<h1> <center><u> <font color='blue'>Multiple Linear Regression with Python</font></u></h1>

### Setting up the Working Environment

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

### The Working Dataset

In [None]:
# importing boston dataset from sklearn
from sklearn.datasets import load_boston

In [None]:
# Setting X and y
X = load_boston().data
y = load_boston().target

In [None]:
# Transfrom X into DataFrame
boston = pd.DataFrame(X, columns=load_boston().feature_names)
boston.head()

In [None]:
# Inserting the target variable to the DataFrame
boston.insert(0, 'Price', y)
boston.head()

### Using sklearn to build regression model

In [None]:
# Import LinearRegression from sklear.linear_model
from sklearn.linear_model import LinearRegression

In [None]:
# Create a linear regression object
lm_reg = LinearRegression() 

### Fitting the Model

In [None]:
# fit the linear regression model
lm_reg.fit(X, y)

### Checking the parameters

In [None]:
# print the intercept value
print("The model intercept value is {:.4f}".format(lm_reg.intercept_))

In [None]:
# zipping coefficients with their names
list(zip(boston.columns[1:], lm_reg.coef_))

In [None]:
# print the parameters 
for var, coef in list(zip(boston.columns[1:], lm_reg.coef_)):
    print("The {0:7s} coefficient is: {1:8.4f}".format(var,coef))

In [None]:
results = pd.DataFrame(list(zip(boston.columns[1:], lm_reg.coef_)), 
             columns=["Variable", "Coefficient"])
results

###  Prediction

After fitting the model, we can implement the model to predict on new data. In our case, we don't have new data, therefore we predict on the data used to build the model. __predict()__ method of linear regression model does the predition for us.

Syntax:
```python 
y_pred = lm_obj.predict(New data)
```

In [None]:
# prediction
y_pred = lm_reg.predict(X)

## Model Performance

After learning how to fit a model, and how to do prediction, you need to a tool to measure the performance of your model. In linear regression problems, we have many metrics, and each metric has its strengths and its weaknesses; however, there are common ones used listed below:

### The model Scores

  1. R^2 (Coefficient of Determination)
  2. MAE (mean squared error)
  3. RMSE (Root Mean Squared Error)
  
These metrics are available in sklearn __metrics__, and to use them we need to import them. 

Syntax:
```python
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
```

At this step, we train and predict the model on the same data we trained the model on.

In [None]:
# Import the necessary metrics
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

In [None]:
# print the R^2
print("Mean squared error: {:.4f}".format(r2_score(y, y_pred)))

In [None]:
# The mean squared error
print("Mean squared error: {:.4f}".format(mean_squared_error(y, y_pred)))

In [None]:
# The Root Mean squared Error 
print("Mean squared error: {:.4f}".format(np.sqrt(mean_squared_error(y, y_pred))))

In [None]:
# The mean absolute error
print("Mean absolute error: {:.4f}".format(mean_absolute_error(y, y_pred)))

# Machine Learning Process

Building Machine Learning Models goes through several steps, which we mention here briefly:

1. **Step 1**: **Extracting features**: Datasets don't typically come naturally with clear features, so there's work to be done in reformatting the dataset. Additionally, you need to decide what features you want to begin with. 

2. **Step 2**: **Dataset splitting**: split the dataset into two datasets: the test and train dataset. 

3. **Step 3: Model Training**: Train the model on training set.

4. **Step 4: Model Evaluation**: The model has to be evaluated on the test set. Model evaluation is performed many times, not just once. 

    - In model evaluation, a threshold based on a calculate metric must be decided to in order to decide whether the model is useful and can be used in practice.
    - If the model is not good, we need to move back to training step to __tune__ the model, considering other solutions considering the inputs.
    - We go back and forth between building and testing several times until we are satisfied about our model.
    - If the model is not improving, maybe because of small dataset, or not enough features ... etc.


In [None]:
from IPython.core.display import Image, HTML
HTML("""
<style>
.output_png {
    display: table-cell;
    text-align: center;
    vertical-align: middle;
}
</style>
""")

In [None]:
Image('ML process.png')

# Train-Test and Model Validation 

Using the data to train the model, then assess the model on the same data is not a good idea (__not honest assessment__), and the model's performance will not be indicative of how well it can generalize to unseen data. For this reason, it is common practice to split your data into two sets, a training set and a test set. You train or fit the model on the __training set__. Then you make predictions on the test set. Finally, the metrics will be calculated between the predictions and the known values. 

## Splitting Data into Train/Test sets

  - It is common practice in machine learning project to split the data into __train set__ and __test set__ (also called unseen data). 
  
  - The train set is used to train or fit the model
  
  - The test set is used to __evaluate or assess__ the model
  
  

### Splitting The Dataset Using sk-learn 

  - First, import __train_test_split()__ from __sklearn.model selection__.
  
  - Use the __train_test_split()__ to randomly split the data.
  
#### __train_test_split()__ has few arguments:
   - __*arrays__: This mean you pass first feature data, and the second the target data.
   - __test_size__ argument specifies what proportion of the original data is used for the test set (20%, 25%, 30% are common used proportions)
   - __random_state__ sets a seed for the random number generator that splits the data into train and test. Setting the seed with the same number allows us to reproduce the exact split so we get the same results. 
   - __shuffle__ by default is __True__, which shuffles the data to ensure that data is not ordered in any way.
   - __stratify__ This is useful in classification problems. __Stratification__ means the same proportion of __events__ and __non-events__ are the same in both train and test sets. (we will use it in comming lectures)
   
Note:

   - By default, __train_test_split()__ splits the data into 75% training data and 25% test data, which is a good rule of thumb.
   - The common names for datasets are (__X_train, X_test__ (X is uppercase), __y_train, y_test__ (y is lowercase))

Suntax:
```python
# Import train_test_split
from sklearn.model_selection import train_test_split
# Split the data
X_train, X_test, y_train, y_test= train_test_split( X, y, 
                                                   test_size=0.2, 
                                                   random_state=10123)
```

In [None]:
# Import train_test_split from sklearn.model_selection
from sklearn.model_selection import train_test_split

In [None]:
# Create the training and test sets
X_train, X_test, y_train, y_test= train_test_split( X, y, 
                                                   test_size=0.2, 
                                                   random_state=123)

In [None]:
# Print shapes of the training and testing data sets
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

### Model Training 

In [None]:
# Create a Linear Regression object
lm = LinearRegression()

In [None]:
# fit the model on the training set
lm.fit(X_train, y_train)

### Prediction on training and testing sets

In [None]:
# Predict on train set
pred_train = lm.predict(X_train)
# Predict on test set
pred_test = lm.predict(X_test)

### Model Evaluation

In [None]:
# The R^2 Score
print("The R^2 on the train set is: {:.4f}".format(r2_score(pred_train, y_train)))

In [None]:
# The R^2 Score
print("The R^2 on the test set is: {:.4f}".format(r2_score(pred_test, y_test)))

In [None]:
print("The MSE on the train set is: {:.4f}".\
      format(mean_squared_error(y_train, pred_train)))

In [None]:
print("The MSE on the test set is: {:.4f}". \
      format(mean_squared_error(y_test, pred_test)))

In [None]:
print("The RMSE on the train set is: {:.4f}".\
      format(np.sqrt(mean_squared_error(y_train, pred_train))))

In [None]:
print("The RMSE on the test set is: {:.4f}". \
      format(np.sqrt(mean_squared_error(y_test, pred_test))))

## Model Diagnostics (Residual Plot)

One of the tools of regression analysis diagnostics is __residual plot__. The residuals are the difference between the actual values and the predicted values:
$$resid= y - \hat y$$ or 
$$residuals = actual \ \ values - predicted \ \ values$$

 - **Residual plot**: is a graph that shows the **residuals** on the vertical axis and the **predicted values** on the horizontal axis.
 
 
 - If the points in a residual plot are randomly scattered around the horizontal axis, a linear regression model is appropriate for the data; otherwise, a non-linear model is more appropriate.
 
 
 - If there is some strucutre or pattern, that means your model is not capturing all the variances. There could be an interaction between variables, or time data where the time is not considered. If this is the case, the data must be examined again carefully.

In [None]:
# calculate the residuals
resid_train = y_train - pred_train
resid_test = y_test-pred_test

In [None]:
# Scatter plot the training data
plt.figure(figsize= (12, 6))
train = plt.scatter(x = pred_train, y = resid_train , c = 'b', alpha=0.5)

# Scatter plot the testing data
test = plt.scatter(pred_test, resid_test , c = 'r', alpha=0.5)

# Plot a horizontal axis line at 0
plt.hlines(y = 0, xmin = -10, xmax = 50)

# Labels
plt.legend((train, test), ('Training','Test'), loc='upper left')
plt.title('Residual Plots')
plt.show()

It seems there is not an abvious pattern in the residual plot, and all the points are plotted above and below the horizontal line. 

## Problems are to be considered:

1. The model was tested only once and only a one portion of the data, so, is this the best technique? and is it the efficient technique to test the model?

Consider this case: 

   - If we split the data again to train and test data, are we going to have the same results?

2. We assumed the relationship is linear, which might not be the case, shouldn't we consider another non-linear algorithm!. 

3. We didn't consider any interactions between variables, but there might be some interactions.

4. No data transformation has been done. So, if we transform the features, we expect some improvement in the model.


All the previous questions should be taken under consideration to improve the model.

## Try to improve the model by coming back the step of building the model. 

In the next lecture, we will learn how to improve the predictive power of our models by learning new techniques about __tweaking or tuning__ the model's options. 

### Here are the topics of the next tutorial 

  1. K-fold Cross-Validation (CV)
  
  
  2. Data Transformation
     - Standardization
     - Normalization
     - Log Transformation
     
     
  3. Hyper-Parameter Tuning
  
     - Ridge Regression
     - Lasso Regression