# Linear Regression problem - Multi variable
We have seen the concept of simple linear regression where a single predictor
variable X was used to model the response variable Y . In many applications, there
is more than one factor that influences the response. Multiple regression models
thus describe how a single response variable Y depends linearly on a number of
predictor variables.

Examples:

The selling price of a house can depend on the desirability of the location, the number of bedrooms, the number of bathrooms, the year the house was built,
the square footage of the lot and a number of other factors.

Stock price prediction - It can dpend on the overall market trend, performance of that particular stock over a given preiod of time and many more.

---
Now we wil create a multivariable  regression modet. This time we will not hard code our training data, we will try to read it from file.

We will follow the same steps as that of Uni variable Regression model.


1.   Import libraries
2.   Get training data.
3.   Data Pre-processing
4.   Train the model
5.   Evaluate model performance
6.   Use trained model to Predict on unseen data.




### Step 1 - Import Required Libraries

In [48]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [49]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt 

### Step 2 - Get Training Data
We will use US startup's data and will try to predict their profit based on their spending in different areas (R&D Spend, Administration and Marketing Spend) and their location(New York, California and Florida).

Let's first read the data and see how it looks.

In [50]:
dataset = pd.read_csv('/content/drive/My Drive/skill-squad training/Linear Regression/50_Startups.csv')
dataset.head(10)

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94
5,131876.9,99814.71,362861.36,New York,156991.12
6,134615.46,147198.87,127716.82,California,156122.51
7,130298.13,145530.06,323876.68,Florida,155752.6
8,120542.52,148718.95,311613.29,New York,152211.77
9,123334.88,108679.17,304981.62,California,149759.96


###Step 3 - Data Preprocessing

Separate Feature and target variables

In [51]:
X = dataset.drop(['Profit'], axis=1)
Y = dataset['Profit']
X.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State
0,165349.2,136897.8,471784.1,New York
1,162597.7,151377.59,443898.53,California
2,153441.51,101145.55,407934.54,Florida
3,144372.41,118671.85,383199.62,New York
4,142107.34,91391.77,366168.42,Florida


In [52]:
Y.head()

0    192261.83
1    191792.06
2    191050.39
3    182901.99
4    166187.94
Name: Profit, dtype: float64

Now before we train our model, we will split the dataset into training and test set.

Training set - will be used for model training

Test set - Will be used for model validation

We have one split function readily available from skleanr, will make use of same.

In [53]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

In [54]:
X.shape

(50, 4)

In [55]:
X_train.shape

(40, 4)

In [56]:
X_test.shape

(10, 4)

###Missing Values

In [57]:
X_train.isnull().sum()

R&D Spend          0
Administration     0
Marketing Spend    0
State              0
dtype: int64

In [58]:
X_test.isnull().sum()

R&D Spend          0
Administration     0
Marketing Spend    0
State              0
dtype: int64

We have categorical column - "State" which can not be consumed directly for model training. we will have to convert state to numerical data.

Here we have two options to use for encoding - Label Encoder and One Hot Encoder

Label Encoder - Encode target labels with value between 0 and n_classes-1.

One Hot Encoder - Encode categorical features as a one-hot numeric array

Here, we will uss One Hot Encoder to convert State column to 3 different columns one  for each state. For this we will use OneHotEncoder library from sklearn.preprocessing

In [59]:
X_train = X_train.reset_index(drop=True)
X_test = X_test.reset_index(drop=True)

In [60]:
print(X_train)

    R&D Spend  Administration  Marketing Spend       State
0    94657.16       145077.58        282574.31    New York
1    86419.70       153514.11             0.00    New York
2   142107.34        91391.77        366168.42     Florida
3    46426.07       157693.92        210797.67  California
4    64664.71       139553.16        137962.62  California
5   114523.61       122616.84        261776.23    New York
6    27892.92        84710.77        164470.71     Florida
7    22177.74       154806.14         28334.72  California
8    63408.86       129219.61         46085.25  California
9    23640.93        96189.63        148001.11  California
10  101913.08       110594.11        229160.95     Florida
11   55493.95       103057.49        214634.81     Florida
12   66051.52       182645.56        118148.20     Florida
13   61994.48       115641.28         91131.24     Florida
14  162597.70       151377.59        443898.53  California
15   15505.73       127382.30         35534.17    New Yo

In [61]:
from sklearn.preprocessing import OneHotEncoder
ohe_state = OneHotEncoder()
X_train_state = pd.DataFrame(ohe_state.fit_transform(X_train[['State']]).toarray())

Now concatenate the encoded values and drop the State column from the dataset as we need it now.

In [62]:
X_train = pd.concat([X_train, X_train_state], axis=1)
X_train = X_train.drop(['State'], axis=1)
X_train.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,0,1,2
0,94657.16,145077.58,282574.31,0.0,0.0,1.0
1,86419.7,153514.11,0.0,0.0,0.0,1.0
2,142107.34,91391.77,366168.42,0.0,1.0,0.0
3,46426.07,157693.92,210797.67,1.0,0.0,0.0
4,64664.71,139553.16,137962.62,1.0,0.0,0.0


In [63]:
X_test_state = pd.DataFrame(ohe_state.transform(X_test[['State']]).toarray())
X_test = pd.concat([X_test, X_test_state], axis=1)
X_test = X_test.drop(['State'], axis=1)
X_test.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,0,1,2
0,78389.47,153773.43,299737.29,0.0,0.0,1.0
1,72107.6,127864.55,353183.81,0.0,0.0,1.0
2,120542.52,148718.95,311613.29,0.0,0.0,1.0
3,28663.76,127056.21,201126.82,0.0,1.0,0.0
4,67532.53,105751.03,304768.73,0.0,1.0,0.0


In [64]:
#Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

In [65]:
X_test

array([[ 0.13139132,  1.2332933 ,  0.74573659, -0.73379939, -0.73379939,
         1.52752523],
       [ 0.0019281 ,  0.32389012,  1.16748743, -0.73379939, -0.73379939,
         1.52752523],
       [ 1.0001246 ,  1.05588077,  0.83945108, -0.73379939, -0.73379939,
         1.52752523],
       [-0.89340711,  0.29551734, -0.03240664, -0.73379939,  1.36277029,
        -0.65465367],
       [-0.09235964, -0.4522957 ,  0.7854401 , -0.73379939,  1.36277029,
        -0.65465367],
       [ 0.59061172, -0.94230725,  0.35123982,  1.36277029, -0.73379939,
        -0.65465367],
       [ 1.29015418,  1.00252587, -0.61169092,  1.36277029, -0.73379939,
        -0.65465367],
       [-0.22417806,  1.19568324, -0.9233775 , -0.73379939, -0.73379939,
         1.52752523],
       [ 0.08737848, -0.16741416,  0.73727088,  1.36277029, -0.73379939,
        -0.65465367],
       [-0.13207426,  1.20727117, -0.77407706, -0.73379939, -0.73379939,
         1.52752523]])

### Step 4 - Train the model
Train linear regression model with training dataset

In [66]:
lr_model = LinearRegression()
lr_model.fit(X_train, Y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

### Step 5 - Model Evaluation
Now use test set to predict the result. Predicted value will be compared with actual value to determine model accuracy

In [67]:
y_pred = lr_model.predict(X_test)
print(y_pred)

[120218.13628257 117020.57394271 154140.13099304  74877.24829903
 109246.88182893 134134.48503521 157026.82099433  99483.2340036
 116316.03752882 103664.54538378]


In [68]:
print(Y_test)

21    111313.02
27    105008.31
8     152211.77
36     90708.19
23    108733.99
11    144259.40
6     156122.51
31     97483.56
20    118474.03
29    101004.64
Name: Profit, dtype: float64


We can visualize the model performace through plot as we have multiple inputs and it is not possible to plot all of them in 2D. we will use loss function to evaluat model. For regression, we use MSE - Mean Squared Error for performance metrics.

In [69]:
from sklearn.metrics import mean_squared_error
print(mean_squared_error(Y_test, y_pred))

59725833.6018116
