### In Multiple Linear Regression, the value of a dependent variable is predicted based on more than one independent variable.

From this dataset, we are required to build a model that would predict the Profits earned by a startup and their various expenditures like R & D Spend, Administration Spend, and Marketing Spend. Clearly, we can understand that it is a multiple linear regression problem, as the independent variables are more than one.

In [1]:
import numpy as np 
import matplotlib.pyplot as plt 
import pandas as pd 

In [2]:
data = pd.read_csv('50_Startups.csv')
data

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94
5,131876.9,99814.71,362861.36,New York,156991.12
6,134615.46,147198.87,127716.82,California,156122.51
7,130298.13,145530.06,323876.68,Florida,155752.6
8,120542.52,148718.95,311613.29,New York,152211.77
9,123334.88,108679.17,304981.62,California,149759.96


In [3]:
X = data.iloc[:, [0,1,2,3]].values 
y = data.iloc[:, 4].values

The dataset contains one categorical variable. So we need to encode or make dummy variables for that.

In [8]:
# Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder = LabelEncoder()
X[:, 3] = labelencoder.fit_transform(X[:, 3])
onehotencoder = OneHotEncoder()
X = onehotencoder.fit_transform(X).toarray()

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  X = check_array(X, dtype=np.int)
In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  X = check_array(X, dtype=np.int)


In [9]:
#Avoiding the Dummy Variable Trap 
X = X[:, 1:]

In [11]:
#Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8, test_size = 0.2, random_state = 0)

In [12]:
# Fitting Multiple Linear Regression to the Training set 
from sklearn.linear_model import LinearRegression 
regressor = LinearRegression() 
regressor.fit(X_train, y_train)

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  eps=np.finfo(np.float).eps,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  eps=np.finfo(np.float).eps, copy_X=True, fit_path=True,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  eps=np.finfo(np.float).eps, copy_X=True, fit_path=True,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  eps=np.finfo(np.float).eps, positive=False):
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  max_n_alphas=1000, n_jobs=None, eps=np.finfo(np.float).eps,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  max_n_alpha

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [13]:
#Predicting the Test set result 
y_pred = regressor.predict(X_test)

In [14]:
y_pred

array([108087.40943475,  94075.52935037, 103684.13111554, 109115.49885826,
       103344.82807584, 119826.12211671,  94325.09783199, 130700.8478805 ,
       108898.78755726, 103344.82807584])

check how our model performed. For this, we will use the Mean Squared Error(MSE) metric from the Sciki-Learn library.

In [15]:
from sklearn.metrics import mean_squared_error 
print("The Mean Squared Error is- {}".format(mean_squared_error(y_test, y_pred))) 

The Mean Squared Error is- 1846130824.0073693


### Implementing Backward Elimination in Python

In [16]:
import statsmodels.formula.api as sm 
X = np.append(arr=np.ones((50,1)).astype(int), values=X, axis=1) 
X_pred = X[:, [0, 1, 2, 3, 4, 5]]

In [20]:
Ols_regressor = sm.OLS(endog=y, exog=X_pred).fit()

In [21]:
Ols_regressor.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.203
Model:,OLS,Adj. R-squared:,0.113
Method:,Least Squares,F-statistic:,2.244
Date:,"Mon, 08 Apr 2024",Prob (F-statistic):,0.0665
Time:,14:47:07,Log-Likelihood:,-594.98
No. Observations:,50,AIC:,1202.0
Df Residuals:,44,BIC:,1213.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,1.178e+05,5659.867,20.808,0.000,1.06e+05,1.29e+05
x1,-8.209e+04,3.84e+04,-2.139,0.038,-1.59e+05,-4730.369
x2,-5.284e+04,3.84e+04,-1.377,0.176,-1.3e+05,2.45e+04
x3,-6.828e+04,3.84e+04,-1.779,0.082,-1.46e+05,9086.971
x4,-4.801e+04,3.84e+04,-1.251,0.218,-1.25e+05,2.94e+04
x5,-3.654e+04,3.84e+04,-0.952,0.346,-1.14e+05,4.08e+04

0,1,2,3
Omnibus:,1.59,Durbin-Watson:,0.512
Prob(Omnibus):,0.452,Jarque-Bera (JB):,0.795
Skew:,-0.059,Prob(JB):,0.672
Kurtosis:,3.606,Cond. No.,7.47


In [22]:
X_pred = X[:, [0, 1, 2, 4, 5]] 
Ols_regressor = sm.OLS(endog=y, exog=X_pred).fit() 
Ols_regressor.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.146
Model:,OLS,Adj. R-squared:,0.07
Method:,Least Squares,F-statistic:,1.922
Date:,"Mon, 08 Apr 2024",Prob (F-statistic):,0.123
Time:,14:47:44,Log-Likelihood:,-596.71
No. Observations:,50,AIC:,1203.0
Df Residuals:,45,BIC:,1213.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,1.163e+05,5731.004,20.290,0.000,1.05e+05,1.28e+05
x1,-8.061e+04,3.93e+04,-2.052,0.046,-1.6e+05,-1476.481
x2,-5.136e+04,3.93e+04,-1.307,0.198,-1.3e+05,2.78e+04
x3,-4.652e+04,3.93e+04,-1.184,0.243,-1.26e+05,3.26e+04
x4,-3.505e+04,3.93e+04,-0.892,0.377,-1.14e+05,4.41e+04

0,1,2,3
Omnibus:,0.922,Durbin-Watson:,0.444
Prob(Omnibus):,0.631,Jarque-Bera (JB):,0.291
Skew:,-0.059,Prob(JB):,0.864
Kurtosis:,3.355,Cond. No.,7.38


In [23]:
X_pred = X[:, [0, 4, 5]] 
Ols_regressor = sm.OLS(endog=y, exog=X_pred).fit() 
Ols_regressor.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.036
Model:,OLS,Adj. R-squared:,-0.005
Method:,Least Squares,F-statistic:,0.8707
Date:,"Mon, 08 Apr 2024",Prob (F-statistic):,0.425
Time:,14:47:59,Log-Likelihood:,-599.75
No. Observations:,50,AIC:,1205.0
Df Residuals:,47,BIC:,1211.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,1.135e+05,5833.106,19.464,0.000,1.02e+05,1.25e+05
x1,-4.378e+04,4.08e+04,-1.072,0.289,-1.26e+05,3.84e+04
x2,-3.231e+04,4.08e+04,-0.791,0.433,-1.14e+05,4.98e+04

0,1,2,3
Omnibus:,0.217,Durbin-Watson:,0.094
Prob(Omnibus):,0.897,Jarque-Bera (JB):,0.028
Skew:,-0.058,Prob(JB):,0.986
Kurtosis:,3.003,Cond. No.,7.22


In [24]:
#Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_pred, y, train_size = 0.8, test_size = 0.2, random_state = 0)

In [25]:
# Fitting Multiple Linear Regression to the Training set 
from sklearn.linear_model import LinearRegression 
regressor = LinearRegression() 
regressor.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [26]:
#Predicting the Test set result 
y_pred = regressor.predict(X_test)

In [27]:
from sklearn.metrics import mean_squared_error 
print("The Mean Squared Error is- {}".format(mean_squared_error(y_test, y_pred)))

The Mean Squared Error is- 1418446415.0324838
