# Multiple Linear Regression: Predict the Startup

## 1. Importing Necessary Packages

In this step, we import essential Python libraries for data manipulation, visualization, and machine learning. These packages provide the tools needed to load data, process it, and build regression models.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

## 2. Loading the Dataset and Splitting Features and Target

In this step, we load the startup dataset from a CSV file. We separate the independent variables (features such as R&D Spend, Administration, Marketing Spend, and State) into X, and the dependent variable (Profit) into y. This prepares the data for further preprocessing and modeling.

In [29]:
data = pd.read_csv('50_Startups.csv')
X = data.iloc[:, :-1].values
y = data.iloc[:, -1].values

## 3. Handling Missing Data

Real-world datasets often contain missing values, which can negatively impact model performance. Here, we use the `SimpleImputer` to replace missing values in the feature matrix with the mean of each column, ensuring the dataset is complete and ready for encoding.

In [30]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(X[:, :-1])
X[:, :-1] = imputer.transform(X[:, :-1])
X

array([[165349.2, 136897.8, 471784.1, 'New York'],
       [162597.7, 151377.59, 443898.53, 'California'],
       [153441.51, 101145.55, 407934.54, 'Florida'],
       [144372.41, 118671.85, 383199.62, 'New York'],
       [142107.34, 91391.77, 366168.42, 'Florida'],
       [131876.9, 99814.71, 362861.36, 'New York'],
       [134615.46, 147198.87, 127716.82, 'California'],
       [130298.13, 145530.06, 323876.68, 'Florida'],
       [120542.52, 148718.95, 311613.29, 'New York'],
       [123334.88, 108679.17, 304981.62, 'California'],
       [101913.08, 110594.11, 229160.95, 'Florida'],
       [100671.96, 91790.61, 249744.55, 'California'],
       [93863.75, 127320.38, 249839.44, 'Florida'],
       [91992.39, 135495.07, 252664.93, 'California'],
       [119943.24, 156547.42, 256512.92, 'Florida'],
       [114523.61, 122616.84, 261776.23, 'New York'],
       [78013.11, 121597.55, 264346.06, 'California'],
       [94657.16, 145077.58, 282574.31, 'New York'],
       [91749.16, 114175.79, 29491

## 4. Encoding Categorical Data and Avoiding the Dummy Variable Trap

Some features, like 'State', are categorical and must be converted to numerical values for regression analysis. We use one-hot encoding to create binary columns for each category. To avoid the dummy variable trap (multicollinearity caused by redundant columns), we remove one dummy variable column after encoding.

In [31]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [-1])], remainder='passthrough')
X = np.array(ct.fit_transform(X))
X = X[:, 1:] # Avoiding the Dummy Variable Trap
X

array([[0.0, 1.0, 165349.2, 136897.8, 471784.1],
       [0.0, 0.0, 162597.7, 151377.59, 443898.53],
       [1.0, 0.0, 153441.51, 101145.55, 407934.54],
       [0.0, 1.0, 144372.41, 118671.85, 383199.62],
       [1.0, 0.0, 142107.34, 91391.77, 366168.42],
       [0.0, 1.0, 131876.9, 99814.71, 362861.36],
       [0.0, 0.0, 134615.46, 147198.87, 127716.82],
       [1.0, 0.0, 130298.13, 145530.06, 323876.68],
       [0.0, 1.0, 120542.52, 148718.95, 311613.29],
       [0.0, 0.0, 123334.88, 108679.17, 304981.62],
       [1.0, 0.0, 101913.08, 110594.11, 229160.95],
       [0.0, 0.0, 100671.96, 91790.61, 249744.55],
       [1.0, 0.0, 93863.75, 127320.38, 249839.44],
       [0.0, 0.0, 91992.39, 135495.07, 252664.93],
       [1.0, 0.0, 119943.24, 156547.42, 256512.92],
       [0.0, 1.0, 114523.61, 122616.84, 261776.23],
       [0.0, 0.0, 78013.11, 121597.55, 264346.06],
       [0.0, 1.0, 94657.16, 145077.58, 282574.31],
       [1.0, 0.0, 91749.16, 114175.79, 294919.57],
       [0.0, 1.0, 86419.7

## 5. Splitting the Dataset into Training and Test Sets

To evaluate the model's performance, we split the dataset into a training set (used to train the model) and a test set (used to assess how well the model generalizes to new data). This helps prevent overfitting and provides a realistic measure of model accuracy.

In [35]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

## 6. Training the Multiple Linear Regression Model

We create a LinearRegression object and fit it to the training data. The model learns the relationship between the features and the target variable (Profit) using the training set.

In [36]:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


## 7. Making Predictions and Comparing Results

After training the model, we use it to predict the target variable (Profit) for the test set. We then compare the predicted values with the actual values to evaluate the model's performance. This comparison helps us understand how accurately the model can predict outcomes for new data.

In [37]:
y_pred = regressor.predict(X_test)
np.set_printoptions(precision=2)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))


[[103015.2  103282.38]
 [132582.28 144259.4 ]
 [132447.74 146121.95]
 [ 71976.1   77798.83]
 [178537.48 191050.39]
 [116161.24 105008.31]
 [ 67851.69  81229.06]
 [ 98791.73  97483.56]
 [113969.44 110352.25]
 [167921.07 166187.94]]


In [9]:
pip install statsmodels

Collecting statsmodels
  Downloading statsmodels-0.14.5-cp313-cp313-macosx_11_0_arm64.whl.metadata (9.5 kB)
Collecting patsy>=0.5.6 (from statsmodels)
  Downloading patsy-1.0.1-py2.py3-none-any.whl.metadata (3.3 kB)
Downloading statsmodels-0.14.5-cp313-cp313-macosx_11_0_arm64.whl (9.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.7/9.7 MB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0mm
[?25hDownloading patsy-1.0.1-py2.py3-none-any.whl (232 kB)
Installing collected packages: patsy, statsmodels
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2/2[0m [statsmodels][0m [statsmodels]
[1A[2KSuccessfully installed patsy-1.0.1 statsmodels-0.14.5

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.1.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3 install --upgrade pip[0m
Note: you may need to restart the kernel to use u

## 8. Model Optimization Using Backward Elimination

In this section, we optimize our multiple linear regression model using the backward elimination technique with the `statsmodels` library. Backward elimination is a stepwise regression approach that helps identify the most significant features for predicting the target variable. The process involves:

1. Adding a column of ones to the feature matrix to account for the intercept term.
2. Fitting an Ordinary Least Squares (OLS) regression model with all possible predictors.
3. Reviewing the p-values for each feature. The p-value indicates the statistical significance of each variable.
4. Removing the feature with the highest p-value (if it is above a chosen significance level, typically 0.05).
5. Repeating steps 2-4 until all remaining features have p-values below the significance threshold.

This iterative process results in a simpler, more interpretable model that retains only the most impactful predictors for the target variable.

In [43]:
import statsmodels.api as sm
X = np.append(arr = np.ones((50,1)).astype(int), values=X, axis=1)
X_opt = X[:, [0,1,2,3,4,5]]
X_opt = X_opt.astype(np.float64)
regressor_OLS = sm.OLS(endog=y, exog=X_opt).fit()
regressor_OLS.summary()
# Simple Linear Regression: Predicting Salary from Experience --- IGNORE ---
# Multiple Linear Regression: Predict the Startup
X_opt = X[:, [0,1,3,4,5]]
X_opt = X_opt.astype(np.float64)
regressor_OLS = sm.OLS(endog=y, exog=X_opt).fit()
regressor_OLS.summary()
X_opt = X[:, [0,3,4,5]]
X_opt = X_opt.astype(np.float64)
regressor_OLS = sm.OLS(endog=y, exog=X_opt).fit()
regressor_OLS.summary()
X_opt = X[:, [0,3,5]]
X_opt = X_opt.astype(np.float64)
regressor_OLS = sm.OLS(endog=y, exog=X_opt).fit()
regressor_OLS.summary()
X_opt = X[:, [0,3]]
X_opt = X_opt.astype(np.float64)
regressor_OLS = sm.OLS(endog=y, exog=X_opt).fit()
regressor_OLS.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.0
Model:,OLS,Adj. R-squared:,0.0
Method:,Least Squares,F-statistic:,
Date:,"Sun, 05 Oct 2025",Prob (F-statistic):,
Time:,20:04:29,Log-Likelihood:,-600.65
No. Observations:,50,AIC:,1203.0
Df Residuals:,49,BIC:,1205.0
Df Model:,0,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,5.601e+04,2850.077,19.651,0.000,5.03e+04,6.17e+04
x1,5.601e+04,2850.077,19.651,0.000,5.03e+04,6.17e+04

0,1,2,3
Omnibus:,0.018,Durbin-Watson:,0.02
Prob(Omnibus):,0.991,Jarque-Bera (JB):,0.068
Skew:,0.023,Prob(JB):,0.966
Kurtosis:,2.825,Cond. No.,7.54e+16
