<a href="https://colab.research.google.com/github/Karthikbajinku/FML-algorithms/blob/main/50_Startups.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Building a multiple linear regression model and performing predictions on 50 Startup's dataset**

**Regression is a predictive modelling technique where the target variable to be estimated is continuous.**

Based on the number of input features in the training example, the regression models are classified into:

$ **Simple Linear Regression** (Univariate)

$ **Multiple Linear Regression** (Multivariate)

• **Simple regression model:** This is the most basic regression model in which predictions are formed from a single, univariate feature of the data.

• **Multiple regression model:** As name implies, in this regression model the predictions are formed from multiple features of the data.

• The goal of regression is to find a target function that can fit the input data with minimum error.

In [15]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import linear_model

In [16]:
#Load the data
data = pd.read_csv("50_Startups.csv")
data.head()


Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


In [17]:
data.shape

(50, 5)

## **Multivariate model:**

In [18]:
#Create feature and target array from such given data
X = data.drop('Profit', axis=1).values
y = data['Profit'].values
X=data.select_dtypes(include=np.number)

In [19]:
X.shape,y.shape

((50, 4), (50,))

Now, X contains the input data(except the label data) and y contains the labelled data

## **Train Test Split**


*   Splitting the data into training data and testing data in which 80% of data comes under training data and the rest of the data comes under the testing data



In [20]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [21]:
X_train.shape,X_test.shape,y_train.shape,y_test.shape

((40, 4), (10, 4), (40,), (10,))

In [22]:
y_train= y_train.reshape(-1,1)
y_test= y_test.reshape(-1,1)

In [23]:
#Training the data
reg = linear_model.LinearRegression()
reg.fit(X_train, y_train)

LinearRegression()

## **Performance metrics on both training and testing data**

***Evaluation metrics are used to measure the quality of the statistical or machine learning model.***

There are many different types of evaluation metrics available to test a regression model.



*   Mean Absolute Error (MAE)
*   Mean Squared Error (MSE)
*   Root Mean Squared Error (RMSE)
*   R_Squared (R2 Score)





In [24]:
#Evaluating the model on training data only
from sklearn.metrics import mean_absolute_error,r2_score,mean_squared_error
y_pred1 = reg.predict(X_train)
print(f'Performance of the model on training data :\n')
print(f'MAE = {mean_absolute_error(y_train, y_pred1)}')
print(f'MSE = {mean_squared_error(y_train, y_pred1)}')
print(f'RMSE = {np.sqrt(mean_squared_error(y_train, y_pred1))}')
print(f'R_2 = {r2_score(y_train, y_pred1)}')

Performance of the model on training data :

MAE = 1.3960743672214448e-11
MSE = 3.8985022269624195e-22
RMSE = 1.974462515967933e-11
R_2 = 1.0


In [25]:
#Evaluating the model on testing data only
from sklearn.metrics import mean_absolute_error,r2_score,mean_squared_error
y_pred2 = reg.predict(X_test)
print(f'Performance of the model on test data :\n')
print(f'MAE = {mean_absolute_error(y_test, y_pred2)}')
print(f'MSE = {mean_squared_error(y_test, y_pred2)}')
print(f'RMSE = {np.sqrt(mean_squared_error(y_test, y_pred2))}')
print(f'R_2 = {r2_score(y_test, y_pred2)}')

Performance of the model on test data :

MAE = 1.3824319466948509e-11
MSE = 2.594038400966295e-22
RMSE = 1.6106018753764987e-11
R_2 = 1.0


## **Conclusion**

Here we conclude by building a multiple regression model and performing predictions on the dataset, Testing data fits best for the LinearRegression model that we trained.