# Linear Regression 

Nous avons 50 start-ups dans notre base de données. Les dépenses de R&D, les dépenses d'administration, les dépenses de marketing, l'état et le profit pour un exercice financier sont tous inclus dans cette base de données. Notre objectif est de développer un modèle capable d'évaluer rapidement quelle entreprise a la marge bénéficiaire la plus élevée. Le profit est la variable dépendante, et les quatre autres variables sont des variables indépendantes.

R&D Spend -- Research and devolop spend in the past few years

Administration -- spend on administration in the past few years

Marketing Spend -- spend on Marketing in the past few years

State -- states from which data is collected

Profit -- profit of each state in the past few years

In [1]:
#import librairies
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [2]:
#read dataset

dataset = pd.read_csv('50_Startups.csv')
dataset.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


In [3]:
#Seperate features ((independent variables) and label (dependent variable))
X = dataset.iloc[:, :-1] #variables independantes
y = dataset.iloc[:, 4] #variable dependante

In [4]:
#print features (independent varibales)
X.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State
0,165349.2,136897.8,471784.1,New York
1,162597.7,151377.59,443898.53,California
2,153441.51,101145.55,407934.54,Florida
3,144372.41,118671.85,383199.62,New York
4,142107.34,91391.77,366168.42,Florida


In [5]:
#print target variable
y.head()

0    192261.83
1    191792.06
2    191050.39
3    182901.99
4    166187.94
Name: Profit, dtype: float64

In [6]:
# Encoding categorical data
X = pd.get_dummies(X,columns=['State'])
X.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State_California,State_Florida,State_New York
0,165349.2,136897.8,471784.1,0,0,1
1,162597.7,151377.59,443898.53,1,0,0
2,153441.51,101145.55,407934.54,0,1,0
3,144372.41,118671.85,383199.62,0,0,1
4,142107.34,91391.77,366168.42,0,1,0


In [7]:
X.shape

(50, 6)

In [9]:
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

In [10]:
# Fitting Multiple Linear Regression to the Training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train.values, y_train.values)

LinearRegression()

In [11]:
X_test.shape

(10, 6)

In [12]:
y_test.values


array([103282.38, 144259.4 , 146121.95,  77798.83, 191050.39, 105008.31,
        81229.06,  97483.56, 110352.25, 166187.94])

In [16]:
# Predicting the Test set results
y_pred = regressor.predict(X_test)
y_pred

  f"X has feature names, but {self.__class__.__name__} was fitted without"


array([103015.20159796, 132582.27760816, 132447.73845174,  71976.09851258,
       178537.48221055, 116161.24230165,  67851.69209676,  98791.73374687,
       113969.43533012, 167921.0656955 ])

In [17]:
#print actual label
y_test

28    103282.38
11    144259.40
10    146121.95
41     77798.83
2     191050.39
27    105008.31
38     81229.06
31     97483.56
22    110352.25
4     166187.94
Name: Profit, dtype: float64

In [18]:
#Evaluate the model using r2-score
from sklearn.metrics import r2_score
print('Test R2-Score: ', r2_score(y_test,y_pred))


Test R2-Score:  0.9347068473282424


### Predict the profit of a start-up in California with the following depenses:
| Department | Depenses|
| --- | --- |
| R&D  | 162597.70 |
| Administration | 151377.59|
| Marketing  | 443898.53 |


In [13]:
X_test.head(1)

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State_California,State_Florida,State_New York
28,66051.52,182645.56,118148.2,0,1,0


In [19]:
new_data = [
    [162597.70, 151377.59, 443898.53, 1, 0, 0  ],
    [162597.70, 151377.59, 443898.53, 0, 1, 0  ],
    [162597.70, 151377.59, 443898.53, 0, 0, 1  ],
    [162597.70, 151377.59, 443898.53, 1, 0, 0  ]
           ]
new_pred = regressor.predict(new_data)
new_pred

array([189547.28196893, 188587.99780887, 190246.65102145, 189547.28196893])

In [16]:
np.array(new_data).shape

(1, 6)