# Predicting a Startups Profit/Success Rate using Multiple Linear Regression in Python

While building this model using Multiple Linear Regression, we deal with a dataset which contains the details of 50 startup’s and predicts the profit of a new Startup based on certain features.To Venture Capitalists this could be a boon as to whether they should invest in a particular Startup or not. So lets say that you work for a Venture Capitalist and your firm has hired you as a Data Scientist to derive insights into the data, and help them to predict whether a particular startup would be safe to invest in or not. We can also derive useful insights into the data by actually seeing as to what difference does it make if a Startup is launched in a particular state.Or Which startup’s end up performing better by seeing that if they spent more money on marketing or was it their stellar R&D department which led them to this huge profit and in turn huge fame and success.



## Import Necessary Libraries

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt


In [2]:
startup_data = pd.read_csv('50_Startups.csv')
startup_data.head()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


In [3]:
startup_data.shape

(50, 5)

In [4]:
#Let's check missing values in dataset
startup_data.isnull().sum()

R&D Spend          0
Administration     0
Marketing Spend    0
State              0
Profit             0
dtype: int64

In [5]:
startup_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   R&D Spend        50 non-null     float64
 1   Administration   50 non-null     float64
 2   Marketing Spend  50 non-null     float64
 3   State            50 non-null     object 
 4   Profit           50 non-null     float64
dtypes: float64(4), object(1)
memory usage: 2.1+ KB


In [6]:
## Let's check statistics of dataset
startup_data.describe()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,Profit
count,50.0,50.0,50.0,50.0
mean,73721.6156,121344.6396,211025.0978,112012.6392
std,45902.256482,28017.802755,122290.310726,40306.180338
min,0.0,51283.14,0.0,14681.4
25%,39936.37,103730.875,129300.1325,90138.9025
50%,73051.08,122699.795,212716.24,107978.19
75%,101602.8,144842.18,299469.085,139765.9775
max,165349.2,182645.56,471784.1,192261.83


In [7]:
## Let's sort the features of dependent (predictor) and independent (target) variable


In [24]:
x = startup_data.iloc[: , 0 : -1].values

In [25]:
x

array([[165349.2, 136897.8, 471784.1, 'New York'],
       [162597.7, 151377.59, 443898.53, 'California'],
       [153441.51, 101145.55, 407934.54, 'Florida'],
       [144372.41, 118671.85, 383199.62, 'New York'],
       [142107.34, 91391.77, 366168.42, 'Florida'],
       [131876.9, 99814.71, 362861.36, 'New York'],
       [134615.46, 147198.87, 127716.82, 'California'],
       [130298.13, 145530.06, 323876.68, 'Florida'],
       [120542.52, 148718.95, 311613.29, 'New York'],
       [123334.88, 108679.17, 304981.62, 'California'],
       [101913.08, 110594.11, 229160.95, 'Florida'],
       [100671.96, 91790.61, 249744.55, 'California'],
       [93863.75, 127320.38, 249839.44, 'Florida'],
       [91992.39, 135495.07, 252664.93, 'California'],
       [119943.24, 156547.42, 256512.92, 'Florida'],
       [114523.61, 122616.84, 261776.23, 'New York'],
       [78013.11, 121597.55, 264346.06, 'California'],
       [94657.16, 145077.58, 282574.31, 'New York'],
       [91749.16, 114175.79, 29491

In [23]:
y = startup_data.iloc[:,-1].values
y

array([192261.83, 191792.06, 191050.39, 182901.99, 166187.94, 156991.12,
       156122.51, 155752.6 , 152211.77, 149759.96, 146121.95, 144259.4 ,
       141585.52, 134307.35, 132602.65, 129917.04, 126992.93, 125370.37,
       124266.9 , 122776.86, 118474.03, 111313.02, 110352.25, 108733.99,
       108552.04, 107404.34, 105733.54, 105008.31, 103282.38, 101004.64,
        99937.59,  97483.56,  97427.84,  96778.92,  96712.8 ,  96479.51,
        90708.19,  89949.14,  81229.06,  81005.76,  78239.91,  77798.83,
        71498.49,  69758.98,  65200.33,  64926.08,  49490.75,  42559.73,
        35673.41,  14681.4 ])

# Encoding categorical data


Label Encoding refers to converting the labels into a numeric form so as to convert them into the machine-readable form. Machine learning algorithms can then decide in a better way how those labels must be operated. It is an important pre-processing step for the structured dataset in supervised learning.

Limitation of label Encoding 


Label encoding converts the data in machine-readable form, but it assigns a unique number(starting from 0) to each class of data. This may lead to the generation of priority issues in the training of data sets. A label with a high value may be considered to have high priority than a label having a lower value.

### If we take a look at our Dataset we can clearly see that State is a String type variable and like we have discussed,We cannot feed String type variables into our Machine Learning model as it can only work with numbers.To overcome this problem we use the Label Encoder object and create Dummy Variables using the OneHotEncoder object… Lets say that if we had only 2 states New York and California namely in our dataset then our OneHotEncoder will be of 2 columns only… Similarly for n different states it would have n columns and each state would be represented by a series of 0s and 1s wherein all columns would be 0 except for the column for that particular state. For ex:- If A,B,C are 3 states then A=100,B=010,C=001 I think now you might be getting my point as to how the OneHotEncoder works…

In [17]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

In [18]:
labelencoder = LabelEncoder()

As it is clear that the only categorical data is the name of the state which is stored at the 3rd Index in our Dataset so we encode that column!



In [33]:
x[:, 3] = labelencoder.fit_transform(x[:, 3])
onehotencoder = OneHotEncoder()
x = onehotencoder.fit_transform(x).toarray()

In [35]:
x = x[:, 1:]

In [36]:
x

array([[0., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       ...,
       [0., 0., 0., ..., 1., 0., 0.],
       [1., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 1., 0., 0.]])

## Splitting the dataset into the Training set and Test set


Importing the Libraries and Applying Cross Validation with 80% data as Training Data and 20% as Test Data.

In [37]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 0)


##  Fitting our Linear Regression Model


Fitting Multiple Linear Regression to the Training set The Linear Regression equation would look like — — > y=b(0)+b(1)x(1)+b(2)x(2)+b(3)x(3)+b(4)D(1)+b(5)D(2)+b(6)D(3)…b(n+3)D(m-1)

In [38]:
from sklearn.linear_model import LinearRegression

In [39]:
regressor = LinearRegression()

In [41]:
regressor.fit(X_train, y_train)

LinearRegression()

In [42]:
y_pred = regressor.predict(X_test)

In [43]:
y_pred

array([126446.24292557, 101024.60153913, 122757.05006124,  84879.87036901,
       127925.23704126, 114624.1813817 , 111103.92347622, 127443.18787736,
       122052.52547925, 127925.23704126])

In [44]:
pd.DataFrame({"Actual": y_test, "Predict": y_pred}).head()

Unnamed: 0,Actual,Predict
0,103282.38,126446.242926
1,144259.4,101024.601539
2,146121.95,122757.050061
3,77798.83,84879.870369
4,191050.39,127925.237041


In [45]:
from sklearn.metrics import mean_squared_error, r2_score

mean_squared_error = mean_squared_error(y_test, y_pred)
print('Mean Squared Error: %.2f' % mean_squared_error)

r2_score = r2_score(y_test, y_pred)
print('R2 Score: %.2f' % r2_score)

Mean Squared Error: 1047013959.82
R2 Score: 0.18
