### Use-case 2
A funding company has a set of historical data of all startups containing the spending pattern for the expense done towards R&D, Administration, marketing and location. The company has hired you as a Data Scientist and your role is to create and deploy the model that can predict the profit of the company based on company's spending pattern and company's location.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [3]:
startupData = pd.read_csv('50_Startups.csv')

In [4]:
#To check if there exists any missing data
startupData.info()
#Since label is numeric, we can go for any Regression algorithm!
#Multiple Linear Regression

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 5 columns):
R&D Spend          50 non-null float64
Administration     50 non-null float64
Marketing Spend    50 non-null float64
State              50 non-null object
Profit             50 non-null float64
dtypes: float64(4), object(1)
memory usage: 2.0+ KB


In [5]:
#Rule for any ML algo is it expects your data to be numeric(features)
startupData.head()
#Deal with categorical
# 1. Using pandas (get dummies)
# 2. Label Encoding ---- One Hot Encoding

Unnamed: 0,R&D Spend,Administration,Marketing Spend,State,Profit
0,165349.2,136897.8,471784.1,New York,192261.83
1,162597.7,151377.59,443898.53,California,191792.06
2,153441.51,101145.55,407934.54,Florida,191050.39
3,144372.41,118671.85,383199.62,New York,182901.99
4,142107.34,91391.77,366168.42,Florida,166187.94


In [15]:
startupData.corr()

Unnamed: 0,R&D Spend,Administration,Marketing Spend,Profit
R&D Spend,1.0,0.241955,0.724248,0.9729
Administration,0.241955,1.0,-0.032154,0.200717
Marketing Spend,0.724248,-0.032154,1.0,0.747766
Profit,0.9729,0.200717,0.747766,1.0


In [40]:
data = startupData.State.unique().tolist()
sorted(data)

['California', 'Florida', 'New York']

In [7]:
#Seperate our data with features and label
features = startupData.iloc[:,0:4].values
label = startupData.iloc[:,[4]].values

In [10]:
#Handling Categorical Data
#LabelEncoder object must be created for each categorical column which needs to be encoded
#
from sklearn.preprocessing import LabelEncoder
stateEncoder = LabelEncoder()
features[:,3] = stateEncoder.fit_transform(features[:,3])

In [12]:
from sklearn.preprocessing import OneHotEncoder
#stateOHE = OneHotEncoder(categorical_features=[column which is label encoded])
stateOHE = OneHotEncoder(categorical_features=[3])
features = stateOHE.fit_transform(features).toarray()
features

array([[0.0000000e+00, 0.0000000e+00, 1.0000000e+00, 1.6534920e+05,
        1.3689780e+05, 4.7178410e+05],
       [1.0000000e+00, 0.0000000e+00, 0.0000000e+00, 1.6259770e+05,
        1.5137759e+05, 4.4389853e+05],
       [0.0000000e+00, 1.0000000e+00, 0.0000000e+00, 1.5344151e+05,
        1.0114555e+05, 4.0793454e+05],
       [0.0000000e+00, 0.0000000e+00, 1.0000000e+00, 1.4437241e+05,
        1.1867185e+05, 3.8319962e+05],
       [0.0000000e+00, 1.0000000e+00, 0.0000000e+00, 1.4210734e+05,
        9.1391770e+04, 3.6616842e+05],
       [0.0000000e+00, 0.0000000e+00, 1.0000000e+00, 1.3187690e+05,
        9.9814710e+04, 3.6286136e+05],
       [1.0000000e+00, 0.0000000e+00, 0.0000000e+00, 1.3461546e+05,
        1.4719887e+05, 1.2771682e+05],
       [0.0000000e+00, 1.0000000e+00, 0.0000000e+00, 1.3029813e+05,
        1.4553006e+05, 3.2387668e+05],
       [0.0000000e+00, 0.0000000e+00, 1.0000000e+00, 1.2054252e+05,
        1.4871895e+05, 3.1161329e+05],
       [1.0000000e+00, 0.0000000e+00,

In [14]:
#Power of for loop to check which random_state will give best performance model
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

for i in range(1,51):
    X_train,X_test,y_train,y_test = train_test_split(features,
                                                label,
                                                test_size = 0.2,
                                                random_state=i)
    model = LinearRegression()
    model.fit(X_train,y_train)
    training_score = model.score(X_train,y_train)
    testing_score = model.score(X_test,y_test)
    
    #Only Generalized model will be outputted
    if testing_score > training_score:
        print("Training Score {} Testing Score {} for Random State {}".format(training_score,testing_score,i))

Training Score 0.9424465426893971 Testing Score 0.9649618042060633 for Random State 1
Training Score 0.9398417195515446 Testing Score 0.9783259006626557 for Random State 2
Training Score 0.9473848999820091 Testing Score 0.9560357304860589 for Random State 4
Training Score 0.9438505226429931 Testing Score 0.9669763022158512 for Random State 5
Training Score 0.9385918220043519 Testing Score 0.990110511339781 for Random State 10
Training Score 0.9411603359254431 Testing Score 0.9726607102793833 for Random State 14
Training Score 0.946138584319559 Testing Score 0.9633877651310018 for Random State 21
Training Score 0.9425908513252554 Testing Score 0.9757906394981196 for Random State 22
Training Score 0.9464972114069966 Testing Score 0.9687727807395822 for Random State 24
Training Score 0.9454518446256155 Testing Score 0.9602561948870648 for Random State 26
Training Score 0.9482961316721963 Testing Score 0.9500997612784656 for Random State 29
Training Score 0.9435367947390881 Testing Score 0

In [17]:
#Create train-test split
#training split will be used to train our model
#testing split will be used to check the accuracy of the model with unknown data
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(features,
                                                label,
                                                test_size = 0.2,
                                                random_state=10)

In [18]:
finalModel = LinearRegression()
finalModel.fit(X_train,y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [19]:
print(finalModel.score(X_train,y_train))
print(finalModel.score(X_test,y_test))

0.9385918220043519
0.990110511339781


In [20]:
stateEncoder.classes_

array(['California', 'Florida', 'New York'], dtype=object)

In [25]:
#Features --- R&D Spend, Administration , Marketing Spend, State
rdSpend = float(input("Enter R&D spend: "))
admSpend = float(input("Enter Administration spend: "))
marketingSpend = float(input("Enter Marketing Spend: "))
state = input("Enter State: ")

if state in stateEncoder.classes_:
    
    featureInput = np.array([[rdSpend,admSpend,marketingSpend,state]])
    #Applying LabelEncoding
    featureInput[:,3] = stateEncoder.transform(featureInput[:,3])
    #Applying OneHotEncoding
    featureInput = stateOHE.transform(featureInput).toarray()
    #Predict
    profit = finalModel.predict(featureInput)
    #Print the profit
    print("Predicted Profit is ",profit)
    

else:
    print("Model don't know about businesses in {}".format(state))

Enter R&D spend: 2345
Enter Administration spend: 2345
Enter Marketing Spend: 2345
Enter State: California
Predicted Profit is  [[51986.93957031]]


In [26]:
#Deployment process
import pickle
pickle.dump(finalModel, open("ProfitPredictionModel.model" , "wb"))
pickle.dump(stateEncoder, open("StateEncoder.encoder" , "wb"))
pickle.dump(stateOHE, open("StateOHE.ohe" , "wb"))