## Model Deployment
In the previous chapters we focussed on creating predicton models. Jupyter notebooks are great in data exploration, transformation and cleaning and in model tuning and evaluation. However, once you are fine with your model is time to use is in practice. Therefore we save the model to a file so we can use it over and over again.  We illustrate this using the titanic ensemble case. 

We first build the model. This typically has to be done only once or has to be repeated to improve the model as new data comes in or old data becomes outdated. We then save this model to a file. 

We then write a function that takes the features of a single passenger as parameters. That function then uses the model to guess wether or not that passenger has survived.  In a production environment that function is used together with the saved model. 

In [1]:
# we first build the model. 
import pandas as pd
url = 'https://raw.githubusercontent.com/HOGENT-Databases/DB3-Workshops/master/data/titanic.csv'
titanic = pd.read_csv(url)
titanic = titanic.drop(['PassengerId','Name','Ticket','Fare','Cabin','Embarked'],axis=1)
titanic = titanic.dropna()
titanic = pd.get_dummies(titanic, columns=["Sex"], prefix=["Sex"])
from sklearn.model_selection import train_test_split
X = titanic.drop('Survived',axis=1)
y = titanic['Survived']
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.30)
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, VotingClassifier

lr = LogisticRegression(solver='newton-cg')
rf100 = RandomForestClassifier(n_estimators=100) 
rf150 = RandomForestClassifier(n_estimators=150) 
rf200 = RandomForestClassifier(n_estimators=200) 
rf250 = RandomForestClassifier(n_estimators=250) 
gnb =  GaussianNB()

model = VotingClassifier(estimators=[('lr', lr), ('rf100', rf100),('rf150', rf150), ('rf200', rf200), 
                                     ('rf250', rf250), ('gnb', gnb)], voting='soft')
model.fit(X_train, y_train)

VotingClassifier(estimators=[('lr', LogisticRegression(solver='newton-cg')),
                             ('rf100', RandomForestClassifier()),
                             ('rf150',
                              RandomForestClassifier(n_estimators=150)),
                             ('rf200',
                              RandomForestClassifier(n_estimators=200)),
                             ('rf250',
                              RandomForestClassifier(n_estimators=250)),
                             ('gnb', GaussianNB())],
                 voting='soft')

In [2]:
# we now save the model to a file
# see https://scikit-learn.org/stable/modules/model_persistence.html

from google.colab import drive
drive.mount('/content/gdrive')
from joblib import dump
dump(model, '/content/gdrive/My Drive/survival_prediction_model.joblib')

ModuleNotFoundError: No module named 'google.colab'

In [3]:
# We will now use this model to guess wether or not an unseen passenger (one that has not been used to build the model)
# has survived or not. 

def PredictSurvival(model,Pclass,Sex,Age,SibSp,Parch):
    import pandas as pd
    passenger=pd.DataFrame(columns=['Pclass','Sex','Age','SibSp','Parch'])

    new_passenger = {'Pclass':Pclass,
                     'Sex':Sex,
                     'Age':Age,
                     'SibSp':SibSp,
                     'Parch':Parch}
    
    passenger = passenger.append(new_passenger,ignore_index=True)

    if Sex == 'male':
        passenger['Sex_male'] = 1
        passenger['Sex_female'] = 0
    else:
        passenger['Sex_male'] = 0
        passenger['Sex_female'] = 1        
    passenger.drop(columns=['Sex'],axis=1,inplace=True)

    # we can't use pd.get_dummies here because not all values (male,female) are available
    # for a single customer
    
    survived = model.predict(passenger)
    
    # most sklearn algorithms also offer a predict_proba method that returns an array of 
    # probabilities per class:
    survived_proba = model.predict_proba(passenger)
    return survived[0],survived_proba[0].max()


from joblib import load
model = load('/content/gdrive/My Drive/survival_prediction_model.joblib')

survived = PredictSurvival(model,Pclass=3,Sex='male',Age=40,SibSp=0,Parch=0)

print(survived)

survived = PredictSurvival(model,Pclass=1,Sex='female',Age=27,SibSp=0,Parch=0)

print(survived)



(1, 1.0)
(0, 0.93)
