# Project Streamlit

- modeling the Titanic dataset

- Course Name :         Applied Machine Learning
- Course instructor:    Sohail Tehranipour
- Student Name :        Afshin Masoudi Ashtiani
- Chapter 7 -           Building a Web App for Data Scientists
- Project:              Streamlit Project
- Date :                September 2024

## Step 1: Install required libraries

In [65]:
%pip install pandas scikit-learn xgboost joblib

Note: you may need to restart the kernel to use updated packages.


## Step 2: Import required libraries

In [66]:
import os
import re
import time
import joblib
import pandas as pd
import numpy as np
from tabulate import tabulate

from sklearn.model_selection import train_test_split

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.tree import ExtraTreeClassifier
from xgboost import XGBClassifier

from sklearn.metrics import accuracy_score, roc_auc_score, recall_score, precision_score, f1_score, cohen_kappa_score, matthews_corrcoef

import warnings
warnings.filterwarnings('ignore')

## Step 3: Load the dataset

In [67]:
train_path = r'./repository/train.csv'
display = True

In [68]:
# """Load the dataset from a CSV file from google.drive"""
# from google.colab import drive
# drive.mount('/content/drive')

# df = pd.read_csv('/content/drive/My Drive/Applied Machine Learning/Datasets/titanic_train.csv')
# X = df.drop(labels='Survived', axis=1)
# y = df.Survived

In [69]:
"""Load the dataset from a CSV file."""
df = pd.read_csv(train_path)
X = df.drop(labels='Survived', axis=1)
y = df.Survived

In [70]:
print(tabulate(df[:10], headers='keys', tablefmt='psql'))

+----+---------------+------------+----------+-----------------------------------------------------+--------+-------+---------+---------+------------------+---------+---------+------------+
|    |   PassengerId |   Survived |   Pclass | Name                                                | Sex    |   Age |   SibSp |   Parch | Ticket           |    Fare | Cabin   | Embarked   |
|----+---------------+------------+----------+-----------------------------------------------------+--------+-------+---------+---------+------------------+---------+---------+------------|
|  0 |             1 |          0 |        3 | Braund, Mr. Owen Harris                             | male   |    22 |       1 |       0 | A/5 21171        |  7.25   | nan     | S          |
|  1 |             2 |          1 |        1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | female |    38 |       1 |       0 | PC 17599         | 71.2833 | C85     | C          |
|  2 |             3 |          1 |        3 | Hei

## Step 4: Split the dataset into Training and Testing sets

In [71]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.1, random_state= 123)
print(tabulate(X_train[:10], headers='keys', tablefmt='psql'))

+-----+---------------+----------+-----------------------------------------------------+--------+-------+---------+---------+---------------+----------+---------+------------+
|     |   PassengerId |   Pclass | Name                                                | Sex    |   Age |   SibSp |   Parch | Ticket        |     Fare | Cabin   | Embarked   |
|-----+---------------+----------+-----------------------------------------------------+--------+-------+---------+---------+---------------+----------+---------+------------|
| 677 |           678 |        3 | Turja, Miss. Anna Sofia                             | female |    18 |       0 |       0 | 4138          |   9.8417 | nan     | S          |
| 547 |           548 |        2 | Padro y Manent, Mr. Julian                          | male   |   nan |       0 |       0 | SC/PARIS 2146 |  13.8625 | nan     | C          |
| 317 |           318 |        2 | Moraweck, Dr. Ernest                                | male   |    54 |       0 |     

## Step 5: Pre-Process the data

- Create initial pre-process

In [72]:
class PreProcessor(BaseEstimator, TransformerMixin): 
    def fit(self, X, y=None): 
        self.ageImputer = SimpleImputer()
        self.ageImputer.fit(X[['Age']])        
        return self 
        
    def transform(self, X, y=None):
        X['Age'] = self.ageImputer.transform(X[['Age']])
        X['CabinClass'] = X['Cabin'].fillna('M').apply(lambda x: str(x).replace(' ', '')).apply(lambda x: re.sub(r'[^a-zA-Z]', '', x))
        X['CabinNumber'] = X['Cabin'].fillna('M').apply(lambda x: str(x).replace(' ', '')).apply(lambda x: re.sub(r'[^0-9]', '', x)).replace('', 0) 
        X['Embarked'] = X['Embarked'].fillna('M')
        X = X.drop(['PassengerId', 'Name', 'Ticket','Cabin'], axis=1)
        return X

## Step 6: Save the model

In [73]:
def save_pipeline(pipeline:Pipeline, name:str) -> str:
    if pipeline:
        file_name = f'{(''.join(cap for cap in str(name) if cap.isupper())).lower()}pipe.joblib'
        joblib.dump(value= pipeline, filename= file_name) 
        return file_name   
    else:
        print("> No pipeline found to save ...!")
        return ''

## Step 7: Create pipelines for training models 

- Create columns transformer

In [74]:
preprocessor = PreProcessor()
numeric_pipeline = Pipeline([('Scaler', StandardScaler())])
categorical_pipeline = Pipeline([('OneHot', OneHotEncoder(handle_unknown='ignore'))])
transformer = ColumnTransformer([('num', numeric_pipeline, ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'CabinNumber']), ('cat', categorical_pipeline, ['Sex', 'Embarked', 'CabinClass'])])

- Create pipelines and fit the pipelines

In [75]:
lrpipe = Pipeline([('InitialPreProc', preprocessor), ('Transformer', transformer), ('Logistic Regression', LogisticRegression())]).fit(X_train, y_train)
rcpipe = Pipeline([('InitialPreProc', preprocessor), ('Transformer', transformer), ('Ridge Classifier', RidgeClassifier())]).fit(X_train, y_train)
etcpipe = Pipeline([('InitialPreProc', preprocessor), ('Transformer', transformer), ('Extra Tree Classifier', ExtraTreeClassifier())]).fit(X_train, y_train)
xgbcpipe = Pipeline([('InitialPreProc', preprocessor), ('Transformer', transformer), ('XGB Classifier', XGBClassifier())]).fit(X_train, y_train)

pipelines_df = pd.DataFrame([
    {'Model' : 'LogisticRegression', 'Filename' : save_pipeline(lrpipe, 'LogisticRegression')}, 
    {'Model' : 'RidgeClassifier', 'Filename' : save_pipeline(rcpipe, 'RidgeClassifier')}, 
    {'Model' : 'ExtraTreeClassifier', 'Filename' : save_pipeline(etcpipe, 'ExtraTreeClassifier')}, 
    {'Model' : 'XGBClassifier', 'Filename' : save_pipeline(xgbcpipe, 'XGBClassifier')}])

print(tabulate(pipelines_df, headers='keys', tablefmt='psql'))

+----+---------------------+-----------------+
|    | Model               | Filename        |
|----+---------------------+-----------------|
|  0 | LogisticRegression  | lrpipe.joblib   |
|  1 | RidgeClassifier     | rcpipe.joblib   |
|  2 | ExtraTreeClassifier | etcpipe.joblib  |
|  3 | XGBClassifier       | xgbcpipe.joblib |
+----+---------------------+-----------------+


## Step 8: Evaluate the models

In [76]:
eval_list = []
index_list = []

start_time = time.process_time()
for index, row in pipelines_df.iterrows():
    model = joblib.load(filename= row.Filename)
    y_pred = model.predict(X_test)
    current_time = time.process_time()
            
    eval_list.append({
        'Model' : row.Model, 
        'Accuracy' : accuracy_score(y_test, y_pred), 
        'AUC' : roc_auc_score(y_test, y_pred), 
        'Recall' : recall_score(y_test, y_pred), 
        'Prec.' : precision_score(y_test, y_pred), 
        'F1' : f1_score(y_test, y_pred),
        'Kappa' : cohen_kappa_score(y_test, y_pred),
        'MCC' : matthews_corrcoef(y_test, y_pred),
        'TT (Sec)' : current_time - start_time})
    
    index_list.append((''.join(cap for cap in str(row.Model) if cap.isupper())).lower())

    start_time = current_time

eval_df = pd.DataFrame(eval_list, index= index_list)
print(tabulate(eval_df, headers='keys', tablefmt='psql'))

+------+---------------------+------------+----------+----------+----------+----------+----------+----------+------------+
|      | Model               |   Accuracy |      AUC |   Recall |    Prec. |       F1 |    Kappa |      MCC |   TT (Sec) |
|------+---------------------+------------+----------+----------+----------+----------+----------+----------+------------|
| lr   | LogisticRegression  |   0.811111 | 0.823733 | 0.857143 | 0.648649 | 0.738462 | 0.595024 | 0.60919  |    0.0625  |
| rc   | RidgeClassifier     |   0.811111 | 0.804147 | 0.785714 | 0.666667 | 0.721311 | 0.579901 | 0.584379 |    0.3125  |
| etc  | ExtraTreeClassifier |   0.766667 | 0.762097 | 0.75     | 0.6      | 0.666667 | 0.490566 | 0.497796 |    0.03125 |
| xgbc | XGBClassifier       |   0.833333 | 0.810484 | 0.75     | 0.724138 | 0.736842 | 0.614946 | 0.615149 |    0.0625  |
+------+---------------------+------------+----------+----------+----------+----------+----------+----------+------------+


## Step 9: Find the best model

In [77]:
best_index = eval_df.Accuracy.idxmax()
best_name = eval_df.loc[best_index, 'Model']
best_path = f'{best_index}pipe.joblib'
best_model = joblib.load(best_path)
print(f'>> The best model is {best_name}.')

>> The best model is XGBClassifier.


## Step 10: Save the best model

In [78]:
joblib.dump(value= best_model, filename= best_path)

['xgbcpipe.joblib']

## Step 11: Predict the dataset

- Import the saved model

In [79]:
model = joblib.load(filename='lrpipe.joblib')

- Import the testing dataset

In [80]:
test = pd.read_csv(r'./repository/test.csv')
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.1+ KB


- Predict the dataset

In [81]:
y_pred = model.predict(df)
y_pred_df = pd.DataFrame(y_pred, columns= ['Prediction'])
print(tabulate(y_pred_df[:10], headers='keys', tablefmt='psql'))

+----+--------------+
|    |   Prediction |
|----+--------------|
|  0 |            0 |
|  1 |            1 |
|  2 |            1 |
|  3 |            1 |
|  4 |            0 |
|  5 |            0 |
|  6 |            0 |
|  7 |            0 |
|  8 |            1 |
|  9 |            1 |
+----+--------------+
