<a href="https://colab.research.google.com/github/Shahar19/Python/blob/master/Classification_Shahar_Bercovitz%2C_January_2020.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Classification project - Titanic - Shahar B, January 2020**
*Binary classification*

# **Import Packages**


In [None]:
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
import seaborn as sns
import math
from datetime import datetime


# Preprocess tools
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.base import TransformerMixin
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.preprocessing import LabelBinarizer
from sklearn.utils.class_weight import compute_class_weight

# Models
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

from sklearn.pipeline import Pipeline
from sklearn.model_selection import StratifiedKFold


# Metrics
from sklearn.metrics import log_loss, accuracy_score, confusion_matrix, classification_report, auc

# Visualizations
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import roc_curve, roc_auc_score

import pickle
from sys import modules
import warnings
warnings.filterwarnings('ignore')

# **Data Description**

 **Data Overview**

Source: Kaggle

dataset link: https://www.kaggle.com/c/titanic


Data Description
Overview
The data has been split into two groups:

training set (train.csv)
test set (test.csv)

The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.

The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.

We also include gender_submission.csv, a set of predictions that assume all and only female passengers survive, as an example of what a submission file should look like.


*Variable Notes*
pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way...
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way...
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.

# **Get the dataset**

## **Upload data files**
1. **Upload dataset CSV format file**

In [None]:
# def user_upload_multiple_files():
#     number_of_files = int(input("please enter number of files to upload\n"))

#     for file in range(0,number_of_files):
#         if 'google.colab' in modules:
#             from google.colab import files
#             uploaded = files.upload()
# user_upload_multiple_files()

# **Data Manipulation and EDA - Research**

## **Data Madipulation**

### **Read Data file**

In [None]:
df = pd.read_csv(r'train.csv', encoding="UTF-8", index_col="PassengerId")

### **Target Varibale Definition**

In [None]:
target_variable = "Survived"

### **Data Head**

In [None]:
df.head()

### **Data Tail**

In [None]:
df.tail()

### **Dataset info**

In [None]:
df.info()

### **Describe dataset**

In [None]:
df.describe().T

### **Target variable proportion**

In [None]:
df[target_variable].value_counts(normalize=True)

### **Pairplot**

In [None]:
sns.pairplot(df, hue=target_variable, plot_kws={'alpha': 0.5})

**Pair Plot Notes**

**Age**

We can see that Titanic passenger's age has normalize distibution from 5 months old baby to age of 80 years old.
mean and median age is 28-30.

There are more survivers than losts among the young ages.


**P class**

The vast majority of Passangers were in the low class (3rd).
we can see that there are many losses in that class.
In opose to the upper class (1st) where most of the passangers among this class survived. 



### **Check Nulls**

In [None]:
df.isnull().sum(axis = 0)

**Pre-processing insights:**

There is nulls values in age, Cabin and 2 in the Embarked data points.

1. Age - We will consider filling the age nulls values using person title feature.
we'll extract a person's title from his name (Mr., Mrs, etc).
The hypothesis is that a person's title can indicate his approximate age.

2. Cabin - there are A LOT of nulls in this data point.
We'll try to use it anyway, using the 1st letter of the cabin as a feature, and we'll fill the nulls with "Not Mentioned" as it's own category.

3. Embarked - we'll fill the two nulls with the most-frequent Embarked port.

### **Handle Nulls**

#### Handle Nulls and 0s in age

##### Create Title Feature

In [None]:
regex_pattern_name = ('([a-zA-z]*)\.(.*)')

In [None]:
def extract_title_from_name(string):
    title = re.search(regex_pattern_name, string)
    if not title:
        return 'No Title'
    return title.group(1)

In [None]:
df['Passanger Title'] = df['Name'].apply(extract_title_from_name)

In [None]:
df['Passanger Title'].value_counts(normalize=False) #True)

In [None]:
title_dict = {  
                "Mr" : "Mr",
                "Miss" : "Miss",
                "Mrs" : "Mrs",
                "Master" : "Master",
                "Dr" : "Mr",
                "Rev" : "Other",
                "Col" : "Other",
                "Major" : "Mr",
                "Mlle" : "Miss",
                "Jonkheer" : "Other",
                "Lady" : "Mrs",
                "Capt" : "Mr",
                "Sir" : "Mr",
                "Mme" : "Miss",
                "Countess" : "Mrs",
                "Ms" : "Miss",
                "Don" : "Mr"
                }

In [None]:
df['Passanger Title'] = np.where(df['Passanger Title'].isin(title_dict), df['Passanger Title'].map(title_dict), "Other")

In [None]:
df[df['Passanger Title']=='Master'].sort_values('Age',ascending = False)

**Insights**

"Master" name title is for childrens ('Age under 12')

##### Group by and filling Age nulls

In [None]:
df.groupby(['Passanger Title'])['Age'].mean()

In [None]:
age_dict = df.groupby(['Passanger Title'])['Age'].mean().to_dict()

In [None]:
df.loc[(df['Age'].isnull()) | (df['Age']==0),'Age'] = df['Passanger Title'].map(age_dict)

In [None]:
df.isnull().sum()

#### Handle Nulls and 0s in Embarked

##### calculate port with max passengers

In [None]:
max_embarked_port = df['Embarked'].value_counts().idxmax()

##### fill nulls with port with max passengers

In [None]:
df.loc[df['Embarked'].isnull(), 'Embarked'] = str(max_embarked_port)

In [None]:
df.isnull().sum()

####Handle with 0's in Cabin

In [None]:
regex_pattern_cabin = ('^([A-Z])(\d*)$')

def extract_cabin_class_from_cabin(string):   
    if string == 'Not Mentioned':
        return 'Not Mentioned'
    cabin_class = re.search(regex_pattern_cabin, string)
    if not cabin_class:
        return 'Not Mentioned'
    return cabin_class.group(1)

In [None]:
df['Cabin'].fillna('Not Mentioned', inplace=True)
df['Cabin Class'] = df['Cabin'].apply(extract_cabin_class_from_cabin)

In [None]:
df.isnull().sum()

**No more nulls**

### **Feature Extraction**

#### Fare Per Ticket

In [None]:
num_member_dict = df.groupby(['Ticket'])['Name'].count().to_dict()
df.loc[:,'num_of_members'] = df['Ticket'].map(num_member_dict)

In [None]:
df['fare_per_ticket'] = df['Fare'] / df['num_of_members']

In [None]:
df.head()

#### Fare Buckets

In [None]:
df['fare_per_ticket'].hist()

In [None]:
fare_buckets=pd.qcut(df['fare_per_ticket'],4).unique().to_list()
fare_buckets

In [None]:
fare_buckets_dict = {
                    (0, 7.762) : 0,
                    (7.762, 8.85) : 1,
                    (8.85, 24.288) : 2,
                    (24.288, math.inf) : 3
                    }   
for bounds, value in fare_buckets_dict.items():
    lower_bound, upper_bound = bounds
    df.loc[((df['fare_per_ticket'] > lower_bound) & (df['fare_per_ticket'] <= upper_bound)), "fare_buckets"] = value
df.loc[df['fare_buckets'].isnull(), "fare_buckets"] = -1

In [None]:
df.fare_buckets.value_counts()

In [None]:
df.num_of_members.value_counts()

#### Age Buckets

In [None]:
df.Age.hist()

In [None]:
age_buckets_dict = {
                    (0, 18) : 0,
                    (18, 29) : 1,
                    (29, 40) : 2,
                    (40, 55) : 3,
                    (55,  math.inf): 4
                    }   
for bounds, value in age_buckets_dict.items():
    lower_bound, upper_bound = bounds
    df.loc[((df.Age > lower_bound) & (df.Age <= upper_bound)), "age_buckets"] = value
df.loc[df.age_buckets.isnull(), "age_buckets"] = -1

In [None]:
df.age_buckets.value_counts()

#### Gender

In [None]:
df['is_male?'] = np.where((df['Sex']=='male'), 1, 0)

#### number of family members onboard

In [None]:
df.loc[:, 'family_onboard'] = df.Parch + df.SibSp

In [None]:
df.family_onboard.value_counts()

In [None]:
family_onboard_dict = {
                    (0, 0) : 0,
                    (1, 1) : 1,
                    (2, 2) : 2,
                    (3,  math.inf) : 3
                    }   
for bounds, value in family_onboard_dict.items():
    lower_bound, upper_bound = bounds
    df.loc[((df.family_onboard >= lower_bound) & (df.family_onboard <= upper_bound)), "family_onboard_buckets"] = value

In [None]:
df.family_onboard_buckets.value_counts()

#### is_parent_with_4_childrens Feature

In [None]:
df["is_parent_with_3_childrens_or_more"] = np.where(df.Parch>2, 1, 0)

In [None]:
df["is_parent_with_3_childrens_or_more"].value_counts()

#### Drop unnecesary columns

In [None]:
df.head()

In [None]:
df_column_list = ['Name', 'Cabin', 'Ticket', 'Sex', 'Age', 'Fare', 'fare_per_ticket', 'family_onboard', 'SibSp', 'Parch']
df = df.drop(df_column_list, axis=1)

In [None]:
df.head()

### Split features & target variable

In [None]:
X = df.drop([target_variable], axis=1)
y = df[target_variable]

### Handle Categorical Features & Scalling

#### Split to categorical and non-categorical dfs

In [None]:
categorical_columns=['Embarked', 'Passanger Title', 'Cabin Class']
X_categorical = X[categorical_columns]
non_categorical_columns = [column for column in X.columns if column not in categorical_columns]
X_non_categorical = X[non_categorical_columns]

#### OneHotEncoder Categorical Features

In [None]:
encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
encoder.fit(X_categorical)
cat_features_names = list(encoder.get_feature_names(X_categorical.columns))

X_categorical_transformed = encoder.transform(X_categorical)
#X_categorical_transformed_dense = X_categorical_transformed.todense()

X_categorical_transformed_df = pd.DataFrame(X_categorical_transformed, columns=cat_features_names, index=X_categorical.index)

#### Scalling the Data

In [None]:
# data_scaler = MinMaxScaler()
# X_train = pd.DataFrame(scaler.fit_transform(X_train), index=X_train.index, columns=X_train.columns)
# X_test = pd.DataFrame(scaler.transform(X_test), index=X_test.index, columns=X_test.columns)

#### Join Features

In [None]:
X = pd.merge(X_non_categorical, X_categorical_transformed_df, left_index=True, right_index=True)

# **Model train - Research**

#### **Data balancing**

In [None]:
class_weights_list = list(compute_class_weight(class_weight='balanced', classes=np.unique(y), y=y))

#### **Split the Data**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state =0)

### **Test approaches**

#### **Define evaluation metrics function**

In [None]:
def print_evaluation_metrics(model):
    print(f"for model {str(model.__class__).split('.')[-1][:-2]}:")
    y_train_pred = model.predict(X_train)
    print("Train set Metrics:")
    print(classification_report(y_train, y_train_pred))
    y_test_pred = model.predict(X_test)
    print("Test set Metrics:")
    print(classification_report(y_test, y_test_pred))

#### **Decision Tree**

In [None]:
desicion_tree_model = DecisionTreeClassifier(max_depth=4, class_weight='balanced')
desicion_tree_model.fit(X_train, y_train)
print_evaluation_metrics(desicion_tree_model)

#### **Logistic Regression**

In [None]:
logistic_regression_model = LogisticRegression(class_weight='balanced')
logistic_regression_model.fit(X_train, y_train)
print_evaluation_metrics(logistic_regression_model)

#### **KNN Classifier**

In [None]:
knn_classifier_model = KNeighborsClassifier(n_neighbors=5)
knn_classifier_model.fit(X_train, y_train)
print_evaluation_metrics(knn_classifier_model)

#### **SVM (SVC) Classifier**

In [None]:
svc_classifier_model = SVC(class_weight='balanced')
svc_classifier_model.fit(X_train, y_train)
print_evaluation_metrics(svc_classifier_model)

#### **Random Forest Classifier**

In [None]:
random_forest_model = RandomForestClassifier(n_estimators=300, max_depth=4, class_weight=None)
random_forest_model.fit(X_train, y_train)
print_evaluation_metrics(random_forest_model)

#### **Xgboost Classifier**

In [None]:
param_dist = {'objective':'binary:logistic', 'n_estimators':380, 'max_depth':3}

clf = XGBClassifier(**param_dist)

clf.fit(X_train, y_train,
        eval_set=[(X_train, y_train), (X_test, y_test)],
        eval_metric='logloss',
        verbose=False)

evals_result = clf.evals_result()
print_evaluation_metrics(clf)

# **Production - Train a model**

## **Pre-process**

### **Pre-process: Read data and split to train, test, evaluation data sets**

##### **Read Data file**

In [None]:
df_train = pd.read_csv(r'train.csv', encoding="UTF-8", index_col="PassengerId")

##### **Target Varibale Definition**

In [None]:
target_variable = "Survived"

##### **Split to target & features dfs**

In [None]:
X = df_train.drop(target_variable, axis=1)
y = df_train[target_variable]

##### **Split to train, evaluation, test**

In [None]:
X_train_temp, X_test, y_train_temp, y_test = train_test_split(X, y, test_size=0.1, random_state =0, stratify=y)

In [None]:
X_train, X_evaluation, y_train, y_evaluation = train_test_split(X_train_temp, y_train_temp, test_size=0.1, random_state =0, stratify=y_train_temp)

### **Feature Extraction**

#### **Handle Nulls Transformers**

##### Handle Nulls and 0s in age (and create title feature)

In [None]:
title_dict = {  
                "Mr" : "Mr",
                "Miss" : "Miss",
                "Mrs" : "Mrs",
                "Master" : "Master",
                "Dr" : "Mr",
                "Rev" : "Other",
                "Col" : "Other",
                "Major" : "Mr",
                "Mlle" : "Miss",
                "Jonkheer" : "Other",
                "Lady" : "Mrs",
                "Capt" : "Mr",
                "Sir" : "Mr",
                "Mme" : "Miss",
                "Countess" : "Mrs",
                "Ms" : "Miss",
                "Don" : "Mr"
                }

In [None]:
class TitleCreatorTransformer(TransformerMixin):
    def __init__(self, title_dict):
        self.regex_pattern = ('([a-zA-z]*)\.(.*)')
        self.name_column = 'Name'
        self.title_dict = title_dict
        
    def fit(self, X, y):
        return self
    
    def transform(self, X):
        X['Passanger Title'] = X[self.name_column].apply(self.extract_title_from_name)
        X['Passanger Title'] = np.where(X['Passanger Title'].isin(title_dict), X['Passanger Title'].map(self.title_dict), "Other")
        return X

    def extract_title_from_name(self, string):
        title = re.search(self.regex_pattern, string)
        if not title:
            return 'No Title'
        return title.group(1)

In [None]:
class FillAgeTransformer(TransformerMixin):
    def __init__(self):
        self.age_dict = {}

    def fit(self, X, y=None):
        self.age_dict = X.groupby(['Passanger Title'])['Age'].mean().to_dict()
        return self
    
    def transform(self, X):
        X.loc[(X['Age'].isnull()) | (X['Age']==0),'Age'] = X['Passanger Title']\
        .map(self.age_dict)   
        return X

##### Handle Nulls and 0s in Embarked

In [None]:
class FillNullsWithMaxIdTransformer(TransformerMixin):
    def __init__(self, column):
        self.column = column
        self.max_id = ''

    def fit(self, X, y=None):
        self.max_id = X[self.column].value_counts().idxmax()
        return self
    
    def transform(self, X):
        X.loc[X[self.column].isnull(), self.column] = str(self.max_id)
        return X

##### Handle with 0's in Cabin

In [None]:
class ExtractCabinTransformer(TransformerMixin):
    def __init__(self, input_column):
        self.regex_pattern = ('^([A-Z])(\d*)$')
        self.input_column = input_column
        self.output_column = f'{self.input_column} Feature'
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        X[self.output_column] = X[self.input_column].fillna('Not Mentioned') \
        .apply(self.extract_cabin_class_from_cabin)
        return X

    def extract_cabin_class_from_cabin(self, string):
        cabin_class = re.search(self.regex_pattern, string)
        if not cabin_class:
            return 'Not Mentioned'
        return cabin_class.group(1)

#### **Feature Extraction Transformers**

##### Fare Per Ticket

In [None]:
def calculate_fare_per_person(df, id = 'Ticket', groupyby_key = 'Ticket', fare='Fare'):
    num_member_dict = df.groupby([groupyby_key])[id].count().to_dict()
    df.loc[:,'num_of_members'] = df[groupyby_key].map(num_member_dict)
    df['fare_per_ticket'] = df[fare] / df['num_of_members']
    return df

##### Fare Per Ticket Buckets

In [None]:
class FareBucketsTransformer(TransformerMixin):
    def __init__(self):
        self.fare_buckets = []
        
    def fit(self, X, y=None):
        self.fare_buckets = pd.qcut(X['fare_per_ticket'],4).unique().to_list()
        self.fare_buckets = sorted(self.fare_buckets, key=lambda fare_bucket: fare_bucket.left)
        return self
    
    def transform(self, X):
        for index, bounds in enumerate(self.fare_buckets):
            lower_bound, upper_bound = bounds.left, bounds.right
            if bounds == self.fare_buckets[-1]:
                X.loc[X.fare_per_ticket>lower_bound, "fare_buckets"] = index
            else:
                X.loc[((X.fare_per_ticket > lower_bound) & (X.fare_per_ticket <= upper_bound)), "fare_buckets"] = index
        #X.loc[X["fare_buckets"].isnull(), "fare_buckets"] = -1
        return X

##### Age Buckets

In [None]:
class AgeBucketsTransformer(TransformerMixin):
    def __init__(self, age_buckets_dict):
        self.age_buckets_dict = age_buckets_dict

    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        for bounds, value in self.age_buckets_dict.items():
            lower_bound, upper_bound = bounds
            X.loc[((X.Age > lower_bound) & (X.Age <= upper_bound)), "age_buckets"] = value
        X.loc[X.age_buckets.isnull(), "age_buckets"] = -1
        return X

In [None]:
age_buckets_dict = {
                    (0, 18) : 0,
                    (18, 29) : 1,
                    (29, 40) : 2,
                    (40, 55) : 3,
                    (55,  math.inf): 4
                    }

##### Gender

In [None]:
class GenderBinarizerTransformer(TransformerMixin):
    def __init__(self, column):
        self.column = column
        self.label_binarizer = LabelBinarizer()
        
    def fit(self, X, y=None):
        self.label_binarizer.fit(X[self.column])
        return self
    
    def transform(self, X):
        X['is_male?'] = self.label_binarizer.transform(X[self.column])
        return X

##### number of family members onboard

In [None]:
def sum_two_columns_df(df, output_col='family_onboard', column1='Parch', column2='SibSp'):
    df.loc[:, output_col] = df[column1] + df[column2]
    return df

In [None]:
family_onboard_dict = {
                    (0, 0) : 0,
                    (1, 1) : 1,
                    (2, 2) : 2,
                    (3,  math.inf) : 3
                    }

In [None]:
class FamilyMembersTransformer(TransformerMixin):
    def __init__(self, family_onboard_dict):
        self.family_onboard_dict = family_onboard_dict

    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        for bounds, value in self.family_onboard_dict.items():
            lower_bound, upper_bound = bounds
            X.loc[((X.family_onboard >= lower_bound) & (X.family_onboard <= upper_bound)), "family_onboard_buckets"] = value
        return X

##### is_parent_with_4_childrens Feature

In [None]:
def is_parent_with_3_childrens_or_more(df):
    df["is_parent_with_3_childrens_or_more"] = np.where(df.Parch>2, 1, 0)
    return df

##### Drop unnecesary columns

In [None]:
def drop_columns(df):    
    df_column_list = ['Name', 'Cabin', 'Ticket', 'Sex', 'Age', 'fare_buckets', 'fare_per_ticket', 'family_onboard', 'SibSp', 'Parch']#, 'Fare'
    df = df.drop(df_column_list, axis=1)
    return df

##### Handle Categorical Features

In [None]:
class CategoricalFeaturesTransformer(TransformerMixin):
    def __init__(self, categorical_columns):
        self.categorical_columns = categorical_columns
        self.encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
        
    def fit(self, X, y=None):
        self.encoder.fit(X[self.categorical_columns])
        return self
    
    def transform(self, X):
        X_categorical = X[self.categorical_columns]
        non_categorical_columns = [column for column in X.columns if column not in self.categorical_columns]
        X_non_categorical = X[non_categorical_columns]
        cat_features_names = list(self.encoder.get_feature_names(X_categorical.columns))
        X_categorical_transformed = self.encoder.transform(X_categorical)
        X_categorical_transformed_df = pd.DataFrame(X_categorical_transformed, columns=cat_features_names, index=X_categorical.index)
        X = pd.merge(X_non_categorical, X_categorical_transformed_df, left_index=True, right_index=True)
        return X

In [None]:
categorical_columns = ['Embarked', 'Passanger Title', 'Cabin Feature']

## **Pipeline**

#### **Create data pipeline**

In [None]:
titanic_steps = [
               ('title_creator', TitleCreatorTransformer(title_dict)),
               ('fill_blanks_age', FillAgeTransformer()),
               ('fill_blanks_embarked', FillNullsWithMaxIdTransformer('Embarked')),
               ('extract_cabin', ExtractCabinTransformer('Cabin')),
               ('fare_per_person', FunctionTransformer(calculate_fare_per_person)),
               ('fare_buckets', FareBucketsTransformer()),
               ('age_buckets', AgeBucketsTransformer(age_buckets_dict)),
               ('gender_binarizer', GenderBinarizerTransformer('Sex')),
               ('family_onboard', FunctionTransformer(sum_two_columns_df)),
               ('family_onboard_buckets', FamilyMembersTransformer(family_onboard_dict)),
               ('parent_w_3_or_more', FunctionTransformer(is_parent_with_3_childrens_or_more)),
               ('drop_columns', FunctionTransformer(drop_columns)),
               ('onehotencoder', CategoricalFeaturesTransformer(categorical_columns))#,
               #('XGBoost_Classifier', XGBClassifier(**param_dist))
               ]

titanic_pipeline = Pipeline(steps=titanic_steps)

#### **Fit pipeline**

In [None]:
X_train = titanic_pipeline.fit_transform(X_train, y_train)

#### **transform pipeline**

In [None]:
X_evaluation = titanic_pipeline.transform(X_evaluation)
X_test = titanic_pipeline.transform(X_test)

#### **Modeling**

##### **Xgboost Classifier**

###### **Xgboost Classifier - GridSearch**

In [None]:
params = {
        'min_child_weight': [1, 5, 10],
        'gamma': [0.5, 1, 1.5, 2, 5],
        'subsample': [0.6, 0.8, 1.0],
        'colsample_bytree': [0.6, 0.8, 1.0],
        'max_depth': [3], #, 4, 5],
        'n_estimators': [200] #, 300, 400]
        }

In [None]:
xgb = XGBClassifier()

skf = StratifiedKFold(n_splits=3, shuffle = True, random_state = 0)

random_search = RandomizedSearchCV(xgb, param_distributions=params, n_iter=200,
                                   scoring='accuracy', n_jobs=4, 
                                   cv=skf.split(X_train,y_train), verbose=3, random_state=0)
start_time = datetime.now()
random_search.fit(X_train, y_train,
        eval_set=[(X_train, y_train), (X_test, y_test)],
        eval_metric='logloss',
        verbose=False)
print(f'\n Time taken: {datetime.now()-start_time} seconds.')
xgb_model = random_search.best_estimator_

In [None]:
random_search.best_params_

###### **Xgboost Classifier - fit**

In [None]:
param_dist = {'objective':'binary:logistic', 'colsample_bytree': 0.6,
 'gamma': 1.5, 'max_depth': 3, 'min_child_weight': 1, 'n_estimators': 200, 
 'subsample': 0.8}

xgb_model = XGBClassifier(**param_dist)

xgb_model.fit(X_train, y_train,
        eval_set=[(X_train, y_train), (X_evaluation, y_evaluation)],
        eval_metric='logloss',
        verbose=False)

evals_result = xgb_model.evals_result()
print_evaluation_metrics(xgb_model)

## **Evaluation**

#### **Predict**

In [None]:
y_train_pred = xgb_model.predict(X_train)
y_evaluation_pred = xgb_model.predict(X_evaluation)
y_test_pred = xgb_model.predict(X_test)

#### **Define evaluation metrics function**

In [None]:
def print_evaluation_metrics(y_actual, y_pred, dataset):
    print(f"{dataset} set Metrics:")
    print(classification_report(y_actual, y_pred))

#### **Get evaluation metrics**

In [None]:
datasets_dict = {'train': (y_train, y_train_pred), 'evaluation': (y_evaluation, y_evaluation_pred), 'test': (y_test, y_test_pred)}

In [None]:
for dataset, (y_actual, y_pred) in datasets_dict.items():
    print_evaluation_metrics(y_actual, y_pred, dataset)

## **Model Performance Visualization**

#### **ROC Curve**

In [None]:
ns_probs = [0 for _ in range(len(y_train))]
lr_probs = xgb_model.predict_proba(X_train)
lr_probs = lr_probs[:, 1]
ns_auc = roc_auc_score(y_train, ns_probs)
lr_auc = roc_auc_score(y_train, lr_probs)
print('No Skill: ROC AUC=%.3f' % (ns_auc))
print('Xgboost: ROC AUC=%.3f' % (lr_auc))
ns_fpr, ns_tpr, _ = roc_curve(y_train, ns_probs)
lr_fpr, lr_tpr, _ = roc_curve(y_train, lr_probs)
plt.plot(ns_fpr, ns_tpr, linestyle='--', label='No Skill')
plt.plot(lr_fpr, lr_tpr, marker='.', label='Xgboost')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend()
plt.show()

#### **Feature Importance**

In [None]:
pd.Series(xgb_model.feature_importances_, index=X_train.columns.values).sort_values().plot.barh(figsize=(4, 10), rot=0)

## **Save Winning Model**

In [None]:
pickle.dump(xgb_model, open("best_model.pickle", "wb"))

# **Production - Apply prediction on new data**

In [None]:
df_test = pd.read_csv(r'test.csv', encoding="UTF-8", index_col="PassengerId")
df_test_transformed = titanic_pipeline.transform(df_test)
model = pickle.load(open("best_model.pickle", "rb"))
df_test_transformed_scored = pd.DataFrame(model.predict(df_test_transformed), columns=["Survived?"], index=df_test.index)
df_test_final_scored = pd.merge(df_test, df_test_transformed_scored["Survived?"], how='left', left_index=True, right_index=True)
df_test_final_scored.to_csv(r'test_scored.csv', encoding="UTF-8")