# Classification Algorithms with Python

For this lecture we will be working with the Titanic Data Set from Kaggle. This is a very famous data set and very often is a student's first step in machine learning!

We'll be trying to predict a classification- survival or deceased. Let's begin our understanding of implementing Logistic Regression in Python for classification.

We'll use a "semi-cleaned" version of the titanic data set, if you use the data set hosted directly on Kaggle, you may need to do some additional cleaning not shown here.

In [None]:
Import Libraries
Let's import some libraries to get started!

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# from IPython.core.interactiveshell import InteractiveShell  
# InteractiveShell.ast_node_interactivity = "all"

# # InteractiveShell.ast_node_interactivity = "last_expr"

In [None]:
# 2+3
# 5-6

In [None]:
The Data
Let's start by reading in the titanic_train.csv file into a pandas dataframe.

In [None]:
train = pd.read_csv('titanic_train.csv')
train.head(5)

In [None]:
train.describe()[['Pclass', 'Age']]

In [None]:
# !pip install pandas-profiling
import pandas_profiling
pandas_profiling.ProfileReport(train)

In [None]:
profile = pandas_profiling.ProfileReport(train)
profile.to_file(output_file="Titanic data profiling.html")
train.tail()

# Exploratory Data Analysis¶

Let's begin some exploratory data analysis! We'll start by checking out missing data!

Missing Data
We can use seaborn to create a simple heatmap to see where we are missing data!

In [None]:
train.info()

In [None]:
sns.heatmap(train.isnull(), yticklabels=False,\
            cbar=False,cmap='viridis')

In [None]:
sns.set_style('whitegrid')
sns.countplot(x='Survived',data=train,palette='RdBu_r')

In [None]:
sns.set_style('whitegrid')
sns.countplot(x='Survived',hue='Sex',data=train, palette='RdBu_r')

In [None]:
sns.set_style('whitegrid')
sns.countplot(x='Survived',hue='Pclass',data=train, palette='rainbow')

In [None]:
sns.distplot(train['Age'].dropna(),kde=True, color='darkred',bins=20)

In [None]:
train['Age'].hist(bins=20,color='darkred',alpha=0.5)

In [None]:
sns.countplot(x='SibSp',data=train)

In [None]:
train[train['Fare'] >200]['Survived'].value_counts()

In [None]:

train['Fare'].hist(color='green',bins=40,figsize=(8,4))

# Cufflinks for plots¶
Let's take a quick moment to show an example of cufflinks!

In [None]:
# !pip install cufflinks
import cufflinks as cf
cf.go_offline()
train['Fare'].iplot(kind='hist',bins=30,color='green')
plt.show()

# Data Cleaning

We want to fill in missing age data instead of just dropping the missing age data rows. One way to do this is by filling in the mean age of all the passengers (imputation). However we can be smarter about this and check the average age by passenger class. For example:

In [None]:
plt.figure(figsize=(12, 7))
sns.boxplot(x='Pclass',y='Age',data=train,palette='winter')

In [None]:
train[train['Pclass']==1]['Age'].mean()

In [None]:
train[train['Pclass']==1]['Age'].mean()
train[train['Pclass']==2]['Age'].mean()
train[train['Pclass']==3]['Age'].mean()

In [None]:
def impute_age(cols):
    Age = cols[0]
    Pclass = cols[1]
    
    if pd.isnull(Age):

        if Pclass == 1:
            return 38.23

        elif Pclass == 2:
            return 29.87

        else:
            return 25.140

    else:
        return Age

Now apply that function!

In [None]:
train.info()

In [None]:
train['Age'] = train[['Age','Pclass']].apply(impute_age,axis=1)

In [None]:
age_mean = train.groupby('Pclass')['Age'].mean()
age_mean

In [None]:
sns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap='viridis')

In [None]:
train.drop(['Name', 'Ticket', 'Cabin'], inplace=True, axis=1)
train['Embarked'].value_counts()

In [None]:
train.dropna(inplace=True)

# Converting Categorical Features

We'll need to convert categorical features to dummy variables using pandas! Otherwise our machine learning algorithm won't be able to directly take in those features as inputs.

In [None]:
train['Embarked'].value_counts()

In [None]:
#sex = pd.get_dummies(train['Sex'],drop_first=False)
embark = pd.get_dummies(train['Embarked'],drop_first=True)
embark

In [None]:
embark.shape

In [None]:
from sklearn.preprocessing import LabelEncoder
LE = LabelEncoder()
train['Sex'] = LE.fit_transform(train['Sex'])

In [None]:
train['Sex']

In [None]:
cities = ["paris", "Paris", "tokyo", "amsterdam", 'paris', 'tokyo']
# cities_new = [city.lower() for city in cities]
cities_new

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit_transform(cities)

In [None]:
list(le.classes_)

In [None]:
le.transform(["tokyo", "tokyo", "paris", 'Paris'])

In [None]:
list(le.inverse_transform([2, 2, 0, 1])) #to fetch the actual labels against 
# the given numeric label

In [None]:
train['Sex'].head()

In [None]:
train['Embarked'] = LE.fit_transform(train['Embarked'])
train.head()

# Building a Logistic Regression model
Let's start by splitting our data into a training set and test set (there is another test.csv file that you can play around with in case you want to use all this data for training).

# Train Test Split

In [None]:
from sklearn.model_selection import train_test_split
X = train.drop('Survived', axis=1)
Y = train['Survived']

In [None]:
X.shape

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,Y, \
                                                    test_size=0.30, 
                                                    random_state=101)

In [None]:
X_train.shape

In [None]:
X_test.shape

# Training and Predicting

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
logmodel = LogisticRegression(solver='liblinear') # , class_weight='balanced'
logmodel

In [None]:
lm = logmodel.fit(X_train,y_train)
lm

In [None]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

In [None]:
lm.n_iter_

In [None]:
lm.classes_

In [None]:
lm.coef_

In [None]:
lm.intercept_

In [None]:
predictions = logmodel.predict(X_test)
predictions

In [None]:
prob = logmodel.predict_proba(X_test)

In [None]:
prob_1 = prob[:,1]

In [None]:
pred = []
for probab in prob_1:
    if probab > 0.8:
        pred.append(1)
    else:
        pred.append(0)

In [None]:
pred

In [None]:
y_test[:5]

# Evaluation

We can check precision,recall,f1-score using classification report!

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test,predictions))
             

In [None]:
from sklearn.metrics import confusion_matrix
matrix = confusion_matrix(y_test,predictions)
print(matrix)

In [None]:
logmodel.score(X_test, y_test) #this is the accuracy score

In [None]:
predictions[:5]

In [None]:
type(predictions)

In [None]:
df = pd.DataFrame(y_test)
df['Predicted'] = predictions

In [None]:
df.tail()

In [None]:
logmodel.predict_proba(X_test[:5])

In [None]:
df1 = pd.DataFrame(logmodel.coef_, columns=X.columns)
df1

In [None]:
logmodel.intercept_

Not so bad! You might want to explore other feature engineering and the other titanic_text.csv file, some suggestions for feature engineering:

.Try grabbing the Title (Dr.,Mr.,Mrs,etc..) from the name as a feature
.Maybe the Cabin letter could be a feature
.Is there any info you can get from the ticket?

In [None]:
from sklearn.linear_model import SGDClassifier
SGD_clf = SGDClassifier()
SGD_clf.fit(X_train, y_train)  # default loss='hinge'

In [None]:
predictions = logmodel.predict(X_test)
SGD_clf.score(X_test, y_test)

In [None]:
# Load libraries
# import pandas
# import numpy
# import matplotlib.pyplot as plt
# from pandas.plotting import scatter_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
# from sklearn.preprocessing import Imputer
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier

In [None]:
import warnings
warnings.simplefilter("ignore")
# Spot Check Algorithms
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))
models

In [None]:
from sklearn.model_selection import KFold, cross_val_score
results = []
names = []
n_splits = 5
for name, model in models:
    kfold = model_selection.KFold(n_splits=5, shuffle=True, \
                                  random_state=5)
    cv_results = model_selection.cross_val_score(model, X_train, \
                                                 y_train, cv=kfold, \
                                                 scoring='accuracy')
    results.append(cv_results)
    names.append(name)
    msg = "%s: %5.2f (%5.2f)" % (name, cv_results.mean()*100, \
                           cv_results.std()*100)
    print(msg)

In [None]:
results

In [None]:
results_df = pd.DataFrame(results, index=names, \
                          columns='CV1 CV2 CV3 CV4 CV5'.split())
# results_df.drop(['CV Mean', 'CV Std Dev'], inplace=True, axis = 1)

In [None]:
results_df

In [None]:
results_df['CV Mean'] = results_df.iloc[:,0:n_splits].mean(axis=1)
results_df['CV Std Dev'] = results_df.iloc[:,0:n_splits].std(axis=1)

In [None]:
pd.set_option('precision',2)
results_df*100

In [None]:
results_df.sort_values(by='CV Mean', ascending=False)*100

In [None]:
# InteractiveShell.ast_node_interactivity = "last_expr"

In [None]:
# %matplotlib inline
# Compare Algorithms
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()

In [None]:
# Standardize the dataset
pipelines = []
pipelines.append(('ScaledLR', Pipeline([('Scaler', StandardScaler()),('LR', LogisticRegression())])))
pipelines.append(('ScaledLDA', Pipeline([('Scaler', StandardScaler()),('LDA', LinearDiscriminantAnalysis())])))
pipelines.append(('ScaledKNN', Pipeline([('Scaler', StandardScaler()),('KNN', KNeighborsClassifier())])))
pipelines.append(('ScaledCART', Pipeline([('Scaler', StandardScaler()),('CART', DecisionTreeClassifier())])))
pipelines.append(('ScaledNB', Pipeline([('Scaler', StandardScaler()),('NB', GaussianNB())])))
pipelines.append(('ScaledSVM', Pipeline([('Scaler', StandardScaler()),('SVM', SVC())])))
results = []
names = []
for name, model in pipelines:
    kfold = model_selection.KFold(n_splits=5, shuffle=True, random_state=5)
    cv_results = model_selection.cross_val_score(model, X_train, y_train, cv=kfold, scoring='accuracy')
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean()*100, cv_results.std()*100)
#     print(msg)

results_df = pd.DataFrame(results, index=names, \
                          columns='CV1 CV2 CV3 CV4 CV5'.split())
results_df['CV Mean'] = results_df.iloc[:,0:n_splits].mean(axis=1)
results_df['CV Std Dev'] = results_df.iloc[:,0:n_splits].std(axis=1)
results_df.sort_values(by='CV Mean', ascending=False)*100

In [None]:
# Compare Algorithms
fig = plt.figure()
fig.suptitle('Scaled Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()

In [None]:
# Normalize the dataset
pipelines = []
pipelines.append(('ScaledLR', Pipeline([('Scaler', MinMaxScaler(feature_range=(0, 1))),('LR', LogisticRegression())])))
pipelines.append(('ScaledLDA', Pipeline([('Scaler', MinMaxScaler(feature_range=(0, 1))),('LDA', LinearDiscriminantAnalysis())])))
pipelines.append(('ScaledKNN', Pipeline([('Scaler', MinMaxScaler(feature_range=(0, 1))),('KNN', KNearestClassifier()])))
pipelines.append(('ScaledCART', Pipeline([('Scaler', MinMaxScaler(feature_range=(0, 1))),('CART', DecisionTreeClassifieKNeighborsClassifier())r())])))
pipelines.append(('ScaledNB', Pipeline([('Scaler', MinMaxScaler(feature_range=(0, 1))),('NB', GaussianNB())])))
pipelines.append(('ScaledSVM', Pipeline([('Scaler', MinMaxScaler(feature_range=(0, 1))),('SVM', SVC())])))
results = []
names = []
for name, model in pipelines:
    kfold = model_selection.KFold(n_splits=5, shuffle=True, random_state=5)
    cv_results = model_selection.cross_val_score(model, X_train, y_train, cv=kfold, scoring='accuracy')
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean()*100, cv_results.std()*100)
#     print(msg)

results_df = pd.DataFrame(results, index=names, \
                          columns='CV1 CV2 CV3 CV4 CV5'.split())
results_df['CV Mean'] = results_df.iloc[:,0:n_splits].mean(axis=1)
results_df['CV Std Dev'] = results_df.iloc[:,0:n_splits].std(axis=1)
results_df.sort_values(by='CV Mean', ascending=False)*100

In [None]:
# Compare Algorithms
fig = plt.figure()
fig.suptitle('Scaled Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()

In [None]:
# ensembles
ensembles = []
ensembles.append(('ScaledAB', Pipeline([('Scaler', StandardScaler()),('AB', AdaBoostClassifier())])))
ensembles.append(('ScaledGBM', Pipeline([('Scaler', StandardScaler()),('GBM', GradientBoostingClassifier())])))  
ensembles.append(('ScaledRF', Pipeline([('Scaler', StandardScaler()),('RF', RandomForestClassifier())])))
ensembles.append(('ScaledET', Pipeline([('Scaler', StandardScaler()),('ET', ExtraTreesClassifier())])))
results = []
names = []
for name, model in ensembles:
    kfold = model_selection.KFold(n_splits=5, shuffle=True, random_state=5)
    cv_results = model_selection.cross_val_score(model, X_train, y_train, cv=kfold, scoring='accuracy')
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean()*100, cv_results.std()*100)
#     print(msg)
    
results_df = pd.DataFrame(results, index=names, \
                          columns='CV1 CV2 CV3 CV4 CV5'.split())
results_df['CV Mean'] = results_df.iloc[:,0:n_splits].mean(axis=1)
results_df['CV Std Dev'] = results_df.iloc[:,0:n_splits].std(axis=1)
results_df.sort_values(by='CV Mean', ascending=False)*100

In [None]:
# Compare Algorithms
fig = plt.figure()
fig.suptitle('Scaled Ensemble Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()

In [None]:
# Tune scaled KNN
scaler = StandardScaler().fit(X_train)
rescaledX = scaler.transform(X_train)
neighbors = [1,3,5,7,9,11,13,15,17,19,21,25,30,35,40,50]
param_grid = dict(n_neighbors=neighbors)
model = KNeighborsClassifier('euclidean')

kfold = model_selection.KFold(n_splits=5, shuffle=True, random_state=5)
grid = GridSearchCV(estimator=model, param_grid=param_grid, \
                    scoring='accuracy', cv=kfold, )

grid_result = grid.fit(rescaledX, y_train)

In [None]:
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

In [None]:
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

In [None]:
model = KNeighborsClassifier(metric='euclidean', n_neighbors=9)

In [None]:
model.fit( rescaledX, y_train)

In [None]:
c_values = [0.1, 0.3, 0.5, 0.7, 0.9, 1.0, 1.3, 1.5, 1.7, 2.0]
kernel_values = ['linear', 'poly', 'rbf', 'sigmoid']
degree_values = [2,3,4,5]
gamma_values =[0.1, 0.5, 1, 2]
param_grid = dict(C=c_values, kernel=kernel_values, degree=degree_values, \
                 gamma = gamma_values)
model = SVC()