# Kaggle Titanic
## Logistic Regression with Python


For this lecture we will be working with the [Titanic Data Set from Kaggle](https://www.kaggle.com/c/titanic). This is a very famous dataset.


# Step - 1 : Frame The Problem

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.



# Step - 2 : Obtain the Data

## Import Libraries

In [None]:
!pip install missingno

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as ms
%matplotlib inline

In [None]:
!ls -l

Pandas provides two important data types with in built functions to be able to provide extensive capability to handle the data.The datatypes include Series and DataFrames.

Pandas provides ways to read or get the data from various sources like read_csv,read_excel,read_html etc.The data is read and stored in the form of DataFrames.

In [None]:
!wget https://www.dropbox.com/s/8grgwn4b6y25frw/titanic.csv

In [None]:
!ls -l

In [None]:
data = pd.read_csv('titanic.csv')

In [None]:
data.head()

In [None]:
#to get the last 5 entries of the data
data.tail(3)

In [None]:
type(data)

In [None]:
data.shape

In [None]:
data.info()

In [None]:
a = data['Age']

In [None]:
a

In [None]:
type(a)

In [None]:
type(data)

In [None]:
data.info()

In [None]:
data.describe()

## Accessing individual data in the data frame

### Working with Columns

since each dataframe is a collection of series if we access a single column we get a series object

In [None]:
data['Cabin'].value_counts()

In [None]:
y = data[['Cabin','Parch']].head()

In [None]:
type(y)

In [None]:
y.head()

In [None]:
data.info() #New_parch is added to the set of columns in the data frames

# Step - 3 : Analyse the Data

In [None]:
ms.matrix(data)

In [None]:
data.info()

We can observe that there are missing values in 'Age', 'Cabin' and 'Embarked'. Let's continue.

In [None]:
data.info()

Visualization of data with Seaborn

In [None]:
sns.jointplot(x='Fare',y='Age',data=data)

In [None]:
sns.distplot(data['Fare'])

In [None]:
sns.heatmap(data.corr(),cmap='coolwarm')
plt.title('data.corr()')

In [None]:
sns.swarmplot
sns.swarmplot(x='Pclass',y='Age',data=data,palette='Set2')

In [None]:
sns.set_style('whitegrid')
sns.countplot(x='Survived',data=data,palette='RdBu_r')

In [None]:
sns.set_style('whitegrid')
sns.countplot(x='Survived',hue='Sex',data=data,palette='RdBu_r')

In [None]:
sns.set_style('whitegrid')
sns.countplot(x='Survived',hue='Pclass',data = data,palette='rainbow')

In [None]:
data['Age'].hist(bins = 30, color = 'darkred', alpha = 0.8)

In [None]:
sns.countplot(x = 'SibSp', data = data)

In [None]:
data['Fare'].hist(color = 'green', bins = 40, figsize = (8,3))

#### What do you observe from the above charts?

# Step - 4 : Feature Engineering

## Feature Engineering

We want to fill the missing values of the age in the dataset with the average age value for each of the classes. This is called data imputation.

In [None]:
data.info()

In [None]:
plt.figure(figsize=(12, 7))
sns.boxplot(x='Pclass',y='Age',data=data,palette='winter')

In [None]:
data['Cabin'].value_counts()

The average age for each of the classes are estimated to be as follows:
  
  * For **Class 1** - The median age is 37
  * For **Class 2** - The median age is 29
  * For **Class 3** - The median age is 24
  
Let's impute these values into the age column.

In [None]:
def impute_age(cols):
    Age = cols[0]
    Pclass = cols[1]
    
    if pd.isnull(Age):
        # Class-1
        if Pclass == 1:
            return 37
        # Class-2 
        elif Pclass == 2:
            return 29
        # Class-3
        else:
            return 24

    else:
        return Age

Applying the function.

In [None]:
data['Age'] = data[['Age','Pclass']].apply(impute_age,axis=1)

Now let's visualize the missing values.

In [None]:
ms.matrix(data)

The Age column is imputed sucessfully.

Let's drop the Cabin column and the row in the Embarked that is NaN.

In [None]:
data.drop('Cabin', axis = 1,inplace=True)

In [None]:
data.head()

In [None]:
data.dropna(inplace = True)

In [None]:
ms.matrix(data)

In [None]:
data.info()

## Converting Categorical Features 

We'll need to convert categorical features to dummy variables using pandas! Otherwise our machine learning algorithm won't be able to directly take in those features as inputs.

In [None]:
data['Embarked'].value_counts()

In [None]:
sex = pd.get_dummies(data['Sex'],drop_first=1)
embark = pd.get_dummies(data['Embarked'],drop_first=1)
sex.head()


In [None]:
embark.head()

In [None]:
old_data = data.copy()
data.drop(['Sex','Embarked','Name','Ticket'],axis=1,inplace=True)
data.head()



In [None]:
old_data.info()

In [None]:
data = pd.concat([data,sex,embark],axis=1)

In [None]:
data.head()

In [None]:
data.info()

In [None]:
data.describe()

# Step - 5 : Model Selection

## Building a Logistic Regression model

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(data.drop('Survived',axis=1), 
                                                    data['Survived'], test_size=0.30, 
                                                    random_state=101)

In [None]:
len(y_test)

In [None]:
267/889

In [None]:
from sklearn.linear_model import LogisticRegression

# Build the Model.
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)

In [None]:
logmodel.coef_

In [None]:
predict =  logmodel.predict(X_test)
predict[:5]

In [None]:
y_test[:5]

Let's move on to evaluate our model.

# Step - 5 A : Multiple Models


In [None]:
from sklearn import svm, tree, linear_model

Algols = [
   
            linear_model.LogisticRegression(),
    
            linear_model.LogisticRegressionCV(),

            svm.LinearSVC(),

            tree.DecisionTreeClassifier(),

          ]


In [None]:
from sklearn.metrics import confusion_matrix, precision_score, recall_score,accuracy_score, f1_score

Algols_columns = []
Algols_compare = pd.DataFrame(columns = Algols_columns)


row_index = 0

#Run once for each alogorithm and collect the summary
for alg in Algols:
    
    predicted = alg.fit(X_train, y_train).predict(X_test)
    Algo_name = alg.__class__.__name__
    Algols_compare.loc[row_index,'Name'] = Algo_name
    Algols_compare.loc[row_index, 'Accuracy'] = accuracy_score(y_test, predicted)
    Algols_compare.loc[row_index, 'Precission'] = precision_score(y_test, predicted)
    Algols_compare.loc[row_index, 'Recall'] = recall_score(y_test, predicted)
    Algols_compare.loc[row_index, 'F1'] = f1_score(y_test, predicted)


    row_index+=1
    

    
#sort by Accuracy     
Algols_compare.sort_values(by = ['Accuracy'], ascending = False, inplace = True)    

#display output
Algols_compare

In [None]:
from sklearn import model_selection

In [None]:
grid_search_models = [
                        [ linear_model.LogisticRegression(),     {
                                                                    'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000],
                                                                 } ],

                          [ linear_model.LogisticRegressionCV(), {
                                                                    'solver' : ['newton-cg', 'lbfgs',],
                                                                    'max_iter' : [10,30,50], 

                                                                 } ],

                          [ svm.LinearSVC(),                     {
                                                                    'C': [ 0.01, 0.1, 1, 10],
                                                                    'max_iter' : [10,30,50],
                              
                                                                 } ],

                          [ tree.DecisionTreeClassifier(),       {
                                                                    'max_depth' :  [1,2,3,4],
                                                                    'max_features' : ['sqrt', 'log2',None],
                                                                 } ],
                                  
                    ]

In [None]:
for model,params in grid_search_models:
    tune_model = model_selection.GridSearchCV(model, param_grid=params, scoring = 'roc_auc',verbose=1)
    tune_model.fit (X_train, y_train)
    print(tune_model.best_params_)
    print (model.__class__.__name__,tune_model.score(X_test, y_test))

# Step - 5 B: Many More Models


In [None]:
from sklearn import svm,model_selection, tree, linear_model, neighbors 
from sklearn import naive_bayes, ensemble, discriminant_analysis, gaussian_process

ManyMoreAlgols = [
   
                   #Ensemble Methods
                    ensemble.AdaBoostClassifier(),
                    ensemble.BaggingClassifier(),
                    ensemble.ExtraTreesClassifier(),
                    ensemble.GradientBoostingClassifier(),
                    ensemble.RandomForestClassifier(),

                    #Gaussian Processes
                    gaussian_process.GaussianProcessClassifier(),

                    #GLM
                    linear_model.LogisticRegressionCV(),
                    linear_model.PassiveAggressiveClassifier(),
                    linear_model.RidgeClassifierCV(),
                    linear_model.SGDClassifier(),
                    linear_model.Perceptron(),

                    #Navies Bayes
                    naive_bayes.BernoulliNB(),
                    naive_bayes.GaussianNB(),

                    #Nearest Neighbor
                    neighbors.KNeighborsClassifier(),

                    #SVM
                    svm.SVC(probability=True),
                    svm.NuSVC(probability=True),
                    svm.LinearSVC(),

                    #Trees    
                    tree.DecisionTreeClassifier(),
                   #tree.ExtraTreeClassifier(),

          ]


In [None]:
from sklearn.metrics import confusion_matrix, precision_score, recall_score,accuracy_score, f1_score

Algols_columns = []
Algols_compare = pd.DataFrame(columns = Algols_columns)


row_index = 0

#Run once for each alogorithm and collect the summary
for alg in ManyMoreAlgols:
    
    predicted = alg.fit(X_train, y_train).predict(X_test)
    Algo_name = alg.__class__.__name__
    Algols_compare.loc[row_index,'Name'] = Algo_name
    Algols_compare.loc[row_index, 'Accuracy'] = accuracy_score(y_test, predicted)
    Algols_compare.loc[row_index, 'Precission'] = precision_score(y_test, predicted)
    Algols_compare.loc[row_index, 'Recall'] = recall_score(y_test, predicted)
    Algols_compare.loc[row_index, 'F1'] = f1_score(y_test, predicted)


    row_index+=1
    

    
#sort by Accuracy     
Algols_compare.sort_values(by = ['Precission'], ascending = False, inplace = True)    

#display output
Algols_compare

In [None]:
Algols_compare.sort_values(by = ['F1'], ascending = False, inplace = True)    

f = ()

#display output
Algols_compare

In [None]:
Algols_compare.sort_values(by = ['Recall'], ascending = False, inplace = True)    

#display output
Algols_compare