# 1. Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# 2. Read in the data, and check out training set

In [None]:
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')

In [None]:
combined = pd.concat([ train, test ])

In [None]:
combined.describe()

# 3. Exploratory Data Analysis

Let's begin some exploratory data analysis! We'll start by checking out missing data!

## Missing Data

We can use seaborn to create a simple heatmap to see where we are missing data!

In [None]:
sns.heatmap(combined.isnull(),yticklabels=False,cbar=False,cmap='viridis')

Roughly 20 percent of the Age data is missing. The proportion of Age missing is likely small enough for reasonable replacement with some form of imputation. Looking at the Cabin column, it looks like we are just missing too much of that data to do something useful with at a basic level. We'll probably drop this later, or change it to another feature like "Cabin Known: 1 or 0"

Let's continue on by visualizing some more of the data! Check out the video for full explanations over these plots, this code is just to serve as reference.

In [None]:
sns.set_style('whitegrid')
sns.countplot(x='Survived', data=combined, palette='RdBu_r')

In [None]:
sns.set_style('whitegrid')
sns.countplot(x='Survived', hue='Sex', data=combined, palette='RdBu_r')

In [None]:
sns.set_style('whitegrid')
sns.countplot(x='Survived', hue='Pclass', data=combined, palette='rainbow')

In [None]:
sns.distplot(combined['Age'].dropna(), kde=False, color='darkred', bins=30)

In [None]:
sns.countplot(x='SibSp',data=combined)

In [None]:
combined['Fare'].hist(color='green',bins=40,figsize=(8,4))

___
## Data Cleaning
We want to fill in missing age data instead of just dropping the missing age data rows. One way to do this is by filling in the mean age of all the passengers (imputation).
However we can be smarter about this and check the average age by passenger class. For example:


In [None]:
plt.figure(figsize=(12, 7))
sns.boxplot(x='Pclass',y='Age',data=combined,palette='winter')

We can see the wealthier passengers in the higher classes tend to be older, which makes sense. We'll use these average age values to impute based on Pclass for Age.

In [None]:
def impute_age(cols):
    Age = cols[0]
    Pclass = cols[1]
    
    if pd.isnull(Age):

        if Pclass == 1:
            return 37

        elif Pclass == 2:
            return 29

        else:
            return 24

    else:
        return Age

Now apply that function!

In [None]:
combined['Age'] = combined[['Age','Pclass']].apply(impute_age, axis=1)

Now let's check that heat map again!

In [None]:
sns.heatmap(combined.isnull(),yticklabels=False,cbar=False,cmap='viridis')

Great! Let's go ahead and drop the Cabin column and the row in Embarked that is NaN.

In [None]:
combined.drop('Cabin', axis=1, inplace=True)

In [None]:
combined.head()

In [None]:
combined.dropna(inplace=True, subset=['Embarked'])

## Converting Categorical Features 

We'll need to convert categorical features to dummy variables using pandas! Otherwise our machine learning algorithm won't be able to directly take in those features as inputs.

In [None]:
combined.info()

In [None]:
sex = pd.get_dummies(combined['Sex'], drop_first=True)
embark = pd.get_dummies(combined['Embarked'], drop_first=True)

In [None]:
combined.drop(['Sex','Embarked','Name','Ticket'], axis=1, inplace=True)

In [None]:
combined = pd.concat([combined,sex,embark], axis=1)

In [None]:
combined.head()

In [None]:
sns.heatmap(combined.isnull(),yticklabels=False,cbar=False,cmap='viridis')

# 4. Building a Logistic Regression model

Let's start by splitting our data into a training set and test set (there is another test.csv file that you can play around with in case you want to use all this data for training).

## Train Test Split

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
train = combined[combined['Survived'].notnull()]
test = combined[combined['Survived'].isnull()]
test = test.drop('Survived', axis=1)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(train.drop(['Survived', 'PassengerId'],axis=1), 
                                                    train['Survived'], test_size=0.30, 
                                                    random_state=101)

## Training and Predicting

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)

In [None]:
predictions = logmodel.predict(X_test)

# 5. Evaluation

We can check precision, recall, f1-score using classification report!

In [None]:
from sklearn.metrics import classification_report

In [None]:
print(classification_report(y_test,predictions))

___
# 6. Creating Submission File

In [None]:
np.any(np.isnan(test))
#np.all(np.isfinite(test))

In [None]:
sns.heatmap(test.isnull(),yticklabels=False,cbar=False,cmap='viridis')

In [None]:
test = test.fillna(0)

In [None]:
sns.heatmap(test.isnull(),yticklabels=False,cbar=False,cmap='viridis')

In [None]:
#set ids as PassengerId and predict survival 
ids = test['PassengerId']
predictions = logmodel.predict(test.drop('PassengerId', axis=1))

#set the output as a dataframe and convert to csv file named submission.csv
output = pd.DataFrame({ 'PassengerId' : ids, 'Survived': predictions })
output.to_csv('submission.csv', index=False)