<a href="https://colab.research.google.com/github/Nov05/Lambda-School-Data-Science/blob/master/LSDS_Intro_Assignment_8_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lambda School, Intro to Data Science, Day 8 — Classification!

## Assignment

Run this cell to load the Titanic data:

In [0]:
import pandas as pd
from pandas.api.types import is_string_dtype
from pandas.api.types import is_numeric_dtype

import seaborn as sns
import matplotlib.pyplot as plt

from statistics import mean
from sklearn.preprocessing import LabelEncoder
from itertools import combinations

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

import warnings
warnings.filterwarnings('ignore')

In [0]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(sns.load_dataset('titanic').drop(columns=['alive']), random_state=0)
target = 'survived'

###Data Exploration

In [3]:
train.shape

(668, 14)

In [4]:
train.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alone
105,0,3,male,28.0,0,0,7.8958,S,Third,man,True,,Southampton,True
68,1,3,female,17.0,4,2,7.925,S,Third,woman,False,,Southampton,False
253,0,3,male,30.0,1,0,16.1,S,Third,man,True,,Southampton,False
320,0,3,male,22.0,0,0,7.25,S,Third,man,True,,Southampton,True
706,1,2,female,45.0,0,0,13.5,S,Second,woman,False,,Southampton,True


In [5]:
print('Column Name          Data Type')
for col in train.columns:
  print('{:20} {}'.format(col, train[col].dtype))

Column Name          Data Type
survived             int64
pclass               int64
sex                  object
age                  float64
sibsp                int64
parch                int64
fare                 float64
embarked             object
class                category
who                  object
adult_male           bool
deck                 category
embark_town          object
alone                bool


In [6]:
train.describe()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare
count,668.0,668.0,535.0,668.0,668.0,668.0
mean,0.386228,2.305389,29.9,0.534431,0.392216,32.373621
std,0.487249,0.837377,14.487993,1.161739,0.822509,50.632021
min,0.0,1.0,0.67,0.0,0.0,0.0
25%,0.0,2.0,21.0,0.0,0.0,7.925
50%,0.0,3.0,29.0,0.0,0.0,14.5
75%,1.0,3.0,38.0,1.0,0.0,31.275
max,1.0,3.0,80.0,8.0,6.0,512.3292


In [7]:
train.describe(include=['object', 'category'])

Unnamed: 0,sex,embarked,class,who,deck,embark_town
count,668,666,668,668,156,666
unique,2,3,3,3,7,3
top,male,S,Third,man,C,Southampton
freq,437,490,367,407,43,490


In [8]:
train.embark_town.unique()

array(['Southampton', 'Cherbourg', 'Queenstown', nan], dtype=object)

### Fill NAs

In [9]:
# find out train data columns that contain NAs
train.isna().sum()

survived         0
pclass           0
sex              0
age            133
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           512
embark_town      2
alone            0
dtype: int64

In [10]:
# check test data NAs
test.isna().sum()

survived         0
pclass           0
sex              0
age             44
sibsp            0
parch            0
fare             0
embarked         0
class            0
who              0
adult_male       0
deck           176
embark_town      0
alone            0
dtype: int64

In [11]:
# drop column 'deck' due to too many NAs, fill column 'age' NAs with train data mean
age_mean = mean(train.age.dropna())
print(age_mean)

29.9


In [12]:
# test data column 'age' mean
mean(test.age.dropna())

29.098715083798883

In [13]:
# verify train data NAs are filled
train.age = train.age.fillna(age_mean)
train.age.isna().sum()

0

In [14]:
# verify test data NAs are filled
test.age = test.age.fillna(age_mean)
test.age.isna().sum()

0

### Encoding

In [15]:
# encode train data column 'sex'
le = LabelEncoder()
train[['sex']] = train[['sex']].apply(le.fit_transform)
train.sex.head()

105    1
68     0
253    1
320    1
706    0
Name: sex, dtype: int64

In [16]:
# encode test data column 'sex'
test[['sex']] = test[['sex']].apply(le.fit_transform)
test.sex.head()

495    1
648    1
278    1
31     0
255    0
Name: sex, dtype: int64

Then, train a [Logistic Regression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.predict_proba), [Decision Tree](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html), or [Random Forest](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) model. Use any features and parameters you want. 

Try to get better than 78.0% accuracy on the test set! (This is not required, but encouraged.)

Do refer to the lecture notebook — but try not to copy-paste.

> You must type each of these exercises in, manually. If you copy and paste, you might as well not even do them. The point of these exercises is to train your hands, your brain, and your mind in how to read, write, and see code. If you copy-paste, you are cheating yourself out of the effectiveness of the lessons. —*[Learn Python the Hard Way](https://learnpythonthehardway.org/book/intro.html)*

After this, you may want to try [Kaggle's Titanic challenge](https://www.kaggle.com/c/titanic)!

In [0]:
features_all = ['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',	'alone']
target = 'survived'

In [32]:
# try Logistic Regression model with different feature combinations
model = LogisticRegression()
for i in range(2, 8):
  for comb in combinations(features_all, i):
    features = list(comb)
    model.fit(train[features], train[target])
    print('features:', features)

    # Train accuracy
    y_true = train[target]
    y_pred = model.predict(train[features])
    print('Train accuracy:', accuracy_score(y_true, y_pred))

    # Test accuracy
    y_true = test[target]
    y_pred = model.predict(test[features])
    print('Test accuracy:', accuracy_score(y_true, y_pred))
    print();

features: ['pclass', 'sex']
Train accuracy: 0.7889221556886228
Test accuracy: 0.7802690582959642

features: ['pclass', 'age']
Train accuracy: 0.6946107784431138
Test accuracy: 0.726457399103139

features: ['pclass', 'sibsp']
Train accuracy: 0.6691616766467066
Test accuracy: 0.7085201793721974

features: ['pclass', 'parch']
Train accuracy: 0.6796407185628742
Test accuracy: 0.7130044843049327

features: ['pclass', 'fare']
Train accuracy: 0.6691616766467066
Test accuracy: 0.7085201793721974

features: ['pclass', 'alone']
Train accuracy: 0.7005988023952096
Test accuracy: 0.7130044843049327

features: ['sex', 'age']
Train accuracy: 0.7889221556886228
Test accuracy: 0.7802690582959642

features: ['sex', 'sibsp']
Train accuracy: 0.7949101796407185
Test accuracy: 0.7802690582959642

features: ['sex', 'parch']
Train accuracy: 0.7889221556886228
Test accuracy: 0.7802690582959642

features: ['sex', 'fare']
Train accuracy: 0.7844311377245509
Test accuracy: 0.7757847533632287

features: ['sex', 'al