# Module3. Machine Learning
Machine learning is a process to let the program itself recognize the hidden pattern behind the data. This field involves traditional statistical tools like regression and more modern approach like deep learning. We will go through some of these technique in this module based on the titanic dataset. 

In [1]:
import pandas as pd
import numpy as np

In [2]:
# as usual
# if you are working on Google Colab, please change the path to :
# https://raw.githubusercontent.com/JumpingSquid/py_tutorial/master/titanic.csv
df = pd.read_csv("titanic.csv")
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


As you can see in the column names, there is a variable, "Survived", indicates whether the passenger was alive. We will try to use different models to see if we can make a good prediction.

## 1. Train and Evaluate
To start training a model, we need to first split the dataset into train set and test set. The train set is the one used to train the model, and we will use the test set to evaluate the accuracy of the model. 

In [3]:
# split the dataset into training set and test set by 8:2
split_n = int(0.8 * len(df))
df_train = df.iloc[:split_n, :]
df_test = df.iloc[split_n:, :]

print("sample size of training set:", len(df_train))
print("sample size of test set:", len(df_test))

sample size of training set: 712
sample size of test set: 179


We first select three features, Pclass, Sex, and Age. But since the gender is stored in the form of string, we need to change it to 1/0.
Then we need to fill the nan value in the columns Age.

In [4]:
train_X = df_train.loc[:, ["Pclass", "Sex", "Age"]]
train_X = train_X.replace("female", 0)
train_X = train_X.replace("male", 1)
train_X = train_X.fillna(0)
train_y = df_train.Survived

test_X = df_test.loc[:, ["Pclass", "Sex", "Age"]]
test_X = test_X.replace("female", 0)
test_X = test_X.replace("male", 1)
test_X = test_X.fillna(0)
test_y = df_test.Survived

### Linear Regression
We first train a linear regression model as a benchmark. When performin machine learning, it is important to choose right features. Features are the variables that the model learns from. For instance, in this case, passenger's gender, age, and class might be good features to predict the chance of survival. Although not certainly, domain knowledge can usually help you figure out what are the good features.

In [5]:
from sklearn.linear_model import LinearRegression

In [6]:
clf_linear = LinearRegression().fit(train_X, train_y)
print("R squared is", clf_linear.score(train_X, train_y))

R squared is 0.36546584549782446


In [7]:
from sklearn.metrics import mean_squared_error

In [8]:
print("The mean squared error for the linear regression is", mean_squared_error(clf_linear.predict(test_X), test_y))

The mean squared error for the linear regression is 0.13624616279221732


### Logistic Regression
Logistic regression is a powerful tool comparing to the linear regression when it comes to the binary case.

In [9]:
from sklearn.linear_model import LogisticRegression

In [10]:
clf_logistic = LogisticRegression(solver ="lbfgs").fit(train_X, train_y)

In [11]:
print("The mean squared error for the logistc regression is",mean_squared_error(clf_logistic.predict(test_X), test_y))

The mean squared error for the logistc regression is 0.2011173184357542


In [12]:
print("logistic:", clf_logistic.predict(test_X)[:5])
print("linear:", clf_linear.predict(test_X)[:5])

logistic: [0 0 0 0 1]
linear: [0.37687667 0.07623216 0.19814633 0.09727519 0.91093401]


In [13]:
print("Number of correct guess for logistic reg:", sum(clf_logistic.predict(test_X) == test_y))
print("Accuracy: ", sum(clf_logistic.predict(test_X) == test_y)/len(df_test))

Number of correct guess for logistic reg: 143
Accuracy:  0.7988826815642458


In [14]:
binary_output = np.where(clf_linear.predict(test_X) > 0.5, 1, 0)
print("Number of correct guess for linear reg:", sum(binary_output == test_y))
print("Accuracy: ", sum(binary_output == test_y)/len(df_test))

Number of correct guess for linear reg: 143
Accuracy:  0.7988826815642458


### Random Forest
Random forest is perhaps one of the most powerful prediction models in the traditional ML tools.
![random_forest_1](https://github.com/JumpingSquid/py_tutorial/raw/master/image/rf_ilus.png)

In [15]:
from sklearn.ensemble import RandomForestClassifier

In [16]:
clf_rf = RandomForestClassifier(n_estimators=150, max_depth=3,random_state=0)
>>> clf_rf.fit(train_X, train_y)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=3, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=150, n_jobs=None,
            oob_score=False, random_state=0, verbose=0, warm_start=False)

In [17]:
print("The mean squared error for the random forest is",mean_squared_error(clf_rf.predict(test_X), test_y))
print("Number of correct guess for random forest:", sum(clf_rf.predict(test_X) == test_y))
print("Accuracy: ", sum(clf_rf.predict(test_X) == test_y)/len(df_test))

The mean squared error for the random forest is 0.18435754189944134
Number of correct guess for random forest: 146
Accuracy:  0.8156424581005587


## 2. Feature Engineering
After using the three basic features to train the models, we may want to add more features to improve the accuracy. In practice, we can transform or combine one or more existing features to create a new feature. This is called feature engineering. In general, if we are facing a dataset with limited sample size and features, feature engineering is the key to increase the performance of a model.

In [18]:
print(train_X.head())

   Pclass  Sex   Age
0       3    1  22.0
1       1    0  38.0
2       3    0  26.0
3       1    0  35.0
4       3    1  35.0


In [19]:
train_X["old"] = np.where(train_X.Age > 55, 1, 0)
train_X["young"] = np.where(train_X.Age < 20, 1, 0)
print(train_X.head())

test_X["old"] = np.where(test_X.Age > 55, 1, 0)
test_X["young"] = np.where(test_X.Age < 20, 1, 0)

   Pclass  Sex   Age  old  young
0       3    1  22.0    0      0
1       1    0  38.0    0      0
2       3    0  26.0    0      0
3       1    0  35.0    0      0
4       3    1  35.0    0      0


In [20]:
train_X["old_man"] = np.where((train_X.old == 1)&(train_X.Sex==1), 1, 0)
train_X["young_man"] = np.where((train_X.young == 1)&(train_X.Sex==1), 1, 0)
print(train_X.head())

test_X["old_man"] = np.where((test_X.old == 1)&(test_X.Sex==1), 1, 0)
test_X["young_man"] = np.where((test_X.young == 1)&(test_X.Sex==1), 1, 0)

   Pclass  Sex   Age  old  young  old_man  young_man
0       3    1  22.0    0      0        0          0
1       1    0  38.0    0      0        0          0
2       3    0  26.0    0      0        0          0
3       1    0  35.0    0      0        0          0
4       3    1  35.0    0      0        0          0


In [21]:
clf_linear = LinearRegression().fit(train_X, train_y)
clf_logistic = LogisticRegression(solver ="lbfgs").fit(train_X, train_y)
clf_rf.fit(train_X, train_y)

binary_output = np.where(clf_linear.predict(test_X) > 0.5, 1, 0)
print("Number of correct guess for linear reg:", sum(binary_output == test_y))
print("Number of correct guess for logistic reg:", sum(clf_logistic.predict(test_X) == test_y))
print("Number of correct guess for random forest:", sum(clf_rf.predict(test_X) == test_y))

Number of correct guess for linear reg: 143
Number of correct guess for logistic reg: 142
Number of correct guess for random forest: 147
