# Course Summary

In the Machine Learning course, we worked on 2 Kaggle competitions: Titanic Survival and Mnist datasets. We applied various preprocessing techniques and classifers to predict the target. Each competition had 2 files: test and train in csv format. In train.csv, all the data is given. It is used to find the relationship between all the attributes and the target attribute. In test.csv, all the data except target parameter is given. The relationship found from the train.csv file is applied to test.csv to predict the attribute target.  

# Titanic Survival 

A data analysis on what kind of people survived the Titanic sinking. The data is read from file and stored in a Pandas Dataframe Object and data is preprocessed as follows.

In [7]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier

train_set = pd.read_csv('datasets/train.csv')
test_set = pd.read_csv('datasets/test.csv')

test_set['Fare'].fillna(test_set['Fare'].dropna().median(), inplace=True)

train_set = train_set.drop(["Cabin", "Ticket", "Name", "PassengerId"], axis=1)
test_set = test_set.drop(["Cabin", "Ticket", "Name", "PassengerId"], axis=1)
freq_embarked = train_set.Embarked.dropna().mode()[0]
combine = [train_set, test_set]

for dataset in combine:
    dataset['Sex'] = dataset['Sex'].map( {'female': 1, 'male': 0} ).astype(int)
    dataset['Embarked'] = dataset['Embarked'].fillna(freq_embarked)
    dataset['Embarked'] = dataset['Embarked'].map({'C':0, 'Q':1, 'S':2}).astype(int)
    
guess_ages = np.zeros((2,3))
for dataset in combine:
    for i in range(0, 2):
        for j in range(0, 3):
            guess_set = dataset[(dataset['Sex'] == i) & (dataset['Pclass'] == j+1)]['Age'].dropna()
            age = guess_set.median()
            guess_ages[i,j] = int(age/0.5+0.5)*0.5
    for i in range(0,2):
        for j in range(0,3):
            dataset.loc[(dataset.Age.isnull()) & (dataset.Sex == i) & (dataset.Pclass == j+1), 'Age'] = guess_ages[i,j]
    dataset['Age'] = dataset['Age'].astype(int)
train_set["AgeRange"] = pd.cut(train_set['Age'],5)
train_set[['AgeRange', 'Survived']].groupby(['AgeRange'], as_index=False).mean().sort_values(by='AgeRange', ascending=True)
train_set['FareBand'] = pd.qcut(train_set['Fare'], 4)
train_set[['FareBand', 'Survived']].groupby(['FareBand'], as_index=False).mean().sort_values(by='FareBand', ascending=True)

combine = [train_set, test_set]

for dataset in combine:
    dataset.loc[ dataset['Age'] <= 16, 'Age'] = 0
    dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32), 'Age'] = 1
    dataset.loc[(dataset['Age'] > 32) & (dataset['Age'] <= 48), 'Age'] = 2
    dataset.loc[(dataset['Age'] > 48) & (dataset['Age'] <= 64), 'Age'] = 3
    dataset.loc[ dataset['Age'] > 64, 'Age']
    dataset.loc[ dataset['Fare'] <= 7.91, 'Fare'] = 0
    dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] <= 14.454), 'Fare'] = 1
    dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare'] <= 31), 'Fare']   = 2
    dataset.loc[ dataset['Fare'] > 31, 'Fare'] = 3
    dataset['Fare'] = dataset['Fare'].astype(int)
    
train_set = train_set.drop(['FareBand'], axis=1)    
train_set = train_set.drop(['AgeRange'], axis=1)
train_set.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,0,1,1,0,0,2
1,1,1,1,2,1,0,3,0
2,1,3,1,1,0,0,1,2
3,1,1,1,2,1,0,3,2
4,0,3,0,2,0,0,1,2


Decision Tree Classifier is used to predict if the person survives are not. This model is usually used to predict if the target attribute has a possibility of only 2 values. It uses all the preprocessed attribute as the training set to predict the attribute survived.

The train_set is split into 2 parts: train and test. Train contains 80% of the train_set and the remaining is test. X contains the preprocessed attributes and Y contains the Survived attribute. Decision Tree classifier is built using the train_X and train_Y. This classifier is applied to the test_X to predict the survived attribute. This attribute is compared with actual values of Survived attribute. Accuracy of the model is calulated as the number of correct predictions divided by the total number of predictions.

In [8]:
from sklearn.model_selection import train_test_split

train_X, test_X, train_Y, test_Y = train_test_split(train_set, train_set['Survived'], test_size=0.2, random_state=42)
train_X = train_X.drop(["Survived"],axis=1)
test_X = test_X.drop(["Survived"],axis=1)

tree_classifier = DecisionTreeClassifier()
tree_classifier.fit(train_X, train_Y) 
y_pred = tree_classifier.predict(test_X)
print ("Accuracy on Training: ",sum(y_pred==test_Y)/len(test_Y))

Accuracy on Training:  0.793296089385


# Mnist

The objective here is to correctly identify digits from a dataset of tens of thousands of handwritten images. I have applied Decision tree and Support vector machine classifier to the dataset.

In [9]:
mnist = pd.read_csv('datasets/mnist/train.csv')

train_X, test_X, train_Y, test_Y = train_test_split(mnist, mnist['label'], test_size=0.2, random_state=42)
train_X = train_X.drop(["label"],axis=1)
test_X = test_X.drop(["label"],axis=1)

tree_classifier = DecisionTreeClassifier()
tree_classifier.fit(train_X, train_Y) 
y_pred = tree_classifier.predict(test_X)
print ("Accuracy on Training: ",sum(y_pred==test_Y)/len(test_Y))

Accuracy on Training:  0.852261904762


In [10]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC

X_train, X_test, Y_train, Y_test = train_test_split(mnist,mnist['label'],random_state = 42, test_size = 0.1)

svm_clf = Pipeline((
("scaler", StandardScaler()),
("linear_svc", LinearSVC(C=1, loss="hinge")),
))
X_train = X_train.drop(["label"], axis=1)
svm_clf.fit(X_train, Y_train)
X_test = X_test.drop(["label"], axis=1)
y_pred = svm_clf.predict(X_test)
print("Accuracy on Training: ", sum(y_pred==Y_test)/len(Y_test))

Accuracy on Training:  0.901666666667
