In this file we'll observe, how spliting the dataset into testing and training sets can affect the performance of various classification algorithms

Titanic dataset from Kaggle will be used as a dataset .

Classifier algorithms used:

- Decision tree algorithm
- Gaussian Naive Bayes algorithm 

In [1]:
#importing all essential libraries 

import numpy as np
import pandas as pd

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score # to predict the accuracy of the prediction of given algorithm
from sklearn.naive_bayes import GaussianNB #it is importing Gaussian naive bayes classifier
from sklearn import cross_validation # this is used to split our dataset into training and testing sets
from sklearn import datasets



In [2]:
#loading the dataset into X,the input paramter
X = pd.read_csv('titanic_data.csv')

In [3]:
#remove all unnecessary features and retain only features with numerical values
X = X._get_numeric_data()

In [4]:
#defining our output variable ie y
y = X['Survived']

In [5]:
#removing few more features from dataset
del X['Age'] # because it has a lot of missing values 
del X['Survived'] # because it's a output and it shouldn't be included in input paramter

In [6]:
#it's time to split the data:
X_train,X_test,y_train,y_test = cross_validation.train_test_split(X,y,test_size = 0.4,random_state = 0)
#here 40% of data is retained for testing

In [7]:
#let's trian our classifiers:

#Decision tree classifier(DTC):
clf1 = DecisionTreeClassifier() # defining DTC as clf1
clf1.fit(X_train,y_train) # training our classifier with training data sets
print "Decision Tree has an accuracy :",accuracy_score(y_test,clf1.predict(X_test))

#in the above line first we'll be predicting the y' value for X_test with clf1.predict(X_test)
#then compare that y' with actual y and compute the accuracy of classifier using accuracy_score()


Decision Tree has an accuracy : 0.663865546218


In [8]:
#Next is Gaussian Naive Bayes classifier, here also the procedure is
#similar to Decision tree except that the classifier we use here is GaussianNB

clf2 = GaussianNB() 
clf2.fit(X_train,y_train) 
print "GaussianNB has an accuracy :",accuracy_score(y_test,clf2.predict(X_test))


GaussianNB has an accuracy : 0.672268907563


#####  Confusion matrix
let's use confusion matrix as an evaluation metrics to  measure accuracy of  Decision tree & GaussianNB algorithm.(Instead of accuracy_score() which was used earlier)

the confusion matrix has the following order

True Negative ,  False Positive

False Negative , True Positive



In [9]:
from sklearn.metrics import confusion_matrix

In [10]:
#let's trian our classifiers:

#Decision tree classifier(DTC):
clf1 = DecisionTreeClassifier() # defining DTC as clf1
clf1.fit(X_train,y_train) # training our classifier with training data sets
print "Decision Tree has a confusion matrix \n:",confusion_matrix(y_test,clf1.predict(X_test))


Decision Tree has a confusion matrix 
: [[160  61]
 [ 69  67]]


In [11]:
#Next is Gaussian Naive Bayes classifier
clf2 = GaussianNB() 
clf2.fit(X_train,y_train) 
print "GaussianNB has a confusion matrix \n:",confusion_matrix(y_test,clf2.predict(X_test))

GaussianNB has a confusion matrix 
: [[188  33]
 [ 84  52]]


##### Precision and recall of a confusion matrix
Let's find the precision and recall for Decision tree and GaussianNB classifiers


In [12]:
#Decision tree classifier(DTC):
from sklearn.metrics import recall_score as recall
from sklearn.metrics import precision_score as precision

clf = DecisionTreeClassifier() # defining DTC as clf1
clf.fit(X_train,y_train) # training our classifier with training data sets

print "Decision Tree Recall: {:.2f} and precision : {:.2f}".format(recall(y_test,clf.predict(X_test)),precision(y_test,clf.predict(X_test)))


Decision Tree Recall: 0.50 and precision : 0.58


In [13]:
#Next is Gaussian Naive Bayes classifier

clf = GaussianNB() # defining DTC as clf1
clf.fit(X_train,y_train) # training our classifier with training data sets

print "GaussiaNB recall: {:.2f} and precision : {:.2f}".format(recall(y_test,clf.predict(X_test)),precision(y_test,clf.predict(X_test)))

GaussiaNB recall: 0.38 and precision : 0.61


##### F1 Score
Now that you've seen precision and recall, another metric you might consider using is the F1 score. F1 score combines precision and recall relative to a specific positive class.

In [14]:
#Decision tree classifier(DTC):
from sklearn.metrics import recall_score as recall
from sklearn.metrics import precision_score as precision
from sklearn.metrics import f1_score
clf = DecisionTreeClassifier() # defining DTC as clf1
clf.fit(X_train,y_train) # training our classifier with training data sets

print "Decision Tree f1 score: {:.2f}".format(f1_score(y_test,clf.predict(X_test)),f1_score(y_test,clf.predict(X_test)))


Decision Tree f1 score: 0.53


In [15]:
#Next is Gaussian Naive Bayes classifier

clf = GaussianNB() # defining DTC as clf1
clf.fit(X_train,y_train) # training our classifier with training data sets

print "GaussiaNB f1 score: {:.2f}".format(f1_score(y_test,clf.predict(X_test)),f1_score(y_test,clf.predict(X_test)))

GaussiaNB f1 score: 0.47


# Evaluation metrics for linear regrssion:

#### Mean absolute error


In [1]:
import numpy as np
import pandas as pd

#load the dataset
from sklearn.datasets import load_linnerud

linnerud_data = load_linnerud()
X = linnerud_data.data
y = linnerud_data.target

In [2]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error as mae
from sklearn.linear_model import LinearRegression
from sklearn import cross_validation

In [3]:
#split the data
X_train,X_test,y_train,y_test = cross_validation.train_test_split(X,y,test_size = 0.4,random_state = 0)


In [6]:
#training classifier

reg1 = DecisionTreeRegressor()
reg1.fit(X_train,y_train)
print "Decision tree mean absolute error:{:.2f}".format(mae(y_test,reg1.predict(X_test)))


reg2 = LinearRegression()
reg2.fit(X_train,y_train)
print "Linear regression mean absolute error:{:.2f}".format(mae(y_test,reg2.predict(X_test)))

Decision tree mean absolute error:10.17
Linear regression mean absolute error:10.41


#####  mean squared error

In [7]:
from sklearn.metrics import mean_squared_error as mse

In [8]:
#training classifier

reg1 = DecisionTreeRegressor()
reg1.fit(X_train,y_train)
print "Decision tree mean absolute error:{:.2f}".format(mse(y_test,reg1.predict(X_test)))


reg2 = LinearRegression()
reg2.fit(X_train,y_train)
print "Linear regression mean absolute error:{:.2f}".format(mse(y_test,reg2.predict(X_test)))

Decision tree mean absolute error:286.42
Linear regression mean absolute error:323.82
