#Kaggle Tutorial for Machine Learning



This is a simple tutorial on how to get started with Machine Learning and Kaggle Competitions.

Necessary modules to run this notebook:
* [Numpy](http://www.numpy.org/)
* [Scikit-Learn](http://scikit-learn.org/stable/)
* [Pandas](http://pandas.pydata.org/)

We're going to use the [Titanic: Machine Learning from Disaster](https://www.kaggle.com/c/titanic) dataset, from [Kaggle](https://www.kaggle.com).

In [None]:
#Import the Numpy library
import numpy as np
#Import 'tree' from scikit-learn library
from sklearn import tree
# Import the Pandas library
import pandas as pd

# Let's not worry about 'pandas copy vs view warnings' for now...
import warnings
warnings.filterwarnings('ignore')

Import Data from the web. We'll be using the Kaggle Database 

In [None]:
train_path = "train.csv"
train = pd.read_csv(train_path)

test_path = "test.csv"
test = pd.read_csv(test_path)

In [None]:
#Print the `head` of the train and test dataframes
train.head()
#test.head()

Here are some usefull information about our test and train dataset.

In [None]:
print "Train data set shape: ", train.shape
print "Test data set shape: ", test.shape

**VARIABLE DESCRIPTIONS:**

* survival: Survival (0 = No; 1 = Yes)

* pclass: Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
* name: Name
* sex: Sex
* age: Age
* sibsp: Number of Siblings/Spouses Aboard
* parch: Number of Parents/Children Aboard
* ticket: Ticket Number
* fare: Passenger Fare
* cabin: Cabin
* embarked: Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

Pandas has a nice feature to get general information about your data, the [describe()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html) method.

In [None]:
train.describe()

In [None]:
test.describe()

How many survived on the train dataset?

In [None]:
train["Survived"].value_counts()

In [None]:
train["Survived"].value_counts(normalize = True)*100

In [None]:
# Males that survived vs males that passed away
print("Males that survived vs males that passed away:")
print(train["Survived"][train["Sex"]=='male'].value_counts())
# Normalized male survival
print("\nMales that survived vs males that passed away (Normalized):")
print(train["Survived"][train["Sex"] == 'male'].value_counts(normalize = True))

In [None]:
# Females that survived vs Females that passed away
print("Females that survived vs females that passed away:")
print(train["Survived"][train["Sex"] == 'female'].value_counts())
# Normalized female survival
print("\nFemales that survived vs females that passed away (Normalized):")
print(train["Survived"][train["Sex"] == 'female'].value_counts(normalize = True))

Let's check if data consistency.

In [None]:
for col in list(train.columns.values):
    print "Number of missing data on {}: {}".format(col,train[col].isnull().values.sum())

###Some data treatment

What if you want to create new attributes instead of using the providing ones? This is called [feature engineering](http://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/).

In [None]:
# Create the column Child and assign to 'NaN'
train["Child"] = float('NaN')

In [None]:
# Assign 1 to passengers under 18, 0 to those 18 or older. Print the new column.
train.Child[train["Age"] < 18] = 1
train.Child[train["Age"] >= 18] = 0
train.head()

In [None]:
# Print normalized Survival Rates for passengers under 18
print("Survival Rate for under 18:")
print(train["Survived"][train["Child"] == 1].value_counts(normalize = True))

# Print normalized Survival Rates for passengers 18 or older
print("\nSurvival Rate for 18 or older:")
print(train["Survived"][train["Child"] == 0].value_counts(normalize =True))

In [None]:
#Convert the male and female groups to integer form
train["Sex"][train["Sex"] == "male"] = 0
train["Sex"][train["Sex"] == "female"] = 1

In [None]:
train.head()

In [None]:
#Impute the Embarked variable
train["Embarked"] = train["Embarked"].fillna("S")

In [None]:
#Convert the Embarked classes to integer form
train["Embarked"][train["Embarked"] == "S"] = 0
train["Embarked"][train["Embarked"] == "C"] = 1
train["Embarked"][train["Embarked"] == "Q"] = 2

In [None]:
train.head()

Let's consider a model that will use the following features:

* Pclass
* Sex
* Age
* Fare

How many missing entries do we have?

In [None]:
print "Number of missing entries: {}".format(train[["Pclass", "Sex", "Age", "Fare"]].isnull().values.sum())

Precisely the number of missing points for the **Age ** feature. One first approach is to drop those rows (entries) and try to train a model on the remaining data.

In [None]:
train2 = train[["Pclass", "Sex", "Age", "Fare", "Survived"]]
train2.dropna(axis = 0, inplace = True)

print "Train shape: {}".format(train.shape)
print "Train2 shape: {}".format(train2.shape)

In [None]:
# Create the target and features numpy arrays: target, features_one
target = train2["Survived"].values
features_one = train2[["Pclass", "Sex", "Age", "Fare"]].values

##Let's start some Machine Learning

Now let's build our [Decision Tree Model](http://scikit-learn.org/stable/modules/tree.html).


Drop NaN values?

In [None]:
# Fit your first decision tree: my_tree_one
my_tree_one = tree.DecisionTreeClassifier()
my_tree_one = my_tree_one.fit(features_one, target)

The `feature_importances_` attribute make it simple to interpret the significance of the features we've included in our model.

In [None]:
# Look at the importance and score of the included features
feature_list = ["Pclass", "Sex", "Age", "Fare"]
importances = my_tree_one.feature_importances_

for k in range(0,len(feature_list)):
    print "Feature: {}\t-> Importance: {}".format(feature_list[k], importances[k])

In [None]:
print(my_tree_one.score(features_one, target))

###Make some predictions

Once we have our model, we can apply it to our test set and see the results....

In [None]:
for col in list(test.columns.values):
    print "Number of missing data on {}: {}".format(col,test[col].isnull().values.sum())

In [None]:
# Impute the missing value with the median
test.Fare[152] = test.Fare.median()
test.Age = test.Age.fillna(test.Age.median())

In [None]:
#Convert the male and female groups to integer form
test["Sex"][test["Sex"] == "male"] = 0
test["Sex"][test["Sex"] == "female"] = 1
#Impute the Embarked variable
test["Embarked"] = test["Embarked"].fillna("S")

#Convert the Embarked classes to integer form
test["Embarked"][test["Embarked"] == "S"] = 0
test["Embarked"][test["Embarked"] == "C"] = 1
test["Embarked"][test["Embarked"] == "Q"] = 2

In [None]:
# Extract the features from the test set: Pclass, Sex, Age, and Fare.
test_features = test[["Pclass", "Sex", "Age", "Fare"]].values

# Make your prediction using the test set
my_prediction = my_tree_one.predict(test_features)

In [None]:
# Create a data frame with two columns: PassengerId & Survived. Survived contains your predictions
PassengerId =np.array(test["PassengerId"]).astype(int)
my_solution = pd.DataFrame(my_prediction, PassengerId, columns = ["Survived"])
print(my_solution)

# Check that your data frame has 418 entries
print(my_solution.shape)

# Write your solution to a csv file with the name my_solution.csv
my_solution.to_csv("my_solution_one.csv", index_label = ["PassengerId"])

###Overfitting and how to control it

In [None]:
# Create a new array with the added features: features_two
train3 = train[["Pclass","Age","Sex","Fare", "SibSp","Parch", "Embarked"]]
train3.dropna(axis = 0, inplace = True)

In [None]:
features_two = train3.values

#Control overfitting by setting "max_depth" to 10 and "min_samples_split" to 5 : my_tree_two
max_depth = 10
min_samples_split = 5
my_tree_two = tree.DecisionTreeClassifier(max_depth = max_depth, min_samples_split = min_samples_split, random_state = 1)
my_tree_two = my_tree_two.fit(features_two, target)

#Print the score of the new decison tree
print("Second model score: {}".format(my_tree_two.score(features_two, target)))

In [None]:
# Look at the importance and score of the included features
feature_list = ["Pclass","Age","Sex","Fare", "SibSp","Parch", "Embarked"]
importances = my_tree_two.feature_importances_

for k in range(0,len(feature_list)):
    print "Feature: {}\t-> Importance: {}".format(feature_list[k], importances[k])

In [None]:
# Extract the features from the test set: Pclass, Sex, Age, and Fare.
test_features = test[["Pclass","Age","Sex","Fare", "SibSp","Parch", "Embarked"]].values

# Make your prediction using the test set
my_prediction2 = my_tree_two.predict(test_features)

In [None]:
my_solution2 = pd.DataFrame(my_prediction2, PassengerId, columns = ["Survived"])
print(my_solution2)

# Check that your data frame has 418 entries
print(my_solution2.shape)

# Write your solution to a csv file with the name my_solution.csv
my_solution2.to_csv("my_solution_two.csv", index_label = ["PassengerId"])

###Feature Engineering

In [None]:
# Create train_two with the newly defined feature
train_two = train.copy()
train_two["family_size"] = train_two.SibSp + train_two.Parch + 1

In [None]:
train3 = train_two[["Pclass", "Sex", "Age", "Fare", "SibSp", "Parch", "family_size"]]
for col in list(train3.columns.values):
    print "Number of missing data on {}: {}".format(col,train3[col].isnull().values.sum())

In [None]:
train3.Age = train3.Age.fillna(train.Age.median())
target = train.Survived

In [None]:
# Create a new feature set and add the new feature
features_three = train3.values

# Define the tree classifier, then fit the model
my_tree_three = tree.DecisionTreeClassifier(max_depth = max_depth, min_samples_split = min_samples_split, random_state = 1)
my_tree_three = my_tree_three.fit(features_three, target)

# Print the score of this decision tree
print(my_tree_three.score(features_three, target))

In [None]:
test_three = test.copy()
test_three["family_size"] = test_three.SibSp + test_three.Parch + 1

In [None]:
# Extract the features from the test set: Pclass, Sex, Age, and Fare.
test_features = test_three[["Pclass", "Sex", "Age", "Fare", "SibSp", "Parch", "family_size"]].values

# Make your prediction using the test set
my_prediction3 = my_tree_three.predict(test_features)

In [None]:
my_solution3 = pd.DataFrame(my_prediction3, PassengerId, columns = ["Survived"])

# Check that your data frame has 418 entries
print(my_solution3.shape)

# Write your solution to a csv file with the name my_solution.csv
my_solution3.to_csv("my_solution_three.csv", index_label = ["PassengerId"])