# Titanic competition - Random Forest
This notebook applies the Random Forest machine learning algorithm in Python3 to tackle the Kaggle Titanic competition. It uses the exact approach described in the Datacamp open course *Kaggle Python Tutorial on Machine Learning*.

We will follow these steps:
1. import the required libraries and get the data
2. describe the data
3. clean and format data
4. train the model
5. predict and submit to Kaggle

## Get the data with Pandas
Load in the training and testing set into your Python environment. The data is stored on the web as csv files. Load this data with the read_csv() method from the Pandas library.

In [14]:
# Import the Pandas library
import pandas as pd

# Import the Numpy library
import numpy as np

# Import the RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier

# Disable the SettingWithCopyWarning
pd.options.mode.chained_assignment = None

# Load the train and test datasets to create two DataFrames
train_url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/train.csv"
train = pd.read_csv(train_url)

test_url = "http://s3.amazonaws.com/assets.datacamp.com/course/Kaggle/test.csv"
test = pd.read_csv(test_url)

print(train.columns)

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')


## Describe the data

[Titanic competition - Decision Tree](Titanic%20competition%20-%20Decision%20Tree.ipynb)

## Clean and format data

In [10]:
# Impute the Age variable
train["Age"] = train["Age"].fillna(train["Age"].median())

# Convert the male and female groups to integer form
train["Sex"][train["Sex"] == "male"] = 0
train["Sex"][train["Sex"] == "female"] = 1

# Impute the Embarked variable
train["Embarked"] = train["Embarked"].fillna("S")

# Convert the Embarked classes to integer form
#train["Embarked"][train["Embarked"] == "S"] = 0
#train["Embarked"][train["Embarked"] == "C"] = 1
#train["Embarked"][train["Embarked"] == "Q"] = 2

# Impute the Age and Fare variables
test["Age"] = test["Age"].fillna(test["Age"].median())
test["Fare"] = test["Fare"].fillna(test["Fare"].median())

# Convert the male and female groups to integer form
test["Sex"][test["Sex"] == "male"] = 0
test["Sex"][test["Sex"] == "female"] = 1

# Impute the Embarked variable
test["Embarked"] = test["Embarked"].fillna("S")

# Convert the Embarked classes to integer form
#test["Embarked"][test["Embarked"] == "S"] = 0
#test["Embarked"][test["Embarked"] == "C"] = 1
#test["Embarked"][test["Embarked"] == "Q"] = 2

## Train the model

the Random Forest technique handles the overfitting problem with decision trees. It grows multiple (very deep) classification trees using the training set. At the time of prediction, each tree is used to come up with a prediction and every outcome is counted as a vote. For example, if you have trained 3 trees with 2 saying a passenger in the test set will survive and 1 says he will not, the passenger will be classified as a survivor. This approach of overtraining trees, but having the majority's vote count as the actual classification decision, avoids overfitting.

Building a random forest in Python looks almost the same as building a decision tree. There are two key differences. A different class is used and a new argument is necessary. Also, we need to import the necessary library from scikit-learn.
* Use RandomForestClassifier() class instead of the DecisionTreeClassifier() class.
* n_estimators needs to be set when using the RandomForestClassifier() class. This argument allows you to set the number of trees you wish to plant and average over.

In [12]:
# Create the target and features numpy arrays: target, features_one
target = train["Survived"].values
features_forest = train[["Pclass", "Age", "Sex", "Fare", "SibSp", "Parch"]].values

# Building and fitting my_forest
forest = RandomForestClassifier(max_depth = 10, min_samples_split=2, n_estimators = 100, random_state = 1)
my_forest = forest.fit(features_forest, target)

# Print the score of the fitted random forest
print(my_forest.score(features_forest, target))

0.9393939393939394


## Predict and submit to Kaggle

make use of the `.predict()` method. You provide it the model (`my_tree_one`), the values of features from the dataset for which predictions need to be made (`test`).

make sure your output is in line with the submission requirements of Kaggle: a csv file with exactly 418 entries and two columns: `PassengerId` and `Survived`. Then use the code provided to make a new data frame using DataFrame(), and create a csv file using to_csv() method from Pandas.

In [13]:


# Compute predictions on our test set features then print the length of the prediction vector
test_features = test[["Pclass", "Age", "Sex", "Fare", "SibSp", "Parch"]].values
pred_forest = my_forest.predict(test_features)
print(len(pred_forest))

# Create a data frame with two columns: PassengerId & Survived. Survived contains your predictions
PassengerId = np.array(test["PassengerId"]).astype(int)
my_solution = pd.DataFrame(pred_forest, PassengerId, columns = ["Survived"])
print(my_solution)

# Check that your data frame has 418 entries
print(my_solution.shape)

# Write your solution to a csv file with the name my_solution.csv
my_solution.to_csv("my_solution_two.csv", index_label = ["PassengerId"])

418
      Survived
892          0
893          0
894          0
895          0
896          0
897          0
898          0
899          0
900          1
901          0
902          0
903          0
904          1
905          0
906          1
907          1
908          0
909          0
910          0
911          0
912          0
913          1
914          1
915          0
916          1
917          0
918          1
919          0
920          1
921          0
...        ...
1280         0
1281         0
1282         0
1283         1
1284         0
1285         0
1286         0
1287         1
1288         0
1289         1
1290         0
1291         0
1292         1
1293         0
1294         1
1295         0
1296         0
1297         0
1298         0
1299         0
1300         1
1301         1
1302         1
1303         1
1304         1
1305         0
1306         1
1307         0
1308         0
1309         0

[418 rows x 1 columns]
(418, 1)
