# Overview

In this notebook we'll explore a typocal workflow for applying data science to a problem and creating a predictive model. 

Though the process of applying data science to problems can cary somewhat from a problem to another problem, the jourey to AI follows the following steps - which we call the AI ladder - at its core.

1. Collect
2. Organize
3. Analyze
4. Infuse

In this notebook, we have selected some of the most common activities from the AI ladder and take you through them. specifically, this is the workflow that we will be covering.


### Notebook's workflow

- Prepration
    1. Problem Definition
    2. Goal Setting
    3. Data Collection
- Data Prepration
    4. Data Exploration
    5. Data Cleaning
    6. Data Wrangling
- Data Analysis
    7. Data Visualization and insight
    8. Model Training
   

   

## Tools we will use

Throughout this notebook, you will be learn to use a few different python libraries. We introduce those tools here and import them to the project. To learn more about each one, you can click on their names:

- [`matplotlib`](https://matplotlib.org/): A library for creating static, animated, and interactive visualizations.
- [`seaborn`](https://seaborn.pydata.org/): Another visualization library based on `matplotlib` that provides simpler higher-level pltting tools
- [`pandas`](https://pandas.pydata.org/docs/user_guide/10min.html): a library for data manipulation and analysis.
- [`numpy`](https://numpy.org/): A library library for working with arrays, matrices, and linear algebra
- [`sklearn`](https://scikit-learn.org/stable/getting_started.html): Scikit-learn is a library that provides machine learning algorithms and other relavant tools.


In [None]:
# Let's import these libraries
# Note that we give some of them short names for ease of use

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Data Analysis
import pandas as pd
import numpy as np
import sklearn

## 1. Prepration

In this section, we takes the steps neccessary to set up our problem such that we can stay focused and achive the results that we need. 

### 1.1 Problem Definition

The first step to successfully apply machine learning to an AI problem is to clearly define the problem and set goals. 

**The problem**

In this case the problem that we are trying to solve is:
> Given we know for a subset of Titanic passengers, who survived the disaster and who didn't, train a predictive model to predict which of the remaining passengers also survived with and optimize for accuracy.

**Background Information**

It is also helpful to have an understanding of the problem we are solving as well. This would help us when putting together our hypothesis when exploring the data and when we engineer new features that could possibly help our predictive models. 

For those who are not familiar with the event, take a look at the [Wikipedia](https://en.wikipedia.org/wiki/Titanic) article on the event pay particular attention to the "[Survivors and victims](https://en.wikipedia.org/wiki/Titanic#Survivors_and_victims)" section. That should provide some insight to the kinds patterns you might want to explore in the data.

## 1.2 Goal Setting

The next step is to define the goal we want to achive. In this case, we will set the followig goals to address our problem definition

1. Exploring the Data
2. Cleaning the Data
3. Wrangling the Data
4. Classifying the Data
5. Visalizing the results

The next steps are all focused around the following goals


## 1.3 Data Collection

The last step before we start working with the data, is collecting the data. In this case we can simply load the data in, however, in more real-world usecases, one might need to use data mining processes or get the appropriate approvales before the data can be used. 

For our purpose, you can load the data using the [`read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) method of the [`pandas`](https://pandas.pydata.org/docs/user_guide/10min.html) library.

In [None]:
#Import the data
df_data_1 = pd.read_csv(r'https://raw.githubusercontent.com/IBM/python-and-analytics/master/data/titanic.full.csv', na_values='?')

Notice the `na_values='?'` within the prenthesis after body. That's an additional argument that we passed into the `read_csv()` function of pandas to specify `?` in our datasets denotes blanks. You might need to adjust this for your own datasets.

Our data is now loaded into our notebook in a `dataframe`. Think of a `dataframe` as a table in python which has rows, columns, and allows you to manipulate the data. We will look at some of these data manipulations below.

For simplicity, let's call our data `df`.

<font color='red'><b>IMPORTANT:</b></font> If your auto-import named the data differently, adjest the number on the right side of `df_data_1` accordingly


In [None]:
df = df_data_1

Finally, we need to understand our data. For that, here is a quick description of what each column entails:

| Column    | Description                                                          |
| :-------- | :------------------------------------------------------------------- |
| Pclass    | Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)                          |
| survival  | Survival (0 = No; 1 = Yes)                                           |
| name      | Name                                                                 |
| sex       | Sex                                                                  |
| age       | Age                                                                  |
| sibsp     | Number of Siblings/Spouses Aboard                                    |
| parch     | Number of Parents/Children Aboard                                    |
| ticket    | Ticket Number                                                        |
| fare      | Passenger Fare (British pound)                                       |
| cabin     | Cabin                                                                |
| embarked  | Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton) |
| boat      | Lifeboat                                                             |
| body      | Body Identification Number                                           |
| home.dest | Home/Destination                                                     |

At this point we are done with aquiring the data and definig our problem, and we are ready to move on to the Data Prepration step.

## Data Prepration

### 2.1. Data Exploration

Before we start cleaning our dataset, we need to understand our data. Let's do that by first visualizing the data and then perform simple statistcs on it.

#### Using a Pandas Dataframe

First let's take a look at our `dataframe` to become familiar with operations that we can perform on it, as well as what is inside it.

##### Vieweing your data

In [None]:
# Let's print our DataFrame
df

Next, let's look at the data types in each colum (aka. feature). It is always of interest to know what numerical, categorical, and text features.

Here are some common data types:

- `float`: Numbers with decimals
- `int`: Integers. Numbers without any decimals
- `object`: Textual values
- `category`: Categorical data
- `bool`: logical `True` or `False` values
- `datetime64`: A format for storing dates

In [None]:
df.dtypes

Looks like some of the features are incorrectly picked as strings (`Objects`). We will take a look at those next and fix them in the data cleaning section. For now, let's contiue exploring the our `dataframe`

Next, let's see what methods our dataframe supports. To list these, in the cell below type `df.` and then press the `TAB` button on your keyboard. You should see a list of methods or attributes available.

In [None]:
# type df. and press TAB at the end of the following line and explore the options
# df.<TAB>



Note that you also see the feature (column) names as things that you can call on `df`. That is the easiest way to pull out a column for manipulation, or vieweing. For example:

`describe()` anad `.info()` shows quick summary of your dataframe.

In [None]:
df.describe()

In [None]:
df.info()

`.T` quickly transposes the table.

In [None]:
df.T

`.sort_index()` and `.sort_values()` sort your table.

In [None]:
df.sort_index(axis=1)

In [None]:
df.sort_values(by='age')

##### Selecting parts of the data

You can find many ways of selecting your data in the [documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html#selection-by-label). Let's explore some of them together.

In [None]:
# Column Selection

# df.fare 
df["fare"]

In [None]:
# Row Selections by row number

df[0:4]

In [None]:
# Row selection by another feature value
# Read "Select df where age was over 5"

df[df["survived"]==1]

In [None]:
# Selecting by row and column
# use the df.loc[rows , [column1, column2, ...] ]
# for instance

df.loc[0:3, ["name", "sex", "age"]]

##### setting values

There are many ways to set values in your table which you can find in the [documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html#setting)

You can, for instance, use their position, value, a condition about their value (eg. > 0), a condition on a different column(feature) (eg. if not survived, set age to 0!), etc... 

Let's explore some of them together

In [None]:
# let's start by making a copy of of df
# note that simply assisgning a different name wouldn't be enough (df2 = df). This would mean df and df2 point to the same table in the memory!
# The right way of making a copy is using the copy() method. 

df2 = df.copy()

In [None]:
# Let's replace the '?' values in the age field with np.NaN (Not a Number) which is an actual value suppported by our table
df2[df["age"]=='?'] = np.NaN
df2

In [None]:
# Next, let's see what would happen if we dropped every row that has any NaN (blank) values.

df2.dropna(how="any")

Interesting... Nothing remained! 
 
Looks like every row has some unknowns. We shall do some more specific cleaning in our DataCleaning step.
 
For now, since `dropna()` doesn't perform its operation [`inplace`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html) and since we didn't assign the output of `df2.dropna(how="any")` back to `df2`, the actual content of `df2` is unmodified, so we can move on. Let's verify `df2` is untouched.

In [None]:
# Checking df2 is unmodified
df2

#### Operating on the data 

Now it's time for some exploration! You can see many of the possible operations in the [documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#basics-binop). Here we will take a look at some of them to become familiar with the syntax.

`.mean()`, `.median()`, `.max()`, `min()`, `skew()` are some of the basic statistical operations you can use. You can use it on the whole table, inidividual rows, or your selected data

In [None]:
df.mean()

In [None]:
df["age"].median()

In [None]:
# Remember to select data conditionally you can do df[(condition1) & (condition2) & ...]

female_survivors = df[ (df["sex"]=='female') & (df["survived"] == 1) ]
male_survivors = df[(df["sex"]=='male') & (df["survived"] == 1)]

f_age_median = female_survivors["age"].median()
m_age_median = male_survivors["age"].median()

print("Age medians f=", f_age_median, "m=", m_age_median)

In [None]:
f_age_skew = female_survivors["age"].skew()
m_age_skew = male_survivors["age"].skew()

print("Age skew f=", f_age_skew, "m=", m_age_skew)

We can see that there is signiticant age difference between male and female survivors and the skew in the data is different. So we will make sure to explore this further when we are visualizing the data. For now, you can explore the data even more with the statistical methods available before moving on.

At this point we are familiar enough with the dataframes that we can move on to the visualization section. Of course, there are many other operations that we haven't covered yet in the interest of time such as joining tables, grouping items together, pivoting, timesries, etc. You are welcome to explore these operations in the `pandas` [documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html#setting).

#### Visual Exploration (matplotlib and seaborn)

Now that we know how to use a dataframe, we can move to creating exploratory plots using seaborn and matplotlib. You can find extensive information on how to make plots with these tools online. Here, we will cover some of the most commonly used features.

##### Histograms

Let's begin by exploring the survial rates based on age on a histogram, we will use the `plt.hist()` function. 

Remember that `plt` was the short name we assigned `matplotlib.pyplot` when we imported it.

In [None]:
plt.hist(df[df["survived"] == 1]["age"])

Interesting... looks like young children had high survival rates. Let's explore this more by increasing the number of bins.

In [None]:
plt.hist(df[df["survived"] == 1]["age"], bins = 20)

Let's look at a more advanced syntax to simplify our visualization. here we will explore the survival by `pclass` by `age`. 

In [None]:
# Create a seabord grid of empty plots with as many rows as unique "pclass" and columns as "survived" values
grid = sns.FacetGrid(df, row='survived', col="pclass")

# apply plt.hist on the 'age' column of each selected data in the grid. 
grid.map(plt.hist, 'age', bins = 20)

# So the first line splits the data by 'survived' and 'pclass' columns, and then grid.map applies a plt.hist(..., bins = 20) on the age column.

Let's do the same for survivors based on age and sex as well

In [None]:
# Create a seabord grid of empty plots with as many rows as unique "pclass" and columns as "survived" values
grid = sns.FacetGrid(df, row='survived', col='sex')

# apply plt.hist on the 'age' column of each selected data in the grid. 
grid.map(plt.hist, 'age', bins = 20)

<b>Observation</b>

This should provide more intuition on who survived and who didn't. It appears:
- pclass 1 had the least passengers, but most survived
- pclass 3 had the most passengers, most didn't survive
- most femakes survived while most men didn't
- most children and all elderly survived
- ...

<b>Conclusion: </b>

From this, we get the intuition that there might be a good corrolation between age, sex, pclass and weather someone survived or not. So these are features that we might consider for our model training.

Next let's learn how to use `plt.pointplot` and `plt.barplot` to explore corrolations in categorical and numerical features.

First, we will explore if the port of boarding (`embark`) seems to have a corrolation with the survival rate

In [None]:
grid = sns.FacetGrid(df, col='embarked')
grid.map(sns.pointplot, 'pclass', 'survived', 'sex')
grid.add_legend()

It appears that port of entry has an impact on whether males or females survived, so we should definitely consider adding this to our model. Next let's see if those who paid more, were more likely to survive. For this, we will use a `plt.barplot`

In [None]:
grid = sns.FacetGrid(df, row="survived", col='embarked')
grid.map(sns.barplot, 'sex', 'fare')
grid.add_legend()

It also appears that, thought not as significant as other features, the paid fare played a role on weather someone survived or not. And this seems to be impacted by the port of entry as well as their gender. So let's consider `fare` for our model training as well.

<b>Exercise:</b>

This is where I courage you to explore your assumptions about the data and see if you can find correlations that could indicate your assumptions are correct or not.

In [None]:
# Room for exercise



This brings us to the end of our data exploration, we are now ready to clean and wrangle our data before we move on to model training.

### 5. Data Cleaning

Let's clean our data, remove what we don't need, and fill in the blanks first

#### Dropping unimportant features.

While we can't know for sure if a feature is useful or not in the begining, we can get a good feeling about it based on our data exploration. It is a good practice to start with fewer features and slowly add more since fewer features means faster model training and iteration time.

In this case, based on our oversvations in the previous section, we are going to keep `pclass`, `age`, `sex`, `embarked`, and `fare`. Additionally, we would see corrolations from the following if we explored them, so we will keep those as well: `name`, `sibsp`, `parch`
And of course our taget column: "survived"

In [None]:
df = df.loc[:, ["pclass", "age", "sex", "embarked", "fare", "name", "sibsp", "parch", "survived"]]
df

Since most Machine Learning models cannot deal with blank data, let's see which columns have blank values and address those. For this we use the `.isnull()` function that puts a true/false on each cell indicating if it is blank, and then we use the `.sum()` function on the result to sum up how many trues (blanks) are in each column.

In [None]:
df.isnull().sum()

Let's start with `age`. We shall replace the blank ones with a guessed value. Based on your knowledge of the domain, you can choose different methods for gessing blank numbers, here we can simply fill in the age with the the age median and move on. 

Of course, this is a very simplistic method, and introdces unwanted error. A better approach would be to guess the age based on the median of age in each group separated based on `sex`, `pclass`, `embarked`. But that is out of the scope of this introductory notebook.

`.fillna()` is used to fill in the NaN values.

In [None]:
# Let's get the Mode of age
age_mode = df['age'].mode()
age_mode

In [None]:
# Since the returned Mode is a Series, we get the first and only element so we have a scalar number to use
age_mode = age_mode[0]

# Then we can use the fillna()
df["age"] = df["age"].fillna(age_mode)

In [None]:
# Let's verify we have no more NaN's in the age column
df["age"].isnull().sum()

<font color='red'><b>Exercise:</b></font> Do the same for the embarked and fare columns.

In [None]:
# Fill NaN embarked values with the mode of that column


# Fill NaN fare value with the mode of the column



Our Data is now relatively clean, so we can move on to the next step.

### 6. Data Wrangling

Many Machine Learning algorithems operate on numberical values only, so we need to convert our textual and categorial fields to numbers

In [None]:
# Let's see the types first
df.dtypes

One way to do this is to replace each value with a number that you want. So something like this:

```python
df['sex'] = df['sex'].map( {'female': 1, 'male': 0} )
df['sex'] = df['sex'].astype(int)
```

However, if you don't care about which number represents which column, there is a simpler way using `.astype('category')` and `cat.codes`

In [None]:
# Convert the column type to categorical
df["sex"] = df["sex"].astype('category')
df.dtypes

In [None]:
# Replace the categorical column with numberical codes
df["sex"] = df["sex"].cat.codes
df.dtypes

Next, Let's do the same for the other categorical columns too.

In [None]:
df["pclass"] = df["pclass"].astype('category').cat.codes
df["embarked"] = df["embarked"].astype('category').cat.codes
df["survived"] = df["survived"].astype('category').cat.codes

# Let's check the head to see if it looks ok
df.dtypes

Next, let's take care of the `name` column.

Perhaps the titles of the passengers would have some corrolation with their survival. Let's see.

In [None]:
# Extract Titles from name

# Here's what we are doing
# Left handside: create a new column called titles
# Right handside: 
#   .str: treat the names as string
#   .extract: extract the part of string that matches our regula rexpression ' ([A-Za-z]+)\.'
df['title'] = df["name"].str.extract(' ([A-Za-z]+)\.')
df['title']

Regular Expressions: Also called regex, regular expressions are a syntax that describe textual patterns. You can read this tutorial on [python regex](https://docs.python.org/3/howto/regex.html) to learn more.

Specifically, here our regex ` ([A-Za-z]+)\.` means to find any word pattern that starts with a space(` `), ends with a dot (`\.`) and has one or more `+` upper or lower case letters `[Az-az]`. 

[Click Here](https://regexr.com/5n4td) to see a visual description of what this regex does or create your own.

Now that we have our titles, let's break them down to common ones and rare ones since the specific title probably won't have much value, but rather weather it's a generic or rare one. 

In [None]:
# Let's see the unique titles:

df["title"].value_counts()

In [None]:
# Let's fix the typos
df["title"] = df["title"].replace('Mlle', "Miss")
df["title"] = df["title"].replace('Ms', "Miss")
df["title"] = df["title"].replace('Mme', "Miss")
df["title"] = df["title"].replace('', "None")
# Let's replace the rare ones with keyword 'rare'
# Here we are selecting the rare titles as anything not in a list
df["title"][~df["title"].isin(["Mr", "Miss", "Mrs", "Master"])] = "Rare"

df["title"].value_counts()


Now let's see if our hypothesis seems resonable that rare titles have different chance of survival

In [None]:
sns.barplot(x='title', y='survived', data=df)

There seems to be a corrolation between the title and survival rate. Even within the same gender (for instance male), the title seems to make a difference in the survival rate.

So let's keep the new `title` column and drop the `name`.

`.drop()` is used for this and we use `axis=1` to indicate it is a colum that we are dropping not a row

In [None]:
df = df.drop("name", axis=1)
df.head()

Finally, let's replace the Title column with numbers and we are ready to train own Machine Learning Model

In [None]:
# we will do this in steps so it's easier to follow
titles = df["title"].astype('category').cat.codes
titles_as_category = titles.astype('category')
title_codes = titles_as_category.cat.codes

# Assign the title codes 
df["title"] = title_codes

# Print the top of the table to check
df.head()

## Data Analysis

Now that we have collected, cleaned, and wrangled our data, it is time to build our predictive model. Note that there are different families (classes) of predictive models (supervised, unsupervised, neural networks, ...), and each family has many models to select from. The choice of algorithm is one that you make in a real-life scenario, experiment and see the results of your selected technique, and then decide how to proceed: explore another family/model, keep the one you have trained, or create a hybrid (bagging, boosting, ensemble, ...) to capitalize on the strength of multiple models and families.

In a real-life scenario, you would want to do some research on each model class, learn their strengths, weeknesses, input/output format, and then pick the most promissing ones based on your knowledge of the problem that you are solving.

To keep this notebook within our timelimit, we will explore only one <b>"classifier"</b> model from the <b>"supervised learning"</b> class, called <b>"Decision Tree"</b>.

### What is a Decision Tree

From scikit-learn's documentation:

> Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. A tree can be seen as a piecewise constant approximation.

Simply put, this model is a tree-like set of decisions that finally lead to a prediction. 

As for its strengths here are some of the ones that relate to our problem today from scikit-learn's documentation:

>
    Some advantages of decision trees are:
        - Simple to understand and to interpret. Trees can be visualised.
        - Requires little data preparation. Other techniques often require data normalisation, dummy variables need to be created and blank values to be removed. Note however that this module does not support missing values.
        - Able to handle both numerical and categorical data. However scikit-learn implementation does not support categorical variables for now. Other techniques are usually specialised in analysing datasets that have only one type of variable.
        - Uses a white box model. If a given situation is observable in a model, the explanation for the condition is easily explained by boolean logic. By contrast, in a black box model (e.g., in an artificial neural network), results may be more difficult to interpret.
        - Possible to validate a model using statistical tests. That makes it possible to account for the reliability of the model.

And as for relavant weaknesses:
>
    The disadvantages of decision trees include:
        - Decision-tree learners can create over-complex trees that do not generalise the data well. This is called overfitting. Mechanisms such as pruning, setting the minimum number of samples required at a leaf node or setting the maximum depth of the tree are necessary to avoid this problem.
        - Decision trees can be unstable because small variations in the data might result in a completely different tree being generated. This problem is mitigated by using decision trees within an ensemble.
        - Predictions of decision trees are neither smooth nor continuous, but piecewise constant approximations. Therefore, they are not good at extrapolation.
        - There are concepts that are hard to learn because decision trees do not express them easily, such as XOR, parity or multiplexer problems.
        - Decision tree learners create biased trees if some classes dominate. It is therefore recommended to balance the dataset prior to fitting with the decision tree.

With these pros and cons in mind, Decisions Trees seem like a strong candidate for our problem, so let's use them and see how they perform.

[Learn more](https://scikit-learn.org/stable/modules/tree.html)

### Packages used: scikit-learn

We will learn and use scikit learn which implements user friendly methods that can perform the training and testing of an algorithm for us. There are other libraries out there that could do the same, however, they often require more in-depth knowledge of how the algorithms work internally to properly configure.

A few concepts that we should learn before moving further:


#### training/validation and test split

We often split our dataset into three sections (train/validate/test) often with ratios (50/25/25). This is however not always the case and in simplified scenatios, such as our usecase, one might only use a (train/validate) split with ratio of (70/30). The use of the "test" set is described below.

- The training set is shows to the model so it learns the patterns to use for its predictions
- The validation set is then then used the true performance of the model. This helps find out if the model just memorized the input data, or if it can in-fact predict unseen data correctly (aka. [generalize](https://en.wikipedia.org/wiki/Generalization_error)).
- The test set is used when the validation error is used for selecting the best model or when performing [hyper-parameter optimization](https://en.wikipedia.org/wiki/Hyperparameter_optimization)). In those cases, because we are using the error rate from the validation set to pick a best model, our validation error rate will be smaller than the true error rate of the model. Therefore, we need a set of data which we never used to optimize or pick our model to give us an unbiased error rate for our model.

[Learn More](https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets)

Since in this case we are just using one model (no model selection), and once trained we aren't tuning it any further (no Hyperparameter_optimization), we can skip the test set and just use a 70/30 train/validate split.

#### Model fit

It is important how a models prediction fits the data. If the model is not adequetly predicting the output based on the input in the training phase, it is called underfit. In an underfit case, the model will also perform poorly on the evaluation set.

On the other hand, if a model performs amazingly on the training set (think 99% accurate) but poorly on the validation set, it is most likely overfit. You can think of this as a model that has basically memorized the training set and therefore is incapable of predicting well on unseen data. 

If a model is neither underfit nor overfit, it is called balanced. This is the place we want to be in and this is why we keep unseen data (validation and test splits). We use those unseen data to detect if our model is overfitting the data.

[Learn More](https://en.wikipedia.org/wiki/Overfitting)

#### Uppercase `X` and lowercase `y`

There is a tradition where uppercase `X` denotes the input data to a model and the lowercase `y` the labels or values that it is expected to learn. That is, each set (train/validate/test) is broken into `X` and `y` where `y` is the predicted column and `X` is all the other columns.


With that, let's split our data, perform our training, and see how they perform on unseed data (validation).

In [None]:
# import the libraries we need
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

# Split target column our
X = df.drop("survived", axis = 1)
y = df["survived"]

# perform the train/validation split
# test is our validation here since we won't further split out train set
# We specify the random state here for reprocucable results. 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=20)

# Let's check the shapes (table dimentions) to make sure it makes sense
X_train.shape, y_train.shape, X_test.shape

In [None]:
# Exercise:
# Take a look at the data splits [X_train, X_test, y_train, y_test] and see if they make sense to you. You can explore, length, columns, dtypes, ...



<font colot='red'><b>Note:</b></font> 
The following line won't work, once you see the error, read it try to debug the issue in the next cell, then move forward to see the solution.

In [None]:
# Decision Tree
decision_tree = DecisionTreeClassifier()

# Note: This won't work, once you see the error, read it try to debug the issue, then move forward to see the solution.
decision_tree.fit(X_train, y_train)


Looks like something is wrong. Take a look at the ValueError message and use the cell below to explore and see if you can findout what is wrong. We left a small mistake intentionally to experience the debugging process

In [None]:
# Exrcise 
# Explore the copy of our original table below and find out what is wrong
# Make sure not to modify to original df, and only work on the df2
df2 = df.copy()

# You may explore df2 here


<b>Solution:</b>

As the error says:
> Input contains NaN, infinity or a value too large for dtype('float32')

So let's find the column with NaN and fill it with an average for that column.

We will use the `.isnull()` to fill the table with true wherever there is a null, and then use the `.any()` to show if in each column there is any True (which would mean a `NaN`). You can run these steps one-by-one 

In [None]:

df.isnull().any()

In [None]:
# Now let's count the NaN in fare column
# Note that `df["fare"].isnull().sum()` wouldn't work as it would count any value, true and false! while sum counts true as 1 and false as 0.
df["fare"].isnull().sum()

In [None]:
# ok there is only one missing value, so let's replace it with the mean and move on.abs
fare_mode = df["fare"].mode()[0]
df["fare"] = df["fare"].fillna(fare_mode)

# and let's make sure there are no more nulls
df.isnull().any()

Perfect, no more nulls, we can now train our model

In [None]:
# Redo our split with fixed data
X = df.drop("survived", axis = 1)
y = df["survived"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

# Decision Tree
decision_tree = DecisionTreeClassifier()

# Train Model
decision_tree.fit(X_train, y_train)

# Make predictions for validation set
y_pred = decision_tree.predict(X_test)

# calculate models accuracy on training set
train_score = decision_tree.score(X_train, y_train)
test_score = decision_tree.score(X_test, y_test)

# Print results
# We know this is a new syntax to format and round numbers to two decimal places within a string. 
# You can learn more at https://www.w3schools.com/python/ref_string_format.asp
print("Trainins Score: {:.2f} validation score: {:.2f}".format(train_score, test_score))

It looks like our decision tree is overfitting! It does very well with the training data 97% accuracy and performs poorly on unseen data (validation set). 

How to counter that is beyond the scope of this notebook, but you can begin by forcing the each branch of the tree to have at least 15 leaves. That is replacing the following line
```python
decision_tree = DecisionTreeClassifier()
```
with:
```python
decision_tree = DecisionTreeClassifier(min_samples_leaf=10)
```

This should reduce overfitting and improve our accuracy to about 80%! 

In [None]:
# Tuning the Tree Parameters
decision_tree = DecisionTreeClassifier(min_samples_leaf=10)

# Train Model
decision_tree.fit(X_train, y_train)

# Make predictions for validation set
y_pred = decision_tree.predict(X_test)

# calculate models accuracy on training set
train_score = decision_tree.score(X_train, y_train)
test_score = decision_tree.score(X_test, y_test)

# Print results
print("Trainins Score: {:.2f} validation score: {:.2f}".format(train_score, test_score))