# Processing Data for Decision Trees in Pandas
By Benned Hedegaard

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/12z93hzXqv5MPtUNb_BjD-GF0vfr5iwOI)

We'll use Kaggle's Titanic dataset to learn how to analyze, reformat, categorize, and slice data in Pandas DataFrames. We'll also delve a bit into the intuitions behind decision trees and some information theory. Note: We will NOT get to implementing decision trees themselves. That said, most of the tools and math to do this are explained/implemented in this Notebook.

[Link to the Kaggle competition](https://www.kaggle.com/c/titanic)

## Importing Data

In [0]:
import pandas as pd
import numpy as np
# import matplotlib.pyplot as plt Not used yet...

In [2]:
# Import the datasets from my GitHub
!git clone https://github.com/Benendead/Titanic_Pandas_Tutorial

Cloning into 'Titanic_Pandas_Tutorial'...
remote: Enumerating objects: 20, done.[K
remote: Counting objects: 100% (20/20), done.[K
remote: Compressing objects: 100% (19/19), done.[K
remote: Total 20 (delta 6), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (20/20), done.


In [3]:
# Look at where our data is.
!ls Titanic_Pandas_Tutorial/Data

test.csv  train.csv


In [4]:
# Read in our training dataset and give a preview of its contents.
tr = pd.read_csv("Titanic_Pandas_Tutorial/Data/train.csv")
tr.head(3) # Previews first 3 rows of the DataFrame.

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


In [5]:
# We can see that the pd.read_csv gives us a DataFrame.
type(tr)

pandas.core.frame.DataFrame

In [6]:
# Check the shape of our data so far.
tr.shape

(891, 12)

In [7]:
# Just for kicks, let's cut off the Ticket column. I just wanted to show this method:
tr.drop(["Ticket"], axis = 1, inplace = True) # Inplace means that this returns nothing.
tr.head(1) # It worked.

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,7.25,,S


### Some Basic Pandas Terms
DataFrame - The primary data structure of Pandas. Allows us to store labeled columns of varying types in one structure.

Index - The row labels in a DataFrame.

Series - The second data structure of Pandas. Like one column of a DataFrame.

A useful reference cheatsheet:
https://assets.datacamp.com/blog_assets/PandasPythonForDataScience.pdf

And one for data wrangling:
https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf

## Analyzing Our Data

In [8]:
"""
Survey the variables we have for each person.

PassengerId - Unique identifiers for each passenger.
Survived - If the passenger survived. 0 = No, 1 = Yes
Pclass - Ticket class. 1 = 1st, 2 = 2nd, 3 = 3rd
Name - Name of the passenger. String
Sex - Sex of the passenger. "male" or "female"
Age - Age of the passenger. Integer
SibSp - # of siblings/spouses passenger had on board. Integer
Parch - # of parents/children passenger had on board. Integer
Ticket - Ticket number. String
Fare - Passenger's fare. Float
Cabin - Cabin number. String
Embarked - Port of embarkation. 
        C = Cherbourg, Q = Queenstown, S = Southampton
"""

tr.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(4)
memory usage: 76.6+ KB


In [0]:
# These column names are unclear. Let's fix that.
tr = tr.rename(index = str, columns = {"SibSp": "SiblingsSpouses", "Parch": "ParentsChildren"}) # Returns the renamed dataframe.

# Notice the lack of inplace.

In [10]:
# Survey some information about the numerical variables.
tr.describe() # Method of the DataFrame class

Unnamed: 0,PassengerId,Survived,Pclass,Age,SiblingsSpouses,ParentsChildren,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


Observations:
* We have 891 of all numerical features except age. We'll need to fix that.
* What percentage of people survived?
* What was the age of the oldest person?

In [11]:
# We noticed that the Age category was missing some values. Let's check for null values across the DataFrame.
nulls = tr.isna() # Gives a dataframe of booleans showing which values in the training set are null.
nulls.head(3)
# We could use this later, maybe.

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SiblingsSpouses,ParentsChildren,Fare,Cabin,Embarked
0,False,False,False,False,False,False,False,False,False,True,False
1,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,True,False


In [12]:
# Like now.
nullsPerCol = nulls.sum() # Has multiple uses...
nullsPerCol

PassengerId          0
Survived             0
Pclass               0
Name                 0
Sex                  0
Age                177
SiblingsSpouses      0
ParentsChildren      0
Fare                 0
Cabin              687
Embarked             2
dtype: int64

In [0]:
def makeCategorical(col, df):
    """
    Makes a given column categorical in the dataframe. Returns fixed dataframe.
    
    Parameters
    ----------
    col : String
        The name of the column to make categorical.
    df : pandas DataFrame
        The dataframe to be altered.
    """
    if (df[col].dtype.name == "category"):   # Skips columns that are already categories.
        return
    df[col] = pd.Categorical(df[col]) # The important part.
    return df

In [14]:
tr.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SiblingsSpouses,ParentsChildren,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,8.05,,S


In [0]:
# Makes these columns explicit categories:
categories = ["Survived", "Pclass", "Sex"]
for i in range(len(categories)):
    makeCategorical(categories[i], tr)

In [16]:
# Shows how many categories each categorical column has.
preview = ""
for i in range(len(categories)): # For the categories we just made...
    col = categories[i]
    preview = preview + str(col) + ": " + str(len(tr[col].cat.categories)) + "\n" # We'll get into the slicing a bit more in a second, don't worry.
print(preview)

Survived: 2
Pclass: 3
Sex: 2



## Slicing DataFrames
We just saw a pretty dense example; let's break down how DataFrames can be sliced.

In [17]:
# We can slice by rows pretty intuitively.
# Note: These slices do not include the final index.
tr[0:3]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SiblingsSpouses,ParentsChildren,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,7.925,,S


In [18]:
# Check the last 5 observations.
tr.tail(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SiblingsSpouses,ParentsChildren,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,7.75,,Q


In [19]:
# We can also index from the ends of the DataFrame. Conceptualize it as if the final example is indexed -0.
tr[-4:-1]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SiblingsSpouses,ParentsChildren,Fare,Cabin,Embarked
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,30.0,C148,C


In [20]:
# Finally, we can slice in both directions:
# Order is row, column.
tr.loc["0":"5", "Name":"Age"] # Note that the start and end of this slice are included.

Unnamed: 0,Name,Sex,Age
0,"Braund, Mr. Owen Harris",male,22.0
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0
2,"Heikkinen, Miss. Laina",female,26.0
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0
4,"Allen, Mr. William Henry",male,35.0
5,"Moran, Mr. James",male,


The row indices are Strings here, which is a bit confusing. Some explanation:
* https://pandas.pydata.org/pandas-docs/stable/indexing.html#selection-by-label

In [21]:
# We can also select specific rows or columns.
listOfImportantPeople = ["0", "5", "10", "420", "500"] # We only care about these people.
features = ["Name", "Age", "Cabin"]
tr.loc[listOfImportantPeople, features]

Unnamed: 0,Name,Age,Cabin
0,"Braund, Mr. Owen Harris",22.0,
5,"Moran, Mr. James",,
10,"Sandstrom, Miss. Marguerite Rut",4.0,G6
420,"Gheorgheff, Mr. Stanio",,
500,"Calic, Mr. Petar",17.0,


## Fixing Null Ages
Recall: Our Age column had some missing observations. We should work on that.

But are the ages of the males and females on the Titanic different enough to justify averaging them separately?

There's only one way to find out...

In [22]:
# Now, how do we access just one column?
sexLabels = tr["Sex"] # Selects the "Sex" column.
sexLabels.head(5)

0      male
1    female
2    female
3    female
4      male
Name: Sex, dtype: category
Categories (2, object): [female, male]

In [23]:
# It's a Series. Pandas documentation describes these as "One-dimensional ndarray with axis labels"
type(sexLabels)

pandas.core.series.Series

In [24]:
# What if we wanted a Series of booleans, in order to select the correct observations?
# We can select observations by label:
males_bool = sexLabels == "male"
females_bool = sexLabels == "female"
females_bool.head(5)

0    False
1     True
2     True
3     True
4    False
Name: Sex, dtype: bool

In [25]:
# Finally, let's find those examples.
males = tr.loc[males_bool]
males.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SiblingsSpouses,ParentsChildren,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,7.25,,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,21.075,,S


In [0]:
# Using what we've seen, create two dataframes each with either the male and females observations.
tr_male = tr.loc[tr["Sex"] == "male"]
tr_female = tr.loc[tr["Sex"] == "female"]

In [27]:
# Previews the male dataframe.
tr_male.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SiblingsSpouses,ParentsChildren,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,7.25,,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,8.4583,,Q


In [28]:
len(tr_male)

577

In [29]:
# Previews the female dataframe.
tr_female.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SiblingsSpouses,ParentsChildren,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,53.1,C123,S


In [30]:
len(tr_female)

314

Let's create a method to replace null values of a given column with the column's average. First, find average for the Age column.

Pretty simple:

In [31]:
tr_male["Age"].mean()

30.72664459161148

In [32]:
tr_female["Age"].mean() # Small difference, but we can pretend it justifies this work.

27.915708812260537

Now, replace null values with that average. There's a function for that:

.fillna(value, inplace = boolean)


inplace changes whether the data is replaced in its place or in a newly created/returned DataFrame. Defaults to False.

In [0]:

def fixNulls(colName, df):
    """
    Replaces null values in a given column with the average of the column. Returns fixed dataframe.
    
    Parameters
    ----------
    colName : String
        The name of the column to fix.
    df : pandas DataFrame
        The dataframe to be altered.
    """
    average = df[colName].mean()
    df1 = df.fillna(value = {colName : average})
    return df1

In [34]:
# Previews the male dataframe's "Age" column before null fixing.
tr_male.loc[:, "Age"].head(10)

0     22.0
4     35.0
5      NaN
6     54.0
7      2.0
12    20.0
13    39.0
16     2.0
17     NaN
20    35.0
Name: Age, dtype: float64

In [0]:
tr_male = fixNulls("Age", tr_male)
tr_female = fixNulls("Age", tr_female)

In [36]:
# Shows the male dataframe's "Age" column after null fixing.
tr_male.loc[:, "Age"].head(10)

0     22.000000
4     35.000000
5     30.726645
6     54.000000
7      2.000000
12    20.000000
13    39.000000
16     2.000000
17    30.726645
20    35.000000
Name: Age, dtype: float64

In [37]:
# We see that the rest of the DataFrame is untouched. *Cabin*
tr_male.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SiblingsSpouses,ParentsChildren,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,7.25,,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,8.05,,S
5,6,0,3,"Moran, Mr. James",male,30.726645,0,0,8.4583,,Q


In [0]:
# Recombines the male and female dataframes back into a single dataframe.
tr = tr_male.append(tr_female)

In [39]:
tr.sort_values(["PassengerId"], inplace = True) # Because we just threw all female observations to the end of the frame.
tr.tail(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SiblingsSpouses,ParentsChildren,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,27.915709,1,2,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,7.75,,Q


In [40]:
# Check our nulls...
nullsPerCol = tr.isna().sum()
nullsPerCol

PassengerId          0
Survived             0
Pclass               0
Name                 0
Sex                  0
Age                  0
SiblingsSpouses      0
ParentsChildren      0
Fare                 0
Cabin              687
Embarked             2
dtype: int64

We still have a bit of work to do, then.

In [41]:
# Yeah, Cabin is basically useless.
tr.drop(["Cabin"], axis = 1, inplace = True)
tr.head(1)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SiblingsSpouses,ParentsChildren,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,7.25,S


In [42]:
# Those pesky two people with null embarked.
tr[nulls["Embarked"]]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SiblingsSpouses,ParentsChildren,Fare,Embarked
61,62,1,1,"Icard, Miss. Amelie",female,38.0,0,0,80.0,
829,830,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,80.0,


In [43]:
tr = tr.fillna(value = {"Embarked" : "C"})
tr[60:63]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SiblingsSpouses,ParentsChildren,Fare,Embarked
60,61,0,3,"Sirayanian, Mr. Orsen",male,22.0,0,0,7.2292,C
61,62,1,1,"Icard, Miss. Amelie",female,38.0,0,0,80.0,C
62,63,0,1,"Harris, Mr. Henry Birkhardt",male,45.0,1,0,83.475,S


In [44]:
tr[nulls["Embarked"]]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SiblingsSpouses,ParentsChildren,Fare,Embarked
61,62,1,1,"Icard, Miss. Amelie",female,38.0,0,0,80.0,C
829,830,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,80.0,C


In [45]:
# One last null check...
nullsPerCol = tr.isna().sum()
nullsPerCol

PassengerId        0
Survived           0
Pclass             0
Name               0
Sex                0
Age                0
SiblingsSpouses    0
ParentsChildren    0
Fare               0
Embarked           0
dtype: int64

### Extra Info
If you check the documentation, Pandas actually has a ton of ways to access data from a DataFrame. The index accessing is the most simple, but .loc or .iloc are both quite powerful as well. 

.iloc takes integers and .loc takes the Strings as we've seen, but more info is here:

https://stackoverflow.com/questions/31593201/pandas-iloc-vs-ix-vs-loc-explanation-how-are-they-different

## How Do We Use Our Data?

So we've seen how we can slice, fix up, and analyze our data a bit. We could certainly do a lot more, but at the end of the day, how do we determine which features are important in determining whether or not someone died on the Titanic?

First, some intuitions on Information Theory: https://bit.ly/InformationTheoryLink

And now some context on decision trees: https://bit.ly/DecisionTreeLink


So we want to be able to calculate the entropy of a dataset resulting from a potential split, merely to demonstrate which splits are actually useful or not. This can be calculated by:

$\sum_{c=0}^{m}-p_{c}\times\log_2(p_{c}).$

$m$ classes, $p_{c}$ is the proportion of each class present.

Or in the case of just two classes (Survived or not):

$E = -p\log_2(p)-qlog_2(q).$

$p$ is the proportion of one class, $q$ the other.

In [46]:
# Because we finally fixed up the Embarked, we can categorize that:
makeCategorical("Embarked", tr)
categories.append("Embarked")
categories

['Survived', 'Pclass', 'Sex', 'Embarked']

In [0]:
def split(feature, df):
    """
    Splits a dataframe into sets based on the given categorical column.
    Returns a list of DataFrames.
    
    Parameters
    ----------
    feature : String
        The feature for dataframe to be split on.
    df : pandas DataFrame
        The dataframe to be split.
    """
    groups = df[feature].cat.categories
    categoryNum = len(groups)
    output = [] # Will be a list of DataFrames.
    for i in range(categoryNum): # For each of the categories of the feature given:
        currentCategorySeries = df[feature] == groups[i]
        output.append(df[currentCategorySeries]) # Append the df of the rows where the internal condition is true.
    return output

In [0]:
def entropy(df):
    """
    Calculates the "Survived" entropy for a given DataFrame.
    
    df : pandas DataFrame
        The dataframe to be considered.
    """
    totalObservations = float(len(df))
    survivedSeries = df["Survived"].value_counts()
    # print(survivedSeries)
    totalDied = float(survivedSeries[0]) # Gets row 0, or the count of those that died.
    totalSurvived = float(survivedSeries[1]) # Gets row 1, or the count of those that survived.
    p = totalDied / totalObservations
    q = totalSurvived / totalObservations
    if (p == 0 or q == 0): # Avoid /0 or log(0).
        return 0.
    return float((-p * np.log2(p)) - (q * np.log2(q)))

In [0]:
def entropyList(dfList):
    """
    Calculates the 'Survived' entropy for a given list of DataFrames.
    
    dfList : list of pandas DataFrames
        The dataframes to be considered.
    """
    totalLength = 0.
    for df in dfList: # Sum up total length of the DataFrames.
        totalLength = float(totalLength + len(df))
    
    summedEntropy = 0.
    
    for df in dfList:
        e = entropy(df)
        weight = float(len(df)) / totalLength # We weight each entropy by proportion.
        weightedEntropy = float(e * weight)
        summedEntropy = float(summedEntropy + weightedEntropy)
        
    return summedEntropy

In [50]:
# As it stands, our dataset is close to half-half survived or not examples, hence a high entropy.
entropy(tr)

0.9607079018756469

### Credits for Feature Insights

Many of the ideas for the features I'm going to create here come from this great source:
https://www.kaggle.com/startupsci/titanic-data-science-solutions

In [51]:
tr["isAlone"] = 0 # Adds a new column and sets its values to 0.
tr.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SiblingsSpouses,ParentsChildren,Fare,Embarked,isAlone
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,7.25,S,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,71.2833,C,0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,7.925,S,0


In [52]:
# Create Series indicating each person's family size.
familySizes = tr["ParentsChildren"] + tr["SiblingsSpouses"] + 1
familySizes.tail(10)

881    1
882    1
883    1
884    1
885    6
886    1
887    1
888    4
889    1
890    1
dtype: int64

In [53]:
# Sets a new column, isAlone, to booleans indicating if the person is alone or not.
tr["isAlone"] = familySizes == 1 # Family size of 1 must be alone.
tr.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SiblingsSpouses,ParentsChildren,Fare,Embarked,isAlone
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,7.25,S,False
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,71.2833,C,False
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,7.925,S,True
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,53.1,S,False
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,8.05,S,True


In [54]:
makeCategorical("isAlone", tr)
categories.append("isAlone")
categories

['Survived', 'Pclass', 'Sex', 'Embarked', 'isAlone']

In [55]:
# Creates two DataFrames which are split along the "Sex" feature using the method made above. Getting 2 DataFrames makes sense.
sexSplit = split("Sex", tr)
len(sexSplit)

2

In [56]:
entropyList(sexSplit)

0.7430477952150327

In [0]:
def informationGain(col, df):
    """
    Finds the potential information gain based on splitting the given DataFrame
    on the specified categorical column.
    
    col : String
        Which feature to find the information gain for.
    df : pandas DataFrame
        The DataFrame to process.
    """
    startingEntropy = entropy(df)
    splitList = split(col, df)
    endingEntropy = entropyList(splitList)
    gain = float(startingEntropy - endingEntropy) # The gain is the decrease in entropy.
    return gain

In [58]:
# Basically, if we split by the literal answer, we'll improve the entropy from where it is to 0. Makes sense.
informationGain("Survived", tr)

0.9607079018756469

In [59]:
for c in categories:
    print(c, informationGain(c, tr))

Survived 0.9607079018756469
Pclass 0.0838310452960116
Sex 0.2176601066606142
Embarked 0.022147406355340515
isAlone 0.029708896074360225


### Concept Check: Information Gain
Hopefully it' makes some sense why we cared about the "Sex" feature when processing our ages.

Note: As you can see, Embarked is a practically useless feature. I therefore stand by the unscientific treatment we used to fix its null values.

In [0]:
class DecisionTree:
    """
    A decision tree splits a dataset into subsets of data, working to create pure subsets. Later, by following the determined path through the chosen
    splits, the tree can classify new observations. Other optimization methods such as ensembling many of these trees together can improve on this idea.
    """
    
    # Homework...

# Questions?