# Processing Data for Decision Trees in Pandas
By Benned Hedegaard

We'll use Kaggle's Titanic dataset to learn how to analyze, reformat, categorize, and slice data in Pandas DataFrames. We'll also delve a bit into the intuitions behind decision trees and some imformation theory. Note: We will NOT get to implementing decision trees themselves. Most of the tools to do this, though, are explained/implemented in this Notebook.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/12z93hzXqv5MPtUNb_BjD-GF0vfr5iwOI)

## Importing Data

In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
# Import the data from my GitHub
!pip install -q xlrd
!git clone https://github.com/Benendead/Titanic_Pandas_Tutorial

Cloning into 'Titanic_Pandas_Tutorial'...
remote: Enumerating objects: 8, done.[K
remote: Counting objects: 100% (8/8), done.[K
remote: Compressing objects: 100% (7/7), done.[K
remote: Total 8 (delta 0), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (8/8), done.


In [3]:
# Look at where our data is.
!ls Titanic_Pandas_Tutorial/Data

test.csv  train.csv


In [4]:
# Read in our dataset and give a preview of its contents.
tr = pd.read_csv("Titanic_Pandas_Tutorial/Data/train.csv")
tr.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


In [5]:
# We can see that the pd.read_csv gives us a DataFrame.
type(tr)

pandas.core.frame.DataFrame

## Analyzing Our Data

In [6]:
"""
Survey the variables we have for each person.

PassengerId - Unique identifiers for each passenger.
Survived - If the passenger survived. 0 = No, 1 = Yes
Pclass - Ticket class. 1 = 1st, 2 = 2nd, 3 = 3rd
Name - Name of the passenger. String
Sex - Sex of the passenger. "male" or "female"
Age - Age of the passenger. Integer
SibSp - # of siblings/spouses passenger had on board. Integer
Parch - # of parents/children passenger had on board. Integer
Ticket - Ticket number. String
Fare - Passenger's fare. Float
Cabin - Cabin number. String
Embarked - Port of embarkation. 
        C = Cherbourg, Q = Queenstown, S = Southampton
"""

tr.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


In [0]:
# These column names are unclear. Let's fix that.
tr = tr.rename(index = str, columns = {"SibSp": "SiblingsSpouses", "Parch": "ParentsChildren"}) # Returns the renamed dataframe.

In [8]:
# Survey some information about the numerical variables.
tr.describe() # Method of the DataFrame class

Unnamed: 0,PassengerId,Survived,Pclass,Age,SiblingsSpouses,ParentsChildren,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


Observations:
* We have 891 of all numerical features except age. We'll need to fix that.
* What percent of people survived?
* What was the age of the oldest person?

In [9]:
# We noticed that the Age category was missing some values. Let's check for null values across the DataFrame.
# Gives a dataframe of booleans showing which values in the training set are null.
# We'll use this later, maybe.
nulls = pd.isna(tr)
nulls.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SiblingsSpouses,ParentsChildren,Ticket,Fare,Cabin,Embarked
0,False,False,False,False,False,False,False,False,False,False,True,False
1,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,True,False


In [0]:
def makeCategorical(col, df):
    """
    Makes a given column categorical in the dataframe. Returns fixed dataframe.
    
    Parameters
    ----------
    col : String
        The name of the column to make categorical.
    df : pandas DataFrame
        The dataframe to be altered.
    """
    df[col] = pd.Categorical(df[col]) # The important part.
    return df

In [0]:
# Makes these columns explicit categories:
categories = ["Sex", "Survived", "Pclass", "SiblingsSpouses", "ParentsChildren"]
for i in range(len(categories)):
    makeCategorical(categories[i], tr)

In [12]:
# Shows how many categories each categorical column has.
preview = ""
for i in range(len(categories)): # For the categories we just made
    col = categories[i]
    preview = preview + str(col) + ": " + str(len(tr[col].cat.categories)) + "\n" # We'll get into this a bit more in a second, don't worry.
print(preview)

Sex: 2
Survived: 2
Pclass: 3
SiblingsSpouses: 7
ParentsChildren: 7



## Slicing DataFrames

In [13]:
# We can slice by rows pretty intuitively. These slices do not include the final index.
tr[0:3]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SiblingsSpouses,ParentsChildren,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


In [14]:
# Check the last 5 observations.
tr.tail(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SiblingsSpouses,ParentsChildren,Ticket,Fare,Cabin,Embarked
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,0,0,370376,7.75,,Q


In [15]:
# We can also index from the ends of the DataFrame. Conceptualize it as if the final example is indexed -0.
tr[-4:-1]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SiblingsSpouses,ParentsChildren,Ticket,Fare,Cabin,Embarked
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0,C148,C


In [16]:
# Finally, we can slice in both directions:
tr.loc["0":"5", "Name":"Age"] # Note that the start and end of this slice are included.

Unnamed: 0,Name,Sex,Age
0,"Braund, Mr. Owen Harris",male,22.0
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0
2,"Heikkinen, Miss. Laina",female,26.0
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0
4,"Allen, Mr. William Henry",male,35.0
5,"Moran, Mr. James",male,


The row indices are Strings here, which is a bit confusing. Some explanation:
* https://pandas.pydata.org/pandas-docs/stable/indexing.html#selection-by-label

In [17]:
# We can also select specific rows or columns.
listOfImportantPeople = ["0","5","10", "420", "500"] # We only care about these people.
features = ["Name", "Age", "Cabin"]
tr.loc[listOfImportantPeople, features]

Unnamed: 0,Name,Age,Cabin
0,"Braund, Mr. Owen Harris",22.0,
5,"Moran, Mr. James",,
10,"Sandstrom, Miss. Marguerite Rut",4.0,G6
420,"Gheorgheff, Mr. Stanio",,
500,"Calic, Mr. Petar",17.0,


## Fixing Null Ages
Recall: Our Age column had some missing observations. We should work on that.

But are the ages of the males and females on the Titanic different enough to justify averaging them separately?

There's only one way to find out...

In [18]:
# Now, how do we access just one column?
sexLabels = tr["Sex"] # Selects the "Sex" column.
sexLabels.head(5)

0      male
1    female
2    female
3    female
4      male
Name: Sex, dtype: category
Categories (2, object): [female, male]

In [19]:
# It's a Series. Pandas documentation describes these as "One-dimensional ndarray with axis labels"
type(sexLabels)

pandas.core.series.Series

In [20]:
# What if we wanted a Series of booleans, in order to select the correct observations?
# We can select observations by label:
males_bool = sexLabels == "male"
females_bool = sexLabels == "female"
females_bool.head(5)

0    False
1     True
2     True
3     True
4    False
Name: Sex, dtype: bool

In [21]:
# Finally, let's find those examples.
males = tr.loc[males_bool]
males.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SiblingsSpouses,ParentsChildren,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S


In [0]:
# Using what we've seen, create two dataframes each with either the male and females observations.
tr_male = tr.loc[tr["Sex"] == "male"]
tr_female = tr.loc[tr["Sex"] == "female"]

In [23]:
# Previews the male dataframe.
tr_male.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SiblingsSpouses,ParentsChildren,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q


In [24]:
# Previews the female dataframe.
tr_female.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SiblingsSpouses,ParentsChildren,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S


Let's create a method to replace null values of a given column with the column's average. First, find average for the Age column.

Pretty simple:

In [25]:
tr_male["Age"].mean()

30.72664459161148

Now, replace null values with that average. There's a function for that:

.fillna(value, inplace = boolean)


inplace changes whether the data is replaced in its place or in a newly created/returned DataFrame. Defaults to False.

In [0]:
def fixNulls(colName, df):
    """
    Replaces null values in a given column with the average of the column. Returns fixed dataframe.
    
    Parameters
    ----------
    colName : String
        The name of the column to fix.
    df : pandas DataFrame
        The dataframe to be altered.
    """
    average = df[colName].mean()
    # Puts the replacements directly into the dataframe.
    values = {colName : average}
    df.fillna(value = values, inplace=True)
    return df

In [27]:
# Previews the male dataframe's "Age" column before null fixing.
tr_male.loc[:, "Age"].head(10)

0     22.0
4     35.0
5      NaN
6     54.0
7      2.0
12    20.0
13    39.0
16     2.0
17     NaN
20    35.0
Name: Age, dtype: float64

In [28]:
tr_male = fixNulls("Age", tr_male)
tr_female = fixNulls("Age", tr_female)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)


DISCLAIMER: The error here is a bit strange because I edited my syntax in the method to exactly what they suggest in the section linked in the error message, and this still complains. Don't blame me, the method still works.

In [29]:
# Shows the male dataframe's "Age" column after null fixing.
tr_male.loc[:, "Age"].head(10)

0     22.000000
4     35.000000
5     30.726645
6     54.000000
7      2.000000
12    20.000000
13    39.000000
16     2.000000
17    30.726645
20    35.000000
Name: Age, dtype: float64

In [0]:
# Recombines the male and female dataframes back into a single dataframe.
tr = tr_male.append(tr_female)

## How Do We Use Our Data?
So we've seen how we can slice, fix up, and analyze our data a bit. We could certainly do a lot more, but at the end of the day, how do we determine which features are important in determining whether or not someone died on the Titanic?

First, some context on decision trees: https://bit.ly/DecTrees

And then Information Theory: https://bit.ly/InformationTheory


So we want to be able to calculate the Gini score of a split, merely to demonstrate which
splits are actually useful or not. Gini score can be calculated by:

$\sum_{k=0}^{classes}(p_{k}\times(1-p_{k})).$

$m$ classes, $p_{k}$ is the proportion of same class inputs present in a particular group.

In [0]:
def split(col, df):
    """
    Splits a dataframe into sets based on the given categorical column.
    
    Parameters
    ----------
    feature : String
        The feature for dataframe to be split on.
    df : pandas DataFrame
        The dataframe to be altered.
    """
    groups = df[col].cat.categories
    categoryNum = len(groups)
    output = [] # Will be a 
    for i in range(categoryNum): # For each of the categories of the feature given:
        currentCategorySeries = df[col] == groups[i]
        output.append(df[currentCategorySeries]) # Append the df of the rows where the internal condition is true.
    return output

In [0]:
def giniScore(splitData):
    """
    Takes a set of dataframes following some split over one of the features. Returns the split's Gini score.
    
    Parameters
    ----------
    splitData : set of pandas DataFrames
        The set resulting from a split whose Gini score we're calculating.
    """
    length = len(splitData)
    total = 0
    for i in range(length): # Sums over each of the classes.
        df = splitData[i]   # The DataFrame of observations of a given class.
        totalPpl = len(df)  # The number of examples (people) in the dataframe is its length.
        survived = len(df[df["Survived"] == 1]) # The number of people whose "Survived" was 1. These people lived.
        pk = float(survived) / float(totalPpl)  # The probability of this person having survived in this class.
        total += pk * (1. - pk)
    return total

In [33]:
# Shows the Gini score of each categorical column.
preview = ""
for i in range(len(categories)): # Recall that we created a list of the 5 categorical columns' names.
    col = categories[i]          # Get the name of this column.
    splitList = split(col, tr)   # This will be a list of DataFrames each containing observations of a given category in this column
    gini = giniScore(splitList)  # 
    preview = preview + str(col) + ": " + str(gini) + "\n"
print(preview)

Sex: 0.34463935983810007
Survived: 0.0
Pclass: 0.6660806692837389
SiblingsSpouses: 1.0499228460447578
ParentsChildren: 1.1229716579017808



In [34]:
# Demonstrates that the methods I've made give a pure Gini score for a pure split.
giniScore(split("Survived", tr))

0.0

In [0]:
class DecisionTree:
    """
    A decision tree splits a dataset into subsets of data, working to create pure subsets. By following the same set
    of splits, one can classify new data put through the decision tree.
    """
    # Homework...