# 09.01.01 - PreparingData

## Purpose

This notebook is intended to act as an extension for pandas using a new reference/sample set, and how to prepare the data accordingly

## Libraries

* Pandas
* Seaborn

## References/Reading
* Seaborn load_dataset - https://seaborn.pydata.org/generated/seaborn.load_dataset.html
* Pandas get_dummies - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html
* Tutorial including more visualization - https://www.geeksforgeeks.org/python-titanic-data-eda-using-seaborn/

In [1]:
import pandas as pd
from seaborn import load_dataset

In [2]:
titanicDataSet = load_dataset("titanic")
titanicDataSet

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,,1,2,23.4500,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


In [3]:
# We have two helper methods that can help us see more details, although the above already gives the shape
titanicDataSet.shape

(891, 15)

In [4]:
# We've seen this before, this helps tell us what type of data we have in different
# columns.  In the context before, we were worried about the date ones.  now we
# want to pay attention to non-nonumbers (objects, for example)
titanicDataSet.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   survived     891 non-null    int64   
 1   pclass       891 non-null    int64   
 2   sex          891 non-null    object  
 3   age          714 non-null    float64 
 4   sibsp        891 non-null    int64   
 5   parch        891 non-null    int64   
 6   fare         891 non-null    float64 
 7   embarked     889 non-null    object  
 8   class        891 non-null    category
 9   who          891 non-null    object  
 10  adult_male   891 non-null    bool    
 11  deck         203 non-null    category
 12  embark_town  889 non-null    object  
 13  alive        891 non-null    object  
 14  alone        891 non-null    bool    
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.7+ KB


# Defining a data dictionary, understanding the data

It's important that you understand the type of data you're dealing with and what is included.  Lets take a few columns and run unique on these.  You can document this elsewhere, or in a summary object if needed

In [5]:
titanicDataSet['survived'].unique()

array([0, 1])

In [6]:
titanicDataSet['sex'].unique()

array(['male', 'female'], dtype=object)

In [7]:
titanicDataSet['pclass'].unique()

array([3, 1, 2])

In [8]:
# Lets make a small helper function/etc to help us out
def generateDictionary(dataFrame, columnsOfInterest, friendlyNames):
    dataDictionary = {}
    while len(columnsOfInterest) > 0 and len(friendlyNames) > 0:
        col = columnsOfInterest.pop()
        columnFriendlyName = friendlyNames.pop()
        dataDictionary[col] = {
            "FriendlyName": columnFriendlyName,
            "Values" : dataFrame[col].unique()
        }
    return dataDictionary

In [9]:
# Now lets define the inputs
columns = ["survived", "pclass", "sex", "age", "sibsp", "parch", "fare", "embarked"]
names = ["Survival", "Ticket class", "Sex", "Age in years", "# of siblings / spouses aboard",
         "# of parents / children aboard", "Passenger fare",
         "Port of Embarkation  C = Cherbourg, Q = Queenstown, S = Southhampton"]



In [10]:
dataDictionary = generateDictionary(titanicDataSet, columns, names)

In [11]:
dataDictionary


{'embarked': {'FriendlyName': 'Port of Embarkation  C = Cherbourg, Q = Queenstown, S = Southhampton',
  'Values': array(['S', 'C', 'Q', nan], dtype=object)},
 'fare': {'FriendlyName': 'Passenger fare',
  'Values': array([  7.25  ,  71.2833,   7.925 ,  53.1   ,   8.05  ,   8.4583,
          51.8625,  21.075 ,  11.1333,  30.0708,  16.7   ,  26.55  ,
          31.275 ,   7.8542,  16.    ,  29.125 ,  13.    ,  18.    ,
           7.225 ,  26.    ,   8.0292,  35.5   ,  31.3875, 263.    ,
           7.8792,   7.8958,  27.7208, 146.5208,   7.75  ,  10.5   ,
          82.1708,  52.    ,   7.2292,  11.2417,   9.475 ,  21.    ,
          41.5792,  15.5   ,  21.6792,  17.8   ,  39.6875,   7.8   ,
          76.7292,  61.9792,  27.75  ,  46.9   ,  80.    ,  83.475 ,
          27.9   ,  15.2458,   8.1583,   8.6625,  73.5   ,  14.4542,
          56.4958,   7.65  ,  29.    ,  12.475 ,   9.    ,   9.5   ,
           7.7875,  47.1   ,  15.85  ,  34.375 ,  61.175 ,  20.575 ,
          34.6542,  63.3583, 

In [12]:
import pprint
pp = pprint.PrettyPrinter(indent = 4)
pp.pprint(dataDictionary)


{   'age': {   'FriendlyName': 'Age in years',
               'Values': array([22.  , 38.  , 26.  , 35.  ,   nan, 54.  ,  2.  , 27.  , 14.  ,
        4.  , 58.  , 20.  , 39.  , 55.  , 31.  , 34.  , 15.  , 28.  ,
        8.  , 19.  , 40.  , 66.  , 42.  , 21.  , 18.  ,  3.  ,  7.  ,
       49.  , 29.  , 65.  , 28.5 ,  5.  , 11.  , 45.  , 17.  , 32.  ,
       16.  , 25.  ,  0.83, 30.  , 33.  , 23.  , 24.  , 46.  , 59.  ,
       71.  , 37.  , 47.  , 14.5 , 70.5 , 32.5 , 12.  ,  9.  , 36.5 ,
       51.  , 55.5 , 40.5 , 44.  ,  1.  , 61.  , 56.  , 50.  , 36.  ,
       45.5 , 20.5 , 62.  , 41.  , 52.  , 63.  , 23.5 ,  0.92, 43.  ,
       60.  , 10.  , 64.  , 13.  , 48.  ,  0.75, 53.  , 57.  , 80.  ,
       70.  , 24.5 ,  6.  ,  0.67, 30.5 ,  0.42, 34.5 , 74.  ])},
    'embarked': {   'FriendlyName': 'Port of Embarkation  C = Cherbourg, Q = '
                                    'Queenstown, S = Southhampton',
                    'Values': array(['S', 'C', 'Q', nan], dtype=object)},
    'fare':

# Evaluating sample data
Still not the most readable, but we could make this better by creating a renderer.  Needless to say, we can get what we need out of this.  We have some data that's more categorical in nature, and some that are more numerical in basis.  Looking at the above:

| Field | Type |
|-------|------|
| age   | Number |
| embarked | Categorical |
| fare    | Number |
| parch   | Number |
| pclass  | Categorical-ish *  |
| sex     | Categorical    |
| sibsp   | Number |
| Survived | Binary  |

So, with the categorical-based ones, we'll have to deal with those.  And, there are a number of ways to do this but we want to be careful here because we want to make sure there's no implied hierarchy associated with it.  Binary is, often times, the best representation regarding categorical data.

Another thing to note is we have a lot of null values.  This is bad, we'll have to deal with those, as most ML systems don't deal with it well.  Now, in practice, we should desire to have a strategy for those null values.  That may not be just dropping them, and it's a case by case basis.  We'll end up dropping them, but first we need to do a bit of extra work before we get there.

In [13]:
columns = ["survived", "pclass", "sex", "age", "sibsp", "parch", "fare", "embarked"]
columns

['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked']

In [14]:
# Get the columns we actually care about, out of the titanic data set
titanicDataSet = titanicDataSet[columns]
# Pay attention to the number of non-null counts.  age is our biggest problem, but we have others too.
titanicDataSet.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   survived  891 non-null    int64  
 1   pclass    891 non-null    int64  
 2   sex       891 non-null    object 
 3   age       714 non-null    float64
 4   sibsp     891 non-null    int64  
 5   parch     891 non-null    int64  
 6   fare      891 non-null    float64
 7   embarked  889 non-null    object 
dtypes: float64(2), int64(4), object(2)
memory usage: 55.8+ KB


In [15]:
titanicDataSet.dropna(inplace=True)
titanicDataSet.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 712 entries, 0 to 890
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   survived  712 non-null    int64  
 1   pclass    712 non-null    int64  
 2   sex       712 non-null    object 
 3   age       712 non-null    float64
 4   sibsp     712 non-null    int64  
 5   parch     712 non-null    int64  
 6   fare      712 non-null    float64
 7   embarked  712 non-null    object 
dtypes: float64(2), int64(4), object(2)
memory usage: 50.1+ KB


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  titanicDataSet.dropna(inplace=True)


# Dealing with the categories

As mentioned earlier, binary options are good for categories, but an integer range can cause us a headache going forward.  Largely because an integer range can fit the data set in a way that's what we don't want.  Think binary as just "on/off", where an integer range can actually weight.

We'll use a new library we've never seen before, called "get_dummies". You can read more about it at:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html

In [16]:
# Most basic usage.  Note how embarked isn't really shown here.  This makes it less optimal
pd.get_dummies(titanicDataSet["embarked"], prefix_sep= "::", drop_first = False)

Unnamed: 0,C,Q,S
0,0,0,1
1,1,0,0
2,0,0,1
3,0,0,1
4,0,0,1
...,...,...,...
885,0,1,0
886,0,0,1
887,0,0,1
889,1,0,0


In [17]:
print(f" type => {type(titanicDataSet['embarked'])}")
print(f" type => {type(titanicDataSet[['embarked']])}")

 type => <class 'pandas.core.series.Series'>
 type => <class 'pandas.core.frame.DataFrame'>


In [18]:
# To fix this, lets use a slightly different selector.  This is much better
pd.get_dummies(titanicDataSet[["embarked"]], prefix_sep= "::", drop_first = False)

Unnamed: 0,embarked::C,embarked::Q,embarked::S
0,0,0,1
1,1,0,0
2,0,0,1
3,0,0,1
4,0,0,1
...,...,...,...
885,0,1,0
886,0,0,1
887,0,0,1
889,1,0,0


In [19]:
# Now, we have an implied column, which is C.  If the other two are 0, then C is implied.  lets drop it
pd.get_dummies(titanicDataSet[["embarked"]], prefix_sep= "::", drop_first = True)

Unnamed: 0,embarked::Q,embarked::S
0,0,1
1,0,0
2,0,1
3,0,1
4,0,1
...,...,...
885,1,0
886,0,1
887,0,1
889,0,0


In [20]:
# Lets create a helper method to assist a bit more.
def createCategoricalDummies(dataFrame, categoryList):
    return pd.get_dummies(dataFrame[categoryList], prefix_sep = "::", drop_first = True)

In [21]:
# Lets modify the original dataFrame to deal with both categories, and add this on, and drop those
categories = ["embarked", "sex"]

titanicDataSet = pd.concat(
    [titanicDataSet.drop(categories, axis=1), createCategoricalDummies(titanicDataSet, categories)], axis= 1)
titanicDataSet

Unnamed: 0,survived,pclass,age,sibsp,parch,fare,embarked::Q,embarked::S,sex::male
0,0,3,22.0,1,0,7.2500,0,1,1
1,1,1,38.0,1,0,71.2833,0,0,0
2,1,3,26.0,0,0,7.9250,0,1,0
3,1,1,35.0,1,0,53.1000,0,1,0
4,0,3,35.0,0,0,8.0500,0,1,1
...,...,...,...,...,...,...,...,...,...
885,0,3,39.0,0,5,29.1250,1,0,0
886,0,2,27.0,0,0,13.0000,0,1,1
887,1,1,19.0,0,0,30.0000,0,1,0
889,1,1,26.0,0,0,30.0000,0,0,1


In [22]:
titanicDataSet.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 712 entries, 0 to 890
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   survived     712 non-null    int64  
 1   pclass       712 non-null    int64  
 2   age          712 non-null    float64
 3   sibsp        712 non-null    int64  
 4   parch        712 non-null    int64  
 5   fare         712 non-null    float64
 6   embarked::Q  712 non-null    uint8  
 7   embarked::S  712 non-null    uint8  
 8   sex::male    712 non-null    uint8  
dtypes: float64(2), int64(4), uint8(3)
memory usage: 41.0 KB
