#Preface

I will continue to use full documentation, not shown in video. And my approach remains the same: (1) write a bare bones implementation and debug, (2) ask Gemini (or Claude or chatGPT) to document it, add assertions and add type hints, (3) double check that nothing got screwed up.

I am also addding a newer profiling tool that supersedes what I use in the video.

I am using advanced features with `from __future__ import annotations`. This gives us early access to features coming in newer versions of Python. Note this import must be the first executable code in your notebook!

In [1]:
from __future__ import annotations  #must be first code to be executed!

<center>
<h1>Chapter Two</h1>
</center>

<hr>

## LEARNING OBJECTIVES:
- Use of profiling tool to get overview of your data.
- Introduction to data wrangling. We will continue looking at wrangling concepts through following chapters, introducing them bit by bit.
- Introduction of a *Pipeline* to structure and organize our wrangling steps.
- First look at building *Transformers* to more formally define a wrangling step that can easily slot into our Pipeline.

#I. Bring in data

I'm going to try out a new package that gives a lot of details on a pandas dataset.

In [2]:
%%capture
!pip install ydata-profiling --upgrade  # Install or upgrade to ydata-profiling

In [3]:
from ydata_profiling import ProfileReport

In [4]:
#all of these should end up in your own library in chapter 3
import pandas as pd
import numpy as np
import warnings
from typing import Dict, Any, Optional, Union, List, Set, Hashable, Tuple, Self, Iterable

In [5]:
url = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vQjIoC4LWfiuFmnix6TUAGSOPK29QMsn5Sb9DS73mNmMqikngNSG9ntmxHUO7ySZXTPpmoP8yAV0auz/pub?output=csv'  #plain old string I got from Google Sheets
titanic_table = pd.read_csv(url)  #using our new package to read in an entire dataset - the coolest

In [6]:
titanic_table.head()  #print first 5 rows of the table

Unnamed: 0,Name,Age,Gender,Class,Joined,Married,Survived,Fare,Bio,Occupation,Class/Dept,Cabin,Boat,Nationality,URL
0,"ABBING, Mr Anthony",41.0,Male,C3,Southampton,0.0,0,7.0,"anthony abbing was born in cincinnati,...",Blacksmith,3rd Class Passenger,,,American,https://www.encyclopedia-titanica.org/titanic-...
1,"ABBOTT, Mr Ernest Owen",21.0,Male,Crew,Southampton,0.0,0,0.0,ernest owen abbott was born in southam...,Lounge Pantry Steward,Victualling Crew,,,English,https://www.encyclopedia-titanica.org/titanic-...
2,"ABBOTT, Mr Eugene Joseph",13.0,Male,C3,Southampton,,0,20.0,"\tmaster eugene joseph abbott, 13, w...",Scholar,3rd Class Passenger,,,American,https://www.encyclopedia-titanica.org/titanic-...
3,"ABBOTT, Mr Rossmore Edward",16.0,Male,C3,Southampton,0.0,0,,"mr rossmore edward abbott, 16, of prov...",Jeweller,3rd Class Passenger,,,"English, American",https://www.encyclopedia-titanica.org/titanic-...
4,"ABELSON, Mr Samuel",,Male,C2,Cherbourg,0.0,0,24.0,"mr samuel abelson, 30, from russia, an...",,2nd Class Passenger,,,Russian,https://www.encyclopedia-titanica.org/titanic-...


In [7]:
profile = ProfileReport(titanic_table, title='Titanic data', html={'style': {'full_width':True}})

In [8]:
profile.to_notebook_iframe()  #need this to work within colab

Output hidden; open in https://colab.research.google.com to view.

###How to use all this data

We will use some of it. In particular, we will look at ways of filling missing values. We will also look at the correlation information. You worked on dcor in chapter 1. We will also look at Pearson.


#II. Reduce table we will work with

Before we actually start analyzing the data in our table, there are two steps we need to do first:

1. Drop columns. We will only be working with a subset of the columns for the Titanic. For other datasets, we will likely want all the columns. So this is an optional step we are taking.

2. Drop duplicate rows. It is rarely the case that you want to include duplicates in your data. So this is really a required step.

Once we take these two steps, we can write our table out to GitHub and use the new version as one we work with.

## Dropping columns

I'm going to make a decision at this point to reduce the columns we consider. This is an important decision and could bite me later if I remove columns that are valuable in predicting. If this was not an academic exercise, you would likely need to justify why you drop columns - someone will ask.

There are 2 ways to go: (1) I could use a pandas method that removes columns or (2) I can just say what columns I want to keep.

In [9]:
print(titanic_table.columns.to_list()) #you can always get the column names as a list

['Name', 'Age', 'Gender', 'Class', 'Joined', 'Married', 'Survived', 'Fare', 'Bio', 'Occupation', 'Class/Dept', 'Cabin', 'Boat', 'Nationality', 'URL']


In [10]:
#Way 1: columns I want to remove

titanic_df_drop = titanic_table.drop(columns=['Name',  'Bio', 'Occupation', 'Class/Dept', 'Cabin', 'Boat', 'Nationality', 'URL']
)

In [11]:
#Way 2: Say what columns I want to keep

titanic_df_keep = titanic_table[['Age', 'Gender', 'Class', 'Joined', 'Married',  'Fare', 'Survived']]  #notice nested list

###Convention note

pandas refers to tables as *DataFrames*, nicknamed `df`. I often use `df` instead of table - easier to type :)

##Now for duplicates

In [12]:
titanic_trimmed = titanic_df_keep.drop_duplicates(ignore_index=True)
len(titanic_df_keep), len(titanic_trimmed)

(2208, 1313)

##Write out to local file

Then download and upload to GitHub.

I won't ask you to do this yet - I'll do it for you for now. I'll download the file then upload to one of my GitHub repositories. I'll use it in next chapter.

But as heads up, I am going to ask you to carry out similar steps in the future. This way you can save all your data on GitHub and easily load it into your notebook.

In [13]:
titanic_trimmed.to_csv('titanic_trimmed.csv', index=False) #writes to local file but deleted overnight so need to move to GitHub

Ok, we are now ready to analyze and transform our data to get it ready for modeling and prediction.

#II. Let's start wrangling
<img src='https://www.dropbox.com/s/9fcc1crlxp19ijt/major_section.png?raw=1' width='300'>

Our goal is to use machine learning methods to make predictions. For now we will be predicting survival for Titanic passengers. But we will also look at a synthetic dataset formed around predicting the satisfaction of cable customers.

I've cheated and looked ahead. I know what form we need our data in to use machine learning algorithms. And I can tell you we have a ways to go to get there. I'll refer to getting our data in shape as data wrangling or just wrangling.

I'll take it a step at a time, using a chapter to look at a specific wrangling step.

##Features versus label

It may look a little weird now, and make more sense when we get to modeling, but I need to break up the raw table into 2 pieces: feature columns and label column. We will be wrangling the feature columns but not the label column.

The label column is the one we will choose to predict. We generally will not wrangle it.

So let's split it out.

In [14]:
titanic_features = titanic_trimmed.drop(columns='Survived')
titanic_label = titanic_trimmed['Survived'].to_list()

#IV. Categorical versus Numeric columns

We want to handle categorical columns differently than numerical columns. A categorical column has a discrete set of values. This set can be small (as in PClass) or large (as in Occupation). But it is contrasted with numeric which generally can take on any real number value.

Categorical columns typically have a special problem: they contain string values that name the members of the category, e.g., 'C1', 'C2', etc. We have to turn these strings into numbers some way.

First let's take a look at what we have in our table in terms of column types.


In [16]:
titanic_features.dtypes

Unnamed: 0,0
Age,float64
Gender,object
Class,object
Joined,object
Married,float64
Fare,float64


### The datatype `object` is typically a string

It is what pandas uses to signal a non-numeric column.

###We have 3 string columns

How we transform is going to take some thought. We need to make them all numeric.

###Let's start with `Gender`

What makes this relatively easy is that it is a binary column: only has 2 unique values. In cases like this, a simple replace method will do the work.

I am going to aribtrarily replace `Male` with `0` and `Female` with `1`.

I'll need a new pandas method. I have two choices: `map` and `replace`. I like `replace` because it does not change values unless in the mapping dictionary. As opposed to `map` which will replace missing keys in the dictionary with `NaN`.

In [17]:
#avoids spurious warning about mapping string to int (AKA downcasting)
pd.set_option('future.no_silent_downcasting', True)

In [18]:
titanic_df_1 = titanic_features.copy()  #I'm keeping versions so I can roll back

In [19]:
column = 'Gender'
mapping = {'Male':0, 'Female':1}
titanic_df_1[column] = titanic_df_1[column].replace(mapping)  #note mismatch with video - use of in_place=True is no longer recommended.

titanic_df_1.tail()

Unnamed: 0,Age,Gender,Class,Joined,Married,Fare
1308,4.0,0,C3,Cherbourg,0.0,22.0
1309,2.0,1,C3,Cherbourg,0.0,22.0
1310,23.0,1,C3,Cherbourg,1.0,22.0
1311,22.0,1,C1,Southampton,0.0,61.0
1312,27.0,0,C3,Cherbourg,0.0,7.0


###Why `inplace=True` is bad

In earlier versions of pandas, this worked fine:

<pre>
titanic_df_1[column].replace(mapping, inplace=True)
</pre>

It had benefit of avoiding making a new copy. But latest version of pandas frowns on this approach and will give you a warning if you use it. The now accepted approach is the one I show above: reassign the column.

You will see in the videos the old `inplace=True` approach but I have updated the notebooks with the new approach.

##I'll set up a little test dataset for you for practice

In [20]:
test_data = {'Temperature':['Medium', 'High', 'High', 'Low'],
             'Rainfall': ['None', 'Light', 'Heavy', 'Light'],
             'Region': ['Desert', 'Tropical', 'Desert', 'Tropical'],
             'Reporting_ID': ['foo', 'fum', 'foe', 'fie']
            }
test_table = pd.DataFrame(test_data) #passing dictionary to pandas
test_table

Unnamed: 0,Temperature,Rainfall,Region,Reporting_ID
0,Medium,,Desert,foo
1,High,Light,Tropical,fum
2,High,Heavy,Desert,foe
3,Low,Light,Tropical,fie


<img src='https://www.dropbox.com/s/8x575mvbi1xumje/cash_line.png?raw=1' height=3 width=500><br>
<img src='https://www.gannett-cdn.com/-mm-/56cbeec8287997813f287995de67747ba5e101d5/c=9-0-1280-718/local/-/media/2018/02/15/Phoenix/Phoenix/636542954131413889-image.jpg' height=50 align=center>

Focus on `Region`. Transform it to a numerical column. Do not change `test_table`.




In [21]:
test_table_1 = test_table.copy()

In [23]:
column = 'Region'
mapping = {'Desert': 0, 'Tropical': 1}
test_table_1[column] = test_table_1[column].replace(mapping)

In [24]:
test_table.head()

Unnamed: 0,Temperature,Rainfall,Region,Reporting_ID
0,Medium,,Desert,foo
1,High,Light,Tropical,fum
2,High,Heavy,Desert,foe
3,Low,Light,Tropical,fie


In [25]:
test_table_1.head()  #Region should 0 and 1 values

Unnamed: 0,Temperature,Rainfall,Region,Reporting_ID
0,Medium,,0,foo
1,High,Light,1,fum
2,High,Heavy,0,foe
3,Low,Light,1,fie


##Nominal columns

Both Gender and Region are nominal columns. They have no natural ordering among their values. So `Desert > Tropical` makes no sense. It does not matter what value we give them as long as unique.

The problem comes when a nominal column is not binary. Look at the `Reporting_ID` column. It has 3 values. And the `Occupation` column has 100s of distinct values.

I can tell you that some of the algorithms we will look at do arithmetic on these values. So we want to avoid giving the false impression that `foo` is 2 and hence twice as important than `foe`, that is 1. In essence, if we just blindly assign integers to each category, we are creating the illusion that certain categories are more important than others.

But then what to do?

### Dealing with the nominal column `Joined`

There has been a recent uptick in interest in ways to deal with nominal columns. For now, we will use the standard approach, which is called One-Hot Encoding (AKA dummy-encoding). Later we may look at another approach called Target-Encoding.

Let's focus on the nominal column `Joined`.


In [26]:
set(titanic_df_1['Joined'].unique())  #updated from videos to use unique method - a little cleaner

{'Belfast', 'Cherbourg', 'Queenstown', 'Southampton'}

##One Hot Encoding (or dummy encoding from Statistics)

So we want to avoid just mapping `Belfast` to 0, ... `Southampton` to 3. Why? Because we are establishing a numerical ordering that does not exist. The general idea for an alternative is as follows.

1. The Joined column has 4 unqiue values. So I will add 4 new columns to the table: `Joined_Belfast, Joined_Cherbourg, Joined_Queenstown`, and `Joined_Southampton`.

2. I'll set all their values to 0 initially.

3. For a specific row, if that row has a value of `Belfast` in `Joined`, I'll reset the `Joined_Belfast` value to 1 (and let 0s remain in the other 3). Do the same for other `Joined` values in other rows.

4. What I end up with for each row is exactly a single 1 in the 4 new columns and 0 otherwise. I have one-hot encoded `Joined` in this way. The `Joined` column is dropped at this point without losing any information.

I suppose I could write a loop to do this, but pandas has a method built-in to do it. Check it out.



In [27]:
titanic_df_2 = pd.get_dummies(titanic_df_1,
                               prefix='Joined',    #your choice
                               prefix_sep='_',     #your choice
                               columns=['Joined'],
                               dummy_na=False,    #will try to impute later so leave NaNs in place
                               drop_first=False,   #really should be True but could screw us up later
                               dtype=int
                               )

titanic_df_2.head()

Unnamed: 0,Age,Gender,Class,Married,Fare,Joined_Belfast,Joined_Cherbourg,Joined_Queenstown,Joined_Southampton
0,41.0,0,C3,0.0,7.0,0,0,0,1
1,21.0,0,Crew,0.0,0.0,0,0,0,1
2,13.0,0,C3,,20.0,0,0,0,1
3,16.0,0,C3,0.0,,0,0,0,1
4,,0,C2,0.0,24.0,0,1,0,0


Two of the arguments are less than intuitive. Here is description from docs:
<pre>
dummy_na bool, default False
Add a column to indicate NaNs, if False NaNs are ignored, i.e., left in place.

drop_first bool, default False
Whether to get k-1 dummies out of k categorical levels by removing the first level.
</pre>
For latter in particular, we can leave one out and infer it from all 0 values in others. Recommended to avoid colinearity issues. Check it out.

In [28]:
titanic_df_drop = pd.get_dummies(titanic_df_1,
                               prefix='Joined',    #your choice
                               prefix_sep='_',     #your choice
                               columns=['Joined'],
                               dummy_na=False,    #will try to impute later so leave NaNs in place
                               drop_first=True,    #will drop Belfast and infer it
                               dtype=int
                               )

titanic_df_drop.head()

Unnamed: 0,Age,Gender,Class,Married,Fare,Joined_Cherbourg,Joined_Queenstown,Joined_Southampton
0,41.0,0,C3,0.0,7.0,0,0,1
1,21.0,0,Crew,0.0,0.0,0,0,1
2,13.0,0,C3,,20.0,0,0,1
3,16.0,0,C3,0.0,,0,0,1
4,,0,C2,0.0,24.0,1,0,0


##But problems later

You will have to trust me that dropping a column may give us grief when you get to your final project. So let's not drop it (and pay a slight penalty of co-linearity).

<img src='https://www.dropbox.com/s/8x575mvbi1xumje/cash_line.png?raw=1' height=3 width=500><br>
<img src='https://www.gannett-cdn.com/-mm-/56cbeec8287997813f287995de67747ba5e101d5/c=9-0-1280-718/local/-/media/2018/02/15/Phoenix/Phoenix/636542954131413889-image.jpg' height=50 align=center>  

Transform the `Reporting_ID` column into ohe form. Do not drop the first.





In [29]:
#Reminder of what is in test_table
test_table

Unnamed: 0,Temperature,Rainfall,Region,Reporting_ID
0,Medium,,Desert,foo
1,High,Light,Tropical,fum
2,High,Heavy,Desert,foe
3,Low,Light,Tropical,fie


In [30]:
#Make changes to column in new version of table
# test_table_1 = pd.get_dummies(test_table,
#                                prefix='Joined',    #your choice
#                                prefix_sep='_',     #your choice
#                                columns=['Joined'],
#                                dummy_na=False,    #will try to impute later so leave NaNs in place
#                                drop_first=False,   #really should be True but could screw us up later
#                                dtype=int
#                                )

test_table_1 = pd.get_dummies(test_table,
                              prefix="ID",
                              prefix_sep='_',
                              columns=['Reporting_ID'],
                              dummy_na = False,
                              drop_first = False,
                              dtype=int
                              )



In [31]:
test_table_1.head() #should see 4 new columns

Unnamed: 0,Temperature,Rainfall,Region,ID_fie,ID_foe,ID_foo,ID_fum
0,Medium,,Desert,0,0,1,0
1,High,Light,Tropical,0,0,0,1
2,High,Heavy,Desert,0,1,0,0
3,Low,Light,Tropical,1,0,0,0


##"Dirty" category columns

Take a look at the unique values in the `Occupation` column.

In [32]:
len(titanic_table['Occupation'].unique())

385

That is a lot of values to one hot encode! We will add 385 new columns. A bit much. We do not have to deal with it because we dropped the column earlier. There are other ways to deal with a Nominal column with large numbers of values, e.g., Target-Encoding. We may get back to some of these later.


<img src='https://www.dropbox.com/s/8x575mvbi1xumje/cash_line.png?raw=1' height=3 width=500><br>
<img src='https://www.gannett-cdn.com/-mm-/56cbeec8287997813f287995de67747ba5e101d5/c=9-0-1280-718/local/-/media/2018/02/15/Phoenix/Phoenix/636542954131413889-image.jpg' height=50 align=center>

Let me set up the next quiz for you.



## Ordinal columns

We can view a column as ordinal if it has a set of categorical values where there is an ordering among the values. If the column contained temperature gradients, it is an ordinal column. I can say one value, e.g., 'High', has an arithmetic relationship to another, e.g., 'Low'. So asking `High>Low` makes sense.

I can't say this is always a slam-dunk decision. I'm looking at you, `Class` column. I am going to make the following argument: from what I know of the social strata at the time (1912), first class passengers (C1) were viewed with more respect than lower classes (C2, C3). And Crew were the lowest of all. So I am going to view `Class` as ordinal. But I would not be surprised if you want to pushback on this. It is a bit flimsy.

The good news is that we can use same replace mechanism to work with ordinal columns. Check it out.

In [33]:
titanic_df_3 = titanic_df_2.copy()  #I'm keeping versions so I can roll back

column = 'Class'
mapping = {'Crew':0,
            'C3': 1,
            'C2': 2,
            'C1': 3}


titanic_df_3[column] = titanic_df_3[column].replace(mapping)
titanic_df_3.tail()

Unnamed: 0,Age,Gender,Class,Married,Fare,Joined_Belfast,Joined_Cherbourg,Joined_Queenstown,Joined_Southampton
1308,4.0,0,1,0.0,22.0,0,1,0,0
1309,2.0,1,1,0.0,22.0,0,1,0,0
1310,23.0,1,1,1.0,22.0,0,1,0,0
1311,22.0,1,3,0.0,61.0,0,0,0,1
1312,27.0,0,1,0.0,7.0,0,1,0,0


Again, this is a subjective decision. I'm thinking ahead to the type of machine learning algorithms we will use. They will ask how close a crewman is to a first class passenger (for instance). I view that difference as big so gave them a wide separation, e.g., `abs(0-3)`. In comparison, I view the separation from a crewman and a third class passenger as small.

<img src='https://www.dropbox.com/s/8x575mvbi1xumje/cash_line.png?raw=1' height=3 width=500><br>
<img src='https://www.gannett-cdn.com/-mm-/56cbeec8287997813f287995de67747ba5e101d5/c=9-0-1280-718/local/-/media/2018/02/15/Phoenix/Phoenix/636542954131413889-image.jpg' height=50 align=center>  Transform the `Temperature` and `Rainfall` columns into numerical form.




In [34]:
#Reminder of what is in test_table
test_table

Unnamed: 0,Temperature,Rainfall,Region,Reporting_ID
0,Medium,,Desert,foo
1,High,Light,Tropical,fum
2,High,Heavy,Desert,foe
3,Low,Light,Tropical,fie


In [36]:
test_table_3 = test_table.copy()  #make both column changes in  this new table

In [37]:
#Make changes to columns in new version of table
rainfall = 'Rainfall'
temp = 'Temperature'
mapping_rain = {'None': 0, 'Light': 1, 'Heavy': 2}
mapping_temp = {'Low': 0, 'Medium': 1, 'High': 2}

test_table_3[rainfall] = test_table_3[rainfall].replace(mapping_rain)
test_table_3[temp] = test_table_3[temp].replace(mapping_temp)

In [38]:
test_table_3.head()  #should now see numeric values in Rainfall and Temperature

Unnamed: 0,Temperature,Rainfall,Region,Reporting_ID
0,1,0,Desert,foo
1,2,1,Tropical,fum
2,2,2,Desert,foe
3,0,1,Tropical,fie


##Quiz setup

I'd like you to work on applying what you have learned above to a new dataset. We will be using this dataset for the remainder of the course. Check it out.

In [79]:
url = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vQPM6PqZXgmAHfRYTcDZseyALRyVwkBtKEo_rtaKq_C7T0jycWxH6QVEzTzJCRA0m8Vz0k68eM9tDm-/pub?output=csv'


In [80]:
customers_df = pd.read_csv(url)
customers_df.head()

Unnamed: 0,ID,Gender,Experience Level,Time Spent,OS,ISP,Age,Rating
0,3,Female,medium,,iOS,Xfinity,,0
1,27,Male,medium,71.97,Android,Cox,50.0,0
2,30,Female,medium,101.81,,Cox,49.0,1
3,40,Female,medium,86.37,Android,Xfinity,53.0,0
4,52,Female,medium,103.97,iOS,Xfinity,58.0,0


In [81]:
customers_df.tail()

Unnamed: 0,ID,Gender,Experience Level,Time Spent,OS,ISP,Age,Rating
995,9984,,low,82.94,,Cox,72.0,1
996,9987,,low,76.93,,,,1
997,9989,Male,,,Android,Cox,47.0,1
998,9995,Female,medium,120.76,iOS,,45.0,1
999,9997,Male,low,83.42,,Xfinity,,1


##Summary

Each row represents a customer (know by their ID). It gives the amount of time each spent trying to get a phone app connected to their ISP provider. The Rating is what the customer rated the help he or she got.

##Steps to take

* First drop the ID column. It carries no useful info.

* As a change-up, we will **not** remove duplicates.

* Second encode the string columns. Leave `NaN` values in place - we will deal with them later.

* Make sure to separate out Ordinal and Nominal columns: they each have their own way of being encoded.

* For nominal columns, go ahead and use `drop_first=False`.

You do not have to save your transformed table to file.

You can ignore the Rating (label) column for now.

In [82]:
#I'll get you started
customer_features = customers_df.drop(columns='Rating')  #new table with target/label column missing
customer_label = customers_df['Rating'].to_list()  #but remember the target/label column for future

Reminder of what table looks like.



In [83]:
customer_features.head()

Unnamed: 0,ID,Gender,Experience Level,Time Spent,OS,ISP,Age
0,3,Female,medium,,iOS,Xfinity,
1,27,Male,medium,71.97,Android,Cox,50.0
2,30,Female,medium,101.81,,Cox,49.0
3,40,Female,medium,86.37,Android,Xfinity,53.0
4,52,Female,medium,103.97,iOS,Xfinity,58.0


###Determining unique values in column quickly

You could look at the detailed profile above. But here is another way.

In [84]:
set(customer_features['Gender'].unique())  #unique values in a column using set function

{'Female', 'Male', nan}

###Ok, go to it



##As reference, here is what I end up with after all the steps

<img src='https://www.dropbox.com/s/kvqpjf9hl4zmpbq/Screen%20Shot%202022-09-29%20at%209.40.53%20AM.png?raw=1' height=150>




In [85]:
wrangled_df_0 = customer_features.drop(columns=['ID'])

In [95]:
wrangled_df_1 = wrangled_df_0.copy()

wrangled_df_1.head()

Unnamed: 0,Gender,Experience Level,Time Spent,OS,ISP,Age
0,Female,medium,,iOS,Xfinity,
1,Male,medium,71.97,Android,Cox,50.0
2,Female,medium,101.81,,Cox,49.0
3,Female,medium,86.37,Android,Xfinity,53.0
4,Female,medium,103.97,iOS,Xfinity,58.0


In [96]:
#probably should be making copies so can easily roll back

wrangled_df_1['OS'] = wrangled_df_1['OS'].replace({'Android': 0, 'iOS': 1})

# one hot encode ISP
wrangled_df_1 = pd.get_dummies(wrangled_df_1,
                               prefix='ISP',
                               prefix_sep='_',
                               columns=['ISP'],
                               dummy_na=False,
                               drop_first=False,
                               dtype=int
                                )

In [97]:
wrangled_df_1['Gender'] = wrangled_df_1['Gender'].replace({'Male': 0, 'Female': 1})

In [98]:
wrangled_df_1['Experience Level'] = wrangled_df_1['Experience Level'].replace({'low': 0, 'medium': 1, 'high': 2})

wrangled_df_1.head()

Unnamed: 0,Gender,Experience Level,Time Spent,OS,Age,ISP_AT&T,ISP_Cox,ISP_HughesNet,ISP_Xfinity
0,1,1,,1.0,,0,0,0,1
1,0,1,71.97,0.0,50.0,0,1,0,0
2,1,1,101.81,,49.0,0,1,0,0
3,1,1,86.37,0.0,53.0,0,0,0,1
4,1,1,103.97,1.0,58.0,0,0,0,1


##Reminder: what I end up with

|index|Gender|Experience Level|Time Spent|OS|Age|ISP\_AT&amp;T|ISP\_Cox|ISP\_HughesNet|ISP\_Xfinity|
|---|---|---|---|---|---|---|---|---|---|
|0|1|1|NaN|1|NaN|0|0|0|1|
|1|0|1|71\.97|0|50\.0|0|1|0|0|
|2|1|1|101\.81|NaN|49\.0|0|1|0|0|
|3|1|1|86\.37|0|53\.0|0|0|0|1|
|4|1|1|103\.97|1|58\.0|0|0|0|1|



###Use `count` column to see number of NaN values

In [99]:
wrangled_df_1.describe(include='all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
Gender,788.0,2.0,1.0,458.0,,,,,,,
Experience Level,787.0,3.0,1.0,570.0,,,,,,,
Time Spent,800.0,,,,94.266525,11.236693,62.43,87.0775,93.495,100.4825,144.95
OS,794.0,2.0,1.0,461.0,,,,,,,
Age,811.0,,,,58.041924,10.624138,18.0,52.0,60.0,66.5,75.0
ISP_AT&T,1000.0,,,,0.025,0.156203,0.0,0.0,0.0,0.0,1.0
ISP_Cox,1000.0,,,,0.44,0.496635,0.0,0.0,0.0,1.0,1.0
ISP_HughesNet,1000.0,,,,0.063,0.243085,0.0,0.0,0.0,0.0,1.0
ISP_Xfinity,1000.0,,,,0.29,0.453989,0.0,0.0,0.0,1.0,1.0


#X. Can we normalize wrangling?

What is appealing to me (and the Data Science community) is the idea of data pipelines.


<img src='https://qph.fs.quoracdn.net/main-qimg-0bb485e3bbc6652f98ca8bb868481ec0'>

The general idea is you pass your original dataframe in on the left and then perform a set of wrangling operations on it, producing the final dataframe version on the right. Each operation is one step that takes a dataframe in and produces a dataframe.

It turns out `sklearn` has us covered. If you follow their guidelines, you can build a data pipeline like above. The general idea is that the above operations become calls on a method called `transform`. That method is attached to a transformer class. Each operation above is then an instance of a transformer class where its `transform` method is called. As you might guess, the `transform` method takes in a dataframe and produces a dataframe.

In [101]:
from sklearn import set_config
set_config(transform_output="pandas")  #says pass pandas tables through pipeline instead of numpy matrices

##We have a slight problem - `sklearn` is missing transformers we want

Here are the 2 transformers I think we want:

1. A mapping transformer that will replace values in a column with other values, i.e., do a mapping.

2. A one-hot encoding transformer that will take a column and replace it with new columns that reflect the column values.

And we would like both these transformers to stick with dataframes: they accept a dataframe; they produce a dataframe.

Unfortunately, `sklearn` does not have these 2 transformers in its library. But there is hope! `sklearn` gives us the ability to build our own customer transformers and still fit them in a pipeline it recognizes.

Let's go over the code to build one of these missing transformers, the one that does mapping. I'll first present the entire code and then go over it piece by piece.



###Important notes:

1. It's easy to get tripped up here in OO land. What we want is a general mapping class that can be instantiated to work on specific column mappings.

2. I have given you the full-meal deal here in terms of documentation.  One of the benefits of documenting so well is that Gemini (and Claude, chatGPT, etc.) can read this documentation (including type hints) and use it to help you with coding. Colab itself has a powerful ability to read type hints and give you warnings. My only complaint is that warnings show up as small red lines under code that are easy to miss.

3. In past I would have brought in a fancier type checker like MyPy by pip installing it and doing other setup. But my belief now is that Colab has caught up with MyPy and there is no need to install it.

In [103]:
from sklearn.base import BaseEstimator, TransformerMixin

class CustomMappingTransformer(BaseEstimator, TransformerMixin):
    """
    A transformer that maps values in a specified column according to a provided dictionary.

    This transformer follows the scikit-learn transformer interface and can be used in
    a scikit-learn pipeline. It applies value substitution to a specified column using
    a mapping dictionary, which can be useful for encoding categorical variables or
    transforming numeric values.

    Parameters
    ----------
    mapping_column : str or int
        The name (str) or position (int) of the column to which the mapping will be applied.
    mapping_dict : dict
        A dictionary defining the mapping from existing values to new values.
        Keys should be values present in the mapping_column, and values should
        be their desired replacements.

    Attributes
    ----------
    mapping_dict : dict
        The dictionary used for mapping values.
    mapping_column : str or int
        The column (by name or position) that will be transformed.

    Examples
    --------
    >>> import pandas as pd
    >>> df = pd.DataFrame({'category': ['A', 'B', 'C', 'A']})
    >>> mapper = CustomMappingTransformer('category', {'A': 1, 'B': 2, 'C': 3})
    >>> transformed_df = mapper.fit_transform(df)
    >>> transformed_df
       category
    0        1
    1        2
    2        3
    3        1
    """

    def __init__(self, mapping_column: Union[str, int], mapping_dict: Dict[Hashable, Any]) -> None:
        """
        Initialize the CustomMappingTransformer.

        Parameters
        ----------
        mapping_column : str or int
            The name (str) or position (int) of the column to apply the mapping to.
        mapping_dict : Dict[Hashable, Any]
            A dictionary defining the mapping from existing values to new values.

        Raises
        ------
        AssertionError
            If mapping_dict is not a dictionary.
        """
        assert isinstance(mapping_dict, dict), f'{self.__class__.__name__} constructor expected dictionary but got {type(mapping_dict)} instead.'
        self.mapping_dict: Dict[Hashable, Any] = mapping_dict
        self.mapping_column: Union[str, int] = mapping_column  #column to focus on

    def fit(self, X: pd.DataFrame, y: Optional[Iterable] = None) -> Self:
        """
        Fit method - performs no actual fitting operation.

        This method is implemented to adhere to the scikit-learn transformer interface
        but doesn't perform any computation.

        Parameters
        ----------
        X : pandas.DataFrame
            The input data to fit.
        y : array-like, default=None
            Ignored. Present for compatibility with scikit-learn interface.

        Returns
        -------
        self : instance of CustomMappingTransformer
            Returns self to allow method chaining.
        """
        print(f"\nWarning: {self.__class__.__name__}.fit does nothing.\n")
        return self  #always the return value of fit

    def transform(self, X: pd.DataFrame) -> pd.DataFrame:
        """
        Apply the mapping to the specified column in the input DataFrame.

        Parameters
        ----------
        X : pandas.DataFrame
            The DataFrame containing the column to transform.

        Returns
        -------
        pandas.DataFrame
            A copy of the input DataFrame with mapping applied to the specified column.

        Raises
        ------
        AssertionError
            If X is not a pandas DataFrame or if mapping_column is not in X.

        Notes
        -----
        This method provides warnings if:
        1. Keys in mapping_dict are not found in the column values
        2. Values in the column don't have corresponding keys in mapping_dict
        """
        assert isinstance(X, pd.core.frame.DataFrame), f'{self.__class__.__name__}.transform expected Dataframe but got {type(X)} instead.'
        assert self.mapping_column in X.columns.to_list(), f'{self.__class__.__name__}.transform unknown column "{self.mapping_column}"'  #column legit?
        warnings.filterwarnings('ignore', message='.*downcasting.*')  #squash warning in replace method below

        #now check to see if all keys are contained in column
        column_set: Set[Any] = set(X[self.mapping_column].unique())
        keys_not_found: Set[Any] = set(self.mapping_dict.keys()) - column_set
        if keys_not_found:
            print(f"\nWarning: {self.__class__.__name__}[{self.mapping_column}] does not contain these keys as values {keys_not_found}\n")

        #now check to see if some keys are absent
        keys_absent: Set[Any] = column_set - set(self.mapping_dict.keys())
        if keys_absent:
            print(f"\nWarning: {self.__class__.__name__}[{self.mapping_column}] does not contain keys for these values {keys_absent}\n")

        X_: pd.DataFrame = X.copy()
        X_[self.mapping_column] = X_[self.mapping_column].replace(self.mapping_dict)
        return X_

    def fit_transform(self, X: pd.DataFrame, y: Optional[Iterable] = None) -> pd.DataFrame:
        """
        Fit to data, then transform it.

        Combines fit() and transform() methods for convenience.

        Parameters
        ----------
        X : pandas.DataFrame
            The DataFrame containing the column to transform.
        y : array-like, default=None
            Ignored. Present for compatibility with scikit-learn interface.

        Returns
        -------
        pandas.DataFrame
            A copy of the input DataFrame with mapping applied to the specified column.
        """
        #self.fit(X,y)  #commented out to avoid warning message in fit
        result: pd.DataFrame = self.transform(X)
        return result

##Let's go over the pieces

I am going to leave most of the documentation out for brevity.

###The initializer `__init__`

When we instantiate the class, we need to decide what info we need. In this case, we need to know what column to work on and the mapping. I am using a dictionary to denote the mapping.
<pre>
class CustomMappingTransformer(BaseEstimator, TransformerMixin):
  def __init__(self, mapping_column: Union[str, int], mapping_dict: Dict[Hashable, Any]) -> None:
    assert isinstance(mapping_dict, dict), f'{self.__class__.__name__} constructor expected dictionary but got {type(mapping_dict)} instead.'
    self.mapping_dict = mapping_dict
    self.mapping_column = mapping_column  #column to focus on

</pre>
Note I am using 2 of sklearn's classes: `BaseEstimator` is a parent class, `TransformerMixin` is a mixin class. If you are a little hazy on the difference, you might want to ask our good friend Gemini.

Also note I am using `Union[str,int]` as the type of the column name. Strictly speaking, I could have used `Hashable` instead (e.g., a tuple could technically be the "name" of a column), but that could lead to problems later. So I am sticking with what I have.

###Getting the class name

Just for convenience, I am using some code that will give me the class name. Makes it a little easier to copy and paste printing code without then having to edit.

<pre>
self.__class__.__name__
</pre>

###Careful with name alignment (not in video)

Notice these two lines use the same base name:

<pre>
    self.mapping_dict = mapping_dict
    self.mapping_column = mapping_column
</pre>

You need to keep this alignment. You will get an error if you do something like this:

<pre>
    self.my_mapping_dict = mapping_dict
    self.my_mapping_column = mapping_column
</pre>

It looks like there should be no problem but code in sklearn will do this alignment check and throw an error if same names are not used. Just one of the constraints we have when working with the sklearn library for pipelines.

###The `fit` process

It's a bit confusing. Each transformer needs at least three methods: `fit, transform, fit_transform`. There is a `fit` step because some transformers have to analyze/train on data before doing transformations. We will see that later. But in our case, no `fit` step is needed. So if someone invokes `fit`, we do nothing and give a warning.
<pre>
  def fit(self, X: pd.DataFrame, y: Optional[Iterable] = None) -> Self:
    print(f"Warning: {self.__class__.__name__}.fit does nothing.")
    return self  #always the return value of fit
</pre>

Note I am on the edge for this warning. I could easily be swayed to remove it and just do nothing other than `return self`. If you want to remove this type of warning, I won't mark you off. It will clutter up output later with warnings.


###The `transform` process

This is where the action is. We will take a table in and do the mapping on the designated column (see constructor).
<pre>
  def transform(self, X: pd.DataFrame) -> pd.DataFrame:
    assert isinstance(X, pd.core.frame.DataFrame), f'{self.__class__.__name__}.transform expected Dataframe but got {type(X)} instead.'
    assert self.mapping_column in X.columns.to_list(), f'{self.__class__.__name__}.transform unknown column "{self.mapping_column}"'  #column legit?
    warnings.filterwarnings('ignore', message='.*downcasting.*')  #happens in replace method

    
    #now check to see if all keys are contained in column
    column_set = set(X[self.mapping_column].unique())
    keys_not_found = set(self.mapping_dict.keys()) - column_set
    if keys_not_found:
      print(f"\nWarning: {self.__class__.__name__}[{self.mapping_column}] does not contain these keys as values {keys_not_found}\n")

    #now check to see if some keys are absent
    keys_absent = column_set -  set(self.mapping_dict.keys())
    if keys_absent:
      print(f"\nWarning: {self.__class__.__name__}[{self.mapping_column}] does not contain keys for these values {keys_absent}\n")

    X_ = X.copy()
    X_[self.mapping_column] = X_[self.mapping_column].replace(self.mapping_dict)
    return X_
</pre>


###A note about asserts and warnings

I want you to write transformers that catch errors early, and do not let them slide by. So you see 2 asserts above:

1. Check if X is DataFrame. Because Colab Python lacks strong type-checking, the type-hint will only give a warning, not an error. So need explicit check in assert.

2. Check to make sure column used for mapping is legit.

3. The `replace` method used in the `transform` method will give a warning when attempting to map a string to a number. It views this as "downcasting". I shut off this warning. I suppose a cleaner approach would be to use something other than `replace` and avoid the warning. You can do that for extra credit if you want :) I bet Gemini could help you!

I also give a warning (printed) if not all the mapping keys show up in the column. I made a design decision here to use warning instead of error. It seems to me some subsetted tables might not have all keys. So warning is enough.

I also give a warning (printed) if the column contains values with no corresponding keys. I made a design decision here to use warning instead of error. I decided that there may be some cases where you only want to map a few of the values so don't need keys for all values. One annoying side-effect is you will get warnings on any NaN values in the column.

Some of the power you gain by writing your own custom classes is ability to make your own decisions on warnings versus errors.

##A side-tour into those pesky `nan` values in a column

As reminder of what we saw earlier:

|index|Gender|Experience Level|Time Spent|OS|Age|ISP\_AT&amp;T|ISP\_Cox|ISP\_HughesNet|ISP\_Xfinity|
|---|---|---|---|---|---|---|---|---|---|
|0|1|1|NaN|1|NaN|0|0|0|1|
|1|0|1|71\.97|0|50\.0|0|1|0|0|
|2|1|1|101\.81|NaN|49\.0|0|1|0|0|
|3|1|1|86\.37|0|53\.0|0|0|0|1|
|4|1|1|103\.97|1|58\.0|0|0|0|1|

It is pretty darn confusing how the are represented. You may think that they are the string `"NaN"`. They are not. They just print that way. They can be viewed as a special `numpy` constant `np.nan`, which is a `float` (!). But a really weird float. Check it out.

In [104]:
np.nan  #prints as nan but is not a string

nan

In [105]:
type(np.nan)

float

In [106]:
np.nan == np.nan  #odd, but strangely accurate: one unknown value is not necessarily equal to another unknown value

False

In [107]:
np.nan + 1  #in essence an np.nan is a poison pill for arithmetic. Once you include it, you will always get np.nan as an answer.

nan

In [109]:
sum([np.nan, 23.1, 34.2])

nan

In [110]:
np.nan in [np.nan, 23.1, np.nan]  #so the in operator is smart about np.nan

True

The good news is that pandas methods know about this weirdness and do the right thing. But problems arise when you bring a column out of pandas DataFrame into a pandas Series.

In [111]:
ps = pd.Series([np.nan, 23.1, np.nan], [0,1,2]).to_list()
ps  #it looks like nan values there

[nan, 23.1, nan]

In [112]:
type(ps[0])

float

In [113]:
np.isnan(ps[0])

np.True_

In [114]:
np.nan in ps #expecting True but got False

False

In [115]:
23.1 in ps

True

My take is the whole representation of missing values in pandas as in flux. In general, most pandas methods know about NaN and do the right thing, e.g., using `mean` method on a column knows to skip over nans in the column.

###For more info

[This stackoverflow thread](https://stackoverflow.com/q/62489359) might help.


##Always make a copy

Your transformers should return a copy of the input DataFrame before munching on it.

`X_ = X.copy()`

##Use library code when we can

I could have written a loop to do the replacement. But I remembered that we had already googled for a method to do this for us: `replace`. Easy peasy. But do see the note about downcasting that `replace` will complain about (and that I have turned off this complaint).

###The `fit_transform` process

This is a convenience function that combines the 2 methods.
<pre>
  def fit_transform(self, X: pd.DataFrame, y: Optional[Iterable] = None) -> pd.DataFrame:
    #self.fit(X, y)  #uncomment if you want
    result = self.transform(X)
    return result
</pre>

I am also on the edge about commenting the `self.fit` step out. I could be swayed that it is more consistent to always call it even if we know `fit` does nothing. Again, if you want to uncomment it, I won't mark you off.

###Test it out

Let's apply our new transformer to `Gender` and `Class`. Reminder: we will need to instantiate the class twice to get two separate operations in our pipeline.

In [116]:
import pandas as pd

In [117]:
url = 'https://raw.githubusercontent.com/fickas/course_datasets/refs/heads/main/titanic_trimmed.csv' #from earlier in notebook
titanic_trimmed = pd.read_csv(url)

In [118]:
titanic_features = titanic_trimmed.drop(columns='Survived')
titanic_features.head()  #print first 5 rows of the table

Unnamed: 0,Age,Gender,Class,Joined,Married,Fare
0,41.0,Male,C3,Southampton,0.0,7.0
1,21.0,Male,Crew,Southampton,0.0,0.0
2,13.0,Male,C3,Southampton,,20.0
3,16.0,Male,C3,Southampton,0.0,
4,,Male,C2,Cherbourg,0.0,24.0


##First let's test our assertions and warnings

In [119]:
#you should see a red line under the code below. This is Colab using type-hints to give you a warning.

gender_transformer = CustomMappingTransformer('Gender', [0,1])  #AssertionError: CustomMappingTransformer constructor expected dictionary but got <class 'list'> instead.

test_df = gender_transformer.fit_transform(titanic_features)  #error

AssertionError: CustomMappingTransformer constructor expected dictionary but got <class 'list'> instead.

In [120]:
gender_transformer = CustomMappingTransformer('gender', {'Male': 0, 'Female': 1})  #AssertionError: CustomMappingTransformer.transform unknown column "gender"
test_df = gender_transformer.fit_transform(titanic_features)  #error

AssertionError: CustomMappingTransformer.transform unknown column "gender"

###Test warning for absent keys

In [121]:
male_transformer = CustomMappingTransformer('Gender', {'Male': 0})  #produce a transform operator
male_df = male_transformer.fit_transform(titanic_features)  #Warning: MappingTransformer[Gender] does not contain keys for these values {'Female'}
male_df[10:15]





Unnamed: 0,Age,Gender,Class,Joined,Married,Fare
10,20.0,0,Crew,Southampton,0.0,0.0
11,40.0,Female,C3,Southampton,1.0,9.0
12,24.0,0,C3,Southampton,0.0,7.0
13,,0,Crew,Southampton,0.0,0.0
14,37.0,0,Crew,Southampton,1.0,0.0


In [122]:
unknown_transformer = CustomMappingTransformer('Gender', {'Male': 0, 'Unknown':2})  #produce a transform operator
df = unknown_transformer.fit_transform(titanic_features)  #Warning: MappingTransformer[Gender] does not contain keys for these values {'Female'}
df[10:15]







Unnamed: 0,Age,Gender,Class,Joined,Married,Fare
10,20.0,0,Crew,Southampton,0.0,0.0
11,40.0,Female,C3,Southampton,1.0,9.0
12,24.0,0,C3,Southampton,0.0,7.0
13,,0,Crew,Southampton,0.0,0.0
14,37.0,0,Crew,Southampton,1.0,0.0


In [123]:
#This warning is not an issue given we will get to nan values as last step in our pipeline

class_transformer = CustomMappingTransformer('Class', {'Crew': 0, 'C3': 1, 'C2': 2, 'C1': 3})
nan_df = class_transformer.transform(titanic_features)  #Warning: MappingTransformer[Class] does not contain keys for these values {nan}





##Looks good

We can do our normal mappings now.

In [124]:
gender_transformer = CustomMappingTransformer('Gender', {'Male': 0, 'Female': 1})  #produce a transform operator
X2_df = gender_transformer.fit_transform(titanic_features)  #apply the operator

In [125]:
X2_df.head()

Unnamed: 0,Age,Gender,Class,Joined,Married,Fare
0,41.0,0,C3,Southampton,0.0,7.0
1,21.0,0,Crew,Southampton,0.0,0.0
2,13.0,0,C3,Southampton,,20.0
3,16.0,0,C3,Southampton,0.0,
4,,0,C2,Cherbourg,0.0,24.0


In [126]:
class_transformer = CustomMappingTransformer('Class', {'Crew': 0, 'C3': 1, 'C2': 2, 'C1': 3})
X3_df = class_transformer.transform(X2_df)





In [127]:
X3_df.head()

Unnamed: 0,Age,Gender,Class,Joined,Married,Fare
0,41.0,0,1,Southampton,0.0,7.0
1,21.0,0,0,Southampton,0.0,0.0
2,13.0,0,1,Southampton,,20.0
3,16.0,0,1,Southampton,0.0,
4,,0,2,Cherbourg,0.0,24.0


#VI. Ready for pipeline

We can define a pipeline that will call our transformers, in order, to pass the data through. If we start with `titanic_features`, the final output will be a dataframe in a transformed form.

Let me show you how to define a pipeline. It is relatively straightforward.

In [128]:
from sklearn.pipeline import Pipeline

#first define the pipeline
titanic_transformer = Pipeline(steps=[
    ('gender', CustomMappingTransformer('Gender', {'Male': 0, 'Female': 1})),
    ('class', CustomMappingTransformer('Class', {'Crew': 0, 'C3': 1, 'C2': 2, 'C1': 3})),
    ], verbose=True)

#now invoke it
transformed_df = titanic_transformer.fit_transform(titanic_features)

[Pipeline] ............ (step 1 of 2) Processing gender, total=   0.0s


[Pipeline] ............. (step 2 of 2) Processing class, total=   0.0s


You can see that each operation is a tuple. First item is name you give the operation. This name is your choice. Second is call on a constructor to give you a transformer object.

Once you invoke it, each transformer is built and then called in turn, passing its output to next transformer.

In [129]:
transformed_df.head()

Unnamed: 0,Age,Gender,Class,Joined,Married,Fare
0,41.0,0,1,Southampton,0.0,7.0
1,21.0,0,0,Southampton,0.0,0.0
2,13.0,0,1,Southampton,,20.0
3,16.0,0,1,Southampton,0.0,
4,,0,2,Cherbourg,0.0,24.0


We got what we wanted!



###Design decision!

I made a design decision to have each transformer that I code for the Pipeline to focus on a single column. I think this is more readable for datasets we will look at.

On the other hand, if you have hundreds of columns, you may consider writing transformers that accept a list of columns to work with, avoiding having hundreds of steps in your Pipeline.

Note that the transformers built-in to sklearn do not always follow this design. They may work on more than one column and in some cases, the entire
 set of columns in a table!

<img src='https://www.dropbox.com/s/8x575mvbi1xumje/cash_line.png?raw=1' height=3 width=500><br>
<img src='https://www.gannett-cdn.com/-mm-/56cbeec8287997813f287995de67747ba5e101d5/c=9-0-1280-718/local/-/media/2018/02/15/Phoenix/Phoenix/636542954131413889-image.jpg' height=50 align=center>

 I'm going to let you work from scratch on a new transformer called CustomRenamingTransformer. This transformer will rename one or more columns.

 Big caveat: if any columns that are targeted for renaming do not exist, then I want an **assert** error. I highlight **assert** because I don't want you to rely on Python to report the error.

Your assert error should name the columns that do not exist.






##Hint 1: look for a pandas method that does column renaming

Why write all the code yourself if there is a pandas method that does it? I bet Gemini knows about it.



##Hint 2: you can use operations on sets to determine if one list of elements is contained in another list of elements.

Allows you to avoid writing your own loops.

###Here is MappingTransformer repeated for reference

It is not an exact match with what you need, but close.

<pre>
from sklearn.base import BaseEstimator, TransformerMixin

class CustomMappingTransformer(BaseEstimator, TransformerMixin):
    """
    A transformer that maps values in a specified column according to a provided dictionary.
    
    This transformer follows the scikit-learn transformer interface and can be used in
    a scikit-learn pipeline. It applies value substitution to a specified column using
    a mapping dictionary, which can be useful for encoding categorical variables or
    transforming numeric values.
    
    Parameters
    ----------
    mapping_column : str or int
        The name (str) or position (int) of the column to which the mapping will be applied.
    mapping_dict : dict
        A dictionary defining the mapping from existing values to new values.
        Keys should be values present in the mapping_column, and values should
        be their desired replacements.
        
    Attributes
    ----------
    mapping_dict : dict
        The dictionary used for mapping values.
    mapping_column : str or int
        The column (by name or position) that will be transformed.
        
    Examples
    --------
    >>> import pandas as pd
    >>> df = pd.DataFrame({'category': ['A', 'B', 'C', 'A']})
    >>> mapper = CustomMappingTransformer('category', {'A': 1, 'B': 2, 'C': 3})
    >>> transformed_df = mapper.fit_transform(df)
    >>> transformed_df
       category
    0        1
    1        2
    2        3
    3        1
    """

    def __init__(self, mapping_column: Union[str, int], mapping_dict: Dict[Hashable, Any]) -> None:
        """
        Initialize the CustomMappingTransformer.
        
        Parameters
        ----------
        mapping_column : str or int
            The name (str) or position (int) of the column to apply the mapping to.
        mapping_dict : Dict[Hashable, Any]
            A dictionary defining the mapping from existing values to new values.
            
        Raises
        ------
        AssertionError
            If mapping_dict is not a dictionary.
        """
        assert isinstance(mapping_dict, dict), f'{self.__class__.__name__} constructor expected dictionary but got {type(mapping_dict)} instead.'
        self.mapping_dict: Dict[Hashable, Any] = mapping_dict
        self.mapping_column: Union[str, int] = mapping_column  #column to focus on

    def fit(self, X: pd.DataFrame, y: Optional[Iterable] = None) -> Self:
        """
        Fit method - performs no actual fitting operation.
        
        This method is implemented to adhere to the scikit-learn transformer interface
        but doesn't perform any computation.
        
        Parameters
        ----------
        X : pandas.DataFrame
            The input data to fit.
        y : array-like, default=None
            Ignored. Present for compatibility with scikit-learn interface.
            
        Returns
        -------
        self : CustomMappingTransformer
            Returns self to allow method chaining.
        """
        print(f"\nWarning: {self.__class__.__name__}.fit does nothing.\n")
        return self  #always the return value of fit

    def transform(self, X: pd.DataFrame) -> pd.DataFrame:
        """
        Apply the mapping to the specified column in the input DataFrame.
        
        Parameters
        ----------
        X : pandas.DataFrame
            The DataFrame containing the column to transform.
            
        Returns
        -------
        pandas.DataFrame
            A copy of the input DataFrame with mapping applied to the specified column.
            
        Raises
        ------
        AssertionError
            If X is not a pandas DataFrame or if mapping_column is not in X.
            
        Notes
        -----
        This method provides warnings if:
        1. Keys in mapping_dict are not found in the column values
        2. Values in the column don't have corresponding keys in mapping_dict
        """
        assert isinstance(X, pd.core.frame.DataFrame), f'{self.__class__.__name__}.transform expected Dataframe but got {type(X)} instead.'
        assert self.mapping_column in X.columns.to_list(), f'{self.__class__.__name__}.transform unknown column "{self.mapping_column}"'  #column legit?
        warnings.filterwarnings('ignore', message='.*downcasting.*')  #squash warning in replace method below

        #now check to see if all keys are contained in column
        column_set: Set[Any] = set(X[self.mapping_column].unique())
        keys_not_found: Set[Any] = set(self.mapping_dict.keys()) - column_set
        if keys_not_found:
            print(f"\nWarning: {self.__class__.__name__}[{self.mapping_column}] does not contain these keys as values {keys_not_found}\n")

        #now check to see if some keys are absent
        keys_absent: Set[Any] = column_set - set(self.mapping_dict.keys())
        if keys_absent:
            print(f"\nWarning: {self.__class__.__name__}[{self.mapping_column}] does not contain keys for these values {keys_absent}\n")

        X_: pd.DataFrame = X.copy()
        X_[self.mapping_column] = X_[self.mapping_column].replace(self.mapping_dict)
        return X_

    def fit_transform(self, X: pd.DataFrame, y: Optional[Iterable] = None) -> pd.DataFrame:
        """
        Fit to data, then transform it.
        
        Combines fit() and transform() methods for convenience.
        
        Parameters
        ----------
        X : pandas.DataFrame
            The DataFrame containing the column to transform.
        y : array-like, default=None
            Ignored. Present for compatibility with scikit-learn interface.
            
        Returns
        -------
        pandas.DataFrame
            A copy of the input DataFrame with mapping applied to the specified column.
        """
        #self.fit(X,y)  #commented out to avoid warning message in fit
        result: pd.DataFrame = self.transform(X)
        return result
</pre>

##And here are some test cases as your target



In [136]:
test_map_good =  {'URL':'foo', 'Boat':'fum'}  #ok
test_map_bad1 =  {'URL':'foo', 'Boot':'fum'}  #error - lists Boot as unknown
test_map_bad2 =  {'url':'foo', 'Boot':'fum'}  #error - lists Boot and url as unknown

###Your choice on how much to document

But as noted, the more you document, the better results you will get with code completion by Gemini and warnings from Colab.

**And reminder**: You can write a barebones implementation then ask Gemini to fully document it for you and add type hints. Frankly, I think Claude does a better job, but Gemini is ok. Only pain is in copying your code into Gemini window then copying results back to your code cell. But it is still a huge time savings.

In [130]:
class CustomRenamingTransformer(BaseEstimator, TransformerMixin):
    """
    A transformer for renaming columns in a pandas DataFrame.

    This transformer allows you to rename columns in a DataFrame using a provided
    dictionary. It adheres to the scikit-learn transformer interface and can
    be used within a scikit-learn pipeline.

    Parameters
    ----------
    rename_dict : dict
        A dictionary mapping existing column names (keys) to their desired new names (values).

    Attributes
    ----------
    rename_dict : dict
        The dictionary used for renaming columns.

    Examples
    --------
    >>> import pandas as pd
    >>> df = pd.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
    >>> renamer = CustomRenamingTransformer({'col1': 'new_col1', 'col2': 'new_col2'})
    >>> transformed_df = renamer.fit_transform(df)
    >>> transformed_df
       new_col1  new_col2
    0         1         4
    1         2         5
    2         3         6
    """
    def __init__(self, rename_dict: Dict[str, str]) -> None:
        """
        Initialize the CustomRenamingTransformer.

        Parameters
        ----------
        rename_dict : dict
            A dictionary mapping existing column names (keys) to their desired new names (values).

        Raises
        ------
        AssertionError
            If rename_dict is not a dictionary.
        """
        assert isinstance(rename_dict, dict), f'{self.__class__.__name__} constructor expected dictionary but got {type(rename_dict)} instead.'
        self.rename_dict: Dict[str, str] = rename_dict

    def fit(self, X: pd.DataFrame, y: Optional[Iterable] = None) -> Self:
        """
        Fit method - performs no actual fitting operation.

        This method is implemented to adhere to the scikit-learn transformer interface
        but doesn't perform any computation.

        Parameters
        ----------
        X : pandas.DataFrame
            The input data to fit.
        y : array-like, default=None
            Ignored. Present for compatibility with scikit-learn interface.

        Returns
        -------
        self : CustomMappingTransformer
            Returns self to allow method chaining.
        """
        print(f"\nWarning: {self.__class__.__name__}.fit does nothing.\n")
        return self  #always the return value of fit

    def transform(self, X: pd.DataFrame) -> pd.DataFrame:
        """
        Rename columns in the input DataFrame.

        Parameters
        ----------
        X : pandas.DataFrame
            The DataFrame containing the columns to rename.

        Returns
        -------
        pandas.DataFrame
            A copy of the input DataFrame with columns renamed according to rename_dict.

        Raises
        ------
        AssertionError
            If X is not a pandas DataFrame or if any key in rename_dict is not a column in X.
        """
        assert isinstance(X, pd.core.frame.DataFrame), f'{self.__class__.__name__}.transform expected Dataframe but got {type(X)} instead.'
        #check to make sure all keys in rename_dict are columns in X
        keys_not_found: Set = set(self.rename_dict.keys()) - set(X.columns)
        assert not keys_not_found, f'{self.__class__.__name__}.transform unknown column(s) "{keys_not_found}"'
        #do renaming using pandas rename method
        X_ = X.copy()  #make copy to avoid side-effects
        X_ = X_.rename(columns=self.rename_dict)
        return X_

    def fit_transform(self, X: pd.DataFrame, y: Optional[Iterable] = None) -> pd.DataFrame:
        """
        Fit to data, then transform it.

        Combines fit() and transform() methods for convenience.

        Parameters
        ----------
        X : pandas.DataFrame
            The DataFrame containing the columns to rename.
        y : array-like, default=None
            Ignored. Present for compatibility with scikit-learn interface.

        Returns
        -------
        pandas.DataFrame
            A copy of the input DataFrame with columns renamed according to rename_dict.
        """
        #self.fit(X,y)
        result: pd.DataFrame = self.transform(X)
        return result


##Test with this code

In [131]:
#you will get red line below [0,1] if you added type hints

rt = CustomRenamingTransformer([0,1])  #AssertionError: CustomRenamingTransformer constructor expected dictionary but got <class 'list'> instead.
new_df = rt.transform(titanic_table)
new_df.columns

AssertionError: CustomRenamingTransformer constructor expected dictionary but got <class 'list'> instead.

In [132]:
rt = CustomRenamingTransformer(test_map_bad1)
new_df = rt.transform(titanic_table)  #AssertionError: Columns {'Boot'}, are not in the data table

NameError: name 'test_map_bad1' is not defined

In [133]:
rt = CustomRenamingTransformer(test_map_bad2)
new_df = rt.transform(titanic_table)  #AssertionError: Columns {'Boot', 'url'}, are not in the data table
new_df.columns

NameError: name 'test_map_bad2' is not defined

In [137]:
rt = CustomRenamingTransformer(test_map_good)
new_df = rt.transform([1,2,3])  #AssertionError: RenamingTransformer.transform expected Dataframe but got <class 'list'> instead.
new_df.columns

AssertionError: CustomRenamingTransformer.transform expected Dataframe but got <class 'list'> instead.

In [138]:
rt = CustomRenamingTransformer(test_map_good)
new_df = rt.transform(titanic_table)
new_df.columns  #should produce correct set of new columns

Index(['Name', 'Age', 'Gender', 'Class', 'Joined', 'Married', 'Survived',
       'Fare', 'Bio', 'Occupation', 'Class/Dept', 'Cabin', 'fum',
       'Nationality', 'foo'],
      dtype='object')

#Challenge 1

Write the `CustomOHETransformer` using `pd.get_dummies` as the foundation. I'll give you a test case to check your results. I'll also give you a start. And remember, you can write barebones version and then ask Gemini to document for you.

To keep things simple, I'll allow you to have your `fit` method do nothing. All the action is in `transform`. We will later revisit this decision when we worry about something called "data leakage".

In [145]:
class CustomOHETransformer(BaseEstimator, TransformerMixin):
    """
    A transformer that performs one-hot encoding on a specified column.

    This transformer follows the scikit-learn transformer interface and can be
    used in a scikit-learn pipeline. It applies one-hot encoding to a specified
    column, creating new columns for each unique value in the original column.

    Parameters
    ----------
    target_column : str or int
        The name (str) or position (int) of the column to be one-hot encoded.
    dummy_na : bool, default=False
        Whether to create a dummy column for NaN values.
    drop_first : bool, default=False
        Whether to drop the first dummy column to avoid multicollinearity.

    Attributes
    ----------
    target_column : str or int
        The column (by name or position) that will be transformed.
    dummy_na : bool
        Whether to include a dummy column for NaN values.
    drop_first : bool
        Whether to drop the first dummy column.

    Examples
    --------
    >>> import pandas as pd
    >>> df = pd.DataFrame({'category': ['A', 'B', 'C', 'A']})
    >>> ohe = CustomOHETransformer('category')
    >>> transformed_df = ohe.fit_transform(df)
    >>> transformed_df
       category_A  category_B  category_C
    0           1           0           0
    1           0           1           0
    2           0           0           1
    3           1           0           0
    """

    def __init__(self, target_column: Union[str, int], dummy_na: bool = False, drop_first: bool = False) -> None:
        """
        Initialize the CustomOHETransformer.

        Parameters
        ----------
        target_column : str or int
            The name (str) or position (int) of the column to be one-hot encoded.
        dummy_na : bool, default=False
            Whether to create a dummy column for NaN values.
        drop_first : bool, default=False
            Whether to drop the first dummy column to avoid multicollinearity.
        """
        self.target_column: Union[str, int] = target_column
        self.dummy_na: bool = dummy_na
        self.drop_first: bool = drop_first

    def fit(self, X: pd.DataFrame, y: Optional[Iterable] = None) -> Self:
        """
        Fit method - performs no actual fitting operation.

        This method is implemented to adhere to the scikit-learn transformer interface
        but doesn't perform any computation.

        Parameters
        ----------
        X : pandas.DataFrame
            The input data to fit.
        y : array-like, default=None
            Ignored. Present for compatibility with scikit-learn interface.

        Returns
        -------
        self : instance of CustomOHETransformer
            Returns self to allow method chaining.
        """
        print(f"\nWarning: {self.__class__.__name__}.fit does nothing.\n")
        return self

    def transform(self, X: pd.DataFrame) -> pd.DataFrame:
        """
        Apply one-hot encoding to the specified column in the input DataFrame.

        Parameters
        ----------
        X : pandas.DataFrame
            The DataFrame containing the column to transform.

        Returns
        -------
        pandas.DataFrame
            A copy of the input DataFrame with one-hot encoding applied to the
            specified column.

        Raises
        ------
        AssertionError
            If X is not a pandas DataFrame or if target_column is not in X.
        """
        assert isinstance(X, pd.core.frame.DataFrame), f'{self.__class__.__name__}.transform expected Dataframe but got {type(X)} instead.'
        assert self.target_column in X.columns.to_list(), f'{self.__class__.__name__}.transform unknown column "{self.target_column}"'

        X_ = pd.get_dummies(
            X,
            prefix=self.target_column,
            prefix_sep='_',
            columns=[self.target_column],
            dummy_na=self.dummy_na,
            drop_first=self.drop_first,
            dtype=int
        )
        return X_

    def fit_transform(self, X: pd.DataFrame, y: Optional[Iterable] = None) -> pd.DataFrame:
        """
        Fit to data, then transform it.

        Combines fit() and transform() methods for convenience.

        Parameters
        ----------
        X : pandas.DataFrame
            The DataFrame containing the column to transform.
        y : array-like, default=None
            Ignored. Present for compatibility with scikit-learn interface.

        Returns
        -------
        pandas.DataFrame
            A copy of the input DataFrame with one-hot encoding applied to the
            specified column.
        """
        # self.fit(X, y)
        result: pd.DataFrame = self.transform(X)
        return result


###Test it out

In [None]:
#don't change this code
ohe = CustomOHETransformer(target_column='joined')
error_df = ohe.fit_transform(titanic_features)  #AssertionError: CustomOHETransformer.transform unknown column joined



In [146]:
#don't change this code
ohe = CustomOHETransformer(target_column='Joined')
X1_df = ohe.fit_transform(titanic_features)
assert set(['Joined_Belfast', 'Joined_Cherbourg', 'Joined_Queenstown', 'Joined_Southampton']) - set(X1_df.columns.to_list()) == set()   #should be True

Put it in pipeline and try it out.

In [147]:
from sklearn.pipeline import Pipeline

#first define the pipeline (but do not invoke it)
titanic_transformer = Pipeline(steps=[
    ('gender', CustomMappingTransformer('Gender', {'Male': 0, 'Female': 1})),
    ('class', CustomMappingTransformer('Class', {'Crew': 0, 'C3': 1, 'C2': 2, 'C1': 3})),
    ('joined', CustomOHETransformer(target_column='Joined'))
    ], verbose=True)

#now invoke it
transformed_df = titanic_transformer.fit_transform(titanic_features)

[Pipeline] ............ (step 1 of 3) Processing gender, total=   0.0s


[Pipeline] ............. (step 2 of 3) Processing class, total=   0.0s
[Pipeline] ............ (step 3 of 3) Processing joined, total=   0.0s


In [148]:
transformed_df.head()

Unnamed: 0,Age,Gender,Class,Married,Fare,Joined_Belfast,Joined_Cherbourg,Joined_Queenstown,Joined_Southampton
0,41.0,0,1,0.0,7.0,0,0,0,1
1,21.0,0,0,0.0,0.0,0,0,0,1
2,13.0,0,1,,20.0,0,0,0,1
3,16.0,0,1,0.0,,0,0,0,1
4,,0,2,0.0,24.0,0,1,0,0


<img src=''>

###What I see

|index|Age|Gender|Class|Married|Fare|Joined\_Belfast|Joined\_Cherbourg|Joined\_Queenstown|Joined\_Southampton|
|---|---|---|---|---|---|---|---|---|---|
|0|41\.0|0|1|0\.0|7\.0|0|0|0|1|
|1|21\.0|0|0|0\.0|0\.0|0|0|0|1|
|2|13\.0|0|1|NaN|20\.0|0|0|0|1|
|3|16\.0|0|1|0\.0|NaN|0|0|0|1|
|4|NaN|0|2|0\.0|24\.0|0|1|0|0|

#Challenge 2

We are missing an operator, one that drops columns. We do not need it for the Titanic dataset but do need it for the customer rating dataset.

Go ahead and build a transformer for dropping columns. I'll give you a start.

Note I have given the choice of either dropping or keeping. The user can use what is easiest depending on whether small number to keep or small number to drop.

##Warning vs error

I'll ask you to give **an error** using `assert` if `column_list` contains columns not in the dataframe **and** `action` is `'keep'`.

I'll ask you to give **a warning** if `column_list` contains columns not in the dataframe **and** `action` is `'drop'`. Note you will have to look at the documentation on the `drop` method to determine how to keep it from giving an error!

This split makes sense to me. If you are saying drop a column that is already missing, then just warn. However, if you say keep a column that is missing, that seems like an error.

In [173]:
from typing import Literal

class CustomDropColumnsTransformer(BaseEstimator, TransformerMixin):
    """
    A transformer that either drops or keeps specified columns in a DataFrame.

    This transformer follows the scikit-learn transformer interface and can be used in
    a scikit-learn pipeline. It allows for selectively keeping or dropping columns
    from a DataFrame based on a provided list.

    Parameters
    ----------
    column_list : List[str]
        List of column names to either drop or keep, depending on the action parameter.
    action : str, default='drop'
        The action to perform on the specified columns. Must be one of:
        - 'drop': Remove the specified columns from the DataFrame
        - 'keep': Keep only the specified columns in the DataFrame

    Attributes
    ----------
    column_list : List[str]
        The list of column names to operate on.
    action : str
        The action to perform ('drop' or 'keep').

    Examples
    --------
    >>> import pandas as pd
    >>> df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
    >>>
    >>> # Drop columns example
    >>> dropper = CustomDropColumnsTransformer(column_list=['A', 'B'], action='drop')
    >>> dropped_df = dropper.fit_transform(df)
    >>> dropped_df.columns.tolist()
    ['C']
    >>>
    >>> # Keep columns example
    >>> keeper = CustomDropColumnsTransformer(column_list=['A', 'C'], action='keep')
    >>> kept_df = keeper.fit_transform(df)
    >>> kept_df.columns.tolist()
    ['A', 'C']
    """

    def __init__(self, column_list: List[str], action: Literal['drop', 'keep'] = 'drop') -> None:
        """
        Initialize the CustomDropColumnsTransformer.

        Parameters
        ----------
        column_list : List[str]
            List of column names to either drop or keep.
        action : str, default='drop'
            The action to perform on the specified columns.
            Must be either 'drop' or 'keep'.

        Raises
        ------
        AssertionError
            If action is not 'drop' or 'keep', or if column_list is not a list.
        """
        assert action in ['keep', 'drop'], f'DropColumnsTransformer action {action} not in ["keep", "drop"]'
        assert isinstance(column_list, list), f'DropColumnsTransformer expected list but saw {type(column_list)}'
        self.column_list: List[str] = column_list
        self.action: Literal['drop', 'keep'] = action

    #your code below

    def fit(self, X: pd.DataFrame, y: Optional[Iterable] = None) -> Self:
        """
        Fit method - performs no actual fitting operation.

        This method is implemented to adhere to the scikit-learn transformer interface
        but doesn't perform any computation.

        Parameters
        ----------
        X : pandas.DataFrame
            The input data to fit.
        y : array-like, default=None
            Ignored. Present for compatibility with scikit-learn interface.

        Returns
        -------
        self : CustomMappingTransformer
            Returns self to allow method chaining.
        """
        print(f"\nWarning: {self.__class__.__name__}.fit does nothing.\n")
        return self  #always the return value of fit

    def transform(self, X: pd.DataFrame) -> pd.DataFrame:
        """
        Transforms the input DataFrame by either dropping or keeping specified columns.

        Parameters
        ----------
        X : pandas.DataFrame
            The DataFrame to transform.

        Returns
        -------
        pandas.DataFrame
            The transformed DataFrame with columns dropped or kept.

        Raises
        ------
        AssertionError
            If X is not a pandas DataFrame.
        KeyError
            If any column in `column_list` is not found in the input DataFrame.
        """
        assert isinstance(X, pd.DataFrame), f"{self.__class__.__name__}.transform expected DataFrame but got {type(X)} instead."

        X_ = X.copy()  # Make a copy of the DataFrame to avoid modifying the original


        if self.action == 'drop':
            unknown_columns = [col for col in self.column_list if col not in X_.columns]
            if unknown_columns:
                warnings.warn(f"Columns {unknown_columns} not found in DataFrame and will be ignored.", UserWarning)
            X_ = X_.drop(columns=[col for col in self.column_list if col in X_.columns], errors='ignore')  # errors='ignore' to suppress KeyError
        elif self.action == 'keep':
            try:
                X_ = X_[self.column_list]
            except KeyError as e:
                raise KeyError(f"Column {e} not found in the DataFrame.") from e

        return X_

    def fit_transform(self, X: pd.DataFrame, y: None = None) -> pd.DataFrame:
        """
        Fit to data, then transform it.

        Combines fit() and transform() methods for convenience.

        Parameters
        ----------
        X : pandas.DataFrame
            The DataFrame containing the column to transform.
        y : Ignored
            Not used, present for API consistency by convention.

        Returns
        -------
        pandas.DataFrame
            A copy of the input DataFrame with mapping applied to the specified column.
        """
        #self.fit(X,y)
        result = self.transform(X)
        return result



##Test your error checking

In [151]:
customer_features.head()

Unnamed: 0,ID,Gender,Experience Level,Time Spent,OS,ISP,Age
0,3,Female,medium,,iOS,Xfinity,
1,27,Male,medium,71.97,Android,Cox,50.0
2,30,Female,medium,101.81,,Cox,49.0
3,40,Female,medium,86.37,Android,Xfinity,53.0
4,52,Female,medium,103.97,iOS,Xfinity,58.0


In [161]:
#test here - don't change
col_list = ['ID',	'Gender',	'Experience Level',	'Time Spent',	'OS',	'ISP',	'Age', 'First timer']  #AssertionError: CustomDropColumnsTransformer.transform unknown columns to keep: {'First timer'}

keepers = CustomDropColumnsTransformer(col_list, 'keep')

error_df = keepers.fit_transform(customer_features)  #assertion error


KeyError: 'Column "[\'First timer\'] not in index" not found in the DataFrame.'

The following test can make use of typing like this:

<pre>
action: Literal['drop', 'keep'] = 'drop'
</pre>

I asked Gemini to document my code first and it did not give me the Literal type above initially. I had to ask it to make sure to add Literal types. On other hand, Claude gave it to me on first shot.

With the Literal you will see a red line under `'remain'`.

In [162]:
#test here - don't change. Should see red line under remain if added type checking
col_list = ['Gender',	'Experience Level',	'Time Spent',	'OS',	'ISP',	'Age']  #keepers
keepers = CustomDropColumnsTransformer(col_list, 'remain')

transformed_df_keep = keepers.fit_transform(customer_features)
assert set(col_list)-set(transformed_df_keep.columns.to_list())==set()

AssertionError: DropColumnsTransformer action remain not in ["keep", "drop"]

In [163]:
#test here - don't change
col_list = ['ID',	'Experience Level',	'Time Spent',	'OS',	'ISP',	'Age', 'Rating']
droppers = CustomDropColumnsTransformer(col_list, 'drop')
transformed_df_drop = droppers.fit_transform(customer_features)  #Warning: DropColumnsTransformer does not contain these columns to drop: {'Rating'}.



In [157]:
transformed_df_drop.head()  #should see columns dropped even though got warning

NameError: name 'transformed_df_drop' is not defined

###Make sure drop and keep give equiv results

In [164]:
#test here - don't change
col_list = ['Gender',	'Experience Level',	'Time Spent',	'OS',	'ISP',	'Age']  #keepers
keepers = CustomDropColumnsTransformer(col_list, 'keep')

transformed_df_keep = keepers.fit_transform(customer_features)
assert set(col_list)-set(transformed_df_keep.columns.to_list())==set()

In [165]:
transformed_df_keep.head()

Unnamed: 0,Gender,Experience Level,Time Spent,OS,ISP,Age
0,Female,medium,,iOS,Xfinity,
1,Male,medium,71.97,Android,Cox,50.0
2,Female,medium,101.81,,Cox,49.0
3,Female,medium,86.37,Android,Xfinity,53.0
4,Female,medium,103.97,iOS,Xfinity,58.0


In [166]:
#test here - don't change
col_list = ['ID']  #droppers
droppers = CustomDropColumnsTransformer(col_list, 'drop')
transformed_df_drop = droppers.fit_transform(customer_features)
assert not set(transformed_df_drop.columns.to_list()).intersection(set(col_list))  #they are gone

In [167]:
transformed_df_drop.head()

Unnamed: 0,Gender,Experience Level,Time Spent,OS,ISP,Age
0,Female,medium,,iOS,Xfinity,
1,Male,medium,71.97,Android,Cox,50.0
2,Female,medium,101.81,,Cox,49.0
3,Female,medium,86.37,Android,Xfinity,53.0
4,Female,medium,103.97,iOS,Xfinity,58.0


In [168]:
assert (transformed_df_keep.columns == transformed_df_drop.columns).all()  #should be True

#Challenge 3

Start a customer_transformer pipeline and add a drop step to the pipeline as first step.

In [174]:
customer_transformer = Pipeline(steps=[
    #add drop step below
    ('drop', CustomDropColumnsTransformer(['ID'], 'drop'))
    ], verbose=True)

#now invoke it
transformed_df = customer_transformer.fit_transform(customer_features)

[Pipeline] .............. (step 1 of 1) Processing drop, total=   0.0s


In [175]:
transformed_df.head()

Unnamed: 0,Gender,Experience Level,Time Spent,OS,ISP,Age
0,Female,medium,,iOS,Xfinity,
1,Male,medium,71.97,Android,Cox,50.0
2,Female,medium,101.81,,Cox,49.0
3,Female,medium,86.37,Android,Xfinity,53.0
4,Female,medium,103.97,iOS,Xfinity,58.0


###What I see

|index|Gender|Experience Level|Time Spent|OS|ISP|Age|
|---|---|---|---|---|---|---|
|0|Female|medium|NaN|iOS|Xfinity|NaN|
|1|Male|medium|71\.97|Android|Cox|50\.0|
|2|Female|medium|101\.81|NaN|Cox|49\.0|
|3|Female|medium|86\.37|Android|Xfinity|53\.0|
|4|Female|medium|103\.97|iOS|Xfinity|58\.0|

#Challenge 4

Fill out the remainder of the customer_transformer pipeline. You have the first step. Now add the others.



###Here is a plan


* You will have to figure out what the possible categorical values are for each string column. There is a pandas method to tell you this. In essence there is a pandas method for everything! It's just figuring out what it is that is the trick :) Hint: look at the CustomMappingTransformer for a clue.

* There are 4 string feature-columns. Decide how to view each, ordinal or nominal. Then add appropriate transformer to pipeline.




In [176]:
#figure out how to get unique values in each string column
print(customers_df.apply(lambda col: col.unique()))

ID                  [3, 27, 30, 40, 52, 94, 109, 122, 126, 143, 14...
Gender                                            [Female, Male, nan]
Experience Level                             [medium, nan, low, high]
Time Spent          [nan, 71.97, 101.81, 86.37, 103.97, 102.56, 74...
OS                                                [iOS, Android, nan]
ISP                              [Xfinity, Cox, HughesNet, nan, AT&T]
Age                 [nan, 50.0, 49.0, 53.0, 58.0, 44.0, 64.0, 67.0...
Rating                                                         [0, 1]
dtype: object


###What I see

<pre>
ID                  [3, 27, 30, 40, 52, 94, 109, 122, 126, 143, 14...
Gender                                            [Female, Male, nan]
Experience Level                             [medium, nan, low, high]
Time Spent          [nan, 71.97, 101.81, 86.37, 103.97, 102.56, 74...
OS                                                [iOS, Android, nan]
ISP                              [Xfinity, Cox, HughesNet, nan, AT&T]
Age                 [nan, 50.0, 49.0, 53.0, 58.0, 44.0, 64.0, 67.0...
Rating                                                         [0, 1]
dtype: object
</pre>

In [177]:
#Once you finish challenge 3 you will have first step - figure out others - I ended up with 5 total steps
customer_transformer = Pipeline(steps=[
    #fill in the steps on your own
    # Drop ID
    ('drop', CustomDropColumnsTransformer(['ID'], 'drop')),
    # Map gender
    ('gender', CustomMappingTransformer('Gender', {'Male': 0, 'Female': 1})),
    # Map experience level
    ('experience_level', CustomMappingTransformer('Experience Level', {'low': 0, 'medium': 1, 'high': 2})),
    # One hot encode OS
    ('os', CustomOHETransformer('OS')),
    # One hot encode ISP
    ('isp', CustomOHETransformer('ISP'))
    ], verbose=True)

In [178]:
transformed_customer_df = customer_transformer.fit_transform(customer_features)
transformed_customer_df.head()

[Pipeline] .............. (step 1 of 5) Processing drop, total=   0.0s


[Pipeline] ............ (step 2 of 5) Processing gender, total=   0.0s


[Pipeline] .. (step 3 of 5) Processing experience_level, total=   0.0s
[Pipeline] ................ (step 4 of 5) Processing os, total=   0.0s
[Pipeline] ............... (step 5 of 5) Processing isp, total=   0.0s


Unnamed: 0,Gender,Experience Level,Time Spent,Age,OS_Android,OS_iOS,ISP_AT&T,ISP_Cox,ISP_HughesNet,ISP_Xfinity
0,1,1,,,0,1,0,0,0,1
1,0,1,71.97,50.0,1,0,0,1,0,0
2,1,1,101.81,49.0,0,0,0,1,0,0
3,1,1,86.37,53.0,1,0,0,0,0,1
4,1,1,103.97,58.0,0,1,0,0,0,1


###What I see

|index|Gender|Experience Level|Time Spent|Age|OS\_Android|OS\_iOS|ISP\_AT&amp;T|ISP\_Cox|ISP\_HughesNet|ISP\_Xfinity|
|---|---|---|---|---|---|---|---|---|---|---|
|0|1|1|NaN|NaN|0|1|0|0|0|1|
|1|0|1|71\.97|50\.0|1|0|0|1|0|0|
|2|1|1|101\.81|49\.0|0|0|0|1|0|0|
|3|1|1|86\.37|53\.0|1|0|0|0|0|1|
|4|1|1|103\.97|58\.0|0|1|0|0|0|1|

#That's it for this notebook!

But a few notes for future.

##Notice we dropped Name before dropping duplicates

One of our first Titanic wrangling steps was to drop the Name column. We then dropped duplicates.
You could plausibly argue we should have done the reverse. Maybe accurate duplicates should include Name. And we dropped almost 1000 rows by doing it the way we did. But I want to consider a more nuanced way to augment our data in a future notebook, i.e., by oversampling. For now, all of our rows are unique in Titanic.

And in the customer dataset, we did not drop duplicates at all. I want to show you both approaches as we move forward: keeping versus dropping duplicates. They both have advantages and disadvantages.