# Sklearn Pandas Lab

### Introduction

In this lesson, we'll practice using the `sklearn_pandas` library for pipelines.  You can reference how the library works [here](https://github.com/scikit-learn-contrib/sklearn-pandas).  

Let's get started.

### Loading the Data

In [8]:
import pandas as pd
# url = "https://raw.githubusercontent.com/jigsawlabs-student/pipelines-and-transformers/master/report-hs.csv"
url = "https://raw.githubusercontent.com/jigsawlabs-student/pipelines-and-transformers/master/hs_directory.csv"
df_directory = pd.read_csv(url)

In [9]:
df_directory[:2]

Unnamed: 0,dbn,school_name,boro,overview_paragraph,school_10th_seats,academicopportunities1,academicopportunities2,academicopportunities3,academicopportunities4,academicopportunities5,...,state_code,Borough,Latitude,Longitude,Community Board,Council District,Census Tract,BIN,BBL,NTA
0,16K498,Brooklyn High School for Law and Technology,K,The mission of Brooklyn High School for Law an...,Y,iLearnNYC: Program for expanded online coursew...,Access to Law and Technology Programs: Youth C...,"Computer Programming, iCareers, Online courses...","Multicultural Literature, Conflict Resolution/...",,...,NY,BROOKLYN,40.688831,-73.920906,3,41,375,3039676.0,3014820000.0,Stuyvesant Heights ...
1,17K524,International High School at Prospect Heights,K,We are a small school that works in teams to p...,Y,"We are a Performance-Based Assessment school, ...","Safe, supportive, and nurturing environment wh...",We are a Computer Science for All school. Thro...,All students are matched to an internship in t...,"Afterschool programs include: Peer Tutoring, A...",...,NY,BROOKLYN,40.670349,-73.961695,9,35,213,3029686.0,3011870000.0,Crown Heights South ...


Let's begin by working with our numeric columns with null values.  Return a dataframe with those columns that have some na values, which are not of type object.

In [10]:
na_vals = None

In [11]:
numeric_cols_some_na = None

In [12]:
numeric_cols_some_na.columns

# Index(['graduation_rate', 'attendance_rate', 'pct_stu_enough_variety',
#        'college_career_rate', 'pct_stu_safe', 'girls', 'boys', 'pbat',
#        'international', 'specialized',
#        ...
#        'common_audition3', 'common_audition4', 'common_audition5',
#        'common_audition6', 'common_audition7', 'common_audition8',
#        'common_audition9', 'common_audition10', 'BIN', 'BBL'],
#       dtype='object', length=158)

Index(['graduation_rate', 'attendance_rate', 'pct_stu_enough_variety',
       'college_career_rate', 'pct_stu_safe', 'girls', 'boys', 'pbat',
       'international', 'specialized',
       ...
       'common_audition3', 'common_audition4', 'common_audition5',
       'common_audition6', 'common_audition7', 'common_audition8',
       'common_audition9', 'common_audition10', 'BIN', 'BBL'],
      dtype='object', length=158)

In [13]:
numeric_cols_some_na[:2]

Unnamed: 0,graduation_rate,attendance_rate,pct_stu_enough_variety,college_career_rate,pct_stu_safe,girls,boys,pbat,international,specialized,...,common_audition3,common_audition4,common_audition5,common_audition6,common_audition7,common_audition8,common_audition9,common_audition10,BIN,BBL
0,0.74,0.85,0.58,0.49,0.74,,,,,,...,,,,,,,,,3039676.0,3014820000.0
1,0.63,0.88,0.92,0.53,0.86,,,1.0,1.0,,...,,,,,,,,,3029686.0,3011870000.0


### DataFrame Mapper

Let's begin by using the DataFrameMapper to handle missing values in some of the columns identified above.

Start by loading the DataFrameMapper from the `sklearn_pandas` library.  

Initialize the mapper and select the `graduation_rate`, `attendance_rate`, and `college_career_rate` rate.  Use the DataFrameMapper to replace the `na` values with the mean, for each of the three columns.

Set up the mapper so that it returns a dataframe.

Then call the fit_transform method on the `numeric_cols_some_na` dataframe.

In [38]:
imputed_cols = None

imputed_cols[:2]

# graduation_rate	attendance_rate	college_career_rate
# 0	0.74	0.85	0.49
# 1	0.63	0.88	0.53

Unnamed: 0,graduation_rate,attendance_rate,college_career_rate
0,0.74,0.85,0.49
1,0.63,0.88,0.53


* Adding `is_na` columns

Now let's update our mapper to add a `is_na` for each of the selected columns.  Copy the code that created the mapper above, and use the `MissingIndicator` transformer to return three additional `is_na` columns.

In [39]:
mapper = None

Fit and transform the mapper on `numeric_cols_some_na`.

In [41]:
imputed_cols_with_is_na = None

In [43]:
imputed_cols_with_is_na[:3]

# 	graduation_rate	graduation_rate_is_na	attendance_rate	attendance_rate_is_na	college_career_rate	college_career_rate_is_na
# 0	0.74	False	0.85	False	0.49	False
# 1	0.63	False	0.88	False	0.53	False
# 2	0.96	False	0.91	False	0.87	

Unnamed: 0,graduation_rate,graduation_rate_is_na,attendance_rate,attendance_rate_is_na,college_career_rate,college_career_rate_is_na
0,0.74,False,0.85,False,0.49,False
1,0.63,False,0.88,False,0.53,False
2,0.96,False,0.91,False,0.87,False


### Working with Categorical Data

Now let's add categorical data to the mix.  We have some categorical data in the boro column.  Let's walk through how to do this.

First we can select the borough column.

In [50]:
borough = df_directory['boro']
borough[:3]

0    K
1    K
2    M
Name: boro, dtype: object

And now let's inspect it to see if there is any missing data or data to impute by looking at a count of the different values.

In [51]:


# K    119
# X    110
# M    107
# Q     80
# R     11
# Name: boro, dtype: int64

K    119
X    110
M    107
Q     80
R     11
Name: boro, dtype: int64

Things look good, let's add it to our mapper.

In [58]:
from sklearn.preprocessing import OneHotEncoder

In [59]:
mapper = DataFrameMapper([
    # add in the rest of the steps here     
    (['boro'], OneHotEncoder())
], df_out = True)

In [64]:
trans_cols_one_hot = mapper.fit_transform(df_directory)

trans_cols_one_hot[:2]

# 	graduation_rate	graduation_rate_is_na	attendance_rate	attendance_rate_is_na	college_career_rate	college_career_rate_is_na	boro_x0_K	boro_x0_M	boro_x0_Q	boro_x0_R	boro_x0_X
# 0	0.74	False	0.85	False	0.49	False	1.0	0.0	0.0	0.0	0.0
# 1	0.63	False	0.88	False	0.53	False	1.0	0.0	0.0	0.0	0.0

Unnamed: 0,graduation_rate,graduation_rate_is_na,attendance_rate,attendance_rate_is_na,college_career_rate,college_career_rate_is_na,boro_x0_K,boro_x0_M,boro_x0_Q,boro_x0_R,boro_x0_X
0,0.74,False,0.85,False,0.49,False,1.0,0.0,0.0,0.0,0.0
1,0.63,False,0.88,False,0.53,False,1.0,0.0,0.0,0.0,0.0


### Adding untransformed columns

Sometimes we may wish to include columns, but we do not need to perform feature engineering.  In this case, we can include the step as normal, but this time add None as the second argument.

In [83]:
non_objs = df_directory.select_dtypes(exclude = 'object')

In [87]:
mapper = DataFrameMapper([
    (['boro'], OneHotEncoder()),
    (['total_students'], None)
], df_out = True)

In [88]:
total_students = mapper.fit_transform(df_directory)
total_students[:2]

Unnamed: 0,boro_x0_K,boro_x0_M,boro_x0_Q,boro_x0_R,boro_x0_X,total_students
0,1.0,0.0,0.0,0.0,0.0,594
1,1.0,0.0,0.0,0.0,0.0,417


### Summary 

In this lesson, we practiced working with the `DataFrameMapper` in `sklearn-pandas`.  We saw that with the DataFrameMapper we could specify columns to transform, and return new columns to our dataframe.  We also practiced using some new transformers like the OneHotEncoder and the `MissingIndicator`. 