# Lab | Revisiting Machine Learning Case Study

- In this lab, you will use `learningSet.csv` file which you already have cloned in today's activities. The full process for the week is shown in the PDF file.

### Instructions

Complete the following steps on the categorical columns in the dataset:

- Check for null values in all the columns
- Exclude the following variables by looking at the definitions. Create a new empty list called `drop_list`. We will append this list and then drop all the columns in this list later:
    - `OSOURCE` - symbol definitions not provided, too many categories
    - `ZIP` - we are including state already
- Identify columns that have over 50% missing values.
- Remove those columns from the dataframe
- Perform all of the cleaning processes from the Lesson.
- Reduce the number of categories in the column `GENDER`. The column should only have either "M" for males, "F" for females, and "other" for all the rest
    - Note that there are a few null values in the column. We will first replace those null values using the code below:

    ```python
    print(categorical['GENDER'].value_counts())
    categorical['GENDER'] = categorical['GENDER'].fillna('F')
    ```

In [1]:
import pandas as pd
import numpy as np

In [2]:
data = pd.read_csv('learningSet.csv', low_memory=False)

In [31]:
# make df with categorical data
categorical = data.select_dtypes(object)

In [32]:
# Dataframe for percentage null values
nulls_percent_df = pd.DataFrame(categorical.isna().sum()*100/len(categorical)).reset_index()
nulls_percent_df.columns = ['column_name', 'nulls_percentage']

# Dataframe with columns with missing values above 
columns_above_threshold = nulls_percent_df[nulls_percent_df['nulls_percentage']>50] # as instructed, we use 0.5 (50% as threshold)
drop_list = list(columns_above_threshold['column_name'])
drop_list.extend(['OSOURCE','ZIP']) # add the two specified useless columns

In [33]:
drop_list#OK, there's only the two specified columns...

['OSOURCE', 'ZIP']

In [34]:
# Drop them from dataframe
categorical.drop(drop_list, axis=1, inplace=True)

In [35]:
# apply cleaning we did in class: changing column "MAILCODE" and replacing all spaces ' ' value with actural Null value
categorical['MAILCODE'] = categorical['MAILCODE'].apply(lambda x: x.replace(" ", "A"))
categorical = categorical.apply(lambda x: x.replace(" ", np.NaN))

# apply cleaning we did in class: groupping some states into 'other'
df = pd.DataFrame(categorical['STATE'].value_counts()).reset_index()

df.columns = ['state', 'count']
other_states = list(df[df['count']<2500]['state'])

categorical['STATE'] = categorical['STATE'].where(~categorical['STATE'].isin(other_states), 'other')

In [36]:
# clean gender column
categorical['GENDER'] = categorical['GENDER'].fillna('F') # fillna

gend = ['F','M']
categorical['GENDER'] = categorical['GENDER'].where(categorical['GENDER'].isin(gend), 'other')

In [37]:
categorical['GENDER'].value_counts()

GENDER
F        54234
M        39094
other     2084
Name: count, dtype: int64