https://www.youtube.com/watch?v=V0u6bxQOUJ8

# Predictive Modeling

Predict whether an individual makes > $50k per year based on census data

https://archive.ics.uci.edu/ml/datasets/Adult

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# data

colnames = 'age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country income'.split()

df = pd.read_csv('adult.data.csv', names = colnames, header=None)
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [3]:
# remove whitespace from df['income']

def rm_space(data):
    data = data.strip()
    return data

df['income'] = df.income.map(rm_space)

In [4]:
print(df['income'].value_counts())

# I had trouble getting the accurate str value due to Capital letters and Whitespace.
# Here is a method for directly pulling the str value
print(df['income'].value_counts().index[1])

<=50K    24720
>50K      7841
Name: income, dtype: int64
>50K


In [5]:
# Create boolean target column
df['target'] = (df['income'] == '>50K')

# can also be done with list comprehension:s
# df['target'] = [0 if x == '<=50K' else 1 for x in df['income']]

In [6]:
df['target'].value_counts()

False    24720
True      7841
Name: target, dtype: int64

In [7]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income,target
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K,False
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K,False
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K,False
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K,False
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K,False


In [8]:
# Assign a Dataframe of Features, X, and a Series of the outcome variable, y
X = df.drop(['income', 'target'], axis=1)
y = df.target

In [9]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education-num   32561 non-null  int64 
 5   marital-status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital-gain    32561 non-null  int64 
 11  capital-loss    32561 non-null  int64 
 12  hours-per-week  32561 non-null  int64 
 13  native-country  32561 non-null  object
dtypes: int64(6), object(8)
memory usage: 3.5+ MB


In [10]:
y.head()

0    False
1    False
2    False
3    False
4    False
Name: target, dtype: bool

In [11]:
y.value_counts()

False    24720
True      7841
Name: target, dtype: int64

## Basic Data Cleaning

### Dealing with Data Types

1. Numeric (income, age)
2. Categorical (gender, nationality)
3. Ordinal (low/med/high)

Computers can only handle numeric features

Must Convert
- Create Dummy Variables
- Categorical, Ordinal Feature --> Many dummy column Features

In [12]:
# See Categorical variable: Education
X['education'].value_counts()

 HS-grad         10501
 Some-college     7291
 Bachelors        5355
 Masters          1723
 Assoc-voc        1382
 11th             1175
 Assoc-acdm       1067
 10th              933
 7th-8th           646
 Prof-school       576
 9th               514
 12th              433
 Doctorate         413
 5th-6th           333
 1st-4th           168
 Preschool          51
Name: education, dtype: int64

In [13]:
# explore pandas get_dummies
print(pd.get_dummies(X['education'].head(5)))

    11th   Bachelors   HS-grad
0      0           1         0
1      0           1         0
2      0           0         1
3      1           0         0
4      0           1         0


In [14]:
# Decide which categorical variables you want to use in the model
def unique_categories(DataFrame):
    for DataFrame_col in DataFrame.columns:
        if DataFrame[DataFrame_col].dtype == 'object':
            unique_cat = len(DataFrame[DataFrame_col].unique())
            print("Column --", DataFrame_col, "-- has ", unique_cat, " unique categories")
unique_categories(X)

Column -- workclass -- has  9  unique categories
Column -- education -- has  16  unique categories
Column -- marital-status -- has  7  unique categories
Column -- occupation -- has  15  unique categories
Column -- relationship -- has  6  unique categories
Column -- race -- has  5  unique categories
Column -- sex -- has  2  unique categories
Column -- native-country -- has  42  unique categories


### Handle 'native-country'

In [15]:
# native-country has many categories, but most are 'U.S.'
# we can make just two bins. 'U.S.' and 'other'. One column is needed for this.
X['native-country'].value_counts()

 United-States                 29170
 Mexico                          643
 ?                               583
 Philippines                     198
 Germany                         137
 Canada                          121
 Puerto-Rico                     114
 El-Salvador                     106
 India                           100
 Cuba                             95
 England                          90
 Jamaica                          81
 South                            80
 China                            75
 Italy                            73
 Dominican-Republic               70
 Vietnam                          67
 Guatemala                        64
 Japan                            62
 Poland                           60
 Columbia                         59
 Taiwan                           51
 Haiti                            44
 Iran                             43
 Portugal                         37
 Nicaragua                        34
 Peru                             31
 

In [16]:
X['native-country'].value_counts().index[0]

' United-States'

In [17]:
X['country-U.S.'] = (X['native-country'] == X['native-country'].value_counts().index[0])

In [18]:
X['country-U.S.'].value_counts()

True     29170
False     3391
Name: country-U.S., dtype: int64

In [19]:
X.head(1)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,country-U.S.
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,True


In [20]:
X = X.drop('native-country', axis=1)
X.head(1)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,country-U.S.
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,True


## Handling Categorical to dummy

In [21]:
# Create a list of features to dummy
def column_objects(df):
    # return a list of columns with dtype object
    object_cols = []
    for col in df.columns:
        if df[col].dtype == 'object':
            object_cols.append(col)
    return object_cols

todummy_list = column_objects(X)
todummy_list

['workclass',
 'education',
 'marital-status',
 'occupation',
 'relationship',
 'race',
 'sex']

In [22]:
type(todummy_list)

list

In [23]:
X.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,country-U.S.
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,True
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,True
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,True
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,True
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,False


In [24]:
def make_dummies(df, todummy_list):
    for s in todummy_list:
        new_dummies = pd.get_dummies(df[s], prefix=s)
        df.drop(s, axis=1, inplace=True)
        df = pd.concat([df, new_dummies], axis=1)
    return df

X = make_dummies(X, todummy_list)
X.head()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,country-U.S.,workclass_ ?,workclass_ Federal-gov,workclass_ Local-gov,...,relationship_ Own-child,relationship_ Unmarried,relationship_ Wife,race_ Amer-Indian-Eskimo,race_ Asian-Pac-Islander,race_ Black,race_ Other,race_ White,sex_ Female,sex_ Male
0,39,77516,13,2174,0,40,True,0,0,0,...,0,0,0,0,0,0,0,1,0,1
1,50,83311,13,0,0,13,True,0,0,0,...,0,0,0,0,0,0,0,1,0,1
2,38,215646,9,0,0,40,True,0,0,0,...,0,0,0,0,0,0,0,1,0,1
3,53,234721,7,0,0,40,True,0,0,0,...,0,0,0,0,0,1,0,0,0,1
4,28,338409,13,0,0,40,False,0,0,0,...,0,0,1,0,0,1,0,0,1,0


In [25]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 67 columns):
 #   Column                                 Non-Null Count  Dtype
---  ------                                 --------------  -----
 0   age                                    32561 non-null  int64
 1   fnlwgt                                 32561 non-null  int64
 2   education-num                          32561 non-null  int64
 3   capital-gain                           32561 non-null  int64
 4   capital-loss                           32561 non-null  int64
 5   hours-per-week                         32561 non-null  int64
 6   country-U.S.                           32561 non-null  bool 
 7   workclass_ ?                           32561 non-null  uint8
 8   workclass_ Federal-gov                 32561 non-null  uint8
 9   workclass_ Local-gov                   32561 non-null  uint8
 10  workclass_ Never-worked                32561 non-null  uint8
 11  workclass_ Private          

We have successfully converted categorical to dummy

In [26]:
X['country-U.S.'].value_counts()

True     29170
False     3391
Name: country-U.S., dtype: int64

In [27]:
# make 'country-U.S.' into uint8
X['country-U.S.'] = X['country-U.S.'].astype('uint8')
X['country-U.S.'].value_counts()

1    29170
0     3391
Name: country-U.S., dtype: int64

In [28]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 67 columns):
 #   Column                                 Non-Null Count  Dtype
---  ------                                 --------------  -----
 0   age                                    32561 non-null  int64
 1   fnlwgt                                 32561 non-null  int64
 2   education-num                          32561 non-null  int64
 3   capital-gain                           32561 non-null  int64
 4   capital-loss                           32561 non-null  int64
 5   hours-per-week                         32561 non-null  int64
 6   country-U.S.                           32561 non-null  uint8
 7   workclass_ ?                           32561 non-null  uint8
 8   workclass_ Federal-gov                 32561 non-null  uint8
 9   workclass_ Local-gov                   32561 non-null  uint8
 10  workclass_ Never-worked                32561 non-null  uint8
 11  workclass_ Private          

## Handle Missing Data

imputation is usually better than just dropping

In [29]:
X.isnull().sum().sort_values(ascending=False).head(67)

age                              0
occupation_ Protective-serv      0
marital-status_ Never-married    0
marital-status_ Separated        0
marital-status_ Widowed          0
                                ..
education_ Masters               0
education_ Preschool             0
education_ Prof-school           0
education_ Some-college          0
sex_ Male                        0
Length: 67, dtype: int64

No missing data in this version of the dataset:
https://archive.ics.uci.edu/ml/datasets/Adult


If missing data was present, we could impute using below method:

In [30]:
# impute usint Imputer from sklearn.preprocessing
from sklearn.preprocessing import Imputer

imp = Imputer(missin_values='NaN', strategy='median', axis=0)
imp.fit(X)
X = pd.DataFrame(data=imp.transfrom(X), columns=X.columns)
X

ImportError: cannot import name 'Imputer' from 'sklearn.preprocessing' (/opt/anaconda3/lib/python3.9/site-packages/sklearn/preprocessing/__init__.py)