## Adult Dataset Part 1: Cleaning and Preparation

#### Objectives:
1. Preprocess the data and store the cleaned dataset as adult_clean.csv
2. Load clean dataset and test supervised/unsupervised models
    - Goal: Determine best model to predict if new entry earns >50k or <50k with supervised learning
    - Goal: Find insights and patterns in data using unspuervised learning
3. Apply k-folds cross validation
    - Goal: Evaluate robustness of results across different models.
    - Goal: Use and justify different error metrics


#### Summary of dataset

Datasource: https://archive.ics.uci.edu/ml/datasets/adult

Feature descrptions copied below from the UCI website:

- age: continuous. 
- workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked. 
- fnlwgt: Continuous. A weighting assigned from the sampling process.
- education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool. 
- education-num: continuous. 
- marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse. 
- occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces. 
- relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried. 
- race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black. 
- sex: Female, Male. 
- capital-gain: continuous.
- capital-loss: continuous. 
- hours-per-week: continuous. 
- native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
- target: >50K, <=50K.

In [225]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV

In [226]:
#original data does not come with headers. add headers based on the documentation descrption
headers = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 
           'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 
           'hours-per-week', 'native-country', 'target']

In [227]:
data = pd.read_csv('./data/adult.data', names = headers)
df = data.copy()

In [228]:
df.head()


Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,target
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [229]:
df.replace({" ?":np.nan},inplace=True)
df.columns
df.info()



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       30725 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education-num   32561 non-null  int64 
 5   marital-status  32561 non-null  object
 6   occupation      30718 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital-gain    32561 non-null  int64 
 11  capital-loss    32561 non-null  int64 
 12  hours-per-week  32561 non-null  int64 
 13  native-country  31978 non-null  object
 14  target          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


In [230]:
cat_col = ['workclass', 'education','marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']
num_cols = [ 'age', 'fnlwgt', 'education-num','capital-gain', 'capital-loss', 'hours-per-week']

On first look, it appears there are no missing values. **However the dataset descrption indicates that missing values have been flled with '?'.** Figure out what to do here :)

In [231]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
lab_enc = LabelEncoder()
X=df.iloc[:,:-1]
y=df.iloc[:,-1]
y= lab_enc.fit_transform(y)
X_train,X_test,y_traint,y_test = train_test_split(X,y,test_size=0.2,random_state=0)
X_train.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 26048 entries, 15282 to 2732
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             26048 non-null  int64 
 1   workclass       24590 non-null  object
 2   fnlwgt          26048 non-null  int64 
 3   education       26048 non-null  object
 4   education-num   26048 non-null  int64 
 5   marital-status  26048 non-null  object
 6   occupation      24584 non-null  object
 7   relationship    26048 non-null  object
 8   race            26048 non-null  object
 9   sex             26048 non-null  object
 10  capital-gain    26048 non-null  int64 
 11  capital-loss    26048 non-null  int64 
 12  hours-per-week  26048 non-null  int64 
 13  native-country  25571 non-null  object
dtypes: int64(6), object(8)
memory usage: 3.0+ MB


Before deciding **how to deal with the missing values want to know how they influence the data**. For example, if workclass and occupation for example both have missing values (in the same tuple) it would be reasonable to drop the data.

In [232]:
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder,OneHotEncoder
from sklearn.pipeline import Pipeline
num_imputer = SimpleImputer(strategy='median')
cat_imputer = SimpleImputer(strategy='most_frequent')
cat_encoder = OneHotEncoder()
cat_pipe = Pipeline([('imputer',cat_imputer),('encoder',cat_encoder)])
transformer = ColumnTransformer([('numerical',num_imputer,num_cols),('categorical',cat_pipe,cat_col)],remainder='drop')
X_train_imputed = transformer.fit_transform(X_train)
X_test_imputed = transformer.transform(X_test)

**What % of the dataset do nans represent?** What is the **best approach** for dealing with the nan's?

In [238]:
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
scaler = StandardScaler(with_mean=False)
clf = SVC()
model = Pipeline([('scaler',scaler),('clf',clf)])



In [239]:
from sklearn import set_config
set_config(display='diagram')
model

In [234]:
# model.score(X_test_imputed,y_test)

In [242]:
from sklearn.model_selection import cross_val_score
params = {
    'clf__kernel':['rbf','poly'],
    'clf__C':[1,10,10]
}
new_grid = GridSearchCV(model,params,cv=5)

In [243]:
new_grid.fit(X_train_imputed,y_traint)

In [245]:
df_scores = pd.DataFrame(new_grid.cv_results_)
df_scores

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_clf__C,param_clf__kernel,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,7.652972,0.132441,1.344189,0.024625,1,rbf,"{'clf__C': 1, 'clf__kernel': 'rbf'}",0.854319,0.852591,0.849328,0.848531,0.845844,0.850123,0.003006,1
1,6.976254,0.048189,1.268104,0.007789,1,poly,"{'clf__C': 1, 'clf__kernel': 'poly'}",0.85144,0.849712,0.84952,0.8445,0.841428,0.84732,0.003747,6
2,8.746066,0.126852,1.283429,0.022735,10,rbf,"{'clf__C': 10, 'clf__kernel': 'rbf'}",0.855278,0.852015,0.85048,0.846804,0.841812,0.849278,0.004622,2
3,8.288231,0.190584,1.223332,0.021054,10,poly,"{'clf__C': 10, 'clf__kernel': 'poly'}",0.852207,0.85144,0.849904,0.846996,0.840276,0.848165,0.004329,4
4,8.827147,0.122206,1.277371,0.011127,10,rbf,"{'clf__C': 10, 'clf__kernel': 'rbf'}",0.855278,0.852015,0.85048,0.846804,0.841812,0.849278,0.004622,2
5,8.49426,0.261905,1.217066,0.01186,10,poly,"{'clf__C': 10, 'clf__kernel': 'poly'}",0.852207,0.85144,0.849904,0.846996,0.840276,0.848165,0.004329,4


In [236]:
# from sklearn.ensemble import RandomForestClassifier
# from sklearn.preprocessing import StandardScaler
# from sklearn.metrics import accuracy_score
# sc = StandardScaler()
# X_tr_sc = sc.fit_transform(X_train)
# X_ts_sc = sc.transform(X_test)
# mod = RandomForestClassifier()
# mod.fit(X_tr_sc,y_train)
# pred = mod.predict(X_ts_sc)

# accuracy_score(y_test,pred)

In [237]:
# lin_mod = LogisticRegression()
# lin_mod.fit(X_tr_sc,y_train)
# pred_lin = lin_mod.predict(X_ts_sc)
# accuracy_score(y_test,pred_lin)