# Chapter 4. Feature engineering and Represeting data.
# Part 1. Categorical data.

## - One-hot-encoding (Dummy Variables)

Categorical features can be represented as binary lot of features where '1' would be assigned to the corresponding descripting feature and all others would be '0'.

#### Implementing One-hot-encoding:

Loading data:

In [2]:
import pandas as pd

#-----loading data
#file doesn't include header so 'header=None'
#'names=['...',...,'...']' defines columns names
data = pd.read_csv('adult.data', header=None, names=['age','workclass','fnlwgt','education','education-num','marital-status','occupation','relationship','race','gender','capital-gain','capital-loss','hours-per-week','native-country','income'])

#taking only a few of columns
data = data[['age','workclass','education','gender','hours-per-week','occupation','income']]
data.head()

Unnamed: 0,age,workclass,education,gender,hours-per-week,occupation,income
0,39,State-gov,Bachelors,Male,40,Adm-clerical,<=50K
1,50,Self-emp-not-inc,Bachelors,Male,13,Exec-managerial,<=50K
2,38,Private,HS-grad,Male,40,Handlers-cleaners,<=50K
3,53,Private,11th,Male,40,Handlers-cleaners,<=50K
4,28,Private,Bachelors,Female,40,Prof-specialty,<=50K


Data validation:

In [3]:
#it's useful to check if content of columns is written uniformly:
print(data.gender.value_counts(), '\n')

 Male      21790
 Female    10771
Name: gender, dtype: int64 



^ Data has clear two values.

One-hot-encoding (making dummy variables):

In [4]:
#-----representing categorical data as dummy variables
data_dummies = pd.get_dummies(data)
print('Original features:\n', list(data.columns), '\n')
print('Features with dummies:\n', list(data_dummies.columns), '\n')
data_dummies.head()

Original features:
 ['age', 'workclass', 'education', 'gender', 'hours-per-week', 'occupation', 'income'] 

Features with dummies:
 ['age', 'hours-per-week', 'workclass_ ?', 'workclass_ Federal-gov', 'workclass_ Local-gov', 'workclass_ Never-worked', 'workclass_ Private', 'workclass_ Self-emp-inc', 'workclass_ Self-emp-not-inc', 'workclass_ State-gov', 'workclass_ Without-pay', 'education_ 10th', 'education_ 11th', 'education_ 12th', 'education_ 1st-4th', 'education_ 5th-6th', 'education_ 7th-8th', 'education_ 9th', 'education_ Assoc-acdm', 'education_ Assoc-voc', 'education_ Bachelors', 'education_ Doctorate', 'education_ HS-grad', 'education_ Masters', 'education_ Preschool', 'education_ Prof-school', 'education_ Some-college', 'gender_ Female', 'gender_ Male', 'occupation_ ?', 'occupation_ Adm-clerical', 'occupation_ Armed-Forces', 'occupation_ Craft-repair', 'occupation_ Exec-managerial', 'occupation_ Farming-fishing', 'occupation_ Handlers-cleaners', 'occupation_ Machine-op-inspct

Unnamed: 0,age,hours-per-week,workclass_ ?,workclass_ Federal-gov,workclass_ Local-gov,workclass_ Never-worked,workclass_ Private,workclass_ Self-emp-inc,workclass_ Self-emp-not-inc,workclass_ State-gov,...,occupation_ Machine-op-inspct,occupation_ Other-service,occupation_ Priv-house-serv,occupation_ Prof-specialty,occupation_ Protective-serv,occupation_ Sales,occupation_ Tech-support,occupation_ Transport-moving,income_ <=50K,income_ >50K
0,39,40,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,1,0
1,50,13,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,1,0
2,38,40,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,1,0
3,53,40,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,1,0
4,28,40,0,0,0,0,1,0,0,0,...,0,0,0,1,0,0,0,0,1,0


^ There were created new columns for every state of categorical features.

Also it's necessary to merge train and test dataframes before encoding if they are separated to prevent wrong encoding of states.

Preparing dataset:

In [5]:
#Separating input features
features = data_dummies.loc[:, 'age':'occupation_ Transport-moving']
#creating dataset as NumPy array
X = features.values
y = data_dummies['income_ >50K'].values

print('X shape: {}'.format(X.shape))
print('y shape: {}'.format(y.shape))

X shape: (32561, 44)
y shape: (32561,)


Applying model:

In [6]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

#splitting dataset
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=0)

#model initialization and building
logreg = LogisticRegression().fit(X_train, y_train)

#model validation
print('test accuracy: {}'.format(logreg.score(X_test, y_test)))

test accuracy: 0.8067804937968308


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


^ Now logreg model can be applied.

## Integer categorical variables encoding

Some categorical features represented as no strings but integers implying corresponding values.

Preparing dataframe:

In [10]:
demo_df = pd.DataFrame({'Integer feature': [0,1,2,1], 'Categorical feature': ['socks','fox','socks','box']})
demo_df

Unnamed: 0,Integer feature,Categorical feature
0,0,socks
1,1,fox
2,2,socks
3,1,box


'get_dummies' don't affect integers as well as floats so it's necessary to convert them to strings befor One-hot-encoding.

Converting integer features to strings and encoding:

In [12]:
#-----converting integers to strings
demo_df['Integer feature'] = demo_df['Integer feature'].astype(str)

#-----one-hot-encoding
#specific columns can be selected by 'columns' param
pd.get_dummies(demo_df, columns=['Integer feature','Categorical feature'])

Unnamed: 0,Integer feature_0,Integer feature_1,Integer feature_2,Categorical feature_box,Categorical feature_fox,Categorical feature_socks
0,1,0,0,0,0,1
1,0,1,0,0,1,0
2,0,0,1,0,0,1
3,0,1,0,1,0,0
