Our data comes in a two-dimensional array of floating point numbers, where each column is a <I>continuous feature </I> that describes the data point.

A particularly common type of feature is the <I>Categorical feature </I>. Also known as a <I>Discrete Feature </I>, that are usually not numeric.

The ditinction between the categorical features and continuous fetaures is analogous to the distinction between classification and regression, only on the input side rather than the output side.

# Question:

## How to represent your data best for a particular application is known as <I>"Feature Engineering"</I>.

Representing data in the right way can have a higher influence on the performance of a supervised model than the exact parameter, we choose.

### Categorical Variables:

In [1]:
# import libraries
import pandas as pd 
import numpy as np

import adult dataset

In [2]:
data = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data")

In [3]:
data_col_names = ["age","workclass","fnlwgt","education","education_num","marital_status","occupation",
"relationship","race","sex","capital_gain","capital_loss","hours_per_week","native_country","income"]

In [4]:
data.shape[1]

15

In [5]:
len(data_col_names)

15

In [6]:
data.sample()

Unnamed: 0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
24130,46,Self-emp-not-inc,353012,Doctorate,16,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,1902,50,United-States,>50K


In [7]:
data.columns=data_col_names

In [8]:
# see few rows od dataset -- data
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
1,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
2,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
3,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
4,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K


In [9]:
# check the 'income' column of data
data['income'].value_counts()

 <=50K    24719
 >50K      7841
Name: income, dtype: int64

In [10]:
#check all column names
data.columns

Index(['age', 'workclass', 'fnlwgt', 'education', 'education_num',
       'marital_status', 'occupation', 'relationship', 'race', 'sex',
       'capital_gain', 'capital_loss', 'hours_per_week', 'native_country',
       'income'],
      dtype='object')

Now, taking a few features to analyse the dataset well.

adult_data

In [11]:
# define new variable "adult_data" and store some features in it.
adult_data = data.loc[:,["age","workclass","education","sex","hours_per_week","occupation","income"]]

In [12]:
# look at new DataFrame
adult_data.head()

Unnamed: 0,age,workclass,education,sex,hours_per_week,occupation,income
0,50,Self-emp-not-inc,Bachelors,Male,13,Exec-managerial,<=50K
1,38,Private,HS-grad,Male,40,Handlers-cleaners,<=50K
2,53,Private,11th,Male,40,Handlers-cleaners,<=50K
3,28,Private,Bachelors,Female,40,Prof-specialty,<=50K
4,37,Private,Masters,Female,40,Exec-managerial,<=50K


#### The task is phrased as a classification problem/task with the two classes being income<=50k and >50k.

In [13]:
# checking data-type of each feature variables
adult_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32560 entries, 0 to 32559
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32560 non-null  int64 
 1   workclass       32560 non-null  object
 2   education       32560 non-null  object
 3   sex             32560 non-null  object
 4   hours_per_week  32560 non-null  int64 
 5   occupation      32560 non-null  object
 6   income          32560 non-null  object
dtypes: int64(2), object(5)
memory usage: 1.1+ MB


In this dataset, <I>age</I> and <I>hours_per_week</I> are continuous features.

The <I>workclass, education, sex </I> and <I>occupation</I> features are categorical features.

We may apply <I> Logistic regression classifier </I> here.

?? How to overcome the problem arises due to categorical variable/features.

### One-Hot-Encoding (Dummy Variables)

The most commom way to represent categorical variables is using the <I> one-hot-encoding</I> or <I> one-out-of-N-encoding</I>,
also known as <I>dummy variables</I>.

In [14]:
# see 'workclass' variable
adult_data.workclass.value_counts()

 Private             22696
 Self-emp-not-inc     2541
 Local-gov            2093
 ?                    1836
 State-gov            1297
 Self-emp-inc         1116
 Federal-gov           960
 Without-pay            14
 Never-worked            7
Name: workclass, dtype: int64

<b>There are 2 ways to do so:
1. Using pd.get_dummies(data)
2. Using Scikit-learn (from sklearn.preprocessing import OneHotEncoder)
</b>

Now,

Use pd.get_dummies()

In [15]:
print("Original features: \n", list(data.columns),"\n")

data_dummies = pd.get_dummies(adult_data)

print("Features after get_dummies: \n", list(data_dummies.columns))

Original features: 
 ['age', 'workclass', 'fnlwgt', 'education', 'education_num', 'marital_status', 'occupation', 'relationship', 'race', 'sex', 'capital_gain', 'capital_loss', 'hours_per_week', 'native_country', 'income'] 

Features after get_dummies: 
 ['age', 'hours_per_week', 'workclass_ ?', 'workclass_ Federal-gov', 'workclass_ Local-gov', 'workclass_ Never-worked', 'workclass_ Private', 'workclass_ Self-emp-inc', 'workclass_ Self-emp-not-inc', 'workclass_ State-gov', 'workclass_ Without-pay', 'education_ 10th', 'education_ 11th', 'education_ 12th', 'education_ 1st-4th', 'education_ 5th-6th', 'education_ 7th-8th', 'education_ 9th', 'education_ Assoc-acdm', 'education_ Assoc-voc', 'education_ Bachelors', 'education_ Doctorate', 'education_ HS-grad', 'education_ Masters', 'education_ Preschool', 'education_ Prof-school', 'education_ Some-college', 'sex_ Female', 'sex_ Male', 'occupation_ ?', 'occupation_ Adm-clerical', 'occupation_ Armed-Forces', 'occupation_ Craft-repair', 'occupat

In the above output, the continuous features were not touched, while the categorical features were expanded into one new feature for each possible value:

In [16]:
# look at new created dummies data
data_dummies.head()

Unnamed: 0,age,hours_per_week,workclass_ ?,workclass_ Federal-gov,workclass_ Local-gov,workclass_ Never-worked,workclass_ Private,workclass_ Self-emp-inc,workclass_ Self-emp-not-inc,workclass_ State-gov,...,occupation_ Machine-op-inspct,occupation_ Other-service,occupation_ Priv-house-serv,occupation_ Prof-specialty,occupation_ Protective-serv,occupation_ Sales,occupation_ Tech-support,occupation_ Transport-moving,income_ <=50K,income_ >50K
0,50,13,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,1,0
1,38,40,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,1,0
2,53,40,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,1,0
3,28,40,0,0,0,0,1,0,0,0,...,0,0,0,1,0,0,0,0,1,0
4,37,40,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,1,0


"values" attribute convert the DataFrame into a Numpy array.

and, then I will train a machine learning model on it.

The target variables are now encoded into two "income" columns

so, as a feature

I extract only the column containing features: -- that is all columns from age to occupation_ Transport-moving.

Pandas DataFrame.ix[ ] is both Label and Integer based slicing technique. Besides pure label based and integer based, Pandas provides a hybrid method for selections and subsetting the object using the ix[] operator. ix[] is the most general indexer and will support any of the inputs in loc[] and iloc[].

DataFrame. ix is deprecated from Pandas version 0.20. 0. You can use the more strict indexing method like loc and iloc 

In [17]:
features = data_dummies.loc[:,'age':'occupation_ Transport-moving']

Now, Extract Numpy Array

In [18]:
data_dummies.columns

Index(['age', 'hours_per_week', 'workclass_ ?', 'workclass_ Federal-gov',
       'workclass_ Local-gov', 'workclass_ Never-worked', 'workclass_ Private',
       'workclass_ Self-emp-inc', 'workclass_ Self-emp-not-inc',
       'workclass_ State-gov', 'workclass_ Without-pay', 'education_ 10th',
       'education_ 11th', 'education_ 12th', 'education_ 1st-4th',
       'education_ 5th-6th', 'education_ 7th-8th', 'education_ 9th',
       'education_ Assoc-acdm', 'education_ Assoc-voc', 'education_ Bachelors',
       'education_ Doctorate', 'education_ HS-grad', 'education_ Masters',
       'education_ Preschool', 'education_ Prof-school',
       'education_ Some-college', 'sex_ Female', 'sex_ Male', 'occupation_ ?',
       'occupation_ Adm-clerical', 'occupation_ Armed-Forces',
       'occupation_ Craft-repair', 'occupation_ Exec-managerial',
       'occupation_ Farming-fishing', 'occupation_ Handlers-cleaners',
       'occupation_ Machine-op-inspct', 'occupation_ Other-service',
       'o

In [19]:
X = features.values

y = data_dummies['income_ >50K'].values

In [20]:
# chcek the shape for both
print(" X.shape {}, y.shape {}".format(X.shape,y.shape))

 X.shape (32560, 44), y.shape (32560,)


Now, data is represented in a way that scikit-learn can work with__