# Feature Engineering

The process of using domain knowledge of the data to create features that make machine learning algorithms work is feature engineering. 

Feature engineering has two goals primarily:

Preparing the proper input dataset, compatible with the machine learning algorithm requirements
Improving the performance of machine learning models

# 1. Feature Encoding

# Categorical Encoding

Machines understand numbers, not text. We need to convert each text category 
to numbers in order for the machine to process them using mathematical equations

To convert categorical columns to numerical columns so that a machine 
learning algorithm understands it. This process is called categorical encoding.

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

# Label Encoding

In [10]:
Data = ['None','low','medium','high','very high']

Data = pd.DataFrame(Data, columns=['Types'])



In [4]:
Data

Unnamed: 0,Types
0,
1,low
2,medium
3,high
4,very high


In [12]:
labelencoder = LabelEncoder()

label_data = labelencoder.fit_transform(Data['Types'])
label_data

array([0, 2, 3, 1, 4])

In [14]:
Data['label'] = label_data

In [7]:
Data

Unnamed: 0,Types,label
0,,0
1,low,2
2,medium,3
3,high,1
4,very high,4


In [15]:
Data['label_types'] = labelencoder.inverse_transform(Data['Types'])

ValueError: y contains previously unseen labels: ['None' 'high' 'low' 'medium' 'very high']

In [12]:
Data.head()

Unnamed: 0,Types,label
0,,0
1,low,2
2,medium,3
3,high,1
4,very high,4


In [2]:
df=pd.read_csv("ML Feature Encoding Resource16931320640.csv")
df

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
48838,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
48839,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
48840,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K


In [22]:
df['education'].unique()

array(['11th', 'HS-grad', 'Assoc-acdm', 'Some-college', '10th',
       'Prof-school', '7th-8th', 'Bachelors', 'Masters', 'Doctorate',
       '5th-6th', 'Assoc-voc', '9th', '12th', '1st-4th', 'Preschool'],
      dtype=object)

In [23]:
df['education'] = labelencoder.fit_transform(df['education'])

In [24]:
df[['gender','income']] = df[['gender','income']].apply(labelencoder.fit_transform)


In [25]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,1,7,Never-married,Machine-op-inspct,Own-child,Black,1,0,0,40,United-States,0
1,38,Private,89814,11,9,Married-civ-spouse,Farming-fishing,Husband,White,1,0,0,50,United-States,0
2,28,Local-gov,336951,7,12,Married-civ-spouse,Protective-serv,Husband,White,1,0,0,40,United-States,1
3,44,Private,160323,15,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,1,7688,0,40,United-States,1
4,18,?,103497,15,10,Never-married,?,Own-child,White,0,0,0,30,United-States,0


In [18]:
df['native-country']=labelencoder.fit_transform(df['native-country'])
df

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,39,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,39,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,39,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,39,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,39,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,39,<=50K
48838,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,39,>50K
48839,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,39,<=50K
48840,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,39,<=50K


In [None]:
cd C:\Mehul Session\Session 27_Python Introduction\LMS FT\LMS Python\1_Python_Session_File\5_Machine Learning\2_Feature Engineering\1_Encoding_Techniques

In [None]:
df = pd.read_csv("income.csv")

In [None]:
df.head()

In [None]:
df['education'].unique()

In [None]:
df['education'] = labelencoder.fit_transform(df['education'])

In [None]:
df.head()

In [None]:
df[['gender','income']] = df[['gender','income']].apply(labelencoder.fit_transform)


In [3]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


# Though label encoding is straight but it has the disadvantage that the numeric values can be misinterpreted by algorithms as having some sort of hierarchy/order in them

# This ordering issue is addressed in another common alternative approach called ‘One-Hot Encoding’. In this strategy, each category value is converted into a new column and assigned a 1 or 0 (notation for true/false) value to the column

# One-Hot Encoder

In [None]:
data=pd.DataFrame(Employee)

In [None]:
data

In [None]:
data.info()

Department, is the categorical feature as it is represented by the object data type and 
the rest of them are numerical features as they are represented by int64.

# Using dummies values method for one hot encoding

In [None]:
dummies = pd.get_dummies(data.Department)
dummies

In [None]:
merged = pd.concat([data,dummies],axis='columns')
merged

In [None]:
final = merged.drop(['Department'], axis='columns')
final

In [None]:
df.head()

In [None]:
dum_df = pd.get_dummies(df, columns=["relationship"], prefix=["Type_is"] )
dum_df.head()

# Apply on whole data

In [3]:
new_df = pd.read_csv("")

FileNotFoundError: [Errno 2] No such file or directory: 'income.csv'

In [4]:
dummi_df = pd.get_dummies(new_df)

NameError: name 'new_df' is not defined

In [None]:
dummi_df.head()