## Data Encoding 
- Encoding is the process of converting the data or a given sequence of characters, symbols, alphabets etc., into a specified format, for the secured transmission of data.
- There are several types of encoding, including image encoding, audio and video encoding, and character encoding.
- Machine learning models require all input and output variables to be numeric.
- Encoding data is a technique used for data privacy as it can mask the data.



### Advantages of encoding data: 
#### Modeling 
- encoding categorical to numerical values allows machine learning models to process and evaluate the data

#### Privacy
- keeps your data safe since the files are not readable unless you have access to the algorithms that were used to encode it
- ideal solution if you need to have third parties access your archives but do not want to have everyone be able to access some sensitive files



### Categorical Encoding Types 
- categorical variables are usually represented as ‘strings’ or ‘categories’ and are finite in number

#### Ordinal Data: The categories have an inherent order
- encoding should retain the information regarding the order in which the category is provided, example the highest degree a person possesses

#### Nominal Data: The categories do not have an inherent order
- no notion of order is present, example the city a person lives in

### Categorical Encoding Process
- read in data file 
- inspect types of data 
- handle any NaN values 



In [15]:
import pandas as pd
import numpy as np
### select the columns wanted ### 
headers = ["symboling", "normalized_losses", "make", "fuel_type", "aspiration",
           "num_doors", "body_style", "drive_wheels", "engine_location",
           "wheel_base", "length", "width", "height", "curb_weight",
           "engine_type", "num_cylinders", "engine_size", "fuel_system",
           "bore", "stroke", "compression_ratio", "horsepower", "peak_rpm",
           "city_mpg", "highway_mpg", "price"]

### read in data file, replace '?' with NaN ###
df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data",
                  header=None, names=headers, na_values="?" )
print('data frame shape:', df.shape)
print('--- data frame data types ---')
print(df.dtypes)
print('--- data frame ---')
df.head()

data frame shape: (205, 26)
--- data frame data types ---
symboling              int64
normalized_losses    float64
make                  object
fuel_type             object
aspiration            object
num_doors             object
body_style            object
drive_wheels          object
engine_location       object
wheel_base           float64
length               float64
width                float64
height               float64
curb_weight            int64
engine_type           object
num_cylinders         object
engine_size            int64
fuel_system           object
bore                 float64
stroke               float64
compression_ratio    float64
horsepower           float64
peak_rpm             float64
city_mpg               int64
highway_mpg            int64
price                float64
dtype: object
--- data frame ---


Unnamed: 0,symboling,normalized_losses,make,fuel_type,aspiration,num_doors,body_style,drive_wheels,engine_location,wheel_base,length,width,height,curb_weight,engine_type,num_cylinders,engine_size,fuel_system,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price
0,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,48.8,2548,dohc,four,130,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27,13495.0
1,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,48.8,2548,dohc,four,130,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27,16500.0
2,1,,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,171.2,65.5,52.4,2823,ohcv,six,152,mpfi,2.68,3.47,9.0,154.0,5000.0,19,26,16500.0
3,2,164.0,audi,gas,std,four,sedan,fwd,front,99.8,176.6,66.2,54.3,2337,ohc,four,109,mpfi,3.19,3.4,10.0,102.0,5500.0,24,30,13950.0
4,2,164.0,audi,gas,std,four,sedan,4wd,front,99.4,176.6,66.4,54.3,2824,ohc,five,136,mpfi,3.19,3.4,8.0,115.0,5500.0,18,22,17450.0


In [16]:
### create a data frame of all the categorical values which are 'object' ###
cat_df = df.select_dtypes(include=['object']).copy()
print('data frame shape:', cat_df.shape)
### check for NaN's in the data ### 
print('--- NaNs in the data ---')
print(cat_df.isna().sum())
print('--- data frame ---')
cat_df.head()

data frame shape: (205, 10)
--- NaNs in the data ---
make               0
fuel_type          0
aspiration         0
num_doors          2
body_style         0
drive_wheels       0
engine_location    0
engine_type        0
num_cylinders      0
fuel_system        0
dtype: int64
--- data frame ---


Unnamed: 0,make,fuel_type,aspiration,num_doors,body_style,drive_wheels,engine_location,engine_type,num_cylinders,fuel_system
0,alfa-romero,gas,std,two,convertible,rwd,front,dohc,four,mpfi
1,alfa-romero,gas,std,two,convertible,rwd,front,dohc,four,mpfi
2,alfa-romero,gas,std,two,hatchback,rwd,front,ohcv,six,mpfi
3,audi,gas,std,four,sedan,fwd,front,ohc,four,mpfi
4,audi,gas,std,four,sedan,4wd,front,ohc,five,mpfi


In [3]:
### check the different values for 'num_doors' ### 
print('--- values for num_doors ---')
cat_df["num_doors"].value_counts()

--- values for num_doors ---


four    114
two      89
Name: num_doors, dtype: int64

In [4]:
### fill the NaNs with majority, four ### 
cat_df = cat_df.fillna(({'num_doors': 'four'}))
print('--- NaNs in the data ---')
cat_df.isna().sum()

--- NaNs in the data ---


make               0
fuel_type          0
aspiration         0
num_doors          0
body_style         0
drive_wheels       0
engine_location    0
engine_type        0
num_cylinders      0
fuel_system        0
dtype: int64

### Pandas Encoding

#### Find and Replace
- manually replace categorical values 
- useful if amount of unique values is small 

In [5]:
### check the different values for 'num_cylinders' ### 
print('--- values for num_cylinders ---')
cat_df['num_cylinders'].value_counts()

--- values for num_cylinders ---


four      159
six        24
five       11
eight       5
two         4
three       1
twelve      1
Name: num_cylinders, dtype: int64

In [6]:
### manually replace the values with numbers ###
# pandas will convert from object to int64
cat_replace =   {"num_doors":     {"four": 4, "two": 2},
                 "num_cylinders": {"four": 4, "six": 6, 
                                   "five": 5, "eight": 8,
                                   "two": 2, "twelve": 12, 
                                   "three": 3 }}
cat_df = cat_df.replace(cat_replace)
print('--- data frame shape:', cat_df.shape)
print(' --- data types ---')
print(cat_df.dtypes)
print('--- data frame ---')
cat_df.head()

--- data frame shape: (205, 10)
 --- data types ---
make               object
fuel_type          object
aspiration         object
num_doors           int64
body_style         object
drive_wheels       object
engine_location    object
engine_type        object
num_cylinders       int64
fuel_system        object
dtype: object
--- data frame ---


Unnamed: 0,make,fuel_type,aspiration,num_doors,body_style,drive_wheels,engine_location,engine_type,num_cylinders,fuel_system
0,alfa-romero,gas,std,2,convertible,rwd,front,dohc,4,mpfi
1,alfa-romero,gas,std,2,convertible,rwd,front,dohc,4,mpfi
2,alfa-romero,gas,std,2,hatchback,rwd,front,ohcv,6,mpfi
3,audi,gas,std,4,sedan,fwd,front,ohc,4,mpfi
4,audi,gas,std,4,sedan,4wd,front,ohc,5,mpfi


#### Label Encoding 
- technique that is converting each value in a column to a number
- data type is changed to category before use
- the disadvantage that the numeric values can be “misinterpreted” by the algorithms

In [7]:
### change the data type from object to category ### 
cat_df["body_style"] = cat_df["body_style"].astype('category')
print('--- data types ---')
print(cat_df.dtypes)
### create a new column with the category codes from the body_style values ### 
cat_df['body_style_cat'] = cat_df['body_style'].cat.codes
print('data frame shape:', cat_df.shape)
print('--- data frame ---')
cat_df.head()

--- data types ---
make                 object
fuel_type            object
aspiration           object
num_doors             int64
body_style         category
drive_wheels         object
engine_location      object
engine_type          object
num_cylinders         int64
fuel_system          object
dtype: object
data frame shape: (205, 11)
--- data frame ---


Unnamed: 0,make,fuel_type,aspiration,num_doors,body_style,drive_wheels,engine_location,engine_type,num_cylinders,fuel_system,body_style_cat
0,alfa-romero,gas,std,2,convertible,rwd,front,dohc,4,mpfi,0
1,alfa-romero,gas,std,2,convertible,rwd,front,dohc,4,mpfi,0
2,alfa-romero,gas,std,2,hatchback,rwd,front,ohcv,6,mpfi,2
3,audi,gas,std,4,sedan,fwd,front,ohc,4,mpfi,3
4,audi,gas,std,4,sedan,4wd,front,ohc,5,mpfi,3


#### One Hot Encoding 
- convert each category value into a new column and assigns a 1 or 0 (True/False) value to the column.
- benefit of not weighting a value improperly but does have the downside of adding more columns to the data set
- original columns stick around so filter 'object' data types for analysis

In [8]:
import pandas as pd 
### use get_dummies on the columns for encoding ### 
# can work multiple columns at once 
cat_df = pd.get_dummies(cat_df, columns = ['drive_wheels'])
### can set a 'prefix' to the new columns for organization ### 
# cat_df = pd.get_dummies(cat_df, columns = ['body_style', 'drive_wheels'], prefix = ['body', 'drive'])
print('data frame shape:', cat_df.shape)
print('--- data frame ---')
cat_df.head()

data frame shape: (205, 13)
--- data frame ---


Unnamed: 0,make,fuel_type,aspiration,num_doors,body_style,engine_location,engine_type,num_cylinders,fuel_system,body_style_cat,drive_wheels_4wd,drive_wheels_fwd,drive_wheels_rwd
0,alfa-romero,gas,std,2,convertible,front,dohc,4,mpfi,0,0,0,1
1,alfa-romero,gas,std,2,convertible,front,dohc,4,mpfi,0,0,0,1
2,alfa-romero,gas,std,2,hatchback,front,ohcv,6,mpfi,2,0,0,1
3,audi,gas,std,4,sedan,front,ohc,4,mpfi,3,0,1,0
4,audi,gas,std,4,sedan,front,ohc,5,mpfi,3,1,0,0


#### Custom Binary Encoding 
- convert each category value into a new column and assigns a 1 or 0 (True/False) value to the column.
- benefit of not weighting a value improperly but does have the downside of adding more columns to the data set
- original columns stick around so filter 'object' data types for analysis

In [9]:
### creating a column for if engine_type is 'ohc', True or False ### 
### check the different values for 'engine_type' ### 
print('--- values for engine_type ---')
cat_df['engine_type'].value_counts()

--- values for engine_type ---


ohc      148
ohcf      15
ohcv      13
dohc      12
l         12
rotor      4
dohcv      1
Name: engine_type, dtype: int64

In [10]:
### create a new column and if the value contains 'ohc' give it a 1, if not 0 ### 
cat_df['OHC_Code'] = np.where(cat_df['engine_type'].str.contains("ohc"), 1, 0)
print('data frame shape:', cat_df.shape)
print('--- OHC_Code values ---')
print(cat_df['OHC_Code'].value_counts())
print('--- data frame ---')
cat_df.head()

data frame shape: (205, 14)
--- OHC_Code values ---
1    189
0     16
Name: OHC_Code, dtype: int64
--- data frame ---


Unnamed: 0,make,fuel_type,aspiration,num_doors,body_style,engine_location,engine_type,num_cylinders,fuel_system,body_style_cat,drive_wheels_4wd,drive_wheels_fwd,drive_wheels_rwd,OHC_Code
0,alfa-romero,gas,std,2,convertible,front,dohc,4,mpfi,0,0,0,1,1
1,alfa-romero,gas,std,2,convertible,front,dohc,4,mpfi,0,0,0,1,1
2,alfa-romero,gas,std,2,hatchback,front,ohcv,6,mpfi,2,0,0,1,1
3,audi,gas,std,4,sedan,front,ohc,4,mpfi,3,0,1,0,1
4,audi,gas,std,4,sedan,front,ohc,5,mpfi,3,1,0,0,1


### Encoding Libraries
- best for machine learning models 
- works well with pipelines

#### Ordinal Encoder 
- equivalent to label encoding 
- technique that is converting each value in a column to a number
- data type is changed to category before use
- the disadvantage that the numeric values can be “misinterpreted” by the algorithms

In [11]:
from sklearn.preprocessing import OrdinalEncoder
### initiate the encoder ###
oe = OrdinalEncoder()
### create a new column, use fit transform on the column to encode ### 
cat_df['make_code'] = oe.fit_transform(cat_df[['make']])
### show the change from make, make_code ### 
print('--- encoded "make" column ---')
cat_df[['make', 'make_code']].head()

--- encoded "make" column ---


Unnamed: 0,make,make_code
0,alfa-romero,0.0
1,alfa-romero,0.0
2,alfa-romero,0.0
3,audi,1.0
4,audi,1.0


#### One Hot Encoder
- binary encoding, true or false 
- convert each category value into a new column and assigns a 1 or 0 (True/False) value to the column.
- benefit of not weighting a value improperly but does have the downside of adding more columns to the data set
- original columns stick around so filter 'object' data types for analysis

In [12]:
from sklearn.preprocessing import OneHotEncoder
### initiate the encoder ###
ohe = OneHotEncoder()
### create a new column, use fit transform on the column to encode ### 
ohe_results = ohe.fit_transform(cat_df[['body_style']])
### show the change from make, make_code ### 
print('--- encoded "make" columns ---')
### create a data frame of the columns encoded ###
# must use array() to convert format for a dataframe
print(pd.DataFrame(ohe_results.toarray(), columns = ohe.categories_).head())
print('--- added encoded columns to data frame ---')
### add the encoded data to the original data frame ### 
cat_df = cat_df.join(pd.DataFrame(ohe_results.toarray(), columns=ohe.categories_))
cat_df.head()

--- encoded "make" columns ---
  convertible hardtop hatchback sedan wagon
0         1.0     0.0       0.0   0.0   0.0
1         1.0     0.0       0.0   0.0   0.0
2         0.0     0.0       1.0   0.0   0.0
3         0.0     0.0       0.0   1.0   0.0
4         0.0     0.0       0.0   1.0   0.0
--- added encoded columns to data frame ---


Unnamed: 0,make,fuel_type,aspiration,num_doors,body_style,engine_location,engine_type,num_cylinders,fuel_system,body_style_cat,drive_wheels_4wd,drive_wheels_fwd,drive_wheels_rwd,OHC_Code,make_code,"(convertible,)","(hardtop,)","(hatchback,)","(sedan,)","(wagon,)"
0,alfa-romero,gas,std,2,convertible,front,dohc,4,mpfi,0,0,0,1,1,0.0,1.0,0.0,0.0,0.0,0.0
1,alfa-romero,gas,std,2,convertible,front,dohc,4,mpfi,0,0,0,1,1,0.0,1.0,0.0,0.0,0.0,0.0
2,alfa-romero,gas,std,2,hatchback,front,ohcv,6,mpfi,2,0,0,1,1,0.0,0.0,0.0,1.0,0.0,0.0
3,audi,gas,std,4,sedan,front,ohc,4,mpfi,3,0,1,0,1,1.0,0.0,0.0,0.0,1.0,0.0
4,audi,gas,std,4,sedan,front,ohc,5,mpfi,3,1,0,0,1,1.0,0.0,0.0,0.0,1.0,0.0


#### Effect Encoder (deviation/sum encoding)
- similar to dummy encoder but encoding 3 values, 1, 0, -1 not 1, 0
- convert each category value into a new column and assigns a 1, 0, -1 value to the column.
- the row containing only 0s in dummy encoding is encoded as -1 in effect encoding
- benefit of not weighting a value improperly but does have the downside of adding more columns to the data set
- original columns stick around so filter 'object' data types for analysis

In [13]:
!pip install category_encoders



In [17]:
import category_encoders as ce
import pandas as pd
### initiate the encoder ## 
se = ce.sum_coding.SumEncoder(cols='drive_wheels') # set the column to encode 
se_results = se.fit_transform(cat_df) # fit transorm the data 
print('data frame shape:', se_results.shape)
print('--- data frame ---')
se_results.head()

data frame shape: (205, 12)
--- data frame ---


  elif pd.api.types.is_categorical(cols):


Unnamed: 0,intercept,make,fuel_type,aspiration,num_doors,body_style,drive_wheels_0,drive_wheels_1,engine_location,engine_type,num_cylinders,fuel_system
0,1,alfa-romero,gas,std,two,convertible,1.0,0.0,front,dohc,four,mpfi
1,1,alfa-romero,gas,std,two,convertible,1.0,0.0,front,dohc,four,mpfi
2,1,alfa-romero,gas,std,two,hatchback,1.0,0.0,front,ohcv,six,mpfi
3,1,audi,gas,std,four,sedan,0.0,1.0,front,ohc,four,mpfi
4,1,audi,gas,std,four,sedan,-1.0,-1.0,front,ohc,five,mpfi


#### Hash Encoder (deviation/sum encoding)
- hashing is a one-way process, in other words, one can not generate original input from the hash representation
- hashing is the transformation of arbitrary size input in the form of a fixed-size value
- several applications like data retrieval, checking data corruption, and in data encryption
- user can fix the number of dimensions after transformation using n_component argument
- default, the Hashing encoder uses the md5 hashing algorithm
- transforms the data in lesser dimensions, it may lead to loss of information
- since here, a large number of features are depicted into lesser dimensions, hence multiple values can be represented by the same hash value, this is known as a collision

In [18]:
import category_encoders as ce
import pandas as pd
### initiate the encoder ### 
he = ce.HashingEncoder(cols='make',n_components=5) # set column and n_components
he_results = he.fit_transform(cat_df) # fit transform the data
print('data frame shape:', he_results.shape)
print('--- data frame ---')
he_results.head()

data frame shape: (205, 14)
--- data frame ---


Unnamed: 0,col_0,col_1,col_2,col_3,col_4,fuel_type,aspiration,num_doors,body_style,drive_wheels,engine_location,engine_type,num_cylinders,fuel_system
0,0,0,0,1,0,gas,std,two,convertible,rwd,front,dohc,four,mpfi
1,0,0,0,1,0,gas,std,two,convertible,rwd,front,dohc,four,mpfi
2,0,0,0,1,0,gas,std,two,hatchback,rwd,front,ohcv,six,mpfi
3,0,0,0,0,1,gas,std,four,sedan,fwd,front,ohc,four,mpfi
4,0,0,0,0,1,gas,std,four,sedan,4wd,front,ohc,five,mpfi


#### Binary Encoder
- a combination of Hash encoding and one-hot encoding
- categorical feature is first converted into numerical using an ordinal encoder, then the numbers are transformed in the binary number,  after that binary value is split into different columns
- works really well when there are a high number of categories
- user can fix the number of dimensions after transformation using n_component argument
- default, the Hashing encoder uses the md5 hashing algorithm
- transforms the data in lesser dimensions, it may lead to loss of information
- since here, a large number of features are depicted into lesser dimensions, hence multiple values can be represented by the same hash value, this is known as a collision

In [20]:
import category_encoders as ce
import pandas as pd
### create some data ### 
df = pd.DataFrame({'City':['Delhi','Mumbai','Hyderabad','Chennai','Bangalore','Delhi','Hyderabad','Mumbai','Agra']})
### initiate the encoder ## 
be = ce.BinaryEncoder(cols=['City'], return_df=True) # parameters 
be_results = be.fit_transform(df) # fit transform the data 
print('data frame shape:', be_results.shape)
print('--- data frame ---')
be_results.head()

data frame shape: (9, 4)
--- data frame ---


  elif pd.api.types.is_categorical(cols):


Unnamed: 0,City_0,City_1,City_2,City_3
0,0,0,0,1
1,0,0,1,0
2,0,0,1,1
3,0,1,0,0
4,0,1,0,1
