### Introduction


`In many practical Data Science activities, the data set will contain categorical variables. These variables are typically stored as text values which represent various traits. Some examples include color (“Red”, “Yellow”, “Blue”), size (“Small”, “Medium”, “Large”) or geographic designations (State or Country). Regardless of what the value is used for, the challenge is determining how to use this data in the analysis. Many machine learning algorithms can support categorical values without further manipulation but there are many more algorithms that do not. Therefore, the analyst is faced with the challenge of figuring out how to turn these text attributes into numerical values for further processing.`

In [96]:
import pandas as pd
import numpy as np

# Define the headers since the data does not have any
headers = ["symboling", "normalized_losses", "make", "fuel_type", "aspiration",
           "num_doors", "body_style", "drive_wheels", "engine_location",
           "wheel_base", "length", "width", "height", "curb_weight",
           "engine_type", "num_cylinders", "engine_size", "fuel_system",
           "bore", "stroke", "compression_ratio", "horsepower", "peak_rpm",
           "city_mpg", "highway_mpg", "price"]

# Read in the CSV file and convert "?" to NaN
df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data",
                  header=None, names=headers, na_values="?" )
df.head()

Unnamed: 0,symboling,normalized_losses,make,fuel_type,aspiration,num_doors,body_style,drive_wheels,engine_location,wheel_base,...,engine_size,fuel_system,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price
0,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27,13495.0
1,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27,16500.0
2,1,,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154.0,5000.0,19,26,16500.0
3,2,164.0,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102.0,5500.0,24,30,13950.0
4,2,164.0,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115.0,5500.0,18,22,17450.0


In [97]:
df.dtypes

symboling              int64
normalized_losses    float64
make                  object
fuel_type             object
aspiration            object
num_doors             object
body_style            object
drive_wheels          object
engine_location       object
wheel_base           float64
length               float64
width                float64
height               float64
curb_weight            int64
engine_type           object
num_cylinders         object
engine_size            int64
fuel_system           object
bore                 float64
stroke               float64
compression_ratio    float64
horsepower           float64
peak_rpm             float64
city_mpg               int64
highway_mpg            int64
price                float64
dtype: object

In [98]:
# we will only focus on encoding the categorical variables, we are going to include only the object columns in our dataframe.

obj_df = df.select_dtypes(include=["object"]).copy()
obj_df.head()

Unnamed: 0,make,fuel_type,aspiration,num_doors,body_style,drive_wheels,engine_location,engine_type,num_cylinders,fuel_system
0,alfa-romero,gas,std,two,convertible,rwd,front,dohc,four,mpfi
1,alfa-romero,gas,std,two,convertible,rwd,front,dohc,four,mpfi
2,alfa-romero,gas,std,two,hatchback,rwd,front,ohcv,six,mpfi
3,audi,gas,std,four,sedan,fwd,front,ohc,four,mpfi
4,audi,gas,std,four,sedan,4wd,front,ohc,five,mpfi


In [99]:
obj_df.isnull().sum()

make               0
fuel_type          0
aspiration         0
num_doors          2
body_style         0
drive_wheels       0
engine_location    0
engine_type        0
num_cylinders      0
fuel_system        0
dtype: int64

In [100]:
obj_df = obj_df.fillna({"num_doors": "four"})
# for simplicity and since four is common 

In [101]:
obj_df.dtypes

make               object
fuel_type          object
aspiration         object
num_doors          object
body_style         object
drive_wheels       object
engine_location    object
engine_type        object
num_cylinders      object
fuel_system        object
dtype: object

### `Approach 1 Find and Replace`

In [102]:
obj_df_1 = obj_df.copy()

In [103]:
cleanup_nums = {"num_doors":     {"four": 4, "two": 2},
                "num_cylinders": {"four": 4, "six": 6, "five": 5, "eight": 8,"two": 2, "twelve": 12, "three":3 }
               }

obj_df_1.replace(cleanup_nums,inplace=True)
obj_df_1.head()

# The nice benefit to this approach is that pandas “knows” the types of values in the columns so the object is now a int64

Unnamed: 0,make,fuel_type,aspiration,num_doors,body_style,drive_wheels,engine_location,engine_type,num_cylinders,fuel_system
0,alfa-romero,gas,std,2,convertible,rwd,front,dohc,4,mpfi
1,alfa-romero,gas,std,2,convertible,rwd,front,dohc,4,mpfi
2,alfa-romero,gas,std,2,hatchback,rwd,front,ohcv,6,mpfi
3,audi,gas,std,4,sedan,fwd,front,ohc,4,mpfi
4,audi,gas,std,4,sedan,4wd,front,ohc,5,mpfi


In [104]:
obj_df_1.dtypes

make               object
fuel_type          object
aspiration         object
num_doors           int64
body_style         object
drive_wheels       object
engine_location    object
engine_type        object
num_cylinders       int64
fuel_system        object
dtype: object

### `Approach 2 Label Encoding`

In [105]:

obj_df_2 = obj_df.copy()

# The nice aspect of this approach is that you get the benefits of pandas categories (compact data size, ability to order, plotting support) but can easily be converted to numeric values for further analysis.

In [106]:
obj_df_2["body_style"] = obj_df_2["body_style"].astype("category")
obj_df_2.dtypes

make                 object
fuel_type            object
aspiration           object
num_doors            object
body_style         category
drive_wheels         object
engine_location      object
engine_type          object
num_cylinders        object
fuel_system          object
dtype: object

In [107]:
obj_df_2["body_style"] = obj_df_2["body_style"].cat.codes
obj_df_2.head()

# convertible -> 0
# hardtop -> 1
# hatchback -> 2
# sedan -> 3
# wagon -> 4

Unnamed: 0,make,fuel_type,aspiration,num_doors,body_style,drive_wheels,engine_location,engine_type,num_cylinders,fuel_system
0,alfa-romero,gas,std,two,0,rwd,front,dohc,four,mpfi
1,alfa-romero,gas,std,two,0,rwd,front,dohc,four,mpfi
2,alfa-romero,gas,std,two,2,rwd,front,ohcv,six,mpfi
3,audi,gas,std,four,3,fwd,front,ohc,four,mpfi
4,audi,gas,std,four,3,4wd,front,ohc,five,mpfi


### `Approach 3 One-Hot Encoding / Get Dummies`

In [108]:
# Label encoding has the advantage that it is straightforward but it has the disadvantage that the numeric values can be “misinterpreted” by the algorithms. For example, the value of 0 is obviously less than the value of 4 but does that really correspond to the data set in real life? Does a wagon have “4X” more weight in our calculation than the convertible? In this example, I don’t think so.

# A common alternative approach is called one hot encoding (but also goes by several different names shown below). Despite the different names, the basic strategy is to convert each category value into a new column and assigns a 1 or 0 (True/False) value to the column. This has the benefit of not weighting a value improperly but does have the downside of adding more columns to the data set.

obj_df_3 = obj_df.copy()

In [109]:
pd.get_dummies(obj_df_3,columns=["body_style", "drive_wheels"],prefix=["body", "drive"]).head()

Unnamed: 0,make,fuel_type,aspiration,num_doors,engine_location,engine_type,num_cylinders,fuel_system,body_convertible,body_hardtop,body_hatchback,body_sedan,body_wagon,drive_4wd,drive_fwd,drive_rwd
0,alfa-romero,gas,std,two,front,dohc,four,mpfi,True,False,False,False,False,False,False,True
1,alfa-romero,gas,std,two,front,dohc,four,mpfi,True,False,False,False,False,False,False,True
2,alfa-romero,gas,std,two,front,ohcv,six,mpfi,False,False,True,False,False,False,False,True
3,audi,gas,std,four,front,ohc,four,mpfi,False,False,False,True,False,False,True,False
4,audi,gas,std,four,front,ohc,five,mpfi,False,False,False,True,False,True,False,False


### `Approach 4 Custom Binary Encoding`

In [110]:
obj_df_4 = obj_df.copy()

# Depending on the data set, you may be able to use some combination of label encoding and one hot encoding to create a binary column that meets your needs for further analysis.

In [111]:
obj_df_3["engine_type"].value_counts()

# For the sake of discussion, maybe all we care about is whether or not the engine is an Overhead Cam (OHC) or not. In other words, the various versions of OHC are all the same for this analysis. If this is the case, then we could use the str accessor plus np.where to create a new column the indicates whether or not the car has an OHC engine.

engine_type
ohc      148
ohcf      15
ohcv      13
dohc      12
l         12
rotor      4
dohcv      1
Name: count, dtype: int64

In [112]:
obj_df_3["OHC_Code"] = np.where(obj_df_3["engine_type"].str.contains("ohc"), 1, 0)
obj_df_3[["make", "engine_type", "OHC_Code"]].head()

Unnamed: 0,make,engine_type,OHC_Code
0,alfa-romero,dohc,1
1,alfa-romero,dohc,1
2,alfa-romero,ohcv,1
3,audi,ohc,1
4,audi,ohc,1


### `Approach 5 Ordinal Encoding & One Hot Encoding by Scikit-Learn`

In [113]:
obj_df_5 = obj_df.copy()

In [114]:
from sklearn.preprocessing import OrdinalEncoder

ord_enc = OrdinalEncoder()
obj_df_5["make_code"] = ord_enc.fit_transform(obj_df_5[["make"]])
obj_df_5[["make", "make_code"]].head(11)

Unnamed: 0,make,make_code
0,alfa-romero,0.0
1,alfa-romero,0.0
2,alfa-romero,0.0
3,audi,1.0
4,audi,1.0
5,audi,1.0
6,audi,1.0
7,audi,1.0
8,audi,1.0
9,audi,1.0


In [115]:
from sklearn.preprocessing import OneHotEncoder

one_hot = OneHotEncoder()
oe_results = one_hot.fit_transform(obj_df_5[["body_style"]])
type(oe_results)

scipy.sparse._csr.csr_matrix

In [116]:
pd.DataFrame(oe_results.toarray(), columns=one_hot.categories_).head()

Unnamed: 0,convertible,hardtop,hatchback,sedan,wagon
0,1.0,0.0,0.0,0.0,0.0
1,1.0,0.0,0.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0
4,0.0,0.0,0.0,1.0,0.0


In [117]:
obj_df = obj_df.join(pd.DataFrame(oe_results.toarray(), columns=one_hot.categories_))

### `Scikit Learn Pipeline`
to incorporate multiple approaches

In [118]:
# incorporate the OneHotEncoder and OrdinalEncoder into a pipeline and use cross_val_score to analyze the results:

from sklearn.compose import make_column_transformer
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score

# for the purposes of this analysis, only use a small subset of features

feature_cols = [
    'fuel_type', 'make', 'aspiration', 'highway_mpg', 'city_mpg',
    'curb_weight', 'drive_wheels'
]

# Remove the empty price rows
df_ml = df.dropna(subset=['price'])

X = df_ml[feature_cols]
y = df_ml['price']

In [119]:
column_trans = make_column_transformer((OneHotEncoder(handle_unknown='ignore'),
                                        ['fuel_type', 'make', 'drive_wheels']),
                                      (OrdinalEncoder(), ['aspiration']),
                                      remainder='passthrough')

In [120]:
linreg = LinearRegression()
pipe = make_pipeline(column_trans, linreg)


In [121]:
cross_val_score(pipe, X, y, cv=10, scoring='neg_mean_absolute_error').mean().round(2)

-2935.31