**DS 301: Applied Data Modeling and Predictive Analysis**

**Lecture 8 – One-Hot Encoding**

# One-Hot Encoding
Nok Wongpiromsarn, 9 September 2020

**Load the automobile price data and apply some filtering as in Lecture 7**

Automobile_price_data_Raw.csv can be downloaded from

https://github.com/MicrosoftLearning/Principles-of-Machine-Learning-Python/tree/master/Module3

We put it under the *datasets* folder.

In [None]:
import os
import pandas as pd

data_path = os.path.join("datasets", "automobile.csv")
data = pd.read_csv(data_path)

# Remove all rows with ? price or horsepower
data = data[(data.price != "?") & (data.horsepower != "?")]

# Change the data type of price and horsepower from *object* to a suitable numeric type
data.price = pd.to_numeric(data.price)
data.horsepower = pd.to_numeric(data.horsepower)

# Check info again
data.head(10)

**Apply one-hot encoding to the body-style column**

In [None]:
body_1hot = pd.get_dummies(data['body-style'], prefix='body-style')
data = pd.concat([data, body_1hot], axis=1)
data.head(10)

In [None]:
# Remove the 'body-style' column. 
# Specify inplace=True to delete the column without having to reassign data.
data.drop('body-style', axis=1, inplace=True)
data.head(10)

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

x = data[['engine-size', 'curb-weight', 'width', 'highway-mpg',
          'body-style_convertible', 'body-style_hardtop', 'body-style_hatchback', 
          'body-style_sedan', 'body-style_wagon']]
y = data['price']

# Split the data into training and test sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=4)

# Apply least squares linear regression
reg = LinearRegression().fit(x_train, y_train)
print("Coefficient: {}".format(reg.coef_))
print("Intercept: {}".format(reg.intercept_))

y_test_predict = reg.predict(x_test)
rmse = mean_squared_error(y_test, y_test_predict, squared=False)
rsquared = reg.score(x_test, y_test)
print("RMSE: {}".format(rmse))
print("Coefficient of determination: {}".format(rsquared))