**DS 301: Applied Data Modeling and Predictive Analysis**

**Lecture 7 – Feature Selection**

# Data Manipulation and Visualization
Nok Wongpiromsarn, 7 September 2020

**Load the automobile price data**

Automobile_price_data_Raw.csv can be downloaded from

https://github.com/MicrosoftLearning/Principles-of-Machine-Learning-Python/tree/master/Module3

We put it under the *datasets* folder.

In [None]:
import os
import pandas as pd

data_path = os.path.join("datasets", "automobile.csv")
data = pd.read_csv(data_path)

**Examine the data using pandas**

In [None]:
# Print a concise summary of the data
data.info()

# Print the first 10 rows of data
data.head(10)

**Remove rows based on column value and change data type of columns**

In [None]:
# Remove all rows with ? price or horsepower
data = data[(data.price != "?") & (data.horsepower != "?")]

# Change the data type of price and horsepower from *object* to a suitable numeric type
data.price = pd.to_numeric(data.price)
data.horsepower = pd.to_numeric(data.horsepower)

# Check the type of the price column
print("Price type: {}".format(data.dtypes["price"]))
print("Horsepower type: {}\n\n".format(data.dtypes["horsepower"]))

# Check info again
data.info()

**Filter the columns based on data type**

Select only columns with *object* type.

In [None]:
# Print the type of each column
print("{}\n\n".format(data.dtypes))

# Construct a dataframe data_object with only columns of type object
data_object = data.select_dtypes(include=[object])

# Check data_object info
data_object.info()

**Plot the data using seaborn**

Use *boxplot* to show the three quartile values of the distribution along with extreme values.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns 

plt.figure(figsize=(26, 12))
sns.boxplot(x="make", y="price", data=data)

In [None]:
plt.figure(figsize=(15, 8))
sns.boxplot(x="body-style", y="price", data=data)

In [None]:
plt.figure(figsize=(15, 8))
sns.boxplot(x="num-of-doors", y="price", data=data)

In [None]:
plt.figure(figsize=(15, 8))
sns.boxplot(x="fuel-type", y="price", data=data)

In [None]:
plt.figure(figsize=(15, 8))
sns.boxplot(x="aspiration", y="price", data=data)

In [None]:
plt.figure(figsize=(15, 8))
sns.boxplot(x="aspiration", y="price", hue="body-style", data=data)

Use *pairplot* to plot pairwise relationships of relevant features.

This creates a grid of Axes such that each numeric variable in data will be shared in the y-axis across a single row and in the x-axis across a single column. The diagonal Axes are treated differently, drawing a plot to show the univariate distribution of the data for the variable in that column.

Use *hue* to map plot aspects to different colors.

In [None]:
sns.pairplot(data.loc[:,["city-mpg", "highway-mpg", "curb-weight", "make", "engine-size", "price", "horsepower"]], 
             hue="make", diag_kind="hist");

Use *kind="reg"* to fit linear regression models to the scatter plots.

In [None]:
sns.pairplot(data.loc[:,["city-mpg", "highway-mpg", "curb-weight", "make", "engine-size", "price", "horsepower"]], 
             kind='reg');

**Visualize the correlation matrix**

In [None]:
corr = data.corr()
corr.style.background_gradient(cmap='coolwarm').set_precision(2)

**Focus on the *price* column, we see that *engine-size* has the highest correlation**

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

x = data[['engine-size']]
y = data['price']

# Split the data into training and test sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=4)

# Apply least squares linear regression
reg = LinearRegression().fit(x_train, y_train)
y_predict = reg.predict(x)

# Visualize the results
plt.scatter(x['engine-size'], y, color='black')
plt.plot(x, y_predict, color='blue')
plt.xlabel("engine-size")
plt.ylabel("price")
plt.show()

Inspect the model

In [None]:
print("Coefficient: {}".format(reg.coef_))
print("Intercept: {}".format(reg.intercept_))

print(reg.predict([[101]]) - reg.predict([[100]]))
print(reg.predict([[0]]))

Evaluate the model

In [None]:
# Root mean squared error
from sklearn.metrics import mean_squared_error

y_test_predict = reg.predict(x_test)
rmse = mean_squared_error(y_test, y_test_predict, squared=False)
print("RMSE: {}".format(rmse))

# Coefficient of determination
rsquared = reg.score(x_test, y_test)
print("Coefficient of determination: {}".format(rsquared))

**Add more features**

1. Add curb-weight

In [None]:
x = data[['engine-size', 'curb-weight']]
y = data['price']

# Split the data into training and test sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=4)

# Apply least squares linear regression
reg = LinearRegression().fit(x_train, y_train)
print("Coefficient: {}".format(reg.coef_))
print("Intercept: {}".format(reg.intercept_))

y_test_predict = reg.predict(x_test)
rmse = mean_squared_error(y_test, y_test_predict, squared=False)
rsquared = reg.score(x_test, y_test)
print("RMSE: {}".format(rmse))
print("Coefficient of determination: {}".format(rsquared))

2. Add whee-base, which results in worse performance

In [None]:
x = data[['engine-size', 'curb-weight', 'wheel-base']]
y = data['price']

# Split the data into training and test sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=4)

# Apply least squares linear regression
reg = LinearRegression().fit(x_train, y_train)
print("Coefficient: {}".format(reg.coef_))
print("Intercept: {}".format(reg.intercept_))

y_test_predict = reg.predict(x_test)
rmse = mean_squared_error(y_test, y_test_predict, squared=False)
rsquared = reg.score(x_test, y_test)
print("RMSE: {}".format(rmse))
print("Coefficient of determination: {}".format(rsquared))

3. Find a good combination based on the correlation matrix

In [None]:
x = data[['engine-size', 'curb-weight', 'width', 'highway-mpg']]
y = data['price']

# Split the data into training and test sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=4)

# Apply least squares linear regression
reg = LinearRegression().fit(x_train, y_train)
print("Coefficient: {}".format(reg.coef_))
print("Intercept: {}".format(reg.intercept_))

y_test_predict = reg.predict(x_test)
rmse = mean_squared_error(y_test, y_test_predict, squared=False)
rsquared = reg.score(x_test, y_test)
print("RMSE: {}".format(rmse))
print("Coefficient of determination: {}".format(rsquared))