**CSCI4840-Fall 2023 LAB3**
**Objective:**
The objective of this lab is to provide students with hands-on experience in implementing, training, and evaluating regression models in machine learning, with a specific emphasis on applying preprocessing and hyperparameter tunning strategies to improve the model's performance. 

**Learning Outcomes:**
* 1. Data Understanding: Demonstrate the ability to effectively explore and interpret dataset structures, identifying potential issues in data quality and implementing appropriate strategies for handling missing values.
* 2. Data Preprocessing: Apply data preprocessing techniques to prepare datasets for machine learning, showcasing proficiency in handling various types of data.
* 3. Linear Regression: Implement a linear regression model using a machine learning library, showcasing an understanding of the core concepts and the ability to apply them to real-world datasets.
* 4. Model Evaluation: Evaluate model performance using relevant metrics such as Mean Squared Error and R-squared, demonstrating the ability to assess the effectiveness of machine learning models on unseen data.
* 5. Feature Importance: Analyze and interpret feature importance, creating visualizations to effectively communicate the significance of different features in predictive modeling.
* 6. Hyperparameter Tuning: Optimize model performance through experimentation with hyperparameter tuning, demonstrating the ability to fine-tune models for improved predictive accuracy.

**Remark:** 
* 1.This lab is an individual work, please show your own understanding and effort. 
* 2.By default, all answers are expected to be supported by code instead of manually analysis . 
* 3.If there is a question expect answer beside code, please answer the question use comment or add a Markdown box

**Q1 (5 points):**

* 1. load CarPrice.xls into a dataframe
* 2. check detailed information of the dataset using info function
* 3. check the first 10 records to get a better idea about the data
* 4. check the pairwise correlation of columns. Does this function apply to non-numerical features?
* 5. check the missing values in each column of the DataFrame

In [1]:
from sklearn import datasets
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from io import StringIO
import sys
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

In [2]:
car = pd.read_csv('CarPrice.xls')
car.set_index('car_ID')
car.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   car_ID            205 non-null    int64  
 1   symboling         205 non-null    int64  
 2   CarName           205 non-null    object 
 3   fueltype          205 non-null    object 
 4   aspiration        205 non-null    object 
 5   doornumber        205 non-null    object 
 6   carbody           205 non-null    object 
 7   drivewheel        205 non-null    object 
 8   enginelocation    205 non-null    object 
 9   wheelbase         205 non-null    float64
 10  carlength         205 non-null    float64
 11  carwidth          205 non-null    float64
 12  carheight         205 non-null    float64
 13  curbweight        205 non-null    int64  
 14  enginetype        205 non-null    object 
 15  cylindernumber    205 non-null    object 
 16  enginesize        205 non-null    int64  
 1

In [3]:
car.head(10)

Unnamed: 0,car_ID,symboling,CarName,fueltype,aspiration,doornumber,carbody,drivewheel,enginelocation,wheelbase,...,enginesize,fuelsystem,boreratio,stroke,compressionratio,horsepower,peakrpm,citympg,highwaympg,price
0,1,3,alfa-romero giulia,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495.0
1,2,3,alfa-romero stelvio,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500.0
2,3,1,alfa-romero Quadrifoglio,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500.0
3,4,2,audi 100 ls,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950.0
4,5,2,audi 100ls,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450.0
5,6,2,audi fox,gas,std,two,sedan,fwd,front,99.8,...,136,mpfi,3.19,3.4,8.5,110,5500,19,25,15250.0
6,7,1,audi 100ls,gas,std,four,sedan,fwd,front,105.8,...,136,mpfi,3.19,3.4,8.5,110,5500,19,25,17710.0
7,8,1,audi 5000,gas,std,four,wagon,fwd,front,105.8,...,136,mpfi,3.19,3.4,8.5,110,5500,19,25,18920.0
8,9,1,audi 4000,gas,turbo,four,sedan,fwd,front,105.8,...,131,mpfi,3.13,3.4,8.3,140,5500,17,20,23875.0
9,10,0,audi 5000s (diesel),gas,turbo,two,hatchback,4wd,front,99.5,...,131,mpfi,3.13,3.4,7.0,160,5500,16,22,17859.167


In [4]:
X, y = car.iloc[:, :-1], car.iloc[:, -1]

In [5]:
#corr_matrix = np.corrcoef(X)
#for x in corr_matrix:
#    print(x)

In [6]:
#cm = np.corrcoef(X.values.T)
#hm = sns.heatmap(cm, cbar=True, annot=True, square=True,)
#plt.show()

In [7]:
#car.corr()

I tried corrcoef function, a corr heat map, and the .corr() function, and all threw the same error, that str is unsuported with the functions. (I commented them out so that the file would run)

There are no null cells in the data frame, as evidenced by the info call above (205 rows, and all columns have 205 non-null entries). But, as discussed on how to tell during class:

In [8]:
car.isnull().sum()

car_ID              0
symboling           0
CarName             0
fueltype            0
aspiration          0
doornumber          0
carbody             0
drivewheel          0
enginelocation      0
wheelbase           0
carlength           0
carwidth            0
carheight           0
curbweight          0
enginetype          0
cylindernumber      0
enginesize          0
fuelsystem          0
boreratio           0
stroke              0
compressionratio    0
horsepower          0
peakrpm             0
citympg             0
highwaympg          0
price               0
dtype: int64

**Q2 (8 points): train without preprocessing**
* 1. temporaryly exclude all non-numerical features
* 2. Split the data into train dataset and test dataset
* 3. apply linear regression on the train dataset
* 4. test the trained model on the test dataset
* 5. print out the performance measurement mse, mae, r2 score

In [9]:
X_num = car.iloc[:, [0 ,1, 9, 10, 11, 12, 13, 16, 18, 19, 20, 21, 22, 23, 24]]
X_train, X_test, y_train, y_test = train_test_split(X_num, y)

In [10]:
LogRegNum = LinearRegression()
LogRegNum.fit(X_train, y_train)

In [11]:
y_pred = LogRegNum.predict(X_test)
print('MSE:', mean_squared_error(y_test, y_pred))
print('MAE:', mean_absolute_error(y_test, y_pred))
print('R^2:', r2_score(y_test, y_pred))

MSE: 12654535.630497536
MAE: 2410.0049986364384
R^2: 0.7484316085833278


**Q3 (5 points): preprocessing the data set**
Identify the ordinal features, and then employ an appropriate strategy to transform them into numerical values.

In [12]:
door_map = {'two' : 2, 'four': 4}
car['doornumber'] = car['doornumber'].map(door_map)

In [13]:
cyl_map = {'twelve' : 12, 'eight' : 8, 'six' : 6, 'five' : 5, 'four' : 4, 'three' : 3, 'two' : 2}
car['cylindernumber'] = car['cylindernumber'].map(cyl_map)

**Q4 (5 points): preprocessing the data set**
Identify the Nominal features, and then employ an appropriate strategy to transform them into numerical values. You can simply remove features that considered useless before you apply the transformation on the remaining features

In [14]:
ohe = OneHotEncoder()
one_hot_encoded = ohe.fit_transform(car[['CarName', 'fueltype', 'aspiration', 'carbody', 
                                         'drivewheel', 'enginetype', 'fuelsystem', 'enginelocation']]).toarray()
one_hot_df = pd.DataFrame(one_hot_encoded, columns = ohe.get_feature_names_out(['CarName', 'fueltype', 'aspiration', 'carbody', 
                                         'drivewheel', 'enginetype', 'fuelsystem', 'enginelocation']))
car = pd.concat([car, one_hot_df], axis = 1)
car.drop(columns = ['CarName', 'fueltype', 'aspiration', 'carbody', 
                                         'drivewheel', 'enginetype', 'fuelsystem', 'enginelocation'], axis = 1, inplace = True)

In [15]:
price = car.pop('price')
car.insert(len(car.columns), 'price', price)

**Q5 (1 point): check feature data type**
use info() to check your dataframe, you should have all numerical features if the last two steps applied correctly

In [16]:
car.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Columns: 194 entries, car_ID to price
dtypes: float64(184), int64(10)
memory usage: 310.8 KB


**Q6 (5 point): check the corralation**
* 1. use corr() to check the pairwise correlation of columns in the dataframe. 
* 2. update the dataframe so that it only contains features whose absolute corraltion value on price is greater or equal to 0.7

In [17]:
corr = car.corr()
price_corr = corr.iloc[:, -1]
sort_corr = pd.DataFrame(columns = ['correlation', 'index'])
for x in range(len(price_corr)):
    sort_corr.loc[x] = [abs(price_corr.iloc[x]), x]
sort_corr.sort_values('correlation', ascending = False, inplace = True)


In [18]:
car_column_names = car.columns
high_corr_indices = sort_corr[sort_corr['correlation'] > 0.7].index
relevant_columns = [car_column_names[i] for i in high_corr_indices]
filtered_car_df = car[relevant_columns]

In [19]:
filtered_car_df

Unnamed: 0,price,enginesize,curbweight,horsepower,carwidth,cylindernumber
0,13495.0,130,2548,111,64.1,4
1,16500.0,130,2548,111,64.1,4
2,16500.0,152,2823,154,65.5,6
3,13950.0,109,2337,102,66.2,4
4,17450.0,136,2824,115,66.4,5
...,...,...,...,...,...,...
200,16845.0,141,2952,114,68.9,4
201,19045.0,141,3049,160,68.8,4
202,21485.0,173,3012,134,68.9,6
203,22470.0,145,3217,106,68.9,6


**Q7 (3 points): apply model on processed data**
* 1. split the processed data into train and test set
* 2. use linear regression to train and test
* 3. print out the mse, mse, and r2 score

In [20]:
X_train, X_test, y_train, y_test = train_test_split(filtered_car_df.iloc[:, 1:], filtered_car_df.iloc[:, 0] )

In [21]:
LogRegNum = LinearRegression()
LogRegNum.fit(X_train, y_train)

In [22]:
y_pred = LogRegNum.predict(X_test)
print('MSE:', mean_squared_error(y_test, y_pred))
print('MAE:', mean_absolute_error(y_test, y_pred))
print('R^2:', r2_score(y_test, y_pred))

MSE: 14977460.974860903
MAE: 2450.1996977476006
R^2: 0.7733510685366318


**Q8 (3 points): Conclusion**
Did you achieve better results in Q7 compared to Q2? Provide a brief discussion explaining why or why not.

After running the regressions several times, the results were fairly the same between the two methods, slightly leaning to the processed data performing better. Most of the data used between the two is the same, so with this particular dataset, the processing of the data did not do much in the end for identifying more useful features. However, if the data was also standardized or normalized (which, given the regression methods, it would be better to standardize vs normalize), it would most likely provide better results, as it would likely fit the data much more efficiently. 

In [26]:
from sklearn.preprocessing import StandardScaler
stdsc = StandardScaler()
X, y = car.iloc[:, :-1], car.iloc[:, -1]
X_std = stdsc.fit_transform(X)
