# Import data

<p>
You can find the "Automobile Data Set" from the following link: <a href="https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data">https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data</a>. 
</p> 
<p> The aim is to predict the used cars prices with the given set of features.</p>

In [None]:
headers = ["symboling","normalized-losses","make","fuel-type","aspiration", "num-of-doors","body-style",
         "drive-wheels","engine-location","wheel-base", "length","width","height","curb-weight","engine-type",
         "num-of-cylinders", "engine-size","fuel-system","bore","stroke","compression-ratio","horsepower",
         "peak-rpm","city-mpg","highway-mpg","price"]

# Data Wrangling
Perform Data Wrangling to convert the data from initial format to a format that may be better for data analysis

## Handling Missing Values

Convert "?" to NaN
Count missing values in each column

<b>Replace by mean:</b>
<ul>
    <li>"normalized-losses"</li>
    <li>"stroke</li>
    <li>"bore"</li>
    <li>"horsepower"</li>
    <li>"peak-rpm"</li>
</ul>

<b>Replace by frequency:</b>
<ul>
    <li>"num-of-doors": replace with the most occuring num-of-doors. 
    </li>
</ul>

<b>Drop the whole row:</b>
<ul>
    <li>"price"
    </li>
</ul>

## Correct Data Format

- "bore", "stroke", price, peak-rpm - float
- normalized-losses, horsepower - int

## Data Standardization

- city-mpg, highway-mpg - Convert mpg to L/100km by mathematical operation (235 divided by mpg) 
- name the new columns : city-L/100km, highway-L/100km

## Data Normalization 

"length", "width" and "height" - normalize using min-max normalization

## Binning 

- Segment the `horsepower` numerical column into 3 equal sized bins and name the groups 'Low', 'Medium' and 'High' and name the column `horsepower-binned`
- Plot the histogram of `horsepower`
- Plot a bar plot for `horsepower-binned`

## Dummy Variable

- Convert `fuel-type` categorical variable into dummy variable and name the resulting columns, `gas` and `diesel` and drop the `fuel-type` column
- Convert `aspiration` categorical variable into dummy variable and name the resulting columns, `std` and `turbo` and drop the `aspiration` column

# Exploratory Data Analysis

<p> Perform EDA to find the list of important variables in predicting the prices

## Regression plots

- Visualize the relationship between the following numerical variables and `price` using **regression plots** and also compute the correlation coefficient for the same and provide your observations on the results
    - `engine-size`
    - `highway-mpg`
    - `peak-rpm`
    - `stroke`

## Box plots
- Visualize the relationship between the following categorical variables and `price` using **box plots** and provide your observations on the results
    - `body-style`
    - `engine-location`
    - `drive-wheels`

## Value counts
Get the value counts for `drive-wheels` and `engine-location`

## Heat map
Use the `groupby` function to find the average `price` of each group, grouped based on  `drive-wheels` and `body-style` ? Convert it to a pivot table. Use a heat map to visualize the relationship between Body Style vs Price

## Pearson Correlation Coefficient and P-value
- Calculate the  Pearson Correlation Coefficient and P-value of the following features and `price` and interpret the values
    - `wheel-base`
    - `horsepower`
    - `length`
    - `width`
    - `curb-weight`
    - `engine-size`
    - `bore`
    - `city-mpg`
    - `highway-mpg`

## ANOVA test
Analyze if different types of `drive-wheels` impact the `price` using ANOVA test.

## Important Variables
Make a list of 10 important variables to feed the model based on the above exploratory data analysis done

# Model Development

Develop several models to predict the price of used cars using the arrived important features

## Simple Linear Regression

Use simple linear regression to predict the `price` base on the feature `highway-mpg`. Find out the intercept and slope, and plug them in to the linear function.

### Model Evaluation using Visualization

Make Regression and Residual plots for `highway-mpg` and `price` and interpret the plots

### Measures for In-Sample Evaluation
- Calculate the R^2
- Compute the Mean Squared Error

## Multiple Linear Regression

Use any 4 important variables as the predictor variables to predict the `price`. Find out the intercept and the coefficients, and plug them in to the linear function.

### Model Evaluation using Visualization

Compare the distribution of the fitted values that result from the model and the actual values by plotting both the distributions on the same plot and interpret the plot.

### Measures for In-Sample Evaluation
- Calculate the R^2
- Compute the Mean Squared Error

## Polynomial Regression
Create a 3rd order polynomial model for `highway-mpg` and `price` using np.polyfit and np.poly1d and plot the polynomial function using the `plot` method.

### Measures for In-Sample Evaluation
- Compute the R^2
- Compute the Mean Squared Error

## Decision Making
Determine which model is best among the 3 based on R^2 and MSE values

# Model Evaluation 
Determine how accurate are the predictions 

## Training & Testing
- Use the function "train_test_split" to split up the data set such that 15% of the data samples will be utilized for testing, set the parameter "random_state" equal to 1

### Simple Linear Regression
- Build a simple linear regression model using `horsepower` feature from the training set
- Calculate the R^2 on the test data and interpret the results
- Perform a 4-fold Cross-validation and arrive at the average and standard deviation of the resulting scores.

### Multiple Linear Regresstion
- Build a multiple linear regression model using the 4 important variables selected previously over the training test.
- Predict the values using the training data and test data
- Make a Distribution plot of Predicted prices and Actual prices of the training data.
- Make a Distribution plot of Predicted prices and Actual prices of the test data.