# AI-Dealership-Strategy with CRISP-DM Framework
AI modeling to assist dealership with selling used cars using the CRISP-DM Framework.

### Business Understanding

From a business perspective, we are tasked with identifying key drivers for used car prices. In the CRISP-DM overview, we are asked to convert this business framing to a data problem definition.

Task reframing: Accurately model the price of used car, given the factors available in the 426 thousand used car prices data set.

The business objective is to find out what the key drivers/factors for used car prices are so that they can adjust their price correctly and make more profitable sales. From a data mining stand point we reframe the task to correctly predict the price of a used car based on our data. If we can train a model that successfully predicts the future price of a car, then we'll have determined what the weights and main factors are that contribute to the modeling. Additionally, we'll be able to establish a certain amount of confidence by testing our models on a training data set using a test-train split.

### Data Understanding

After considering the business understanding, we want to get familiar with our data.


<img src="images/NAs.png"/>

Even though we have 460k thousand entries, we see that a lot of them have NAs.

If we consider the NaNs count and take a closer look for some samples of the data and the variance in it we find the following interesting information:

- The `VIN` and `id` columns are different IDs for our vehicles, this is useless for our data modeling and should be dropped.
- The `price`, `odometer` and `year` have large distributions, we'll keep `year` as is, but we'll remove the upper a lower percentiles from `price` and `odometers`. Very rare that a used car sells for almost 400,000 or 0 dollars and cars with 0 miles on the odometer aren't exactly used.
- `drive`, `size`, `condition`, `cylinders` and `paint_color` have so many NaNs that it reduces our data set by more than half so we should remove those.

<img src="images/boxplot-price.png"/>
<img src="images/boxplot-odometer.png"/>
<img src="images/boxplot-year.png"/>

- If we take a look at the unique value count from each column we find that it might be reasonable to include many of these unique values per column with exception to the `model` column which has 30,000 different models. This is unreasonably large and should be removed.
<img src="images/unique1.png"/>
<img src="images/unique2.png"/>
<img src="images/unique3.png"/>

- We aren't going to do any geographical modeling since our data set is too small for the entire United States. We can thus remove the `state` and `region` columns from our data set.

- We see there are several histograms that could reasonably be removed from our model:

There is a dissproportion for both some `manufacturers` and `models`, they should both be removed from our model.
<img src="images/histogram-manufacturer.png"/>
<img src="images/histogram-model.png"/>


- The following models are reasonable to keep in our build with exception to those we already removed for other reasons.
<img src="images/histogram-fuel.png"/>
<img src="images/histogram-title_status.png"/>
<img src="images/histogram-transmission.png"/>
<img src="images/histogram-type.png"/>


We are left with the following columns and the following adjustments and a universal NaN removal:
- `price` -> remove the top and bottom percentiles to account for extremes
- `odometer` -> remove the top and bottom percentiles to account for extremes
- `year`
- `fuel`
- `title_status`
- `transmision`
- `type`

### Data Preparation

After our initial exploration and fine tuning of the business understanding, it is time to construct our final dataset prior to modeling.  Here, we want to make sure to handle any integrity issues and cleaning, the engineering of new features, any transformations that we believe should happen (scaling, logarithms, normalization, etc.), and general preparation for modeling with `sklearn`. 

Let's drop all the columns we mentioned we weren't keeping from before and remove all the NaNs rows too.

<img src="images/dropna.png"/>

We end up with 319715 values from 426880 which is a ~25% reduction in data. This is not ideal, however our data set is still very large and we've reduced the useful factors down to just a handfull.

Let's continue our data preparations. As mentioned before it doesn't make sense for used car dealers to be selling cars with no miles on them, cars that cost nothing and outliers that cost $400,000 so let's remove the top and bottom 2% from both the `price` and `odometer` column.

<img src="images/boxplot-price-fixed.png"/>
<img src="images/boxplot-odometer-fixed.png"/>

These distributions look much more reasonable!

Since we have categorical data we'll need to perform One-Hot Encoding on the following columns:
- `fuel`
- `title_status`
- `transmission`
- `type`

Now we have 30 columns:
<img src="images/Encoded.png"/>


Now that we have a bunch of One Hot Encoded variables and several other variables, we need to scale all of them so they have equal potential when initially training the model. Using the a Standard Scaler we successfully scale our data besides the price.

Now let's use a heatmap to see some general trends.

<img src="images/scaled-heatmap.png"/>

We see some general relationships that make sense, there is a positive correlation with price and year, as a newer car is usually more expensive. Additionally, we see a negative correlation between price and odomemter, this is also reasonable as more miles means more use which makes it less valuable.

Let's check the correlation between our columns using a graph as a theory check.

<img src="images/pairplot.png"/>

The scatter plots for price, odometer, year seem reasonable, it's much harder to tell for the rest as they are all One Hot Encoded. However, this seems reasonable enough so let's generate our model.

### Modeling

We'll tackle three different models and see what works best. Our first model will be using Linear Regression.

Fitting our Linear Regression model we get:

R_squared : 1.0
RMSE : 7.810271353972936e-10

We are overfitting very heavily in our model. We should use RFE instead to fix this issue:

Using RFE and selecting for 3 features we find:

Column: year, Selected False, Rank: 27
Column: odometer, Selected False, Rank: 26
Column: fuel_diesel, Selected True, Rank: 1
Column: fuel_electric, Selected False, Rank: 3
Column: fuel_gas, Selected True, Rank: 1
Column: fuel_hybrid, Selected True, Rank: 1
Column: fuel_other, Selected False, Rank: 2
Column: title_status_clean, Selected False, Rank: 7
Column: title_status_lien, Selected False, Rank: 8
Column: title_status_missing, Selected False, Rank: 11
Column: title_status_parts only, Selected False, Rank: 12
Column: title_status_rebuilt, Selected False, Rank: 9
Column: title_status_salvage, Selected False, Rank: 10
Column: transmission_automatic, Selected False, Rank: 5
Column: transmission_manual, Selected False, Rank: 6
Column: transmission_other, Selected False, Rank: 4
Column: type_SUV, Selected False, Rank: 16
Column: type_bus, Selected False, Rank: 25
Column: type_convertible, Selected False, Rank: 18
Column: type_coupe, Selected False, Rank: 17
Column: type_hatchback, Selected False, Rank: 20
Column: type_mini-van, Selected False, Rank: 21
Column: type_offroad, Selected False, Rank: 22
Column: type_other, Selected False, Rank: 15
Column: type_pickup, Selected False, Rank: 13
Column: type_sedan, Selected False, Rank: 23
Column: type_truck, Selected False, Rank: 14
Column: type_van, Selected False, Rank: 19
Column: type_wagon, Selected False, Rank: 24

This is not reasonable, `fuel_gas`, `fuel_diesel` and `fuel_hybrid` should not be the highest ranking categories. There were 5 fuel types possible: `fuel_gas`, `fuel_diesel`, `fuel_hybrid`, `fuel_electric` and `fuel_other`. All of them are ranked from 1st to 3rd place which covers every single type of car type. In this case our model was not particularly useful.

Last model we'll try is GridSearchCV to see if that works.

Parameters Rankings: {'ridge__alpha': 0.1}
Resulting Score: 1.0
Model: 0.9999999999921322

And the result:

MSE: 0.001

We see the resulting score again is 1.0 like our R Squared from previous models. This model is far too overfit and does not correctly predict the results

### Evaluation

For future evaluations we'll need to adjust the model as so:

There were far too many column values that required One Hot Encoding, this resulted in a model that was exceptionally overfitted and poorly trainable. 
Step 1: Stop collecting One Hot Encoding data or at least filter a lot of the data so we're not using it to build our model so strongly.

There were several fields that may have been very useful in building our model accuratly but there was too many missing data fields so they had to be removed. Some examples of the useful columns: `wheelbase` or `cylinders`.
Step 2: Do better data collection on columns or lose a lot of the data but use those colimns as they were very valuable for model designing.

Use a better model such as KNN or K-Nearest Neighbor. This model might help with grouping the different fuel types together for example. This might avoid the issue of them being ranking all together as very high.


### Deployment

We found that the highest ranking features for our model are `fuel_gas`, `fuel_diesel` and `fuel_hybrid`, in second place is `fuel_other` and third place is `fuel_electric`. This is completely unreasonable so we cannot make any valuable insights based on this data. There were 5 fuel types possible: `fuel_gas`, `fuel_diesel`, `fuel_hybrid`, `fuel_electric` and `fuel_other`. All of them are ranked from 1st to 3rd place which covers every single type of car type. In this case our model was not particularly useful.