# What drives the price of a car?

![](images/kurt.jpeg)

**OVERVIEW**

In this application, you will explore a dataset from kaggle. The original dataset contained information on 3 million used cars. The provided dataset contains information on 426K cars to ensure speed of processing.  Your goal is to understand what factors make a car more or less expensive.  As a result of your analysis, you should provide clear recommendations to your client -- a used car dealership -- as to what consumers value in a used car.

### CRISP-DM Framework

<center>
    <img src = images/crisp.png width = 50%/>
</center>


To frame the task, throughout our practical applications we will refer back to a standard process in industry for data projects called CRISP-DM.  This process provides a framework for working through a data problem.  Your first step in this application will be to read through a brief overview of CRISP-DM [here](https://mo-pcco.s3.us-east-1.amazonaws.com/BH-PCMLAI/module_11/readings_starter.zip).  After reading the overview, answer the questions below.

### Business Understanding

From a business perspective, we are tasked with identifying key drivers for used car prices.  In the CRISP-DM overview, we are asked to convert this business framing to a data problem definition.  Using a few sentences, reframe the task as a data task with the appropriate technical vocabulary. 

This dataset of used cars attributes and price is a very large one. Even after cleaning entries with missing values and outliers, what is left is arounf 200,000 entries, for this portion of the whole set. This size is a good number of entries to draw a model upon.
Few attributes are numeric - mileage and year. Majority of the attributes are categorical, like number of cylinders, paint color, model, manufacturer, region, type and size. The chalenge is to identufy which attribute correlates with the price the most, and try to iclude as many categorical attributes as possible. Keeping in mind that some categories as model and region has thousands of unique values. From this preliminary analysis it shows that the way to reach an excellent estimate for a car price is to cluster by manufacturer, and then go allong other properties such as year, number of cylinders, paint color and model.

### Data Understanding

After considering the business understanding, we want to get familiar with our data.  Write down some steps that you would take to get to know the dataset and identify any quality issues within.  Take time to get to know the dataset and explore what information it contains and how this could be used to inform your business understanding.

Steps to explore and evaluate the data:
1. Get information about the size of the dataset, how many entries, columns. The nature of the features: numeric, categorical. How many missing or duplicate values.
2. Within the numeric columns: learn about the values distribution, explore the histograms. Check for outliers. For the categorical columns: haw many unique values and how are they distributed
3. How are all the features correlate with the target (price) column.

### Data Preparation

After our initial exploration and fine tuning of the business understanding, it is time to construct our final dataset prior to modeling.  Here, we want to make sure to handle any integrity issues and cleaning, the engineering of new features, any transformations that we believe should happen (scaling, logarithms, normalization, etc.), and general preparation for modeling with `sklearn`. 

- See the Jupyter notebook attached for data preparation: cleaning and filling missing values, cleaning outliers, feature engineering for numerical representation by median price weight (normalization by median price) for the region and state columns, ordinal representation (numeric) for the condition column (to save on adding colomns after OneHotEncoder).
- More data preparation included OneHotEncoder for the cateforical features. Omitting the model feature as it contained more than 10,000 values and was out of the scope of this analysis. 
- Trying out 3 datasets that went into the model. Expecting the best results when the price column (the target) was transformed into a logarithmic scale. This is expected since the price values distribution is very large and spuns over 3 orders of magnitudes.

### Modeling

With your (almost?) final dataset in hand, it is now time to build some models.  Here, you should build a number of different regression models with the price as the target.  In building your models, you should explore different parameters and be sure to cross-validate your findings.

I was doing 3 models, each with a different data prepared. See above my explanation about turning the price column to logarithmic scale, which gave the best results. I used GridSearchCV with 5 folds, and a train and test sets for cross validation.

### Evaluation

With some modeling accomplished, we aim to reflect on what we identify as a high quality model and what we are able to learn from this.  We should review our business objective and explore how well we can provide meaningful insight on drivers of used car prices.  Your goal now is to distill your findings and determine whether the earlier phases need revisitation and adjustment or if you have information of value to bring back to your client.

A high performing model is one that has a high predictive capability, with a low margin of error. My preliminary results show that the car manufacturer is a key driver for the price. Other contributing factors were year and diesel fuel. To better evaluate my model I suggest: perform the model after cleaning all missing values without filling them, try differnt feature engineering and use regularization over features. 
From the clients I can asked a clean data and more business data and layers such as: 1) market supply and demand by manufacturers and models. 2) Supply and demand per regional and state market. The business data can be implementd into the dataset either directly or to support features engineering.

### Deployment

Now that we've settled on our models and findings, it is time to deliver the information to the client.  You should organize your work as a basic report that details your primary findings.  Keep in mind that your audience is a group of used car dealers interested in fine tuning their inventory.

All three different models I have tried were consistent to show that the top factor that drives the price of a car is the manufacturer - high end manufacturers like Ferrari, Porche, Aston-Martin, and Tesla were shown to drive the price high up. The type of the car has a big influence too, as it was represented in the top factors by the number of cylinders. Cars with 10 and 12 cylinders are likely to be more expensive as it has a correlation with high-end expensive models. Other contributing factors are a clean title status, year, diesel fuel. From the cars types: trucks were associated with higher prices.