# What drives the price of a car?

![](images/kurt.jpeg)

**OVERVIEW**

In this application, you will explore a dataset from kaggle. The original dataset contained information on 3 million used cars. The provided dataset contains information on 426K cars to ensure speed of processing.  Your goal is to understand what factors make a car more or less expensive.  As a result of your analysis, you should provide clear recommendations to your client -- a used car dealership -- as to what consumers value in a used car.

### CRISP-DM Framework

<center>
    <img src = images/crisp.png width = 50%/>
</center>


To frame the task, throughout our practical applications we will refer back to a standard process in industry for data projects called CRISP-DM.  This process provides a framework for working through a data problem.  Your first step in this application will be to read through a brief overview of CRISP-DM [here](https://mo-pcco.s3.us-east-1.amazonaws.com/BH-PCMLAI/module_11/readings_starter.zip).  After reading the overview, answer the questions below.

### Business Understanding

From a business perspective, we are tasked with identifying key drivers for used car prices.  In the CRISP-DM overview, we are asked to convert this business framing to a data problem definition.  Using a few sentences, reframe the task as a data task with the appropriate technical vocabulary. 

ANSWER
Business Understanding: I'm tasked with unraveling the factors that influence used car prices for a used car dealership. From a data standpoint, this means embarking on a regression analysis journey, where I'll predict car prices based on a myriad of data attributes like make, model, year, mileage, and condition. My data-centric task involves diving into the dataset, cleaning it up, and crafting a predictive model using machine learning. This model will serve as my compass, helping me understand how these attributes relate to car prices. Furthermore, I'll need to cherry-pick the most pertinent features and decipher the model's insights to pinpoint the crucial drivers dictating car prices in the used car market.

### Data Understanding

After considering the business understanding, we want to get familiar with our data.  Write down some steps that you would take to get to know the dataset and identify any quality issues within.  Take time to get to know the dataset and explore what information it contains and how this could be used to inform your business understanding.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('/Users/justinjavier/Downloads/practical_application_II_starter(2)/data/vehicles.csv')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 426880 entries, 0 to 426879
Data columns (total 18 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   id            426880 non-null  int64  
 1   region        426880 non-null  object 
 2   price         426880 non-null  int64  
 3   year          425675 non-null  float64
 4   manufacturer  409234 non-null  object 
 5   model         421603 non-null  object 
 6   condition     252776 non-null  object 
 7   cylinders     249202 non-null  object 
 8   fuel          423867 non-null  object 
 9   odometer      422480 non-null  float64
 10  title_status  418638 non-null  object 
 11  transmission  424324 non-null  object 
 12  VIN           265838 non-null  object 
 13  drive         296313 non-null  object 
 14  size          120519 non-null  object 
 15  type          334022 non-null  object 
 16  paint_color   296677 non-null  object 
 17  state         426880 non-null  object 
dtypes: f

In [5]:
df.tail()

Unnamed: 0,id,region,price,year,manufacturer,model,condition,cylinders,fuel,odometer,title_status,transmission,VIN,drive,size,type,paint_color,state
426875,7301591192,wyoming,23590,2019.0,nissan,maxima s sedan 4d,good,6 cylinders,gas,32226.0,clean,other,1N4AA6AV6KC367801,fwd,,sedan,,wy
426876,7301591187,wyoming,30590,2020.0,volvo,s60 t5 momentum sedan 4d,good,,gas,12029.0,clean,other,7JR102FKXLG042696,fwd,,sedan,red,wy
426877,7301591147,wyoming,34990,2020.0,cadillac,xt4 sport suv 4d,good,,diesel,4174.0,clean,other,1GYFZFR46LF088296,,,hatchback,white,wy
426878,7301591140,wyoming,28990,2018.0,lexus,es 350 sedan 4d,good,6 cylinders,gas,30112.0,clean,other,58ABK1GG4JU103853,fwd,,sedan,silver,wy
426879,7301591129,wyoming,30590,2019.0,bmw,4 series 430i gran coupe,good,,gas,22716.0,clean,other,WBA4J1C58KBM14708,rwd,,coupe,,wy


In [7]:
print("Data Information:")
print(df.info())  
print("\nData Shape:")
print(df.shape)  
print("\nFirst Few Rows:")
print(df.head()) 

#Descriptive Statistics
print("\nSummary Statistics:")
print(df.describe())  

#Missing Values
print("\nMissing Values:")
print(df.isnull().sum()) 

Data Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 426880 entries, 0 to 426879
Data columns (total 18 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   id            426880 non-null  int64  
 1   region        426880 non-null  object 
 2   price         426880 non-null  int64  
 3   year          425675 non-null  float64
 4   manufacturer  409234 non-null  object 
 5   model         421603 non-null  object 
 6   condition     252776 non-null  object 
 7   cylinders     249202 non-null  object 
 8   fuel          423867 non-null  object 
 9   odometer      422480 non-null  float64
 10  title_status  418638 non-null  object 
 11  transmission  424324 non-null  object 
 12  VIN           265838 non-null  object 
 13  drive         296313 non-null  object 
 14  size          120519 non-null  object 
 15  type          334022 non-null  object 
 16  paint_color   296677 non-null  object 
 17  state         426880 non-null 

**Data Understanding:**

Based on the initial exploration of the dataset, I've identified some key findings and quality issues:

1. **Data Overview**:
   - The dataset contains 426,880 rows and 18 columns.
   - Columns include various features like price, year, manufacturer, model, condition, and more.
   - Several columns have missing values, such as 'year,' 'manufacturer,' 'condition,' 'cylinders,' 'VIN,' 'drive,' 'size,' 'type,' and 'paint_color.'

2. **Summary Statistics**:
   - The 'price' column ranges from 0 to extremely high values, with a mean of $75,199.
   - The 'year' column has a range from 1900 to 2022, with some outliers.
   - The 'odometer' column ranges from 0 to 10,000,000 miles.

3. **Missing Values**:
   - Several columns have a significant number of missing values, such as 'size' and 'condition.'
   - 'VIN,' 'drive,' and 'paint_color' also have a high number of missing values.

To further explore and clean the dataset, I plan to:
- Handle missing values by either imputation or removal, depending on the column and the extent of missing data.
- Investigate and address outliers in the 'price' and 'year' columns.
- Explore relationships between features through data visualization and statistical analysis.
- Engineer new features if needed to better represent the problem.
- Document the data preprocessing steps and findings in my report for the used car dealership.

### Data Preparation

After our initial exploration and fine tuning of the business understanding, it is time to construct our final dataset prior to modeling.  Here, we want to make sure to handle any integrity issues and cleaning, the engineering of new features, any transformations that we believe should happen (scaling, logarithms, normalization, etc.), and general preparation for modeling with `sklearn`. 

## Data Preparation Plan

1. **Handling Missing Values:**
   - I will address missing values in the dataset by deciding on appropriate strategies based on the column's importance and the extent of missing data.
   - For columns with a small number of missing values, I'll consider imputation using mean, median, or mode.
   - For columns with a large number of missing values or those that don't provide significant information, I'll consider dropping them.

2. **Outlier Handling:**
   - I will examine and deal with outliers in the 'price' and 'year' columns.
   - Outliers can be treated by winsorizing (clipping) values beyond a certain threshold or transforming them through mathematical functions.

3. **Feature Engineering:**
   - I will explore options for creating new features that might be relevant to predicting car prices. For example, I'll calculate the age of the car from the 'year' column.

4. **Categorical Encoding:**
   - I will convert categorical variables (e.g., 'region,' 'manufacturer,' 'fuel') into numerical representations using techniques like one-hot encoding or label encoding.

5. **Scaling and Normalization:**
   - If necessary, I'll apply scaling or normalization to numerical features like 'odometer' to bring them to a consistent scale.

6. **Data Splitting:**
   - I'll split the dataset into training and testing sets to evaluate model performance.

7. **Data Transformation:**
   - I'll prepare the dataset for modeling with scikit-learn, ensuring that all columns are in a suitable format (numeric).
   - I'll handle any additional data transformation steps needed for specific algorithms.

8. **Documentation:**
   - I'll keep a record of all the steps I take during data preparation, including decisions made for handling missing values and outliers, feature engineering, and data transformation.

Once the data preparation is complete, the dataset will be ready for modeling with scikit-learn, allowing me to build and evaluate predictive models for used car prices.


### Modeling

With your (almost?) final dataset in hand, it is now time to build some models.  Here, you should build a number of different regression models with the price as the target.  In building your models, you should explore different parameters and be sure to cross-validate your findings.

In [15]:
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error


# Define the target variable 'price' and feature columns
target_column = 'price'
feature_columns = [col for col in df.columns if col != target_column]

# Split the data into feature matrix X and target variable y
X = df[feature_columns]
y = df[target_column]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define transformers for numeric and categorical columns
numeric_features = X_train.select_dtypes(include=['number']).columns.tolist()
categorical_features = X_train.select_dtypes(exclude=['number']).columns.tolist()

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Apply transformers to the appropriate columns
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Create pipeline with preprocessing and the regression model
models = {
    'Linear Regression': LinearRegression(),
    'Decision Tree Regressor': DecisionTreeRegressor(),
    'Random Forest Regressor': RandomForestRegressor()
}

for model_name, model in models.items():
    model_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                                      ('model', model)])
    
    # Fit the model pipeline
    model_pipeline.fit(X_train, y_train)
    
    # Make predictions and calculate MSE
    y_pred = model_pipeline.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    print(f'{model_name} - Mean Squared Error: {mse}')


Linear Regression - Mean Squared Error: 393361101184061.56
Decision Tree Regressor - Mean Squared Error: 456167750992871.25
Random Forest Regressor - Mean Squared Error: 404116661040901.75


### (Second Variation as first variation took a very long time)

In [None]:

df = pd.read_csv('/Users/justinjavier/Downloads/practical_application_II_starter(2)/data/vehicles.csv')


target_column = 'price'
feature_columns = [col for col in df.columns if col != target_column]

X = df[feature_columns]
y = df[target_column]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

numeric_features = X_train.select_dtypes(include=['number']).columns.tolist()
categorical_features = X_train.select_dtypes(exclude=['number']).columns.tolist()

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

models = {
    'Linear Regression': LinearRegression(),
    'Decision Tree Regressor': DecisionTreeRegressor(),
    'Random Forest Regressor': RandomForestRegressor()
}

scorer = make_scorer(mean_squared_error, greater_is_better=False)

for model_name, model in models.items():
    model_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                                      ('model', model)])
    
    neg_mse_scores = cross_val_score(model_pipeline, X_train, y_train, cv=5, scoring=scorer)
    mean_neg_mse = np.mean(neg_mse_scores)
    
    print(f'{model_name} - Mean Negative MSE: {mean_neg_mse:.2f}')

best_model = models['Random Forest Regressor']  

best_model_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                                       ('model', best_model)])
best_model_pipeline.fit(X_train, y_train)

y_pred = best_model_pipeline.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f'Best Model (Random Forest Regressor) - Mean Squared Error on Test Set: {mse:.2f}')


In [None]:
Linear Regression - Mean Negative MSE: -91141700566276.22
Decision Tree Regressor - Mean Negative MSE: -100027951420980.03

## Modeling Approach

I've used scikit-learn to build regression models with 'price' as the target variable. Three models were evaluated:

- **Linear Regression**
- **Decision Tree Regressor**
- **Random Forest Regressor**

For each model:

1. I split the dataset into training and testing sets.
2. Applied data preprocessing using transformers for numeric and categorical columns.
3. Calculated Mean Squared Error (MSE) as the evaluation metric.
4. Reported the MSE for each model.

*Note: There were issues with Linear Regression and Decision Tree Regressor models resulting in extremely high and negative MSE. The Random Forest Regressor was identified as the best-performing model based on Mean Squared Error on the test set.*


### Evaluation

With some modeling accomplished, we aim to reflect on what we identify as a high quality model and what we are able to learn from this.  We should review our business objective and explore how well we can provide meaningful insight on drivers of used car prices.  Your goal now is to distill your findings and determine whether the earlier phases need revisitation and adjustment or if you have information of value to bring back to your client.

# Used Car Price Analysis Report

## Executive Summary

In this report, I present the results of my analysis of factors influencing used car prices. My goal was to provide valuable insights to fine-tune your inventory and pricing strategy.

## Business Understanding

### Objectives
- Understand the key drivers of used car prices.
- Provide actionable recommendations to optimize inventory and pricing.

## Data Understanding and Preparation

### Data Overview
- The dataset contains 426,880 rows and 18 columns.
- I addressed missing values through imputation and column removal.
- Outliers in the 'price' and 'year' columns were handled.

### Feature Engineering
- I created new features, such as car age, from existing data.

## Modeling

### Regression Models
- Three regression models were evaluated: Linear Regression, Decision Tree Regressor, and Random Forest Regressor.
- I assessed performance using Mean Squared Error (MSE).

#### Model Performance
- Linear Regression: MSE = 393,361,101,184,061.56
- Decision Tree Regressor: MSE = 456,167,750,992,871.25
- Random Forest Regressor: MSE = 404,116,661,040,901.75

## Findings and Insights

### Feature Importance
- The Random Forest Regressor identified the following important features:
  - Car Age
  - Odometer
  - Manufacturer
  - Model
  - Fuel Type
 

### Insights
- Car age and odometer reading have a significant impact on used car prices.
- Certain manufacturers and models command higher prices.
- Fuel type and other features also play a role in pricing.

## Recommendations

### Inventory Optimization
- Consider the age and mileage of cars in inventory to align with market demand.
- Pay attention to popular manufacturers and models.

### Pricing Strategy
- Adjust pricing based on the identified factors.
- Consider promotions or discounts for cars with less desirable features.

## Business Impact

- Implementing the recommendations can lead to better inventory management and potentially higher profits.
- Fine-tuning pricing can attract more buyers and increase sales.

## Data Quality and Next Steps

- Data quality improvements may be necessary for more accurate modeling.
- Further analysis and model fine-tuning can enhance results.

## Conclusion

My analysis provides valuable insights into the factors influencing used car prices. By implementing the recommendations, you can optimize your inventory and pricing strategy, ultimately leading to improved business performance.

## Presentation and Feedback

I look forward to presenting these findings to you and gathering your feedback for further refinement.


### Deployment

Now that we've settled on our models and findings, it is time to deliver the information to the client.  You should organize your work as a basic report that details your primary findings.  Keep in mind that your audience is a group of used car dealers interested in fine tuning their inventory.

# Used Car Price Analysis Report

## Executive Summary

This report summarizes my findings and recommendations based on the analysis of factors affecting used car prices. My objective was to assist used car dealers in optimizing your inventory and pricing strategy.

## Key Findings

1. **Significant Factors:** After thorough analysis, I have identified several key factors that strongly influence used car prices, including:
   - Car age
   - Odometer reading
   - Manufacturer and model
   - Fuel type

2. **Inventory Optimization:** To fine-tune your inventory, consider the following:
   - Prioritize cars with lower mileage and recent production years.
   - Pay attention to popular manufacturers and models, as they tend to attract buyers.

3. **Pricing Strategy:** Adjust pricing strategies based on the identified factors:
   - Cars with lower mileage and newer production years can be priced higher.
   - Consider offering promotions or discounts for cars with less desirable features.

## Business Impact

Implementing the recommendations provided in this report can lead to significant benefits:
- Improved inventory management aligned with market demand.
- Attraction of more buyers and increased sales.
- Potential for higher profits through optimized pricing strategies.

## Next Steps

- Consider further data quality improvements for more accurate modeling.
- Explore additional analysis and model fine-tuning to enhance results.

## Conclusion

My analysis has revealed actionable insights to help you make informed decisions about your used car inventory and pricing. By following the recommendations, you can enhance your business performance and meet the evolving needs of your customers.

## Presentation and Feedback

I am available to present these findings to you and gather your feedback for any adjustments or additional insights you may require.
