Report

The report is to be a short executive summary of the project, a minimum of 500 words and maximum 1500 words (~1-3 pages, including figures). It should contain:

* **Your project problem statement**: What is the underlying question you are seeking to answer or problem you are addressing? Can you articulate how your project adds business/societal value? 

* **Background on the subject matter area**: Why is this a good problem / subject area to apply data science techniques? How has it been addressed in the past?

* **Details on dataset**: What is the source of the dataset and the original data itself (including data format, structure and schema, etc.)? How was it collected? By whom and why?

* **Summary of cleaning and preprocessing**: Summarize at a high level, the preprocessing, feature engineering and any other data cleaning/transformation, and exploratory data analysis (EDA) performed and the motivation and reasoning behind it. This should be at an appropriate level of detail for a non-technical audience and high-level visuals such as a flow chart, tables, etc. will be helpful here.

* **Insights, modeling, and results**: A summary of all the EDA and modelling completed including the process of model selection, evaluation, and results. As with summarizing the cleaning and preprocessing, high-level visuals can and should be employed to effect here as appropriate.

* **Findings and conclusions**: Based on all analysis and modeling, how do your results compare against your initial goals and hypotheses? How does the practical value of your project measure up to your initial expectations? Summarize the practical applications of the project as well as potential next steps and future directions.

## **CAPSTONE EXECUTIVE SUMMARY**

 ### **Abstract**

One of the most important practices in the automotive industry is the ability to predict the price of a vehicle. This is especially important for the automotive industry as it is a very competitive market and the ability to predict the price of a vehicle is a key factor in the success of a dealership. The ability to predict the price of a vehicle is also important for the consumer as it allows them to make an informed decision on the purchase of a vehicle. This is a perfect area to apply data science techniques as the data is readily available and the problem is well defined. 

 ### **Introduction**
The goal of this project is to predict the price of a vehicle based on the features of the vehicle.

### **Problem Statement**

The key task is to find out what features have a significant impact on the sale price of a vehicle and build a predictive model that can be used to predict the sale price of a vehicle based on the features of the vehicle. The insights gathered here can be used by the car dealers to price their vehicles competitively and also by the customers to get a fair idea of the price of the vehicle they are interested in.


The sales of a vehicle is a complex process. There are many factors that influence the sale of a vehicle which are not always obvious. This could mean that a vehicle is not sold at the price it is worth, or that a vehicle is sold at a price that is too high. This equates to a loss of revenue for the dealership and a loss of value for the customer. In order to maximize the revenue of a dealership, it is important to understand the factors that influence the sale of a vehicle. In this project we will focus on what features have a significant impact on the sale price of a vehicle.

This project will attempt to predict the sale price of a vehicle based on the features of the vehicle. This will allow the dealership to understand the factors that influence the sale price of a vehicle and allow them to maximize the revenue of the dealership. This will also allow the customer to understand the value of the vehicle they are purchasing and allow them to make an informed decision on the purchase of the vehicle.

Some underlying questions that are addressed in this project are:

* What are the most important features that affect the price of a car?
* How can a consumer use this model to make informed decisions when purchasing a car?
* How can a car dealership use this model to help them sell more cars?

### **Proposed Solution**

The proposed solution is to build a predictive model that can be used to predict the sale price of a vehicle based on the features of the vehicle. The insights gathered here can be used by the car dealers to price their vehicles competitively and also by the customers to get a fair idea of the price of the vehicle they are interested in.

### **Value Proposition**

The value of this project is that it will allow the car dealers to understand the factors that influence the sale price of a vehicle and allow them to maximize the revenue of the dealership. This will also allow the customer to understand the value of the vehicle they are purchasing and allow them to make an informed decision on the purchase of the vehicle.

### **Background**

Car dealerships are a very important part of the automotive industry. They are responsible for the sale of new and used vehicles. The sale of a vehicle is a complex process. There are many factors that influence the sale of a vehicle which are not always obvious. This could mean that a vehicle is not sold at the price it is worth, or that a vehicle is sold at a price that is too high. This equates to a loss of revenue for the dealership and a loss of value for the customer. In order to maximize the revenue of a dealership, it is important to understand the factors that influence the sale of a vehicle. In this project we will focus on what features have a significant impact on the sale price of a vehicle.
 

The automotive industry is a multi-billion dollar industry that is constantly evolving. As data becomes more readily available, it is becoming easier for consumers to make informed decisions on the purchase of a vehicle. This is because consumers can now access information on the vehicle they are purchasing online and compare it to other vehicles. This has led to a shift in the way consumers purchase vehicles. Consumers are now more informed and are looking for a better deal. This has led to a shift in the way dealerships sell vehicles. Dealerships are now looking for ways to maximize the revenue of the dealership. This is because the more revenue a dealership generates, the more vehicles they can sell. This is why it is important for dealerships to understand the factors that influence the sale of a vehicle and maximize the revenue of the dealership.

So where do data scientists come into play? Data scientists can help dealerships understand the factors that influence the sale of a vehicle. This will allow dealerships to maximize the revenue of the dealership. This will also allow consumers to understand the value of the vehicle they are purchasing and allow them to make an informed decision on the purchase of the vehicle. The techniques used in this project will demonstrate how data scientists can perform such analysis and provide insights to dealerships and consumers. We will also demonstrate how data scientists can use machine learning to predict the sale price of a vehicle based on the features of the vehicle.

### **Details on Dataset**

This data was obtained using web scraping techniques on <a href="https://www.cargurus.com/">https://www.cargurus.com/</a>. This data is for academic, research and individual experimentation only and is not intended for commercial purposes. This dataset contains 3 million new and used vehicle listings in the United States. The data set can be downloaded on <a href="https://www.kaggle.com/datasets/ananaymital/us-used-cars-dataset">https://www.kaggle.com/datasets/ananaymital/us-used-cars-dataset.</a>

To get a better understanding of the dataset, a data dictionary was created. This data dictionary can be downloaded/reviewed <a href="https://docs.google.com/spreadsheets/d/1g_GaTFI6kdzBK4nJBLnEndxdix-bZoId/edit?usp=sharing&ouid=114887506937266750782&rtpof=true&sd=true">here.</a>

### **Methodology**

The methodology used in this project is as follows:
 
* Data Cleaning and Preprocessing - The data was cleaned and preprocessed to ensure that the data is ready for analysis.
* Exploratory Data Analysis - The data was analyzed to understand the data and to identify any trends and patterns in the data.
* Feature Engineering - New features were created to improve the performance of the model.
* Model Selection - The best model was selected based on the performance of the model.
* Model Evaluation - The model was evaluated to determine the performance of the model.
* Model Deployment - The model was deployed to predict the sale price of a vehicle based on the features of the vehicle.
 
### **Data Cleaning and Preprocessing**

In preparation for analysis and modeling, the data was cleaned and preprocessed using traditional data cleaning techniques. Standard data cleaning techniques such as removing duplicates, removing or imputing missing values, removing outliers, and removing unnecessary columns were performed. Then the data was transformed using feature engineering techniques that were specific to the data.

Duplicates were detected by using the vin number of the vehicle. If the vin number was the same, the vehicle was considered a duplicate and was removed from the dataset.

Columns with 50% or more missing values were removed from the dataset. This was done because the missing values could not be imputed and would have a negative impact on the analysis and modeling. Columns with less than 50% missing values were imputed using several different techniques. Depending on the type of data, the missing values were imputed using the mean, median, mode, or a constant value. 

Outliers were detected and removed from the dataset. Outliers were detected using the interquartile range (IQR) method. This method calculates the IQR of the data and then calculates the upper and lower bounds of the data. Any data that falls outside of the upper and lower bounds is considered an outlier and is removed from the dataset.

In determining which columns to remove, the [data dictionary](https://docs.google.com/spreadsheets/d/1g_GaTFI6kdzBK4nJBLnEndxdix-bZoId/edit?usp=sharing&ouid=114887506937266750782&rtpof=true&sd=true) was used to determine which columns were not relevant to the analysis. This required some domain knowledge of the automotive industry. Columns that were not relevant to the analysis were removed from the dataset.

Feature engineering techniques such as creating dummy variables, creating new features, and removing unnecessary features were performed. The data was then split into training and testing sets for analysis and modeling. The training set was used to train the model and the testing set was used to test the model.

Once the data was cleaned and preprocessed, exploratory data analysis (EDA) was performed. This was done to gain a better understanding of the data and to determine which features were important to the analysis. This was done using several different techniques. First, a correlation matrix was created to determine which features were highly correlated with the sale price of the vehicle. This was done using the Pearson correlation coefficient. The Pearson correlation coefficient is a measure of the linear correlation between two variables. The Pearson correlation coefficient ranges from -1 to 1. A value of 1 indicates a perfect positive correlation, a value of -1 indicates a perfect negative correlation, and a value of 0 indicates no correlation. The Pearson correlation coefficient was calculated for all features in the dataset. The features with the highest correlation with the sale price of the vehicle were selected for further analysis. 

### **Insights, Modeling, and Results**
 
From what we have learned from our exploratory data analysis, we can answer our questions of interest. 
 
1. Why is the price of a vehicle different from other vehicles?
 
    The price of a vehicle is different from other vehicles because of the features of the vehicle. The features of the vehicle include the year, make, model, engine type, transmission type, mileage, speed, size, and if the vehicle has an incident. These features are important to predicting the price of a vehicle because they have a high correlation with the target variable. The higher the correlation, the more important the feature is to predicting the price of a vehicle.

2. What features are important to predicting the price of a vehicle?

    The features that are important to predicting the price of a vehicle are the year, make, model, engine type, transmission type, mileage, speed, size, and if the vehicle has an incident. These features have a high correlation with the target variable. The higher the correlation, the more important the feature is to predicting the price of a vehicle.

3. What is the relationship between the price of a vehicle and other features?
 
    The relationship between the price of a vehicle and other features is that the more expensive the vehicle, the more features it has. The more features a vehicle has, the more expensive it is. The more expensive the vehicle, the more features it has. This is a positive relationship.

What we have learned from our exploratory data analysis is that the features of the vehicle are important to predicting the price of a vehicle. The features of the vehicle include the year, make, model, engine type, transmission type, mileage, speed, size, and if the vehicle has an incident. These features are important to predicting the price of a vehicle because they have a high correlation with the target variable. The higher the correlation, the more important the feature is to predicting the price of a vehicle.

In our EDA, we have learned:

* The most common vehicles that are listed are SUV / Crossover, sedans, and pickup trucks.
* The most common vehicle make is Ford, Chevrolet, and Toyota.
* The average fuel economy is 19 mpg in the city, and 20-30 mpg on the highway.
* The Most common type of engine is an I4 engines, followed by V6’s and V8’s.
* The most common exterior colors are White, Black, and Silver.
* The most common interior colors are Black, Gray, and White.
* At least 11% of vehicles have been in an accident, damaged, salvaged, or had been stolen.
* The average fuel tank size is 26 gal. Not including Electric vehicles and hybrids. 
* Gas power vehicles represent 86% of the data.
* The average horsepower of a vehicle falls between 180 – 200 horsepower
* The average torque speed is 264 lb-ft.
* The average vehicle has 4-6 major options installed.
* The most common seating capacity is 5 seats, followed by 7 seats, and 6 seats.
* Majority of the vehicles on this list is new. Showing zero mileage. 
* The most common transmission is an Automatic transmission, followed by a Manual transmission.
* The top wheel systems in order are: FWD, AWD, 4WD, RWD, and 4X2.
* The average age of a vehicle is 2 years old.
* The average price of a vehicle is $29933.37.
* The most popular brand names are Ford, Chevrolet, and Toyota.

EDA summary and insights

Through the EDA, we have learned that the features of the vehicle are important to predicting the price of a vehicle. The features of the vehicle include the year, make, model, engine type, transmission type, mileage, speed, size, and if the vehicle has an incident. These features are important to predicting the price of a vehicle because they have a high correlation with the target variable. The higher the correlation, the more important the feature is to predicting the price of a vehicle.

---

## **Actionable Insights**

Some actionable insights that we can take from this data are:
 
* Some of the most common vehicles that are listed are SUV / Crossover, sedans, and pickup trucks, and leading brands are Ford, Chevrolet, and Toyota. It would be wise to invest in these types of vehicles if you are looking to sell a vehicle fast.

* The size of the vehicle plays a key part in the price of the vehicle. This means that the larger the vehicle, the more expensive it is. This is likely because larger vehicles are more expensive to manufacture. If you are looking to sell a vehicle, a company should consider these factors when pricing the vehicle, or purchasing.

* There is a high importance in fuel economy and savings. It would be wise to purchase vehicles that emphases in these areas. For example, buying smaller vehicles with smaller sized engines that have great fuel economy.

* The average price of a vehicle is $29933.37. This means that vehicles that are priced above this value will be harder to sell. So if you are looking to sell your vehicle, you should consider selling it for less than $29933.37.

### **Model Selection**

Six different models were tested to determine which model would be the best fit for the data. The outcome resulted in XGBoost being the best model for predicting the price of a vehicle. 

## **Conclusion**

In recent years, it has been noted that a large number of vehicles are being sold online. This is due to the fact that it is more convenient for the buyer and the seller. The buyer can shop for a vehicle from the comfort of their home, and the seller can sell their vehicle without having to deal with the hassle of selling it in person. This is why it is important to understand the factors that affect the price of a vehicle. As a result, we have created a model that can predict the price of a vehicle based on the features of the vehicle. This model can be used by sellers to determine the price of their vehicle, and by buyers to determine if the price of the vehicle is fair. 