# Data description report

- **Author:** Anthony Bäckström
- **Date:** 22.10.2023
- **Reviewer:**

## Data Collection:

### Source
The dataset has been collected from an online platform, Kaggle, which is known for hosting a variety of datasets provided by users and organizations. This dataset is named "US used cars dataset" and is provided by a user named Ananaymital.

[US Used cars dataset](https://www.kaggle.com/datasets/ananaymital/us-used-cars-dataset/data)

### Content

This dataset contains a comprehensive list of used cars in the United States, captured in various fields such as '**vin**' (Vehicle Identification Number), '**price**', '**city**', '**mileage**', '**seller_rating**', and many others. These fields provide detailed specifications, condition, and history of the vehicles, dealer information, and sales listing details.



## Description of Data Content

The data was collected around 2020 and spans approximately one year, providing a snapshot of the used car market during that period.

### Vehicle Identification

'**VIN**', This stands for Vehicle Identification Number, a unique code used by the automotive industry to identify individual motor vehicles, towed vehicles, motorcycles, scooters, and mopeds. It's crucial for tracking registrations, warranty claims, thefts, and recalls.

### Physical Attributes

These include various measurements and categorical descriptions of the vehicle's physical characteristics, such as '**back_legroom**' (space for passengers), '**bed_height**' and '**bed_length**' (for trucks), '**body_type**' (sedan, SUV, etc.), '**cabin**' '**size**', '**height**', '**length**', and '**width**'. These attributes can affect the vehicle's desirability depending on the customer's needs and preferences.

### Performance Specifications

Fields like ''**engine_cylinders**', '**engine_displacement**', '**engine_type**', '**horsepower**', '**transmission**', and '**torque**' describe the vehicle's engine performance and capabilities. These can significantly influence a vehicle's market value, operational costs, and attractiveness to certain buyers.

### Sales and Dealer Information

This includes '**city**', '**dealer_zip**' (geographical location of the dealer), '**franchise_dealer**', '**franchise_make**' (the brand of car the dealer is franchised to sell), '**listing_id**', '**sp_id**' (seller's ID), '**sp_name**' (seller's name). These give context about the sale listing and can influence trust and preference among buyers.

### Condition and History

Fields like '**fleet**' (if a vehicle was part of a fleet), '**frame_damaged**', '**has_accidents**', '**salvage**' (if the car was deemed a total loss), '**vehicle_damage_category**', and '**owner_count**' are critical for potential buyers. They indicate the vehicle's history and condition, which can significantly affect the value and sellability.

### Listing Details

These include '**daysonmarket**' (how long the vehicle has been listed for sale), '**description**', '**listed_date**', '**listing_color**', '**main_picture_url**', '**price**', '**savings_amount**' (discount or deal offerings), and '**seller_rating**' (the rating of the seller, which may affect buyer trust). These details influence how quickly and at what price a vehicle is likely to sell.

### Location Data

'**latitude**' and '**longitude**' provide precise geographical coordinates of the vehicle's location. This data is crucial for analyzing geographical trends, sales area impacts, and regional preferences or variations in pricing.

### Fuel and Economy

This category includes '**city_fuel_economy**', '**combined_fuel_economy**', '**fuel_tank_volume**', '**fuel_type**', and '**highway_fuel_economy**'. These fields are especially important to environmentally conscious buyers or those looking for fuel-efficient vehicles.

### Others

Additional fields like '**wheel_system**', '**year**' (the model year of the car), '**is_certified**' (if the car is certified pre-owned), '**is_cpo**', '**is_new**', '**is_oemcpo**' (whether the vehicle is a certified pre-owned vehicle of the original equipment manufacturer) and more provide other miscellaneous information that buyers might consider.



## Full List of Categories and Their Description

**Descriptions from Kaggle**

- vin: Type String. Vehicle Identification Number is a unique encoded string for every vehicle. Read more at https://www.autocheck.com/vehiclehistory/vin-basics
- back_legroom: Type String. Legroom in the rear seat.
- bed: Type String. Category of bed size(open cargo area) in pickup truck. Null usually means the vehicle isn't a pickup truck
- bed_height: Type String. Height of bed in inches
- bed_length: Type String. Length of bed in inches
- body_type: Type String. Body Type of the vehicle. Like Convertible, Hatchback, Sedan, etc.
- cabin: Type String. Category of cabin size(open cargo area) in pickup truck. Eg: Crew Cab, Extended Cab, etc.
- city: Type String. city where the car is listed. Eg: Houston, San Antonio, etc.
- city_fuel_economy: Type Float. Fuel economy in city traffic in km per litre
- combine_fuel_economy: Type Float. Combined fuel economy is a weighted average of City and Highway fuel economy in km per litre
- daysonmarket: Type Integer. Days since the vehicle was first listed on the website.
- dealer_zip: Type Integer. Zipcode of the dealer
- description: Type String. Vehicle description on the vehicle's listing page
- engine_cylinders: Type String. The engine configuration. Eg: I4, V6, etc.
- engine_displacement: Type Float. engine_displacement is the measure of the cylinder volume swept by all of the pistons of a piston engine, excluding the combustion chambers.
- engine_type: Type String. The engine configuration. Eg: I4, V6, etc.
- exterior_color: Type String. Exterior color of the vehicle, usually a fancy one same as the brochure.
- fleet: Type Boolean. Whether the vehicle was previously part of a fleet.
- frame_damaged: Type Boolean. Whether the vehicle has a damaged frame.
- franchise_dealer: Type Boolean. Whether the dealer is a franchise dealer.
- franchise_make: Type String. The company that owns the franchise.
- front_legroom: Type String. The legroom in inches for the passenger seat
- fuel_tank_volume: Type String. Fuel tank's filling capacity in gallons
- fuel_type: Type String. Dominant type of fuel ingested by the vehicle.
- has_accidents: Type Boolean. Whether the vin has any accidents registered.
- height: Type String. Height of the vehicle in inches
- highway_fuel_economy: Type Float. Fuel economy in highway traffic in km per litre
- horsepower: Type Float. Horsepower is the power produced by an engine.
- interior_color: Type String. Interior color of the vehicle, usually a fancy one same as the brochure.
- isCab: Type Boolean. Whether the vehicle was previously taxi/cab.
- is_certified: Type Boolean. Whether the vehicle is certified. Certified cars are covered through warranty period
- is_cpo: Type Boolean. Pre-owned cars certified by the dealer. Certified vehicles come with a manufacturer warranty for free repairs for a certain time period. Read more at https://www.cartrade.com/blog/2015/auto-guides/pros-and-cons-of-buying-a-certified-pre-owned-car-1235.html
- is_new: Type Boolean. If True means the vehicle was launched less than 2 years ago.
- is_oemcpo: Type Boolean. Pre-owned cars certified by the manufacturer. Read more at https://www.cargurus.com/Cars/articles/know_the_difference_dealership_cpo_vs_manufacturer_cpo
- latitude: Type Float. Latitude from the geolocation of the dealership.
- length: Type String. Length of the vehicle in inches
- listed_date: Type String. The date the vehicle was listed on the website. Does not make days_on_market obsolete. The prices is days_on_market days after the listed date.
- listing_color: Type String. Dominant color group from the exterior color.
- listing_id: Unique Type Integer. Listing id from the website
- longitude: Type Float. Longitude from the geolocation of the dealership.
- main_picture_url: Type String. Points the location of the image.

**Our own descriptions for those categories where it was missing.**

- major_options: Type List. Includes car accessories such as ['Adaptive Cruise Control', 'Backup Camera', 'Leather Seats']
- make_name: Type String. Car maker.
- maximum_seating:Type String. Tells how many seats there are in the car.
- mileage: Type Float.
- model_name: Type String. Model of the car.
- owner_count: Type Integer. Indicates the number of previous owners.
- power: Type String. Declares amount ie. 246 hp @ 5,500 RPM.
- price: Type Float. Price of the car.
- salvage: Type Boolean. 
- savings_amount: Type Integer.
- seller_rating: Type Integer. Ranges from 1-5.
- sp_id: Type Integer. Car Dealership ID.
- sp_name: Type String. Car Dealership name.
- theft_title: Type Boolean. 
- torque: Type String. Measurement of car's ability to do work, ie. 200 lb-ft @ 1,750 RPM
- transmission: Type String. Automatic, CVT or other.
- transmission_display: Type String.
- trimId: Type String.
- trim_name: Type String.
- wheel_system: Type String. 
- wheel_system_display: Type String. Explains the wheel_system column in words.
- wheelbase: Type String. 
- width: Type String. Car width in inches.
- year: Type Integer. Year model of the car.

## Data Exploration

In this phase, we dive into the dataset to uncover underlying patterns, correlations, and insights that can inform the car dealership's strategic decisions. 

### a. Understanding Distributions through Histograms

- Price: We begin by plotting a histogram of the '**price**' column to understand the distribution of car prices in our dataset. This visualization helps us identify the range in which most car prices fall, highlighting the most common price brackets. Outliers, or exceptionally high or low values, can also be observed, which might indicate luxury vehicles, classic cars, or cars with potential issues.

- Daysonmarket (Sales Time): Next, a histogram of the '**daysonmarket**' column shows us how long cars typically stay on the market before they are sold. A higher frequency of low days-on-market could indicate a strong demand, while a right-skewed graph, where cars remain unsold for many days, might suggest issues like overpricing or a lack of interest in certain vehicle types.

- Year (Vehicle Age): Lastly, a histogram of the '**year**' column will reveal the age distribution of vehicles in the dataset. This plot helps us understand the market's skewness towards newer or older models and identify which vehicle ages are most common in our inventory.

### b. Visualizing Relationships with Scatter Plots and Heatmaps

- We use scatter plots to visually explore potential relationships between the price and various influential factors like '**mileage**', '**horsepower**', '**year**' (age of the vehicle), '**make_name**' (brand), and '**body_type**'. For instance, plotting '**mileage**' against '**price**' might show a negative correlation, indicating that cars with higher mileage may sell for less. Heatmaps can further assist in identifying correlations across multiple variables, helping to spot patterns that aren't as obvious in scatter plots.

### c. Impact of Previous Taxi Use on Price

- To assess whether a car's history as a taxi impacts its price or sales time, we will filter entries where the '**fleet**' column is true or where the '**description**' mentions the vehicle was used for fleet purposes. Comparing the average price and days-on-market for these cars against the general pool will indicate if a history of taxi use affects a vehicle's marketability or value.

### d. Geographical Analysis

- For a geographical perspective, we plot cars' locations on a map using their '**latitude**' and '**longitude**' values. This visualization allows us to see geographical clusters of vehicle listings and sales, providing insights into market density and potential regional preferences. Additionally, aggregating data by '**city**' or '**dealer_zip**' will let us analyze trends on a more macro level, such as identifying cities or regions with higher sales, prices, or quicker turnover times.

Through this explorative analysis, we aim to uncover trends, correlations, and insights that can guide the dealership in strategic decision-making, from pricing strategies to inventory selection and beyond.

## Data Quality Verification

In the process of analyzing the dataset "used_cars_data.csv," a crucial step is the verification of data quality.

### Missing Values

- An initial scan of the dataset was conducted to identify any missing entries across all fields. This is particularly pertinent for data fields that are essential for analysis, such as '**VIN,**' '**price,**' '**mileage,**' and '**daysonmarket.**'
- Special attention was given to free-form text fields like '**description,**' where missing data or vague, inconsistent input is a common occurrence due to their unstructured nature. The presence of numerous missing or nonsensical entries in these fields could limit their utility for textual analysis or natural language processing tasks.
- Three completely empty columns, which it makes sense to delete already when initializing the data: **combine_fuel_economy**, **is_certified**, **vehicle_damage_category**.
- In these categories, there is only a little data (less than 6 %), in which case it should be considered whether to omit the entire column: **is_cpo**, **is_oemcpo**.

### Duplicate Entries

- We executed a search for duplicate records, which are defined as entries having identical information across all fields or, more specifically, entries with the same '**VIN**'. Such duplicates can skew analysis results, particularly if the dataset is used to model market-level dynamics or inventory turnover.
- Any duplicates identified were noted for removal or further investigation, as they might indicate data entry errors, multiple listings of the same vehicle, or other inconsistencies.
- **wheel_system** has the same information as **wheel_system_display**. wheel_system has values ie. 'AWD', and wheel_system_display presents same value in words, like 'all-wheel drive'
- **listing_color** and **exterior_color** present the same information.
- **transmission** and **transmission_display** contain the same information.


### Inconsistent Entries

- The dataset was examined for inconsistencies, such as cases where '**mileage**' might be exceedingly high or low relative to the '**year**' of the vehicle, or where the 'price' is outside a logical range given the vehicle’s make, model, and condition.
- Additionally, categorical fields like '**body_type**,' '**fuel_type**,' and '**transmission**' were checked for inconsistent categorizations, misspellings, or illogical entries that may indicate data quality issues or require harmonization into consistent categories.

### Data Integrity

- Key fields such as '**VIN**' were validated for format consistency, ensuring they adhere to the standardized structure of vehicle identification numbers. Any anomalies detected in '**VIN**' could suggest data corruption, misentry, or counterfeit records and thus were subjected to further scrutiny.
- Fields like '**price**,' '**mileage**,' and '**daysonmarket**' underwent integrity checks to confirm that the values fall within plausible ranges and conform to expected data types (e.g., numerical). Extreme values or outliers were flagged for review, as they might represent data errors or special cases needing separate consideration.


## Data Selection

In this section of our analysis, we concentrate on filtering and focusing on the most relevant data points within the "used_cars_data.csv" dataset that will enable us to answer the client's specific queries effectively. The data columns we select are integral in understanding the dynamics of used car sales, particularly focusing on factors like how quickly cars sell, their pricing, and how their history or specifications impact these factors.

### a. Impact of Sales Area on Vehicle's Age, Type, and Price

- To analyze the impact of the sales area, we primarily focus on '**city**', '**latitude**', and '**longitude**' to define the geographical context of each vehicle sale. This regional information will be cross-analyzed with '**price**', '**year**' (to determine vehicle's age), and '**body_type**' to uncover any location-based trends or preferences.
- '**dealer_zip**' could also provide an additional layer of granularity for sales area analysis, especially if there are multiple dealerships within a city.

### b. Cars That Sell the Worst

- The '**daysonmarket**' field is pivotal here; vehicles with prolonged periods on the market are evidently harder to sell. We'll identify the cars that remain unsold longer than the average for this dataset.
- '**seller_rating**' will also be utilized. A lower rating might correlate with slow sales, possibly due to factors like customer service, trustworthiness, or the quality of vehicles sold by the dealer.
- '**make_name**' and '**model_name**' are essential to determine if specific brands or models are consistently harder to sell.

### c. Effect on Price for Vehicles Used as Taxis

- We'll leverage condition indicators such as '**fleet**' (common in taxis or rental services) and '**has_accidents**' (as taxis might be prone to more wear and tear and potential accidents). Cross-referencing these with '**price**' will help determine if being a former taxi impacts a vehicle's market value.

### d. Identifying Dominant Features for Sales

- We'll conduct a comprehensive analysis, considering various features like '**price**', '**daysonmarket**', '**make_name**', '**model_name**', '**year**', '**body_type**', '**engine_cylinders**', '**horsepower**', '**city_fuel_economy**', '**highway_fuel_economy**', and condition indicators '**has_accidents**', '**fleet**', '**frame_damaged**'.
- Advanced statistical methods or machine learning algorithms (like feature importance from tree-based models) can be applied to quantitatively identify the top 10 features most strongly correlated with quick sales or high prices.

**Categories to be excluded**

- vin, listing_id: id numbers are not necessary selling points of the car
- listed_date: specific day is not needed, we look into daysonmarket for the time that car spent on market.
- city, latitude, longitude: in this task we focus on all cars regardless of location, so no location is needed
- dealer_zip, franchise_dealer, franchise_make, seller_rating, sp_id, sp_name: we study the characteristics of the car, the influence of the seller is not considered here
- wheel_system_display: same information as wheel_system category

**Categories to change**

- from the point of view of sales, it is necessary to examine whether the existence of the image of the car is important for sales. So let's change the values of the **main_picture_url** category to boolean True or False, depending on whether there is a picture of the car or not.
- the same is done for the **description** category, i.e. if there is a description in the column it is True, if not, False. In this way, we can examine whether the existence of a more detailed description generally has an effect on sales.

By narrowing our focus to these specific data columns, we aim to provide concise, relevant insights that directly address the client's inquiries. This targeted approach ensures efficiency in our analysis process and clarity in the consequent findings and recommendations.

## Summary of required data collection

### 1. What is the impact of the sales area on the vehicle's age, type, and price?
- #### Focus on:
    - '**city**' or '**dealer_zip**': To identify the sales area.
    - '**year**': To calculate the vehicle's age.
    - '**body_type**': To categorize the type of vehicle.
    - '**price**': To analyze the impact on pricing.
    - '**latitude**' and 'longitude': For more detailed geographical analysis, if needed.

### 2. What kinds of cars sell the worst?
- #### Focus on:
    - '**daysonmarket**': To identify how long vehicles take to sell.
    - '**seller_rating**': Lower ratings might correlate with slower sales.
    - '**make_name**', 'model_name', 'year': To identify the specific cars.
    - '**price**': To check if pricing is a factor.

### 3. Has there been an effect on the car's price if it has been previously used as a taxi?
- #### Focus on:
    - '**fleet**': This might indicate if the car was used commercially, possibly as a taxi.
    - '**description**': To check for mentions of the vehicle being used as a taxi.
    - '**price**': To analyze the pricing difference.
    - '**make_name**', '**model_name**': In case specific makes or models are more commonly used as taxis.

### 4. We want to choose the most dominant (key) features for sales (10 pcs).
- #### Focus on: (Since this requires a more holistic approach, fewer fields should be omitted initially.)
    - '**price**', '**daysonmarket**', '**seller_rating**', '**make_name**', '**model_name**', '**year**', '**body_type**', '**has_accidents**', '**engine_type**', '**horsepower**', '**mileage**', '**city_fuel_economy**', '**highway_fuel_economy**', '**owner_count**', and other features that seem intuitively important for a vehicle's sales appeal.


## Additional Data Considerations

Given the intricacies of the dataset at hand, coupled with the client's distinct inquiries, there's a clear indication that supplementary external data, particularly economic or automotive industry data relevant to the Finnish market, could provide profound contextual insights. This approach is particularly useful for drawing comparisons or understanding broader market trends that affect the used car industry.

Integrating these external data sources can be effectively performed through several fields. The '**VIN**' stands out as a particularly reliable identifier for this purpose, provided the external data maintains vehicle-specific records. In the absence of '**VIN**', alternative fields such as '**make_name**', '**model_name**', and '**year**' can serve as robust linking points.

### Initial Observations for Client's Questions

#### Impact of Sales Area on Vehicle's Age, Type, and Price
- To dissect the influence of sales area, the data should be segmented based on '**city**' or '**dealer_zip**', allowing for a detailed regional analysis. Key focus points would include the average and median '**price**', the '**year**' of the vehicles (to determine age), and the frequency of different '**body_type**' within each region. Additionally, leveraging the '**latitude**' and '**longitude**' data to create geographical visualizations can illuminate regional disparities or trends.

#### Cars That Sell the Worst
- Identifying vehicles that linger on the market involves pinpointing those with high mean '**daysonmarket**' values, potentially compounded by low '**seller_rating**'. An examination of the '**make_name**', '**model_name**', and '**year**' for this subset of vehicles is crucial to discern any prevalent patterns or specific models consistently underperforming in sales.

#### Effect on Car's Price if Previously Used as a Taxi
- Vehicles previously employed as taxis may be flagged under '**fleet**' or through particular annotations in the '**description**'. A comparative study focusing on '**price**' and '**daysonmarket**' for this cohort versus the broader dataset is essential to observe any notable variances indicative of a "**taxi effect.**"

#### Dominant Features for Sales
- Pinpointing the features that most heavily sway sales requires a more nuanced analysis. While correlation studies between '**price**' or '**daysonmarket**' and various features offer preliminary insights, the deployment of advanced machine learning techniques is recommended for more accurate results. Utilizing models like Random Forest or Gradient Boosting for feature selection can help in distilling the top 10 features, gauged by their calculated importance scores.

### Next Steps in Analysis

As we transition into the subsequent phases of the CRISP-DM methodology, our dataset will be subjected to rigorous preprocessing and cleansing to ensure its optimality for the forthcoming detailed analysis. This involves employing both statistical evaluations and machine learning algorithms tailored to extract the depth of insights our client necessitates.
