# Data Preparation Report

- Author: Iina Pirinen
- Date: 2.11.2023

## Introduction

The purpose of this report is to open up our work in Data Preparation phase. The tasks we wanted to achieve on this phase were:

- Data cleaning and editing.
- Data integration and formatting.
- Handling missing values and outliers.
- Feature selection and engineering.
- Data scaling and formatting.
- Visualizing data after preprocessing.

Our team chose four different tasks for data processing, which are:

- Impact of Sales Area on Vehicle's Age, Type, and Price
- Cars That Sell the Worst
-  Effect on Price for Vehicles Used as Taxis
- Identifying Dominant Features for Sales

Each member of the team carried out their own task. After this stage, the team members have a good knowledge of the data and its features. In the next step, we start building a machine learning model based on the material we have pre-processed. There may still be changes to these first stages.

## Data Selection

- The data was obtained from the Kaggle website. This material serves as the main data source.

- To view the data according to sales areas, we needed zip codes, which were obtained using [zipcodes 1.2.0 library](https://pypi.org/project/zipcodes/).

## Data Cleaning

Each member of the group cleaned the used cars data to suit their own needs, because each task required different types of data. Here is links to each team members tasks:

- [Impact of sales area](./Impact_of_Sales_Area.ipynb)
- [Cars That Sell the Worst](./worst_cars.ipynb)
- [Effect on Price for Vehicles Used as Taxis](./taxi.ipynb)
- [Identifying Dominant Features for Sales preprocessing file](./dominant-feats/dominant-features-cleaning.ipynb)
- [Identifying Dominant Features for Sales](./dominant-feats/dominant-features.ipynb)

The team members decided together that everyone processes the data for their own task. This way, at this stage, everyone got good experience in processing a large dataset.

### Delete Duplicate rows

- Duplicate rows count: 40
- The percentage of duplicate rows is 0.0013333155557925895%

Rows that appear to be duplicate data were removed. The rows with the same **vin** were removed from the data.
    
### Delete rows with high number of missing values

Columns **bed** (99.3%), **bed_height** (85.7%), **bed_length** (85.7%), **cabin** (97.9%), **owner_count** (50.6%) were removed because it would be difficult to define substitutes for missing values in these columns.
    
### Column selection
    
Columns **combine_fuel_economy**, **is_certified**, **vehicle_damage_category**, **is_cpo** and **is_oemcpo** are not included, baceuse they do not contain very little information, or are fully empty columns.
    
![](./img/nulls.png)
    
Categories, that has same information as other categories are dropped. These columns are: **listing_color**, **wheel_system_display**, **trim_name**, **listed_date**, **sp_name**, **franchise_dealer**, **franchise_make**

Id categories are not needed, so they can be dropped: **listing_id**, **trimId**, **sp_id**, **vin**
    
The following categories are not needed because we don't need data based on sellers or location in the machine learning phase: **city**, **latitude**, **longitude**, **dealer_zip**, **seller_rating**.

### Combining columns

**city_fuel_economy** & **highway_fuel economy** were combined and we achieved to make a new column named **combined_fuel_economy**, which shows the average of fuel economy of the car.



### Change types to numeric

Some columns had numeric data in string format, which is why these columns had to be split and converted to numeric. Below is an example of height columns values before type change. 

![](./img/numeric.png)

### Filling missing values

All columns were carefully examined and missing values were filled using appropriate methods such as mean, median, or mode, depending on what was most appropriate for each category. The best method is selected by looking at graphs that show the distribution of the data. The example below shows that it would be best to replace the missing values in the **width** column with a mode value.

![](./img/width_exmpl.png)

## Data Preprocessing challenges

- a large data set produced problems that have not existed in previous courses. We solved the problems in reading the data by splitting the material into chunks and reading the file in parts.
- for some codes, the reading times were long, in which case the codes were made to run overnight, so that working time is not spent waiting for the code to be executed

## Key findings

### Impact of Sales Area on Vehicle's Age, Type, and Price

- Most of both new and old cars are sold in the south. The least number of cars are sold in the Northeast, which may be because the Northeast is a smaller region than the other regions.

![](./img/sales-area.png)

- The most popular cars sold in all regions are SUV/Crossover, Pickup Truck and Sedan.

- Car prices are cheaper in the Midwest region. The most expensive cars are in West.

![](./img/sales-area-price.png)

### Cars That Sell the Worst

- The graph shows that Pickup Truck and Coupe model cars spend the least time in sales. Van model cars spend the most time in sales. Hatchback models also spend a little longer on sale than others.

![](./img/worst-body-type.png)

- The top 3 worst selling cars are Pininfarina, Franklin and Daewoo. Pininfarina's poor sales are surely due to it being considered a "luxury car". Franklin cars, on the other hand, are old cars, so the price is probably explained by the rarity of the cars and because they are classic cars. 

![](./img/worst-maker.png)

- The effect of the car's age on sales is clear. Only a few year old cars clearly outsell old cars and brand new cars faster. Cars older than 2010 generally sell worse the older the car is.

![](./img/worst-year.png)

### Effect on Price for Vehicles Used as Taxis

- The price of taxis seems to remain consistently lower than other cars over the years.

![](./img/taxi-year.png)

- For some mysterious reason, the price of SUV model cars that have been in taxi use is higher than the price of regular SUV cars.

![](./img/taxi-suv.png)

### Identifying Dominant Features for Sales

- Columns torque, horsepower, mileage, year, fuel tank volume, wheekbase, height and length seems to have strong correlation with price.

    **Torque and Horsepower**: These are key indicators of the engine's power and performance. Cars with higher torque and horsepower values often provide better acceleration and overall performance, which can increase their market value.

    **Mileage**: Generally, lower mileage suggests less wear and tear on the car, indicating a potentially longer lifespan. Cars with lower mileage are often perceived as more reliable and can command higher prices in the used car market.

    **Year**: The model year of a car is crucial because newer cars often come with updated features, technology, and safety enhancements. Newer models can be perceived as more valuable and may have a higher market value compared to older ones.

    **Fuel Tank Volume**: The fuel efficiency of a car is an important consideration for buyers. Cars with larger fuel tank volumes may have longer ranges between refueling, which can be an attractive feature and positively influence the price.

    **Wheelbase, Height, and Length**: These dimensions contribute to the overall size and design of the car. Larger cars, especially those with more interior space, may be considered more comfortable and desirable, leading to higher prices.

![](./img/dom-corr-price.png)



## Next Steps

1. Feature Selection:

    We will consider performing feature selection techniques to identify the most relevant variables for modeling phase. Feature selection can help improve model performance and reduce dimensionality.

2. Model Selection:

    We will begin experimenting with different machine learning models. We'll select a variety of algorithms that are suitable for the problem. We'll train each model on preprocessed data and evaluate their performance using appropriate metrics. Experimentation with different hyperparameters to optimize model performance will be made.

3. Cross-Validation:

    We'll implement cross-validation techniques to assess the generalization ability of models. Cross-validation helps in estimating how well models will perform on unseen data.

4. Evaluation and Model Comparison:

    We will analyze the results of models and compare their performance metrics. The most promising candidates for project are indentified. Accuracy, precision, recall, F1-score, or mean squared error are used, depending on specific problem type (classification or regression).

5. Iterative Refinement:

    As we analyze the modeling phase (step 4) and evaluate model performance (step 5), we are prepared to iterate and make adjustments. This might involve revisiting earlier steps, such as data preprocessing, to further enhance the quality of dataset.

6. Decision Making:

    Based on the performance and insights gained from modeling phase, we'll make informed decisions about the direction of the project. We'll decide which model(s) to proceed with and what additional steps may be required.

7. Future Steps:

    We'll plan for the deployment and operationalization of the selected model(s) in a real-world setting. We'll consider how to maintain and update the model as new data becomes available.