<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#What-sells-a-car?" data-toc-modified-id="What-sells-a-car?-1">What sells a car?</a></span><ul class="toc-item"><li><span><a href="#Initialization" data-toc-modified-id="Initialization-1.1">Initialization</a></span></li></ul></li><li><span><a href="#Load-data" data-toc-modified-id="Load-data-2">Load data</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Explore-initial-data" data-toc-modified-id="Explore-initial-data-2.0.1">Explore initial data</a></span></li><li><span><a href="#Conclusions-and-further-steps" data-toc-modified-id="Conclusions-and-further-steps-2.0.2">Conclusions and further steps</a></span></li></ul></li></ul></li><li><span><a href="#Data-Preprocessing" data-toc-modified-id="Data-Preprocessing-3">Data Preprocessing</a></span><ul class="toc-item"><li><span><a href="#column-date_posted" data-toc-modified-id="column-date_posted-3.1">column <code>date_posted</code></a></span></li><li><span><a href="#column-type" data-toc-modified-id="column-type-3.2">column <code>type</code></a></span></li><li><span><a href="#column-is_4wd" data-toc-modified-id="column-is_4wd-3.3">column <code>is_4wd</code></a></span></li></ul></li><li><span><a href="#Treat-missing-values" data-toc-modified-id="Treat-missing-values-4">Treat missing values</a></span><ul class="toc-item"><li><span><a href="#column-model_year" data-toc-modified-id="column-model_year-4.1">column <code>model_year</code></a></span></li><li><span><a href="#column-odometer" data-toc-modified-id="column-odometer-4.2">column <code>odometer</code></a></span></li><li><span><a href="#column-cylinders" data-toc-modified-id="column-cylinders-4.3">column <code>cylinders</code></a></span></li></ul></li><li><span><a href="#Enrich-data" data-toc-modified-id="Enrich-data-5">Enrich data</a></span><ul class="toc-item"><li><span><a href="#Day-of-the-week,-month,-and-year-the-ad-was-placed" data-toc-modified-id="Day-of-the-week,-month,-and-year-the-ad-was-placed-5.1">Day of the week, month, and year the ad was placed</a></span></li><li><span><a href="#Inserting-the-age-of-the-car" data-toc-modified-id="Inserting-the-age-of-the-car-5.2">Inserting the age of the car</a></span></li><li><span><a href="#Average-yearly-mileage" data-toc-modified-id="Average-yearly-mileage-5.3">Average yearly mileage</a></span></li><li><span><a href="#Replace-column-condition" data-toc-modified-id="Replace-column-condition-5.4">Replace column <code>condition</code></a></span></li><li><span><a href="#Check-clean-data" data-toc-modified-id="Check-clean-data-5.5">Check clean data</a></span></li></ul></li><li><span><a href="#Study-core-parameters" data-toc-modified-id="Study-core-parameters-6">Study core parameters</a></span><ul class="toc-item"><li><span><a href="#Study-core-parameters-without-outliers" data-toc-modified-id="Study-core-parameters-without-outliers-6.1">Study core parameters without outliers</a></span></li><li><span><a href="#Study-and-treat-outliers" data-toc-modified-id="Study-and-treat-outliers-6.2">Study and treat outliers</a></span></li><li><span><a href="#Ads-lifetime" data-toc-modified-id="Ads-lifetime-6.3">Ads lifetime</a></span></li><li><span><a href="#Average-price-per-each-type-of-vehicle" data-toc-modified-id="Average-price-per-each-type-of-vehicle-6.4">Average price per each type of vehicle</a></span></li></ul></li><li><span><a href="#Price-factors" data-toc-modified-id="Price-factors-7">Price factors</a></span><ul class="toc-item"><li><span><a href="#Scatterplot-for-suv" data-toc-modified-id="Scatterplot-for-suv-7.1">Scatterplot for <code>suv</code></a></span></li><li><span><a href="#Scatterplot-for-sedan" data-toc-modified-id="Scatterplot-for-sedan-7.2">Scatterplot for <code>sedan</code></a></span></li></ul></li><li><span><a href="#General-conclusion" data-toc-modified-id="General-conclusion-8">General conclusion</a></span></li></ul></div>


# What sells a car?

You're an analyst at Crankshaft List. Hundreds of free advertisements for vehicles are published on your site every day. You need to study data collected over the last few years and determine which factors influence the price of a vehicle.

We have to load the database, analyze the data for incorrect and missing values. After we prepare our database for work, we need to turn to the main tasks that we want to study - the average age of advertisements, how different parameters affect the price and speed of sale, what are the patterns and correlations in the data. So, let's begin.

## Initialization

Let's start by importing the needed libraries and modules.

In [1]:
# Loading all the libraries
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt #imports the pyplot module from the matplotlib library to plot graphs

# Load data

Let's load the `vehicles_us.csv` file, make it a data frame and look at the general information.

In [4]:
#creating variable for dataset path
data_path = '/Users/lanadashevsky/Practicum DA projects/datasets/vehicles_us.csv'

In [3]:
# Load the data file into a DataFrame

try:
    data = pd.read_csv('vehicles_us.csv')
except:
    data = pd.read_csv(data_path)
    
data.info()

FileNotFoundError: [Errno 2] No such file or directory: '/datasets/vehicles_us.csv'

### Explore initial data

The dataset contains the following fields:
- `price`
- `model_year`
- `model`
- `condition`
- `cylinders`
- `fuel` — gas, diesel, etc.
- `odometer` — the vehicle's mileage when the ad was published
- `transmission`
- `paint_color`
- `is_4wd` — whether the vehicle has 4-wheel drive (Boolean type)
- `date_posted` — the date the ad was published
- `days_listed` — from publication to removal

We'll want to see how many columns and rows our data has, look at a few rows to check for potential issues with the data.

In [None]:
# print the general/summary information about the DataFrame
data.describe()

In [None]:
# Let's look in the filtered table at the the first column with missing data
data.isna().sum()

**Conclusion.**

Our data has 51525 rows and 13 columns total. Presented data in columns - 6 numeric and 7 objects type. We have missing values in 5 columns - `model_year`, `cylinders`, `odometer`, `paint_color`, `is_4wd`.

Two columns have a problems with type:
- `date_posted` contains str objects that can be converted to a datetime objects;
- `is_4wd` needs to be boolean and not to be **float64**.

**Next step.** 
We need to print out a few rows of our data to see how the data is presented and what problems there might be.

In [None]:
# print a sample of data

data.head(10)

In [None]:
data.duplicated().sum()

On first review we don't have duplicate ads

**We have some problems with data:**
1. `model_year` have a missing values and type float64;
2. `cylinders`, `odometer` have a missing values;
3. `type` have some values are uppercase, some are lowercase;
4. `paint_color` have a missing values;
5. `is_4wd` column must be a boolean but represented as a float64;
6. `date_posted` contains str objects that can be converted to a datetime objects.

### Conclusions and further steps

After looking at the information about the data frame and the first few lines, and identifying the problems, we can take the first steps to fix some of the problems:

- `date_posted` column contains str objects that can be converted to datetime objects;
- `type` replace all with lowercase for ease of operation;
- `paint_color` all missing colors replace to `unknown`;
- Convert `model_year` column to integer

Let's start with simple problems that don't require big changes.

# Data Preprocessing

##  column `date_posted` 

The `date_posted` column has an object type. We need to change to date/time type. For this we will use the pd.to_datetime() method. According to our data, the date is written in the format - **`%Y-%m-%d`**.

In [None]:
# changes the data type in the date_posted column to datetime type
data['date_posted'] = pd.to_datetime(data['date_posted'], format='%Y-%m-%d')
data.info()

Data type in column `date_posted` successfully changed to **datetime64**.

## column `type`

The column `type` have some values are uppercase, some are lowercase. Check how many unique values this column has.

In [None]:
# Let's see all values in type column to check
data['type'].unique()

In [None]:
# Fix the registers
data['type'] = data['type'].str.lower()

# Checking all the values in the column to make sure we fixed them
data['type'].value_counts()

We fixed registers in the column `type` and now it look fine.

## column `is_4wd`

The column `is_4wd` column must be a boolean but represented as a float64. Check how many unique values this column has.

In [None]:
# Let's see all values in type column to check
data['is_4wd'].unique()

We have two uniques value - 1 and NaN. This column tells us whether the car is 4wd or not. 
We can assume that 1 is a boolean value field of True, it is reasonable to assume that missing values are False or Null. So we need to change all missing values to null and then convert our column to boolean type.

In [None]:
# all missing values in the is_4wd column are changed to null
data['is_4wd'] = data['is_4wd'].fillna(0)

# changes the data type in the is_4wd column to boolean
data['is_4wd'] = data['is_4wd'].astype('bool')

In [None]:
# check column after convert
data['is_4wd'].value_counts()

# Treat missing values

Study missing values for each column in more details and decide what to do with them.

In [None]:
data.isna().sum()

We see that data is missing in 4 columns - `model_year`, `cylinders`, `odometer`and `paint_color`. 

We have three columns, between which there is logical relationship -`model_year`, `cylinders` and `odometer`, we'll start with the one with the most value lost.

## column `model_year`

Let's look at the column `model_year` first.  
The year of a model can be determined by its `condition` and vehicle `model`. Since there are no missing values in these columns and based on them we can fill in the missing values in column `model_year`, then we will consider this filling first.
Let us look at a description of the column.

In [None]:
# check a unique description of the cylinders column
data['model_year'].describe()

We have a min value equal 1908 in this column, which looks strange. Let's check how much data there is with such a year and think about what to do with it.

In [None]:
# check min values in column model_year
data.query('model_year == 1908')

Checking these models and their year tells us that this is a human error at the stage of filling in the data. The correct model year is 1998, which will be right and allow us to avoid outliers at the preprocessing stage.

In [None]:
# fix the min value that it's not correct
data['model_year'] = data['model_year'].replace(1908,1998)


In [None]:
# look at the missing rows
data[data.model_year.isna()]

We group the missing values by `model` and `condition` and return the median value for this data that we fill in column `model_year`.

In [None]:
# grouped values by car's condition and model and filling the missing values by median
data['model_year'] = data['model_year'].fillna(data.groupby(['model', 'condition'])['model_year'].transform('median'))

# check row with missing value in column after fix
data.loc[72]

In [None]:
# check our data after fix
data.model_year.isna().sum()

## column `odometer`

Odometer is an instrument for measuring the distance traveled by a vehicle, it's a numeracal value.  
Let us look at a numerical description of the column

In [None]:
# check a numerical description of the odometer column
data['odometer'].describe()

In [None]:
# look at the missing rows
data[data.odometer.isna()]

We see that we have a large standard deviation and a high maximum, so in this case it is better to use the median instead of the mean.  
The odometer value depends on the model year and its condition, but we cannot replace the missing values with just the median, we must do this based on the median value based on the model year and condition of the car.
We group the missing values by year and condition and return the median value for this data that we fill in column `odometer`.

In [None]:
# grouped values by car's model_year and condition and filling the missing values by median
data['odometer'] = data['odometer'].fillna(data.groupby(['model_year', 'condition'])['odometer'].transform('median'))

# check row with missing value in column odometer after fix
data.loc[25]

In [None]:
# check our data after fix
data['odometer'].describe()

In [None]:
# check our data after fix
data.odometer.isna().sum()

## column `cylinders`

Let's look at the next column `cylinders`.  
This column keeps values that tell us how many cylinders the car has, so the values are numeric too.   
Let us look at a unique description of the column.

In [None]:
# check a unique description of the cylinders column
data['cylinders'].unique()

In [None]:
# look at the missing rows
data[data.cylinders.isna()]

We will replace the missing values in a similar way to the method we used in the `odometer` column.  
But we have to keep in mind that the number of cylinders depends on the car model and year. We will group by `model` and use the median.

In [None]:
# grouped values by car's model_year and model and filling the missing values by median
data['cylinders'] = data['cylinders'].fillna(data.groupby(['model'])['cylinders'].transform('median'))

# check row with missing value in column cylinders after fix
data.loc[59]

In [None]:
data.cylinders.isna().sum()

**The `paint_color` column**  
This column is categorical and we cannot replace these values. In this case, its NaN values will remain unchanged for now.

In [None]:
# check missing value in data after fix
data.isna().sum()

Our replacement by grouping columns by type worked very well and we can see that we have a small amount of lost data left.  

Let's get to work and explore our data.

# Enrich data

## Day of the week, month, and year the ad was placed

In order to calculate the vehicle's age at the time of the advertisement was placed, we need to create three separate columns - year, month and day, which we must get from the `date_posted`. For this we use built-in commands.

In [None]:
# Add datetime values for when the ad was placed

data['ad_day'] = pd.to_datetime(data['date_posted']).dt.day
data['ad_month'] = pd.to_datetime(data['date_posted']).dt.month
data['ad_year'] = pd.to_datetime(data['date_posted']).dt.year
data.head()

## Inserting the age of the car

Now we want to create a column `ad_posted_age` with the age of the car at the time of publication. This will simply subtract model_year from ad_year.

In [None]:
# Add the vehicle's age when the ad was placed
data['ad_posted_age'] = data['ad_year'] - data['model_year']

data['ad_posted_age'].unique()

In [None]:
data['ad_posted_age'] = data['ad_posted_age'].apply( lambda x: np.ceil(x))
data['ad_posted_age'].unique()

We can assume that we may have zero values in the column in the case when the year of the model coincides with the year of the advertisement.  
Let's check it out.

In [None]:
# check zero values
data.query('ad_posted_age == 0')

Our guess was correct - 2156 rows have a zero values in column `ad_posted_age`. We need to change the zero values to one year, because in these situations we round up and assume that the car is already a year old.

In [None]:
# fix the data based on our decision
data['ad_posted_age'] = data['ad_posted_age'].replace(0,1)

# check our column on zero values after fix
data.query('ad_posted_age == 0')

We have succesfully inserted and checked our `ad_posted_age` column.

## Average yearly mileage


Now we want to add an `avg_miles` column that stores the average number of miles per year for each car. To do this, the odometer value must be divided by the age of the car.

In [None]:
# Add the vehicle's average mileage per year
data['avg_yearly_mileage'] = data['odometer'] / data['ad_posted_age']
data.head() #check our data


## Replace column `condition`

In the `condition` column, replace string values with a numeric scale:
- new = 5
- like new = 4
- excellent = 3
- good = 2
- fair = 1
- salvage = 0

In [None]:
# check our unique data in condition  column
data['condition'].unique()

In [None]:
# let's create dictionary with our values as a keys
condition_dict={
    'new':5,
    'like new':4,
    'excellent':3,
    'good':2,
    'fair':1,
    'salvage':0
}

In [None]:
# define a function that replace categories valueas to new column in numeric values with dictionary
def replace_cond(x):
    return condition_dict[x]

data['cond_categ'] = data['condition'].apply(replace_cond)

#check  our unique values in new column
data['cond_categ'].unique()

## Check clean data

Look at our data after the pre-processing step.  
We have added several columns for your future study, analysis and plotting and histograms - `ad_year`,             `ad_posted_age`, `avg_yearly_mileage`, `cond_categ`.      

In [None]:
# print the general/summary information about the DataFrame
data.info()

In [None]:
# print a sample of data
data.head()

In [None]:
data.describe()

# Study core parameters

Study the following parameters:
- Price
- The vehicle's age when the ad was placed
- Mileage
- Number of cylinders
- Condition

We plot histograms for each of these parameters and study how outliers affect the form and readability of the histograms.


In [None]:
# import library for plots and histograms
import seaborn as sns

## Study core parameters without outliers

In [None]:
data['price'].describe()

In [None]:
#plots a histogram for the 'price' column with 20 bins and the x-axis ranging from 0 to 60000
plt.title('Price')
data['price'].hist(bins=20, range=(0, 60000))
plt.show() #shows the above histogram
print('The outliers are considered to be values below', np.percentile(data.price, 3), ' and above', np.percentile(data.price, 97))

#plots a histogram for the 'ad_posted_age' column with 20 bins and the x-axis ranging from 0 to 40
plt.title('The vehicle\'s age when the ad was placed')
data['ad_posted_age'].hist(bins=20, range=(0, 40))
plt.show() #shows the above histogram

#plots a histogram for the 'odometer' column with 20 bins and the x-axis ranging from 0 to 400000
plt.title('Odometer')
data['odometer'].hist(bins=20, range=(0, 400000))
plt.show() #shows the above histogram

#plots a histogram for the 'cylinders' column with 5 bins
plt.title('Number of cylinders')
data['cylinders'].hist(bins=5)
plt.show() #shows the above histogram

#plots a histogram for the 'cond_categ' column with 6 bins
plt.title('Condition')
data['cond_categ'].hist(bins=6)
plt.show() #shows the above histogram


Let's analyze what we see on the histograms and what happens to our data.
- on the `price` histogram, the bins begin to zero out and form a wide bottom after 40,000, and the percentile shows us that 97% is 35,000;
- on the `ad_posted_age` histogram, the same thing happens after 30 years;
- on the `odometer` bar graph, we start to notice a tail after 270,000 km.
- there are no outliers on the `cylinder` and `condition` histograms.

## Study and treat outliers

Let's filter our date with these outputs and check - **`price more than 45000, ad_posted_age more than 30, odometer more 270000 km`** and see how the new date compares to the date without outliers.

In [None]:
#creates a slice of data where the price is more than 45000, age is more than 30, and odometer is more than 270000
outlier_data = data.query('price > 45000 & ad_posted_age > 30 & odometer > 270000')
outlier_data.head()

In [None]:
filtered_data = data.query('price < 45000 & ad_posted_age < 30 & odometer < 270000')
filtered_data.info() #general information about dataframe
filtered_data.head()

Now let's build histograms by the same parameters for two tables at the same time - full and filtered tables.

In [None]:
plt.title('Price')
data['price'].hist(bins=20, alpha=0.3, range=(0, 45000), label='with outliers')
filtered_data['price'].hist(bins=20, alpha=0.3, range=(0, 45000), label='without outliers')
plt.legend(loc='upper right')
plt.show() #shows the above histogram

plt.title('The vehicle\'s age when the ad was placed')
data['ad_posted_age'].hist(bins=20, alpha=0.3, range=(0, 40), label='with outliers')
filtered_data['ad_posted_age'].hist(bins=20,alpha=0.3, range=(0, 40), label='without outliers')
plt.legend(loc='upper right')
plt.show()  #shows the above histogram

plt.title('Odometer')
data['odometer'].hist(bins=20, alpha=0.3, range=(0, 270000), label='with outliers')
filtered_data['odometer'].hist(bins=20,alpha=0.3, range=(0, 270000), label='without outliers')
plt.legend(loc='upper right')
plt.show() #shows the above histogram

plt.title('Number of cylinders')
data['cylinders'].hist(bins=5, alpha=0.3, label='with outliers')
filtered_data['cylinders'].hist(bins=5,alpha=0.3, label='without outliers')
plt.legend(loc='upper right')
plt.show() #shows the above histogram

plt.title('Condition')
data['cond_categ'].hist(bins=6, alpha=0.3, label='with outliers')
filtered_data['cond_categ'].hist(bins=6,alpha=0.3, label='without outliers')
plt.legend(loc='upper left')
plt.show() #shows the above histogram

**Conclusions** 
- `Venicle price` - a right skewed histogram, normal correlation with a peak in price of 5000-6000, then a uniform price decrease. Removing the outliers did not affect the histogram and did not change the main distribution.
- `The vehicle's age when the ad was placed` - a right skewed histogram, normal correlation with a peak in 6-7 years, then a uniform decline values. Removing the outliers did not affect the histogram and did not change the main distribution.
- `Odometer` - a central skewed histogram, normal correlation with a peak in 100-150k miles. Removing the outliers did not affect the histogram and did not change the main distribution.
- `Numbers of cylinders` - Removing the outliers did not affect the histogram and did not change the main distribution.
- `Condition` - most of the vehicles shown are in "good" or "excellent" condition. Removing the outliers did not affect the histogram and did not change the main distribution.

In [None]:
# let's build a boxplot for column 'price' without outliers
sns.boxplot(x=data['price'], showfliers = False)

## Ads lifetime

- Study how many days advertisements were displayed.  
- Calculate the mean and median.  
- Describe the typical lifetime of an ad.  
- Determine when ads were removed quickly, and when they were listed for an abnormally long time.

In [None]:
def distr_percentile(parameter):
    
    ninety_seven = np.percentile(filtered_data[parameter], 97)
    three = np.percentile(filtered_data[parameter], 3)
    print('Statistics on: {}'.format(parameter))
    print('---------------------------------------')
    print('lower outliers border:',three)
    print('max outliers border:',ninety_seven)
    print('---------------------------------------')
    print('The outliers are considered to be values below',three, "and above",ninety_seven)
    print('We have',len(filtered_data[(filtered_data[parameter]<three)|(filtered_data[parameter]>ninety_seven)]),"values that we can consider outliers")

In [None]:
filtered_data['days_listed'].describe()

Accordingly this information in `days_listed` the mean is about 39 days and the median is 33.

In [None]:
# plots a histogram for the 'days_listed' column with 20 bins and the x-axis ranging from 0 to 200

plt.title('Ads lifetime')
filtered_data['days_listed'].hist(bins=20, range=(0, 200))
plt.show()
print(distr_percentile('days_listed'))

In [None]:
# Determine the lower limits for outliers
ad_outlier_data = filtered_data.query('days_listed < 5')
 
plt.title('Ads outlier lifetime less then 4 days')
ad_outlier_data['days_listed'].hist(bins=5, range=(0, 5)) 
plt.show()

# Determine the upper limits for outliers
ad_outlier_data_max = filtered_data.query('days_listed > 105')
 
plt.title('Ads outlier lifetime more then 105 days')
ad_outlier_data_max['days_listed'].hist(bins=7, range=(105, 270)) 
plt.show()

We will make the assumption that cars in excellent condition and at a lower price are bought faster and therefore they are in the data for a short time. Let's take a look and compare with cars whose ads are more than 105 days old.

In [None]:
print('Type and conditions of the venicles that sold in less then 5 days:')
type_car_short_ad=ad_outlier_data.groupby(['type', 'cond_categ'])['price'].agg(['count','mean']).reset_index().sort_values(by='count',ascending=False)
print(type_car_short_ad.iloc[:3])
print('---------------------------------')
print('')
print('Type and conditions of the venicles that sold in more then 105 days:')
type_car_long_ad=ad_outlier_data_max.groupby(['type', 'cond_categ'])['price'].agg(['count','mean']).reset_index().sort_values(by='count',ascending=False)
print(type_car_long_ad.iloc[:3])

Our assumption was not confirmed. There is no pattern and the top three top of ads of these periods are exactly the same.

Let's look at the average value for the columns - `'price', 'model_year', 'odometer', 'cond_categ'`, as well as the average value of the immediate for the data that is included in the lower and upper of the outliers.  
To do this, we define a function that takes our rows as a parameter and calculates the average value in the filtered dataframe.

In [None]:
# function that takes our rows as a parameter and calculates the average value in the filtered dataframe
def lifetime_ads(parameter):
    print(f'Average: {parameter}')
    print('---------------------------------------------------') 
    print(f'for all venicles: {int(filtered_data[parameter].mean())}')
    print(f'for venicles that sold in less then 5 days: {int(ad_outlier_data[parameter].mean())}')
    print(f'for venicles that sold in more then 105 days: {int(ad_outlier_data_max[parameter].mean())}')
    print('---------------------------------------------------')
    print(' ')


In [None]:
# send rows to function lifetime_ads
for parameter in ['price', 'model_year', 'odometer', 'cond_categ']:
    lifetime_ads(parameter)

In [None]:
# describe the typical lifetime of an ad, let's make a boxplot
filtered_data.boxplot(column='days_listed', figsize=(8,8))
plt.show()

**Conclusions**
- The typical lifetime of an ad placement is between 19 and 53 days;
- Ads that are in the database for more than 105 days are considered abnormal;
- Ads that were posted less than 19 days consider that the car was sold in a short time.

In [None]:
# gets final filtered data with our conclusions before
filtered_data_fin = filtered_data.query('days_listed > 5 & days_listed < 105')

In [None]:
# plots a histogram for the 'days_listed' column without outliers
plt.title('Ads lifetime without outliers')
filtered_data_fin['days_listed'].hist(bins=20, range=(0, 100))
plt.show()

## Average price per each type of vehicle

We analyze the number of ads and the average price for each type of vehicle.  

In [None]:
type_car_ad=filtered_data_fin.groupby(['type'])['price'].agg(['count','mean']).reset_index().sort_values(by='count',ascending=False)
type_car_ad

In [None]:
#Plot a graph showing the dependence of the number of ads on the vehicle type
plt.figure(figsize=(12,5))
sns.barplot(x="type", y="count", data=type_car_ad)

Two types with the greatest number of ads - **`suv`** and **`sedan`**.

# Price factors

## Scatterplot for `suv`

Now we will study how the price depends on the age, mileage, condition, type of transmission and color of the car in these categories.  

Let's make a cut according to the type of **`suv`**.

In [None]:
#creates a slice of rows where the type is SUV
suv = filtered_data_fin[filtered_data_fin['type'] == 'suv']
#reset the index after slicing
suv= suv.reset_index(drop=True)
suv

In [None]:
# price depends on age, mileage, condition, transmission type, and color.
for_scatter=suv[['price','ad_posted_age','avg_yearly_mileage','cond_categ', 'transmission', 'paint_color']]
for_scatter

We use of correlation matrix and correlation plots for study depends.

In [None]:
sns.pairplot(for_scatter) 

In [None]:
# correlation matrix
for_scatter.corr()

For categorical variables (transmission type and color), we plot a box-and-whisker charts, and create scatterplots for the rest. When analyzing categorical variables, note that the categories must have at least 50 ads; otherwise, their parameters won't be valid for analysis.

In [None]:
# Making sure that it has more than 50 listings
suv['transmission'].value_counts()

In [None]:
plt.figure(figsize=(10,6))
ax=sns.boxplot(x="transmission", y="price", data=suv, showfliers = False)

In [None]:
# Making sure that it has more than 50 listings
print(suv['paint_color'].value_counts())

# remove two colors from our data because there are less than 50 
suv_color = suv.query('paint_color not in("yellow", "purple")')

In [None]:
plt.figure(figsize=(12,6))
ax=sns.boxplot(x="paint_color", y="price", data=suv_color, showfliers = False)

**Conclusions**  

Based on the correlation matrix and histograms, we can say:
- the age of the SUV is a factor that affects the price of the car. No correlation was found between price and condition or average mileage;
- SUVs with manual transmission had higher typical prices than SUV's with automatic transmission;
- the most numerous were advertisements for the sale of cars in black, white and silver.
- the price of black cars was higher than other colors, white cars were in second place. We see that the prices of orange cars are the highest, but due to the small number (only 79) we cannot tell about the trend.

## Scatterplot for `sedan`

Now we will study how the price depends on the age, mileage, condition, type of transmission and color of the car in these categories.

Let's make a cut according to the type of `sedan`.

In [None]:
#creates a slice of rows where the type is SUV
sedan = filtered_data_fin[filtered_data_fin['type'] == 'sedan']
#reset the index after slicing
sedan= sedan.reset_index(drop=True)

sedan

In [None]:
# price depends on age, mileage, condition, transmission type, and color.
for_scatter_sd=sedan[['price','ad_posted_age','avg_yearly_mileage','cond_categ', 'transmission', 'paint_color']]
for_scatter_sd

In [None]:
sns.pairplot(for_scatter_sd) 

In [None]:
# correlation matrix
for_scatter_sd.corr()

In [None]:
# Making sure that it has more than 50 listings
sedan['transmission'].value_counts()

In [None]:
plt.figure(figsize=(10,6))
ax=sns.boxplot(x="transmission", y="price", data=sedan, showfliers = False)

In [None]:
##Making sure that it has more than 50 listings
print(sedan['paint_color'].value_counts())

# remove two colors from our data because there are less than 50 
sedan_color = sedan.query('paint_color not in("yellow", "purple", "orange")')

In [None]:
plt.figure(figsize=(12,6))
ax=sns.boxplot(x="paint_color", y="price", data=sedan_color, showfliers = False)

**Conclusions**  

Based on the correlation matrix and histograms, we can say:
- the age of the `sedan` is a factor that affects the price of the car. No correlation was found between price and condition or average mileage;
- Sedan's with automatic transmission had higher typical prices than sedan's with others transmission;
- the most numerous were advertisements for the sale of cars in silver, black, white and grey colors.
- the price of cars in black and white colors was higher than other colors.

# General conclusion

- We filtered the data, replaced the missing values and, based on this, built histograms, graphs, correlation matrices for the study.
- We have added new columns - the age of the car at the time of advertising, the average mileage and made a new classification.
- We have seen that the typical lifetime of advertisements is between 19 and 53 days, and those that last more than 105 days we can consider as anomalously long.
- For short ads, which were only 5 days old, we made an assumption that these were cars in excellent condition for a lower price and they quickly sold out, but this was not confirmed.
- The factor that most influences the price is the age of the car. Pearson correlation coefficient for SUVs - 0.58, for sedans - 0.61.
- The three most popular type featured in ads are SUVs, sedans and trucks - 11380, 11264 and 11055.
- For SUVs, cars in black color and manual transmission cars were more expensive.
- For sedans, black and white cars were almost equally expensive, with silver being the most popular color - 1880. Cars with an automatic transmission were listed as more expensive than a manual transmission.