# Exploratory analysis of Car Sales

### Table of contents:
[Introduction](#intro)\
[1. Data overview](#overview)\
[2. Duplicated values](#duplicated)
* [Obvious duplicates](#obvious)
* [Implicit duplicates](#implicit)

[3. Missing values](#missing)

## Introduction <a id='intro'></a>

The goal of this analysis is to clean the data and prepare it for building a web app.

## 1. Data overview <a id='overview'></a>

In [78]:
import pandas as pd

In [79]:
data=pd.read_csv('../vehicles_us.csv')

In [80]:
data.head(10)

Unnamed: 0,price,model_year,model,condition,cylinders,fuel,odometer,transmission,type,paint_color,is_4wd,date_posted,days_listed
0,9400,2011.0,bmw x5,good,6.0,gas,145000.0,automatic,SUV,,1.0,2018-06-23,19
1,25500,,ford f-150,good,6.0,gas,88705.0,automatic,pickup,white,1.0,2018-10-19,50
2,5500,2013.0,hyundai sonata,like new,4.0,gas,110000.0,automatic,sedan,red,,2019-02-07,79
3,1500,2003.0,ford f-150,fair,8.0,gas,,automatic,pickup,,,2019-03-22,9
4,14900,2017.0,chrysler 200,excellent,4.0,gas,80903.0,automatic,sedan,black,,2019-04-02,28
5,14990,2014.0,chrysler 300,excellent,6.0,gas,57954.0,automatic,sedan,black,1.0,2018-06-20,15
6,12990,2015.0,toyota camry,excellent,4.0,gas,79212.0,automatic,sedan,white,,2018-12-27,73
7,15990,2013.0,honda pilot,excellent,6.0,gas,109473.0,automatic,SUV,black,1.0,2019-01-07,68
8,11500,2012.0,kia sorento,excellent,4.0,gas,104174.0,automatic,SUV,,1.0,2018-07-16,19
9,9200,2008.0,honda pilot,excellent,,gas,147191.0,automatic,SUV,blue,1.0,2019-02-15,17


In [81]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51525 entries, 0 to 51524
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   price         51525 non-null  int64  
 1   model_year    47906 non-null  float64
 2   model         51525 non-null  object 
 3   condition     51525 non-null  object 
 4   cylinders     46265 non-null  float64
 5   fuel          51525 non-null  object 
 6   odometer      43633 non-null  float64
 7   transmission  51525 non-null  object 
 8   type          51525 non-null  object 
 9   paint_color   42258 non-null  object 
 10  is_4wd        25572 non-null  float64
 11  date_posted   51525 non-null  object 
 12  days_listed   51525 non-null  int64  
dtypes: float64(4), int64(2), object(7)
memory usage: 5.1+ MB


Conclusions:
- There are no issues with style in the column names.
- There are missing values. To get reliable results, it's necessary to preprocess the data.

In [82]:
data.describe(include='all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
price,51525.0,,,,12132.46492,10040.803015,1.0,5000.0,9000.0,16839.0,375000.0
model_year,47906.0,,,,2009.75047,6.282065,1908.0,2006.0,2011.0,2014.0,2019.0
model,51525.0,100.0,ford f-150,2796.0,,,,,,,
condition,51525.0,6.0,excellent,24773.0,,,,,,,
cylinders,46265.0,,,,6.125235,1.66036,3.0,4.0,6.0,8.0,12.0
fuel,51525.0,5.0,gas,47288.0,,,,,,,
odometer,43633.0,,,,115553.461738,65094.611341,0.0,70000.0,113000.0,155000.0,990000.0
transmission,51525.0,3.0,automatic,46902.0,,,,,,,
type,51525.0,13.0,SUV,12405.0,,,,,,,
paint_color,42258.0,12.0,white,10029.0,,,,,,,


## 2. Duplicated values <a id='duplicated'></a>

### Obvious duplicates <a id='obvious'></a>

In [83]:
data.duplicated().sum()

0

Conclusion: there are no obvious duplicated rows in the dataset.

### Implicit duplicates <a id='implicit'></a>

Checking implicit duplicates in the column 'model' by printing a list of unique model names, sorted in an alphabetical order.

In [84]:
data['model'].sort_values().unique()

array(['acura tl', 'bmw x5', 'buick enclave', 'cadillac escalade',
       'chevrolet camaro', 'chevrolet camaro lt coupe 2d',
       'chevrolet colorado', 'chevrolet corvette', 'chevrolet cruze',
       'chevrolet equinox', 'chevrolet impala', 'chevrolet malibu',
       'chevrolet silverado', 'chevrolet silverado 1500',
       'chevrolet silverado 1500 crew', 'chevrolet silverado 2500hd',
       'chevrolet silverado 3500hd', 'chevrolet suburban',
       'chevrolet tahoe', 'chevrolet trailblazer', 'chevrolet traverse',
       'chrysler 200', 'chrysler 300', 'chrysler town & country',
       'dodge charger', 'dodge dakota', 'dodge grand caravan',
       'ford econoline', 'ford edge', 'ford escape', 'ford expedition',
       'ford explorer', 'ford f-150', 'ford f-250', 'ford f-250 sd',
       'ford f-250 super duty', 'ford f-350 sd', 'ford f150',
       'ford f150 supercrew cab xlt', 'ford f250', 'ford f250 super duty',
       'ford f350', 'ford f350 super duty', 'ford focus', 'ford focus

The following models have alternative names of the same model:
- ford f-150, ford f150
- ford f-250, ford f250
- ford f-250 sd, ford f-250 super duty, ford f250 super duty
- ford f-350 sd, ford f350 super duty

The correct way of writing the model name is "ford f-model_number" according to the official [Ford website](https://www.ford.com/trucks/f150/models/?intcmp=vhp-seconNav-modselect).

The method replace() is used to get rid of these implicit duplicates and correct the names in the column 'model'.

In [85]:
data['model_clean']=data['model'].replace({'ford f150':'ford f-150','ford f250':'ford f-250','ford f350 super duty':'ford f-350 sd'})

In [86]:
data['model_clean']=data['model_clean'].replace(['ford f250 super duty', 'ford f-250 super duty'], 'ford f-250 sd')

Checking if there are no implicit duplicates anymore in the column 'model_clean'.

In [87]:
data['model_clean'].sort_values().unique()

array(['acura tl', 'bmw x5', 'buick enclave', 'cadillac escalade',
       'chevrolet camaro', 'chevrolet camaro lt coupe 2d',
       'chevrolet colorado', 'chevrolet corvette', 'chevrolet cruze',
       'chevrolet equinox', 'chevrolet impala', 'chevrolet malibu',
       'chevrolet silverado', 'chevrolet silverado 1500',
       'chevrolet silverado 1500 crew', 'chevrolet silverado 2500hd',
       'chevrolet silverado 3500hd', 'chevrolet suburban',
       'chevrolet tahoe', 'chevrolet trailblazer', 'chevrolet traverse',
       'chrysler 200', 'chrysler 300', 'chrysler town & country',
       'dodge charger', 'dodge dakota', 'dodge grand caravan',
       'ford econoline', 'ford edge', 'ford escape', 'ford expedition',
       'ford explorer', 'ford f-150', 'ford f-250', 'ford f-250 sd',
       'ford f-350 sd', 'ford f150 supercrew cab xlt', 'ford f350',
       'ford focus', 'ford focus se', 'ford fusion', 'ford fusion se',
       'ford mustang', 'ford mustang gt coupe 2d', 'ford ranger',
 

Conclusion: the duplicated data is cleaned and ready for further preprocessing.

## 3. Missing values <a id='missing'></a>

In [88]:
data.isna().sum()

price               0
model_year       3619
model               0
condition           0
cylinders        5260
fuel                0
odometer         7892
transmission        0
type                0
paint_color      9267
is_4wd          25953
date_posted         0
days_listed         0
model_clean         0
dtype: int64

The following columns have missing values:
* model_year
* cylinders
* odometer
* paint_color
* is_4wd

In [89]:
group_model=data.groupby(['model'])['model_year'].count()
group_model

model
acura tl             224
bmw x5               246
buick enclave        257
cadillac escalade    295
chevrolet camaro     392
                    ... 
toyota sienna        308
toyota tacoma        769
toyota tundra        568
volkswagen jetta     485
volkswagen passat    324
Name: model_year, Length: 100, dtype: int64

In [95]:
data['model_year_clean'] = data['model_year'].fillna(data.groupby('model')['model_year'].transform('median'))
data.head()

Unnamed: 0,price,model_year,model,condition,cylinders,fuel,odometer,transmission,type,paint_color,is_4wd,date_posted,days_listed,model_clean,model_year_clean
0,9400,2011.0,bmw x5,good,6.0,gas,145000.0,automatic,SUV,,1.0,2018-06-23,19,bmw x5,2011.0
1,25500,,ford f-150,good,6.0,gas,88705.0,automatic,pickup,white,1.0,2018-10-19,50,ford f-150,2011.0
2,5500,2013.0,hyundai sonata,like new,4.0,gas,110000.0,automatic,sedan,red,,2019-02-07,79,hyundai sonata,2013.0
3,1500,2003.0,ford f-150,fair,8.0,gas,,automatic,pickup,,,2019-03-22,9,ford f-150,2003.0
4,14900,2017.0,chrysler 200,excellent,4.0,gas,80903.0,automatic,sedan,black,,2019-04-02,28,chrysler 200,2017.0
