# BMW Used Car Sales - EDA

Note : This notebook is only on bmw car sales data only.

Contains information for price, transmission, mileage, fuel type, road tax, miles per gallon (mpg), and engine size.

In [None]:
# library
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.formula.api import ols
import missingno as msno

%matplotlib inline

In [None]:
df = pd.read_csv('../input/used-car-dataset-ford-and-mercedes/bmw.csv')
df['model'] = df['model'].str.strip() # 'model' values had a leading space

In [None]:
df.head()

Nothing looks off from the initial printout

In [None]:
df.info()

Is the data type appropriate?
- model : String, so object seem appropriate. May need to be converted into category.
- year : This is ambiguous as of now. This can be the year when the sale has occured or the model of the car is produced. Therefore, it is unsure whether integer or category variable may be appropriate. Leaving it as integer for now.
- Price : Unsure of its unit (of currency) is, but integer works
- transmission : Object data type works. May convert to category
- milage : Integer is sufficient
- fueltype:  object is sufficient. May convert to category
- tax : Integer is sufficient. Unsure of its unit of currency, however
- mpg : float is necessary for decimals. Good choice
- engineSize : capacity of engine expressed in liters and often rounded up to 1st decimal point. Float is appropriate

Forunately, there are no null values in any of the columns.

In [None]:
df.describe()

### Look into categorical variables

In [None]:
cat_col = ['model', 'year', 'transmission', 'fuelType']
sns.set_style('whitegrid')
for col in cat_col:
    unique = df[col].nunique()
    print('There are {} unique values in "{}" column'.format(unique, col))

In [None]:
# Create a Countplot function
def col_count(col):
    plt.figure(figsize=(14,6))
    sns.countplot(y=col, data=df.sort_values(col)).set_title('Count of {}'.format(col))

In [None]:
col_count('model')

Judging from the names of the 'model' column values, there exists a system of sub-models for BMW's used cars as they share a same pre/suffix.
- "Series", "M", "X", "Z", "i"

According to a quick wikiepdia search, these sub-models indicate a series of models with a consistent vehicle class.
For example, 5 Series cars are indicated as "Mid-size luxury car" while 8 Series models are labeled as "Grand tourer"

In [None]:
col_count('year')

Number of counts drop siginificantly for cars made before 2013. This may become important if I were to run a regression analysis as years with little number of samples will have great influence in the coefficient result. 

In [None]:
col_count('transmission')

There are a good number of samples available for each type of transmission.

In [None]:
col_count('fuelType')

There may not be enough information on Electric or 'other' types of fuel types due to low sample count.

## Look into numerical variables

In [None]:
num_col = ['price', 'mileage', 'tax', 'mpg', 'engineSize']
cat_col = ['model', 'year', 'transmission', 'fuelType']

def plot_kde(cat):
    plt.figure(figsize=(14,6))
    sns.histplot(x=cat, data=df, kde=True)
    plt.title('Distribution of {}'.format(cat))

In [None]:
# Create ecdf function

def plot_ecdf(data, variable):
    x = np.sort(data)
    y = np.arange(1, len(data)+1)/len(data)
    
    x_norm = np.sort(np.random.normal(data.mean(), data.std(), len(data)))
    
    plt.figure(figsize=(14,6))
    sns.scatterplot(x=x, y=y, label = variable)
    sns.scatterplot(x=x_norm, y=y, label='Normal distribution')
    plt.title('ECDF of {}'.format(variable))

In [None]:
plot_kde('price')

In [None]:
plot_ecdf(df['price'], 'price')

In [None]:
plot_ecdf(np.log(df['price']), 'ln(price)')

Takeaway:

- 'Price' distribution has a tail to the right
- Natural log of price seems to have a normal distribution

In [None]:
plot_kde('mileage')

Many of the cars in the data is from more recent year production, so this distribution is what was expected.

In [None]:
plot_kde('mpg')

There may be some outliers present in the mpg values, or there may be some eco-friendly vehicles with extremely high mpg (note that this mpg value for electirc cars are conversions in energy consumption level). 

Let us investiage if there are any oil based cars (non-hybrid) in cars with extremely high mpg values.

In [None]:
high_mpg = df[df['mpg'] > 100]
high_mpg.value_counts('fuelType')

In [None]:
very_high_mpg = df[df['mpg'] > 400]
very_high_mpg.value_counts('model')

In [None]:
df.loc[df['model'] == 'i3', 'mpg'].value_counts()

All mpg values for i3 model has 470.8 as its mpg, which is probably an error. I will be filling that number as np.nan.

In [None]:
df.loc[df['model'] == 'i3', 'mpg'] = np.nan

## Multi-variable examination

In [None]:
plt.figure(figsize=(14,6))
sns.heatmap(df.corr(), annot=True)
plt.title('Heatmap of correlation of BMW car sales data')

Takeaway:

- Car year (sold or produced?) has a strong, negative correlation with milage (Newer cars have lower miles per gallon? Maybe)
- Car year (sold or produced?) has a strong, positive correlation with price
    - Depending on what the 'year' variable represets, there are two different interpretation for this result
    - If 'year' variable represents the year which the sale has occured, this suggest that trades made recent years garnered greater price
    - Otherwise, this suggest that newer models garnered greater price in used market sales, which seem more plausible.
- Price has a strong, negative correlation with milage (Cars used more tend to be cheaper? Probably)
- Price has a mild, positive correlation with engineSize (Powerful cares are more expensive? Probably)
- Tax has a mild, negative correlation with mpg (Less tax for eco-friendly vehicles? Plausible)
- Mpg has weak, negative correlation with engineSize (Stronger cars is less eco-friendly? Plausible)

In [None]:
# Before going on with the analysis, work with models with significant number of samples (500+)
model_counts = df['model'].value_counts()
models = model_counts[model_counts.values > 500].index
models

df_models = df[df['model'].isin(models)].sort_values('model')

In [None]:
# Function to create point plots
def get_pointplot(x_col, y_col, hue_col, data):
    plt.figure(figsize=(14,6))
    sns.pointplot(x=x_col, y=y_col, hue=hue_col, data=data, alpha=0.5)
    plt.title('{} versus {}'.format(x_col, y_col))
    
def get_regplot(x_col, y_col, data):
    plt.figure(figsize=(14,6))
    sns.regplot(x=x_col, y=y_col, data=data, order=1)
    plt.title('{} versus {}'.format(x_col, y_col))

In [None]:
# Year and milage
get_pointplot('year', 'mileage', 'model', df_models)

Takeaway:

- I believe that 'year' variable represents the year which the car is produced.
- If that is true, newer models tend to have smaller milage, and the confidence interval for newer models are pretty narrow. 
- In addition, this suggest that we may be able to extract some good insights on cars with recent year trade or a model.

In [None]:
# Year and price
get_pointplot('year', 'price', 'model', df_models)

As expected, older models tend to be cheaper than newer models.

In [None]:
get_regplot('mileage', 'price', df_models)

Greater milage yields lower price, but almost no car price is dropping below certain threshold (>9000), which suggest that the relationship between mileage and price is probably not a linear one. 

Then, how do we capture the magnitude of relationship between mileage and price? We need to start by analyzing residual plots.

Most often, residual plots are used to determine whether two variables are good fit to be used for regression.
Residual plot visualizes the following: 
1. Difference between the predicted price and the actual price on y-axis (residuals)
2. Mileage on x-axis.

If our estimator is a good fit, we would expect residuals to be distributed evenly across the y=0 line, above and below y=0 throughout the range of price. This suggest that we are capturing the relationship between mileage and price, without overshooting or undershooting .

However, if distribution of residual is uneven across y=0 line, we would then look for different variable candidates by maniuplating their scale. In this case, we may square/log/... the mileage variable. 

From earlier exercises, we discovered that distribution of ln(price) closely matched the normal distribution. I believe that would be a good starting point.

In [None]:
# Residual plot of ln(price)
plt.figure(figsize=(14,6))
sns.residplot(x=np.log(df_models['mileage']), y=df_models['price'], order=1, scatter_kws={'alpha' : 0.1})
plt.title('Residual plot of {} versus {} (order = 1)'.format('ln(mileage)', 'price'))

In [None]:
# Examine the scatterplot of ln(price)
plt.figure(figsize=(14,6))
sns.scatterplot(x=np.log(df_models['mileage']), y=df_models['price'], hue=df_models['model'], alpha=0.3)
plt.title('Scatterplot of ln(mileage) and price')

If we look at at the residual plot only, ln(price) does not seem like a good choice of a variable to assess the magnitude of relationship with price via regression.

We can observe that:
1. Residuals mostly dip below 0 at the extremes of ln(mileage) values (ex. 2, 11) and 
2. Residuals are concentrated above 0 around log(mileage) value of 9.

However, there are two cluster of data around ln(mileage)=2 and ln(mileage)=5 with little milage and little drop in data.
Note that ln(mileage) of 2 and 5 corresponds to approximately 55 miles 150 miles. It means that those cars are practically new when it comes to mileage. 

It may be wise to exclude cars with very little mileage that can heavily influence any regression result using variables transformed using logarithmic scale. 

In [None]:
# For fun : Residual plot of log(price)**2
plt.figure(figsize=(14,6))
sns.residplot(x=np.log(df_models['mileage']), y=df_models['price'], order=2, scatter_kws={'alpha' : 0.1})
plt.title('Residual plot of {} versus {} (order = 2)'.format('log(mileage)', 'price'))

Potential second option is to increase the order of polynomial which the regression is ran on. 

In that case, a 2nd order may seem like a potential candidate. 

***This begs the question...***

Which model retains its value of price the best?

I need to answer a few questions on my own before tackling this question.

1. Retaining car value against what?
    - Milage is probably the best value to measure against, as it tracks the level of usage regardless of when the car is produced.
    - However, running a simple regression puts us at risk getting of getting influenced by small number of samples that exist at either extremes of mileage values.
    - Instead of running a regression, I will be assigning cars a label based on their mileage percentiles and compare their medians.
2. Should same model of cars produced in different year be considered a different entity?
    - It is true that carmakers make minor adjustments to cars produced in later years. However, changes tend to be relatively minor. I will use model as the only distinct category in this analysis.
3. This is a used car dataset. How would you estimate the value retained after a new car is bought?
    - Unfortunately, the data pertains only on used car sales. Therefore, this analysis would be applicable for used cars only, regardless of its milage. It will not be able to capture the discount between brand new cars versus used cars.
    - Meanwhile, dataset contains data on used cars with extremely low mileage, and they are going to be used as a benchmark when comparing the value retained after gainin mileage.

In [None]:
# group cars by mileage using qcut function
df_models['mileage_tier'] = pd.qcut(df_models['mileage'], 8)
df_models.value_counts(['model', 'mileage_tier']).sort_index()
plt.figure(figsize=(14,6))
sns.scatterplot(x=np.log(df_models['mileage']), y=df_models['price'], hue=df_models['mileage_tier'], alpha=0.5)
plt.title('ln(mileage) versus price')
plt.xlabel('ln(mileage)')

In [None]:
# gather the median price for individual mileage tiers
df_pivot = df_models.pivot_table(values='price', index='model', columns='mileage_tier', aggfunc='median') 
    # median is more effective against outliers
df_pivot

In [None]:
# convert price to proportion of tier 0
for i in range(0, df_pivot.shape[0]):
    df_pivot.iloc[i, :] = df_pivot.iloc[i, :]/df_pivot.iloc[i, 0]
df_pivot = df_pivot.iloc[:, 1:]
df_pivot

In [None]:
# create a heatmap
plt.figure(figsize=(14,6))
sns.heatmap(df_pivot, annot=True)
plt.title('Value retention of BMW models over mileage (1 = 100% value retained)')

In [None]:
# Create ranking of each
df_rank = df_pivot.rank(axis=0, method='min')
df_rank

In [None]:
plt.figure(figsize=(14,6))
sns.heatmap(df_rank, annot=True, cmap='Greens')
plt.title('Value retention ranking of BMW models over mileage (7 = best)')

In [None]:
# Get average rank by model
df_averagerank = df_rank.mean(axis=1)
df_averagerank.sort_values(ascending=False)

Takeaway:
- Based on the average-rank assessment, the models which retain the best value from the best to worst are: 
        1. 2 Series (Compact car, Compact MPV)
        2. X3 (Compact Luxury SUV)
        3. 5 Series (Mid-size luxury car)
        4. 4 Series (Compact executive car)
        5. X1 (Sub-compact luxury SUV)
        6. 3 Series (Compact executive car)
        7. 1 Series (Hatchback, coupé, convertible, subcompact car)