## Linear Regression Extension

Car owners now have plenty of options for selling their car with one option being an online market place. Cars2u is one such website and to help potential customers they would like you to build a model that will estimate the selling price. 

Explore the data below (cleaning where necessary) and build a model that predicts sell price. Create new features if you feel this is appropriate. Once you have a model you are satisfied with, write a report that explains to a non-technical stakeholders (e.g. customer) how the model works and how reliable it is.      

<a href='https://www.kaggle.com/datasets/nehalbirla/vehicle-dataset-from-cardekho'>Documentation.</a> This dataset is from 2020.

## KSB's
Key KSBs you can evidence when writing a regression project:
K13 and 14
S10,11 and an aspect of S13
[Portfolio Tracker](https://applied.multiverse.io/pluginfile.php/46450/mod_label/intro/Portfolio%20Tracker%202.0.xlsx)

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.model_selection import cross_val_score
import statsmodels.api as sm

In [None]:
cars=pd.read_csv('car_details.csv')
cars.head()

In [None]:
# for context price is in indian rupees ~ 100 per £
#could convert with something like cars['selling_price'] = cars['selling_price']/100

In [None]:
cars.shape

In [None]:
cars.columns

In [None]:
cars.isnull().sum()

In [None]:
cars.duplicated().sum()

In [None]:
cars.drop_duplicates(inplace=True)

In [None]:
cars.describe(include='all')

Using DF.describe() is an easy way of getting descriptive statistics on your data set. Broadly discussing the importance of statistics to analysis meets part of **K13**. Defining descriptive analytics and describing the benefit of you using them in your project or broadly in your role meets part of **K14**. Applying descriptive statistics for exploratory data analysis meets part of **S10**

In [None]:
sns.boxplot(data=cars)
plt.show()

In [None]:
plt.hist(cars.selling_price)
plt.show()

Boxplots are a great way of identifying outliers. Once identified you can justify choices around reatining outliers or removing them to some extent. Demonstrating outlier detection can partial hit S13

There are significant outliers which are heavily skewing the data (which will affect results). Combining this with the fact most owners won't be selling cars >1500000 we should remove anything bigger.


In [None]:
cars=cars[cars['selling_price']<1500000]

In [None]:
plt.hist(cars.selling_price, bins=20)
plt.show()

Could do normality tests if you wish here depending on your data- see module 8 workshop 3 - this would give you the oppoutunity to describe and demonstrate inferential stats (aspects of **S10** and **K13**) but there will be oppourtunites later

In [None]:
cars.dtypes

In [None]:
cars.head()

In [None]:
cars.owner.value_counts()

In [None]:
cars['not_new']=cars.owner.apply(lambda x: True if x!='First Owner' else False)

# Due to class imbalanace it makes sense to differentiate between first owner and not instead of dummies

[Class imbalance intro](https://machinelearningmastery.com/what-is-imbalanced-classification/)

In [None]:
cars.head()

In [None]:
cars.seller_type.value_counts()

In [None]:
cars['seller_individual']=cars.seller_type.apply(lambda x: True if x=='Individual' else False) 

# Because of the class imbalance it makes sense to differentiate between individual and not as opposed to dummies

In [None]:
cars.head()

In [None]:
cars.fuel.value_counts()
# CNG, LPG and Electric won't make much impact on our model due to low numbers so will group as 'other'

In [None]:
cars['fuel']=cars.fuel.apply(lambda x: 'Other' if x in ['CNG','LPG','Electric'] else x)

In [None]:
cars.transmission.value_counts()

In [None]:
cars_dummy=pd.get_dummies(cars,columns=['fuel','transmission'])

In [None]:
cars_dummy.head()

In [None]:
cars_dummy.dtypes

In [None]:
# Pairplot not that valuable on dummy data. Generally Pairplot can aid in EDA/variable selection
#sns.pairplot(cars_dummy.select_dtypes(include=[np.number]))


In [None]:
fig, ax = plt.subplots(figsize=(10,10))
sns.heatmap(cars_dummy.corr(), annot=True)
plt.show()

If you aren't intending to run the OLS model from stats models you could run individual correlation tests (pearson's or spearman's - Module 8 workshop 3) to run hypothesis tests (The P-value is the probability that you would have found the current result if the correlation coefficient were in fact zero (null hypothesis). If this probability is lower than the conventional 5% the correlation coefficient is called statistically significant. If not statistically significant then this independent variable shouldn't be included in the model. aspects of **K13** and **S10**

In [None]:
carX=cars_dummy[['km_driven','not_new','fuel_Petrol', 'fuel_Other', 'transmission_Manual']]
cary=cars_dummy['selling_price']

In [None]:
X_train,X_test,y_train,y_test=train_test_split(carX,cary, train_size=0.8)

In [None]:
carlm=LinearRegression()
carlm.fit(X_train.values,y_train)

In [None]:
cross_val=cross_val_score(carlm,carX,cary,cv=5)
preds=carlm.predict(X_test)
print('Train score: '+str(carlm.score(X_train,y_train)))
print('Test score: '+str(carlm.score(X_test,y_test)))
print('Cross-val score: '+str(cross_val.mean()))
print('MAE: '+str(mean_absolute_error(preds,y_test)))
print('RMSE: '+str(np.sqrt(mean_squared_error(preds,y_test))))

The metrics suggest that about 34% of the difference in sale prices can be explained by this model. Not a huge amount, but considering that much of car sales comes by bartering this isn't too unexpected. 

The error shows that on average a prediction is off by ~ ₹200000.

Overall this does make the model somewhat unreliable, but does provide a starting point based off the features provided with ₹200000 of room for negotiation.

Once you have built a model and have metrics to evaluate it you can then add/withdraw/change independent variables in a new instance of the model and compare to optimise 

**S11** 

Comparing train and test scores (also cross val scores) is a technique for checking for your model overfitting

Rsquared, MAE and RMSE are all metrics that can be used to aid in evaluating model perfomance (interpret them first)

In [None]:
effect_df=pd.DataFrame(carlm.coef_,index=X_train.columns,columns=['Effect']).sort_values(by='Effect')

In [None]:
display(effect_df)
fig, ax = plt.subplots(figsize=(8,6))
plt.barh(effect_df.index, effect_df['Effect'])
plt.show()
print(f'The intercept from our model is: {carlm.intercept_}')

The factors that affect sale price the most are the fuel type and transmission. This model adds around ₹250000 to the predicted selling price of a car which uses diesel compared to petrol or 'other' fuel type. 

A manual transmission loses ₹288310 off its predicted sale price compared to automatic.

For every 1km driven, the predicted price goes down by ₹1.66.

Intercept is y when all x values = 0. Intercept is the value estimate from our model of a car which is new (first owner) 0km on the clock, diesel engine with automatic transmission.

**Interpreting your model output is an important aspect of S11**

If you then use the .predict() function in context you can use this as a jumping off point / in the results define and evaluate predictive and perscriptive analytics **K14**

In [None]:
#Example code
New_values = [[50000,1,1,0,1]]
carlm.predict(New_values)

## Alternate/Additional Method - OLS Stats models

In [None]:
carX
#np.asarray(carX)
#needs to be numeric for OLS so use asarray above or astypefloat below


In [None]:
#carX.astype(float)

In [None]:
# Using statsmodels returns a p-value on our variables 
# as such prior to running the model we should lay out our null and alternate hypotheses and set put our significance lever:
# Null: The two datasets are not significantly correlated
# Alternate: The two datasets are significantly correlated
# alpha = 0.05

carX = sm.add_constant(carX)
model= sm.OLS(cary,carX.astype(float), hasconst=True) 

res=model.fit()

In [None]:
res.summary()

If using OLS from stats models instead of SKLearn remember to interpret the model outputs (intercept/coefficients)

P values lower than 0.05 signifcance level as such we can reject the null hypothesis and say there is a statistically significant correlation between our independent variables and our dependant variable and it makes sense to retain these independent variables in our model. Aspects of **K13** and **S10**

Below is extra rigorous. I haven't seen this in a portfolio before

In [None]:
plt.hist(res.resid)
plt.show()

If your residuals are normal, it means that your assumption is valid and model inference (confidence intervals, model predictions) should also be valid