<div style="color:white;
            display:fill;
            border-radius:10px;
            background-color:#0e2f52;
            font-size:80%;
            font-family:Verdana;
            letter-spacing:1px">
    <h1 style='padding: 20px;
              color:white;
              text-align:center;'>
        AUTOSCOUT24 CAR PRICE PREDICTION (EDA and ML)
    </h1>
    </div>

<h2 align="center"><font color=#20a0ff> Linear Regression and Regularization(Linear-Ridge-Lasso-ElasticNet) </font></h2> 

---
    
<p align="right">
  Duygu Jones | Data Scientist  | July 2024 
    <br>
  Follow me:
  <a href="https://duygujones.vercel.app/">duygujones.com</a> | 
  <a href="https://www.linkedin.com/in/duygujones/">Linkedin</a> | 
  <a href="https://github.com/Duygu-Jones">GitHub</a> | 
  <a href="https://www.kaggle.com/duygujones">Kaggle</a> | 
  <a href="https://medium.com/@duygujones">Medium</a> | 
  <a href="https://public.tableau.com/app/profile/duygu.jones/vizzes">Tableau</a>
</p>


## **Table of Content**

1. [Intoduction](#1.)
1. [Importing Libraries](#2.)
1. [Reading the Dataset](#3.)
1. [EXPLORATORY DATA ANALYSIS (EDA)](#4.)
    - [Categorical Features](#cat)
    - [Numerical Features](#num)
    - [Correlations](#corr)
    - [Outlier Analysis](#out)
1. [MACHINE LEARNING](#4.1)
1. [Train | Test Split)](#5.)
1. [Implement Linear Regression](#6.)
1. [Implement Ridge Regression](#7.)
1. [Implement Lasso Regression](#8.)
1. [Implement Elastic-Net Regression](#9.)
1. [Feature Importance)](#10.)
1. [Compare Models Performance](#11.)
1. [Final Model and Predictions](#12.)

## Introduction

- This project aims to predict car prices using the Auto Scout dataset from AutoScout24, containing features of 9 different car models. 
- By performing Exploratory Data Analysis (EDA) and implementing machine learning models, the goal is to gain insights into the data and build effective regression models for car price prediction.
- Additionally, these models can support automotive industry stakeholders in understanding market trends, optimizing pricing strategies, and making data-driven decisions. Consumers can also benefit by making informed choices when selecting vehicles.
- Ultimately, the goal is to leverage data-driven insights to enhance the understanding of the automotive market, improve pricing accuracy, and support sustainable development in the automotive industry. 

## Objectives

1. **Understand the dataset and its features.**
2. **Clean and prepare the data for modeling.**
3. **Implement various regression algorithms to predict car prices.**
4. **Optimize model performance by tuning hyperparameters and focusing on important features.**
5. **Compare the performance of different regression algorithms.**



*The dataset and results are used for educational purposes, demonstrating the application of advanced machine learning techniques on real-world data. We aim to build effective regression models to predict car prices and gain a deeper understanding of machine learning techniques.*

## About the Dataset
The **Auto Scout** data is sourced from the online car trading company [AutoScout24](https://www.autoscout24.com) in 2019 and contains various features of 9 different car models. This project uses a pre-processed and organized dataset to explore and understand machine learning algorithms, particularly for car price prediction using regression techniques.

**Dataset:** AutoScout24 Car Sales Dataset  
- **Content:** Data on various features of 9 different car models.  
- **Number of Rows:** 15,915  
- **Number of Columns:** 23  

**Inputs:**
- **make_model:** The make and model of the car
- **body_type:** The type of the car (e.g., Sedan)
- **price:** The price of the car (in EUR)
- **vat:** VAT status
- **km:** The car's mileage
- **Type:** The condition of the car (e.g., used)
- **Fuel:** The type of fuel (e.g., Diesel, Petrol)
- **Gears:** Number of gears
- **Comfort_Convenience:** Comfort and convenience features
- **Entertainment_Media:** Entertainment and media features
- **Extras:** Extra features
- **Safety_Security:** Safety and security features
- **age:** The age of the car
- **Previous_Owners:** Number of previous owners
- **hp_kW:** Engine power (in kW)
- **Inspection_new:** New inspection status
- **Paint_Type:** Type of paint
- **Upholstery_type:** Type of upholstery
- **Gearing_Type:** Type of gearing (e.g., automatic)
- **Displacement_cc:** Engine displacement (in cc)
- **Weight_kg:** The weight of the car (in kg)
- **Drive_chain:** Type of drive (e.g., front-wheel drive)
- **cons_comb:** Combined fuel consumption (L/100 km)

<a id="2."></a>
# - Import the Libraries

In [1]:
import numpy as np 
import pandas as pd 

import matplotlib.pyplot as plt
import seaborn as sns

import plotly.express as px
import scipy.stats as stats

plt.rcParams["figure.figsize"] = (10, 6)

%matplotlib inline  

from scipy import stats
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge, RidgeCV
from sklearn.linear_model import Lasso, LassoCV
from sklearn.linear_model import ElasticNet, ElasticNetCV

from sklearn.model_selection import cross_val_score, cross_validate

from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

from yellowbrick.regressor import ResidualsPlot, PredictionError

import warnings
warnings.filterwarnings("ignore")

<a id="3"></a>
# - Read the Dataset

In [2]:
df0 = pd.read_csv('/kaggle/input/autoscout24-car-sales2019/autoscout_car_sales.csv')
df = df0.copy()

<a id="4."></a>
# 1. EXPLORATORY DATA ANALYSIS (EDA)

In [None]:
df.head()

In [None]:
df.info()

# - Rename the Columns

In [None]:
df.columns

In [3]:
df.rename(columns={
    'make_model': 'make_model',
    'body_type': 'body_type',
    'price': 'price',
    'vat': 'vat',
    'km': 'km',
    'Type': 'type',
    'Fuel': 'fuel_type',
    'Gears': 'gears_num',
    'Comfort_Convenience': 'comfort_convenience',
    'Entertainment_Media': 'entertainment_media',
    'Extras': 'extras',
    'Safety_Security': 'safety_security',
    'age': 'age',
    'Previous_Owners': 'previous_owners',
    'hp_kW': 'hp_kw',
    'Inspection_new': 'inspection_new',
    'Paint_Type': 'paint_type',
    'Upholstery_type': 'upholstery_type',
    'Gearing_Type': 'gearing_type',
    'Displacement_cc': 'displacement_cc',
    'Weight_kg': 'weight_kg',
    'Drive_chain': 'drive_chain',
    'cons_comb': 'fuel_cons_comb'
}, inplace=True)


In [None]:
df.columns 

# - Check Missing Values

In [4]:
# Check out the missing values

missing_count = df.isnull().sum()
value_count = df.isnull().count()
missing_percentage = round(missing_count / value_count * 100, 2)
missing_df = pd.DataFrame({"count": missing_count, "percentage": missing_percentage})
missing_df

Unnamed: 0,count,percentage
make_model,0,0.0
body_type,0,0.0
price,0,0.0
vat,0,0.0
km,0,0.0
type,0,0.0
fuel_type,0,0.0
gears_num,0,0.0
comfort_convenience,0,0.0
entertainment_media,0,0.0


# - Check Duplicated Values

In [5]:
# Let's observe first the unique values

def get_unique_values(df):
    
    output_data = []

    for col in df.columns:

        # If the number of unique values in the column is less than or equal to 5
        if df.loc[:, col].nunique() <= 10:
            # Get the unique values in the column
            unique_values = df.loc[:, col].unique()
            # Append the column name, number of unique values, unique values, and data type to the output data
            output_data.append([col, df.loc[:, col].nunique(), unique_values, df.loc[:, col].dtype])
        else:
            # Otherwise, append only the column name, number of unique values, and data type to the output data
            output_data.append([col, df.loc[:, col].nunique(),"-", df.loc[:, col].dtype])

    output_df = pd.DataFrame(output_data, columns=['Column Name', 'Number of Unique Values', ' Unique Values ', 'Data Type'])

    return output_df

In [6]:
get_unique_values(df)

Unnamed: 0,Column Name,Number of Unique Values,Unique Values,Data Type
0,make_model,9,"[Audi A1, Audi A2, Audi A3, Opel Astra, Opel C...",object
1,body_type,8,"[Sedans, Station wagon, Compact, Coupe, Van, O...",object
2,price,2952,-,int64
3,vat,2,"[VAT deductible, Price negotiable]",object
4,km,6691,-,float64
5,type,5,"[Used, Employee's car, New, Demonstration, Pre...",object
6,fuel_type,4,"[Diesel, Benzine, LPG/CNG, Electric]",object
7,gears_num,4,"[7.0, 6.0, 5.0, 8.0]",float64
8,comfort_convenience,6196,-,object
9,entertainment_media,346,-,object


In [7]:
# Duplicated Data

df_duplicated = df[df.duplicated() == True]
df_duplicated.shape

(1673, 23)

In [None]:
#Drop the duplicated values

def duplicate_values(df):
    print("Duplicate check...")
    num_duplicates = df.duplicated(subset=None, keep='first').sum()
    if num_duplicates > 0:
        print("There are", num_duplicates, "duplicated observations in the dataset.")
        df.drop_duplicates(keep='first', inplace=True)
        print(num_duplicates, "duplicates were dropped!")
        print("No more duplicate rows!")
    else:
        print("There are no duplicated observations in the dataset.")

Understanding the context and data collection methods is crucial to determine the cause of duplicates. 
Whether to drop duplicates depends on the analysis purpose.
- For analyzing changes over time or variations, keeping duplicates might be more appropriate.
- However, since I will be using a linear model, I dropped these rows because duplicate rows contain similar values. 
- Linear regression tries to find the general trend of the data points and the fact that duplicate rows contain similar values does not lead to a major change in the model's predictions.

In [None]:
# Basic statistics summary of Numerical features

df.describe().T

In [None]:
# Basic statistics summary of Object features

df.describe(include= 'object').T

<a id="num"></a>
# 1.1. Numerical Features

         'price', 'km','gears_num','age','previous_owners','hp_kw','inspection_new','displacement_cc','weight_kg','fuel_cons_comb'

### Distributions of Numerical Features

In [None]:
numerical_df = df.select_dtypes(include=['number'])

plt.figure(figsize=(15, 10))

num_vars = len(numerical_df.columns)

for i, var in enumerate(numerical_df.columns, 1):
    plt.subplot((num_vars // 3) + 1, 3, i)
    sns.histplot(data=df, x=var, kde=True)
    plt.title(f'Distribution of {var}')
    
plt.tight_layout()
plt.show()

### Distributions of the Numerical Features:

1. **Price:** Distribution is right-skewed, with most prices concentrated between 10,000 and 20,000 EUR. Some higher-end outliers, particularly above 40,000 EUR.
2. **Km (Mileage):** Distribution is right-skewed, with most vehicles having low mileage. Values above 150,000 km are less common.
3. **Gears Number:** Most vehicles have 5, 6, or 7 gears. Vehicles with 8 gears are rare.
4. **Age:** Peaks at 0, 1, 2, and 3 years, indicating these are common ages for vehicles. Vehicles are generally up to 3 years old.
5. **Previous Owners:** Most vehicles have had one or two previous owners. Vehicles with 3 or more previous owners are less common.
6. **Hp_kw (Engine Power in kW):** Distribution is right-skewed, with most vehicles between 50 and 150 kW. Vehicles above 200 kW are less frequent.
7. **Inspection_new (New Inspection):** Concentration at the extremes (values of 0 and 1). Mid-range values are rare.
8. **Displacement_cc (Engine Displacement in cc):** A peak at 1500 cc. Values above 2000 cc are less common.
9. **Weight_kg:** Most vehicles are between 1000 and 1500 kg. Vehicles above 2000 kg are rare.
10. **Fuel Consumption Combined:** Approximately normal distribution, with most values between 4 and 6 l/100 km. Values above 8 l/100 km are less frequent.

### Overall:
- **Right-Skewed Distributions:** Many features like `price`, `km`, and `hp_kw` show right-skewed distributions. This skewness might affect the assumptions of linear models, which assume normally distributed residuals.
- **Peaks in Categorical Data:** Features such as `gears_num` and `age` have distinct peaks, indicating specific common values that could be significant predictors in the model.
- **Outliers:** While several features have some higher-end outliers, their impact on linear modeling needs careful assessment to ensure they do not disproportionately influence the model.
- **Data Transformation:** Consider data transformations (e.g., log transformation) to normalize the distributions of skewed features, improving the performance and accuracy of linear models.
- **Feature Scaling:** Ensure all features are appropriately scaled, especially those with wide ranges like `km` and `price`, to ensure effective model training.

By addressing these considerations, linear modeling can be made more robust and accurate with this dataset.

In [None]:
print(df['displacement_cc'].value_counts())

In [None]:
# Let's try to merge the values less than 10 in the displacement_cc column and assign them to a new column called 'other_disp'.
# But wont keep it.

displacement = df.displacement_cc.value_counts()

displacement[displacement < 10 ].index

In [None]:
other_disp = list(displacement[displacement < 10 ].index)
other_disp

In [None]:
# Apply on the column
# df['displacement_cc'] = df['displacement_cc'].apply(lambda x: 'other_disp' if x in other_disp else x)

# print(df['displacement_cc'].value_counts())

In [None]:
sns.distplot(df['displacement_cc'])

<a id="cat"></a>
# 1.2. Categorical Features

        'make_model', 'body_type', 'vat', 'type', 'fuel_type', 'comfort_convenience',
        'entertainment_media', 'extras', 'safety_security', 'paint_type',
        'upholstery_type', 'gearing_type', 'drive_chain'

### Distributions of Categorical Features

In [None]:
import plotly.graph_objects as go
import plotly.express as px

# Distribution of our categorical characteristics with bar graph


def plot_bar_graphs(df, columns):
    for column in columns:
        plt.figure(figsize=(10, 4))
        ax = sns.countplot(x=column,
                           data=df,
                           order=df[column].value_counts().index)
        ax.bar_label(ax.containers[0], rotation=45)
        plt.xlabel(column, fontsize=10)
        plt.ylabel('Count', fontsize=10)
        plt.title(f'Bar Graph of {column}', fontsize=10)
        plt.xticks(rotation=45, ha='right', fontsize=10)
        plt.show()


cat_features = [
    'make_model', 'body_type', 'vat', 'type', 'fuel_type', 'paint_type',
    'upholstery_type', 'gearing_type', 'drive_chain'
]

plot_bar_graphs(df, cat_features)

## 'Make_model' Feature

- *As we observed in the distributions of the `make_model` feature values above, Audi A2 has only one data point.
    - The fact that the Audi A2 has only one data point indicates that there is not enough data to represent this model.
    - it is insufficient for generalization and should be excluded from the analysis.  
    - By removing Audi A2, we ensure the model's stability and accuracy.

In [None]:
models = df.make_model.value_counts()
models

In [None]:
# Dropping the only Audi A2 observation to increase model performance

df = df[df['make_model'] != 'Audi A2'] 

In [None]:
df[df.make_model=="Audi A2"]

<a id="corr"></a>
# 1.3. Correlations

In [None]:
# Target vs Numerical Features

numerical_df = df.select_dtypes(include=['number'])

for i in range(0, len(numerical_df.columns), 5):
    sns.pairplot(data=numerical_df,
                x_vars=numerical_df.columns[i:i+5],
                y_vars=['price'])

#   - Label Encoding 

In [None]:
# Label the categorical features to see the correlation between categorical features and Target veraible. 

from sklearn.preprocessing import LabelEncoder

# Copy the original dataframe to avoid modifying it directly
df_labeled = df.copy()

# List of categorical columns
categorical_columns = ['make_model', 'body_type', 'vat', 'type', 'fuel_type', 'comfort_convenience',
'entertainment_media', 'extras', 'safety_security', 'paint_type',
'upholstery_type', 'gearing_type', 'drive_chain']

# Apply Label Encoding to each categorical column
label_encoders = {}
for column in categorical_columns:
    le = LabelEncoder()
    df_labeled[column] = le.fit_transform(df_labeled[column])
    label_encoders[column] = le

# Display the first few rows of the labeled dataframe
print(df_labeled.head())

In [None]:
df_labeled.info()

#   - Heatmap

In [None]:
correlation_matrix = df_labeled.corr()

plt.figure(figsize=(25,20))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Matrix Heatmap')
plt.show()

In [None]:
# Check the multicolinarty;

numerical_df = df.select_dtypes(include=['number'])

multicolinarty_check1 = numerical_df.corr()[(numerical_df.corr()>= 0.9) & (numerical_df.corr() < 1)].any().any()

multicolinarty_check2 = numerical_df.corr()[(numerical_df.corr()<= -0.9) & (numerical_df.corr() > -1)].any().any()

print(multicolinarty_check1)
print(multicolinarty_check2)

In [None]:
correlation_matrix = df_labeled.corr() # Labeled df

price_corr = correlation_matrix['price'].sort_values(ascending= False)
price_corr

In [None]:
price_corr.plot(kind='bar',figsize=(15,6))

### NOTE:

1. **Strong Positive Correlations with Price:**
   - **Gears Number (`gears_num`):** Shows a strong positive correlation with `price` (0.53), indicating that cars with more gears tend to be more expensive.
   - **Engine Power (`hp_kW`):** Exhibits a moderate positive correlation with `price` (0.45), suggesting that higher-powered cars are generally more expensive.
   - **Engine Displacement (`displacement_cc`):** Has a moderate positive correlation with `price` (0.36), indicating that cars with larger engine displacements tend to be more expensive.

2. **Negative Correlations with Price:**
   - **Mileage (`km`):** Displays a negative correlation with `price` (-0.35), suggesting that cars with higher mileage tend to be cheaper.
   - **Age (`age`):** Exhibits a negative correlation with `price` (-0.21), indicating that older cars tend to be cheaper.
   - **Fuel Consumption (`fuel_cons_comb`):** Shows a moderate negative correlation with `price` (-0.30), indicating that cars with higher fuel consumption tend to be cheaper.

3. **Multicollinearity Considerations:**
   - **Gears Number (`gears_num`), Engine Power (`hp_kW`), and Engine Displacement (`displacement_cc`):** These features show strong correlations with each other and with `price`, suggesting potential multicollinearity. 
   - **Mileage (`km`) and Age (`age`):** These features also show a strong correlation (0.76) with each other, indicating redundancy. 

Managing multicollinearity among these highly correlated features is crucial to ensure model stability and performance. By addressing these correlations, we can build a more reliable and accurate price prediction model.

<a id="out"></a>
# 1.4. Outlier Analysis

Linear models are generally sensitive to outliers because they seek linear relationships between data points and employ MSE-like loss functions that amplify large errors. Some models, such as decision trees and robust regression, are more resilient to outliers and are less affected by such data. Therefore, when choosing a model, the characteristics of the dataset and the presence of outliers should be considered.

- We will not intervene with outliers at the moment, but we can take action later according to the model's forecasting performance.
- Let's observe the outliers for now.

In [None]:
# Checking Outliers

# Initialize the subplot counter
x = 0

#Numerical features;
numerical_columns = ['price', 'km','gears_num','age','hp_kw','displacement_cc','weight_kg','fuel_cons_comb']
        
# Create a figure with specified size
plt.figure(figsize=(16, 4))

# Loop through each numerical column and create a boxplot
for col in numerical_columns:
    x += 1
    plt.subplot(1, 8, x)
    sns.boxplot(data=df[col])
    plt.title(col)

# Show the plots
plt.tight_layout()  # Adjust subplots to fit in the figure area.
plt.show()

## Target Variable

In [None]:
# Checking outliers for target variable 'price' with boxplot

sns.boxplot(df.price);

In [None]:
#Skewness of the target variable
print("Skewness: %f" % df['price'].skew())

# Distribution of target variable before log transformation
price_untransformed = sns.distplot(df['price'])

- The range of skewness for a fairly symmetrical bell curve distribution is between -0.5 and 0.5; 
- moderate skewness is -0.5 to -1.0 and 0.5 to 1.0; 
- and highly skewed distribution is < -1.0 and > 1.0. 

*In our case, we have ~1.2, so it is considered skewed data. 
Now, we can try to transform our data, so it looks more normally distributed.*

In [None]:
#After Log Transformation
price_transformed = sns.distplot(np.log(df['price']))

In [None]:
# Apply Log Transformation for Target Variable

#df['log_price'] = np.log(df['price'])
#df[['price', 'log_price']].head()

In [None]:
# Make and Models 

plt.figure(figsize=(16,6))
sns.boxplot(x="make_model", y="price", data=df, whis=3, color="skyblue")
plt.show()

In [None]:
total_outliers = []

for model in df.make_model.unique():
    
    car_prices = df[df["make_model"]== model]["price"]
    
    Q1 = car_prices.quantile(0.25)
    Q3 = car_prices.quantile(0.75)
    IQR = Q3-Q1
    lower_lim = Q1-1.5*IQR
    upper_lim = Q3+1.5*IQR
    
    count_of_outliers = (car_prices[(car_prices < lower_lim) | (car_prices > upper_lim)]).count()
    
    total_outliers.append(count_of_outliers)
    
    print(f" The count of outlier for {model:<15} : {count_of_outliers:<5}, \
          The rate of outliers : {(count_of_outliers/len(df[df['make_model']== model])).round(3)}")
print()    
print("Total_outliers : ",sum(total_outliers), "The rate of total outliers :", (sum(total_outliers)/len(df)).round(3))

- Some models, like Opel Astra, Opel Insignia, and Renault Clio, have a high number of outliers, indicating significant price deviations. 
- Models like Audi A1, Opel Corsa, and Renault Duster have more consistent price distributions with fewer outliers. 
- The overall outlier rate is 2.6%, showing the dataset's sensitivity to outliers. 
- These outliers should be considered in analysis and modeling.

In [None]:
#==================================================================================================================
# Calculate skewness for numeric features

# A skewness value greater than 1 indicates positive skewness,
# a skewness value less than -1 indicates negative skewness,
# and a skewness value close to zero indicates a relatively symmetric distribution.

num_cols= df.select_dtypes('number').columns

skew_limit = 0.75               # define a limit above which we will log transform
skew_vals = df[num_cols].skew()


# Showing the skewed columns
skew_cols = (skew_vals
             .sort_values(ascending=False)
             .to_frame()
             .rename(columns={0:'Skew'})
             .query('abs(Skew) > {}'.format(skew_limit)))
skew_cols

In [None]:
# Apply log transformation to skewed columns

for col in skew_cols.index:
    # Since log transformation cannot be applied to non-positive values, we add 1 to each value
    df[col] = np.log1p(df[col])

# Display the transformed dataframe
print(df.head())

In [None]:
skew_cols  # check the skewness again

In [None]:
# Before and After Log Transformation of Target

fig, (ax_before, ax_after) = plt.subplots(1, 2, figsize=(10, 5))
df['price'].hist(ax=ax_before)
df['price'].apply(np.log1p).hist(ax=ax_after)

#====================================================================================================================

---

## One-Hot Encoding

### Explanation: Sample Data Point

**Original Data:**

- The `comfort_convenience`, `entertainment_media`, `extras`, and `safety_security` columns contain categorical values.

| comfort_convenience          | entertainment_media       | extras      | safety_security              |
|------------------------------|---------------------------|-------------|------------------------------|
| Air Conditioning, Cruise Control | Bluetooth, CD Player     | Alloy Wheels | ABS, Airbags                 |


**After One-Hot Encoding:**

- Each category has been converted into a separate column. 
- For example, `Air Conditioning` and `Cruise Control` are represented as separate columns with the `cc_` prefix. 
- Similarly, other categories are represented with their own prefixes, with the presence of a category indicated by 1 and its absence by 0.

| cc_Air Conditioning | cc_Cruise Control | em_Bluetooth | em_CD Player | ex_Alloy Wheels | ss_ABS | ss_Airbags |
|---------------------|-------------------|--------------|--------------|----------------|--------|------------|
| 1                   | 1                 | 1            | 1            | 1              | 1      | 1          |


This process allows machine learning models to interpret these categorical features as numerical data, thereby improving model performance.

In [None]:
selected_columns = df.loc[:, ['comfort_convenience', 'entertainment_media', 'extras', 'safety_security']]
selected_columns.head()

In [None]:
# Get Dummies for Categorical Features with More Than One Value

df = df.join(df["comfort_convenience"].str.get_dummies(sep = ",").add_prefix("cc_"))
df = df.join(df["entertainment_media"].str.get_dummies(sep = ",").add_prefix("em_"))
df = df.join(df["extras"].str.get_dummies(sep = ",").add_prefix("ex_"))
df = df.join(df["safety_security"].str.get_dummies(sep = ",").add_prefix("ss_"))

In [None]:
# Drop the original features as we dont need anymore.

df.drop(["comfort_convenience","entertainment_media","extras","safety_security"], axis=1, inplace=True)

In [None]:
# One-Hot Encoding for all df

df = pd.get_dummies(df, drop_first =True)

In [None]:
print(df.shape)

df.head(3)

In [None]:
# Convert boolean columns to integers, changing True to 1 and False to 0.

bool_columns = df.columns[df.dtypes == 'bool']
df[bool_columns] = df[bool_columns].astype(int)

In [None]:
df.head(3)

In [None]:
corr_by_price = df.corr()["price"].sort_values()[:-1]
corr_by_price
#================================================================================================================

---

<a id="4.1"></a>
# MACHINE LEARNING MODELLING

<a id="5."></a>
# Train | Test Split

<a id="6."></a>
# Linear Regression

## Model

## Cross Valisation

## Pipline

<a id="7."></a>
# Ridge Regression

<a id="8."></a>
# Lasso Regression

<a id="9."></a>
# Elastic-Net Regression

<a id="10."></a>
# Feature Importance

<a id="11."></a>
# Comparing Models Performance

<a id="12."></a>
# Final Model and Predictions

<a id='import'></a>
<div style="color:white;
            display:fill;
            border-radius:10px;
            background-color:#0e2f52;
            font-size:80%;
            font-family:Verdana;
            letter-spacing:1px">
    <h3 style='padding: 20px;
              color:white;
              text-align:center;'>
        Thank you...
    </h3>
    </div>

---

<p align="right">
  Duygu Jones | Data Scientist  | 2024
    
  <br>
  Follow me:
  <a href="https://duygujones.vercel.app/">duygujones.com</a> | 
  <a href="https://www.linkedin.com/in/duygujones/">Linkedin</a> | 
  <a href="https://github.com/Duygu-Jones">GitHub</a> | 
  <a href="https://www.kaggle.com/duygujones">Kaggle</a> | 
  <a href="https://medium.com/@duygujones">Medium</a> | 
  <a href="https://public.tableau.com/app/profile/duygu.jones/vizzes">Tableau</a>
</p>