In [13]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

In [14]:
df = pd.read_csv('/home/leeladhar/Documents/Data_analysis_EDA/data.csv')

In [15]:
df.head()

Unnamed: 0,Make,Model,Year,Engine Fuel Type,Engine HP,Engine Cylinders,Transmission Type,Driven_Wheels,Number of Doors,Market Category,Vehicle Size,Vehicle Style,highway MPG,city mpg,Popularity,MSRP
0,BMW,1 Series M,2011,premium unleaded (required),335.0,6.0,MANUAL,rear wheel drive,2.0,"Factory Tuner,Luxury,High-Performance",Compact,Coupe,26,19,3916,46135
1,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Convertible,28,19,3916,40650
2,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,High-Performance",Compact,Coupe,28,20,3916,36350
3,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Coupe,28,18,3916,29450
4,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,Luxury,Compact,Convertible,28,18,3916,34500


# Feature Details
- Make: the manufacturer of the vehicle
- Model: the model of the vehicle
- Year: the year the vehicle was manufactured
- Engine Fuel Type: the type of fuel the vehicle's engine uses
- Engine HP: the horsepower of the engine
- Engine Cylinders: the number of cylinders in the engine
- Transmission Type: the type of transmission the vehicle has
- Driven_Wheels: the type of drivetrain the vehicle has (e.g., front wheel drive, rear wheel drive, etc.)
- Number of Doors: the number of doors the vehicle has
- Market Category: the vehicle's market category
- Vehicle Size: the size of the vehicle
- Vehicle Style: the style of the vehicle (e.g., sedan, SUV, coupe, etc.)
- highway MPG: the miles per gallon the vehicle gets on the highway
- city mpg: the miles per gallon the vehicle gets in the city
- Popularity: a measure of the vehicle's popularity
- MSRP: the manufacturer's suggested retail price (this seems to be our target variable)

In [16]:
df.shape

(11914, 16)

In [17]:
df.columns

Index(['Make', 'Model', 'Year', 'Engine Fuel Type', 'Engine HP',
       'Engine Cylinders', 'Transmission Type', 'Driven_Wheels',
       'Number of Doors', 'Market Category', 'Vehicle Size', 'Vehicle Style',
       'highway MPG', 'city mpg', 'Popularity', 'MSRP'],
      dtype='object')

In [18]:
df.describe()

Unnamed: 0,Year,Engine HP,Engine Cylinders,Number of Doors,highway MPG,city mpg,Popularity,MSRP
count,11914.0,11845.0,11884.0,11908.0,11914.0,11914.0,11914.0,11914.0
mean,2010.384338,249.38607,5.628829,3.436093,26.637485,19.733255,1554.911197,40594.74
std,7.57974,109.19187,1.780559,0.881315,8.863001,8.987798,1441.855347,60109.1
min,1990.0,55.0,0.0,2.0,12.0,7.0,2.0,2000.0
25%,2007.0,170.0,4.0,2.0,22.0,16.0,549.0,21000.0
50%,2015.0,227.0,6.0,4.0,26.0,18.0,1385.0,29995.0
75%,2016.0,300.0,6.0,4.0,30.0,22.0,2009.0,42231.25
max,2017.0,1001.0,16.0,4.0,354.0,137.0,5657.0,2065902.0


In [19]:
df.dtypes

Make                  object
Model                 object
Year                   int64
Engine Fuel Type      object
Engine HP            float64
Engine Cylinders     float64
Transmission Type     object
Driven_Wheels         object
Number of Doors      float64
Market Category       object
Vehicle Size          object
Vehicle Style         object
highway MPG            int64
city mpg               int64
Popularity             int64
MSRP                   int64
dtype: object

In [20]:
df.isnull().sum()

Make                    0
Model                   0
Year                    0
Engine Fuel Type        3
Engine HP              69
Engine Cylinders       30
Transmission Type       0
Driven_Wheels           0
Number of Doors         6
Market Category      3742
Vehicle Size            0
Vehicle Style           0
highway MPG             0
city mpg                0
Popularity              0
MSRP                    0
dtype: int64

# Missing Values
- Engine Fuel Type: 3 missing values
- Engine HP: 69 missing values
- Engine Cylinders: 30 missing values
- Number of Doors: 6 missing values
- Market Category: 3742 missing values
Let's handle these missing values:

For "Engine Fuel Type", "Engine Cylinders", and "Number of Doors", I'll fill the missing values with the mode (most common value) of each column, as these are categorical variables.
For "Engine HP", I'll use the median to fill the missing values, as this column is a numerical one and using the median will be less sensitive to outliers.
"Market Category" has a significant number of missing values. To keep things simple, I'll create a new category called "Unknown" for the missing values in this column.
After handling missing values, I'll convert the categorical variables into numerical ones using one-hot encoding. However, I'll only encode the columns with a reasonable number of unique categories to avoid creating a very high-dimensional dataset.

In [22]:
df.nunique()

Make                   48
Model                 915
Year                   28
Engine Fuel Type       10
Engine HP             356
Engine Cylinders        9
Transmission Type       5
Driven_Wheels           4
Number of Doors         3
Market Category        71
Vehicle Size            3
Vehicle Style          16
highway MPG            59
city mpg               69
Popularity             48
MSRP                 6049
dtype: int64

In [25]:
df['Make'].value_counts()

Chevrolet        1123
Ford              881
Volkswagen        809
Toyota            746
Dodge             626
Nissan            558
GMC               515
Honda             449
Mazda             423
Cadillac          397
Mercedes-Benz     353
Suzuki            351
BMW               334
Infiniti          330
Audi              328
Hyundai           303
Volvo             281
Subaru            256
Acura             252
Kia               231
Mitsubishi        213
Lexus             202
Buick             196
Chrysler          187
Pontiac           186
Lincoln           164
Oldsmobile        150
Land Rover        143
Porsche           136
Saab              111
Aston Martin       93
Plymouth           82
Bentley            74
Ferrari            69
FIAT               62
Scion              60
Maserati           58
Lamborghini        52
Rolls-Royce        31
Lotus              29
Tesla              18
HUMMER             17
Maybach            16
Alfa Romeo          5
McLaren             5
Spyker    

- Make: 48 unique values
- Model: 915 unique values
- Engine Fuel Type: 10 unique values
- Transmission Type: 5 unique values
- Driven_Wheels: 4 unique values
- Market Category: 72 unique values
- Vehicle Size: 3 unique values
- Vehicle Style: 16 unique values

The "Model" and "Market Category" columns have a high number of unique values. One-hot encoding these columns would significantly increase the dimensionality of the dataset, which could lead to overfitting and longer training times.

For simplicity, I'll one-hot encode the "Make", "Engine Fuel Type", "Transmission Type", "Driven_Wheels", "Vehicle Size", and "Vehicle Style" columns.

In [9]:
# Handling missing values
df['Engine Fuel Type'].fillna(df['Engine Fuel Type'].mode()[0], inplace=True)
df['Engine HP'].fillna(df['Engine HP'].median(), inplace=True)
df['Engine Cylinders'].fillna(df['Engine Cylinders'].mode()[0], inplace=True)
df['Number of Doors'].fillna(df['Number of Doors'].mode()[0], inplace=True)
df['Market Category'].fillna('Unknown', inplace=True)

In [11]:
# One-hot encoding for categorical columns with reasonable unique values
df_encoded = pd.get_dummies(df, columns=['Make', 'Engine Fuel Type', 'Transmission Type', 'Driven_Wheels', 'Vehicle Size', 'Vehicle Style'])

One-hot encoding is a technique used to convert categorical variables into a numerical format that machine learning algorithms can understand. Categorical variables often contain labels like 'red', 'blue', 'small', 'large', etc., that are not directly usable in most algorithms, which require numerical input. One-hot encoding effectively removes this obstacle.

Identify the unique categories in the column.
Create a new column for each unique category.
Populate these new columns with '0's and '1's based on the original column's values.

Advantages:

Enables algorithms to better understand categorical data.
Suitable for any machine learning algorithm, not just linear models.
Disadvantages:

Can lead to high memory consumption for categories with many unique values.
The transformed dataset can be large and sparse, which might slow down model training.

In [12]:

# Separating predictors and target
X = df_encoded.drop(columns=['MSRP', 'Model', 'Market Category'])
y = df_encoded['MSRP']

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Training the linear regression model
regressor = LinearRegression()
regressor.fit(X_train, y_train)

# Predicting on training and testing sets
y_train_pred = regressor.predict(X_train)
y_test_pred = regressor.predict(X_test)

# Calculating RMSE and R2 for the training set
rmse_train = np.sqrt(mean_squared_error(y_train, y_train_pred))
r2_train = r2_score(y_train, y_train_pred)

# Calculating RMSE and R2 for the testing set
rmse_test = np.sqrt(mean_squared_error(y_test, y_test_pred))
r2_test = r2_score(y_test, y_test_pred)

# Printing the performance metrics
print(f'Training set: RMSE = {rmse_train}, R2 = {r2_train}')
print(f'Testing set: RMSE = {rmse_test}, R2 = {r2_test}')


Training set: RMSE = 24995.453342510424, R2 = 0.840615979160612
Testing set: RMSE = 19236.764859202045, R2 = 0.8447483709062453


The R2 score, also known as the coefficient of determination, is a measure of how well the predictions from a linear regression model approximate the actual data points. It provides an indication of the goodness-of-fit of a model and ranges from 0 to 1. A score of 1 indicates that the model perfectly fits the data, while a score close to 0 indicates that the model does not explain the variability of the response data around its mean.

MSE (Mean Squared Error): The average of the squares of the errors or deviations (i.e., the difference between the estimator and the estimated). 

RMSE (Root Mean Squared Error): The square root of the MSE. RMSE is in the same units as the response variable, which is helpful for interpretation.

MAE (Mean Absolute Error): The average of the absolute errors. It's less sensitive to outliers compared to MSE and RMSE.