# Data Preparation Project

## Name: Sameer Ahamed Rizwan Basha

## Student No: 202381922

Importing essential libraries and packages for data manipulation, analysis, and visualization.

In [1]:
# NumPy: A fundamental package for scientific computing with Python. It provides support for
# arrays (including multi-dimensional arrays), matrices, and a large collection of high-level
# mathematical functions to operate on these arrays.
import numpy as np

# Pandas: An open-source, BSD-licensed library providing high-performance, easy-to-use data
# structures and data analysis tools for the Python programming language. It's widely used for
# data manipulation and cleaning.
import pandas as pd

# Seaborn: A Python data visualization library based on matplotlib. It provides a high-level
# interface for drawing attractive and informative statistical graphics. It's built on top of
# matplotlib and closely integrated with pandas data structures.
import seaborn as sns

# Matplotlib: A comprehensive library for creating static, animated, and interactive visualizations
# in Python. It's the most widely used visualization library for Python and serves as a foundation for
# other libraries like Seaborn.
import matplotlib.pyplot as plt

# Plotly's graph_objs: A module that contains the functions that will generate graph objects for us.
# These graph objects are high-level wrappers around low-level dictionaries that define the style and
# contents of the objects that make up each Plotly plot (e.g., the lines in a line plot or the markers
# in a scatter plot).
import plotly.graph_objs as go

# Plotly Express: A terse, consistent, high-level API for creating figures. It is a wrapper around
# plotly.graph_objs and provides a simpler, more user-friendly interface to create common types of plots
# and charts (like scatter plots, line charts, bar charts, etc.) with less code.
import plotly.express as px

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor


In [2]:
def train_random_forest(X,y):    
    # Use the command train_test_split to divide the dataset
    X_train, X_test, y_train, y_test = train_test_split(X,y,
                                                        test_size=0.2,
                                                        shuffle=True,
                                                        random_state=42)
    # initialize the random forest classifier
    clf = RandomForestRegressor(random_state=42)
    # Fit the model
    clf.fit(X_train,y_train)
    # Check the model score
    score = clf.score(X_test,y_test)
    print(score)
    return score

## The dataset you have chosen, its variables, in particular the target variable.


### Loading Dataset

In [3]:
df=pd.read_csv("archive\CAR DETAILS FROM CAR DEKHO - Copy.csv")
df.head()

Unnamed: 0,name,yearkm_driven,fuel,seller_type,transmission,owner,selling_price
0,Maruti 800 AC,200770000,Petrol,Individual,Manual,First Owner,60000
1,Maruti Wagon R LXI Minor,200750000,Petrol,Individual,Manual,First Owner,135000
2,Hyundai Verna 1.6 SX,2012100000,Diesel,Individual,Manual,First Owner,600000
3,Datsun RediGO T Option,201746000,Petrol,Individual,Manual,First Owner,250000
4,Honda Amaze VX i-DTEC,2014141000,Diesel,Individual,Manual,Second Owner,450000


### EDA of the dataset

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6364 entries, 0 to 6363
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   name           6364 non-null   object
 1   yearkm_driven  6364 non-null   int64 
 2   fuel           6364 non-null   object
 3   seller_type    6364 non-null   object
 4   transmission   4688 non-null   object
 5   owner          6364 non-null   object
 6   selling_price  6364 non-null   int64 
dtypes: int64(2), object(5)
memory usage: 348.2+ KB


In [5]:
df.shape

(6364, 7)

In [6]:
df.describe()

Unnamed: 0,yearkm_driven,selling_price
count,6364.0,6364.0
mean,751834400.0,510739.6
std,2028224000.0,594384.2
min,20141.0,20000.0
25%,201250000.0,200000.0
50%,201577000.0,351500.0
75%,201870000.0,600000.0
max,20191000000.0,8900000.0


### Step 0_1 - Calculating base performance without any data preperation

In [7]:
# Create a train test split in the dataset
y = df['selling_price']
X = df.drop(['selling_price'],axis=1)

score_0_1 = train_random_forest(X,y)
print(score_0_1)

ValueError: could not convert string to float: 'Hyundai Santro Xing GL Plus'

I encountered a "ValueError" while working with some code. The error message I received was "could not convert string to float: 'Mahindra Scorpio S11 BSIV'". It appears that I was attempting to convert a string containing the text "Mahindra Scorpio S11 BSIV" into a floating-point number, but this conversion wasn't possible because the string didn't represent a numerical value.

To resolve this issue, I need to check the context in which I was trying to convert this string to a float. It seems like I might have been trying to perform a conversion on data that wasn't intended to be a numerical value.

## Step 0_2 : Encode the column 'name'

In [8]:
# Convert the 'name' column to a categorical data type

df['name'] = pd.Categorical(df['name'])

# Assign the encoded variable to a new column
df['name'] = df['name'].cat.codes

# del df['name']
df.dtypes

# Create a train test split in the dataset
y = df['selling_price']
X = df.drop(['selling_price'],axis=1)

score_0_2 = train_random_forest(X,y)
score_0_2

ValueError: could not convert string to float: 'Petrol'

I encountered a "ValueError" with the message "could not convert string to float: 'Diesel'". It seems I was attempting to convert the string "Diesel" into a floating-point number, but this conversion wasn't possible because "Diesel" is not a numerical value.

This error typically happens when I'm trying to perform an operation that expects numerical data, but I'm providing non-numeric data instead. To resolve this issue, I need to check the context in which this conversion is happening and ensure I'm providing the correct data types.

## Step 0_3 : Encode the column 'fuel'

In [9]:
# Convert the 'fuel' column to a categorical data type

df['fuel'] = pd.Categorical(df['fuel'])

# Assign the encoded variable to a new column
df['fuel'] = df['fuel'].cat.codes

# Create a train test split in the dataset
y = df['selling_price']
X = df.drop(['selling_price'],axis=1)

score_0_3 = train_random_forest(X,y)
score_0_3

ValueError: could not convert string to float: 'Individual'

I encountered a "ValueError" with the message "could not convert string to float: 'Individual'". It seems I was attempting to convert the string "Individual" into a floating-point number, but this conversion wasn't possible because "Individual" is not a numerical value.

This error typically occurs when I'm trying to perform an operation that expects numerical data, but I'm providing non-numeric data instead. To resolve this issue, I need to check the context in which this conversion is happening and make sure I'm providing the correct data types.

## Step 0_4 : Encode the column 'seller_type'

In [10]:
# Convert the 'seller_type' column to a categorical data type

df['seller_type'] = pd.Categorical(df['seller_type'])

# Assign the encoded variable to a new column
df['seller_type'] = df['seller_type'].cat.codes

# Create a train test split in the dataset
y = df['selling_price']
X = df.drop(['selling_price'],axis=1)

score_0_4 = train_random_forest(X,y)
score_0_4

ValueError: could not convert string to float: 'Manual'

I encountered a "ValueError" with the message "could not convert string to float: 'Manual'". It appears that I was attempting to convert the string "Manual" into a floating-point number, which is not possible because "Manual" is not a numerical value.

This error typically occurs when I'm trying to perform an operation that expects numerical data, but I'm providing non-numeric data instead. To resolve this issue, I need to check the context in which this conversion is happening and ensure that I'm providing the correct data types.

## Step 0_5 : Encode the column 'transmission'

In [11]:
# Convert the 'transmission' column to a categorical data type

df['transmission'] = pd.Categorical(df['transmission'])

# Assign the encoded variable to a new column
df['transmission'] = df['transmission'].cat.codes

# Create a train test split in the dataset
y = df['selling_price']
X = df.drop(['selling_price'],axis=1)

score_0_5 = train_random_forest(X,y)

ValueError: could not convert string to float: 'First Owner'

I encountered a "ValueError" with the message "could not convert string to float: 'First Owner'". It appears that I was attempting to convert the string "First Owner" into a floating-point number, which is not possible because "First Owner" is not a numerical value.

This error typically occurs when I'm trying to perform an operation that expects numerical data, but I'm providing non-numeric data instead. To resolve this issue, I need to check the context in which this conversion is happening and ensure that I'm providing the correct data types.

## Step 0_6 : Encode the column 'owner'

In [12]:
# Convert the 'owner' column to a categorical data type

df['owner'] = pd.Categorical(df['owner'])

# Assign the encoded variable to a new column
df['owner'] = df['owner'].cat.codes

# Create a train test split in the dataset
y = df['selling_price']
X = df.drop(['selling_price'],axis=1)

score_0_6 = train_random_forest(X,y)

0.5723483216745441


In [13]:
df.dtypes

name             int16
yearkm_driven    int64
fuel              int8
seller_type       int8
transmission      int8
owner             int8
selling_price    int64
dtype: object

----------------------------------------------------------

# Duplicate Instances

## Experiment 1 : Handling Duplicate Instances

In [14]:
# Removing duplicates
print(f'Shape of df before the duplicates were removed: {df.shape}')
df = df.drop_duplicates()
print(f'Shape of df after the duplicates were removed: {df.shape}')

# Create a train test split in the dataset
y = df['selling_price']
X = df.drop(['selling_price'],axis=1)

score_1 = train_random_forest(X,y)

Shape of df before the duplicates were removed: (6364, 7)
Shape of df after the duplicates were removed: (5096, 7)
0.6708761921059816


---------------------------------------------------

# Missingness in features

## Check the missing values in the df

In [15]:
import pandas as pd

# Replace empty strings (or any other placeholder) with NaN
df['transmission'].replace(0, np.nan, inplace=True) 

# Counting the missing values in each column
missing_values_count = df.isna().sum()
print(missing_values_count)

name               0
yearkm_driven      0
fuel               0
seller_type        0
transmission     319
owner              0
selling_price      0
dtype: int64


## Step 2: Handling Missing Data - Mean substitution

In [16]:
# Count the frequency of each category in the 'category' column
frequency = df['transmission'].value_counts()

# First, calculate the mean of the non-missing values in the column
mean_value = df['transmission'].mean()

# Deep copy
mean_df = df.copy()

# Now, fill the missing (NaN) values with the mean value
mean_df['transmission'].fillna(mean_value, inplace=True)

# Create a train test split in the dataset
y = mean_df['selling_price']
X = mean_df.drop(['selling_price'],axis=1)

score_2 = train_random_forest(X,y)

0.6697825383425331


## Experiment 3: Handling Missing Data - Median substitution

In [17]:
from sklearn.impute import SimpleImputer

# Initialize the SimpleImputer with median strategy
imputer = SimpleImputer(strategy='median')

# Impute missing values
median_df = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

# Create a train test split in the dataset
y = mean_df['selling_price']
X = mean_df.drop(['selling_price'],axis=1)

score_3 = train_random_forest(X,y)


0.6697825383425331


## Experiment 4 : Handling Missing Data - Frequency Substitution

In [18]:
# Initialize the SimpleImputer with frequency strategy
imputer = SimpleImputer(strategy='most_frequent')

# Impute missing values
median_df = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

# Create a train test split in the dataset
y = mean_df['selling_price']
X = mean_df.drop(['selling_price'],axis=1)

score_4 = train_random_forest(X,y)

0.6697825383425331


## Experiment 5 : Handling Missing Data -  Multiple Imputation

In [19]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

imp = IterativeImputer(max_iter=10, random_state=0)

# Create a train test split in the dataset
y = df['selling_price']
X = df.drop(['selling_price'],axis=1)

imp.fit(X)  # X is your data with missing values

X_imputed = imp.transform(X)

score_5 = train_random_forest(X_imputed,y)

0.6746201735634997


## Experiment 6 : Handling Missing Data - knn_imputer

In [20]:
from sklearn.impute import KNNImputer
# Create KNN imputer object
# You can specify the number of neighbors to use for imputing missing values
imputer = KNNImputer(n_neighbors=2)

# Fit the imputer to the data and transform it
df_imputed = imputer.fit_transform(df)

# The result is a NumPy array, so you may want to convert it back to a DataFrame
df_imputed = pd.DataFrame(df_imputed, columns=df.columns)

# Create a train test split in the dataset
y = df_imputed['selling_price']
X = df_imputed.drop(['selling_price'],axis=1)

score_6 = train_random_forest(X,y)

0.6941655544588718


----------------------------------------------------------

# Compound Variables

## Experiment 7 : Handling Compound Variables

In [21]:
# One the visual inspection of the project we can see that the features 'year' and 'km_driven are merged

# We will attempt to seperate them here.
# Convert the column to string
df_imputed['yearkm_driven'] = df_imputed['yearkm_driven'].astype(str)

# Now you can use string slicing
df_imputed['year'] = df_imputed['yearkm_driven'].str[:4].astype(int)
df_imputed['km_driven'] = df_imputed['yearkm_driven'].str[4:].astype(float)

del df_imputed['yearkm_driven']
print(df_imputed)
# Create a train test split in the dataset
y = df_imputed['selling_price']
X = df_imputed.drop(['selling_price'],axis=1)

score_7 = train_random_forest(X,y)

        name  fuel  seller_type  transmission  owner  selling_price  year  \
0      775.0   4.0          1.0           1.0    0.0        60000.0  2007   
1     1041.0   4.0          1.0           1.0    0.0       135000.0  2007   
2      505.0   1.0          1.0           1.0    0.0       600000.0  2012   
3      118.0   4.0          1.0           1.0    0.0       250000.0  2017   
4      279.0   1.0          1.0           1.0    2.0       450000.0  2014   
...      ...   ...          ...           ...    ...            ...   ...   
5091  1035.0   4.0          1.0           1.0    0.0       250000.0  2011   
5092   971.0   1.0          1.0           1.0    0.0       700000.0  2018   
5093   238.0   4.0          1.0           1.0    4.0       185000.0  2011   
5094   928.0   1.0          1.0           1.0    0.0       200000.0  2011   
5095   630.0   1.0          1.0           1.0    4.0       315000.0  2004   

      km_driven  
0       70000.0  
1       50000.0  
2      100000.0  
3  

---------------------------------------------

# Outlier Detection

## Experiment 8 : Outlier Detection using K Nearest Neighbours

In [26]:
from sklearn.neighbors import NearestNeighbors

y = df_imputed['selling_price']
X = df_imputed.drop(['selling_price'],axis=1)

# KNN for outlier detection
K = 5  # Number of neighbors
neigh = NearestNeighbors(n_neighbors=K)
neigh.fit(X)
distances, indices = neigh.kneighbors(X)

# Determine an outlier threshold
outlier_threshold = np.mean(distances[:, -1]) + 2 * np.std(distances[:, -1])

# Identify outliers
outliers = distances[:, -1] > outlier_threshold

# Remove outliers
X_cleaned = X[~outliers]
y_cleaned = y[~outliers]

print(f'Number of outliers removed: {len(outliers) - X_cleaned.shape[0]}')

score_8 = train_random_forest(X_cleaned,y_cleaned)


Number of outliers removed: 11
0.734151870484302


## Experiment 9: Isolation Forest Outlier Detection

In [38]:
from sklearn.ensemble import IsolationForest

y = df_imputed['selling_price']
X = df_imputed.drop(['selling_price'],axis=1)

# Apply Isolation Forest for Outlier Detection
iso_forest = IsolationForest(contamination=0.5,random_state=2) # contamination is the proportion of outliers in the dataset
outliers = iso_forest.fit_predict(X)
outlier_index = np.where(outliers == -1) # -1 indicates outliers

# Remove Outliers
X_clean = X.drop(outlier_index[0])
y_clean = y.drop(outlier_index[0])

score_9 = train_random_forest(X_clean,y_clean)

0.7797388368084834


---------------------------------------------

# Feature Transformations

## Experiment 10: Interval Based Binning

## Experiment 11: Frequency Based Binning

## Experiment 12: Threshold Based Binning

## Experiment 13: Centering

## Experment 14: Scaling

## Experiment 15 : Power Transformations