## Introduction

One of the biggest challenges of an auto dealership purchasing a used car at an auto auction is the risk of that the vehicle might have serious issues that prevent it from being sold to customers. The auto community calls these unfortunate purchases "kicks".

Kicked cars often result when there are tampered odometers, mechanical issues the dealer is not able to address, issues with getting the vehicle title from the seller, or some other unforeseen problem. Kick cars can be very costly to dealers after transportation cost, throw-away repair work, and market losses in reselling the vehicle.

In this project we will see the cars which has higher risk of being kick, which can help real value for dealership and provide best selection for the customers,

We will take a look at real world data for 150,000 customers and use machine learning techniques to build the models.

## Installing and immporting all the libraries

In [None]:
pip install opendatasets

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import opendatasets as od
import os
import plotly.express as px

## Downloading the Data

In [None]:
# Correct syntax
url = 'https://www.kaggle.com/competitions/DontGetKicked/data.csv'
#df = pd.read_csv(url)

In [None]:
od.download(url)

In [None]:
os.listdir('./DontGetKicked')

In [None]:
train_df = pd.read_csv('DontGetKicked/training.csv')
test_df = pd.read_csv('DontGetKicked/test.csv')

# Exploratory Data Analysis

We have the dataset downloaded and loaded in the dataframe. Let's check the data we have. We have train.csv which contains the training data and test.csv for testing the data.

In [None]:
train_df

In [None]:
train_df.info()

In [None]:
test_df

In [None]:
a = train_df.IsBadBuy.value_counts()
plt.figure(figsize=(6,6))
sns.barplot(x=['NO','Yes'],y=a)
plt.ylabel('Count')
plt.title("Is a Bad Buy", fontsize = 18)

**Insights**: In this dataset our target variable is `IsBadBuy`, by analysing the target variable we can see that there are 64K values stating its a good buy and 9K stating its a Bad buy.Hence our data is Imbalanced.

In [None]:
b = train_df.Auction.value_counts()
colors = ['cyan', 'lightblue','pink']
plt.figure(figsize=(7,7))
plt.title('Purchase in Auction',fontsize=18)
plt.pie(b,colors=colors,
        labels =['MANHEIM','OTHER','ADESA'],
        autopct = '%1.1f%%',startangle=90,shadow=True,
       radius = 1.2,explode = (0, 0.0005,0))
plt.legend();

By seeing this chart we can tell that `Manheim` is the Auction place where maximum number of vehicles purchased in the auction.

In [None]:
age = pd.DataFrame(train_df.VehicleAge.value_counts())
plt.figure(figsize=(8,6))
sns.barplot(x=age.index,y='VehicleAge',data=age)
plt.ylabel('count',fontsize=18)
plt.title('Vehicle Age',fontsize=18)

Insights: We can see that people prefer to buy the vehicle age of 3 & 4 regarding the counts

In [None]:
make = pd.DataFrame(train_df.Make.value_counts())
plt.figure(figsize=(12,6))
sns.barplot(x=make.index,y='Make',data=make)
plt.ylabel('count',fontsize=18)
plt.xticks(rotation=75)
plt.title('Make',fontsize=18)

Insights: Chevrolet is the company which people have purchased the vehicle a lot.

In [None]:
px.histogram(train_df, x="VehicleAge", color='IsBadBuy')

Insights: Distibutions of vehicle age with the purchase was a good or bad.

In [None]:
px.histogram(train_df, x= "Make", color='IsBadBuy',width=1000)

Insights: Chevrolet and Dodge are the companies which are in high demand of the market.

In [None]:
px.histogram(train_df, x= "Nationality", color='IsBadBuy')

Insights: Seems like vehicle of America has been purchased a lot and it has the higher demand.

In [None]:
px.histogram(train_df, x= "Size", color='IsBadBuy')

Insights: Majority of the People prefer to purchase the Medium size vehicle.

In [None]:
px.histogram(train_df,x='VNST',width=1000)

Insights: Texas is the state from where the vehicle purchase is more

In [None]:
px.histogram(train_df,x='WheelType',y='IsBadBuy')

Insights: Majority of the vehicle purchased Wheel Type as Alloy , this might be because the Wheel type Special costs more price.

In [None]:
plt.figure(figsize=(20, 12))

# Compute correlation matrix
corr = train_df.corr(numeric_only=True)

# Create a mask for the upper triangle
mask_matrix = np.triu(np.ones_like(corr, dtype=bool))

# Plot heatmap
sns.heatmap(corr, 
            mask=mask_matrix, 
            cmap='crest', 
            annot=True, 
            fmt=".2f", 
            linewidths=0.5, 
            square=True, 
            cbar_kws={"shrink": 0.75})

plt.title("Correlation Heatmap", fontsize=18)
plt.show()


Insights: Here we can clearly see that the MMRA price columns are very strongly correlated with each other which might affect the accuracy of what we are trying to achieve.

In [None]:
px.scatter(train_df, x="MMRAcquisitionAuctionAveragePrice", y="MMRAcquisitionRetailAveragePrice",color="IsBadBuy")

Insights : It is clearly obvious that there is a strong positive correlation.

In [None]:
px.scatter(train_df, x = "MMRCurrentAuctionAveragePrice",y = "MMRCurrentAuctionCleanPrice", color='IsBadBuy')

Insights: The above graph is evident there is a positive correlation here and after 20K mark the vehicle seems to fall in Bad Buy.

In [None]:
px.scatter(train_df, x='MMRCurrentRetailAveragePrice', y='MMRCurrentRetailCleanPrice', color='IsBadBuy')

Insights: The above graph is evident there is a positive correlation here and after 20K mark the vehicle seems to fall in Bad Buy.

In [None]:
# Download competition files
!kaggle competitions download -c DontGetKicked

# Unzip the downloaded zip
import zipfile
with zipfile.ZipFile("DontGetKicked.zip", 'r') as zip_ref:
    zip_ref.extractall("DontGetKicked")

# text_file = open('/content/DontGetKicked/Carvana_Data_Dictionary.txt')
# content = text_file.read()
# print(content)
# text_file.close()

In [None]:
print(os.getcwd())

for root, dirs, files in os.walk(".", topdown=True):
    for name in files:
        if name.endswith(".txt"):
            print(os.path.join(root, name))

Let's check the null values in the dataset

In [None]:
with open("DontGetKicked/Carvana_Data_Dictionary.txt", "r") as f:
    content = f.read()
    print(content)

In [None]:
train_df.isna().sum()*100/len(train_df)

We can see that there are lot of NaN values present in the dataset

In [None]:
train_df.IsOnlineSale.value_counts()

# Data Preprocessing

In [None]:
train_df['Transmission'] = train_df['Transmission'].replace({'manual':'MANUAL'})

In [None]:
train_df.isna().sum()

Inshights: We have approx 90% of nun values in `PrimeUnit` & `Aucguart` so its better to drop them

In [None]:
train_df.drop(['Trim','Model','RefId','VehYear','WheelTypeID','VNZIP1','PRIMEUNIT','AUCGUART','PurchDate'], axis=1, inplace=True)
test_df.drop(['Trim','Model','RefId','VehYear','WheelTypeID','VNZIP1','PRIMEUNIT','AUCGUART','PurchDate'], axis=1, inplace=True)

Removing Unecessary columns which may not impact our model and our model can learn better without this columns

In [None]:
train_df.describe()

In [None]:
train_targets = train_df['IsBadBuy']
train_df.drop('IsBadBuy',axis=1, inplace= True)

Splitting the Dependent variable and independent variables

In [None]:
train_df.head()

### Conclusions


1. We can drop **`Model`** & **`Trim`** as they have alot of categories and model wont be able to learn all of them.


2. Note: **`WheelType`** and **`WheelTypeID`** are one and the same. One of them is containing numeric categories and other as string. Its better we drop **`WheelTypeID`** as the other column has type of metal used for making the wheel which might help us understand the importance of a particular metal used in making the wheel

3.  **`VehYear`** might not play a crucial role as we have  **`VehicleAge`** as a column. The  **`PurchDate`** varies and similarly  **`VehYear`** varies.The only thing that matters is how much old the vehicle is at the time of resale. Thus `drop` **`VehYear`**


# Encoding & Imputing Technique

It is always a good practice to indentify the numerical and catagorical columns so that it becomes easier to work on them.



In [None]:
num_cols = train_df.select_dtypes(exclude='object').columns.tolist()

We all know that machine learning models cannot work with missing data therefore we will need to fill these missing values and this process is called imputation.

In [None]:
from sklearn.impute import SimpleImputer

In [None]:
imputer = SimpleImputer(strategy='mean')

Implementing simple imputer so that missing values can be filled with mean

In [None]:
from sklearn.impute import SimpleImputer

# Select numeric columns
num_cols = train_df.select_dtypes(include='number').columns

# Initialize imputer
imputer = SimpleImputer(strategy='mean')  # or 'median', 'most_frequent', etc.

# Fit the imputer on the training data
imputer.fit(train_df[num_cols])

# Transform the data and update the DataFrame
train_df[num_cols] = imputer.transform(train_df[num_cols])

In [None]:
imputer.fit(test_df[num_cols])
test_df[num_cols] = imputer.transform(test_df[num_cols])

In [None]:
train_df.isna().sum()

Simple Imputer doesnot work for the categorical values or objects so that we are applying different stratergy for filling missing values

In [None]:
train_df = train_df.apply(lambda x: x.fillna(x.value_counts().index[0]))

In [None]:
test_df = test_df.apply(lambda x: x.fillna(x.value_counts().index[0]))

In [None]:
test_df.isnull().sum()

In [None]:
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()

Selecting some specific columns so that label encoding can be applied because label encoding takes the labels as `rank`.

In [None]:
train_df['Auction']= label_encoder.fit_transform(train_df["Auction"])
train_df['Transmission']= label_encoder.fit_transform(train_df['Transmission'])
train_df['WheelType']= label_encoder.fit_transform(train_df['WheelType'])
train_df['Nationality']= label_encoder.fit_transform(train_df['Nationality'])
train_df['TopThreeAmericanName']= label_encoder.fit_transform(train_df['TopThreeAmericanName'])

test_df['Auction']= label_encoder.fit_transform(test_df["Auction"])
test_df['Transmission']= label_encoder.fit_transform(test_df['Transmission'])
test_df['WheelType']= label_encoder.fit_transform(test_df['WheelType'])
test_df['Nationality']= label_encoder.fit_transform(test_df['Nationality'])
test_df['TopThreeAmericanName']= label_encoder.fit_transform(test_df['TopThreeAmericanName'])

In [None]:
train_df.head()

In [None]:
category_col = train_df.select_dtypes(include = 'object').columns.tolist()

**Encoding Categorical Data:**

Since machine learning models can only be trained with numeric data, we need to convert categorical data to numbers. A very common technique is to use one-hot encoding for categorical columns.

<img src="https://i.imgur.com/n8GuiOO.png" width="640">

One hot encoding involves adding a new binary (0/1) column for each unique category of a categorical column.


We will use `OneHotEncoder` from `sklearn.preprocessing` to achive this goal.

In [None]:
from sklearn.preprocessing import OneHotEncoder

In [None]:
from sklearn.preprocessing import OneHotEncoder

# Update for latest sklearn version
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')

# Fit only on training data
encoder.fit(train_df[category_col])

# Transform both datasets
train_encoded = encoder.transform(train_df[category_col])
test_encoded = encoder.transform(test_df[category_col])

In [None]:
from sklearn.preprocessing import OneHotEncoder

# Define and fit encoder
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
encoder.fit(train_df[category_col])

# Transform data
train_encoded = encoder.transform(train_df[category_col])

# Get new column names
encoded_cols = list(encoder.get_feature_names_out(category_col))

# Create DataFrame with encoded columns
import pandas as pd
train_encoded_df = pd.DataFrame(train_encoded, columns=encoded_cols)

# Optional: Reset index if needed to merge
train_encoded_df.index = train_df.index

# Combine with original DataFrame (dropping old categorical columns)
train_df_final = pd.concat([train_df.drop(columns=category_col), train_encoded_df], axis=1)

After fitting the categorical columns to the `OneHotEncoder` object, the encoder creates a list of new columns from all the categories in the columns and we can access them using `get_feature_names_out`.

Now we will use these encoded columns names to transform the columns into encoded columns.

In [None]:
import pandas as pd

# Transform categorical columns
train_encoded = encoder.transform(train_df[category_col])
test_encoded = encoder.transform(test_df[category_col])

# Create DataFrames from encoded arrays
train_encoded_df = pd.DataFrame(train_encoded, columns=encoded_cols, index=train_df.index)
test_encoded_df = pd.DataFrame(test_encoded, columns=encoded_cols, index=test_df.index)

# Drop original categorical columns from the original dataframes
train_df = train_df.drop(columns=category_col)
test_df = test_df.drop(columns=category_col)

# Concatenate encoded columns
train_df = pd.concat([train_df, train_encoded_df], axis=1)
test_df = pd.concat([test_df, test_encoded_df], axis=1)

# Optional (to defragment memory)
train_df = train_df.copy()
test_df = test_df.copy()

In [None]:
train_df[encoded_cols]

# Scaling Down the Data

In [None]:
train_df.columns.tolist()

Another good practice is to scale numeric features to a small range of values e.g. (0,1) or (−1,1). Scaling numeric features ensures that no particular feature has a disproportionate impact on the model's loss. Optimization algorithms also work better in practice with smaller numbers.

We will use `MinMaxScaler` from `sklearn.preprocessing` to scale numeric features.

In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
scaler = MinMaxScaler()
scaler.fit(train_df[num_cols])

In [None]:
train_df[num_cols] = scaler.transform(train_df[num_cols])
test_df[num_cols] = scaler.transform(test_df[num_cols])

In [None]:
train_df[num_cols].describe()

In [None]:
cols_to_drop = ['Make', 'SubModel', 'Color', 'Size', 'VNST']

missing_cols_train = [col for col in cols_to_drop if col not in train_df.columns]
missing_cols_test = [col for col in cols_to_drop if col not in test_df.columns]

print("Missing in train_df:", missing_cols_train)
print("Missing in test_df:", missing_cols_test)

As we applied one hot encoding to the above columns lets drop these columns from the dataset.

# Training , Validation and Test Set

**Training, Validation and Test Sets:**

While building real-world machine learning models, it is quite common to split the dataset into three parts:

1. **Training set** - used to train the model, i.e., compute the loss and adjust the model's weights using an optimization technique.


2. **Validation set** - used to evaluate the model during training, tune model hyperparameters (optimization technique, regularization etc.), and pick the best version of the model. Picking a good validation set is essential for training models that generalize well.


3. **Test set** - used to compare different models or approaches and report the model's final accuracy. For many datasets, test sets are provided separately. The test set should reflect the kind of data the model will encounter in the real-world, as closely as feasible.

As a general rule of thumb you can use around 60% of the data for the training set, 20% for the validation set and 20% for the test set. If a separate test set is already provided, you can use a 75%-25% training-validation split.


When rows in the dataset have no inherent order, it's common practice to pick random subsets of rows for creating test and validation sets. This can be done using the `train_test_split` utility from `scikit-learn`. Learn more about it here: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
inputs, val_inputs, train_targets, val_targets = train_test_split(train_df,train_targets, test_size=0.20, random_state=42)

# Dumb Model

It's always a good idea to build a baseline or a dumb model first before training a machine learning model to actually have the baseline, which we need to perform better from.

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix

We will use the `accuracy_score` from `sklearn.metrics` library to test the accuracy of models by computing the percentage of matching values between the predictions and actual targets

In [None]:
dum_model_outs = np.zeros(len(inputs))
accuracy_score(dum_model_outs,train_targets)

Our Dum Model saying 'No' has the accuracy of 87%.

# Model 1: Logistic Regression

We wiil make our first model which is going to be `LogisticRegression` model.
We will use `LogisticRegression` from `sklearn.linear_model` to build the model.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

In [None]:
lr_model = LogisticRegression(random_state = 42,solver='liblinear',class_weight={0: 1, 1:1.6})
lr_model.fit(inputs, train_targets)

We have made the lr_model object and have fitted the training inputs to the model.
Next we will get the predictions from the model and check the accuracy score.

In [None]:
lr_model.score(inputs, train_targets)

In [None]:
lr_model.score(val_inputs, val_targets)

In [None]:
train_preds = lr_model.predict(inputs)

**Confusion Matrix:**

A Confusion matrix is an N x N matrix used for evaluating the performance of a classification model, where N is the number of target classes. The matrix compares the actual target values with those predicted by the machine learning model. This gives us a holistic view of how well our classification model is performing and what kinds of errors it is making.

For a binary classification problem, we would have a 2 x 2 matrix as shown below with 4 values:

![](https://cdn.analyticsvidhya.com/wp-content/uploads/2020/04/Basic-Confusion-matrix.png)

- The target variable has two values: Positive or Negative
- The columns represent the actual values of the target variable
- The rows represent the predicted values of the target variable


Here TP and TN means that the the predicted value matches the actual value, FN means that model predicted **False** but the actual value was **True** and FP means that the model predicted **True** but the actual value was **False**.

In [None]:
confusion_matrix(train_targets, train_preds, normalize = 'pred')

In the above matrix we can see that the **TP** and **TN** have a percentage of 88% and 45% repectively.


In [None]:
accuracy = accuracy_score(train_targets, train_preds)
accuracy

In [None]:
val_preds = lr_model.predict(val_inputs)

In [None]:
confusion_matrix(val_targets, val_preds, normalize = 'pred')

In the above matrix we can see that the **TP** and **TN** have a percentage of 88% and 41% repectively.


In [None]:
preds=lr_model.predict(test_df)

# Model 2: KNN Classifier

We will create KNN classifier model.

We will use `KNeighborsClassifier` from `sklearn.neighbors` to build the model.


In [None]:
from sklearn.neighbors import KNeighborsClassifier
KNN = KNeighborsClassifier(algorithm='auto',leaf_size=30,metric='minkowski',n_neighbors=11,weights='uniform')

In [None]:
KNN.fit(inputs, train_targets)

In [None]:
KNN.score(inputs, train_targets)

In [None]:
KNN.score(val_inputs, val_targets)


It seems we are getting the score of 87 in the validation set but `KNeighborsClassifier` takes some time to give the predictions.

In [None]:
submission_df = pd.DataFrame({
    'RefId': test_df['RefId'],  
    'IsBadBuy': preds
})

preds=KNN.predict(test_df)
submission_df['IsBadBuy']=preds

# Model 3: Decission Tree Classifier

Next We will create our 'Descision Tree' model.

We will use `DecisionTreeClassifier` from `sklearn.tree` to build the model.

In [None]:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(random_state=42)

In [None]:
model.fit(inputs, train_targets)

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix

In [None]:
train_preds = model.predict(inputs)

In [None]:
train_preds

In [None]:
pd.value_counts(train_preds)

In [None]:
train_probs = model.predict_proba(inputs)
train_probs

In [None]:
accuracy_score(train_targets, train_preds)

In [None]:
model.score(val_inputs, val_targets)

It seems we are getting the accuracy of 79% in the validation set lets apply some hyperparameter here to increase the accuracy

In [None]:
val_targets.value_counts() / len(val_targets)

In [None]:
from sklearn.tree import plot_tree, export_text
plt.figure(figsize=(80,20))
plot_tree(model, feature_names=inputs.columns, max_depth=3, filled=True);

In [None]:
model.tree_.max_depth

In [None]:
importance_df = pd.DataFrame({
    'feature': inputs.columns,
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False)

In [None]:
plt.title('Feature Importance')
sns.barplot(data=importance_df.head(10), x='importance', y='feature');

Importance of the feature in the dataset

In [None]:
model = DecisionTreeClassifier(max_depth=3, random_state=42)

applying different values of max_depth of the tree to increase the accuracy

In [None]:
model.fit(inputs, train_targets)

In [None]:
model.score(inputs, train_targets)

In [None]:
model.score(val_inputs, val_targets)

Now we can see that just by increasing the max depth of the tree we are getting the accuracy of 88% which is far better than 79%

In [None]:
def max_depth_error(md):
    model = DecisionTreeClassifier(max_depth=md, random_state=42)
    model.fit(inputs, train_targets)
    train_acc = 1 - model.score(inputs, train_targets)
    val_acc = 1 - model.score(val_inputs, val_targets)
    return {'Max Depth': md, 'Training Error': train_acc, 'Validation Error': val_acc}

In [None]:
errors_df = pd.DataFrame([max_depth_error(md) for md in range(1, 21)])

In [None]:
errors_df

In [None]:
plt.figure()
plt.plot(errors_df['Max Depth'], errors_df['Training Error'])
plt.plot(errors_df['Max Depth'], errors_df['Validation Error'])
plt.title('Training vs. Validation Error')
plt.xticks(range(0,21, 2))
plt.xlabel('Max. Depth')
plt.ylabel('Prediction Error (1 - Accuracy)')
plt.legend(['Training', 'Validation'])

Applying different hyperparameter values to tune the dataset

In [None]:
model = DecisionTreeClassifier(max_depth=4,max_leaf_nodes=50,random_state=42).fit(inputs, train_targets)
model.score(val_inputs, val_targets)

Hence we got the accuracy of 88% in Decission tree model

# Model 4: RandomForest Calssifier

Next we will make our random forest classifier model and we will use `RandomForestClassifier` from `sklearn.ensemble`.

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
model = RandomForestClassifier(n_jobs=-1, random_state=42)

In [None]:
model.fit(inputs, train_targets)

In [None]:
model.score(inputs, train_targets)

In [None]:
model.score(val_inputs, val_targets)

In [None]:
train_probs = model.predict_proba(inputs)
train_probs

In [None]:
# # Show all columns in test_df
# print("Columns in test_df:")
# print(test_df.columns.tolist())

# # Preview first few rows
# print("\nFirst few rows:")
# print(test_df.head())

In [None]:
# Predict probabilities for class 1
preds = model.predict_proba(test_df)[:, 1]

# Create a submission DataFrame with auto-generated IDs
submission_df = pd.DataFrame({
    'RefId': range(len(test_df)),   # Auto-generated RefId: 0, 1, 2, ...
    'IsBadBuy': preds
})

# Save to CSV
submission_df.to_csv('rf_Submissions.csv', index=False)

In [None]:
importance_df = pd.DataFrame({
    'feature': inputs.columns,
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False)

In [None]:
plt.figure(figsize=(10,6))
plt.title('Feature Importance')
sns.barplot(data=importance_df.head(10),palette='husl',x='importance', y='feature');
plt.show()

In [None]:
def test_params(**params):
    model = RandomForestClassifier(random_state=42, n_jobs=-1, **params).fit(inputs, train_targets)
    return model.score(inputs, train_targets), model.score(val_inputs, val_targets)

In [None]:
test_params(max_depth=40)

In [None]:
test_params(max_leaf_nodes=2**12)

In [None]:
test_params(max_features='log2')

# Traing best Model:

Since we had the highest accuracy with random forest so let's tune our randomm forest model and since it is a recursive process let's create a function to test our hyperparameters.



In [None]:
model = RandomForestClassifier(n_jobs=-1,
                               random_state=42,
                               n_estimators=300,
                               max_features='log2',
                               max_depth=40,
                               class_weight={0: 1, 1: 1.6})

In [None]:
model.fit(inputs, train_targets)

In [None]:
model.score(inputs, train_targets), model.score(val_inputs, val_targets)

In [None]:
preds=model.predict_proba(test_df)
submission_df['IsBadBuy']=preds[:,1]

In [None]:
# Step 3: Show the first few prediction results
print("Prediction Results (Top 10):")
print(submission_df.head(10))

# Step 4: Save to CSV
submission_df.to_csv('rf_Submissions.csv', index=False)
print("\nSaved to 'rf_Submissions.csv'")

# **Summary**

We downloaded , explored , performed EDA(Exploratory Data Analysis), cleaned the data and trained few models to automate the process of identifying that the car bought at auction is a good purchase or bad purchase.

- Training data & test data had approximately 73K rows and 34 columns.
- Prepared the dataset and removed the data which had more categories or which had high correlations
- Imputed the missing values in both categorical columns and numeric columns.
- Encoded the categorical columns with Label encoding & One hot encoding and scaled the numerical values using MinMaxScaler.
- Then split the data into train data and validation data and trained the dumb model to get the baseline for our models.
- Dataset was Imbalance , its important to balance the dataset first we applied `class_weights' parameter for balancing the data.
- Trained four models:
`LogisticRegression`,`KNNClassifier` `DecissionTree` and `RandomForest`.
Among these `RandomForest` performed better and applied hyperparameter tuning onto it so that it gave the accuracy of 89% on the validation set.

**Possible Future Work:**

- Performing better feature engineering.
- Tuning the Hyperparameter.
- performing cross-validation like k_fold.

