## Car Dataset Analysis and Naïve Bayes Classification Model Report

## 1. Introduction

This report presents an in-depth analysis of the car dataset and the implementation of a Naïve Bayes classification model. The objective is to analyze the dataset, perform preprocessing, build a predictive model, and evaluate its performance.

## 2. Libraries Used

The following libraries were used in the project:

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

## 3. Dataset Overview

The dataset used in this project is the car dataset. It contains various attributes related to car specifications, and the goal is to predict the Transmission type of the vehicle.

### 3.1 Number of Columns

The dataset consists of multiple columns:

- `Car_Name`

- `Year`

- `Selling_Price`

- `Present_Price`

- `Kms_Driven`

- `Fuel_Type`

- `Seller_Type`

- `Transmission`

- `Owner`

## 3.2 Relationship Between Columns

- `Selling_Price` and `Present_Price` have a strong correlation.

- `Fuel_Type` and `Seller_Type` may influence the car's value.

- `Kms_Driven` and `Age` of the car (derived from `Year`) can affect price.

- `Transmission` is the target variable and is influenced by `Fuel_Type`, `Kms_Driven`, and `Owner`.

## 4. Basic Analysis

The dataset was loaded and examined using the following functions:

In [2]:
data = pd.read_csv(r"C:\Users\Shaik Sakhlaih\Downloads\cardata.csv")
print(data.head())
print(data.tail())
print(data.info())
print(data.describe())

  Car_Name  Year  Selling_Price  Present_Price  Kms_Driven Fuel_Type  \
0     ritz  2014           3.35           5.59       27000    Petrol   
1      sx4  2013           4.75           9.54       43000    Diesel   
2     ciaz  2017           7.25           9.85        6900    Petrol   
3  wagon r  2011           2.85           4.15        5200    Petrol   
4    swift  2014           4.60           6.87       42450    Diesel   

  Seller_Type Transmission  Owner  
0      Dealer       Manual      0  
1      Dealer       Manual      0  
2      Dealer       Manual      0  
3      Dealer       Manual      0  
4      Dealer       Manual      0  
    Car_Name  Year  Selling_Price  Present_Price  Kms_Driven Fuel_Type  \
296     city  2016           9.50           11.6       33988    Diesel   
297     brio  2015           4.00            5.9       60000    Petrol   
298     city  2009           3.35           11.0       87934    Petrol   
299     city  2017          11.50           12.5       

### 4.1 Checking for Null Values

In [3]:
print(data.isnull().sum())

Car_Name         0
Year             0
Selling_Price    0
Present_Price    0
Kms_Driven       0
Fuel_Type        0
Seller_Type      0
Transmission     0
Owner            0
dtype: int64


The dataset has **no missing values**.

## 5. Data Preprocessing

Since the dataset contains categorical variables, we used Label Encoding to convert them into numeric values.

In [4]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
data['Car_Name'] = le.fit_transform(data['Car_Name'])
data['Fuel_Type'] = le.fit_transform(data['Fuel_Type'])
data['Seller_Type'] = le.fit_transform(data['Seller_Type'])
data['Transmission'] = le.fit_transform(data['Transmission'])

## 6. Model Building

The `Transmission` column is the target variable (`y`), while the remaining columns are features (`X`).

In [5]:
from sklearn.model_selection import train_test_split

x = data.drop(['Transmission'], axis=1)
y = data['Transmission']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)

print("x_train:", x_train.shape)
print("x_test:", x_test.shape)
print("y_train:", y_train.shape)
print("y_test:", y_test.shape)

x_train: (240, 8)
x_test: (61, 8)
y_train: (240,)
y_test: (61,)


### 6.1 Applying Naïve Bayes Classification

In [6]:
from sklearn.naive_bayes import GaussianNB

gnb = GaussianNB()
gnb.fit(x_train, y_train)

## 7. Model Evaluation

### 7.1 Predictions

In [7]:
y_pred = gnb.predict(x_test)

### 7.2 Accuracy Score

In [8]:
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_pred, y_test) * 100
print("The accuracy score:", accuracy)

The accuracy score: 88.52459016393442


## 8. Conclusion

- The dataset contained **no missing values**, simplifying preprocessing.

- Categorical columns were encoded using **Label Encoding**.

- The **Naïve Bayes classification* model was implemented.

The accuracy of the model is 86.89%, indicating a fairly strong predictive performance.

The model can be further improved by tuning hyperparameters or using alternative classification models such as Decision Trees or Random Forest.

This report provides a comprehensive analysis and evaluation of the car dataset using Naïve Bayes classification.