#### 1.
###Loading dataset
- The dataset is loaded using the read_csv() method from the pandas library because it is stored in a comma-separated values (CSV) format.

In [2]:
import pandas as pd
import numpy as np
data=pd.read_csv("cardekho.csv")


- The first five rows of the dataset are displayed using the head() method in pandas.

In [6]:
data.head(21)

Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner,mileage(km/ltr/kg),engine,max_power,seats
0,Maruti Swift Dzire VDI,2014,450000,145500,Diesel,Individual,Manual,First Owner,23.4,1248.0,74.0,5.0
1,Skoda Rapid 1.5 TDI Ambition,2014,370000,120000,Diesel,Individual,Manual,Second Owner,21.14,1498.0,103.52,5.0
2,Honda City 2017-2020 EXi,2006,158000,140000,Petrol,Individual,Manual,Third Owner,17.7,1497.0,78.0,5.0
3,Hyundai i20 Sportz Diesel,2010,225000,127000,Diesel,Individual,Manual,First Owner,23.0,1396.0,90.0,5.0
4,Maruti Swift VXI BSIII,2007,130000,120000,Petrol,Individual,Manual,First Owner,16.1,1298.0,88.2,5.0
5,Hyundai Xcent 1.2 VTVT E Plus,2017,440000,45000,Petrol,Individual,Manual,First Owner,20.14,1197.0,81.86,5.0
6,Maruti Wagon R LXI DUO BSIII,2007,96000,175000,LPG,Individual,Manual,First Owner,17.3,1061.0,57.5,5.0
7,Maruti 800 DX BSII,2001,45000,5000,Petrol,Individual,Manual,Second Owner,16.1,796.0,37.0,4.0
8,Toyota Etios VXD,2011,350000,90000,Diesel,Individual,Manual,First Owner,23.59,1364.0,67.1,5.0
9,Ford Figo Diesel Celebration Edition,2013,200000,169000,Diesel,Individual,Manual,First Owner,20.0,1399.0,68.1,5.0


In [4]:
### display column list
columns_list = data.columns.tolist()
print(columns_list)

['name', 'year', 'selling_price', 'km_driven', 'fuel', 'seller_type', 'transmission', 'owner', 'mileage(km/ltr/kg)', 'engine', 'max_power', 'seats']


#### COLUMNS EXPLANATIONS
- Name-The make and model of the car 
- year – The manufacturing year of the car.
- selling_price – The price at which the car is being sold
- km_driven – The total distance the car has been driven, measured in kilometers
- fuel – The type of fuel the car uses.
- seller_type – Who is selling the car.
- transmission – The car’s gearbox type, manual or automatic.
- owner – The ownership history.
- mileage(km/ltr/kg) – How far the car can travel per unit of fuel.
- engine – The engine capacity, usually given in cubic centimeters (cc).
- max_power – The maximum power output of the engine.
- seats – The number of seats available in the car.

### 2. checking missing values in the dataset

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8128 entries, 0 to 8127
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   name                8128 non-null   object 
 1   year                8128 non-null   int64  
 2   selling_price       8128 non-null   int64  
 3   km_driven           8128 non-null   int64  
 4   fuel                8128 non-null   object 
 5   seller_type         8128 non-null   object 
 6   transmission        8128 non-null   object 
 7   owner               8128 non-null   object 
 8   mileage(km/ltr/kg)  7907 non-null   float64
 9   engine              7907 non-null   float64
 10  max_power           7913 non-null   object 
 11  seats               7907 non-null   float64
dtypes: float64(3), int64(3), object(6)
memory usage: 762.1+ KB


In [None]:
# showing number of missing values per column
missing_count = data.isnull().sum()
print(missing_count)

name                    0
year                    0
selling_price           0
km_driven               0
fuel                    0
seller_type             0
transmission            0
owner                   0
mileage(km/ltr/kg)    221
engine                221
max_power             215
seats                 221
dtype: int64


In [10]:
# Percentage of missing values per column
missing_percentage = (missing_count / len(data)) * 100
print(missing_percentage)

name                  0.000000
year                  0.000000
selling_price         0.000000
km_driven             0.000000
fuel                  0.000000
seller_type           0.000000
transmission          0.000000
owner                 0.000000
mileage(km/ltr/kg)    2.718996
engine                2.718996
max_power             2.645177
seats                 2.718996
dtype: float64


### 3.

In [12]:
# Drop rows where selling_price is NaN
data = data.dropna(subset=['selling_price'])
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8128 entries, 0 to 8127
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   name                8128 non-null   object 
 1   year                8128 non-null   int64  
 2   selling_price       8128 non-null   int64  
 3   km_driven           8128 non-null   int64  
 4   fuel                8128 non-null   object 
 5   seller_type         8128 non-null   object 
 6   transmission        8128 non-null   object 
 7   owner               8128 non-null   object 
 8   mileage(km/ltr/kg)  7907 non-null   float64
 9   engine              7907 non-null   float64
 10  max_power           7913 non-null   object 
 11  seats               7907 non-null   float64
dtypes: float64(3), int64(3), object(6)
memory usage: 762.1+ KB


### Why we can't train a model with missing target values
- In supervised machine learning, the target variable (y) is what the model learns to predict
- If a row has a missing selling_price, there’s no correct answer for that example — the model wouldn’t know what it should be learning from that row.
- Including rows with missing targets would either cause errors or, worse, introduce incorrect assumptions into training.
- The model needs both features (X) and target (y) for every training example

### 4.

In [13]:
# Fill missing mileage values with the mean of the column
data['mileage(km/ltr/kg)'].fillna(data['mileage(km/ltr/kg)'].mean(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['mileage(km/ltr/kg)'].fillna(data['mileage(km/ltr/kg)'].mean(), inplace=True)


In [14]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8128 entries, 0 to 8127
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   name                8128 non-null   object 
 1   year                8128 non-null   int64  
 2   selling_price       8128 non-null   int64  
 3   km_driven           8128 non-null   int64  
 4   fuel                8128 non-null   object 
 5   seller_type         8128 non-null   object 
 6   transmission        8128 non-null   object 
 7   owner               8128 non-null   object 
 8   mileage(km/ltr/kg)  8128 non-null   float64
 9   engine              7907 non-null   float64
 10  max_power           7913 non-null   object 
 11  seats               7907 non-null   float64
dtypes: float64(3), int64(3), object(6)
memory usage: 762.1+ KB


### Why filling missing values can sometimes be better than dropping rows:

- Keeps more data → Dropping rows removes potentially valuable information from other columns, which reduces the dataset size and may weaken the model.

- Maintains patterns → If missing values are few and random, filling them with a statistical measure (mean, median, mode) preserves most of the dataset’s relationships.

- Avoids bias from deletion → Dropping too many rows might make the dataset less representative of the real-world distribution.

- Improves model stability → Some algorithms can’t handle missing values at all, so filling them avoids training errors without losing too much information.

#### 5.

In [15]:
# Remove duplicate rows
data.drop_duplicates(inplace=True)

### How duplicate rows can affect model training:

- Bias the model → Duplicate rows give certain examples extra weight, making the model think those patterns are more common than they really are.

- Reduce generalization → The model might overfit to repeated data instead of learning patterns that apply broadly.

- Wastes computation → Training time increases unnecessarily when the model processes the same example multiple times.

- Skews evaluation metrics → Validation accuracy or error metrics may look better (or worse) than reality if duplicates are present in both training and test sets.

### 6.

In [16]:
from datetime import datetime

# Get the current year
current_year = datetime.now().year

# Create a new column 'car_age'
data['car_age'] = current_year - data['year']

In [17]:
data.head()

Unnamed: 0,name,year,selling_price,km_driven,fuel,seller_type,transmission,owner,mileage(km/ltr/kg),engine,max_power,seats,car_age
0,Maruti Swift Dzire VDI,2014,450000,145500,Diesel,Individual,Manual,First Owner,23.4,1248.0,74.0,5.0,11
1,Skoda Rapid 1.5 TDI Ambition,2014,370000,120000,Diesel,Individual,Manual,Second Owner,21.14,1498.0,103.52,5.0,11
2,Honda City 2017-2020 EXi,2006,158000,140000,Petrol,Individual,Manual,Third Owner,17.7,1497.0,78.0,5.0,19
3,Hyundai i20 Sportz Diesel,2010,225000,127000,Diesel,Individual,Manual,First Owner,23.0,1396.0,90.0,5.0,15
4,Maruti Swift VXI BSIII,2007,130000,120000,Petrol,Individual,Manual,First Owner,16.1,1298.0,88.2,5.0,18


### 7.

In [18]:
# Unique fuel types
print(data['fuel'].unique())

['Diesel' 'Petrol' 'LPG' 'CNG']


### Why knowing all possible values before encoding is important:

- Avoids missing categories → If you encode without checking, some categories might be skipped or wrongly handled.

- Prevents errors → Some encoders (like OneHotEncoder) will fail if they encounter unseen categories later.

- Ensures correct mapping → You need to verify that the mapping from category to number is logical and consistent.

- Helps decide encoding strategy → For example:

        -If there are few categories → one-hot encoding might be fine.

        -If there are many categories → label encoding or target encoding may be better.