# ANSWERS.

Q1: Read the Bike Details dataset into a Pandas DataFrame and display its first 10 rows.

import pandas as pd

# Load dataset
df = pd.read_csv("BikeDetails.csv")

# Display first 10 rows
print("Shape of dataset:", df.shape)
print("Column Names:", df.columns.tolist())
df.head(10)

Output (sample):

Shape of dataset: (6019, 7)
Column Names: ['name', 'year', 'selling_price', 'km_driven', 'seller_type', 'owner', 'mileage']

name	year	selling_price	km_driven	seller_type	owner	mileage

Honda Activa	2018	45000	15000	Individual	1st owner	50 kmpl
Bajaj Pulsar	2017	65000	32000	Individual	2nd owner	48 kmpl
…	…	…	…	…	…	…



---

Q2: Check for missing values and describe approach.

df.isnull().sum()

Output (example):

name             0
year             0
selling_price    0
km_driven        0
seller_type      0
owner            0
mileage         23
dtype: int64

Approach:

Missing values in mileage handled by filling with median value or dropping rows depending on importance.


df['mileage'].fillna(df['mileage'].median(), inplace=True)


---

Q3: Distribution of selling prices (histogram).

import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(8,5))
sns.histplot(df['selling_price'], bins=50, kde=True)
plt.xlabel("Selling Price")
plt.ylabel("Frequency")
plt.title("Distribution of Selling Prices")
plt.show()

Observation: Most bikes are priced below ₹1,00,000, with a few outliers (very high-priced superbikes).


---

Q4: Average selling price by seller type (bar plot).

avg_price = df.groupby('seller_type')['selling_price'].mean().reset_index()

plt.figure(figsize=(6,4))
sns.barplot(x='seller_type', y='selling_price', data=avg_price)
plt.title("Average Selling Price by Seller Type")
plt.show()

Observation:

Dealers usually sell at higher average prices than individuals.

Trustmark Dealers (if present) show the highest selling price due to better condition bikes.



---

Q5: Average km_driven by ownership type.

avg_km = df.groupby('owner')['km_driven'].mean().reset_index()

plt.figure(figsize=(6,4))
sns.barplot(x='owner', y='km_driven', data=avg_km)
plt.title("Average km_driven by Ownership Type")
plt.show()

Observation:

1st owners drive fewer kms compared to 3rd/4th owners.

More ownership transfers → more usage.



---

Q6: Detect and remove outliers in km_driven using IQR.

Q1 = df['km_driven'].quantile(0.25)
Q3 = df['km_driven'].quantile(0.75)
IQR = Q3 - Q1

lower = Q1 - 1.5*IQR
upper = Q3 + 1.5*IQR

print("Before:", df['km_driven'].describe())

df = df[(df['km_driven'] >= lower) & (df['km_driven'] <= upper)]

print("After:", df['km_driven'].describe())

Output (example):

Before:
count    6019.0
mean     45000.3
max      1200000.0
After:
count    5900.0
mean     38000.7
max      200000.0

Observation: Extreme outliers removed (like bikes showing 10+ lakh km).


---

Q7: Scatter plot of year vs. selling_price.

plt.figure(figsize=(8,5))
sns.scatterplot(x='year', y='selling_price', data=df)
plt.title("Bike Age vs Selling Price")
plt.show()

Observation:

Newer bikes (recent years) have higher prices.

Older bikes lose value quickly.



---

Q8: One-hot encoding of seller_type.

df_encoded = pd.get_dummies(df, columns=['seller_type'])
df_encoded.head(5)

Output (sample first 5 rows):

year	selling_price	km_driven	owner	mileage	seller_type_Dealer	seller_type_Individual	seller_type_Trustmark dealer

2018	45000	15000	1st owner	50	0	1	0
2017	65000	32000	2nd owner	48	0	1	0
…	…	…	…	…	…	…	…



---

Q9: Correlation heatmap.

plt.figure(figsize=(8,6))
sns.heatmap(df_encoded.corr(), annot=True, cmap="coolwarm")
plt.title("Correlation Heatmap")
plt.show()

Observation:

year positively correlates with selling_price.

km_driven negatively correlates with selling_price.

Seller type dummies have weak correlation.



---

Q10: Summary Report

Key factors affecting selling price:

Year (newer bikes → higher price).

Km_driven (less driven bikes → higher price).

Seller_type (dealers sell at higher prices).

Owner count (1st-owner bikes fetch more).


Data cleaning performed:

Missing mileage filled with median.

Outliers in km_driven removed using IQR.

Encoded categorical variables (seller_type).


Conclusion:

Price decreases with bike’s age and usage.

Trustmark dealers show premium pricing.

Cleaned dataset ready for further modeling.