# EDA | Assignment

**Instructions:** Carefully read each question. Use Google Docs, Microsoft Word, or a similar tool
to create a document where you type out each question along with its answer. Save the
document as a PDF, and then upload it to the LMS. Please do not zip or archive the files before
uploading them. Each question carries 20 marks.



**Dataset Link (Bike Details Dataset):**
https://drive.google.com/file/d/1iKy23bMtEQShF_weneRNnYrFmzvpPOI3/view?usp=drive_link

● Download the Bike Details dataset.

● Complete all practical questions using Python and relevant data science libraries.

● Present your findings with appropriate visualizations and summary statistics.

● Save as PDF and submit.


**Question 1:** Read the Bike Details dataset into a Pandas DataFrame and display its first 10 rows.


(Show the shape and column names as well.)

(Include your Python code and output in the code box below.)

**Answer:**

In [None]:
import pandas as pd

# Load the dataset
# Update the file_path with the correct location of your CSV file
file_path = "/mnt/data/BIKE DETAILS.csv" # <<< Update this path
try:
    df = pd.read_csv(file_path)

    # Display first 10 rows, shape, and column names
    print("First 10 rows:")
    print(df.head(10))

    print("\nShape of the dataset:", df.shape)
    print("\nColumn Names:", df.columns.tolist())

except FileNotFoundError:
    print(f"Error: The file was not found at {file_path}. Please update the file_path with the correct location.")
except Exception as e:
    print(f"An error occurred: {e}")

Error: The file was not found at /mnt/data/BIKE DETAILS.csv. Please update the file_path with the correct location.


**Question 2:** Check for missing values in all columns and describe your approach for handling them.

(Include your Python code and output in the code box below.)

**Answer:**

In [None]:
import pandas as pd

# Reload the dataset
file_path = "/mnt/data/BIKE DETAILS.csv"
df = pd.read_csv(file_path)

# Check missing values in all columns
missing_values = df.isnull().sum()

# Percentage of missing values
missing_percentage = (df.isnull().sum() / len(df)) * 100

missing_summary = pd.DataFrame({
    "Missing Values": missing_values,
    "Percentage (%)": missing_percentage.round(2)
})

print(missing_summary)


**Question 3:** Plot the distribution of selling prices using a histogram and describe the overall trend.

(Include your Python code and output in the code box below.)

**Answer:**

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Load dataset
file_path = "/mnt/data/BIKE DETAILS.csv"
df = pd.read_csv(file_path)

# Plot histogram of selling prices
plt.figure(figsize=(8,5))
plt.hist(df['selling_price'], bins=50, edgecolor='black')
plt.title("Distribution of Selling Prices")
plt.xlabel("Selling Price")
plt.ylabel("Frequency")
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()


**Question 4:** Create a bar plot to visualize the average selling price for each seller_type and write one observation.

(Include your Python code and output in the code box below.)

**Answer:**

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Load dataset
file_path = "/mnt/data/BIKE DETAILS.csv"
df = pd.read_csv(file_path)

# Calculate average selling price for each seller_type
avg_price = df.groupby("seller_type")["selling_price"].mean()

# Plot bar chart
plt.figure(figsize=(6,4))
avg_price.plot(kind='bar', color=['skyblue', 'orange', 'green'], edgecolor='black')
plt.title("Average Selling Price by Seller Type")
plt.xlabel("Seller Type")
plt.ylabel("Average Selling Price")
plt.xticks(rotation=0)
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()


**Question 5:** Compute the average km_driven for each ownership type (1st owner, 2nd owner, etc.), and present the result as a bar plot.

(Include your Python code and output in the code box below.)

**Answer:**

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Load dataset
file_path = "/mnt/data/BIKE DETAILS.csv"
df = pd.read_csv(file_path)

# Compute average km_driven for each ownership type
avg_km = df.groupby("owner")["km_driven"].mean().sort_values()

# Plot bar chart
plt.figure(figsize=(7,4))
avg_km.plot(kind='bar', color='teal', edgecolor='black')
plt.title("Average km_driven by Ownership Type")
plt.xlabel("Ownership Type")
plt.ylabel("Average km_driven")
plt.xticks(rotation=30, ha='right')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()


**Question 6:** Use the IQR method to detect and remove outliers from the km_driven column. Show before-and-after summary statistics.

(Include your Python code and output in the code box below.)

**Answer:**

In [None]:
import pandas as pd

# Load dataset
file_path = "/mnt/data/BIKE DETAILS.csv"
df = pd.read_csv(file_path)

# Summary statistics before removing outliers
print("Before Removing Outliers:")
print(df["km_driven"].describe())

# IQR method
Q1 = df["km_driven"].quantile(0.25)
Q3 = df["km_driven"].quantile(0.75)
IQR = Q3 - Q1

# Define bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Remove outliers
df_no_outliers = df[(df["km_driven"] >= lower_bound) & (df["km_driven"] <= upper_bound)]

# Summary statistics after removing outliers
print("\nAfter Removing Outliers:")
print(df_no_outliers["km_driven"].describe())


**Question 7:** Create a scatter plot of year vs. selling_price to explore the relationship between a bike's age and its price.

(Include your Python code and output in the code box below.)

**Answer:**

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Load dataset
file_path = "/mnt/data/BIKE DETAILS.csv"
df = pd.read_csv(file_path)

# Scatter plot: Year vs Selling Price
plt.figure(figsize=(8,5))
plt.scatter(df['year'], df['selling_price'], alpha=0.5, c='blue')
plt.title("Scatter Plot: Year vs. Selling Price")
plt.xlabel("Year of Manufacture")
plt.ylabel("Selling Price")
plt.grid(linestyle='--', alpha=0.7)
plt.show()


**Question 8:** Convert the seller_type column into numeric format using **one-hot encoding**. Display the first 5 rows of the resulting DataFrame.

(Include your Python code and output in the code box below.)

**Answer:**

In [None]:
import pandas as pd

# Load dataset
file_path = "/mnt/data/BIKE DETAILS.csv"
df = pd.read_csv(file_path)

# One-hot encoding for seller_type
df_encoded = pd.get_dummies(df, columns=["seller_type"], drop_first=False)

# Display first 5 rows
print(df_encoded.head(5))


**Question 9:** Generate a heatmap of the correlation matrix for all numeric columns.What correlations stand out the most?

(Include your Python code and output in the code box below.)

**Answer:**

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset
file_path = "/mnt/data/BIKE DETAILS.csv"
df = pd.read_csv(file_path)

# Compute correlation matrix (numeric columns only)
corr_matrix = df.corr(numeric_only=True)

# Plot heatmap
plt.figure(figsize=(8,6))
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Correlation Heatmap of Numeric Columns")
plt.show()


**Question 10:** Summarize your findings in a brief report:

● What are the most important factors affecting a bike's selling price?

● Mention any data cleaning or feature engineering you performed.

(Include your Python code and output in the code box below.)

**Answer:**

In [None]:
import pandas as pd

# Load dataset
file_path = "/mnt/data/BIKE DETAILS.csv"
df = pd.read_csv(file_path)

# --- Data Cleaning ---
# Check missing values
missing_summary = df.isnull().sum()

# Feature Engineering: One-hot encoding seller_type
df_encoded = pd.get_dummies(df, columns=["seller_type
