# Car Price Prediction: Data Science Pipeline
**Author:** GROUP B
**Project:** End-to-End Car Price Prediction Analysis

This notebook documents the complete data science workflow, from raw data exploration to predictive modeling.

## 1. Setup and Library Imports
We use standard data science libraries: `pandas` for data manipulation, `seaborn` and `matplotlib` for visualization, and `scikit-learn` for machine learning.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, r2_score

# Set visualization style
sns.set_theme(style="whitegrid")
%matplotlib inline

## 2. Data Loading
Loading the original dataset `car_price_prediction_.csv`.

In [None]:
df = pd.read_csv('car_price_prediction_.csv')
print(f"Dataset Shape: {df.shape}")
df.head()

## 3. Data Cleaning
In this step, we: 
- Remove irrelevant columns like `Car ID`.
- Check for and handle missing values.
- Verify data types.

In [None]:
# Dropping Car ID
if 'Car ID' in df.columns:
    df = df.drop('Car ID', axis=1)

# Null Value Check
print("Missing Values:")
print(df.isnull().sum())

# Data Types Check
print("\nData Types:")
print(df.dtypes)

## 4. Exploratory Data Analysis (EDA)
Understanding the distribution of car prices and features.

In [None]:
plt.figure(figsize=(10, 6))
sns.histplot(df['Price'], kde=True, color='blue')
plt.title('Distribution of Car Prices')
plt.xlabel('Price ($)')
plt.show()

### 4.1 Numerical Feature Analysis
Analyzing `Year`, `Engine Size`, and `Mileage`.

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

sns.scatterplot(ax=axes[0], x='Year', y='Price', data=df)
axes[0].set_title('Price vs Year')

sns.scatterplot(ax=axes[1], x='Mileage', y='Price', data=df)
axes[1].set_title('Price vs Mileage')

sns.scatterplot(ax=axes[2], x='Engine Size', y='Price', data=df)
axes[2].set_title('Price vs Engine Size')

plt.tight_layout()
plt.show()

## 5. Categorical Feature Analysis
Analysis of `Brand`, `Fuel Type`, and `Condition` impact on price.

In [None]:
plt.figure(figsize=(12, 6))
sns.boxplot(x='Brand', y='Price', data=df)
plt.title('Car Price by Brand')
plt.xticks(rotation=45)
plt.show()

In [None]:
plt.figure(figsize=(10, 5))
sns.barplot(x='Condition', y='Price', data=df, order=['Used', 'Like New', 'New'])
plt.title('Average Price by Condition')
plt.show()

## 6. Correlation Analysis
We use a heatmap to identify relationships between numerical columns.

In [None]:
plt.figure(figsize=(8, 6))
numeric_df = df.select_dtypes(include=[np.number])
sns.heatmap(numeric_df.corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap')
plt.show()

## 7. Linear Regression Modeling
We build a predictive pipeline that includes:
1. **OneHotEncoding** for categorical data.
2. **StandardScaler** for numerical data.
3. **Linear Regression** as the estimator.

In [None]:
# Separate Features and Target
X = df.drop('Price', axis=1)
y = df['Price']

# Define column groups
categorical_cols = X.select_dtypes(include=['object']).columns.tolist()
numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns.tolist()

# Create preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_cols),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols)
    ])

# Define Pipeline
model_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', LinearRegression())
])

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit Model
model_pipeline.fit(X_train, y_train)

# Predictions
y_pred = model_pipeline.predict(X_test)

# Metrics
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse:.2f}")
print(f"R-squared Score: {r2:.4f}")

## 8. Conclusion and Observations
Based on the analysis:
- The model achieves an R-squared of near -0.02, which suggests that the input features do not significantly influence the price in the **original provided dataset**.
- In a real-world scenario, we would perform further feature engineering or use a larger, more structured dataset to improve performance.

---
**End of Report**