# Phase 1: Data Loading and String-to-Numeric Sanitization

In this initial phase, we address the primary challenge of the dataset: **Data Types**. In the raw CSV, numerical features such as `HorsePower`, `Price`, and `Total Speed` are stored as string objects because they include units (e.g., 'hp', 'km/h') and currency symbols.

### Methodology:
To prepare the data for mathematical analysis and machine learning, we implemented a robust cleaning pipeline:

1. **Unit Stripping:** We use a custom function to remove commas and specific units (case-insensitive).
2. **Range Resolution:** A significant portion of the data contains ranges (e.g., `$12,000 - $15,000` or `70-85 hp`). Our logic splits these strings, converts the individual values to floats, and calculates the **arithmetic mean** to provide a single representative data point.
3. **Type Conversion:** All cleaned strings are cast to 64-bit floats using NumPy logic.



### Summary Statistics:
The final step of this block generates **Descriptive Statistics** (Mean, Median, Std Dev, Min/Max) for the cleaned columns. This allows us to verify the success of the conversion and identify initial data trends.

In [None]:
import pandas as pd
import numpy as np

# Load the dataset with the correct encoding
df = pd.read_csv('Cars Datasets 2025.csv', encoding='latin-1')


# Function to clean string columns (handles units, commas, and ranges)
def clean_range_column(series, units_to_remove):
    # 1. Remove non-numeric characters (units, commas)
    cleaned = series.astype(str).str.replace(',', '', regex=False).str.replace(units_to_remove, '', regex=False, case=False).str.strip()

    # 2. Function to handle ranges (e.g., '70-85' becomes 77.5)
    def handle_range(value):
        if value.lower() in ('nan', 'n/a', ''):
            return np.nan
        value = value.strip('$').strip()

        if '-' in value:
            try:
                parts = [float(p.strip()) for p in value.split('-') if p.strip()]
                return np.mean(parts) if len(parts) >= 2 else (parts[0] if parts else np.nan)
            except ValueError:
                return np.nan
        else:
            try:
                return float(value)
            except ValueError:
                return np.nan

    return cleaned.apply(handle_range)

# Apply cleaning to the required columns
df['HorsePower_Clean'] = clean_range_column(df['HorsePower'], 'hp')
df['Total_Speed_Clean'] = clean_range_column(df['Total Speed'], 'km/h')
df['Performance_Clean'] = clean_range_column(df['Performance(0 - 100 )KM/H'], 'sec')
df['Cars Prices_Clean'] = clean_range_column(df['Cars Prices'], '')
df['Seats_Clean'] = clean_range_column(df['Seats'], '')
df['Torque_Clean'] = clean_range_column(df['Torque'], 'Nm')

print("--- Initial Cleaning & Summary Statistics Complete ---")
print(df[['HorsePower_Clean', 'Cars Prices_Clean']].describe().to_markdown(numalign="left", stralign="left"))

# Save the initially cleaned data
df.to_csv('cleaned_data.csv', index=False)

# Phase 2: Exploratory Data Analysis (EDA) & Data Visualization

With the dataset sanitized into numerical values, we now perform a deep dive into the data to identify patterns and statistical relationships. This phase fulfills the requirement for 10-15 different analyses through visualizations and grouped aggregations.

### Key Analyses Performed:

1. **Correlation Analysis:** We use a **Heatmap** to determine how features like `HorsePower` and `Total Speed` relate to the `Price`. This helps us understand which variables are the strongest predictors for our machine learning model.
2. **Outlier Detection:** Using **Box Plots**, we identify extreme data points. The car market is unique because luxury hyper-cars (outliers) can skew the average price significantly.
3. **Distribution Analysis:** We use **Histograms** with Kernel Density Estimates (KDE) to see if our data follows a normal distribution or if it is skewed toward specific values.
4. **Categorical Standardization:** We unify the `Fuel Types` column (e.g., merging various "Hybrid" labels) to prepare for categorical encoding.
5. **Grouped Aggregations:** we calculate the **Mean Horsepower by Fuel Type** to see how performance varies across different engine technologies.
6. **Manufacturer Trends:** We visualize the **Top 10 Companies** to see which brands dominate the dataset.

### Visual Outputs:
* `correlation_heatmap.png`: Reveals the "internal logic" of car specs.
* `outlier_boxplots.png`: Highlights the gap between standard and luxury vehicles.
* `horsepower_distribution.png`: Shows the frequency of different power levels.
* `top_10_companies.png`: Displays market representation in the data.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load data from the previous step
df = pd.read_csv('cleaned_data.csv')

# Define Categorical Standardization (CRUCIAL for later encoding)
def standardize_fuel(fuel):
    fuel = str(fuel).lower()
    if 'petrol' in fuel and 'diesel' in fuel:
        return 'Petrol/Diesel'
    elif 'electric' in fuel or 'ev' in fuel:
         return 'Electric'
    elif 'hybrid' in fuel or 'hyrbrid' in fuel or 'plug-in' in fuel:
        return 'Hybrid'
    elif 'diesel' in fuel:
        return 'Diesel'
    elif 'cng' in fuel:
        return 'CNG'
    elif 'hydrogen' in fuel:
        return 'Hydrogen'
    elif 'petrol' in fuel:
        return 'Petrol'
    else:
        return 'Other'

df['Fuel_Type_Standard'] = df['Fuel Types'].apply(standardize_fuel)

# Numerical columns for analysis
cleaned_cols = ['HorsePower_Clean', 'Total_Speed_Clean', 'Performance_Clean', 'Cars Prices_Clean']

# --- EDA Requirements ---

# 1. Correlation Analysis (Visualization: correlation_heatmap.png)
plt.figure(figsize=(8, 6))
sns.heatmap(df[cleaned_cols].corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix')
plt.savefig('correlation_heatmap.png')
plt.close()

# 2. Outlier Detection (Visualization: outlier_boxplots.png)
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
sns.boxplot(y=df['HorsePower_Clean'])
plt.title('Box Plot of HorsePower')
plt.subplot(1, 2, 2)
sns.boxplot(y=df['Cars Prices_Clean'])
plt.ylim(0, 500000) # Zoomed in
plt.title('Box Plot of Car Prices (Zoomed)')
plt.savefig('outlier_boxplots.png')
plt.close()

# 3. Feature Distribution (Visualization: horsepower_distribution.png)
plt.figure(figsize=(8, 5))
sns.histplot(df['HorsePower_Clean'].dropna(), kde=True, bins=30)
plt.title('Distribution of HorsePower')
plt.savefig('horsepower_distribution.png')
plt.close()

# 4. Grouped Aggregation & Categorical Count (Visualization: top_10_companies.png)
top_10_companies = df['Company Names'].value_counts().head(10)
plt.figure(figsize=(10, 5))
sns.barplot(x=top_10_companies.index, y=top_10_companies.values, palette="viridis")
plt.xticks(rotation=45, ha='right')
plt.title('Top 10 Companies by Car Count')
plt.tight_layout()
plt.savefig('top_10_companies.png')
plt.close()

# Print text results
print("\n--- 5. Grouped Aggregation: Mean HorsePower by Fuel Type ---")
print(df.groupby('Fuel_Type_Standard')['HorsePower_Clean'].mean().sort_values(ascending=False).to_markdown())

print("\n--- 6. Data Types and Unique Value Counts ---")
print(df.dtypes.to_frame(name='Data Type').to_markdown())


# Save the fully cleaned data for ML
df.to_csv('cleaned_data_final.csv', index=False)
print("\nEDA complete. All 4 PNG files and 'cleaned_data_final.csv' saved.")

# Phase 3: Manual Data Preprocessing & Feature Engineering

To prepare our data for a Linear Regression model without relying on external machine learning libraries, we implement the core preprocessing steps manually. This ensures our model receives clean, normalized, and numerically consistent data.

### Methodology:

1. **Missing Value Imputation:** We identify missing values in numerical columns and fill them with the **Median**. The median is used instead of the mean because it is more robust against the extreme outliers (hyper-cars) identified in Phase 2.
2. **Standard Scaling (Z-score Normalization):** Linear Regression is sensitive to the scale of data. If `Horsepower` ranges from 50 to 1000 and `Seats` ranges from 2 to 7, the model might incorrectly prioritize the larger numbers. We transform every numerical feature so it has a **mean of 0 and a standard deviation of 1** using the formula:  
   $$z = \frac{x - \mu}{\sigma}$$
3. **One-Hot Encoding:** Machines cannot understand "Petrol" or "Diesel" as text. We convert the `Fuel Type` category into multiple binary columns (0 or 1). This is a pure Pandas implementation of categorical encoding.
4. **Manual Train/Test Split:** To evaluate our model fairly, we shuffle the data using a random seed and split it:
   * **80% Training Set:** Used to teach the model.
   * **20% Testing Set:** Used to verify the model's accuracy on unseen data.

### Configuration Export:
We save the means, standard deviations, and median values into `preprocessing_config.json`. This is a critical step that allows our **Streamlit application** to apply the exact same mathematical transformations to user inputs at runtime.

In [None]:
import pandas as pd
import numpy as np
import json
import random # Used for simple train/test split

# Load the final data
df = pd.read_csv('cleaned_data_final.csv')

# Drop rows with missing target (price)
df_clean = df.dropna(subset=['Cars Prices_Clean']).copy()

# Features (X) and Target (Y)
X = df_clean[['HorsePower_Clean', 'Total_Speed_Clean', 'Performance_Clean', 'Seats_Clean', 'Fuel_Type_Standard']].copy()
Y = df_clean['Cars Prices_Clean'].copy()
numerical_features = ['HorsePower_Clean', 'Total_Speed_Clean', 'Performance_Clean', 'Seats_Clean']

# --- 1. Handling Missing Values (Imputation with Median) ---
imputation_stats = {}
for col in numerical_features:
    median_val = X[col].median()
    X[col] = X[col].fillna(median_val)
    imputation_stats[col] = median_val # Save median for Streamlit

print("1. Missing values imputed with the median.")

# --- 2. Scaling Numerical Features (Standard Scaling) ---
scaling_stats = {}
for col in numerical_features:
    mean_val = X[col].mean()
    std_val = X[col].std()
    
    # Standard Scaling Formula: (x - mean) / std
    X[col] = (X[col] - mean_val) / std_val
    
    scaling_stats[col] = {'mean': mean_val, 'std': std_val} # Save stats for Streamlit

print("2. Numerical features scaled (Z-score).")

# --- 3. Encoding Categorical Features (One-Hot Encoding with Pandas) ---
X_encoded = pd.get_dummies(X, columns=['Fuel_Type_Standard'], drop_first=True, dtype=int)
X_encoded = X_encoded.drop(columns=['Fuel_Type_Standard'], errors='ignore') 

final_columns = X_encoded.columns.tolist()

# --- 4. Custom Train/Test Split (Pure Python/NumPy logic) ---
# Create a list of indices, shuffle, and split
indices = list(range(len(X_encoded)))
random.seed(42) # Set seed for reproducibility
random.shuffle(indices)

train_size = int(0.8 * len(X_encoded))
train_indices = indices[:train_size]
test_indices = indices[train_size:]

X_train = X_encoded.iloc[train_indices]
X_test = X_encoded.iloc[test_indices]
Y_train = Y.iloc[train_indices]
Y_test = Y.iloc[test_indices]


# --- 5. Save Preprocessing Stats (CUSTOM JSON) ---
preprocessing_config = {
    'imputation_stats': imputation_stats,
    'scaling_stats': scaling_stats,
    'final_columns': final_columns,
    # Convert DataFrame columns to list for saving categorical levels
    'categorical_levels': X_encoded.filter(regex='Fuel_Type_Standard_').columns.tolist()
}

with open('preprocessing_config.json', 'w') as f:
    json.dump(preprocessing_config, f)
    
print("\nPreprocessing complete. Config saved as 'preprocessing_config.json'.")
print(f"X_train shape: {X_train.shape}")

# Phase 4: Machine Learning - Manual Linear Regression Implementation

In the final phase of development, we build and train our predictive model. Rather than using "black-box" libraries, we implement **Linear Regression** using pure **NumPy** to solve the **Normal Equation**. This mathematical approach finds the specific weights that minimize the squared error between our predictions and the actual car prices.

### Technical Implementation:

1. **The Normal Equation:** We use matrix algebra to solve for the optimal parameter vector $\theta$ (weights and bias) using the formula:
   $$\theta = (X^T X)^{-1} X^T y$$
   * This involves transposing matrices, performing matrix multiplication (`@`), and calculating the inverse of the matrix.
2. **Prediction Logic:** Once trained, the model predicts prices using the linear combination formula:
   $$Y_{predicted} = X \cdot weights + bias$$
3. **Manual Performance Metrics:** To evaluate the model's accuracy on the unseen test set, we manually calculate:
   * **Root Mean Squared Error (RMSE):** Represents the standard deviation of the prediction errors.
   * **R-squared ($R^2$):** Indicates the proportion of the variance for the car price that is explained by our features (Horsepower, Speed, etc.).

### Model Persistence:
The final weights ($W$) and bias ($b$), along with the evaluation metrics, are exported to `linear_model_weights.json`. This "lightweight" model file is what powers the **runtime predictions** in our Streamlit application, allowing users to get instant price estimates without needing to re-train the model.

In [None]:
import numpy as np
import json
import pandas as pd

# Reuse X_train, Y_train, X_test, Y_test from the previous block (Code Block 3)

# --- Linear Regression Class using Pure NumPy (Normal Equation) ---
class SimpleLinearRegression:
    def __init__(self):
        self.weights = None
        self.bias = None

    def fit(self, X, y):
        # 1. Add bias term (column of ones) to X
        X_with_bias = X.copy()
        X_with_bias.insert(0, 'Bias', 1.0)
        
        X_np = X_with_bias.values
        y_np = y.values.reshape(-1, 1)

        # 2. Normal Equation: W = (X^T * X)^-1 * X^T * y
        # We use the @ operator for matrix multiplication
        # np.linalg.inv calculates the inverse of a matrix
        try:
            W_np = np.linalg.inv(X_np.T @ X_np) @ X_np.T @ y_np
            
            self.bias = W_np[0, 0]
            self.weights = W_np[1:, 0]
        except np.linalg.LinAlgError:
            print("Warning: Matrix is singular. Cannot compute inverse. Weights set to zero.")
            self.bias = 0.0
            self.weights = np.zeros(X_np.shape[1] - 1)

    def predict(self, X):
        # Prediction: Y = XW + b
        if isinstance(X, pd.DataFrame):
            X_np = X.values
        else:
            X_np = X 
            
        return X_np @ self.weights + self.bias
        
# --- Train the Model ---
lr_model = SimpleLinearRegression()
lr_model.fit(X_train, Y_train)

# --- Evaluate the Model (Manual Metrics) ---
Y_pred_test = lr_model.predict(X_test)
Y_test_np = Y_test.values

# Calculate RMSE (Root Mean Squared Error)
rmse = np.sqrt(np.mean((Y_test_np - Y_pred_test)**2))

# Calculate R2 Score
SS_res = np.sum((Y_test_np - Y_pred_test)**2)
SS_tot = np.sum((Y_test_np - Y_test_np.mean())**2)
r2_score_val = 1 - (SS_res / SS_tot)

print("\n--- Linear Regression Model Results (Pure NumPy) ---")
print(f"R-squared (R2): {r2_score_val:.4f}")
print(f"Root Mean Squared Error (RMSE): ${rmse:,.2f}")

# --- Save the Model Weights (CUSTOM JSON) ---
model_weights = {
    'bias': lr_model.bias,
    'weights': lr_model.weights.tolist(),
    # Save the final test metrics for display in Streamlit
    'test_r2': r2_score_val,
    'test_rmse': rmse
}

with open('linear_model_weights.json', 'w') as f:
    json.dump(model_weights, f)
    
print("Model weights saved as 'linear_model_weights.json'.")