<a href="https://colab.research.google.com/github/EnhanceImpact/Python-for-Data-Science/blob/main/automobile_price_eda_linear_regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Automobile Price: EDA, Cleaning, and Simple Linear Regression
This notebook performs:
1) **EDA** and data cleaning (drop `normalized-losses`, fix dtypes)
2) **Scatterplot matrix** for numeric features (slides-friendly)
3) **Feature selection**: choose the single numeric feature most correlated with `price`
4) **Simple Linear Regression** (train/test split) using that one feature

**Note:** Keep your dataset as a CSV with the column names shown in the sample.

## 1. Imports & Settings

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

pd.set_option('display.max_columns', 100)
pd.set_option('display.width', 120)

## 2. Load Data
Update the `data_path` variable to point to your CSV file.

In [None]:
data_path = 'automobile_data.csv'  # <-- change to your path
df = pd.read_csv(data_path)
print('Raw shape:', df.shape)
df.head()

## 3. Data Cleaning
- Drop `normalized-losses`
- Convert numeric-like columns to numeric dtypes
- Drop rows with missing target (`price`)

In [None]:
# Drop normalized-losses if present
if 'normalized-losses' in df.columns:
    df = df.drop(columns=['normalized-losses'])

# Known numeric columns from the dataset
numeric_like = [
    'price','highway-mpg','city-mpg','peak-rpm','horsepower','compression-ratio',
    'stroke','bore','engine-size','curb-weight','height','width','length','wheel-base'
]

for col in numeric_like:
    if col in df.columns:
        df[col] = pd.to_numeric(df[col], errors='coerce')

# Ensure target exists and drop rows with missing price
if 'price' not in df.columns:
    raise ValueError('Target column "price" not found in data.')
df = df.dropna(subset=['price']).reset_index(drop=True)

print('Shape after cleaning:', df.shape)
df.info()

## 4. Quick EDA
Head, describe (numeric), and basic correlations.

In [None]:
display(df.head())
display(df.describe(numeric_only=True))

corr = df.corr(numeric_only=True)
if 'price' in corr.columns:
    display(corr['price'].sort_values(ascending=False))
else:
    print('No numeric correlation to price found yet.')

## 5. Scatterplot Matrix (Numeric Features)
A compact scatterplot matrix for numeric features. Use this in slides to discuss relationships.

In [None]:
num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
# Keep to a manageable subset if many columns
plot_cols = num_cols
if len(plot_cols) > 8:
    # prioritize columns with strongest |corr| to price
    if 'price' in corr.columns:
        order = corr['price'].abs().sort_values(ascending=False).index.tolist()
        plot_cols = [c for c in order if c in num_cols][:8]

axarr = scatter_matrix(df[plot_cols].dropna(), figsize=(12, 12))
plt.suptitle('Scatterplot Matrix (Numeric Features)', y=1.02)
plt.tight_layout()
plt.savefig('scatter_matrix_numeric.png', dpi=150, bbox_inches='tight')
plt.show()
print('Saved scatter matrix to scatter_matrix_numeric.png')

## 6. Pick Best Single Feature for Price
We choose the numeric feature with the highest absolute correlation with `price` (excluding `price` itself).

In [None]:
if 'price' in df.columns:
    corr_series = df.select_dtypes(include=[np.number]).corr(numeric_only=True)['price']
    corr_series = corr_series.drop(labels=['price'])
    best_feature = corr_series.abs().idxmax()
    print('Best feature:', best_feature)
    print('Correlation with price:', corr_series[best_feature])
else:
    best_feature = None
    print('No price column present.')

## 7. Visualize Best Feature vs Price
A simple scatter with a fitted line for intuition.

In [None]:
if best_feature is not None:
    x = df[best_feature]
    y = df['price']
    valid = ~(x.isna() | y.isna())
    plt.figure()
    plt.scatter(x[valid], y[valid], alpha=0.7)
    plt.xlabel(best_feature)
    plt.ylabel('price')
    plt.title(f'{best_feature} vs price')
    # Line of best fit (visual only)
    m, b = np.polyfit(x[valid], y[valid], 1)
    x_line = np.linspace(x[valid].min(), x[valid].max(), 100)
    y_line = m * x_line + b
    plt.plot(x_line, y_line)
    plt.tight_layout()
    plt.savefig('best_feature_vs_price.png', dpi=150, bbox_inches='tight')
    plt.show()
    print('Saved best_feature_vs_price.png')

## 8. Simple Linear Regression (One Feature)
Train/test split, fit `LinearRegression`, report R², MAE, RMSE, and plot diagnostics.

In [None]:
if best_feature is not None:
    data = df[[best_feature, 'price']].dropna()
    X = data[[best_feature]].values
    y = data['price'].values

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )

    model = LinearRegression()
    model.fit(X_train, y_train)

    y_pred = model.predict(X_test)

    r2 = r2_score(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)
    rmse = mean_squared_error(y_test, y_pred, squared=False)
    print({'feature': best_feature, 'R2': r2, 'MAE': mae, 'RMSE': rmse})

    # Predicted vs Actual
    plt.figure()
    plt.scatter(y_test, y_pred, alpha=0.7)
    lims = [min(y_test.min(), y_pred.min()), max(y_test.max(), y_pred.max())]
    plt.plot(lims, lims)
    plt.xlabel('Actual price')
    plt.ylabel('Predicted price')
    plt.title('Predicted vs Actual (One-Feature Linear Regression)')
    plt.tight_layout()
    plt.savefig('pred_vs_actual_price.png', dpi=150, bbox_inches='tight')
    plt.show()
    print('Saved pred_vs_actual_price.png')

    # Residuals plot
    residuals = y_test - y_pred
    plt.figure()
    plt.scatter(y_pred, residuals, alpha=0.7)
    plt.axhline(0)
    plt.xlabel('Predicted price')
    plt.ylabel('Residuals')
    plt.title('Residual Plot')
    plt.tight_layout()
    plt.savefig('residuals_price.png', dpi=150, bbox_inches='tight')
    plt.show()
    print('Saved residuals_price.png')
else:
    print('best_feature not available; check earlier steps.')

## 9. Save Cleaned Data (Optional)

In [None]:
out_path = 'automobile_data_cleaned.csv'
df.to_csv(out_path, index=False)
out_path