# Objective

The goal of this analysis is to predict the price of used cars using multiple regression techniques. The dataset includes features such as engine_cc, mileage, assembly, body type, ad_city, color, fuel type, make, model, registration status, transmission, and year. After handling missing values and one-hot encoding categorical variables, we applied three different regression models to compare their performances.

# Models Selected

For this analysis, the following models were chosen:

### Linear Regression
- **Simplicity and Interpretability**: Linear regression is a straightforward model that provides clear insight into the relationships between features and the target variable.
- **Baseline Model**: It serves as a reliable baseline for comparing more complex models.
- **Efficiency**: It is computationally efficient, making it suitable for large datasets with numerous features.

### Random Forest Regression
- **Non-Linearity**: Random forests are capable of capturing complex, non-linear relationships between features and the target variable.
- **Robustness**: This model is less prone to overfitting due to its ensemble nature, combining multiple decision trees.
- **Feature Importance**: Random forests provide insights into the importance of each feature, aiding in understanding their impact on the target variable.

### Gradient Boosting Regression
- **Accuracy**: Gradient boosting tends to deliver high predictive accuracy by iteratively improving on weak models.
- **Flexibility**: It is adaptable to various types of data and distributions, making it a versatile choice.
- **Customizability**: The model offers numerous hyperparameters for tuning, allowing for optimization of performance.

# Evaluation Metrics

The models were evaluated using the following metrics:

- **Mean Absolute Error (MAE)**: This metric measures the average magnitude of the errors in the predictions, without considering whether the errors are positive or negative.
- **Mean Squared Error (MSE)**: MSE measures the average of the squares of the errors, penalizing larger errors more heavily than MAE.
- **Root Mean Squared Error (RMSE)**: RMSE is the square root of MSE, providing an error metric in the same units as the target variable.
- **R-squared (R²)**: This metric represents the proportion of variance in the target variable that can be explained by the independent variables, providing insight into the model's explanatory power.

In [3]:
import pandas as pd

data_path = 'pakwheels_used_cars.csv'
df = pd.read_csv(data_path)

print(df.head())
print(df.isnull().sum())

numeric_columns = ['engine_cc', 'mileage', 'year', 'price']
for column in numeric_columns:
    df[column].fillna(df[column].mean(), inplace=True)

categorical_columns = ['assembly', 'body', 'color', 'fuel_type']
for column in categorical_columns:
    df[column].fillna(df[column].mode()[0], inplace=True)

print(df.isnull().sum())

df_encoded = pd.get_dummies(df, columns=categorical_columns + ['ad_city', 'make', 'model', 'registered', 'transmission'])

print(df_encoded.head())

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

X = df_encoded.drop('price', axis=1)
y = df_encoded['price']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

models = {
    "Linear Regression": LinearRegression(),
    "Random Forest": RandomForestRegressor(random_state=42),
    "Gradient Boosting": GradientBoostingRegressor(random_state=42)
}

results = {}
for name, model in models.items():
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    results[name] = {
        "MAE": mean_absolute_error(y_test, predictions),
        "MSE": mean_squared_error(y_test, predictions),
        "RMSE": mean_squared_error(y_test, predictions, squared=False),
        "R²": r2_score(y_test, predictions)
    }

results_df = pd.DataFrame(results).T
print(results_df)


    ad_ref  assembly       body ad_city                color  engine_cc  \
0  7927285  Imported        Van  Lahore          Pearl White     2000.0   
1  7679303  Imported  Hatchback  Lahore                 Grey      996.0   
2  7915479       NaN      Sedan  Lahore          Super white     1798.0   
3  7918380       NaN      Sedan  Lahore  Crystal Black Pearl     1500.0   
4  7676167  Imported        MPV  Lahore               Silver     3000.0   

  fuel_type    make  mileage    model     registered transmission    year  \
0    Hybrid  Nissan   124000   Serena  Un-Registered    Automatic  1905.0   
1    Petrol  Toyota    30738     Vitz         Punjab    Automatic  1905.0   
2    Petrol  Toyota   183000  Corolla         Punjab    Automatic  1905.0   
3    Petrol   Honda    41000    Civic         Punjab    Automatic  1905.0   
4    Petrol  Toyota   126000  Alphard         Punjab    Automatic  1905.0   

       price  
0  8990000.0  
1  4190000.0  
2  3990000.0  
3  6490000.0  
4  4750000.