# *Machine Learning Project*

## *Business Goal:*

#### You are required to model the price of cars with the available independent variables. It will be used by the management to understand how exactly the prices vary with the independent variables. They can accordingly manipulate the design of the cars, the business strategy etc. to meet certain price levels. Further, the model will be a good way for the management to understand the pricing dynamics of a new market


### 1. **Loading and Preprocessing**

#### ● *Load the dataset and perform necessary preprocessing steps.*

In [13]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the dataset
df = pd.read_csv(r'C:\Users\HP\Downloads\Machine Learning\Project\CarPrice_Assignment.csv')

# Drop unnecessary columns
X = df.drop(columns=['price', 'car_ID', 'CarName'])
y = df['price']

# Convert categorical variables into dummy/indicator variables
X = pd.get_dummies(X, drop_first=True)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### 2. **Model Implementation**

#### ● *Implement the following five regression algorithms:*

1) Linear Regression
2) Decision Tree Regressor
3) Random Forest Regressor
4) Gradient Boosting Regressor
5) Support Vector Regressor


In [21]:
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR

# Define the models
models = {
    'Linear Regression': LinearRegression(),
    'Decision Tree Regressor': DecisionTreeRegressor(random_state=42),
    'Random Forest Regressor': RandomForestRegressor(random_state=42),
    'Gradient Boosting Regressor': GradientBoostingRegressor(random_state=42),
    'Support Vector Regressor': SVR()
}

# Train the models
for name, model in models.items():
    model.fit(X_train, y_train)

### 3. **Model Evaluation**

● *Compare the performance of all the models based on R-squared, Mean Squared Error (MSE), and Mean Absolute Error (MAE).*

● *Identify the best performing model and justify why it is the best.*

In [44]:
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

# Define a function to evaluate models
def evaluate_model(model, X_test, y_test):
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)
    return mse, r2, mae

# Evaluate each model
results = {}
for name, model in models.items():
    mse, r2, mae = evaluate_model(model, X_test, y_test)
    results[name] = {'MSE': mse, 'R-squared': r2, 'MAE': mae}

# Print the performance of all models
for name, metrics in results.items():
    print(f"{name} - MSE: {metrics['MSE']}, R-squared: {metrics['R-squared']}, MAE: {metrics['MAE']}")

# Identify the best performing model based on R-squared value
best_model_name = max(results, key =lambda x: results[x]['R-squared'])
print(f"\nBest Performing Model: {best_model_name}")

Linear Regression - MSE: 8482008.484371832, R-squared: 0.8925566700320242, MAE: 2089.382729204749
Decision Tree Regressor - MSE: 8300272.356143635, R-squared: 0.8948587586031806, MAE: 1886.3211463414634
Random Forest Regressor - MSE: 3314701.736754924, R-squared: 0.9580119976178074, MAE: 1261.4174512195123
Gradient Boosting Regressor - MSE: 5912585.344424135, R-squared: 0.9251040765527097, MAE: 1683.9518920721596
Support Vector Regressor - MSE: 86994157.1156621, R-squared: -0.10197271618931159, MAE: 5707.106445656011

Best Performing Model: Random Forest Regressor


### 4. **Feature Importance Analysis**

● *Identify the significant variables affecting car prices (feature selection)*

In [31]:
# Perform feature importance analysis for the best performing model (if applicable)
if hasattr(models[best_model_name], 'feature_importances_'):
    feature_importances = models[best_model_name].feature_importances_
    feature_names = X.columns
    feature_importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': feature_importances})
    feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)
    print("\nFeature Importance Analysis:")
    print(feature_importance_df)


Feature Importance Analysis:
                  Feature    Importance
6              enginesize  5.444410e-01
5              curbweight  2.994207e-01
13             highwaympg  4.569430e-02
10             horsepower  3.551141e-02
3                carwidth  1.353314e-02
2               carlength  9.019221e-03
1               wheelbase  7.540727e-03
11                peakrpm  6.794992e-03
12                citympg  6.493206e-03
8                  stroke  4.912611e-03
7               boreratio  4.652634e-03
9        compressionratio  4.016081e-03
4               carheight  3.833689e-03
40        fuelsystem_mpfi  2.058066e-03
31    cylindernumber_four  1.733749e-03
19          carbody_sedan  1.354308e-03
18      carbody_hatchback  1.237313e-03
15       aspiration_turbo  1.132112e-03
0               symboling  9.572643e-04
26         enginetype_ohc  9.515885e-04
22         drivewheel_rwd  8.707071e-04
17        carbody_hardtop  6.209014e-04
32     cylindernumber_six  4.771876e-04
16        

### 5. **Hyperparameter Tuning**

● *Perform hyperparameter tuning and check whether the performance of the model has increased.*

In [35]:
from sklearn.model_selection import GridSearchCV

# Define hyperparameter grids for the best performing model
if best_model_name == 'Random Forest Regressor':
    param_grid = {
        'n_estimators': [100, 200],
        'max_depth': [None, 10, 20],
        'min_samples_split': [2, 5],
        'min_samples_leaf': [1, 2]
    }
elif best_model_name == 'Gradient Boosting Regressor':
    param_grid = {
        'n_estimators': [100, 200],
        'learning_rate': [0.01, 0.1],
        'max_depth': [3, 5]
    }
elif best_model_name == 'Support Vector Regressor':
    param_grid = {
        'C': [0.1, 1],
        'epsilon': [0.1, 0.2]
    }
else:
    param_grid = {}

# Perform hyperparameter tuning if applicable
if param_grid:
    grid_search = GridSearchCV(models[best_model_name], param_grid, cv=5)
    grid_search.fit(X_train, y_train)
    best_params = grid_search.best_params_
    print(f"\nBest Hyperparameters for {best_model_name}: {best_params}")
else:
    print("\nHyperparameter tuning is not applicable for the best performing model.")


Best Hyperparameters for Random Forest Regressor: {'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 100}
