In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.metrics import r2_score


In [2]:
df = pd.read_csv("/content/cleaned_dataset.csv")

In [3]:
# Define features and target variable
X = df.drop(columns=['sales_price'])  # Features
y = df['sales_price']  # Target variable

# Convert categorical variables to dummy variables
X = pd.get_dummies(X, drop_first=True)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [4]:
# Initialize models
models = {
    'Linear Regression': LinearRegression(),
    'Ridge Regression': Ridge(),
    'Lasso Regression': Lasso(),
    'Decision Tree': DecisionTreeRegressor(),
    'Random Forest': RandomForestRegressor(),
    'Support Vector Regression': SVR()
}

# Train models and calculate R² values
r2_scores = {}

for model_name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    r2_scores[model_name] = r2_score(y_test, y_pred)


In [5]:
# Print R² values for each model
for model_name, score in r2_scores.items():
    print(f"{model_name}: R² = {score:.4f}")


Linear Regression: R² = 0.7654
Ridge Regression: R² = 0.7657
Lasso Regression: R² = 0.7655
Decision Tree: R² = 0.8555
Random Forest: R² = 0.8953
Support Vector Regression: R² = -0.2288


### **Model Performance Interpretation**

1. **Linear Regression: R² = 0.7654**
   - **Interpretation:** The Linear Regression model explains approximately 76.54% of the variance in the sales price. This indicates a good fit, but there might still be room for improvement with more complex models or feature engineering.

2. **Ridge Regression: R² = 0.7657**
   - **Interpretation:** Ridge Regression, which includes L2 regularization, performs similarly to Linear Regression with an R² value of 76.57%. The slight improvement suggests that regularization helps manage multicollinearity or prevents overfitting without significantly affecting performance.

3. **Lasso Regression: R² = 0.7655**
   - **Interpretation:** Lasso Regression, which includes L1 regularization, achieves an R² value of 76.55%. It’s similar to Ridge Regression and Linear Regression, with regularization helping to select a subset of important features while performing slightly worse than Ridge Regression.

4. **Decision Tree: R² = 0.8555**
   - **Interpretation:** The Decision Tree model explains 85.55% of the variance in sales price, showing a notable improvement over the linear models. This indicates that the Decision Tree can capture more complex relationships between features and target, potentially leading to better predictions.

5. **Random Forest: R² = 0.8953**
   - **Interpretation:** The Random Forest model achieves the highest R² value of 89.53%. This ensemble method of multiple decision trees offers the best performance, effectively handling complex interactions between features and reducing overfitting compared to a single decision tree.

6. **Support Vector Regression: R² = -0.2288**
   - **Interpretation:** The negative R² value for Support Vector Regression (SVR) suggests that the model performs poorly and worse than a simple mean-based model. This could indicate issues with model parameter tuning or the need for scaling the features.

### **Summary**

- **Best Performing Models:** Random Forest and Decision Tree, with R² values of 89.53% and 85.55%, respectively. These models handle complex feature interactions well and provide robust predictions.
- **Regularized Linear Models:** Ridge and Lasso Regression perform similarly to standard Linear Regression, explaining about 76.5% of the variance. Regularization helps in handling multicollinearity and feature selection but does not significantly outperform linear models in this case.
- **Poor Performance:** Support Vector Regression shows a negative R² value, indicating it is not suitable for this dataset or may require additional tuning.

### **Conclusion**

The Random Forest model is the most effective for predicting sales price based on the given data, with the highest R² value.

# **Model Optimization**

In [6]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import r2_score


In [7]:
# Define the parameter grid
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['auto', 'sqrt', 'log2']
}


In [8]:
# Initialize Random Forest Regressor
rf = RandomForestRegressor(random_state=42)

# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid,
                           cv=5, scoring='r2', n_jobs=-1, verbose=2)

# Fit GridSearchCV
grid_search.fit(X_train, y_train)


Fitting 5 folds for each of 324 candidates, totalling 1620 fits


540 fits failed out of a total of 1620.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
540 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_validation.py", line 729, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/base.py", line 1145, in wrapper
    estimator._validate_params()
  File "/usr/local/lib/python3.10/dist-packages/sklearn/base.py", line 638, in _validate_params
    validate_parameter_constraints(
  File "/usr/local/lib/python3.10/dist-packages/sklearn/utils/_param_validation.py", line 96, in validate_parameter_constraints
    raise InvalidParameterError(
s

In [9]:
# Get the best parameters and model
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

print("Best Parameters:", best_params)

# Predict using the optimized model
y_pred_optimized = best_model.predict(X_test)

# Evaluate the optimized model
optimized_r2 = r2_score(y_test, y_pred_optimized)
print("Optimized Random Forest R²:", optimized_r2)


Best Parameters: {'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 300}
Optimized Random Forest R²: 0.9361852439925779


##Interpretation

param_grid: Specifies the hyperparameters to be tested.

--> GridSearchCV: Searches through the parameter grid using cross-validation to find the best combination of hyperparameters.

--> best_params_: Displays the best hyperparameters found during the search.

--> best_estimator_: The model with the best parameters.

--> r2_score(): Evaluates the R² value of the optimized model.



### Steps Taken to Improve Model Performance

#### Objective:
To improve the performance of the Random Forest model for predicting sales price.

#### Steps Taken:

 **Hyperparameter Tuning:**
   - **Parameter Grid Definition:** Defined a parameter grid for `GridSearchCV` including:
     - `n_estimators`: Number of trees in the forest.
     - `max_depth`: Maximum depth of each tree.
     - `min_samples_split`: Minimum number of samples required to split an internal node.
     - `min_samples_leaf`: Minimum number of samples required at a leaf node.
     - `max_features`: Number of features to consider for the best split.
   - **GridSearchCV Initialization:** Used `GridSearchCV` to perform an exhaustive search over the parameter grid, evaluating the model using 5-fold cross-validation.

 **Model Training and Evaluation:**
   - **Training:** Trained the Random Forest model with the best hyperparameters found by `GridSearchCV`.
   - **Prediction:** Used the optimized model to predict sales prices on the test set.
   - **Performance Evaluation:** Evaluated the model's performance using the R² score, which measures the proportion of variance in the target variable that is predictable from the features.

#### Results:

- **Best Parameters:** The optimal hyperparameters for the Random Forest model are:
  - `max_depth`: None
  - `max_features`: 'sqrt'
  - `min_samples_leaf`: 1
  - `min_samples_split`: 2
  - `n_estimators`: 300

- **Optimized Random Forest R²:** The R² value of the optimized Random Forest model is `0.9362`, indicating that approximately 93.62% of the variance in the sales price is explained by the model. This reflects a significant improvement in model performance compared to the initial version.

#### Conclusion:

The optimization process successfully enhanced the Random Forest model's performance through hyperparameter tuning. The increased R² score demonstrates a more accurate prediction of sales prices, highlighting the effectiveness of the optimized model.
