<a href="https://colab.research.google.com/github/Keerthana2001-ops/Data-Science-project/blob/main/Regressor_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, r2_score

In [2]:
file_path = 'Final preprocessed data.csv'
data = pd.read_csv(file_path)

In [3]:
X = data.iloc[:, :-1]  # Features
y = data.iloc[:, -1]   # Target

In [4]:
non_numeric_columns = X.select_dtypes(include=['object']).columns

# Encode non-numeric columns using LabelEncoder if present
if len(non_numeric_columns) > 0:
    label_encoders = {}
    for column in non_numeric_columns:
        le = LabelEncoder()
        X[column] = le.fit_transform(X[column])
        label_encoders[column] = le

In [6]:
# Replace 'No data' with NaN if it exists
X.replace('No data', np.nan, inplace=True)
y.replace('No data', np.nan, inplace=True)



In [8]:
y = pd.to_numeric(y, errors='coerce')

# Handle missing values by imputing with the median value for each column
X.fillna(X.median(), inplace=True)
y.fillna(y.median(), inplace=True)

In [9]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the features for better SVR performance
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [10]:
# Define different regression models to evaluate
models = {
    "Linear Regression": LinearRegression(),
    "Decision Tree Regressor": DecisionTreeRegressor(random_state=42),
    "Random Forest Regressor": RandomForestRegressor(n_estimators=100, random_state=42),
    "Support Vector Regressor (SVR)": SVR(kernel='rbf')
}


In [12]:
# Train and evaluate each model
model_results = {}
for model_name, model in models.items():
    if model_name == "Support Vector Regressor (SVR)":
        # Use scaled data for SVR
        model.fit(X_train_scaled, y_train)
        y_pred = model.predict(X_test_scaled)
    else:
        # Use original data for other models
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)

    # Evaluate the model
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)

    # Store the results
    model_results[model_name] = {"MSE": mse, "R2 Score": r2}

In [13]:
# Display the results
for model_name, metrics in model_results.items():
    print(f"{model_name}:")
    print(f"  MSE: {metrics['MSE']}")
    print(f"  R2 Score: {metrics['R2 Score']}")
    print("\n")

Linear Regression:
  MSE: 625514.1679660474
  R2 Score: -391261.4842582663


Decision Tree Regressor:
  MSE: 42.12313619391305
  R2 Score: -25.348248778394257


Random Forest Regressor:
  MSE: 68682.18342064426
  R2 Score: -42960.07600060279


Support Vector Regressor (SVR):
  MSE: 1.6948550581395745
  R2 Score: -0.06014097596167134




The Support Vector Regressor (SVR) has the lowest Mean Squared Error (MSE) and a R² Score closest to zero compared to other models, making it the best choice for this dataset among the tested models. However, the negative R² score indicates that the model still needs improvement and may not fit the data very well.

In [15]:
# Import the necessary module
from sklearn.model_selection import GridSearchCV

# Hyperparameter tuning for SVR using GridSearchCV
param_grid = {
    'C': [0.1, 1, 10, 100],
    'gamma': [0.001, 0.01, 0.1, 1],
    'kernel': ['rbf']
}
grid_search = GridSearchCV(SVR(), param_grid, cv=3, scoring='r2', verbose=2, n_jobs=-1)
grid_search.fit(X_train_scaled, y_train)

Fitting 3 folds for each of 16 candidates, totalling 48 fits


In [16]:
# Best parameters from GridSearchCV
best_params = grid_search.best_params_
print(f"Best Parameters for SVR: {best_params}")

Best Parameters for SVR: {'C': 10, 'gamma': 1, 'kernel': 'rbf'}


In [17]:
# Train the SVR model with the best parameters
best_svr = SVR(**best_params)
best_svr.fit(X_train_scaled, y_train)

In [18]:
# Predict using the optimized SVR model
y_pred_best_svr = best_svr.predict(X_test_scaled)

In [19]:
# Evaluate the optimized SVR model
mse_best_svr = mean_squared_error(y_test, y_pred_best_svr)
r2_best_svr = r2_score(y_test, y_pred_best_svr)

In [20]:
print("Optimized Support Vector Regressor (SVR):")
print(f"  MSE: {mse_best_svr}")
print(f"  R2 Score: {r2_best_svr}")

Optimized Support Vector Regressor (SVR):
  MSE: 5.67616150479142
  R2 Score: -2.5504696218749396


 In this it include feature engineering (polynomial features), hyperparameter tuning using GridSearchCV for SVR, and applied feature scaling for improved performance. The SVR model is now optimized with the best hyperparameters found through grid search