# Q1. In order to predict house price based on several characteristics, such as location, square footage, number of bedrooms, etc., you are developing an SVM regression model. Which regression metric in this situation would be the best to employ?

In the context of predicting house prices using an SVM regression model, the most appropriate regression metric to employ would be the **Mean Absolute Error (MAE)** or the **Root Mean Squared Error (RMSE)**.

Here's a brief explanation of both metrics:

1. **Mean Absolute Error (MAE):**
   - MAE measures the average absolute differences between predicted and actual values.
   - It gives an idea of how far off the predictions are from the actual values, but it doesn't penalize large errors as heavily.
   - MAE is easy to understand and interpret, as it represents the average absolute difference in the units of the target variable.

2. **Root Mean Squared Error (RMSE):**
   - RMSE measures the square root of the average of squared differences between predicted and actual values.
   - It penalizes larger errors more heavily compared to MAE, making it sensitive to outliers or large prediction errors.
   - RMSE is useful when you want to give more weight to large errors.

Choosing between MAE and RMSE depends on the specific characteristics of the problem. If you want to have a metric that is more sensitive to large errors, use RMSE. However, if you want a metric that is more robust to outliers, MAE might be a better choice.

In real estate, both metrics are commonly used, but it's a good practice to report both for a comprehensive evaluation of the model's performance.

In [5]:
import pandas  as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.svm import SVR
model = SVR()

from sklearn.model_selection import GridSearchCV

In [6]:
df = pd.read_csv("Bengaluru_House_Data_1.csv")
df1 = pd.read_csv('Bengaluru_House_Data__2.csv')

In [7]:
df.head()

Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price
0,Super built-up Area,19-Dec,Electronic City Phase II,2 BHK,Coomee,1056,2.0,1.0,39.07
1,Plot Area,Ready To Move,Chikka Tirupathi,4 Bedroom,Theanmp,2600,5.0,3.0,120.0
2,Built-up Area,Ready To Move,Uttarahalli,3 BHK,,1440,2.0,3.0,62.0
3,Super built-up Area,Ready To Move,Lingadheeranahalli,3 BHK,Soiewre,1521,3.0,1.0,95.0
4,Super built-up Area,Ready To Move,Kothanur,2 BHK,,1200,2.0,1.0,51.0


In [8]:
df1.head()

Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price
0,Plot Area,18-Mar,Choodasandra,3 Bedroom,,900,4.0,3.0,155.0
1,Super built-up Area,Ready To Move,Kengeri Hobli,3 BHK,,1082,2.0,0.0,42.0
2,Super built-up Area,Ready To Move,Uttarahalli,3 BHK,,1350,2.0,3.0,47.24
3,Super built-up Area,17-Dec,Kadugodi,2 BHK,Alestrb,1314,2.0,2.0,78.0
4,Super built-up Area,18-May,Maithri Layout,2 BHK,,1290,2.0,3.0,40.63


In [9]:
combine_df = pd.concat([df , df1] , axis=0 , ignore_index=True)

combine_df.head()

Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price
0,Super built-up Area,19-Dec,Electronic City Phase II,2 BHK,Coomee,1056,2.0,1.0,39.07
1,Plot Area,Ready To Move,Chikka Tirupathi,4 Bedroom,Theanmp,2600,5.0,3.0,120.0
2,Built-up Area,Ready To Move,Uttarahalli,3 BHK,,1440,2.0,3.0,62.0
3,Super built-up Area,Ready To Move,Lingadheeranahalli,3 BHK,Soiewre,1521,3.0,1.0,95.0
4,Super built-up Area,Ready To Move,Kothanur,2 BHK,,1200,2.0,1.0,51.0


In [10]:
combine_df.isnull().sum()

area_type          0
availability       0
location           1
size              16
society         5502
total_sqft         0
bath              73
balcony          609
price              0
dtype: int64

In [11]:
combine_df.balcony.unique()

array([ 1.,  3., nan,  2.,  0.])

In [12]:
combine_df['balcony'].fillna(combine_df['balcony'].median() , inplace=True)

combine_df['location'].dropna(inplace=True)

combine_df['bath'].fillna(combine_df['bath'].median() , inplace=True)
combine_df.isnull().sum()

area_type          0
availability       0
location           1
size              16
society         5502
total_sqft         0
bath               0
balcony            0
price              0
dtype: int64

In [13]:
from sklearn.preprocessing import LabelEncoder

encoder_area = LabelEncoder()
encoder_availability = LabelEncoder()
encoder_location = LabelEncoder()
encoder_size = LabelEncoder()

combine_df['area_type_2'] = encoder_area.fit_transform(combine_df['area_type'])
combine_df['availability_2'] = encoder_area.fit_transform(combine_df['availability'])
combine_df['location_2'] = encoder_area.fit_transform(combine_df['location'])
combine_df['size_2'] = encoder_area.fit_transform(combine_df['size'])

from sklearn.preprocessing import LabelEncoder
l_en = LabelEncoder()

combine_df['total_sqrt_2'] = l_en.fit_transform(combine_df['total_sqft'])

In [14]:
combine_df.isnull().sum()

area_type            0
availability         0
location             1
size                16
society           5502
total_sqft           0
bath                 0
balcony              0
price                0
area_type_2          0
availability_2       0
location_2           0
size_2               0
total_sqrt_2         0
dtype: int64

In [15]:
x = combine_df.drop(['price','area_type' , "availability" , "location" , "size" , 'society' , 'total_sqft'] , axis=1)

y = combine_df.price


In [16]:
X_train, X_test, y_train, y_test = train_test_split(x,y,test_size=0.20)

r_model = SVR(kernel='rbf')

r_model.fit(X_train , y_train)

# Q2. You have built an SVM regression model and are trying to decide between using MSE or R-squared as your evaluation metric. Which metric would be more appropriate if your goal is to predict the actual price of a house as accurately as possible?

In [13]:
q2_model = SVR(kernel='rbf')


q2_model.fit(X_train , y_train)
y_pred = q2_model.predict(X_test)

from sklearn.metrics import mean_squared_error , r2_score

print(mean_squared_error(y_pred , y_test))

print(r2_score(y_pred , y_test))

16385.686754444283
-14.168457659390114


If your goal is to predict the actual price of a house as accurately as possible, then **Mean Squared Error (MSE)** would be the more appropriate evaluation metric.

Here's why:

1. **Mean Squared Error (MSE):**
   - MSE measures the average of the squared differences between predicted and actual values.
   - It gives more weight to larger errors, which means it penalizes large deviations from the actual values more heavily.
   - In the context of house price prediction, it's crucial to minimize large prediction errors, as a significant difference between the predicted and actual price can have a substantial impact on the buyer or seller.

2. **R-squared (Coefficient of Determination):**
   - R-squared measures the proportion of the variance in the dependent variable (house prices) that is predictable from the independent variables (features).
   - While R-squared is a valuable metric for understanding the proportion of variability explained by the model, it doesn't directly measure prediction accuracy in terms of absolute price values.

Since your primary goal is to predict the actual price of a house as accurately as possible, you want to minimize the error in your predictions. MSE directly reflects this objective.

```python
from sklearn.metrics import mean_squared_error

# Assuming y_true and y_pred are your actual and predicted house prices
mse = mean_squared_error(y_true, y_pred)

print(f'MSE: {mse}')
```

Remember that while R-squared is important for understanding the goodness of fit of your model, it might not be the best metric when the absolute accuracy of predictions is the top priority.

# Q3. You have a dataset with a significant number of outliers and are trying to select an appropriate regression metric to use with your SVM model. Which metric would be the most appropriate in this scenario?

In [14]:
from sklearn.metrics import mean_absolute_error

print(mean_absolute_error(y_pred , y_test))

48.83646544682976


Mean Absolute Error (MAE):
MAE measures the average absolute differences between predicted and actual values.
It is less sensitive to outliers compared to metrics like Mean Squared Error (MSE) or Root Mean Squared Error (RMSE).
Since MAE does not square the errors, it doesn't overly penalize large deviations from the actual values, making it more robust in the presence of outliers.


# Q4. You have built an SVM regression model using a polynomial kernel and are trying to select the best metric to evaluate its performance. You have calculated both MSE and RMSE and found that both values are very close. Which metric should you choose to use in this case?

If you have built an SVM regression model using a polynomial kernel and found that both Mean Squared Error (MSE) and Root Mean Squared Error (RMSE) are very close, it indicates that the errors are fairly evenly distributed across the data points. In this scenario, it is advisable to choose the Root Mean Squared Error (RMSE) over MSE.

Root Mean Squared Error (RMSE):

RMSE is essentially the square root of MSE. It measures the average of the squared differences between predicted and actual values, and then takes the square root of that average.
RMSE penalizes larger errors more heavily than smaller errors. This makes it more sensitive to outliers or large prediction errors.
In cases where the differences between predicted and actual values are fairly evenly distributed, RMSE provides a slightly more nuanced view of the model's performance.
Mean Squared Error (MSE):

MSE measures the average of the squared differences between predicted and actual values, without taking the square root.
It provides a measure of the overall variance of the errors, but does not provide a direct interpretation of the size of the errors.

In [15]:
q4_model = SVR(kernel='poly')

q4_model.fit(X_train , y_train)

q4_pred = q4_model.predict(X_test)

mse = mean_squared_error(q4_pred , y_test)

import numpy as np
rmse = np.sqrt(mse)

print(f"MSE :" , {mse})
print(f"RMSE :" , {rmse})


MSE : {18634.09305883111}
RMSE : {136.50675096430618}


# Q5. You are comparing the performance of different SVM regression models using different kernels (linear, polynomial, and RBF) and are trying to select the best evaluation metric. Which metric would be most appropriate if your goal is to measure how well the model explains the variance in the target variable?

If your goal is to measure how well the model explains the variance in the target variable, then the most appropriate evaluation metric to use would be the Coefficient of Determination (R-squared).

Here's why:

Coefficient of Determination (R-squared):
R-squared measures the proportion of the variance in the dependent variable (target variable) that is predictable from the independent variables (features).
It provides an indication of how well the model fits the data. A higher R-squared value indicates a better fit.
R-squared values range from 0 to 1, with 1 indicating a perfect fit where all variability in the target variable is explained by the model.
In the context of comparing SVM regression models with different kernels (linear, polynomial, and RBF), using R-squared allows you to assess how well each model captures the variance in the target variable.

In [18]:
from sklearn.svm import SVR
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(x,y,test_size=0.30 , random_state=42)

s_scaler = StandardScaler()
X_train_scaled = s_scaler.fit_transform(X_train)
X_test_scaled = s_scaler.transform(X_test)

q5_svr_model = SVR(kernel='poly')

q5_svr_model.fit(X_train_scaled , y_train)

In [28]:
from sklearn.metrics import mean_absolute_error , r2_score , mean_squared_error

y_pred = q5_svr_model.predict(X_test)


mae = mean_absolute_error(y_pred , y_test)
msr = mean_squared_error(y_pred , y_test)
r2 = r2_score(y_pred , y_test)


print("MAE" , mae)
print("MSR" , msr)
print("r2" , r2)



MAE 1891559405.0555675
MSR 1.1085187440965087e+19
r2 -0.0009177555504140678


If your goal is to measure how well the model explains the variance in the target variable, then the most appropriate evaluation metric would be the **Coefficient of Determination (R-squared)**.

R-squared is a statistical measure that represents the proportion of the variance in the dependent variable that is predictable from the independent variables. It ranges from 0 to 1, where 1 indicates that the model perfectly explains the variance, and 0 indicates that the model does not explain any of the variance.

In the context of SVM regression models, you can use R-squared to assess how well the model captures the variation in the target variable. A higher R-squared value indicates a better fit.

Keep in mind that while R-squared is a valuable metric for assessing how well the model explains the variance, it's important to also consider other metrics (such as Mean Absolute Error, Mean Squared Error, etc.) to get a comprehensive understanding of the model's performance.