#### Q1. In order to predict house price based on several characteristics, such as location, square footage, number of bedrooms, etc., you are developing an SVM regression model. Which regression metric in this situation would be the best to employ?

##### Dataset link: https://drive.google.com/file/d/1Z9oLpmt6IDRNw7IeNcHYTGeJRYypRSC0/view?usp=share_link

#### Q2. You have built an SVM regression model and are trying to decide between using MSE or R-squared as your evaluation metric. Which metric would be more appropriate if your goal is to predict the actual price of a house as accurately as possible?

#### Q3. You have a dataset with a significant number of outliers and are trying to select an appropriate regression metric to use with your SVM model. Which metric would be the most appropriate in this scenario?

#### Q4. You have built an SVM regression model using a polynomial kernel and are trying to select the best metric to evaluate its performance. You have calculated both MSE and RMSE and found that both values are very close. Which metric should you choose to use in this case?

#### Q5. You are comparing the performance of different SVM regression models using different kernels (linear, polynomial, and RBF) and are trying to select the best evaluation metric. Which metric would be most appropriate if your goal is to measure how well the model explains the variance in the target variable?

## Answers

#### Q1. In order to predict house price based on several characteristics, such as location, square footage, number of bedrooms, etc., you are developing an SVM regression model. Which regression metric in this situation would be the best to employ?
##### Dataset link: https://drive.google.com/file/d/1Z9oLpmt6IDRNw7IeNcHYTGeJRYypRSC0/view?usp=share_link


In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
pip install gdown

Collecting gdown
  Downloading gdown-4.7.1-py3-none-any.whl (15 kB)
Collecting filelock
  Downloading filelock-3.12.4-py3-none-any.whl (11 kB)
Installing collected packages: filelock, gdown
Successfully installed filelock-3.12.4 gdown-4.7.1
Note: you may need to restart the kernel to use updated packages.


In [2]:
import gdown
import pandas as pd

# Define the Google Drive file ID
file_id = "1Z9oLpmt6IDRNw7IeNcHYTGeJRYypRSC0"

# Define the URL to download the file
url = f"https://drive.google.com/uc?id={file_id}"

# Define the output file name
output_file = "data.csv"

# Download the file from Google Drive
gdown.download(url, output_file, quiet=False)

# Load the downloaded CSV file into a Pandas DataFrame
df = pd.read_csv(output_file)

# Now you can work with the DataFrame 'df'


Downloading...
From: https://drive.google.com/uc?id=1Z9oLpmt6IDRNw7IeNcHYTGeJRYypRSC0
To: /home/jovyan/work/data.csv
100%|██████████| 938k/938k [00:00<00:00, 2.46MB/s]


In [3]:
df.head()

Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price
0,Super built-up Area,19-Dec,Electronic City Phase II,2 BHK,Coomee,1056,2.0,1.0,39.07
1,Plot Area,Ready To Move,Chikka Tirupathi,4 Bedroom,Theanmp,2600,5.0,3.0,120.0
2,Built-up Area,Ready To Move,Uttarahalli,3 BHK,,1440,2.0,3.0,62.0
3,Super built-up Area,Ready To Move,Lingadheeranahalli,3 BHK,Soiewre,1521,3.0,1.0,95.0
4,Super built-up Area,Ready To Move,Kothanur,2 BHK,,1200,2.0,1.0,51.0


In [4]:
from sklearn.preprocessing import LabelEncoder
encoder=LabelEncoder()
df['size']=encoder.fit_transform(df['size'])


In [5]:
df['society']=encoder.fit_transform(df['society'])
df['location']=encoder.fit_transform(df['location'])
df['availability']=encoder.fit_transform(df['availability'])
df['area_type']=encoder.fit_transform(df['area_type'])

In [6]:
df.head()

Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price
0,3,40,419,13,464,1056,2.0,1.0,39.07
1,2,80,317,19,2439,2600,5.0,3.0,120.0
2,0,80,1179,16,2688,1440,2.0,3.0,62.0
3,3,80,757,16,2186,1521,3.0,1.0,95.0
4,3,80,716,13,2688,1200,2.0,1.0,51.0


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13320 entries, 0 to 13319
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   area_type     13320 non-null  int64  
 1   availability  13320 non-null  int64  
 2   location      13320 non-null  int64  
 3   size          13320 non-null  int64  
 4   society       13320 non-null  int64  
 5   total_sqft    13320 non-null  object 
 6   bath          13247 non-null  float64
 7   balcony       12711 non-null  float64
 8   price         13320 non-null  float64
dtypes: float64(3), int64(5), object(1)
memory usage: 936.7+ KB


In [8]:
df.isnull().sum()

area_type         0
availability      0
location          0
size              0
society           0
total_sqft        0
bath             73
balcony         609
price             0
dtype: int64

In [9]:
df.shape

(13320, 9)

In [10]:
df[['total_sqft_1', 'total_sqft_2']] = df['total_sqft'].str.split('-', expand=True)

In [11]:
df.head()


Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price,total_sqft_1,total_sqft_2
0,3,40,419,13,464,1056,2.0,1.0,39.07,1056,
1,2,80,317,19,2439,2600,5.0,3.0,120.0,2600,
2,0,80,1179,16,2688,1440,2.0,3.0,62.0,1440,
3,3,80,757,16,2186,1521,3.0,1.0,95.0,1521,
4,3,80,716,13,2688,1200,2.0,1.0,51.0,1200,


In [12]:
df['total_sqft'] = df['total_sqft'].replace('None', np.nan)

In [13]:
df.isnull().sum()

area_type           0
availability        0
location            0
size                0
society             0
total_sqft          0
bath               73
balcony           609
price               0
total_sqft_1        0
total_sqft_2    13119
dtype: int64

In [14]:
df.shape

(13320, 11)

In [15]:
df=df.drop(['total_sqft_2','total_sqft'], axis=1)

In [16]:
df.head()

Unnamed: 0,area_type,availability,location,size,society,bath,balcony,price,total_sqft_1
0,3,40,419,13,464,2.0,1.0,39.07,1056
1,2,80,317,19,2439,5.0,3.0,120.0,2600
2,0,80,1179,16,2688,2.0,3.0,62.0,1440
3,3,80,757,16,2186,3.0,1.0,95.0,1521
4,3,80,716,13,2688,2.0,1.0,51.0,1200


In [17]:
df['balcony'] = df['balcony'].fillna(df['balcony'].mode()[0])

In [18]:
df['bath'] = df['bath'].fillna(df['bath'].mode()[0])

In [19]:
df.isnull().sum()

area_type       0
availability    0
location        0
size            0
society         0
bath            0
balcony         0
price           0
total_sqft_1    0
dtype: int64

In [20]:
df[['total_sqft_1', 'total_sqft_S']] = df['total_sqft_1'].str.split('S', expand=True)

In [21]:
df[['total_sqft_1', 'total_sqft_A']] = df['total_sqft_1'].str.split('A', expand=True)

In [22]:
df[['total_sqft_1', 'total_sqft_P']] = df['total_sqft_1'].str.split('P', expand=True)

In [23]:
df[['total_sqft_1', 'total_sqft_C']] = df['total_sqft_1'].str.split('C', expand=True)

In [24]:
df[['total_sqft_1', 'total_sqft_G']] = df['total_sqft_1'].str.split('G', expand=True)

In [25]:
df=df.drop(['total_sqft_S','total_sqft_A','total_sqft_P','total_sqft_C','total_sqft_G'], axis=1)

In [26]:
df.head()

Unnamed: 0,area_type,availability,location,size,society,bath,balcony,price,total_sqft_1
0,3,40,419,13,464,2.0,1.0,39.07,1056
1,2,80,317,19,2439,5.0,3.0,120.0,2600
2,0,80,1179,16,2688,2.0,3.0,62.0,1440
3,3,80,757,16,2186,3.0,1.0,95.0,1521
4,3,80,716,13,2688,2.0,1.0,51.0,1200


In [27]:
df['total_sqft_1']=df['total_sqft_1'].astype('float')

In [28]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13320 entries, 0 to 13319
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   area_type     13320 non-null  int64  
 1   availability  13320 non-null  int64  
 2   location      13320 non-null  int64  
 3   size          13320 non-null  int64  
 4   society       13320 non-null  int64  
 5   bath          13320 non-null  float64
 6   balcony       13320 non-null  float64
 7   price         13320 non-null  float64
 8   total_sqft_1  13320 non-null  float64
dtypes: float64(4), int64(5)
memory usage: 936.7 KB


In [29]:
df.columns

Index(['area_type', 'availability', 'location', 'size', 'society', 'bath',
       'balcony', 'price', 'total_sqft_1'],
      dtype='object')

In [30]:
X=df[['area_type', 'availability', 'location', 'size', 'society', 'bath',
       'balcony', 'total_sqft_1']]
y=df['price']

In [31]:
X.head()

Unnamed: 0,area_type,availability,location,size,society,bath,balcony,total_sqft_1
0,3,40,419,13,464,2.0,1.0,1056.0
1,2,80,317,19,2439,5.0,3.0,2600.0
2,0,80,1179,16,2688,2.0,3.0,1440.0
3,3,80,757,16,2186,3.0,1.0,1521.0
4,3,80,716,13,2688,2.0,1.0,1200.0


In [32]:
y.head()

0     39.07
1    120.00
2     62.00
3     95.00
4     51.00
Name: price, dtype: float64

In [33]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=42)

In [34]:
from sklearn.svm import SVR
svr=SVR()

In [35]:
svr.fit(X_train,y_train)

In [36]:
y_pred=svr.predict(X_test)

In [37]:
from sklearn.metrics import r2_score

In [38]:
r2_score(y_test,y_pred)

0.312440482534433



#### Q2. You have built an SVM regression model and are trying to decide between using MSE or R-squared as your evaluation metric. Which metric would be more appropriate if your goal is to predict the actual price of a house as accurately as possible?



If your goal is to predict the actual price of a house as accurately as possible, the most appropriate evaluation metric to use for your SVM regression model is the Mean Squared Error (MSE)

#### Accuracy of Predictions: 
MSE measures the average squared difference between the predicted values and the actual target values. It provides a direct assessment of the accuracy of your predictions in the original scale of the target variable (e.g., dollars for house prices).

#### Sensitivity to Errors: 
MSE penalizes larger errors more heavily than smaller errors because it squares the differences between predictions and actual values. This means that it places greater emphasis on reducing significant prediction errors, which is crucial when predicting house prices accurately.

####  Commonly Used Metric:
MSE is a widely used and accepted metric for regression problems, particularly in scenarios where prediction accuracy is essential, such as house price prediction

#### Q3. You have a dataset with a significant number of outliers and are trying to select an appropriate regression metric to use with your SVM model. Which metric would be the most appropriate in this scenario?



In a dataset with a significant number of outliers and you are using a Support Vector Machine (SVM) for regression, it's important to select an appropriate regression metric that is robust to outliers. One commonly used metric in such scenarios is the Mean Absolute Error (MAE).

Robustness to Outliers:

- MAE is less sensitive to outliers compared to other regression metrics like Mean Squared Error (MSE) or Root Mean Squared Error (RMSE).
- Outliers can have a significant impact on MSE and RMSE because they involve squaring the error values, which amplifies the effect of outliers.
- Consider a scenario where you're predicting housing prices. If your dataset contains some extreme outliers (e.g., a million-dollar house in a neighborhood of much cheaper houses), MAE can provide a more meaningful assessment of the model's prediction errors without being overly influenced by the outlier.

#### Q4. You have built an SVM regression model using a polynomial kernel and are trying to select the best metric to evaluate its performance. You have calculated both MSE and RMSE and found that both values are very close. Which metric should you choose to use in this case?



When we have built an SVM regression model using a polynomial kernel, and both the Mean Squared Error (MSE) and Root Mean Squared Error (RMSE) values are very close, it's generally a good practice to choose the RMSE as the evaluation metric.

#### RMSE Provides the Same Scale as the Target Variable:

RMSE is advantageous because it's expressed in the same units as the target variable (the dependent variable you are trying to predict).
This means that you can directly interpret the RMSE value in the context of your problem without worrying about the scale of the error.

#### Q5. You are comparing the performance of different SVM regression models using different kernels (linear, polynomial, and RBF) and are trying to select the best evaluation metric. Which metric would be most appropriate if your goal is to measure how well the model explains the variance in the target variable?

If goal is to measure how well the model explains the variance in the target variable, the most appropriate evaluation metric is the Coefficient of Determination (R-squared or R²). R-squared quantifies the proportion of the variance in the target variable that is explained by the regression model. It provides an indication of how well the model fits the data and captures the underlying patterns in the target variable.

In the context of SVM regression models with different kernels (linear, polynomial, and RBF), we can use R-squared to compare their performance in explaining the variance.

- Variance: R-squared measures the percentage of variance in the target variable that is accounted for by the independent variables (features) used in the model. It quantifies the goodness of fit, indicating how well the model captures the underlying patterns in the data.