# Answer1

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
url = f'https://drive.google.com/uc?id=1Z9oLpmt6IDRNw7IeNcHYTGeJRYypRSC0'
# Read the CSV file using pandas
df = pd.read_csv(url)

In [3]:
df.head()

Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price
0,Super built-up Area,19-Dec,Electronic City Phase II,2 BHK,Coomee,1056,2.0,1.0,39.07
1,Plot Area,Ready To Move,Chikka Tirupathi,4 Bedroom,Theanmp,2600,5.0,3.0,120.0
2,Built-up Area,Ready To Move,Uttarahalli,3 BHK,,1440,2.0,3.0,62.0
3,Super built-up Area,Ready To Move,Lingadheeranahalli,3 BHK,Soiewre,1521,3.0,1.0,95.0
4,Super built-up Area,Ready To Move,Kothanur,2 BHK,,1200,2.0,1.0,51.0


In [4]:
df['Rooms']=df['size'].str.split().str[0].str[0].astype('float')

In [5]:
df['total_sqft'] = pd.to_numeric(df['total_sqft'], errors='coerce')

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13320 entries, 0 to 13319
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   area_type     13320 non-null  object 
 1   availability  13320 non-null  object 
 2   location      13319 non-null  object 
 3   size          13304 non-null  object 
 4   society       7818 non-null   object 
 5   total_sqft    13073 non-null  float64
 6   bath          13247 non-null  float64
 7   balcony       12711 non-null  float64
 8   price         13320 non-null  float64
 9   Rooms         13304 non-null  float64
dtypes: float64(5), object(5)
memory usage: 1.0+ MB


In [7]:
#remove missing value and nan value rows:
df = df.dropna(subset=['total_sqft','Rooms'])

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 13065 entries, 0 to 13319
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   area_type     13065 non-null  object 
 1   availability  13065 non-null  object 
 2   location      13064 non-null  object 
 3   size          13065 non-null  object 
 4   society       7596 non-null   object 
 5   total_sqft    13065 non-null  float64
 6   bath          13057 non-null  float64
 7   balcony       12525 non-null  float64
 8   price         13065 non-null  float64
 9   Rooms         13065 non-null  float64
dtypes: float64(5), object(5)
memory usage: 1.1+ MB


In [9]:
#dependent and indpendent split:
X = df[['total_sqft','Rooms']]
y = df['price']

In [10]:
X

Unnamed: 0,total_sqft,Rooms
0,1056.0,2.0
1,2600.0,4.0
2,1440.0,3.0
3,1521.0,3.0
4,1200.0,2.0
...,...,...
13315,3453.0,5.0
13316,3600.0,4.0
13317,1141.0,2.0
13318,4689.0,4.0


In [11]:
y

0         39.07
1        120.00
2         62.00
3         95.00
4         51.00
          ...  
13315    231.00
13316    400.00
13317     60.00
13318    488.00
13319     17.00
Name: price, Length: 13065, dtype: float64

In [12]:
#train test split:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.35,random_state=10)

In [13]:
X_train

Unnamed: 0,total_sqft,Rooms
4627,615.0,1.0
2833,918.0,2.0
6347,909.0,2.0
957,1303.0,2.0
3331,820.0,2.0
...,...,...
11867,1200.0,3.0
1376,1267.0,3.0
13067,7150.0,1.0
7447,4900.0,5.0


In [14]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [15]:
from sklearn.svm import SVR
svr = SVR(kernel  = 'linear')

In [16]:
svr.fit(X_train,y_train)

In [17]:
y_pred = svr.predict(X_test)

In [19]:
#Accuracy:
from sklearn.metrics import r2_score
r2_score(y_test,y_pred)

0.37913333354059997

# Answer2

When evaluating the performance of a Support Vector Machine (SVM) regression model for predicting the actual price of a house, Mean Squared Error (MSE) would be more appropriate as the evaluation metric.

MSE measures the average squared difference between the predicted values and the actual values. It penalizes larger errors more heavily than smaller errors, providing a more sensitive measure of accuracy. In the context of predicting house prices, where the goal is to minimize the difference between predicted and actual prices, MSE is a suitable metric.

R-squared (coefficient of determination) is another common metric for regression models, but it assesses the proportion of the variance in the dependent variable that is explained by the independent variables. While R-squared is a valuable metric, it might not be as directly interpretable for the goal of predicting house prices accurately.

In summary, for the specific goal of predicting house prices with an SVM regression model, MSE is a more appropriate evaluation metric

In [23]:
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test,y_pred)

In [24]:
print("Mean Squared Error:",mse)

Mean Squared Error: 15955.18189180317


# Answer3

When dealing with a dataset that has a significant number of outliers, Mean Absolute Error (MAE) would be a more appropriate regression metric for evaluating your SVM model.

MAE is less sensitive to outliers compared to Mean Squared Error (MSE) because it measures the average absolute difference between the predicted and actual values. In the presence of outliers, MSE can be heavily influenced by the squared differences, giving more weight to the outliers and potentially leading to a skewed evaluation.

Since MAE does not square the errors, it provides a more robust measure of accuracy when dealing with datasets that have outliers. Therefore, if your goal is to assess the performance of your SVM model on a dataset with a significant number of outliers, MAE is a suitable metric.

In [25]:
from sklearn.metrics import mean_absolute_error
mae = mean_absolute_error(y_test,y_pred)

In [26]:
print("Mean Absolute Error:",mae)

Mean Absolute Error: 43.230255250081306


# Answer4

When both Mean Squared Error (MSE) and Root Mean Squared Error (RMSE) are very close, it is generally more appropriate to use the RMSE as the evaluation metric for your SVM regression model with a polynomial kernel. The reason for this is that RMSE has the advantage of being in the same units as the dependent variable (the target variable).

RMSE is essentially the square root of MSE, and taking the square root helps in interpreting the error metric in the original scale of the data. This can make it easier to communicate and understand the magnitude of the errors in the context of the problem.

Therefore, if MSE and RMSE are very similar, it's often a good practice to choose RMSE for its interpretability in the original units of the target variable. Keep in mind that both metrics are measures of the average magnitude of errors, and lower values indicate better model performance.

In [28]:
import math
rmse = math.sqrt(mse)
print("RMSE:",rmse)

RMSE: 126.3138230432567


# Answer5
If your goal is to measure how well the model explains the variance in the target variable, the most appropriate evaluation metric would be the coefficient of determination, also known as R-squared. R-squared is commonly used for this purpose in regression models, including SVM regression.

R-squared measures the proportion of the variance in the dependent variable that is explained by the independent variables in the model. It ranges from 0 to 1, with a higher R-squared value indicating a better fit of the model to the data.

For different SVM regression models using different kernels (linear, polynomial, and RBF), you can calculate the R-squared value for each model and compare them. A higher R-squared value suggests that a larger proportion of the variance in the target variable is captured by the model.

So, in summary, use R-squared as the evaluation metric when your goal is to measure how well the model explains the variance in the target variable for SVM regression models with different kernels.

# for Linear kernel:

In [29]:
from sklearn.svm import SVR
svr = SVR(kernel  = 'linear')

In [30]:
svr.fit(X_train,y_train)

In [31]:
y_pred = svr.predict(X_test)

In [32]:
#Accuracy:
from sklearn.metrics import r2_score
print("R2_square",r2_score(y_test,y_pred))

R2_square 0.37913333354059997


# for Polynomial:

In [33]:
svr = SVR(kernel = 'poly')

In [34]:
svr.fit(X_train,y_train)

In [35]:
y_pred = svr.predict(X_test)

In [36]:
#Accuracy:
from sklearn.metrics import r2_score
print("R2_square",r2_score(y_test,y_pred))

R2_square 0.061893110469412815


# for RBF:

In [39]:
svr = SVR(kernel ='rbf')

In [40]:
svr.fit(X_train,y_train)

In [42]:
y_pred = svr.predict(X_test)

In [43]:
#Accuracy:
from sklearn.metrics import r2_score
print("R2_square",r2_score(y_test,y_pred))

R2_square 0.2597704628232984
