Q1. In order to predict house price based on several characteristics, such as location, square footage,
number of bedrooms, etc., you are developing an SVM regression model. Which regression metric in this
situation would be the best to employ?

Dataset link:https://drive.google.com/file/d/1Z9oLpmt6IDRNw7IeNcHYTGeJRYypRSC0/view?usp=share_link

In the case of predicting house prices, the choice of metric should align with your specific goals. For instance, if you want to prioritize minimizing large errors, you might focus on RMSE or MSE. If your concern is the overall accuracy of predictions, MAE could be a good choice.

**Mean Squared Error (MSE):**

MSE is a widely used metric that measures the average squared difference between predicted and actual values. It gives higher weight to larger errors, which could be relevant if you want to penalize larger prediction errors more heavily.

**Root Mean Squared Error (RMSE):**

RMSE is the square root of the MSE, which is expressed in the same units as the target variable (house prices in this case). It provides an easily interpretable measure of the typical prediction error.

**Mean Absolute Error (MAE):**

MAE calculates the average absolute difference between predicted and actual values. Unlike MSE, MAE treats all errors equally regardless of their magnitude.

In [1]:
# Q2. You have built an SVM regression model and are trying to decide between using MSE or R-squared as
# your evaluation metric. Which metric would be more appropriate if your goal is to predict the actual 
# price of a house as accurately as possible?

# The metric that would be more appropriate for predicting the actual price of a house as accurately as
#  possible is Root Mean Squared Error (RMSE).

# RMSE measures the average magnitude of the errors between predicted values and actual values. It 
# gives higher weight to larger errors, which is particularly useful when dealing with predictions 
# related to price or any continuous variable. In the context of predicting house prices, RMSE will 
# penalize larger prediction errors more heavily, providing a more accurate reflection of how well 
# your model is performing in terms of predicting the actual prices.

import pandas as pd
import lxml
df=pd.read_csv('https://drive.google.com/u/0/uc?id=1Z9oLpmt6IDRNw7IeNcHYTGeJRYypRSC0&export=download')
df.head()

Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price
0,Super built-up Area,19-Dec,Electronic City Phase II,2 BHK,Coomee,1056,2.0,1.0,39.07
1,Plot Area,Ready To Move,Chikka Tirupathi,4 Bedroom,Theanmp,2600,5.0,3.0,120.0
2,Built-up Area,Ready To Move,Uttarahalli,3 BHK,,1440,2.0,3.0,62.0
3,Super built-up Area,Ready To Move,Lingadheeranahalli,3 BHK,Soiewre,1521,3.0,1.0,95.0
4,Super built-up Area,Ready To Move,Kothanur,2 BHK,,1200,2.0,1.0,51.0


In [2]:
df=df[~df.isnull().any(axis=1)]
df.dropna(inplace=True)

In [3]:
df=df[~df['society'].isnull()]


In [4]:
df['society'].isnull().sum()

0

 **Q3. You have a dataset with a significant number of outliers and are trying to select an appropriate regression metric to use with your SVM model. Which metric would be the most appropriate in this scenario?**


When dealing with a dataset that contains a significant number of outliers and using a Support Vector Machine (SVM) model for regression, it's important to choose a regression metric that is robust to outliers. The most appropriate metric in this scenario would be the Mean Absolute Error (MAE).

The MAE calculates the average absolute difference between the predicted values and the actual target values. It is less sensitive to outliers compared to other metrics like Mean Squared Error (MSE) or Root Mean Squared Error (RMSE), which give more weight to larger errors and can be heavily influenced by outliers.

Since outliers can have a disproportionately large impact on the prediction errors in the case of MSE and RMSE, using MAE as the regression metric is a better choice when dealing with datasets that have significant outliers. It provides a more balanced evaluation of the model's performance that is less skewed by extreme values.


**Q4. You have built an SVM regression model using a polynomial kernel and are trying to select the best metric to evaluate its performance. You have calculated both MSE and RMSE and found that both values are very close. Which metric should you choose to use in this case?**


In this scenario, when both the Mean Squared Error (MSE) and Root Mean Squared Error (RMSE) are very close, you should choose RMSE as the preferred metric. RMSE offers better interpretability, sensitivity to outliers, and is commonly used in regression tasks.

In [5]:
### handle missing values

# print(df.info())
df.head()
df.dropna(inplace=True)
df.reset_index()

Unnamed: 0,index,area_type,availability,location,size,society,total_sqft,bath,balcony,price
0,0,Super built-up Area,19-Dec,Electronic City Phase II,2 BHK,Coomee,1056,2.0,1.0,39.07
1,1,Plot Area,Ready To Move,Chikka Tirupathi,4 Bedroom,Theanmp,2600,5.0,3.0,120.00
2,3,Super built-up Area,Ready To Move,Lingadheeranahalli,3 BHK,Soiewre,1521,3.0,1.0,95.00
3,5,Super built-up Area,Ready To Move,Whitefield,2 BHK,DuenaTa,1170,2.0,1.0,38.00
4,11,Plot Area,Ready To Move,Whitefield,4 Bedroom,Prrry M,2785,5.0,3.0,295.00
...,...,...,...,...,...,...,...,...,...,...
7491,13313,Super built-up Area,Ready To Move,Uttarahalli,3 BHK,Aklia R,1345,2.0,1.0,57.00
7492,13314,Super built-up Area,Ready To Move,Green Glen Layout,3 BHK,SoosePr,1715,3.0,3.0,112.00
7493,13315,Built-up Area,Ready To Move,Whitefield,5 Bedroom,ArsiaEx,3453,4.0,0.0,231.00
7494,13317,Built-up Area,Ready To Move,Raja Rajeshwari Nagar,2 BHK,Mahla T,1141,2.0,1.0,60.00


In [6]:
df.head()

Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price
0,Super built-up Area,19-Dec,Electronic City Phase II,2 BHK,Coomee,1056,2.0,1.0,39.07
1,Plot Area,Ready To Move,Chikka Tirupathi,4 Bedroom,Theanmp,2600,5.0,3.0,120.0
3,Super built-up Area,Ready To Move,Lingadheeranahalli,3 BHK,Soiewre,1521,3.0,1.0,95.0
5,Super built-up Area,Ready To Move,Whitefield,2 BHK,DuenaTa,1170,2.0,1.0,38.0
11,Plot Area,Ready To Move,Whitefield,4 Bedroom,Prrry M,2785,5.0,3.0,295.0


In [7]:
df.drop('availability',inplace=True,axis=1)

## encode object values 



In [8]:
### encode area type
df['area_type'].unique()

array(['Super built-up  Area', 'Plot  Area', 'Built-up  Area',
       'Carpet  Area'], dtype=object)

In [62]:
df.groupby('area_type')['price'].mean

<bound method GroupBy.mean of <pandas.core.groupby.generic.SeriesGroupBy object at 0x7fe19d3ad120>>

In [None]:
for i 

In [9]:
ordinal_mapping={'Super built-up  Area':1, 'Plot  Area':4, 'Built-up  Area':2,'Carpet  Area':4}
df['encoded_area_type']=df['area_type'].map(ordinal_mapping)

In [10]:
### encode location

mean_price=df.groupby('location')['price'].mean()
j=0
for i in df['location']:
    df.loc[j,'encoded_location']=mean_price[i]
    j=j+1
   
    

In [11]:
df.head()

Unnamed: 0,area_type,location,size,society,total_sqft,bath,balcony,price,encoded_area_type,encoded_location
0,Super built-up Area,Electronic City Phase II,2 BHK,Coomee,1056,2.0,1.0,39.07,4.0,44.313613
1,Plot Area,Chikka Tirupathi,4 Bedroom,Theanmp,2600,5.0,3.0,120.0,1.0,122.189167
3,Super built-up Area,Lingadheeranahalli,3 BHK,Soiewre,1521,3.0,1.0,95.0,4.0,123.256927
5,Super built-up Area,Whitefield,2 BHK,DuenaTa,1170,2.0,1.0,38.0,4.0,95.67102
11,Plot Area,Whitefield,4 Bedroom,Prrry M,2785,5.0,3.0,295.0,1.0,146.637368


In [12]:
## encode size

df.dropna(inplace=True)
df=df[~df['size'].isnull()]
df['size'].unique()

array(['2 BHK', '4 Bedroom', '3 BHK', '3 Bedroom', '1 RK', '4 BHK',
       '1 BHK', '5 BHK', '11 BHK', '5 Bedroom', '9 BHK', '2 Bedroom',
       '6 BHK', '7 BHK', '6 Bedroom'], dtype=object)

In [13]:
df['encoded_size']=df['size'].str.replace(' BHK','').str.replace(' Bedroom','').str.replace(' RK','')
df['encoded_size']

0       2
1       4
3       3
5       2
11      4
       ..
7488    2
7489    3
7490    3
7493    3
7494    2
Name: encoded_size, Length: 4176, dtype: object

In [14]:
df['encoded_size']=df['encoded_size'].astype(float)
df['encoded_size']

0       2.0
1       4.0
3       3.0
5       2.0
11      4.0
       ... 
7488    2.0
7489    3.0
7490    3.0
7493    3.0
7494    2.0
Name: encoded_size, Length: 4176, dtype: float64

In [15]:
### encode society
df=df[~df['society'].isnull()]
mean_price=df.groupby('society')['price'].mean()
j=0
for i in df['society']:
    df.loc[j,'encoded_society']=mean_price[i]
    j=j+1
   

In [16]:
df=df[~df['encoded_society'].isnull()]

In [17]:
### encode total_sqft

df=df[~df['total_sqft'].isnull()]
df=df[~df['total_sqft'].str.contains('-')]

In [18]:
df['total_sqft'].unique()

array(['1056', '2600', '1521', '1170', '2785', '1000', '2250', '1175',
       '1180', '1540', '2770', '1755', '2800', '510', '660', '1151',
       '1025', '1075', '1760', '1693', '700', '1724', '1254', '600',
       '1330.74', '970', '1459', '1270', '1670', '2010', '1185', '1600',
       '1200', '1500', '845', '5700', '1160', '1140', '1358', '1569',
       '1240', '2089', '2511', '1660', '1326', '1499', '708', '1060',
       '1296', '2894', '2502', '650', '2400', '1007', '1640', '1260',
       '1413', '1116', '1530', '2497', '1427', '2061', '1282', '1870',
       '880', '1535', '950', '1360', '3050', '1563.05', '1167', '890',
       '1612', '1710', '957', '1125', '1020', '1735', '2050', '1063',
       '1904', '2000', '1425', '1470', '450', '1152', '1350', '1550',
       '705', '770', '1242', '1700', '2144', '1704', '1070', '1327',
       '1400', '1225', '1909', '1359', '1595', '1475', '1580', '1295',
       '589', '1787', '984', '2405', '1080', '1153', '1148', '1110',
       '1100', '1

In [19]:
df['total_sqft']=df['total_sqft'].str.replace('Sq. Meter','').str.replace('Sq. Yards','')

  df['total_sqft']=df['total_sqft'].str.replace('Sq. Meter','').str.replace('Sq. Yards','')


In [20]:
df['total_sqft']=df['total_sqft'].astype(float)

In [21]:
df.head()

Unnamed: 0,area_type,location,size,society,total_sqft,bath,balcony,price,encoded_area_type,encoded_location,encoded_size,encoded_society
0,Super built-up Area,Electronic City Phase II,2 BHK,Coomee,1056.0,2.0,1.0,39.07,4.0,44.313613,2.0,53.5125
1,Plot Area,Chikka Tirupathi,4 Bedroom,Theanmp,2600.0,5.0,3.0,120.0,1.0,122.189167,4.0,126.0
3,Super built-up Area,Lingadheeranahalli,3 BHK,Soiewre,1521.0,3.0,1.0,95.0,4.0,123.256927,3.0,38.0
5,Super built-up Area,Whitefield,2 BHK,DuenaTa,1170.0,2.0,1.0,38.0,4.0,95.67102,2.0,38.0
11,Plot Area,Whitefield,4 Bedroom,Prrry M,2785.0,5.0,3.0,295.0,1.0,146.637368,4.0,127.786667


In [22]:
df.columns

Index(['area_type', 'location', 'size', 'society', 'total_sqft', 'bath',
       'balcony', 'price', 'encoded_area_type', 'encoded_location',
       'encoded_size', 'encoded_society'],
      dtype='object')

### train dataset

In [23]:
## independent anddependent data

X=df[['encoded_area_type', 'encoded_location','encoded_size', 'encoded_society','total_sqft', 'bath','balcony']]
y=df['price']


In [47]:
### train test split

from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=42)

In [48]:
#### standard scalling

from sklearn.preprocessing import StandardScaler

scaler=StandardScaler()
X_train=scaler.fit_transform(X_train)
X_test=scaler.transform(X_test)

In [49]:
# Q4. You have built an SVM regression model using a polynomial kernel and are trying to select the best
# metric to evaluate its performance. You have calculated both MSE and RMSE and found that both values
# are very close. Which metric should you choose to use in this case?

In [50]:
# train model

from sklearn.svm import SVR

svr=SVR(kernel='poly')
svr.fit(X_train,y_train)

In [51]:
# predict test value
y_pred=svr.predict(X_test)

In [52]:
## model accuracy

from sklearn.metrics import mean_squared_error
mse=mean_squared_error(y_test,y_pred)
print('mse :',mse)
print('rmse :',mse**(1/2))

mse : 5185.772798313161
rmse : 72.01231004705488


**Q5. You are comparing the performance of different SVM regression models using different kernels (linear, polynomial, and RBF) and are trying to select the best evaluation metric. Which metric would be most
appropriate if your goal is to measure how well the model explains the variance in the target variable?**

If your goal is to measure how well the model explains the variance in the target variable, the most appropriate evaluation metric would be the Coefficient of Determination (R-squared or R^2).

In [53]:
param_grid={'kernel':['linear', 'polynomial','RBF']}

In [54]:
from sklearn.model_selection import GridSearchCV

grid=GridSearchCV(SVR(),param_grid=param_grid,refit=True,cv=5,verbose=3)
grid.fit(X_train,y_train)

Fitting 5 folds for each of 3 candidates, totalling 15 fits
[CV 1/5] END .....................kernel=linear;, score=0.731 total time=   0.1s
[CV 2/5] END .....................kernel=linear;, score=0.386 total time=   0.1s
[CV 3/5] END ....................kernel=linear;, score=-1.973 total time=   0.1s
[CV 4/5] END .....................kernel=linear;, score=0.603 total time=   0.1s
[CV 5/5] END .....................kernel=linear;, score=0.747 total time=   0.1s
[CV 1/5] END ...................kernel=polynomial;, score=nan total time=   0.0s
[CV 2/5] END ...................kernel=polynomial;, score=nan total time=   0.0s
[CV 3/5] END ...................kernel=polynomial;, score=nan total time=   0.0s
[CV 4/5] END ...................kernel=polynomial;, score=nan total time=   0.0s
[CV 5/5] END ...................kernel=polynomial;, score=nan total time=   0.0s
[CV 1/5] END ..........................kernel=RBF;, score=nan total time=   0.0s
[CV 2/5] END ..........................kernel=RBF

10 fits failed out of a total of 15.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/conda/lib/python3.10/site-packages/sklearn/svm/_base.py", line 180, in fit
    self._validate_params()
  File "/opt/conda/lib/python3.10/site-packages/sklearn/base.py", line 570, in _validate_params
    validate_parameter_constraints(
  File "/opt/conda/lib/python3.10/site-packages/sklearn/utils/_param_validation.py", line 97, in validate_parameter_constraints
    raise InvalidParameterError(
sklearn.uti

In [59]:
y_pred=grid.predict(X_test)
y_pred

array([ 70.06787382, 109.77081124,  89.8348695 ,  70.89156369,
       101.14724068, 109.78637009, 234.77443152,  56.43628031,
        61.01985617,  54.06874398,  82.34787632,   9.64396003,
       131.95250065,  56.346316  ,  60.84140163,  62.32990042,
        68.87054217,  16.56031368, 103.46995274,  57.65192383,
        58.60480725,  77.99939194,  88.95723953, 103.14797442,
       175.09454156,  42.89502677,  99.89602835,  77.21711444,
        11.54654321,  14.26657867, 103.55182782,  80.00673144,
        39.54177569, 161.09774865,  67.13084593,  63.05381427,
        54.99650342,  94.7228791 ,  86.37822704,  60.51957588,
        68.45670043,  50.35239159,  62.17666561,  94.27164009,
       205.95316082,  63.52940248,  75.93461958, 166.28310126,
        16.88787594, 113.38572588,  89.691062  , 106.51517938,
        56.34002518,  58.69456081, 119.34541761,  64.83668761,
        38.34623399,  49.66222513,  66.49065581,  45.67557377,
        -1.19824966,  95.57424249, 102.02816355,  73.35

In [56]:
grid.best_params_

{'kernel': 'linear'}

In [60]:
from sklearn.metrics import r2_score,accuracy_score

print(r2_score(y_test,y_pred))


0.5937501729912882
