### Loadind Bengaluru House Dataset :-

In [24]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import io

# Load the data
Bengaluru_House_Data = pd.read_csv("Bengaluru_House_Data.csv")
Bengaluru_House_Data.head()

Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price
0,Super built-up Area,19-Dec,Electronic City Phase II,2 BHK,Coomee,1056,2.0,1.0,39.07
1,Plot Area,Ready To Move,Chikka Tirupathi,4 Bedroom,Theanmp,2600,5.0,3.0,120.0
2,Built-up Area,Ready To Move,Uttarahalli,3 BHK,,1440,2.0,3.0,62.0
3,Super built-up Area,Ready To Move,Lingadheeranahalli,3 BHK,Soiewre,1521,3.0,1.0,95.0
4,Super built-up Area,Ready To Move,Kothanur,2 BHK,,1200,2.0,1.0,51.0


### Examine the Dataset :-

In [25]:
## Check Shape
Bengaluru_House_Data_shape = Bengaluru_House_Data.shape
print(f"Shape (Rows, Column) :- {Bengaluru_House_Data_shape}")


## Check Missing Value
# Columns which has null values
Bengaluru_House_Data_missing_value = Bengaluru_House_Data.isnull().sum()
print("\nMissing Values (Column wise) :-")
display(Bengaluru_House_Data_missing_value)


## Check Duplicates
Bengaluru_House_Data_duplicates = Bengaluru_House_Data.duplicated().sum()
print(f"\nNumber of duplicates :- {Bengaluru_House_Data_duplicates}")


## Check Summary
# Capture the output of df.info() as a string
Bengaluru_House_Data_summary_buffer = io.StringIO()
Bengaluru_House_Data.info(buf=Bengaluru_House_Data_summary_buffer)
Bengaluru_House_Data_summary_str = Bengaluru_House_Data_summary_buffer.getvalue()
# Close the buffer
Bengaluru_House_Data_summary_buffer.close()
# Now, output_str contains the info() output as a string
print("\nSummary :-")
print(Bengaluru_House_Data_summary_str)


## Check Descriptive Summary
Bengaluru_House_Data_descriptive_summary =Bengaluru_House_Data.describe().T
print("\nDescriptive Statistics:-")
display(Bengaluru_House_Data_descriptive_summary)


## Check Mis-Spaced
Bengaluru_House_Data_mis_spaced_columns = [col for col in Bengaluru_House_Data.columns if ' ' in col]

if Bengaluru_House_Data_mis_spaced_columns:
    print("\nMis-spaced column names :-")
    for col in Bengaluru_House_Data_mis_spaced_columns:
        print(f"'{col}'")
else:
    print("\nNo mis-spaced column names found.")


## Check No. of Unique Value
print("\nTotal No. of Unique Value for each Columns :-")
for col_name in Bengaluru_House_Data.columns:
    Bengaluru_House_Data_No_unique_value = len(Bengaluru_House_Data[col_name].unique())
    print(F"{col_name} = {Bengaluru_House_Data_No_unique_value}")

Shape (Rows, Column) :- (13320, 9)

Missing Values (Column wise) :-


area_type          0
availability       0
location           1
size              16
society         5502
total_sqft         0
bath              73
balcony          609
price              0
dtype: int64


Number of duplicates :- 529

Summary :-
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13320 entries, 0 to 13319
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   area_type     13320 non-null  object 
 1   availability  13320 non-null  object 
 2   location      13319 non-null  object 
 3   size          13304 non-null  object 
 4   society       7818 non-null   object 
 5   total_sqft    13320 non-null  object 
 6   bath          13247 non-null  float64
 7   balcony       12711 non-null  float64
 8   price         13320 non-null  float64
dtypes: float64(3), object(6)
memory usage: 936.7+ KB


Descriptive Statistics:-


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
bath,13247.0,2.69261,1.341458,1.0,2.0,2.0,3.0,40.0
balcony,12711.0,1.584376,0.817263,0.0,1.0,2.0,2.0,3.0
price,13320.0,112.565627,148.971674,8.0,50.0,72.0,120.0,3600.0



No mis-spaced column names found.

Total No. of Unique Value for each Columns :-
area_type = 4
availability = 81
location = 1306
size = 32
society = 2689
total_sqft = 2117
bath = 20
balcony = 5
price = 1994


### Handling Miassing Values, Label Encoding, Feature Engineering and Modify Data Types

-   **`Handling Duplicates`**

In [26]:
Bengaluru_House_Data.drop_duplicates(keep='first', inplace=True)

1.  `Area Type`

In [27]:
# Define the label encoding mapping
area_type_mapping = {'Super built-up  Area': 3,
                     'Plot  Area': 2,
                     'Built-up  Area': 1,
                     'Carpet  Area': 0}

# Perform label encoding
label_encoder = LabelEncoder()
Bengaluru_House_Data['area_type_encoded'] = Bengaluru_House_Data['area_type'].map(area_type_mapping)

2.  `Availability`

In [28]:
def categorize_availability(avail):
    if avail == 'Ready To Move' or avail == 'Immediate Possession':
        return avail
    else:
        return 'Date-Based'

# Apply the categorize_availability function to create a new column
Bengaluru_House_Data['availability'] = Bengaluru_House_Data['availability'].apply(categorize_availability)

# Define the label encoding mapping
availability_mapping = {'Ready To Move': 1,
                        'Immediate Possession': 1,  # Add this line
                        'Date-Based': 0
                        }

# Perform label encoding
Bengaluru_House_Data['availability_category'] = Bengaluru_House_Data['availability'].map(availability_mapping).astype(int)

3.  `Location`

In [29]:
def target_encode(df, target, categorical):
    target_encoded = {}
    global_mean = df[target].mean()
    for category in categorical:
        category_mean = df.groupby(category)[target].mean()
        category_count = df.groupby(category)[target].count()
        category_encoded = (category_mean * category_count + global_mean) / (category_count + 1)
        target_encoded[category] = category_encoded
    return target_encoded

target_encoded_location = target_encode(Bengaluru_House_Data, 'price', ['location'])
Bengaluru_House_Data['location_encoded'] = Bengaluru_House_Data['location'].map(target_encoded_location['location'])

4.  `Size Or No. fo Bedroom`

In [30]:
## No of Bedrrom of the Bengaluru House 
Bengaluru_House_Data['size'] = Bengaluru_House_Data['size'].str.replace('BHK', '').str.replace('Bedroom', '').str.replace('RK','')
Bengaluru_House_Data.dropna(subset=['size'], inplace=True)
Bengaluru_House_Data['size'] = Bengaluru_House_Data['size'].astype(int)
Bengaluru_House_Data.rename(columns={'size': 'No. of Bedroom'}, inplace=True)

5.  `society`

In [31]:
Bengaluru_House_Data = Bengaluru_House_Data.drop(columns=['society'])

6.  `total_sqft`

In [32]:
Bengaluru_House_Data['total_sqft'] = Bengaluru_House_Data['total_sqft'].str.replace('Sq. Meter', '')

def parse_total_sqft(total_sqft):
    try:
        parts = total_sqft.split(" -")
        if len(parts) == 1:
            return float(parts[0])
        else:
            return (float(parts[0]) + float(parts[1])) / 2
    except:
        return None

# Apply the parsing function to the 'total_sqft' column
Bengaluru_House_Data['total_sqft'] = Bengaluru_House_Data['total_sqft'].apply(parse_total_sqft)

7.  `bath`

In [33]:
## No of Bathroom of the Bengaluru House 
Bengaluru_House_Data['bath'] = Bengaluru_House_Data['bath'].fillna(0)
Bengaluru_House_Data['bath'] = Bengaluru_House_Data['bath'].astype(int)
Bengaluru_House_Data.rename(columns={'bath': 'No. of Bathroom'}, inplace=True)

8.  `balcony`

In [34]:
## No of Bathroom of the Bengaluru House 
Bengaluru_House_Data['balcony'] = Bengaluru_House_Data['balcony'].fillna(0)
Bengaluru_House_Data['balcony'] = Bengaluru_House_Data['balcony'].astype(int)
Bengaluru_House_Data.rename(columns={'balcony': 'No. of Balcony'}, inplace=True)

### Check Again Before going for Model Training

In [35]:
Bengaluru_House_Data_duplicates = Bengaluru_House_Data.duplicated().sum()
print(f"\nNumber of duplicates :- {Bengaluru_House_Data_duplicates}")

Bengaluru_House_Data_missing_value = Bengaluru_House_Data.isnull().sum()
print("\nMissing Values (Column wise) :-")
display(Bengaluru_House_Data_missing_value)


Number of duplicates :- 129

Missing Values (Column wise) :-


area_type                 0
availability              0
location                  1
No. of Bedroom            0
total_sqft               29
No. of Bathroom           0
No. of Balcony            0
price                     0
area_type_encoded         0
availability_category     0
location_encoded          1
dtype: int64

### Fixing these basic Error

In [36]:
Bengaluru_House_Data.drop_duplicates(keep='first', inplace=True)

In [37]:
Bengaluru_House_Data['location'].fillna(Bengaluru_House_Data['location'].mode()[0], inplace=True)
Bengaluru_House_Data['total_sqft'].fillna(Bengaluru_House_Data['total_sqft'].mean(), inplace=True)
Bengaluru_House_Data['location_encoded'].fillna(Bengaluru_House_Data['location_encoded'].mode()[0], inplace=True)

### Know we are good to go for further Process

In [38]:
# Assuming X contains the features and y contains the target variable (price)
X = Bengaluru_House_Data[['area_type_encoded','availability_category','location_encoded','No. of Bedroom','total_sqft','No. of Bathroom','No. of Balcony']]
y = Bengaluru_House_Data['price']

In [39]:
X.shape,y.shape

((12646, 7), (12646,))

In [40]:
# Step 2: Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [41]:
# Step 3: Model Training
svm_model = SVR()

In [42]:
svm_model.fit(X_train, y_train)

In [43]:
# Step 4: Model Evaluation
y_pred = svm_model.predict(X_test)

###################################################################################################################################################################

Q.No-01    In order to predict house price based on several characteristics, such as location, square footage, number of bedrooms, etc., you are developing an SVM regression model. Which regression metric in this situation would be the best to employ?

Ans :-

**In the context of predicting house prices using an SVM regression model, several regression metrics are commonly used to evaluate the model's performance.**

**`Let's Calculate the metrics` :**

In [44]:
# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error (MSE):", mse)

# Calculate Root Mean Squared Error (RMSE)
rmse = mean_squared_error(y_test, y_pred, squared=False)
print("Root Mean Squared Error (RMSE):", rmse)

# Calculate Mean Absolute Error (MAE)
mae = mean_absolute_error(y_test, y_pred)
print("Mean Absolute Error (MAE):", mae)

# Calculate R-squared (R2)
r2 = r2_score(y_test, y_pred)
print("R-squared (R2):", r2)

Mean Squared Error (MSE): 20646.778013454208
Root Mean Squared Error (RMSE): 143.68986746968002
Mean Absolute Error (MAE): 44.60814817174127
R-squared (R2): 0.2735174484813252


In [45]:
def mean_absolute_percentage_error(y_true, y_pred):
    y_true, y_pred = np.array(y_true), np.array(y_pred)
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

def coefficient_of_determination(y_true, y_pred):
    ss_res = np.sum((y_true - y_pred) ** 2)
    ss_tot = np.sum((y_true - np.mean(y_true)) ** 2)
    return 1 - (ss_res / ss_tot)

# Calculate Mean Absolute Percentage Error (MAPE)
mape = mean_absolute_percentage_error(y_test, y_pred)
print("Mean Absolute Percentage Error (MAPE):", mape)

# Calculate Coefficient of Determination (CD)
cd = coefficient_of_determination(y_test, y_pred)
print("Coefficient of Determination (CD):", cd)


Mean Absolute Percentage Error (MAPE): 30.553202201231084
Coefficient of Determination (CD): 0.2735174484813252


**According to these metrics, the choice of the best regression metric depends on the specific requirements and preferences of the analysis.**

**`However`, considering common practices and the characteristics of the calculated metrics, `here's some insight` :**

1. **Mean Squared Error (MSE)** and **Root Mean Squared Error (RMSE)**: These metrics penalize larger errors more heavily than smaller errors, making them sensitive to outliers. While they provide a measure of the average squared deviation of predictions from the actual values, they might not be very intuitive for interpretation in terms of house prices. However, they are commonly used for assessing model accuracy.

2. **Mean Absolute Error (MAE)**: MAE represents the average absolute difference between predicted and actual values. It is more robust to outliers compared to MSE and RMSE because it doesn't square the errors. MAE provides a straightforward interpretation in the same units as the target variable (e.g., dollars for house prices).

3. **R-squared (R2)** and **Coefficient of Determination (CD)**: R-squared represents the proportion of the variance in the dependent variable that is predictable from the independent variables. It ranges from 0 to 1, where a higher value indicates a better fit. However, it doesn't directly provide insight into the magnitude of prediction errors.

4. **Mean Absolute Percentage Error (MAPE)**: MAPE measures the percentage difference between predicted and actual values. It's useful for understanding the average percentage error relative to the actual values, making it interpretable, especially for stakeholders who may want to understand prediction accuracy in percentage terms.

**`Considering these insights and Calculated metrics`**, **Mean Absolute Error (MAE) may be the most suitable regression metric in this situation**. It provides a clear interpretation of the average absolute difference between predicted and actual house prices, making it easier to understand and communicate the model's performance.

----------------------------------------------------------------------------------------------------------------------------------------------------------

Q.No-02    You have built an SVM regression model and are trying to decide between using MSE or R-squared as your evaluation metric. Which metric would be more appropriate if your goal is to predict the actual price of a house as accurately as possible?

Ans :-

**If the goal is to predict the actual price of a house as accurately as possible, it's essential to choose an evaluation metric that directly reflects the accuracy of the predictions.**

**`In this scenario, considering insights and the calculated metrics value` :**

1. **Mean Squared Error (MSE)** measures the average squared difference between predicted and actual values. It provides a measure of the average magnitude of errors. However, it doesn't directly translate to the accuracy in terms of actual house prices.

2. **R-squared (R2)** represents the proportion of the variance in the dependent variable (house prices) that is predictable from the independent variables. A higher R-squared indicates that a larger proportion of the variability in house prices is explained by the model. However, R-squared does not provide information about the magnitude of prediction errors.

`Given that the goal is to predict house prices as accurately as possible,` **Then Mean Squared Error (MSE) would be more appropriate as an evaluation metric. This is because MSE directly quantifies the average squared difference between predicted and actual house prices, providing a clear indication of the model's predictive accuracy in terms of minimizing prediction errors. By minimizing MSE, the model aims to make predictions that are as close as possible to the actual house prices, which aligns with the goal of accurately predicting house prices.**

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Q.No-03    You have a dataset with a significant number of outliers and are trying to select an appropriate regression metric to use with your SVM model. Which metric would be the most appropriate in this scenario?

Ans :-

`When dealing with a dataset that contains a significant number of outliers`, **it's important to choose a regression metric that is robust to outliers. One of the regression metric is the Mean Absolute Error (MAE).**

**Mean Absolute Error (MAE)** is calculated by taking the average of the absolute differences between the predicted and actual values. It is less sensitive to outliers compared to other metrics like Mean Squared Error (MSE) or Root Mean Squared Error (RMSE) because it does not square the errors.

Using MAE as the regression metric for your Support Vector Machine (SVM) model would be appropriate in this scenario because it provides a more balanced evaluation of the model's performance, even when dealing with outliers. By using MAE, the model's performance won't be heavily influenced by extreme values in the dataset, allowing for a more reliable assessment of its predictive capabilities.

--------------------------------------------------------------------------------------------------------------------------------------------------------------

Q.No-04    You have built an SVM regression model using a polynomial kernel and are trying to select the best metric to evaluate its performance. You have calculated both MSE and RMSE and found that both values are very close. Which metric should you choose to use in this case?

Ans :-

**When both Mean Squared Error (MSE) and Root Mean Squared Error (RMSE) are very close, it indicates that the scale of the errors is consistent across the dataset. `In such cases, RMSE might be preferred over MSE`.**

`The reason` is that RMSE is in the same unit as the target variable, making it more interpretable compared to MSE, which is in squared units. This can be advantageous for communication purposes, especially if stakeholders are more familiar with the original scale of the target variable.

`Therefore`, in this case where MSE and RMSE are very close, it would be reasonable to choose RMSE as the metric to evaluate the performance of SVM regression model with a polynomial kernel.

--------------------------------------------------------------------------------------------------------------------------------------------------------------------

Q.No.05    You are comparing the performance of different SVM regression models using different kernels (linear, polynomial, and RBF) and are trying to select the best evaluation metric. Which metric would be most appropriate if your goal is to measure how well the model explains the variance in the target variable?

Ans :-

**When the goal is to measure how well a regression model explains the variance in the target variable, `the most appropriate evaluation metric is the coefficient of determination`, commonly known as** $ R^2 $ **(R-squared).**

The $ R^2 $ score quantifies the proportion of the variance in the dependent variable that is predictable from the independent variables. It ranges from 0 to 1, where 1 indicates perfect prediction and 0 indicates that the model does not explain any of the variability in the target variable.

`For comparing different SVM regression models with different kernels (linear, polynomial, and RBF)`, **using** $ R^2 $ **as the evaluation metric would be suitable**. This metric provides insight into how well each model captures the variance in the target variable, allowing for an informed comparison of their performance in terms of explaining the data variability.