# **House Price Predictor**

Pre-processing steps:

### **1) Importing libraries**

In [8]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### **2) Importing dataset**

In [9]:
house_data = pd.read_csv('Bengaluru_House_Data.csv')
house_data.head()

Unnamed: 0,area_type,availability,location,size,society,total_sqft,bath,balcony,price
0,Super built-up Area,19-Dec,Electronic City Phase II,2 BHK,Coomee,1056,2.0,1.0,39.07
1,Plot Area,Ready To Move,Chikka Tirupathi,4 Bedroom,Theanmp,2600,5.0,3.0,120.0
2,Built-up Area,Ready To Move,Uttarahalli,3 BHK,,1440,2.0,3.0,62.0
3,Super built-up Area,Ready To Move,Lingadheeranahalli,3 BHK,Soiewre,1521,3.0,1.0,95.0
4,Super built-up Area,Ready To Move,Kothanur,2 BHK,,1200,2.0,1.0,51.0


### **3) Analyzing the data.**

In [10]:
house_data.shape

(13320, 9)

The info() function gives detailed information about the columns in the dataset for example its datatype, and non-null count.

In [11]:
house_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13320 entries, 0 to 13319
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   area_type     13320 non-null  object 
 1   availability  13320 non-null  object 
 2   location      13319 non-null  object 
 3   size          13304 non-null  object 
 4   society       7818 non-null   object 
 5   total_sqft    13320 non-null  object 
 6   bath          13247 non-null  float64
 7   balcony       12711 non-null  float64
 8   price         13320 non-null  float64
dtypes: float64(3), object(6)
memory usage: 936.7+ KB


Here, we get the value count of each type in the column.

In [12]:
for column in house_data.columns:
  print(house_data[column].value_counts())
  print("*"*20)

Super built-up  Area    8790
Built-up  Area          2418
Plot  Area              2025
Carpet  Area              87
Name: area_type, dtype: int64
********************
Ready To Move    10581
18-Dec             307
18-May             295
18-Apr             271
18-Aug             200
                 ...  
15-Aug               1
17-Jan               1
16-Nov               1
16-Jan               1
14-Jul               1
Name: availability, Length: 81, dtype: int64
********************
Whitefield                        540
Sarjapur  Road                    399
Electronic City                   302
Kanakpura Road                    273
Thanisandra                       234
                                 ... 
Bapuji Layout                       1
1st Stage Radha Krishna Layout      1
BEML Layout 5th stage               1
singapura paradise                  1
Abshot Layout                       1
Name: location, Length: 1305, dtype: int64
********************
2 BHK         5199
3 BHK        

### **4) Finding Missing data**

**Null Values**

The isna() function along with sum() function gives us number of records of each colum that has null values

In [13]:
house_data.isna().sum()

area_type          0
availability       0
location           1
size              16
society         5502
total_sqft         0
bath              73
balcony          609
price              0
dtype: int64

### **5) Handling the Missing data**

**Analyzing null values**

As we can see there are few columns that contain null values that are size, society, bath, balcony. we need to analyze those cloumns and decide whether we want to drop the cloumn or impute the values.

In the below code we have dropped few columns that are availability, area_type, society, balcony.


**Reason for dropping 'area_type' column:** The "area_type" column may describe different types of areas such as built-up area, super built-up area, plot area, etc. However, for house price prediction, the specific type of area might not be as relevant as other factors like location, size, amenities, etc.

**Reason for dropping 'availability' column:** The "availability" column typically indicates whether the property is ready-to-move, under construction, or other statuses related to availability. While this information may be important for certain analyses, for house price prediction, the availability status might not significantly influence the price prediction.

**Reason for dropping 'society' column:** The "society" column represents the housing society or community where the property is located. While the reputation or amenities of the society might influence the property's value, using this information directly in the model could lead to overfitting or introduce bias. Additionally, there are many unique society names or a significant number of missing values, including this column might not provide sufficient predictive power.

**Reason for dropping 'balcony' column:** The "balcony" column indicates the number of balconies in the property. While having balconies may affect the property's value, the exact number of balconies might not be as crucial for price prediction as other features. Additionally, there are many missing values in this column and it doesn't have a strong correlation with the target variable (house price), dropping it can improve model performance and reduce complexity.

In [14]:
# 'inplace=True' ensures that the changes are made directly to the original DataFrame, house_data.
house_data.drop(columns=['availability','area_type','society','balcony'],inplace=True)

Here, the describe() function gives all the statistical data about the numberial columns in dataset fro example mean, minimum value, maximum value.

In [15]:
house_data.describe()

Unnamed: 0,bath,price
count,13247.0,13320.0
mean,2.69261,112.565627
std,1.341458,148.971674
min,1.0,8.0
25%,2.0,50.0
50%,2.0,72.0
75%,3.0,120.0
max,40.0,3600.0


In [16]:
house_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13320 entries, 0 to 13319
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   location    13319 non-null  object 
 1   size        13304 non-null  object 
 2   total_sqft  13320 non-null  object 
 3   bath        13247 non-null  float64
 4   price       13320 non-null  float64
dtypes: float64(2), object(3)
memory usage: 520.4+ KB


### **6) Handling Categorical Variables**

As we can see even after dropping the few cloumns we still have some columns left which contain null values so let's start by imputing values in those columns.

In loctaion column we have one null value. Let's see the value count of location column which gives the maximum no. of times a record has occured.

In [17]:
house_data['location'].value_counts()

Whitefield                        540
Sarjapur  Road                    399
Electronic City                   302
Kanakpura Road                    273
Thanisandra                       234
                                 ... 
Bapuji Layout                       1
1st Stage Radha Krishna Layout      1
BEML Layout 5th stage               1
singapura paradise                  1
Abshot Layout                       1
Name: location, Length: 1305, dtype: int64

As we can see in above data we have Whitefield and Sarjapur Road location which has occured most number of times so we can impute any of those value.

In [18]:
house_data['location'] = house_data['location'].fillna('Sarjapur Road')

In size column we have 16 null value. Let's see the value count of size column which gives the maximum no. of times a record has occured.

In [19]:
house_data['size'].value_counts()

2 BHK         5199
3 BHK         4310
4 Bedroom      826
4 BHK          591
3 Bedroom      547
1 BHK          538
2 Bedroom      329
5 Bedroom      297
6 Bedroom      191
1 Bedroom      105
8 Bedroom       84
7 Bedroom       83
5 BHK           59
9 Bedroom       46
6 BHK           30
7 BHK           17
1 RK            13
10 Bedroom      12
9 BHK            8
8 BHK            5
11 BHK           2
11 Bedroom       2
10 BHK           2
14 BHK           1
13 BHK           1
12 Bedroom       1
27 BHK           1
43 Bedroom       1
16 BHK           1
19 BHK           1
18 Bedroom       1
Name: size, dtype: int64

As we can see in above data we have 2 BHK size which has occured most number of times so we can impute this value in place of null values.

In [20]:
house_data['size'] = house_data['size'].fillna('2 BHK')

Next, we have bath cloumn which is a numerical column and it has 73 missing values. So, we will replace the null values with it's median.

In [21]:
house_data['bath'] = house_data['bath'].fillna(house_data['bath'].median())

Here, as seen in the size column we have bhk and bedroom which are in string. So, so using split() function we split it according to space using get(0) and whatever string we get convert it into int and save in new coumn 'bhk'.

In [22]:
house_data['bhk'] = house_data['size'].str.split().str.get(0).astype(int)

**Outliers**

In the size column where the bhk is greater than 20 we can say these are outliers and there are 2 such values and  we need to fix them.

In [23]:
house_data[house_data.bhk>20]

Unnamed: 0,location,size,total_sqft,bath,price,bhk
1718,2Electronic City Phase II,27 BHK,8000,27.0,230.0,27
4684,Munnekollal,43 Bedroom,2400,40.0,660.0,43


**Analyzing the total_sqft column for removing the range**

Below code will give array of unique values present in the 'total_sqft' column of your DataFrame.

In [24]:
house_data['total_sqft'].unique()

array(['1056', '2600', '1440', ..., '1133 - 1384', '774', '4689'],
      dtype=object)

Here, we have defined a converRange function which aims to convert strings representing numerical ranges into their numerical equivalent, summing the lower and upper bounds if they exist.

In [25]:
# Define a function named 'convertRange' to convert numerical ranges into their numerical equivalent
def convertRange(x):
    # Split the input string 'x' based on the '-' delimiter
    temp = x.split('-')

    # Check if the string 'x' represents a numerical range
    if len(temp) == 2:
        # If it's a range, convert the lower and upper bounds to floats, sum them, and return the result
        return (float(temp[0]) + float(temp[1]))

    # If the string 'x' does not represent a range, attempt to convert it to a float
    try:
        # If successful, return the float value
        return float(x)
    # Handle exceptions (e.g., if 'x' contains non-numeric characters)
    except:
        # If conversion fails, return None
        return None

Below line would apply the 'convertRange' function to each value in the 'total_sqft' column of your DataFrame.

In [26]:
house_data['total_sqft'] = house_data['total_sqft'].apply(convertRange)

In [27]:
house_data.head()

Unnamed: 0,location,size,total_sqft,bath,price,bhk
0,Electronic City Phase II,2 BHK,1056.0,2.0,39.07,2
1,Chikka Tirupathi,4 Bedroom,2600.0,5.0,120.0,4
2,Uttarahalli,3 BHK,1440.0,2.0,62.0,3
3,Lingadheeranahalli,3 BHK,1521.0,3.0,95.0,3
4,Kothanur,2 BHK,1200.0,2.0,51.0,2


**Price per square feet**

Here, price column has the value in lakhs we will add a new column price per square feet which will have data in rupees. To get price per square feet we will divide price by total_sqft but it will give value in lakhs so we multiply price by 100000 to get value in rupees.

In [28]:
house_data['price_per_sqft'] = house_data['price'] * 100000 / house_data['total_sqft']

In [29]:
house_data['price_per_sqft']

0         3699.810606
1         4615.384615
2         4305.555556
3         6245.890861
4         4250.000000
             ...     
13315     6689.834926
13316    11111.111111
13317     5258.545136
13318    10407.336319
13319     3090.909091
Name: price_per_sqft, Length: 13320, dtype: float64

In [30]:
house_data.describe()

Unnamed: 0,total_sqft,bath,price,bhk,price_per_sqft
count,13274.0,13320.0,13320.0,13320.0,13274.0
mean,1587.856887,2.688814,112.565627,2.802778,7868.363
std,1291.390407,1.338754,148.971674,1.294496,106430.9
min,1.0,1.0,8.0,1.0,267.8298
25%,1100.0,2.0,50.0,2.0,4216.355
50%,1282.0,2.0,72.0,3.0,5415.005
75%,1691.8,3.0,120.0,3.0,7288.013
max,52272.0,40.0,3600.0,43.0,12000000.0


There are 1306 locations if we do one hot encode there will be 1306 columns so we cannot pass this to the model. Therefore, we have to reduce the number of locations.

To do so the location that has occuranece less  then 10 we will replace it with 'others'.

In [31]:
house_data['location'].value_counts()

Whitefield                        540
Sarjapur  Road                    399
Electronic City                   302
Kanakpura Road                    273
Thanisandra                       234
                                 ... 
1st Stage Radha Krishna Layout      1
BEML Layout 5th stage               1
singapura paradise                  1
Uvce Layout                         1
Abshot Layout                       1
Name: location, Length: 1306, dtype: int64

Here, we have passed a lamda function to apply so for every location x.strip() is performed and saved back to location and the sapce in front an back of location will be removed.

In [32]:
house_data['location'] = house_data['location'].apply(lambda x: x.strip())
location_count = house_data['location'].value_counts()

In [33]:
location_count

Whitefield                            541
Sarjapur  Road                        399
Electronic City                       304
Kanakpura Road                        273
Thanisandra                           237
                                     ... 
1Channasandra                           1
Hosahalli                               1
Vijayabank bank layout                  1
near Ramanashree California resort      1
Abshot Layout                           1
Name: location, Length: 1295, dtype: int64

we will see the locations that has count less than 10, there are 1054 locations that have occured 10 or less than 10 times.

In [34]:
location_count_less_10 = location_count[location_count<=10]
location_count_less_10

BTM 1st Stage                         10
Nagadevanahalli                       10
Basapura                              10
Sector 1 HSR Layout                   10
Dairy Circle                          10
                                      ..
1Channasandra                          1
Hosahalli                              1
Vijayabank bank layout                 1
near Ramanashree California resort     1
Abshot Layout                          1
Name: location, Length: 1054, dtype: int64

Changing the location to 'others' that have occured 10 or less than 10 times.

The line of code is using a lambda function along with the apply() method to modify the 'location' column.

In [35]:
# Apply a lambda function to the 'location' column of the 'house_data' DataFrame
# The lambda function replaces locations with fewer than 10 occurrences with the string 'other'
house_data['location'] = house_data['location'].apply(lambda x: 'other' if x in location_count_less_10 else x)

Here, the number of locations has reduced to 242 so, if we do one hot encoding then there will be 242 columns which we can feed to the model.

In [36]:
house_data['location'].value_counts()

other                 2886
Whitefield             541
Sarjapur  Road         399
Electronic City        304
Kanakpura Road         273
                      ... 
Nehru Nagar             11
Banjara Layout          11
LB Shastri Nagar        11
Pattandur Agrahara      11
Narayanapura            11
Name: location, Length: 242, dtype: int64

### **One Hot Encoding**

In [37]:
one_hot_encoded_location = pd.get_dummies(house_data['location'])

In [38]:
house_data_encoded = pd.concat([house_data.drop('location', axis=1), one_hot_encoded_location], axis=1)

In [39]:
print(house_data_encoded.head())

        size  total_sqft  bath   price  bhk  price_per_sqft  \
0      2 BHK      1056.0   2.0   39.07    2     3699.810606   
1  4 Bedroom      2600.0   5.0  120.00    4     4615.384615   
2      3 BHK      1440.0   2.0   62.00    3     4305.555556   
3      3 BHK      1521.0   3.0   95.00    3     6245.890861   
4      2 BHK      1200.0   2.0   51.00    2     4250.000000   

   1st Block Jayanagar  1st Phase JP Nagar  2nd Phase Judicial Layout  \
0                    0                   0                          0   
1                    0                   0                          0   
2                    0                   0                          0   
3                    0                   0                          0   
4                    0                   0                          0   

   2nd Stage Nagarbhavi  ...  Vishveshwarya Layout  Vishwapriya Layout  \
0                     0  ...                     0                   0   
1                     0  ...      

### **7) Outlier detection and Removal**

In the below data we can there is a plot that has total_sqft of 1 which is not possible. So, we can surely say that it is an outlier.

In [40]:
house_data.describe()

Unnamed: 0,total_sqft,bath,price,bhk,price_per_sqft
count,13274.0,13320.0,13320.0,13320.0,13274.0
mean,1587.856887,2.688814,112.565627,2.802778,7868.363
std,1291.390407,1.338754,148.971674,1.294496,106430.9
min,1.0,1.0,8.0,1.0,267.8298
25%,1100.0,2.0,50.0,2.0,4216.355
50%,1282.0,2.0,72.0,3.0,5415.005
75%,1691.8,3.0,120.0,3.0,7288.013
max,52272.0,40.0,3600.0,43.0,12000000.0


Here, if we divide total_sqft by bhk we will get how may square feet should be there in 1 bhk.

After applying describe() we found that there is 0.2500 square feet's 1 bhk which again is an outlier.

In the graph we can observe outliers for total_sqft

In [41]:
(house_data['total_sqft']/house_data['bhk']).describe()

count    13274.00000
mean       585.57528
std        403.35213
min          0.25000
25%        475.00000
50%        554.00000
75%        628.00000
max      26136.00000
dtype: float64

To remove the outliers we applied a condition i.e., we removed all those records which have less than 300 sqft for 1 bhk. So, now our minimum has become 300 sqft.

In [42]:
house_data = house_data[((house_data['total_sqft']/house_data['bhk']) >= 300 )]
house_data.describe()

Unnamed: 0,total_sqft,bath,price,bhk,price_per_sqft
count,12530.0,12530.0,12530.0,12530.0,12530.0
mean,1624.470975,2.559537,111.382401,2.650838,6262.517051
std,1315.547998,1.077938,152.077329,0.976678,4181.698408
min,300.0,1.0,8.44,1.0,267.829813
25%,1125.0,2.0,49.0,2.0,4166.825095
50%,1305.0,2.0,70.0,3.0,5268.245958
75%,1717.75,3.0,115.0,3.0,6896.551724
max,52272.0,16.0,3600.0,16.0,176470.588235


The number of rows have reduced.

In [43]:
house_data.shape

(12530, 7)

When we applied describe() function on price_per_sqft we found that maximum value for per sqft is 176470.588 which is surely an outlier.

In [44]:
house_data.price_per_sqft.describe()

count     12530.000000
mean       6262.517051
std        4181.698408
min         267.829813
25%        4166.825095
50%        5268.245958
75%        6896.551724
max      176470.588235
Name: price_per_sqft, dtype: float64

To remove the outlier of price_per_Sqft we have written a function 'remove_outliers_sqft' which takes dataframe.

In [45]:
def remove_outliers_sqft(df):
    # Initialize an empty DataFrame to store the filtered data
    df_output = pd.DataFrame()

    # Iterate over each group of dataframes grouped by 'location'
    for key, subdf in df.groupby('location'):
        # Calculate the mean and standard deviation of 'price_per_sqft' for the current location
        m = np.mean(subdf.price_per_sqft)
        st = np.std(subdf.price_per_sqft)

        # Filter the current location's dataframe to remove outliers
        gen_df = subdf[(subdf.price_per_sqft > (m - st)) & (subdf.price_per_sqft <= (m + st))]

        # Concatenate the filtered dataframes with the output DataFrame
        df_output = pd.concat([df_output, gen_df], ignore_index=True)

    # Return the DataFrame with outliers removed
    return df_output

# Call the function to remove outliers from 'house_data' and assign the result back to 'house_data'
house_data = remove_outliers_sqft(house_data)

# Display summary statistics for the cleaned DataFrame
house_data.describe()

Unnamed: 0,total_sqft,bath,price,bhk,price_per_sqft
count,10267.0,10267.0,10267.0,10267.0,10267.0
mean,1514.3731,2.469465,91.011917,2.573683,5643.208908
std,916.337739,0.970396,86.360434,0.892913,2273.138806
min,300.0,1.0,10.0,1.0,1250.0
25%,1110.0,2.0,49.0,2.0,4210.526316
50%,1285.0,2.0,67.0,2.0,5161.290323
75%,1650.0,3.0,100.0,3.0,6422.844564
max,30400.0,16.0,2200.0,16.0,24509.803922


The function bhk_outlier_remover(df) aims to remove outliers from the DataFrame df based on the number of bedrooms ('bhk') and the 'price_per_sqft' column, with consideration of statistics aggregated by location.


In [46]:
def bhk_outlier_remover(df):
          exclude_indices = np.array([])
          for location, location_df in df.groupby('location'):
              bhk_stats = {}
              for bhk, bhk_df in location_df.groupby('bhk'):
                   bhk_stats[bhk] = {
                       'mean': np.mean(bhk_df.price_per_sqft),
                        'std': np.std(bhk_df.price_per_sqft),
                        'count': bhk_df.shape[0]
                   }
              for bhk, bhk_df in location_df.groupby("bhk"):
                  stats = bhk_stats.get(bhk-1)
                  if stats and stats['count']>5:
                      exclude_indices = np.append(exclude_indices, bhk_df[bhk_df.price_per_sqft<(stats['mean'])].index.values)
          return df.drop(exclude_indices, axis='index')

Applies the bhk_outlier_remover function to the 'house_data' DataFrame, removing outliers based on the statistics aggregated by location and bedroom count.

In [47]:
house_data = bhk_outlier_remover(house_data)

We can observe that number of rows have reduced as we have removed outliers.

In [48]:
house_data.shape

(7339, 7)

In [49]:
house_data

Unnamed: 0,location,size,total_sqft,bath,price,bhk,price_per_sqft
0,1st Block Jayanagar,4 BHK,2850.0,4.0,428.0,4,15017.543860
1,1st Block Jayanagar,3 BHK,1630.0,3.0,194.0,3,11901.840491
2,1st Block Jayanagar,3 BHK,1875.0,2.0,235.0,3,12533.333333
3,1st Block Jayanagar,3 BHK,1200.0,2.0,130.0,3,10833.333333
4,1st Block Jayanagar,2 BHK,1235.0,2.0,148.0,2,11983.805668
...,...,...,...,...,...,...,...
10258,other,2 BHK,1200.0,2.0,70.0,2,5833.333333
10259,other,1 BHK,1800.0,1.0,200.0,1,11111.111111
10262,other,2 BHK,1353.0,2.0,110.0,2,8130.081301
10263,other,1 Bedroom,812.0,1.0,26.0,1,3201.970443


We have dropped two columns that are 'size' and 'price_per_sqft'.

**Reason for dropping 'size' Column:** we have dropped size column becuase it had values in string which are irrelevant for analysis and thus we derived a new column bhk from size which contain only integer part of the string.

**Reason for dropping 'price_per_sqft' column:** Here, price_per_sqft is a derived from 'total_sqft' the infoemation of this column is already present in other column and kit does not add unique insignts for analysis so, we dropped this column.

In [50]:
house_data.drop(columns=['size','price_per_sqft'],inplace=True)

### **8) Splitting dataset into Training and testing sets**

In [51]:
house_data.head()

Unnamed: 0,location,total_sqft,bath,price,bhk
0,1st Block Jayanagar,2850.0,4.0,428.0,4
1,1st Block Jayanagar,1630.0,3.0,194.0,3
2,1st Block Jayanagar,1875.0,2.0,235.0,3
3,1st Block Jayanagar,1200.0,2.0,130.0,3
4,1st Block Jayanagar,1235.0,2.0,148.0,2


In [52]:
house_data.to_csv("cleanded_data.csv")

In [53]:
X=house_data.drop(columns=['price'])
y=house_data['price']

In [54]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression,Lasso,Ridge
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.metrics import r2_score

In [55]:
X_train,X_test,y_train,y_test = train_test_split(X,y, test_size=0.2, random_state=0)

In [56]:
X_train

Unnamed: 0,location,total_sqft,bath,bhk
8532,other,1150.0,2.0,3
5325,Old Airport Road,2690.0,4.0,4
1895,EPIP Zone,2710.0,3.0,3
6529,Sultan Palaya,5000.0,5.0,4
9004,other,2100.0,4.0,4
...,...,...,...,...
6036,Sarjapur,650.0,1.0,1
3958,Kalyan nagar,1198.0,2.0,2
1996,Electronic City,435.0,1.0,1
3164,Hoodi,1639.0,3.0,3


In [57]:
X_test

Unnamed: 0,location,total_sqft,bath,bhk
6330,Sarjapur Road,1216.0,2.0,2
5876,Ramagondanahalli,1500.0,3.0,3
1040,Begur Road,1600.0,3.0,3
5969,Sahakara Nagar,1200.0,2.0,2
671,Balagere,1012.0,2.0,2
...,...,...,...,...
1191,Bharathi Nagar,1432.0,2.0,2
725,Banashankari,600.0,1.0,2
2457,Gottigere,3000.0,3.0,4
2709,Haralur Road,1027.0,1.0,2


In [58]:
print(X_train.shape)
print(X_test.shape)

(5871, 4)
(1468, 4)


### Applying Linear Regression

In [63]:
column_trans = make_column_transformer((OneHotEncoder(sparse=False), ['location']),remainder='passthrough')

In [75]:
scaler = StandardScaler()

In [90]:
from sklearn.linear_model import LinearRegression

lr = LinearRegression()

In [69]:
pipe = make_pipeline(column_trans,scaler, lr)

In [70]:
pipe.fit(X_train,y_train)



In [76]:
y_pred_lr = pipe.predict(X_test)

In [77]:
r2_score(y_test, y_pred_lr)

0.7971571080931397

### Applying Lasso

In [78]:
lasso = Lasso()

In [80]:
pipe = make_pipeline(column_trans,scaler, lasso)

In [81]:
pipe.fit(X_train,y_train)



In [82]:
y_pred_lasso = pipe.predict(X_test)
r2_score(y_test, y_pred_lasso)

0.7851332530941815

### Applying Ridge

In [83]:
ridge = Ridge()

In [84]:
pipe = make_pipeline(column_trans, scaler, ridge)

In [85]:
pipe.fit(X_train,y_train)



In [86]:
y_pred_ridge = pipe.predict(X_test)
r2_score(y_test, y_pred_ridge)

0.7971713890436467

In [88]:
print("No Regularization: ", r2_score(y_test, y_pred_lr))
print("Lasso: ", r2_score(y_test, y_pred_lasso))
print("Ridge: ", r2_score(y_test, y_pred_ridge))

No Regularization:  0.7971571080931397
Lasso:  0.7851332530941815
Ridge:  0.7971713890436467
