<center>
    <h1><b> L1 and L2 Regularization</b></h1>
</center>

L1 and L2 are used to regularize model overfit.

L1 uses lambda function with absolute value to reduce theta function when dealing with overfit.

L2 uses lambda function with square value to reduce theta function when dealing with overfit.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sb
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, confusion_matrix

%matplotlib inline

In [2]:
# Suppress warnings for clean notebook
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [3]:
dataset = pd.read_csv(r'D:\AI Engineering\Python\My_Projects\Datasets\Melbourne_housing.csv', low_memory = False)

In [4]:
dataset.head()

Unnamed: 0,Suburb,Address,Rooms,Type,Method,SellerG,Date,Distance,Postcode,Bedroom,...,Landsize,BuildingArea,YearBuilt,CouncilArea,Latitude,Longtitude,Regionname,Propertycount,ParkingArea,Price
0,Abbotsford,68 Studley St,2,h,SS,Jellis,3/9/2016,2.5,3067.0,2.0,...,126.0,inf,,Yarra City Council,-37.8014,144.9958,Northern Metropolitan,4019.0,Carport,
1,Airport West,154 Halsey Rd,3,t,PI,Nelson,3/9/2016,13.5,3042.0,3.0,...,303.0,225.0,2016.0,Moonee Valley City Council,-37.718,144.878,Western Metropolitan,3464.0,Detached Garage,840000.0
2,Albert Park,105 Kerferd Rd,2,h,S,hockingstuart,3/9/2016,3.3,3206.0,2.0,...,120.0,82.0,1900.0,Port Phillip City Council,-37.8459,144.9555,Southern Metropolitan,3280.0,Attached Garage,1275000.0
3,Albert Park,85 Richardson St,2,h,S,Thomson,3/9/2016,3.3,3206.0,2.0,...,159.0,inf,,Port Phillip City Council,-37.845,144.9538,Southern Metropolitan,3280.0,Indoor,1455000.0
4,Alphington,30 Austin St,3,h,SN,McGrath,3/9/2016,6.4,3078.0,3.0,...,174.0,122.0,2003.0,Darebin City Council,-37.7818,145.0198,Northern Metropolitan,2211.0,Parkade,


### Data Exploration

In [5]:
dataset.shape

(34857, 22)

In [6]:
# Number of unique values in each column
dataset.nunique()

Suburb             351
Address          34009
Rooms               12
Type                 3
Method               9
SellerG            388
Date                78
Distance           215
Postcode           211
Bedroom             15
Bathroom            11
Car                 15
Landsize          1684
BuildingArea       742
YearBuilt          160
CouncilArea         33
Latitude         13402
Longtitude       14524
Regionname           8
Propertycount      342
ParkingArea          8
Price             2871
dtype: int64

In [7]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34857 entries, 0 to 34856
Data columns (total 22 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Suburb         34857 non-null  object 
 1   Address        34857 non-null  object 
 2   Rooms          34857 non-null  int64  
 3   Type           34857 non-null  object 
 4   Method         34857 non-null  object 
 5   SellerG        34857 non-null  object 
 6   Date           34857 non-null  object 
 7   Distance       34856 non-null  float64
 8   Postcode       34856 non-null  float64
 9   Bedroom        26640 non-null  float64
 10  Bathroom       26631 non-null  float64
 11  Car            26129 non-null  float64
 12  Landsize       23047 non-null  float64
 13  BuildingArea   13760 non-null  object 
 14  YearBuilt      15551 non-null  float64
 15  CouncilArea    34854 non-null  object 
 16  Latitude       26881 non-null  float64
 17  Longtitude     26881 non-null  float64
 18  Region

In [8]:
dataset.isnull().sum()

Suburb               0
Address              0
Rooms                0
Type                 0
Method               0
SellerG              0
Date                 0
Distance             1
Postcode             1
Bedroom           8217
Bathroom          8226
Car               8728
Landsize         11810
BuildingArea     21097
YearBuilt        19306
CouncilArea          3
Latitude          7976
Longtitude        7976
Regionname           0
Propertycount        3
ParkingArea          0
Price             7610
dtype: int64

In [9]:
dataset.columns

Index(['Suburb', 'Address', 'Rooms', 'Type', 'Method', 'SellerG', 'Date',
       'Distance', 'Postcode', 'Bedroom', 'Bathroom', 'Car', 'Landsize',
       'BuildingArea', 'YearBuilt', 'CouncilArea', 'Latitude', 'Longtitude',
       'Regionname', 'Propertycount', 'ParkingArea', 'Price'],
      dtype='object')

In [10]:
cols_to_use = ['Suburb', 'Rooms', 'Type', 'Method', 'SellerG', 'Regionname', 'Propertycount', 'Distance', 'CouncilArea', 'Bedroom', 'Bathroom', 'Car', 'Landsize',
       'BuildingArea', 'YearBuilt', 'Price']

In [11]:
df = dataset[cols_to_use]
df[10:20]

Unnamed: 0,Suburb,Rooms,Type,Method,SellerG,Regionname,Propertycount,Distance,CouncilArea,Bedroom,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Price
10,Altona North,3,h,S,Jas,Western Metropolitan,5132.0,11.1,Hobsons Bay City Council,3.0,1.0,2.0,533.0,117,1970.0,781000.0
11,Altona North,4,h,S,hockingstuart,Western Metropolitan,5132.0,11.1,Hobsons Bay City Council,,,,,missing,,857500.0
12,Armadale,3,h,S,RT,Southern Metropolitan,4836.0,6.3,Stonnington City Council,2.0,1.0,1.0,305.0,missing,,
13,Armadale,2,u,S,Jellis,Southern Metropolitan,4836.0,6.3,Stonnington City Council,2.0,1.0,1.0,0.0,76,1964.0,599000.0
14,Ascot Vale,3,h,S,Jellis,Western Metropolitan,6567.0,5.9,Moonee Valley City Council,3.0,1.0,2.0,305.0,135,1910.0,
15,Ascot Vale,2,u,S,Nelson,Western Metropolitan,6567.0,5.9,Moonee Valley City Council,3.0,1.0,2.0,0.0,missing,,455000.0
16,Ashburton,4,h,S,Marshall,Southern Metropolitan,3052.0,11.0,Boroondara City Council,4.0,3.0,2.0,753.0,399,2015.0,2650000.0
17,Ashburton,2,h,S,Marshall,Southern Metropolitan,3052.0,11.0,Boroondara City Council,,,,,inf,,1820000.0
18,Ashwood,3,h,SP,Jellis,Southern Metropolitan,2894.0,12.2,Monash City Council,3.0,1.0,2.0,343.0,inf,1960.0,
19,Ashwood,2,h,S,Tim,Southern Metropolitan,2894.0,12.2,Monash City Council,2.0,1.0,1.0,583.0,inf,1950.0,995000.0


In [12]:
df.shape

(34857, 16)

In [13]:
df.isna().sum()

Suburb               0
Rooms                0
Type                 0
Method               0
SellerG              0
Regionname           0
Propertycount        3
Distance             1
CouncilArea          3
Bedroom           8217
Bathroom          8226
Car               8728
Landsize         11810
BuildingArea     21097
YearBuilt        19306
Price             7610
dtype: int64

In [14]:
# Filling the selected columns Nan rows with zero 
fill_zero_cols = ['Propertycount', 'Distance', 'Bedroom', 'Bathroom', 'Car']
df[fill_zero_cols] = df[fill_zero_cols].fillna(0)

In [15]:
df.isna().sum()

Suburb               0
Rooms                0
Type                 0
Method               0
SellerG              0
Regionname           0
Propertycount        0
Distance             0
CouncilArea          3
Bedroom              0
Bathroom             0
Car                  0
Landsize         11810
BuildingArea     21097
YearBuilt        19306
Price             7610
dtype: int64

In [16]:
df['BuildingArea'].unique()

array(['inf', '225', '82', '122', '263', '242', '108', '251', '117',
       'missing', '76', '135', '399', '118', '103', '180', nan, '123',
       '218', '129', '167', '154', '275', '121', '146', '125', '255',
       '94', '75', '254', '156', '404', '240', '370', '268', '202', '203',
       '69', '140', '214', '253', '189', '215', '195', '96', '104', '100',
       '313', '144', '130', '64', '93', '106', '107', '110', '70', '132',
       '229', '51', '147', '113', '83', '56', '137', '85', '175', '3558',
       '170', '259', '265', '353', '138', '19', '116', '87', '74', '320',
       '300', '52', '210', '120', '86', '97', '152', '200', '14', '161',
       '128', '178', '185', '109', '53', '133', '115', '143', '150',
       '236', '276', '188', '179', '249', '141', '349', '192', '34', '73',
       '84', '81', '207', '50', '197', '264', '312', '235', '221', '260',
       '183', '160', '186', '78', '105', '145', '168', '62', '111', '220',
       '315', '181', '500', '61', '112', '420', '226

In [17]:
# Replacing non int values with None
df['BuildingArea'] = df['BuildingArea'].replace(['inf', 'missing', 'nan', None], np.NaN)

In [18]:
df['BuildingArea'].unique()

array([nan, '225', '82', '122', '263', '242', '108', '251', '117', '76',
       '135', '399', '118', '103', '180', '123', '218', '129', '167',
       '154', '275', '121', '146', '125', '255', '94', '75', '254', '156',
       '404', '240', '370', '268', '202', '203', '69', '140', '214',
       '253', '189', '215', '195', '96', '104', '100', '313', '144',
       '130', '64', '93', '106', '107', '110', '70', '132', '229', '51',
       '147', '113', '83', '56', '137', '85', '175', '3558', '170', '259',
       '265', '353', '138', '19', '116', '87', '74', '320', '300', '52',
       '210', '120', '86', '97', '152', '200', '14', '161', '128', '178',
       '185', '109', '53', '133', '115', '143', '150', '236', '276',
       '188', '179', '249', '141', '349', '192', '34', '73', '84', '81',
       '207', '50', '197', '264', '312', '235', '221', '260', '183',
       '160', '186', '78', '105', '145', '168', '62', '111', '220', '315',
       '181', '500', '61', '112', '420', '226', '266', '410', '

In [19]:
# Converting BuildingArea datatype to float
df['BuildingArea'] = df['BuildingArea'].astype('float')

In [20]:
df['BuildingArea'].info()

<class 'pandas.core.series.Series'>
RangeIndex: 34857 entries, 0 to 34856
Series name: BuildingArea
Non-Null Count  Dtype  
--------------  -----  
13742 non-null  float64
dtypes: float64(1)
memory usage: 272.4 KB


In [21]:
# Filling selected columns with their mean
df['Landsize'] = df['Landsize'].fillna(df['Landsize'].mean())
df['BuildingArea'] = df['BuildingArea'].fillna(df['BuildingArea'].mean())

In [22]:
df.isna().sum()

Suburb               0
Rooms                0
Type                 0
Method               0
SellerG              0
Regionname           0
Propertycount        0
Distance             0
CouncilArea          3
Bedroom              0
Bathroom             0
Car                  0
Landsize             0
BuildingArea         0
YearBuilt        19306
Price             7610
dtype: int64

In [23]:
# Checking the percentage of each unique value in 'YearBuilt' column
(df['YearBuilt'].value_counts(normalize = True).values * 100).round(2)

array([9.58, 8.1 , 7.  , 4.67, 3.9 , 3.67, 3.5 , 3.41, 2.96, 2.86, 2.61,
       2.49, 2.35, 2.32, 2.14, 1.77, 1.67, 1.59, 1.55, 1.47, 1.47, 1.38,
       1.36, 1.3 , 1.29, 1.14, 1.1 , 1.09, 1.07, 1.05, 1.  , 0.92, 0.84,
       0.84, 0.81, 0.8 , 0.73, 0.7 , 0.64, 0.58, 0.53, 0.48, 0.39, 0.37,
       0.34, 0.32, 0.3 , 0.3 , 0.28, 0.26, 0.24, 0.24, 0.21, 0.21, 0.19,
       0.18, 0.17, 0.16, 0.16, 0.16, 0.15, 0.15, 0.15, 0.15, 0.15, 0.14,
       0.14, 0.14, 0.13, 0.13, 0.12, 0.12, 0.11, 0.11, 0.1 , 0.1 , 0.1 ,
       0.1 , 0.08, 0.08, 0.08, 0.08, 0.07, 0.07, 0.07, 0.07, 0.06, 0.06,
       0.06, 0.06, 0.06, 0.06, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05,
       0.04, 0.04, 0.04, 0.04, 0.04, 0.04, 0.04, 0.04, 0.04, 0.03, 0.03,
       0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0.03, 0.02,
       0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.02, 0.01, 0.01,
       0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01,
       0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.

In [24]:
# Function to fill null rows in YearBuilt with their proportionate values
def replace_with_percentage(df: pd.DataFrame, col: float(), new_col: float()) -> pd.DataFrame:
    # Copy the original column to the new column
    df[new_col] = df[col].copy()

    # Find rows with NaN in the new column
    null_rows = df[new_col].isnull()
    
    # Get value counts of unique values in the column
    unique_value_count = df[col].value_counts(normalize=True)
    
    # Create a list of values based on their percentage
    values = unique_value_count.index.tolist()
    weights = unique_value_count.values.tolist()

    # Randomly assign values to NaN rows in the new column based on percentage distribution
    df.loc[null_rows, new_col] = np.random.choice(values, size=null_rows.sum(), p=weights)
    
    return df

In [25]:
# Replace NaN with values according to their percentage in a new column 'YearBuilt_'
df = replace_with_percentage(df, 'YearBuilt', 'YearBuilt_')
df.head(5)

Unnamed: 0,Suburb,Rooms,Type,Method,SellerG,Regionname,Propertycount,Distance,CouncilArea,Bedroom,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Price,YearBuilt_
0,Abbotsford,2,h,SS,Jellis,Northern Metropolitan,4019.0,2.5,Yarra City Council,2.0,1.0,1.0,126.0,160.2564,,,1975.0
1,Airport West,3,t,PI,Nelson,Western Metropolitan,3464.0,13.5,Moonee Valley City Council,3.0,2.0,1.0,303.0,225.0,2016.0,840000.0,2016.0
2,Albert Park,2,h,S,hockingstuart,Southern Metropolitan,3280.0,3.3,Port Phillip City Council,2.0,1.0,0.0,120.0,82.0,1900.0,1275000.0,1900.0
3,Albert Park,2,h,S,Thomson,Southern Metropolitan,3280.0,3.3,Port Phillip City Council,2.0,1.0,0.0,159.0,160.2564,,1455000.0,1890.0
4,Alphington,3,h,SN,McGrath,Northern Metropolitan,2211.0,6.4,Darebin City Council,3.0,2.0,1.0,174.0,122.0,2003.0,,2003.0


In [26]:
df['YearBuilt_'].dtype

dtype('float64')

In [27]:
# dropping YearBuilt column
df1 = df.drop(columns ='YearBuilt', axis = 1)

In [28]:
df1.head(1)

Unnamed: 0,Suburb,Rooms,Type,Method,SellerG,Regionname,Propertycount,Distance,CouncilArea,Bedroom,Bathroom,Car,Landsize,BuildingArea,Price,YearBuilt_
0,Abbotsford,2,h,SS,Jellis,Northern Metropolitan,4019.0,2.5,Yarra City Council,2.0,1.0,1.0,126.0,160.2564,,1975.0


In [29]:
df1.isnull().sum()

Suburb              0
Rooms               0
Type                0
Method              0
SellerG             0
Regionname          0
Propertycount       0
Distance            0
CouncilArea         3
Bedroom             0
Bathroom            0
Car                 0
Landsize            0
BuildingArea        0
Price            7610
YearBuilt_          0
dtype: int64

In [30]:
# Dropping null values in the dataframe
df1.dropna(inplace = True)

In [31]:
df1.isna().sum()

Suburb           0
Rooms            0
Type             0
Method           0
SellerG          0
Regionname       0
Propertycount    0
Distance         0
CouncilArea      0
Bedroom          0
Bathroom         0
Car              0
Landsize         0
BuildingArea     0
Price            0
YearBuilt_       0
dtype: int64

In [32]:
df1.dtypes

Suburb            object
Rooms              int64
Type              object
Method            object
SellerG           object
Regionname        object
Propertycount    float64
Distance         float64
CouncilArea       object
Bedroom          float64
Bathroom         float64
Car              float64
Landsize         float64
BuildingArea     float64
Price            float64
YearBuilt_       float64
dtype: object

In [33]:
# Coverting the object columns to dummies
df1 = pd.get_dummies(df1, drop_first = True).astype(int)
df1.head(2)

Unnamed: 0,Rooms,Propertycount,Distance,Bedroom,Bathroom,Car,Landsize,BuildingArea,Price,YearBuilt_,...,CouncilArea_Moorabool Shire Council,CouncilArea_Moreland City Council,CouncilArea_Nillumbik Shire Council,CouncilArea_Port Phillip City Council,CouncilArea_Stonnington City Council,CouncilArea_Whitehorse City Council,CouncilArea_Whittlesea City Council,CouncilArea_Wyndham City Council,CouncilArea_Yarra City Council,CouncilArea_Yarra Ranges Shire Council
1,3,3464,13,3,2,1,303,225,840000,2016,...,0,0,0,0,0,0,0,0,0,0
2,2,3280,3,2,1,0,120,82,1275000,1900,...,0,0,0,1,0,0,0,0,0,0


In [34]:
# Creating X and Y
X = df1.drop('Price', axis = 1)
y = df1['Price']

In [35]:
# splitting the dataset to training and testing 
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size = 0.3, random_state = 0)

In [36]:
# Training the model
model = LinearRegression().fit(train_X, train_y)

In [37]:
# making predictons
y_prediction = model.predict(test_X)
y_prediction

array([1872552.73424887,  724155.61780828, 1076922.59066442, ...,
        710613.26283245,  624662.26660199,  591245.12306279])

In [38]:
# Evaluating the model
model.score(test_X, test_y)

0.6750048842925458

In [39]:
# Trying the model on seen dataset
model.score(train_X, train_y)

0.6786985682607156

### Using Lasso L1 Regularization

In [40]:
from sklearn import linear_model

In [41]:
# Creating Lasso object
lasso_reg = linear_model.Lasso(alpha=25, max_iter=100, tol=0.1)
lasso_reg.fit(train_X, train_y)

In [42]:
lasso_reg.score(test_X, test_y)

0.6770875427701454

In [43]:
lasso_reg.score(train_X, train_y)

0.6767407054747101

### L2 or Ridge Regularization

In [44]:
from sklearn.linear_model import Ridge

ridge_reg = Ridge(alpha = 5, max_iter = 100, tol = 0.1)
ridge_reg.fit(train_X, train_y)

In [45]:
ridge_reg.score(test_X, test_y)

0.677488894555211

In [46]:
ridge_reg.score(train_X, train_y)

0.6757340405453944