# ML Lab 04: Linear Regression

## Prit Kanadiya
## 211070010

<b>Aim</b>: To implement the Linear Regression algorithm from scratch. Linear Regression is used to approximate house price in mumbai based on several features.

<b>Theory</b>: Linear Regression is a simple yet powerful statistical method used for modeling the relationship between a dependent variable (target) and one or more independent variables (features). In the context of predicting house prices, the dependent variable (target) would be the price of the house, while the independent variables (features) could be various factors such as the size of the house, number of bedrooms, location, etc.

The basic idea behind linear regression is to find the best-fitting straight line that describes the relationship between the independent variables and the dependent variable. This line can be represented by the equation:

```y=mx+b```

Where:

```y``` is the dependent variable (target)

```x``` is the independent variable (feature)

```m``` is the slope of the line

```b``` is the y-intercept

The goal of linear regression is to find the values of ```m``` and ```b``` that minimize the error between the predicted values of ```y``` and the actual values of ```y```. This error is typically measured using a cost function, such as the Mean Squared Error (MSE).

Once the values of ```m``` and ```b``` are determined, the linear regression model can be used to make predictions on new data by simply plugging in the values of the independent variables into the equation of the line.

In this lab, we will implement the linear regression algorithm from scratch using Python and NumPy, and then use it to predict house prices in Mumbai based on various features. We will also evaluate the performance of our model using metrics such as Mean Squared Error.

In [1]:
# import all the required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from geopy.geocoders import Nominatim

In [2]:
# obtain the raw csv 
data = "mumbai_house_prices.csv"
house_price = pd.read_csv(data)

In [3]:
# printing the info for dataset
print(house_price.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 76038 entries, 0 to 76037
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   bhk         76038 non-null  int64  
 1   type        76038 non-null  object 
 2   locality    76038 non-null  object 
 3   area        76038 non-null  int64  
 4   price       76038 non-null  float64
 5   price_unit  76038 non-null  object 
 6   region      76038 non-null  object 
 7   status      76038 non-null  object 
 8   age         76038 non-null  object 
dtypes: float64(1), int64(2), object(6)
memory usage: 5.2+ MB
None


In [4]:
house_price.head()

Unnamed: 0,bhk,type,locality,area,price,price_unit,region,status,age
0,3,Apartment,Lak And Hanware The Residency Tower,685,2.5,Cr,Andheri West,Ready to move,New
1,2,Apartment,Radheya Sai Enclave Building No 2,640,52.51,L,Naigaon East,Under Construction,New
2,2,Apartment,Romell Serene,610,1.73,Cr,Borivali West,Under Construction,New
3,2,Apartment,Soundlines Codename Urban Rainforest,876,59.98,L,Panvel,Under Construction,New
4,2,Apartment,Origin Oriana,659,94.11,L,Mira Road East,Under Construction,New


In [5]:
# create a dictionary to map region to their latitude and longitude
geo = Nominatim(user_agent="Geopy Library")
unique_regions = house_price["region"].unique()
print("Total number of unique values: ", len(unique_regions))
lat_long_dict = {}
unknown_regions = []

for r in unique_regions:
    loc = geo.geocode(r + ", Mumbai")
    if (loc == None):
        unknown_regions.append(r)
        continue
    else:
        latitude = loc.latitude
        longitude = loc.longitude
        lat_long_dict[r] = [latitude, longitude]

print("Geopy could not find the following regions: ", unknown_regions, len(unknown_regions))

Total number of unique values:  228
Geopy could not find the following regions:  ['Mira Road East', 'Badlapur East', 'Badlapur West', 'Ambernath West', 'Ulhasnagar', 'Kewale', 'Nala Sopara', 'Karanjade', 'Neral', 'Karjat', 'Dronagiri', 'Navade', 'Owale', 'Ville Parle East', 'Vangani', 'Bhayandar East', 'Ambernath East', 'Nilje Gaon', 'Titwala', 'Koper Khairane', 'Napeansea Road', 'Koproli', 'Anjurdive', 'Taloje', 'Vasai West', 'Vasai east', 'Nalasopara East', 'Saphale', 'Kasheli', 'Panch Pakhdi', 'Hiranandani Estates', 'Vichumbe', 'Sector 17 Ulwe', 'Sector 23 Ulwe', 'Sector 20 Kamothe', 'Sector 30 Kharghar', 'Virar East', 'Sector 8 New panvel', 'Bhayandar West', 'Sector 20 Ulwe', 'Virar West', 'Palava', 'Greater Khanda', 'Sector-35D Kharghar', 'Umroli', 'Sector-9 Ulwe', 'Sector-3 Ulwe', 'kasaradavali thane west', 'Sector 19 Kharghar', 'Kalher', 'Sector 21 Kharghar', 'Usarghar Gaon', 'Patlipada', 'Vevoor', 'Sector 7 Kharghar', 'Badlapur', 'Khanda Colony', 'Gauripada', 'Warai', 'Khatiwal

In [6]:
# we remove all rows which contain unidentified regions and for all rows with identified regions, we add their latitiude and longitude to the data set.
del_idx = []
for i in range(len(house_price)):
    region = house_price.loc[i, "region"]
    if region in unknown_regions:
        del_idx.append(i)
    else:
        lat_long = lat_long_dict[region]
        house_price.at[i, "latitude"] = lat_long[0] 
        house_price.at[i, "longitude"] = lat_long[1]


In [7]:
house_price.head()

Unnamed: 0,bhk,type,locality,area,price,price_unit,region,status,age,latitude,longitude
0,3,Apartment,Lak And Hanware The Residency Tower,685,2.5,Cr,Andheri West,Ready to move,New,19.117249,72.833968
1,2,Apartment,Radheya Sai Enclave Building No 2,640,52.51,L,Naigaon East,Under Construction,New,19.013755,72.846294
2,2,Apartment,Romell Serene,610,1.73,Cr,Borivali West,Under Construction,New,19.229456,72.84799
3,2,Apartment,Soundlines Codename Urban Rainforest,876,59.98,L,Panvel,Under Construction,New,18.990978,73.065553
4,2,Apartment,Origin Oriana,659,94.11,L,Mira Road East,Under Construction,New,,


In [8]:
house_price.drop(del_idx, inplace=True)
house_price = house_price.reset_index(drop=True)

In [9]:
house_price.head()

Unnamed: 0,bhk,type,locality,area,price,price_unit,region,status,age,latitude,longitude
0,3,Apartment,Lak And Hanware The Residency Tower,685,2.5,Cr,Andheri West,Ready to move,New,19.117249,72.833968
1,2,Apartment,Radheya Sai Enclave Building No 2,640,52.51,L,Naigaon East,Under Construction,New,19.013755,72.846294
2,2,Apartment,Romell Serene,610,1.73,Cr,Borivali West,Under Construction,New,19.229456,72.84799
3,2,Apartment,Soundlines Codename Urban Rainforest,876,59.98,L,Panvel,Under Construction,New,18.990978,73.065553
4,2,Apartment,Bhoomi Simana Wing A Phase 1,826,3.3,Cr,Parel,Under Construction,New,19.009482,72.837661


In [10]:
house_price.shape

(61218, 11)

In [11]:
# remove columns type, locality and region since they are not meaningful for distance metric
house_price.drop(['type', 'locality', 'region'], axis=1, inplace=True)

In [12]:
# dealing with categorical data
house_price["age"].replace({"New":0, "Resale":1, "Unknown": 0}, inplace=True)
house_price["status"].replace({"Ready to move":0, "Under Construction":1}, inplace=True)

In [13]:
# calculating price in Lakhs for each using price and price_unit
for i in range(len(house_price)):  
    price_unit = house_price.loc[i, "price_unit"]

    if price_unit == "Cr":
        house_price.at[i, "price"] = house_price.at[i, "price"] * 100

    house_price.at[i, "latitude"] = house_price.at[i, "latitude"]*10000
    house_price.at[i, "longitude"] = house_price.at[i, "longitude"]*10000
  
house_price.drop(["price_unit"], axis=1, inplace=True)

In [14]:
house_price.head()

Unnamed: 0,bhk,area,price,status,age,latitude,longitude
0,3,685,250.0,0,0,191172.495,728339.68
1,2,640,52.51,1,0,190137.554,728462.942891
2,2,610,173.0,1,0,192294.561,728479.905
3,2,876,59.98,1,0,189909.781,730655.529733
4,2,826,330.0,1,0,190094.817,728376.614


In [15]:
# split into train and test dataset
test_ratio = 0.1
test_size = int(test_ratio*len(house_price))
test_indices = house_price.sample(test_size).index
train = house_price.drop(test_indices)
test = house_price.loc[test_indices]
train = train.reset_index(drop=True)
test = test.reset_index(drop=True)
y_train = train.pop("price").tolist()
y_test = test.pop("price").tolist()
print("The size of X_train is: ", train.shape)
print("The size of X_test is: ", test.shape)
print("The size of y_train is: ", len(y_train))
print("The size of y_test is: ", len(y_test))

The size of X_train is:  (55097, 6)
The size of X_test is:  (6121, 6)
The size of y_train is:  55097
The size of y_test is:  6121


In [21]:
class LinearRegression:
    def __init__(self, lr=0.01, n_iters=1000):
        self.lr = lr
        self.n_iters = n_iters
        self.weights = None
        self.bias = None

    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = 0

        # Gradient Descent
        for i in range(self.n_iters):

            if (i%100 == 0):
                print("iteration: ", i)

            y_predicted = np.dot(X, self.weights) + self.bias
            # Compute gradients
            dw = (1/n_samples) * np.dot(X.T, (y_predicted - y))
            db = (1/n_samples) * np.sum(y_predicted - y)

            # Update parameters
            self.weights -= self.lr * dw
            self.bias -= self.lr * db

    def predict(self, X):
        return np.dot(X, self.weights) + self.bias

# Example usage
X_train = train.values  # Using only 'area' as feature for simplicity
y_train = np.array(y_train)
X_test = test.values
y_test = np.array(y_test)

# Train the model
model = LinearRegression(lr=0.001, n_iters=5000)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Calculate MSE
mse = np.mean((y_pred - y_test) ** 2)

iteration:  0
iteration:  100
iteration:  200
iteration:  300


  self.weights -= self.lr * dw


iteration:  400
iteration:  500
iteration:  600
iteration:  700
iteration:  800
iteration:  900
iteration:  1000
iteration:  1100
iteration:  1200
iteration:  1300
iteration:  1400
iteration:  1500
iteration:  1600
iteration:  1700
iteration:  1800
iteration:  1900
iteration:  2000
iteration:  2100
iteration:  2200
iteration:  2300
iteration:  2400
iteration:  2500
iteration:  2600
iteration:  2700
iteration:  2800
iteration:  2900
iteration:  3000
iteration:  3100
iteration:  3200
iteration:  3300
iteration:  3400
iteration:  3500
iteration:  3600
iteration:  3700
iteration:  3800
iteration:  3900
iteration:  4000
iteration:  4100
iteration:  4200
iteration:  4300
iteration:  4400
iteration:  4500
iteration:  4600
iteration:  4700
iteration:  4800
iteration:  4900


In [22]:
print(mse)

0.1823793497794478

<b>Conclusion:</b> Hence implemented linear regression using mumbai house price dataset. I transformed the region names to their respective latitude and longitude to make it more useful. Additionally, understood the concept of gradient descent and loss function.