
## EDA

In [None]:
import pandas as pd
from pprint import pprint
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from src.utils import preprocess_data
from src.utils import feature_engg

import geopandas as gpd
from shapely.geometry import Point, Polygon

from dotenv import load_dotenv
import os

load_dotenv()
ROOT_DIR = os.environ.get("ROOT_DIR")
os.chdir(ROOT_DIR)

## Data Ingestion

In [None]:
df = pd.read_csv(r"data\raw\Train.csv")

## Data Preprocessing

### Analysing the data

Checking the data type and the null values in our data.

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.isnull().sum()

Checking unique values of balcony, total square feet, and size, for understanding what needs to be cleaned and analyse how to clean it.

In [None]:
columns = ['area_type','balcony', 'total_sqft', 'size']

for col in columns:
    unique_values = df[col].unique()
    print(f"Unique values for column '{col}': {unique_values}")

## Checking frequency of variables which have null values

This will show us which variables/variable ranges are more prominent in the market

In [None]:
def size_plot(df,col):
    plt.figure(figsize=(30,5))
    sns.countplot(df, x=col)
    plt.title('Number of houses by size')

size_plot(df, 'size')

We see from the graph that most houses are 2BHK and 3BHK, but there are houses with 2/3 bed and bath mentioned as bedroom only so we will have to clean the data based on that and check again.

Spatial analysis using geopandas and geoplots to check location concentration 

In [None]:
def bath_plot():
    plt.figure(figsize=(8,4))
    sns.countplot(x=df.bath)
    plt.title('Number of Houses by bath count')

bath_plot()

Bath count of 2 and 3 is most popular, with a few houses with 40 bathrooms!

In [None]:
def balcony_plot():
    plt.figure(figsize=(8,4))
    sns.countplot(x=df.balcony)
    plt.title('Number of Houses by balcony count')

balcony_plot()

## Creating plots for other variables to check for other value range and count 

In [None]:
def areatype_plot():
    plt.figure(figsize=(6,4))
    sns.countplot(x=df.area_type)
    plt.title('Area type with count of houses')

areatype_plot()

In [None]:
def availability_plot():
    plt.figure(figsize=(70,10))
    sns.countplot(x=df.availability)
    plt.title('House Availability')

availability_plot()

From this graph we see that max houses are offered as ready to move in, and the data is skewed to ready to move in houses.

In [None]:
def sqft_plot():
    plt.figure(figsize=(50,6))
    sns.histplot(x=df.total_sqft, binwidth=5000)
    plt.title('Max houses for area(sq ft) range')

sqft_plot()

The total area in sq ft is also highly skewed we will have to clean it.
We also saw from the unique values that all values in this column weren't consistent, so will have to change that as well.

In [None]:
def price_plot():
    plt.figure(figsize=(6,4))
    sns.histplot(x=df.price, binwidth=100)
    plt.title('Max houses price range')

price_plot()

We can see that the price data is highly skewed, so we will have to remove the skewedness.

## Data Cleaning

Making a copy of our datframe to work on it

In [None]:
copy_df = df[['area_type', 'availability', 'location', 'size', 'society', 'total_sqft', 'bath', 
              'balcony', 'price']].copy()

To clean our data we have:
- Removed the words bedroom, bhk and rk from the size column as we saw from above that we need to do that and then created chart for better understanding of the values now.
- Removed the null values in the data using mean, median, or a particular value, based on initial analysis.
- Some values in the total_sqft column are in the form of x-y so to use those columns I have found the mean of these values x+y/2 and replaced the x-y with this mean.
- Certain values in the total_sqft column are not in sqft so to make that an easy transition I have made a dictionary containing some of the metrics
- After the data dictionary has been formed we create a function to extract the numeric values from our dataframe copy_df.
- Using the functions created we change all the values to sqft 
- Converting the total area in sqft given as str to numeric format
- Removed skewdness of the data 

Combining the above under one function only so as to be able to use the function for preprocessing the test file as well

In [None]:
copy_df = preprocess_data(copy_df)

In [None]:
copy_df.isnull().sum()

Checking the size plot after removing words like bhk, rk and bedrooms

In [None]:
size_plot(copy_df, 'size')

Checking the price plot after the removed skewness

In [None]:
def price_plot():
    plt.figure(figsize=(6,4))
    sns.histplot(x=copy_df.price, binwidth=0.5)
    plt.title('Max houses price range')

price_plot()

## Feature Engineering

- Making a new column month, this column will contain only the months and not the exact date as from graphs plotted we could see that we do not need the data along with the dates
- Making a new column ready, this column will contain only the values like ready to move and immediate possession and nothing else as from graphs plotted we could see that we do not need the data along with the too many dates
- Combining the two columns month and ready we finally get the column that was desired and we can now plot our graphs based on these columns 

In [None]:
copy_df = feature_engg(copy_df)

Creating a new column address in the copy_df dataframe to map our locations to their longitude and latitudes for geo spatial analysis

In [None]:
copy_df['address'] = copy_df['location']+',Bangalore,Karnataka,India'

## Creating Plots with Target Variable(Price)

These plots will help determine which variable affects our target variable the most, so that we can move forward with that data.

Boxplot for area type and price dependency

In [None]:
sns.boxplot(x = copy_df['area_type'], y=copy_df['price'])
plt.title('Area Type vs Price')

Bocplot for availability and price dependency

In [None]:
plt.figure(figsize=(30,5))
sns.boxplot(x = copy_df['availability'], y=copy_df['price'])
plt.title('Availability vs Price')

We can see that this data can be divided into months and then checked with the target variable, otherwise the results of the graph are not clear.

In [None]:
plt.figure(figsize=(17,5))
sns.boxplot(x = copy_df['extract'], y=copy_df['price'])
plt.title('Availability vs Price')

In [None]:
plt.figure(figsize=(17,5))
month_order = ['Jan', 'Feb', 'Mar', 'Apr', 'May','Jun','Jul','Aug','Sep','Oct','Nov', 'Dec']
sns.boxplot(x = copy_df['month'], y=copy_df['price'], order= month_order)
plt.title('Availability on the basis of months vs Price')

In [None]:
plt.figure(figsize=(17,5))
sns.boxplot(x = copy_df['ready'], y=copy_df['price'])
plt.title('Availability on the basis on readiness vs Price')

Boxplot for size and price dependency

In [None]:
sns.boxplot(x = copy_df['size'], y=copy_df['price'])
plt.title('Size vs Price')

Boxplot for location and price dependency 

In [None]:
plt.figure(figsize=(40,6))
sns.boxplot(x = copy_df['location'], y=copy_df['price'])
plt.title('Location vs Price')

Scatterplot for total area in square ft and price dependency

In [None]:
sns.scatterplot(x = copy_df['total_sqft'], y=copy_df['price'])
plt.title('Total area in sqft vs Price')

Boxplot for bath and price dependency 

In [None]:
plt.figure(figsize=(10,4))
sns.boxplot(x = copy_df['bath'], y=copy_df['price'])
plt.title('Bath vs Price')

Boxplot for balcony and price dependency 

In [None]:
sns.boxplot(x = copy_df['balcony'], y=copy_df['price'])
plt.title('Balcony vs Price')

Checking correlation between bath, balcony and total area to see if any of these is highly correlated and has to ignored.

In [None]:
corr_df = copy_df[['bath', 'balcony', 'total_sqft','size', 'price']]
correlation_matrix = corr_df.corr()

sns.heatmap(correlation_matrix, annot=True, cmap='Blues', fmt='.1f')

We see that size and bath are highly correlated and hence one of these based on their correlation with our target variable price can be dropped.

Extracting our cleaned data to a file to start modelling on it.

In [None]:
file_path = r'D:\source\repos\OrionDataAnalyticsInternshipJul23-Proj2\data\processed/'
file_name = "cleaned.csv"
extract_info = file_path + file_name
copy_df.to_csv(extract_info, index=False)

## Geo Spatial Analysis

Extracting a new file from copy_df to use for geoencoding

In [None]:
address_df = copy_df['address'].drop_duplicates()
file_path = r'D:\source\repos\OrionDataAnalyticsInternshipJul23-Proj2\data\raw/'
file_name = "address.csv"
extract_info = file_path + file_name
address_df.to_csv(extract_info, index=False)

### Data Ingestion: the shapefile to display the map of Bangalore

In [None]:
bangaloremap = gpd.read_file(r'data\raw\bbmpwards\bbmpwards.shp')

####  Plotting complete Bangalore map using the shapefile

In [None]:
bangaloremap.plot(alpha=0.5, edgecolor='k', legend=True)

### Data Ingestion: the encoded address file

In [None]:
encoded_df = pd.read_csv(r"data\final\encoded_addr.csv")

Reformatting our data into a GeoPandas Dataframe because we cannot use it directly as a dataframe

In [None]:
crs = {'init':'EPSG:4326'}
geometry = [Point(xy) for xy in zip(encoded_df['longitude'], encoded_df['latitude'])]
geo_df = gpd.GeoDataFrame(encoded_df, crs = crs, geometry = geometry)

### Plotting our encoded data using subplots on the bangalore map to see if we can come to any conclusion using it

In [None]:
fig, ax = plt.subplots(figsize = (8,8))
bangaloremap.plot(ax=ax, color='lightgrey',edgecolor='Darkgrey')
geo_df.plot(ax=ax)

ax.set_title('Bangalore Real Estate')
ax.set_ylim(12.8, 13.15)
ax.set_xlim(77.45,77.8)

Reducing alpha to see where the houses are concentrated, as with all the dots with same intensity it is becoming difficult to do the same.

In [None]:
fig, ax = plt.subplots(figsize = (10,10))
bangaloremap.to_crs(epsg=4326).plot(ax=ax, color='lightgrey',edgecolor='Darkgrey')
geo_df.plot(ax=ax, alpha = .1 )
ax.set_title('Bangalore Real Estate')
ax.set_ylim(12.8, 13.15)
ax.set_xlim(77.45,77.8)
# plt.savefig('Property Map')

Using heatmap to check where the houses with the max price are located, as colours in the heatmap will indicate price range.

In [None]:
geo_df['price'] = copy_df['price']
fig, ax = plt.subplots(figsize = (10,10))
bangaloremap.to_crs(epsg=4326).plot(ax=ax, color='lightgrey',edgecolor='Darkgrey')
geo_df.plot(column = 'price', ax=ax, cmap = 'rainbow',
            legend = True, legend_kwds={'shrink': 0.3}, 
            markersize = 10)
ax.set_title('Bangalore Price Heatmap')
ax.set_ylim(12.80, 13.15)
ax.set_xlim(77.45,77.8)

Using heatmap to check where the houses with the max area are located, as colours in the heatmap will indicate area range.

In [None]:
geo_df['price'] = (copy_df['total_sqft'])
fig, ax = plt.subplots(figsize = (10,10))
bangaloremap.to_crs(epsg=4326).plot(ax=ax, color='lightgrey',edgecolor='Darkgrey')
geo_df.plot(column = 'price', ax=ax, cmap = 'rainbow',
            legend = True, legend_kwds={'shrink': 0.3}, 
            markersize = 10)
ax.set_title('Bangalore House Size Heatmap')
ax.set_ylim(12.80, 13.15)
ax.set_xlim(77.45,77.8)

- We can see from geo spatial analysis that there is not particular area in the data that has a concentration of houses, nor an area where house price or house area is always higher. 
- We can also see that there are many outliers in the data, or it could also suggest that our shapefile is not up to date, or our coordinates are not accurate.
- Our data suggests that the location in bangalore doesn't have an affect on price and we do not need it to be able to effectively train our model.

Using a correlation matrix to verify the above analysis

In [None]:
geo_df.drop('price',axis=1,inplace = True)
new_df = pd.merge(copy_df,geo_df, on='address')

corr_df = new_df[['bath', 'balcony', 'total_sqft','size','longitude','latitude', 'price']]
correlation_matrix = corr_df.corr()

sns.heatmap(correlation_matrix, annot=True, cmap='Blues', fmt='.1f')

The correlation matrix proves that our analysis that the locations are not correlated with our target variable and do not need to be used during training our model.




## Geo Encoding

In [None]:
import pandas as pd
from pprint import pprint
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from mapbox import Geocoder
import os
from dotenv import load_dotenv
import json
from dataclasses import dataclass,  fields, _MISSING_TYPE
import requests

from dotenv import load_dotenv
import os

load_dotenv()
ROOT_DIR = os.environ.get("ROOT_DIR")
os.chdir(ROOT_DIR)

## Data Ingestion

In [None]:
address_df = pd.read_csv(r"data\raw\address.csv")

## Using Mapbox geocoding api

Trying for one address only

In [None]:
load_dotenv()
API_KEY = os.getenv('API_KEY')
geocoder = Geocoder(access_token= API_KEY)

address_geocodes  = []
address = address_df["address"][0]
print(f"Address in my data: {address}\n----------")

response = geocoder.forward(address).json()
print(f"Response received: \n")
pprint(response)
print(f"\n--------")

address_geocode = {'address': address,
'place_name': response["features"][0]["place_name"],
    'relevance': response["features"][0]["relevance"],
    'bbox': response["features"][0]["bbox"],
    'longitude': response["features"][0]["center"][0],
    'lattitude': response["features"][0]["center"][1]

    }
address_geocodes.append(address_geocode)

print(f"keeping the essential values, we get,")
pprint(address_geocode)
pprint(address_geocodes)

df = pd.DataFrame(address_geocodes)

print(df)

Geoencoding all the addresses

In [None]:
load_dotenv()
API_KEY = os.getenv('API_KEY')
geocoder = Geocoder(access_token= API_KEY)

address_geocodes  = []

for address in address_df.address:
    response = geocoder.forward(address).json()

    address_geocode = {'address': address,
    'place_name': response["features"][0]["place_name"],
        'relevance': response["features"][0]["relevance"],
        'longitude': response["features"][0]["center"][0],
        'lattitude': response["features"][0]["center"][1]

        }
    address_geocodes.append(address_geocode)

df = pd.DataFrame(address_geocodes)

Checking how our dataframe looks like now

In [None]:
df

Exporting the file containing all the geoencoded data so that we do not have to wait for geoencoding every time

In [None]:
file_path = r'data\processed/'
file_name = "addresses_gecoded_by_mapboxAPI.csv"
extract_info = file_path + file_name
df.to_csv(extract_info, index=False)

## EDA on Mapbox geocoded data

Data ingestion of the geo-encoded file

In [None]:
df = pd.read_csv(r"data\processed\addresses_gecoded_by_mapboxAPI.csv")

Finding all the rows where latitude and longitude point to the general location of Bangalore only

In [None]:
inaccurate_df = df.loc[(df['longitude'] == 77.591300) & (df['lattitude'] == 12.979120)]

Plotting distribution of relevance, longitude and latitude 

In [None]:
sns.histplot(df.relevance)
plt.title("Histplot for Relevance")

In [None]:
sns.histplot(df.longitude)
plt.title("Histplot for Longitude")

In [None]:
sns.histplot(df.lattitude)
plt.title("Histplot for Lattitude")

Plotting longitude and latitude using scatterplot

In [None]:
plt.figure(figsize=(10,10))
sns.scatterplot(x= df.longitude, y=df.lattitude)
plt.title('Latitude vs Longitude')
plt.ylabel('latitude')

## Using https://geocode.maps.co/ free api

Created a dataclass to skip errors where the api does not have the data for our given address

In [None]:
@dataclass
class GeocodedAddress:
    address: str
    place_name: str = 'NOT_FOUND' 
    relevance: float = 0
    longitude: float = None
    latitude: float = None

    def __post_init__(self):
        # Loop through the fields
        for field in fields(self):
            # If there is a default and the value of the field is none we can assign a value
            if not isinstance(field.default, _MISSING_TYPE) and getattr(self, field.name) is None:
                setattr(self, field.name, field.default)

Trying for one address only

In [None]:
address_geocodes  = []
address = address_df["address"][3]
print(f"Address in my data: {address}\n----------")

geocode_base_url = f"https://geocode.maps.co/search?q={address}"
response = requests.get(geocode_base_url).json()
print(f"Response received: \n")
pprint(response)
print(f"\n--------")


if len(response) >=1:

    address_geocode = GeocodedAddress(
        address=address,
        place_name = response[0]["display_name"],
        relevance = response[0]["importance"],
        longitude = response[0]["lon"],
        latitude = response[0]["lat"]
    )
else: 
    address_geocode = GeocodedAddress(
        address=address
    )

    address_geocodes.append(address_geocode.__dict__)

    print(f"keeping the essential values, we get,")
    pprint(address_geocode)
    pprint(address_geocodes)

    df = pd.DataFrame(address_geocodes)

    print(df)

Geoencoding for all the addresses

In [None]:
address_geocodes  = []

for address in address_df.address:
    print(address)
    geocode_base_url = f"https://geocode.maps.co/search?q={address}"
    response = requests.get(geocode_base_url).json()

    if len(response) >=1:

        address_geocode = GeocodedAddress(
            address=address,
            place_name = response[0]["display_name"],
            relevance = response[0]["importance"],
            longitude = response[0]["lon"],
            latitude = response[0]["lat"]
        )
    else: 
        address_geocode = GeocodedAddress(
            address=address
        )

    address_geocodes.append(address_geocode.__dict__)


geocode_map_df = pd.DataFrame(address_geocodes)

Exporting the file containing all the geoencoded data so that we do not have to wait for geoencoding every time

In [None]:
file_path = r'data\processed/'
file_name = "addresses_geocoded_by_geocodemaps.csv"
extract_info = file_path + file_name
geocode_map_df.to_csv(extract_info, index=False)

## EDA on geocode.map.co geocoded data

Data ingestion on the geoencoded file

In [None]:
geocode_map_df = pd.read_csv(r"data\processed\addresses_geocoded_by_geocodemaps.csv")

Checking how our dataframe looks like 

In [None]:
geocode_map_df

Finding all the rows where place_name is NOT_FOUND

In [None]:
not_found_df = geocode_map_df.loc[geocode_map_df['place_name'] == 'NOT_FOUND']

Plotting distribution of relevance, longitude and latitude

In [None]:
sns.histplot(geocode_map_df.relevance)
plt.title("Histplot for Relevance")

In [None]:
sns.histplot(geocode_map_df.longitude)
plt.title("Histplot for Longitude")

In [None]:
sns.histplot(geocode_map_df.latitude)
plt.title("Histplot for Latitude")

Plotting longitude and latitude using scatterplot

In [None]:
plt.figure(figsize=(10,10))
sns.scatterplot(x= geocode_map_df.longitude, y=geocode_map_df.latitude)
plt.title('Latitude vs Longitude')

We used 2 geoencoding APIs mapbox and geocode.maps.co from the analysis above it is clear that the encoding by mapbox is far more efficient than by geocode.map url. We see from the not found columns that the url has 477 addresses not available where as mapbox only had 154.
Furthermore with the columns that it did found mapbox has a relevance of 1 with most addresses, which means being able to encode the exact address we want where it is very less in case of the url, which is also showcased while plotting the scatterplot.

## Merging the 2 dataframes for better results 

Extracting only the address from our data with multiple longitude and latitude taken as the general bangalore coordinates

In [None]:
addr_inaccurate_df = inaccurate_df['address']

Merged the address column with the geocode url encoded data to get accurate coordinates in place of the general coordinates

In [None]:
merged_df = pd.merge(addr_inaccurate_df,geocode_map_df)

Checking how many of them are found and creating a dataframe for it

In [None]:
found_df = merged_df.loc[merged_df['place_name']!='NOT_FOUND']
found_df.shape[0]

We see that 54 rows now have better coordinates than the general bangalore coordinates, and we now move on to put them in our original df 

In [None]:
# Set the index of found_df to be the address column
found_df.set_index('address', inplace=True)

# Iterate through the rows of df and update the longitude and latitude values
for index, row in df.iterrows():
    address = row['address']
    if address in found_df.index:
        df.at[index, 'place_name'] = found_df.at[address, 'place_name']
        df.at[index, 'longitude'] = found_df.at[address, 'longitude']
        df.at[index, 'lattitude'] = found_df.at[address, 'latitude']

# Reset the index to bring the DataFrame back to its original format
df.reset_index(drop=True, inplace=True)

Conducting Sanity Checks

In [None]:
df

In [None]:
df.loc[(df['longitude'] == 77.591300) & (df['lattitude'] == 12.979120)]

## Final geoencoded data 

In [None]:
plt.figure(figsize=(10,10))
sns.scatterplot(x= df.longitude, y=df.lattitude)
plt.title('Latitude vs Longitude')
plt.ylabel('latitude')

Extracting this dataframe now as it is better than our original encoded datasets.

In [None]:
file_path = r'data\final/'
file_name = "encoded_addr.csv"
extract_info = file_path + file_name
geocode_map_df.to_csv(extract_info, index=False)

## Modelling 

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LinearRegression 
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score  
import joblib
import os 

from src.utils import preprocess_data
from src.utils import feature_engg

## Data Ingestion

In [None]:
model_df = pd.read_csv(r"data\processed\cleaned.csv")

## Linear Regression Model

In [None]:
X= model_df[['area_type', 'total_sqft', 'bath', 'balcony', 'extract']]
Y = model_df['price']

# x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.25, random_state=50)

column_trans = ColumnTransformer(transformers=
                                 [('onehot', OneHotEncoder(), ['area_type', 'extract']),
                                  ('scaler', StandardScaler(), ['total_sqft', 'bath', 'balcony'])],
                                  remainder='passthrough')

pipeline = make_pipeline(column_trans, LinearRegression())

pipeline.fit(X, Y)

# y_pred = pipeline.predict(x_test)

filename = os.path.join("models", "1st_model_LR.joblib")
joblib.dump(pipeline, filename)

loaded_model_LR = joblib.load(filename)
result = loaded_model_LR.score(X, Y)
print(result)

## Random Forest Regressor Model

In [None]:
X = model_df[['area_type', 'total_sqft', 'bath', 'balcony', 'extract']]
Y = model_df['price']

# x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.25, random_state=50)

column_trans = ColumnTransformer(transformers=
                                 [('onehot', OneHotEncoder(), ['area_type', 'extract']),
                                  ('scaler', StandardScaler(), ['total_sqft', 'bath', 'balcony'])],
                                  remainder='passthrough')

pipeline = make_pipeline(column_trans, RandomForestRegressor())

pipeline.fit(X, Y)

# y_pred = pipeline.predict(x_test)

filename = os.path.join("models", "1st_model_RF.joblib")
joblib.dump(pipeline, filename)

loaded_model_RF1 = joblib.load(filename)
result = loaded_model_RF1.score(X, Y)
print(result)

Checking random forest by adding the parameter that was found to be highly correlated during EDA, but we know that since the values in parameter size were very less it is possible that the correlation matrix results are not as accurate.

In [None]:
X = model_df[['area_type','size','total_sqft', 'bath', 'balcony', 'extract']]
Y = model_df['price']

# x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.25, random_state=50)

column_trans = ColumnTransformer(transformers=
                                 [('onehot', OneHotEncoder(), ['area_type', 'extract']),
                                  ('scaler', StandardScaler(), ['size','total_sqft', 'bath', 'balcony'])],
                                  remainder='passthrough')

pipeline = make_pipeline(column_trans, RandomForestRegressor())

pipeline.fit(X, Y)

# y_pred = pipeline.predict(x_test)

filename = os.path.join("models", "2nd_model_RF.joblib")
joblib.dump(pipeline, filename)

loaded_model_RF2 = joblib.load(filename)
result = loaded_model_RF2.score(X, Y)
print(result)

## Evaluation

Using the test data to evaluate our models created.
- We first read the test file
- Then preprocess the test file
- Finally we add the new features in the test file

In [None]:
test_data = pd.read_csv(r"data/raw/Test.csv")
clean_test_data = preprocess_data(test_data)
clean_test_data = feature_engg(clean_test_data)

### Linear Regression

In [None]:
x_test = clean_test_data[['area_type','total_sqft', 'bath', 'balcony', 'extract']]
y_pred = clean_test_data['price']

y_pred_lr = loaded_model_LR.predict(x_test)
result = loaded_model_LR.score(x_test, y_pred_lr)
print("R-squared value on test data:", result)

### Random Forest

Using the first random forest model

In [None]:
clean_test_data.head()

In [None]:
x_test = clean_test_data[['area_type','total_sqft', 'bath', 'balcony', 'extract']]
y_pred = clean_test_data['price']

loaded_model_RF1 = joblib.load(os.path.join("models", "1st_model_RF.joblib"))
y_pred_rf1 = loaded_model_RF1.predict(x_test)
price = np.expm1(y_pred_rf1)

# result = loaded_model_RF1.score(x_test, y_pred_rf1)
# print("R-squared value on test data:", result)

In [None]:
headerList = ['price']
pd.DataFrame(price).to_csv(r'data\final\submission_RF1.csv',header=headerList, index_label= 'id')

Using the second random forest model

In [None]:
x_test = clean_test_data[['area_type','size','total_sqft', 'bath', 'balcony', 'extract']]
y_pred = clean_test_data['price']

y_pred_rf2 = loaded_model_RF2.predict(x_test)
price = np.expm1(y_pred_rf2)
# result = loaded_model_RF2.score(x_test, y_pred_rf2)
# print("R-squared value on test data:", result)

In [None]:
headerList = ['price']
pd.DataFrame(price).to_csv(r'data\final\submission_RF2.csv',header=headerList, index_label= 'id')

## Machine hack r score

![image.png](attachment:image.png)
