<div>
    <h1><center style="background-color:#C39BD3; color:white;">🏠 Housing Prices in Indian Metropolitan Areas</center></h1>
</div>

<div>
<img src="https://i.imgur.com/Q5IhUpF.gif">
</div>

<div class="alert alert-warning">
<p>Being born and brought up in a metropolitan city, I've witnessed the city develop and the housing prices rise depending on the availability of amenities in a particular region. This was my motivation of putting together a dataset for analysis 😄 <br><br>
Now let's delve into the factors that govern the pricing!
</p>
</div>

<div>  
<h3><center style="background-color:#C39BD3; color:white;"><strong>Importing Libraries 📚</strong></center></h3>
</div>

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib.image as mpimg
import folium
import math
import plotly.graph_objects as go
import plotly.express as px
import eli5
import graphviz
import networkx as nx

from eli5.sklearn import PermutationImportance
from folium import Choropleth, Circle, Marker
from folium.plugins import HeatMap, MarkerCluster
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import ExtraTreesRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error,r2_score
from sklearn.model_selection import train_test_split
from geopy.geocoders import Nominatim
from sklearn import tree
from matplotlib import pyplot as plt
from pdpbox import pdp, get_dataset, info_plots
from string import ascii_letters
from colorama import Fore, Back, Style
y_ = Fore.YELLOW
r_ = Fore.RED
g_ = Fore.GREEN
b_ = Fore.BLUE
m_ = Fore.MAGENTA

In [None]:
!pip install geopy 
!pip install Nominatim
!pip install folium

<div>  
<h3><center style="background-color:#C39BD3; color:white;"><strong>Custom Color Palette 🎨</strong></center></h3>
</div>

In [None]:
custom_colors = ["#4e89ae", "#c56183","#ed6663","#ffa372"]
customPalette = sns.set_palette(sns.color_palette(custom_colors))

In [None]:
sns.palplot(sns.color_palette(custom_colors),size=1)
plt.tick_params(axis='both', labelsize=0, length = 0)

<center style="background: #93C0A4; font-size: 20px; padding: 10px; border: 1px solid lightgray; margin: 10px; width:100px; color:white;">
    Mumbai
</center>

In [None]:
mumbai = sns.dark_palette(custom_colors[0], reverse=True)
sns.palplot(sns.color_palette(mumbai),size=1)
plt.tick_params(axis='both', labelsize=0, length = 0)

<center style="background: #93C0A4; font-size: 20px; padding: 10px; border: 1px solid lightgray; margin: 10px; width:100px; color:white;">
    Delhi
</center>

In [None]:
delhi = sns.dark_palette(custom_colors[1], reverse=True)
sns.palplot(sns.color_palette(delhi),size=1)
plt.tick_params(axis='both', labelsize=0, length = 0)

<center style="background: #93C0A4; font-size: 20px; padding: 10px; border: 1px solid lightgray; margin: 10px; width:100px; color:white;">
    Chennai
</center>

In [None]:
chennai = sns.dark_palette(custom_colors[2], reverse=True)
sns.palplot(sns.color_palette(chennai),size=1)
plt.tick_params(axis='both', labelsize=0, length = 0)

<center style="background: #93C0A4; font-size: 20px; padding: 10px; border: 1px solid lightgray; margin: 10px; width:150px; color:white;">
    Hyderabad
</center>

In [None]:
hyderabad = sns.dark_palette(custom_colors[3], reverse=True)
sns.palplot(sns.color_palette(hyderabad),size=1)

<div>  
<h3><center style="background-color:#C39BD3; color:white;"><strong>Loading the dataset and displaying rows ⌛</strong></center></h3>
</div>

In [None]:
df1 = pd.read_csv('../input/housing-prices-in-metropolitan-areas-of-india/Mumbai.csv')
df2 = pd.read_csv('../input/housing-prices-in-metropolitan-areas-of-india/Delhi.csv')
df3 = pd.read_csv('../input/housing-prices-in-metropolitan-areas-of-india/Chennai.csv')
df4 = pd.read_csv('../input/housing-prices-in-metropolitan-areas-of-india/Hyderabad.csv')

In [None]:
df1.head(5)

In [None]:
df2.head(5)

In [None]:
df3.head(5)

In [None]:
df4.head(5)

<div class="alert alert-warning">
<p>📌 Since for a set of houses, nothing was mentioned about certain amenities, '9' was used to mark such values, which could indicate the absence of information about the apartment but these values don't ascertain the absence of such a feature in real life.<br><br>
We will be dropping these values so that they don't cloud our analysis.
</p>
</div>

In [None]:
df1.replace(9, np.nan, inplace=True)
df2.replace(9, np.nan, inplace=True)
df3.replace(9, np.nan, inplace=True)
df4.replace(9, np.nan, inplace=True)

In [None]:
df1 = df1.dropna()
df2 = df2.dropna()
df3 = df3.dropna()
df4 = df4.dropna()

<div>  
<h3><center style="background-color:#C39BD3; color:white;"><strong>Dataframe shape after dropping values</strong></center></h3>
</div>

In [None]:
print(f"{y_}Mumbai:{r_}{df1.shape}\n")
print(f"{y_}Delhi:{r_}{df2.shape}\n")
print(f"{y_}Chennai:{r_}{df3.shape}\n")
print(f"{y_}Hyderabad:{r_}{df4.shape}\n")

In [None]:
print(f"{y_}Data types of data columns: \n{m_}{df1.dtypes}")

<div class="alert alert-warning">
<p>Modifying price to price in lakhs(INR)
</p>
</div>

In [None]:
df1['Price'] = df1['Price']/100000
df2['Price'] = df1['Price']/100000
df3['Price'] = df1['Price']/100000
df4['Price'] = df1['Price']/100000

<div>  
<h3><center style="background-color:#C39BD3; color:white;"><strong>Feature generation: latitude and longitude 🌐</strong></center></h3>
</div>

In [None]:
geolocator = Nominatim(user_agent="Ruch")

def feature_generation(df):
    lat=[]
    long=[]
    a=0
    for i in df['Location']: 
        location = geolocator.geocode(i)
        try:
            lat.append(location.latitude)
            long.append(location.longitude)
            print(a)
        except:
            lat.append("NA")
            long.append("NA")
        a=a+1
    df['Latitude'] = lat
    df['Longitude'] = long

In [None]:
# feature_generation(df1)
# feature_generation(df2)
# feature_generation(df3)
# feature_generation(df4)

In [None]:
# df1.to_csv('/kaggle/working/Mumbai_updated.csv')
# df2.to_csv('/kaggle/working/Delhi_updated.csv')
# df3.to_csv('/kaggle/working/Chennai_updated.csv')
# df4.to_csv('/kaggle/working/Hyderabad_updated.csv')

In [None]:
df1 = pd.read_csv('../input/intermediate-notebooks-data/Mumbai_updated.csv')
df2 = pd.read_csv('../input/intermediate-notebooks-data/Delhi_updated.csv')
df3 = pd.read_csv('../input/intermediate-notebooks-data/Chennai_updated.csv')
df4 = pd.read_csv('../input/intermediate-notebooks-data/Hyderabad_updated.csv')

In [None]:
df1.head(5)

In [None]:
df1 = df1.drop(['Unnamed: 0'], axis = 1) 
df2 = df2.drop(['Unnamed: 0'], axis = 1) 
df3 = df3.drop(['Unnamed: 0'], axis = 1) 
df4 = df4.drop(['Unnamed: 0'], axis = 1) 

<div>  
<h3><center style="background-color:#C39BD3; color:white;"><strong>EDA 📊</strong></center></h3>
</div>

In [None]:
sns.set_style("whitegrid")

In [None]:
def triple_plot(x, title,c):
    fig, ax = plt.subplots(3,1,figsize=(20,10),sharex=True)
    sns.distplot(x, ax=ax[0],color=c)
    ax[0].set(xlabel=None)
    ax[0].set_title('Histogram + KDE')
    sns.boxplot(x, ax=ax[1],color=c)
    ax[1].set(xlabel=None)
    ax[1].set_title('Boxplot')
    sns.violinplot(x, ax=ax[2],color=c)
    ax[2].set(xlabel=None)
    ax[2].set_title('Violin plot')
    fig.suptitle(title, fontsize=16)
    plt.tight_layout(pad=3.0)
    plt.show()

In [None]:
triple_plot(df1['Price'],'Distribution of Price(in lakhs) in Mumbai',custom_colors[0])

In [None]:
triple_plot(df2['Price'],'Distribution of Price(in lakhs) in Delhi',custom_colors[1])

In [None]:
triple_plot(df3['Price'],'Distribution of Price(in lakhs) in Chennai',custom_colors[2])

In [None]:
triple_plot(df4['Price'],'Distribution of Price(in lakhs) in Hyderabad',custom_colors[3])

In [None]:
def count_plot(data,title,p):
    df5=data[data['Resale']== 0]
    df6=data[data['Resale']== 1]
    fig, ax = plt.subplots(1,2,figsize=(15, 10))
    ax[0]=sns.countplot(y='Location', data=df5, order=df5.Location.value_counts().index[:10],ax=ax[0],palette = p)
    ax[0].set_title('Number of New Properties')
    ax[1]=sns.countplot(y='Location', data=df6, order=df6.Location.value_counts().index[:10],ax=ax[1],palette = p)
    ax[1].set_title('Number of Resale Properties')   
    
    fig.suptitle(title, fontsize=16)
    plt.tight_layout(pad=3.0)
    plt.show()

In [None]:
count_plot(df1,'New and Resale Properties in Mumbai',mumbai)

In [None]:
count_plot(df2,'New and Resale Properties in Delhi',delhi)

In [None]:
count_plot(df3,'New and Resale Properties in Chennai',chennai)

In [None]:
count_plot(df4,'New and Resale Properties in Hyderabad',hyderabad)

In [None]:
def cat_plot(data,title,p):
    sns.catplot(x="No. of Bedrooms", y="Price", data=data,palette = p)
    plt.title('No. of Bedrooms vs Price in '+ title,size=16)
    plt.gcf().set_size_inches(6,8)
    plt.show()

In [None]:
cat_plot(df1,'Mumbai',mumbai)

In [None]:
cat_plot(df2,'Delhi',delhi)

In [None]:
cat_plot(df3,'Chennai',chennai)

In [None]:
cat_plot(df4,'Hyderabad',hyderabad)

In [None]:
def scatter_plot(data,title,c):
    sns.scatterplot(x="Area", y="Price", data=data,color=c,marker="P")
    plt.title('Area in square feet vs Price in '+ title,size=16)
    plt.gcf().set_size_inches(6,8)
    plt.show()

In [None]:
scatter_plot(df1,'Mumbai',custom_colors[0])

In [None]:
scatter_plot(df2,'Delhi',custom_colors[1])

In [None]:
scatter_plot(df3,'Chennai',custom_colors[2])

In [None]:
scatter_plot(df4,'Hyderabad',custom_colors[3])

In [None]:
frames = [df1,df2,df3,df4]
merged = pd.concat(frames)
merged = merged.loc[:, ~merged.columns.str.contains('^Unnamed')]

In [None]:
def preprocess(df) :
    df = df[['Location','Latitude','Longitude','Price']]
    df = df.replace('NA', np.nan)
    df.dropna(subset=['Latitude'], inplace=True)
    df.dropna(subset=['Price'], inplace=True)
    df["Latitude"] = df["Latitude"].astype(float)
    df["Longitude"] = df["Longitude"].astype(float)
    return df

In [None]:
map1_df = preprocess(df1)
map2_df = preprocess(df2)
map3_df = preprocess(df3)
map4_df = preprocess(df4)

<div>  
<h3><center style="background-color:#C39BD3; color:white;"><strong>House locations 🗺️</strong></center></h3>
</div>

<div>
<img src="https://i.imgur.com/bUJos0Ul.jpg" width="350" height="350"/>
</div>

In [None]:
city_map = folium.Map(location=[19.08,72.74], zoom_start=11.2, tiles='Stamen Terrain')
mc = MarkerCluster()
for idx, row in map1_df.iterrows():
    if not math.isnan(row['Longitude']) and not math.isnan(row['Latitude']):
        popup = """
        Location : <b>%s</b><br>
        Price : <b>%s</b><br>
        """ % (row['Location'], row['Price'])
        mc.add_child(Marker([row['Latitude'], row['Longitude']],tooltip=popup))
    city_map.add_child(mc)
city_map

<div>
<img src="https://i.imgur.com/F2eFcsf.png" width="350" height="350"/>
</div>

In [None]:
city_map = folium.Map(location=[28.69,76.95], zoom_start=10, tiles='Stamen Terrain')
mc = MarkerCluster()
for idx, row in map2_df.iterrows():
    if not math.isnan(row['Longitude']) and not math.isnan(row['Latitude']):
        popup = """
        Location : <b>%s</b><br>
        Price : <b>%s</b><br>
        """ % (row['Location'], row['Price'])
        mc.add_child(Marker([row['Latitude'], row['Longitude']],tooltip=popup))
    city_map.add_child(mc)
city_map

<div>
<img src="https://i.imgur.com/E2rku1K.png" width="350" height="350"/>
</div>

In [None]:
city_map = folium.Map(location=[13.04,80], zoom_start=10.5, tiles='Stamen Terrain')
mc = MarkerCluster()
for idx, row in map3_df.iterrows():
    if not math.isnan(row['Longitude']) and not math.isnan(row['Latitude']):
        popup = """
        Location : <b>%s</b><br>
        Price : <b>%s</b><br>
        """ % (row['Location'], row['Price'])
        mc.add_child(Marker([row['Latitude'], row['Longitude']],tooltip=popup))
    city_map.add_child(mc)
city_map

<div>
<img src="https://i.imgur.com/PFS3PJv.png" width="350" height="350"/>
</div>

In [None]:
city_map = folium.Map(location=[17.4,78.2], zoom_start=10, tiles='Stamen Terrain')
mc = MarkerCluster()
for idx, row in map4_df.iterrows():
    if not math.isnan(row['Longitude']) and not math.isnan(row['Latitude']):
        popup = """
        Location : <b>%s</b><br>
        Price : <b>%s</b><br>
        """ % (row['Location'], row['Price'])
        mc.add_child(Marker([row['Latitude'], row['Longitude']],tooltip=popup))
    city_map.add_child(mc)
city_map

<div>  
<h3><center style="background-color:#C39BD3; color:white;"><strong>Amenities</strong></center></h3>
</div>

In [None]:
c1 = ["#4e89ae","#BFD5E2"]
c2 = ["#c56183","#E6BCCA"]
c3 = ["#ed6663","#F7BDBC"]
c4 = ["#ffa372","#FFDECC"]

In [None]:
def pie_chart(df,link,c,addAll = True):
    df = df.iloc [:,5:-2] 
    fig = go.Figure()
    for column in df.columns.to_list():
        val = df[column].value_counts().rename_axis('unique_values').reset_index(name='val_count')
        labels = val['unique_values']
        values = val['val_count']
        fig.add_trace(
            go.Pie(
                labels=labels, 
                values=values,
                marker_colors=c
            )
        )
        button_all = dict(label = 'All',
                      method = 'update',
                      args = [{'visible': df.columns.isin(df.columns),
                               'title': 'All',
                               'showlegend':True}])


    def create_layout_button(column):
        return dict(label = column,
                    method = 'update',
                    args = [{'visible': df.columns.isin([column]),
                             'title': column,
                             'showlegend': True}])
    fig.add_layout_image(
    dict(
        source=link,
        xref="paper", yref="paper",
        x=0.5, y=0.95,
        sizex=0.9, sizey=0.6,
        xanchor="center", yanchor="bottom"
    )
    )
    fig.update_layout(
        updatemenus=[go.layout.Updatemenu(
            active = 0,
            buttons = ([button_all] * addAll) + list(df.columns.map(lambda column: create_layout_button(column)))
            )
        ])
    
    fig.show()

In [None]:
pie_chart(df1,"https://i.imgur.com/OEr0Lw2.png",c1)

In [None]:
pie_chart(df2,"https://i.imgur.com/Byi2BQE.png",c2)

In [None]:
pie_chart(df3,"https://i.imgur.com/8Yxjfhx.png",c3)

In [None]:
pie_chart(df4,"https://i.imgur.com/KXYLDQV.png",c4)

<div>  
<h3><center style="background-color:#C39BD3; color:white;"><strong>Correlation</strong></center></h3>
</div>

In [None]:
merged.columns
merged = merged.rename(columns={"Children'splayarea": "ChildrenPlayArea"})
merged = merged.dropna()

In [None]:
plt.figure(figsize=(30,35))
corr=merged.corr()
mask = np.triu(np.ones_like(corr, dtype=bool))
sns.heatmap(merged.corr(), mask=mask, cmap='coolwarm', vmax=.3, center=0,
            square=True, linewidths=.5,annot=True)
plt.show()

In [None]:
indices = corr.index.values
cor_matrix = np.asmatrix(corr)
G = nx.from_numpy_matrix(cor_matrix)
G = nx.relabel_nodes(G,lambda x: indices[x])
G.edges(data=True)

In [None]:
def corr_network(G, corr_direction, min_correlation):
    H = G.copy()

    for s1, s2, weight in G.edges(data=True):       
        if corr_direction == "positive":
            if weight["weight"] < 0 or weight["weight"] < min_correlation:
                H.remove_edge(s1, s2)
        else:
            if weight["weight"] >= 0 or weight["weight"] > min_correlation:
                H.remove_edge(s1, s2)
                
    edges,weights = zip(*nx.get_edge_attributes(H,'weight').items())
    
    weights = tuple([(1+abs(x))**2 for x in weights])
   
    d = dict(nx.degree(H))
    nodelist=d.keys()
    node_sizes=d.values()
    
    positions=nx.circular_layout(H)
    
    plt.figure(figsize=(15,15))

    nx.draw_networkx_nodes(H,positions,node_color='#d100d1',nodelist=nodelist,
                       node_size=tuple([x**3 for x in node_sizes]),alpha=0.8)

    nx.draw_networkx_labels(H, positions, font_size=8)

    if corr_direction == "positive":
        edge_colour = plt.cm.summer 
    else:
        edge_colour = plt.cm.autumn
        
    nx.draw_networkx_edges(H, positions, edgelist=edges,style='solid',
                          width=weights, edge_color = weights, edge_cmap = edge_colour,
                          edge_vmin = min(weights), edge_vmax=max(weights))
    plt.axis('off')
    plt.show() 

In [None]:
corr_network(G, corr_direction="positive",min_correlation = 0.5)

In [None]:
corr_network(G, corr_direction="negative",min_correlation = -0.1)

<div>  
<h3><center style="background-color:#C39BD3; color:white;"><strong>Permutation Importance</strong></center></h3>
</div>

In [None]:
feature_names = ['Area','No. of Bedrooms', 'Resale',
       'MaintenanceStaff', 'Gymnasium', 'SwimmingPool', 'LandscapedGardens',
       'JoggingTrack', 'RainWaterHarvesting', 'IndoorGames', 'ShoppingMall',
       'Intercom', 'SportsFacility', 'ATM', 'ClubHouse', 'School',
       '24X7Security', 'PowerBackup', 'CarParking', 'StaffQuarter',
       'Cafeteria', 'MultipurposeRoom', 'Hospital', 'WashingMachine',
       'Gasconnection', 'AC', 'Wifi', 'ChildrenPlayArea', 'LiftAvailable',
       'BED', 'VaastuCompliant', 'Microwave', 'GolfCourse', 'TV',
       'DiningTable', 'Sofa', 'Wardrobe', 'Refrigerator', 'Latitude',
       'Longitude']

X = merged[feature_names]
y = merged['Price']

In [None]:
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)
model = RandomForestRegressor().fit(train_X, train_y)

In [None]:
perm = PermutationImportance(model, random_state=1).fit(val_X, val_y)
eli5.show_weights(perm, feature_names = val_X.columns.tolist())

<div class="alert alert-warning">
<p> Longitude is the most important feature. <br> Understandably, the area of the house plays a major role in the final price too.
</p>
</div>

<div>  
<h3><center style="background-color:#C39BD3; color:white;"><strong>Partial plots</strong></center></h3>
</div>

In [None]:
model2 = DecisionTreeRegressor(random_state=0, max_depth=5, min_samples_split=5).fit(train_X, train_y)

In [None]:
tree_graph = tree.export_graphviz(model2, out_file=None, feature_names=feature_names)
graphviz.Source(tree_graph)

<div class="alert alert-warning">
<p>The leaves indicate the splitting criteria. <br> The branches represent True or False values.
</p>
</div>

<div class="alert alert-success">
<p>Interaction between Longitude coordinates of a house and the Area of the house
</p>
</div>

In [None]:
features_to_plot = ['Longitude', 'Area']
inter1  =  pdp.pdp_interact(model=model2, dataset=val_X, model_features=feature_names, features=features_to_plot)

pdp.pdp_interact_plot(pdp_interact_out=inter1, feature_names=features_to_plot, plot_type='contour')
plt.show()

<div>  
<h3><center style="background-color:#C39BD3; color:white;"><strong>Feature Importance</strong></center></h3>
</div>

In [None]:
model3 = ExtraTreesRegressor()
model3.fit(train_X, train_y)
fi = pd.DataFrame(model3.feature_importances_,
             columns=['importance'])
fi['feature'] = feature_names
fi = fi.sort_values('importance', ascending=False)

plt.figure(figsize=(20, 10))
ax = sns.barplot(data=fi, x='importance', y='feature',
                 palette="spring_r")
ax.tick_params(axis='both', which='both', labelsize=15)
ax.set_xlabel('Importance',fontsize=15, weight="bold");
ax.set_ylabel('Feature',fontsize=15,weight="bold");
plt.title("Feature Importance", size=20, weight="bold");

<div>  
<h3><center style="background-color:#C39BD3; color:white;"><strong>Model Training ⚙️ </strong></center></h3>
</div>

In [None]:
feature_names = ['Area','No. of Bedrooms','MaintenanceStaff','24X7Security','Latitude','Longitude']

X = merged[feature_names]
y = merged['Price']

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

In [None]:
def train_model(m,name):
    model = m
    model.fit(X=train_X, y=train_y)
    predictions = model.predict(val_X)
    mae = mean_absolute_error(val_y, predictions)
    r2 = r2_score(val_y, predictions)
    print("{0} mae {1} r2 {2}".format(name,mae,r2))

train_model(DecisionTreeRegressor(),"Decision Tree Regressor")
train_model(RandomForestRegressor(),"Random Forest Regressor")   
train_model(XGBRegressor(n_estimators=600),"XGBoost Regressor")   

References:
* [NetworkX documentation](https://networkx.org/documentation/stable/tutorial.html)
* [Visualising stocks correlations with Networkx](https://towardsdatascience.com/visualising-stocks-correlations-with-networkx-88f2ee25362e)


<div>
    <img src="https://i.imgur.com/pl3FhXV.png">
</div>