#  **Housing Price Analysis of Top Tier Metropolitan Cities in India**

### <pre>Importing Required Libraries</pre>

In [None]:
# pip install geopy

In [None]:
import numpy as np, pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from geopy.geocoders import Nominatim
import folium
from folium import plugins

%matplotlib inline

***

![Bangalore Image](https://media.giphy.com/media/LnWAqDfCpfPSmnedvT/giphy.gif)

In [None]:
bangalore_df = pd.read_csv('../input/housing-prices-india-metropolitan/Bangalore.csv')

In [None]:
bangalore_df.info()

In [None]:
bangalore_df.sample(5)

***

![Delhi Image](https://media.giphy.com/media/hSEwaRIfSa4np4l5d5/giphy.gif)

In [None]:
delhi_df = pd.read_csv('../input/housing-prices-india-metropolitan/Delhi.csv')

In [None]:
delhi_df.info()

In [None]:
delhi_df.sample(5)

***

![Mumbai Image](https://media.giphy.com/media/STrgyibuarMZKHDQms/giphy.gif)

In [None]:
mumbai_df = pd.read_csv('../input/housing-prices-india-metropolitan/Mumbai.csv')

In [None]:
mumbai_df.info()

In [None]:
mumbai_df.sample(5)

<pre>We have imported Bangalore, Delhi & Mumbai Datasets, 
and it shows that there are some 40 variables/features in the data including the Price.</pre>

As per the Data Definition, `Price`, `Area`, `Location` & `Bedrooms` are Numerical/Continuous variables, rest all are Categorical Variables.
But these Categorical variables have three types of values, <br>
<code>0 = Particular facility is absent in the House</code><br>
<code>1 = Particular facility is present in the House</code><br>
<code>9 = No information is available for the variable, whether it is present or not</code>

***

### <pre>Handling Missing Values</pre>

In [None]:
bangalore_df.replace(9,np.nan, inplace=True)
delhi_df.replace(9,np.nan, inplace=True)
mumbai_df.replace(9,np.nan, inplace=True)

In [None]:
bangalore_df.dropna(axis=0, inplace=True)
delhi_df.dropna(axis=0, inplace=True)
mumbai_df.dropna(axis=0, inplace=True)

In [None]:
bangalore_df.info()

In [None]:
delhi_df.info()

In [None]:
mumbai_df.info()

We performed some Cleaning Process, and removed the Missing values,i.e. `9s` from all the categorical values, because it contains a big portion of the data as missing, so it cannot be imputed with majority category of the variable, it may impact the analysis.
In order to do that we converted all `9s` to `NaN` values and removed those rows.
As a result we are left with `1951` entries in Bangalore, `2002` entries in Delhi & `1398` entries in Mumbai Data to perform the further analysis.

***

#### Variable Conversions

In [None]:
bangalore_df.columns

In [None]:
NonFloatColumns = ['Price','Area','Location','No. of Bedrooms']
ColumnsToConvert = []
for col in bangalore_df.columns:
    if col not in NonFloatColumns:
        ColumnsToConvert.append(col)
ColumnsToConvert

While Handling Missing Values, we realized that the Categorical Variables(By Definition) are actually Numeric in nature, so we identified the list of all those columns and below we will convert them to required DataType, i.e. Object(String)

In [None]:
bangalore_df[ColumnsToConvert] = bangalore_df[ColumnsToConvert].astype(str)
delhi_df[ColumnsToConvert] = delhi_df[ColumnsToConvert].astype(str)
mumbai_df[ColumnsToConvert] = mumbai_df[ColumnsToConvert].astype(str)

In [None]:
replace_dict = {'0':'No','1':'Yes','0.0':'No','1.0':'Yes'}
bangalore_df[ColumnsToConvert] = bangalore_df[ColumnsToConvert].replace(replace_dict)
delhi_df[ColumnsToConvert] = delhi_df[ColumnsToConvert].replace(replace_dict)
mumbai_df[ColumnsToConvert] = mumbai_df[ColumnsToConvert].replace(replace_dict)

In [None]:
bangalore_df.head()

In [None]:
delhi_df.head()

In [None]:
mumbai_df.head()

***

#### Feature Scaling

`Price`, `Area` & `Bedrooms` are only Numerical Variables in our data, but they are not of same order, like Price is having very high values, so it is difficult to interpret them with `Area` & `Bedrooms` on plots. So we will take them on the same scale

In [None]:
# Converting Price to Tens of Lacs
bangalore_df.Price = bangalore_df.Price/100000
delhi_df.Price = delhi_df.Price/100000
mumbai_df.Price = mumbai_df.Price/100000

In [None]:
#Converting Area to Hundreds of Sq feet
bangalore_df.Area = bangalore_df.Area/100
delhi_df.Area = delhi_df.Area/100
mumbai_df.Area = mumbai_df.Area/100

In [None]:
bangalore_df.head()

In [None]:
bangalore_df.describe()

In [None]:
delhi_df.head()

In [None]:
delhi_df.describe()

In [None]:
mumbai_df.head()

In [None]:
mumbai_df.describe()

***

#### Plotting & Analysis

According the variables, we have figured out 2 different Categories for the Variables as <br>
- `Important Amenities` like `Club House`,`Gym`,`SwimmingPool`,etc. <br>
- `Furnishing` like `Sofa`,`Wardrobe`,`TV`,etc. So that we can analyze different categories and their impact on the Price.

In [None]:
ImpAmenitiesColumns = ['Resale','Gymnasium','SwimmingPool','ClubHouse','School','24X7Security','PowerBackup','CarParking','Hospital']
FurnishingColumns = ['DiningTable','Sofa','Wardrobe','Refrigerator','Microwave','TV','BED','AC','Wifi','Gasconnection']

In [None]:
# Function for plotting Imp Amenities as HUE with Price and Area

def ImpAmenitiesPlot(df):
  fig, axes = plt.subplots(3,3, figsize=(18,9))
  
  col = 0
  for i in range(3):
    for j in range(3):
      axes[i,j].set_title("Bedrooms-Price Plot with {} as Hue".format(ImpAmenitiesColumns[col]))
      sns.barplot(x='No. of Bedrooms',y='Price',hue=ImpAmenitiesColumns[col],data=df,ax=axes[i,j])
      col += 1

  plt.tight_layout(pad=3)

In [None]:
# Function for plotting Furnishing as HUE with Price and Area
def FurnishingPlot(df):
  fig, axes = plt.subplots(2,5, figsize=(25,10))

  col = 0
  for i in range(2):
    for j in range(5):
      axes[i,j].set_title("Bedrooms-Price Plot with {} as Hue".format(FurnishingColumns[col]))
      sns.barplot(x='No. of Bedrooms',y='Price',hue=FurnishingColumns[col],data=df,ax=axes[i,j])
      col += 1

  plt.tight_layout(pad=2)

In [None]:
# Function for plotting Line Plots of Price with Bedrooms & Area to analyse the trend
def LinePlots(df):
  fig,axes = plt.subplots(1,2, figsize=(18,4))
  sns.lineplot(x='No. of Bedrooms',y='Price',data=df,ax=axes[0])
  sns.lineplot(x='Area',y='Price',data=df,ax=axes[1])
  plt.tight_layout(pad=3)

In [None]:
# Function to Plot Distribution and Variation, to understand the Outliers and Statistical Distribution of Price
def QuadPlot(df):
  fig,axes = plt.subplots(2,2, figsize=(18,8))
  sns.distplot(df['Price'],ax=axes[0,0])
  sns.boxplot(df['Price'],ax=axes[0,1])
  sns.scatterplot(x='No. of Bedrooms', y='Price', data=df, ax=axes[1,0])
  sns.scatterplot(x='Area',y='Price', data=df,ax=axes[1,1])
  plt.tight_layout(pad=3)

***

### **Bangalore**

In [None]:
ImpAmenitiesPlot(bangalore_df)

In [None]:
FurnishingPlot(bangalore_df)

In [None]:
LinePlots(bangalore_df)

In [None]:
QuadPlot(bangalore_df)

***

### **Delhi**

In [None]:
ImpAmenitiesPlot(delhi_df)

In [None]:
FurnishingPlot(delhi_df)

In [None]:
LinePlots(delhi_df)

In [None]:
QuadPlot(delhi_df)

***

### **Mumbai**

In [None]:
ImpAmenitiesPlot(mumbai_df)

In [None]:
FurnishingPlot(mumbai_df)

In [None]:
LinePlots(mumbai_df)

In [None]:
QuadPlot(mumbai_df)

From `BoxPlots` & `ScatterPlots` of all 3 Cities, it is clearly visible that there are some Outliers in `Price`, having very high Prices of some Houses for high values of `Area` and `No. of bedrooms`, so let's look at this point in more detailed way(Statistically).

***

In [None]:
bangalore_df.loc[(bangalore_df['Price']>1000) | (bangalore_df['Area']>60),['Price','Area','No. of Bedrooms']]

In [None]:
delhi_df.loc[(delhi_df['Price']>1000) | (delhi_df['Area']>60),['Price','Area','No. of Bedrooms']]

In [None]:
mumbai_df.loc[(mumbai_df['Price']>1000) | (mumbai_df['Area']>60),['Price','Area','No. of Bedrooms']]

According to the data, reasons for Outliers are if Houses having,<br>
- <code>`Area`>6000 sq.ft.</code>
- <code>`No. of Bedrooms`= 4,5,6</code>

***

In [None]:
ResaleBLR = bangalore_df[bangalore_df.Resale=='Yes']
ResaleDEL = delhi_df[delhi_df.Resale=='Yes']
ResaleMUM = mumbai_df[mumbai_df.Resale=='Yes']

fig,axes = plt.subplots(1,3,figsize=(18,8))
axes[0].set_title('Bangalore Resale Houses')
sns.countplot(y='Location',data=ResaleBLR,order=ResaleBLR.Location.value_counts().index[:10],ax=axes[0])
axes[1].set_title('Delhi Resale Houses')
sns.countplot(y='Location',data=ResaleDEL,order=ResaleDEL.Location.value_counts().index[:10],ax=axes[1])
axes[2].set_title('Mumbai Resale Houses')
sns.countplot(y='Location',data=ResaleMUM,order=ResaleMUM.Location.value_counts().index[:10],ax=axes[2])
plt.tight_layout(pad=3)
plt.show()

Above are the regions having high number of **Resale** houses

In [None]:
FreshBLR = bangalore_df[bangalore_df.Resale=='No']
FreshDEL = delhi_df[delhi_df.Resale=='No']
FreshMUM = mumbai_df[mumbai_df.Resale=='No']

fig,axes = plt.subplots(1,3,figsize=(18,8))
axes[0].set_title('Bangalore Resale Houses')
sns.countplot(y='Location',data=FreshBLR,order=FreshBLR.Location.value_counts().index[:10],ax=axes[0])
axes[1].set_title('Delhi Resale Houses')
sns.countplot(y='Location',data=FreshDEL,order=FreshDEL.Location.value_counts().index[:10],ax=axes[1])
axes[2].set_title('Mumbai Resale Houses')
sns.countplot(y='Location',data=FreshMUM,order=FreshMUM.Location.value_counts().index[:10],ax=axes[2])
plt.tight_layout(pad=3)
plt.show()

Above are the regions having high number of **Fresh** houses

***

### Plotting Houses on Map in each City

In [None]:
def getCoordinates(Location,City):
  loc = Location+", "+City
  print(loc)
  geolocator = Nominatim(user_agent='EkansH',timeout=3)
  geo_loc = geolocator.geocode(loc)

  try:
    coordinates = {'lat':geo_loc.latitude,'lon':geo_loc.longitude}
  except AttributeError:
    coordinates = {'lat':np.nan,'lon':np.nan}

  return coordinates['lat'], coordinates['lon']

In [None]:
# bangalore_df['latitude'],bangalore_df['longitude'] = np.vectorize(getCoordinates)(bangalore_df.Location,'Bangalore')
# delhi_df['latitude'],delhi_df['longitude'] = np.vectorize(getCoordinates)(delhi_df.Location,'Delhi')
# mumbai_df['latitude'],mumbai_df['longitude'] = np.vectorize(getCoordinates)(mumbai_df.Location,'Mumbai')

In [None]:
# bangalore_df.to_csv("../input/housing-prices-india-metropolitan/bangalore_updated.csv", encoding="utf-8", index=False)
# delhi_df.to_csv("../input/housing-prices-india-metropolitan/delhi_updated.csv", encoding="utf-8", index=False)
# mumbai_df.to_csv("../input/housing-prices-india-metropolitan/mumbai_updated.csv", encoding="utf-8", index=False)

In [None]:
bangalore = pd.read_csv("../input/housing-prices-india-metropolitan/bangalore_updated.csv")
delhi = pd.read_csv("../input/housing-prices-india-metropolitan/delhi_updated.csv")
mumbai = pd.read_csv("../input/housing-prices-india-metropolitan/mumbai_updated.csv")

In [None]:
bangalore.head()

In [None]:
delhi.head()

In [None]:
mumbai.head()

In [None]:
blr_coordinates = (12.9791198,77.5912997)
del_coordinates = (28.6517178,77.2219388)
mum_coordinates = (19.0759899,72.8773928)

In [None]:
bangalore[bangalore['latitude'].isna()]

In [None]:
delhi[delhi['latitude'].isna()]

In [None]:
mumbai[mumbai['latitude'].isna()]

In [None]:
bangalore.dropna(axis=0, inplace=True)
delhi.dropna(axis=0, inplace=True)
mumbai.dropna(axis=0, inplace=True)

In [None]:
# Function to generate Map of each City to identify the regions where Houses are present
def GenerateMap(df,coordinates):
  map = folium.Map(location=coordinates, zoom_start=10)
  HouseCluster = folium.plugins.MarkerCluster().add_to(map)
  for idx,row in df.iterrows():
    folium.Marker([row.latitude,row.longitude],popup=str(row['Price'])+' Lacs').add_to(HouseCluster)
  return map

> 👀 <code>[Learn more about using Maps in EDA](https://georgetsilva.github.io/posts/mapping-points-with-folium/) </code>

In [None]:
GenerateMap(bangalore,blr_coordinates)

In [None]:
GenerateMap(delhi,del_coordinates)

In [None]:
GenerateMap(mumbai,mum_coordinates)

***

## Final Conclusion of the Analysis

As per the Data provided, we categorized variables in 2 different Categories,
- `Import Amenities`, like Hospitals, Gym, Swimming Pool, etc. which tells that whether Houses have these amenities nearby.
- `Furnishing`, like Wardrobe, AC, TV, etc. which tells that whether the House is provided by these Furnishings, whether the House is Fully furnished or Non Furnished.



**Important Amenities:-**
- Resale variable has some serious impact over Price, if House is on Resale then the Price will be low as compared to Fresh House.
- Similarly, Gym, ClubHouse, 24X7Security, PowerBackup, CarParking, also positively impact the Price.
- Unlikely, Swimming Pool, School, Hospital impacting the Price in a Negative manner, which means that if these Amenities are not situated nearby to the Property, then its price will be Higher comparatively, which means that Houses are in Outskirts of the City, as far distant from Hospitals, Schools.

**Furnishing:-**
- In Furnishing Category, there is a strange trend that 1,2 or 3 Bedrooms Houses have Higher Prices which are Furnished in a way, like AC, TV, Wardrobe, etc. are provided. But as you go towards 4 or 5 Bedroom House, their Prices are higher if these Furnitures are not provided.

- As the No. of Bedrooms and Area of House, their Prices goes on increasing.

According to BoxPlot & ScatterPlot, there are Outliers in the Prices for 4 or 5 Bedrooms and House Area more then 6000sq.ft.

**Conclusion Statement:** <br>
*Fresh Houses in Outskirts of the City, having Important Amenities like, Gym, Clubhouse, 24X7Security, PowerBackup, CarParking and not having Nearby Hospitals & Schools are more expensive.*
*Also if a Family is planning to Purchase a 4 or 5 Bedroom House, then they might have their own furnitures, due to which they want Houses Non-Furnished, and Unfurnished Houses are expensive.*
*Alternatively, if a small family or bachelors are planning to purchase a House then they require Furnished Houses.*
*Anyone who is eligible to spend more money to purchase a house, they will get more number of bedrooms and more area, and vice-versa.*