## Problem Statement :- 
#### Scrape House Prices from the Magic bricks Website and map the localities to a unique set of zip-codes in that city. Python modules are available to solve the latter half of the task.


Web-Scraping is often used as a tool to collect the data when no real-time data available for the analysis. First of all, we are going to collect all the search results from magic bricks websites in Hyderabad.All the data is neatly arranged.

Before we actually begin scraping the data, we need to decide on what features need to be scraped from the website. In the current example, we will be collecting,
No Bedrooms.

No Bathrooms.

Type of Furnishing.

The Tennants Preferred.

The Area of the House in sqft.

The locality of the House.

Price or Rent.

In [1]:
# importing libraries
import re
import time
import pandas as pd
from bs4 import BeautifulSoup
from urllib.request import urlopen,Request

Some websites automatically block any kind of scraping, and that’s why we’ll define a header to pass along the get command, which will basically make our queries to the website look like they are coming from an actual browser.

In [2]:
headers = ({'User-Agent':
            'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'})
url = "https://www.magicbricks.com/property-for-rent/residential-real-estate?proptype=Multistorey-Apartment,Builder-Floor-Apartment,Penthouse,Studio-Apartment,Service-Apartment&cityName=Hyderabad&BudgetMin=5000&BudgetMax=50000&page="

After collecting the webpage data as a list of containers, we can see that data pertaining to each field is stored in <span> tags in each of the page_containers. We can write a script to collect all the span tags of each page_container, extract the data and loop it over the entire page_containers.
    
The logic remains the same for all the fields except Locality where the locality needs to be extracted from a string of characters and little tricky to get.I have used regular expressions to find the pattern which starts with ‘in’ and ends with a ‘number (\d)’ but this pattern is not uniform across the data, Some samples are also ending with spaces or with nothing at the end. So, I have used 3 patterns to get the locality data.

In [3]:
def house_price(url,headers):
    request = Request(url, headers=headers)
    response = urlopen(request)
    html = response.read()
    html_soup = BeautifulSoup(html , 'html.parser')
    house_container = html_soup.find_all('div', class_ = 'flex relative clearfix m-srp-card__container' )
    for container in range(len(house_container)) :
        first  = house_container[container].find_all('span')
        second = house_container[container].find_all('div',class_ = 'm-srp-card__summary__info')
        Price_  = first[1].text.replace('\n','')
        Price.append((Price_))
        Bedrooms_   = first[3].text.replace('\n','')
        Bedrooms.append((Bedrooms_))
        try:
            Bathrooms_  = second[4].text.replace('\n','')
            Bathrooms.append(Bathrooms_)
        except :
            Bathrooms.append('NaN')
        try:
            Area_       = first[4].text.replace('\n','')
            Area.append((Area_))
        except:
            Area.append(('NaN'))
        Furnishing_ = second[1].text.replace('\n','')
        Furnishing.append(Furnishing_)
        try:
            Tennants_   = second[2].text.replace('\n','')
            Tennants.append(Tennants_)
        except:
            Tennants.append('NaN')
        try :
            Locality_0  = house_container[container].find_all('span')[2]
            Locality_1 = Locality_0.text.replace('\n',' ')
            Locality_2 = re.search(r'in(.+?)\d', Locality_1)
            if Locality_2 is None :
                Locality_2 = re.search(r'in(.+?),',Locality_1)
            if Locality_2 is None :
                Locality_2 = re.search(r'in(.+?) ',Locality_1)
            Locality_  = Locality_2[1]
            Locality.append(Locality_)
        except:
            Locality.append(('NaN'))
        
    cols = ['Bedrooms', 'Bathrooms', 'Furnishing', 'Tennants', 'Area', 'Price','Locality']

    house_data = pd.DataFrame({'Bedrooms': Bedrooms,
                               'Bathrooms': Bathrooms,
                               'Furnishing': Furnishing,
                               'Tennants': Tennants,
                               'Area': Area,
                               'Price': Price,
                               'Locality':Locality})[cols]
    return house_data

In [4]:
Bedrooms = []
Bathrooms = []
Furnishing = []
Tennants = []
Locality = []
Area = []
Price = []

In [5]:
for i in range(1,40):
    url_ = url+str(i)
    house_data = house_price(url_,headers)

In [6]:
house_data.count()

Bedrooms      1171
Bathrooms     1171
Furnishing    1171
Tennants      1171
Area          1171
Price         1171
Locality      1171
dtype: int64

In [7]:

house_data.to_csv('magics_brics_data.csv')

In [8]:
data = pd.read_csv('magics_brics_data.csv')

In [9]:
data.head(10)

Unnamed: 0.1,Unnamed: 0,Bedrooms,Bathrooms,Furnishing,Tennants,Area,Price,Locality
0,0,3 BHK Service Apartment,1,Unfurnished,Bachelors,1390 sqft,26000,"Hafeezpet, NH"
1,1,1 BHK Builder Floor,1,Unfurnished,Bachelors/Family,read more,14000,Amarnath Residency
2,2,3 BHK Apartment,1,Semi-Furnished,Family,1650 sqft,35000,Nanakram Guda
3,3,2 BHK Builder Floor,1,Semi-Furnished,Bachelors,1200 sqft,22000,Jubilee Hills
4,4,2 BHK Apartment,1,Unfurnished,Bachelors,1295 sqft,27000,"Chandanagar, NH"
5,5,3 BHK Apartment,2,Semi-Furnished,Bachelors/Family,1630 sqft,30000,Gopanapalli
6,6,3 BHK Apartment,3,Semi-Furnished,Bachelors/Family,2075 sqft,49000,"Vajras Jasmine County, Gachibowli, Outer Ring..."
7,7,2 BHK Apartment,1,Unfurnished,Bachelors/Family,,8000,Eastend Colony
8,8,3 BHK Builder Floor,Immediately,Semi-Furnished,Bachelors/Family,1339 sqft,22000,Tolichowki
9,9,3 BHK Apartment,Immediately,Semi-Furnished,Bachelors/Family,1800 sqft,22000,"Narsingi, Outer Ring Road"


In [10]:
import requests
import logging
import time

logger = logging.getLogger("root")
logger.setLevel(logging.DEBUG)
ch = logging.StreamHandler()
ch.setLevel(logging.DEBUG)
logger.addHandler(ch)

##### Script for mapping address to unique zip code using google api

The script expects an input CSV file with a column that contains addresses. The default column name is “Address”, but we can specify different names in the configuration section of the script.

The script will take each address and geocode it using the Google APIs, returning:

the matching latitude and longitude,
the cleaned and formatted address from Google,
postcode of the matched address 
accuracy of the match,
the “type” of the location – “street, neighbourhood, locality”
google place ID,
the number of results returned,
the entire JSON response from Google can be requested.


For Generating API look into the link https://developers.google.com/maps/documentation/geocoding/get-api-key

In [11]:
API_KEY = 'AIzaSyAmcECGL6FaRhdm4YGAZRZB4ZL3ukttEyA'
BACKOFF_TIME = 30
output_filename = 'output-magicbricks.csv'

In [12]:
address_column_name = "Locality"
RETURN_FULL_RESULTS = False

In [13]:
if address_column_name not in data.columns:
    raise ValueError("Missing Address column in input data")

In [14]:
addresses = data[address_column_name].tolist()

The script functionality is simple: there’s a central function “get_google_result()” that actually requests data from the Google API using the Python requests library, and then a wrapper around that to handle data backup and geocoding query limits.

In [15]:
def get_google_results(address, api_key=None, return_full_response=False):
    # Set up your Geocoding url
    geocode_url = "https://maps.googleapis.com/maps/api/geocode/json?address={}".format(address)
    #print(geocode_url)
    if api_key is not None:
        geocode_url = geocode_url + "&key={}".format(api_key)
        
    # Ping google for the reuslts:
    results = requests.get(geocode_url)
    # Results will be in JSON format - convert to dict using requests functionality
    results = results.json()
    
    # if there's no results or an error, return empty results.
    if len(results['results']) == 0:
        output = {
            "postcode": None
        }
    else:    
        answer = results['results'][0]
        output = {
            "postcode": ",".join([x['long_name'] for x in answer.get('address_components') 
                                  if 'postal_code' in x.get('types')])
        }
        
    # Append some other details:    
    output['status'] = results.get('status')
    if return_full_response is True:
        output['response'] = results
    
    return output

In [16]:
results = []
# Go through each address in turn
for address in addresses:
    # While the address geocoding is not finished:
    geocoded = False
    while geocoded is not True:
        # Geocode the address with google
        try:
            geocode_result = get_google_results(address, API_KEY, return_full_response=RETURN_FULL_RESULTS)
        except Exception as e:
            logger.exception(e)
            logger.error("Major error with {}".format(address))
            logger.error("Skipping!")
            geocoded = True
            
        # If we're over the API limit, backoff for a while and try again later.
        if geocode_result['status'] == 'OVER_QUERY_LIMIT':
            logger.info("Hit Query Limit! Backing off for a bit.")
            time.sleep(BACKOFF_TIME * 60) # sleep for 30 minutes
            geocoded = False
        else:
            # If we're ok with API use, save the results
            # Note that the results might be empty / non-ok - log this
            if geocode_result['status'] != 'OK':
                logger.warning("Error geocoding {}: {}".format(address, geocode_result['status']))
            logger.debug("Geocoded: {}: {}".format(address, geocode_result['status']))
            results.append(geocode_result)  
            
            geocoded = True

    # Print status every 100 addresses
    if len(results) % 100 == 0:
    	logger.info("Completed {} of {} address".format(len(results), len(addresses)))
        

# All done
logger.info("Finished geocoding all addresses")
# Write the full results to csv using the pandas library.
data1=pd.DataFrame(results)

Geocoded:  Hafeezpet, NH : OK
Geocoded:  Amarnath Residency: OK
Geocoded:  Nanakram Guda : OK
Geocoded:  Jubilee Hills : OK
Geocoded:  Chandanagar, NH : OK
Geocoded:  Gopanapalli : OK
Geocoded:  Vajras Jasmine County, Gachibowli, Outer Ring Road : OK
Geocoded:  Eastend Colony: OK
Geocoded:  Tolichowki : OK
Geocoded:  Narsingi, Outer Ring Road : OK
Geocoded:  Osman Nagar : OK
Geocoded:  Vidyanagar, Adikmet : OK
Geocoded:  Kukatpally Housing Board Colony, NH : OK
Geocoded:  Narsingi, Outer Ring Road : OK
Geocoded:  Puppalaguda : OK
Geocoded:  My Home Avatar, Narsingi, Outer Ring Road : OK
Geocoded:  Lanco Hills, Manikonda, Outer Ring Road : OK
Geocoded:  Gachibowli: OK
Geocoded:  Yapral : OK
Geocoded:  Aditya Imperial Heights, Hafeezpet, NH : OK
Geocoded:  Sai Ram Nagar Colony, Champapet, Koti : OK
Geocoded:  Mantri Celestia, Gachibowli, Outer Ring Road : OK
Geocoded:  Kistareddypet, Outer Ring Road : OK
Geocoded:  Gks Habitat Royale, Yapral : OK
Geocoded:  Bikshapathi Nagar, Hafeezpet :

Geocoded:  Kondapur : OK
Geocoded:  Puppalaguda : OK
Geocoded:  Accurate Wind Chimes, Narsingi, Outer Ring Road : OK
Geocoded:  SMR Vinay Iconia, Kondapur : OK
Geocoded:  Miyapur, NH : OK
Completed 200 of 1171 address
Geocoded:  Boduppal, NH : OK
Geocoded:  Rainbow Vistas, Hitech City : OK
Geocoded:  Rajiv gandhi Nagar-Gachibowli : OK
Geocoded:  Kavuri Hills, Madhapur : OK
Geocoded:  Toli Chowki : OK
Geocoded:  Silicon Ridge, Attapur : OK
Geocoded:  Gks Habitat Royale, Yapral : OK
Geocoded:  Kukatpally Housing Board Colony, NH : OK
Geocoded:  Bhavani: OK
Geocoded:  Brindavan Colony-Toli Chowki : OK
Geocoded:  Empress Heights, Jubilee Hills : OK
Geocoded:  Attapur : OK
Geocoded:  Mayur Marg : OK
Geocoded:  Suchitra Circle : OK
Geocoded:  Masab Tank : OK
Geocoded:  Mantri Celestia, Gachibowli, Outer Ring Road : OK
Geocoded:  SMR Vinay Symphony, Gachibowli, Outer Ring Road : OK
Geocoded:  Project Hill Ridge Springs, Gachibowli, Outer Ring Road : OK
Geocoded:  NSK Exotica, Kukatpally, NH :

Error geocoding  									: ZERO_RESULTS
Geocoded:  									: ZERO_RESULTS
Geocoded:  									 								 				             				             			             				         			            	Vajras Jasmine County,  			            	 	                	 	                		 					            							          							         	Gachibowli, Outer Ring Road 							          							           			        				  						         	         		    		 	           			 						 							 								 					             					            	: OK
Geocoded:  									 								 				             				             			             				        	 	                	 	                		 					            							          							         	Kondapur 							          							           			        				  						         	         		    		 	           			 						 							 								 					             					            	: OK
Geocoded:  									 								 				             				             			             				        	 	                	 	               

Geocoded:  									 								 				             				             			             				         			            	Aditya Imperial Heights,  			            	 	                	 	                		 					            							          							         	Hafeezpet, NH  : OK
Geocoded:  									 								 				             				             			             				        	 	                	 	                		 					            							          							         	Brindavan Colony-Toli Chowki 							          							           			        				  						         	         		    		 	           			 						 							 								 					             					            	: OK
Geocoded:  									 								 				             				             			             				         			            	My Homes Jewel,  			            	 	                	 	                		 					            							          							         	Miyapur, NH  : OK
Geocoded:  									 								 				             				             			             				         			           

Geocoded:  PBEL City, Appa junction : OK
Geocoded:  West Marredpally : OK
Geocoded:  SV Brindavanam, Boduppal, NH : OK
Completed 400 of 1171 address
Geocoded:  Bowenpally : OK
Geocoded:  Manjeera Majestic Homes, Kukatpally, NH : OK
Geocoded:  Vertex Panache, Gachibowli, Outer Ring Road : OK
Geocoded:  Habsiguda, NH : OK
Geocoded:  Gandhi Nagar-Boiguda : OK
Geocoded:  Mehdipatnam : OK
Geocoded:  Sameera Sisiram, Kondapur : OK
Geocoded:  Kondapur : OK
Geocoded:  Emami Swanlake, Kukatpally, NH : OK
Geocoded:  Beeramguda, Ramachandra Puram, NH : OK
Geocoded:  Attapur : OK
Geocoded:  SMR Vinay Fountainhead, Miyapur, NH : OK
Geocoded:  Manasarovar Heights, Teachers Colony-Trimulgherry : OK
Geocoded:  Aditya Nagar, Hafeezpet : OK
Geocoded:  Raaaps Raaganjali, Nalanda Nagar, Upparpally : OK
Geocoded:  Lakdikapul, NH : OK
Geocoded:  Manikonda, Outer Ring Road : OK
Geocoded:  Mye Villa, Mallapur : OK
Geocoded:  Chandanagar, NH : OK
Geocoded:  GK Pride, Yapral : OK
Geocoded:  Vishnu Saphire, Kond

Geocoded:  									 								 				             				             			             				         			            	PBEL City,  			            	 	                	 	                		 					            							          							         	Appa junction 							          							           			        				  						         	         		    		 	           			 						 							 								 					             					            	: OK
Geocoded:  									 								 				             				             			             				         			            	Spectra Metro Heights,  			            	 	                	 	                		 					            							          							         	Nagole 							          							           			        				  						         	         		    		 	           			 						 							 								 					             					            	: OK
Geocoded:  									 								 				             				             			             				         			            	Elite Trehan Towers,  			            	 	                	 	         

Completed 600 of 1171 address
Geocoded:  My Home Avatar, Narsingi, Outer Ring Road : OK
Geocoded:  									 								 				             				             			             				        	 	                	 	                		 					            							          							         	Mettuguda 							          							           			        				  						         	         		    		 	           			 						 							 								 					             					            	: OK
Geocoded:  									 								 				             				             			             				        	 	                	 	                		 					            							          							         	LB Nagar, NH  : OK
Geocoded:  									 								 				             				             			             				        	 	                	 	                		 					            							          							         	Chaitanyapuri, Kothapet 							          							           			        				  						         	         		    		 	           			 						 							 								 					         

Geocoded:  									 								 				             				             			             				        	 	                	 	                		 					            							          							         	Matrusri Nagar, Miyapur, NH  : OK
Geocoded:  									 								 				             				             			             				        	 	                	 	                		 					            							          							         	Moinabad, Chevella Road 							          							           			        				  						         	         		    		 	           			 						 							 								 					             					            	: OK
Geocoded:  									 								 				             				             			             				        	 	                	 	                		 					            							          							         	Bolarum, Medchal Road 							          							           			        				  						         	         		    		 	           			 						 							 								 					             					            	: OK
Geocoded:  									 								 				

Geocoded:  P Janardhan Reddy Nagar, Gachibowli, Outer Ring Road : OK
Geocoded:  Kukatpally Project, Kukatpally, NH : OK
Geocoded:  Barkatpura, Kachiguda, NH : OK
Geocoded:  Gachibowli, Outer Ring Road : OK
Geocoded:  Gokul Nagar-Tarnaka : OK
Geocoded:  Madhapur : OK
Geocoded:  Vertex Pleasant, Kukatpally, NH : OK
Geocoded:  Nizampet : OK
Geocoded:  Modi Splendour, Kukatpally, NH : OK
Geocoded:  Habsiguda, NH : OK
Geocoded:  Janapriya Lakefront, Sainikpuri : OK
Geocoded:  Nanakram Guda : OK
Geocoded:  Shamshabad: OK
Completed 800 of 1171 address
Error geocoding  Moti Nagar RWA: ZERO_RESULTS
Geocoded:  Moti Nagar RWA: ZERO_RESULTS
Geocoded:  Tellapur, Outer Ring Road : OK
Geocoded:  Ameerpet, NH : OK
Geocoded:  NGO Colony : OK
Geocoded:  Aditya Empress Towers, Shaikpet : OK
Geocoded:  Prestige Ivy League, Hitech City : OK
Geocoded:  Alkapur Township, Manikonda, Outer Ring Road : OK
Geocoded:  Khaja Guda, Outer Ring Road : OK
Geocoded:  Srila Park Pride, Kukatpally, NH : OK
Geocoded:  SMR

Geocoded:  Gudimalkapur : OK
Geocoded:  Old Bowenpally : OK
Geocoded:  Gollaguda : OK
Geocoded:  Kondapur RTO : OK
Geocoded:  Alakapur: OK
Geocoded:  Asif Nagar : OK
Geocoded:  Prime Bhadradri Towers, Kukatpally, NH : OK
Geocoded:  Surya Nagar-Quthbullapur : OK
Geocoded:  Bhadurpalle : OK
Geocoded:  Hyderabad : OK
Geocoded:  Domalguda, Himayath Nagar, NH : OK
Geocoded:  HMT swarnapuri colony,Miyapur : OK
Geocoded:  Nanal nagar, Mehdipatnam : OK
Geocoded:  Hyderabad : OK
Geocoded:  Dammaiguda : OK
Geocoded:  Gulshan Colony, Qutub Shahi Tombs : OK
Geocoded:  Anjaneyanagar Colony: OK
Geocoded:  safa complex Sivarampalli : OK
Geocoded:  									 								 				             				             			             				        	 	                	 	                		 					            							          							         	Bachupally, Outer Ring Road 							          							           			        				  						         	         		    		 	           			 						 							 								 					             					         

Geocoded:  									 								 				             				             			             				        	 	                	 	                		 					            							          							         	Isnapur, Outer Ring Road 							          							           			        				  						         	         		    		 	           			 						 							 								 					             					            	: OK
Geocoded:  									 								 				             				             			             				        	 	                	 	                		 					            							          							           							         	Chilkalguda 							          			        				  						         	         		    		 	           			 						 							 								 					             					            	: OK
Geocoded:  									 								 				             				             			             				         			            	My Home Avatar,  			            	 	                	 	                		 					            							          							         	Narsingi, Outer Ring Road 		

In [17]:
data1.head()

Unnamed: 0,postcode,status
0,,OK
1,500084.0,OK
2,500032.0,OK
3,,OK
4,500050.0,OK


In [18]:
data1['postcode']=data1['postcode'].replace("","No code")

In [19]:
data1

Unnamed: 0,postcode,status
0,No code,OK
1,500084,OK
2,500032,OK
3,No code,OK
4,500050,OK
5,No code,OK
6,500032,OK
7,59501,OK
8,500008,OK
9,No code,OK


In [20]:
data['Postal_code']=data1['postcode']

In [21]:
data.to_csv('output-magicbricks.csv')

In [22]:
data.head()

Unnamed: 0.1,Unnamed: 0,Bedrooms,Bathrooms,Furnishing,Tennants,Area,Price,Locality,Postal_code
0,0,3 BHK Service Apartment,1,Unfurnished,Bachelors,1390 sqft,26000,"Hafeezpet, NH",No code
1,1,1 BHK Builder Floor,1,Unfurnished,Bachelors/Family,read more,14000,Amarnath Residency,500084
2,2,3 BHK Apartment,1,Semi-Furnished,Family,1650 sqft,35000,Nanakram Guda,500032
3,3,2 BHK Builder Floor,1,Semi-Furnished,Bachelors,1200 sqft,22000,Jubilee Hills,No code
4,4,2 BHK Apartment,1,Unfurnished,Bachelors,1295 sqft,27000,"Chandanagar, NH",500050


##### End-of-Notebook

In [23]:
!jupyter nbconvert --to html webscrapping.ipynb

[NbConvertApp] Converting notebook webscrapping.ipynb to html
[NbConvertApp] Writing 416060 bytes to webscrapping.html
