# Data Gathering

**In this phase, our goal is to collect property data for Delhi from `Makaan.com`, an Indian online real estate platform and property listing website.** 

Makaan.com serves as a marketplace where both individuals and real estate professionals can list and search for various types of properties, including apartments, houses, land, and commercial spaces available for sale or rent in numerous Indian cities. To ensure a comprehensive understanding, we will gather the following details:

1. Property Location
2. Construction Status
3. Price
4. Area
5. Number of Bedrooms (BHK)
6. Number of Bathrooms

Website Link: https://www.makaan.com/delhi-residential-property/buy-property-in-delhi-city

### Imports

In [1]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import requests

### Lists for Individual Collection

In [2]:
tagline = []
construction_status = []
price = []
area = []
bathrooms = []

### Web Scrapping

In [3]:
import time

start_time = time.time()

# Iterate through pages 1 to 2600
for i in range(1, 2600):
    
    # Construct the URL for each page
    page_url = f"https://www.makaan.com/listings?sortBy=date-desc&listingType=buy&pageType=CITY_URLS&cityName=Delhi&cityId=6&templateId=MAKAAN_CITY_LISTING_BUY&page={i}"
    
    try:
        # Send an HTTP GET request to the URL
        page = requests.get(page_url)
        
        # Check if the request was successful
        if page.status_code == 200:
            # Parse the page content with BeautifulSoup
            soup = BeautifulSoup(page.content, "html.parser")
            
            for span in soup.find_all('div', class_='txt'):
                try:
                    tagline.append(span.find('h3', class_='seo-hdng').find('span').text)
                except AttributeError:
                    tagline.append('NULL')                   

            try:
                construction_status.extend([item.text for item in soup.find_all('td', class_='val')])
            except AttributeError:
                construction_status.append('NULL')
                
            try:
                price.extend([f"{price_span.text} {unit_span.text}" for price_span, unit_span in zip(soup.find_all('span', class_='val', itemprop='offers'), soup.find_all('span', class_='unit'))])
            except AttributeError:
                price.append('NULL')                
                                
            try:
                area.extend([item.text for item in soup.find_all('td', class_='size')])
            except AttributeError:
                area.append('NULL')

            try:
                bathrooms.extend([item.text for item in soup.find_all('li', title='Bathrooms')])
            except AttributeError:
                bathrooms.append('NULL')                

        else:
            print(f"Error: Unable to retrieve page {i}. Status code: {page.status_code}")
    except Exception as e:
        print(f"Error: An error occurred on page {i}. {str(e)}")

end_time = time.time()
elapsed_time = end_time - start_time
print(f"Scraping took {elapsed_time} seconds.")


Scraping took 2374.0850162506104 seconds.


### Checking Data for Proper Construction of Data Frame

In [11]:
print("Length of Tagline list:", len(tagline))
print("Length of Construction Status list:", len(construction_status))
print("Length of Price list:", len(price))
print("Length of Area list:", len(area))
print("Length of Bathrooms list:", len(bathrooms))

Length of Tagline list: 49980
Length of Construction Status list: 51980
Length of Price list: 51980
Length of Area list: 51980
Length of Bathrooms list: 49980


#### Matching the Lengths of all the Lists

In [24]:
null_values_to_add = len(price) - 49980
tagline.extend([None] * null_values_to_add)
bathrooms.extend([None] * null_values_to_add)

In [25]:
print("Length of Tagline list:", len(tagline))
print("Length of Bathrooms list:", len(bathrooms))

Length of Tagline list: 51980
Length of Bathrooms list: 51980


### Constructing DataFrame

In [26]:
data = {
    'Tagline': tagline,
    'Construction Status': construction_status,
    'Price': price,
    'Area': area,
    'Bathrooms': bathrooms
}

df = pd.DataFrame(data)

In [27]:
df

Unnamed: 0,Tagline,Construction Status,Price,Area,Bathrooms
0,3 BHK in Sector 3 Dwarka Delhi,Ready to move,1.8 Cr,1900,3 Bathrooms
1,3 BHK in Sector 3 Dwarka Delhi,Ready to move,1.8 Cr,1900,2 Bathrooms
2,3 BHK in Sector 3 Dwarka Delhi,Ready to move,1.8 Cr,1900,2 Bathrooms
3,3 BHK in Sector 3 Dwarka Delhi,Ready to move,1.81 Cr,1900,3 Bathrooms
4,3 BHK in Sector 3 Dwarka Delhi,Ready to move,1.8 Cr,1900,3 Bathrooms
...,...,...,...,...,...
51975,,Ready to move,25 L,600,
51976,,Ready to move,4 Cr,1800,
51977,,Ready to move,36.32 L,789,
51978,,Ready to move,48 L,1000,


### Saving DataFrame as `xlsx` file

In [28]:
import os

cwd = os.getcwd()
df.to_excel(cwd + "/Delhi-House-Prices.xlsx", index = False)