# Data scrapping
---

This script collects data from 'www.apartments.com' and store them into a DataFrame. The following data are collected from the website. Later, data cleaning is performed to remove any unwanted features from the dataframe.
- Apartment complex name
- Address
- Rent
- Number of bedroom, bathroom, sqft
- Tenant rating
- Amenity such as laundary, fitness, business center, etc
- Allowed pet
- Nearby school
- and additional information such as built year

## Import useful libraries

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np

# regulate frequency of request
import time

## Parsing the main page with BeautifulSoup

The website provides a main page for each of 5 boroughs of NY or any region in general. These main pages can be accessed by urls in the form of 'www.apartment.com/name-of-region'. In this project, 5 boroughs of NY is considered.

As a first step, the main page for each borough is parsed to find the total number of 'listing pages' to scrap. The number of listing pages are stored in a dictionary called 'boroughs_max_pages'

In order to avoid any problem with the server, 1 second of delay is introduced between each request.

In [2]:
# boroughs of NY
boroughs = ['manhattan-ny','queens-ny','brooklyn-ny','staten-island-ny','bronx-ny']
boroughs_max_page = {'manhattan-ny':0,'queens-ny':0,'brooklyn-ny':0,'staten-island-ny':0,'bronx-ny':0}
boroughs_listing_pages = {'manhattan-ny':{},'queens-ny':{},'brooklyn-ny':{},'staten-island-ny':{},'bronx-ny':{}}

# send user agent to avoid bot check
headers = {'User-Agent': 'User'}

# loop over borooughs of NY
for br in boroughs:
    
    # request page
    page_main = requests.get('https://www.apartments.com/%s/'%br, headers=headers)
    
    # pause to regulate request frequency
    time.sleep(1)

    # create soup of the main page
    soup_main = BeautifulSoup(page_main.content,'html.parser')
    
    # find the total number of apartment "listing pages" to scrap
    max_list_pages = 0

    # find tag that links to "listing pages"
    for tag in soup_main.find_all('a'):
        # selecting tags containing page numbers
        if 'data-page' in tag.attrs:
            if not 'class' in tag.attrs:
                # found the page numbers to scrap for this br
                boroughs_max_page[br] = max(int(tag['data-page']),max_list_pages)

# for testing            
print(boroughs_max_page)

{'manhattan-ny': 28, 'queens-ny': 28, 'brooklyn-ny': 28, 'staten-island-ny': 23, 'bronx-ny': 28}


## Collect URLs of each rental listing

Knowing the total number of listing pages to parse for each borough, the listing pages are parsed to collect URLs of apartment listings. The URLs and titles of apartment complex are stored in a dictionary "boroughs_listing_pages".

Similar to the previous step, 1 second of delay is introduced to avoid any conflict with the server. Because of the delay, it could take a few minuts to run this code. Hence, debug message is printed before looping over each page.

In [3]:
# loop over boroughs
for br in boroughs:
#if True:

    # loop over page number
    for i in range(boroughs_max_page[br]):
        
        # set page number to avoid confusion
        page_number = i+1
        
        #print("Loop over page %s of %s" %(page_number,br))
        
        # request list page
        page_listing = requests.get('https://www.apartments.com/%s/%s' %(br,page_number), headers=headers)
        
        # pause to regulate request frequency
        time.sleep(1)
       
        # parse listing page with BeautifulSoup
        soup_listing = BeautifulSoup(page_listing.content,'html.parser')
       
        # for testing
        #print(soup_listing.prettify())

        for tag in soup_listing.findAll("a",{"class":"placardTitle"}):
            
            # add title and the link to each property
            if not tag['title'] in boroughs_listing_pages[br]:
                boroughs_listing_pages[br][tag['title']] = tag['href']
            
# Collected links
print("collected links to each apartment pages")

collected links to each apartment pages


For debugging purpose, total number of apartment rental listing collected are printed here. In the current state, each borough has around 500 to 700 apartment listing.

In [4]:
# for dibugging
for br in boroughs:
    print(len(boroughs_listing_pages[br]))


693
667
698
556
658


##  Scrap apartment rental listing

In the previous step, we collected title and URL of all apartment rental listing posted on the website. Now, we loop over each URL, and collect useful features such as rent price, number of bedrooms, etc.

Parsing is done with BeautifulSoup, and features collected from each apartment rental listing are stored as python list. Later, the lists are converted into a dataframe.


Additional care was given so that if some features are missing from an apartment listing, it would not cause the script to crash.

In [5]:
# define lists to hold contents before creating DataFrame
list_rental_title = []
list_street_address = []
list_city = []
list_state = []
list_postal_code = []
list_rating = []
list_amenity = []
list_pet_policy = []
list_property_info = []
list_school = []
list_bedrooms = []
list_bathrooms = []
list_rent = []
list_deposit = []
list_unit = []
list_sqft = []
list_name = []
list_leaseLength = []
list_borough = []

# loop over boroughs again to scrap rental listing
for br in boroughs:
    
    # loop over rental listing
    for title, url in boroughs_listing_pages[br].items():
        
        #print("boroughs: %s, processing apartment: %s" %(br,title))
        
        #=======================================
        # request and parse each complex
        #=======================================
        
        # get url from the list
        rental_url = url
        
        # request list page
        page_rent = requests.get('%s' %rental_url, headers=headers)
        
        # pause to regulate request frequency
        time.sleep(0.2)
        
        # parse listing page with BeautifulSoup
        soup_rent = BeautifulSoup(page_rent.content,'html.parser')
        
        # for testing
        #print(soup_rent.prettify())
        
        #=======================================
        # find complex's property
        #=======================================
        
        # get rental title
        rental_title = title
        
        # street address
        if soup_rent.findAll("span",{"itemprop":"streetAddress"}):
            street_address = soup_rent.findAll("span",{"itemprop":"streetAddress"})[0].text
        else:
            street_address = ""
        
        # city
        if soup_rent.findAll("span",{"itemprop":"addressLocality"}):
            city = soup_rent.findAll("span",{"itemprop":"addressLocality"})[0].text
        else:
            city = ""
        
        # state
        if soup_rent.findAll("span",{"itemprop":"addressRegion"}):
            state = soup_rent.findAll("span",{"itemprop":"addressRegion"})[0].text
        else:
            state = ""
        
        # postal code (zip code)
        if soup_rent.findAll("span",{"itemprop":"postalCode"}):
            postal_code = soup_rent.findAll("span",{"itemprop":"postalCode"})[0].text
        else:
            postal_code = ""
        
        # apartment rating
        if 'title' in soup_rent.findAll("div",{"class":"rating"})[0].attrs:
            rating = soup_rent.findAll("div",{"class":"rating"})[0]['title']
        else:
            rating = ""
        
        # list of amenities
        amenity_temp = []
        if soup_rent.findAll("section",{"class":"printPropertySection"}):
            for tag_amenity in soup_rent.findAll("section",{"class":"printPropertySection"})[0].findAll("li"):
                amenity_temp.append(tag_amenity.text)
            
        # add amenity to row
        amenity = amenity_temp
        
        # pet policy
        pet_policy_temp = []
        for tags_pet in soup_rent.findAll("div",{"class":"petPolicyDetails"}):
            pet_policy_temp.append(tags_pet.text)
            
        # add pet policy to row
        pet_policy = pet_policy_temp
            
        # additional information such as built date, complex size    
        if soup_rent.findAll("div",{"class":"specList propertyFeatures js-spec"}):
            property_info = soup_rent.findAll("div",{"class":"specList propertyFeatures js-spec"})[0].text
        else:
            property_info = ""
        
        # nearby school information
        school_temp = []
        for tag_school in soup_rent.findAll("div",{"class":"schoolCard"}):
            # get name, number of students, rating
            school_name = tag_school.findAll("p",{"class":"schoolType"})[0].text
            school_number_student = tag_school.findAll("p",{"class":"numberOfStudents"})[0].text
            school_rating = tag_school.findAll("i")[0]['class'][0]
            
            # add it to list of schools as a dictionary
            school_temp.append({"school_name":school_name, "school_number_student":school_number_student,"school_rating":school_rating})
            
        # add school to row
        school = school_temp

        # loop over available units in this apartment
        for tag_unit in soup_rent.findAll("tr",{"class":"rentalGridRow"}):
        
            #=======================================
            # find unit property
            #=======================================
          
            # append complex's properties
            list_rental_title.append(rental_title)
            list_street_address.append(street_address)
            list_city.append(city)
            list_state.append(state)
            list_postal_code.append(postal_code)
            list_rating.append(rating)
            list_amenity.append(amenity)
            list_pet_policy.append(pet_policy)
            list_property_info.append(property_info)
            list_school.append(school)
            list_borough.append(br)
            
            # number of bedrooms in this apartment
            if tag_unit.findAll("td",{"class":"beds"}):
                list_bedrooms.append(tag_unit.findAll("td",{"class":"beds"})[0].findAll("span",{"class":"shortText"})[0].text.strip())
            else:
                list_bedrooms.append("")
            
            # number of bathroom in this apartment
            if tag_unit.findAll("td",{"class":"baths"}):
                list_bathrooms.append(tag_unit.findAll("td",{"class":"baths"})[0].findAll("span",{"class":"shortText"})[0].text.strip())
            else:
                list_bathrooms.append("")
            
            # rent per month
            if tag_unit.findAll("td",{"class":"rent"}):
                list_rent.append(tag_unit.findAll("td",{"class":"rent"})[0].text.strip())
            else:
                list_rent.append("")
            
            # deposit
            if tag_unit.findAll("td",{"class":"deposite"}):
                list_deposit.append(tag_unit.findAll("td",{"class":"deposit"})[0].text.strip())
            else:
                list_deposit.append("")
            
            # unit
            if tag_unit.findAll("td",{"class":"unit"}):
                list_unit.append(tag_unit.findAll("td",{"class":"unit"})[0].text.strip())
            else:
                list_unit.append("")
        
            # sqft
            if tag_unit.findAll("td",{"class":"sqft"}):
                list_sqft.append(tag_unit.findAll("td",{"class":"sqft"})[0].text.strip())
            else:
                list_sqft.append("")
        
            # name
            if tag_unit.findAll("td",{"class":"name"}):
                list_name.append(tag_unit.findAll("td",{"class":"name"})[0].text.strip())
            else:
                list_name.append("")
                
            # lease length
            if tag_unit.findAll("td",{"class":"leaseLength"}):
                list_leaseLength.append(tag_unit.findAll("td",{"class":"leaseLength"})[0].text.strip())
            else:
                list_leaseLength.append("")
            
                    


## Converting list to DataFrame

The lists collected from the previous step are converted into a DataFrame. The DataFrame is then saved as csv for post-processing.

In [6]:
df = pd.DataFrame({'rental_title':list_rental_title,
                   'borough':list_borough,
                   'street_address':list_street_address,
                   'city':list_city,
                   'state':list_state,
                   'postal_code':list_postal_code,
                   'rating':list_rating,
                   'amenity':list_amenity,
                   'pet_policy':list_pet_policy,
                   'property_info':list_property_info,
                   'school':list_school,
                   'bedrooms':list_bedrooms,
                   'bathrooms':list_bathrooms,
                   'rent':list_rent,
                   'deposit':list_deposit,
                   'unit':list_unit,
                   'sqft':list_sqft,
                   'name':list_name,
                   'leaseLength':list_leaseLength})

Before store the DataFrame to csv, check the output and make sure it looks okay.

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9297 entries, 0 to 9296
Data columns (total 19 columns):
amenity           9297 non-null object
bathrooms         9297 non-null object
bedrooms          9297 non-null object
borough           9297 non-null object
city              9297 non-null object
deposit           9297 non-null object
leaseLength       9297 non-null object
name              9297 non-null object
pet_policy        9297 non-null object
postal_code       9297 non-null object
property_info     9297 non-null object
rating            9297 non-null object
rent              9297 non-null object
rental_title      9297 non-null object
school            9297 non-null object
sqft              9297 non-null object
state             9297 non-null object
street_address    9297 non-null object
unit              9297 non-null object
dtypes: object(19)
memory usage: 1.3+ MB


In [8]:
df.head(5)

Unnamed: 0,amenity,bathrooms,bedrooms,borough,city,deposit,leaseLength,name,pet_policy,postal_code,property_info,rating,rent,rental_title,school,sqft,state,street_address,unit
0,"[Bowling Alley, Cardio Fitness Room, Central A...",1 BA,Studio,manhattan-ny,New York,,12 - 24 Month Lease,Studio/1 Bathroom,[\n\n\r\n Pets Nego...,10018,\nProperty Information\n\n•Built in 2017\n•598...,5 star property,"$3,255","555TEN, New York, NY","[{'school_name': 'Public Elementary School', '...",500 Sq Ft,NY,555 10th Ave,
1,"[Bowling Alley, Cardio Fitness Room, Central A...",1 BA,Studio,manhattan-ny,New York,,12 - 24 Month Lease,Studio/1 Bathroom,[\n\n\r\n Pets Nego...,10018,\nProperty Information\n\n•Built in 2017\n•598...,5 star property,"$3,255","555TEN, New York, NY","[{'school_name': 'Public Elementary School', '...",,NY,555 10th Ave,29C
2,"[Bowling Alley, Cardio Fitness Room, Central A...",1 BA,Studio,manhattan-ny,New York,,12 - 24 Month Lease,Alcove Studio/1 Bathroom,[\n\n\r\n Pets Nego...,10018,\nProperty Information\n\n•Built in 2017\n•598...,5 star property,"$3,365","555TEN, New York, NY","[{'school_name': 'Public Elementary School', '...",647 Sq Ft,NY,555 10th Ave,
3,"[Bowling Alley, Cardio Fitness Room, Central A...",1 BA,Studio,manhattan-ny,New York,,12 - 24 Month Lease,Alcove Studio/1 Bathroom,[\n\n\r\n Pets Nego...,10018,\nProperty Information\n\n•Built in 2017\n•598...,5 star property,"$3,365","555TEN, New York, NY","[{'school_name': 'Public Elementary School', '...",,NY,555 10th Ave,32H
4,"[Bowling Alley, Cardio Fitness Room, Central A...",1 BA,1 BR,manhattan-ny,New York,,12 - 24 Month Lease,1 Bedroom/1 Bathroom,[\n\n\r\n Pets Nego...,10018,\nProperty Information\n\n•Built in 2017\n•598...,5 star property,"$4,415 - 5,050","555TEN, New York, NY","[{'school_name': 'Public Elementary School', '...",550 Sq Ft,NY,555 10th Ave,


In [9]:
df.to_csv('ny_rental_data.csv')