# Creating my own Dataset of Boston Apartments Leasing (Web Scraping)

## Using BeautifulSoup

## Description

The goal of this project is to create a dataset using Web Scrapping. Extracting apartment rental data from the RentHop site http://www.renthop.com and saving it to csv file. 

4 simple steps:

1. Access Web Page
2. Locate Specific Information
3. Retrieve Data
3. Save Data - somewhere else to be accessed later (files or databases)


## Loading Libraries

In [17]:
# Libraries

import pandas as pd
import requests
from bs4 import BeautifulSoup


## Accessing Web Page

In [2]:
r = requests.get('https://www.renthop.com/boston-ma/apartments-for-rent')
r.content

b'<!doctype html>\n<html lang="en">\n<head>\n<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />\n<meta http-equiv="Content-Language" content="en" />\n<title>Apartments for Rent in Boston, MA, No Fee Rentals | RentHop</title>\n<meta name="description" content="Newly listed Boston, MA apartments for rent. Smooth work commute, popular bars and nightlife, nearby restaurants and grocery stores, and safety. Find your perfect home in Boston, MA." />\n<meta name="author" content="RentHop" />\n<meta name="Copyright" content="Copyright (c) 2009 - 2020 RentHop.com" />\n<meta property="fb:page_id" content="124300320712" />\n<meta property="fb:app_id" content="294321126236" />\n<meta name="og:image" content="https://www.renthop.com/images/renthop_icon_small.png" />\n<link rel="image_src" href="https://www.renthop.com/images/renthop_icon_small.png" />\n<meta name="og:title" content="Apartments for Rent in Boston, MA, No Fee Rentals | RentHop" />\n<meta name="og:description" conten

I am going to use the BeautifulSoup library to do the HTML parsing.

In [3]:
# Creating an instance of BeautifulSoup

soup = BeautifulSoup(r.content, "html5lib")

The first thing to do is retrieving that **div** tag that contains the listing data on the page. 

In [4]:
listing_divs = soup.select('div[class*=search-info]')
listing_divs

[<div class="search-info pr-4 pl-4 pr-md-0 pl-md-4 py-2 py-md-0">
 <div>
 <div class="float-right align-top font-size-9">
 <span class="font-gray-2 d-none d-sm-inline-block">
 Last 30 min
 </span>
 </div>
 <a class="font-size-11 listing-title-link b" href="https://www.renthop.com/listings/309-highland-avenue-somerville-ma-02144/a/58164299" id="listing-58164299-title">309 Highland Avenue, Apt A</a>
 <div class="font-size-9 overflow-ellipsis" id="listing-58164299-neighborhoods" style="margin-top: -1px;">
 Powder House, Somerville
 </div>
 </div>
 <div style="margin-top: 10px;">
 <table id="listing-58164299-info">
 <tbody><tr>
 <td class="font-size-11 b" id="listing-58164299-price" style="padding: 0px 10px 0px 0px; vertical-align: bottom;">
 $2,850
 </td>
 <td class="font-size-11 b" style="border-left: 1px solid #eeeeee; padding: 0px 10px 0px 10px; vertical-align: bottom;">
 <span style="color: #444444;">
 2 Bed
 </span>
 </td>
 <td class="font-size-11 b" style="border-left: 1px solid #ee

In [5]:
# Verifying how many records contains the list

len(listing_divs)

20

## Locate Specific Information

Once I have all the **divs** with the listing data for each apartment, I need to pull out the individual data points for each apartment.

These are the information I want to target:

- URL of the listing
- Address of the apartment
- Neighborhood
- Number of bedrooms
- Number of bathrooms

In [6]:
# Checking out data in listing_divs

listing_divs[0]

<div class="search-info pr-4 pl-4 pr-md-0 pl-md-4 py-2 py-md-0">
<div>
<div class="float-right align-top font-size-9">
<span class="font-gray-2 d-none d-sm-inline-block">
Last 30 min
</span>
</div>
<a class="font-size-11 listing-title-link b" href="https://www.renthop.com/listings/309-highland-avenue-somerville-ma-02144/a/58164299" id="listing-58164299-title">309 Highland Avenue, Apt A</a>
<div class="font-size-9 overflow-ellipsis" id="listing-58164299-neighborhoods" style="margin-top: -1px;">
Powder House, Somerville
</div>
</div>
<div style="margin-top: 10px;">
<table id="listing-58164299-info">
<tbody><tr>
<td class="font-size-11 b" id="listing-58164299-price" style="padding: 0px 10px 0px 0px; vertical-align: bottom;">
$2,850
</td>
<td class="font-size-11 b" style="border-left: 1px solid #eeeeee; padding: 0px 10px 0px 10px; vertical-align: bottom;">
<span style="color: #444444;">
2 Bed
</span>
</td>
<td class="font-size-11 b" style="border-left: 1px solid #eeeeee; padding: 0px 10px 

In [7]:
# Retriving data from one record

url = listing_divs[0].select('a[id*=title]')[0]['href']
address = listing_divs[0].select('a[id*=title]')[0].string
neighborhood = listing_divs[0].select('div[id*=hood]')[0].string.replace('\n','')

print(url)
print(address)
print(neighborhood)

https://www.renthop.com/listings/309-highland-avenue-somerville-ma-02144/a/58164299
309 Highland Avenue, Apt A
Powder House, Somerville


In [8]:
# Collecting data from 20 records to one list

listing_list = []

for index in range(len(listing_divs)):
    each_listing = []
    current_listing = listing_divs[index]
    
    url = current_listing.select('a[id*=title]')[0]['href']
    address = current_listing.select('a[id*=title]')[0].string
    neighborhood = current_listing.select('div[id*=hood]')[0].string.replace('\n','')
    
    each_listing.append(url)
    each_listing.append(address)
    each_listing.append(neighborhood)
    
    listing_specs = current_listing.select('table[id*=info] tr') 

    for spec in listing_specs:
        try:
            each_listing.extend(spec.text.strip().replace(' ','_').split())
        except:
            each_listing.extend(np.nan)
            
    listing_list.append(each_listing)

    
listing_list[0:2]

[['https://www.renthop.com/listings/309-highland-avenue-somerville-ma-02144/a/58164299',
  '309 Highland Avenue, Apt A',
  'Powder House, Somerville',
  '$2,850',
  '2_Bed',
  '1_Bath'],
 ['https://www.renthop.com/listings/45-marion-street/513/58248057',
  '45 Marion Street, Apt 513',
  'Coolidge Corner, Brookline',
  '$4,950',
  '2_Bed',
  '2_Bath']]

## Retrieving Data

So far, there are just 20 records, this is because I analyze just one page.To have data from more than 20 records, it is necessary to iterate on several pages. From the search option I got this url it will help to navigate in different pages using the last parameter *page* in the URL.

https://www.renthop.com/search/boston-ma?min_price=0&max_price=50000&q=&sort=hopscore&search=0&page=0

Let's code the page number change and get all the urls.


In [9]:
url_prefix = "https://www.renthop.com/search/boston-ma?min_price=0&max_price=50000&q=&sort=hopscore&search=0&page="
page_number = 0

for url in range(4):
    target_page = url_prefix + str(page_number)
    print(target_page + '\n')
    page_number += 1
    

https://www.renthop.com/search/boston-ma?min_price=0&max_price=50000&q=&sort=hopscore&search=0&page=0

https://www.renthop.com/search/boston-ma?min_price=0&max_price=50000&q=&sort=hopscore&search=0&page=1

https://www.renthop.com/search/boston-ma?min_price=0&max_price=50000&q=&sort=hopscore&search=0&page=2

https://www.renthop.com/search/boston-ma?min_price=0&max_price=50000&q=&sort=hopscore&search=0&page=3



Once I located the data I required, the next step is to bring data from the pages that result from the site search. 

In [10]:
# Creating a function to retrieve data from every single page
# from site searching results

def retrieve_data(listing_divs):
    
    """
    Retrieve apartment characteristics from a web page code
  
    Args:
        listing_divs: a list of "<div>"
    
    Returns:
        listing_list: Apartment's information:
            - URL of the listing
            - Address of the apartment
            - Neighborhood
            - Number of bedrooms
            - Number of bathrooms
    """
    
    listing_list = []

    for index in range(len(listing_divs)):
        each_listing = []
        current_listing = listing_divs[index]

        url = current_listing.select('a[id*=title]')[0]['href']
        address = current_listing.select('a[id*=title]')[0].string
        neighborhood = current_listing.select('div[id*=hood]')[0].string.replace('\n','')

        each_listing.append(url)
        each_listing.append(address)
        each_listing.append(neighborhood)

        listing_specs = current_listing.select('table[id*=info] tr') 

        for spec in listing_specs:
            try:
                each_listing.extend(spec.text.strip().replace(' ','_').split())
            except:
                each_listing.extend(np.nan)

        listing_list.append(each_listing)

    return listing_list

In [11]:
# Creating a loop to apply same steps for all pages that result from the site search.

url_prefix = "https://www.renthop.com/search/boston-ma?min_price=0&max_price=50000&q=&sort=hopscore&search=0&page="
page_number = 1

all_pages_parsed = []
pages = 350

for url in range(pages):
    target_page = url_prefix + str(page_number)
    page_number += 1

    r = requests.get(target_page)
    
    # Getting a BeautifulSoup instance to be able to retrieve data
    soup = BeautifulSoup(r.content, "html5lib")

    listing_divs = soup.select('div[class*=search-info]')
    
    one_page_parsed = retrieve_data(listing_divs)
    all_pages_parsed.extend(one_page_parsed)


In [12]:
# Number of retrieved records 

print(len(all_pages_parsed))

7000


## Save Data on a CSV File

In [13]:
# Creating a pandas dataframe

df = pd.DataFrame(all_pages_parsed, columns=['url','address','neighborhood','price','rooms','baths','none'])

In [14]:
df.head()

Unnamed: 0,url,address,neighborhood,price,rooms,baths,none
0,https://www.renthop.com/listings/309-highland-...,"309 Highland Avenue, Apt A","Powder House, Somerville","$2,850",2_Bed,1_Bath,
1,https://www.renthop.com/listings/45-marion-str...,"45 Marion Street, Apt 513","Coolidge Corner, Brookline","$4,950",2_Bed,2_Bath,
2,https://www.renthop.com/listings/529-beacon-st...,"529 Beacon Street, Apt 3","Back Bay West, Back Bay, Boston","$1,850",_,Studio,1_Bath
3,https://www.renthop.com/listings/magnus-ave-an...,"10 Magnus Avenue, Apt 2","Ward Two, Somerville","$4,150",4_Bed,1_Bath,
4,https://www.renthop.com/listings/50-stanhope-s...,"50 Stanhope Street, Apt 1L","Prudential - St. Botolph, Back Bay, Boston","$2,499",2_Bed,2_Bath,


In [15]:
# Writing a comma-separated values (CSV) file

df.to_csv('apartments_leasing.csv', index=False)

## Conclusions

- Applying web scraping allows us to create our own datasets for future analysis. This is just an example but there are a lot of sources on the web

- In order to easy the extraction and location of the data we need,  it is crucial to know the web page code to scrap

- *BeautifulSoup* is a helpful and powerful tool for web scraping, it is easy to learn and it has very good documentation that you can check out on this  [link](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

- *BeautifulSoup* requires an external library to make a request to the website, in this case, I use Requests and that dependency did not represent any disadvantage for this specific project

[Wendy Navarrete](http://wendynavarrete.com)

July 2020