# Data Scrapping Renthop.com
This book focus on create an app to help choosing a department on NYC. 
The main content is based on Python Machine Learning Blueprints book by Packt.

---------------------------

Importing libraries 

In [2]:
# Importing libraries
import numpy as np
import pandas as pd
import requests
import matplotlib.pyplot as plt

# Inline plots
%matplotlib inline

As stated in the title, we're going to use NYC apartment data in our model. The url is the next: 
https://www.renthop.com/nyc/apartments-for-rent

First we're going to do a quick test and be sure that we can retrieve the page.

In [None]:
r = requests.get('https://www.renthop.com/nyc/apartments-for-rent')
r.content

I deleted the answer of r.content as it is too much, but the output should be the html code of the whole pag. You can try 
copying the request into an HTML text editor to see what's downloaded. It would show a raw page as it doesn't download the CSS file, but all content is there.
That makes more easier to scrapping as the page doesn't use much JS code for rendering.

If the data is behind JS code we most use Selenium to interact with it.

Now examine the page elements to see how we can parse the page data:

1. Open the RentHop site and clic anywhere.
2. Click on **Inspect**.
3. Click on the square with arrow button and then click on the data on the page. It will highlight the code part.
You can see that each listing's data is in a table and the first td tag contains the price, the second is the number of bedrooms and 
the thhird the number of bathrooms. The address can be found in an anchor or a tag.

--------------------

To do the HTML parsing, we are going to use BeautifulSoup library.

We simply need to pass the page content into the BeautifulSoup class.

In [6]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(r.content, 'html5lib')

# Now we can use the soup object to parsing the apartment data. The first thing
# is retrieve div tag that contains listing data on the page.

# Looking for all div classes that contain 'search.info'.
listing_divs = soup.select('div[class*=search-info]')
listing_divs

[<div class="search-info d-block d-md-inline-block align-top">
 <div>
 <div class="float-right font-size-9" style="padding-top: 2px;">
 <span class="font-gray-2 d-none d-sm-inline-block"></span>
 </div>
 <a class="font-size-12 b" href="https://www.renthop.com/listings/775-columbus-ave/07c/62453191" id="listing-62453191-title" style="text-decoration: none; line-height: 100%;">
 775 Columbus Avenue, Apt 07C
 </a>
 <div class="font-gray-1 font-size-8 overflow-ellipsis" id="listing-62453191-neighborhoods" style="margin-top: 0px;">
 Manhattan Valley, Upper West Side, Upper Manhattan, Manhattan
 </div>
 </div>
 <div class="vspace-1" style="height: 15px;"></div>
 <div id="listing-62453191-info">
 
 <div class="d-inline-block align-middle" id="listing-62453191-price" style="line-height: 100%;">
 <span class="font-size-20 b">$5,170</span>
 </div>
 <div class="d-inline-block">
 <div class="font-size-7 font-blue ml-1 b" style="line-height: 100%; background-color: var(--blue-dark-t-10); padding: 4

by the looking task we see that there should be twenty divs aprox. Let's confirm with `len(listing_divs)`

In [7]:
len(listing_divs)

20

## Pulling the individual data points

We need to pull out the individual data points for each apartments. The points we want are:

- URL of the listing
- Address of the apartment
- Neighborhood
- Number of Bedrooms
- Number of bathrooms

More data could be more significant like square footage, etc. there are not here.

Let's cheeck the first listing

In [8]:
listing_divs[0]

<div class="search-info d-block d-md-inline-block align-top">
<div>
<div class="float-right font-size-9" style="padding-top: 2px;">
<span class="font-gray-2 d-none d-sm-inline-block"></span>
</div>
<a class="font-size-12 b" href="https://www.renthop.com/listings/775-columbus-ave/07c/62453191" id="listing-62453191-title" style="text-decoration: none; line-height: 100%;">
775 Columbus Avenue, Apt 07C
</a>
<div class="font-gray-1 font-size-8 overflow-ellipsis" id="listing-62453191-neighborhoods" style="margin-top: 0px;">
Manhattan Valley, Upper West Side, Upper Manhattan, Manhattan
</div>
</div>
<div class="vspace-1" style="height: 15px;"></div>
<div id="listing-62453191-info">

<div class="d-inline-block align-middle" id="listing-62453191-price" style="line-height: 100%;">
<span class="font-size-20 b">$5,170</span>
</div>
<div class="d-inline-block">
<div class="font-size-7 font-blue ml-1 b" style="line-height: 100%; background-color: var(--blue-dark-t-10); padding: 4px 6px 3px 6px;">
No

The first ´div´ contains all the data points we're looking for, so there's no necessity to keep all.

We need to begin our parse to target them each individually.

The URL for the page is with an anchor or a tag, Let's parse that out now with another `select` statement.

In [13]:
print(listing_divs[0].select('a[id*=title]')[0]['href'])

print(listing_divs[0].select('div[id*=hood]')[0])

https://www.renthop.com/listings/775-columbus-ave/07c/62453191
<div class="font-gray-1 font-size-8 overflow-ellipsis" id="listing-62453191-neighborhoods" style="margin-top: 0px;">
Manhattan Valley, Upper West Side, Upper Manhattan, Manhattan
</div>


That's the info ww want! Now, we can retrieve the other data points for the listing.

In [11]:
# First check the first listing div.
href = listing_divs[0].select('a[id*=title]')[0]['href']
addy = listing_divs[0].select('a[id*=title]')[0].string
hood = listing_divs[0].select('div[id*=hood]')[0].string.replace('\n', '')

# Verify this by printing what we got.
print(href)
print(addy)
print(hood)

https://www.renthop.com/listings/775-columbus-ave/07c/62453191

775 Columbus Avenue, Apt 07C

Manhattan Valley, Upper West Side, Upper Manhattan, Manhattan


Now we get the address!

Let's continue with the other items: bedrooms, bathrooms and price.

Since they are in a table tag in a div and then inside a table row, we need to iterate
over each point to capture the data.

**NOTE**
The original book states thath paragraph, but at the moment this notebook is written, the data is not in a table but
in a div with id*=info. So you must change the follow line to the next:

`listing_specs = listing_divs[0].select('table[id*=info] tr')`

to

`listing_specs = listing_divs[0].select('div[id*=info]')`



In [16]:
listing_specs = listing_divs[0].select('div[id*=info]')

for spec in listing_specs:
    spec_data = spec.text.strip().replace(' ', '_').split()
    print(spec_data)	

['$5,170', 'No_Fee', 'By_Owner']


In [17]:
listing_specs

[<div id="listing-62453191-info">
 
 <div class="d-inline-block align-middle" id="listing-62453191-price" style="line-height: 100%;">
 <span class="font-size-20 b">$5,170</span>
 </div>
 <div class="d-inline-block">
 <div class="font-size-7 font-blue ml-1 b" style="line-height: 100%; background-color: var(--blue-dark-t-10); padding: 4px 6px 3px 6px;">
 No Fee
 </div>
 </div>
 <div class="d-inline-block">
 <div class="font-size-7 font-blue ml-1 b" style="line-height: 100%; background-color: var(--blue-dark-t-10); padding: 4px 6px 3px 6px;">
 By Owner
 </div>
 </div>
 </div>]