***
<div  align='center'><img src='https://cdn-res.keymedia.com/cms/images/ca/155/0348_637304012816162034.jpg' width='25%'></div >
<h1><center> WebScraping project: Ottawa Housing Market </center></h1>
<h2><center>Created by: Ginta Grinfelde</center></h2> 

***

## Importing  libraries

In [1]:
from bs4 import BeautifulSoup # For HTML parsing
import requests # Website connections
from time import sleep # To prevent overwhelming the server between connections
from collections import Counter # Keep track of our term counts
import pandas as pd # For converting results to a dataframe and bar chart plots
import json # For parsing json
%matplotlib inline 
import re #For searching text
from random import randint # for sleep randomisation
import math # using math formulas


## 1. Testing if the website allows webscraping


In [2]:
page_url = 'https://www.point2homes.com/CA/Real-Estate-Listings/ON/Ottawa.html'

In [3]:
result = requests.get(page_url)

In [4]:
result.status_code # Good status! 

200

## 2. Getting urls for each listing by using Beautiful Soup

### Getting content from the first page

After some failed attempts, I found out that the website only displays 30 pages of listings (24 per page). I had to use a different approach and use filters to get multiple search results. I added 5 different price ranges and scraped the pages for each search result. In this way, I was able to scrape 25-30 pages per price break.

In [5]:
# Defining get_page function and getting the content from the first page - Real Estate Listings in Ottawa, Ontario.
# The same code can be used if using other provinces and cities.

def get_page(province, city, page, price_min, price_max):
    url = f'https://www.point2homes.com/CA/Real-Estate-Listings/{province}/{city}.html?page={page}&PriceMin={price_min}&PriceMax={price_max}'
    result = requests.get(url)
    soup = BeautifulSoup(result.content)
    return soup
soup = get_page('ON', 'Ottawa', 1, 0, 250000)


In [6]:
# Setting price ranges to set search filters and get more pages to scrap
price_range = [
    ['0', '500000'],
    ['500001', '700000'],
    ['700001', '1000000'],
    ['1000000', '2000000'],
    ['2000001', '4000000']]
    

### Getting urls for each listing from all pages


In [1]:
#Importing sleep to slow down the code
from time import sleep
list_urls = []
sleep = sleep(randint(1,10))

# Getting the number of search results per price filter(from the first pages) and the number of listings per page and
# Calculating the amount of pages to scrape for each price break.

for x in price_range:
    low = x[0]
    high = x[1]
    soup = get_page('ON', 'Ottawa', 1, low, high)
    num_results_per_page = int(soup.find('div', class_='results-no').get_text(strip=True).split()[2].replace(',', ''))
    num_of_results = int(soup.find('div', class_='results-no').get_text(strip=True).split()[-2].replace(',', ''))
    num_of_pages = math.ceil(num_of_results/num_results_per_page)

    #Setting up the function to scrape all pages    
    for page in range(1,num_of_pages+1):
        sleep
        soup = get_page('ON', 'Ottawa', page, low, high)
    
    #Getting the urls for each listing from each page and storing them in list_urls list
        for addr in soup.find_all('h3', class_='item-address-title'):
            sleep
            list_urls.append(addr.a.get('href'))


## 3. Collecting data from each listing

### Determining which attributes to scrape

To have a better understanding of the available properties, I decided to scrape the following information from each listing:

0. prop_address = Property address
1. price = Listed selling price
2. num_beds = Number of bedrooms
3. num_baths - Number of bathrooms
4. prop_type = Property type (Residential, Single family etc)
5. prop_style = Property style (Semi-detached, Condominium etc)
6. year_built = When the house was built
7. parking = Available parking  at the property
8. basement = If the roperty has a basement
9. prop_taxes = Estimated yearly taxes for houses
10. neighborhood = Property's neighborhood
11. postal_code = Property's postal_code
12. lot_info = Size of the property
13. assoc_fee = Estimated fees for condominiums
14. walk_score = Walk Score is a number between 0 and 100 that measures the walkability of the address.
15. transit_score = Transit Score is a number between 0 and 100 that shows how close the public transport is to the address.
16. bike_score = Bike Score is a number between 0 and 100 that measures the bikability of the address.
17. latitude = addresse's latitude for using maps
18. longitude = addresse's longitude for using maps

### Defining the full urls for each listing's page and getting the content.
The urls in list_urls are partial so a new functions has to be defined for scraping each listing's page.

In [None]:
# The link changes from the previous get_page function so new function is created.
# During data scraping I found out that not all urls are partial so I added an if statement to take care of it.

def get_page2(ext):
    if 'http' in ext:
        url = ext
    else:
        url = f'https://www.point2homes.com{ext}'
    result = requests.get(url)
    p_content = BeautifulSoup(result.content)
    return p_content



### Scraping informations based on above atrributes from each property listing


In [None]:
# Importing and adding the sleep function again (it kept breaking for me)
from time import sleep
sleep = sleep(randint(1,10))

# Getting data from each listing and storing it in a list of lists - data_p

data_p = []

for url in list_urls:
    sleep
    p_content = get_page2(url)
    
    for tag in p_content:
        sleep 
        
  #For each attribute I added a Misssing Info option if the listing doesn't have the attribute.

        #0
        try:
            prop_address = p_content.find('div', class_='address-container').get_text(strip=True)
        except:
            prop_address = 'Missing info'
        #1
        sleep 
        try:
            price = p_content.find('div', class_='price').get_text(strip=True)
        except:
            price = 'Missing info'

        #2
        sleep 
        try:
            num_beds = p_content.find('li', class_='ic-beds').get_text(strip=True)[0]
        except:
            num_beds = 'Missing info'

        #3
        sleep 
        try:
            num_baths = p_content.find('li', class_='ic-baths nosq').get_text(strip=True)[0]
        except:
            num_baths = 'Missing info'
        sleep 
        #4
        try:
            prop_type = p_content.find('li', class_='property-type ic-proptype').get_text(strip=True)
        except:
            prop_type = 'Missing info'
        sleep 
         #5   
        try:
            prop_style = p_content.find('dt', text='Style' ).find_next_sibling("dd").get_text(strip=True)
        except:
            prop_style = 'Missing info'
        sleep 
        #6
        try:
            year_built = p_content.find('dt', text='Year Built' ).find_next_sibling("dd").get_text(strip=True)
        except:
            year_built = 'Missing info'
        sleep 
        #7   
        try:
            parking = p_content.find('dt', text='Parking info').find_next_sibling("dd").get_text(strip=True)
        except:
            parking = 'Missing info'
        sleep 
        #8    
        try:
            basement = p_content.find('dt', text='Basement').find_next_sibling("dd").get_text(strip=True)
        except:
            basement = 'Missing info'
        sleep
        #9    
        try:
            prop_taxes = p_content.find('dt', text='Taxes').find_next_sibling("dd").get_text(strip=True)
        except:
            prop_taxes = 'Missing info'
        sleep 
        #10
        try:
            neighborhood = p_content.find('dt', text='Neighborhood').find_next_sibling("dd").get_text(strip=True)
        except:
            neighborhood = 'Missing info'
        sleep 
        #11
        try:
            postal_code = p_content.find('dt', text='Postal Code').find_next_sibling("dd").get_text(strip=True)
        except:
            postal_code = 'Missing info'
        sleep 
        #12
        try:
            lot_info = p_content.find('dt', text='Lot info').find_next_sibling("dd").get_text(strip=True)
        except:
            lot_info = 'Missing info'
        sleep 
        #13
        try:
            assoc_fee = p_content.find('dt', text='Association Fee').find_next_sibling("dd").get_text(strip=True)
        except:
            assoc_fee = 'Missing info'
        sleep 
        #14   
        try:
            walk_score = list(p_content.find('div', class_='walkscore-item walkscore-ic1').span.children)[0].get_text(strip=True)
        except:
            walk_score = 'Missing info'
        sleep 
        #15
        try:
            transit_score = list(p_content.find('div', class_='walkscore-item walkscore-ic2').span.children)[0].get_text(strip=True)
        except:
            transit_score = 'Missing info'
        sleep 
         #16   
        try:
            bike_score = list(p_content.find('div', class_='walkscore-item walkscore-ic3').span.children)[0].get_text(strip=True)
        except:
            bike_score = 'Missing info'
        sleep 
        #17    
        try:
            latitude = p_content.find('input', id=(re.findall('Latitude\_+.\_[\d]+', str(p_content))[0]))['value']
        except:
            latitude = 'Missing info'
        sleep 
        #18
        try:
            longitude= p_content.find('input', id=(re.findall('Longitude\_+.\_[\d]+', str(p_content))[0]))['value']
        except:
            longitude = 'Missing info'

    
        
    data_p.append([prop_address, price, num_beds, num_baths, prop_type\
             ,prop_style, year_built, parking, basement\
             ,prop_taxes, neighborhood, postal_code, lot_info, assoc_fee\
             ,walk_score, transit_score, bike_score, latitude, longitude])

# Data Analysis

In [5]:
#Importing libraries
import numpy as np
import plotly.offline as py
import plotly.graph_objs as go 
import plotly.express as px
import pandas as pd

py.init_notebook_mode(connected=True)

Creating a new DataFrame 'properties' from a list of lists

In [None]:
properties = pd.DataFrame(data_p, columns= ['prop_address', 'price', 'num_beds', 'num_baths', 'prop_type'\
             ,'prop_style', 'year_built', 'parking', 'basement'\
             ,'prop_taxes', 'neighborhood', 'postal_code', 'lot_info', 'assoc_fee'\
             ,'walk_score', 'transit_score', 'bike_score', 'latitude', 'longitude']) 

### Exporting the data to a csv file

In [None]:
properties.to_csv('/Users/gintagrinfelde/Documents/Data Science/Python/DataScraping_python-Ginta/point2homes.csv', index=False)

#### The scraped data Exploration can be found in the Data Analysis notebook