# Lab 5.2 -- Scraping IMBD

Our goal is to scrap [IMDB](imdb.com) user reviews for *Borat Subsequent Moviefilm*.  Unfortunately, the page for user reviews only shows a limited number of reviews and you can't access additional pages through a link.  `selenium` to the rescue! In this lab, we will combine our two approaches to web scraping by

1. Using `selenium` to load the page and click the *Load More* until we have all the reviews.
2. Creating a `BeautifulSoup` instance for the complete page and parsing the results.

### Task 1 -- Load the reviews.

Explore IMBD to find the web link for the user reviews for *Borat Subsequent Moviefilm* and load this page in Python with `selenium`.

In [175]:
#Easy way to copy and paste the filepath
import easygui
file = easygui.fileopenbox()
print(file)


/Users/dm6258xw/Downloads/chromedriver


In [176]:
# Your code here
from selenium import webdriver
driver = webdriver.Chrome(executable_path=file)
driver.get('https://www.imdb.com/title/tt13143964/reviews?ref_=tt_ov_rt')

### Task 2 -- Figure out how to click the *Load More* button.

To load all of the user reviews, we need to click the *Load More* button multiple times.  First, find the corresponding WebElement and verify that clicking this button loads another page of results.

In [177]:
# Your code here
driver.find_element_by_class_name('ipl-load-more__button').click()

### Task 3 -- Click *Load More* until you have all the results.

Now you need to write code that will keep clicking the *Load More* button when you find it.  **Hint:** We can think of this as an example of an *unfold* process, meaning you should use a `while` loop combined with a [try-and-except statement](https://pythonbasics.org/try-except/) to keep trying to click the button.  To make sure you don't get an infinite loop, use a variable to identify and hold the stopping condition/state.

In [178]:
# Your code here

from time import sleep

stopping_condition = 0

while(stopping_condition != 1):
    try:
        driver.find_element_by_class_name('ipl-load-more__button').click()
        sleep(1) #To allow the next button to appear
    except:
        stopping_condition = 1
        print('The loop is now complete')

The loop is now complete


### Task 4 -- Load the results in a `BeautifulSoup` object.

Since `bs4` has better tools for parsing html, we will now switch to using this module to parse the results.  Recall that you can access the content of the current content from the `selenium` driver using `driver.page_source`.  You can use this attribute to make a `soup` object for the page using 

> soup = BeautifulSoup(driver.page_source, 'html.parser')

In [179]:
# Your code here
import re
from bs4 import BeautifulSoup
from composablesoup import get_text
soup = BeautifulSoup(driver.page_source, 'html.parser')

### Task 5 -- Extract the information

Now extract the following data to a csv file.

1. Title
2. Score
3. User
4. Date
5. Text (replace commas with semi-colons!)
6. Two columns for X and Y, where `"X out of Y found this helpful"`
7. Permanent link the the review.


In [159]:
# Your code here


# Titles

titles = soup.find_all('a', attrs={'class':'title'})
title_messy = [get_text(x, strip=True) for x in titles]
title_output = [x.replace(',',';') for x in [x.replace('\n', '') for x in title_messy]]

title_output[:5]

[' Borat Make a Number 2',
 ' Laugh Out Loud Funny S#!^!',
 ' Excellent. And this is from a non Sasha Cohen Baron fan. REAL REVIEW.',
 ' Cohen is a genius',
 " The 10's are 10's & The 1's are 10's!"]

In [153]:
# Scores
score_re = re.compile('^\d\d?$')
scores = soup.find_all('span')
score_text_all = [get_text(x, strip=True) for x in scores]
score_output = list(filter(score_re.match, score_text_all))

score_output[0:5]

['10', '10', '10', '10', '10']

In [154]:
# Users
users = soup.find_all('span', attrs={'class':'display-name-link'})
user_output = [get_text(x, strip=True) for x in users]

user_output[0:5]

['MissCzarChasm',
 'YourSonsDad',
 'lvanka',
 'WindsOfWintergreen',
 'AnaAnaBanana']

In [170]:
#Date
dates = soup.find_all('span', attrs={'class':'review-date'})
date_output = [get_text(x, strip=True) for x in dates]

#Text
texts = soup.find_all('div', attrs={'class':'text show-more__control'})
text = [get_text(x, strip=True) for x in texts]
text_output = [x.replace('\n', ' ') for x in [x.replace(',', ';') for x in text]]

#X out of Y
found_helpful = soup.find_all('div', attrs={'class':'actions text-muted'})
found_helpful_text = [get_text(x, strip=True) for x in found_helpful]
found_helpful_output_combined = [re.findall('[0-9]+', x) for x in found_helpful_text]

found_helpful_output = [', '.join(x) for x in found_helpful_output_combined]

In [160]:
#Permalink
href = re.compile('/review/.*')
all_href = [a['href'] for a in soup.select('a[href]')]
permalink = ['https://www.imdb.com' + x for x in all_href if href.match(x)]

permalink[:5]

['https://www.imdb.com/review/rw6217081/?ref_=tt_urv',
 'https://www.imdb.com/review/rw6217081/?ref_=tt_urv',
 'https://www.imdb.com/review/rw6213611/?ref_=tt_urv',
 'https://www.imdb.com/review/rw6213611/?ref_=tt_urv',
 'https://www.imdb.com/review/rw6219436/?ref_=tt_urv']

In [171]:
results_list = [list(a) for a in zip(title_output, score_output, user_output, date_output, text_output,found_helpful_output,permalink)]

comma_join = [','.join(x) for x in results_list]

line_join = '\n'.join(comma_join)


In [172]:
with open('lab5_2.csv', 'w') as outfile:
    outfile.write(line_join)