# Scrape Raw Data from Movie Review Pages
#### In this notebook, I will import URL's scraped in notebook \#1 and visit each page, scraping data. I will run the data through Beautiful Soup to strip away html artifacts.
#### Separate functions will be built to extract each individual movie feature to facilitate trouble-shooting and provide for maximum modularity in this process (in case I think of additional features that need to be extracted in the future, I will not have to rewrite all my code). As part of my trouble-shooting, I printed out the names of each movie and which of the 8780 movies it was when its page was visited and data obtained.

### Importing URL's

In [1]:
import pandas as pd
import seaborn as sns
import requests, re, json, time, copy
from bs4 import BeautifulSoup as bs4

### Load list of movie URL's and visit them to scrape movie features

In [2]:
# Load json of movies_urls
with open('data/movies_urls.json') as json_file:  
    movies_urls = json.load(json_file)

In [3]:
movies_urls[:5]

['https://www.commonsensemedia.org/movie-reviews/siberia',
 'https://www.commonsensemedia.org/movie-reviews/shock-and-awe',
 'https://www.commonsensemedia.org/movie-reviews/eighth-grade',
 'https://www.commonsensemedia.org/movie-reviews/skyscraper',
 'https://www.commonsensemedia.org/movie-reviews/hotel-transylvania-3-summer-vacation']

### Main functions used to scrape data

In [4]:
movies_features = []
def scrape_movie_features(first_movie, num_movies_to_scrape):
    movie_features = {}
    missed_movies = {}
    for movie in range(num_movies_to_scrape):
        movie = (movie + first_movie)
        url = movies_urls[movie]
        res = requests.get(url)
        soup = bs4(res.content, 'lxml')
        print(movie, url[47:])
        if res.status_code == 200:
            movie_features = get_movie_features(movie, url, soup)
            movies_features.append(movie_features)
        else:
            missed_movies.append([movie, url])
        if movie == len(movies_urls):
            print('That was the last movie.')
            break
    return movies_features, missed_movies

In [5]:
movies_features = []
def get_movie_features(num, url, soup):
    movie_features = {}
    movie_features['movie_id'] = num
    movie_features['slug'] = url[47:]
    movie_features['title'] = get_title(soup)
    movie_features['age'] = get_age(soup)
    movie_features['family_topics'] = get_family_topics(soup)
    movie_features['is_it_any_good'] = get_is_it_any_good(soup)
    movie_features['movie_details_raw'] = get_movie_details_raw(soup)
    movie_features['one_line_description'] = get_one_line_description(soup)
    movie_features['overall_rating'] = get_overall_rating(soup)
    movie_features['parental_rating_and_spoilers_raw'] = get_parental_rating_and_spoilers_raw(soup)
    movie_features['what_is_the_story_raw'] = get_what_is_the_story(soup)
    movie_features['what_parents_need_to_know_raw'] = get_what_parents_need_to_know(soup)
    return movie_features

### Helper Functions to get specific raw movie features

In [6]:
def get_title(soup):
    return str(soup.h1)[4:-5]

In [7]:
def get_age(soup):
    return str(soup.find('div', 'csm-green-age').get_text())

In [8]:
def get_family_topics(soup):
    return str(soup.find('div',
                         'field-name-field-family-topics').get_text())   

In [9]:
def get_is_it_any_good(soup):
    return str(soup.find('div', 'field-name-field-any-good').get_text())

In [10]:
def get_movie_details_raw(soup):
    return str(soup.find('div', 'pane-product-details').get_text())   

In [11]:
def get_one_line_description(soup):
    return str(soup.find('div', 'field-name-field-one-liner').get_text())

In [12]:
def get_overall_rating(soup):
    rating = re.search("\d", str(soup.find('div', 'field_stars_rating')))
    return str(rating[0])

In [13]:
def get_parental_rating_and_spoilers_raw(soup):
    return str(soup.find_all('div', 'field-type-field-collection'))

In [14]:
def get_what_is_the_story(soup):
    return str(soup.find('div',
                         'field-name-field-what-is-story').get_text())

In [15]:
def get_what_parents_need_to_know(soup):
    return str(soup.find('div', 'field-name-field-parents-need-to-know'))

### Scraping Movie Features

In [16]:
len(movies_urls)

8892

In [22]:
# Leave this cell commented out; will run a scraper that will take two and a half hours
# movies_features, missed_movies = scrape_movie_features(0, 9000)

In [18]:
len(movies_features)

8892

In [23]:
# Run if you'd like to see the scraped raw, unprocessed movie data
# movies_features[8891]

In [20]:
# movies_features_0_to_8891new = movies_features

In [21]:
# with open('data/movies_features_0_to_8891new.json', 'w') as output:
#     json.dump(movies_features_0_to_8891new, output)

#### N.B. Movie Features all appear in movies_features_0_to_8779_less_2 except for two movies, Dolphins and ? (243 and 244). Thereafter, movies is a numerical difference of 2 between index and movie_id number. All other movies were collected.
#### Apparently, these two broken movies were fixed (or possibly deleted) before movies were rescraped on 7/13/18. Subsequent notebooks not (yet) set up to access the new movie data.
#### Movie features saved to json file for further cleaning and data exploration in Notebook 3.