# Scrape Raw Data from Movie Review Pages
#### In this notebook, a list of URL's was imported and used to scrape individual movie pages for text and non-text data. Data then run through Beautiful Soup to strip away html artifacts and exported for further processing.
#### Separate functions extract each individual movie feature to facilitate trouble-shooting and provide for maximum modularity.

### Importing URL's

In [1]:
import pandas as pd
import seaborn as sns
import requests, re, json, time, copy
from bs4 import BeautifulSoup as bs4

### Load list of movie URL's and visit them to scrape movie features

In [2]:
# Load json of movies_urls
with open('data/movies_urls.json') as json_file:  
    movies_urls = json.load(json_file)

In [3]:
movies_urls[:5]

['https://www.commonsensemedia.org/movie-reviews/the-music-of-silence',
 'https://www.commonsensemedia.org/movie-reviews/the-miseducation-of-cameron-post',
 'https://www.commonsensemedia.org/movie-reviews/the-spy-who-dumped-me',
 'https://www.commonsensemedia.org/movie-reviews/the-darkest-minds',
 'https://www.commonsensemedia.org/movie-reviews/like-father']

### Main functions used to scrape data

In [4]:
movies_features = []
def scrape_movie_features(first_movie, num_movies_to_scrape):
    movie_features = {}
    missed_movies = {}
    for movie in range(num_movies_to_scrape):
        movie = (movie + first_movie)
        url = movies_urls[movie]
        res = requests.get(url)
        soup = bs4(res.content, 'lxml')
        print(movie, url[47:])
        if res.status_code == 200:
            movie_features = get_movie_features(movie, url, soup)
            movies_features.append(movie_features)
        else:
            missed_movies.append([movie, url])
        if movie == len(movies_urls):
            print('That was the last movie.')
            break
    return movies_features, missed_movies

In [5]:
movies_features = []
def get_movie_features(num, url, soup):
    movie_features = {}
    movie_features['movie_id'] = num
    movie_features['slug'] = url[47:]
    movie_features['title'] = get_title(soup)
    movie_features['age'] = get_age(soup)
    movie_features['family_topics'] = get_family_topics(soup)
    movie_features['is_it_any_good'] = get_is_it_any_good(soup)
    movie_features['movie_details_raw'] = get_movie_details_raw(soup)
    movie_features['one_line_description'] = get_one_line_description(soup)
    movie_features['overall_rating'] = get_overall_rating(soup)
    movie_features['parental_rating_and_spoilers_raw'] = get_parental_rating_and_spoilers_raw(soup)
    movie_features['what_is_the_story_raw'] = get_what_is_the_story(soup)
    movie_features['what_parents_need_to_know_raw'] = get_what_parents_need_to_know(soup)
    return movie_features

### Helper Functions to get specific raw movie features

In [6]:
def get_title(soup):
    return str(soup.h1)[4:-5]

In [7]:
def get_age(soup):
    return str(soup.find('div', 'csm-green-age').get_text())

In [8]:
def get_family_topics(soup):
    return str(soup.find('div',
                         'field-name-field-family-topics').get_text())   

In [9]:
def get_is_it_any_good(soup):
    return str(soup.find('div', 'field-name-field-any-good').get_text())

In [10]:
def get_movie_details_raw(soup):
    return str(soup.find('div', 'pane-product-details').get_text())   

In [11]:
def get_one_line_description(soup):
    return str(soup.find('div', 'field-name-field-one-liner').get_text())

In [12]:
def get_overall_rating(soup):
    rating = re.search("\d", str(soup.find('div', 'field_stars_rating')))
    return str(rating[0])

In [13]:
def get_parental_rating_and_spoilers_raw(soup):
    return str(soup.find_all('div', 'field-type-field-collection'))

In [14]:
def get_what_is_the_story(soup):
    return str(soup.find('div',
                         'field-name-field-what-is-story').get_text())

In [15]:
def get_what_parents_need_to_know(soup):
    return str(soup.find('div', 'field-name-field-parents-need-to-know'))

### Scraping Movie Features

In [25]:
# DO NOT RUN THIS CELL  will run a scraper that will take almost three hours
movies_features, missed_movies = scrape_movie_features(0, 10000)

In [19]:
len(movies_features)

8765

In [21]:
# Run if you'd like to see the scraped raw, unprocessed movie data
movies_features[len(movies_features) - 1]

{'age': 'age 8+',
 'family_topics': "Families can talk about silent movies. What did you expect going in? Were any parts surprising? Did you ever forget you were watching a silent film and just get into the story?\n\nFamilies can also talk about technology and filmmaking. Buster Keaton didn't have any of the tools we have today and still managed to make the action exciting. Do you think not relying on technology somehow made this filmmaker more inventive? Or do you think he was limited by the lack of CGI and other effects common today?\n\n",
 'is_it_any_good': "Even viewers who normally don't seek out silent movies or classics in general are in for a treat. SHERLOCK JR. is clever, charming, inventive, and full of surprises. There's so much packed into 44 minutes, it's hard to believe that there's a movie within a movie and a love story and a frame-up and it all ties together and makes perfect sense with just the occasional pithy caption.\nThe runaway moped scene had to take so much pla

In [22]:
# replace RENUMBER with len(movies_features)-1 and edit future notebooks to update system
# to be able to recommend more recent movies. See below for further explanation
movies_features_0_to_RENUMBERnew = movies_features

In [24]:
with open('data/movies_features_0_to_RENUMBERnew.json', 'w') as output:
    json.dump(movies_features_0_to_RENUMBERnew, output)

#### Movie features saved to json file for further cleaning and data exploration in Notebook 3.