# Movie Dataset Creation

### We would like to extract information of a movie from Wikipedia. Then, we obtain the links for all the movies in Walt Disney Pictures Films. Using each link, we extract the information for each movie separately

### We cover a wide range of Python & data science topics in this video. They include:
- Web scraping with BeautifulSoup
- Cleaning data
- Pattern matching with regular expressions (Re library)
- Working with dates (datetime library)
- Accessing data from an API using Requests library

### Import necessary libraries

In [7]:
import requests
from bs4 import BeautifulSoup as bs

### Request the content for the movie Toy_Story_3

In [8]:
url = 'https://en.wikipedia.org/wiki/Toy_Story_3'
r = requests.get(url)
soup = bs(r.content, 'html.parser')

### Clean the data for superscripts and unnecessary span items

In [10]:
for tag in soup.find_all(['sup', 'span']):
    tag.decompose()

### The following line finds the first table in the webpage

In [12]:
data1 = soup.find('table', class_='infobox vevent')

### The section that we want starts from the third 'tr' 

In [None]:
trs = data1.find_all('tr')[2:]

### All 'th' terms give the task, e.g. 'Directed by' or 'Starring'

The people who have done the tasks are either found in 'li' or 'td' terms

In [14]:
for tr in trs:
    task = tr.find('th').get_text(' ', strip=True)
    if tr.find('li'):
        set = tr.find_all('li')
        person = [w.text.replace('\xa0', ' ') for w in set]
    else:
        person = tr.find('td').get_text(' ', strip=True)
    print(task + ':', person)

Directed by: Lee Unkrich
Screenplay by: Michael Arndt
Story by: ['John Lasseter', 'Andrew Stanton', 'Lee Unkrich']
Produced by: Darla K. Anderson
Starring: ['Tom Hanks', 'Tim Allen', 'Joan Cusack', 'Don Rickles', 'Wallace Shawn', 'John Ratzenberger', 'Estelle Harris', 'Ned Beatty', 'Michael Keaton', 'Jodi Benson', 'John Morris']
Cinematography: ['Jeremy Lasky', 'Kim White']
Edited by: Ken Schretzmann
Music by: Randy Newman
Production companies: ['Walt Disney Pictures', 'Pixar Animation Studios']
Distributed by: Walt Disney Studios Motion Pictures
Release dates: ['June 12, 2010 (Taormina Film Fest)', 'June 18, 2010 (United States)']
Running time: 103 minutes
Country: United States
Language: English
Budget: $200 million
Box office: $1.067 billion
