In [1]:
import pandas as pd
import os
import requests
from bs4 import BeautifulSoup

**More Info**

- [pandas: Flat file functions](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)

In [3]:
# Import the Rotten Tomatoes bestofrt TSV file into a DataFrame
df = pd.read_csv("bestofrt.tsv", sep='\t')

In [4]:
# Check to see if the file was imported correctly
df.head()

Unnamed: 0,ranking,critic_score,title,number_of_critic_ratings
0,1,99,The Wizard of Oz (1939),110
1,2,100,Citizen Kane (1941),75
2,3,100,The Third Man (1949),77
3,4,99,Get Out (2017),282
4,5,97,Mad Max: Fury Road (2015),370


## Gather data programmatically 

### #1 Use `resquest`

In [6]:
import requests

In [7]:
url = 'https://www.rottentomatoes.com/m/et_the_extraterrestrial'
response = requests.get(url)
response.url

'https://www.rottentomatoes.com/m/et_the_extraterrestrial'

In [8]:
# save HTML to file
with open("et_the_extraterrestrial.html", mode="wb") as file:
    file.write(response.content)

### #2 Use `BeautifulSoup`

In [9]:
# work with HTML in memory
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.content)
type(soup)

bs4.BeautifulSoup

#### HTML files in  Python

Search for and extract patterns in text:

1. (Python String Find Method)[https://www.tutorialspoint.com/python/string_find.htm]
2. [Regular Expressions](https://regexr.com/)
3. Use [Beatiful Soup](https://www.crummy.com/software/BeautifulSoup/) HTML parser:

**Let's extract the Movie title:**

1. Make the soup:
    - passing the path to your HTML file into a file handle
    - pass that file handle into the Beautiful Soup constructor
2. Find and extract data

In [10]:
with open("et_the_extraterrestrial.html") as file:
    soup = BeautifulSoup(file, "lxml")

In [11]:
# find the title of the web page
soup.find('title')

<title>E.T. The Extra-Terrestrial (1982) - Rotten Tomatoes</title>

#### Access tag content with `.contents`

- returns a list of the tags children 

In [12]:
# title content with  - Rotten Tomatoes removed
soup.find('title').contents[0][:-len(' - Rotten Tomatoes')]

'E.T. The Extra-Terrestrial (1982)'

### Quiz

With your knowledge of HTML file structure, you're going to use Beautiful Soup to extract our desired Audience Score metric and number of audience ratings, along with the movie title like in the video above (so we have something to merge the datasets on later) for each HTML file, then save them in a pandas DataFrame.

The Jupyter Notebook below contains template code that:

- Creates an empty list, `df_list`, to which dictionaries will be appended. This list of dictionaries will eventually be converted to a pandas DataFrame (this is the [most efficient way of building a DataFrame row by row](https://stackoverflow.com/questions/28056171/how-to-build-and-fill-pandas-dataframe-from-for-loop/28058264#28058264)).
- Loops through each movie's Rotten Tomatoes HTML file in the rt_html folder.
- Opens each HTML file and passes it into a file handle called file.
- Creates a DataFrame called df by converting df_list using the `pd.DataFrame` [constructor](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html).
- Your task is to extract the title, audience score, and number of audience ratings in each HTML file so each trio can be appended as a dictionary to df_list.

The Beautiful Soup methods required for this task are:

- `find()`
- `find_all()`

There is an excellent tutorial on these methods ([Searching the tree](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-the-tree)) in the Beautiful Soup documentation. Please consult that tutorial if you are stuck.

**More Information**
1. [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/)
2. [Stack Overflow: Beautiful Soup and Unicode Problems](https://stackoverflow.com/questions/19508442/beautiful-soup-and-unicode-problems)
3. [Stack Overflow: Python: Removing \xa0 from string](https://stackoverflow.com/questions/10993612/python-removing-xa0-from-string)

In [13]:
# List of dictionaries to build file by file and later convert to a DataFrame
df_list = []
folder = 'rt_html'
for movie_html in os.listdir(folder):
    with open(os.path.join(folder, movie_html)) as file:
        # make the soup
        soup = BeautifulSoup(file, "lxml")
        # get the title
        title = soup.find('title').contents[0][:-len(' - Rotten Tomatoes')]
        # get the audience score
        audience_score = soup.find('div', class_="audience-score meter").find("span").contents[0][:-1]
        # number of audience ratings
        num_audience_ratings = soup.find("div", class_="audience-info hidden-xs superPageFontColor").find_all("div")[1].contents[2].strip().replace(',', '')
        
        # Append to list of dictionaries
        df_list.append({'title': title,
                        'audience_score': int(audience_score),
                        'number_of_audience_ratings': int(num_audience_ratings)})
df = pd.DataFrame(df_list, columns = ['title', 'audience_score', 'number_of_audience_ratings'])

In [14]:
df.head()

Unnamed: 0,title,audience_score,number_of_audience_ratings
0,E.T. The Extra-Terrestrial (1982),72,32314083


[Best of Rotten Tomatoes: Critic vs. Audience Scores (Tableau Public Viz)](https://public.tableau.com/profile/david.venturi#!/vizhome/BestofRottenTomatoesCriticvs_AudienceScores/BestofRottenTomatoesCriticvs_AudienceScores)

In [15]:
df = pd.read_csv("bestofrt.tsv", sep='\t')
df.head()

Unnamed: 0,ranking,critic_score,title,number_of_critic_ratings
0,1,99,The Wizard of Oz (1939),110
1,2,100,Citizen Kane (1941),75
2,3,100,The Third Man (1949),77
3,4,99,Get Out (2017),282
4,5,97,Mad Max: Fury Road (2015),370


## Source: Downloading Files from the Internet

### HTTP (Hypertext Transfer Protocol)

HTTP, the Hypertext Transfer Protocol, is the language that web browsers (like Chrome or Safari) and web servers (basically computers where the contents of a website are stored) speak to each other. Every time you open a web page, or download a file, or watch a video, it's HTTP that makes it possible.

HTTP(S) is a request/response protocol:

1. Your computer, a.k.a. the client, sends a request to a server for some file. For this lesson: "Get me the file `1-the-wizard-of-oz-1939-film.txt`", for example. 
    - `GET` is the name of the HTTP request method (of which there are multiple) used for retrieving data.
    - `request` library has a method called `GET` which will send the request for us, return the contents of the file we requested, which we can then save to a file
    - import the `os` library so we can store the downloaded file in a folder called `ebert_reviews`
    
2. The web server sends back a response. 
    - If the request is valid: "Here is the file you asked for:", then followed by the contents of the `1-the-wizard-of-oz-1939-film.txt` file itself.


If you'd like to learn more, or are feeling like there are knowledge gaps you'd like to fill in, I encourage you to check out the following videos in our free Web Development course: concepts 2-5 and 24-30 in Lesson 1 ("How the Web Works").

### Roger Ebert review word cloud

The text from each of his reviews, for each of the movies on the Rotten Tomatoes Top 100 Movies of All Time list.

In [16]:
# create the ebert_reviews folder folder
folder_name = "ebert_reviews"
if not os.path.exists(folder_name):
    os.makedirs(folder_name)

In [17]:
url = "https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9904_11-e.t.-the-extra-terrestrial/11-e.t.-the-extra-terrestrial.txt"

# get the url
response = requests.get(url)
response

<Response [200]>

- `200` is the HTTP status code for the request has succeeded;
- this response returned from the `requests'get` method;
- all the text file is actually in our computer's working memory within this response variable;
- it is stored in the body of the response which we can access using `.content`

In [18]:
# check the contents
#response.content

- it is in bytes format and we can save this file to our computer;
- so, we will open a file called `11-e.t.-the-extra-terrestrial.txt`, everything after the last `/`;
- we will use `split` function and select the last item in the list returned;
- we need to open this file in `wb` mode, `Write Binary` because `response.content` is in bytes and not text;
- write to the file handle we've opened, `file.write(response.content)` 

In [19]:
# write the response to a file
with open(os.path.join(folder_name, url.split('/')[-1]), mode='wb') as file:
    file.write(response.content)

In [20]:
# check the contents of the folder
os.listdir(folder_name)

['1-the-wizard-of-oz-1939-film.txt',
 '10-metropolis-1927-film.txt',
 '100-battleship-potemkin.txt',
 '11-e.t.-the-extra-terrestrial.txt',
 '12-modern-times-film.txt',
 '14-singin-in-the-rain.txt',
 '15-boyhood-film.txt',
 '16-casablanca-film.txt',
 '17-moonlight-2016-film.txt',
 '18-psycho-1960-film.txt',
 '19-laura-1944-film.txt',
 '2-citizen-kane.txt',
 '20-nosferatu.txt',
 '21-snow-white-and-the-seven-dwarfs-1937-film.txt',
 '22-a-hard-day27s-night-film.txt',
 '23-la-grande-illusion.txt',
 '25-the-battle-of-algiers.txt',
 '26-dunkirk-2017-film.txt',
 '27-the-maltese-falcon-1941-film.txt',
 '29-12-years-a-slave-film.txt',
 '3-the-third-man.txt',
 '30-gravity-2013-film.txt',
 '31-sunset-boulevard-film.txt',
 '32-king-kong-1933-film.txt',
 '33-spotlight-film.txt',
 '34-the-adventures-of-robin-hood.txt',
 '35-rashomon.txt',
 '36-rear-window.txt',
 '37-selma-film.txt',
 '38-taxi-driver.txt',
 '39-toy-story-3.txt',
 '4-get-out-film.txt',
 '40-argo-2012-film.txt',
 '41-toy-story-2.txt',
 

### Quiz

In the Jupyter Notebook below, programmatically download all of the Roger Ebert review text files to a folder called ebert_reviews using the Requests library. Use a for loop in conjunction with the provided ebert_review_urls list.

Here is the [Requests documentation](http://docs.python-requests.org/en/master/) for easy reference. It is excellently clear relative to similar libraries, like [urllib](https://docs.python.org/3/howto/urllib2.html).

In [21]:
# url for each ebert review
ebert_review_urls = ['https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9900_1-the-wizard-of-oz-1939-film/1-the-wizard-of-oz-1939-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9901_2-citizen-kane/2-citizen-kane.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9901_3-the-third-man/3-the-third-man.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9902_4-get-out-film/4-get-out-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9902_5-mad-max-fury-road/5-mad-max-fury-road.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9902_6-the-cabinet-of-dr.-caligari/6-the-cabinet-of-dr.-caligari.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9903_7-all-about-eve/7-all-about-eve.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9903_8-inside-out-2015-film/8-inside-out-2015-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9903_9-the-godfather/9-the-godfather.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9904_10-metropolis-1927-film/10-metropolis-1927-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9904_11-e.t.-the-extra-terrestrial/11-e.t.-the-extra-terrestrial.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9904_12-modern-times-film/12-modern-times-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9904_14-singin-in-the-rain/14-singin-in-the-rain.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9905_15-boyhood-film/15-boyhood-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9905_16-casablanca-film/16-casablanca-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9905_17-moonlight-2016-film/17-moonlight-2016-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9906_18-psycho-1960-film/18-psycho-1960-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9906_19-laura-1944-film/19-laura-1944-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9906_20-nosferatu/20-nosferatu.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9907_21-snow-white-and-the-seven-dwarfs-1937-film/21-snow-white-and-the-seven-dwarfs-1937-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9907_22-a-hard-day27s-night-film/22-a-hard-day27s-night-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9907_23-la-grande-illusion/23-la-grande-illusion.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9908_25-the-battle-of-algiers/25-the-battle-of-algiers.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9908_26-dunkirk-2017-film/26-dunkirk-2017-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9908_27-the-maltese-falcon-1941-film/27-the-maltese-falcon-1941-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9909_29-12-years-a-slave-film/29-12-years-a-slave-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9909_30-gravity-2013-film/30-gravity-2013-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9909_31-sunset-boulevard-film/31-sunset-boulevard-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990a_32-king-kong-1933-film/32-king-kong-1933-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990a_33-spotlight-film/33-spotlight-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990a_34-the-adventures-of-robin-hood/34-the-adventures-of-robin-hood.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990b_35-rashomon/35-rashomon.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990b_36-rear-window/36-rear-window.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990b_37-selma-film/37-selma-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990c_38-taxi-driver/38-taxi-driver.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990c_39-toy-story-3/39-toy-story-3.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990c_40-argo-2012-film/40-argo-2012-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990d_41-toy-story-2/41-toy-story-2.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990d_42-the-big-sick/42-the-big-sick.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990d_43-bride-of-frankenstein/43-bride-of-frankenstein.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990d_44-zootopia/44-zootopia.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990e_45-m-1931-film/45-m-1931-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990e_46-wonder-woman-2017-film/46-wonder-woman-2017-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990e_48-alien-film/48-alien-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990f_49-bicycle-thieves/49-bicycle-thieves.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990f_50-seven-samurai/50-seven-samurai.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990f_51-the-treasure-of-the-sierra-madre-film/51-the-treasure-of-the-sierra-madre-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9910_52-up-2009-film/52-up-2009-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9910_53-12-angry-men-1957-film/53-12-angry-men-1957-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9910_54-the-400-blows/54-the-400-blows.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9911_55-logan-film/55-logan-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9911_57-army-of-shadows/57-army-of-shadows.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9912_58-arrival-film/58-arrival-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9912_59-baby-driver/59-baby-driver.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9913_60-a-streetcar-named-desire-1951-film/60-a-streetcar-named-desire-1951-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9913_61-the-night-of-the-hunter-film/61-the-night-of-the-hunter-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9913_62-star-wars-the-force-awakens/62-star-wars-the-force-awakens.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9913_63-manchester-by-the-sea-film/63-manchester-by-the-sea-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9914_64-dr.-strangelove/64-dr.-strangelove.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9914_66-vertigo-film/66-vertigo-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9914_67-the-dark-knight-film/67-the-dark-knight-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9915_68-touch-of-evil/68-touch-of-evil.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9915_69-the-babadook/69-the-babadook.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9915_72-rosemary27s-baby-film/72-rosemary27s-baby-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9916_73-finding-nemo/73-finding-nemo.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9916_74-brooklyn-film/74-brooklyn-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9917_75-the-wrestler-2008-film/75-the-wrestler-2008-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9917_77-l.a.-confidential-film/77-l.a.-confidential-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9918_78-gone-with-the-wind-film/78-gone-with-the-wind-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9918_79-the-good-the-bad-and-the-ugly/79-the-good-the-bad-and-the-ugly.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9918_80-skyfall/80-skyfall.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9919_82-tokyo-story/82-tokyo-story.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9919_83-hell-or-high-water-film/83-hell-or-high-water-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9919_84-pinocchio-1940-film/84-pinocchio-1940-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9919_85-the-jungle-book-2016-film/85-the-jungle-book-2016-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991a_86-la-la-land-film/86-la-la-land-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991b_87-star-trek-film/87-star-trek-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991b_89-apocalypse-now/89-apocalypse-now.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991c_90-on-the-waterfront/90-on-the-waterfront.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991c_91-the-wages-of-fear/91-the-wages-of-fear.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991c_92-the-last-picture-show/92-the-last-picture-show.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991d_93-harry-potter-and-the-deathly-hallows-part-2/93-harry-potter-and-the-deathly-hallows-part-2.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991d_94-the-grapes-of-wrath-film/94-the-grapes-of-wrath-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991d_96-man-on-wire/96-man-on-wire.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991e_97-jaws-film/97-jaws-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991e_98-toy-story/98-toy-story.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991e_99-the-godfather-part-ii/99-the-godfather-part-ii.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991e_100-battleship-potemkin/100-battleship-potemkin.txt']

In [22]:
for ebert_review_url in ebert_review_urls:  
    
    # get the url respone
    response = requests.get(ebert_review_url)
    
    # write the response to a file
    with open(os.path.join(folder_name, ebert_review_url.split('/')[-1]), mode='wb') as file:
        file.write(response.content)

In [23]:
# check the contents of the folder
len(os.listdir(folder_name))

88

**More Information**

- A text file is downloaded in this example. Binary files (images, for example) are best read and wrote to [other ways](http://docs.python-requests.org/en/latest/user/quickstart/#binary-response-content).
- [Stack Overflow: What is the 'wb' mean in this code, using Python?](https://stackoverflow.com/questions/2665866/what-does-wb-mean-in-this-code-using-python)

### Text File Structure

A file that uses a specific **character set**: 

 - Contains no formatting, like italics or bolding 
 - Has no media (images or video)
 - Lines are separated by newline character or backslah end in Python. These are invidible in most software apps, like text editor.
 - You need to select the right **encoding** to display the document properly;
 - The Ebert reviews text files has no structure, just a blob of text.
 
###  Encodings and Character Sets Articles

- [The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)](https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/) by Joel Spolsky
- [What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text]()

### Unicode and Python

In Python 3, there is:

- one text type: `str`, which holds Unicode data and
- two byte types: `bytes` and `bytearray`

The Stack Overflow answers [here](https://stackoverflow.com/questions/6224052/what-is-the-difference-between-a-string-and-a-byte-string) explain the different use cases well.

More Information
If you’re still confused about the difference between character sets and encoding, check out these articles:

- [The difference between UTF-8 and Unicode?](http://www.polylab.dk/utf8-vs-unicode.html)
- [More About Unicode in Python 2 and 3](http://lucumr.pocoo.org/2014/1/5/unicode-in-2-and-3/)


### Text Files in Python

The [`glob`](https://docs.python.org/3/library/glob.html) library:

- Makes opening files with similar path structure (like our folder of Roger Ebert review text files) simple;
- glob patterns that use something called wildcard characters
- `glob.glob` returns a list

#### Quiz

So we have 88 Roger Ebert reviews to open and read, which you can see in the Jupyter Notebook dashboard below (click jupyter in the top lefthand corner to access the dashboard) in the ebert_reviews folder. 

We'll need to loop to iterate through all of the files in this folder to open and read each, then extract the bits of text that we need as separate pieces of data:

1. the first line, which is the movie title (to merge to the master dataset with)
2. the second line, which is the review URL (not necessary for the word cloud but nice to have)
3. everything from the third line onwards, which is the review text


- Files are separated by newline characters;
- This code returns an iterator, so we can read the file line by line

```with open(ebert_review, encoding='utf-8') as file:
      print(file.read())  
```

The Jupyter Notebook below contains template code that:

- Creates an empty list, df_list, to which dictionaries will be appended. This list of dictionaries will eventually be converted to a pandas DataFrame (this is the most efficient way of building a DataFrame row by row).
- Loops through each movie's Roger Ebert review text file in the ebert_reviews folder.
- Opens each text file using a path generated by glob and passes it into a file handle called file.
- Creates a DataFrame called df by converting df_list using the pd.DataFrame constructor.
- Your task is to extract the movie title, Roger Ebert review URL, and the review in each text file and append each trio as a dictionary to df_list.

The file methods required for this task are:

- `readline()`
- `read()`

**More Information**
- [Stack Overflow: Best Practices for Opening Files in Python](https://stackoverflow.com/questions/5250744/difference-between-open-and-codecs-open-in-python/22288895#22288895)
- [Stack Overflow: The Correct, Fully Pythonic Way to Read a File](https://stackoverflow.com/questions/8009882/how-to-read-a-large-file-line-by-line-in-python/8010133#8010133)
- [Stack Overflow: Iterables and Iterators](https://stackoverflow.com/questions/16994552/is-file-object-in-python-an-iterable/16994568#16994568)
- [Wikipedia: Glob programming](https://en.wikipedia.org/wiki/Glob_(programming))

In [25]:
import glob

In [46]:
# create an empty list of dictionaries
df_list = []

# every file that end with .txt
for ebert_review in glob.glob('ebert_reviews/*txt'):
    with open(ebert_review, encoding='utf-8') as file:   
        # title
        title = file.readline()[:-1]
        # review url
        review_url = file.readline()[:-1]
        # review text
        review_text = file.read()
        # Append to list of dictionaries
        df_list.append({'title': title,
                        'review_url': review_url,
                        'review_text': review_text})

# convert list to data frame        
df = pd.DataFrame(df_list, columns = ['title', 'review_url', 'review_text'])
df.head()

Unnamed: 0,title,review_url,review_text
0,The Wizard of Oz (1939),http://www.rogerebert.com/reviews/great-movie-...,As a child I simply did not notice whether a m...
1,Metropolis (1927),http://www.rogerebert.com/reviews/great-movie-...,The opening shots of the restored “Metropolis”...
2,Battleship Potemkin (1925),http://www.rogerebert.com/reviews/great-movie-...,"""The Battleship Potemkin” has been so famous f..."
3,E.T. The Extra-Terrestrial (1982),http://www.rogerebert.com/reviews/great-movie-...,Dear Raven and Emil:\n\nSunday we sat on the b...
4,Modern Times (1936),http://www.rogerebert.com/reviews/modern-times...,"A lot of movies are said to be timeless, but s..."
