 # Gathering Data

- We will master gathering data here
    

- this step varies a lot depending on what my project is:
    
    - they could be spread out across many sources and file formats
    
    - this step can be the most challenging part of working with data, rather then analytic technique or modelliing or visualizing
    
    - it is usually glossed over as an afterthought, but we will gather data here across many sources and file formats and how to combine them all to make a master dataset


 - it is the first step in the data wrangling process. We have no data then we do
 
 - it varies from project to project
     
     - sometimes you are given it/ pointed to it, and osetimes you need to search for the right data for your project
     
     - sometimes it isn't readily available abd you need to generate it somehow
     
 - once we find our data it is not unusual that it is spread across many sources and file format, and this makes it tricky wheen organizing the data in your programming environment
 
- this is the most technically challenging lesson in the most code heavy part of data analytics

    - Note: think how complicated this can get, it ranges from accessing an API to building complex data pipelines and distributied systems and web scrapers to hold and pull lots of data and to make it analyzable. Streaming algorithms here can get really complex as can systems like Spark and MapReduce. These are all about wrangling massive amounts of data. How can machines handle it when volume goes up? How do we find it? What are the requirements under which we need it picked up. 
    
- we will gain coding skills and craftiness to conquer majority of gathering scenarios we will see in the future

- Structure of lesson:

    - pose questions
    
    - explore the source of each piece of data we need to answer those questions, each piece from a different source with a different file format
    
    - we will learn about the structure of that file, how to handle it with python libraries, we will gather each data to join it together to form a master dataset

- Problem:

    - we want to pick a movie, how to pick what I want
    
        - rotten tomato top 100 movies of all time with 40 above critic reviews
        
            - tomatometer critic reviews vs the peoples reviews adn to plot them in a scatter plot
            
                - horizontal is critic score and vertical is audience score
                
                    - we can see high critic ratings with low audience scores and high and high and split it into four quadrants
                    
                    - Roger Ebert is favorite reviewier. Word cloud around it we can create for knowing what words
                    
                        - Note: word cloud is a data visualization based on the sizes of words being an encoding that  tells us information
        
       

## Source Files on Hand (Flat files)

- Gathering can be comlicated but sometimes it is easy and we are just given a file

- We are given here:

    - TSV file, tab seperated values
    
    - Flat file structure
    
- This is like being emailed a file from a colleague, getting it on a thumb drive or from companies file storage system

    - this is internal system
    
- Flat File Family

    - core traits
    
        - tabular data (data structured in rows where each row has columns) and plain text formatting
        
        - one data record per line, with one or more field
        
        - seperated by a delimiter for each field
        
        - pandas or spreadsheet application use these queues in data to show how to represent it in program
        
        - human readable
        
   - differences
   
       - delimiter can change, like tabs or commas
       
   - Excel and .txt files aren't flat files
   
       -  .txt files can be if they follow the core traits of a flat file but they can also just be text blobs
       
       - excel files are tabular, but they aren't flat. They are a zip collection of XML files
       
           - zip format isn't human readable, but when we do unzip it the XML is human readable but it isn't as easy as a flat file
           
   - advantages:
   
       - they are text files and therefore human readable
       
       - lightweight
       
       - simple to understand
       
       - lots of software can read/write to text files (like text editors)
       
       - great for small datasets
       
  - disadvatages
  
      - lack of standards
      
      - data redundancy
      
      - sharing data can be cumbersome
      
      - not great for large datasets 

        
      

Flat files vs databases:

- no brainer for flat file: 6 by 9 (small number of rows/ cols
- no brainer for database: 860 weather stations by 1388 days nd counting

- Small becomes large when:

    - data management takes away from analysis

    - depends on number of files rather than number of cols, 100 files perhaps, 1000 for sure
    
    - how many people are sharing the data?  Data management and documentation become hard when more people are analyzing and generating
    
    - 

Flat files in Python

- because they are human readable it is easy to parse or understand filee with base python

- How to do it:

    - open file, read text line by line, seperate content based on delimter, store everything in favorite data structure
    
    - but we can use pandas instead which can handle every file type and is great for tabular dta
    
    - `read_csv` does this for us

In [246]:
%ls

 [0m[01;34mebert_reviews[0m/           'Introduction Data Wrangling.ipynb'
 [01;35mexample-job-posting.jpg[0m   online-job-postings.csv
 features.txt              rotten_tomato.tsv
'Gathering Data.ipynb'     [01;34mrt_html[0m/


In [247]:
import pandas as pd
df = pd.read_csv("rotten_tomato.tsv", delimiter="\t")

In [248]:
df

Unnamed: 0,ranking,critic_score,title,number_of_critic_ratings
0,1,99,The Wizard of Oz (1939),110
1,2,100,Citizen Kane (1941),75
2,3,100,The Third Man (1949),77
3,4,99,Get Out (2017),282
4,5,97,Mad Max: Fury Road (2015),370
...,...,...,...,...
95,96,100,Man on Wire (2008),156
96,97,97,Jaws (1975),74
97,98,100,Toy Story (1995),78
98,99,97,"The Godfather, Part II (1974)",72


## Web Scraping

- We want to grab audience score to our dataset and number of audience reviews

    - not easily accessible with rotten tomatos
    
- Web scraping is extracting data from websites with code

    - simple, we just grab the HTML from a website and because we get text back with tags we can easily use a parser
    
        - BeautifulSoup is great for this
        
        - we can either download html and access it offline or we can do it over the internet
        
- We first need to get pages data from HTML

    - we can save it manually, but their are two better ways to do it
    
    - we can actually download it programmitically
    
         - we want to access 100 movie extra data points, so we would need to do that with a loop
         
- To web scrape, all we really need to understand HTML file structure

    - tags encompase elements, and the elements are structures as a tree via nested tags

In [249]:
# This is how simple it is to get data from the internet
import requests
url = "https://www.rottentomatoes.com/m/et_the_extraterrestrial"
response = requests.get(url)

In [250]:
# This willl save a file to a computer
with open("et_the_extraterrestrial", mode="wb") as file:
    file.write(response.content)

In [251]:
%cat et_the_extraterrestrial | head -4

<!DOCTYPE html>
<html lang="en"
      dir="ltr"
      xmlns:fb="http://www.facebook.com/2008/fbml"
cat: write error: Broken pipe


In [252]:
# We don't actually need to download it, so we can remove it
%rm et_the_extraterrestrial

In [253]:
soup = BeautifulSoup("<h1>Hello World</h1>")
print(soup.h1)

<h1>Hello World</h1>


In [254]:
# Without actually downloading it with BS4
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.content)
soup

<!DOCTYPE html>

<html dir="ltr" lang="en" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://opengraphprotocol.org/schema/">
<head prefix="og: http://ogp.me/ns# flixstertomatoes: http://ogp.me/ns/apps/flixstertomatoes#">
<!-- salt=lay-def-02-juRm -->
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="ie=edge" http-equiv="x-ua-compatible"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<title>E.T. The Extra-Terrestrial (1982) - Rotten Tomatoes</title>
<meta 2002,="" a="" alien="" and="" around="" as="" be="" becomes="" before="" begin="" braver="" but="" by="" can="" communicate="" communication="" content="Both a classic movie for kids and a remarkable portrait of childhood, E.T. is a sci-fi adventure that captures that strange moment in youth when the world is a place of mysterious possibilities (some wonderful, some awful), and the universe seems somehow separate from the one inhabited by grown-ups. Henry Thomas pla

- We could parse html with `.find()` method or with regular expressions

- BS4 instead is good library for this, it is good for parsing


In [255]:
f# we will get titel with BS4ff

<_io.TextIOWrapper name='ebert_reviews/27-the-maltese-falcon-1941-film.txt' mode='r' encoding='utf-8'>

In [256]:
title = soup.find("title").contents[0][:-len(" - Rotten Tomatoes")]
title

'E.T. The Extra-Terrestrial (1982)'

In [257]:
%ls

 [0m[01;34mebert_reviews[0m/           'Introduction Data Wrangling.ipynb'
 [01;35mexample-job-posting.jpg[0m   online-job-postings.csv
 features.txt              rotten_tomato.tsv
'Gathering Data.ipynb'     [01;34mrt_html[0m/


In [258]:
import os
# List of dictionaries to build file by file and later convert to a DataFrame
df_list = []
# Folder with saved web scraped data
folder = "rt_html"
for movie_html in os.listdir(folder):
    file_path = os.path.join(os.getcwd(), 'rt_html', movie_html)
    
    assert os.path.isfile(file_path)
    
    # Grap the title, audience score, and ratings
    with open(os.path.join(folder, movie_html)) as file:
        soup = BeautifulSoup(file)
        # Grabbing title and score was simplle
        title = soup.title.string[:-len(" - Rotten Tomatoes")]
        rating = soup.find("div", "audience-score meter").find("span").string[:-1]
        
        # Audience number required a simple loop
        audience_number_raw = soup.find_all("span", class_="subtle superPageFontColor")
        audience_number = None    
        for span in audience_number_raw:
            if span.string.find("User Ratings:") != -1:
                parent = span.parent
                parent.find("span", class_="subtle superPageFontColor").decompose()
                raw_number = parent.contents[1]
                audience_number = str(raw_number).strip().replace(",", "")
          
        # Test to see no NULLS indicating broken scraper
        assert ((title != None) and (rating != None) and (audience_number != None))
        
        # Append to list of dictionaries
        df_list.append({'title': title,
                        'audience_score': int(rating),
                        'number_of_audience_ratings': int(audience_number)})         
df = pd.DataFrame(df_list,
                  columns = ['title',
                             'audience_score',
                             'number_of_audience_ratings']
                 )


In [259]:
# We have the data in a dataframe!!
df.head()

Unnamed: 0,title,audience_score,number_of_audience_ratings
0,Modern Times (1936),95,39736
1,Vertigo (1958),93,101454
2,Skyfall (2012),86,372497
3,It Happened One Night (1934),93,33106
4,All About Eve (1950),94,44564


## Downloading Files From The Internet

- We will grab all the words from robert ebert for his word cloud

- We have 100 .txt files from udacity

- We only need `requests` and understanding HTTP basics to download files from the internet

In [260]:
import requests
import os

In [261]:
folder_name = 'ebert_reviews'
if not os.path.exists(folder_name):
    os.makedirs(folder_name)

In [262]:
ebert_review_urls = ['https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9900_1-the-wizard-of-oz-1939-film/1-the-wizard-of-oz-1939-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9901_2-citizen-kane/2-citizen-kane.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9901_3-the-third-man/3-the-third-man.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9902_4-get-out-film/4-get-out-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9902_5-mad-max-fury-road/5-mad-max-fury-road.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9902_6-the-cabinet-of-dr.-caligari/6-the-cabinet-of-dr.-caligari.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9903_7-all-about-eve/7-all-about-eve.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9903_8-inside-out-2015-film/8-inside-out-2015-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9903_9-the-godfather/9-the-godfather.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9904_10-metropolis-1927-film/10-metropolis-1927-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9904_11-e.t.-the-extra-terrestrial/11-e.t.-the-extra-terrestrial.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9904_12-modern-times-film/12-modern-times-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9904_14-singin-in-the-rain/14-singin-in-the-rain.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9905_15-boyhood-film/15-boyhood-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9905_16-casablanca-film/16-casablanca-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9905_17-moonlight-2016-film/17-moonlight-2016-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9906_18-psycho-1960-film/18-psycho-1960-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9906_19-laura-1944-film/19-laura-1944-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9906_20-nosferatu/20-nosferatu.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9907_21-snow-white-and-the-seven-dwarfs-1937-film/21-snow-white-and-the-seven-dwarfs-1937-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9907_22-a-hard-day27s-night-film/22-a-hard-day27s-night-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9907_23-la-grande-illusion/23-la-grande-illusion.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9908_25-the-battle-of-algiers/25-the-battle-of-algiers.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9908_26-dunkirk-2017-film/26-dunkirk-2017-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9908_27-the-maltese-falcon-1941-film/27-the-maltese-falcon-1941-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9909_29-12-years-a-slave-film/29-12-years-a-slave-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9909_30-gravity-2013-film/30-gravity-2013-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9909_31-sunset-boulevard-film/31-sunset-boulevard-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990a_32-king-kong-1933-film/32-king-kong-1933-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990a_33-spotlight-film/33-spotlight-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990a_34-the-adventures-of-robin-hood/34-the-adventures-of-robin-hood.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990b_35-rashomon/35-rashomon.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990b_36-rear-window/36-rear-window.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990b_37-selma-film/37-selma-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990c_38-taxi-driver/38-taxi-driver.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990c_39-toy-story-3/39-toy-story-3.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990c_40-argo-2012-film/40-argo-2012-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990d_41-toy-story-2/41-toy-story-2.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990d_42-the-big-sick/42-the-big-sick.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990d_43-bride-of-frankenstein/43-bride-of-frankenstein.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990d_44-zootopia/44-zootopia.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990e_45-m-1931-film/45-m-1931-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990e_46-wonder-woman-2017-film/46-wonder-woman-2017-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990e_48-alien-film/48-alien-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990f_49-bicycle-thieves/49-bicycle-thieves.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990f_50-seven-samurai/50-seven-samurai.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad990f_51-the-treasure-of-the-sierra-madre-film/51-the-treasure-of-the-sierra-madre-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9910_52-up-2009-film/52-up-2009-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9910_53-12-angry-men-1957-film/53-12-angry-men-1957-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9910_54-the-400-blows/54-the-400-blows.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9911_55-logan-film/55-logan-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9911_57-army-of-shadows/57-army-of-shadows.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9912_58-arrival-film/58-arrival-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9912_59-baby-driver/59-baby-driver.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9913_60-a-streetcar-named-desire-1951-film/60-a-streetcar-named-desire-1951-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9913_61-the-night-of-the-hunter-film/61-the-night-of-the-hunter-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9913_62-star-wars-the-force-awakens/62-star-wars-the-force-awakens.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9913_63-manchester-by-the-sea-film/63-manchester-by-the-sea-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9914_64-dr.-strangelove/64-dr.-strangelove.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9914_66-vertigo-film/66-vertigo-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9914_67-the-dark-knight-film/67-the-dark-knight-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9915_68-touch-of-evil/68-touch-of-evil.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9915_69-the-babadook/69-the-babadook.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9915_72-rosemary27s-baby-film/72-rosemary27s-baby-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9916_73-finding-nemo/73-finding-nemo.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9916_74-brooklyn-film/74-brooklyn-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9917_75-the-wrestler-2008-film/75-the-wrestler-2008-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9917_77-l.a.-confidential-film/77-l.a.-confidential-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9918_78-gone-with-the-wind-film/78-gone-with-the-wind-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9918_79-the-good-the-bad-and-the-ugly/79-the-good-the-bad-and-the-ugly.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9918_80-skyfall/80-skyfall.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9919_82-tokyo-story/82-tokyo-story.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9919_83-hell-or-high-water-film/83-hell-or-high-water-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9919_84-pinocchio-1940-film/84-pinocchio-1940-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad9919_85-the-jungle-book-2016-film/85-the-jungle-book-2016-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991a_86-la-la-land-film/86-la-la-land-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991b_87-star-trek-film/87-star-trek-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991b_89-apocalypse-now/89-apocalypse-now.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991c_90-on-the-waterfront/90-on-the-waterfront.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991c_91-the-wages-of-fear/91-the-wages-of-fear.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991c_92-the-last-picture-show/92-the-last-picture-show.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991d_93-harry-potter-and-the-deathly-hallows-part-2/93-harry-potter-and-the-deathly-hallows-part-2.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991d_94-the-grapes-of-wrath-film/94-the-grapes-of-wrath-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991d_96-man-on-wire/96-man-on-wire.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991e_97-jaws-film/97-jaws-film.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991e_98-toy-story/98-toy-story.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991e_99-the-godfather-part-ii/99-the-godfather-part-ii.txt',
                     'https://d17h27t6h515a5.cloudfront.net/topher/2017/September/59ad991e_100-battleship-potemkin/100-battleship-potemkin.txt']

In [266]:
for url in ebert_review_urls:
    response = requests.get(url)
    with open(os.path.join(folder_name,
                           url.split('/')[-1]),
              mode="wb") as file:
        file.write(response.content)

In [267]:
assert len(os.listdir("ebert_reviews/")) == 88

Text File Structure

- Definition: uses a specific character set and has no specific formatting and no media 

- lines are seperated by `\n` and are invisible usually

- Flat files are text files with a structure, where as robert ebert files are blobs

- character sets and encodings are what every programmer needs to know

    - it makes no sense to have a string without knowing its encoding
    
    - there isn't such thing as a plain text, it always has an encoding
    
        - wether it is in memory, file, or email
        
        - if we don't know its encoding we can't display it correctly
        
        - Quote: ...the simple fact that if you don’t tell me whether a particular string is encoded using UTF-8 or ASCII or ISO 8859-1 (Latin 1) or Windows 1252 (Western European), you simply cannot display it correctly or even figure out where it ends. There are over a hundred encodings and above code point 127, all bets are off.” 
        
- Python uses one text type, str, which holds unicode data

- two byte types: bytes and bytearray

Text files:

- We need to open and read these blob files into a dataframe

- we will iterate across each file with loop

- we can do this with `os` and `glob`
    
    - `glob` we can use patterns to specify sets of file names with wildcards
  
- We should always!!!!! use encoding when we open a file (look at source) or else we will have broken code a lot

- 

In [283]:
df_list = []
import glob
for ebert_review in glob.glob('ebert_reviews/*.txt'):
    with open(ebert_review, encoding="utf-8") as f:
        title = f.readline()[:-1]
        review_url = f.readline()[:-1]
        review_text = f.read()
        df_list.append({'title': title,
                       'review_url': review_url,
                       'review_text': review_text})
ebert_df = pd.DataFrame(df_list)

In [284]:
ebert_df.head()

Unnamed: 0,title,review_url,review_text
0,The Maltese Falcon (1941),http://www.rogerebert.com/reviews/great-movie-...,Among the movies we not only love but treasure...
1,M (1931),http://www.rogerebert.com/reviews/great-movie-...,The horror of the faces: That is the overwhelm...
2,Citizen Kane (1941),http://www.rogerebert.com/reviews/great-movie-...,“I don't think any word can explain a man's li...
3,The Big Sick (2017),http://www.rogerebert.com/reviews/the-big-sick...,"It sounds impossible—too melodramatic, too cra..."
4,Army of Shadows (L'Armée des ombres) (1969),http://www.rogerebert.com/reviews/great-movie-...,"Jean-Pierre Melville's ""Army of Shadows"" is ab..."


## API  and Access Libraries

- We want the pictures from wikipedia API

- MediaWiki to grap pictures

- we can use client libraries/ access libraries to make it simpler to use the API

    - We can use access library wptools

In [296]:
source venv/bin/activate

SyntaxError: invalid syntax (<ipython-input-296-c3cf1cccb0de>, line 1)

In [289]:
response.content

