---

Here, we will create a web scraper to scrape OpenTable's DC listings. We're interested in knowing the restaurant's name, location, price, and how many people booked it today. OpenTable provides all of this information on their website page: http://www.opentable.com/washington-dc-restaurant-listings. Let's inspect the elements of this page to assure we can find each of the bits of information in which we're interested and begin with importing our needed libraries.

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
import twitter, re, datetime, pandas as pd
from textacy import preprocessing
from collections import Counter
import matplotlib.pyplot as plt
from selenium import webdriver
from bs4 import BeautifulSoup
from time import sleep
import numpy as np
import requests
import urllib
import json

%config InlineBackend.figure_format = 'retina'
%matplotlib inline

In [2]:
# Let's set the url we want to visit #
url = 'http://www.opentable.com/washington-dc-restaurant-listings'

# Let's visit that url and grab the html #
html = requests.get(url)

In [3]:
# Let's check what's in the html (.text returns the request content in Unicode) #
html.text[:500]

'           <!DOCTYPE html><html lang="en"><head><meta charset="utf-8"/><meta http-equiv="X-UA-Compatible" content="IE=9; IE=8; IE=7; IE=EDGE"/> <title>Restaurant Reservation Availability</title>    <meta  name="robots" content="noindex,nofollow" > </meta>     <link rel="shortcut icon" href="//components.otstatic.com/components/favicon/1.0.6/favicon/favicon.ico" type="image/x-icon"/><link rel="icon" href="//components.otstatic.com/components/favicon/1.0.6/favicon/favicon-16.png" sizes="16x16"/><l'

In [4]:
# Let's convert into a soup object so we can parse it #
soup = BeautifulSoup(html.text, 'html.parser')

Note: We will utilize the web browser inspect tool to find the tags associated with elements of the page we want to scrape.

As a soup object, we can now begin to retrieve data from the HTML page.

In [5]:
# Let's print the restaurant names #
soup.find_all(name='span', attrs={'class':'rest-row-name-text'})

[<span class="rest-row-name-text">Et Corner</span>,
 <span class="rest-row-name-text">Napoleon Ondricka</span>,
 <span class="rest-row-name-text">922 O'Reilly</span>,
 <span class="rest-row-name-text">Zackary Lakin</span>,
 <span class="rest-row-name-text">Murray</span>,
 <span class="rest-row-name-text">Doloremque Burgs</span>,
 <span class="rest-row-name-text">Villages</span>,
 <span class="rest-row-name-text">Cum Mill</span>,
 <span class="rest-row-name-text">Nesciunt Avenue</span>,
 <span class="rest-row-name-text">429 Kohler</span>,
 <span class="rest-row-name-text">Schuster</span>,
 <span class="rest-row-name-text">Minus Vista</span>,
 <span class="rest-row-name-text">Magnam Place</span>,
 <span class="rest-row-name-text">Blanditiis Fords</span>,
 <span class="rest-row-name-text">Kaileys</span>,
 <span class="rest-row-name-text">Treutel</span>,
 <span class="rest-row-name-text">Libero</span>,
 <span class="rest-row-name-text">Quia Shields</span>,
 <span class="rest-row-name-text"

In [6]:
# Let's print out the restaurant names for each element we find #
for entry in soup.find_all(name='span', attrs={'class':'rest-row-name-text'}):
    print(entry.text)

Et Corner
Napoleon Ondricka
922 O'Reilly
Zackary Lakin
Murray
Doloremque Burgs
Villages
Cum Mill
Nesciunt Avenue
429 Kohler
Schuster
Minus Vista
Magnam Place
Blanditiis Fords
Kaileys
Treutel
Libero
Quia Shields
Iure
Hazles
Consequatur
Fredericks
856 Nader
Suscipit
Kellis
Kertzmann Keys
Agloe Bar & Grill
Lauras
Exercitationem Schimmel
Ut
Annabelle Center
Brandi Barton
Illo Court
1161 Kemmer
Iure Branch
Corrupti Waelchi
Padberg
Dolorem Creek
Granvilles
542 Erdman
Quisquam D'Amore
Perferendis Locks
Soluta Corners
Non Vista
Jasmin Koss
Casper
618 Larson
Laborum Groves
Izabella Will
Donnell Haag
Veronica Anderson
Buster Falls
Considine
Mitchell Plains
Chester Radial
Emmas
1448 Gaylord
Carloss
Village
Via
Hickle
Est
Erdman Islands
Coralie Kessler
Sunt
535 Ortiz
Langosh Lake
Minima Jaskolski
Oswaldo Legros
Recusandae Deckow
Island
Vitae
Mountains
Ikes
Scarletts
Volkman Flat
Quaerat Light
1188 Dietrich
Consectetur Field
Electas
427 Little
524 Hettinger
Drives
Dolorem Lakes
Raphaelles
Mason Por

In [7]:
# Let's print the restaurant locations #
soup.find_all(name='span', attrs={'class':'rest-row-meta--location'})

[<span class="rest-row-meta--location rest-row-meta-text sfx1388addContent">West Toneystad</span>,
 <span class="rest-row-meta--location rest-row-meta-text sfx1388addContent">Port Lilianaville</span>,
 <span class="rest-row-meta--location rest-row-meta-text sfx1388addContent">North Kassandra</span>,
 <span class="rest-row-meta--location rest-row-meta-text sfx1388addContent">Koeppbury</span>,
 <span class="rest-row-meta--location rest-row-meta-text sfx1388addContent">New Karltown</span>,
 <span class="rest-row-meta--location rest-row-meta-text sfx1388addContent">West Cielo</span>,
 <span class="rest-row-meta--location rest-row-meta-text sfx1388addContent">Watsicaburgh</span>,
 <span class="rest-row-meta--location rest-row-meta-text sfx1388addContent">Pacochaside</span>,
 <span class="rest-row-meta--location rest-row-meta-text sfx1388addContent">Schaefermouth</span>,
 <span class="rest-row-meta--location rest-row-meta-text sfx1388addContent">Keeblerside</span>,
 <span class="rest-row-met

In [8]:
# Let's print out the restaurant locations for each element we find #
for entry in soup.find_all('span', {'class':'rest-row-meta--location'}):
    print(entry.text)

West Toneystad
Port Lilianaville
North Kassandra
Koeppbury
New Karltown
West Cielo
Watsicaburgh
Pacochaside
Schaefermouth
Keeblerside
Lake Aronhaven
West Vinnie
Lake Natalie
South Scotburgh
Felipabury
Abbottville
Lydaburgh
New Anissa
Jacobiport
Walterview
South Toreyhaven
West Maye
Eugenialand
Ratkeside
Cheyenneville
Halvorsonburgh
East Alaynaside
West Cierraborough
Traceymouth
Lake Jakobfort
Judyville
East Jamelmouth
East Keyon
Lake Josefina
South Destini
Port Marisaberg
Lake Vitaborough
New Ottostad
Schoenhaven
Port Norris
West Nyah
Maverickborough
Lake Moriahbury
West Dino
West Kelli
Addiemouth
Ramonfurt
Walkerborough
New Hillary
Lake Emelie
Treutelhaven
Elinorehaven
New Justus
Port Mabelleberg
Colbyside
Trinityton
West Shanon
North Johnathon
Auerton
New Nora
Kassulkefurt
West Kareemside
Lelaside
Grahammouth
Vitoport
Bergstromchester
New Fernefurt
West Tad
New Taylorland
Toyville
Rosebury
East Korbin
North Heavenberg
Kutchburgh
Demetrisside
Lake Maximohaven
Jayceechester
Harveyville

In [9]:
# Let's print the restaurant prices #
soup.find_all('div', {'class':'rest-row-pricing'})

[<div class="rest-row-pricing"> <i class="pricing--the-price">  $    $    $    </i> <span class="pricing--not-the-price">  $        </span></div>,
 <div class="rest-row-pricing"> <i class="pricing--the-price">  $    $      </i> <span class="pricing--not-the-price">  $    $      </span></div>,
 <div class="rest-row-pricing"> <i class="pricing--the-price">  $    $    $    $  </i> <span class="pricing--not-the-price"> </span></div>,
 <div class="rest-row-pricing"> <i class="pricing--the-price">  $    $    $    $  </i> <span class="pricing--not-the-price"> </span></div>,
 <div class="rest-row-pricing"> <i class="pricing--the-price">  $    $    $    </i> <span class="pricing--not-the-price">  $        </span></div>,
 <div class="rest-row-pricing"> <i class="pricing--the-price">  $    $    $    </i> <span class="pricing--not-the-price">  $        </span></div>,
 <div class="rest-row-pricing"> <i class="pricing--the-price">  $    $      </i> <span class="pricing--not-the-price">  $    $      

In [10]:
# Let's print out the restaurant prices (dollar signs) for each element we find #
for entry in soup.find_all('div', {'class':'rest-row-pricing'}):
    print(entry.find('i').text)

  $    $    $    
  $    $      
  $    $    $    $  
  $    $    $    $  
  $    $    $    
  $    $    $    
  $    $      
  $    $    $    $  
  $    $    $    $  
  $    $    $    $  
  $    $    $    $  
  $    $    $    
  $    $    $    $  
  $    $    $    $  
  $    $      
  $    $    $    $  
  $    $      
  $    $      
  $    $    $    $  
  $    $    $    $  
  $    $    $    $  
  $    $      
  $    $      
  $    $      
  $    $      
  $    $      
  $    $    $    
  $    $    $    $  
  $    $    $    
  $    $    $    $  
  $    $      
  $    $    $    
  $    $    $    $  
  $    $    $    $  
  $    $    $    
  $    $      
  $    $      
  $    $    $    $  
  $    $      
  $    $    $    
  $    $    $    $  
  $    $    $    $  
  $    $    $    $  
  $    $    $    $  
  $    $    $    
  $    $    $    $  
  $    $    $    
  $    $      
  $    $    $    
  $    $    $    
  $    $      
  $    $    $    
  $    $      
  $    $    $    $  
  $    $  

In [11]:
# Let's try to print the number of dollars signs per restaurant #
for entry in soup.find_all('div', {'class':'rest-row-pricing'}):
    price = entry.find('i').text
    print('Number of $:',price.count('$'))

Number of $: 3
Number of $: 2
Number of $: 4
Number of $: 4
Number of $: 3
Number of $: 3
Number of $: 2
Number of $: 4
Number of $: 4
Number of $: 4
Number of $: 4
Number of $: 3
Number of $: 4
Number of $: 4
Number of $: 2
Number of $: 4
Number of $: 2
Number of $: 2
Number of $: 4
Number of $: 4
Number of $: 4
Number of $: 2
Number of $: 2
Number of $: 2
Number of $: 2
Number of $: 2
Number of $: 3
Number of $: 4
Number of $: 3
Number of $: 4
Number of $: 2
Number of $: 3
Number of $: 4
Number of $: 4
Number of $: 3
Number of $: 2
Number of $: 2
Number of $: 4
Number of $: 2
Number of $: 3
Number of $: 4
Number of $: 4
Number of $: 4
Number of $: 4
Number of $: 3
Number of $: 4
Number of $: 3
Number of $: 2
Number of $: 3
Number of $: 3
Number of $: 2
Number of $: 3
Number of $: 2
Number of $: 4
Number of $: 4
Number of $: 2
Number of $: 4
Number of $: 3
Number of $: 3
Number of $: 3
Number of $: 3
Number of $: 3
Number of $: 3
Number of $: 3
Number of $: 3
Number of $: 4
Number of 

In [12]:
# Let's print the number of times each restaurant was booked #
soup.find_all('div', {'class':'booking'})

[]

It seems like we can't find the number of bookings for each resturant. This may be due to the fact that the number of bookings can be considered as dynamic data (as opposed to the restaurant's name, location, and price, which can be considered as static data). Thus, we must run JavaScript before scraping. To resolve our JavaScript issue, there's a few things we can do. Here, we'll request that the page load, wait one second, and then we're going to grab the source html from the page.

Let's continue with Selenium (a headless browser that allows us to render JavaScript just as a human-navigated browser would). The page should believe we're visiting from a live connection on a browser client and the JavaScript should render to be a part of the page source.

In [13]:
# Let's visit our relevant page #
driver = webdriver.Firefox()
driver.get('http://www.opentable.com/washington-dc-restaurant-listings')

# Let's wait one second #
sleep(1)

# Let's grab the page source #
html = driver.page_source

In [14]:
# Let's convert into a soup object so we can parse it #
html = BeautifulSoup(html, "lxml")

In [15]:
# Let's print the number of times each restaurant was booked again #
html.find_all('div', {'class':'booking'})

[<div class="booking"><span class="tadpole"></span> Booked 4 times today</div>,
 <div class="booking"><span class="tadpole"></span> Booked 1 times today</div>,
 <div class="booking"><span class="tadpole"></span> Booked 1 times today</div>]

In [16]:
# Let's print out the number of times each restaurant was booked today #
for entry in html.find_all('div', {'class':'booking'}):
    print(entry.text)

 Booked 4 times today
 Booked 1 times today
 Booked 1 times today


In [17]:
# Let's close our driver #
driver.close()

Note: This notebook was created during the Covid-19 pandemic (resulting in the sparse amount of bookings as seen above).

Let's clean this up a little bit. We're going to use Regular Expressions (Regex) to grab only the digits that are available in each of the text.

In [18]:
# Let's grab the text for each entry #
for booking in html.find_all('div', {'class':'booking'}):
    
    # Let's match all digits #
    match = re.search(r'\d+', booking.text)
    
    if match:
        print(match.group())
    else: pass

4
1
1


Sometimes an API doesn't provide all the information we would like to get. Let's continue by using a combination of scraping and API calls to find the ratings and networks of famous television shows. The Internet Movie Database contains data about movies and TV shows. Unfortunately, it does not have a public API. However, the webpage http://www.imdb.com/chart/toptv/?ref_=nv_tp_tv250_2 contains a list of the top 250 tv shows of all time.

In [19]:
# Let's create a function to obtain the list of the top 250 results #
def get_top_250():
    response = requests.get('http://www.imdb.com/chart/toptv/?ref_=nv_tp_tv250_2')
    html = response.text
    
    # Let's find everything after title to the next backslash in the a href element #
    entries = re.findall('<a href.*?/title/(.*?)/', html)
    
    # Let's create a list of the top 250 results #
    return list(set(entries))

In [20]:
# Let's call the function and check the length of the list #
entries = get_top_250()
len(entries)

251

In [21]:
# Let's check out why we get an extra entry in our list #
entries

['tt0994314',
 'tt0423731',
 'tt5788792',
 'tt0121955',
 'tt5712554',
 'tt0088509',
 'tt6769208',
 'tt0090509',
 'tt7651892',
 'tt9566030',
 'tt0086831',
 'tt0417299',
 'tt7278862',
 'tt0108855',
 'tt1910272',
 'tt0318871',
 'tt0088484',
 'tt1533395',
 'tt0112130',
 'tt1298820',
 'tt2395695',
 'tt1758429',
 'tt5425186',
 'tt0275137',
 'tt1242773',
 'tt4063800',
 'tt0758745',
 'tt0111893',
 'tt4834232',
 'tt4158110',
 'tt4269716',
 'tt2243973',
 'tt0074006',
 'tt0092337',
 'tt4295140',
 'tt6077448',
 'tt0278238',
 'tt0979432',
 'tt0214341',
 'tt7660850',
 'tt1586680',
 'tt4574334',
 'tt0388629',
 'tt0047708',
 'tt1733785',
 'tt0387764',
 'tt0248654',
 'tt10233448',
 'tt0098936',
 'tt5189670',
 'tt9561862',
 'tt7259746',
 'tt0052520',
 'tt0795176',
 'tt1442449',
 'tt6108262',
 'tt0081834',
 'tt0290978',
 'tt0081846',
 'tt5555260',
 'tt1486217',
 'tt9621106',
 'tt1305826',
 'tt2937900',
 'tt0187664',
 'tt2306299',
 'tt0193676',
 'tt4508902',
 'tt0075537',
 'tt0086661',
 'tt2802850',
 'tt0

In [22]:
# Let's create a loop to find the index value of the entry we don't need and drop it from the list #
nn = 0

for x in range(len(entries)):
    if 'tt' not in entries[x]:
        nn = x
    else: pass

entries.pop(nn)

'?count=100&amp;groups=oscar_best_picture_winners&amp;sort=year%2Cdesc&amp;ref_=nv_ch_osc" tabindex="-1" aria-disabled="false"><span class="ipc-list-item__text" role="presentation">Best Picture Winners<'

In [23]:
# Let's check the length of the list again #
len(entries)

250

Although the Internet Movie Database does not have a public API, an open API exists at http://www.tvmaze.com/api. Let's use this API to retrieve information about each of the 250 TV shows we have just extracted.

In [24]:
# Let's create a function to pull information from the API using Json interaction and store into a DataFrame #
shows_df1= pd.DataFrame(columns = ['show_name', 'rating_avg', 'genres', 'network', 'premiere_date', 'status'])

def get_entry(entry):
    res=requests.get('http://api.tvmaze.com/lookup/shows?imdb='+entry)
    
    if res.status_code == 200:
        try:
            status = json.loads(res.text).get('status')
        except AttributeError:
            status = 'NA'
        try: 
            rating = json.loads(res.text).get('rating').get('average')
        except AttributeError:
            rating = 'NA'
            
        try:
            network = json.loads(res.text).get('network').get('name')
        except AttributeError:
            network = 'NA'
            
        try:
            title = json.loads(res.text).get('name')
        except AttributeError:
            title = 'NA'
            
        try:
            premier = json.loads(res.text).get('premiered')
        except AttributeError:
            premier = 'NA'
            
        try:
            genres = json.loads(res.text).get('genres')
        except AttributeError:
            genres = 'NA'

       
        shows_df1.loc[len(shows_df1)] = [title, rating, genres, network, premier, status]

In [25]:
# Let's call the above function #
for entry in entries:
    get_entry(entry)
    
shows_df1

Unnamed: 0,show_name,rating_avg,genres,network,premiere_date,status
0,Code Geass,8.1,"[Drama, Action, Anime, Science-Fiction]",MBS,2006-10-05,Ended
1,Samurai Champloo,7.7,"[Comedy, Action, Adventure, Anime]",Fuji TV,2004-05-20,Ended
2,The Marvelous Mrs. Maisel,8.4,"[Drama, Comedy]",,2017-03-17,Running
3,South Park,8.6,[Comedy],Comedy Central,1997-08-13,Running
4,The Grand Tour,8.2,"[Comedy, Adventure]",,2016-11-18,Running
5,Blue Planet II,9.5,[Nature],BBC One,2017-10-29,Ended
6,One Strange Rock,7.5,[],National Geographic Channel,2018-03-26,Running
7,"Yes, Prime Minister",8.6,[Comedy],BBC Two,1986-01-09,Ended
8,Avatar: The Last Airbender,8.9,"[Action, Adventure, Fantasy]",Nickelodeon,2005-02-21,Ended
9,My Brilliant Friend,7.6,[Drama],Rai 1,2018-11-18,Running


In [26]:
# Let's check for null values present, as well as data types for each column #
shows_df1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 226 entries, 0 to 225
Data columns (total 6 columns):
show_name        226 non-null object
rating_avg       203 non-null float64
genres           226 non-null object
network          226 non-null object
premiere_date    226 non-null object
status           226 non-null object
dtypes: float64(1), object(5)
memory usage: 12.4+ KB


In [27]:
# Let's create a function to pull information from the API converting Json into a python dictionary element #
shows_df2= pd.DataFrame(columns = ['show_name', 'rating_avg', 'genres', 'network', 'premiere_date', 'status'])

def get_entry(entry):
    res=requests.get('http://api.tvmaze.com/lookup/shows?imdb='+entry)
    if res.status_code == 200:
        results = json.loads(res.text)
        
        try:    
            status = results['status']
        except TypeError:
            status = 'NA'   
        try:
            rating = results['rating']['average']
        except TypeError:
            rating = 'NA'
        try:
            network = results['network']['name']
        except TypeError:
            network = 'NA'
        try:   
            title = results['name']
        except TypeError:
            title = 'NA'
        try:   
            genres = results['genres']
        except TypeError:
            genres = 'NA'
        try:   
            premier = results['premiered']
        except TypeError:
            premier = 'NA'
        
        shows_df2.loc[len(shows_df2)] = [title, rating, genres, network, premier, status]

In [28]:
# Let's call the above function #
for entry in entries:
    get_entry(entry)
    
shows_df2

Unnamed: 0,show_name,rating_avg,genres,network,premiere_date,status
0,Code Geass,8.1,"[Drama, Action, Anime, Science-Fiction]",MBS,2006-10-05,Ended
1,Samurai Champloo,7.7,"[Comedy, Action, Adventure, Anime]",Fuji TV,2004-05-20,Ended
2,The Marvelous Mrs. Maisel,8.4,"[Drama, Comedy]",,2017-03-17,Running
3,South Park,8.6,[Comedy],Comedy Central,1997-08-13,Running
4,The Grand Tour,8.2,"[Comedy, Adventure]",,2016-11-18,Running
5,Blue Planet II,9.5,[Nature],BBC One,2017-10-29,Ended
6,One Strange Rock,7.5,[],National Geographic Channel,2018-03-26,Running
7,"Yes, Prime Minister",8.6,[Comedy],BBC Two,1986-01-09,Ended
8,Avatar: The Last Airbender,8.9,"[Action, Adventure, Fantasy]",Nickelodeon,2005-02-21,Ended
9,My Brilliant Friend,7.6,[Drama],Rai 1,2018-11-18,Running


In [29]:
# Let's check for null values present, as well as data types for each column #
shows_df2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 226 entries, 0 to 225
Data columns (total 6 columns):
show_name        226 non-null object
rating_avg       203 non-null float64
genres           226 non-null object
network          226 non-null object
premiere_date    226 non-null object
status           226 non-null object
dtypes: float64(1), object(5)
memory usage: 12.4+ KB


Let's continue with NLP using the Twitter API. Here, we'll create a method to pull a list of tweets from the Twitter API and attempt to classify whether a tweet comes from Sanders or Trump. We will begin with the Twitter API key setup.

In [30]:
twitter_keys = {
    'consumer_key' : 'AdhLIB5ImYxfv5SftZUO32xRd',
    'consumer_secret' : 'WartHHi6xoaFD7FP1KCw3cURJqDRC3yjwUwWvcXpM5fMCdNRgJ',
    'access_token_key' : '1258789060635701253-RMGwPtN5T2Oqn9Yv4Fekj6bzz9hDPX',
    'access_token_secret' : 'qHryBE2Lf0qdQCJB8cM5NFGZjNj8pvzT4hgeRO5fYllyv'
}

In [31]:
# Let's create a class to make requests and eventually transform the JSON responses into DataFrames #
api = twitter.Api(
    consumer_key = twitter_keys['consumer_key'],
    consumer_secret = twitter_keys['consumer_secret'],
    access_token_key = twitter_keys['access_token_key'],
    access_token_secret = twitter_keys['access_token_secret']
)

class TweetMiner(object):
    result_limit = 20    
    api = False
    data = []
    
    def __init__(self, keys_dict, api, result_limit = 20):
        self.api = api
        self.twitter_keys = keys_dict
        self.result_limit = result_limit
        
    def mine_user_tweets(self, user='DAndaluz', mine_rewteets=False, max_pages=5):
        data = []
        last_tweet_id = False
        page = 1
        
        while page <= max_pages:

            if last_tweet_id:
                statuses = self.api.GetUserTimeline(screen_name=user, count=self.result_limit, max_id=last_tweet_id - 1)        
            else:
                statuses = self.api.GetUserTimeline(screen_name=user, count=self.result_limit)
                
            for item in statuses:
                mined = {
                    'tweet_id' : item.id,
                    'handle' : item.user.name,
                    'retweet_count' : item.retweet_count,
                    'text' : item.text,
                    'mined_at' : datetime.datetime.now(),
                    'created_at' : item.created_at,
                }
            
                last_tweet_id = item.id
                data.append(mined)
                
            page += 1
            
        return data

In [32]:
# Let's instantiate the class #
miner = TweetMiner(twitter_keys, api, result_limit=2)

In [33]:
# Let's mine the tweets from two Twitter users: Sanders and Trump #
sanders = miner.mine_user_tweets(user="berniesanders", max_pages=5)
trump = miner.mine_user_tweets(user="realDonaldTrump", max_pages=5)

In [34]:
# Let's check out the first tweet in our list for Sanders #
print(sanders[0])

{'tweet_id': 1259963746899955715, 'handle': 'Bernie Sanders', 'retweet_count': 258, 'text': 'I find it interesting that now that the coronavirus has hit the White House, Donald Trump is now suddenly a big sup… https://t.co/LcL8mJtTTx', 'mined_at': datetime.datetime(2020, 5, 11, 17, 53, 7, 367511), 'created_at': 'Mon May 11 21:49:00 +0000 2020'}


In [35]:
# Let's check out the first tweet in our list for Trump #
print(trump[0])

{'tweet_id': 1259940503598125057, 'handle': 'Donald J. Trump', 'retweet_count': 4269, 'text': 'RT @WhiteHouse: LIVE: President @realDonaldTrump delivers remarks on testing https://t.co/m2HCbBcA5o', 'mined_at': datetime.datetime(2020, 5, 11, 17, 53, 7, 922836), 'created_at': 'Mon May 11 20:16:39 +0000 2020'}


In [36]:
# Let's convert the tweet ouputs to a DataFrame #
pd.DataFrame(sanders).head()

Unnamed: 0,created_at,handle,mined_at,retweet_count,text,tweet_id
0,Mon May 11 21:49:00 +0000 2020,Bernie Sanders,2020-05-11 17:53:07.367511,258,I find it interesting that now that the corona...,1259963746899955715
1,Mon May 11 18:42:27 +0000 2020,Bernie Sanders,2020-05-11 17:53:07.367511,1420,"In April, 26.4 percent of workers lost their j...",1259916797425463296
2,Mon May 11 15:48:34 +0000 2020,Bernie Sanders,2020-05-11 17:53:07.467841,2167,"While working people struggle to survive, the ...",1259873039682076674
3,Mon May 11 15:11:51 +0000 2020,Bernie Sanders,2020-05-11 17:53:07.467841,436,"RT @postlive: Sen. Bernie Sanders says ""health...",1259863800662183941
4,Mon May 11 15:08:59 +0000 2020,Bernie Sanders,2020-05-11 17:53:07.583834,189,Join me live now with the @washingtonpost to d...,1259863079493517312


Next, let's create the training data.

In [37]:
# Let's instantiate the class and mine the tweets from Sanders and Trump #
miner = TweetMiner(twitter_keys, api, result_limit=400)
sanders_tweets = miner.mine_user_tweets('berniesanders')
trump_tweets = miner.mine_user_tweets('realDonaldTrump')

In [38]:
# Let's create a DataFrame for Sanders #
sanders_df = pd.DataFrame(sanders_tweets) 
print('Rows: %s \nColumns: %s' %(sanders_df.shape[0], sanders_df.shape[1]))

Rows: 1000 
Columns: 6


In [39]:
# Let's create a DataFrame for Trump #
trump_df = pd.DataFrame(trump_tweets) 
print('Rows: %s \nColumns: %s' %(trump_df.shape[0], trump_df.shape[1]))

Rows: 599 
Columns: 6


In [40]:
# Let's concat the two DataFrames #
tweets = pd.concat([sanders_df, trump_df], axis=0)
print('Rows: %s \nColumns: %s' %(tweets.shape[0], tweets.shape[1]))
tweets.sample(10)

Rows: 1599 
Columns: 6


Unnamed: 0,created_at,handle,mined_at,retweet_count,text,tweet_id
393,Thu Apr 30 12:32:50 +0000 2020,Donald J. Trump,2020-05-11 17:53:10.848994,17805,RT @realDonaldTrump: Despite reports to the co...,1255837513559822338
301,Sat May 02 21:25:45 +0000 2020,Donald J. Trump,2020-05-11 17:53:10.848994,5330,RT @Alyssafarah: The Trump Admin is making sur...,1256696404917059585
951,Tue Feb 18 01:36:39 +0000 2020,Bernie Sanders,2020-05-11 17:53:10.077327,7201,"Together, we are going to end the greed of the...",1229580453574823937
418,Thu Apr 30 11:45:41 +0000 2020,Donald J. Trump,2020-05-11 17:53:11.238465,17805,"Despite reports to the contrary, Sweden is pay...",1255825648448348161
185,Fri May 08 20:15:43 +0000 2020,Donald J. Trump,2020-05-11 17:53:10.510074,14981,https://t.co/ykwoWfQfCf,1258853107573891075
356,Fri May 01 04:36:58 +0000 2020,Donald J. Trump,2020-05-11 17:53:10.848994,6884,RT @hughhewitt: The always careful ⁦@EliLake⁩ ...,1256080145053614081
983,Sat Feb 15 23:31:49 +0000 2020,Bernie Sanders,2020-05-11 17:53:10.077327,676,RT @kailanikm: Bernie Sanders is rallying a cr...,1228824262993219584
147,Wed Apr 08 15:20:18 +0000 2020,Bernie Sanders,2020-05-11 17:53:08.757910,4278,Please join me at 11:45 a.m. ET for a special ...,1247907128024666114
175,Sat May 09 15:53:36 +0000 2020,Donald J. Trump,2020-05-11 17:53:10.510074,6667,RT @MikeGarcia2020: 2/The right to vote is sac...,1259149530240598016
270,Wed Mar 25 21:59:56 +0000 2020,Bernie Sanders,2020-05-11 17:53:09.128129,1190,Please join me at 7 p.m. ET tonight to discuss...,1242934267543420930


Let's set up a vectorizer and figure out what the most common ngrams are.

In [41]:
# Let's use the TfidfVectorizer to find ngrams for us #
vect = TfidfVectorizer(ngram_range=(2,4))

# Let's pull all of Sanders' tweet texts into one giant string #
summaries = ''.join(sanders_df['text'])
ngrams_summaries = vect.build_analyzer()(summaries)

Counter(ngrams_summaries).most_common(20)

[('https co', 794),
 ('health care', 64),
 ('on the', 63),
 ('of the', 58),
 ('in the', 57),
 ('to the', 46),
 ('we are', 46),
 ('for all', 39),
 ('we need', 39),
 ('for the', 38),
 ('we have', 38),
 ('if you', 37),
 ('we re', 35),
 ('going to', 35),
 ('we will', 35),
 ('the coronavirus', 31),
 ('we must', 30),
 ('need to', 27),
 ('we can', 26),
 ('and the', 26)]

In [42]:
# Let's pull all of Trump's tweet texts into one giant string #
summaries = ''.join(trump_df['text'])
ngrams_summaries = vect.build_analyzer()(summaries)

Counter(ngrams_summaries).most_common(20)

[('https co', 308),
 ('of the', 41),
 ('in the', 31),
 ('president realdonaldtrump', 27),
 ('fake news', 25),
 ('for the', 24),
 ('thank you', 21),
 ('rt johnwhuber', 20),
 ('rt realdonaldtrump', 15),
 ('will be', 15),
 ('the people', 14),
 ('people of', 12),
 ('to be', 12),
 ('should be', 11),
 ('has been', 11),
 ('on the', 11),
 ('to the', 10),
 ('do nothing', 10),
 ('rt cdcgov', 10),
 ('rt whitehouse', 9)]

Now, let's process the tweets and build a model.

In [43]:
# Let's use the textacy package to do some more comprehensive preprocessing #
tweet_text = tweets['text'].values

clean_text = [x.lower() for x in tweet_text]

clean_text = [preprocessing.replace.replace_hashtags(x, replace_with='') for x in clean_text]

clean_text = [preprocessing.replace.replace_urls(x, replace_with='') for x in clean_text]

clean_text = [preprocessing.normalize.normalize_unicode(x) for x in clean_text]

clean_text = [preprocessing.normalize.normalize_whitespace(x) for x in clean_text]

clean_text = [preprocessing.replace.replace_currency_symbols(x, replace_with='') for x in clean_text]

clean_text = [preprocessing.replace.replace_phone_numbers(x, replace_with='') for x in clean_text]

clean_text = [preprocessing.replace.replace_emails(x, replace_with='') for x in clean_text]

clean_text = [preprocessing.replace.replace_emojis(x, replace_with='') for x in clean_text]

clean_text = [preprocessing.remove.remove_accents(x) for x in clean_text]

In [44]:
# Let's check out the first three tweets before preprocessing #
print(tweet_text[0:3])

['I find it interesting that now that the coronavirus has hit the White House, Donald Trump is now suddenly a big sup… https://t.co/LcL8mJtTTx'
 'In April, 26.4 percent of workers lost their jobs or had their hours reduced. \n\nDuring this horrific crisis, we hav… https://t.co/196FDdUzUz'
 'While working people struggle to survive, the rich reach unthinkable levels of wealth.\n\nThe annual cost of chemothe… https://t.co/eQwrFVpC5T']


In [45]:
# Let's check out the first three tweets after preprocessing #
print(clean_text[0:3])

['i find it interesting that now that the coronavirus has hit the white house, donald trump is now suddenly a big sup...', 'in april, 26.4 percent of workers lost their jobs or had their hours reduced. \nduring this horrific crisis, we hav...', 'while working people struggle to survive, the rich reach unthinkable levels of wealth.\nthe annual cost of chemothe...']


In [46]:
# Let's make the user handle our target (Sanders will be 0 and Trump will be 1) #

y = tweets['handle'].map(lambda x: 0 if x == 'Bernie Sanders' else 1).values

print('Target mean (baseline):', np.mean(y))

Target mean (baseline): 0.37460913070669166


In [47]:
# Let's preprocess our text data to tfidf #
tfv = TfidfVectorizer(ngram_range=(1,4), max_features=2000)
X = tfv.fit_transform(clean_text).todense()

print('Rows: %s \nColumns: %s' %(X.shape[0], X.shape[1]))

Rows: 1599 
Columns: 2000


In [48]:
# Let's cross-validate the accuracy #
accuracies = cross_val_score(LogisticRegression(solver='lbfgs'), X, y, cv=10)

print('List of accuracies:', accuracies)
print('\nMean of accuracies:', np.mean(accuracies))

# Let's instantiate and fit our logistic regression model #
estimator = LogisticRegression(solver='lbfgs')
estimator.fit(X, y)

List of accuracies: [0.925      0.9        0.85625    0.8125     0.8375     0.7875
 0.86875    0.84375    0.83125    0.86163522]

Mean of accuracies: 0.8524135220125787


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

This is a very good accuracy considering the baseline.

Let's check the predicted probability for a random tweet from both Sanders and Trump.

In [49]:
# Let's set our source as tfidf vectors #
source_test = [sanders_df['text'][np.random.choice(len(sanders_df['text']))],
               trump_df['text'][np.random.choice(len(trump_df['text']))]]

Xtest = tfv.transform(source_test)

# Let's predict using the previously trained model #
estimator.predict_proba(Xtest)

array([[0.76465727, 0.23534273],
       [0.23585026, 0.76414974]])

We can see above that our classifier is predicting correctly. The 1st column is probability of being Bernie, and 2nd Trump.