![StatModels](https://www.durhamtech.edu/themes/custom/durhamtech/images/durham-tech-logo-web.svg) 

## Web Scraping

This lecture provides foundational knowledge and examples of scraping data from live websites.

---

### Set Up
1.	Ensure you have a stable internet connection



### Needed Packages
1.	pandas
2.  selenium
3.  bs4
4.  requests
5.  random
6.  time
---

# Table of Contents

### The Basics
#### <a href='#1'>What is web scraping and why do you need it?</a>
#### <a href='#2'>Needed Packages</a>
#### <a href='#3'>Scraping ESPN</a>
#### <a href='#4'>A note on anonymous web scraping and browsing</a>

### Advanced Methods
#### <a href='#5'>Creating an email address through python only</a>
#### <a href='#6'>Download Consumer Spending Data</a>
#### <a href='#7'>Scrape current US National Debt</a>

### Additional Use cases
#### <a href='#8'>Scrape images and other files</a>
#### <a href='#9'>Scraping function to download files of any type from a website</a>
#### <a href='#10'>Scrape funny coffee pictures</a>
#### <a href='#11'>Scrape Bloomberg sitemap (XML) for current political news</a>
#### <a href='#12'>Web crawl Twitter account</a>

### Practice
#### <a href='#13'>Exercise Set 1</a>
#### <a href='#14'>Exercise Set 2</a>
#### <a href='#15'>Exercise Set 3</a>
#### <a href='#16'>Exercise Set 4</a>

<a id='1'></a>
## What is web scraping and why do you need it?
Web scraping is a catch all term for gathering information directly from web pages on the internet.  Unlike structured APIs and Databases, web scraping will require you to get creative in how you approach collecting the data as no two projects will be the same.  As an important addition to webscraping, this lecture will also cover basic use of the selenium package, which will enable you to tranverse and interact with the vast majority of moder webpages, ranging from obscure pages all the way to Facebook.

In general, web scraping should be your last resort when looking to automate a data gathering exercise, as websites are prone to change over time, many web hosts do not appreciate web scraping, and, it's generally much slower than just working with prebuilt data sources.  

<a id='2'></a>
## Needed Packages

In [10]:
import pandas as pd
import warnings
import selenium
from bs4 import *
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
import time
from random import randint
import requests
warnings.filterwarnings('ignore')

<a id='3'></a>
## Scraping ESPN

In [12]:
##define website url
url = 'https://www.espn.com/nba/boxscore/_/gameId/401360104'

In [13]:
##Open browser
browser = webdriver.Chrome(executable_path = 'chromedriver.exe')
browser.get(url)

In [14]:
##Close browser
browser.close()

In [16]:
##Open browser without actually seeing it, also known as 'headless' browsing
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
browser = webdriver.Chrome(executable_path = 'chromedriver.exe',chrome_options=chrome_options)
browser.get(url)


In [17]:
##Grab the raw webpage
innerHTML = browser.execute_script("return document.body.innerHTML")

In [18]:
##Parse raw webpage and close the browser (Don't want to gunk up your RAM!!)
soup = BeautifulSoup(innerHTML,"html.parser")
browser.close()

In [19]:
##Find all tables on page
tables = soup.findAll("table")


In [36]:
##Grab Stats for all players on both teams & convert to dataframe
df = []
for table in tables:
    for row in table.findAll("tr"):
        cells = row.findAll("td")
        cells = [ele.text.strip() for ele in cells]
        if len(cells) == 15 and cells[0] != '' and cells[0] != 'TEAM':
            df.append(cells[0:3])
df = pd.DataFrame(df, columns = ['Player','Minutes Played','FG'])
df.head()

Unnamed: 0,Player,Minutes Played,FG
0,J. GrantJ. GrantSF,33,5-16
1,S. BeyS. BeySF,23,4-12
2,I. StewartI. StewartC,27,2-6
3,C. JosephC. JosephPG,25,3-6
4,C. CunninghamC. CunninghamPG,28,3-13


<a id='4'></a>
## A note on anonymous web scraping and browsing
When web scraping, many websites will blacklist your IP address in an effort to prevent you from abusing their sites.  If you are at a job and your employer is needing the data, an IP ban is not something you want ruining your day.  A common work around is to simply use a proxy server so that the website doesn't know your actual IP address.  The below will give the framework to put a proxy server between you and the internet calls you are making with python.  

In [78]:
##Proxy server list
## 'https://proxydig.com/free-proxy-list/'

In [5]:
##Connect to a website using a proxy server
PROXY = '209.165.163.187:3128'  ##Note that if you cannot connect to a webpage, try using a different proxy server from the site above
url = 'https://whatismyipaddress.com/'
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--proxy-server=%s' % PROXY)
browser = webdriver.Chrome(executable_path = 'chromedriver.exe',chrome_options=chrome_options)
browser.get(url)

##NOTE: Using proxy servers will avoid most common website black listing BUT, does not substitute a VPN for security NOR
##should you attempt to use proxy servers for nefarious activity

While proxy servers enable you to avoid blacklisting, they leave you exposed from a security standpoint.  There are two common methods to add a layer of security.  The first is to always work with a VPN on before doing any web scraping, this will ensure even if someone tries to trace back to the original IP, only the VPN's will appear.  Alternatively, Selenium allows you to use the Tor browser to make internet calls.  This method is generally extremely slow, and unnecessary, but a fun exercise if you have a couple hours to spare! 

<a id='13'></a>
## -------------PRACTICE-------------

1. Create a function that will open and return a headless web browser, think of variables that may be helpful to send the function.  

2. Go to: https://www.macrotrends.net/1333/historical-gold-prices-100-year-chart and scrape the table titled 'Gold Prices - Historical Annual Data'

3.  Put the data from question 2 in a pandas dataframe.

<a id='5'></a>
## Creating an email address through python only


In [28]:
##Open webpage
url = 'https://mail.tutanota.com/login'
browser = webdriver.Chrome(executable_path = 'chromedriver.exe',chrome_options=chrome_options)
browser.get(url)

In [29]:
##Select 'More' so we can create an account
more_button = browser.find_element_by_xpath('//*[@id="login-view"]/div[2]/div/div[3]/div/button')
more_button.click()

In [31]:
##Select 'Sign Up' so we can create an account
sign_up_button = browser.find_element_by_xpath('//*[@id="login-view"]/div[2]/div/div[4]/div/div/div/button[1]/div')
sign_up_button.click()

In [32]:
##Select the Free option
free_button = browser.find_element_by_xpath('//*[@id="upgrade-account-dialog"]/div[2]/div[1]/div[1]/div[5]/button/div')
free_button.click()

In [33]:
##Agree to terms
term_1 = browser.find_element_by_xpath('//*[@id="modal"]/div[2]/div/div/div/div[2]/div[1]/div/input')
term_1.click()
term_2 = browser.find_element_by_xpath('//*[@id="modal"]/div[2]/div/div/div/div[2]/div[2]/div/input')
term_2.click()
ok_button = browser.find_element_by_xpath('//*[@id="modal"]/div[2]/div/div/div/div[3]/button[2]/div')
ok_button.click()


In [37]:
##Add Account Info
email_add = browser.find_element_by_xpath('//*[@id="signup-account-dialog"]/div/div[1]/div/div/div/div[1]/input')
email_add.click()
email_add.send_keys('test_user_abc_123') ###You will need to put in a new username!

password = browser.find_element_by_xpath('//*[@id="signup-account-dialog"]/div/div[2]/div[1]/div/div/div/div[1]/input[4]')
password.click()
password.send_keys('Sample_password!')

sec_password = browser.find_element_by_xpath('//*[@id="signup-account-dialog"]/div/div[2]/div[3]/div/div/div/div/input')
sec_password.click()
sec_password.send_keys('Sample_password!')


In [38]:
##Agree to terms 2
term_1 = browser.find_element_by_xpath('//*[@id="signup-account-dialog"]/div/div[3]/div/input')
term_1.click()
term_2 = browser.find_element_by_xpath('//*[@id="signup-account-dialog"]/div/div[4]/div/input')
term_2.click()
ok_button = browser.find_element_by_xpath('//*[@id="signup-account-dialog"]/div/div[5]/button/div')
ok_button.click()

In [39]:
##Continue
ok_button = browser.find_element_by_xpath('//*[@id="wizardDialogContent"]/div[4]/div/button/div')
ok_button.click()

In [40]:
##Type in Password
password_input = browser.find_element_by_xpath('//*[@id="login-view"]/div[2]/div/div[1]/form/div[2]/div/div/div/div/div/input')
password_input.click()
password_input.send_keys('Sample_password!')


In [41]:
##Login
login_button = browser.find_element_by_xpath('//*[@id="login-view"]/div[2]/div/div[1]/form/div[4]/button/div')
login_button.click()

<a id='14'></a>
## -------------PRACTICE-------------

1. Login to your new email account.

2. Click the button to send an email.

3. Send yourself an email.

<a id='6'></a>
## Download Consumer Spending Data

In [48]:
##Open Webpage and make full screen
url = 'https://www.bea.gov/'
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--start-maximized")
browser = webdriver.Chrome(executable_path = 'chromedriver.exe',chrome_options=chrome_options)
browser.get(url)

In [49]:
##Navigate to Data section
data = browser.find_element_by_link_text('Data')
data.click()

In [51]:
##Navigate to Data by Topic section
data = browser.find_element_by_partial_link_text('Data by Topic')
data.click()

In [52]:
##Navigate to Consumer Spending Section
data = browser.find_element_by_link_text('Consumer Spending')
data.click()

In [53]:
##Navigate to next Consumer Spending Section
data = browser.find_element_by_xpath('//*[@id="test"]/div[2]/article/div/div/div/ul/li[1]/a')
data.click()

In [54]:
##Navigate to Interactive Data
data = browser.find_element_by_link_text('Interactive Data')
data.click()

In [55]:
##Navigate to Summary Tables
data = browser.find_element_by_link_text('Summary Tables')
data.click()

In [56]:
##Navigate to Person Income and Outlays
data = browser.find_element_by_partial_link_text('PERSONAL INCOME AND OUTLAYS')
data.click()

In [57]:
##Navigate to table 2.2A
data = browser.find_element_by_partial_link_text('2.2A')
data.click()

In [59]:
##Download Data
data = browser.find_element_by_xpath('//*[@id="showDownload"]')
data.click()

In [60]:
##Select Download Format
data = browser.find_element_by_xpath('//*[@id="download_wraper"]/div/a[2]')
data.click()
browser.close()

<a id='15'></a>
## -------------PRACTICE-------------

1. Navigate to a different data source.

2. Download a different data set than what was downloaded above.

3. Read the new data set into a pandas dataframe.

<a id='7'></a>
## Scrape current US National Debt

In [5]:
##Open Webpage and make full screen
url = 'https://www.usdebtclock.org/'
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
browser = webdriver.Chrome(executable_path = 'chromedriver.exe',chrome_options=chrome_options)
browser.get(url)

In [6]:
##Grab the raw webpage
innerHTML = browser.execute_script("return document.body.innerHTML")

In [7]:
##Parse raw webpage and close the browser (Don't want to gunk up your RAM!!)
soup = BeautifulSoup(innerHTML,"html.parser")
browser.close()

In [8]:
##Scrape the First number showing the total debt
divs = soup.findAll("div")
count = 1
for div in divs:
    for row in div.findAll("span"):
        if count == 1:
            print(row.text.strip())
            count+=1
        else:
            break

$29,699,434,196,803


<a id='8'></a>
## Scrape images and other files
Let's see how we can automatically find and download files linked at any website.

The data you need for your projects might not always be raw data, but in the form of files (images, .txt files etc)

In [12]:
# As we can see there are two images on the data-x.blog/resources
# say that we want to download them
# Images are displayed with the <img> tag in HTML

# open connection and create new soup

raw = requests.get('https://data-x.blog/resources/').content
soup = BeautifulSoup(raw,features='html.parser')

print(soup.find('img')) 
# as we can see below the image urls 
# are stored in the src attribute inside the img tag

<img alt="WordPress.com" height="45" src="//s1.wp.com/wp-content/themes/h4/i/logo-h-rgb.png" width="180"/>


In [13]:
# Parse all url to the images
img_urls = list()
for img in soup.find_all('img'):
    img_url = img.get('src') 
    if '.jpeg' in img_url or '.jpg' in img_url:
        print(img_url)
        img_urls.append(img_url)

In [14]:
## Let's look at what our current file directory looks like

%ls

 Volume in drive C is Windows
 Volume Serial Number is 8066-06E8

 Directory of C:\Users\bensm\Desktop\Durham\Lectures\WebScraping

01/06/2022  07:29 PM    <DIR>          .
01/06/2022  07:29 PM    <DIR>          ..
11/26/2021  06:18 PM    <DIR>          .ipynb_checkpoints
09/13/2018  01:58 PM         6,702,592 chromedriver.exe
11/14/2021  08:28 PM           310,346 Data Sources.pptx
11/26/2021  07:36 PM             5,554 Free_Proxy_List.csv
11/14/2021  08:28 PM             2,825 Template.ipynb
01/06/2022  07:29 PM            23,506 Web_Scraping.ipynb
               5 File(s)      7,044,823 bytes
               3 Dir(s)  126,389,366,784 bytes free


In [15]:
# To download and save files with Python we can use 
# the shutil library which is a file operations library
'''
The shutil module offers a number of high-level operations on files and 
collections of files. In particular, functions are provided which support 
file copying and removal.
'''

import shutil

for idx, img_url in enumerate(img_urls): 
    #enumarte to create a file integer name for every image
    
    # make a request to the image URL
    img_source = requests.get(img_url, stream=True) 
    # we set stream = True to download/ 
    # stream the content of the data
    
    with open('img'+str(idx)+'.jpg', 'wb') as file: 
        # open file connection, create file and write to it
        shutil.copyfileobj(img_source.raw, file) 
        # save the raw file object

    del img_source # to remove the file from memory

In [16]:
## Let's see if the file has been saved
%ls

 Volume in drive C is Windows
 Volume Serial Number is 8066-06E8

 Directory of C:\Users\bensm\Desktop\Durham\Lectures\WebScraping

01/06/2022  07:29 PM    <DIR>          .
01/06/2022  07:29 PM    <DIR>          ..
11/26/2021  06:18 PM    <DIR>          .ipynb_checkpoints
09/13/2018  01:58 PM         6,702,592 chromedriver.exe
11/14/2021  08:28 PM           310,346 Data Sources.pptx
11/26/2021  07:36 PM             5,554 Free_Proxy_List.csv
11/14/2021  08:28 PM             2,825 Template.ipynb
01/06/2022  07:29 PM            23,506 Web_Scraping.ipynb
               5 File(s)      7,044,823 bytes
               3 Dir(s)  126,388,563,968 bytes free


<a id='9'></a>
## Scraping function to download files of any type from a website
Below is a function that takes in a website and a specific file type to download X of them from the website.

In [24]:
# Extended scraping function of any file format
import os # To interact with operating system and format file name
import shutil # To copy file object from python to disk
import requests
import bs4 as bs

def py_file_scraper(url, html_tag='img', source_tag='src', file_type='.jpg',max=-1):
    
    '''
    Function that scrapes a website for certain file formats.
    The files will be placed in a folder called "files" 
    in the working directory.
    
    url = the url we want to scrape from
    html_tag = the file tag (usually img for images or 
    a for file links)
    
    source_tag = the source tag for the file url 
    (usually src for images or href for files)
    
    file_type = .png, .jpg, .pdf, .csv, .xls etc.
    
    max = integer (max number of files to scrape, 
    if = -1 it will scrape all files)
    '''
    
    # make a directory called 'files' 
    # for the files if it does not exist
    if not os.path.exists('files/'):
        os.makedirs('files/')
    print('Loading content from the url...')
    source = requests.get(url).content
    print('Creating content soup...')
    soup = bs.BeautifulSoup(source,'html.parser')
    
    i=0
    print('Finding tag:%s...'%html_tag)
    for n, link in enumerate(soup.find_all(html_tag)):
        file_url=link.get(source_tag)
        print ('\n',n+1,'. File url',file_url)
        
        
        if 'http' in file_url: # check that it is a valid link
            print('It is a valid url..')
            
            
            if file_type in file_url: #only check for specific 
                # file type
                
                print('%s FILE TYPE FOUND IN THE URL...'%file_type)
                file_name = os.path.splitext(os.path.basename(file_url))[0] + file_type 
                #extract file name from url

                file_source = requests.get(file_url, stream = True)
             
                # open new stream connection

                with open('./files/'+file_name, 'wb') as file: 
                    # open file connection, create file and 
                    # write to it
                    
                    shutil.copyfileobj(file_source.raw, file) 
                    # save the raw file object
                    
                    print('DOWNLOADED:',file_name)
                    
                    i+=1
                    
                del file_source # delete from memory
            else:
                print('%s file type NOT found in url:'%file_type)
                print('EXCLUDED:',file_url) 
                # urls not downloaded from
                
        if i == max:
            print('Max reached')
            break
            

    print('Done!')

<a id='10'></a>
## Scrape funny coffee pictures

In [26]:
py_file_scraper('https://goldcoffee.com/dark/') 
# scrape coffee

Loading content from the url...
Creating content soup...
Finding tag:img...

 1 . File url https://www.facebook.com/tr?id=2963138980640409&ev=PageView

&noscript=1
It is a valid url..
.jpg file type NOT found in url:
EXCLUDED: https://www.facebook.com/tr?id=2963138980640409&ev=PageView

&noscript=1

 2 . File url https://goldcoffee.com/wp-content/uploads/2021/04/cropped-Asset-1.png
It is a valid url..
.jpg file type NOT found in url:
EXCLUDED: https://goldcoffee.com/wp-content/uploads/2021/04/cropped-Asset-1.png

 3 . File url https://i1.wp.com/goldcoffee.com/wp-content/uploads/2021/05/coffee-beans-cup-rustic-wood-scaled.jpg?fit=1024%2C683&ssl=1
It is a valid url..
.jpg FILE TYPE FOUND IN THE URL...
DOWNLOADED: coffee-beans-cup-rustic-wood-scaled.jpg

 4 . File url https://goldcoffee.com/wp-content/uploads/2021/04/cropped-Asset-1.png
It is a valid url..
.jpg file type NOT found in url:
EXCLUDED: https://goldcoffee.com/wp-content/uploads/2021/04/cropped-Asset-1.png

 5 . File url https:

<a id='11'></a>
## Scrape Bloomberg sitemap (XML) for current political news

In [27]:
# XML documents - site maps, all the urls. just between tags
# XML human and machine readable.
# Newest links: all the links for FIND SITE MAP!
# News websites will have sitemaps for politics, bot constantly
# tracking news track the sitemaps

# Before scraping a website look at robots.txt file
bs.BeautifulSoup(requests.get('https://www.bloomberg.com/robots.txt').content,'lxml')

<html><body><p># Bot rules:
# 1. A bot may not injure a human being or, through inaction, allow a human being to come to harm.
# 2. A bot must obey orders given it by human beings except where such orders would conflict with the First Law.
# 3. A bot must protect its own existence as long as such protection does not conflict with the First or Second Law.
# If you can read this then you should apply here https://www.bloomberg.com/careers/
User-agent: *
Disallow: /polska
Disallow: /account/*

User-agent: Mediapartners-Google
Disallow: /about/careers
Disallow: /about/careers/
Disallow: /offlinemessage/
Disallow: /apps/fbk
Disallow: /bb/newsarchive/
Disallow: /apps/news
Disallow: /search

User-agent: Spinn3r
Disallow: /podcasts/
Disallow: /feed/podcast/
Disallow: /bb/avfile/

User-agent: Googlebot-News
Disallow: /sponsor/
Disallow: /news/sponsors/*
Disallow: /news/terminal/*

Sitemap: https://www.bloomberg.com/sitemap.xml
Sitemap: https://www.bloomberg.com/feeds/bbiz/sitemap_index.xml
Site

In [28]:
source = requests.get('https://www.bloomberg.com/feeds/bpol/sitemap_news.xml').content
soup = bs.BeautifulSoup(source,'xml') # Note parser 'xml'

In [29]:
print(soup.prettify())

<?xml version="1.0" encoding="utf-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:image="http://www.google.com/schemas/sitemap-image/1.1" xmlns:news="http://www.google.com/schemas/sitemap-news/0.9">
 <url>
  <loc>
   https://www.bloomberg.com/news/articles/2022-01-06/illinois-governor-asks-biden-for-covid-tests-for-chicago-schools
  </loc>
  <news:news>
   <news:publication>
    <news:name>
     Bloomberg
    </news:name>
    <news:language>
     en
    </news:language>
   </news:publication>
   <news:publication_date>
    2022-01-07T00:06:19.403Z
   </news:publication_date>
   <news:title>
    Illinois Asks Biden for Covid Tests for Chicago Schools
   </news:title>
   <news:keywords/>
   <news:stock_tickers/>
  </news:news>
  <image:image>
   <image:loc>
    https://assets.bwbx.io/images/users/iqjWHBFdfxIU/iE36oF.uh.MA/v1/1200x-1.jpg
   </image:loc>
   <image:license>
    https://www.bloomberg.com/tos
   </image:license>
  </image:image>
 </url>
 <url>
  <loc>
 

In [30]:
# Find political news headlines
for news in soup.find_all({'news'}):
    print(news.title.text)
    print(news.publication_date.text)
    #print(news.keywords.text)
    print('\n')

Illinois Asks Biden for Covid Tests for Chicago Schools
2022-01-07T00:06:19.403Z


Ex-New York Time Columnist Nicholas Kristof Ineligible to Run for Oregon Governor
2022-01-06T23:18:49.556Z


U.S. to Revise Singapore Travel Advisory After Covid Data Flap
2022-01-06T23:15:53.982Z


Chicago ICUs Fill; Alaska Airlines Cuts Flights: Virus Update
2022-01-06T22:59:28.882Z


Brazil’s Omicron Toll Begins to Show Even Amid Data Blackout
2022-01-06T22:22:01.681Z


Conservatives' Plan to Regain Power Unravels in South Korea
2022-01-06T21:00:05.631Z


Cambodian Prime Minister Is First Foreign Leader to Visit Myanmar After Coup
2022-01-06T21:00:00.008Z


China’s Local Governments Give Early Hints of 2022 GDP Target
2022-01-06T21:00:00.006Z


New York Man Charged With Acting as an Egyptian Government Agent
2022-01-06T20:32:19.514Z


Jan. 6’s Wounds in Congress Run Deep, and Trump Keeps Them Fresh
2022-01-06T20:31:17.602Z


IMF Critiqued for Dropping Tough Loan Standards for Covid Funds
2022-01-06T20

<a id='12'></a>
## Web crawl Twitter account

In [31]:
# Helper function to maintain the urls and the number of times they appear

url_dict = dict()

def add_to_dict(url_d, key):
    if key in url_d:
        url_d[key] = url_d[key] + 1
    else:
        url_d[key] = 1

In [32]:
# Recursive function which extracts links from the given url upto a given 'depth'.

def get_urls(url, depth):
    if depth == 0:
        return
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'html.parser')
    for link in soup.find_all('a'):
        if link.has_attr('href') and "https://" in link['href']:
#             print(link['href'])
            add_to_dict(url_dict, link['href'])
            get_urls(link['href'], depth - 1)

In [33]:
# Iterative function which extracts links from the given url upto a given 'depth'.

def get_urls_iterative(url, depth):
    urls = [url]
    for url in urls:
        r = requests.get(url)
        soup = BeautifulSoup(r.text, 'html.parser')
        for link in soup.find_all('a'):
            if link.has_attr('href') and "https://" in link['href']:
                add_to_dict(url_dict, link['href'])
                urls.append(link['href'])
        if len(urls) > depth:
            break

In [34]:
get_urls("https://twitter.com/GolfWorld", 2)
for key in url_dict:
    print(str(key) + "  ----   " + str(url_dict[key]))

https://help.twitter.com/using-twitter/twitter-supported-browsers  ----   1
https://microsoft.com/edge  ----   1
https://www.apple.com/safari  ----   1
https://www.google.com/chrome  ----   1
https://www.mozilla.org/firefox  ----   1
https://twitter.com/  ----   4
https://status.twitterstat.us/  ----   4
https://cards-dev.twitter.com/validator  ----   3
https://publish.twitter.com  ----   3
https://privacy.twitter.com/  ----   3
https://transparency.twitter.com/  ----   3
https://about.twitter.com/en/who-we-are/our-company.html  ----   3
https://about.twitter.com/en/who-we-are/twitter-for-good.html  ----   3
https://blog.twitter.com/  ----   3
https://about.twitter.com/en/who-we-are/brand-toolkit.html  ----   3
https://careers.twitter.com/  ----   3
https://investor.twitterinc.com/  ----   3
https://help.twitter.com/  ----   3
https://help.twitter.com/en/using-twitter  ----   3
https://media.twitter.com/  ----   3
https://business.twitter.com/en/help.html  ----   3
https://help.twitter

<a id='16'></a>
## -------------PRACTICE-------------

1. Go to a new site and download at least 2 images. 

2. Scrape a different twitter account.

3. Get the latest bloomberg news topic.