# Web Scraping — Part 2 — Workbook

In this lesson, we're going to introduce how to scrape multiple web pages from the internet with the Python libraries requests and BeautifulSoup.

---

## Quick Demonstration of Image Scraping — NYT Front Page

### Import Requests and BeautifulSoup

Once again, we're going to use the `requests` library and the `BeautifulSoup` library to scrape data.

In [2]:
import requests
from bs4 import BeautifulSoup

### Get HTML Data and Extract Text

*The New York Times* Front Page: https://nytimes.com

Here we're going to request the url for *The New York Times* front page, extract the text of the web page, then transform it into BeautifulSoup document.

In [166]:
response = requests.get("https://nytimes.com")
html_string = response.text
document = BeautifulSoup(html_string, "html.parser")

Here we search through the HTML code to find all the `<img>` tags:

In [None]:
document.find_all('img')

To display these images in our Jupyter notebook, we're going to import the Python modules `Markdown` and `display`, which allow us to transform code output into Markdown and thus display the images in this notebook

In [None]:
from IPython.display import Markdown, display

# Loop through all the images on the NYT front page
for image in document.find_all('img'):
    
    # Convert the image tag to a string
    image_string = str(image)
    
    # Transform the tag to Markdown and then display it as Markdown
    display(Markdown(image_string))

## Quick Demonstration of Image Scraping — Bill Gates's LinkedIn Page

https://www.linkedin.com/in/williamhgates/

In [109]:
response = requests.get("https://www.linkedin.com/in/williamhgates/")
html_string = response.text
document = BeautifulSoup(html_string, "html.parser")

In [110]:
from IPython.display import Markdown, display

# Loop through all the images on the NYT front page
for image in document.find_all('img'):
    # Convert the image tag to a string
    image_string = str(image)
    # Transform the tag to Markdown and then display it as Markdown
    display(Markdown(image_string))

What's going wrong here?

In [None]:
response

## Scraping Multiple Web Pages At a Time

In the last lesson, we figured out how to scrape the lyrics for a single Missy Elliott song.

In [117]:
response = requests.get("https://genius.com/Missy-elliott-work-it-lyrics")
html_string = response.text
document = BeautifulSoup(html_string, "html.parser")

In [None]:
document.find('p').text

But how can we scrape lyrics for multiple Missy Elliott songs at a time?

### Figure Out the Pattern

What we need to do is figure out how to progammatically generate the correct Genius web page URL for each song we're interested in:

`f"https://genius.com/Missy-elliott-{formatted_song}-lyrics"`

In [152]:
song_titles = ['Work It', 'WTF (Where They From)', 'The Rain (Supa Dupa Fly)']

```
for song in song_titles:
    formatted_song = ?????
    response = requests.get(f"https://genius.com/Missy-elliott-{formatted_song}-lyrics")
    html_string = response.text
    document = BeautifulSoup(html_string, "html.parser")
    document.find('p').text
```

Let's inspect the Genius web pages for each of these songs:

https://genius.com/Missy-elliott-work-it-lyrics

https://genius.com/Missy-elliott-the-rain-supa-dupa-fly-lyrics

https://genius.com/Missy-elliott-wtf-where-they-from-lyrics

### Make Song Titles Fit Pattern — Your Turn!

Create a function called `format_song()` that will take in a song title and then return the song title correctly formatted for its Genius web page.

For example, the song `WTF (Where They From)` needs to be converted to `wtf-where-they-from`.

Hint: You will need to use [string methods](https://info1350.github.io/Intro-CA-SP21/02-Python/06-String-Methods.html#id1)!

In [148]:
def format_song(song):
    #Your Code Here 👇
    
    
    
    
    return formatted_song

Test of your function on these two song titles to make sure it's working correctly.

In [149]:
format_song('WTF (Where They From)')

'wtf-where-they-from'

In [150]:
format_song('Work It')

'work-it'

### Put It All Together

In [156]:
song_titles = ['Work It', 'WTF (Where They From)', 'The Rain (Supa Dupa Fly)']

Now use your `format_song()` function to create the variable `formatted_song`, which will allow the code below to work.

In [None]:
for song in song_titles:
    formatted_song = ???? #Use your format_song() function here
    response = requests.get(f"https://genius.com/Missy-elliott-{formatted_song}-lyrics")
    html_string = response.text
    document = BeautifulSoup(html_string, "html.parser")
    lyrics = document.find('p').text
    print(lyrics)

## Write Lyrics to a Text File

In [152]:
song_titles = ['Work It', 'WTF (Where They From)', 'The Rain (Supa Dupa Fly)']

Here we are writing the lyrics to a text file rather than printing them out.

Again, use your `format_song()` function to create the variable `formatted_song`, which will allow the code below to work.

In [160]:
with open('Missy-Elliott-Lyrics.txt', mode='w') as file_object:
    
    for song in song_titles:
        formatted_song = format_song(song)  #Use your format_song() function here
        response = requests.get(f"https://genius.com/Missy-elliott-{formatted_song}-lyrics")
        html_string = response.text
        document = BeautifulSoup(html_string, "html.parser")
        lyrics = document.find('p').text
        
        file_object.write(lyrics)

## Count Top Words From File

If we wanted to find out the most frequent words in Missy Elliott's lyrics, we could use the word counter code that we've used in previous lessons.

In [163]:
import re
from collections import Counter

stopwords = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours',
 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers',
 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves',
 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are',
 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does',
 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until',
 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into',
 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down',
 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here',
 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',
 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so',
 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now', 've', 'll', 'amp']


def split_into_words(any_chunk_of_text):
    lowercase_text = any_chunk_of_text.lower()
    split_words = re.split("\W+", lowercase_text)
    return split_words

def get_top_words(full_text, number_of_words=20):
    all_the_words = split_into_words(full_text)
    meaningful_words = [word for word in all_the_words if word not in stopwords]
    meaningful_words_tally = Counter(meaningful_words)
    most_frequent_meaningful_words = meaningful_words_tally.most_common(number_of_words)
    return most_frequent_meaningful_words

Let's read in the file that we created and get the top words.

In [None]:
missy_lyrics = open('Missy-Elliott-Lyrics.txt').read()
get_top_words(missy_lyrics)

## What patterns do you notice about the top 20 words from these Missy Elliott songs?
Feel free to open the text file in the file browser at the left and inspect the lyrics manually

## Bonus: If You Wanted to Change the Artist...

In [None]:
artist = 'Bts'
song_titles = ['Dynamite', 'Euphoria', 'Fake Love']

for song in song_titles:
    formatted_song = ???? #Use your format_song() function here
    response = requests.get(f"https://genius.com/{artist}-{formatted_song}-lyrics")
    html_string = response.text
    document = BeautifulSoup(html_string, "html.parser")
    lyrics = document.find('p').text
    print(lyrics)

## Group Discussion

* Do you think scholars should use web scraping in their research? Why or why not?
* How would you feel if you found out that one of your social media posts had been included in an academic article without your knowledge?
* What are some strategies that you think scholars might use to do web scraping in an ethical way?