# Web Scraping — Part 2 — Workbook

In this lesson, we're going to introduce how to scrape multiple web pages from the internet with the Python libraries requests and BeautifulSoup.

---

## Quick Demonstration of Image Scraping — NYT Front Page

### Import Requests and BeautifulSoup

Once again, we're going to use the `requests` library and the `BeautifulSoup` library to scrape data.

In [1]:
import requests
from bs4 import BeautifulSoup

### Get HTML Data and Extract Text

*The New York Times* Front Page: https://nytimes.com

Here we're going to request the url for *The New York Times* front page, extract the text of the web page, then transform it into BeautifulSoup document.

In [2]:
response = requests.get("https://nytimes.com")
html_string = response.text
document = BeautifulSoup(html_string, "html.parser")

Here we search through the HTML code to find all the `<img>` tags:

In [3]:
document.find_all('img')

[<img alt="President Biden faces left at a lectern, holding a microphone." class="css-hdqqnp" src="https://static01.nyt.com/images/2022/11/06/multimedia/06dc-biden-lead-1-860e/06dc-biden-lead-1-860e-threeByTwoSmallAt2X.jpg?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/>,
 <img alt="President Biden faces left at a lectern, holding a microphone." class="css-122y91a" src="https://static01.nyt.com/images/2022/11/06/multimedia/06dc-biden-lead-1-860e/06dc-biden-lead-1-860e-threeByTwoSmallAt2X.jpg?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/>,
 <img alt="On Saturday in Pennsylvania, Donald J. Trump held a rally for Republicans. On Sunday in Florida, Mr. Trump and Ron DeSantis held separate campaign rallies." class="css-dzl7b5" loading="lazy"/>,
 <img alt="On Saturday in Pennsylvania, Donald J. Trump held a rally for Republicans. On Sunday in Florida, Mr. Trump and Ron DeSantis held separate campaign rallies." class="css-122y91a" src="https://static01.nyt.com/

To display these images in our Jupyter notebook, we're going to import the Python modules `Markdown` and `display`, which allow us to transform code output into Markdown and thus display the images in this notebook

In [4]:
from IPython.display import Markdown, display

# Loop through all the images on the NYT front page
for image in document.find_all('img'):
    
    # Convert the image tag to a string
    image_string = str(image)
    
    # Transform the tag to Markdown and then display it as Markdown
    display(Markdown(image_string))

<img alt="President Biden faces left at a lectern, holding a microphone." class="css-hdqqnp" src="https://static01.nyt.com/images/2022/11/06/multimedia/06dc-biden-lead-1-860e/06dc-biden-lead-1-860e-threeByTwoSmallAt2X.jpg?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="President Biden faces left at a lectern, holding a microphone." class="css-122y91a" src="https://static01.nyt.com/images/2022/11/06/multimedia/06dc-biden-lead-1-860e/06dc-biden-lead-1-860e-threeByTwoSmallAt2X.jpg?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="On Saturday in Pennsylvania, Donald J. Trump held a rally for Republicans. On Sunday in Florida, Mr. Trump and Ron DeSantis held separate campaign rallies." class="css-dzl7b5" loading="lazy"/>

<img alt="On Saturday in Pennsylvania, Donald J. Trump held a rally for Republicans. On Sunday in Florida, Mr. Trump and Ron DeSantis held separate campaign rallies." class="css-122y91a" src="https://static01.nyt.com/images/2022/11/06/multimedia/06pol-trump-desantis-1-2e67/06pol-trump-desantis-1-2e67-threeByTwoSmallAt2X.jpg?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="John Fetterman with supporters at an event in Harrisburg, Pa., on Sunday." class="css-dzl7b5" loading="lazy"/>

<img alt="John Fetterman with supporters at an event in Harrisburg, Pa., on Sunday." class="css-122y91a" src="https://static01.nyt.com/images/2022/11/06/multimedia/00pol-fetterman-photo01-1-f506/00pol-fetterman-photo01-1-f506-threeByTwoSmallAt2X.jpg?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="" class="slideshow-animate css-3kzcvh" loading="lazy"/>

<img alt="" class="css-122y91a" src="https://static01.nyt.com/images/2022/11/06/multimedia/06nycm-fashion-fader-1/-1-3dc2-square640.jpg?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="" class="slideshow-animate css-3kzcvh" loading="lazy"/>

<img alt="" class="css-122y91a" src="https://static01.nyt.com/images/2022/11/06/multimedia/06nycm-fashion-fader-6-1-c4d2/06nycm-fashion-fader-6-1-c4d2-square640.jpg?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="" class="slideshow-animate css-3kzcvh" loading="lazy"/>

<img alt="" class="css-122y91a" src="https://static01.nyt.com/images/2022/11/06/multimedia/06nycm-fashion-fader-7-1-6338/06nycm-fashion-fader-7-1-6338-square640.jpg?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="" class="slideshow-animate css-3kzcvh" loading="lazy"/>

<img alt="" class="css-122y91a" src="https://static01.nyt.com/images/2022/11/06/multimedia/06nycm-fashion-fader-4/06nycm-fashion-grid3-1-0f56-square640.jpg?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="" class="slideshow-animate css-3kzcvh" loading="lazy"/>

<img alt="" class="css-122y91a" src="https://static01.nyt.com/images/2022/11/06/multimedia/06nycm-fashion-fader-2/06nycm-fashion-fader-2-square640.jpg?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="" class="slideshow-animate css-3kzcvh" loading="lazy"/>

<img alt="" class="css-122y91a" src="https://static01.nyt.com/images/2022/11/06/multimedia/06nycm-fashion-fader-3/06nycm-fashion-grid4-1-c973-square640.jpg?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="" class="slideshow-animate css-3kzcvh" loading="lazy"/>

<img alt="" class="css-122y91a" src="https://static01.nyt.com/images/2022/11/06/multimedia/06nycm-fashion-fader-5/06nycm-fashion-grid1-1-ced0-square640.jpg?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="" class="css-dzl7b5" loading="lazy"/>

<img alt="" class="css-122y91a" src="https://static01.nyt.com/images/2022/11/06/business/00MOLLY-Lede/00MOLLY-Lede-threeByTwoSmallAt2X.jpg?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="A man in a black basketball jersey and black shorts hunches over." class="css-dzl7b5" loading="lazy"/>

<img alt="A man in a black basketball jersey and black shorts hunches over." class="css-122y91a" src="https://static01.nyt.com/images/2022/11/06/business/06nike-irving/merlin_215808777_cb994196-0a10-4748-b39e-6b5ea57e38b9-threeByTwoSmallAt2X.jpg?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="An undulating landscape of white sand dunes in the foreground, with a backdrop of surrounding rocky mountains and clouds in the sky." class="css-dzl7b5" loading="lazy"/>

<img alt="An undulating landscape of white sand dunes in the foreground, with a backdrop of surrounding rocky mountains and clouds in the sky." class="css-122y91a" src="https://static01.nyt.com/images/2022/11/04/realestate/04-atomic-new-mexico-alamogorda3/04-atomic-new-mexico-alamogorda3-square640.jpg?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="" class="css-dzl7b5" loading="lazy"/>

<img alt="" class="css-122y91a" src="https://static01.nyt.com/images/2022/11/03/multimedia/nyc-salary-quiz-1-29fa/nyc-salary-quiz-1-29fa-threeByTwoSmallAt2X.jpg?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="In “Whatever Weighs You Down,” a screen showed the artist Chisato Minamimura using sign language to mirror the pianist Zubin Kanga’s gestures, which he made onstage while wearing sensor gloves." class="css-dzl7b5" loading="lazy"/>

<img alt="In “Whatever Weighs You Down,” a screen showed the artist Chisato Minamimura using sign language to mirror the pianist Zubin Kanga’s gestures, which he made onstage while wearing sensor gloves." class="css-122y91a" src="https://static01.nyt.com/images/2022/11/03/arts/03experimental01/03experimental01-threeByTwoSmallAt2X.jpg?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="Pamela Paul" class="css-1ii2lp6" loading="lazy"/>

<img alt="Pamela Paul" class="css-122y91a" src="https://static01.nyt.com/images/2022/07/12/opinion/pamela-paul-new/pamela-paul-new-thumbLarge-v2.png?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="Ezra Klein" class="css-1ii2lp6" loading="lazy"/>

<img alt="Ezra Klein" class="css-122y91a" src="https://static01.nyt.com/images/2021/01/06/opinion/ezra-klein/ezra-klein-thumbLarge-v3.png?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="Margaret Renkl" class="css-1ii2lp6" loading="lazy"/>

<img alt="Margaret Renkl" class="css-122y91a" src="https://static01.nyt.com/images/2017/04/08/opinion/margaret-renkl/margaret-renkl-thumbLarge-v2.png?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="" class="css-dzl7b5" loading="lazy"/>

<img alt="" class="css-122y91a" src="https://static01.nyt.com/images/2022/11/07/opinion/07renkl-1/merlin_157863966_af54805c-431b-4da4-bf08-fbaf189967f5-threeByTwoSmallAt2X.jpg?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="Katherine Miller" class="css-1ii2lp6" loading="lazy"/>

<img alt="Katherine Miller" class="css-122y91a" src="https://static01.nyt.com/images/2022/09/09/opinion/katherine-miller/katherine-miller-thumbLarge.png?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="" class="css-dzl7b5" loading="lazy"/>

<img alt="" class="css-122y91a" src="https://static01.nyt.com/images/2022/11/03/podcasts/03-RunUp-left-image/03-RunUp-left-image-square320-v3.jpg?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="" class="css-dzl7b5" loading="lazy"/>

<img alt="" class="css-122y91a" src="https://static01.nyt.com/images/2022/11/04/climate/04cli-newsletter-illo/04cli-newsletter-illo-square320.jpg?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="" class="css-dzl7b5" loading="lazy"/>

<img alt="" class="css-122y91a" src="https://static01.nyt.com/images/2016/09/04/fashion/02Modernlove-podcast-miscarriage-image/04LOVE-square320.jpg?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="" class="css-dzl7b5" loading="lazy"/>

<img alt="" class="css-122y91a" src="https://static01.nyt.com/images/2022/11/03/well/03Well-NL-Boredom/03Well-NL-Boredom-square320-v2.jpg?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="A bunker that has been converted into a temple in Keelung, Taiwan." class="css-dzl7b5" loading="lazy"/>

<img alt="A bunker that has been converted into a temple in Keelung, Taiwan." class="css-122y91a" src="https://static01.nyt.com/images/2022/10/31/world/00keelung-dispatch-07/merlin_215584809_3dc0644b-96cc-4ae0-b154-f7a2bd0501d7-threeByTwoSmallAt2X.jpg?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/>

<img class="svelte-1xus8y6" src="https://static01.nyt.com/newsgraphics/2021/coronavirus-tracking/images/svg/timeseries/USA/USA-cases-two-weeks.svg"/>

<img class="svelte-1xus8y6" src="https://static01.nyt.com/newsgraphics/2021/coronavirus-tracking/images/svg/timeseries/USA/USA-deaths-two-weeks.svg"/>

<img class="svelte-1xus8y6" src="https://static01.nyt.com/newsgraphics/2021/coronavirus-tracking/images/svg/timeseries/NYT-World/NYT-World-cases-two-weeks.svg"/>

<img class="svelte-1xus8y6" src="https://static01.nyt.com/newsgraphics/2021/coronavirus-tracking/images/svg/timeseries/NYT-World/NYT-World-deaths-two-weeks.svg"/>

<img alt="US coronavirus cases" class="svelte-14oieza" loading="lazy" src="https://static01.nyt.com/newsgraphics/2021/coronavirus-tracking/images/maps/USA/hotspots-state.png"/>

<img alt="Global coronavirus cases" class="svelte-14oieza" loading="lazy" src="https://static01.nyt.com/newsgraphics/2021/coronavirus-tracking/images/maps/NYT-World/hotspots.png"/>

<img alt="Gabrielle Blair, 48, also known as Design Mom, sits in her study at home in Argentan, France. Her family has been restoring the 17th century house since 2019. " class="css-dzl7b5" loading="lazy"><noscript><img alt="Gabrielle Blair, 48, also known as Design Mom, sits in her study at home in Argentan, France. Her family has been restoring the 17th century house since 2019. " class="css-122y91a" src="https://static01.nyt.com/images/2022/10/21/multimedia/21WELL-DESIGN-MOM1-1-f7de/21WELL-DESIGN-MOM1-1-f7de-threeByTwoSmallAt2X.jpg?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/></noscript></img>

<img alt="Gabrielle Blair, 48, also known as Design Mom, sits in her study at home in Argentan, France. Her family has been restoring the 17th century house since 2019. " class="css-122y91a" src="https://static01.nyt.com/images/2022/10/21/multimedia/21WELL-DESIGN-MOM1-1-f7de/21WELL-DESIGN-MOM1-1-f7de-threeByTwoSmallAt2X.jpg?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="" class="css-dzl7b5" loading="lazy"><noscript><img alt="" class="css-122y91a" src="https://static01.nyt.com/images/2022/11/04/well/04WELL-NIGHTMARE-VARIANT/04WELL-NIGHTMARE-VARIANT-threeByTwoSmallAt2X.jpg?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/></noscript></img>

<img alt="" class="css-122y91a" src="https://static01.nyt.com/images/2022/11/04/well/04WELL-NIGHTMARE-VARIANT/04WELL-NIGHTMARE-VARIANT-threeByTwoSmallAt2X.jpg?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="" class="css-dzl7b5" loading="lazy"><noscript><img alt="" class="css-122y91a" src="https://static01.nyt.com/images/2022/03/09/well/00well-daylight-saving-time/00well-daylight-saving-time-threeByTwoSmallAt2X.jpg?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/></noscript></img>

<img alt="" class="css-122y91a" src="https://static01.nyt.com/images/2022/03/09/well/00well-daylight-saving-time/00well-daylight-saving-time-threeByTwoSmallAt2X.jpg?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="" class="css-dzl7b5" loading="lazy"/>

<img alt="" class="css-122y91a" src="https://static01.nyt.com/images/2022/11/01/multimedia/25ASKWELL-STATINS1-1-a328/25ASKWELL-STATINS1-1-a328-threeByTwoSmallAt2X-v3.jpg?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="" class="css-dzl7b5" loading="lazy"/>

<img alt="" class="css-122y91a" src="https://static01.nyt.com/images/2022/11/03/well/03-Burst-Hold-It-Together-Patia/03-Burst-Hold-It-Together-Patia-threeByTwoSmallAt2X.jpg?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="The home of Thomas and Kathleen Dennis in Truro, Mass., part of the Cape Cod National Seashore. The couple has had to move the house twice as the sand dunes near it erode." class="css-dzl7b5" loading="lazy"/>

<img alt="The home of Thomas and Kathleen Dennis in Truro, Mass., part of the Cape Cod National Seashore. The couple has had to move the house twice as the sand dunes near it erode." class="css-122y91a" src="https://static01.nyt.com/images/2022/11/05/multimedia/05sp-water-move-inyt1/merlin_215839311_67692910-1869-4380-9553-5494761411c9-square640.jpg?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="From “The Blue Scarf.”" class="css-dzl7b5" loading="lazy"/>

<img alt="From “The Blue Scarf.”" class="css-122y91a" src="https://static01.nyt.com/images/2022/10/30/books/review/30Jones-CHILDRENS/30Jones-CHILDRENS-threeByTwoSmallAt2X.jpg?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="From left, Mei Mac, Ami Okumura Jones and Dai Tabuchi in the Royal Shakespeare Company production of “My Neighbour Totoro” at the Barbican Theater." class="css-dzl7b5" loading="lazy"/>

<img alt="From left, Mei Mac, Ami Okumura Jones and Dai Tabuchi in the Royal Shakespeare Company production of “My Neighbour Totoro” at the Barbican Theater." class="css-122y91a" src="https://static01.nyt.com/images/2022/11/03/world/03londontheater1/merlin_215506701_c9ee9efa-012d-4b75-b373-7ec6931e9587-threeByTwoSmallAt2X.jpg?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="A singer wearing all black pauses in front of a piano onstage at a recital hall, with the word “benedictus” projected on the wall." class="css-dzl7b5" loading="lazy"/>

<img alt="A singer wearing all black pauses in front of a piano onstage at a recital hall, with the word “benedictus” projected on the wall." class="css-122y91a" src="https://static01.nyt.com/images/2022/11/07/multimedia/04davone-tines/04davone-tines-threeByTwoSmallAt2X.jpg?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="Millie Bobby Brown in “Enola Holmes 2.”" class="css-dzl7b5" loading="lazy"/>

<img alt="Millie Bobby Brown in “Enola Holmes 2.”" class="css-122y91a" src="https://static01.nyt.com/images/2022/11/03/multimedia/03enolaholmes-review-1-2378/03enolaholmes-review-1-2378-threeByTwoSmallAt2X.jpg?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="" class="css-dzl7b5" loading="lazy"/>

<img alt="" class="css-122y91a" src="https://static01.nyt.com/images/2022/11/06/multimedia/https---theathletic-com-3768577-2022-11-06-nfl-week-9-takeaways/https---theathletic-com-3768577-2022-11-06-nfl-week-9-takeaways--threeByTwoSmallAt2X.jpg?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="" class="css-dzl7b5" loading="lazy"/>

<img alt="" class="css-122y91a" src="https://static01.nyt.com/images/2019/04/18/homepage/spelling-bee-logo-bulletin/spelling-bee-logo-bulletin-square320-v5.png?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="" class="css-dzl7b5" loading="lazy"/>

<img alt="" class="css-122y91a" src="https://static01.nyt.com/images/2020/03/23/crosswords/crossword-logo-nytgames-hires/crossword-logo-nytgames-hires-square320-v3.png?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="" class="css-dzl7b5" loading="lazy"/>

<img alt="" class="css-122y91a" src="https://static01.nyt.com/images/2022/03/02/crosswords/alpha-wordle-icon-new/alpha-wordle-icon-new-square320-v3.png?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="Wordle gives players six tries to guess the daily word. Only a small number of people go in with a plan." class="css-dzl7b5" loading="lazy"/>

<img alt="Wordle gives players six tries to guess the daily word. Only a small number of people go in with a plan." class="css-122y91a" src="https://static01.nyt.com/images/2022/08/25/crosswords/wordle-top/wordle-top-square320.jpg?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="" class="css-dzl7b5" loading="lazy"/>

<img alt="" class="css-122y91a" src="https://static01.nyt.com/images/2021/05/27/multimedia/alpha-letterboxed-promo-1622145789727/alpha-letterboxed-promo-1622145789727-square320.png?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/>

<img alt="" class="css-dzl7b5" loading="lazy"/>

<img alt="" class="css-122y91a" src="https://static01.nyt.com/images/2020/03/23/crosswords/tiles-logo-nytgames-hi-res/tiles-logo-nytgames-hi-res-square320-v4.png?format=pjpg&amp;quality=75&amp;auto=webp&amp;disable=upscale"/>

## Quick Demonstration of Image Scraping — Bill Gates's LinkedIn Page

https://www.linkedin.com/in/williamhgates/

In [5]:
response = requests.get("https://www.linkedin.com/in/williamhgates/")
html_string = response.text
document = BeautifulSoup(html_string, "html.parser")

In [6]:
from IPython.display import Markdown, display

# Loop through all the images on the NYT front page
for image in document.find_all('img'):
    # Convert the image tag to a string
    image_string = str(image)
    # Transform the tag to Markdown and then display it as Markdown
    display(Markdown(image_string))

What's going wrong here?

In [7]:
response

<Response [999]>

## Scraping Multiple Web Pages At a Time

In the last lesson, we figured out how to scrape the lyrics for a single Missy Elliott song.

In [8]:
response = requests.get("https://genius.com/Missy-elliott-work-it-lyrics")
html_string = response.text
document = BeautifulSoup(html_string, "html.parser")

In [9]:
document.find('p').text

'Produced by'

But how can we scrape lyrics for multiple Missy Elliott songs at a time?

### Figure Out the Pattern

What we need to do is figure out how to progammatically generate the correct Genius web page URL for each song we're interested in:

`f"https://genius.com/Missy-elliott-{formatted_song}-lyrics"`

In [10]:
song_titles = ['Work It', 'WTF (Where They From)', 'The Rain (Supa Dupa Fly)']

```
for song in song_titles:
    formatted_song = ?????
    response = requests.get(f"https://genius.com/Missy-elliott-{formatted_song}-lyrics")
    html_string = response.text
    document = BeautifulSoup(html_string, "html.parser")
    document.find('p').text
```

Let's inspect the Genius web pages for each of these songs:

https://genius.com/Missy-elliott-work-it-lyrics

https://genius.com/Missy-elliott-the-rain-supa-dupa-fly-lyrics

https://genius.com/Missy-elliott-wtf-where-they-from-lyrics

### Make Song Titles Fit Pattern — Your Turn!

Create a function called `format_song()` that will take in a song title and then return the song title correctly formatted for its Genius web page.

For example, the song `WTF (Where They From)` needs to be converted to `wtf-where-they-from`.

Hint: You will need to use [string methods](https://info1350.github.io/Intro-CA-SP21/02-Python/06-String-Methods.html#id1)!

In [11]:
def format_song(song):
    #Your Code Here 👇
    
    
    
    
    return formatted_song

Test of your function on these two song titles to make sure it's working correctly.

In [12]:
# format_song('WTF (Where They From)')

NameError: name 'formatted_song' is not defined

In [None]:
# format_song('Work It')

### Put It All Together

In [13]:
# song_titles = ['Work It', 'WTF (Where They From)', 'The Rain (Supa Dupa Fly)']

Now use your `format_song()` function to create the variable `formatted_song`, which will allow the code below to work.

In [18]:
# for song in song_titles:
#     formatted_song = format_song("Work It")
#     response = requests.get(f"https://genius.com/Missy-elliott-{formatted_song}-lyrics")
#     html_string = response.text
#     document = BeautifulSoup(html_string, "html.parser")
#     lyrics = document.find('p').text
#     print(lyrics)

NameError: name 'formatted_song' is not defined

## Write Lyrics to a Text File

In [19]:
song_titles = ['Work It', 'WTF (Where They From)', 'The Rain (Supa Dupa Fly)']

Here we are writing the lyrics to a text file rather than printing them out.

Again, use your `format_song()` function to create the variable `formatted_song`, which will allow the code below to work.

In [22]:
# with open('Missy-Elliott-Lyrics.txt', mode='w') as file_object:
    
#     for song in song_titles:
#         # formatted_song = format_song(song)  #Use your format_song() function here
#         response = requests.get(f"https://genius.com/Missy-elliott-{formatted_song}-lyrics")
#         html_string = response.text
#         document = BeautifulSoup(html_string, "html.parser")
#         lyrics = document.find('p').text
        
#         file_object.write(lyrics)

## Count Top Words From File

If we wanted to find out the most frequent words in Missy Elliott's lyrics, we could use the word counter code that we've used in previous lessons.

In [23]:
import re
from collections import Counter

stopwords = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours',
 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers',
 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves',
 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are',
 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does',
 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until',
 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into',
 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down',
 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here',
 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',
 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so',
 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now', 've', 'll', 'amp']


def split_into_words(any_chunk_of_text):
    lowercase_text = any_chunk_of_text.lower()
    split_words = re.split("\W+", lowercase_text)
    return split_words

def get_top_words(full_text, number_of_words=20):
    all_the_words = split_into_words(full_text)
    meaningful_words = [word for word in all_the_words if word not in stopwords]
    meaningful_words_tally = Counter(meaningful_words)
    most_frequent_meaningful_words = meaningful_words_tally.most_common(number_of_words)
    return most_frequent_meaningful_words

Let's read in the file that we created and get the top words.

In [24]:
missy_lyrics = open('Missy-Elliott-Lyrics.txt').read()
get_top_words(missy_lyrics)

[('', 1)]

## What patterns do you notice about the top 20 words from these Missy Elliott songs?
Feel free to open the text file in the file browser at the left and inspect the lyrics manually

## Bonus: If You Wanted to Change the Artist...

In [26]:
artist = 'Bts'
song_titles = ['Dynamite', 'Euphoria', 'Fake Love']

# for song in song_titles:
#     formatted_song = ???? #Use your format_song() function here
#     response = requests.get(f"https://genius.com/{artist}-{formatted_song}-lyrics")
#     html_string = response.text
#     document = BeautifulSoup(html_string, "html.parser")
#     lyrics = document.find('p').text
#     print(lyrics)

## Group Discussion

* Do you think scholars should use web scraping in their research? Why or why not?
* How would you feel if you found out that one of your social media posts had been included in an academic article without your knowledge?
* What are some strategies that you think scholars might use to do web scraping in an ethical way?