# First Steps:  Data Scraping

The world is full of terrible people who don't want you to have the easy-to-access data you so richly deserve.  "Web-scraping" is much harder than it used to be (websites are seldom just a bunch of .html files anymore) but there are times that scraping the web can be useful.  

To do this, I'm going to make use of two venerable libraries:  `requests` and `BeautifulSoup`.  Are they already included in our installation of JupyterLab?  I think so, but we can check by asking the Python package manager, PIP, to tell us everything it knows...

In [1]:
pip list

Package                       Version
----------------------------- ---------
alembic                       1.7.7
altair                        4.2.0
anyio                         3.5.0
argon2-cffi                   21.3.0
argon2-cffi-bindings          21.2.0
asttokens                     2.0.5
async-generator               1.10
attrs                         21.4.0
Babel                         2.9.1
backcall                      0.2.0
backports.functools-lru-cache 1.6.4
beautifulsoup4                4.11.1
bleach                        5.0.0
blinker                       1.4
bokeh                         2.4.2
Bottleneck                    1.3.4
brotlipy                      0.7.0
cached-property               1.5.2
certifi                       2021.10.8
certipy                       0.1.3
cffi                          1.15.0
charset-normalizer            2.0.12
click                         8.1.2
cloudpickle                   2.0.0
colorama                      0.4.4
conda          

Life is good.  Let's get started by "importing" the `requests` library.

In [2]:
import requests

No errors is good!  Now, let's open up the BeautifulSoup library:

In [3]:
import beautifulsoup4

ModuleNotFoundError: No module named 'beautifulsoup4'

Wait, what?  Oh, right, this is Python, *which often lies to your face*.  It turns out beautifulsoup is stored in bs4.  Don't ask, just make a note of it for the future.

In [4]:
from bs4 import BeautifulSoup

Stellar.  Now let's grab some headlines from Kotaku.

> Also, in case it is no longer listed as one of the headlines, here is a reminder of that the games industry, games-adjacent industries, and most (many) gamers are largely just toxic sludge:

> https://kotaku.com/kim-kardashian-roblox-sex-tape-advertisement-experience-1848811161

In [5]:
kotakuNewsSite = "https://kotaku.com/culture/news"

In [6]:
myResult = requests.get(kotakuNewsSite)

In [7]:
mySoup = BeautifulSoup(myResult.content, 'html.parser')

In [8]:
myHeadlines = mySoup.find_all('h2')

In [9]:
headlineQuantity = len(myHeadlines)

In [10]:
headlineQuantity # the site usually includes 20 on a page...

20

Perfect.  Let's test it by looking at the first headline.  Don't forget the first headline will be headline number 0 because computers <3 U:

In [11]:
print(myHeadlines[0])

<h2 class="sc-759qgu-0 cAqwZL cw4lnv-6 fqqIge"><i>Elden Ring</i>'s Legendary '50 Hit' Illusory Wall, Murdered In The Big Patch (RIP)</h2>


That's not very useful.  Luckily, `BeautifulSoup` has a built-in function to automatically fix this for us.  To use it, we use what is called "dot syntax."  It is kind of like adding a degree of specificity to a request.

In [12]:
print(myHeadlines[0].text)

Elden Ring's Legendary '50 Hit' Illusory Wall, Murdered In The Big Patch (RIP)


YESSSSSSSSS.  (That's "yes" in Python).  Let's collect them all!

In [13]:
for i in range(headlineQuantity):
    print(myHeadlines[i].text)

Elden Ring's Legendary '50 Hit' Illusory Wall, Murdered In The Big Patch (RIP)
Robotech Board Game Fires Enormous Missile Barrage Into Kickstarter Funding Goal
Battlefield 2042 Is Now A Little Bit Better
Snoop Dogg Is Now Playable In Call Of Duty
Lego Star Wars DLCs Add Rogue One, Classic '90s Minifigs
GTA V’s Next-Gen Ports Remove Some Transphobic Content
Nintendo Worker Files Complaint With National Labor Relations Board
Elden Ring Patch Patches Patches
Tiny Tina’s Wonderlands Already Has DLC, Will Never Let You Go
New World Of Warcraft Expansion Lets You Fly Dragons
Elden Ring Patch Really Wants Folks To Finally Notice The Damn Tutorial
Xbox Game Pass’ April Lineup Is Kind Of A Letdown
Elden Ring's Biggest Patch Yet Nerfs Bleed, Fixes Huge Bugged Quest, And Way More
Kim Kardashian Threatens To Sue Roblox Over Sex Tape Ad, Creator Banned
Report: Sega Developing New, 'Big-Budget' Jet Set Radio & Crazy Taxi Games
Former Xbox Boss Remembers Microsoft Trying To Buy Blizzard, Westwood Bac

Fun!  Obviously.  But can we dump that data to a text file so that we can have it for all eternity?

Sure, why not?

In [14]:
myFile = open("newHeadlines.txt", "a+")

"w" is "write new file"; "a" is "append to file"

In [15]:
for i in range(headlineQuantity):
    myFile.write(myHeadlines[i].text)
    myFile.write("\n")

In [16]:
myFile.close()