# Why Might You Webscrape?

When you're faced with a data science project, you may want to use outside data resources found on the web. For example, you may want to use news headlines for sentiment analysis, historical weather data for a forecasting project, or online shop prices for your personal product sniper. However, although many websites present data in a pleasant, easy-to-read manner, it is often not downloadable, making it difficult to use the data in a data science project.  What to do now?

That's where webscraping comes in. If you can webscrape, you don't necessary need the data to be downloadable. You just need the data to be presented in a neat, organized manner and your webscraper will take care of the rest. Let's take a look at some examples below.

# Imports

Like most python projects, webscraping will require you download/install a few libraries. The first is the `requests` library. This library will help you access data from websites given a specific URL. The second library is the `BeautifulSoup` library. This library will help you parse and read the data. The final library is the `pandas` library which you may already be familiar with. This library will help you organize the data and export it to your customary csv file. To import the libraries, run the import commands below

If you receive an error along the lines of "No Module named BLANK", try running:
- `!pip install requests`
- `!pip install beautifulsoup4`

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Scraping News Headlines

First, we will scrape news headlines from https://www.nbcnews.com/business.  Feel free to visit the website and take a look at the weather data available.

Firstly, in the code cell below, we are defining the URL we wish to scrape and then using `requests.get()` to access the data.

In [2]:
nbc_business = "https://www.nbcnews.com/business"
res = requests.get(nbc_business)

Next, we will use the result from `requests.get()` and feed it into `BeautifulSoup()` to create an object we can parse.

In [3]:
soup = BeautifulSoup(res.content, 'html.parser')

At this time, it would be a good idea to turn our attention back to the website we're scraping. On the news site, hover above a headline, right click, and click Inspect. You should see HTML code pop up with a line highlighted. The HTML code represents the source code for the webpage we're on and the highlighted portion represents the code specifically for the news headline. Pretty neat!

It's not super important to understand what's going on with the HTML right now. Just know that there are tags such as `<span> </span>` and attributes such as `class = "tease-card__headline"`. For webscraping purposes, you want to find these tags and identifying attributes because we can parse the data using the `find_all()` function like below.

In [4]:
headlines = soup.find_all('span', {'class': 'tease-card__headline'})
headlines += soup.find_all('h2', {'class': 'wide-tease-item__headline'})

Now that we have grabbed all of our news headlines, we only need to put them into a pandas dataframe and create a csv. The following code cell does just that.

In [5]:
data_nbc = pd.DataFrame(headlines)
data_nbc.to_csv("headlines.csv")

In [6]:
data_nbc

Unnamed: 0,0
0,Airbnb changes booking process for travelers
1,"In Georgia’s midterms, a growing health care c..."
2,How American families are dealing with the new...
3,"A recession is coming, economists say. When it..."
4,Evictions are piling up across the U.S. as Cov...
5,A seismic grocery merger faces major oppositio...
6,Hong Kong is inviting back the (business) worl...
7,Inside Twitter's chaotic short-notice layoffs
8,Advertisers pull back from Twitter amid 'uncer...
9,How pay transparency laws set the stage for an...


# Scraping Weather Data

Next, we're going to try scraping weather data, specifically the weather of Berkeley. Please visit https://forecast.weather.gov/MapClick.php?lat=37.8699&lon=-122.2705#.Y2n19eTMJD8 to see the site we are scraping. Again, the following code cell will grab the data from the appropriate URL and then build a parser using BeautifulSoup.

In [7]:
url = "https://forecast.weather.gov/MapClick.php?lat=37.8699&lon=-122.2705#.Y2n19eTMJD8"
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")

This time, we wish to grab multiple columns of data. More specifically, we wish to grab the time period, a short description of the weather, and the extreme temperature. If you take a look at the HTML code, you'll see that the data we want is nested inside a `<div>` tag identified by the attribute `class = tombstone-container`. The following code block first finds all the `tombstone-container` and then grabs the `period-name`, `short-desc` and `temp` attributes nested within each container.

In [15]:
items = soup.find_all("div", {"class": "tombstone-container"})
period_name = [item.find("p", {"class": "period-name"}).get_text() for item in items]
short_desc = [item.find("p", {"class": "short-desc"}).get_text() for item in items]
temp = [item.find("p", {"class": "temp"}).get_text() for item in items]

Now to convert the scraped data into a meaningful dataframe and csv. The code cell below organizes all of our data and outputs a csv.

In [16]:
df = pd.DataFrame({"Period" : period_name, "Short Description" : short_desc, "Temperature" : temp})
df.to_csv("weather.csv")

In [17]:
df

Unnamed: 0,Period,Short Description,Temperature
0,Tonight,Showers,Low: 51 °F
1,Tuesday,Showers,High: 54 °F
2,TuesdayNight,ShowersLikely,Low: 49 °F
3,Wednesday,ChanceShowers,High: 54 °F
4,WednesdayNight,Partly Cloudy,Low: 42 °F
5,Thursday,Mostly Sunny,High: 55 °F
6,ThursdayNight,Mostly Clear,Low: 40 °F
7,VeteransDay,Partly Sunny,High: 55 °F
8,FridayNight,Mostly Cloudy,Low: 42 °F


# Webscrape Online Store

For our last exercise, we will scrape an online storefor prices, ratings, and product description. We will be scraping https://www.thewhiskyexchange.com if you would like to take a gander. The code cell below defines the URL to visit as well as a header for our requests. This header is important because sometimes, websites will try to block suspicious consecutive requests such as webscraping. In order to appear like a normal user, the header defines certain fields to that are filled out when normally browsing on the web.

In [33]:
baseurl = "https://www.thewhiskyexchange.com"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36'}

For this webscraping exercise, there are actually multiple URL's we want to visit, namely one for each page of products. Additionally, we don't just want to visit the initial search page. We actually want to visit each individual product's page and grab the relevant data. In order to do this, we first scrape for the relevant links using the for loop below. Notice how we're actually changing the URL for each iteration of the for loop to scrape a different page of products.

In [34]:
productlinks = []
t = {}
data = []
c = 0
for x in range(1,6):
    k = requests.get('https://www.thewhiskyexchange.com/c/35/japanese-whisky?pg={}&psize=24&sort=pasc'.format(x)).text
    soup = BeautifulSoup(k, 'html.parser')
    productlist = soup.find_all("li", {"class": "product-grid__item"})

    for product in productlist:
        link = product.find("a", {"class": "product-card"}).get('href')
        productlinks.append(baseurl + link)

Now that we have a list of product URLs, we can scrape for the data for each product. The following for loop does just that, scraping for the product's price, description, name, and overview and then saving them to a dictionary. Note, this code cell may take a while to run.

In [35]:
for link in productlinks:
    f = requests.get(link,headers=headers).text
    hun=BeautifulSoup(f,'html.parser')

    try:
        price=hun.find("p", {"class": "product-action__price"}).text.replace('\n',"")
    except:
        price = None

    try:
        about=hun.find("div", {"class": "product-main__description"}).text.replace('\n',"")
    except:
        about=None

    try:
        rating = hun.find("div", {"class": "review-overview"}).text.replace('\n',"")
    except:
        rating=None

    try:
        name=hun.find("h1", {"class": "product-main__name"}).text.replace('\n',"")
    except:
        name=None

    whisky = {"name": name, "price": price, "rating": rating, "about": about}

    data.append(whisky)
    c = c + 1

Finally, let's organize all of our data and output to a csv. Running the following code cell will do just that.

In [36]:
df = pd.DataFrame(data)
df.to_csv("prices.csv")

In [37]:
df

Unnamed: 0,name,price,rating,about
0,Suntory Toki,£33.95,4(34 Reviews),Toki is a blended whisky from Suntory's three ...
1,Shinju Japanese Whisky,£37.45,,"A sweet, fruity Japanese whisky from Shinju th..."
2,The Chita Whisky,£51.95,4.5(42 Reviews),Chita is Suntory's grain distillery. The flags...
3,Nikka Coffey Grain Whisky,£57.95,4.5(52 Reviews),"A release of grain whisky from Japan's Nikka, ..."
4,Shinju 8 Year Old Japanese Whisky,£59.75,,"A sweet, fruity eight-year-old whisky from Shi..."
...,...,...,...,...
115,Yamazaki 12 Year Old,£150,4.5(94 Reviews),One of the first Japanese single malts to brea...
116,Akashi 5 Year OldRed Wine Cask,£160,,"A single malt from the White Oak distillery, r..."
117,Mars Komatgatake IPA FinishBot.2021,£175,,A limited-edition Mars Komagatake Japanese sin...
118,Shizuoka Contact S Single Malt3 Year Old,£175,,The third whisky to come out of Shizuoka disti...


# Conclusion

That's all from us folks. Today, we have scraped news headlines, weather data, and online shopping prices. Hopefully by now you are a web scraping expert. Now have fun scraping the web and using this skill to your advantage. Remember to scrape responsibly.

# References

- https://shailchudgar005.medium.com/web-scraping-using-python-weather-data-71a8194d9c01
- https://python.plainenglish.io/news-headlines-web-scrapper-4f4fdbf87c1e
- https://www.freecodecamp.org/news/scraping-ecommerce-website-with-python/