## Part I - Getting and parsing a text corpus

In [127]:
import pandas as pd
df = pd.read_json('data/News_Category_Dataset.json', lines=True, orient='records')
df.head()

Unnamed: 0,short_description,headline,date,link,authors,category
0,She left her husband. He killed their children...,There Were 2 Mass Shootings In Texas Last Week...,2018-05-26,https://www.huffingtonpost.com/entry/texas-ama...,Melissa Jeltsen,CRIME
1,Of course it has a song.,Will Smith Joins Diplo And Nicky Jam For The 2...,2018-05-26,https://www.huffingtonpost.com/entry/will-smit...,Andy McDonald,ENTERTAINMENT
2,The actor and his longtime girlfriend Anna Ebe...,Hugh Grant Marries For The First Time At Age 57,2018-05-26,https://www.huffingtonpost.com/entry/hugh-gran...,Ron Dicker,ENTERTAINMENT
3,The actor gives Dems an ass-kicking for not fi...,Jim Carrey Blasts 'Castrato' Adam Schiff And D...,2018-05-26,https://www.huffingtonpost.com/entry/jim-carre...,Ron Dicker,ENTERTAINMENT
4,"The ""Dietland"" actress said using the bags is ...",Julianna Margulies Uses Donald Trump Poop Bags...,2018-05-26,https://www.huffingtonpost.com/entry/julianna-...,Ron Dicker,ENTERTAINMENT


Let's start by looking at these URLs and making sure that they're valid before we bother getting them all.  They're supposed to be all from huffingtonpost (according to Kaggle where I got this dataset), so we'll check that with a regular expression before we move on.

In [128]:
import re
p=re.compile(r'(?<=www\.).+(?=\.com)')

def get_root(url):
    tmp = p.search(url)
    if tmp is None:
        return None
    else:
        return tmp.group()

df.link.apply(get_root).unique()[:5]

array(['huffingtonpost', 'huffingtonpost.comhttps://www.outsports',
       'huffingtonpost.comhttp://www.newnownext',
       'huffingtonpost.comhttps://www.theguardian',
       'huffingtonpost.comhttp://www.citypages'], dtype=object)

Clearly some of these links don't seem valid, so let's look at a few at random to get a better sense of what's wrong with them.

In [129]:
import numpy as np
df['root'] = df.link.apply(get_root)
mismatches = df[df.root!='huffingtonpost']['link']
print(f"Total number of news articles = {len(df)}\nNumber of mismatches = {len(mismatches)}\n")

#Double Links
p = re.compile('(?<=https://).+(?=(http://|https://))')
double_link_idx = df[df['link'].apply(lambda x: p.search(x) is not None)].index
print(f"{len(double_link_idx)} 'double links'.")

for idx in np.random.choice(range(len(mismatches)), size=10):
    print(mismatches.iloc[idx])

Total number of news articles = 124989
Number of mismatches = 3774

4785 'double links'.
https://www.huffingtonpost.comhttps://medium.com/matter/on-gawker-s-problem-with-women-f1197d8c1a4e
https://www.huffingtonpost.comhttp://www.usmagazine.com/celebrity-news/news/daisy-fuentes-and-richard-marx-couldnt-be-happier-after-romantic-wedding-photos-w160414
https://www.huffingtonpost.comhttp://www.vh1.com/news/213630/dmc-big-freedia-and-more-join-vh1s-special-love-hip-hop-out-in-hip-hop/
https://www.huffingtonpost.comhttp://www.realclearpolitics.com/articles/2015/08/13/americas_crush_on_political_outsiders_summer_fling_or_real_deal_127768.html
https://www.huffingtonpost.comhttp://www.politico.com/story/2016/06/hardly-anybody-wants-to-speak-at-trumps-convention-224815
https://www.huffingtonpost.comhttps://www.thedodo.com/bear-woman-cube-1257985982.html
https://www.huffingtonpost.comhttp://www.theatlantic.com/politics/archive/2016/05/the-path-to-a-trump-presidency/482796/
https://www.huffington

Having just run the code above a few times, it appears these are all (or close to it) a link to another website, some with a HuffPo one appended to the front. I could pull out the real link with another regex pretty easily, but there are about 8,000 issues of ~125,000 articles total.  Besides, I don't really want to worry about the robot.txt policies for each of these websites when I can just scrape HuffingtonPost (which has a very permissive policy).  So I'm just going to drop these and move on with the scraping.

In [130]:
#Drop the non-HuffPo URLs
df.drop(mismatches.index, inplace=True)
df.drop([x for x in double_link_idx if x not in mismatches.index], inplace=True)

Now since there are ~120,000 articles and I have other things to do today than watch requests fly one at a time, I'll be using asyncio to get the text of these links. https://www.huffpost.com/robots.txt doesn't show any page per minute limit nor does the /entry/* route seem to be on their disallow list, so I'm not going to have to deal with any special rate limiting (yay!).  I'm going to start with one link taken at random to test out the parsing.  From the inspector in chrome, it looks like I want to capture all the text within a section tag with id="entry-body".

Now since there are ~120,000 articles and I have other things to do today than watch requests fly one at a time, I'll be using asyncio to get the text of these links.  The actual scraping is done with the file 'get_articles.py' since jupyter and asyncio don't play well together because they both use the event loop (note that there did actually appear to be some throttle I had to deal with described on the blog post related to this notebook at alanpenkar.com).  Feel free to look at that code directly, however below is a light breakdown of the HTML parsing.  There were a couple different ways the pages looked with regard to structure and classes, so the process was a bit iterative of running the scraper on a random set of pages, then looking at the parsing errors and examining their format manually with Chrome Dev Tools and the BeautifulSoup code below.

In [93]:
import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
ua = UserAgent()

response = requests.get('https://www.huffingtonpost.com/entry/avocado-feel-full-overweight-lunch_us_5b9dc55de4b03a1dcc8cae44', headers={'User-Agent':ua.random})
soup = BeautifulSoup(response.text)
string = " ".join(soup.find("section", {"id":"entry-body"}).stripped_strings)

string

'Avocado lovers, good news for you. A small new study in  the Nutrition Journal  shows that eating half of a Hass avocado at lunchtime can help overweight adults feel full for longer in the hours following. This could be useful in combating hunger pangs between lunch and dinner -- which can lead to the urge to eat unhealthy snacks. The study was funded by the Hass Avocado Board, but the board did not have a role in the design or conduct of the study, nor the interpretation of the results. "Avocados are a very popular and delicious fruit, and from the results of our study, may also be helpful for people who are looking to better manage their weight," study researcher Dr. Joan Sabate, a professor of nutrition at Loma Linda University School of Public Health, said in a statement. For the study, 26 people ages 25 to 65 who were overweight and moderately obese (with a body mass index higher than 25 but less than 35) were given the same breakfast for three days. Then for lunch, the participa

The parsing logic above worked for most pages, however there were parsing issues with about 5% of pages.  The code below managed to handle those cases.  If you look closely, you will see some issues with parsing that leave some strange artifacts such as '\x0a'.  That will be dealt with with text preprocessing in the next notebook.

In [91]:
response = requests.get('https://www.huffingtonpost.com/entry/texas-amanda-painter-mass-shooting_us_5b081ab4e4b0802d69caad89', headers={'User-Agent':ua.random})
soup = BeautifulSoup(response.text)
string = " ".join([''.join(x.stripped_strings) for x in soup.find_all("div", {"class":"content-list-component"}) ])
    
string

'DENTON, Texas―Amanda Painter sat at the kitchen table in an unfamiliar apartment with an absurd dilemma: She had nothing to wear to a vigil for her three dead children. Her clothes were at home, but her home was now a crime scene. Less than 100 hours after her children were murdered, Amanda, 29, found herself in a Walmart, hobbling down the aisles, the gunshot wound to her neck concealed by a gauze bandage. She found a suitable purple shirt and kept her head down. She hadn’t expected to see so many children at the store, laughing and playing. For eight years, Amanda answered to “Mommy.” Now, her babies ― Odin, 8, Caydence, 6, and Drake, 4 ― were gone. Each time she closed her eyes, even to blink, they returned. Last week, Amanda’s ex-husband, Justin Painter, 39, entered her home in Ponder, Texas,and fatally shot her boyfriend, Seth Richardson, 29, her three children and then himself. He intentionally kept Amanda alive, he told her, to live with the pain. It was an unthinkable tragedy.

On inspection, several of these pages (like the one below) don't exist anymore.  However, that'll get taken care of in the preprocessing notebook as well.

In [139]:
df.loc[76120,'link']

'https://www.huffingtonpost.com/entry/taxidermist-immortalizes-michigan-state-win-with-chipmunks_us_56328a38e4b0c66bae5bcf89'