# Data Acquisition

To compile a list of news stories and articles about the **transgender** topic from the past year, I used three different sources:

    1. NewsAPI.org - A free but limited API for compiling News Articles related to one topic. I scraped a reasonable amount of stories from here.
    2. WorldNewsAPI.com - Another free news API with a global scope of news articles and a greater free query limit. It also provided some interesting metadata for each article, such as a sentiment analysis run by the WorldNews organization.
    3. GroundNews.com - A paid news service which provides a wealth of interesting metadata and analysis on the News Stories covered there. I have a paid subscription to this service, and was able to scrape a decent number of news articles, along with pertinent metadata.

Two of these sources were APIs, which gave me the text in a relatively easy to clean format. The third, GroundNews, gave me a bit of a challenge. Scraping the article summaries and metadata actually proved relatively simple, but I then tried to scrape the source texts (from the original stories) which proved tricky, and I had to get creative to glean usable text data from those sources.

Let's start with the APIs.

  **Important** : I've modified this code from the original .py files in two ways: first, I've added some comments. Secondly, I replaced the final line to show the head of the dataframe rather than writing it to CSV. I've already successfully compiled the data, and I don't want to accidentally overwrite it if this code no longer works the same(which it probably won't, since these APIs are Time-related).

# News API

For this one the strategy was relatively simple. I first acquired an API Key from the website, then ran the following series of queries to acquire the maximum number of articles possible. The real difficulty was in overcoming the restrictions this API placed upon free users. I was limited in both number of requests, and the page # which I could use for a request result. In this case, I was limited to 5 pages per request, and to get around this limitation I retried the query multiple times using each available sorting parameter. I then combined these results into a single dataframe containing all metadata and dropped duplicates.

Here's what I did:

In [1]:
import requests
import pandas as pd

df1 = pd.DataFrame(columns=['title','description','content','author','publishedAt','source','url','urlToImage'])
print('running!')

#Changing sorting parameters to get around the free API limitations.
#There is a maximum page limit of 5, limiting the number of results seen even further than would be possible given the 100 queries/day limitation.

#Articles sorted by publish date by default
for num in range(1,6):  
    request_params = {'q':'transgender',
                      'page':f'{num}',
                      'language':'en',
                      'apiKey':'edb887b5553c43aab598452a6335ad0c'}
    response = requests.get('https://newsapi.org/v2/everything',request_params)
    
    json = response.json()
    articles = json['articles']
        
    for article in articles:
        df1.loc[len(df1.index)]=[
        article['title'],
        article['description'],
        article['content'],
        article['author'],
        article['publishedAt'],
        article['source']['id'],
        article['url'],
        article['urlToImage']
        ]
        
df2 = pd.DataFrame(columns=['title','description','content','author','publishedAt','source','url','urlToImage'])

#Same query sorted by relevancy
for num in range(1,6):  
    request_params = {'q':'transgender',
                      'page':f'{num}',
                      'sortBy':'relevancy',
                      'language':'en',
                      'apiKey':'edb887b5553c43aab598452a6335ad0c'}
    response = requests.get('https://newsapi.org/v2/everything',request_params)
    
    json = response.json()
    articles = json['articles']
        
    for article in articles:
        df2.loc[len(df2.index)]=[
        article['title'],
        article['description'],
        article['content'],
        article['author'],
        article['publishedAt'],
        article['source']['id'],
        article['url'],
        article['urlToImage']
        ]
        
df3 = pd.DataFrame(columns=['title','description','content','author','publishedAt','source','url','urlToImage'])

#Same Query sorted by popularity
for num in range(1,6):  
    request_params = {'q':'transgender',
                      'page':f'{num}',
                      'sortBy':'popularity',
                      'language':'en',
                      'apiKey':'edb887b5553c43aab598452a6335ad0c'}
    response = requests.get('https://newsapi.org/v2/everything',request_params)
    
    json = response.json()
    articles = json['articles']
        
    #I kept every piece of information attached to each article in its own column in the raw text dataframe, even if the last two ultimately proove irrelevant. 
    #Better to have extra irrelevant data than leave out something important.
    for article in articles:
        df3.loc[len(df3.index)]=[
        article['title'],
        article['description'],
        article['content'],
        article['author'],
        article['publishedAt'],
        article['source']['id'],
        article['url'],
        article['urlToImage']
        ]

df_all = pd.concat([df1,df2,df3]).drop_duplicates()
    
# df_all.to_csv('../../data/newsapi_corpus_raw.csv')

df_all.head()



running!


Unnamed: 0,title,description,content,author,publishedAt,source,url,urlToImage
0,The owner of a restaurant in New York state to...,Some workers at the pizzeria even equated bein...,Lourdes Balduque/Getty Images\r\n<ul><li>Staff...,Grace Dean,2024-01-29T14:33:43Z,business-insider,https://www.businessinsider.com/new-york-resta...,https://i.insider.com/65b796ba6c8f0a134f7aa55e...
1,Man guilty of killing transgender woman in hat...,It was the United States' first federal trial ...,In combo of undated selfie images provided cou...,,2024-02-24T05:40:52Z,,https://www.npr.org/2024/02/24/1233685859/man-...,https://media.npr.org/assets/img/2024/02/24/ap...
2,Utah Joins 10 Other States in Regulating Bathr...,Utah became the latest state to regulate bathr...,Utah became the latest state to regulate bathr...,AMY BETH HANSON / AP,2024-01-31T16:25:37Z,time,https://time.com/6590528/utah-joins-states-reg...,https://api.time.com/wp-content/uploads/2024/0...
3,Killer was moved to Brianna Ghey's school afte...,Scarlett Jenkinson was moved after drugging a ...,Killer Scarlett Jenkinson was moved to a new s...,https://www.facebook.com/bbcnews,2024-02-02T12:30:54Z,bbc-news,https://www.bbc.co.uk/news/uk-68153179,https://ichef.bbci.co.uk/news/1024/branded_new...
4,South Dakota has apologized and must pay $300K...,,"Si vous cliquez sur « Tout accepter », nos par...",,2024-02-06T18:31:04Z,,https://consent.yahoo.com/v2/collectConsent?se...,


# WorldNewsAPI

This API was more friendly to the free user than NewsAPI, with much less strict limitations. Here I was limited to a maximum of 50 *points* per day, with each API request costing at least one point, plus an additional 1 point for every 100 results returned. However, this meant that with careful typing and some luck, I could pull over 4,000 articles in one go!

Here was my final code:

In [2]:
import requests
import pandas as pd

df = pd.DataFrame(columns=['title','text','authors','country','sentiment','url'])

#Going to extract a bunch of news articles from the past couple of months from this API as well.
url = 'https://api.worldnewsapi.com/search-news'

#There is a maximum of 100 returned articles per request. Make 30 such requests and append the results of each into a single data-frame.
# o is the request offset. It decides which batch of 100 we're requesting.
for o in range(0,3001,100):
    request_params = {'text':'transgender',
                      'number':'100',
                      'offset':f'{o}',
                      'earliest-publish-date':'2023-09-01',
                      'sort':'publish-time',
                      'language':'en',
                      'api-key':'968a873ef4a14e2fb2acc6a0107ccbea'}
    print(f'requesting articles {o+1}-{o+101}!')
    #get the news list from the response json.
    resp = requests.get(url,request_params)
    json = resp.json()
    articles =json['news']
    #Store the relevant metadata for each article in the corresponding column and append that row to my dataframe.
    for article in articles:
        df.loc[len(df.index)]=[
        article['title'],
        article['text'],
        #Authors is often a list of people, but is sometimes empty. This ensures that this column contains all strings.
        article['authors'][0] if len(article['authors'])>0 else '',
        article['source_country'],
        article['sentiment'],
        article['url']
        ]
#Drop any duplicates I've accidentally acquired
df = df.drop_duplicates()
#df.to_csv('../../data/worldnewsapi_corpus_raw.csv',mode='a')
df.head()

requesting articles 1-101!
requesting articles 101-201!
requesting articles 201-301!
requesting articles 301-401!
requesting articles 401-501!
requesting articles 501-601!
requesting articles 601-701!
requesting articles 701-801!
requesting articles 801-901!
requesting articles 901-1001!
requesting articles 1001-1101!
requesting articles 1101-1201!
requesting articles 1201-1301!
requesting articles 1301-1401!
requesting articles 1401-1501!
requesting articles 1501-1601!
requesting articles 1601-1701!
requesting articles 1701-1801!
requesting articles 1801-1901!
requesting articles 1901-2001!
requesting articles 2001-2101!
requesting articles 2101-2201!
requesting articles 2201-2301!
requesting articles 2301-2401!
requesting articles 2401-2501!
requesting articles 2501-2601!
requesting articles 2601-2701!
requesting articles 2701-2801!
requesting articles 2801-2901!
requesting articles 2901-3001!
requesting articles 3001-3101!


Unnamed: 0,title,text,authors,country,sentiment,url
0,Russia’s Supreme Court effectively outlaws LGB...,Menu Menu World U.S. Politics Sports Entertain...,DASHA LITVINOVA,us,0.311,https://apnews.com/article/russia-lgbtq-crackd...
1,Where Do Trans Rights Stand in Taiwan After Sa...,"An estimated 5,000 people gathered in Ximendin...",Daniel Yo-Ling,us,0.134,https://thediplomat.com/2023/11/where-do-trans...
2,Transgender People&#039;s Neurological Needs A...,"As a transgender neurologist, I advocate for t...",Deneen Broadnax,us,0.12,https://worldnewsera.com/news/science/transgen...
3,"Class 10th, 12th Board Exams Forms: Transgende...",After the Supreme Court recognised transgender...,Deeksha Teri,in,0.06,https://indianexpress.com/article/education/cl...
4,Photos: Protesters squared off in downtown Ottawa,Article content PHOTO GALLERY Thousands of peo...,Lois Kirkup,ca,0.481,https://ottawacitizen.com/news/local-news/phot...


And last but not least...

# Ground News

So this one was interesting. To acquire the most relevant articles from GroundNews, I'd have to take advantage of my paid membership, as well as some UI features which were difficult to use purely with the python **requests** package. So ultimately I decided to *download the HTML* of a page I could access manually. This page detailed a list of transgender interest articles. 
![image.png](attachment:image.png)
Before doing this, I manually expanded the page to the maximum possible size by repeatedly clicking this button.
![image-2.png](attachment:image-2.png)
The result was a long list of **events** related to the transgender topic, each of which linked to multiple articles originally from other sites.

Here's how I scraped those pages:

In [4]:
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd

df = pd.DataFrame(columns = ['title','summary','bias','factuality','owner','source'])
print('running!')

#Get into the topic page on Ground News for the topic of transgender.
#There is a button on Ground News to expand the article selection. I don't know how to click it within a python
#url request and clicking it does nothing to change the URL, so instead I populated the page with as many articles
#as I was allowed and then copied the resulting html into a local .txt file. I'm using this html as my basis here.
#I also had to modify this filepath.
file = open('../data/ground_news_transgender_interest.html',encoding="utf8",mode='r')
content = file.read()
file.close()
soup = BeautifulSoup(content,'lxml').body

articles = []
#Make a list of article links by looking for all links in the body which contain
#/article/. These all link to article pages on ground news.
for article in soup.find_all(href=re.compile('article')):
    articles.append('https://ground.news'+article.get('href'))
    
i=1
#keep a running tally of articles so i can stay sane while it's working

for article in articles:
    print(str(i)+'/97')
    i+=1
    #Go to the article page on ground news. This is a list of stories from different sources
    #about the same event, along with information about the sources of each story.
    article_page = requests.get(article)
    article_soup = BeautifulSoup(article_page.content,'html.parser').body
    
    stories = article_soup.find_all(id="article-summary")
    
    #For each story, get the source-bias, factuality rating, summary, and owner of the company.
    #Then follow the link to the original story and grab the original story's text.
    for story in stories:
        
        title = story.find('h4').get_text() if story.find('h4') is not None else '';
        summary = story.find('p').get_text() if story.find('p') is not None else '';
        
        #Bias, Factuality, and Owner buttons may not be present. Return an empty string if
        #BeautifulSoup can't find them.
        bias = story.find(id=re.compile('article-source-bias')).get_text() if story.find(id=re.compile('article-source-bias')) is not None else ''
        
        factuality = story.find(id=re.compile('article-source-factuality')).get_text() if story.find(id=re.compile('article-source-factuality')) is not None else ''
        
        owner = story.find(id=re.compile('article-source-owner')).get_text() if story.find(id=re.compile('article-source-owner')) is not None else ''
        
        if story.find('a') is not None:
            source_url = story.find('a').get('href')
            # I think trying to add the source text might be resulting in memory errors in this environment, I can't get the code to complete.
            
            # source_page = requests.get(source_url)
            # source_soup = BeautifulSoup(source_page.content,'html.parser')
            
            # #get rid of scripts and styles in the source page
            # for script in soup(["script", "style"]):
            #     script.extract()
            
            # #get the raw text. Not formatting any more here because these pages will
            # #vary wildly in their structure due to being from different websites.
            # source_text = source_soup.get_text()
        else:
            source_url = ''
        
        #Add all of the values relating to this story to the dataframe.
        df.loc[len(df.index)] = [(title if title is not None else ''),
                                 (summary if summary is not None else ''),
                                 (bias if bias is not None else ''),
                                 (factuality if factuality is not None else ''),
                                 (owner if owner is not None else ''),
                                 (source_url if source_url is not None else '')]
df = df.drop_duplicates()
df[['owner_type','owner']]=df['owner'].str.split(': ', n=1, expand=True)
#df.to_csv('../../data/ground_news_articles_raw.csv',index=False)
df.head()

running!
1/97
2/97
3/97
4/97
5/97
6/97
7/97
8/97
9/97
10/97
11/97
12/97
13/97
14/97
15/97
16/97
17/97
18/97
19/97
20/97
21/97
22/97
23/97
24/97
25/97
26/97
27/97
28/97
29/97
30/97
31/97
32/97
33/97
34/97
35/97
36/97
37/97
38/97
39/97
40/97
41/97
42/97
43/97
44/97
45/97
46/97
47/97
48/97
49/97
50/97
51/97
52/97
53/97
54/97
55/97
56/97
57/97
58/97
59/97
60/97
61/97
62/97
63/97
64/97
65/97
66/97
67/97
68/97
69/97
70/97
71/97
72/97
73/97
74/97
75/97
76/97
77/97
78/97
79/97
80/97
81/97
82/97
83/97
84/97
85/97
86/97
87/97
88/97
89/97
90/97
91/97
92/97
93/97
94/97
95/97
96/97
97/97


Unnamed: 0,title,summary,bias,factuality,owner,source,owner_type
0,Greece legalises same-sex marriage,Greece has become the first Christian Orthodox...,Center,High Factuality,Government of the United Kingdom,https://www.bbc.co.uk/news/world-europe-683101...,Government
1,Greece becomes first Orthodox Christian countr...,Lawmakers in the 300-seat parliament voted for...,Lean Left,Mixed Factuality,Scott Trust Limited,https://www.theguardian.com/world/2024/feb/15/...,Independent
2,Greece legalises same sex marriage in landmark...,The law gives same-sex couples the right to we...,Lean Left,High Factuality,The Hindu Group,https://www.thehindu.com/news/international/gr...,Independent
3,Greece becomes first Orthodox Christian countr...,Greece has become the first Orthodox Christian...,Center,High Factuality,Bell Media,https://www.ctvnews.ca/world/greece-becomes-fi...,Media Conglomerate
4,Greece legalises same-sex marriage – will anot...,Greece has become the first majority-Orthodox ...,Lean Left,Mixed Factuality,Evgeny Lebedev,https://www.independent.co.uk/news/world/europ...,Individual


# Scraping the Source Texts (Getting Creative)

Now that I had a list of article summaries and metadata, the next step was to gather the source text from the original articles. If I wanted to analyze the difference in lexicon of different news sources on this topic, I would need to get the original text.

Here was the code I used:

In [5]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re


sources=df['source']
source_texts=[]

whitespace = re.compile(r'\s+')

for source in sources:
    conf = ''
    text = ''
    #For each source page, get the full html as a BeautifulSoup Object
    try:
        soup = BeautifulSoup(requests.get(source,timeout=10).content,'html.parser').body
        #This is code I found online which rips out style and scripts in the page. This
        #is useful for extracting the text content.
        for script in soup(['script', 'style','[document]', 'head', 'title']):
            if script is not None:
                script.extract()
        
        #Another tactic to extract the story without getting ads and stuff. It seems
        #many news sites tag their main story. This does result in a lot of blank
        #rows, but I figure a blank row is less problematic than a row filled with
        #junk words.
        # if soup.find(id=re.compile('main|content|article|story')) is not None:
        #     text = re.sub(whitespace, ' ',soup.find(id=re.compile('main|content|article|story')).get_text())
        #     conf = 'high'
        # elif soup.find_all('article') is not None or soup.find_all('p') is not None:
        #     if soup.find_all('article') is not None:
        #         text += ' '.join([re.sub(whitespace, ' ', a.get_text()) for a in soup.find_all('article')])
        #     if soup.find_all('p') is not None:
        #         text += ' '.join([re.sub(whitespace, ' ', p.get_text()) for p in soup.find_all('p')])
        #     conf = 'med'
        # else:
        #     text = re.sub(whitespace, ' ',soup.get_text())
        #     conf = 'low'
            
        text = re.sub(whitespace, ' ',soup.get_text())
        
        source_texts+=[text]
    except:
        source_texts+=['']
        continue

df['source_text']=source_texts
# df.to_csv('../../data/ground_news_articles_raw.csv')
df.head()


Unnamed: 0,title,summary,bias,factuality,owner,source,owner_type,source_text
0,Greece legalises same-sex marriage,Greece has become the first Christian Orthodox...,Center,High Factuality,Government of the United Kingdom,https://www.bbc.co.uk/news/world-europe-683101...,Government,BBC HomepageSkip to contentAccessibility HelpY...
1,Greece becomes first Orthodox Christian countr...,Lawmakers in the 300-seat parliament voted for...,Lean Left,Mixed Factuality,Scott Trust Limited,https://www.theguardian.com/world/2024/feb/15/...,Independent,Skip to main contentSkip to navigationSkip to...
2,Greece legalises same sex marriage in landmark...,The law gives same-sex couples the right to we...,Lean Left,High Factuality,The Hindu Group,https://www.thehindu.com/news/international/gr...,Independent,India World Opinion Sports e-Paper Menu Short...
3,Greece becomes first Orthodox Christian countr...,Greece has become the first Orthodox Christian...,Center,High Factuality,Bell Media,https://www.ctvnews.ca/world/greece-becomes-fi...,Media Conglomerate,Skip to main content Live Search CTVNews.ca S...
4,Greece legalises same-sex marriage – will anot...,Greece has become the first majority-Orthodox ...,Lean Left,Mixed Factuality,Evgeny Lebedev,https://www.independent.co.uk/news/world/europ...,Individual,Jump to contentUS EditionChangeUK EditionAsia...


This code went through a variety of iterations, as I tried different techniques to extract the useful text from each of these various html web-pages. I tried scraping only text from specific tags, like ***<a\>***,***<article\>***,***<p\>***, and ***<id="main|body">***. However, none of these methods provided a truly general solution for removing junk text outside of the main article. I eventually resolved to just gather all of the text, and try to remove the junky bits during the data cleaning/preprocessing step.

This ultimately proved a good decision, as I was able to sort out the junk much better once I was in the cleaning step. To see how I did that, check out Data Cleaning!