# Data Acquisition

To compile a list of news stories and articles about the **transgender** topic from the past year, I used three different sources:

    1. NewsAPI.org - A free but limited API for compiling News Articles related to one topic. I scraped a reasonable amount of stories from here.
    2. WorldNewsAPI.com - Another free news API with a global scope of news articles and a greater free query limit. It also provided some interesting metadata for each article, such as a sentiment analysis run by the WorldNews organization.
    3. GroundNews.com - A paid news service which provides a wealth of interesting metadata and analysis on the News Stories covered there. I have a paid subscription to this service, and was able to scrape a decent number of news articles, along with pertinent metadata.

Two of these sources were APIs, which gave me the text in a relatively easy to clean format. The third, GroundNews, gave me a bit of a challenge. Scraping the article summaries and metadata actually proved relatively simple, but I then tried to scrape the source texts (from the original stories) which proved tricky, and I had to get creative to glean usable text data from those sources.

Let's start with the APIs.

# News API

For this one the strategy was relatively simple. I first acquired an API Key from the website, then ran the following series of queries to acquire the maximum number of articles possible. The real difficulty was in overcoming the restrictions this API placed upon free users. I was limited in both number of requests, and the page # which I could use for a request result. In this case, I was limited to 5 pages per request, and to get around this limitation I retried the query multiple times using each available sorting parameter. I then combined these results into a single dataframe containing all metadata and dropped duplicates.

Here's what I did:

In [1]:
import requests
import pandas as pd

df1 = pd.DataFrame(columns=['title','description','content','author','publishedAt','source','url','urlToImage'])
print('running!')

#Changing sorting parameters to get around the free API limitations.
#There is a maximum page limit of 5, limiting the number of results seen even further than would be possible given the 100 queries/day limitation.

#Articles sorted by publish date by default
for num in range(1,6):  
    request_params = {'q':'transgender',
                      'page':f'{num}',
                      'language':'en',
                      'apiKey':'edb887b5553c43aab598452a6335ad0c'}
    response = requests.get('https://newsapi.org/v2/everything',request_params)
    
    json = response.json()
    articles = json['articles']
        
    for article in articles:
        df1.loc[len(df1.index)]=[
        article['title'],
        article['description'],
        article['content'],
        article['author'],
        article['publishedAt'],
        article['source']['id'],
        article['url'],
        article['urlToImage']
        ]
        
df2 = pd.DataFrame(columns=['title','description','content','author','publishedAt','source','url','urlToImage'])

#Same query sorted by relevancy
for num in range(1,6):  
    request_params = {'q':'transgender',
                      'page':f'{num}',
                      'sortBy':'relevancy',
                      'language':'en',
                      'apiKey':'edb887b5553c43aab598452a6335ad0c'}
    response = requests.get('https://newsapi.org/v2/everything',request_params)
    
    json = response.json()
    articles = json['articles']
        
    for article in articles:
        df2.loc[len(df2.index)]=[
        article['title'],
        article['description'],
        article['content'],
        article['author'],
        article['publishedAt'],
        article['source']['id'],
        article['url'],
        article['urlToImage']
        ]
        
df3 = pd.DataFrame(columns=['title','description','content','author','publishedAt','source','url','urlToImage'])

#Same Query sorted by popularity
for num in range(1,6):  
    request_params = {'q':'transgender',
                      'page':f'{num}',
                      'sortBy':'popularity',
                      'language':'en',
                      'apiKey':'edb887b5553c43aab598452a6335ad0c'}
    response = requests.get('https://newsapi.org/v2/everything',request_params)
    
    json = response.json()
    articles = json['articles']
        
    #I kept every piece of information attached to each article in its own column in the raw text dataframe, even if the last two ultimately proove irrelevant. 
    #Better to have extra irrelevant data than leave out something important.
    for article in articles:
        df3.loc[len(df3.index)]=[
        article['title'],
        article['description'],
        article['content'],
        article['author'],
        article['publishedAt'],
        article['source']['id'],
        article['url'],
        article['urlToImage']
        ]

df_all = pd.concat([df1,df2,df3]).drop_duplicates()
    
# df_all.to_csv('../../data/newsapi_corpus_raw.csv')

df_all.head()



running!


Unnamed: 0,title,description,content,author,publishedAt,source,url,urlToImage
0,The owner of a restaurant in New York state to...,Some workers at the pizzeria even equated bein...,Lourdes Balduque/Getty Images\r\n<ul><li>Staff...,Grace Dean,2024-01-29T14:33:43Z,business-insider,https://www.businessinsider.com/new-york-resta...,https://i.insider.com/65b796ba6c8f0a134f7aa55e...
1,Man guilty of killing transgender woman in hat...,It was the United States' first federal trial ...,In combo of undated selfie images provided cou...,,2024-02-24T05:40:52Z,,https://www.npr.org/2024/02/24/1233685859/man-...,https://media.npr.org/assets/img/2024/02/24/ap...
2,Utah Joins 10 Other States in Regulating Bathr...,Utah became the latest state to regulate bathr...,Utah became the latest state to regulate bathr...,AMY BETH HANSON / AP,2024-01-31T16:25:37Z,time,https://time.com/6590528/utah-joins-states-reg...,https://api.time.com/wp-content/uploads/2024/0...
3,Killer was moved to Brianna Ghey's school afte...,Scarlett Jenkinson was moved after drugging a ...,Killer Scarlett Jenkinson was moved to a new s...,https://www.facebook.com/bbcnews,2024-02-02T12:30:54Z,bbc-news,https://www.bbc.co.uk/news/uk-68153179,https://ichef.bbci.co.uk/news/1024/branded_new...
4,South Dakota has apologized and must pay $300K...,,"Si vous cliquez sur « Tout accepter », nos par...",,2024-02-06T18:31:04Z,,https://consent.yahoo.com/v2/collectConsent?se...,


# WorldNewsAPI

This API was more friendly to the free user than NewsAPI, with much less strict limitations. Here I was limited to a maximum of 50 *points* per day, with each API request costing at least one point, plus an additional 1 point for every 100 results returned. However, this meant that with careful typing and some luck, I could pull over 4,000 articles in one go!

Here was my final code:

In [None]:
import requests
import pandas as pd

df = pd.DataFrame(columns=['title','text','authors','country','sentiment','url'])

#Going to extract a bunch of news articles from the past couple of months from this API as well.
url = 'https://api.worldnewsapi.com/search-news'

#There is a maximum of 100 returned articles per request. Make 30 such requests and append the results of each into a single data-frame.
# o is the request offset. It decides which batch of 100 we're requesting.
for o in range(0,3001,100):
    request_params = {'text':'transgender',
                      'number':'100',
                      'offset':f'{o}',
                      'earliest-publish-date':'2023-09-01',
                      'sort':'publish-time',
                      'language':'en',
                      'api-key':'968a873ef4a14e2fb2acc6a0107ccbea'}
    print(f'requesting articles {o+1}-{o+101}!')
    #get the news list from the response json.
    resp = requests.get(url,request_params)
    json = resp.json()
    articles =json['news']
    #Store the relevant metadata for each article in the corresponding column and append that row to my dataframe.
    for article in articles:
        df.loc[len(df.index)]=[
        article['title'],
        article['text'],
        #Authors is often a list of people, but is sometimes empty. This ensures that this column contains all strings.
        article['authors'][0] if len(article['authors'])>0 else '',
        article['source_country'],
        article['sentiment'],
        article['url']
        ]
        
df = df.drop_duplicates()
df.to_csv('../../data/worldnewsapi_corpus_raw.csv',mode='a')