# Capstone Proposal

### Problem Statement

- Climate change is the most profound existential crisis human civilization has ever faced, but there is still little regard of this reality across the media. Media has a profound imapct on the way that the population views issues, and the tone and frequency of media attention will have an influence on society's direction in addressing these issues. Despite increasing warning from scientists, there does not seem to be a scaled response to the cliate crisis. There does, however, seem to be a growing climate movement and more awareness in general. This project aims to quantify any change in news coverage that could indicate the way that media is directing the population. Through the use of newspaper headlines, and additional resources if avalable, an analysis of the last 10 years of media coverage will be performed. Using APIs to build a corpus of headlines, NLP will be used to assess the sentiment of the artices that refer to climate change and any related terms. This assessment will be compared over time to establish if any trend is present in how climate related issues are being presented. There will also be an exploration of the frequency of climate related headlines to determine if more attention is being given to the topic over time.

### Deliverable

- A streamlit web dashbord showing the results of the analysis, a description of the data, and the methodology used to produce the analysis.


## <u>Sample Code</u>

In [1]:
import os
import time
import json
import requests
from api_keys import nyt_api
from api_keys import grdn_api

### New York Times Data

- New York Times data should be easy to access and provides 
<a href="https://developer.nytimes.com/faq">11. Is there an API call limit?</a>
Yes, there are two rate limits per API: 4,000 requests per day and 10 requests per minute. You should sleep 6 seconds between calls to avoid hitting the per minute rate limit. If you need a higher rate limit, please contact us at code@nytimes.com.

- Even with this limit in mind it could only take 20 minutes to pull 120 months of data.

In [9]:
# code to pull all the data I would need from NYT

# for year in range(2013, 2023):
#     for month in range(1,13):
#         req = requests.get(f'https://api.nytimes.com/svc/archive/v1/{year}/{month}.json?api-key={nyt_api}')
#         with open(f'nyt/nyt_{year}_{month}.json', 'w') as file:
#             content = json.loads(req.content)
#             json.dump(content, file)
#             time.sleep(4) # seems to take at least 2 seconds to write file so not full 6 second sleep

In [38]:
# then can easily access the data that has been pulled
with open('nyt/nyt_2013_1.json', 'r') as file:
    archive = json.load(file)
    
# show the first item in the docs
archive['response']['docs'][0]
# important fields are 'abstract', 'headline', 'pub_date', 'section_name', 

{'abstract': 'The Emancipation Proclamation evolved during the Civil War years, as did the thinking of its author.',
 'web_url': 'https://opinionator.blogs.nytimes.com/2012/12/31/abraham-lincoln-and-the-emancipation-proclamation/',
 'snippet': 'The Emancipation Proclamation evolved during the Civil War years, as did the thinking of its author.',
 'lead_paragraph': 'In an op-ed, Eric Foner writes:',
 'source': 'The New York Times',
 'multimedia': [],
 'headline': {'main': 'Abraham Lincoln and the Emancipation Proclamation',
  'kicker': 'Opinionator',
  'content_kicker': None,
  'print_headline': '',
  'name': None,
  'seo': None,
  'sub': None},
 'keywords': [{'name': 'subject',
   'value': 'Civil War (US) (1861-65)',
   'rank': 1,
   'major': 'N'},
  {'name': 'subject',
   'value': 'Emancipation Proclamation (1863)',
   'rank': 2,
   'major': 'N'},
  {'name': 'subject', 'value': 'Slavery', 'rank': 3, 'major': 'N'},
  {'name': 'persons', 'value': 'Lincoln, Abraham', 'rank': 4, 'major': 

### Guardian Data

- Guardian data will be harder to acquire since the api does not provide as easy access to the whole set of headlines.
- Looking through the API explorer I found that that there are 956561 items that I can obtain from The Guardian for the time range I'm interested in, but there are no abstracts, only headlines, publication date, and section name. This might not be a problem but could hinder sentiment analysis.
- The requests are only 10 items per page by default but <a href=https://stackoverflow.com/questions/61031878/get-all-articles-guardian-api>this SO exchange</a> suggests that it is possible to increase that number so that the number of requests I need to make will be more reasonable. 

In [40]:
# would need to iterate over as many pages as needed to aggregate all the data

# the request uri includes the dates I am interested in
for num in range(1, 2):
    req = requests.get(
        f'https://content.guardianapis.com/search?from-date=2013-01-01&to-date=2023-01-01&order-by=oldest&page={num}&api-key={grdn_api}'
    )
    results = json.loads(req.content)
    

In [42]:
results
    
# relevant fields would be 'webTitle', 'webPublicationDate', and 'sectionName'
# pages field shows how many requests would need to be made if keeping the default 10 items per page

{'response': {'status': 'ok',
  'userTier': 'developer',
  'total': 956561,
  'startIndex': 1,
  'pageSize': 10,
  'currentPage': 1,
  'pages': 95657,
  'orderBy': 'oldest',
  'results': [{'id': 'science/2013/jan/01/stephen-hawking-silences-go-compare',
    'type': 'article',
    'sectionId': 'science',
    'sectionName': 'Science',
    'webPublicationDate': '2013-01-01T00:00:00Z',
    'webTitle': 'Stephen Hawking silences Go Compare singer in latest ad instalment',
    'webUrl': 'https://www.theguardian.com/science/2013/jan/01/stephen-hawking-silences-go-compare',
    'apiUrl': 'https://content.guardianapis.com/science/2013/jan/01/stephen-hawking-silences-go-compare',
    'isHosted': False,
    'pillarId': 'pillar/news',
    'pillarName': 'News'},
   {'id': 'society/2013/jan/01/uk-needs-michelle-obama-obesity',
    'type': 'article',
    'sectionId': 'society',
    'sectionName': 'Society',
    'webPublicationDate': '2013-01-01T00:01:00Z',
    'webTitle': 'UK needs its own Michelle Ob