<a href="https://colab.research.google.com/github/FranziskaSW/DS-keyword-clusters/blob/master/1_Introduction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

 <p align="center">**<font  size="70" >Analyzing</font>**
<br>![alt text](https://upload.wikimedia.org/wikipedia/commons/thumb/7/77/The_New_York_Times_logo.png/800px-The_New_York_Times_logo.png)
<br>**<font  size="5" >Using data to gain insight into the newspaper's policy</font>**</p>

<br> <p align="center">
**Franziska Wehrmann** | franzisk.wehrmann@mail.huji.ac.il | franziska  <br>
**Alon Itach** | alon.itach@mail.huji.ac.il | alonit  <br>
**Yuval Reif** | yuval.reif@mail.huji.ac.il | yuval.reif  <br>
  </p>

# Problem Description
--------------------------

When we sat down to conjure ideas for the project, Franzi shared with us David Kriesel's [Spiegel Mining](http://www.dkriesel.com/spiegelmining) talk, where he explored a database of articles that he himself scraped over several years from the German newsmagazine, Spiegel Online.

We all liked the spirit of Kriesel's project, and wished to create a project in this spirit yet in the context of the course. Our main aspirations were to gain insight into a newspaper's publication policy through data mining & visualization, and to play and have fun with a huge amount of data.

###The problems we chose to tackle were:
- **Identifying stories in the sea of articles:** Finding sets of articles which are connected in some sense (either in a straight-forward way or in some deeper sense), so that they define a story.
- **Finding trends in the newspaper's publications:** Developing tools to determine these trends so as to gain insight into the choices (conscious or unconscious) the newspaper's editors make.
- **Analyzing the network of keywords:** Defining a network with article keywords as nodes, and edges defined by keywords' article-coappearance, and searching this network for insights about the newspaper's publication in a broad sense.



# The Data
-------------
Since this project is about textual data, we needed a language we could all understand, so English was our only option. Of all the English news' APIs, the NYT's API was the most promising: Although containing only article metadata (and for scraping articles' content you need an online subscription), it does have some cool features:
- Has articles from 1851 (!) to date
- Has **rich textual metadata**, including the following for each article:
  - Headline and the lead paragraph
  - Tagged keywords of 4 types: people, organizations, locations, creative works & subjects
    - Subject being anything from "Classified Information and State Secrets" to "#MeToo Movement"
  - News desk (A general category: Politics, Culture, Sports, etc.) and section of the magazine (More specific: Soccer, Middle East, Economy) it was published in
  - Publication date & time
- Is free & well maintained
 



# The API
----------------

NYTimes API is a RESTful servie. For each request, a JSON with data of one month is provided. Example of one (cleaned) document is given.

All in all we analyzed data **from 1990 to 2019**, with **2131065 unique documents** in a weight of **3.83 GB** as JSON format. 

In [0]:
import json
print(json.dumps({"web_url": "https://www.nytimes.com/2018/05/03/nyregion/nyc-safe-injection-sites-heroin.html", "snippet": "Though no injection sites exist yet in the United States, the endorsement of the strategy by New York may give the movement behind it special impetus.", "print_page": "1", "blog": [], "source": "The New York Times", "headline": {"main": "De Blasio Moves to Bring Safe Injection Sites to New York City", "kicker": None, "content_kicker": None, "print_headline": "To Curb Overdoses, New York Plans to Try Safe Injection Sites", "name": None, "seo": None, "sub": None}, "keywords": [{"name": "persons", "value": "de Blasio, Bill", "rank": 1, "major": "N"}, {"name": "glocations", "value": "New York City", "rank": 2, "major": "N"}, {"name": "subject", "value": "Drug Abuse and Traffic", "rank": 3, "major": "N"}, {"name": "subject", "value": "Heroin", "rank": 4, "major": "N"}, {"name": "subject", "value": "Opioids and Opiates", "rank": 5, "major": "N"}, {"name": "subject", "value": "Hypodermic Needles and Syringes", "rank": 6, "major": "N"}], "pub_date": "2018-05-03T21:00:11+0000", "document_type": "article", "news_desk": "Metro", "byline": {"original": "By WILLIAM NEUMAN", "person": [{"firstname": "William", "middlename": None, "lastname": "NEUMAN", "qualifier": None, "title": None, "role": "reported", "organization": "", "rank": 1}], "organization": None}, "type_of_material": "News", "_id": "5aeb785d47de81a9012268e6", "word_count": 1204, "score": 1, "uri": "nyt://article/efa4a550-e283-558f-b135-ebe6222ba3cc"}, indent=2))


{
  "web_url": "https://www.nytimes.com/2018/05/03/nyregion/nyc-safe-injection-sites-heroin.html",
  "snippet": "Though no injection sites exist yet in the United States, the endorsement of the strategy by New York may give the movement behind it special impetus.",
  "print_page": "1",
  "blog": [],
  "source": "The New York Times",
  "headline": {
    "main": "De Blasio Moves to Bring Safe Injection Sites to New York City",
    "kicker": null,
    "content_kicker": null,
    "print_headline": "To Curb Overdoses, New York Plans to Try Safe Injection Sites",
    "name": null,
    "seo": null,
    "sub": null
  },
  "keywords": [
    {
      "name": "persons",
      "value": "de Blasio, Bill",
      "rank": 1,
      "major": "N"
    },
    {
      "name": "glocations",
      "value": "New York City",
      "rank": 2,
      "major": "N"
    },
    {
      "name": "subject",
      "value": "Drug Abuse and Traffic",
      "rank": 3,
      "major": "N"
    },
    {
      "name": "subject",
 