<a href="https://colab.research.google.com/github/Dawudis/Tokenization-Coreference-Experiments-on-NYT-Articles/blob/main/Spacy_Tokenization_w_Coreference_Resolution_01252022.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Scraping NYT Articles**

In [None]:
#use pynytimes to extract article urls from nytimes
%pip install pynytimes

In [2]:
import datetime

from pynytimes import NYTAPI
nyt = NYTAPI("A7EB13Chd5XLpap0NnRKyAhG508q0PPy", parse_dates=True) # enter my NY Times API Key

# to see more on pynytimes, go to: https://github.com/michadenheijer/pynytimes

articles = nyt.article_search(
    query = "politics",
    results = 30,
    dates = {
        "begin": datetime.datetime(2021, 1, 1), 
        "end": datetime.datetime(2022, 1, 1) #extracting article urls from 1/1/2021 to 1/1/2022
    }
)

In [3]:
import pandas as pd
df = pd.DataFrame(articles, columns= ['web_url']) #input article urls into dataframe 
print (df.head())
print(df.columns)
print(df.shape)

                                             web_url
0  https://www.nytimes.com/2021/12/23/us/politics...
1  https://www.nytimes.com/2021/12/31/us/politics...
2  https://www.nytimes.com/2021/12/12/arts/televi...
3  https://www.nytimes.com/2021/12/30/us/white-ho...
4  https://www.nytimes.com/2021/12/30/opinion/sup...
Index(['web_url'], dtype='object')
(30, 1)


In [None]:
#use news-please to scrape article urls 
!pip3 install news-please

In [5]:
from newsplease import NewsPlease

result = []
for i in df["web_url"]: 
  result.append(NewsPlease.from_url(i).maintext)
# for every item (url) in the dataframe, we are applying the newsplease function to get scrape the text
# note that result is a list of the articles. 

# we have to convert the list of articles content into a text file so that we can tokenize that text file into sentences.
textfile = open("nyarticles.txt", "w") 
for element in result:
    textfile.write(element + "\n")
textfile.close()

In [6]:
f = open('nyarticles.txt', 'r')
data=f.read() #The read() method in Python is a pre-defined function which returns the read data in the form of a string. 

# **Apply Coreference Resolution + Tokenize + Clean the Text and Apply the Entity Extraction Function**

In [None]:
#install necessary dependencies for neural coref.
!pip install spacy==2.1.0
!python -m spacy download en
!pip install neuralcoref
import spacy
import neuralcoref
nlp = spacy.load('en')
neuralcoref.add_to_pipe(nlp)

In [8]:
#convert article text into coref. resolved text
doc = nlp(data)
coref_resolved = doc._.coref_resolved

In [9]:
#use spacy again to tokenize the sentences as well as to create the entity extraction function after this
#for the tokenization, we had to change the model for better accuracy
nlp1 = spacy.load("en_core_web_sm")
doc1 = nlp1(coref_resolved)
sentences = doc1.sents #tokenize the doc and give the output to 'sentences'

In [10]:
#create entity extraction function: (any sentence in the text with a PERSON'S name is put into "test_list")
test_list = []
for item in sentences:
  for ent in item.ents:
    if ent.label_ == "PERSON":
      test_list.append(item)

In [11]:
#test_list seems to have multiple repetitions of sentences
test_list[0:5]

[the 50-50 split in the Senate means that Vice President Kamala Harris must break tied votes,
 Similarly, Biden made an early two-pronged bet about the midterms: that a surging economy and a waning threat from the coronavirus would deliver victory to the Democrats.,
 “Chief Justice Roberts is taking a page from “Chief Justice Roberts old playbook: acknowledging institutional challenges in the judiciary but telling the public that only we judges can fix only we judges,” Mr. Roth said.,
 “Chief Justice Roberts is taking a page from “Chief Justice Roberts old playbook: acknowledging institutional challenges in the judiciary but telling the public that only we judges can fix only we judges,” Mr. Roth said.,
 “Chief Justice Roberts is taking a page from “Chief Justice Roberts old playbook: acknowledging institutional challenges in the judiciary but telling the public that only we judges can fix only we judges,” Mr. Roth said.]

In [12]:
#create function to remove repetitive sentences from the text
repitition_removal = [] 
for i in test_list: 
    if i not in repitition_removal: 
        repitition_removal.append(i)

In [13]:
#repitition_removal's items are in a span format, which won't work in a dataframe
repitition_removal[0:5]

[the 50-50 split in the Senate means that Vice President Kamala Harris must break tied votes,
 Similarly, Biden made an early two-pronged bet about the midterms: that a surging economy and a waning threat from the coronavirus would deliver victory to the Democrats.,
 “Chief Justice Roberts is taking a page from “Chief Justice Roberts old playbook: acknowledging institutional challenges in the judiciary but telling the public that only we judges can fix only we judges,” Mr. Roth said.,
 Chief Justice Roberts “Chief Justice Roberts addressed at some length a recent series of articles in The Wall Street Journal that found that 131 federal judges had violated a federal law by hearing 685 lawsuits between 2010 and 2018 that involved companies in which they or they families owned shares of stock.,
 Sesame Street,” which premiered in 1969, was the project of Joan Ganz Cooney, a TV executive who was originally more interested in the civil rights movement than in education but came to see the c

# **Input Results Into DataFrame**

In [14]:
#so we convert each element of the list into a string
sentences = list(map(str, repitition_removal))
sentences[0:5]

['the 50-50 split in the Senate means that Vice President Kamala Harris must break tied votes',
 'Similarly, Biden made an early two-pronged bet about the midterms: that a surging economy and a waning threat from the coronavirus would deliver victory to the Democrats.\n',
 '“Chief Justice Roberts is taking a page from “Chief Justice Roberts old playbook: acknowledging institutional challenges in the judiciary but telling the public that only we judges can fix only we judges,” Mr. Roth said.',
 'Chief Justice Roberts “Chief Justice Roberts addressed at some length a recent series of articles in The Wall Street Journal that found that 131 federal judges had violated a federal law by hearing 685 lawsuits between 2010 and 2018 that involved companies in which they or they families owned shares of stock.\n',
 'Sesame Street,” which premiered in 1969, was the project of Joan Ganz Cooney, a TV executive who was originally more interested in the civil rights movement than in education but came

In [15]:
sent=pd.DataFrame(sentences, columns=['sentences']) # converting the list to dataframe. The dataframe is called 'sent' - from sentences.
sent.head()

Unnamed: 0,sentences
0,the 50-50 split in the Senate means that Vice ...
1,"Similarly, Biden made an early two-pronged bet..."
2,“Chief Justice Roberts is taking a page from “...
3,Chief Justice Roberts “Chief Justice Roberts a...
4,"Sesame Street,” which premiered in 1969, was t..."


In [16]:
sent.to_csv('2021-2022corefspacy.csv', index=False) #converts dataframe to csv

In [17]:
#downloads csv file onto computer
from google.colab import files
files.download('2021-2022corefspacy.csv')