<a href="https://colab.research.google.com/github/Dawudis/Tokenization-Coreference-Experiments-on-NYT-Articles/blob/main/nnsplit_Tokenization_w_Coreference_Resolution_01252022.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Scraping NYT Articles**

In [None]:
#use pynytimes to extract article urls from nytimes
%pip install pynytimes

In [None]:
import datetime

from pynytimes import NYTAPI
nyt = NYTAPI("A7EB13Chd5XLpap0NnRKyAhG508q0PPy", parse_dates=True) # enter my NY Times API Key

# to see more on pynytimes, go to: https://github.com/michadenheijer/pynytimes

articles = nyt.article_search(
    query = "politics",
    results = 30,
    dates = {
        "begin": datetime.datetime(2021, 1, 1), 
        "end": datetime.datetime(2022, 1, 1) #extracting article urls from 1/1/2021 to 1/1/2022
    }
)

In [3]:
import pandas as pd
df = pd.DataFrame(articles, columns= ['web_url']) #input article urls into dataframe 
print (df.head())
print(df.columns)
print(df.shape)

                                             web_url
0  https://www.nytimes.com/2021/12/23/us/politics...
1  https://www.nytimes.com/2021/12/31/us/politics...
2  https://www.nytimes.com/2021/12/12/arts/televi...
3  https://www.nytimes.com/2021/12/30/us/white-ho...
4  https://www.nytimes.com/2021/12/30/opinion/sup...
Index(['web_url'], dtype='object')
(30, 1)


In [None]:
#use news-please to scrape article urls 
!pip3 install news-please

In [5]:
from newsplease import NewsPlease

result = []
for i in df["web_url"]: 
  result.append(NewsPlease.from_url(i).maintext)
# for every item (url) in the dataframe, we are applying the newsplease function to get scrape the text
# note that result is a list of the articles. 

# we have to convert the list of articles content into a text file so that we can tokenize that text file into sentences.
textfile = open("nyarticles.txt", "w") 
for element in result:
    textfile.write(element + "\n")
textfile.close()

In [6]:
f = open('nyarticles.txt', 'r')
data=f.read() #The read() method in Python is a pre-defined function which returns the read data in the form of a string. 

# **Apply Coreference Resolution + Tokenize + Clean the Text and Apply the Entity Extraction Function**

In [None]:
#install necessary dependencies for neural coref.
!pip install spacy==2.1.0
!python -m spacy download en
!pip install neuralcoref
import spacy
import neuralcoref
nlp = spacy.load('en')
neuralcoref.add_to_pipe(nlp)

In [8]:
#convert article text into coref. resolved text
doc = nlp(data)
coref_resolved = doc._.coref_resolved

In [None]:
#install necessary dependencies for sentence tokenization model
!pip install nnsplit
from nnsplit import NNSplit
splitter = NNSplit.load("en")

In [10]:
#tokenize 'coref_resolved' text
splits = splitter.split([coref_resolved])

In [11]:
#for this particular tokenization model, we have to use a method that basically takes a print function output and assigns it to a variable
import os
import sys
import io

old_stdout = sys.stdout
new_stdout = io.StringIO()
sys.stdout = new_stdout
#tokenize the text into sentences
for sentence in splits:
  print(sentence)

output = new_stdout.getvalue()
#get the print output value and assign it to variable 'output'
sys.stdout = old_stdout
#the output is now a string, so we must convert it into a list

In [28]:
#now we convert 'ouput' into a textfile, then put each line of the text file (AKA each sentence) into a list 'tokens_list'
#open text file
text_file = open("tokens.txt", "w")
 
#write string to file
text_file.write(output)
 
#close file
text_file.close()

tokens_list = open("tokens.txt").readlines()
tokens_list[0:5]

['Sign up here to get On Politics in your inbox on Tuesdays and Thursdays.\n',
 'Given that this is the last On Politics newsletter before Christmas, and of 2021 for that matter, it seems like a good time to take stock and reflect on what a wish list might be for the nation’s leaders.\n',
 'Today, Democrats control both the White House and Congress. But the party’s hold on power is so slim — the 50-50 split in the Senate means that Vice President Kamala Harris must break tied votes — that the entire Biden agenda is dependent on every single Democrat’s falling into line. And they aren’t all doing so.\n',
 'History bodes poorly for the party’s in a first midterm election, and many Democrats are bracing for a rout in 2022. Here is what many Democrats think the nation’s leaders are looking for in the New Year:\n',
 'Biden: Biden won the Democratic nomination after making two early bets in the primary that paid off big: that Biden would be seen as the most electable Democrat and that Black 

In [13]:
#now we use a map function to apply spaCy's nlp function to each element of our list to prepare it for entity extraction
doc1 = list(map(nlp, tokens_list))

In [14]:
#create entity extraction function: (any sentence in the text with a PERSON'S name is put into "test_list")
test_list = []
for item in doc1:
  for ent in item.ents:
    if ent.label_ == "PERSON":
      test_list.append(item)

In [25]:
#test_list seems to have multiple repetitions of sentences
test_list[0:5]

[Today, Democrats control both the White House and Congress. But the party’s hold on power is so slim — the 50-50 split in the Senate means that Vice President Kamala Harris must break tied votes — that the entire Biden agenda is dependent on every single Democrat’s falling into line. And they aren’t all doing so.,
 Biden: Biden won the Democratic nomination after making two early bets in the primary that paid off big: that Biden would be seen as the most electable Democrat and that Black voters would be a loyal base. Both bets paid off. Similarly, Biden made an early two-pronged bet about the midterms: that a surging economy and a waning threat from the coronavirus would deliver victory to the Democrats.,
 “Chief Justice Roberts is taking a page from “Chief Justice Roberts old playbook: acknowledging institutional challenges in the judiciary but telling the public that only we judges can fix only we judges,” Mr. Roth said. “Yet the problems of overlooked financial conflicts and sexual

In [16]:
#create function to remove repetitive sentences from the text
repitition_removal = [] 
for i in test_list: 
    if i not in repitition_removal: 
        repitition_removal.append(i)

In [26]:
#repitition_removal's items are in a doc format, which won't work in a dataframe
repitition_removal[0:5]

[Today, Democrats control both the White House and Congress. But the party’s hold on power is so slim — the 50-50 split in the Senate means that Vice President Kamala Harris must break tied votes — that the entire Biden agenda is dependent on every single Democrat’s falling into line. And they aren’t all doing so.,
 Biden: Biden won the Democratic nomination after making two early bets in the primary that paid off big: that Biden would be seen as the most electable Democrat and that Black voters would be a loyal base. Both bets paid off. Similarly, Biden made an early two-pronged bet about the midterms: that a surging economy and a waning threat from the coronavirus would deliver victory to the Democrats.,
 “Chief Justice Roberts is taking a page from “Chief Justice Roberts old playbook: acknowledging institutional challenges in the judiciary but telling the public that only we judges can fix only we judges,” Mr. Roth said. “Yet the problems of overlooked financial conflicts and sexual

# **Input Results Into DataFrame**

In [27]:
#so we convert each element of the list into a string
sentences = list(map(str, repitition_removal))
sentences[0:5]

['Today, Democrats control both the White House and Congress. But the party’s hold on power is so slim — the 50-50 split in the Senate means that Vice President Kamala Harris must break tied votes — that the entire Biden agenda is dependent on every single Democrat’s falling into line. And they aren’t all doing so.\n',
 'Biden: Biden won the Democratic nomination after making two early bets in the primary that paid off big: that Biden would be seen as the most electable Democrat and that Black voters would be a loyal base. Both bets paid off. Similarly, Biden made an early two-pronged bet about the midterms: that a surging economy and a waning threat from the coronavirus would deliver victory to the Democrats.\n',
 '“Chief Justice Roberts is taking a page from “Chief Justice Roberts old playbook: acknowledging institutional challenges in the judiciary but telling the public that only we judges can fix only we judges,” Mr. Roth said. “Yet the problems of overlooked financial conflicts a

In [19]:
sent=pd.DataFrame(sentences, columns=['sentences']) # converting the list to dataframe. The dataframe is called 'sent' - from sentences.
sent.head()

Unnamed: 0,sentences
0,"Today, Democrats control both the White House ..."
1,Biden: Biden won the Democratic nomination aft...
2,“Chief Justice Roberts is taking a page from “...
3,“Chief Justice Roberts “Chief Justice Roberts ...
4,"“Sesame Street,” which premiered in 1969, was ..."


In [20]:
sent.to_csv('2021-2022corefnnsplit.csv', index=False) #converts dataframe to csv

In [None]:
#downloads csv file onto computer
from google.colab import files
files.download('2021-2022corefnnsplit.csv')