# Using the module `nytcomments` to retrieve comments on NYT articles:

First we import the three useful functions `get_dataset`, `get_comments` and `get_articles` from the module `nytcomments`.

In [1]:
from nytcomments import get_dataset, get_comments, get_articles

The main function `get_dataset` retrieves the complete dataset with information concerning both the comments as well as the respective articles whereas the other two functions `get_comments` and `get_articles` retrieves either comments or articles alone. The function `get_comments` can be used in place of now deprecated NYT Community API with the additional feature of getting the data in ready-to-use `pandas` dataframe (or `csv` files) without the need of using an API key. The function `get_articles` can be used as a API wrapper for the NYT article search API where the json data is converted and processed to return `pandas` dataframe (or `csv` files).

## Retrieving comments given the url of an article using the function `get_comments`:

Given the url of an article, the function `get_comments` retrieves the comments from the article in a pandas dataframe, with an in-built option to save the comments as ready-to-use `csv` files. The function is a substitute for the NYT Community API which is deprecated and has [unresolved bugs currently](). Just like the NYT Community API, the function retrieves all the comments except for the nested replies of depths more than 2 and for each comment/reply, it restricts the number of replies retrieved to 3. The function `get_comments` does not use any API and hence API key is not required for this particular function.

We pass the `article_url` as an input argument and collect the comments in the `pandas` dataframe `comments_df`:

In [2]:
article_url = 'https://www.nytimes.com/2018/04/08/us/facebook-users-data-harvested-cambridge-analytica.html'
comments_df, total_comments = get_comments(article_url)

Retrieved 164 comments


The first 5 rows of the dataframe are as follows: 

In [3]:
comments_df.head()

Unnamed: 0,approveDate,commentBody,commentID,commentSequence,commentTitle,commentType,createDate,depth,editorsSelection,parentID,...,status,timespeople,trusted,updateDate,userDisplayName,userID,userLocation,userTitle,userURL,inReplyTo
0,1523303298,What I find amusing is that FB has very little...,26657017,26657017,<br/>,comment,1523299839,1,False,,...,approved,1,0,1523303298,Jessica,54109366,New York,,,
1,1523303278,"So Cambridge Analytica, Qualtrics, and Faceboo...",26657471,26657471,<br/>,comment,1523301614,1,False,,...,approved,1,0,1523303278,Charlie,60562074,Iowa,,,
2,1523303278,"That last line - ""Facebook, though, is a tough...",26656949,26656949,<br/>,comment,1523299592,1,False,,...,approved,1,0,1523303278,Pamela Thacher,61119821,"Canton, NY",,,
3,1523303276,"So, here is the action plan:<br/>First thing i...",26656457,26656457,<br/>,comment,1523297425,1,False,,...,approved,1,0,1523303276,Lostin24,63458708,Michigan,,,
4,1523303273,Facebook has said they will alert those who ha...,26657130,26657130,<br/>,comment,1523300305,1,False,,...,approved,1,0,1523303273,AW,8559639,California,,,


The features of the dataframe along with the respective datatypes are as follows:

In [4]:
comments_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 236 entries, 0 to 0
Data columns (total 29 columns):
approveDate              236 non-null object
commentBody              236 non-null object
commentID                236 non-null int64
commentSequence          236 non-null int64
commentTitle             236 non-null object
commentType              236 non-null object
createDate               236 non-null object
depth                    236 non-null int64
editorsSelection         236 non-null bool
parentID                 72 non-null object
parentUserDisplayName    72 non-null object
permID                   236 non-null object
picURL                   236 non-null object
recommendations          236 non-null int64
recommendedFlag          0 non-null object
replies                  236 non-null object
replyCount               236 non-null int64
reportAbuseFlag          0 non-null object
sharing                  236 non-null int64
status                   236 non-null object
timespeople

The function `get_comments` have three arguments:
- article_url: the web url of an article
- save: The option to save the comments' dataframe as `csv` file. Default is False.
- printout: The option to print out some output logging the process of comments' retrieval. Default is True.

The arguments `save` and `printout` are optional. 

Here we save the comments as `csv` file and suppress the output.

In [5]:
get_comments(article_url, save=True, printout=False);

## Retrieving comments' and articles' dataset using the function `get_dataset`:

For the next two functions `get_dataset` and `get_articles`, the NYT article_search API key is required, which can be obtained from [NYT Developers' Network](http://developer.nytimes.com/).

In [6]:
ARTICLE_API_KEY = '################################' # Please use your API-key here.

The main function `get_dataset` have the following arguments:
- ARTICLE_API_KEY: The API key for the NYT article search API that can be obtained from [NYT Developers' Network](http://developer.nytimes.com/signup).  

Using the [NYT article search API](http://developer.nytimes.com/), we retrieve 10 articles per page. The range of pages is from 0 to 199.
- page_lower:  The lower limit for the article search. The range allowed is 0 to page_upper or 199, whichever is lower. Default is 0.
- page_upper:  The upper limit for the article search. The range allowed is 0 or page_lower, whichever is higher, to 199. Default is 30.
- begin_date: The begin date for the article search. The required format is `YYYYMMDD`. Default date is `20081031`, the [date of release for the NYT Community API](https://open.blogs.nytimes.com/2008/10/30/announcing-the-new-york-times-community-api/). 
- end_date: The end date for the article search. The required format is `YYYYMMDD`. Default date is current date.
- query: The keywords for the article search query. The comments from all the articles containing the query term is retrieved. 
- sort: The chronological order for the article search. The two options are `oldest` or `newest`. Default is `newest`.
- max_comments: The maximum number of comments to be retrieved. Default is 50,000 comments.
- save: The option to save the dataframes as `csv` files named `Articles.csv` and `Comments.csv` in the current directory. The default is set False.
- printout: The option to either print or suppress the output logging the retrieval process. The default is set True.

All the above arguments except `ARTICLE_API_KEY` are optional.

The function returns two dataframes - one each for articles and comments. Among all the articles published in NYT in Jan-Feb 2017, roughly 15% of them were open for comments. The function stores and returns the information only about those articles that included comments. 

In [7]:
articles_df, comments_df = get_dataset(ARTICLE_API_KEY, page_lower=0, page_upper=2)

Page:  0
Retrieved 729 comments
Article url: https://www.nytimes.com/2018/04/08/opinion/fox-friends-trump.html
Retrieved 59 comments
Article url: https://www.nytimes.com/2018/04/08/us/politics/john-bolton-trump.html
Retrieved 231 comments
Article url: https://www.nytimes.com/2018/04/08/opinion/judicial-independence.html
Page:  1
Retrieved 110 comments
Article url: https://www.nytimes.com/2018/04/08/nyregion/trump-tower-fire-art-collector.html
Retrieved 164 comments
Article url: https://www.nytimes.com/2018/04/08/us/facebook-users-data-harvested-cambridge-analytica.html

Total articles stored:  5
Total comments retrieved:  1835


In [8]:
articles_df.head()

Unnamed: 0,articleID,byline,documentType,headline,keywords,multimedia,newDesk,printPage,pubDate,sectionName,snippet,source,typeOfMaterial,webURL,articleWordCount
0,5acaab47068401528a2a509b,CHARLES M. BLOW,article,Horror of Being Governed by 'Fox &amp; Friends',"[News and News Media, United States Politics a...",68,OpEd,19,2018-04-08 23:52:37,Unknown,Trump’s Fox fixation isn’t benign or inconsequ...,The New York Times,Op-Ed,https://www.nytimes.com/2018/04/08/opinion/fox...,731
1,5acaa80a068401528a2a508c,By PETER BAKER,article,Fiery National Security Adviser For a Powder-K...,"[Bolton, John R, Trump, Donald J, National Sec...",68,Washington,1,2018-04-08 23:38:48,Politics,Mr. Bolton takes over as the president’s third...,The New York Times,News,https://www.nytimes.com/2018/04/08/us/politics...,2560
2,5acaa675068401528a2a5080,By THE EDITORIAL BOARD,article,State Courts Under Attack,"[Elections, Courts and the Judiciary, Politics...",65,Editorial,18,2018-04-08 23:33:12,Unknown,Lawmakers around the country are treating judg...,The New York Times,Editorial,https://www.nytimes.com/2018/04/08/opinion/jud...,1056
3,5acaa258068401528a2a503d,By JOHN LELAND and LUIS FERRÉ-SADURNÍ,article,Art Collector and Bon Vivant Dies in Fire at t...,"[Brassner, Todd R (1951-2018), Trump Tower (Ma...",66,Metro,15,2018-04-08 23:14:30,Unknown,Todd Brassner lived alone amid a collection of...,The New York Times,News,https://www.nytimes.com/2018/04/08/nyregion/tr...,839
4,5aca9ded068401528a2a501c,By MATTHEW ROSENBERG and GABRIEL J.X. DANCE,article,Affected Users Say Facebook Betrayed Them,"[Cambridge Analytica, Facebook Inc, Data-Minin...",65,Investigative,1,2018-04-08 22:55:38,Unknown,Facebook plans to begin notifying users on Mon...,The New York Times,News,https://www.nytimes.com/2018/04/08/us/facebook...,1399


In [9]:
articles_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 15 columns):
articleID           5 non-null category
byline              5 non-null category
documentType        5 non-null category
headline            5 non-null object
keywords            5 non-null object
multimedia          5 non-null int64
newDesk             5 non-null category
printPage           5 non-null int32
pubDate             5 non-null datetime64[ns]
sectionName         5 non-null category
snippet             5 non-null object
source              5 non-null category
typeOfMaterial      5 non-null category
webURL              5 non-null category
articleWordCount    5 non-null int64
dtypes: category(8), datetime64[ns](1), int32(1), int64(2), object(3)
memory usage: 1.5+ KB


In [10]:
comments_df.head()

Unnamed: 0,approveDate,articleID,articleWordCount,commentBody,commentID,commentSequence,commentTitle,commentType,createDate,depth,...,status,timespeople,trusted,typeOfMaterial,updateDate,userDisplayName,userID,userLocation,userTitle,userURL
0,1523305668,5acaab47068401528a2a509b,731,I was always a little leery of the quality of ...,26658107.0,26658107.0,<br/>,comment,1523304099,1.0,...,approved,1,0,Op-Ed,1523305668,Zack,66580591.0,Ottawa,,
1,1523305662,5acaab47068401528a2a509b,731,"Another ""I'm smart and if you disagree with m...",26658359.0,26658359.0,<br/>,comment,1523304807,1.0,...,approved,0,0,Op-Ed,1523305662,Jim Brown,68445462.0,Brooklyn,,
2,1523305637,5acaab47068401528a2a509b,731,"Except for any truly newsworthy event, I disco...",26658073.0,26658073.0,<br/>,comment,1523303963,1.0,...,approved,1,0,Op-Ed,1523305637,Michael,55663429.0,Ottawa,,
3,1523305636,5acaab47068401528a2a509b,731,"After that diatribe, you’ve got to make an app...",26658775.0,26658775.0,<br/>,comment,1523305367,1.0,...,approved,1,0,Op-Ed,1523305636,Ed,50727767.0,"Old Field, NY",,
4,1523305351,5acaab47068401528a2a509b,731,"I m not ""governed"" by Fox news. I listen to F...",26657581.0,26657581.0,<br/>,comment,1523302005,1.0,...,approved,1,0,Op-Ed,1523305351,Sarah,34539023.0,N.J.,,


In [12]:
articles_df, comments_df = get_dataset(ARTICLE_API_KEY, begin_date='20180402', end_date='20180403', 
                                        query='gun violence')

Page:  0
Page:  1
Retrieved 371 comments
Article url: https://www.nytimes.com/2018/04/02/sports/nfl-cheerleaders.html
Page:  2
Retrieved 17 comments
Article url: https://www.nytimes.com/2018/04/02/nyregion/south-bronx-jail-rikers-island.html
Retrieved 463 comments
Article url: https://www.nytimes.com/2018/04/02/us/teacher-strikes-oklahoma-kentucky.html
Page:  3
Retrieved 17 comments
Article url: https://www.nytimes.com/2018/04/02/arts/television/here-and-now-recap-yes.html
Page:  4
Retrieved 225 comments
Article url: https://www.nytimes.com/2018/04/02/opinion/tennessee-guns-schools.html
Retrieved 885 comments
Article url: https://www.nytimes.com/2018/04/02/magazine/gun-culture-is-my-culture-and-i-fear-for-what-it-has-become.html
Page:  5
Retrieved 71 comments
Article url: https://www.nytimes.com/2018/04/01/business/economy/labor-recruit.html
Retrieved 463 comments
Article url: https://www.nytimes.com/2018/04/01/opinion/stephon-clark-tragedy.html
Page:  6
No aricles found on page 6

Tot

In [14]:
articles_df, comments_df = get_dataset(ARTICLE_API_KEY, begin_date='20180101', sort='oldest', max_comments=1000)

Page:  0
Page:  1
Retrieved 535 comments
Article url: https://www.nytimes.com/2017/12/31/opinion/failed-war-on-drugs.html
Page:  2
Retrieved 250 comments
Article url: https://www.nytimes.com/2017/12/31/opinion/7-wishes-for-2018.html
Page:  3
Retrieved 50 comments
Article url: https://www.nytimes.com/2018/01/01/us/at-veterans-hospital-in-oregon-a-push-for-better-ratings-puts-patients-at-risk-doctors-say.html
Page:  4
Retrieved 11 comments
Article url: https://www.nytimes.com/2018/01/01/realestate/shopping-fireplaces-wood-stoves.html
Retrieved 427 comments
Article url: https://www.nytimes.com/2018/01/01/world/asia/kim-jong-un-offer-talks-south-korea-and-us.html
Retrieved 145 comments
Article url: https://www.nytimes.com/2018/01/01/reader-center/ag-sulzberger-publisher-questions.html

Total articles stored:  6
Total comments retrieved:  1977


In [15]:
articles_df, comments_df = get_dataset(ARTICLE_API_KEY, end_date='20180409', max_comments=1000, save=True)

Page:  0
Retrieved 732 comments
Article url: https://www.nytimes.com/2018/04/08/opinion/fox-friends-trump.html
Retrieved 59 comments
Article url: https://www.nytimes.com/2018/04/08/us/politics/john-bolton-trump.html
Retrieved 235 comments
Article url: https://www.nytimes.com/2018/04/08/opinion/judicial-independence.html

Total articles stored:  3
Total comments retrieved:  1391


## Retrieving articles' dataframe using the function `get_articles`:

The function `get_articles` have the similar arguments, as above except the default values for some arguments are set different:
- ARTICLE_API_KEY: The API key for the NYT article search API that can be obtained from [NYT Developers' Network](http://developer.nytimes.com/signup).  

Using the [NYT article search API](http://developer.nytimes.com/), we retrieve 10 articles per page. The range of pages is from 0 to 199.
- page_lower:  The lower limit for the article search. The range allowed is 0 to page_upper or 199, whichever is lower. Default is 0.
- page_upper:  The upper limit for the article search. The range allowed is 0 or page_lower, whichever is higher, to 199. Default is 50.
- begin_date: The begin date for the article search. The required format is `YYYYMMDD`. Default date is `20081031`, the [date of release for the NYT Community API](https://open.blogs.nytimes.com/2008/10/30/announcing-the-new-york-times-community-api/). 
- end_date: The end date for the article search. The required format is `YYYYMMDD`. Default date is current date.
- query: The keywords for the article search query. The comments from all the articles containing the query term is retrieved. 
- sort: The chronological order for the article search. The two options are `oldest` or `newest`. Default is `newest`.
- max_articles: The maximum number of articles to be retrieved. Default is 100,000 comments.
- save: The option to save the dataframes as `csv` files named `Articles.csv` and `Comments.csv` in the current directory. The default is set False.
- printout: The option to either print or suppress the output logging the retrieval process. The default is set True.

All the above arguments except `ARTICLE_API_KEY` are optional.

The function returns a `pandas` dataframe for articles. Unlike the above function `get_dataset`, the functions stores and returns the information concerning all those articles that were searched using the arguments passsed to the function. 

In [16]:
articles_df = get_articles(ARTICLE_API_KEY, begin_date='April02-2018', end_date='2018/04/03', query='gun violence')

Page:  0
Article url: https://www.nytimes.com/2018/04/02/us/politics/justice-department-california-lawsuit-land-use-law.html
Article url: https://www.nytimes.com/2018/04/02/us/martin-luther-king-memphis.html
Article url: https://www.nytimes.com/interactive/2018/04/02/us/king-mlk-last-sermon-annotated.html
Article url: https://www.nytimes.com/aponline/2018/04/02/us/ap-us-school-shooting-ted-nugent.html
Article url: https://www.nytimes.com/2018/04/02/us/cliff-crash-intentional.html
Article url: https://www.nytimes.com/2018/04/02/arts/television/bill-cosby-trial-jury-selection.html
Article url: https://www.nytimes.com/2018/04/02/world/americas/caravans-migrants-mexico-trump.html
Article url: https://www.nytimes.com/2018/04/02/briefing/trade-war-oklahoma-march-madness.html
Article url: https://www.nytimes.com/2018/04/02/obituaries/martin-luther-king-jr.html
Article url: https://www.nytimes.com/2018/04/02/us/students-on-martin-luther-king-legacy.html
Page:  1
Article url: https://www.nytime

In [17]:
articles_df.head()

Unnamed: 0,articleID,byline,documentType,headline,keywords,multimedia,newDesk,printPage,pubDate,sectionName,snippet,source,typeOfMaterial,webURL,articleWordCount
0,5ac2c09a068401528a2a10d7,By KATIE BENNER,article,Justice Dept. Takes Aim at California To Wrest...,"[United States Politics and Government, Califo...",68,Washington,12,2018-04-02 23:45:26,Politics,The Trump administration’s second suit against...,The New York Times,News,https://www.nytimes.com/2018/04/02/us/politics...,797
1,5ac2c018068401528a2a10d0,"By ALAN BLINDER, JERRY GRAY and MORRIGAN McCARTHY",article,A Dampened Dream,"[King, Martin Luther Jr, Civil Rights and Libe...",68,National,12,2018-04-02 23:43:17,Unknown,“There is neither optimism nor strong pessimis...,The New York Times,News,https://www.nytimes.com/2018/04/02/us/martin-l...,553
2,5ac2bdb6068401528a2a10bc,By NIKITA STEWART,multimedia,Unknown,"[King, Martin Luther Jr, Speeches and Statements]",66,U.S.,0,2018-04-02 23:33:07,Unknown,"Part rousing sermon/part somber premonition, “...",The New York Times,Interactive Feature,https://www.nytimes.com/interactive/2018/04/02...,0
3,5ac2bd0d068401528a2a10b3,By THE ASSOCIATED PRESS,article,Ted Nugent: Florida School Shooting Survivor '...,[],0,,0,2018-04-02 23:30:18,Unknown,"Rocker Ted Nugent says a Parkland, Florida, sc...",AP,News,https://www.nytimes.com/aponline/2018/04/02/us...,142
4,5ac2bcdd068401528a2a10af,By NIRAJ CHOKSHI,article,Family’s Fatal Plunge Off Cliff May Have Been ...,"[Deaths (Fatalities), Hart, Jennifer (1979-201...",63,National,0,2018-04-02 23:29:31,Unknown,The parents and at least three of their six ad...,The New York Times,News,https://www.nytimes.com/2018/04/02/us/cliff-cr...,761


## Points to note:
Though there is no limit to the number of comments that can be retrieved daily, the article search is restricted to 1000 articles per day. Since the function `get_dataset` uses the NYT article search API inside, the limit for the number of articles applies to it. You must agree to the [Terms of Use](http://developer.nytimes.com/tou) for the NYT API to retrive the articles' data using this package. Sometimes the functions gives `HTTP: 500 Internal Server Error` on account of using NYT API and the solution is to call the function again if it so happens. 

## Acknowledgement:
* A part of code used above is inspired from the code written by [Neal Caren](http://nealcaren.web.unc.edu/scraping-comments-from-the-new-york-times/) with some modification.
* NYT article search API is used for the article search.


The data collected using this module is contributed to the [Kaggle datasets](https://www.kaggle.com/aashita/nyt-comments) and more contributions are welcome. The [detailed exploration](https://www.kaggle.com/aashita/nyt-comments-eda) of the data with graphs and some models built using it can also be found in Kaggle.