# Using the module `nytcomments` to retrieve comments on NYT articles:

First we import the three useful functions `get_dataset`, `get_comments` and `get_articles` from the module `nytcomments`.

In [1]:
from nytcomments import get_dataset, get_comments, get_articles

The main function `get_dataset` retrieves the complete dataset with information concerning both the comments as well as the respective articles whereas the other two functions `get_comments` and `get_articles` retrieves either comments or articles alone. The function `get_comments` can be used in place of now deprecated NYT Community API with the additional feature of getting the data in ready-to-use `pandas` dataframe (or `csv` files) without the need of using an API key. The function `get_articles` can be used as a API wrapper for the NYT article search API where the json data is converted and processed to return `pandas` dataframe (or `csv` files).

## Retrieving comments' and articles' dataset using the function `get_dataset`:

For the two functions `get_dataset` and `get_articles`, the NYT article_search API key is required, which can be obtained from [NYT Developers' Network](http://developer.nytimes.com/).

In [2]:
ARTICLE_API_KEY =  '################################' # Please use your API-key here.

The main function `get_dataset` have the following arguments:
- ARTICLE_API_KEY: The API key for the NYT article search API that can be obtained from [NYT Developers' Network](http://developer.nytimes.com/signup).  

Using the [NYT article search API](http://developer.nytimes.com/), we retrieve 10 articles per page. The range of pages is from 0 to 199.
- page_lower:  The lower limit for the article search. The range allowed is 0 to page_upper or 199, whichever is lower. Default is 0.
- page_upper:  The upper limit for the article search. The range allowed is 0 or page_lower, whichever is higher, to 199. Default is 30.
- begin_date: The begin date for the article search. The required format is `YYYYMMDD`. Default date is `20081031`, the [date of release for the NYT Community API](https://open.blogs.nytimes.com/2008/10/30/announcing-the-new-york-times-community-api/). 
- end_date: The end date for the article search. The required format is `YYYYMMDD`. Default date is current date.
- sort: The chronological order for the article search. The two options are `oldest` or `newest`. Default is `newest`.
- query: The keywords for the article search query. The comments from all the articles containing the query term is retrieved. Default is None.
- filter_query: The filters for the search of the articles. It supports the entire list of [filtering options](http://developer.nytimes.com/article_search_v2.json#/README) provided by NYT article search API such as the week of the day, the word count of the articles, source, etc. Default is None. 
- max_comments: The upper limit on the number of comments. Default is 100,000 comments.
- max_articles: The upper limit on the number of articles. Default is 10,000 articles.
- printout: The option to either print or suppress the output logging the retrieval process. The default is set True.
- save: The option to save the dataframes as `csv` files. The default is set False.
- filename: The filename for the csv files. The argument is ignored when the save argument above is False. Default is None, in which case the files are named `Articles.csv` and `Comments.csv`.
- path: The path for saving the `csv` files. The argument is ignored when the save argument above is False. Default is None, in which case the files are saved in the current directory. 

All the above arguments except `ARTICLE_API_KEY` are optional.

The function returns two dataframes - one each for articles and comments. Among all the articles published in NYT in Jan-Feb 2017, roughly 15% of them were open for comments. The function stores and returns the information only about those articles that included comments. 

In [3]:
# Please make sure to enter the ARTICLE_API_KEY in the cell above before calling the function.
articles_df, comments_df = get_dataset(ARTICLE_API_KEY, page_lower=0, page_upper=2)

Page:  0
Page:  1
Retrieved 43 comments from the article with url: 
https://www.nytimes.com/2018/04/28/crosswords/daily-puzzle-2018-04-29.html

Total articles stored:  1
Total comments retrieved:  43


The returned articles' dataframe consists of 16 features:

In [4]:
articles_df.head()

Unnamed: 0,articleID,byline,documentType,headline,keywords,multimedia,newDesk,pubDate,snippet,source,typeOfMaterial,webURL,articleWordCount,printPage,sectionName
0,5ae4eee3068401528a2ab44d,By CAITLIN LOVINGER,article,Mis-Unabbreviated,[Crossword Puzzles],68,Games,2018-04-28 22:00:01,Peter Wentz’s puzzle is in the home run depart...,The New York Times,News,https://www.nytimes.com/2018/04/28/crosswords/...,1104,0,Unknown


In [5]:
articles_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1 entries, 0 to 0
Data columns (total 15 columns):
articleID           1 non-null category
byline              1 non-null category
documentType        1 non-null category
headline            1 non-null object
keywords            1 non-null object
multimedia          1 non-null int64
newDesk             1 non-null category
pubDate             1 non-null datetime64[ns]
snippet             1 non-null object
source              1 non-null category
typeOfMaterial      1 non-null category
webURL              1 non-null category
articleWordCount    1 non-null int64
printPage           1 non-null int32
sectionName         1 non-null category
dtypes: category(8), datetime64[ns](1), int32(1), int64(2), object(3)
memory usage: 844.0+ bytes


The returned comments' dataframe consists of 34 features:

In [6]:
comments_df.head()

Unnamed: 0,approveDate,commentBody,commentID,commentSequence,commentTitle,commentType,createDate,depth,editorsSelection,parentID,...,userLocation,userTitle,userURL,inReplyTo,articleID,sectionName,newDesk,articleWordCount,printPage,typeOfMaterial
0,1525004536,A dry and joyless puzzle for me today— morning...,26909364,26909364,<br/>,comment,1525004534,1,False,0,...,New York,,,0,5ae4eee3068401528a2ab44d,Unknown,Games,1104,0,News
1,1525002291,But it wasn’t DOH. It was AHA. I did that too.,26909070,26909070,<br/>,comment,1525002288,1,False,0,...,Delaware,,,0,5ae4eee3068401528a2ab44d,Unknown,Games,1104,0,News
2,1525001093,"Ok, I still don't get AM RADIO. In the other t...",26908928,26908928,<br/>,comment,1525001091,1,False,0,...,"Clarkston, Georgia",,,0,5ae4eee3068401528a2ab44d,Unknown,Games,1104,0,News
3,1524997558,I loved this. What a complete solving experie...,26908633,26908633,<br/>,comment,1524997555,1,False,0,...,"Asheville, NC",,,0,5ae4eee3068401528a2ab44d,Unknown,Games,1104,0,News
4,1524996344,I really struggled with today's puzzle. I had...,26908561,26908561,<br/>,comment,1524996341,1,False,0,...,"Harrogate, UK",,,0,5ae4eee3068401528a2ab44d,Unknown,Games,1104,0,News


In [7]:
articles_df, comments_df = get_dataset(ARTICLE_API_KEY, begin_date='20180428', end_date='20180429', 
                                        query='gun violence')

Page:  0
Retrieved 275 comments from the article with url: 
https://www.nytimes.com/2018/04/28/us/golden-state-killer-joseph-deangelo.html
Page:  1
No aricles found on page 1

Total articles stored:  1
Total comments retrieved:  275


In [8]:
articles_df, comments_df = get_dataset(ARTICLE_API_KEY, begin_date='20180115', sort='oldest', max_comments=1000)

Page:  0

API rate limit exceeded for today. No more comments can be retrieved using the article search today, however the function get_comments can be used to retrieve further comments w/o limit if the list of URL(s) of the article(s) are provided to the function.

Total articles stored:  0
Total comments retrieved:  0


In [9]:
articles_df, comments_df = get_dataset(ARTICLE_API_KEY, end_date='20180427', max_comments=1000, 
                                       save=True, filename='April2018')

Page:  0
Retrieved 44 comments from the article with url: 
https://www.nytimes.com/2018/04/26/theater/review-iceman-cometh-denzel-washington.html
Page:  1
Retrieved 57 comments from the article with url: 
https://www.nytimes.com/2018/04/26/opinion/metoo-cosby-guilty.html
Page:  2
Retrieved 15 comments from the article with url: 
https://www.nytimes.com/2018/04/26/us/victims-golden-state-killer.html
Retrieved 110 comments from the article with url: 
https://www.nytimes.com/2018/04/26/climate/congress-pruitt-epa-ethics.html
Page:  3
Page:  4
Retrieved 773 comments from the article with url: 
https://www.nytimes.com/2018/04/26/opinion/trumps-war-poor.html
Page:  5
Retrieved 224 comments from the article with url: 
https://www.nytimes.com/2018/04/26/opinion/cosby-guilty-assault-women.html
Maximum limit of 1000 for the comments have exceeded. Terminating retrieval.


Total articles stored:  6
Total comments retrieved:  1223
The articles' and comments' data is stored as the csv files - Artic

## Retrieving articles' dataframe using the function `get_articles`:

The function `get_articles` has many similar arguments as above:

- ARTICLE_API_KEY: The API key for the NYT article search API that can be obtained from [NYT Developers' Network](http://developer.nytimes.com/signup).  

Using the [NYT article search API](http://developer.nytimes.com/), we retrieve 10 articles per page. The range of pages is from 0 to 199.
- page_lower:  The lower limit for the article search. The range allowed is 0 to page_upper or 199, whichever is lower. Default is 0.
- page_upper:  The upper limit for the article search. The range allowed is 0 or page_lower, whichever is higher, to 199. Default is 50.
- begin_date: The begin date for the article search. The required format is `YYYYMMDD`. Default date is `20081031`, the [date of release for the NYT Community API](https://open.blogs.nytimes.com/2008/10/30/announcing-the-new-york-times-community-api/). 
- end_date: The end date for the article search. The required format is `YYYYMMDD`. Default date is current date.
- sort: The chronological order for the article search. The two options are `oldest` or `newest`. Default is `newest`.
- query: The keywords for the article search query. All the articles containing the query term are retrieved. Default is None.
- filter_query: The filters for the search of the articles. It supports the entire list of [filtering options](http://developer.nytimes.com/article_search_v2.json#/README) provided by NYT article search API such as the week of the day, the word count of the articles, source, etc. Default is None. 
- max_articles: The maximum number of articles to be retrieved. Default is 10,000 articles.
- save: The option to save the dataframes as `csv` files named `Articles.csv` and `Comments.csv` in the current directory. The default is set False.
- printout: The option to either print or suppress the output logging the retrieval process. The default is set True.
- save: The option to save the articles' dataframe as a `csv` file. The default is set False.
- filename: The filename for the csv file. The argument is ignored when the save argument above is False. Default is None, in which case the file is named `Articles.csv`.
- path: The path for saving the `csv` file. The argument is ignored when the save argument above is False. Default is None, in which case the file is saved in the current directory. 


All the above arguments except `ARTICLE_API_KEY` are optional.

The function returns a `pandas` dataframe for articles. Unlike the above function `get_dataset`, the functions stores and returns the information concerning all those articles that were searched using the arguments passsed to the function. 

In [10]:
articles_df = get_articles(ARTICLE_API_KEY, begin_date='April27-2018', end_date='2018/04/28', query='North Korea')

Page:  0
Article url: https://www.nytimes.com/2018/04/27/opinion/korea-peace-talks.html
Article url: https://www.nytimes.com/video/world/asia/100000005872698/from-dictator-to-diplomat-kim-jong-uns-extreme-makeover.html
Article url: https://www.nytimes.com/2018/04/27/briefing/north-korea-angela-merkel-golden-state-killer.html
Article url: https://www.nytimes.com/aponline/2018/04/27/business/ap-us-business-highlights.html
Article url: https://www.nytimes.com/aponline/2018/04/27/sports/ap-paul-newberry-going-international.html
Article url: https://www.nytimes.com/2018/04/27/world/asia/koreans-set-the-table-for-a-deal-that-trump-will-try-to-close.html
Article url: https://www.nytimes.com/reuters/2018/04/27/world/asia/27reuters-northkorea-southkorea-usa-military.html
Article url: https://www.nytimes.com/reuters/2018/04/27/world/asia/27reuters-northkorea-missiles-usa.html
Article url: https://www.nytimes.com/2018/04/27/opinion/north-korea-south-korea.html
Article url: https://www.nytimes.com

The returned articles' dataframe consists of 16 features:

In [11]:
articles_df.head()

Unnamed: 0,articleID,byline,documentType,headline,keywords,multimedia,newDesk,printPage,pubDate,sectionName,snippet,source,typeOfMaterial,webURL,articleWordCount
0,5ae3b4ef068401528a2ab2d9,By THE EDITORIAL BOARD,article,Koreans’ Talk of Peace Raises Hopes and Doubts,"[North Korea, South Korea, Kim Jong-un, Moon J...",68,Editorial,0,2018-04-27 23:40:27,Unknown,"As the leaders of the North and South met, eve...",The New York Times,Editorial,https://www.nytimes.com/2018/04/27/opinion/kor...,537
1,5ae3a9be068401528a2ab2cb,"By ROBIN STEIN, AINARA TIEFENTHÄLER and NATALI...",multimedia,Unknown,"[North Korea, Kim Jong-un, International Relat...",68,World / Asia Pacific,0,2018-04-27 22:52:44,Asia Pacific,"How did North Korea’s leader, Kim Jong-un, go ...",The New York Times,Video,https://www.nytimes.com/video/world/asia/10000...,374
2,5ae3a230068401528a2ab2ba,By CHARLES McDERMID and SANDRA STEVENSON,article,Your Evening Briefing,[],68,NYTNow,0,2018-04-27 22:20:30,Unknown,Here’s what you need to know at the end of the...,The New York Times,briefing,https://www.nytimes.com/2018/04/27/briefing/no...,1121
3,5ae39bab068401528a2ab2b0,By THE ASSOCIATED PRESS,article,Business Highlights,[],0,,0,2018-04-27 21:52:41,Unknown,,AP,News,https://www.nytimes.com/aponline/2018/04/27/bu...,911
4,5ae396c0068401528a2ab2aa,By THE ASSOCIATED PRESS,article,Column: US Leagues Are on the Verge of Going I...,[],0,,0,2018-04-27 21:31:42,Unknown,An NFL team in London? Count on it.,AP,News,https://www.nytimes.com/aponline/2018/04/27/sp...,1078


## Retrieving comments given the url of an article using the function `get_comments`:

Given the url of an article, the function `get_comments` retrieves the comments from the article in a pandas dataframe, with an in-built option to save the comments as ready-to-use `csv` files. The function is a substitute for the NYT Community API which is deprecated and has [unresolved bugs currently](). Just like the NYT Community API, the function retrieves all the comments except for the nested replies of depths more than 2 and for each comment/reply, it restricts the number of replies retrieved to 3. The function `get_comments` does not use any API and hence API key is not required for this particular function.

The function `get_comments` have the following arguments:
- article_url: the web url of an article
- max_comments: The upper limit on the number of comments. Default is 50,000.
- printout: The option to either print or suppress the output logging the retrieval process. The default is set True.
- save: The option to save the comments' dataframe as a `csv` file. The default is set False.
- filename: The filename for the csv file. The argument is ignored when the save argument above is False. Default is None, in which case the file is named `Comments.csv`.
- path: The path for saving the `csv` file. The argument is ignored when the save argument above is False. Default is None, in which case the file is saved in the current directory. 

Except the argument `article_url`, all other arguments are optional. 

We pass the `article_url` as an input argument and collect the comments in the `pandas` dataframe `comments_df`:

In [12]:
article_url = 'https://www.nytimes.com/2018/04/08/us/facebook-users-data-harvested-cambridge-analytica.html'
comments_df = get_comments(article_url)

Retrieved 288 comments from the article with url: 
https://www.nytimes.com/2018/04/08/us/facebook-users-data-harvested-cambridge-analytica.html

Total comments retrieved:  288


The first 5 rows of the dataframe are as follows: 

In [13]:
comments_df.head()

Unnamed: 0,approveDate,commentBody,commentID,commentSequence,commentTitle,commentType,createDate,depth,editorsSelection,parentID,...,status,timespeople,trusted,updateDate,userDisplayName,userID,userLocation,userTitle,userURL,inReplyTo
0,1523335059,"I had my emails on Wikileaks, my personal data...",26663182,26663182,<br/>,comment,1523314468,1,False,0,...,approved,1,0,1523335059,GA,66396621,"Rhinebeck, NY",,,0
1,1523334998,Delete your Facebook account TODAY...and get a...,26657340,26657340,<br/>,comment,1523301141,1,False,0,...,approved,1,0,1523334998,Jay David,41260666,NM,,,0
2,1523332343,I don't know whether to be proud or aghast tha...,26661486,26661486,<br/>,comment,1523309729,1,False,0,...,approved,1,0,1523332343,NotDeadYet,67564564,NJ,,,0
3,1523331198,A good reason to NEVER log into any app with F...,26661829,26661829,<br/>,comment,1523310406,1,False,0,...,approved,1,0,1523331198,norman0000,54269757,Grand Cayman,,,0
4,1523330677,Note FT is not saying anything about CA claimi...,26662459,26662459,<br/>,comment,1523312123,1,False,0,...,approved,1,0,1523330677,Abby,55271958,Tucson,,,0


The returned comments' dataframe consists of 34 features:

In [14]:
comments_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 288 entries, 0 to 287
Data columns (total 28 columns):
approveDate              288 non-null int64
commentBody              288 non-null object
commentID                288 non-null int32
commentSequence          288 non-null int64
commentTitle             288 non-null category
commentType              288 non-null category
createDate               288 non-null int64
depth                    288 non-null int64
editorsSelection         288 non-null bool
parentID                 288 non-null category
parentUserDisplayName    89 non-null category
permID                   288 non-null category
picURL                   288 non-null category
recommendations          288 non-null int16
recommendedFlag          0 non-null category
replyCount               288 non-null int8
reportAbuseFlag          0 non-null category
sharing                  288 non-null int8
status                   288 non-null category
timespeople              288 non-null i

Here we save the comments as `csv` file and suppress the output.

In [15]:
get_comments(article_url, save=True, printout=False);

## Points to note:
Though there is no limit to the number of comments that can be retrieved daily, the article search is restricted to 1000 articles per day. Since the function `get_dataset` uses the NYT article search API inside, the limit for the number of articles applies to it. Using the same API key simultaneously for multiple retrieval results in the termination of the retrieval with the output `API rate limit has exceeded for today`.  Unlike the other two, the function `get_comments` has no such limit on retrieving comments.

The data collected using this module is contributed to the [Kaggle](https://www.kaggle.com/aashita/nyt-comments) where the dataset is at the [top among the 20 featured datasets](https://www.kaggle.com/datasets). The [detailed exploration](https://www.kaggle.com/aashita/nyt-comments-eda) of the data with graphs and [some models](https://www.kaggle.com/aashita/nyt-comments/kernels) built on it can also be found on Kaggle.