Ressources:
- Jeux de données d'entraînement : https://blog.cambridgespark.com/50-free-machine-learning-datasets-sentiment-analysis-b9388f79c124
- Sentiment analysis : https://medium.com/nearist-ai/word2vec-tutorial-the-skip-gram-model-c7926e1fdc09
- Sentiment analysis : https://towardsdatascience.com/unsupervised-sentiment-analysis-a38bf1906483

Questions à se poser:
- Pour les articles: on distingue selon articles ou blog? Et on fait par catégorie d'articles?
- Pour les commentaires : on distingue commentaire et réponse à un commentaire ou on regarde juste les commentaires?

In [1]:
import pandas as pd
import numpy as np
import os
from glob import glob

In [2]:
path_to_data = ('data/')

In [3]:
print('Reading articles...')
articles = pd.concat(map(pd.read_csv, glob(os.path.join(path_to_data+'nyt-articles/', "*.csv"))), axis=0, sort=True).reset_index()
print('Reading comments...')
comments = pd.concat(map(pd.read_csv, glob(os.path.join(path_to_data+'nyt-comments/', "*.csv"))), axis=0, sort=True).reset_index()

Reading articles...
Reading comments...


  exec(code_obj, self.user_global_ns, self.user_ns)
  exec(code_obj, self.user_global_ns, self.user_ns)
  exec(code_obj, self.user_global_ns, self.user_ns)
  exec(code_obj, self.user_global_ns, self.user_ns)
  exec(code_obj, self.user_global_ns, self.user_ns)


# 1. Descriptive statistics

In [10]:
articles['documentType'].value_counts() # some of them are blogposts

article     9168
blogpost     167
Name: documentType, dtype: int64

In [11]:
articles['newDesk'].value_counts(dropna=False)

OpEd               1719
National            630
Metro               593
Learning            575
Foreign             514
Culture             507
Business            407
Magazine            386
Washington          378
Dining              366
Games               357
Well                342
Editorial           333
Sports              332
Science             284
Upshot              230
RealEstate          225
Insider             154
Weekend             140
Unknown             117
Travel              112
Arts&Leisure         71
BookReview           62
Styles               60
SundayBusiness       57
Metropolitan         57
Podcasts             40
Photo                39
NewsDesk             32
Smarter Living       31
Climate              31
Investigative        28
Obits                22
Politics             22
Express              19
SpecialSections      18
TStyle               14
EdLife               11
Letters               5
Summary               4
NYTNow                4
Society         

In [12]:
articles['sectionName'].value_counts(dropna=False)

Unknown                       6380
Politics                       638
Sunday Review                  353
Television                     261
Asia Pacific                   174
Europe                         172
Family                         166
Live                           138
Middle East                     89
Move                            61
Book Review                     60
Art & Design                    54
Economy                         54
Baseball                        52
Eat                             47
Olympics                        45
Soccer                          43
Media                           43
Mind                            42
Lesson Plans                    41
The Daily                       40
Music                           38
Americas                        37
Pro Basketball                  34
Wine, Beer & Cocktails          33
DealBook                        31
Pro Football                    31
Africa                          21
College Basketball  

In [13]:
comments.columns

Index(['index', 'approveDate', 'articleID', 'articleWordCount', 'commentBody',
       'commentID', 'commentSequence', 'commentTitle', 'commentType',
       'createDate', 'depth', 'editorsSelection', 'inReplyTo', 'newDesk',
       'parentID', 'parentUserDisplayName', 'permID', 'picURL', 'printPage',
       'recommendations', 'recommendedFlag', 'replyCount', 'reportAbuseFlag',
       'sectionName', 'sharing', 'status', 'timespeople', 'trusted',
       'typeOfMaterial', 'updateDate', 'userDisplayName', 'userID',
       'userLocation', 'userTitle', 'userURL'],
      dtype='object')

In [14]:
comments.head()

Unnamed: 0,index,approveDate,articleID,articleWordCount,commentBody,commentID,commentSequence,commentTitle,commentType,createDate,...,status,timespeople,trusted,typeOfMaterial,updateDate,userDisplayName,userID,userLocation,userTitle,userURL
0,0,1491245186,58def1347c459f24986d7c80,716.0,This project makes me happy to be a 30+ year T...,22022598.0,22022598.0,<br/>,comment,1491237000.0,...,approved,1.0,0.0,News,1491245186,Rob Gayle,46006296.0,"Riverside, CA",,
1,1,1491188619,58def1347c459f24986d7c80,716.0,Stunning photos and reportage. Infuriating tha...,22017350.0,22017350.0,,comment,1491180000.0,...,approved,1.0,0.0,News,1491188619,Susan A.,29202761.0,<br/>,,
2,2,1491188617,58def1347c459f24986d7c80,716.0,Brilliant work from conception to execution. I...,22017334.0,22017334.0,<br/>,comment,1491179000.0,...,approved,1.0,0.0,News,1491188617,Meta,63944806.0,Raleigh NC,,
3,3,1491167820,58def1347c459f24986d7c80,716.0,NYT reporters should provide a contributor's l...,22015913.0,22015913.0,<br/>,comment,1491150000.0,...,approved,1.0,0.0,News,1491167820,Tom Wyrick,1266184.0,"Missouri, USA",,
4,4,1491167815,58def1347c459f24986d7c80,716.0,Could only have been done in print. Stunning.,22015466.0,22015466.0,<br/>,comment,1491147000.0,...,approved,1.0,0.0,News,1491167815,Joe Sharkey,61121360.0,"Tucson, Arizona",,


In [15]:
comments['commentType'].value_counts()

comment          1595760
userReply         580279
reporterReply        325
Name: commentType, dtype: int64

# 2. Clean data

## 2.1. Keeping only some articles

We keep only the comments written by authors who wrote at least 5 articles in the NYT.


In [16]:
import re

In [17]:
# Keep articles OpEd and Editorial
articles = articles[articles['newDesk'].isin(['OpEd', 'Editorial'])]
len(articles)

2052

In [18]:
# Cleaning articles author name

# duplicate articles if several authors
authors = articles['byline'].str.split(',|and|with', expand=True).add_prefix('author_')

# remove everything before "by, BY, By" (interviewed by, etc.)
for col in authors.columns :
    authors.loc[~authors[col].isna(), col] = authors.loc[~authors[col].isna(), col].apply(lambda x: re.sub(r'.*(by|BY|By)', '', x).strip())

# merge to have article id
authors = pd.merge(authors, articles[['articleID']], left_index=True, right_index=True)

# stack df to have a df with one article-one author per row
authors = authors.set_index(['articleID']).stack().reset_index(level=-1, drop=True).reset_index(name='author')

# manually remove authors which are not real : "M.D" (which is a title), "Unknown"
authors = authors[~authors['author'].isin(['M.D', 'Unknown'])]

# keep authors who wrote 10 or more articles
authors_list = authors['author'].value_counts()[authors['author'].value_counts()>=10].index
authors = authors[authors['author'].isin(authors_list)]

In [19]:
# In this authors list, we randomly select 100 authors.
# import random
# authors_list = random.sample(list(authors_list), 100)
# authors = authors[authors['author'].isin(authors_list)]

In [20]:
authors['author'].value_counts()

THE EDITORIAL BOARD       258
PAUL KRUGMAN              113
GAIL COLLINS               75
NICHOLAS KRISTOF           75
FRANK BRUNI                75
DAVID BROOKS               72
ROSS DOUTHAT               67
CHARLES M. BLOW            64
ROGER COHEN                61
LIZ SPAYD                  53
BRET STEPHENS              48
DAVID LEONHARDT            47
THOMAS L. FRIEDMAN         37
MICHELLE GOLDBERG          34
THOMAS B. EDSALL           33
TIMOTHY EGAN               30
MAUREEN DOWD               29
UNKNOWN                    27
EVAN GERSHKOVICH           21
LINDA GREENHOUSE           16
JENNIFER FINNEY BOYLAN     14
MARGARET RENKL             11
ANDREW ROSENTHAL           11
SUSAN CHIRA                11
Name: author, dtype: int64

In [22]:
len(authors['author'].unique())

24

In [23]:
# In the comments database, we keep only comments from the authors we kept, using the "articleID" variable.
comments = comments[comments['articleID'].isin(authors['articleID'].unique())]
len(comments)

671748

## 2.2. Tokenization of comments

In [24]:
from gensim.models import LdaModel
from gensim import corpora
import nltk
from string import punctuation
from nltk.tokenize import TreebankWordTokenizer

In [25]:
for item in ['<br/>']:
    comments['commentBody'] = comments['commentBody'].apply(lambda x: x.replace(item, ' '))              

In [27]:
nltk.download('stopwords')
en_stop = set(nltk.corpus.stopwords.words('english'))
to_be_removed = list(en_stop) + list(punctuation)

print("Tokenizing...")
tok = TreebankWordTokenizer()
comments['commentTokens'] = comments['commentBody'].apply(lambda x: list(filter(lambda a: a.lower() not in to_be_removed,tok.tokenize(x))))


# dictionary = corpora.Dictionary(comments['commentTokens'].tolist())
# corpus = [dictionary.doc2bow(text) for text in comments['commentTokens'].tolist()]
# ldamodel = LdaModel(corpus, id2word=dictionary, num_topics=4)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\naila\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Tokenizing...


In [64]:
comments.to_csv(path_to_data+('df_clean.csv'), sep=';')