I had data from different sources (Kaggle, scraped from web.archive.org and BBC api), that I have put into mongodb database. Originally I was working with tose datasets so I practiced getting the data from mongodb.

However, later during the project I ended up using an expanded version of the Kaggle dataset that was in SQLite, which was too large and I decided not to put it in mongodb. 

In [2]:
from os import listdir
from os.path import isfile, join
import json

In [3]:
from pymongo import MongoClient

In [4]:
import pandas as pd
import datetime as dt

We have data from different news publications and tv news transcripts. So we create 2 collections in the NEWS database, publications and tvnews

In [5]:
client = MongoClient()

In [6]:
client.database_names()

  """Entry point for launching an IPython kernel.


['NEWS', 'admin', 'books', 'config', 'local', 'outings']

In [7]:
news_db = client.NEWS

In [8]:
publications = news_db.publications

In [9]:
tvnews = news_db.tvnews

In [36]:
guardian_col = news_db.guardian

In [10]:

publications.remove({ })

  """Entry point for launching an IPython kernel.


{'n': 142570, 'ok': 1.0}

In [155]:
news_db.collection_names()

  """Entry point for launching an IPython kernel.


['tvnews', 'publications']

#### Load the AllTheNews dataset into a dataframe


In [40]:
# load the All The News csv file
mypath = '/Users/aminenhila/Desktop/Metis/Project4/Data_Code/Data/all-the-news/'
onlyfiles = [f for f in listdir(mypath) if isfile(join(mypath,f))]

all_the_news_df = pd.DataFrame()
for filename in onlyfiles:
    file_df = pd.read_csv(mypath+filename, index_col = None)
    all_the_news_df = pd.concat([all_the_news_df, file_df])


of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  


In [41]:
all_the_news_df = all_the_news_df.drop(['Unnamed: 0', 'url'], axis = 1)
all_the_news_df.head()

Unnamed: 0,author,content,date,id,month,publication,title,year
0,Carl Hulse,WASHINGTON — Congressional Republicans have...,2016-12-31,17283.0,12.0,New York Times,House Republicans Fret About Winning Their Hea...,2016.0
1,Benjamin Mueller and Al Baker,"After the bullet shells get counted, the blood...",2017-06-19,17284.0,6.0,New York Times,Rift Between Officers and Residents as Killing...,2017.0
2,Margalit Fox,"When Walt Disney’s “Bambi” opened in 1942, cri...",2017-01-06,17285.0,1.0,New York Times,"Tyrus Wong, ‘Bambi’ Artist Thwarted by Racial ...",2017.0
3,William McDonald,"Death may be the great equalizer, but it isn’t...",2017-04-10,17286.0,4.0,New York Times,"Among Deaths in 2016, a Heavy Toll in Pop Musi...",2017.0
4,Choe Sang-Hun,"SEOUL, South Korea — North Korea’s leader, ...",2017-01-02,17287.0,1.0,New York Times,Kim Jong-un Says North Korea Is Preparing to T...,2017.0


In [42]:
all_the_news_df = all_the_news_df.rename(columns = {'content':'bodytext'})

In [43]:
all_the_news_df.head()

Unnamed: 0,author,bodytext,date,id,month,publication,title,year
0,Carl Hulse,WASHINGTON — Congressional Republicans have...,2016-12-31,17283.0,12.0,New York Times,House Republicans Fret About Winning Their Hea...,2016.0
1,Benjamin Mueller and Al Baker,"After the bullet shells get counted, the blood...",2017-06-19,17284.0,6.0,New York Times,Rift Between Officers and Residents as Killing...,2017.0
2,Margalit Fox,"When Walt Disney’s “Bambi” opened in 1942, cri...",2017-01-06,17285.0,1.0,New York Times,"Tyrus Wong, ‘Bambi’ Artist Thwarted by Racial ...",2017.0
3,William McDonald,"Death may be the great equalizer, but it isn’t...",2017-04-10,17286.0,4.0,New York Times,"Among Deaths in 2016, a Heavy Toll in Pop Musi...",2017.0
4,Choe Sang-Hun,"SEOUL, South Korea — North Korea’s leader, ...",2017-01-02,17287.0,1.0,New York Times,Kim Jong-un Says North Korea Is Preparing to T...,2017.0


In [44]:
all_the_news_df = all_the_news_df[all_the_news_df['bodytext'].notna()]

In [45]:
all_the_news_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 142570 entries, 0 to 49998
Data columns (total 8 columns):
author         126694 non-null object
bodytext       142570 non-null object
date           139929 non-null object
id             142570 non-null float64
month          139929 non-null float64
publication    142570 non-null object
title          142568 non-null object
year           139929 non-null float64
dtypes: float64(3), object(5)
memory usage: 9.8+ MB


In [46]:
all_the_news_df.date = pd.to_datetime(all_the_news_df.date)

In [47]:
all_the_news_df.year = all_the_news_df.date.dt.year
all_the_news_df.month = all_the_news_df.date.dt.month
all_the_news_df['day'] = all_the_news_df.date.dt.day

In [48]:
all_the_news_df.head()

Unnamed: 0,author,bodytext,date,id,month,publication,title,year,day
0,Carl Hulse,WASHINGTON — Congressional Republicans have...,2016-12-31,17283.0,12.0,New York Times,House Republicans Fret About Winning Their Hea...,2016.0,31.0
1,Benjamin Mueller and Al Baker,"After the bullet shells get counted, the blood...",2017-06-19,17284.0,6.0,New York Times,Rift Between Officers and Residents as Killing...,2017.0,19.0
2,Margalit Fox,"When Walt Disney’s “Bambi” opened in 1942, cri...",2017-01-06,17285.0,1.0,New York Times,"Tyrus Wong, ‘Bambi’ Artist Thwarted by Racial ...",2017.0,6.0
3,William McDonald,"Death may be the great equalizer, but it isn’t...",2017-04-10,17286.0,4.0,New York Times,"Among Deaths in 2016, a Heavy Toll in Pop Musi...",2017.0,10.0
4,Choe Sang-Hun,"SEOUL, South Korea — North Korea’s leader, ...",2017-01-02,17287.0,1.0,New York Times,Kim Jong-un Says North Korea Is Preparing to T...,2017.0,2.0


In [49]:
all_the_news_df[(all_the_news_df['publication'] == 'Guardian')]

Unnamed: 0,author,bodytext,date,id,month,publication,title,year,day
0,Jessica Glenza,The son of a Louisiana man whose father was sh...,2016-07-13,151908.0,7.0,Guardian,Alton Sterling’s son: ’Everyone needs to prote...,2016.0,13.0
1,,Copies of William Shakespeare’s first four boo...,2016-05-25,151909.0,5.0,Guardian,Shakespeare’s first four folios sell at auctio...,2016.0,25.0
2,Robert Pendry,"Debt: $20, 000, Source: College, credit cards,...",2016-10-31,151910.0,10.0,Guardian,My grandmother’s death saved me from a life of...,2016.0,31.0
3,Bradford Frost,"It was late. I was drunk, nearing my 35th birt...",2016-11-26,151911.0,11.0,Guardian,I feared my life lacked meaning. Cancer pushed...,2016.0,26.0
4,,A central Texas man serving a life sentence fo...,2016-08-20,151912.0,8.0,Guardian,Texas man serving life sentence innocent of do...,2016.0,20.0
...,...,...,...,...,...,...,...,...,...
49994,Lawrence Grandpre,There have been many proposed solutions to the...,2016-08-12,151902.0,8.0,Guardian,"If Baltimore is serious about police reform, g...",2016.0,12.0
49995,Mary Valle,"Maybe I feel like August won’t let go of me, b...",2016-08-28,151903.0,8.0,Guardian,The transition from summer to fall feels like ...,2016.0,28.0
49996,,"Diana Marcela, 28, has spent 13 years with Far...",2016-09-16,151904.0,9.0,Guardian,"Colombia: Farc’s female fighters, then and now...",2016.0,16.0
49997,Paul Mason,"This Christmas break, for anybody steeped in t...",2016-12-26,151905.0,12.0,Guardian,Why I’m optimistic about 2017,2016.0,26.0


In [50]:
all_the_news_df.date = all_the_news_df.date.dt.strftime('%Y-%m-%d')

In [51]:
# make the dataframe into a dictionary
all_the_news_dict = all_the_news_df.to_dict(orient = 'records')

In [52]:
all_the_news_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 142570 entries, 0 to 49998
Data columns (total 9 columns):
author         126694 non-null object
bodytext       142570 non-null object
date           142570 non-null object
id             142570 non-null float64
month          139929 non-null float64
publication    142570 non-null object
title          142568 non-null object
year           139929 non-null float64
day            139929 non-null float64
dtypes: float64(4), object(5)
memory usage: 10.9+ MB


In [53]:
# insert the all_the_news dict into mongodb publications collection
publications.insert_many(all_the_news_dict)

<pymongo.results.InsertManyResult at 0x1137dbffa0>

In [54]:
publications.count()

  """Entry point for launching an IPython kernel.


142570

#### Load the Guardian json files and clean them up to right keys 

In [11]:
# read the guardian json files
mypath = '/Users/aminenhila/Desktop/Metis/Project4/Data_Code/Data/Guardian/'

onlyfiles = [f for f in listdir(mypath) if isfile(join(mypath, f)) if not f.startswith('.')]

In [12]:
sorted_onlyfiles = sorted(onlyfiles)

In [19]:
files_2019_2020 = sorted_onlyfiles[1461:]

In [20]:
guardian_data = []
i = 0
for file in files_2019_2020:
    i = i+1
    print(i)
    with open(mypath+file) as f:
        data = json.load(f,encoding='utf-8')
    guardian_data.extend(data)
        #file_json = pd.read_json(mypath+file)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277


In [28]:
guardian_data[230]

{'id': 'sport/2019/jan/02/johanna-konta-crashes-out-brisbane-international-tennis',
 'type': 'article',
 'sectionId': 'sport',
 'sectionName': 'Sport',
 'webPublicationDate': '2019-01-02T10:10:17Z',
 'webTitle': 'Johanna Konta out of Brisbane International in second round',
 'webUrl': 'https://www.theguardian.com/sport/2019/jan/02/johanna-konta-crashes-out-brisbane-international-tennis',
 'apiUrl': 'https://content.guardianapis.com/sport/2019/jan/02/johanna-konta-crashes-out-brisbane-international-tennis',
 'fields': {'headline': 'Johanna Konta out of Brisbane International in second round',
  'standfirst': '<ul><li>British No 1 loses 6-2, 7-6 (2) to Ajla Tomljanovic</li><li>Caroline Wozniacki begins 2019 with win in Auckland</li></ul>',
  'trailText': 'Johanna Konta could not follow her victory over Sloane Stephens at the Brisbane International with a win against Ajla Tomljanovic, losing 6-2, 7-6 (2)',
  'byline': 'Guardian sport and agencies',
  'main': '<figure class="element elemen

In [29]:
def clean_guardian(guardian_data):
    ''' Leave only some of the key, value pairs'''
    
    cleaned_guardian_data = [{'sectionName':article['sectionName'],
         'publication': 'Guardian',
        'byline':article['fields']['byline'],
          'webPublicationDate':article['webPublicationDate'],
        'headline':article['fields']['headline'],
         'lang':article['fields']['lang'],
         'bodyText':article['fields']['bodyText']} for article in guardian_data if article['fields'].get('byline') is not None]
    return cleaned_guardian_data

In [30]:
cleaned_guardiean_dict_list = clean_guardian(guardian_data)

In [32]:
guardian_df = pd.DataFrame(cleaned_guardiean_dict_list)

In [34]:
guardian_df.shape

(83096, 7)

In [35]:
guardian_dict = guardian_df.to_dict(orient = 'records')

In [38]:
len(guardian_dict)

83096

In [39]:
guardian_col.insert_many(guardian_dict)

<pymongo.results.InsertManyResult at 0x10e87029b0>

#### Load the BBC TV news dataset

In [106]:
BBC_df = pd.read_csv('/Users/aminenhila/Desktop/Metis/Project4/Data_Code/Data/Final_BBCNEWS.csv')

In [107]:
BBC_df = BBC_df.drop(['Unnamed: 0', 'URL', 'IAPreviewThumb', 'IAShowID'], axis = 1)
# the csv file contains other news information without their transcripts
# only bbc transcripts are available in this csv file
BBC_df = BBC_df[BBC_df.Station == 'BBCNEWS']
BBC_df.head()

Unnamed: 0,MatchDateTime,Station,Show,Snippet,NewsTranscripts
0,11/15/2017 13:26:03,BBCNEWS,BBC News at One,that is thought to be a danger point. that wou...,the army in zimbabwe seizes control of the co...
1,11/4/2017 22:08:32,BBCNEWS,BBC News,with. excess heat is killing people. so we are...,this is bbc news. i'm rachel schofield. the h...
2,11/15/2017 21:47:38,BBCNEWS,Outside Source,to hold the world's temperature rise as close ...,"hello, i'm kasia madera, this is outside sour..."
3,11/4/2017 12:14:30,BBCNEWS,BBC News,contains is not news in the sense that this is...,this is bbc news. i'm shaun ley. the headline...
4,11/4/2017 23:12:06,BBCNEWS,BBC News,administration's view on climate change. the s...,this is bbc news. i'm rachel schofield. the h...


In [108]:
BBC_df.MatchDateTime = pd.to_datetime(BBC_df.MatchDateTime)
BBC_df = BBC_df.rename(columns = {'MatchDateTime': 'date', 'NewsTranscripts': 'bodytext'})

In [109]:
BBC_df.date = pd.to_datetime(BBC_df.date)
BBC_df['year'] = BBC_df.date.dt.year
BBC_df['day'] = BBC_df.date.dt.day
BBC_df['month'] = BBC_df.date.dt.month


In [110]:
BBC_df.date = pd.to_datetime(BBC_df.date.dt.date)

In [111]:
BBC_df.head()

Unnamed: 0,date,Station,Show,Snippet,bodytext,year,day,month
0,2017-11-15,BBCNEWS,BBC News at One,that is thought to be a danger point. that wou...,the army in zimbabwe seizes control of the co...,2017,15,11
1,2017-11-04,BBCNEWS,BBC News,with. excess heat is killing people. so we are...,this is bbc news. i'm rachel schofield. the h...,2017,4,11
2,2017-11-15,BBCNEWS,Outside Source,to hold the world's temperature rise as close ...,"hello, i'm kasia madera, this is outside sour...",2017,15,11
3,2017-11-04,BBCNEWS,BBC News,contains is not news in the sense that this is...,this is bbc news. i'm shaun ley. the headline...,2017,4,11
4,2017-11-04,BBCNEWS,BBC News,administration's view on climate change. the s...,this is bbc news. i'm rachel schofield. the h...,2017,4,11


#### Load the MSNBC TV news dataset

In [119]:
MSNBC_df = pd.read_csv('/Users/aminenhila/Desktop/Metis/Project4/Data_Code/Data/Final_MSNBC.csv')


In [120]:
MSNBC_df = MSNBC_df.drop(['Unnamed: 0', 'URL', 'IAPreviewThumb', 'IAShowID'], axis = 1)
# the csv file contains other news information without their transcripts
# only bbc transcripts are available in this csv file
MSNBC_df = MSNBC_df[MSNBC_df.Station == 'MSNBC']
MSNBC_df.head()

Unnamed: 0,MatchDateTime,Station,Show,Snippet,NewsTranscripts
1363,1/25/2018 19:34:10,MSNBC,MSNBC Live With Katy Tur,"reality. the u.s. can at a significant cost, c...",welcome back. it is transcript time. senator ...
1364,1/11/2018 10:37:15,MSNBC,First Look,says they intend to honor the blue slip courte...,tuesday's televised meeting with lawmakers o...
1365,1/9/2018 12:54:24,MSNBC,Morning Joe,of dhs. so many topics to discuss. seems prett...,"tuesday morning. ""morning joe"" starts right n..."
1366,1/8/2018 1:36:14,MSNBC,Kasie DC,"be the figure head of a parallel party, a para...",see how invisalign® treatment can shape your ...
1367,1/29/2018 16:23:06,MSNBC,MSNBC Live With Velshi and Ruhle,president trump was asked if he believes in cl...,i brought my big boots to your wisdom and it ...


In [121]:
MSNBC_df.MatchDateTime = pd.to_datetime(MSNBC_df.MatchDateTime)
MSNBC_df = MSNBC_df.rename(columns = {'MatchDateTime': 'date', 'NewsTranscripts': 'bodytext'})


In [126]:
MSNBC_df.date = pd.to_datetime(MSNBC_df.date)
MSNBC_df['year'] = MSNBC_df.date.dt.year
MSNBC_df['day'] = MSNBC_df.date.dt.day
MSNBC_df['month'] = MSNBC_df.date.dt.month


In [127]:
MSNBC_df.date = pd.to_datetime(MSNBC_df.date.dt.date)

In [128]:
MSNBC_df.head()


Unnamed: 0,date,Station,Show,Snippet,bodytext,year,day,month
1363,2018-01-25,MSNBC,MSNBC Live With Katy Tur,"reality. the u.s. can at a significant cost, c...",welcome back. it is transcript time. senator ...,2018,25,1
1364,2018-01-11,MSNBC,First Look,says they intend to honor the blue slip courte...,tuesday's televised meeting with lawmakers o...,2018,11,1
1365,2018-01-09,MSNBC,Morning Joe,of dhs. so many topics to discuss. seems prett...,"tuesday morning. ""morning joe"" starts right n...",2018,9,1
1366,2018-01-08,MSNBC,Kasie DC,"be the figure head of a parallel party, a para...",see how invisalign® treatment can shape your ...,2018,8,1
1367,2018-01-29,MSNBC,MSNBC Live With Velshi and Ruhle,president trump was asked if he believes in cl...,i brought my big boots to your wisdom and it ...,2018,29,1


#### Load the CNN and Fox TV news dataset

In [135]:
CNN_FOX_df = pd.read_csv('/Users/aminenhila/Desktop/Metis/Project4/Data_Code/Data/FINAL_FOX_CNN.csv')


In [137]:
CNN_FOX_df = CNN_FOX_df.drop(['Unnamed: 0', 'URL', 'IAPreviewThumb', 'IAShowID'], axis = 1)
# the csv file contains other news information without their transcripts
# only bbc transcripts are available in this csv file
CNN_FOX_df = CNN_FOX_df[(CNN_FOX_df.Station == 'CNN') | (CNN_FOX_df.Station == 'FOXNEWS')]
CNN_FOX_df.head()

Unnamed: 0,MatchDateTime,Station,Show,Snippet,NewsTranscripts
0,1/22/2015 10:09:31,CNN,Early Start With John Berman and Christine Romans,that transformed the arctic. one alaska senato...,crisis in yemen that could derail the america...
1,1/25/2015 10:51:30,CNN,CNNI Simulcast,we've got white house correspondent michelle k...,[laughter] ♪ borf a liver tute face stummy wa...
2,1/28/2015 15:34:56,CNN,CNN Newsroom With Carol Costello,dioxide in the atmosphere when we burn fossil ...,"for the first lady for standing there, being ..."
3,1/25/2015 11:27:39,CNN,New Day Sunday,renewable energy more accessible and effortibl...,good morning. president obama receiving a war...
4,1/1/2015 10:10:23,CNN,All the Best All the Worst 2014 An Anderson Co...,taking aggressive steps to address climate cha...,you pull out the popcorn and let the side sho...


In [139]:
CNN_FOX_df.MatchDateTime = pd.to_datetime(CNN_FOX_df.MatchDateTime)
CNN_FOX_df = CNN_FOX_df.rename(columns = {'MatchDateTime': 'date', 'NewsTranscripts': 'bodytext'})


In [140]:
CNN_FOX_df.date = pd.to_datetime(CNN_FOX_df.date)
CNN_FOX_df['year'] = CNN_FOX_df.date.dt.year
CNN_FOX_df['day'] = CNN_FOX_df.date.dt.day
CNN_FOX_df['month'] = CNN_FOX_df.date.dt.month


In [141]:
CNN_FOX_df.date = pd.to_datetime(CNN_FOX_df.date.dt.date)

In [142]:
CNN_FOX_df.head()

Unnamed: 0,date,Station,Show,Snippet,bodytext,year,day,month
0,2015-01-22,CNN,Early Start With John Berman and Christine Romans,that transformed the arctic. one alaska senato...,crisis in yemen that could derail the america...,2015,22,1
1,2015-01-25,CNN,CNNI Simulcast,we've got white house correspondent michelle k...,[laughter] ♪ borf a liver tute face stummy wa...,2015,25,1
2,2015-01-28,CNN,CNN Newsroom With Carol Costello,dioxide in the atmosphere when we burn fossil ...,"for the first lady for standing there, being ...",2015,28,1
3,2015-01-25,CNN,New Day Sunday,renewable energy more accessible and effortibl...,good morning. president obama receiving a war...,2015,25,1
4,2015-01-01,CNN,All the Best All the Worst 2014 An Anderson Co...,taking aggressive steps to address climate cha...,you pull out the popcorn and let the side sho...,2015,1,1


#### Combine the CNN, Fox, MSNBC, BBC TV news dataframes together

In [145]:
TV_news_df = pd.concat([CNN_FOX_df,MSNBC_df, BBC_df])

In [146]:
TV_news_df

Unnamed: 0,date,Station,Show,Snippet,bodytext,year,day,month
0,2015-01-22,CNN,Early Start With John Berman and Christine Romans,that transformed the arctic. one alaska senato...,crisis in yemen that could derail the america...,2015,22,1
1,2015-01-25,CNN,CNNI Simulcast,we've got white house correspondent michelle k...,[laughter] ♪ borf a liver tute face stummy wa...,2015,25,1
2,2015-01-28,CNN,CNN Newsroom With Carol Costello,dioxide in the atmosphere when we burn fossil ...,"for the first lady for standing there, being ...",2015,28,1
3,2015-01-25,CNN,New Day Sunday,renewable energy more accessible and effortibl...,good morning. president obama receiving a war...,2015,25,1
4,2015-01-01,CNN,All the Best All the Worst 2014 An Anderson Co...,taking aggressive steps to address climate cha...,you pull out the popcorn and let the side sho...,2015,1,1
...,...,...,...,...,...,...,...,...
63677,2017-08-06,BBCNEWS,Breakfast,still meets its climate change targets. italia...,"hello. this is breakfast, with rogerjohnson a...",2017,6,8
63678,2017-08-06,BBCNEWS,BBC News,to cap energy prices during june's election ca...,this is bbc news. the headlines at 10.00. the...,2017,6,8
63679,2017-08-08,BBCNEWS,Outside Source,climate change? in some ways it does but in so...,welcome to outside source. donald trump has t...,2017,8,8
63680,2017-08-10,BBCNEWS,BBC News,getting concerned and we don't know how it wil...,now it is time for our news review. we begin ...,2017,10,8


In [175]:
type(TV_news_dict)

dict

In [176]:
# make it into a dictionary
TV_news_dict = TV_news_df.to_dict(orient = 'records')

In [177]:
# insert the TV_news_dict into the tvnews collections in mongodb NEWS database
tvnews.insert_many(TV_news_dict)

<pymongo.results.InsertManyResult at 0x11cf462b40>