-
Notifications
You must be signed in to change notification settings - Fork 1
/
OnlineNewsPopularityAnalysis.Rmd
1016 lines (800 loc) · 49.9 KB
/
OnlineNewsPopularityAnalysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Online News Popularity Analysis"
author: "Anastasia Makarevich"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output:
md_document:
variant: markdown_github
html_document:
# theme of html document
# theme of code highlight
# table of contents
theme : journal # "default", "cerulean", "journal",
# "flatly", "readable", "spacelab",
# "united", "cosmo", "lumen", "paper",
# "sandstone", "simplex", "yeti"
highlight : tango # "default", "tango", "pygments",
# "kate", "monochrome", "espresso",
# "zenburn", "haddock", "textmate"
toc : true # get table of content
toc_depth : 3
toc_float : true
df_print : paged
---
```{r echo=FALSE, message=FALSE, warning=FALSE, packages}
# Load all of the packages
library(ggplot2)
library(tidyr)
library(dplyr)
library(gridExtra)
library(reshape2)
library(GGally)
library(scales)
# knitr: Suppress code/messages/warnings
# Set default plot options and center them
knitr::opts_chunk$set(fig.width=9,fig.height=5,fig.path='Figs/',
fig.align='center',tidy=TRUE,
echo=FALSE,warning=FALSE,message=FALSE)
```
```{r echo=FALSE, message=FALSE, warning=FALSE, constants}
# set constants for colours
MAIN_COLOUR <- "royalblue4"
FILL_COLOUR <- "whitesmoke"
```
# Introduction. Dataset Overview
```{r echo=FALSE, message=FALSE, warning=FALSE, data_loading}
# Load the Data
news_raw <- read.csv('OnlineNewsPopularity/OnlineNewsPopularity.csv')
```
In this work I will be analysing popularity of online news, specifically - what features can possibly help us to predict the number of shares?
The source dataset can be downloaded from UCI repository: [see here](http://archive.ics.uci.edu/ml/datasets/Online+News+Popularity).
This dataset contains information about the articles published on Mashable.com during a two-year period. Popularity is measured by the number of shares since the publication date. An article is considered popular if it exceed the threshold of 1400 shares (as suggested by the dataset creators).
# Variables Transformation
The dataset contains 39797 observations with 61 attributes - real and integers. However, conceptually the dataset contains categorical variables as well which are encoded as integers (for example, channel name and weekday name). Such encoding is very convenient for prediction models. However, for the sake of plotting we will convert the dummy variables back to categorical variables. The following new categorical variables will be created:
- weekend (one of: Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday)
- channel (one of: LifeStyle, Entertainment, Business, Social Media, Tech, World, Other (if it's not in one of the channels))
- topic (the topic with the maximum value for Latent Dirichlet Allocation is selected, one of: Topic 1, Topic 2, Topic 3, Topic 4, Topic 5)
```{r message=FALSE, warning=FALSE, preprocess}
# create and fill in the column for weekday
weekdays_names <- c('weekday_is_monday', 'weekday_is_tuesday',
'weekday_is_wednesday', 'weekday_is_thursday',
'weekday_is_friday', 'weekday_is_saturday',
'weekday_is_sunday')
weekdays <- c("Monday", 'Tuesday', 'Wednesday', 'Thursday', 'Friday',
'Saturday', 'Sunday')
# rename weekday columns
colnames(news_raw)[colnames(news_raw) %in% weekdays_names] <- weekdays
# create one column for weekdays
news_raw$weekday <- factor(colnames(news_raw[weekdays])
[max.col(news_raw[weekdays])], levels = weekdays)
channel_names <- c('data_channel_is_lifestyle', 'data_channel_is_entertainment',
'data_channel_is_bus', 'data_channel_is_socmed',
'data_channel_is_tech', 'data_channel_is_world')
channels <- c('Lifestyle', 'Entertainment', 'Business',
'Social Media', 'Tech', 'World')
news_raw$Other <- as.numeric(!rowSums(news_raw[channel_names]))
# rename columns
colnames(news_raw)[colnames(news_raw) %in% channel_names] <- channels
# create column with topic names and convert them to ordered factor
news_raw$channel <- factor(colnames(news_raw[c(channels, 'Other')])
[max.col(news_raw[c(channels, 'Other')])],
levels = c(channels, 'Other'))
# select topic names
topic_names <- c('LDA_00','LDA_01','LDA_02','LDA_03','LDA_04')
# define news topic names
topics <- c('Topic 1', 'Topic 2', 'Topic 3', 'Topic 4', 'Topic 5')
# rename columns
colnames(news_raw)[colnames(news_raw) %in% topic_names] <- topics
# create column with topic names and convert them to ordered factor
news_raw$topic <- factor(colnames(news_raw[topics])[max.col(news_raw[topics])],
levels = topics)
```
```{r echo=FALSE, message=FALSE, warning=FALSE, sample}
# remove outliers
news <- news_raw
# create a sample of the subset, so that plotting is faster and takes less space
news_sample <- news[sample(1:nrow(news), 5000), ]
```
# Univariate EDA
## News Examples: Best and Worst
Before we dive into exploring variables, it's interesting to take a quick look what news are the most/least readable.
### Best Stories
```{r echo=FALSE, message=FALSE, warning=FALSE, top_stories}
# sort stories by the number of shares in decreasing order and show top 5
head(news[order(news$shares,decreasing=T),c('url','shares','timedelta')],n=5)
```
### Worst Stories
```{r echo=FALSE, message=FALSE, warning=FALSE, worst_stories}
# sort stories by the number of shares in increasing order and show top 5
head(news[order(news$shares,decreasing=F),c('url','shares','timedelta')],n=5)
```
## Number of Shares
Numeric Summary:
```{r echo=FALSE, message=FALSE, warning=FALSE, shares_summary}
# show statistics for shares
summary(news$shares)
```
The distribution of the number of shares of the original data is highly skewed (right-skewed), so
there are two plots: the first one shows original data, the second one is without outliers (omitting values abouve Q3 + 1.5*IQR).
```{r echo=FALSE, message=FALSE, warning=FALSE, shares_distribution}
# create histogram with all shares
shares <- ggplot(aes(x = shares), data=news_raw) +
geom_histogram(fill=FILL_COLOUR, col=MAIN_COLOUR, bins=30) +
theme_linedraw() +
xlab("Number of Shares") +
ylab("Counts")
# create histogram for shares without outliers
shares_no_outliers <-
ggplot(aes(x = shares), data=
subset(news_raw, news_raw$shares < 1.5*IQR(news_raw$shares)+
quantile(news_raw$shares,0.75))) +
geom_histogram(fill=FILL_COLOUR, col=MAIN_COLOUR, bins=30) +
theme_linedraw() +
xlab("Number of shares (No Outliers)") +
ylab("Counts")
grid.arrange(shares, shares_no_outliers,ncol=1)
```
Since the distribution is so heavily skewed, it is reasonable to remove outliers so that we work with more balanced data. We will consider everything that is above Q3 + 1.5*IQR as an outlier.
```{r echo=FALSE, message=FALSE, warning=FALSE, outliers}
# remove outliers
news <- subset(news_raw, news_raw$shares < 1.5*IQR(news$shares)+
quantile(news$shares,0.75))
```
## Days of the Week
Summary:
```{r echo=FALSE, message=FALSE, warning=FALSE, weekday_summary}
#show summary for the number of news per weekday
summary(news$weekday)
```
```{r echo = FALSE, message=FALSE, warning=FALSE, weekday_plot}
# create barplot for number of news by weekday
ggplot(aes(x = weekday), data=news) +
geom_bar(fill=FILL_COLOUR, color=MAIN_COLOUR) +
theme_linedraw() +
xlab("Weekday")
```
There are significantly fewer publications on weekends. We should explore if there are fewer shares on weekends as well.
## News Channels
```{r echo=FALSE, message=FALSE, warning=FALSE, channel_summary}
# show summary for number of news by channel
summary(news$channel)
```
```{r echo=FALSE, message=FALSE, warning=FALSE, Channel_Plot}
# create barplot for number of news by channel
ggplot(aes(x = channel), data=news_sample) +
geom_bar(fill=FILL_COLOUR, color=MAIN_COLOUR) +
theme_linedraw()
```
Surprisingly, Lifestyle and Social Media channels get significantly fewer publications that any other channels. Also, there is significant amoung of news not assigned to any chanel, so we should account for that.
## News Topics
```{r echo=FALSE, message=FALSE, warning=FALSE, topics_summary}
# show summary for the number of news by topic
summary(news$topic)
```
```{r echo=FALSE, message=FALSE, warning=FALSE, topics_plot}
# create barplot for the number of news by topic
ggplot(aes(x = topic), data=news_sample) +
geom_bar(fill=FILL_COLOUR, color=MAIN_COLOUR) +
theme_linedraw()
```
We see that Topic 2 is the least covered on Mashable, while news on Topics 4 and 5 are published more often.
## Title and Text Length
The length of the text/title is measured in the number of tokens (not necessarily distinct). In the very basic case, the stop words are not excluded, so we're simply measuring text/title length with the number of words in it.
```{r echo=FALSE, message=FALSE, warning=FALSE, words_summary}
# show summary for title length and text length
summary(select(news,c("n_tokens_title", "n_tokens_content")))
```
```{r echo=FALSE, message=FALSE, warning=FALSE, words_plot}
# create histogram for the title length
title_length <- ggplot(aes(x=n_tokens_title), data=news) +
geom_histogram(bins=20, fill=FILL_COLOUR, color=MAIN_COLOUR) +
labs(title="Title Length (Number of\nwords in title)") +
xlab("Title Length") +
theme_linedraw()
# create histogram for the text length
text_length <- ggplot(aes(x=n_tokens_content), data=news) +
geom_histogram(binwdith=200, fill=FILL_COLOUR, color=MAIN_COLOUR) +
labs(title ="Text Length (Number of\nwords in text)") +
xlab("Text Length") +
theme_linedraw()
# plot two histograms side by side
grid.arrange(title_length, text_length, ncol=2)
```
There is one interesting thing that we can notice here. There is quite large number of news with zero number of words! Does it even make sense?
```{r echo=FALSE, message=FALSE, warning=FALSE, no_text_counts}
# select news that contain no text
no_text <- subset(news, news$n_tokens_content==0)
```
To be precise, there are `r nrow(no_text)` of them. Let's see if they contain anything instead:
```{r echo=FALSE, message=FALSE, warning=FALSE, no_text_characteristics}
# count the number of news that contain videos, the number of news that contain
# images, the number of ntws that contain links and the number of news that
# contain neither
data.frame(counts = c(sum(no_text$num_videos > 0),
sum(no_text$num_imgs > 0),
sum(no_text$num_hrefs > 0),
sum(no_text$num_videos==0 &
no_text$num_imgs==0 &
no_text$num_hrefs==0)),
row.names = c("Videos", "Images", "Links", "Nothing"))
```
Looks like there are 78 news bits that don't have any content at all! Let's actually look at the titles with supposedly no content:
```{r echo=FALSE, message=FALSE, warning=FALSE, no_text_sample}
# show first 10 news with no text, no images, no links
head(subset(no_text, no_text$num_videos==0 &
no_text$num_imgs==0 &
no_text$num_hrefs==0)$url, 10)
```
And, surprisingly, they still get a lot of shares! There must be something wrong. Let's check one news article from this list - [http://mashable.com/2013/01/23/fitness-gadget-gym-cost-comparison/](http://mashable.com/2013/01/23/fitness-gadget-gym-cost-comparison/). Turns out, it contains a lot text and a lof images! We can conclude that these observations might be corrupted! Do those with no text really have no text then? Let's see:
```{r echo=FALSE, message=FALSE, warning=FALSE, no_text}
# show first article that has at least one video or at least one image or
# at least onelink
head(subset(no_text, no_text$num_videos>0 |
no_text$num_imgs>0 |
no_text$num_hrefs>0)$url, 1)
```
[The link](#'http://mashable.com/2013/01/23/actual-facebook-graph-searches/') leads to an article named "Tumbrl Serves Up Hilariously Awful 'Actual Facebook Graph Searches'" that does have text! We can suspect, that actually these news that are reported to have zero number of words were actually parsed incorrectly. We could have written our own parser, but this is outside of the scope of this work, so we will just remove these observations from the dataset.
```{r message=FALSE, warning=FALSE, no_text_removal}
# remove news with no text
news <- subset(news, news$n_tokens_content > 0)
```
Also, at this point we can create another categorical variable that will split texts by length. We will create 3 buckets:
- Short texts: 0-400 words
- Medium: 400-1000 words
- LongReads: over 1000 words
```{r message=FALSE, warning=FALSE, length_category}
# create categorical variable for text length by cutting n_tokens_content into
# three categories
news$text_length <- cut(news$n_tokens_content, breaks = c(
0, 400, 1000, 9000),
labels = c("Short","Medium","LongRead"))
summary(news$text_length)
```
## Videos and Links
Summary:
```{r echo=FALSE, message=FALSE, warning=FALSE, videos_summary}
# show summary for number of links, number of imagees, number of videos and
# number of self-references
summary(select(news, c("num_hrefs", "num_self_hrefs",
"num_imgs", "num_videos")))
```
```{r echo=FALSE, message=FALSE, warning=FALSE, videos_links_images_plot}
# create histogram for number of links
links <- ggplot(aes(x=num_hrefs), data=news) +
geom_histogram(binwidth=2, fill=FILL_COLOUR, color=MAIN_COLOUR)+
theme_linedraw() +
xlab("Number of links") +
labs(title="Number of Links") +
coord_cartesian(xlim=c(-0.5, quantile(news$num_hrefs,0.75)+
1.5*IQR(news$num_hrefs)))
# create histogram for number of self-references
self_refs <- ggplot(aes(x=num_self_hrefs), data=news) +
geom_histogram(binwidth=1, fill=FILL_COLOUR, color=MAIN_COLOUR)+
theme_linedraw() +
xlab("Number of self references") +
labs(title ="Number of Self References") +
scale_x_continuous(breaks= pretty_breaks()) +
coord_cartesian(xlim=c(-0.5,
quantile(news$num_self_hrefs,0.75)+
1.5*IQR(news$num_self_hrefs)))
# create histogram for number of images
images <- ggplot(aes(x=num_imgs), data=news) +
geom_histogram(binwidth=1, fill=FILL_COLOUR, color=MAIN_COLOUR)+
theme_linedraw() +
xlab("Number of images") +
labs(title ="Number of Images") +
coord_cartesian(xlim=c(-0.5,quantile(news$num_imgs,0.75)+
1.5*IQR(news$num_imgs))) +
scale_x_continuous(breaks= pretty_breaks())
# create histogram for number of videos
videos <- ggplot(aes(x=num_videos), data=news) +
geom_histogram(binwidth=1, fill=FILL_COLOUR, color=MAIN_COLOUR)+
theme_linedraw() +
xlab("Number of videos") +
labs(title ="Number of Videos") +
coord_cartesian(xlim=c(-0.5,quantile(news$num_videos,0.75)+1.5*IQR(news$num_videos)))
# plot all 4 histograms on 2x2 grid
grid.arrange(links, self_refs, images, videos, ncol=2, nrow=2)
```
All of the above histograms are right-skewed - most values are concentrated on the left, so smaller values are more typical. We see that an average text has 10 links (including 3 self-references), 4 images an 1 video (but not necessarily all these at the same time).
## Text Semantics
### Subjectivity and Polarity
Polarity measures the emotions expressed in the text. In this dataset polarity values are in the continuous interval from -1 to 1 (inclusive). Closeness to -1 means that text has negative sentiment, while closness to +1 means positive sentiment.
Subjectivity simply measures how subjective is the text (i.e. if it expresses an opinion or states a fact). Subjectivity of 0 will indicate that the text simply states the facts, while subjectivity close to 1 will indicate that the text is an opinion rather than a bunch of facts.
```{r echo=FALSE, message=FALSE, warning=FALSE, text_summary}
# show summary for polarity and subjectivity of title and text
summary(select(news, c("global_subjectivity", "global_sentiment_polarity",
"title_subjectivity", "title_sentiment_polarity")))
```
```{r echo=FALSE, message=FALSE, warning=FALSE, text_plot}
# create histogram for text subjectivity
glob_subj <- ggplot(aes(x=global_subjectivity), data=news) +
geom_histogram(bins=20, fill=FILL_COLOUR, color=MAIN_COLOUR)+
theme_linedraw() +
xlab("Globar subjectivity value") +
labs(title="Global Subjectivity")
# create histogram for text polarity
glob_sent <- ggplot(aes(x=global_sentiment_polarity), data=news) +
geom_histogram(bins=20, fill=FILL_COLOUR, color=MAIN_COLOUR)+
theme_linedraw() +
xlab("Global sentiment polarity value") +
labs(title ="Global Sentiment Polarity")
# create histogram for title subjectivity
title_subj <- ggplot(aes(x=title_subjectivity), data=news) +
geom_histogram(bins=20, fill=FILL_COLOUR, color=MAIN_COLOUR)+
theme_linedraw() +
xlab("Title subjectivity value") +
labs(title ="Title subjectivity")
# create histogram for title polarity
title_sent <- ggplot(aes(x=title_sentiment_polarity), data=news) +
geom_histogram(bins=20, fill=FILL_COLOUR, color=MAIN_COLOUR)+
theme_linedraw() +
xlab("Title polarity value") +
labs(title ="Title sentiment polarity")
# plot all 4 histograms on 2x2 grid
grid.arrange(glob_subj, title_subj, glob_sent, title_sent, ncol=2, nrow=2)
```
We can note that global subjectivity has normal distribution with the mean a bit shifted to the left from 0.5, so most of the texts tend to be neutrality. And if we look at the same characteristic of the title, we will notice that there is a very explicit peak at 0 that tells us that significant proportion (specifically - `r nrow(subset(news, news$title_subjectivity==0))/nrow(news)`) of all news titles simply state facts.
The distribution for the text sentiment polarity looks normal with the mean shifted to the right from 0 which means that in general the newswriters prefer to be "positive". However, when it comes to naming the news, the authors mostly (`r nrow(subset(news, news$title_sentiment_polarity==0))/nrow(news)` of the titles) prefer neutral style.
### Negative and Positive Words Rates
Global rate of positive/negative words shows the percentage of positive/negative words in the whole text. For the following histogram I decided to combine them and plot the histogram for the proportion of non-neutral tokens in the text.
Rate of positive words shows the percentage of positive words among all non-neutral tokens. For this plot I took a subset of all the news the do contain any non-neutral words, because the rate of 0 can also mean that there are no non-neutral words at all.
Numerica summary for global rate of non-neutral words and globalr rate of positive words:
```{r echo=FALSE, message=FALSE, warning=FALSE, non_neutral_summary}
# show summaris for rate of non-neutral words and rate of positive words
summary(news$global_rate_positive_words+news$global_rate_negative_words)
summary(news$rate_positive_words)
```
```{r echo=FALSE, message=FALSE, warning=FALSE, non_neutral_plot}
# create histogram for rate of non-neutral words
non_neutral <- ggplot(aes(x=global_rate_positive_words+global_rate_negative_words), data=news) +
geom_histogram(bins=20, fill=FILL_COLOUR, color=MAIN_COLOUR)+
theme_linedraw() +
xlab("Proportion of non-neutral words from the total") +
labs(title="Global Rate Non-Neutral words")
# create histogram for rate of positive words
positive <- ggplot(aes(x=rate_positive_words),
data=subset(news,
(news$global_rate_negative_words+
news$global_rate_positive_words)>0)) +
geom_histogram(bins=20, fill=FILL_COLOUR, color=MAIN_COLOUR)+
theme_linedraw() +
xlab("Positive/Non-Neutral words ratio") +
labs(title="Rate of Positive Words")
# plot two histograms side by side
grid.arrange(non_neutral, positive, ncol=2)
```
We can conclude that the texts mostly have low proportion of non-neutral words. And for the non-neutral tokens we can see that the histogram is left-skewed - most news are positive. It's interesting to note that there are much more news that have only positive and no negative words than news with just negative words (bars at 0 and at 1).
# Univariate Analysis
## Dataset Structure
Original dataset contains 39644 observations with 61 attributes. The attributes are float, integer and categorical. The distribution of the variable of interest (shares) is heavily skewed, so we had to remove the outliers.
## Main Feature of Interest
We're interested in exploring how the number of shares is relates to other variables.
## Feature to Support the Feature of Interest
- the subjectivity/polarity of title/text
- the number/presece of images
- the length of the text
- the data channel
- the day of publication
- the topic
## Additional Features
I createed several variables during the first stage:
- weekday (decodes dummy variables)
- channel (decodes dummy variables)
- other (to identify that the channel is other)
- topic' that stores the topic that has the highest value of membership according to LDA
- text_length (categorical variabl that splits all texts into 3 categories: Short, Medium, LongRead)
- contains_images (indicates if the text contains images)
## Unusual Distributions and Variable Transformations
Most of the distributions are right-skewed, which seems quite logical. There are
only two plots that show almost normal distribution:
- global subjectivity
- global sentiment polarity
- number of words in title
I've also sclaed the distribution of the numer of shares using log scale, because
the data contains extreme values (outliers) that make it hard to see the shape.
Also, it looks like for the goals of prediction we should better get rid of
outliers - they are extreme values that happen rarely so it doesn't make sense to
analyse them together with other oservations.
One interesting distribution is the distribution of the rate of positive words - it's left-skewed for the articles that contain any non-neutral tokens. Also the distributions of the title polarity and subjectivity have very clear peak at the "neutral position" which means that most titles are neutral (both in terms of subjectivity and polarity).
# Bivariate EDA
## Older => more popular?
One natural question that one may ask this data is whether the news get more shares as more and more days pass? The smallest time interval between the publication data and data acquisition date (in days) is `r min(news$timedelta)`. So we're not comparing total novices to old monsters. Then it makes sense to plot the number of shares against the number of days elapsed:
```{r echo=FALSE, message=FALSE, warning=FALSE, time_shares_plot}
# plot number of shares against timedelta
ggplot(aes(x=timedelta, y = shares), data=news) +
geom_point(alpha = 0.5, size = 1, position = 'jitter', colour=MAIN_COLOUR) +
xlab("Days") +
ylab("Number of shares") +
theme_linedraw()
```
One might expect that as time goes by, the text will naturally will gain more shares, but the two variables look completely uncorrelated. And, indeed, the correlation coefficient is just `r cor(news$timedelta, news$shares, method='pearson')`
We can explain it by the fact that new articles become outdated very fast, so if it's not shared intensively in the first 8 days, then it will not become significantly more popular no matter how long it stays on the site. But this is actually good for our analysis, because it justifies our measure of the popularity in the number of shares even if for the texts with very different publication dates.
## Numeric Characteristics of the Texts
Let's see if there is some relationship between the number of shares and numeric characteristics of the text, such as:
- title length
- text length
- number of links
- number of images
- number of videos
```{r echo=FALSE, message=FALSE, warning=FALSE, cor_numeric}
# select only numeric characteristics of the text and number of shares
numeric_characteristics <- news[c('n_tokens_title',
'n_tokens_content', "num_hrefs",
"num_imgs", "num_videos",
'shares')]
#round(cor(numeric_characteristics),4)
```
```{r echo=FALSE, message=FALSE, warning=FALSE, plot_numeric}
# create and show plot for all possible pairs of numeric characteristics and the number of shares
num_pairs <- ggpairs(numeric_characteristics, columnLabels = c('Title\nLength',
'Text\nLength',
'Number of\nLinks',
'Number of\nImages',
'Number of\nVideos',
'Shares'),
lower=list(continuous=wrap("points", colour=MAIN_COLOUR,alpha = 0.2,
size = 1, position = 'jitter'))) +
theme(text = element_text(size=8))
num_pairs
```
It doesnt look like any of the variables are highly or even moderately related to the number of shares. The highest we see is the correlation of 0.0823 and 0.0585 and 0.0523 between the average number of shares per day and the number of references and images and text length.
As for the other variables, we can see moderate correlation between:
- the number of references and the text length (0.408);
- the number of images and the text length (0.367);
- the number of images and the number of references (0.351);
Here it makes sense to stop and think about the variables we're assessing. Maybe it's not the number of images that influence the number of shares, but it's their presence that can make a difference? Let's create a variable for that and call it "contains_images".
```{r echo=FALSE, message=FALSE, warning=FALSE, transformation_images}
# create new variable that indicates whether the articles contains an image
news$contains_images <- news$num_imgs > 0
```
```{r echo=FALSE, message=FALSE, warning=FALSE, contains_images_plot}
# create scatter plot for number of images against the number of shares
images_shares <- ggplot(aes(x=num_imgs, y = shares), data = news) +
geom_point(colour=MAIN_COLOUR) +
scale_y_log10() +
ylab("Shares (Log Scale)") +
xlab("Number of Images") +
theme_linedraw()
# create boxplot to compare the number of shares between article with and
# without images
images_shares_bin <- ggplot(aes(x=contains_images, y = shares), data=news) +
geom_boxplot(colour=MAIN_COLOUR) +
ylab("Shares (Log Scale)") +
xlab("Contains Images") +
scale_y_log10() +
theme_linedraw()
grid.arrange(images_shares, images_shares_bin, ncol=2)
```
There is some difference but it's very weak - maybe it's more explicit in some specific categories? We should explore it in multivariate analysis section.
```{r echo=FALSE, message=FALSE, warning=FALSE, contains_images_summary}
# compare median for the number of shares for news with and without images
data.frame(news %>% group_by(contains_images) %>%
summarize(median=median(shares)))
```
## Sematnic Characteristics of the Texts
Next, we can check if there is any relationshiip between the number of shares and semantic characteristics of the text, such as:
- global subjectivity
- global sentiment polarity
- global rate of positive words
- global rate of negative words
- rate of positive words
```{r echo=FALSE, message=FALSE, warning=FALSE, semantic_cor}
# select features with semantic characteristics and the number of shares
semantic_characteristics <- news[c('global_subjectivity',
"global_sentiment_polarity",
'global_rate_positive_words',
'global_rate_negative_words',
'rate_positive_words',
'shares')]
#round(cor(semantic_characteristics),4)
```
```{r echo=FALSE, message=FALSE, warning=FALSE, semantic_plot}
# create and show plot for all possible pairs of semantic characteristics and the number of shares
sem_pairs <- ggpairs(semantic_characteristics, columnLabels =
c('Global\nSubjectivity',
'Global Sentiment\nPolarity',
'Global Rate\nPositive Words',
'Global Rate\nNegative Words',
'Rate Positive\nWords',
'Shares'),
lower=list(continuous=
wrap("points", colour=MAIN_COLOUR,
alpha = 0.2, size = 1,
position = 'jitter'))) +
theme(text=element_text(size=8))
sem_pairs
```
Again, we can see that there are no strong correlations between the number of shares and other variables. There is some very very weak positive correlation between the number of shares and text subjectivity and global rate of positive words.
Note: we should be very careful with these attributes: we shouldn't use rate of positive words, global rate of positive words and global sentiment polarity if we ever decide to build a model because these three variable are correlated (which is quite logical).
As a final step for text characteristics, let's see how the characteristics themselves are related to each other:
```{r echo=FALSE, message=FALSE, warning=FALSE, num_sem_plot}
# select most interesting numeric and semantic characteristics and plot
# them against each other
num_sem_characteristics <- news[c('n_tokens_title',
'n_tokens_content', 'num_imgs', 'num_hrefs',
'global_subjectivity','global_sentiment_polarity')]
ggpairs(num_sem_characteristics, columnLabels = c('Title Length',
'Text Length',
'Number of\nImages',
'Number of\nLinks',
'Global Subjectivity',
"Global Sentiment\nPolarity"),
lower=list(continuous=wrap("points", colour=MAIN_COLOUR,alpha = 0.2,
size = 1, position = 'jitter'))) +
theme(text=element_text(size=8))
```
I expected to see some negative correlation between the number of links and subjectivity (links might be used to reference some facts to prove the point), but the result was the oppositte. One possible explanation is that at the same time links can also mean references to the image/video sources. Nevertheless, the correlation is too small to claim anything.
## Shares by Weekday
In the univariate analysis we noticed that weekend gets much fewer publications than work days. What about the shares of the news published on weekend? Do these news go unnoticed by the audience?
```{r echo=FALSE, message=FALSE, warning=FALSE, shares_by_weekday}
# create boxplot that shows statistics of shares by weekday
ggplot(aes(x=weekday, y = shares), data=news) +
geom_boxplot(colour=MAIN_COLOUR) +
ylab("Shares") +
xlab("Weekday") +
theme_linedraw()
```
Quite the opposite! Even though there are less news published on weekend, those that published get significnatly more shares, especially on Saturday. It doesn't mean that one causes another, of course. One guess is that on weekend people have more time to actually read the news thoroughly and of course they first look at those that were just published. It's interesting to check if weekend news are longer than work day news:
```{r echo=FALSE, message=FALSE, warning=FALSE, length_by_weekday}
# create boxplots for text length by weekday
ggplot(aes(x=weekday, y = n_tokens_content), data=news) +
geom_boxplot(colour=MAIN_COLOUR) +
ylab("Text Length") +
scale_y_log10() +
xlab("Weekday") +
theme_linedraw()
```
And indeed - it looks like weekends texts (especially those published on Saturday) are are slightly longer:
```{r echo=FALSE, message=FALSE, warning=FALSE, length_by_weekday_summary}
# show medians for each weekday
data.frame(news %>% group_by(weekday) %>%
summarize(median=median(n_tokens_content)))
```
### Shares by Channel
The next question is - are people more likely to share the news published in specific channels?
```{r echo=FALSE, message=FALSE, warning=FALSE, shares_by_channel}
# create boxplots for number of shares by channel
ggplot(aes(x=channel, y = shares), data=news) +
geom_boxplot(colour=MAIN_COLOUR) +
ylab("Shares") +
scale_y_log10() +
theme_linedraw()
```
It looks like Entertainment and World news are the least shareable, while Social Media news are the most shareable.
### Shares by Topic
Let's check if we can see some pattern in the number of shares by topic:
```{r echo=FALSE, message=FALSE, warning=FALSE, shares_by_topic}
# create boxplots fo number of shares by topic
ggplot(aes(x=topic, y = shares), data=news) +
geom_boxplot(colour=MAIN_COLOUR) +
ylab("Shares") +
scale_y_log10() +
xlab("Topic") +
guides(colour=FALSE)
```
Looks like topics 1 and 5 are shared slightly more often than topics:
```{r echo=FALSE, message=FALSE, warning=FALSE, shares_by_topic_summary}
# show medians for number of shares by topic
data.frame(news %>% group_by(topic) %>%
summarise(median = median(shares)))
```
But what do these topics mean?
### What Do Topics Mean?
Since we don't have access to the original data scraping script, we can only guess what the topics are about. One way to get some idea of what the topics are about is to see how these anonymous topics distributed across channels:
```{r echo=FALSE, message=FALSE, warning=FALSE, topics_channels}
# create barplots to show news counts by topic by channel
ggplot(aes(x=topic, fill=topic),
data = news) +
geom_bar() +
theme_linedraw() +
theme(axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
axis.title.x=element_blank()) +
xlab("Topic") +
ylab("Count") +
facet_wrap(~channel)
```
And indeed we can make some conclusions based on these barplots:
- topic 1 is related to business
- topic 5 is related to technology
- topic 3 is related to world news
- topic 2 is related to entertainment
Taking into account our previous visualisation, we can now say that news about business and technology are better candidates for popular news than anything else.
## Number of Links, Number of Images vs Number of Shares
```{r echo=FALSE, message=FALSE, warning=FALSE, length_vs_shares}
# plot shares against number of links as scatterplot
shares_hrefs <- ggplot(aes(x=num_hrefs, y=shares), data=news) +
geom_point(alpha = 0.2, size = 3, position = 'jitter', colour=MAIN_COLOUR) +
xlab("Number of Links") +
ylab("Shares (Log Scale)") +
scale_y_log10() +
theme_linedraw()
# plot shares against the number of images as scatterplot
shares_images <- ggplot(aes(x=num_imgs, y=shares), data=news) +
geom_point(alpha = 0.2, size = 3, position = 'jitter', colour=MAIN_COLOUR) +
xlab("Number of Images") +
ylab("Shares (Log Scale)") +
theme_linedraw() +
scale_y_log10()
grid.arrange(shares_hrefs, shares_images, ncol=2)
```
The correlation is very weak and is barely identifiable on the plot, so we should not rely on it.
## Being Objective Doesn't Make You Popular
When first approaching this dataset, I expected that there will be some moderate correlation between the subjectivity and the number of shares. On one hand, factual news are more reliable and so "safer" to share. On the other hand, more subjective news are more appealing. So maybe these two thing balance each other out instead? Hence, the low value of the correlation coefficient - `r cor(news$global_subjectivity, news$shares)`.
```{r echo=FALSE, message=FALSE, warning=FALSE, subj_plot}
# plot shares against global subjectivity
ggplot(aes(x=global_subjectivity, y = shares),
data = news) +
geom_point(alpha = 0.2, size = 2, position = 'jitter', colour=MAIN_COLOUR) +
theme_linedraw() +
xlab("Subjectivity") +
ylab("Shares")
```
The plot can serve as an example of absolute patternlessness. I don't think we should build any models based on that.
# Bivariate Analysis
## Observed Relationships
It looks like our variable of interest (the number of shares) doesn't have any strong relationship with any of other variables. Correlation values are very low and if we want to build a model, then maybe it will make sense to cluster the dataset somehow and build a separate model for each cluster, but this is out of the scope of this work.
That said, there are some weak, but rather interesting relationships. We noticed that:
- news published on weekend get more shares;
- text length has barely noticeable positive effect on the number of shares;
- news published in Social Media channel are shared more often, but at the same time news on topics related to Technology and Business are more shared (looks like we have some sort of Simpson's paradox here!);
## Relationships Between Features
One of the most interesting relationships is the relationship between the weekday and text length. We've discovered that texts published on weekends get more shares and are longer! At the same time, there are fewer texts published on weekend and there is almost negligible correlation between the text length and the number of shares.
## Strongest Relationship?
The strongest relationship was among the variables that are innately connected. These are global rate of positive/negative words and rate of positive/negative words as well as subjectivity and polarity. So, as it has been said, we should be very careful if we decide to build a model with this variables. For example, positive words often indicate some emotion that increases the subjectivity of the text, as well as polarity, so we shouldn't be surprised to see positive correlation between them.
The most noticeable impact on the number of shares is from day of publication and channel. As for continuous variables, the strongest relationship (although still very weak) with the average number of shares per day was shown by:
- number of links;
- number of images;
- global subjectivity.
# Multivariate EDA
## Shares by Text Length: With Images vs Without Images
Previously, we noted that presence of images doesn't influence the number of shares that much. Let's check if that's true for all texts independently of their length:
```{r echo=FALSE, message=FALSE, warning=FALSE, length_shares_images}
# create boxplot fo number of shares for news with/without images by text length
ggplot(aes(x=contains_images, y=shares, color=contains_images), data=news) +
geom_boxplot() +
scale_y_log10() +
facet_wrap(~text_length) +
theme_linedraw() +
xlab('Contains Images?') +
ylab('Number of Shares') +
guides(colour=FALSE)
```
Looks like the presence of images has positive impact on the number of shares for long reads. Also, it's interesting that the median for the number of shares is lower for long reads than for medium and short texts, so long articles without images are the least shareable.
```{r echo=FALSE, message=FALSE, warning=FALSE, length_shares_images_summary}
# Show medians and IQRs for news of different lengths with/without images
data.frame(subset(news) %>% group_by(contains_images,text_length) %>% summarize(median=median(shares),
lower_fence = quantile(shares,0.25),
upper_fence = quantile(shares,0.75)))
```
## Shares by topic by text length
```{r echo=FALSE, message=FALSE, warning=FALSE, shares_topics_length}
# create boxplots for the number of shares by topic by text length
ggplot(aes(x=topic, y=shares, color=topic), data=news) +
geom_boxplot() +
scale_y_log10() +
facet_wrap(~text_length) +
theme_linedraw() +
theme(axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
axis.title.x=element_blank())
```
Here we can notice that long reads on topic 1 (Business) get the most shares. The trend doesn't hold for short texts: most popular short texts are on topics 4 and 5 (Technology). Both observations seem quite logical: long reviews in business topics and short technical news (e.g. about some new gadget) are what seems worth sharing.
## How are business and tech news affected by the presence of images?
```{r}
ggplot(aes(x=topic, y=shares, color=contains_images), data=
subset(news,news$topic=="Topic 1" | topic=="Topic 5")) +
geom_boxplot() +
scale_y_log10() +
facet_wrap(~text_length) +
theme_linedraw() +
theme(axis.text.x=element_text(size=8, angle=45)) +
ylab("Number of Shares") +
xlab("Topic") +
guides(color=guide_legend(title="Contains Images?"))
```
We can see that news related to Technology are less affected by the absence of images, while for business long reads it looks like an important feature.
## Subjectivity vs Shares by Channel: Weekends and Workdays
```{r echo=FALSE, message=FALSE, warning=FALSE, subj_shares_topic}
news$is_weekend <- news$Saturday | news$Sunday
ggplot(aes(x=global_subjectivity, y = shares, color=is_weekend),
data = subset(news, news$channel !='Other')) +
geom_point(alpha = 0.5, size = 1, position = 'jitter') +
theme_linedraw() +
xlab("Number of Images") +
ylab("Shares (Log Scale)") +
facet_wrap(~channel) +
guides(color=guide_legend(title="Weekend?"))
```
The scatter plots looks quite similar to each other (i.e. show no correlation), but we may notice a few things. No matter what the channel is, the subjectiviy of the text doesn't influence it much as well as other numeric characteristics. And we can again see that news published on weekened are shifted up while news shared on workdays are more concentrated on the bottom.
# Multivariate Analysis
## Observed Relationships
One interesting finding is that presence of images has more impact on long text rather than on shorter ones. However, it's quite logical if you think about it - it's easier to read long text and get involved if it has visuals. Also, we identified that business long reads is the top shared category that strengthened our previous observation that tech and business news are the most shared.
## Inter-Feature Interactions
I tried various combinations, but it doesn't look that there are more connections. The most interesting connections have already been reported in bivariate analysis section. Further exploration didn't give anything.
### Possible Models
I decided not to build a model, because it doesn't look like any of the variables have really significan and measurable impact on the outcome variables (i.e. we can't predict based only on 0/1 columns). This work was important to understand that numeric and basic semantic characteristics are not enough to predict popularity.
------
# Final Plots and Summary
## Plot One: News are Negatively Skewed in a Positive Way!
```{r echo=FALSE, Plot_One}
non_neutral_summary <- ggplot(aes(x=global_rate_positive_words+global_rate_negative_words),
data=news) +
geom_histogram(bins=20, fill=FILL_COLOUR, color=MAIN_COLOUR)+
theme_linedraw() +
xlab("Proportion of non-neutral words from the total") +
ylab("News counts") +
labs(title="Global Rate Non-Neutral words")
positive_summary <- ggplot(aes(x=rate_positive_words),
data=subset(news,
(news$global_rate_negative_words+
news$global_rate_positive_words)>0)) +
geom_histogram(bins=20, fill=FILL_COLOUR, color=MAIN_COLOUR)+
theme_linedraw() +
xlab("Positive/Non-Neutral words ratio") +
ylab("News counts")+
labs(title="Rate of Positive Words")
grid.arrange(non_neutral_summary, positive_summary, nrow=2)
```
I live in Eastern Europe and based on the news I see every day here I expected quite the opposite distribution: i.e., I expected that news will be left-skewed in terms of rate of non-neutral words, while the ratio of positive words to non-neutral words will be right skewed, because it's simply easier to draw one's attention with bad news. But it's actually quite the opposite: the second histogram shows that in fact, the distribution of positive words rations is left-skewed. However, we don't know if it's because there is some latent advertising that boosts the rate of positive words or the authors are just nice.
## Plot Two: Weekends Get Fewer Publications, But More Shares
```{r echo=FALSE, Plot_Two}
weekday_counts_summary <- ggplot(aes(x = weekday), data=news) +
geom_bar(fill=FILL_COLOUR, color=MAIN_COLOUR) +
theme_linedraw() +
xlab("Weekday") +
ylab("Number of Publications") +
labs(title="Numaber of Publications by Weekday")
weekday_shares_summary <- ggplot(aes(x=weekday, y = shares, color=weekday), data=news) +
geom_boxplot(colour=MAIN_COLOUR) +
ylab("Number of Shares") +
xlab("Weekday") +
labs(title="Numaber of Publications by Weekday")
grid.arrange(weekday_counts_summary, weekday_shares_summary, nrow=2)
```
Another thing I found interesting is that there are significantly fewer news published on weekend, but these news are shared the most - especially on Saturdays. Maybe when the news article is shared on weekend it spreads faster because people have more free time to actually read (rather than just look at the title).
## Plot Three: Top News Are Business Longreads with Images
```{r echo=FALSE, Plot_Three}
shares_by_topic <- ggplot(aes(x=topic, y=shares, color=topic),
data=subset(news, text_length=="LongRead")) +
geom_boxplot() +
scale_y_log10() +
facet_wrap(~text_length) +
theme_linedraw() +
theme(axis.text.x=element_text(size=8, angle=45, hjust=1)) +
xlab("Topic") +
ylab("Number of Shares") +
labs(title="Number of Shares by Topic\nfor Long Reads\n") +
guides(color=FALSE)
bus_shares_by_img <- ggplot(aes(x = contains_images, y=shares, color=contains_images),
data=subset(news,
news$topic=="Topic 1" & text_length=="LongRead")) +
geom_boxplot() +
scale_y_log10() +
facet_wrap(~text_length) +
theme_linedraw() +
theme(axis.text.x=element_text(size=8, angle=45, hjust=1)) +
xlab("Contains Images?") +
ylab("Number of Shares") +
labs(title="Number of Shares for\nBusiness LongReads:\nWith vs Without Images") +
guides(color= FALSE)
grid.arrange(shares_by_topic, bus_shares_by_img, ncol=2)
```
This is maybe the strongest relationship I was able to find. Compared to other groups, business long reads show the best performance (we've identified that Topic 1 is related to Business). Also, if we look closer at business long reads that contain and don't contain images, we will see that the difference is quite dramatic, and this is the only category where this pattern is so explicit.
------
# Reflection
During this exploration, I realised one very important thing. We all know about the GIGO principle. I think it also applies to data in most cases. And this refers not only to the way the data was collected and the way the measurements were taken. It's also about collecting the right data. Based on the exploration we've done it seems like it doesn't make much sense to predict news popularity based on numeric characteristics, even if they measure some semantic features (although very simple) as well.
It doesn't matter how positive and how good the text is - if it's not interesting, it will not get popular. Also this exploration shows how important it is to be able to look at the actual data and if possible - at the source of the data. It will help to detect potential errors in the dataset - like when we discovered that zero-length texts just were not processed correctly.