# This is the Group 9 Project Report ipynb File!

Here is the link to the data set we will be using is the Online News Popularity Dataset from the UCI
Machine Learning Respiratory. Subsequent sources can be found in the link on the UCI web page.
(Link: https://archive.ics.uci.edu/ml/datasets/Online+News+Popularity)

## Introduction: 

It is the responsibility of journalists and editors to understand which features of articles may grab the
reader’s attention in order to allocate resources effectively, improve the reading experience and increase
the sale. There are multiple factors influencing the popularity of the articles which may include 
contemporaneity, writing quality, and other latent elements (Keneshloo, 2016).To investigate the 
properties of articles that gain articles, we are interested in determining the features that influence 
the popularity of an article. The question we are trying to answer is “how many shares will an article 
have with specific features (i.e., a particular word count, number of images, etc.) generate?” We aim to 
predict if the number of shares in Social Network, and the popularity, is influenced by these predictors 
and determine which features are standard in the articles with a higher share-rate. 

The data set we chose for our project summarizes heterogeneous features of articles published by a digital
media platform named “Mashable” over two years. Mashable is a multi-platform media and entertainment 
company that focuses on publishing news of entertainment, tech, culture and science (Mashable, 2022).
In this dataset, there are 61 attributes in total, of which 58 of them are predictive attributes, 2
are non-predictive, and 1 is the target field. To have a precise and accurate prediction of what factors 
can affect the shares (based on the content of an article), we have filtered our columns down, as they 
are most relevant to our prediction. The features that we are interested in are the 8 listed below:

- n_tokens_title: Number of words in the title
- n_tokens_content: Number of words in the content
- n_unique_tokens: Rate of unique words in the content
- num_hrefs: Number of links
- num_imgs: Number of images
- num_videos: Number of videos
- average_token_length: Average length of the words in the content
- shares: Number of shares (target)

With this filtered data, we anticipate to see a correlation between these features and the number of 
shares. For the method of data analysis, we plan to use k-nn regression because the value we are 
predicting ("shares") is quantitative. With continuous data, regression is an ideal method which allows
a precise prediction in regards to quantity. Further, we plan to visually represent the data on graphs(scatter plot, bar graph, and line graph) for an easier comparison.


## Summary of what we Found: 

Broadly speaking, upon analysis of our knn regression model (and its predictions against the test dataset), no clear correlations were observed between any of the selected predictors and the number of shares of each plotted article. Our model did not predict any positive or negative trends for increasing values of each of the selected predictors, nor did the test values it was predicted against demonstrate such trends. 

The 7 aforementioned selected predictors central to the project included the number of images, number of words in content, number of words in the title, number of links, average word length, and positivity and negativity rates. None of these were found to demonstrate a strong correlation that indicates a direct impact on the number of shares. What was observed, however, was a certain range that the majority of articles with a high number of overall shares fit within.

A higher number of images contained in an article was found to not necessarily lead to a higher number of article shares. In fact, many articles with no images still had a large number of shares (3000+). 

No significant correlation is observed between the word count of an article’s content and the number of shares that those articles generate, but generally speaking, there are more articles with 1500 or fewer total words that generate 3000+ shares.

No correlation is observed between the number of links and shares. However, there is a greater density of articles with 3000+ shares that have 10 or fewer links embedded within.
There appears to be no correlation between the number of images in an article and the number of shares that the article generates - there is a greater density of higher numbers of total shares for articles with 2 or fewer images, however.
Moreover, there is no correlation between the average word length in an article and the number of article shares, but broadly speaking, there is a greater density of articles with 3000+ shares amongst the range of articles with 4.5-5.0 average word length.
Finally, the relationships between positivity and negativity rates and the number of article shares do not demonstrate a clear trend. However, it is observed that most articles have positive words at a rate of more than 0.5, and negative words at a rate of less than 0.5. 

## What we expected to Find:

It is without a doubt that this is not in agreement with what we initially expected. Upon perusal of the various predictors present in the selected dataset, initial conclusions of possible relationships between predictors (specified elements of articles) and the number of article shares were hypothesized. Specifically, there was an initial expectation that articles of higher popularity (ie, those with a greater number of shares), would have a higher number of images and videos, but not overly excessive in terms of article word count, the number of words in the title, and average word length  (a relatively middle ground in each of these areas was expected to maximize shares). The vast majority of these predictors (and their expected impact on article popularity, which could be better understood using the number of shares as a proxy) were considered under the lens of viewer engagement; more images and videos, and a manageable overall article word count would be expected to draw viewers in and maintain their attention, leaving them more likely to share a given article with their acquaintances. With regards to positivity and negativity rates, there was an initial expectation that articles with higher rates of each would correspond to higher and lower total shares respectively - this being connected to the feelings induced and impact experienced by those reading articles with words associated with positive versus negative emotions. 

## Impacts of what we found

These findings are significant in that one can pinpoint instances where specific values associated with any of the selected predictors are more likely to result in an article gaining a significant number of shares, thereby allowing for an article to be shaped in a way that optimizes its popularity. Specifically, while the predictors are not associated with any sort of positive trend in the number of shares, there may be instances in which there is a peak in the middle of the data - a “middle-ground” so to speak. In this way, we may identify ranges of average word length, the number of images and links, as well as other predictors, which more often than not generate high levels of the number of shares, and use this to educate the way in which articles are written. For instance, articles with fewer than 1500 words and a title that is 5-15 words in length are predicted to have a very high number of overall shares. 

## Future questions

In order to provide useful recommendations in the future to maximize the shares, we will have further research on other possible and potential factors that might impact the shares. Given the predictors we examined did not amount to the identification of any clear relationships, there are two follow-up considerations we intend to examine. 

The first is with regards to questions that would help us ascertain whether there exist other article elements that have a clear correlation with the number of article shares. Is publishing time on weekends likely to lead to more shares of articles than the weekdays? Are websites with more page views leading to more shares of articles? What kind of subjects leads to more shares of articles? Does different weather affect the shares of articles? Will different regions or countries affect the shares of articles? Will different age groups affect the shares of articles?

The second future consideration is with regard to the data we use to make predictions. It is entirely likely that our dataset, if expanded, could provide a better representation of possible relationships between article elements and the number of shares, in a manner that is not as susceptible to outliers in the dataset. 



The following cell loads the packages that will be used for this project:

In [None]:
library(tidyverse)
library(testthat)
library(digest)
library(repr)
library(tidymodels)
library(GGally)
library(ISLR)
library(infer)
library(cowplot)

This cell will import our data from a link. The link is to a public GitHub with the data, as the direct website link gives us a .zip file.

In [None]:
url <- "https://raw.githubusercontent.com/EPICxFLIPPER/data/main/OnlineNewsPopularity.csv"
popularity <- read_csv(url) 

head(popularity)

**Table 1:** Shows the data set loaded in without any changes.

The next cell wrangles our data. This data was already quite tidy, only a few null values had to be removed. We also selected only our predictors and the shares coloum (bing predicted). We selected these predictors as they would be easily manipulated by a humans writing an article. Through our exploritory analysis, there were quite a few outlier points, the have been removed. Futhermore, values of 0 that simply don't make sense were removed, such as a value of 0 in word count.

In [None]:
set.seed(9)
#Wrangling
popularity_tidy <- popularity |> 
#removing columns that are specifically not litsted as predictors as well as columns that self reference articles.
select(-url, -timedelta) |>
# select(n_tokens_title:average_token_length, shares) |>
select(n_tokens_title:shares, -num_videos, -n_non_stop_words , -n_non_stop_unique_tokens,
      -data_channel_is_lifestyle:-is_weekend,-LDA_00:-LDA_04, - num_self_hrefs, -num_keywords:-global_rate_negative_words,
      -avg_positive_polarity: -abs_title_sentiment_polarity, -n_unique_tokens) |>
filter(shares < 5000)|>
arrange(desc(shares)) |>
drop_na(n_tokens_content:shares) |>
#Removing valeus where 0 doesnt make sense
filter(n_tokens_title >0) |>
filter(n_tokens_content>0) |>
filter(average_token_length>0) |>
filter(shares>0)



head(popularity_tidy)

**Table 2:** Shows the wrangeled data.

In the following line we will take a sample from our data to explore. This is because with such a large data set the R kerenl would crash. A random sample of sufficient size should be representative of the population enough to allow for continued exploration of the data.

In [None]:
set.seed(9)
popularity_sample <- rep_sample_n(popularity_tidy, 5000,replace = FALSE)

In the next cell we split the data into training and testing data. 80% in the training set, stratzised by shares.

In [None]:
set.seed(9)
#Splitting the data into training and testing data
#Strata = the day the article was published
popularity_split<- initial_split(popularity_sample, prop = .80, strata = shares)
popularity_train<- training(popularity_split)
popularity_test <- testing(popularity_split)


In [None]:
set.seed(9)
initial_table <- popularity_train |>
map_df(mean,na.rm =TRUE)
initial_table


**Tabel 3:** This table shows the averages of each of the predictor variables

In [None]:
set.seed(9)
#Below are the distributions of the 3 most varable categoreis in our data set. Though some of the categories are less variable, we belive 
#they will still be good predictors of shares.

options(repr.plot.width = 20, repr.plot.height =7)

shares_plot <- popularity_train |> 
ggplot(aes(x = shares)) + geom_histogram(binwidth = 100)+xlim(0,5000) +labs(x = "Shares" , y = "Frequency") +
ggtitle("Distribuion of Shares (Fig 1.0)") + theme(text = element_text(size = 20))


shares_vs_imgs_plot <- popularity_train |>
ggplot(aes(x = num_imgs , y = shares)) + geom_point(alpha = .3) + xlab("Images") + ylab("Shares") +
ggtitle("Shares vs Images (Fig 1.4)") +
theme(text = element_text(size = 20))

shares_vs_word_count_plot <- popularity_train |>
ggplot(aes(x = n_tokens_content , y = shares)) + geom_point(alpha = .3) + xlab("Word Count") + ylab("Shares") +
ggtitle("Shares vs Word Count (Fig 1.2)") +
theme(text = element_text(size = 20))

shares_vs_token_length_plot <- popularity_train |>
ggplot(aes(x = average_token_length , y = shares)) + geom_point(alpha = .3) + xlab("Average Word Length") + ylab("Shares") +
ggtitle("Shares vs Avg Word Length (Fig 1.5)") +
theme(text = element_text(size = 20))

shares_vs_title_plot <- popularity_train |>
ggplot(aes(x = n_tokens_title , y = shares)) + geom_point(alpha = .3) + xlab("Words in Title") + ylab("Shares") +
ggtitle("Shares vs Words in Title (Fig 1.1)") +
theme(text = element_text(size = 20))

shares_vs_links_plot <- popularity_train |>
ggplot(aes(x = num_hrefs , y = shares)) + geom_point(alpha = .3) + xlab("Links") + ylab("Shares") +
ggtitle("Shares vs Links (Fig 1.3)") +
theme(text = element_text(size = 20))

shares_vs_positive_plot <- popularity_train |>
ggplot(aes(x = rate_positive_words , y = shares)) + geom_point(alpha = .3) + xlab("Positive words rate") + ylab("Shares") +
ggtitle("Shares vs Positivity (Fig 1.6)") +
theme(text = element_text(size = 20))

shares_vs_negative_plot <- popularity_train |>
ggplot(aes(x = rate_negative_words , y = shares)) + geom_point(alpha = .3) + xlab("Negative words rate") + ylab("Shares") +
ggtitle("Shares vs Negativity (Fig 1.7)") +
theme(text = element_text(size = 20))

plots <- plot_grid(
    shares_vs_title_plot,
    shares_vs_word_count_plot,
    shares_vs_links_plot,
    shares_vs_imgs_plot,
    shares_vs_token_length_plot,
    shares_vs_positive_plot,
    shares_vs_negative_plot,
ncol = 3) 
shares_plot
plots

- **Fig1.0:** Distribution of Shares
- **Fig1.1:** Shares vs Words in Title
- **Fig1.2:** Shares vs Word Count
- **Fig1.3:** Shares vs Links
- **Fig1.4:** Shares vs Images
- **Fig1.5:** Shares vs Word Length
- **Fig1.6:** Shares vs Positivity Rate
- **Fig1.7:** Shares vs Negativity Rate

The following cell is for cross validation. From our exploritory analysis we can see that none of the relationships in figures 1.1-1.7 are particullarly linear. Because of this we will be using a KNN regression model. We need to find the best number of neighobrs to predict off of. We will be using a 5 fold cross validation shown below. Note that we are testing neighbors 40-80 as though this process we see the most efficient number of neighbors in this range. Reducing the range saves running time for the viewer of this project.

In [None]:
#I will be loacting the best neighbors to use
set.seed(9)
popularity_spec <- nearest_neighbor(weight_func = "rectangular" , neighbors = tune()) |> 
set_engine("kknn") |> 
set_mode("regression")

popularity_recipe <- recipe(shares~., data = popularity_train) |> 
step_scale(all_predictors()) |> 
step_center(all_predictors())

gridvals <- tibble(neighbors = seq(from = 40, to = 80, by = 1))
popularity_vfold <- vfold_cv(data = popularity_train, v = 5, strata = shares)

popularity_fit <- workflow() |>
add_recipe(popularity_recipe) |>
add_model(popularity_spec) |>
tune_grid(resamples = popularity_vfold, grid = gridvals) |>
collect_metrics() |>
filter(.metric == "rmse") |>
arrange(mean) |>
select(neighbors) |>
slice(1) |>
pull()



popularity_fit
                   


Next We will train our data with the number of predictors found above as well as compare our predictions to the testing set.

In [None]:
set.seed(9)
popularity_spec_2 <- nearest_neighbor(weight_func = "rectangular", neighbors = popularity_fit) |> 
set_engine("kknn") |> 
set_mode("regression")

popularity_fit_2 <- workflow() |> 
add_recipe(popularity_recipe) |>
add_model(popularity_spec_2) |>
fit(data = popularity_train)

popularity_predictions <- predict(popularity_fit_2, popularity_test) |> 
bind_cols(popularity_test)

head(popularity_predictions)

**Tabel 4:** Shows the data with the prediction column.

Below we will explore a few of our predictors vs share plots again this time with how the shares were predicted overlayed onto the graphs. The graphs corrospond to the following figure labels:

- Fig2.0: Shares vs Words in Title
- Fig2.1: Shares vs Word Count
- Fig2.2: Shares vs Links
- Fig2.3: Shares vs Images
- Fig2.4: Shares vs Word Length
- Fig2.5: Shares vs Positivity Rate
- Fig2.6: Shares vs Negativity Rate

The plots have been intentionally left fairly large to better see the predcitons, which are represented by the blue line on each graph.

In [None]:
set.seed(9)
options(repr.plot.width = 12 , repr.plot.height =7)
plot_1 <- popularity_train |> 
ggplot(aes(x = n_tokens_content , y = shares )) + geom_point(alpha = .5) +
geom_line(data = popularity_predictions , aes(x = n_tokens_content , y = .pred), color = "blue") +
labs(x = "Word Count in Content" , y = "Shares", color = "Predictions")+
ggtitle("Shares vs Word Count (W/Predictions)Fig2.1") + theme(text = element_text(size = 20)) + xlim(0,3000)


plot_2 <- popularity_train |> 
ggplot(aes(x = n_tokens_title , y = shares )) + geom_point(alpha = .5) +
geom_line(data = popularity_predictions , aes(x = n_tokens_title , y = .pred), color = "blue") + xlim(0,20) +
theme(text = element_text(size = 20)) + labs(x = "Words in Title" , y = "Shares", color = "Predictions") +
ggtitle("Shares vs Words in Title (W/Predictions) Fig2.0")


plot_3 <- popularity_train |> 
ggplot(aes(x = rate_positive_words , y = shares )) + geom_point(alpha = .5) +
geom_line(data = popularity_predictions , aes(x = rate_positive_words , y = .pred), color = "blue") + xlim(0,1) +
theme(text = element_text(size = 20)) + labs(x = "Positivity Rate" , y = "Shares" , color = "Predictions")+
ggtitle("Shares vs Positivity (W/Predictions)Fig2.5")


plot_4 <- popularity_train |> 
ggplot(aes(x = average_token_length , y = shares )) + geom_point(alpha = .5) +
geom_line(data = popularity_predictions , aes(x = average_token_length , y = .pred), color = "blue") + xlim(4,5.5) +
theme(text = element_text(size = 20))+labs(x = "Average Word Length" , y = "Shares" , color = "Predictions")+
ggtitle("Shares vs Word Length (W/Predictions)Fig2.4")


plot_5 <- popularity_train |> 
ggplot(aes(x = num_imgs , y = shares )) + geom_point(alpha = .5) +
geom_line(data = popularity_predictions , aes(x = num_imgs , y = .pred), color = "blue") + xlim(0,20) +
theme(text = element_text(size = 20))+labs(x = "Images" , y = "Shares" , color = "Predictions")+
ggtitle("Shares vs Images (W/Predictions)Fig2.3")


plot_6 <- popularity_train |> 
ggplot(aes(x = num_hrefs , y = shares )) + geom_point(alpha = .5) +
geom_line(data = popularity_predictions , aes(x = num_hrefs , y = .pred), color = "blue") + xlim(0,20) +
theme(text = element_text(size = 20))+labs(x = "Links" , y = "Shares" , color = "Predictions")+
ggtitle("Shares vs Links (W/Predictions)Fig2.2")


plot_7 <- popularity_train |> 
ggplot(aes(x = rate_negative_words , y = shares )) + geom_point(alpha = .5) +
geom_line(data = popularity_predictions , aes(x = rate_negative_words , y = .pred), color = "blue") + xlim(0,1) +
theme(text = element_text(size = 20)) + labs(x = "Negativity Rate" , y = "Shares" , color = "Predictions")+
ggtitle("Shares vs Negativity (W/Predictions)Fig2.6")

plot_2
plot_1
plot_6
plot_5
plot_4
plot_3
plot_7


