# The News Trend Predictor - Research & Development of News Trend Prediction Strategies & Methodologies

#### Student Names & IDs:
- Liam Fitzmaurice - S.N. 14027149
- Shance Zhao (Alex) - S.N. 24013122
- Zhuonan Mai (Miranda) - S.N. 19044660


# Introduction
Group 2 presents the development of The News Trend Predictor app - a web app targeted at journalists, social media personalities/marketers, and anyone wishing to know the likely relevancy of a news item in the near future, based on its recent trajectory.

The News Trend Predictor’s target prediction variable is the Google Trend of the given news text string - this is a self-relative and normalised (to a 0-100 range) measure of popularity based on the number of Google searches containing the string - using this as the target variable is directly relevant to the target users, as it is useful data to have for developing search engine optimisation strategies when publishing news articles, writing social media posts, or generating any content that relies on algorithmic content discovery to reach viewers.
This report documents our journey through the data wrangling, analysis, predictive modelling, and app development processes, the successes and roadblocks we faced, and our resulting key findings.


#### Datasets used:
1. Google Search Trend data for each relative News Item analysed.
2. Custom-generated calendar dataset.
3. Custom-generated dataset of each relative News Item’s most popular keyword-related YouTube videos during the given timeframes.


#### Dataset sources: 
1. Trend data is scraped from https://trends.google.com/trends/ using the Python model “PyTrends”.
2. The calendar dataset was internally created by Group 4.
3. Data is fetched from YouTube API through community-donated API keys at https://yt.lemnoslife.com/


### Research Questions
1. Are prediction algorithms effective in predicting the popularity trend of individual news items?
2. How do the quality and source of data impact the reliability of news trend forecasts?
3. What is the relationship between the data and different categories/types of news (e.g. short-term, medium-term, and long-term news items)?
4. Which machine learning algorithms are most effective and appropriate for this type of prediction/forecast?


### Executive Summary
The News Trend Predictor app forecasts the relevancy of news items by leveraging Google Trend data, normalised to a 0-100 scale. Aimed at journalists and content creators, the app uses historical data from YouTube and Google Trends to overcome access restrictions on other platforms.
Our research explored the effectiveness of prediction algorithms, data quality impacts, and optimal machine learning methods. Key findings indicate that small datasets and highly correlated features can lead to overfitting and poor predictive performance, underscoring the need for adequate data and efficient feature selection.

#### Modules

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
import statistics
import plotly.graph_objs as go
from datetime import datetime
import math
import json
from pathlib import Path
import time
from TrendProcesses import FetchData, CreateFeatures, RunAnalysis, RunModels
import plotly.express as px

import seaborn as sns    
from sklearn.preprocessing import RobustScaler, StandardScaler,MinMaxScaler
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor,KNeighborsClassifier
from sklearn.metrics import mean_absolute_error,mean_squared_error,r2_score,accuracy_score
import statsmodels.formula.api as smf
#from handle_model_util import handle_yt_data, handle_tr_data,normalize_trend
from pylab import rcParams
rcParams['figure.dpi'] = 150
rcParams['lines.linewidth'] = 2
rcParams['axes.facecolor'] = 'white'
rcParams['patch.edgecolor'] = 'white'
rcParams['font.family'] = 'StixGeneral'

# Data Wrangling
At the beginning of our project, we faced significant challenges in our attempts to find and integrate historic hashtag and "keyword mention" data sources. Many websites, including major platforms such as X, Meta (Facebook and Instagram) and TikTok, require application registration, the upload of personal identification, and a clear explanation of the data collection purpose. This process proved to be cumbersome and time-consuming, often resulting in delays and additional administrative burdens. Despite these efforts, access to the desired historical data remained limited, prompting us to explore alternative sources.

In order to use readily accessable hashtag and keyword mention data from these sources, we would need to create an automatic script to fetch predetermined news item strings from their respective API's, run it daily, and use the results to build our own historic datasets - doing this would severely limit any app functionality, as the selected news items would need to be known up to 30 days in advance, which is not possible.

Ultimately, we settled on using YouTube and Google trends data for our project. YouTube provided a readily-accessible repository of historical content and keyword mentions, while calendar data offered valuable insights into trends and events over time. This combination allowed us to circumvent the stringent requirements of other platforms and focus on the analysis and integration of accessible and relevant data, thus streamlining our research process and enhancing the quality of our findings.

Liam noted the similarity of data sources from a prior project of his - but that our usage is significantly different enough to warrant using them, and that our results would differ due to this. Specifically, that we are purely using YouTube data as predictive features to predict the Google Trend (and not YouTube views), that our gathered data is not limited to a single YouTube account, our use of external calendar features, and our project having vastly different research goals.

After testing the data gathering methods below, we decided to limit our scope to 30 days of gathered data, and limit forecast predictions made from that data to a maximum of 1 day.

#### Data Gathering
For each news item, the following steps are taken:
1. Google Trends data is fetched for the news item string (1 API call).
2. 30 YouTube api calls are made to gather the top 1-3 videos relevant to the news item string (30 API calls).
3. 1 API call is made for each video ID fetched, to gather video statistics (30-90 API calls).

The code below is commented out and replaced with a raw csv data read - data is for the news item "Palistine" - data is aligned later:

In [5]:


#data_fetcher = FetchData(raw_data_loader)
#trend, yt_data = data_fetcher.fetch_and_return_final_df_list("palistine")


all_trends = pd.read_csv('./data/test_trends_data.csv')
trend = all_trends.loc[:, ["date", all_trends.columns[1]]]
yt_data = pd.read_csv('./data/test_yt_data.csv')

print("Trend data:")
print(trend.head())
print('')
print("YouTube Initial data:")
print(yt_data.head())

Trend data:
         date  palistine
0  2024-05-01         97
1  2024-05-02        100
2  2024-05-03         48
3  2024-05-04         44
4  2024-05-05         42

YouTube Initial data:
         date  views  likes  comments
0  2024-05-03  17894   3611        99
1  2024-05-03  18462   1718       104
2  2024-05-03  43857  10830       273
3  2024-05-04  90220    730       794
4  2024-05-04  14960   1418        75


#### Data Preparation & Feature Engineering
Preparing data is not an easy task. Initially, our goal was to acquire historical data from major social media platforms to diversify our dataset.
However, during the data acquisition process, we encountered various restrictions imposed by large platforms such as Twitter, Meta, and TikTok. Consequently, we decided to focus solely on data from YouTube and Google Trends. This shift allowed us to streamline our data collection efforts and enhance the quality and accessibility of our data.

The below code triggers our data alignment and feature creation methods in our TrendProcesses class.
Data is grouped by date, and the mean average values of each inital column are used.

Features included and created are are:
- Video views on the given day.
- Video likes on the given day.
- Video comments on the given day.
- How many days old the video data is - calculated from the last .max() date in the data set, to avoid misrepresenting age when the script is run on a later date.
- Views, likes, and comments per day.
- The daily likes to views ratio, and the daily comments to views ratio.
- The trend to daily views ratio, and the trend to daily likes ratio.
- The difference between a given days daily views and the prior day's daily views ("diff_daily_views"), and the same difference for likes ("diff_daily_likes") and comments.

Additionally, a "calendar" dataset is generated, which numerically labels which day of the week the date falls on ("day_number", 0 being Monday, and 6 being Sunday), and labels if the given date is an American Federal public holiday ("is_holiday"). We chose American public holidays, as we estimated that most YouTube news-related videos are targeted at a Western audience, and the largest audience that shares public holiday days is the USA - other western populations have holidays on different days with smaller populations, or overlap with American ones (Christmas etc.).


Two datasets are returned - both normalised data and non-normalised data. The calendar and trend data are excluded from normalisation.


In [6]:
feature_creator = CreateFeatures()
data, data_normalised = feature_creator.create_features(trend, yt_data)
print("Data columns (same in both sets):")
print(data.columns)
print('')
print('')
print("Regular data:")
print(data.head())

print('')
print("Normalised data:")
print(data_normalised.head())

Data columns (same in both sets):
Index(['views', 'likes', 'comments', 'trend', 'days_old', 'daily_views',
       'daily_likes', 'daily_comments', 'daily_likes_to_views_ratio',
       'daily_comments_to_views_ratio', 'trend_to_daily_views_ratio',
       'trend_to_daily_likes_ratio', 'diff_daily_views', 'diff_daily_likes',
       'diff_daily_comments', 'day_number', 'is_holiday'],
      dtype='object')


Regular data:
                    views         likes     comments  trend  days_old  \
date                                                                    
2024-05-03   26737.666667   5386.333333   158.666667   48.0        29   
2024-05-04   40893.333333   1800.333333   318.333333   44.0        28   
2024-05-05    6026.666667    551.666667    12.000000   42.0        27   
2024-05-06  416213.000000  16955.000000  2487.000000   46.0        26   
2024-05-07  288151.500000   9613.000000   827.500000   59.0        25   

            daily_views  daily_likes  daily_comments  \
date       

  df.loc[df.index.max()] = adjusted_values


# EDA/Data Visulisation


#### Cluster Analysis

#### Correlations
Data correlations are also explored in order to find relevant features for inclusion during predictive modelling - although usually a useful tool, due to the nature of this domain and scope of the modelling (with every news item varying in total "news relevancy" length), a generated correlation matrix is more useful as a datapoint, to determine if the output prediction model will be useful/accurate or not, rather than strictly as a tool for feature selection.

The primary information for this use is to determine lineatiy with the target variable "trend" - some correlation is good, but ultra-high collinearity (or having all collinearities be very close to 0) indicates the model will be substandard.

In [7]:
analyser = RunAnalysis()
corr_matrix_norm = analyser.get_corr_matrix(data_normalised)

fig = px.bar(corr_matrix_norm, x='Feature', y='Correlation', title='Feature Correlation to Google Trend')
fig.show()


# Predictive Modelling

NOTES:
kNN, Linear Regression, and Decision Tree models for now.
Each model will have a plot showing the accuracy of predictions against the real trend data.
Only plot the final versions of each model - talk about the iterations of parameters/features used that got you there!


# The Trend Predictor App

NOTES:
I will discuss & show screenshots of:
The nodejs local version vs streamlit version.
The reasoning behind the presentation layout and data/graphs included.

# Conclusion 

The development of The News Trend Predictor app provided valuable insights into the complexities of predictive modelling for news trends. One of the primary challenges we encountered was accessing high-quality historical data due to stringent requirements from platforms like Twitter, Meta, and TikTok. This obstacle led us to pivot towards using YouTube and Google Trends data, which were more accessible and relevant to our needs. This strategic decision allowed us to streamline our data analysis process and improve the overall quality of our findings. Additionally, the server-side integration process underscored the importance of managing Python versions to ensure compatibility and efficient API call management to avoid issues like the 429 errors encountered with Google Pytrends.

Our findings highlighted the critical importance of sufficient data quantities and appropriate feature selection in predictive modelling. Small datasets led to unstable coefficients and poor predictive performance in linear regression models (Wilstrup & Kasak, 2021), while KNN models struggled with classification accuracy due to insufficient neighbouring data points (Wilstrup & Kasak, 2022). Moreover, the presence of highly correlated features increased the risk of overfitting, emphasising the need for regularisation techniques to enhance model generalisation (Ajitesh Kumar, 2024; Genuer et al., 2010). These insights are invaluable for future developments in predictive modelling and content strategy optimisation, providing a solid foundation for further research and practical applications in this domain.


#### Main Challenges
Data Collection:
API Issues: Many websites, including major platforms such as Meta and TikTok, require application registration, the upload of personal identification, and a clear explanation of the data collection purpose.

Server-Side Integration:
The server, running JavaScript, triggers Python scripts. The code must first determine the Python version being used by the user. If the versions do not match, the server must align with the user's version to ensure compatibility and detect any issues.

API Call Restrictions:
The imposed limit of 600 API calls can be restrictive. For instance, processing a single video might necessitate 600 API calls. Processing multiple videos will thus exceed the API call limit, necessitating careful management and optimization.

Modeling Considerations:

Overfitting Concerns:
The potential for model overfitting is heightened due to the presence of high correlations among many features.
Iterative Modeling: Emphasizing the importance of iterative modeling to address and mitigate the overfitting issue and enhance model performance.


### Key findings:

1. Lorem ipsum dolor sit amet
2. Lorem ipsum dolor sit amet
3. Lorem ipsum dolor sit amet
4. ...

# Bibliography (optional) 