# Popularity Prediction

I will provide you with all the analysis I have made regarding the dataset, after going through them carefully, I want you to come up with ML models and techniques we can use to predict the popularity.
1. Topic is one of the four values {'obama','microsoft','palestine','economy'} The occurences of these Topic are as followed {28610,21858,8843,33928} in the same order as the topic list.
2. Number of Articles Published by Day of the Week: 
DayOfWeek
Monday       16250
Tuesday      16720
Wednesday    15739
Thursday     15424
Friday       13846
Saturday      6994
Sunday        82662. 
3. Number of Articles Published by Time of Day:
TimeOfDay
Morning (6 AM - 12 PM)      20257
Afternoon (12 PM - 5 PM)    23551
Evening (5 PM - 9 PM)       17374
Night (9 PM - 6 AM)         32057
4. Top Topics by Platform:
Facebook: obama (Average Popularity = 2411.93)
GooglePlus: palestine (Average Popularity = 2480.58)
LinkedIn: microsoft (Average Popularity = 909.74)
Insights: 

* *Obama and Palestine perform better on facebook as compared to microsoft and economy, suggesting sentimental topics have a higher popularity on Facebook as compared to other topics.*

* *Global Topics like palestine and economy perform good on GooglePlus.*

* *Microsoft and Economy perform the best in LinkedIn as they directly relate to job opportunities and purpose of the platform.*

5. *Insight: Almost all articles with a repeating sources for a topic attains a lot of traction as compared to the other average occuring sources on all platforms irrespective of the topic. Therefore we can conclude Source is a highly important parameter for an articles popularity irrespective of the platfrom or topic. A repetative source for a topic will attain on an average higher popularity as compared to sources that are not frequent.*

6. There are two more columns in the news_df Sentiment_Headline and Sentiment_Title
I combined the two columns using average and created a temporary mean_sentiment column and then analysed popularity vs upper, lower quartile of mean_sentiment and got the below results:                                      
                                          Facebook	GooglePlus	LinkedIn 
sentiment_category			
Extreme +vs Sentiment	1890.259258	1824.372085	808.876082
Extreme -ve Sentiment	1745.458791	1869.388414	810.139478
Medium Sentiment	1600.414076	1752.636002	762.046227
*Insight - Articles with extreme positive or negative Headline/Title perform better on all platforms as compared to modest sentiment Headline/Title article. The articles follow "Bad publicity is Good publicity". Articles with extreme negative values perform even better than positive articles on googleplus and linkedin.*

7. *Insight- In general we observe that articles having the most common words in headline/title for a topic have higher average popularity than others, this could be because of "Hot Topic" phenomena, when there is an important news regarding any of the one topic most of the sources write about it and the Title/Headline being highly repeated gains a higher popularity than other not so relevant titles/headlines for a topic.* 

8. *Insight - we can say that articles published on weekends gain traction on facebook and googleplus while the opposite is true for linkedin which is more of a business/corporate application, mostly used during week days.*


In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from lightgbm import LGBMRegressor
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import FunctionTransformer

In [3]:
data=pd.read_csv("Cleaned_News.csv")

# Feature Engineering

In [4]:
data.columns

Index(['IDLink', 'Title', 'Headline', 'Source', 'Topic', 'PublishDate',
       'SentimentTitle', 'SentimentHeadline', 'Facebook', 'GooglePlus',
       'LinkedIn'],
      dtype='object')

In [5]:
data['PublishDate'] = pd.to_datetime(data['PublishDate'], errors='coerce')

In [6]:
# Creating new Columns 
import pandas as pd

def create_derived_features(df):
    """Creates derived features from existing columns.

    Args:
        df: Pandas DataFrame containing news data with 'SentimentTitle',
            'SentimentHeadline', and 'PublishDate' columns.

    Returns:
        A new Pandas DataFrame with the derived features, or the original
        DataFrame if the required columns are missing. Returns None if input is not a dataframe.
    """
    if not isinstance(df, pd.DataFrame):
        print("Input is not a Pandas DataFrame")
        return None
    df_derived = df.copy()

    required_cols = ['SentimentTitle', 'SentimentHeadline', 'PublishDate']
    if not all(col in df_derived.columns for col in required_cols):
        print(f"Missing required columns: {set(required_cols) - set(df_derived.columns)}")
        return df_derived  # Return original if columns are missing

    # 1. Sentiment Mean
    df_derived['Sentiment_mean'] = df_derived[['SentimentTitle', 'SentimentHeadline']].mean(axis=1)

    # 2. Publish Day
    df_derived['PublishDay'] = df_derived['PublishDate'].dt.day_name()

    # 3. Publish Hour
    df_derived['PublishHour'] = df_derived['PublishDate'].dt.hour

    # 4. Time of Day
    def categorize_time(hour):
        if 6 <= hour < 12:
            return 'Morning'
        elif 12 <= hour < 17:
            return 'Afternoon'
        elif 17 <= hour < 21:
            return 'Evening'
        else:
            return 'Night'

    df_derived['TimeOfDay'] = df_derived['PublishHour'].apply(categorize_time)

    return df_derived

data = create_derived_features(data)



In [30]:
data.head()

Unnamed: 0,IDLink,Title,Headline,Source,Topic,PublishDate,SentimentTitle,SentimentHeadline,Facebook,GooglePlus,LinkedIn,Sentiment_mean,PublishDay,PublishHour,TimeOfDay
0,99248,Obama Lays Wreath at Arlington National Cemetery,Obama Lays Wreath at Arlington National Cemete...,USA TODAY,obama,2002-04-02 00:00:00,0.0,-0.0533,2547.659722,1538.570833,499.025,-0.02665,Tuesday,0,Night
1,10423,A Look at the Health of the Chinese Economy,Tim Haywood investment director businessunit h...,Bloomberg,economy,2008-09-20 00:00:00,0.208333,-0.156386,1380.145833,1957.444444,753.729167,0.025974,Saturday,0,Night
2,18828,Nouriel Roubini Global Economy Not Back to 2008,Nouriel Roubini NYU professor and chairman at ...,Bloomberg,economy,2012-01-28 00:00:00,-0.42521,0.139754,1647.295833,2242.472222,874.993056,-0.142728,Saturday,0,Night
3,27788,Finland GDP Expands In Q4,Finlands economy expanded marginally in the th...,RTT News,economy,2015-03-01 00:06:00,0.0,0.026064,1157.554167,1805.383333,701.736111,0.013032,Sunday,0,Night
4,27789,Tourism govt spending buoys Thai economy in Ja...,Tourism and public spending continued to boost...,The Nation Thailand39s English news,economy,2015-03-01 00:11:00,0.0,0.141084,1439.5125,2166.45,857.6875,0.070542,Sunday,0,Night


In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
def preprocess_data(df):
    """Preprocesses the data for social media engagement prediction."""


    df['Facebook'] = np.log1p(df['Facebook'] + 1)
    df['GooglePlus'] = np.log1p(df['GooglePlus'] + 1)
    df['LinkedIn'] = np.log1p(df['LinkedIn'] + 1)

    source_counts = df.groupby(['Topic', 'Source'])['Source'].count().unstack(fill_value=0)
    top_10_sources = {}
    for topic in df['Topic'].unique():
        top_10_sources[topic] = source_counts.loc[topic].nlargest(10).index.tolist()

    for topic, sources in top_10_sources.items():
        for source in sources:
            source_col = f"source_is_{topic}_{source.replace(' ', '_')}"
            df[source_col] = (df['Topic'] == topic) & (df['Source'] == source)
            
    categorical_features = ['Topic', 'PublishDay', 'TimeOfDay']
    for feature in categorical_features:
        ohe = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
        encoded = ohe.fit_transform(df[[feature]])
        encoded_df = pd.DataFrame(encoded, columns=ohe.get_feature_names_out([feature]))
        df = pd.concat([df, encoded_df], axis=1)
        df.drop(feature, axis=1, inplace=True)

    scaler = StandardScaler()
    df['Sentiment_mean_scaled'] = scaler.fit_transform(df[['Sentiment_mean']].values)

    df['is_weekend'] = (df['PublishDay_Saturday'] == 1) | (df['PublishDay_Sunday'] == 1)

    def preprocess_text(text):
        if not isinstance(text, str):
            return ""  # Return empty string instead of list for CountVectorizer
        text = text.lower()
        text = re.sub(r'[^\w\s]', '', text)
        tokens = [word for word in text.split() if word not in stop_words]
        return " ".join(tokens)  # Return space-separated string

    df['CleanedHeadline'] = df['Headline'].apply(preprocess_text)
    df['CleanedTitle'] = df['Title'].apply(preprocess_text)

    headline_vectorizer = CountVectorizer(max_features=20, ngram_range=(5, 5))
    headline_patterns = headline_vectorizer.fit_transform(df['CleanedHeadline']).toarray()
    df = pd.concat([df, pd.DataFrame(headline_patterns, columns=headline_vectorizer.get_feature_names_out())], axis=1)

    title_vectorizer = CountVectorizer(max_features=20, ngram_range=(3, 3))
    title_patterns = title_vectorizer.fit_transform(df['CleanedTitle']).toarray()
    df = pd.concat([df, pd.DataFrame(title_patterns, columns=title_vectorizer.get_feature_names_out())], axis=1)

    feature_cols = list(df.filter(like='is_')) + list(df.filter(like='PublishDay_')) + list(df.filter(like='TimeOfDay_')) + ['Sentiment_mean_scaled'] + list(df.filter(like='source_is_')) + list(headline_vectorizer.get_feature_names_out()) + list(title_vectorizer.get_feature_names_out()) + ['is_weekend']
    X = df[feature_cols]
    X = X.loc[:, (X != X.iloc[0]).any()]

    return X, df[['Facebook', 'GooglePlus', 'LinkedIn']]

# LightGBM Model

In [15]:
df=data.copy()

In [20]:
df['Facebook_log'] = np.log1p(df['Facebook'] + 1)
df['GooglePlus_log'] = np.log1p(df['GooglePlus'] + 1)
df['LinkedIn_log'] = np.log1p(df['LinkedIn'] + 1)

In [21]:
# Extract date-time components
df['Year'] = pd.to_datetime(df['PublishDate']).dt.year
df['Month'] = pd.to_datetime(df['PublishDate']).dt.month
df['Day'] = pd.to_datetime(df['PublishDate']).dt.day
df['Hour'] = pd.to_datetime(df['PublishDate']).dt.hour

# Cyclical encoding for Month and Day
df['Month_sin'] = np.sin(2 * np.pi * df['Month'] / 12)
df['Month_cos'] = np.cos(2 * np.pi * df['Month'] / 12)
df['Day_sin'] = np.sin(2 * np.pi * df['Day'] / 31)
df['Day_cos'] = np.cos(2 * np.pi * df['Day'] / 31)


In [22]:
# Compute topic-wise source frequency
source_topic_counts = df.groupby(['Topic', 'Source']).size().reset_index(name='SourceCount')
total_topic_counts = df.groupby('Topic').size().reset_index(name='TotalCount')

# Merge to calculate normalized frequency
df = df.merge(source_topic_counts, on=['Topic', 'Source'], how='left')
df = df.merge(total_topic_counts, on='Topic', how='left')
df['SourceFreq'] = df['SourceCount'] / df['TotalCount']
df.drop(['SourceCount', 'TotalCount'], axis=1, inplace=True)


In [23]:
from sklearn.feature_extraction.text import CountVectorizer
import re

# Helper function to preprocess text
def preprocess_text(text):
    text = re.sub(r'[^\w\s]', '', text.lower())  # Remove punctuation and convert to lowercase
    stop_words = set(["the", "and", "is", "to", "in", "it", "of", "for", "on", "with", "as", "this", "at", "by"])  # Example stop words
    words = [word for word in text.split() if word not in stop_words]
    return ' '.join(words)

df['CleanedHeadline'] = df['Headline'].apply(preprocess_text)
df['CleanedTitle'] = df['Title'].apply(preprocess_text)

# Get top 20 patterns for headline (5 words)
headline_vectorizer = CountVectorizer(max_features=20, ngram_range=(5, 5))
headline_patterns = headline_vectorizer.fit(df['CleanedHeadline']).get_feature_names_out()

# Get top 20 patterns for title (3 words)
title_vectorizer = CountVectorizer(max_features=20, ngram_range=(3, 3))
title_patterns = title_vectorizer.fit(df['CleanedTitle']).get_feature_names_out()

# Add binary columns for presence of these patterns
for pattern in headline_patterns:
    df[f'Headline_{pattern}'] = df['CleanedHeadline'].str.contains(pattern).astype(int)

for pattern in title_patterns:
    df[f'Title_{pattern}'] = df['CleanedTitle'].str.contains(pattern).astype(int)


In [24]:
# Features and Targets
features = [
    'Topic', 'PublishHour', 'Month_sin', 'Month_cos', 'Day_sin', 'Day_cos', 
    'SourceFreq', 'Sentiment_mean'] + \
    [f'Headline_{pattern}' for pattern in headline_patterns] + \
    [f'Title_{pattern}' for pattern in title_patterns]

target_facebook = 'Facebook_log'
target_googleplus = 'GooglePlus_log'
target_linkedin = 'LinkedIn_log'

X = df[features]
y_facebook = df[target_facebook]
y_googleplus = df[target_googleplus]
y_linkedin = df[target_linkedin]



# Train-test split for each target
X_train_fb, X_test_fb, y_train_fb, y_test_fb = train_test_split(X, y_facebook, test_size=0.2, random_state=42)
X_train_gp, X_test_gp, y_train_gp, y_test_gp = train_test_split(X, y_googleplus, test_size=0.2, random_state=42)
X_train_li, X_test_li, y_train_li, y_test_li = train_test_split(X, y_linkedin, test_size=0.2, random_state=42)


In [25]:
# Ensure all columns are numeric
X_train_fb = X_train_fb.apply(pd.to_numeric, errors='coerce')
X_test_fb = X_test_fb.apply(pd.to_numeric, errors='coerce')
X_train_gp = X_train_gp.apply(pd.to_numeric, errors='coerce')
X_test_gp = X_test_gp.apply(pd.to_numeric, errors='coerce')
X_train_li = X_train_li.apply(pd.to_numeric, errors='coerce')
X_test_li = X_test_li.apply(pd.to_numeric, errors='coerce')

# Check for any remaining non-numeric columns
print(X_train_fb.dtypes)
print(X_test_fb.dtypes)

Topic                                                     float64
PublishHour                                                 int32
Month_sin                                                 float64
Month_cos                                                 float64
Day_sin                                                   float64
Day_cos                                                   float64
SourceFreq                                                float64
Sentiment_mean                                            float64
Headline_canadian prime minister justin trudeau             int32
Headline_external affairs minister sushma swaraj            int32
Headline_federal reserve chair janet yellen                 int32
Headline_his final state union address                      int32
Headline_international middle east media center             int32
Headline_japanese prime minister shinzo abe                 int32
Headline_middle east media center wwwimemcorg               int32
Headline_o

In [26]:
# Train LightGBM Models
from lightgbm import LGBMRegressor

def train_lgbm(X_train, y_train, X_test, y_test):
    model = LGBMRegressor(random_state=42)
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    rmse = np.sqrt(mean_squared_error(y_test, predictions))
    return model, rmse

# Train and evaluate for each target
model_fb, rmse_fb = train_lgbm(X_train_fb, y_train_fb, X_test_fb, y_test_fb)
model_gp, rmse_gp = train_lgbm(X_train_gp, y_train_gp, X_test_gp, y_test_gp)
model_li, rmse_li = train_lgbm(X_train_li, y_train_li, X_test_li, y_test_li)

# Print results
print(f"RMSE for Facebook: {rmse_fb}")
print(f"RMSE for GooglePlus: {rmse_gp}")
print(f"RMSE for LinkedIn: {rmse_li}")


[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.005568 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 665
[LightGBM] [Info] Number of data points in the train set: 74591, number of used features: 44
[LightGBM] [Info] Start training from score 7.387639
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.005083 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 665
[LightGBM] [Info] Number of data points in the train set: 74591, number of used features: 44
[LightGBM] [Info] Start training from score 7.460173
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.004843 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough,

In [27]:
# Test the models and compare RMSE values
print(f"RMSE for model_fb: {rmse_fb}")
print(f"RMSE for model_gp: {rmse_gp}")
print(f"RMSE for model_li: {rmse_li}")

# Make predictions with each model
predictions_fb = model_fb.predict(X_test_fb)
predictions_gp = model_gp.predict(X_test_gp)
predictions_li = model_li.predict(X_test_li)

# Optionally: Calculate RMSE for each model's predictions
from sklearn.metrics import mean_squared_error
import numpy as np

rmse_fb_test = np.sqrt(mean_squared_error(y_test_fb, predictions_fb))
rmse_gp_test = np.sqrt(mean_squared_error(y_test_gp, predictions_gp))
rmse_li_test = np.sqrt(mean_squared_error(y_test_li, predictions_li))

print(f"Test RMSE for model_fb: {rmse_fb_test}")
print(f"Test RMSE for model_gp: {rmse_gp_test}")
print(f"Test RMSE for model_li: {rmse_li_test}")


RMSE for model_fb: 0.20451280810356615
RMSE for model_gp: 0.1467251658915226
RMSE for model_li: 0.18870089063629905
Test RMSE for model_fb: 0.20451280810356615
Test RMSE for model_gp: 0.1467251658915226
Test RMSE for model_li: 0.18870089063629905


In [30]:
# Inverse log1p transformation for predictions
predictions_fb_original = np.expm1(predictions_fb)
predictions_gp_original = np.expm1(predictions_gp)
predictions_li_original = np.expm1(predictions_li)

# Inverse log1p transformation for actual values
y_test_fb_original = np.expm1(y_test_fb)
y_test_gp_original = np.expm1(y_test_gp)
y_test_li_original = np.expm1(y_test_li)

# Calculate RMSE in the original scale
rmse_fb_original = np.sqrt(mean_squared_error(y_test_fb_original, predictions_fb_original))
rmse_gp_original = np.sqrt(mean_squared_error(y_test_gp_original, predictions_gp_original))
rmse_li_original = np.sqrt(mean_squared_error(y_test_li_original, predictions_li_original))

print(f"Original Scale RMSE for model_fb: {rmse_fb_original}")
print(f"Original Scale RMSE for model_gp: {rmse_gp_original}")
print(f"Original Scale RMSE for model_li: {rmse_li_original}")


Original Scale RMSE for model_fb: 363.19203719791534
Original Scale RMSE for model_gp: 274.7420048319023
Original Scale RMSE for model_li: 138.58863271095132


In [31]:
df[["Facebook","LinkedIn","GooglePlus"]].describe()

Unnamed: 0,Facebook,LinkedIn,GooglePlus
count,93239.0,93239.0,93239.0
mean,1709.134605,785.776226,1799.756127
std,600.936341,228.10237,505.231734
min,482.225,134.4125,441.1875
25%,1228.854167,579.208333,1422.045833
50%,1540.0625,801.472222,1695.366667
75%,2115.179167,928.319444,2077.1
max,5763.013889,1975.381944,4503.333333


FB Test RMSE =  363.32

GP Test RMSE = 274.68

LI Test RMSE = 138.716

# Baseline Model 

In [32]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import mean_squared_error, r2_score
import nltk
from nltk.corpus import stopwords
import re
import math

news=data.copy()

#nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

#news=pd.read_csv("Tableau_News.csv")
# Preprocessing and Feature Engineering

# Log transform the target variables
news['Facebook'] = np.log1p(news['Facebook'])
news['GooglePlus'] = np.log1p(news['GooglePlus'])
news['LinkedIn'] = np.log1p(news['LinkedIn'])

# One-Hot Encoding for Topic
ohe_topic = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
topic_encoded = ohe_topic.fit_transform(news[['Topic']])
topic_df = pd.DataFrame(topic_encoded, columns=ohe_topic.get_feature_names_out(['Topic']))
news = pd.concat([news, topic_df], axis=1)

# One-Hot Encoding for Day of the Week
ohe_day = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
day_encoded = ohe_day.fit_transform(news[['PublishDay']])
day_df = pd.DataFrame(day_encoded, columns=ohe_day.get_feature_names_out(['PublishDay']))
news = pd.concat([news, day_df], axis=1)

# One-Hot Encoding for Time of Day
ohe_time = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
time_encoded = ohe_time.fit_transform(news[['TimeOfDay']])
time_df = pd.DataFrame(time_encoded, columns=ohe_time.get_feature_names_out(['TimeOfDay']))
news = pd.concat([news, time_df], axis=1)

# Scaling numerical features
scaler = StandardScaler()
news['Sentiment_mean_scaled'] = scaler.fit_transform(news[['Sentiment_mean']].values)

# Feature Engineering: Source (Frequency Encoding)
source_counts = news.groupby(['Topic', 'Source'])['Source'].count().unstack(fill_value=0)
top_10_sources = {}
for topic in news['Topic'].unique():
    top_10_sources[topic] = source_counts.loc[topic].nlargest(10).index.tolist()

for topic, sources in top_10_sources.items():
    for source in sources:
        source_col = f"source_is_{topic}_{source.replace(' ', '_')}"  # more robust column names
        news[source_col] = (news['Topic'] == topic) & (news['Source'] == source)

# Feature Engineering: Interaction Terms
news['is_weekend'] = (news['PublishDay'] == "Sunday") | (news['PublishDay'] == "Saturday")

# Text Preprocessing (for Headline and Title)
def preprocess_text(text):
    if isinstance(text, float) and math.isnan(text):
        return []
    text = str(text)
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)
    tokens = [word for word in text.split() if word not in stop_words]
    return tokens

news['CleanedHeadline'] = news['Headline'].apply(preprocess_text)
news['CleanedTitle'] = news['Title'].apply(preprocess_text)

# Feature Engineering: Common Words in Headline and Title (using CountVectorizer)
headline_vectorizer = CountVectorizer(max_features=20, ngram_range=(5, 5))
headline_patterns = headline_vectorizer.fit_transform(news['CleanedHeadline'].apply(lambda x: " ".join(x))).toarray()
news = pd.concat([news, pd.DataFrame(headline_patterns, columns=headline_vectorizer.get_feature_names_out())], axis=1)

title_vectorizer = CountVectorizer(max_features=20, ngram_range=(3, 3))
title_patterns = title_vectorizer.fit_transform(news['CleanedTitle'].apply(lambda x: " ".join(x))).toarray()
news = pd.concat([news, pd.DataFrame(title_patterns, columns=title_vectorizer.get_feature_names_out())], axis=1)

# Define features (X) and targets (y)
feature_cols = list(news.filter(like='is_')) + list(news.filter(like='PublishDay_')) + list(news.filter(like='TimeOfDay_')) + ['Sentiment_mean_scaled'] + list(news.filter(like='source_is_')) + list(headline_vectorizer.get_feature_names_out()) + list(title_vectorizer.get_feature_names_out()) + ['is_weekend']
X = news[feature_cols]

targets = ['Facebook', 'GooglePlus', 'LinkedIn']
results = {}

for target in targets:
    y = news[target]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Linear Regression using scikit-learn
    from sklearn.linear_model import LinearRegression
    model = LinearRegression()
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    # Reverse the log transform
    y_test_original = np.expm1(y_test)
    y_pred_original = np.expm1(y_pred)

    # Evaluate the model
    mse = mean_squared_error(y_test_original, y_pred_original)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_test_original, y_pred_original)

    results[target] = {'MSE': mse, 'RMSE': rmse, 'R-squared': r2}

# Print results
for target, metrics in results.items():
    print(f"Results for {target}:")
    print(f"  MSE: {metrics['MSE']:.2f}")
    print(f"  RMSE: {metrics['RMSE']:.2f}")
    print(f"  R-squared: {metrics['R-squared']:.2f}")
    print("-" * 30)

Results for Facebook:
  MSE: 263899.43
  RMSE: 513.71
  R-squared: 0.27
------------------------------
Results for GooglePlus:
  MSE: 192197.84
  RMSE: 438.40
  R-squared: 0.24
------------------------------
Results for LinkedIn:
  MSE: 34805.38
  RMSE: 186.56
  R-squared: 0.35
------------------------------


# Random Forest

In [11]:
df=data.copy()

In [12]:
df.columns

Index(['IDLink', 'Title', 'Headline', 'Source', 'Topic', 'PublishDate',
       'SentimentTitle', 'SentimentHeadline', 'Facebook', 'GooglePlus',
       'LinkedIn', 'Sentiment_mean', 'PublishDay', 'PublishHour', 'TimeOfDay'],
      dtype='object')

In [9]:
df.head()

Unnamed: 0,IDLink,Title,Headline,Source,Topic,PublishDate,SentimentTitle,SentimentHeadline,Facebook,GooglePlus,LinkedIn,Sentiment_mean,PublishDay,PublishHour,TimeOfDay
0,99248,Obama Lays Wreath at Arlington National Cemetery,Obama Lays Wreath at Arlington National Cemete...,USA TODAY,obama,2002-04-02 00:00:00,0.0,-0.0533,2547.659722,1538.570833,499.025,-0.02665,Tuesday,0,Night
1,10423,A Look at the Health of the Chinese Economy,Tim Haywood investment director businessunit h...,Bloomberg,economy,2008-09-20 00:00:00,0.208333,-0.156386,1380.145833,1957.444444,753.729167,0.025974,Saturday,0,Night
2,18828,Nouriel Roubini Global Economy Not Back to 2008,Nouriel Roubini NYU professor and chairman at ...,Bloomberg,economy,2012-01-28 00:00:00,-0.42521,0.139754,1647.295833,2242.472222,874.993056,-0.142728,Saturday,0,Night
3,27788,Finland GDP Expands In Q4,Finlands economy expanded marginally in the th...,RTT News,economy,2015-03-01 00:06:00,0.0,0.026064,1157.554167,1805.383333,701.736111,0.013032,Sunday,0,Night
4,27789,Tourism govt spending buoys Thai economy in Ja...,Tourism and public spending continued to boost...,The Nation Thailand39s English news,economy,2015-03-01 00:11:00,0.0,0.141084,1439.5125,2166.45,857.6875,0.070542,Sunday,0,Night


In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.ensemble import RandomForestRegressor
import nltk
from nltk.corpus import stopwords
import re
import math
stop_words = set(stopwords.words('english'))


def preprocess_data(df):
    """Preprocesses the data for social media engagement prediction."""


    df['Facebook'] = np.log1p(df['Facebook'] + 1)
    df['GooglePlus'] = np.log1p(df['GooglePlus'] + 1)
    df['LinkedIn'] = np.log1p(df['LinkedIn'] + 1)

    source_counts = df.groupby(['Topic', 'Source'])['Source'].count().unstack(fill_value=0)
    top_10_sources = {}
    for topic in df['Topic'].unique():
        top_10_sources[topic] = source_counts.loc[topic].nlargest(10).index.tolist()

    for topic, sources in top_10_sources.items():
        for source in sources:
            source_col = f"source_is_{topic}_{source.replace(' ', '_')}"
            df[source_col] = (df['Topic'] == topic) & (df['Source'] == source)
            
    categorical_features = ['Topic', 'PublishDay', 'TimeOfDay']
    for feature in categorical_features:
        ohe = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
        encoded = ohe.fit_transform(df[[feature]])
        encoded_df = pd.DataFrame(encoded, columns=ohe.get_feature_names_out([feature]))
        df = pd.concat([df, encoded_df], axis=1)
        df.drop(feature, axis=1, inplace=True)

    scaler = StandardScaler()
    df['Sentiment_mean_scaled'] = scaler.fit_transform(df[['Sentiment_mean']].values)

    df['is_weekend'] = (df['PublishDay_Saturday'] == 1) | (df['PublishDay_Sunday'] == 1)

    def preprocess_text(text):
        if not isinstance(text, str):
            return ""  # Return empty string instead of list for CountVectorizer
        text = text.lower()
        text = re.sub(r'[^\w\s]', '', text)
        tokens = [word for word in text.split() if word not in stop_words]
        return " ".join(tokens)  # Return space-separated string

    df['CleanedHeadline'] = df['Headline'].apply(preprocess_text)
    df['CleanedTitle'] = df['Title'].apply(preprocess_text)

    headline_vectorizer = CountVectorizer(max_features=20, ngram_range=(5, 5))
    headline_patterns = headline_vectorizer.fit_transform(df['CleanedHeadline']).toarray()
    df = pd.concat([df, pd.DataFrame(headline_patterns, columns=headline_vectorizer.get_feature_names_out())], axis=1)

    title_vectorizer = CountVectorizer(max_features=20, ngram_range=(3, 3))
    title_patterns = title_vectorizer.fit_transform(df['CleanedTitle']).toarray()
    df = pd.concat([df, pd.DataFrame(title_patterns, columns=title_vectorizer.get_feature_names_out())], axis=1)

    feature_cols = list(df.filter(like='is_')) + list(df.filter(like='PublishDay_')) + list(df.filter(like='TimeOfDay_')) + ['Sentiment_mean_scaled'] + list(df.filter(like='source_is_')) + list(headline_vectorizer.get_feature_names_out()) + list(title_vectorizer.get_feature_names_out()) + ['is_weekend']
    X = df[feature_cols]
    X = X.loc[:, (X != X.iloc[0]).any()]

    return X, df[['Facebook', 'GooglePlus', 'LinkedIn']]



X, y = preprocess_data(df)

targets = y.columns
results = {}

for target in targets:
    y_target = y[target]
    X_train, X_test, y_train, y_test = train_test_split(X, y_target, test_size=0.2, random_state=42)

    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)

    best_model = None
    best_rmse = float('inf')
    for n_estimators in range(10, 101, 10):
        rf_model = RandomForestRegressor(n_estimators=n_estimators, random_state=42, n_jobs=-1)
        rf_model.fit(X_train_scaled, y_train)
        y_pred = rf_model.predict(X_test_scaled)
        rmse = np.sqrt(mean_squared_error(y_test, y_pred))
        if rmse < best_rmse:
            best_model = rf_model
            best_rmse = rmse
            print(f"Improved RMSE for {target}: {best_rmse:.4f} with n_estimators={n_estimators}")

    try:
        y_pred = best_model.predict(X_test_scaled)
        mse = mean_squared_error(np.expm1(y_test), np.expm1(y_pred)) #Inverse log transformation
        rmse = np.sqrt(mse)
        r2 = r2_score(np.expm1(y_test), np.expm1(y_pred))
        results[target] = {'MSE': mse, 'RMSE': rmse, 'R-squared': r2}
    except Exception as e:
        print(f"Error during prediction or evaluation for {target}: {e}")

for target, metrics in results.items():
    print(f"Results for {target} (Random Forest):")
    print(f"  MSE: {metrics['MSE']:.2f}")
    print(f"  RMSE: {metrics['RMSE']:.2f}")
    print(f"  R-squared: {metrics['R-squared']:.2f}")
    print("-" * 30)

Improved RMSE for Facebook: 0.3161 with n_estimators=10
Improved RMSE for Facebook: 0.3117 with n_estimators=20
Improved RMSE for Facebook: 0.3107 with n_estimators=30
Improved RMSE for Facebook: 0.3098 with n_estimators=40
Improved RMSE for Facebook: 0.3094 with n_estimators=50
Improved RMSE for Facebook: 0.3092 with n_estimators=60
Improved RMSE for Facebook: 0.3090 with n_estimators=70
Improved RMSE for Facebook: 0.3089 with n_estimators=80
Improved RMSE for Facebook: 0.3087 with n_estimators=90
Improved RMSE for Facebook: 0.3085 with n_estimators=100
Improved RMSE for GooglePlus: 0.2548 with n_estimators=10
Improved RMSE for GooglePlus: 0.2515 with n_estimators=20
Improved RMSE for GooglePlus: 0.2505 with n_estimators=30
Improved RMSE for GooglePlus: 0.2499 with n_estimators=40
Improved RMSE for GooglePlus: 0.2498 with n_estimators=50
Improved RMSE for GooglePlus: 0.2495 with n_estimators=60
Improved RMSE for GooglePlus: 0.2494 with n_estimators=70
Improved RMSE for GooglePlus: 0.2

# Neural Network

In [33]:
df=data.copy()

In [34]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import mean_squared_error, r2_score
import nltk
from nltk.corpus import stopwords
import re
import math
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

#nltk.download('stopwords', quiet=True)
stop_words = set(stopwords.words('english'))


# One-Hot Encoding for Topic
ohe_topic = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
topic_encoded = ohe_topic.fit_transform(df[['Topic']])
topic_df = pd.DataFrame(topic_encoded, columns=ohe_topic.get_feature_names_out(['Topic']))
df = pd.concat([df, topic_df], axis=1)

# One-Hot Encoding for Day of the Week
ohe_day = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
day_encoded = ohe_day.fit_transform(df[['PublishDay']])
day_df = pd.DataFrame(day_encoded, columns=ohe_day.get_feature_names_out(['PublishDay']))
df = pd.concat([df, day_df], axis=1)

# One-Hot Encoding for Time of Day
ohe_time = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
time_encoded = ohe_time.fit_transform(df[['TimeOfDay']])
time_df = pd.DataFrame(time_encoded, columns=ohe_time.get_feature_names_out(['TimeOfDay']))
df = pd.concat([df, time_df], axis=1)

# Scaling numerical features
scaler = StandardScaler()
df['Sentiment_mean_scaled'] = scaler.fit_transform(df[['Sentiment_mean']].values)

# Feature Engineering: Source (Frequency Encoding)
source_counts = df.groupby(['Topic', 'Source'])['Source'].count().unstack(fill_value=0)
top_10_sources = {}
for topic in df['Topic'].unique():
    top_10_sources[topic] = source_counts.loc[topic].nlargest(10).index.tolist()

for topic, sources in top_10_sources.items():
    for source in sources:
        source_col = f"source_is_{topic}_{source.replace(' ', '_')}"  # more robust column names
        df[source_col] = (df['Topic'] == topic) & (df['Source'] == source)

# Feature Engineering: Interaction Terms
df['facebook_weekend'] = (df['Facebook'] > 0) & ((df['PublishDay_Saturday'] == 1) | (df['PublishDay_Sunday'] == 1))
df['googleplus_weekend'] = (df['GooglePlus'] > 0) & ((df['PublishDay_Saturday'] == 1) | (df['PublishDay_Sunday'] == 1))
df['linkedin_weekend'] = (df['LinkedIn'] > 0) & ((df['PublishDay_Saturday'] == 1) | (df['PublishDay_Sunday'] == 1))

# Text Preprocessing (for Headline and Title)
def preprocess_text(text):
    if isinstance(text, float) and math.isnan(text):
        return []
    text = str(text)
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)
    tokens = [word for word in text.split() if word not in stop_words]
    return tokens

df['CleanedHeadline'] = df['Headline'].apply(preprocess_text)
df['CleanedTitle'] = df['Title'].apply(preprocess_text)

# Feature Engineering: Common Words in Headline and Title (using CountVectorizer)
headline_vectorizer = CountVectorizer(max_features=20, ngram_range=(5, 5))
headline_patterns = headline_vectorizer.fit_transform(df['CleanedHeadline'].apply(lambda x: " ".join(x))).toarray()
df = pd.concat([df, pd.DataFrame(headline_patterns, columns=headline_vectorizer.get_feature_names_out())], axis=1)

title_vectorizer = CountVectorizer(max_features=20, ngram_range=(3, 3))
title_patterns = title_vectorizer.fit_transform(df['CleanedTitle'].apply(lambda x: " ".join(x))).toarray()
df = pd.concat([df, pd.DataFrame(title_patterns, columns=title_vectorizer.get_feature_names_out())], axis=1)

# ... (All Preprocessing and Feature Engineering steps - SAME AS IN THE PREVIOUS IMPROVED CODE)
feature_cols = list(df.filter(like='is_')) + list(df.filter(like='PublishDay_')) + list(df.filter(like='TimeOfDay_')) + ['Sentiment_mean_scaled'] + list(df.filter(like='source_is_')) + list(df.filter(regex=r'\b\w+(?:\s+\w+){4}\b')) + list(df.filter(regex=r'\b\w+(?:\s+\w+){2}\b')) + ['facebook_weekend', 'googleplus_weekend', 'linkedin_weekend']

X = df[feature_cols]
X = X.loc[:, (X != X.iloc[0]).any()]

targets = ['Facebook', 'GooglePlus', 'LinkedIn']
results = {}

for target in targets:
    y = df[target]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)

    # Build the Neural Network Model
    model = keras.Sequential([
        layers.Dense(128, activation='relu', input_shape=[X_train_scaled.shape[1]]),  # Input layer + hidden layer
        layers.Dropout(0.3),  # Dropout for regularization
        layers.Dense(64, activation='relu'),
        layers.Dropout(0.3),
        layers.Dense(1)  # Output layer (1 neuron for regression)
    ])

    model.compile(loss='mse', optimizer='adam', metrics=['mse', 'mae'])

    # Train the Model with Early Stopping
    early_stopping = keras.callbacks.EarlyStopping(
        monitor='val_loss', patience=10, restore_best_weights=True
    )

    history = model.fit(
        X_train_scaled, y_train,
        epochs=100,  # Adjust as needed
        batch_size=32,  # Adjust as needed
        validation_split=0.2, # Validation split
        callbacks=[early_stopping],
        verbose = 0
    )
    y_pred = model.predict(X_test_scaled)
    y_pred = y_pred.flatten()
    mse = mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_test, y_pred)

    results[target] = {'MSE': mse, 'RMSE': rmse, 'R-squared': r2}

# Print results
for target, metrics in results.items():
    print(f"Results for {target} (Neural Network):")
    print(f"  MSE: {metrics['MSE']:.2f}")
    print(f"  RMSE: {metrics['RMSE']:.2f}")
    print(f"  R-squared: {metrics['R-squared']:.2f}")
    print("-" * 30)






Results for Facebook (Neural Network):
  MSE: 221747.57
  RMSE: 470.90
  R-squared: 0.38
------------------------------
Results for GooglePlus (Neural Network):
  MSE: 184815.31
  RMSE: 429.90
  R-squared: 0.27
------------------------------
Results for LinkedIn (Neural Network):
  MSE: 33008.16
  RMSE: 181.68
  R-squared: 0.38
------------------------------
