# Online News Popularity Prediction

**Dataset Source:** [Online News Popularity Dataset](https://www.kaggle.com/datasets/ishantjuyal/online-news-popularity)

**Objective:**  
Predict the number of shares an online news article will receive on social media platforms based on various features extracted from the article and its metadata.

---

## Table of Contents
1. [Introduction](#Introduction)
2. [Problem Statement](#Problem-Statement)
3. [Data Loading and Overview](#Data-Loading-and-Overview)
4. [Exploratory Data Analysis (EDA)](#Exploratory-Data-Analysis-EDA)
5. [Data Preprocessing](#Data-Preprocessing)
6. [Feature Engineering](#Feature-Engineering)
7. [Model Building and Training](#Model-Building-and-Training)
8. [Model Evaluation](#Model-Evaluation)
9. [Results and Discussion](#Results-and-Discussion)
10. [Conclusion](#Conclusion)
11. [References](#References)

---

## Introduction

In the age of digital media, understanding and predicting the popularity of online news articles is crucial for content creators, marketers, and publishers. This notebook explores the Online News Popularity dataset to build models that predict the number of shares an article will receive on social media platforms. By leveraging various features such as textual content metrics, metadata, and sentiment analysis, we aim to identify key factors that influence article popularity.

---

## Problem Statement

This is a **regression** problem where the objective is to predict the continuous variable **'shares'**, representing the number of times an online news article is shared on social media platforms.

---

## Data Loading and Overview




In [None]:
!pip install shap

In [None]:
# Importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# For modeling
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor

# For warnings
import warnings
warnings.filterwarnings('ignore')

# For SHAP
import shap

# Setting visual style
sns.set(style="whitegrid")

In [None]:
# Loading the dataset
df = pd.read_csv('OnlineNewsPopularity.csv')

# Displaying the first five rows
df.head()


In [None]:
# Displaying dataset shape
print(f"Dataset contains {df.shape[0]} rows and {df.shape[1]} columns.")


In [None]:
# Listing all columns
df.columns.tolist()


Removing Leading and Trailing Whitespace

In [None]:
# Removing leading and trailing whitespaces from column names
df.columns = df.columns.str.strip()

# Displaying cleaned column names
print("\nCleaned Column Names:")
print(df.columns.tolist())


In [None]:
# Dropping non-predictive columns
df_clean = df.drop(['url', 'timedelta'], axis=1)

# Verify the columns after cleaning and dropping
print("\nColumns after cleaning and dropping non-predictive features:")
print(df_clean.columns.tolist())

---

# Exploratory Data Analysis (EDA)
### 1. Understanding the Target Variable

In [None]:
# Description of target variable
print("Description of 'shares' variable:")
print(df['shares'].describe())


In [None]:
# Distribution of 'shares'
plt.figure(figsize=(10,6))
sns.histplot(df['shares'], bins=50, kde=True)
plt.title('Distribution of Number of Shares')
plt.xlabel('Number of Shares')
plt.ylabel('Frequency')
plt.show()


The distribution of shares is highly skewed, indicating that most articles receive a low number of shares, while a few go viral.

### Log Transformation of Target Variable

In [None]:
# Applying log transformation to the target variable
if 'shares' in df_clean.columns:
    df_clean['log_shares'] = np.log1p(df_clean['shares'])
    print("'log_shares' column created successfully.")
else:
    print("'shares' column not found in df_clean.")

# Plotting the transformed target variable
plt.figure(figsize=(10,6))
sns.histplot(df['log_shares'], bins=50, kde=True)
plt.title('Distribution of Log-Transformed Number of Shares')
plt.xlabel('Log(Number of Shares)')
plt.ylabel('Frequency')
plt.show()


The log transformation normalizes the distribution, making it more suitable for regression modeling.

### 2. Checking for Missing Values

In [None]:
# Checking for missing values
missing_values = df.isnull().sum()
print("Missing values in each column:")
print(missing_values[missing_values > 0])


There are no missing values in the dataset.

### 3. Statistical Summary

In [None]:
# Statistical summary
df.describe()


### 4. Correlation Analysis

In [None]:
# Correlation matrix
plt.figure(figsize=(20,20))
corr_matrix = df.corr()
sns.heatmap(corr_matrix, annot=False, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()


# Correlation with target variable



In [None]:
corr_with_target = corr_matrix['shares'].sort_values(ascending=False)
print("Top 10 features positively correlated with shares:")
print(corr_with_target.tail(-1).tail(10))

print("\nTop 10 features negatively correlated with shares:")
print(corr_with_target.tail(-1).head(10))

### Feature Distributions

In [None]:
# Selecting top 5 positively and negatively correlated features
top_positive_features = corr_with_target[1:6].index.tolist()  # Top 5 positive
top_negative_features = corr_with_target[-5:].index.tolist()  # Top 5 negative

# Combine both lists
top_features = top_positive_features + top_negative_features

# Plotting distributions
for feature in top_features:
    plt.figure(figsize=(8,4))
    sns.scatterplot(x=df_clean[feature], y=df_clean['shares'])
    plt.title(f'Shares vs {feature}')
    plt.xlabel(feature)
    plt.ylabel('Shares')
    plt.show()


### 5. Splitting Dataset

In [None]:
# Using absolute correlation for feature selection
corr_with_target_abs = corr_matrix[target].abs().sort_values(ascending=False)

# Selecting top 30 features based on absolute correlation, excluding the target
top_corr_features = corr_with_target_abs[1:31].index.tolist()

# Subset the data
X_selected = df_clean[top_corr_features]
y = df_clean[target]

# Splitting into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_selected, y, test_size=0.2, random_state=42)

# Feature Scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Converting back to DataFrame
X_train_scaled = pd.DataFrame(X_train_scaled, columns=top_corr_features, index=X_train.index)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=top_corr_features, index=X_test.index)


### Feature Engineering
- #### Feature Importance Using Random Forest

In [None]:
# Ensure 'shares' is not in the feature list
top_corr_features = [feature for feature in top_corr_features if feature != 'shares']

# Training the Random Forest model with the corrected feature set
rf_initial = RandomForestRegressor(n_estimators=100, random_state=42)
rf_initial.fit(X_train_scaled[top_corr_features], y_train)

# Getting feature importances
importances = rf_initial.feature_importances_
feature_importances = pd.Series(importances, index=top_corr_features).sort_values(ascending=False)

# Plotting feature importances
plt.figure(figsize=(12,8))
sns.barplot(x=feature_importances, y=feature_importances.index)
plt.title('Feature Importances')
plt.xlabel('Importance Score')
plt.ylabel('Features')
plt.show()


### Model Building and Training

- #### Train multiple regression models to compare their performance.

In [None]:
# Initializing and training Linear Regression
lr = LinearRegression()
lr.fit(X_train_scaled, y_train)
y_pred_lr = lr.predict(X_test_scaled)

# Initializing and training Decision Tree Regressor
dt = DecisionTreeRegressor(random_state=42)
dt.fit(X_train_scaled, y_train)
y_pred_dt = dt.predict(X_test_scaled)

# Initializing and training K-Nearest Neighbors Regressor
knn = KNeighborsRegressor()
knn.fit(X_train_scaled, y_train)
y_pred_knn = knn.predict(X_test_scaled)

# Initializing and training Random Forest Regressor
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train_scaled, y_train)
y_pred_rf = rf.predict(X_test_scaled)


### Hyperparameter Tuning with GridSearchCV

#### - Optimize the Random Forest model.

In [None]:
# Defining parameter grid
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}

# Initializing GridSearchCV
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid,
                           cv=3, n_jobs=-1, verbose=2, scoring='r2')

# Fitting GridSearchCV
grid_search.fit(X_train_scaled, y_train)

# Best parameters
print("Best parameters found: ", grid_search.best_params_)

# Best estimator
best_rf = grid_search.best_estimator_

# Predicting with the best estimator
y_pred_best_rf = best_rf.predict(X_test_scaled)


### Model Evaluation

#### - Evaluate all models using MAE, MSE, RMSE, and R².

In [None]:
# Defining evaluation function
def evaluate_model(y_true, y_pred, model_name):
    mae = mean_absolute_error(y_true, y_pred)
    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_true, y_pred)
    print(f"--- {model_name} ---")
    print(f"MAE: {mae:.2f}")
    print(f"MSE: {mse:.2f}")
    print(f"RMSE: {rmse:.2f}")
    print(f"R²: {r2:.4f}")
    print("\n")


In [None]:
# Evaluating Linear Regression
evaluate_model(y_test, y_pred_lr, "Linear Regression")

# Evaluating Decision Tree Regressor
evaluate_model(y_test, y_pred_dt, "Decision Tree Regressor")

# Evaluating K-Nearest Neighbors Regressor
evaluate_model(y_test, y_pred_knn, "K-Nearest Neighbors Regressor")

# Evaluating Random Forest Regressor
evaluate_model(y_test, y_pred_rf, "Random Forest Regressor")

# Evaluating Tuned Random Forest Regressor
evaluate_model(y_test, y_pred_best_rf, "Tuned Random Forest Regressor")


In [None]:
# Comparing R² scores
models = ['Linear Regression', 'Decision Tree', 'KNN', 'Random Forest', 'Tuned Random Forest']
r2_scores = [
    r2_score(y_test, y_pred_lr),
    r2_score(y_test, y_pred_dt),
    r2_score(y_test, y_pred_knn),
    r2_score(y_test, y_pred_rf),
    r2_score(y_test, y_pred_best_rf)
]

plt.figure(figsize=(10,6))
sns.barplot(x=models, y=r2_scores)
plt.title('R² Scores of Different Models')
plt.ylabel('R² Score')
plt.xlabel('Models')
plt.ylim(0,1)
plt.show()


### Results and Discussion
#### - Residual Analysis

In [None]:
# Residuals for Tuned Random Forest
residuals = y_test - y_pred_best_rf

plt.figure(figsize=(10,6))
sns.histplot(residuals, bins=50, kde=True)
plt.title('Distribution of Residuals (Tuned Random Forest)')
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.show()

# Residuals vs Predicted
plt.figure(figsize=(10,6))
sns.scatterplot(x=y_pred_best_rf, y=residuals)
plt.axhline(0, color='red', linestyle='--')
plt.title('Residuals vs Predicted Values (Tuned Random Forest)')
plt.xlabel('Predicted Shares')
plt.ylabel('Residuals')
plt.show()


In [None]:
# Initialize SHAP explainer
explainer = shap.Explainer(best_rf, X_train_scaled)
shap_values = explainer(X_test_scaled)

# Summary plot
shap.summary_plot(shap_values, X_test_scaled)
