<a href="https://colab.research.google.com/github/Desmyk/ADVMachineLearning/blob/main/Web_Traffic_Capstone(ADVML).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ***Advanced Machine Learning Capstone: Web Traffic Time Series Forecasting using Wikipedia data***

**Author:** Michael Mbugua  
**Project Type:** Machine Learning Capstone  
**Dataset Source:** [Kaggle - Web Traffic Time Series Forecasting](https://www.kaggle.com/c/web-traffic-time-series-forecasting/data?select=train_1.csv.zip)

---

## ***Objective***

The goal of this capstone project is to develop a hybrid machine learning model that can accurately forecast daily web traffic for Wikipedia pages by combining:

1. **Temporal patterns** (using LSTM/Transformers)

2. **Page relationships** (using Graph Neural Networks)

3. **Metadata features** (language, access type, etc.)

This system has practical applications in:

- Resource allocation for Wikipedia servers

- Anomaly detection (e.g., bot attacks or viral content)

- Content delivery network optimization

---

## ***Dataset Overview***

The dataset contains daily web traffic for 145,000 Wikipedia pages from 2015-07-01 to 2016-12-31, with features:

### **Core Features**

- **Page**: Full page URL (e.g., "Apple_Inc._de.wikipedia.org_all-access")

- **Date**: Daily timestamp

- **Visits**: Number of daily page views (integer, sparse)

### **Extracted Metadata Features**

From page URLs we derive:

**Content Features:**

- Page title (e.g., "Apple_Inc")

- Language code (e.g., "de", "en")

- Access type (desktop/mobile)

- Agent type (all-access/spider)

**Temporal Features:**

- Day of week

- Month

- Special events (holidays/extremes)

**Graph Features:**

- Page similarity network

- Community detection clusters

## ***Key Challenges***
**Extreme sparsity:** 37% of visit counts are zero

**Power-law distribution:** Few pages get most traffic

**Complex seasonality:** Weekly + yearly patterns + event spikes

Let's begin by loading and exploring the dataset :

In [55]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.tsa.seasonal import seasonal_decompose
import plotly.express as px
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX
from xgboost import XGBRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout
import warnings
warnings.filterwarnings('ignore')



##  Loading and Initial Inspection

In [56]:
# optimize memory
# Only load the first N rows to test structure
df = pd.read_csv('/content/train_1.csv', nrows=1000)
df.head()

Unnamed: 0,Page,2015-07-01,2015-07-02,2015-07-03,2015-07-04,2015-07-05,2015-07-06,2015-07-07,2015-07-08,2015-07-09,...,2016-12-22,2016-12-23,2016-12-24,2016-12-25,2016-12-26,2016-12-27,2016-12-28,2016-12-29,2016-12-30,2016-12-31
0,2NE1_zh.wikipedia.org_all-access_spider,18.0,11.0,5.0,13.0,14.0,9.0,9.0,22.0,26.0,...,32.0,63.0,15.0,26.0,14.0,20.0,22.0,19.0,18.0,20.0
1,2PM_zh.wikipedia.org_all-access_spider,11.0,14.0,15.0,18.0,11.0,13.0,22.0,11.0,10.0,...,17.0,42.0,28.0,15.0,9.0,30.0,52.0,45.0,26.0,20.0
2,3C_zh.wikipedia.org_all-access_spider,1.0,0.0,1.0,1.0,0.0,4.0,0.0,3.0,4.0,...,3.0,1.0,1.0,7.0,4.0,4.0,6.0,3.0,4.0,17.0
3,4minute_zh.wikipedia.org_all-access_spider,35.0,13.0,10.0,94.0,4.0,26.0,14.0,9.0,11.0,...,32.0,10.0,26.0,27.0,16.0,11.0,17.0,19.0,10.0,11.0
4,52_Hz_I_Love_You_zh.wikipedia.org_all-access_s...,,,,,,,,,,...,48.0,9.0,25.0,13.0,3.0,11.0,27.0,13.0,36.0,10.0


In [57]:
print("\nMissing values:\n", df.isna().sum())


Missing values:
 Page           0
2015-07-01    65
2015-07-02    65
2015-07-03    67
2015-07-04    64
              ..
2016-12-27     9
2016-12-28    10
2016-12-29     9
2016-12-30     8
2016-12-31     9
Length: 551, dtype: int64


In [58]:
print(f"Data shape: {df.shape}")
df.info()

Data shape: (1000, 551)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Columns: 551 entries, Page to 2016-12-31
dtypes: float64(550), object(1)
memory usage: 4.2+ MB


In [59]:
# find the 'object column' (most data is numerical)
object_columns = df.select_dtypes(include=['object']).columns
print(object_columns)

Index(['Page'], dtype='object')


In [60]:
# exploring the numerical data using descriptive statistic
df.describe()

Unnamed: 0,2015-07-01,2015-07-02,2015-07-03,2015-07-04,2015-07-05,2015-07-06,2015-07-07,2015-07-08,2015-07-09,2015-07-10,...,2016-12-22,2016-12-23,2016-12-24,2016-12-25,2016-12-26,2016-12-27,2016-12-28,2016-12-29,2016-12-30,2016-12-31
count,935.0,935.0,933.0,936.0,936.0,936.0,939.0,940.0,940.0,939.0,...,992.0,991.0,990.0,991.0,992.0,991.0,990.0,991.0,992.0,991.0
mean,24.190374,20.251337,17.499464,17.797009,19.370726,14.28312,16.14164,19.212766,24.825532,23.255591,...,26.772177,27.564077,26.110101,26.270434,28.287298,26.702321,28.869697,28.638749,26.739919,28.193744
std,205.152589,204.60279,111.479464,113.336268,115.707303,109.435244,87.550596,92.067146,128.433328,127.413396,...,160.167573,139.266722,132.44628,132.271681,153.228522,137.455966,142.725111,151.953976,141.797581,137.473299
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2.0,2.0,4.0,4.0,5.0,3.0,3.0,4.0,6.0,5.0,...,8.0,9.0,8.0,9.0,7.0,8.0,10.0,10.0,8.0,8.0
50%,7.0,7.0,7.0,8.0,9.0,5.0,7.0,9.0,11.0,12.0,...,12.0,14.0,12.0,15.0,12.0,13.0,16.0,16.0,13.0,13.0
75%,20.0,14.0,14.0,16.25,18.0,9.0,14.0,17.0,25.0,20.0,...,21.0,24.0,20.0,21.0,20.0,23.0,26.0,25.0,22.0,21.0
max,6051.0,6123.0,3213.0,3417.0,3465.0,3195.0,2568.0,2566.0,3855.0,3639.0,...,4850.0,4175.0,3941.0,3927.0,4152.0,4047.0,4315.0,4613.0,3883.0,3927.0


In [61]:
# let's see the columns in the dataset
print(df.columns)

Index(['Page', '2015-07-01', '2015-07-02', '2015-07-03', '2015-07-04',
       '2015-07-05', '2015-07-06', '2015-07-07', '2015-07-08', '2015-07-09',
       ...
       '2016-12-22', '2016-12-23', '2016-12-24', '2016-12-25', '2016-12-26',
       '2016-12-27', '2016-12-28', '2016-12-29', '2016-12-30', '2016-12-31'],
      dtype='object', length=551)


## Data Characteristics


### **Traffic Distribution**

In [62]:
# 1. Melt the DataFrame to combine date columns into a single 'visits' column
df_melted = pd.melt(df, id_vars=['Page'], value_vars=df.columns[1:], # Exclude 'Page' column
                    var_name='Date', value_name='visits')
# 2. Convert 'Date' to datetime if it's not already
df_melted['Date'] = pd.to_datetime(df_melted['Date'])


In [20]:
import plotly.express as px
import numpy as np

# Apply log transform and drop NaNs
log_visits = np.log1p(df_melted['visits'].dropna())

# Create interactive histogram
fig = px.histogram(
    log_visits,
    nbins=50,
    title="Log-Scaled Visits Distribution",
    labels={"value": "log(Visits + 1)", "count": "Frequency"},
    template="plotly_white"
)

# Customize layout
fig.update_layout(
    xaxis_title="log(Visits + 1)",
    yaxis_title="Frequency",
    bargap=0.1
)

fig.show()



### **Temporal patterns(seasonality)**

In [21]:
import plotly.graph_objects as go
# Group by date and calculate daily traffic
daily_traffic = df_melted.groupby('Date')['visits'].sum().reset_index()

# 7-day rolling average
daily_traffic['7_day_avg'] = daily_traffic['visits'].rolling(7).mean()

# Create the figure
fig = go.Figure()

# Raw daily visits
fig.add_trace(go.Scatter(
    x=daily_traffic['Date'],
    y=daily_traffic['visits'],
    mode='lines',
    name='Daily Visits',
    line=dict(color='lightblue'),
    hovertemplate='Date: %{x}<br>Visits: %{y:.0f}<extra></extra>'
))

# 7-day rolling average
fig.add_trace(go.Scatter(
    x=daily_traffic['Date'],
    y=daily_traffic['7_day_avg'],
    mode='lines',
    name='7-Day Rolling Average',
    line=dict(color='darkblue', width=2),
    hovertemplate='Date: %{x}<br>7-Day Avg: %{y:.0f}<extra></extra>'
))

# Layout tweaks
fig.update_layout(
    title="Wikipedia Daily Traffic with Weekly Rolling Average",
    xaxis_title="Date",
    yaxis_title="Total Visits",
    template="plotly_white",
    hovermode="x unified",
    height=400
)

fig.show()


## Feature Engineering Strategy

### **Metadata Extraction**

In [65]:
# Safer regex: handles corner cases better
# Extract language
df['Language'] = df['Page'].str.extract(r'\.([a-z]{2})\.wikipedia\.org', expand=False)
# Extract access type
df['Access'] = df['Page'].str.extract(r'\.(desktop|mobile)\.', expand=False)

# More precise title extraction using regex
# Extract the article title
df['Title'] = df['Page'].str.extract(r'^(.*?)(?:\.[a-z]{2}\.wikipedia\.org)', expand=False)

# Optional: fill missing language/access with 'unknown' # Handle missing data
df['Language'] = df['Language'].fillna('unknown')
df['Access'] = df['Access'].fillna('unknown')

# One-hot encode with clearer column names & optimized dtypes
df = pd.get_dummies(df, columns=['Language', 'Access'], prefix=['Lang', 'Access'], dtype='uint8')
### Converts the Language and Access columns into multiple binary (0/1) columns—one for each category

### **Temporal Patterns**

In [69]:
# Weekly aggregation by Language
weekly = df.groupby([pd.Grouper(key='Date', freq='W'), 'Language'])['visits'].sum().reset_index()

# Normalize per week
weekly['WeeklyMax'] = weekly.groupby('Date')['visits'].transform('max')
weekly['Normalized'] = weekly['visits'] / weekly['WeeklyMax']

# Pivot for heatmap
heatmap_data = weekly.pivot(index='Date', columns='Language', values='Normalized')

# Interactive heatmap
fig = px.imshow(
    heatmap_data.T,  # Transpose so languages are rows
    labels=dict(x="Week", y="Language", color="Normalized Visits"),
    x=heatmap_data.index,
    y=heatmap_data.columns,
    aspect="auto",
    color_continuous_scale='YlGnBu'
)
fig.update_layout(title='Normalized Weekly Traffic by Language')
fig.show()

KeyError: 'The grouper name Date is not found'

## Modeling Approach

### **LSTM with Feature Embeddings**

In [52]:
num_samples = 1000
seq_length = 30
num_features = 4
lang_vocab = 50

# Dummy data
X_seq = np.random.rand(num_samples, seq_length, num_features)
X_lang = np.random.randint(0, lang_vocab, size=(num_samples, 1))
X_access = np.random.randint(0, 2, size=(num_samples, 1))
y_target = np.random.rand(num_samples)


In [53]:
from tensorflow.keras.layers import Input, LSTM, Dense, Embedding, Concatenate, Flatten, Dropout
from tensorflow.keras.models import Model

def build_traffic_model(seq_length=30, num_features=4, lang_vocab=50):
    # 1. Time-series input (Visits, 7d_avg, 28d_trend, IsHoliday)
    num_input = Input(shape=(seq_length, num_features), name="Num_Features")
    lstm_out = LSTM(64, return_sequences=False)(num_input)

    # 2. Language embedding
    lang_input = Input(shape=(1,), name="Language")
    lang_embed = Embedding(input_dim=lang_vocab, output_dim=8)(lang_input)
    lang_embed_flat = Flatten()(lang_embed)

    # 3. Access type embedding
    access_input = Input(shape=(1,), name="AccessType")
    access_embed = Embedding(input_dim=2, output_dim=2)(access_input)
    access_embed_flat = Flatten()(access_embed)

    # 4. Combine all
    combined = Concatenate()([lstm_out, lang_embed_flat, access_embed_flat])
    x = Dense(64, activation='relu')(combined)
    x = Dropout(0.3)(x)
    x = Dense(32, activation='relu')(x)
    output = Dense(1, activation='linear', name="Output")(x)  # use 'sigmoid' for binary, etc.

    model = Model(inputs=[num_input, lang_input, access_input], outputs=output)
    return model


In [54]:
model = build_traffic_model()
model.compile(optimizer='adam', loss='mse', metrics=['mae'])

# Example usage
model.fit(
    x=[X_seq, X_lang, X_access],  # each shaped appropriately
    y=y_target,
    batch_size=32,
    epochs=10,
    validation_split=0.1
)


Epoch 1/10
[1m29/29[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 31ms/step - loss: 0.2721 - mae: 0.4349 - val_loss: 0.0889 - val_mae: 0.2524
Epoch 2/10
[1m29/29[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 16ms/step - loss: 0.0934 - mae: 0.2651 - val_loss: 0.0853 - val_mae: 0.2476
Epoch 3/10
[1m29/29[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 18ms/step - loss: 0.0878 - mae: 0.2546 - val_loss: 0.0846 - val_mae: 0.2466
Epoch 4/10
[1m29/29[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 21ms/step - loss: 0.0899 - mae: 0.2582 - val_loss: 0.0848 - val_mae: 0.2473
Epoch 5/10
[1m29/29[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 26ms/step - loss: 0.0915 - mae: 0.2637 - val_loss: 0.0838 - val_mae: 0.2455
Epoch 6/10
[1m29/29[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 26ms/step - loss: 0.0933 - mae: 0.2621 - val_loss: 0.0842 - val_mae: 0.2463
Epoch 7/10
[1m29/29[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 20ms/step - loss: 0

<keras.src.callbacks.history.History at 0x7e61b579e110>

### ***Model Summary***
Inputs:
- [30 x 4] time series
- [1] language category
- [1] access type

Architecture:
- LSTM on numerical sequence
- Embedding + Flatten for categorical inputs
- Dense layers for learning interaction
- Output: single value (regression target)