<div style="color: #095AAD; font-weight: bold; font-size: 16px;">
    
# Wikipedia Data Collection - Gathering Interest Trends</div>

This notebook collects Wikipedia page view data for key AI/ML terms to analyze public interest trends over time. The collected data will be used to correlate with salary trends in the tech industry, providing insights into how public interest in technologies relates to job market compensation.

**Data source**: [Wikimedia REST API](https://wikimedia.org/api/rest_v1/)

<div style="color: #095AAD; font-weight: bold; font-size: 16px;">
    
## Collection Strategy</div>

I collect daily page view data for strategic keywords that represent key areas of the tech industry:

| **Keyword** | **Wikipedia Articles** | **Category** |
|-------------|----------------------|--------------|
| `chatgpt` | ChatGPT, chatgpt, Chatgpt | AI |
| `ai` | Artificial_intelligence variants | AI |
| `ml` | Machine_learning variants | AI |
| `dl` | Deep_learning variants | AI |
| `python` | Python_(programming_language) variants | Programming |

<div style="color: #095AAD; font-weight: bold; font-size: 16px;">
    
## Dataset Structure</div>

After collection, the resulting dataset contains the following structure:

| **Column** | **Description** | **Example** |
|------------|-----------------|-------------|
| `date` | Collection date | 2020-01-01, 2025-06-29 |
| `keyword` | Technology term | chatgpt, ai, ml, dl, python |
| `views` | Daily page views | 80510, 19841, 4103 |
| `category` | Technology category | AI, Programming |
| `period` | ChatGPT release period | before, after |

<div style="color: #095AAD; font-weight: bold; font-size: 15px;">

### Importing required libraries</div>

In [1]:
import asyncio
import aiohttp
import pandas as pd
from datetime import datetime, timedelta
import time

<div style="color: #095AAD; font-weight: bold; font-size: 15px;">

### Data collection functions</div>

In [2]:
# Get page views for a specific article on a specific date
async def get_views_async(session, article, date):
    url = f'https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/all-access/all-agents/{article}/daily/{date}/{date}'
    headers = {'User-Agent': 'WikipediaAnalysis/1.0 (educational project)'}
    try:
        async with session.get(url, headers=headers) as response:
            if response.status == 200:
                data = await response.json()
                return data['items'][0]['views']
            else:
                return 0
    except:
        return 0

# Classify dates as before/after ChatGPT release
def get_period_label(date):
    chatgpt_date = datetime(2022, 11, 30)
    return 'after' if date >= chatgpt_date else 'before'

# Categorize keywords into AI or Programming
def get_category(keyword):
    if keyword in ['chatgpt', 'ai', 'ml', 'dl']:
        return 'AI'
    elif keyword == 'python':
        return 'Programming'
    else:
        return 'Other'

<div style="color: #095AAD; font-weight: bold; font-size: 15px;">

### Main collection function</div>

In [3]:
# Simplified Wikipedia data collection for correlation analysis
async def collect_wikipedia_data_simple(start: datetime, end: datetime):
    
    article_groups = {
        'chatgpt': ['ChatGPT', 'chatgpt', 'Chatgpt'],
        'ai': ['Artificial_intelligence', 'Artificial_Intelligence', 'artificial_intelligence'],
        'ml': ['Machine_learning', 'Machine_Learning', 'machine_learning', 'Machine-learning'],
        'dl': ['Deep_learning', 'Deep_Learning', 'deep_learning', 'Deep-learning'],
        'python': ['Python_(programming_language)', 'Python_programming_language', 'Python_language', 'python_language']
    }
    
    dataset = []
    current_date = start
    
    start_time = time.time()
    
    async with aiohttp.ClientSession() as session:
        while current_date <= end:
            
            date_str = current_date.strftime('%Y%m%d')
            
            # Collect aggregated data only
            for keyword, variants in article_groups.items():
                tasks = []
                for article in variants:
                    task = get_views_async(session, article, date_str)
                    tasks.append(task)
                
                results = await asyncio.gather(*tasks)
                total_views = sum(results)
                
                record = {
                    'date': current_date.strftime('%Y-%m-%d'),
                    'keyword': keyword,
                    'views': total_views,
                    'category': get_category(keyword),
                    'period': get_period_label(current_date)
                }
                
                dataset.append(record)
            
            current_date += timedelta(days=1)
        
        # Save all data in one file
        df_final = pd.DataFrame(dataset)
        df_final['date'] = pd.to_datetime(df_final['date'])
        df_final.to_csv("wikipedia_data_complete.csv", index=False)
        
        print(f'Saved complete dataset with {len(df_final)} rows')
    
    end_time = time.time()
    print(f'Collection completed in {end_time - start_time:.2f} seconds')
    return df_final

<div style="color: #095AAD; font-weight: bold; font-size: 15px;">

### Collecting data</div>

Running the collection process for the full date range to gather Wikipedia page view trends.

In [4]:
start_date = datetime(2020, 1, 1)
end_date = datetime(2025, 6, 29)
await collect_wikipedia_data_simple(start_date, end_date)

Saved complete dataset with 10035 rows
Collection completed in 1481.96 seconds


Unnamed: 0,date,keyword,views,category,period
0,2020-01-01,chatgpt,0,AI,before
1,2020-01-01,ai,6572,AI,before
2,2020-01-01,ml,2867,AI,before
3,2020-01-01,dl,1672,AI,before
4,2020-01-01,python,4685,Programming,before
...,...,...,...,...,...
10030,2025-06-29,chatgpt,80510,AI,after
10031,2025-06-29,ai,19841,AI,after
10032,2025-06-29,ml,4103,AI,after
10033,2025-06-29,dl,2176,AI,after


**Collection summary:**

The data collection process gathers daily Wikipedia page views for 5 key technology terms from 2020 to mid-2025, creating a comprehensive dataset for trend analysis and correlation with salary data.