# Wikipedia Page Views

**Important: DO NOT CLEAR THE OUTPUT OF THIS NOTEBOOK AFTER EXECUTION!!!**

This notebook downloads Wikipedia page view statistics and creates a dictionary mapping doc_id -> page_views.

**Output:** `gs://db204905756/page_views/pageview.pkl`

## Setup

In [None]:
!pip install -q google-cloud-storage==1.43.0

In [None]:
!gcloud dataproc clusters list --region us-central1

In [None]:
import pickle
from collections import Counter, defaultdict
from pathlib import Path
from google.cloud import storage

# Bucket configuration
bucket_name = 'db204905756'

client = storage.Client()
bucket = client.bucket(bucket_name)

print(f"✅ Connected to bucket: {bucket_name}")

## Check if Page Views Already Exist

No need to download again if we already have it!

In [None]:
# Check if pageview.pkl already exists
existing = !gsutil ls gs://$bucket_name/page_views/pageview.pkl 2>/dev/null

if existing and 'pageview.pkl' in str(existing):
    print("✅ pageview.pkl already exists! You can skip the download.")
    print("   Location: gs://" + bucket_name + "/page_views/pageview.pkl")
    SKIP_DOWNLOAD = True
else:
    print("Page views not found. Will download and process.")
    SKIP_DOWNLOAD = False

## Download Page View Data

**Skip this section if pageview.pkl already exists!**

Downloads ~2.3GB file from Wikimedia and processes it.

In [None]:
# Only run if we need to download
if not SKIP_DOWNLOAD:
    pv_path = 'https://dumps.wikimedia.org/other/pageview_complete/monthly/2021/2021-08/pageviews-202108-user.bz2'
    p = Path(pv_path) 
    pv_name = p.name
    pv_temp = f'{p.stem}-4dedup.txt'
    
    print("Downloading page views file (~2.3GB)...")
    print("This will take about 10-15 minutes...")
    !wget -N $pv_path
    print("\n✅ Download complete!")
else:
    print("Skipping download - file already exists.")

In [None]:
# Only run if we need to process
if not SKIP_DOWNLOAD:
    pv_path = 'https://dumps.wikimedia.org/other/pageview_complete/monthly/2021/2021-08/pageviews-202108-user.bz2'
    p = Path(pv_path) 
    pv_name = p.name
    pv_temp = f'{p.stem}-4dedup.txt'
    
    print("Extracting and filtering English Wikipedia pages...")
    print("This will take a few minutes...")
    
    # Filter for English pages, keep article ID (field 3) and page views (field 5)
    # Use grep -E instead of grep -P for compatibility
    !bzcat $pv_name | grep "^en\.wikipedia" | cut -d' ' -f3,5 | grep -E "^[0-9]+\s[0-9]+$" > $pv_temp
    
    print("✅ Extraction complete!")
    
    # Check file size
    !ls -lh $pv_temp
else:
    print("Skipping extraction - file already exists.")

In [None]:
# Only run if we need to process
if not SKIP_DOWNLOAD:
    pv_temp = 'pageviews-202108-user-4dedup.txt'
    
    print("Creating page view dictionary...")
    
    # Create a Counter that sums up page views for the same article
    wid2pv = Counter()
    
    with open(pv_temp, 'rt') as f:
        for i, line in enumerate(f):
            parts = line.strip().split(' ')
            if len(parts) == 2:
                try:
                    wid2pv.update({int(parts[0]): int(parts[1])})
                except ValueError:
                    continue
            
            # Progress indicator
            if i % 1000000 == 0 and i > 0:
                print(f"  Processed {i:,} lines...")
    
    print(f"\n✅ Created dictionary with {len(wid2pv):,} articles")
    
    # Convert to defaultdict
    page_view_dict = defaultdict(int)
    for doc_id, views in wid2pv.items():
        page_view_dict[doc_id] = views
    
    # Show some stats
    top_10 = wid2pv.most_common(10)
    print(f"\nTop 10 most viewed articles:")
    for doc_id, views in top_10:
        print(f"  Doc ID {doc_id}: {views:,} views")
else:
    print("Skipping processing - file already exists.")

## Save to GCS

In [None]:
# Only run if we need to save
if not SKIP_DOWNLOAD:
    print("Saving page view dictionary...")
    
    # Save locally
    with open("pageview.pkl", 'wb') as f:
        pickle.dump(dict(page_view_dict), f)
    
    # Upload to GCS - to page_views folder (won't overwrite anything!)
    blob = bucket.blob('page_views/pageview.pkl')
    blob.upload_from_filename('pageview.pkl')
    
    print(f"\n✅ Saved to gs://{bucket_name}/page_views/pageview.pkl")
else:
    print("File already exists in GCS.")

## Verify

In [None]:
print("Files in page_views/:")
!gsutil ls -lh gs://$bucket_name/page_views/

In [None]:
# Test loading the file
print("Testing load from GCS...")

blob = bucket.blob('page_views/pageview.pkl')
contents = blob.download_as_bytes()
loaded_pv = pickle.loads(contents)

print(f"✅ Successfully loaded {len(loaded_pv):,} page view entries")

# Show sample
sample_ids = list(loaded_pv.keys())[:5]
print(f"\nSample entries:")
for doc_id in sample_ids:
    print(f"  Doc ID {doc_id}: {loaded_pv[doc_id]:,} views")

## Clean Up (Optional)

Remove the large temporary files to free up disk space.

In [None]:
# Uncomment to clean up temporary files
# !rm -f pageviews-202108-user.bz2
# !rm -f pageviews-202108-user-4dedup.txt
# !rm -f pageview.pkl
# print("✅ Temporary files removed")

## Summary

### File Created:

| File | Location | Description |
|------|----------|-------------|
| pageview.pkl | page_views/ | Dictionary mapping doc_id -> page views |

### Usage in search_frontend.py:

```python
# Load page views
from google.cloud import storage
import pickle

client = storage.Client()
bucket = client.bucket('db204905756')
blob = bucket.blob('page_views/pageview.pkl')
page_views = pickle.loads(blob.download_as_bytes())

# Get page views for a document
doc_id = 12345
views = page_views.get(doc_id, 0)
```

In [None]:
print("\n" + "="*50)
print("✅ Page Views - COMPLETE!")
print("="*50)
print(f"\nLocation: gs://{bucket_name}/page_views/pageview.pkl")