# Population Data Preprocessing

In [13]:
import pandas as pd
import numpy as np

population_df = pd.read_csv("input_data/population_by_country_AUG.2024.csv")

## Data Cleaning and Initial Preprocessing

This cell performs initial data cleaning:
1. Checks for missing values in the dataset.
2. Removes rows with missing values.
3. Checks for and removes duplicate entries in the dataset.

In [14]:
# Check for missing values
print("\nMissing values in population_df dataset:")
print(population_df.isnull().sum())

# Remove rows with missing values
population_df = population_df.dropna()

# Check for duplicates
print("\nDuplicates in population_df dataset:")
print(population_df.duplicated().sum())

# Remove duplicates
population_df = population_df.drop_duplicates()


Missing values in population_df dataset:
Geography     0
Population    0
dtype: int64

Duplicates in population_df dataset:
0


## Identifying Regions and Assigning to Countries

This section identifies regions and assigns them to countries:
1. Identifies rows that represent regions (entries in all caps in the 'Geography' column).
2. Creates a new 'region' column in the dataset.
3. Iterates through the dataset, assigning the appropriate region to each country.

In [15]:
# Identify the rows that represent regions (ALL CAPS entries in 'Geography')
regions = population_df[population_df['Geography'].str.isupper()].copy()

# Create an empty column for regions in the population_df
population_df['region'] = None

# Iterate over the regions and assign the corresponding region to countries
current_region = None
for index, row in population_df.iterrows():
    if row['Geography'].isupper():
        # If the row is a region, set it as the current region
        current_region = row['Geography']
    else:
        # If the row is a country, assign the current region
        population_df.at[index, 'region'] = current_region

## Creating the Final Country Population Dataset

This cell creates the final country population dataset:
1. Removes rows corresponding to regions, keeping only country data.
2. Renames the 'Geography' column to 'country' for clarity.

In [None]:
# Remove the rows corresponding to regions and keep only the country rows
country_population_df = population_df[~population_df['Geography'].str.isupper()].copy()

# Rename the columns to 'country'
country_population_df = country_population_df.rename(columns={'Geography': 'country'})

# Save the final DataFrame to a CSV file for future use
country_population_df.to_csv('intermediate_data/population_by_country_with_region.csv', index=False)

# Politicians Data Preprocessing

In [16]:
import pandas as pd
import numpy as np
import requests
import json
import pandas as pd
from tqdm import tqdm
from urllib.parse import unquote

politicians_df = pd.read_csv("input_data/politicians_by_country_AUG.2024.csv")

## Data Cleaning and Initial Preprocessing

This cell performs initial data cleaning:
1. Checks for missing values in the dataset.
2. Removes rows with missing values.
3. Checks for and removes duplicate entries in the dataset.

In [17]:
# Check for missing values
print("\nMissing values in population_df dataset:")
print(population_df.isnull().sum())

# Remove rows with missing values
population_df = population_df.dropna()

# Check for duplicates
print("\nDuplicates in population_df dataset:")
print(population_df.duplicated().sum())

# Remove duplicates
population_df = population_df.drop_duplicates()


Missing values in population_df dataset:
Geography      0
Population     0
region        24
dtype: int64

Duplicates in population_df dataset:
0


## Get Predicted Score

### Constants

In [None]:
# Constants
ORES_API_ENDPOINT = "https://ores.wikimedia.org/v3/scores/enwiki/{revid}/articlequality"
MEDIAWIKI_API_ENDPOINT = "https://en.wikipedia.org/w/api.php"
INPUT_FILE = 'input_data/politicians_by_country_AUG.2024.csv'
OUTPUT_FILE = 'intermediate_data/politicians_with_quality_and_revisions.csv'
ERROR_LOG_FILE = 'intermediate_data/error_log.txt'

REQUEST_HEADER = {
    'User-Agent': "<pgupta1@uw.edu>, University of Washington, MSDS DATA 512 - AUTUMN 2024"
}

## Helper Functions

These functions handle various tasks:
1. Extracting article titles from URLs.
2. Getting the current revision ID for a given article.
3. Requesting ORES quality scores for article revisions.

In [None]:

def extract_title_from_url(url):
    """Extract the article title from a Wikipedia URL."""
    parts = url.split("/")
    if len(parts) > 4:
        return unquote(parts[4].replace("_", " "))
    return url

def get_current_revision(title):
    """Get the current revision ID for a given Wikipedia article title."""
    params = {
        "action": "query",
        "format": "json",
        "titles": title,
        "prop": "revisions",
        "rvprop": "ids",
        "rvlimit": 1
    }
    
    try:
        response = requests.get(MEDIAWIKI_API_ENDPOINT, params=params, headers=REQUEST_HEADER)
        response.raise_for_status()
        data = response.json()
        page = next(iter(data['query']['pages'].values()))
        return page['revisions'][0]['revid']
    except Exception as e:
        return None

def request_ores_score(rev_id):
    """Get ORES quality prediction for a given article revision."""
    try:
        response = requests.get(ORES_API_ENDPOINT.format(revid=rev_id), headers=REQUEST_HEADER)
        response.raise_for_status()
        data = response.json()
        return data['enwiki']['scores'][str(rev_id)]['articlequality']['score']['prediction']
    except requests.exceptions.HTTPError as http_err:
        print(f"HTTP error occurred: {http_err}")
    except Exception as e:
        print(f"Error getting ORES score for rev_id {rev_id}: {str(e)}")
    return None

### Main Processing Function

This function is the main workhorse of the script:
1. It loads the input data.
2. Processes each article to get its current revision ID and quality score.
3. Handles errors and logs them.
4. Saves progress periodically and outputs the final results.

In [11]:

def process_politicians_data():
    """Process all the politicians data and get quality predictions."""
    df = pd.read_csv(INPUT_FILE)
    
    print("Columns in the CSV file:")
    print(df.columns.tolist())
    
    url_column = 'url' 
    
    if url_column not in df.columns:
        raise ValueError(f"Column '{url_column}' not found in the CSV file.")
    
    df['quality'] = None
    df['revision_id'] = None
    error_count = 0
    error_log = []

    for index, row in tqdm(df.iterrows(), total=len(df), desc="Processing articles"):
        url = row[url_column]
        title = extract_title_from_url(url)
        rev_id = get_current_revision(title)
        
        if rev_id:
            df.at[index, 'revision_id'] = rev_id
            quality = request_ores_score(rev_id)
            if quality:
                df.at[index, 'quality'] = quality
            else:
                error_count += 1
                error_msg = f"Failed to get ORES score for {title} (rev_id: {rev_id})"
                error_log.append(error_msg)
        else:
            error_count += 1
            error_msg = f"Failed to get revision ID for {title}"
            error_log.append(error_msg)
        
        # Save progress every 100 articles
        if (index + 1) % 100 == 0:
            df.to_csv(OUTPUT_FILE, index=False)
            print(f"Progress saved. Processed {index+1} articles.")

    # Calculate and print error rate
    total_articles = len(df)
    error_rate = error_count / total_articles
    print(f"\nError rate: {error_rate:.2%}")

    if error_rate > 0.01:
        print("Error rate is above 1%. Please review the error log.")
    
    # Log errors
    with open(ERROR_LOG_FILE, 'w') as f:
        for error in error_log:
            f.write(f"{error}\n")

    # Save final results
    df.to_csv(OUTPUT_FILE, index=False)
    print(f"\nFinal results saved to {OUTPUT_FILE}")
    print(f"Error log saved to {ERROR_LOG_FILE}")

if __name__ == "__main__":
    process_politicians_data()

Columns in the CSV file:
['name', 'url', 'country']


Processing articles:   0%|          | 0/7155 [00:00<?, ?it/s]

Processing articles:   1%|▏         | 100/7155 [02:01<2:26:12,  1.24s/it]

Progress saved. Processed 100 articles.


Processing articles:   3%|▎         | 200/7155 [04:00<2:44:08,  1.42s/it]

Progress saved. Processed 200 articles.


Processing articles:   4%|▍         | 300/7155 [06:35<5:13:10,  2.74s/it]

Progress saved. Processed 300 articles.


Processing articles:   6%|▌         | 400/7155 [08:49<2:23:13,  1.27s/it]

Progress saved. Processed 400 articles.


Processing articles:   7%|▋         | 500/7155 [10:57<2:40:14,  1.44s/it]

Progress saved. Processed 500 articles.


Processing articles:   8%|▊         | 600/7155 [13:54<2:41:57,  1.48s/it]

Progress saved. Processed 600 articles.


Processing articles:  10%|▉         | 700/7155 [16:11<1:54:08,  1.06s/it]

Progress saved. Processed 700 articles.


Processing articles:  11%|█         | 800/7155 [18:05<2:05:39,  1.19s/it]

Progress saved. Processed 800 articles.


Processing articles:  13%|█▎        | 900/7155 [20:07<2:05:53,  1.21s/it]

Progress saved. Processed 900 articles.


Processing articles:  14%|█▍        | 1000/7155 [22:01<1:55:07,  1.12s/it]

Progress saved. Processed 1000 articles.


Processing articles:  15%|█▌        | 1100/7155 [24:07<1:54:59,  1.14s/it]

Progress saved. Processed 1100 articles.


Processing articles:  17%|█▋        | 1200/7155 [26:05<2:11:21,  1.32s/it]

Progress saved. Processed 1200 articles.


Processing articles:  18%|█▊        | 1300/7155 [28:00<1:43:01,  1.06s/it]

Progress saved. Processed 1300 articles.


Processing articles:  20%|█▉        | 1400/7155 [29:51<1:57:41,  1.23s/it]

Progress saved. Processed 1400 articles.


Processing articles:  21%|██        | 1500/7155 [31:45<1:46:14,  1.13s/it]

Progress saved. Processed 1500 articles.


Processing articles:  22%|██▏       | 1600/7155 [33:37<1:51:56,  1.21s/it]

Progress saved. Processed 1600 articles.


Processing articles:  23%|██▎       | 1666/7155 [35:56<29:02:59, 19.05s/it]

HTTP error occurred: 504 Server Error: Gateway Timeout for url: https://ores.wikimedia.org/v3/scores/enwiki/1208171356/articlequality


Processing articles:  24%|██▍       | 1700/7155 [36:34<1:43:29,  1.14s/it] 

Progress saved. Processed 1700 articles.


Processing articles:  25%|██▌       | 1800/7155 [38:25<1:38:44,  1.11s/it]

Progress saved. Processed 1800 articles.


Processing articles:  27%|██▋       | 1900/7155 [40:25<1:53:30,  1.30s/it]

Progress saved. Processed 1900 articles.


Processing articles:  28%|██▊       | 2000/7155 [42:30<1:33:05,  1.08s/it]

Progress saved. Processed 2000 articles.


Processing articles:  29%|██▉       | 2100/7155 [44:29<1:39:45,  1.18s/it]

Progress saved. Processed 2100 articles.


Processing articles:  31%|███       | 2200/7155 [46:29<1:30:45,  1.10s/it]

Progress saved. Processed 2200 articles.


Processing articles:  32%|███▏      | 2300/7155 [48:22<1:27:10,  1.08s/it]

Progress saved. Processed 2300 articles.


Processing articles:  34%|███▎      | 2400/7155 [50:17<1:47:15,  1.35s/it]

Progress saved. Processed 2400 articles.


Processing articles:  35%|███▍      | 2500/7155 [52:19<1:32:17,  1.19s/it]

Progress saved. Processed 2500 articles.


Processing articles:  36%|███▋      | 2600/7155 [54:17<1:26:01,  1.13s/it]

Progress saved. Processed 2600 articles.


Processing articles:  38%|███▊      | 2700/7155 [56:21<1:20:58,  1.09s/it]

Progress saved. Processed 2700 articles.


Processing articles:  39%|███▉      | 2800/7155 [58:22<1:20:48,  1.11s/it]

Progress saved. Processed 2800 articles.


Processing articles:  41%|████      | 2900/7155 [1:00:17<1:20:28,  1.13s/it]

Progress saved. Processed 2900 articles.


Processing articles:  42%|████▏     | 3000/7155 [1:02:12<1:15:26,  1.09s/it]

Progress saved. Processed 3000 articles.


Processing articles:  43%|████▎     | 3100/7155 [1:04:06<1:23:24,  1.23s/it]

Progress saved. Processed 3100 articles.


Processing articles:  45%|████▍     | 3200/7155 [1:05:58<1:17:15,  1.17s/it]

Progress saved. Processed 3200 articles.


Processing articles:  46%|████▌     | 3300/7155 [1:07:49<1:10:54,  1.10s/it]

Progress saved. Processed 3300 articles.


Processing articles:  48%|████▊     | 3400/7155 [1:09:45<1:06:25,  1.06s/it]

Progress saved. Processed 3400 articles.


Processing articles:  48%|████▊     | 3454/7155 [1:11:57<19:44:20, 19.20s/it]

HTTP error occurred: 504 Server Error: Gateway Timeout for url: https://ores.wikimedia.org/v3/scores/enwiki/1067021356/articlequality


Processing articles:  48%|████▊     | 3455/7155 [1:12:58<32:36:50, 31.73s/it]

HTTP error occurred: 504 Server Error: Gateway Timeout for url: https://ores.wikimedia.org/v3/scores/enwiki/1231810366/articlequality


Processing articles:  49%|████▉     | 3500/7155 [1:13:53<1:04:36,  1.06s/it] 

Progress saved. Processed 3500 articles.


Processing articles:  50%|█████     | 3600/7155 [1:15:48<1:22:39,  1.40s/it]

Progress saved. Processed 3600 articles.


Processing articles:  52%|█████▏    | 3700/7155 [1:17:53<1:02:14,  1.08s/it]

Progress saved. Processed 3700 articles.


Processing articles:  53%|█████▎    | 3800/7155 [1:19:51<1:19:32,  1.42s/it]

Progress saved. Processed 3800 articles.


Processing articles:  55%|█████▍    | 3900/7155 [1:21:48<1:02:16,  1.15s/it]

Progress saved. Processed 3900 articles.


Processing articles:  56%|█████▌    | 4000/7155 [1:23:51<1:06:36,  1.27s/it]

Progress saved. Processed 4000 articles.


Processing articles:  57%|█████▋    | 4100/7155 [1:25:57<1:01:34,  1.21s/it]

Progress saved. Processed 4100 articles.


Processing articles:  59%|█████▊    | 4200/7155 [1:27:57<54:24,  1.10s/it]  

Progress saved. Processed 4200 articles.


Processing articles:  60%|██████    | 4300/7155 [1:29:56<55:27,  1.17s/it]  

Progress saved. Processed 4300 articles.


Processing articles:  61%|██████▏   | 4400/7155 [1:31:52<51:45,  1.13s/it]  

Progress saved. Processed 4400 articles.


Processing articles:  63%|██████▎   | 4500/7155 [1:33:54<49:02,  1.11s/it]  

Progress saved. Processed 4500 articles.


Processing articles:  64%|██████▍   | 4600/7155 [1:35:55<52:41,  1.24s/it]  

Progress saved. Processed 4600 articles.


Processing articles:  66%|██████▌   | 4700/7155 [1:38:01<45:17,  1.11s/it]  

Progress saved. Processed 4700 articles.


Processing articles:  66%|██████▋   | 4743/7155 [1:40:03<12:49:36, 19.14s/it]

HTTP error occurred: 504 Server Error: Gateway Timeout for url: https://ores.wikimedia.org/v3/scores/enwiki/747227236/articlequality


Processing articles:  67%|██████▋   | 4800/7155 [1:41:22<47:43,  1.22s/it]   

Progress saved. Processed 4800 articles.


Processing articles:  68%|██████▊   | 4900/7155 [1:43:27<43:23,  1.15s/it]  

Progress saved. Processed 4900 articles.


Processing articles:  70%|██████▉   | 5000/7155 [1:45:37<40:20,  1.12s/it]  

Progress saved. Processed 5000 articles.


Processing articles:  71%|███████▏  | 5100/7155 [1:47:34<38:27,  1.12s/it]

Progress saved. Processed 5100 articles.


Processing articles:  73%|███████▎  | 5200/7155 [1:49:33<41:27,  1.27s/it]

Progress saved. Processed 5200 articles.


Processing articles:  74%|███████▍  | 5300/7155 [1:51:27<31:51,  1.03s/it]

Progress saved. Processed 5300 articles.


Processing articles:  75%|███████▌  | 5400/7155 [1:53:24<35:29,  1.21s/it]

Progress saved. Processed 5400 articles.


Processing articles:  77%|███████▋  | 5500/7155 [1:55:28<33:47,  1.22s/it]

Progress saved. Processed 5500 articles.


Processing articles:  78%|███████▊  | 5600/7155 [1:57:34<27:54,  1.08s/it]

Progress saved. Processed 5600 articles.


Processing articles:  80%|███████▉  | 5700/7155 [1:59:35<33:26,  1.38s/it]

Progress saved. Processed 5700 articles.


Processing articles:  81%|████████  | 5800/7155 [2:01:35<25:26,  1.13s/it]

Progress saved. Processed 5800 articles.


Processing articles:  82%|████████▏ | 5900/7155 [2:03:37<26:51,  1.28s/it]

Progress saved. Processed 5900 articles.


Processing articles:  84%|████████▍ | 6000/7155 [2:05:37<23:22,  1.21s/it]

Progress saved. Processed 6000 articles.


Processing articles:  85%|████████▌ | 6100/7155 [2:07:38<21:54,  1.25s/it]

Progress saved. Processed 6100 articles.


Processing articles:  87%|████████▋ | 6200/7155 [2:09:49<18:16,  1.15s/it]

Progress saved. Processed 6200 articles.


Processing articles:  88%|████████▊ | 6300/7155 [2:11:48<15:50,  1.11s/it]

Progress saved. Processed 6300 articles.


Processing articles:  89%|████████▉ | 6400/7155 [2:14:03<17:22,  1.38s/it]

Progress saved. Processed 6400 articles.


Processing articles:  91%|█████████ | 6500/7155 [2:16:03<12:07,  1.11s/it]

Progress saved. Processed 6500 articles.


Processing articles:  92%|█████████▏| 6600/7155 [2:18:03<13:06,  1.42s/it]

Progress saved. Processed 6600 articles.


Processing articles:  94%|█████████▎| 6700/7155 [2:20:08<08:56,  1.18s/it]

Progress saved. Processed 6700 articles.


Processing articles:  95%|█████████▌| 6800/7155 [2:22:05<07:20,  1.24s/it]

Progress saved. Processed 6800 articles.


Processing articles:  96%|█████████▋| 6900/7155 [2:24:03<05:11,  1.22s/it]

Progress saved. Processed 6900 articles.


Processing articles:  98%|█████████▊| 7000/7155 [2:26:04<03:06,  1.20s/it]

Progress saved. Processed 7000 articles.


Processing articles:  99%|█████████▉| 7100/7155 [2:28:08<01:05,  1.18s/it]

Progress saved. Processed 7100 articles.


Processing articles: 100%|██████████| 7155/7155 [2:29:15<00:00,  1.25s/it]


Error rate: 0.17%

Final results saved to politicians_with_quality_and_revisions.csv
Error log saved to error_log.txt



