# Wikipedia Politicians Analysis

## Import Libraries

In [6]:
import pandas as pd
import numpy as np

## Load and Preprocess Data

In this cell, we:
1. Load the CSV file containing data about Wikipedia articles on politicians by country.
2. Convert the population from millions to the actual number.
3. Filter out countries with zero population.
4. Define high-quality articles as those marked as 'FA' (Featured Article) or 'GA' (Good Article).

In [7]:
# Load the CSV file
df = pd.read_csv('output_data/wp_politicians_by_country.csv')

# Convert population to actual number (from millions) and filter out zero population
df['population'] = df['population'] * 1_000_000
df = df[df['population'] > 0]

# Define high-quality articles
df['is_high_quality'] = df['article_quality'].isin(['FA', 'GA'])

## Calculate Country Metrics

Here, we:
1. Group the data by country and calculate total articles, high-quality articles, and population for each country.
2. Calculate per capita metrics for total articles and high-quality articles per million people.

In [8]:
# Group by country and calculate metrics
country_metrics = df.groupby('country').agg({
    'article_title': 'count',
    'is_high_quality': 'sum',
    'population': 'first'
}).reset_index()

# Calculate per capita metrics (per 1,000,000 people)
country_metrics['total_articles_per_capita'] = (country_metrics['article_title'] / country_metrics['population']) * 1_000_000
country_metrics['high_quality_articles_per_capita'] = (country_metrics['is_high_quality'] / country_metrics['population']) * 1_000_000

## Calculate Region Metrics

Similar to country metrics, we:
1. Group the data by region and calculate total articles, high-quality articles, and population for each region.
2. Calculate per capita metrics for total articles and high-quality articles per million people at the regional level.


In [9]:
# Group by region and calculate metrics
region_metrics = df.groupby('region').agg({
    'article_title': 'count',
    'is_high_quality': 'sum',
    'population': 'sum'
}).reset_index()

# Calculate per capita metrics for regions (per 1,000,000 people)
region_metrics['total_articles_per_capita'] = (region_metrics['article_title'] / region_metrics['population']) * 1_000_000
region_metrics['high_quality_articles_per_capita'] = (region_metrics['is_high_quality'] / region_metrics['population']) * 1_000_000

## Define Formatting Function

This function is defined to format our dataframes into markdown tables for better readability in the notebook output.


In [10]:
# Function to format tables
def format_table(df, columns, title):
    formatted_df = df[columns].copy()
    for col in formatted_df.select_dtypes(include=['float64']).columns:
        formatted_df[col] = formatted_df[col].apply(lambda x: f'{x:.4f}')
    return formatted_df.to_markdown(index=False, tablefmt="pipe")

## Country-by-country analysis: total-articles-per-capita and high-quality-articles-per-capita

In [11]:

print("Country-by-country analysis: total-articles-per-capita and high-quality-articles-per-capita")
country_analysis = country_metrics[['country', 'total_articles_per_capita', 'high_quality_articles_per_capita']].sort_values('total_articles_per_capita', ascending=False)
print(format_table(country_analysis, ['country', 'total_articles_per_capita', 'high_quality_articles_per_capita'], "Country-by-Country Analysis"))
print()

Country-by-country analysis: total-articles-per-capita and high-quality-articles-per-capita
| country                        |   total_articles_per_capita |   high_quality_articles_per_capita |
|:-------------------------------|----------------------------:|-----------------------------------:|
| antigua and barbuda            |                    330      |                             0      |
| federated states of micronesia |                    140      |                             0      |
| marshall islands               |                    130      |                             0      |
| tonga                          |                    100      |                             0      |
| barbados                       |                     83.3333 |                             0      |
| seychelles                     |                     60      |                             0      |
| montenegro                     |                     60      |                            

## Regional analysis: total-articles-per-capita and high-quality-articles-per-capita

In [12]:
print(" Regional analysis: total-articles-per-capita and high-quality-articles-per-capita")
region_analysis = region_metrics[['region', 'total_articles_per_capita', 'high_quality_articles_per_capita']].sort_values('total_articles_per_capita', ascending=False)
print(format_table(region_analysis, ['region', 'total_articles_per_capita', 'high_quality_articles_per_capita'], "Regional Analysis"))

 Regional analysis: total-articles-per-capita and high-quality-articles-per-capita
| region          |   total_articles_per_capita |   high_quality_articles_per_capita |
|:----------------|----------------------------:|-----------------------------------:|
| OCEANIA         |                      0.6391 |                             0.009  |
| NORTHERN EUROPE |                      0.1644 |                             0.0077 |
| CARIBBEAN       |                      0.1553 |                             0.0064 |
| CENTRAL AMERICA |                      0.1323 |                             0.0071 |
| CENTRAL ASIA    |                      0.0535 |                             0.0026 |
| WESTERN ASIA    |                      0.0456 |                             0.002  |
| SOUTHERN EUROPE |                      0.0444 |                             0.003  |
| EASTERN AFRICA  |                      0.0278 |                             0.0007 |
| WESTERN EUROPE  |                      0.0257

# Results

## Top and Bottom Countries Analysis

### 1. Top 10 countries by coverage

In [13]:
top_10_coverage = country_metrics.sort_values('total_articles_per_capita', ascending=False).head(10)
print("1. Top 10 countries by coverage:")
print(format_table(top_10_coverage, ['country', 'total_articles_per_capita'], "Top 10 Countries by Total Articles per Capita"))
print()

1. Top 10 countries by coverage:
| country                        |   total_articles_per_capita |
|:-------------------------------|----------------------------:|
| antigua and barbuda            |                    330      |
| federated states of micronesia |                    140      |
| marshall islands               |                    130      |
| tonga                          |                    100      |
| barbados                       |                     83.3333 |
| seychelles                     |                     60      |
| montenegro                     |                     60      |
| bhutan                         |                     55      |
| maldives                       |                     55      |
| samoa                          |                     40      |



### 2. Bottom 10 countries by coverage

In [14]:
bottom_10_coverage = country_metrics.sort_values('total_articles_per_capita', ascending=True).head(10)
print("2. Bottom 10 countries by coverage:")
print(format_table(bottom_10_coverage, ['country', 'total_articles_per_capita'], "Bottom 10 Countries by Total Articles per Capita"))
print()

2. Bottom 10 countries by coverage:
| country       |   total_articles_per_capita |
|:--------------|----------------------------:|
| china         |                      0.0113 |
| ghana         |                      0.088  |
| india         |                      0.1057 |
| saudi arabia  |                      0.1355 |
| zambia        |                      0.1485 |
| norway        |                      0.1818 |
| israel        |                      0.2041 |
| egypt         |                      0.3042 |
| cote d'ivoire |                      0.3236 |
| ethiopia      |                      0.3478 |



### 3. Top 10 countries by high quality

In [15]:
top_10_quality = country_metrics.sort_values('high_quality_articles_per_capita', ascending=False).head(10)
print("3. Top 10 countries by high quality:")
print(format_table(top_10_quality, ['country', 'high_quality_articles_per_capita'], "Top 10 Countries by High Quality Articles per Capita"))
print()

3. Top 10 countries by high quality:
| country               |   high_quality_articles_per_capita |
|:----------------------|-----------------------------------:|
| montenegro            |                             5      |
| luxembourg            |                             2.8571 |
| albania               |                             2.5926 |
| kosovo                |                             2.3529 |
| maldives              |                             1.6667 |
| lithuania             |                             1.3793 |
| croatia               |                             1.3158 |
| guyana                |                             1.25   |
| palestinian territory |                             1.0909 |
| slovenia              |                             0.9524 |



### 4. Bottom 10 countries by high quality

In [16]:
bottom_10_quality = country_metrics.sort_values('high_quality_articles_per_capita', ascending=True).head(10)
print("4. Bottom 10 countries by high quality:")
print(format_table(bottom_10_quality, ['country', 'high_quality_articles_per_capita'], "Bottom 10 Countries by High Quality Articles per Capita"))
print()

4. Bottom 10 countries by high quality:
| country                        |   high_quality_articles_per_capita |
|:-------------------------------|-----------------------------------:|
| zimbabwe                       |                                  0 |
| qatar                          |                                  0 |
| grenada                        |                                  0 |
| gambia                         |                                  0 |
| samoa                          |                                  0 |
| senegal                        |                                  0 |
| federated states of micronesia |                                  0 |
| estonia                        |                                  0 |
| eritrea                        |                                  0 |
| equatorial guinea              |                                  0 |



### 5. Geographic regions by total coverage

In [17]:
regions_total_coverage = region_metrics.sort_values('total_articles_per_capita', ascending=False)
print("5. Geographic regions by total coverage:")
print(format_table(regions_total_coverage, ['region', 'total_articles_per_capita'], "Geographic Regions by Total Articles per Capita"))
print()

5. Geographic regions by total coverage:
| region          |   total_articles_per_capita |
|:----------------|----------------------------:|
| OCEANIA         |                      0.6391 |
| NORTHERN EUROPE |                      0.1644 |
| CARIBBEAN       |                      0.1553 |
| CENTRAL AMERICA |                      0.1323 |
| CENTRAL ASIA    |                      0.0535 |
| WESTERN ASIA    |                      0.0456 |
| SOUTHERN EUROPE |                      0.0444 |
| EASTERN AFRICA  |                      0.0278 |
| WESTERN EUROPE  |                      0.0257 |
| NORTHERN AFRICA |                      0.0248 |
| EASTERN EUROPE  |                      0.0244 |
| MIDDLE AFRICA   |                      0.0242 |
| SOUTHERN AFRICA |                      0.0207 |
| SOUTH AMERICA   |                      0.0165 |
| WESTERN AFRICA  |                      0.0089 |
| SOUTHEAST ASIA  |                      0.0088 |
| EAST ASIA       |                      0.0041 |
| SOUTH A

### 6. Geographic regions by high quality coverage

In [18]:
regions_quality_coverage = region_metrics.sort_values('high_quality_articles_per_capita', ascending=False)
print("6. Geographic regions by high quality coverage:")
print(format_table(regions_quality_coverage, ['region', 'high_quality_articles_per_capita'], "Geographic Regions by High Quality Articles per Capita"))

6. Geographic regions by high quality coverage:
| region          |   high_quality_articles_per_capita |
|:----------------|-----------------------------------:|
| OCEANIA         |                             0.009  |
| NORTHERN EUROPE |                             0.0077 |
| CENTRAL AMERICA |                             0.0071 |
| CARIBBEAN       |                             0.0064 |
| SOUTHERN EUROPE |                             0.003  |
| CENTRAL ASIA    |                             0.0026 |
| WESTERN ASIA    |                             0.002  |
| NORTHERN AFRICA |                             0.0014 |
| SOUTHERN AFRICA |                             0.0013 |
| EASTERN EUROPE  |                             0.0013 |
| WESTERN EUROPE  |                             0.0011 |
| MIDDLE AFRICA   |                             0.0008 |
| EASTERN AFRICA  |                             0.0007 |
| SOUTHEAST ASIA  |                             0.0006 |
| SOUTH AMERICA   |                     