## Wikipedia Names by Year

See the README for the full description. In this assignment we'll pull all the names for people born in the 150 years ending in 2015. 

In [None]:
import requests
import pandas as pd
from collections import Counter

wikipedia_api_url = "https://en.wikipedia.org/w/api.php"


In [6]:
# Create a function that creates a dictionary with the information
# being requested from the Wikipedia API, then pulls the info
def fetch_category_members(year):
    
    req = {
        'action': 'query',
        'format': 'json',
        'list': 'categorymembers',
        'cmlimit': 500,
        'cmtitle': f'Category:{year}_births'
    }

    all_members = []

    #Sends requests through Wikipedia API 
    while True:
        r = requests.get(wikipedia_api_url, params=req)
        data = r.json()

        # Extract category members and add them to the all_members list
        all_members.extend(data['query']['categorymembers'])

        # Check if there's a 'cmcontinue' parameter in the response
        if 'continue' in data and 'cmcontinue' in data['continue']:
            # Update the req dictionary with the 'cmcontinue' value for the next request
            req['cmcontinue'] = data['continue']['cmcontinue']
        else:
            # If there's no 'cmcontinue', it means we've fetched all the results, so break out of the loop
            break

    # Return a DataFrame for the year
    return pd.DataFrame({
        'Year': [year] * len(all_members),
        'Name': [member['title'] for member in all_members]
    })


In [8]:

# Fetch data for each year and store each DataFrame in a list
dfs = [fetch_category_members(year) for year in range(1860, 2024)]

# Concatenate all DataFrames
final_df = pd.concat(dfs, ignore_index=True)


         Year                           Name
0        1860      Charles Wheaton Abbot Jr.
1        1860             Frank Frost Abbott
2        1860           William Louis Abbott
3        1860        Robert Abdy (cricketer)
4        1860                   Abe Masakoto
...       ...                            ...
1461002  2021                  Grace Warrior
1461003  2022                Vinice Mabansag
1461004  2022                   Aire Webster
1461005  2023              Ernest Brooksbank
1461006  2023  Prince Fran√ßois of Luxembourg

[1461007 rows x 2 columns]


In [9]:
# Group by 'Year', count names per year, sort in descending order and get top 10
top_10_years = final_df.groupby('Year').size().sort_values(ascending=False).head(10)

print(top_10_years)

Year
1988    18334
1986    17832
1990    17806
1989    17736
1987    17733
1985    17691
1991    17330
1984    17255
1992    17225
1982    17037
dtype: int64


After extracting information using the Wikipedia API, storing that information in a dataframe for each year, and then concatinating those dataframes; the most common birth year for people who appear on Wikipedia is 1988.