# API Quest
## Oslo

# HYPOTHESIS
- Rich countries have more Nobel Prizes
    - Nobel prize winners immigrate towards rich countries
    - Nobel prize winners immigrate towards stable countries
- Countries of birth / early education have more impact than countries of higher education
- Nobel Prizes Laureates are getting younger
- Nobel Prizes are awarded more to international teams than before

- Gender Differences: Is there a significant difference in the gender ratio among Nobel Prize winners? Has this changed over time?
- Geographic Distribution: In which countries or regions are Nobel Prize winners predominantly located? Has this distribution changed over time?
- Age of Winners: What is the age distribution of Nobel Prize winners? Are there any noticeable trends in age?
- Publications: Are there specific journals where Nobel Prize winners’ research is commonly published? How influential are these journals?

## HYPOTHESIS 1
- Men are over represented in Nobel Prizes

# DATA SOURCES

1. **Nobel Laureates Data**
	- **Nobel Prize Official Data**
	  - Description: Comprehensive information on all Nobel laureates, including their age, nationality, affiliation, prize category, and motivation.
	  - Link: [Nobel Prize Official Website](https://www.nobelprize.org/organization/developer-zone-2/)
	  - API: [Nobel Prize API](https://www.nobelprize.org/organization/developer-zone-2/)
	- **Kaggle Nobel Laureates Dataset**
	  - Description: A dataset compiled from the Nobel Prize official data, available in CSV format for easy analysis.
	  - Link: [Kaggle Nobel Prize Dataset](https://www.kaggle.com/datasets/imdevskp/nobel-prize/data)

3. **Economic Indicators**
	- **World Bank GDP Data**
	  - Description: GDP per capita and other economic indicators for countries worldwide.
	  - Link: [World Bank GDP per Capita](https://data.worldbank.org/indicator/NY.GDP.PCAP.CD)
	- **Heritage Foundation Index of Economic Freedom**
	  - Description: Measures economic freedom in countries across 12 quantitative and qualitative factors.
	  - Link: [Index of Economic Freedom](https://www.heritage.org/index/)

4. **Education Expenditure and Statistics**
	- **UNESCO Education Data**
	  - Description: Data on government expenditure on education as a percentage of GDP and total government expenditure.
	  - Link: [UNESCO Education Expenditure](http://data.uis.unesco.org/)
	- **OECD Education Statistics**
	  - Description: Detailed statistics on education spending, enrollment rates, and educational attainment among OECD countries.
	  - Link: [OECD Education at a Glance](https://www.oecd.org/education/education-at-a-glance/)

5. **Gender Statistics**
	- **UNESCO Gender Parity Index**
	  - Description: Data on gender parity in education and literacy rates.
	  - Link: [UNESCO Gender Equality Data](http://data.uis.unesco.org/)
	- **World Bank Gender Data Portal**
	  - Description: Comprehensive data on gender equality indicators globally.
	  - Link: [World Bank Gender Data](https://datatopics.worldbank.org/gender/)

18. **Demographic and Socioeconomic Data**
	 - **United Nations Educational, Scientific and Cultural Organization (UNESCO) Institute for Statistics**
		- Description: Data on education, literacy rates, and demographic factors.
		- Link: [UNESCO UIS Data](http://data.uis.unesco.org/)
	 - **OECD Social and Welfare Statistics**
		- Description: Indicators on social protection, income inequality, and more.
		- Link: [OECD Social Data](https://www.oecd.org/social/soc/)



## Selected data sources

1. Nobel API
2. https://uis.unesco.org/
3. https://databank.worldbank.org/source/world-development-indicators

In [None]:
#TODO filter STEM fields
#TODO modularize

In [779]:
#imports
import os
import json
import requests
import pandas as pd
from dotenv import load_dotenv
import plotly.express as px

In [780]:
#settings
pd.set_option('display.max_colwidth', 900)
pd.set_option('display.max_rows', 50)

In [None]:
#load env
load_dotenv()
token = os.getenv('TOKEN')
print(token)

In [782]:
#TODO: Get the data from the API
enrollment_df = pd.read_csv('sources/school_enrolment_gender.csv')
enrollment_df.head()

laureates_url = 'https://api.nobelprize.org/2.1/laureates'

In [783]:
def flatten(dictionnary, prefix=''):
    flattened = pd.json_normalize(dictionnary)

    if prefix:
        flattened = flattened.add_prefix(prefix + '.')

    for column in flattened.columns:
        sample = flattened[column].iloc[0]

        if isinstance(sample, list) and len(sample) > 0 and isinstance(sample[0], dict):
            # Find the maximum length of lists in the column
            max_len = flattened[column].apply(lambda x: len(x) if isinstance(x, list) else 0).max()
            for i in range(max_len):
                inner_dict = flattened[column].apply(
                    lambda x: x[i] if isinstance(x, list) and len(x) > i else None
                )
                flattened = pd.concat([flattened, flatten(inner_dict, f"{column}_{i+1}")], axis=1)
            flattened.drop(column, axis=1, inplace=True)

    return flattened


In [None]:
def get_all_laureates():
    offset = 0
    limit = 25
    max = 50
    all_laureates = pd.DataFrame()
    
    while offset < max:
        url = f"{laureates_url}?offset={offset}&limit={limit}"
        response = requests.get(url)
        data = response.json()
        max = data['meta']['count']
        flattened = flatten(data['laureates'])
        all_laureates = pd.concat([all_laureates, flattened], ignore_index=True)
        offset += limit

    all_laureates['id'] = all_laureates['id'].astype(int)
    return all_laureates.sort_values('id')

laureates_df = get_all_laureates()
laureates_df.head()

In [785]:
gender_columns = [
    {'id': {'original_name': 'id','dtype': 'int64'}},
    {'name': {'original_name': 'knownName.en','dtype': 'object'}},
    {'gender': {'original_name': 'gender','dtype': 'category'}},
    {'award_year': {'original_name': 'nobelPrizes_1.awardYear','dtype': 'int64'}},
]

In [None]:

gender_df = laureates_df[[value['original_name'] for column in gender_columns for key, value in column.items()]].sort_values('id')
gender_df = gender_df.rename(columns={value['original_name']: key for column in gender_columns for key, value in column.items()})
display(gender_df.head())

gender_counts = gender_df.groupby('gender').aggregate({'id': 'count'}).reset_index()
gender_counts['proportion'] = gender_counts['id'] / gender_counts['id'].sum()
gender_counts['proportion'] = gender_counts['proportion'].apply(lambda x: f"{x:.0%}")
gender_counts.sort_values('proportion', ascending=False, inplace=True)
gender_counts.index = range(1, len(gender_counts) + 1)
display(gender_counts)


In [None]:
yearly_ratio = gender_df.groupby(['award_year','gender']).size().unstack(fill_value=0)
display(yearly_ratio[['female','male']])
yearly_ratio['total'] = yearly_ratio.sum(axis=1)
yearly_ratio['female_ratio'] = yearly_ratio['female'] / yearly_ratio['total']
yearly_ratio['male_ratio'] = yearly_ratio['male'] / yearly_ratio['total']
display(yearly_ratio[['female_ratio','male_ratio']])


In [None]:
gender_cumulative = gender_df.groupby(['award_year', 'gender']).size().unstack(fill_value=0).cumsum()
gender_cumulative['total'] = gender_cumulative.sum(axis=1)
gender_cumulative['male_proportion'] = gender_cumulative['male'] / gender_cumulative['total']
gender_cumulative['female_proportion'] = gender_cumulative['female'] / gender_cumulative['total']
display(gender_cumulative[['male_proportion', 'female_proportion']])

In [None]:
fig = px.bar(gender_counts, x='gender', y='id', text='proportion', title='Gender Distribution of Nobel Laureates')
fig.show()

In [None]:
fig = px.line(gender_cumulative, x=gender_cumulative.index, y=['female', 'male'], title='Cumulative Gender Distribution of Nobel Laureates')
fig.show()

In [None]:
fig = px.line(gender_cumulative, x=gender_cumulative.index, y=['female_proportion', 'male_proportion'], title='Cumulative Proportion of Nobel Laureates by gender')
fig.show()

In [None]:
fig = px.line(yearly_ratio, x=yearly_ratio.index, y=['female_ratio', 'male_ratio'], title='Yearly Gender Distribution of Nobel Laureates')
fig.show()