# API Quest
## Oslo

# HYPOTHESIS
- Rich countries have more Nobel Prizes
    - Nobel prize winners immigrate towards rich countries
    - Nobel prize winners immigrate towards stable countries
- Countries of birth / early education have more impact than countries of higher education
- Nobel Prizes Laureates are getting younger
- Nobel Prizes are awarded more to international teams than before

- Gender Differences: Is there a significant difference in the gender ratio among Nobel Prize winners? Has this changed over time?
- Geographic Distribution: In which countries or regions are Nobel Prize winners predominantly located? Has this distribution changed over time?
- Age of Winners: What is the age distribution of Nobel Prize winners? Are there any noticeable trends in age?
- Publications: Are there specific journals where Nobel Prize winners’ research is commonly published? How influential are these journals?

## HYPOTHESIS 1
- Men are over represented in Nobel Prizes

## Selected data sources

1. Nobel API
2. crossref.org
3. https://archive.ics.uci.edu/ml/datasets/Gender+by+Name
4. namsor.app

In [131]:
#QUESTIONS
#caching
#error handling / checkpoints?
#nested jsons?
#what if we don't know the possible value?
#FileNotFoundError as check for file existence?

In [132]:
#TODO fix given names function to accept ending years
#TODO compare bar charts of nobel vs fields
#TODO compare evolution of fields
#TODO PREZ dropped analysis by lack of data


In [133]:
%load_ext autoreload
%autoreload 2 

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [134]:
#imports
import os
import requests
import time
import pandas as pd
from dotenv import load_dotenv
from datetime import datetime
import plotly.express as px
from wrangling import *

In [135]:
#settings
pd.set_option('display.max_colwidth', 900)
pd.set_option('display.max_rows', 40)

In [136]:
#load env
load_dotenv()
name_token = os.getenv('NAME_KEY')


# Main data

In [137]:
laureates_url = 'https://api.nobelprize.org/2.1/laureates'

In [138]:
laureates_df = load_or_fetch_laureates('sources/laureates.csv', laureates_url)
display(laureates_df)

Loading cached laureates data


Unnamed: 0,id,fileName,gender,sameAs,knownName.en,knownName.se,givenName.en,givenName.se,familyName.en,familyName.se,...,nobelPrizes_1.affiliations_4.countryNow.en,nobelPrizes_1.affiliations_4.countryNow.no,nobelPrizes_1.affiliations_4.countryNow.se,nobelPrizes_1.affiliations_4.countryNow.sameAs,nobelPrizes_1.affiliations_4.countryNow.latitude,nobelPrizes_1.affiliations_4.countryNow.longitude,nobelPrizes_1.affiliations_4.continent.en,nobelPrizes_1.affiliations_4.locationString.en,nobelPrizes_1.affiliations_4.locationString.no,nobelPrizes_1.affiliations_4.locationString.se
0,1,rontgen,male,"['https://www.wikidata.org/wiki/Q35149', 'https://en.wikipedia.org/wiki/Wilhelm_Röntgen']",Wilhelm Conrad Röntgen,Wilhelm Conrad Röntgen,Wilhelm Conrad,Wilhelm Conrad,Röntgen,Röntgen,...,,,,,,,,,,
1,2,lorentz,male,"['https://www.wikidata.org/wiki/Q41688', 'https://en.wikipedia.org/wiki/Hendrik_Lorentz']",Hendrik A. Lorentz,Hendrik A. Lorentz,Hendrik A.,Hendrik A.,Lorentz,Lorentz,...,,,,,,,,,,
2,3,zeeman,male,"['https://www.wikidata.org/wiki/Q79000', 'https://en.wikipedia.org/wiki/Pieter_Zeeman']",Pieter Zeeman,Pieter Zeeman,Pieter,Pieter,Zeeman,Zeeman,...,,,,,,,,,,
3,4,becquerel,male,"['https://www.wikidata.org/wiki/Q41269', 'https://en.wikipedia.org/wiki/Henri_Becquerel']",Henri Becquerel,Henri Becquerel,Henri,Henri,Becquerel,Becquerel,...,,,,,,,,,,
4,5,pierre-curie,male,"['https://www.wikidata.org/wiki/Q37463', 'https://en.wikipedia.org/wiki/Pierre_Curie']",Pierre Curie,Pierre Curie,Pierre,Pierre,Curie,Curie,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
987,1030,brus,male,"['https://www.wikidata.org/wiki/Q194646', 'https://en.wikipedia.org/wiki/Louis_E._Brus']",Louis Brus,Louis Brus,Louis,Louis,Brus,Brus,...,,,,,,,,,,
988,1031,yekimov,male,"['https://www.wikidata.org/wiki/Q1547368', 'https://en.wikipedia.org/wiki/Alexei_Ekimov']",Aleksey Yekimov,Aleksej Jekimov,Aleksey,Aleksej,Yekimov,Jekimov,...,,,,,,,,,,
989,1032,fosse,male,"['https://www.wikidata.org/wiki/Q443868', 'https://en.wikipedia.org/wiki/Jon_Fosse']",Jon Fosse,Jon Fosse,Jon,Jon,Fosse,Fosse,...,,,,,,,,,,
990,1033,mohammadi,female,"['https://www.wikidata.org/wiki/Q4967771', 'https://en.wikipedia.org/wiki/Narges_Mohammadi']",Narges Mohammadi,Narges Mohammadi,Narges,Narges,Mohammadi,Mohammadi,...,,,,,,,,,,


### GENDER ANALYSIS

In [139]:
#gender data schema
gender_columns = get_json('schema')

In [140]:
#transforms df into usable form
gender_df = shape_dataframe(laureates_df, gender_columns)
display(gender_df)

Unnamed: 0,id,name,gender,award_year,field
0,1,Wilhelm Conrad Röntgen,male,1901,Physics
1,2,Hendrik A. Lorentz,male,1902,Physics
2,3,Pieter Zeeman,male,1902,Physics
3,4,Henri Becquerel,male,1903,Physics
4,5,Pierre Curie,male,1903,Physics
...,...,...,...,...,...
985,1028,Anne L’Huillier,female,2023,Physics
986,1029,Moungi Bawendi,male,2023,Chemistry
987,1030,Louis Brus,male,2023,Chemistry
988,1031,Aleksey Yekimov,male,2023,Chemistry


In [141]:
#shape nobels by year
nobels_by_year = gender_df.groupby(['award_year', 'gender']).size().unstack(fill_value=0)
nobels_by_year['total'] = nobels_by_year.apply(sum, axis=1)
display(nobels_by_year)

gender,female,male,total
award_year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1901,0,3,3
1902,0,4,4
1903,1,4,5
1904,0,3,3
1905,0,3,3
...,...,...,...
2019,1,11,12
2020,3,7,10
2021,0,10,10
2022,1,8,9


In [142]:
#display ratios
nobels_by_year['female_ratio_nobels'] = nobels_by_year['female'] / nobels_by_year['total']
nobels_by_year['male_ratio_nobels'] = nobels_by_year['male'] / nobels_by_year['total']
display(nobels_by_year.head(3))

gender,female,male,total,female_ratio_nobels,male_ratio_nobels
award_year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1901,0,3,3,0.0,1.0
1902,0,4,4,0.0,1.0
1903,1,4,5,0.2,0.8


In [143]:
#select only ratios
nobels_ratio_by_year = nobels_by_year[['female_ratio_nobels', 'male_ratio_nobels']]
nobels_ratio_by_year.reset_index(inplace=True)
display(nobels_ratio_by_year.head(3))

gender,award_year,female_ratio_nobels,male_ratio_nobels
0,1901,0.0,1.0
1,1902,0.0,1.0
2,1903,0.2,0.8


In [144]:
#rename columns
nobels_ratio_by_year = nobels_ratio_by_year.rename(columns={'award_year': 'year'})
display(nobels_ratio_by_year.head(3))

gender,year,female_ratio_nobels,male_ratio_nobels
0,1901,0.0,1.0
1,1902,0.0,1.0
2,1903,0.2,0.8


## Graphs

In [145]:
custom_colors = {
    'Men Nobel Win': '#1f77b4',        
    'Men Scientists': '#87ceeb',            
    'Women Nobel Win': '#cd8816',     
    'Women Scientists': '#ffb333',          
}

In [146]:
#cumulative count
gender_cumulative = gender_df.groupby(['award_year', 'gender']).size().unstack(fill_value=0).cumsum()
gender_cumulative = gender_cumulative.rename(columns={'male': 'Men Winners', 'female': 'Women Winners'})
display(gender_cumulative.head(3))

gender,Women Winners,Men Winners
award_year,Unnamed: 1_level_1,Unnamed: 2_level_1
1901,0,3
1902,0,7
1903,1,11


In [147]:
fig = px.line(gender_cumulative, x=gender_cumulative.index, y=['Men Winners', 'Women Winners'], title='Cumulative Gender Distribution of Nobel Laureates', color_discrete_map=custom_colors)
fig.update_layout(template='plotly_white')
fig.show()

### FIELD ANALYSIS  

In [148]:
#get the authors of random papers
authors_names_df = get_papers_authors(gender_columns, 1901, 2023, 'initial')
display(authors_names_df)

Loading from cached names db


Unnamed: 0,year,field,name,gender
0,1901,Physics,Ludwig,
1,1901,Physics,Dawson,
2,1901,Physics,John,
3,1901,Physics,George,
4,1901,Chemistry,Bernard,
...,...,...,...,...
4549,2023,Economic Sciences,Vincenzo,
4550,2023,Economic Sciences,Henrique,
4551,2023,Economic Sciences,Tünde-Ilona,
4552,2023,Economic Sciences,Salvatore,


In [151]:
#genderize the names
fields_df = genderize_names(authors_names_df)
display(fields_df.head(3))

Unnamed: 0,year,field,name,gender
0,1901,Physics,Ludwig,male
1,1901,Physics,Dawson,male
2,1901,Physics,John,male


In [None]:
#group by decade
fields_df['decade'] = fields_df['year'] // 10 * 10
display(fields_df)

#proportion of males and females by decade
gender_counts = fields_df.groupby(['decade', 'gender']).size().reset_index(name='count')
display(gender_counts)

total_counts = fields_df.groupby('decade').size().reset_index(name='total')
display(total_counts)

gender_proportions = pd.merge(gender_counts, total_counts, on='decade')
gender_proportions['proportion'] = gender_proportions['count'] / gender_proportions['total']
display(gender_proportions)

pivot_fields_df = gender_proportions.pivot(index='decade', columns='gender', values='proportion').reset_index()
print(pivot_fields_df)


In [None]:
# proportion of males and females by year
gender_counts = fields_df.groupby(['year', 'gender']).size().reset_index(name='count')
display(gender_counts)

total_counts = fields_df.groupby('year').size().reset_index(name='total')
display(total_counts)

gender_proportions = pd.merge(gender_counts, total_counts, on='year')
gender_proportions['proportion'] = gender_proportions['count'] / gender_proportions['total']
display(gender_proportions)

pivot_fields_df = gender_proportions.pivot(index='year', columns='gender', values='proportion').reset_index()
display(pivot_fields_df)

# graph the data
fig = px.line(pivot_fields_df, x='year', y=['female', 'male'], title='Scientific papers by Gender Over Time')
fig.show()

# Overlay yearly_ratio and pivot_fields_df
fig = px.line(yearly_gender_ratio, x=yearly_gender_ratio.index, y=['female_ratio', 'male_ratio'], title='Yearly Distribution of Nobel Laureates')


fig.show()





In [None]:
display(pivot_fields_df)
display(nobels_ratio_by_year)

# Ensure 'year' columns are of the same data type
pivot_fields_df['year'] = pivot_fields_df['year'].astype(int)
nobels_ratio_by_year['year'] = nobels_ratio_by_year['year'].astype(int)

merged_ratios = pd.merge(pivot_fields_df, nobels_ratio_by_year, left_on='year', right_on='year', suffixes=('_papers', '_nobels'))
merged_ratios.rename(columns={
    'female': 'Women Scientists',
    'male': 'Men Scientists',
    'female_ratio_nobels': 'Women Nobel Win',
    'male_ratio_nobels': 'Men Nobel Win'
}, inplace=True)
display(merged_ratios)


""" custom_colors = {
    'Men Nobel Win': '#1f77b4',        
    'Men Scientists': '#87ceeb',            
    'Women Nobel Win': '#cd8816',     
    'Women Scientists': '#ffb333',          
} """

# Create the line graph with markers
fig = px.line(
    merged_ratios,
    x='year',
    y=[
        'Men Nobel Win',
        'Men Scientists', 
        'Women Nobel Win', 
        'Women Scientists', 
        ],
    title='Gender Ratios in Scientific Papers and Nobel Laureates Over Time',
    color_discrete_map=custom_colors,
)
# Apply a theme
fig.update_layout(template='plotly_white')
# Update fonts

fig.add_annotation(
    x=2009,
    y=merged_ratios.loc[merged_ratios['year'] == 2009, 'Women Nobel Win'].values[0],
    text="Rare over representation",
    showarrow=True,
    arrowhead=1
)

fig.show()

# Calculate average ratios over time
average_ratios_df = merged_ratios.mean().to_frame(name='Average').T
average_ratios_df = average_ratios_df.drop(columns='year')
display(average_ratios_df)

# Create a bar chart for average ratios
fig_avg = px.bar(
    average_ratios_df.melt(var_name='Category', value_name='Average Ratio'),
    x='Category',
    y='Average Ratio',
    title='Average Gender Ratios in Scientific Papers and Nobel Laureates',
    color='Category',
    color_discrete_map=custom_colors
)
fig_avg.update_layout(template='plotly_white')
fig_avg.show()
