# Nobel Prize Analyzes

In [38]:
import pandas as pd

## Data Exploration

In [39]:
df = pd.read_csv('data/nobel_prize_data.csv')

In [40]:
print(f'The data has {df.shape[0]} rows and {df.shape[1]} columns')
print(f'The data has {df.isna().sum().sum()} missing values')
print(f'The data has {df.isnull().sum().sum()} null values')
print(f'The data has {df.duplicated().sum()} duplicated rows')
print(f'The data has {df.dtypes.nunique()} different data types')

The data has 962 rows and 16 columns
The data has 1023 missing values
The data has 1023 null values
The data has 0 duplicated rows
The data has 2 different data types


In [41]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 962 entries, 0 to 961
Data columns (total 16 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   year                   962 non-null    int64 
 1   category               962 non-null    object
 2   prize                  962 non-null    object
 3   motivation             874 non-null    object
 4   prize_share            962 non-null    object
 5   laureate_type          962 non-null    object
 6   full_name              962 non-null    object
 7   birth_date             934 non-null    object
 8   birth_city             931 non-null    object
 9   birth_country          934 non-null    object
 10  birth_country_current  934 non-null    object
 11  sex                    934 non-null    object
 12  organization_name      707 non-null    object
 13  organization_city      707 non-null    object
 14  organization_country   708 non-null    object
 15  ISO                    

In [42]:
first_year = df['year'].min()
print(f'The Nobel prize was first awarded in {first_year}')

The Nobel prize was first awarded in 1901


In [43]:
last_year = df['year'].max()
print(f'The last year included is {last_year}')

The last year included is 2020


#### Which columns tend to have NaN values? And how many NaN values are there per column?

In [44]:
df.isna().sum().sort_values(ascending=False)

organization_name        255
organization_city        255
organization_country     254
motivation                88
birth_city                31
birth_date                28
birth_country             28
birth_country_current     28
sex                       28
ISO                       28
year                       0
category                   0
prize                      0
prize_share                0
laureate_type              0
full_name                  0
dtype: int64

#### Why are there so many NaN values for the birth date? And why are there so many missing values among the organisation columns?

In [45]:
col_subset = ['year', 'category', 'full_name', 'birth_date', 'organization_name', 'laureate_type']
df.loc[df['birth_date'].isna(), col_subset]

Unnamed: 0,year,category,full_name,birth_date,organization_name,laureate_type
24,1904,Peace,Institut de droit international (Institute of ...,,,Organization
60,1910,Peace,Bureau international permanent de la Paix (Per...,,,Organization
89,1917,Peace,Comité international de la Croix Rouge (Intern...,,,Organization
200,1938,Peace,Office international Nansen pour les Réfugiés ...,,,Organization
215,1944,Peace,Comité international de la Croix Rouge (Intern...,,,Organization
237,1947,Peace,American Friends Service Committee (The Quakers),,,Organization
238,1947,Peace,Friends Service Council (The Quakers),,,Organization
283,1954,Peace,Office of the United Nations High Commissioner...,,,Organization
348,1963,Peace,Comité international de la Croix Rouge (Intern...,,,Organization
349,1963,Peace,Ligue des Sociétés de la Croix-Rouge (League o...,,,Organization


In [46]:
df.loc[df['organization_name'].isna(), col_subset]

Unnamed: 0,year,category,full_name,birth_date,organization_name,laureate_type
1,1901,Literature,Sully Prudhomme,1839-03-16,,Individual
3,1901,Peace,Frédéric Passy,1822-05-20,,Individual
4,1901,Peace,Jean Henry Dunant,1828-05-08,,Individual
7,1902,Literature,Christian Matthias Theodor Mommsen,1817-11-30,,Individual
9,1902,Peace,Charles Albert Gobat,1843-05-21,,Individual
...,...,...,...,...,...,...
932,2018,Peace,Nadia Murad,1993-07-02,,Individual
942,2019,Literature,Peter Handke,1942-12-06,,Individual
946,2019,Peace,Abiy Ahmed Ali,1976-08-15,,Individual
954,2020,Literature,Louise Glück,1943-04-22,,Individual


#### Data type manipulation

In [47]:
df['birth_date'] = pd.to_datetime(df['birth_date'])

#### Adding a Column called *share_pct* which has the laureates' share as a percentage in the form of a floating-point number.

In [48]:
separated_values = df.prize_share.str.split('/', expand=True)
numerator = pd.to_numeric(separated_values[0])
denominator = pd.to_numeric(separated_values[1])
df['share_pct'] = numerator / denominator