# Nobel Prize Analysis
Table of contents:
* [Data Cleaning](#data-cleaning)

In [1]:
import pandas as pd
import numpy as np


## Data Cleaning <a id='data-cleaning'></a>

In [2]:
df = pd.read_csv('data/nobel_prize_data.csv')
df.sample(frac=0.01)

Unnamed: 0,year,category,prize,motivation,prize_share,laureate_type,full_name,birth_date,birth_city,birth_country,birth_country_current,sex,organization_name,organization_city,organization_country,ISO
321,1959,Physics,The Nobel Prize in Physics 1959,"""for their discovery of the antiproton""",1/2,Individual,Owen Chamberlain,1920-07-10,"San Francisco, CA",United States of America,United States of America,Male,University of California,"Berkeley, CA",United States of America,USA
20,1904,Chemistry,The Nobel Prize in Chemistry 1904,"""in recognition of his services in the discove...",1/1,Individual,Sir William Ramsay,1852-10-02,Glasgow,Scotland,United Kingdom,Male,University College London,London,United Kingdom,GBR
327,1960,Physics,The Nobel Prize in Physics 1960,"""for the invention of the bubble chamber""",1/1,Individual,Donald Arthur Glaser,1926-09-21,"Cleveland, OH",United States of America,United States of America,Male,University of California,"Berkeley, CA",United States of America,USA
775,2005,Literature,The Nobel Prize in Literature 2005,"""who in his plays uncovers the precipice under...",1/1,Individual,Harold Pinter,1930-10-10,London,United Kingdom,United Kingdom,Male,,,,GBR
783,2006,Chemistry,The Nobel Prize in Chemistry 2006,"""for his studies of the molecular basis of euk...",1/1,Individual,Roger D. Kornberg,1947-04-24,"St. Louis, MO",United States of America,United States of America,Male,Stanford University,"Stanford, CA",United States of America,USA
5,1901,Physics,The Nobel Prize in Physics 1901,"""in recognition of the extraordinary services ...",1/1,Individual,Wilhelm Conrad Röntgen,1845-03-27,Lennep (Remscheid),Prussia (Germany),Germany,Male,Munich University,Munich,Germany,DEU
952,2020,Economics,The Sveriges Riksbank Prize in Economic Scienc...,“for improvements to auction theory and invent...,1/2,Individual,Paul R. Milgrom,1948-04-20,"Detroit, MI",United States of America,United States of America,Male,Stanford University,"Stanford, CA",United States of America,USA
677,1997,Chemistry,The Nobel Prize in Chemistry 1997,"""for their elucidation of the enzymatic mechan...",1/4,Individual,Paul D. Boyer,1918-07-31,"Provo, UT",United States of America,United States of America,Male,University of California,"Los Angeles, CA",United States of America,USA
516,1981,Chemistry,The Nobel Prize in Chemistry 1981,"""for their theories, developed independently, ...",1/2,Individual,Kenichi Fukui,1918-10-04,Nara,Japan,Japan,Male,Kyoto University,Kyoto,Japan,JPN
601,1989,Physics,The Nobel Prize in Physics 1989,"""for the development of the ion trap technique""",1/4,Individual,Wolfgang Paul,1913-08-10,Lorenzkirch,Germany,Germany,Male,University of Bonn,Bonn,Germany,DEU


Drop unnecessary columns.

In [3]:
df = df.drop(columns=['motivation', 'birth_city', 'birth_country', 'organization_city'])
df.sample(frac=0.01, random_state=42)

Unnamed: 0,year,category,prize,prize_share,laureate_type,full_name,birth_date,birth_country_current,sex,organization_name,organization_country,ISO
344,1963,Literature,The Nobel Prize in Literature 1963,1/1,Individual,Giorgos Seferis,1900-03-13,Turkey,Male,,,TUR
381,1967,Medicine,The Nobel Prize in Physiology or Medicine 1967,1/3,Individual,Haldan Keffer Hartline,1903-12-22,United States of America,Male,Rockefeller University,United States of America,USA
351,1963,Physics,The Nobel Prize in Physics 1963,1/4,Individual,J. Hans D. Jensen,1907-06-25,Germany,Male,University of Heidelberg,Germany,DEU
778,2005,Peace,The Nobel Peace Prize 2005,1/2,Organization,International Atomic Energy Agency (IAEA),,,,,,
277,1953,Physics,The Nobel Prize in Physics 1953,1/1,Individual,Frits Zernike,1888-07-16,Netherlands,Male,Groningen University,Netherlands,NLD
667,1996,Literature,The Nobel Prize in Literature 1996,1/1,Individual,Wislawa Szymborska,1923-07-02,Poland,Female,,,POL
215,1944,Peace,The Nobel Peace Prize 1944,1/1,Organization,Comité international de la Croix Rouge (Intern...,,,,,,
465,1976,Literature,The Nobel Prize in Literature 1976,1/1,Individual,Saul Bellow,1915-06-10,Canada,Male,,,CAN
86,1916,Literature,The Nobel Prize in Literature 1916,1/1,Individual,Carl Gustaf Verner von Heidenstam,1859-07-06,Sweden,Male,,,SWE
286,1955,Chemistry,The Nobel Prize in Chemistry 1955,1/1,Individual,Vincent du Vigneaud,1901-05-18,United States of America,Male,Cornell University,United States of America,USA


Count categorical values in `['category', 'laureate_type']`.

In [4]:
df.value_counts(subset=['category', 'laureate_type'], dropna=True)

category    laureate_type
Medicine    Individual       222
Physics     Individual       216
Chemistry   Individual       186
Literature  Individual       117
Peace       Individual       103
Economics   Individual        86
Peace       Organization      32
dtype: int64

Check for any duplicated rows based on `year`, `category`, `full_name` and `birth_date`.

In [5]:
df.duplicated(subset=['prize', 'category', 'full_name', 'birth_date']).any()

False

No duplicated rows, good. Now check for `NaN` values.

In [6]:
df.isna().sum()

year                       0
category                   0
prize                      0
prize_share                0
laureate_type              0
full_name                  0
birth_date                28
birth_country_current     28
sex                       28
organization_name        255
organization_country     254
ISO                       28
dtype: int64

Let's drop the rows that contain `NaN` values.

In [7]:
df = df.dropna(axis=0)
df.isna().sum()

year                     0
category                 0
prize                    0
prize_share              0
laureate_type            0
full_name                0
birth_date               0
birth_country_current    0
sex                      0
organization_name        0
organization_country     0
ISO                      0
dtype: int64

Good, now let's check the data type of each column.

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 704 entries, 0 to 961
Data columns (total 12 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   year                   704 non-null    int64 
 1   category               704 non-null    object
 2   prize                  704 non-null    object
 3   prize_share            704 non-null    object
 4   laureate_type          704 non-null    object
 5   full_name              704 non-null    object
 6   birth_date             704 non-null    object
 7   birth_country_current  704 non-null    object
 8   sex                    704 non-null    object
 9   organization_name      704 non-null    object
 10  organization_country   704 non-null    object
 11  ISO                    704 non-null    object
dtypes: int64(1), object(11)
memory usage: 71.5+ KB


Convert `year` and `birth_date` columns to datetime date type.

In [9]:
df['year'] = pd.to_datetime(df['year'])
df['birth_date'] = pd.to_datetime(df['birth_date'])
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 704 entries, 0 to 961
Data columns (total 12 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   year                   704 non-null    datetime64[ns]
 1   category               704 non-null    object        
 2   prize                  704 non-null    object        
 3   prize_share            704 non-null    object        
 4   laureate_type          704 non-null    object        
 5   full_name              704 non-null    object        
 6   birth_date             704 non-null    datetime64[ns]
 7   birth_country_current  704 non-null    object        
 8   sex                    704 non-null    object        
 9   organization_name      704 non-null    object        
 10  organization_country   704 non-null    object        
 11  ISO                    704 non-null    object        
dtypes: datetime64[ns](2), object(10)
memory usage: 71.5+ KB


Create a `prize_pct` column which is the `prize_share` but in the form of a floating point number.

In [10]:
df_prize_share = df['prize_share'].str.split('/', expand=True)
numerator = pd.to_numeric(df_prize_share[0])
denominator = pd.to_numeric(df_prize_share[1])
df['prize_pct'] = numerator / denominator

df.sample(n=5, random_state=42)

Unnamed: 0,year,category,prize,prize_share,laureate_type,full_name,birth_date,birth_country_current,sex,organization_name,organization_country,ISO,prize_pct
428,1970-01-01 00:00:00.000001973,Chemistry,The Nobel Prize in Chemistry 1973,1/2,Individual,Geoffrey Wilkinson,1921-07-14,United Kingdom,Male,Imperial College,United Kingdom,GBR,0.5
139,1970-01-01 00:00:00.000001927,Physics,The Nobel Prize in Physics 1927,1/2,Individual,Charles Thomson Rees Wilson,1869-02-14,United Kingdom,Male,University of Cambridge,United Kingdom,GBR,0.5
132,1970-01-01 00:00:00.000001926,Physics,The Nobel Prize in Physics 1926,1/1,Individual,Jean Baptiste Perrin,1870-09-30,France,Male,Sorbonne University,France,FRA,1.0
314,1970-01-01 00:00:00.000001958,Physics,The Nobel Prize in Physics 1958,1/3,Individual,Pavel Alekseyevich Cherenkov,1904-07-28,Russia,Male,P.N. Lebedev Physical Institute,Russia,RUS,0.333333
461,1970-01-01 00:00:00.000001975,Physics,The Nobel Prize in Physics 1975,1/3,Individual,Ben Roy Mottelson,1926-07-09,United States of America,Male,Nordita,Denmark,USA,0.333333


One thing to note here is that since you have dropped many rows from your DataFrame, you could have meesed up the index in your DataFrame. Therefore, the next step is to reset the index.

In [11]:
df = df.reset_index(drop=True)

Now, what have we done so far?
* We dropped unnecessary columns.
* We made sure there are no duplicated rows.
* We dropped rows that contain `NaN` values.
* We converted the `year` and `birth_date` column to datetime data type.
* We Created new column called `prize_pct` that is the floating point number form of the `prize_share`.

Let's proceed to the next step which is to investigate our data.