## This notebook is used for practice with cleaning data sets

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
from pydataset import data

In [73]:
df = pd.read_csv('untidy-data/gapminder1.csv')
df.head()

Unnamed: 0,year,country,measure,measurement
0,1955,Afghanistan,pop,8891209.0
1,1960,Afghanistan,pop,9829450.0
2,1965,Afghanistan,pop,10997885.0
3,1970,Afghanistan,pop,12430623.0
4,1975,Afghanistan,pop,14132019.0


In [74]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2079 entries, 0 to 2078
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   year         2079 non-null   int64  
 1   country      2079 non-null   object 
 2   measure      2079 non-null   object 
 3   measurement  2079 non-null   float64
dtypes: float64(1), int64(1), object(2)
memory usage: 65.1+ KB


gapminder1 dataset has 4 initial columns:

    - year
    - country
    - measure
    - measurement
    
The data is intended to show a country's population, fertility rate, and life expectancy.

In [71]:
# the pivot table will divide the measures into their own columns
df1 = df.pivot_table(index=['year','country'])
df1.sample(10).sort_values('year')

Unnamed: 0_level_0,measure,fertility,life_expect,pop
year,country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1955,Egypt,6.97,44.444,23855527.0
1955,United Kingdom,2.49,70.42,50946000.0
1960,El Salvador,6.847,52.307,2581583.0
1965,China,6.06,58.38112,715185000.0
1965,United States,2.545,70.76,194303000.0
1970,Argentina,3.1455,67.065,23962313.0
1980,Bolivia,5.2995,53.859,5441298.0
1980,Nigeria,6.9,45.826,68550274.0
1980,Iran,6.63,59.62,39583397.0
1985,Poland,2.15,70.98,37225792.0


**What was the highest fertility rate for each country?**

In [72]:
df1.groupby('country').fertility.max()

country
Afghanistan           8.0000
Argentina             3.4400
Aruba                 5.1500
Australia             3.4060
Austria               2.7800
Bahamas               4.5030
Bangladesh            6.8500
Barbados              4.6700
Belgium               2.6440
Bolivia               6.7500
Brazil                6.1501
Canada                3.8820
Chile                 5.4860
China                 6.0600
Colombia              6.7600
Costa Rica            7.2245
Croatia               2.4200
Cuba                  4.6805
Dominican Republic    7.6405
Ecuador               6.7000
Egypt                 7.0730
El Salvador           6.8470
Finland               2.7690
France                2.8500
Georgia               2.9790
Germany               2.4900
Greece                2.3800
Grenada               6.7000
Haiti                 6.3000
Hong Kong             5.3100
Iceland               4.0230
India                 5.8961
Indonesia             5.6720
Iran                  7.0000
Iraq  

**Which country had the highest population in 2000?**

In [84]:
df2 = df.pivot_table(index=['year','country'], columns='measure', values='measurement').reset_index()
df2

measure,year,country,fertility,life_expect,pop
0,1955,Afghanistan,7.7000,30.332,8891209.0
1,1955,Argentina,3.1265,64.399,18927821.0
2,1955,Aruba,5.1500,64.381,53865.0
3,1955,Australia,3.4060,70.330,9277087.0
4,1955,Austria,2.5200,67.480,6946885.0
...,...,...,...,...,...
688,2005,Switzerland,1.4200,81.701,7489370.0
689,2005,Turkey,2.1430,71.777,69660559.0
690,2005,United Kingdom,1.8150,79.425,60441457.0
691,2005,United States,2.0540,78.242,295734134.0


In [103]:
x = df2[df2.year==2000]
x.loc[x['pop'].idxmax()]

measure
year                   2000
country               China
fertility               1.7
life_expect          72.028
pop            1262645000.0
Name: 580, dtype: object

**Which year had the highest overall population? What was the population?**

In [121]:
df2.groupby('year').pop.sum().idxmax()

2005

In [130]:
df2[df2.year == 2005]['pop'].sum()

5014673090.0