# First notebook

Introduction to the `jupyter notebook` and `pandas` module on the example of the pre-preparation of data for analysis:

- data loading / line selection
- selection of data columns / deletion of unnecessary ones
- selection of rows based on values in the column
- unification of column names
- unification of string type data in different tables
- conversion of data to the appropriate format, e.g. `int`
- saving data to a new csv file

In [1]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import re

In [2]:
%pwd

'/home/u1/22_dydaktyka/04inzynier/notebooks'

In [3]:
%cd ../data

/home/u1/22_dydaktyka/04inzynier/data


In [4]:
%ls

countryPopulation.csv  WHO-COVID-19-global-data.csv
covid.csv              WPP2019_POP_F01_1_TOTAL_POPULATION_BOTH_SEXES.xlsx


# Read data

> population data is stored in a Microsoft Office spreadsheet

In [5]:
# population data
df = pd.read_excel('WPP2019_POP_F01_1_TOTAL_POPULATION_BOTH_SEXES.xlsx')

In [6]:
df.head(20)

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 68,Unnamed: 69,Unnamed: 70,Unnamed: 71,Unnamed: 72,Unnamed: 73,Unnamed: 74,Unnamed: 75,Unnamed: 76,Unnamed: 77
0,,,,,,,,,,,...,,,,,,,,,,
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,United Nations,,,,,,,,,,...,,,,,,,,,,
4,Population Division,,,,,,,,,,...,,,,,,,,,,
5,Department of Economic and Social Affairs,,,,,,,,,,...,,,,,,,,,,
6,,,,,,,,,,,...,,,,,,,,,,
7,World Population Prospects 2019,,,,,,,,,,...,,,,,,,,,,
8,File POP/1-1: Total population (both sexes com...,,,,,,,,,,...,,,,,,,,,,
9,"Estimates, 1950 - 2020",,,,,,,,,,...,,,,,,,,,,


### Re-reading the file

> Useful data starts from row 15 and the first column 'Index' is an index.

>The data will be reloaded without unnecessary rows.

In [7]:
df = pd.read_excel('WPP2019_POP_F01_1_TOTAL_POPULATION_BOTH_SEXES.xlsx',\
                   index_col=0,skiprows=16)

In [8]:
df.head(3)

Unnamed: 0_level_0,Variant,"Region, subregion, country or area *",Notes,Country code,Type,Parent code,1950,1951,1952,1953,...,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,Estimates,WORLD,,900,World,0,2.53643e+06,2.58403e+06,2.63086e+06,2.67761e+06,...,7.04119e+06,7.12583e+06,7.21058e+06,7.29529e+06,7.3798e+06,7.46402e+06,7.54786e+06,7.63109e+06,7.71347e+06,7.7948e+06
2,Estimates,UN development groups,a,1803,Label/Separator,900,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3,Estimates,More developed regions,b,901,Development Group,1803,814819,824004,833720,843788,...,1.23956e+06,1.24411e+06,1.24845e+06,1.25262e+06,1.25662e+06,1.26048e+06,1.26415e+06,1.26756e+06,1.27063e+06,1.2733e+06


# Selecting / indexing pd.DataFrame

### With `.loc/iloc` methods

> general rule `start:stop:step`

> `[rows,cols]` --> `[start:stop:step, start:stop:step]`

> `df.loc[start:stop:step, start:stop:step]`

#### Selecting
> by name: `df.loc[start:stop:step, start:stop:step]`

> by index: `df.iloc[start:stop:step, start:stop:step]`

### With atribute - columns name

> `df.column_name`

### With `[]`
> `df['column_name']`

### Omitting unnecessary columns
  >The necessary data are contained in the '2020' column.
  
  >The columns with data from 1950 to 2019 will be deleted.

In [9]:
# create list with column names
l = [str(x) for x in range(1950,2020)]
print(l)

['1950', '1951', '1952', '1953', '1954', '1955', '1956', '1957', '1958', '1959', '1960', '1961', '1962', '1963', '1964', '1965', '1966', '1967', '1968', '1969', '1970', '1971', '1972', '1973', '1974', '1975', '1976', '1977', '1978', '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986', '1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995', '1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018', '2019']


In [10]:
# remove columns
df = df.drop(columns=l)
df.head(3)

Unnamed: 0_level_0,Variant,"Region, subregion, country or area *",Notes,Country code,Type,Parent code,2020
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,Estimates,WORLD,,900,World,0,7.7948e+06
2,Estimates,UN development groups,a,1803,Label/Separator,900,...
3,Estimates,More developed regions,b,901,Development Group,1803,1.2733e+06


### Unification of column names

In [11]:
df.columns

Index(['Variant', 'Region, subregion, country or area *', 'Notes',
       'Country code', 'Type', 'Parent code', '2020'],
      dtype='object')

In [12]:
# change column names - shorter names
df.columns = ['Variant', 'Country', 'Notes','Ccode', 'Type', 'Pcode', 'c2020']
df.head(1)

Unnamed: 0_level_0,Variant,Country,Notes,Ccode,Type,Pcode,c2020
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,Estimates,WORLD,,900,World,0,7794800.0


### Check columns

Check which columns contain useful data - displays unique data

In [13]:
for col in df.columns[:-1]:
    print(f'{col}:\n',pd.unique(df[f'{col}']),'\n')

Variant:
 ['Estimates'] 

Country:
 ['WORLD' 'UN development groups' 'More developed regions'
 'Less developed regions' 'Least developed countries'
 'Less developed regions, excluding least developed countries'
 'Less developed regions, excluding China'
 'Land-locked Developing Countries (LLDC)'
 'Small Island Developing States (SIDS)' 'World Bank income groups'
 'High-income countries' 'Middle-income countries'
 'Upper-middle-income countries' 'Lower-middle-income countries'
 'Low-income countries' 'No income group available' 'Geographic regions'
 'Africa' 'Asia' 'Europe' 'Latin America and the Caribbean'
 'Northern America' 'Oceania' 'Sustainable Development Goal (SDG) regions'
 'SUB-SAHARAN AFRICA' 'Eastern Africa' 'Burundi' 'Comoros' 'Djibouti'
 'Eritrea' 'Ethiopia' 'Kenya' 'Madagascar' 'Malawi' 'Mauritius' 'Mayotte'
 'Mozambique' 'Réunion' 'Rwanda' 'Seychelles' 'Somalia' 'South Sudan'
 'Uganda' 'United Republic of Tanzania' 'Zambia' 'Zimbabwe'
 'Middle Africa' 'Angola' 'Cameroon' 

### Omitting unnecessary columns and rows
  
 Columns to delete - do not contain useful data:
   >`['Variant', 'Notes', 'Ccode','Pcode']` 
  
 Rows to be deleted are determined by unnecessary values from the `Type` column:  
   >`['World','Label/Separator','Development Group','Special other',
 'Income Group','Region','SDG region','Subregion','SDG subregion']`

In [14]:
delCol = ['Variant', 'Notes', 'Ccode','Pcode']
delRow = ['World','Label/Separator','Development Group','Special other','Income Group',\
      'Region','SDG region','Subregion','SDG subregion']

In [15]:
# drop columns in 'delCol' list
df = df.drop(columns=delCol)

In [16]:
# select rows that do not contain values from the 'urs' list\
# (unnecessary columns and rows)
for val in delRow:
    df = df[df.Type != val]

In [17]:
# remove 'Type' column
df = df.drop(columns='Type')
df.shape

(235, 2)

In [18]:
# Sort by population in descending order
df.sort_values('c2020',ascending=False).head()

Unnamed: 0_level_0,Country,c2020
Index,Unnamed: 1_level_1,Unnamed: 2_level_1
128,China,1439320.0
120,India,1380000.0
289,United States of America,331003.0
139,Indonesia,273524.0
124,Pakistan,220892.0


# Selecting / indexing pd.DataFrame

> general rule `start:stop:step`

> `[rows,cols]` --> `[start:stop:step, start:stop:step]`

> `df.loc[start:stop:step, start:stop:step]`

### Selecting
> by name: `df.loc[start:stop:step, start:stop:step]`

> by index: `df.iloc[start:stop:step, start:stop:step]`


In [19]:
# get information about 'df'
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 235 entries, 27 to 289
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Country  235 non-null    object
 1   c2020    235 non-null    object
dtypes: object(2)
memory usage: 5.5+ KB


### Change data type
Column `c2020` contains numerical data, which are stored in the table as `object (str)` type.
  >Change of data type to `numerical`.

In [20]:
df.loc[:,'c2020'] = df.c2020.convert_dtypes()*1000
df.loc[:,'c2020'] = df.c2020.astype('int')
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 235 entries, 27 to 289
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Country  235 non-null    object
 1   c2020    235 non-null    int64 
dtypes: int64(1), object(1)
memory usage: 5.5+ KB


Unnamed: 0_level_0,Country,c2020
Index,Unnamed: 1_level_1,Unnamed: 2_level_1
27,Burundi,11890781
28,Comoros,869595
29,Djibouti,988002
30,Eritrea,3546427
31,Ethiopia,114963583


#### Formatting

Data displayed in pd.DataFrame can be formatted - see [documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html#Finer-Control:-Display-Values)

In [21]:
df.sort_values('c2020',ascending=False).head(10).style.format({'c2020':"{:_}"})

Unnamed: 0_level_0,Country,c2020
Index,Unnamed: 1_level_1,Unnamed: 2_level_1
128,China,1_439_323_774
120,India,1_380_004_385
289,United States of America,331_002_647
139,Indonesia,273_523_621
124,Pakistan,220_892_331
190,Brazil,212_559_409
76,Nigeria,206_139_587
118,Bangladesh,164_689_383
240,Russian Federation,145_934_460
184,Mexico,128_932_753


### Country names validation

In order to be able to compare the names of countries in differentxh tables, they must be unified. Scope:
- deleting white characters from the beginning and end
- replace different white characters between words with one single space
- each word in the name begins with a capital letter

In [71]:
for i,country in enumerate(df.Country):
    contry = country.strip()
    contry = re.sub(r'\s+',' ',country)
    country = country.title()
    df.iloc[i,0] = country

df.sort_values('c2020',ascending=False).head(10).style.format({'c2020':"{:>_}"})

Unnamed: 0_level_0,Country,c2020
Index,Unnamed: 1_level_1,Unnamed: 2_level_1
128,China,1_439_323_774
120,India,1_380_004_385
289,United States Of America,331_002_647
139,Indonesia,273_523_621
124,Pakistan,220_892_331
190,Brazil,212_559_409
76,Nigeria,206_139_587
118,Bangladesh,164_689_383
240,Russian Federation,145_934_460
184,Mexico,128_932_753


In [23]:
# save data
name = 'countryPopulation.csv'
df.to_csv(name,sep=';',index=False)

# Covid data

In [24]:
cov = pd.read_csv('WHO-COVID-19-global-data.csv',encoding='utf-8')
cov.head()

Unnamed: 0,Date_reported,Country_code,Country,WHO_region,New_cases,Cumulative_cases,New_deaths,Cumulative_deaths
0,2020-01-03,AF,Afghanistan,EMRO,0,0,0,0
1,2020-01-04,AF,Afghanistan,EMRO,0,0,0,0
2,2020-01-05,AF,Afghanistan,EMRO,0,0,0,0
3,2020-01-06,AF,Afghanistan,EMRO,0,0,0,0
4,2020-01-07,AF,Afghanistan,EMRO,0,0,0,0


In [25]:
# column names
print(cov.columns.to_list())
cov.columns = cov.columns.str.strip()
print(cov.columns.to_list())

['Date_reported', ' Country_code', ' Country', ' WHO_region', ' New_cases', ' Cumulative_cases', ' New_deaths', ' Cumulative_deaths']
['Date_reported', 'Country_code', 'Country', 'WHO_region', 'New_cases', 'Cumulative_cases', 'New_deaths', 'Cumulative_deaths']


In [26]:
# drop columns
cov = cov.drop(columns=['Country_code','WHO_region'])

In [27]:
# short names
cov.columns = ['Date', 'Country', 'Ncases','CumCases', 'Ndeaths','Cumdeaths']
cov.head()

Unnamed: 0,Date,Country,Ncases,CumCases,Ndeaths,Cumdeaths
0,2020-01-03,Afghanistan,0,0,0,0
1,2020-01-04,Afghanistan,0,0,0,0
2,2020-01-05,Afghanistan,0,0,0,0
3,2020-01-06,Afghanistan,0,0,0,0
4,2020-01-07,Afghanistan,0,0,0,0


# Countries comparison

Differences in the number and names of countries in both tables

In [28]:
s1 = set(df.Country)
s2 = set(cov.Country)

In [29]:
s1.difference(s2)

{'Antigua And Barbuda',
 'Bolivia (Plurinational State Of)',
 'Bonaire, Sint Eustatius And Saba',
 'Bosnia And Herzegovina',
 'Channel Islands',
 'China, Hong Kong Sar',
 'China, Macao Sar',
 'China, Taiwan Province Of China',
 "Côte D'Ivoire",
 "Dem. People'S Republic Of Korea",
 'Democratic Republic Of The Congo',
 'Iran (Islamic Republic Of)',
 'Isle Of Man',
 "Lao People'S Democratic Republic",
 'Micronesia (Fed. States Of)',
 'Northern Mariana Islands',
 'Republic Of Korea',
 'Republic Of Moldova',
 'Saint Kitts And Nevis',
 'Saint Martin (French Part)',
 'Saint Pierre And Miquelon',
 'Saint Vincent And The Grenadines',
 'Sao Tome And Principe',
 'Sint Maarten (Dutch Part)',
 'State Of Palestine',
 'Trinidad And Tobago',
 'Turks And Caicos Islands',
 'United Kingdom',
 'United Republic Of Tanzania',
 'United States Of America',
 'Venezuela (Bolivarian Republic Of)',
 'Wallis And Futuna Islands',
 'Western Sahara'}

In [30]:
s2.difference(s1)

{'Antigua and Barbuda',
 'Bolivia (Plurinational State of)',
 'Bonaire, Sint Eustatius and Saba',
 'Bosnia and Herzegovina',
 'Côte d’Ivoire',
 "Democratic People's Republic of Korea",
 'Democratic Republic of the Congo',
 'Guernsey',
 'Iran (Islamic Republic of)',
 'Isle of Man',
 'Jersey',
 'Kosovo[1]',
 "Lao People's Democratic Republic",
 'Micronesia (Federated States of)',
 'Northern Mariana Islands (Commonwealth of the)',
 'Other',
 'Pitcairn Islands',
 'Republic of Korea',
 'Republic of Moldova',
 'Saint Kitts and Nevis',
 'Saint Martin',
 'Saint Pierre and Miquelon',
 'Saint Vincent and the Grenadines',
 'Sao Tome and Principe',
 'Sint Maarten',
 'The United Kingdom',
 'Trinidad and Tobago',
 'Turks and Caicos Islands',
 'United Republic of Tanzania',
 'United States of America',
 'Venezuela (Bolivarian Republic of)',
 'Wallis and Futuna',
 'occupied Palestinian territory, including east Jerusalem'}

# Matching country names

...