# Data preparetion process

> Yes, it could be done in less amount of code, but that way it's much easier to read<br>
> Also it could be done line by line or even in Spark, but our datasets small enough to use them in Pandas

1. Get all dirs for files

In [1]:
import os
os.chdir('../Data/WHO Datasets/Unboxed')
cwd = os.getcwd()

In [2]:
# We are having datasets with and without metadata. So let's work only with 'DATA' marked datasets
datalist = []
for dirpath, dirname, filenames in os.walk(cwd):
    for file in filenames:
        if file.endswith("Data.csv"):
            datalist.append(os.path.join(dirpath, file))

2. Check what is in these datasets

In [3]:
with open(datalist[1], mode='r') as f:
    print(f.readline())
    print(f.readline())

Series Name,Series Code,Country Name,Country Code,1960 [YR1960],1961 [YR1961],1962 [YR1962],1963 [YR1963],1964 [YR1964],1965 [YR1965],1966 [YR1966],1967 [YR1967],1968 [YR1968],1969 [YR1969],1970 [YR1970],1971 [YR1971],1972 [YR1972],1973 [YR1973],1974 [YR1974],1975 [YR1975],1976 [YR1976],1977 [YR1977],1978 [YR1978],1979 [YR1979],1980 [YR1980],1981 [YR1981],1982 [YR1982],1983 [YR1983],1984 [YR1984],1985 [YR1985],1986 [YR1986],1987 [YR1987],1988 [YR1988],1989 [YR1989],1990 [YR1990],1991 [YR1991],1992 [YR1992],1993 [YR1993],1994 [YR1994],1995 [YR1995],1996 [YR1996],1997 [YR1997],1998 [YR1998],1999 [YR1999],2000 [YR2000],2001 [YR2001],2002 [YR2002],2003 [YR2003],2004 [YR2004],2005 [YR2005],2006 [YR2006],2007 [YR2007],2008 [YR2008],2009 [YR2009],2010 [YR2010],2011 [YR2011],2012 [YR2012],2013 [YR2013],2014 [YR2014],2015 [YR2015],2016 [YR2016],2017 [YR2017],2018 [YR2018],2019 [YR2019],2020 [YR2020],2021 [YR2021]

"Birth rate, crude (per 1,000 people)",SP.DYN.CBRT.IN,Afghanistan,AFG,50.34,50.44

- Because all datasets are from WHO we are can use that as a reference for all of them
- Some datasets have more than 1 series, so for grouping will be much easier to use Pandas

3. Create a list of countries that you're have in any dataset

In [4]:
import pandas as pd

countries = set()

for dataset in datalist:
    df = pd.read_csv(dataset)
    df.drop(df.loc[df['Series Name'] != df['Series Name'].unique()[0]].index, axis=0, inplace=True)
    countries.update(df['Country Code'].tolist())

len(countries)

275

4. Create a list of countries that you're have in any dataset

In [5]:
country_check = {}
for country in countries:
    country_check[country] = 'Not'

for dataset in datalist:
    df = pd.read_csv(dataset)
    df.drop(df.loc[df['Series Name'] != df['Series Name'].unique()[0]].index, axis=0, inplace=True)
    country_list = df['Country Code'].tolist()

    for country in country_list:
        if country_check[country] != 'Delete':
            country_check[country] = 'Have'
    
    for country in country_check:
        if country_check[country] == 'Not':
            country_check[country] = 'Delete'
        elif country_check[country] == 'Have':
            country_check[country] = 'Not'

countries_filtered = set()

for country in country_check:
    if country_check[country] == 'Not':
        countries_filtered.add(country)

len(countries_filtered)

193

5. Get names of countries with void lines and delete them from our countries_filtered set

In [6]:
country_check = {}
info_cols = ['Series Name','Series Code','Country Name','Country Code']

for country in countries_filtered:
    country_check[country] = True

for dataset in datalist:
    df = pd.read_csv(dataset, na_values='..')
    df.drop(info_cols[0:3], axis=1, inplace=True)
    idx = list(set(df.index) - set(df.drop('Country Code', axis=1).dropna(how='all').index))
    for country in df.iloc[idx]['Country Code'].dropna().to_list():
        country_check[country] = False
    
for country in country_check:
    if country_check[country] == False and country in countries_filtered:
        countries_filtered.remove(country)

len(countries_filtered)

139

6. Choose a period that you're interested in

> Based on the fact that each dataset can cover different periods, let's use smallest timespan in our datasets

In [7]:
period = {}

for dataset in datalist:
    df = pd.read_csv(dataset, na_values='..')
    years = set(df.drop(info_cols, axis=1).dropna(how='all',axis=1).columns)
    print(df['Series Name'][0],'\n', years)

Total alcohol consumption per capita (liters of pure alcohol, projected estimates, 15+ years of age) 
 {'2000 [YR2000]', '2005 [YR2005]', '2010 [YR2010]', '2015 [YR2015]', '2018 [YR2018]'}
Birth rate, crude (per 1,000 people) 
 {'1962 [YR1962]', '1995 [YR1995]', '2012 [YR2012]', '2005 [YR2005]', '2017 [YR2017]', '1996 [YR1996]', '2003 [YR2003]', '1991 [YR1991]', '1966 [YR1966]', '1981 [YR1981]', '1977 [YR1977]', '2007 [YR2007]', '2013 [YR2013]', '1978 [YR1978]', '1972 [YR1972]', '1986 [YR1986]', '1973 [YR1973]', '1971 [YR1971]', '2002 [YR2002]', '1969 [YR1969]', '2009 [YR2009]', '1993 [YR1993]', '1994 [YR1994]', '1965 [YR1965]', '2016 [YR2016]', '1976 [YR1976]', '1990 [YR1990]', '1960 [YR1960]', '1988 [YR1988]', '2010 [YR2010]', '1998 [YR1998]', '1970 [YR1970]', '1974 [YR1974]', '2015 [YR2015]', '2019 [YR2019]', '1983 [YR1983]', '1985 [YR1985]', '1964 [YR1964]', '2014 [YR2014]', '1961 [YR1961]', '1979 [YR1979]', '1984 [YR1984]', '1987 [YR1987]', '1989 [YR1989]', '2011 [YR2011]', '2006 

As we can see - we will have a very bad time with our datasets in some cases. So we should forget about datasets:<br>
`Mortality rate attributed to unsafe water, unsafe sanitation and lack of hygiene`<br>
`Total alcohol consumption per capita`

In [8]:
del datalist[0]
del datalist[-4]

In [9]:
period = {}
years_min = 0
years_max = 9999

for dataset in datalist:
    df = pd.read_csv(dataset, na_values='..')
    years = set(df.drop(info_cols, axis=1).dropna(how='all',axis=1).columns)
    if int(min(years)[0:4]) > years_min:
        years_min = int(min(years)[0:4])
    if int(max(years)[0:4]) < years_max:
        years_max = int(max(years)[0:4])

print(years_min, years_max)

2000 2019


We will increase that timespan to 20 years. 1999-2019

In [10]:
# Savilg it as a set of columns
df = pd.read_csv(datalist[0], na_values='..')
years_col_list = df.drop(info_cols, axis=1).columns
years_col_list

Index(['1960 [YR1960]', '1961 [YR1961]', '1962 [YR1962]', '1963 [YR1963]',
       '1964 [YR1964]', '1965 [YR1965]', '1966 [YR1966]', '1967 [YR1967]',
       '1968 [YR1968]', '1969 [YR1969]', '1970 [YR1970]', '1971 [YR1971]',
       '1972 [YR1972]', '1973 [YR1973]', '1974 [YR1974]', '1975 [YR1975]',
       '1976 [YR1976]', '1977 [YR1977]', '1978 [YR1978]', '1979 [YR1979]',
       '1980 [YR1980]', '1981 [YR1981]', '1982 [YR1982]', '1983 [YR1983]',
       '1984 [YR1984]', '1985 [YR1985]', '1986 [YR1986]', '1987 [YR1987]',
       '1988 [YR1988]', '1989 [YR1989]', '1990 [YR1990]', '1991 [YR1991]',
       '1992 [YR1992]', '1993 [YR1993]', '1994 [YR1994]', '1995 [YR1995]',
       '1996 [YR1996]', '1997 [YR1997]', '1998 [YR1998]', '1999 [YR1999]',
       '2000 [YR2000]', '2001 [YR2001]', '2002 [YR2002]', '2003 [YR2003]',
       '2004 [YR2004]', '2005 [YR2005]', '2006 [YR2006]', '2007 [YR2007]',
       '2008 [YR2008]', '2009 [YR2009]', '2010 [YR2010]', '2011 [YR2011]',
       '2012 [YR2012]', '

In [11]:
# We will check first and last position. We'll work with everything inbetween
years_col_list = [
    '2000 [YR2000]', '2019 [YR2019]'
]

In [12]:
years_col_list_full = [
    '2000 [YR2000]', '2001 [YR2001]', '2002 [YR2002]', '2003 [YR2003]',
    '2004 [YR2004]', '2005 [YR2005]', '2006 [YR2006]', '2007 [YR2007]',
    '2008 [YR2008]', '2009 [YR2009]', '2010 [YR2010]', '2011 [YR2011]',
    '2012 [YR2012]', '2013 [YR2013]', '2014 [YR2014]', '2015 [YR2015]',
    '2016 [YR2016]', '2017 [YR2017]', '2018 [YR2018]', '2019 [YR2019]'
]

7. Delete all countries that have smaller observation period (SHOuLD BE REWRITTEN)

In [13]:
# For check if country is inside our period
countries_check = {}
for country in countries_filtered:
    countries_check[country] = 'Not'


for dataset in datalist:
    df = pd.read_csv(dataset, na_values='..')
    df_years = df.loc[df['Series Name'][0] == df['Series Name']].drop(info_cols[0:3], axis=1)

    df_years[years_col_list + ['Country Code']]
    
    for country in df_years[years_col_list + ['Country Code']].dropna(how='any')['Country Code']:
        if country in countries_check and countries_check[country] != 'Delete':
          countries_check[country] = 'Have'
    
    for country in countries_check:
        if countries_check[country] == 'Not':
            countries_check[country] = 'Delete'
        elif countries_check[country] == 'Have':
            countries_check[country] = 'Not'

b=[i[1] for i in countries_check.items()]

for k in list(set(b)):
    print("{0}: {1}".format(k, b.count(k)))

Not: 20
Delete: 119


In [14]:
# For check if country is inside our period
countries_check = {}
for country in countries_filtered:
    countries_check[country] = 'Not'




for dataset in datalist:
    df = pd.read_csv(dataset, na_values='..')
    df_years = df.loc[df['Series Name'][0] == df['Series Name']].drop(info_cols[0:3], axis=1)

    for col in df_years.columns.drop('Country Code'):
        if col in years_col_list:
            years_col_list_check[col] = True



    df_years[temp_year_list + ['Country Code']]
    
    for country in df_years[temp_year_list + ['Country Code']].dropna(how='any')['Country Code']:
        if country in countries_check and countries_check[country] != 'Delete':
          countries_check[country] = 'Have'
    
    for country in countries_check:
        if countries_check[country] == 'Not':
            countries_check[country] = 'Delete'
        elif countries_check[country] == 'Have':
            countries_check[country] = 'Not'

    for year in years_col_list:
        years_col_list_check[year] = False

b=[i[1] for i in countries_check.items()]

for k in list(set(b)):
    print("{0}: {1}".format(k, b.count(k)))

Not: 20
Delete: 119


So we will use 20 countries. Let's record list of their country codes in the 

In [15]:
for country in countries_check:
    if countries_check[country] == 'Delete' and country in countries_filtered:
        countries_filtered.remove(country)

len(countries_filtered)

20

8. Check for data gaps in your data and fill them if needed

In [16]:
len(datalist)

11

Datasets should be checked separatelty, one by one, each has it's own specific

In [None]:
# Since it will be a lot of datasets, here as functions i will create repeatetive tasks

In [67]:
# Pattern will be similar so I'll not comment repeated things
df = pd.read_csv(datalist[0], na_values="..")
df = df.loc[df['Country Code'].isin(countries_filtered)]

# We will assume based on the data collection process that if table has more than 1 series, then gaps in them are positioned in the same places
# It is based on the fact that each table is from the same research process. Different series is only separation by gender or class
# df_years = df.loc[df['Series Name'][0] == df['Series Name']].drop(info_cols[0:3], axis=1)

# For check if all years are in our dataset
years_col_list_check = {}
for year in years_col_list_full:
    years_col_list_check[year] = False

for col in df_years.columns.drop('Country Code'):
    if col in years_col_list_full:
        years_col_list_check[col] = True

# That list is needed for the column selection. If we're don't have any, we will get an error
# I'm sure that there is another way, but this is easier
temp_year_list = []
for year in years_col_list_check:
    if years_col_list_check[year] == True:
        temp_year_list.append(year)
    else:
        print("Table doesn't have column {0}".format(year))

na_all_check = False

for country in df['Country Code']:
    row_check = False
    na_col_list = []
    for col in temp_year_list:
       row_check = df.loc[df['Country Code'] == country][col].isna().bool() or row_check
       if df.loc[df['Country Code'] == country][col].isna().bool() == True:
           na_col_list.append(col)
    if row_check == True:
        print('Country {0} has an null values in the next columns: {1}'.format(country, na_col_list))
        na_all_check = True

if na_all_check == False:
    print('Countries have no NaN values except missing columns')

df[['Country Code'] + temp_year_list]

Table doesn't have column 2001 [YR2001]
Countries have no NaN values except missing columns


Unnamed: 0,Country Code,2000 [YR2000],2002 [YR2002],2003 [YR2003],2004 [YR2004],2005 [YR2005],2006 [YR2006],2007 [YR2007],2008 [YR2008],2009 [YR2009],2010 [YR2010],2011 [YR2011],2012 [YR2012],2013 [YR2013],2014 [YR2014],2015 [YR2015],2016 [YR2016],2017 [YR2017],2018 [YR2018],2019 [YR2019]
7,ARG,19.366,18.756,18.453,18.352,18.353,18.194,18.005,17.812,17.849,17.87,17.806,17.729,17.632,17.504,17.346,16.824,16.206,15.187,14.783
11,AUT,9.8,9.7,9.5,9.7,9.5,9.4,9.2,9.3,9.2,9.4,9.3,9.4,9.4,9.6,9.8,10.0,10.0,9.7,9.6
17,BLR,9.4,8.9,9.0,9.1,9.3,9.9,10.7,11.1,11.4,11.4,11.5,12.2,12.5,12.5,12.5,12.4,10.8,9.9,9.3
18,BEL,11.4,10.9,11.0,11.3,11.4,11.6,11.7,11.9,11.8,11.9,11.7,11.5,11.3,11.2,10.8,10.8,10.5,10.4,10.2
42,COL,22.114,21.021,20.444,19.853,19.265,18.688,18.173,17.702,17.275,16.909,16.609,16.344,16.088,15.834,15.585,15.353,15.081,14.841,14.682
46,CRI,19.788,18.12,17.656,17.228,16.99,16.797,16.823,16.899,16.53,16.04,15.772,15.483,15.093,14.852,14.522,14.145,13.926,13.562,12.847
53,DNK,12.6,11.9,12.0,12.0,11.9,12.0,11.7,11.8,11.4,11.4,10.6,10.4,10.0,10.1,10.2,10.8,10.6,10.6,10.5
56,DOM,24.842,23.868,23.425,22.972,22.485,21.865,21.705,21.947,22.132,22.038,21.792,21.454,21.166,20.867,20.596,20.258,19.993,19.762,19.291
59,SLV,27.465,24.484,23.257,22.085,21.261,20.689,20.182,19.812,19.584,19.428,19.372,19.151,18.969,18.809,18.406,17.566,16.974,16.536,16.416
67,FIN,11.0,10.7,10.9,11.0,11.0,11.2,11.1,11.2,11.3,11.4,11.1,11.0,10.7,10.5,10.1,9.6,9.1,8.6,8.3


In [18]:
df = pd.read_csv(datalist[1], na_values="..")
df

Unnamed: 0,Series Name,Series Code,Country Name,Country Code,1996 [YR1996],1998 [YR1998],2000 [YR2000],2002 [YR2002],2003 [YR2003],2004 [YR2004],...,2012 [YR2012],2013 [YR2013],2014 [YR2014],2015 [YR2015],2016 [YR2016],2017 [YR2017],2018 [YR2018],2019 [YR2019],2020 [YR2020],2021 [YR2021]
0,Control of Corruption: Percentile Rank,CC.PER.RNK,Russian Federation,RUS,15.053763,21.390375,20.212767,20.634920,25.396826,23.645321,...,15.165876,15.639811,17.307692,15.865385,21.634615,17.307692,20.192308,24.038462,19.230770,19.711538
1,Control of Corruption: Percentile Rank,CC.PER.RNK,Afghanistan,AFG,4.301075,8.021390,4.787234,4.761905,4.761905,6.403941,...,1.421801,0.947867,5.288462,5.769231,3.846154,3.846154,4.807693,6.250000,4.807693,12.500000
2,Control of Corruption: Percentile Rank,CC.PER.RNK,Albania,ALB,19.354839,18.181818,23.936171,24.338625,22.751324,27.586206,...,25.118483,24.170616,34.134617,37.500000,37.980770,41.826923,34.615383,31.730770,31.730770,31.730770
3,Control of Corruption: Percentile Rank,CC.PER.RNK,Algeria,DZA,33.333332,22.994652,16.489361,22.751324,29.629629,29.064039,...,36.018959,39.336494,31.250000,30.769230,30.288462,32.211540,28.846153,28.365385,27.884615,29.807692
4,Control of Corruption: Percentile Rank,CC.PER.RNK,American Samoa,ASM,,,,,,76.847290,...,65.876778,65.876778,87.019234,87.019234,87.500000,94.711540,94.711540,94.711540,88.461540,88.942307
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
214,,,,,,,,,,,...,,,,,,,,,,
215,,,,,,,,,,,...,,,,,,,,,,
216,,,,,,,,,,,...,,,,,,,,,,
217,Data from database: Worldwide Governance Indic...,,,,,,,,,,...,,,,,,,,,,


In [19]:
df = pd.read_csv(datalist[2], na_values="..")
df

Unnamed: 0,Series Name,Series Code,Country Name,Country Code,1960 [YR1960],1961 [YR1961],1962 [YR1962],1963 [YR1963],1964 [YR1964],1965 [YR1965],...,2012 [YR2012],2013 [YR2013],2014 [YR2014],2015 [YR2015],2016 [YR2016],2017 [YR2017],2018 [YR2018],2019 [YR2019],2020 [YR2020],2021 [YR2021]
0,"Death rate, crude (per 1,000 people)",SP.DYN.CDRT.IN,Afghanistan,AFG,31.921,31.349,30.845,30.359,29.867,29.389,...,7.711,7.478,7.395,7.331,7.077,7.027,6.981,6.791,7.113,
1,"Death rate, crude (per 1,000 people)",SP.DYN.CDRT.IN,Albania,ALB,16.681,15.735,14.871,13.918,12.993,12.146,...,7.573,7.819,7.868,7.947,8.035,8.150,8.308,8.480,10.785,
2,"Death rate, crude (per 1,000 people)",SP.DYN.CDRT.IN,Algeria,DZA,23.785,23.723,25.046,22.615,22.917,23.106,...,4.767,4.673,4.555,4.437,4.472,4.542,4.482,4.392,5.398,
3,"Death rate, crude (per 1,000 people)",SP.DYN.CDRT.IN,American Samoa,ASM,,,,,,,...,,,4.200,,,5.100,,,5.600,
4,"Death rate, crude (per 1,000 people)",SP.DYN.CDRT.IN,Andorra,AND,,,,,,,...,3.900,,,,4.300,,4.400,3.900,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
266,,,,,,,,,,,...,,,,,,,,,,
267,,,,,,,,,,,...,,,,,,,,,,
268,,,,,,,,,,,...,,,,,,,,,,
269,Data from database: World Development Indicators,,,,,,,,,,...,,,,,,,,,,


In [20]:
df = pd.read_csv(datalist[3], na_values="..")
df

Unnamed: 0,Series Name,Series Code,Country Name,Country Code,1960 [YR1960],1961 [YR1961],1962 [YR1962],1963 [YR1963],1964 [YR1964],1965 [YR1965],...,2012 [YR2012],2013 [YR2013],2014 [YR2014],2015 [YR2015],2016 [YR2016],2017 [YR2017],2018 [YR2018],2019 [YR2019],2020 [YR2020],2021 [YR2021]
0,Domestic general government health expenditure...,SH.XPD.GHED.PC.CD,Afghanistan,AFG,,,,,,,...,2.233482,2.770863,2.926995,3.046787,3.056288,3.348260,2.721409,5.388990,,
1,Domestic general government health expenditure...,SH.XPD.GHED.PC.CD,Albania,ALB,,,,,,,...,103.797180,113.417021,118.457388,107.598633,115.212352,120.758235,148.436569,,,
2,Domestic general government health expenditure...,SH.XPD.GHED.PC.CD,Algeria,DZA,,,,,,,...,244.723577,235.838739,258.972463,205.426791,176.502810,170.199512,168.575678,161.333285,,
3,Domestic general government health expenditure...,SH.XPD.GHED.PC.CD,American Samoa,ASM,,,,,,,...,,,,,,,,,,
4,Domestic general government health expenditure...,SH.XPD.GHED.PC.CD,Andorra,AND,,,,,,,...,1574.375856,1627.334496,1725.140756,1549.631194,1631.386656,1789.861213,1916.984563,1906.859429,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
266,,,,,,,,,,,...,,,,,,,,,,
267,,,,,,,,,,,...,,,,,,,,,,
268,,,,,,,,,,,...,,,,,,,,,,
269,Data from database: World Development Indicators,,,,,,,,,,...,,,,,,,,,,


In [21]:
df = pd.read_csv(datalist[4], na_values="..")
df

Unnamed: 0,Series Name,Series Code,Country Name,Country Code,1960 [YR1960],1961 [YR1961],1962 [YR1962],1963 [YR1963],1964 [YR1964],1965 [YR1965],...,2012 [YR2012],2013 [YR2013],2014 [YR2014],2015 [YR2015],2016 [YR2016],2017 [YR2017],2018 [YR2018],2019 [YR2019],2020 [YR2020],2021 [YR2021]
0,GDP per capita (current US$),NY.GDP.PCAP.CD,Afghanistan,AFG,62.369375,62.443703,60.950364,82.021738,85.511073,105.243196,...,663.141053,651.987862,628.146804,592.476537,520.252064,530.149831,502.056771,500.522664,516.866552,368.754614
1,GDP per capita (current US$),NY.GDP.PCAP.CD,Albania,ALB,,,,,,,...,4247.630047,4413.062005,4578.633208,3952.802538,4124.055390,4531.019374,5287.663694,5396.215864,5332.160475,6492.872012
2,GDP per capita (current US$),NY.GDP.PCAP.CD,Algeria,DZA,239.031069,209.915477,169.925637,225.821562,238.875870,253.307007,...,5610.730894,5519.777576,5516.230604,4197.421361,3967.199451,4134.936720,4171.795011,4022.150184,3337.252512,3690.627878
3,GDP per capita (current US$),NY.GDP.PCAP.CD,American Samoa,ASM,,,,,,,...,11920.061090,12038.871592,12313.997357,13101.541816,13300.824611,12372.884783,13195.935900,13672.576657,15501.526337,15743.310758
4,GDP per capita (current US$),NY.GDP.PCAP.CD,Andorra,AND,,,,,,,...,44904.580043,44750.435680,45682.246231,38885.376014,39932.164487,40632.484393,42903.443579,41327.502031,37207.493861,42137.327271
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
266,,,,,,,,,,,...,,,,,,,,,,
267,,,,,,,,,,,...,,,,,,,,,,
268,,,,,,,,,,,...,,,,,,,,,,
269,Data from database: World Development Indicators,,,,,,,,,,...,,,,,,,,,,


In [22]:
df = pd.read_csv(datalist[5], na_values="..")
df

Unnamed: 0,Series Name,Series Code,Country Name,Country Code,1960 [YR1960],1961 [YR1961],1962 [YR1962],1963 [YR1963],1964 [YR1964],1965 [YR1965],...,2012 [YR2012],2013 [YR2013],2014 [YR2014],2015 [YR2015],2016 [YR2016],2017 [YR2017],2018 [YR2018],2019 [YR2019],2020 [YR2020],2021 [YR2021]
0,Income share held by highest 10%,SI.DST.10TH.10,Afghanistan,AFG,,,,,,,...,,,,,,,,,,
1,Income share held by highest 10%,SI.DST.10TH.10,Albania,ALB,,,,,,,...,22.9,,25.5,24.8,25.0,24.6,22.7,23.8,,
2,Income share held by highest 10%,SI.DST.10TH.10,Algeria,DZA,,,,,,,...,,,,,,,,,,
3,Income share held by highest 10%,SI.DST.10TH.10,American Samoa,ASM,,,,,,,...,,,,,,,,,,
4,Income share held by highest 10%,SI.DST.10TH.10,Andorra,AND,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
266,,,,,,,,,,,...,,,,,,,,,,
267,,,,,,,,,,,...,,,,,,,,,,
268,,,,,,,,,,,...,,,,,,,,,,
269,Data from database: World Development Indicators,,,,,,,,,,...,,,,,,,,,,


In [23]:
df = pd.read_csv(datalist[6], na_values="..")
df

Unnamed: 0,Series Name,Series Code,Country Name,Country Code,1960 [YR1960],1961 [YR1961],1962 [YR1962],1963 [YR1963],1964 [YR1964],1965 [YR1965],...,2013 [YR2013],2014 [YR2014],2015 [YR2015],2016 [YR2016],2017 [YR2017],2018 [YR2018],2019 [YR2019],2020 [YR2020],2021 [YR2021],2022 [YR2022]
0,Land Surface Temperature,EN.LND.LTMP.DC,Russian Federation,RUS,,,,,,,...,-4.258061,-3.753492,-2.753141,-2.374735,-3.254571,-3.754938,-3.254885,-1.247285,-4.255229,
1,Land Surface Temperature,EN.LND.LTMP.DC,Afghanistan,AFG,,,,,,,...,29.124779,28.628137,28.875890,30.626397,29.874431,30.125905,27.872626,28.125945,30.874159,
2,Land Surface Temperature,EN.LND.LTMP.DC,Albania,ALB,,,,,,,...,19.812026,19.316749,19.939278,19.061845,20.311461,19.687391,19.935919,20.314952,19.438976,
3,Land Surface Temperature,EN.LND.LTMP.DC,Algeria,DZA,,,,,,,...,38.628051,38.187711,37.687921,38.438034,37.687599,37.874857,38.376966,38.625905,38.626593,
4,Land Surface Temperature,EN.LND.LTMP.DC,Andorra,AND,,,,,,,...,8.826377,9.530140,10.960856,11.337043,11.091862,9.274452,11.459539,11.403015,10.278319,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
239,,,,,,,,,,,...,,,,,,,,,,
240,,,,,,,,,,,...,,,,,,,,,,
241,,,,,,,,,,,...,,,,,,,,,,
242,Data from database: Environment Social and Gov...,,,,,,,,,,...,,,,,,,,,,


In [24]:
df = pd.read_csv(datalist[7], na_values="..")
df

Unnamed: 0,Series Name,Series Code,Country Name,Country Code,1960 [YR1960],1961 [YR1961],1962 [YR1962],1963 [YR1963],1964 [YR1964],1965 [YR1965],...,2012 [YR2012],2013 [YR2013],2014 [YR2014],2015 [YR2015],2016 [YR2016],2017 [YR2017],2018 [YR2018],2019 [YR2019],2020 [YR2020],2021 [YR2021]
0,"Life expectancy at birth, female (years)",SP.DYN.LE00.FE.IN,Afghanistan,AFG,33.285,33.813,34.297,34.773,35.246,35.702,...,63.514,64.027,64.274,64.576,65.096,66.099,66.458,66.677,65.432,
1,"Life expectancy at birth, female (years)",SP.DYN.LE00.FE.IN,Albania,ALB,57.780,58.900,59.750,60.831,61.851,62.866,...,80.703,80.781,81.013,81.183,81.377,81.504,81.608,81.666,79.676,
2,"Life expectancy at birth, female (years)",SP.DYN.LE00.FE.IN,Algeria,DZA,43.608,43.467,42.569,43.476,43.357,43.279,...,75.478,75.840,76.467,76.824,76.803,76.821,77.205,77.760,75.912,
3,"Life expectancy at birth, female (years)",SP.DYN.LE00.FE.IN,American Samoa,ASM,,,,,,,...,,,,,,,,,,
4,"Life expectancy at birth, female (years)",SP.DYN.LE00.FE.IN,Andorra,AND,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
798,,,,,,,,,,,...,,,,,,,,,,
799,,,,,,,,,,,...,,,,,,,,,,
800,,,,,,,,,,,...,,,,,,,,,,
801,Data from database: World Development Indicators,,,,,,,,,,...,,,,,,,,,,


In [25]:
df = pd.read_csv(datalist[9], na_values="..")
df

Unnamed: 0,Series Name,Series Code,Country Name,Country Code,1960 [YR1960],1961 [YR1961],1962 [YR1962],1963 [YR1963],1964 [YR1964],1965 [YR1965],...,2012 [YR2012],2013 [YR2013],2014 [YR2014],2015 [YR2015],2016 [YR2016],2017 [YR2017],2018 [YR2018],2019 [YR2019],2020 [YR2020],2021 [YR2021]
0,"Suicide mortality rate (per 100,000 population)",SH.STA.SUIC.P5,Afghanistan,AFG,,,,,,,...,4.0,4.0,3.9,4.0,4.0,4.1,4.1,4.1,,
1,"Suicide mortality rate (per 100,000 population)",SH.STA.SUIC.P5,Albania,ALB,,,,,,,...,5.2,5.3,5.0,4.8,4.7,4.7,4.5,4.3,,
2,"Suicide mortality rate (per 100,000 population)",SH.STA.SUIC.P5,Algeria,DZA,,,,,,,...,2.9,2.9,2.8,2.7,2.6,2.5,2.5,2.5,,
3,"Suicide mortality rate (per 100,000 population)",SH.STA.SUIC.P5,American Samoa,ASM,,,,,,,...,,,,,,,,,,
4,"Suicide mortality rate (per 100,000 population)",SH.STA.SUIC.P5,Andorra,AND,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
798,,,,,,,,,,,...,,,,,,,,,,
799,,,,,,,,,,,...,,,,,,,,,,
800,,,,,,,,,,,...,,,,,,,,,,
801,Data from database: World Development Indicators,,,,,,,,,,...,,,,,,,,,,


In [26]:
df = pd.read_csv(datalist[10], na_values="..")
df

Unnamed: 0,Series Name,Series Code,Country Name,Country Code,1996 [YR1996],1998 [YR1998],2000 [YR2000],2002 [YR2002],2003 [YR2003],2004 [YR2004],...,2012 [YR2012],2013 [YR2013],2014 [YR2014],2015 [YR2015],2016 [YR2016],2017 [YR2017],2018 [YR2018],2019 [YR2019],2020 [YR2020],2021 [YR2021]
0,Voice and Accountability: Percentile Rank,VA.PER.RNK,Russian Federation,RUS,43.5,36.815922,37.810944,36.318409,32.338310,30.769230,...,19.248827,18.779343,20.689655,20.197044,17.733990,18.719212,18.840580,17.874395,20.289856,19.806763
1,Voice and Accountability: Percentile Rank,VA.PER.RNK,Afghanistan,AFG,1.0,0.497512,0.995025,9.452736,14.427860,15.384615,...,14.084507,14.553990,16.256157,18.719212,20.689655,22.167488,20.289856,21.256039,19.806763,7.246377
2,Voice and Accountability: Percentile Rank,VA.PER.RNK,Albania,ALB,29.5,38.805969,41.293533,48.258705,50.248756,49.038460,...,50.704224,51.173710,50.246304,52.709358,52.216747,54.187191,52.657005,52.173912,51.207729,50.241547
3,Voice and Accountability: Percentile Rank,VA.PER.RNK,Algeria,DZA,14.5,12.437811,14.427860,18.905472,18.905472,23.557692,...,22.535212,23.943663,25.123152,24.630543,23.645321,23.152710,21.256039,19.806763,19.323671,20.772947
4,Voice and Accountability: Percentile Rank,VA.PER.RNK,American Samoa,ASM,,,,,,64.423080,...,80.751175,81.220657,,,,,95.169083,92.753624,81.642509,75.845413
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
214,,,,,,,,,,,...,,,,,,,,,,
215,,,,,,,,,,,...,,,,,,,,,,
216,,,,,,,,,,,...,,,,,,,,,,
217,Data from database: Worldwide Governance Indic...,,,,,,,,,,...,,,,,,,,,,


In [27]:
df = pd.read_csv(datalist[11], na_values="..")
df

IndexError: list index out of range

9. Save in the most comfortable way