# BW \#75 Refugees
How many refugees are there in general around the world? Has the number increased over the years? Where do refugees come from, and where do they go?
This week, we'll thus look at data about international refugees, and try to understand the countries and numbers involved. This is obviously a large and complex topic, one filled with lots of heartbreak and pain, as well as political turmoil in the countries that refugees came from and arrived at. Immigration and refugees are hotly debated topics across a large number of countries

## Data and seven questions
Data comes from 3 files, all produced by the World Bank
- The population of each country and area, per year (population): https://api.worldbank.org/v2/en/indicator/SP.POP.TOTL?downloadformat=csv
- The number of refugees who left each country and area, per year (origin): https://api.worldbank.org/v2/en/indicator/SM.POP.REFG.OR?downloadformat=csv
- The number of refugees who left each country and area, per year (destination): https://api.worldbank.org/v2/en/indicator/SM.POP.REFG?downloadformat=csv

## Challenges
Learning goals include combining data frames, working with multi-indexes, grouping, filtering, working with time data, and resampling.
- Create a single data frame whose index is made up of country names. The columns will be a multi-index with top-level names "origin", "destination", and "population". The lower level of the multi-index should contain the years; you can remove other columns.
- What 10 countries accepted the most refugees in 2000 and 2023?

In [2]:
import pandas as pd

When reading the file with the first line of code we get ParserError, that is because we only want to read the data from line 2  of the csv file where the column headers are. Thus, we pass the argument `header = 2` which tells `read_csv` to start on line 2. We also want to create an index made up of `Country Name`.
We only want to keep the columns containing the years and removing other columns

In [37]:
#refugee_origin_df = pd.read_csv(r"C:\Users\npigeon\Git\BW #75 Refugees\API_SM.POP.REFG.OR_DS2_en_csv_v2_2022116\API_SM.POP.REFG.OR_DS2_en_csv_v2_2022116.csv")
refugee_origin_df = pd.read_csv(r"C:\Users\npigeon\Git\BW #75 Refugees\API_SM.POP.REFG.OR_DS2_en_csv_v2_2022116\API_SM.POP.REFG.OR_DS2_en_csv_v2_2022116.csv",
                                header=2, # skip the first two rows
                                index_col='Country Name') # setting the first level of index to 'Country Name'

refugee_origin_df.columns

Index(['Country Code', 'Indicator Name', 'Indicator Code', '1960', '1961',
       '1962', '1963', '1964', '1965', '1966', '1967', '1968', '1969', '1970',
       '1971', '1972', '1973', '1974', '1975', '1976', '1977', '1978', '1979',
       '1980', '1981', '1982', '1983', '1984', '1985', '1986', '1987', '1988',
       '1989', '1990', '1991', '1992', '1993', '1994', '1995', '1996', '1997',
       '1998', '1999', '2000', '2001', '2002', '2003', '2004', '2005', '2006',
       '2007', '2008', '2009', '2010', '2011', '2012', '2013', '2014', '2015',
       '2016', '2017', '2018', '2019', '2020', '2021', '2022', '2023',
       'Unnamed: 68'],
      dtype='object')

In [38]:
refugee_origin_df.drop(['Country Code', 'Indicator Name', 'Indicator Code', 'Unnamed: 68'], axis=1, inplace=True)
refugee_origin_df

Unnamed: 0_level_0,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,...,2014,2015,2016,2017,2018,2019,2020,2021,2022,2023
Country Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Aruba,,,,,,,,,,,...,,,,,,,,,,
Africa Eastern and Southern,,,,,,,,,,,...,3559470.0,4195068.0,4917140.0,6058091.0,5958200.0,5936536.0,5959979.0,6146257.0,6124318.0,6939594.0
Afghanistan,,,,,,,,,,,...,2596259.0,2666294.0,2501447.0,2624265.0,2681267.0,2727556.0,2594827.0,2712869.0,5661717.0,6403144.0
Africa Western and Central,,,,,,,,,,,...,919751.0,1038377.0,1090726.0,1155918.0,1283440.0,1354330.0,1457853.0,1675916.0,1750586.0,1793223.0
Angola,150000.0,150000.0,170000.0,175000.0,200000.0,220000.0,303800.0,356200.0,381360.0,408190.0,...,9468.0,11855.0,8388.0,8292.0,8243.0,8176.0,8196.0,11403.0,12021.0,11506.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Kosovo,,,,,,,,,,,...,,,,,,,,,,
"Yemen, Rep.",,,,,,,,,,,...,2623.0,15901.0,18427.0,23555.0,31145.0,36522.0,32433.0,37615.0,40147.0,48044.0
South Africa,,,,,,,390.0,450.0,740.0,290.0,...,422.0,447.0,452.0,461.0,482.0,441.0,474.0,643.0,784.0,919.0
Zambia,,,,,,,,,15000.0,13000.0,...,309.0,334.0,256.0,260.0,266.0,259.0,252.0,255.0,296.0,324.0


in the correction they tell Pandas to keep only the columns whose name consist of 4 digits with the `filter` method, passing a regular expression via the `regex` keyword argument and specifying that we're looking at the column names via `axis='columns'`:
- to anchor our match to the start of the string with ^
- looking for one or more decimal digits with \d+
-  to anchor our match to the end of the string with $


In [42]:
refugee_origin_df = pd.read_csv(r"C:\Users\npigeon\Git\BW #75 Refugees\API_SM.POP.REFG.OR_DS2_en_csv_v2_2022116\API_SM.POP.REFG.OR_DS2_en_csv_v2_2022116.csv",
                                header=2, index_col='Country Name').filter(regex=r'^\d+$', axis='columns') 
# skip the first two rows
# setting the first level of index to 'Country Name' 
# filter columns that are only numbers
refugee_origin_df

Unnamed: 0_level_0,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,...,2014,2015,2016,2017,2018,2019,2020,2021,2022,2023
Country Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Aruba,,,,,,,,,,,...,,,,,,,,,,
Africa Eastern and Southern,,,,,,,,,,,...,3559470.0,4195068.0,4917140.0,6058091.0,5958200.0,5936536.0,5959979.0,6146257.0,6124318.0,6939594.0
Afghanistan,,,,,,,,,,,...,2596259.0,2666294.0,2501447.0,2624265.0,2681267.0,2727556.0,2594827.0,2712869.0,5661717.0,6403144.0
Africa Western and Central,,,,,,,,,,,...,919751.0,1038377.0,1090726.0,1155918.0,1283440.0,1354330.0,1457853.0,1675916.0,1750586.0,1793223.0
Angola,150000.0,150000.0,170000.0,175000.0,200000.0,220000.0,303800.0,356200.0,381360.0,408190.0,...,9468.0,11855.0,8388.0,8292.0,8243.0,8176.0,8196.0,11403.0,12021.0,11506.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Kosovo,,,,,,,,,,,...,,,,,,,,,,
"Yemen, Rep.",,,,,,,,,,,...,2623.0,15901.0,18427.0,23555.0,31145.0,36522.0,32433.0,37615.0,40147.0,48044.0
South Africa,,,,,,,390.0,450.0,740.0,290.0,...,422.0,447.0,452.0,461.0,482.0,441.0,474.0,643.0,784.0,919.0
Zambia,,,,,,,,,15000.0,13000.0,...,309.0,334.0,256.0,260.0,266.0,259.0,252.0,255.0,296.0,324.0


If all of the columns are four-digit strings representing years, we can change them to be integers. If we want, we could run `astype(int)` on `refugee_origin_df.columns`, getting back integers, and then assign the result back to `refugee_origin_df.columns`:

In [45]:
refugee_origin_df.columns = refugee_origin_df.columns.astype(int)
refugee_origin_df

Unnamed: 0_level_0,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,...,2014,2015,2016,2017,2018,2019,2020,2021,2022,2023
Country Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Aruba,,,,,,,,,,,...,,,,,,,,,,
Africa Eastern and Southern,,,,,,,,,,,...,3559470.0,4195068.0,4917140.0,6058091.0,5958200.0,5936536.0,5959979.0,6146257.0,6124318.0,6939594.0
Afghanistan,,,,,,,,,,,...,2596259.0,2666294.0,2501447.0,2624265.0,2681267.0,2727556.0,2594827.0,2712869.0,5661717.0,6403144.0
Africa Western and Central,,,,,,,,,,,...,919751.0,1038377.0,1090726.0,1155918.0,1283440.0,1354330.0,1457853.0,1675916.0,1750586.0,1793223.0
Angola,150000.0,150000.0,170000.0,175000.0,200000.0,220000.0,303800.0,356200.0,381360.0,408190.0,...,9468.0,11855.0,8388.0,8292.0,8243.0,8176.0,8196.0,11403.0,12021.0,11506.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Kosovo,,,,,,,,,,,...,,,,,,,,,,
"Yemen, Rep.",,,,,,,,,,,...,2623.0,15901.0,18427.0,23555.0,31145.0,36522.0,32433.0,37615.0,40147.0,48044.0
South Africa,,,,,,,390.0,450.0,740.0,290.0,...,422.0,447.0,452.0,461.0,482.0,441.0,474.0,643.0,784.0,919.0
Zambia,,,,,,,,,15000.0,13000.0,...,309.0,334.0,256.0,260.0,266.0,259.0,252.0,255.0,296.0,324.0


But it feels weird to use assignment when we've managed to do the rest via method chaining. But how can we assign to our data frame's columns from within a method chain?

It's possible with the `pipe` method, which lets us run a function on our data frame, getting a chained result. And the function we'll run? It'll be a `lambda`, taking the data frame that we've gotten so far via the method chain, running `set_axis` to assign to our columns:

In [48]:
refugee_origin_df = pd.read_csv(r"C:\Users\npigeon\Git\BW #75 Refugees\API_SM.POP.REFG.OR_DS2_en_csv_v2_2022116\API_SM.POP.REFG.OR_DS2_en_csv_v2_2022116.csv",
                                header=2, index_col='Country Name').filter(regex=r'^\d+$', axis='columns').pipe(lambda df_: df_.set_axis(df_.columns.astype(int), axis='columns'))    
refugee_origin_df

Unnamed: 0_level_0,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,...,2014,2015,2016,2017,2018,2019,2020,2021,2022,2023
Country Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Aruba,,,,,,,,,,,...,,,,,,,,,,
Africa Eastern and Southern,,,,,,,,,,,...,3559470.0,4195068.0,4917140.0,6058091.0,5958200.0,5936536.0,5959979.0,6146257.0,6124318.0,6939594.0
Afghanistan,,,,,,,,,,,...,2596259.0,2666294.0,2501447.0,2624265.0,2681267.0,2727556.0,2594827.0,2712869.0,5661717.0,6403144.0
Africa Western and Central,,,,,,,,,,,...,919751.0,1038377.0,1090726.0,1155918.0,1283440.0,1354330.0,1457853.0,1675916.0,1750586.0,1793223.0
Angola,150000.0,150000.0,170000.0,175000.0,200000.0,220000.0,303800.0,356200.0,381360.0,408190.0,...,9468.0,11855.0,8388.0,8292.0,8243.0,8176.0,8196.0,11403.0,12021.0,11506.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Kosovo,,,,,,,,,,,...,,,,,,,,,,
"Yemen, Rep.",,,,,,,,,,,...,2623.0,15901.0,18427.0,23555.0,31145.0,36522.0,32433.0,37615.0,40147.0,48044.0
South Africa,,,,,,,390.0,450.0,740.0,290.0,...,422.0,447.0,452.0,461.0,482.0,441.0,474.0,643.0,784.0,919.0
Zambia,,,,,,,,,15000.0,13000.0,...,309.0,334.0,256.0,260.0,266.0,259.0,252.0,255.0,296.0,324.0


We perform the same operations on the 2 other files for destination and population

In [49]:
refugee_destination_df = pd.read_csv(r"C:\Users\npigeon\Git\BW #75 Refugees\API_SM.POP.REFG_DS2_en_csv_v2_2011316\API_SM.POP.REFG_DS2_en_csv_v2_2011316.csv",
                                header=2, index_col='Country Name').filter(regex=r'^\d+$', axis='columns').pipe(lambda df_: df_.set_axis(df_.columns.astype(int), axis='columns'))    
refugee_destination_df

Unnamed: 0_level_0,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,...,2014,2015,2016,2017,2018,2019,2020,2021,2022,2023
Country Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Aruba,,,,,,,,,,,...,,,,,,,,,,
Africa Eastern and Southern,,,,,,,,,,,...,2637640.0,3333273.0,3990478.0,5155400.0,5114399.0,5087755.0,5183533.0,5436720.0,5412266.0,5553759.0
Afghanistan,,,,,,,,,,,...,300421.0,257553.0,59770.0,75927.0,72228.0,72227.0,72278.0,66949.0,52159.0,34826.0
Africa Western and Central,,,,,,,,,,,...,1108169.0,1138010.0,1200854.0,1172523.0,1285773.0,1315229.0,1474135.0,1631057.0,1702392.0,2296159.0
Angola,,,,,,,,,,,...,15468.0,15547.0,15547.0,41119.0,39856.0,25793.0,25791.0,26045.0,25514.0,25174.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Kosovo,,,,,,,,,,,...,,,,,,,,,,
"Yemen, Rep.",,,,,,,,,,,...,257637.0,267163.0,269778.0,270913.0,264359.0,268503.0,166906.0,89467.0,77458.0,55568.0
South Africa,,,,,,,,,,,...,112182.0,121635.0,91018.0,88694.0,89285.0,78395.0,76729.0,75512.0,66596.0,67145.0
Zambia,,,,,,5000.0,6290.0,10710.0,13190.0,12640.0,...,25566.0,26434.0,29338.0,41266.0,49877.0,57518.0,66070.0,75154.0,61159.0,72411.0


In [50]:
population_df = pd.read_csv(r"C:\Users\npigeon\Git\BW #75 Refugees\API_SP.POP.TOTL_DS2_en_csv_v2_2001050\API_SP.POP.TOTL_DS2_en_csv_v2_2001050.csv",
                                header=2, index_col='Country Name').filter(regex=r'^\d+$', axis='columns').pipe(lambda df_: df_.set_axis(df_.columns.astype(int), axis='columns'))    
population_df

Unnamed: 0_level_0,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,...,2014,2015,2016,2017,2018,2019,2020,2021,2022,2023
Country Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Aruba,54608.0,55811.0,56682.0,57475.0,58178.0,58782.0,59291.0,59522.0,59471.0,59330.0,...,103594.0,104257.0,104874.0,105439.0,105962.0,106442.0,106585.0,106537.0,106445.0,106277.0
Africa Eastern and Southern,130692579.0,134169237.0,137835590.0,141630546.0,145605995.0,149742351.0,153955516.0,158313235.0,162875171.0,167596160.0,...,583651101.0,600008424.0,616377605.0,632746570.0,649757148.0,667242986.0,685112979.0,702977106.0,720859132.0,739108306.0
Afghanistan,8622466.0,8790140.0,8969047.0,9157465.0,9355514.0,9565147.0,9783147.0,10010030.0,10247780.0,10494489.0,...,32716210.0,33753499.0,34636207.0,35643418.0,36686784.0,37769499.0,38972230.0,40099462.0,41128771.0,42239854.0
Africa Western and Central,97256290.0,99314028.0,101445032.0,103667517.0,105959979.0,108336203.0,110798486.0,113319950.0,115921723.0,118615741.0,...,397855507.0,408690375.0,419778384.0,431138704.0,442646825.0,454306063.0,466189102.0,478185907.0,490330870.0,502789511.0
Angola,5357195.0,5441333.0,5521400.0,5599827.0,5673199.0,5736582.0,5787044.0,5827503.0,5868203.0,5928386.0,...,27128337.0,28127721.0,29154746.0,30208628.0,31273533.0,32353588.0,33428486.0,34503774.0,35588987.0,36684202.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Kosovo,990150.0,1014211.0,1038618.0,1063175.0,1087700.0,1111812.0,1135522.0,1159611.0,1184645.0,1211011.0,...,1812771.0,1788196.0,1777557.0,1791003.0,1797085.0,1788878.0,1790133.0,1786038.0,1768086.0,1756374.0
"Yemen, Rep.",5542459.0,5646668.0,5753386.0,5860197.0,5973803.0,6097298.0,6228430.0,6368014.0,6515904.0,6673981.0,...,27753304.0,28516545.0,29274002.0,30034389.0,30790513.0,31546691.0,32284046.0,32981641.0,33696614.0,34449825.0
South Africa,16520441.0,16989464.0,17503133.0,18042215.0,18603097.0,19187194.0,19789771.0,20410677.0,21050540.0,21704214.0,...,54729551.0,55876504.0,56422274.0,56641209.0,57339635.0,58087055.0,58801927.0,59392255.0,59893885.0,60414495.0
Zambia,3119430.0,3219451.0,3323427.0,3431381.0,3542764.0,3658024.0,3777680.0,3901288.0,4029173.0,4159007.0,...,15737793.0,16248230.0,16767761.0,17298054.0,17835893.0,18380477.0,18927715.0,19473125.0,20017675.0,20569737.0


We now have three data frames, all of which have the same index and columns. Normally, we would use `pd.concat` to combine multiple data frames into a single, new one. But I asked you to do something a bit different, combining them but using a multi-index for the columns, such that we can keep track of the data's original frames.

Fortunately, we can pass `pd.concat` an additional `keys` keyword argument, indicating the multi-index name we want to associate with each of the original data frames:

In [52]:
df = pd.concat([population_df, refugee_destination_df, refugee_origin_df], 
              axis='columns',
              keys=['population', 'destination', 'origin'])
df

Unnamed: 0_level_0,population,population,population,population,population,population,population,population,population,population,...,origin,origin,origin,origin,origin,origin,origin,origin,origin,origin
Unnamed: 0_level_1,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,...,2014,2015,2016,2017,2018,2019,2020,2021,2022,2023
Country Name,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
Aruba,54608.0,55811.0,56682.0,57475.0,58178.0,58782.0,59291.0,59522.0,59471.0,59330.0,...,,,,,,,,,,
Africa Eastern and Southern,130692579.0,134169237.0,137835590.0,141630546.0,145605995.0,149742351.0,153955516.0,158313235.0,162875171.0,167596160.0,...,3559470.0,4195068.0,4917140.0,6058091.0,5958200.0,5936536.0,5959979.0,6146257.0,6124318.0,6939594.0
Afghanistan,8622466.0,8790140.0,8969047.0,9157465.0,9355514.0,9565147.0,9783147.0,10010030.0,10247780.0,10494489.0,...,2596259.0,2666294.0,2501447.0,2624265.0,2681267.0,2727556.0,2594827.0,2712869.0,5661717.0,6403144.0
Africa Western and Central,97256290.0,99314028.0,101445032.0,103667517.0,105959979.0,108336203.0,110798486.0,113319950.0,115921723.0,118615741.0,...,919751.0,1038377.0,1090726.0,1155918.0,1283440.0,1354330.0,1457853.0,1675916.0,1750586.0,1793223.0
Angola,5357195.0,5441333.0,5521400.0,5599827.0,5673199.0,5736582.0,5787044.0,5827503.0,5868203.0,5928386.0,...,9468.0,11855.0,8388.0,8292.0,8243.0,8176.0,8196.0,11403.0,12021.0,11506.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Kosovo,990150.0,1014211.0,1038618.0,1063175.0,1087700.0,1111812.0,1135522.0,1159611.0,1184645.0,1211011.0,...,,,,,,,,,,
"Yemen, Rep.",5542459.0,5646668.0,5753386.0,5860197.0,5973803.0,6097298.0,6228430.0,6368014.0,6515904.0,6673981.0,...,2623.0,15901.0,18427.0,23555.0,31145.0,36522.0,32433.0,37615.0,40147.0,48044.0
South Africa,16520441.0,16989464.0,17503133.0,18042215.0,18603097.0,19187194.0,19789771.0,20410677.0,21050540.0,21704214.0,...,422.0,447.0,452.0,461.0,482.0,441.0,474.0,643.0,784.0,919.0
Zambia,3119430.0,3219451.0,3323427.0,3431381.0,3542764.0,3658024.0,3777680.0,3901288.0,4029173.0,4159007.0,...,309.0,334.0,256.0,260.0,266.0,259.0,252.0,255.0,296.0,324.0


## What 10 countries accepted the most refugees in 2000 and 2023
### 2000

In [94]:
filtered = df['destination'][2000]
filtered_df = pd.DataFrame(filtered)
filtered_df.sort_values(2000, ascending=False).head(10)

Unnamed: 0_level_0,2000
Country Name,Unnamed: 1_level_1
World,15935134.0
Low & middle income,13376149.0
IDA & IBRD total,12001044.0
Middle income,11490188.0
Early-demographic dividend,8438167.0
Lower middle income,7780161.0
Middle East & North Africa,6096323.0
Middle East & North Africa (excluding high income),6083390.0
IDA total,6036232.0
IBRD only,5964812.0


On peut aussi utiliser la méthode `nlargest`

In [95]:
df[('destination', 2000)].nlargest()

Country Name
World                         15935134.0
Low & middle income           13376149.0
IDA & IBRD total              12001044.0
Middle income                 11490188.0
Early-demographic dividend     8438167.0
Name: (destination, 2000), dtype: float64

Il y a un problème dans les `Country Name` car il n'y a pas que des countries il y a aussi des zones géographiques. Pour pallier à ce problème on télécharge la bibliothèque `pycountry` et on obtiens une liste des noms de pays valide. 

In [134]:
import pycountry

In [139]:
valid_countries = {country.name for country in pycountry.countries}
unique_index_names = set(df.index.unique())
invalid_names = unique_index_names - valid_countries
invalid_names

{'Africa Eastern and Southern',
 'Africa Western and Central',
 'Arab World',
 'Bahamas, The',
 'Bolivia',
 'British Virgin Islands',
 'Caribbean small states',
 'Central Europe and the Baltics',
 'Channel Islands',
 'Congo, Dem. Rep.',
 'Congo, Rep.',
 "Cote d'Ivoire",
 'Curacao',
 'Early-demographic dividend',
 'East Asia & Pacific',
 'East Asia & Pacific (IDA & IBRD countries)',
 'East Asia & Pacific (excluding high income)',
 'Egypt, Arab Rep.',
 'Euro area',
 'Europe & Central Asia',
 'Europe & Central Asia (IDA & IBRD countries)',
 'Europe & Central Asia (excluding high income)',
 'European Union',
 'Fragile and conflict affected situations',
 'Gambia, The',
 'Heavily indebted poor countries (HIPC)',
 'High income',
 'Hong Kong SAR, China',
 'IBRD only',
 'IDA & IBRD total',
 'IDA blend',
 'IDA only',
 'IDA total',
 'Iran, Islamic Rep.',
 "Korea, Dem. People's Rep.",
 'Korea, Rep.',
 'Kosovo',
 'Kyrgyz Republic',
 'Lao PDR',
 'Late-demographic dividend',
 'Latin America & Caribbe

In [140]:
names_to_add = {'Bahamas, The',
                'Bolivia',
                'British Virgin Islands', 
                'Caribbean small states', 
                'Channel Islands',
                'Congo, Dem. Rep.',
                'Congo, Rep.',
                "Cote d'Ivoire",
                'Curacao','Egypt, Arab Rep.','Gambia, The','Venezuela, RB',
                'Hong Kong SAR, China',
                'Iran, Islamic Rep.',
                "Korea, Dem. People's Rep.",
                'Korea, Rep.',
                'Kosovo',
                'Kyrgyz Republic',
                'Lao PDR',
                'Tanzania',
                'Turkiye',
                'Virgin Islands (U.S.)',
                'West Bank and Gaza','Yemen, Rep.'}
valid_countries.update(names_to_add)
invalid_names = unique_index_names - valid_countries


In [141]:
filtered_df = df[df.index.isin(valid_countries)]
filtered_df[('destination', 2000)].nlargest(10)

Country Name
Pakistan              2001460.0
Iran, Islamic Rep.    1868000.0
Jordan                1610630.0
West Bank and Gaza    1428891.0
Germany                906000.0
Tanzania               680862.0
United States          508209.0
Serbia                 484390.0
Guinea                 427210.0
Sudan                  414928.0
Name: (destination, 2000), dtype: float64

The correction created a regular expression to describe the country names (and partial names) to ignore. Then we look for anything with a bunch of words or phrases, which I put into the variable `ignore_pattern`. I then got the data frame's index and turned it into a series with `to_series`, which returned a series whose index and values were identical. I then used `loc` to keep only those elements that did not match my regexp pattern, using `str.contains`.

In [114]:
ignore_pattern = r'(?:countries|situations|IDA|IBRD|demographic|OECD|World|Middle East|Sub-Saharan|ern and|Euro|income|Asia|America\b)'

countries = (df
             .index
             .to_series()
             .loc[lambda s_: ~s_.str.contains(ignore_pattern,
                                              regex=True)]
            )
(
    df
    .loc[lambda df_: df_.index.isin(countries)]
    [('destination', 2000)]
    .nlargest(10)
)

Country Name
Pakistan              2001460.0
Iran, Islamic Rep.    1868000.0
Jordan                1610630.0
West Bank and Gaza    1428891.0
Germany                906000.0
Tanzania               680862.0
United States          508209.0
Serbia                 484390.0
Guinea                 427210.0
Sudan                  414928.0
Name: (destination, 2000), dtype: float64

### 2023
We do the same

In [144]:
(
    df
    .loc[lambda df_: df_.index.isin(countries)]
    [('destination', 2023)]
    .nlargest(10)
)

Country Name
Iran, Islamic Rep.    3764517.0
Turkiye               3251127.0
Jordan                3063591.0
Germany               2593007.0
West Bank and Gaza    2482144.0
Pakistan              1988231.0
Uganda                1577498.0
Lebanon               1279108.0
Russian Federation    1230131.0
Chad                  1100921.0
Name: (destination, 2023), dtype: float64

: 

The numbers, as you can see, are far higher than they were 23 years earlier.