In [1]:
# To Do:
# * Include introduction saying that we could have better data as it relates to genderqueer / non-binary individuals so that programs can be more specific to support these individuals
# Include importance around early childhood development in intro
# Include in intro question of whether it is fair to do machine learning from this dataset because numbers are based on a ton of dependent variables
# Talk about clear lack of data and consequences for application of tool
# Generalize functions to look at overall displaced individuals or by country totals
# Right now I am wondering about the age ranges 5 - 11, 5 - 17, 12 - 17 and if there is a better way to split data-wise than just summing. Until 2005 they didn't split the age range
# Clustering for each country geographically
# R^2 jumps all around (see if I can test variance)


## Introduction

We are currently experiencing the largest refugee crisis that we have globally seen [1]. Over the past decade, the number of refugees and displaced persons has skyrocketed, and it is imperitive that services to support these individuals are in place throughout the relocation process. Importantly, services must be specified to the individual and family level so the we are correctly assisting those who require it. Using historical trends, we can better understand the demographics of displaced individuals. 

The United Nations Refugee Agency (UNHCR) collects demographic data around displaced individuals [2]. We can utilize this data to look at the different number of individuals from countries and locations by age and sex. This will better the ability for non-profits and humanitarian organizations to plan their services strategically to benefit the most individuals. For example, they will have a better idea as to how many women-specific and early-age development programs are needed. Having an understanding of where individuals are coming from will also better the ability to plan language integration services, job planning and relocation within specific communities.

Specifically, this tool will provide demographic trends given a specific country of origin (or total numbers across a specific group of countries) and demographic (sex/age range). Data insights will be provided temporally through digestible visuals.

In [2]:
# import libraries

import pandas as pd
import numpy as np
import pdb
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns; sns.set()
from sklearn.metrics import r2_score


## Pre-processing of data

In [3]:
# pre-processing of data

# data: http://popstats.unhcr.org/en/demographics
demographics_df = pd.read_csv('./unhcr_popstats_export_demographics_2019_12_23_152351.csv', header = 2)

demographics_df.head()


Unnamed: 0,Year,Country / territory of asylum/residence,Location Name,Female 0-4,Female 5-11,Female 5-17,Female 12-17,Female 18-59,Female 60+,F: Unknown,F: Total,Male 0-4,Male 5-11,Male 5-17,Male 12-17,Male 18-59,Male 60+,M: Unknown,M: Total
0,2001,Afghanistan,West,,,0.0,,,,0.0,0,,,0.0,,,,0.0,0
1,2001,Afghanistan,Various,14335.0,,45451.0,,99880.0,19234.0,412004.0,590904,14716.0,,47522.0,,114965.0,13025.0,435492.0,625720
2,2001,Afghanistan,North,,,0.0,,,,0.0,0,,,0.0,,,,0.0,0
3,2001,Afghanistan,Kabul,,,1.0,,1.0,,0.0,2,,,0.0,,2.0,,0.0,2
4,2001,Afghanistan,Herat,,,0.0,,1.0,,0.0,1,,,0.0,,1.0,,0.0,1


In [4]:
# Cleaning data

# convert ages to integers because you can't have half of a person
# assuming that all '*' values are 0
def clean_convert_unhcr_data(col):
    demographics_df[col] = demographics_df[col].fillna(0).replace('*', 0).astype(int)
    
    
demographics_columns = ['Female 0-4','Female 5-11','Female 5-17','Female 12-17',
                        'Female 18-59','Female 60+','F: Unknown','F: Total',
                        'Male 0-4','Male 5-11','Male 5-17','Male 12-17',
                        'Male 18-59','Male 60+','M: Unknown','M: Total']

for col in demographics_columns:
    clean_convert_unhcr_data(col)
    
# The 'total' columns are sums of all other columns for each gender, but there is overlap among a few columns
def add_cols(col, col_list):
    demographics_df[col] = demographics_df[col_list].sum(axis = 1)
    

add_cols('Female 5-17',['Female 5-11','Female 5-17','Female 12-17'])

add_cols('Male 5-17',['Male 5-11','Male 5-17','Male 12-17'])

cols_to_drop = ['Female 5-11', 'Female 12-17', 'Male 5-11', 'Male 12-17','Location Name']
    
demographics_df = demographics_df.drop(cols_to_drop, axis = 1)
demographics_df = demographics_df.groupby(['Year','Country / territory of asylum/residence']).sum().reset_index()

# find countries with only one entry as you can't make a regression off of one entry
for country in demographics_df['Country / territory of asylum/residence'].unique():
    count = demographics_df[demographics_df['Country / territory of asylum/residence'] == country]['Country / territory of asylum/residence'].count()
    if count == 1:
        print(country)
        
# remove from dataframe
demographics_df = demographics_df[demographics_df['Country / territory of asylum/residence'] != 'Bonaire']


Bonaire


## Functions Assignment

In [5]:
def specify_country(dataframe_input, country):
    '''This is utilized to filter a dataset across all countries to one specific one.
    '''
    dataframe_output = dataframe_input[dataframe_input['Country / territory of asylum/residence'] == country]
    return dataframe_output
    
def lin_reg(dataframe_input, demographic, test_size):
    '''This builds the linear regression model for a specific dataset. 
    First it converts x and y data to numpy arrays and then creates a linear regression model.
    '''
    
    x_data = country_specified_dataframe['Year'].values.reshape(-1, 1)
    y_data = country_specified_dataframe[demographic].values.reshape(-1, 1)
    
    # create training and testing datasets, pick a small test_size as tehre isn't a lot of data
    x_train, x_test, y_train, y_test = train_test_split(x_data, y_data, test_size = test_size)

    # create linear regression object
    lr = LinearRegression()
    lr.fit(x_train, y_train)
    y_pred = lr.predict(x_train)
    
    linear_regression_list = [lr.coef_[0][0], lr.intercept_[0], test_size, r2_score(y_train, y_pred)]

    return linear_regression_list


def get_max_r_squared(dataframe_input, demographic, country):
    '''This is used to iterate over different test_sizes for train/test/split such that we can get a max R squared value.
    '''
    
    cols_max_r2 = ['coefficient', 'intercept', 'test_size','r_squared']
    appended_data = []
    # can't go higher than 0.3, otherwise not enough data is in the fitting process
    test_size_range = np.linspace(0.01, 0.3, 100, endpoint = False) 

    for i in test_size_range:
        appended_data.append(lin_reg(dataframe_input = dataframe_input, demographic = demographic, test_size = i))
    
    r2_df = pd.DataFrame(appended_data, columns = cols_max_r2)

    # get max value for R^2, if there is more than one pull duplicates are coming from when there is no statistical 
    # trend and therefore R^2 = 1.0
    max_r2 = r2_df[r2_df['r_squared'] == r2_df['r_squared'].max()].reset_index(drop = True).iloc[:1]
    max_r2['demographic'] = demographic
    max_r2['country'] = country
    max_r2 = max_r2[['country','demographic', 'coefficient', 'intercept', 'test_size', 'r_squared',]]
    return max_r2


In [6]:
# test the return of a single country and demographic input

country_input, demographic = 'Burkina Faso', 'Female 60+'

country_specified_dataframe = specify_country(demographics_df, country_input)
get_max_r_squared(country_specified_dataframe, demographic, country_input)


Unnamed: 0,country,demographic,coefficient,intercept,test_size,r_squared
0,Burkina Faso,Female 60+,80.518767,-161526.625335,0.2391,0.784713


In [7]:
# test list of countries
test_country_list_df = pd.DataFrame({'country': ['Samoa', 'Iceland', 'Afghanistan', 'Rep. of Korea', 'Burkina Faso', 'Bosnia and Herzegovina']})
test_country_list = test_country_list_df['country']

# full list of countries
full_country_list = demographics_df['Country / territory of asylum/residence'].sort_values().unique()

# full list of demographics
demographics_list = pd.DataFrame({'demographic': ['Female 0-4', 
                                                  'Female 5-17', 
                                                  'Female 18-59', 
                                                  'Female 60+', 
                                                  'Male 0-4', 
                                                  'Male 5-17', 
                                                  'Male 18-59', 
                                                  'Male 60+',
                                                 ]})


cols_final_df = ['country','demographic','coefficient','intercept','test_size','r_squared']
lin_reg_list =[]

# iterate over each country and demographic in dataframes
for country in full_country_list:
    for demographic in demographics_list['demographic']:
        country_specified_dataframe = specify_country(demographics_df, country)
        lin_reg_object = get_max_r_squared(country_specified_dataframe, demographic, country)
        lin_reg_list.append(lin_reg_object)
        
overall_lin_reg_df = pd.concat(lin_reg_list, ignore_index = True)
# sort countries and demographic
overall_lin_reg_df = overall_lin_reg_df.sort_values(by = ['country', 'demographic']).reset_index(drop = True)

overall_lin_reg_df.to_csv('./regression_values_overall.csv', index = False)
overall_lin_reg_df.head(10)


Unnamed: 0,country,demographic,coefficient,intercept,test_size,r_squared
0,Afghanistan,Female 0-4,11330.270107,-22642870.0,0.2652,0.682125
1,Afghanistan,Female 18-59,10713.749896,-21159420.0,0.2449,0.105482
2,Afghanistan,Female 5-17,20940.614941,-41813680.0,0.2971,0.585107
3,Afghanistan,Female 60+,-2730.824176,5535973.0,0.126,0.372109
4,Afghanistan,Male 0-4,10985.575916,-21951700.0,0.2884,0.636405
5,Afghanistan,Male 18-59,-20800.769212,42300550.0,0.2507,0.440784
6,Afghanistan,Male 5-17,19117.812804,-38121860.0,0.2507,0.506845
7,Afghanistan,Male 60+,-1393.338021,2854077.0,0.2304,0.119121
8,Albania,Female 0-4,28.556907,-57264.19,0.2652,0.478819
9,Albania,Female 18-59,69.405882,-139148.6,0.2884,0.424607


#### References:

#### [1]: https://www.unhcr.org/blogs/statistics-refugee-numbers-highest-ever/
#### [2]: http://popstats.unhcr.org/en/demographics