# DOPP 2019W Exercise 3 - Group 32

### Contributors
- Eszter Katalin Bognar - 11931695
- Luis Kolb - 01622731
- Alexander Leitner - 01525882

### Objectives of the analysis
- What is the most accurate overview of flows of refugees between countries that can be obtained? 
- Are there typical characteristics of refugee origin and destination countries? 
- Are there typical characteristics of large flows of refugees? 
- Can countries that will produce large numbers of refugees be predicted? Can refugee flows be predicted?

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import logging
import hashlib

pd.options.mode.chained_assignment = None  # default='warn'

pd.options.mode.chained_assignment = None  # default='warn'
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

from sklearn import svm
from sklearn import neighbors
from sklearn import tree
from sklearn import metrics
from sklearn import preprocessing
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.preprocessing import PolynomialFeatures

pd.options.mode.chained_assignment = None  # default='warn'

## Load datasets
We started our analysis by loading the necessary data files.

We selected 4 datasets to use:
- OECD International Migration Database data (https://stats.oecd.org/Index.aspx?DataSetCode=MIG)
- Gross Domestic Product per Capita data (https://data.worldbank.org/indicator/NY.GDP.PCAP.CD)
- Human Development Index data (http://hdr.undp.org/en/data)
- World Governance Index data (https://datacatalog.worldbank.org/dataset/worldwide-governance-indicators)

Each dataset were loaded and formatted including:

- reshaping columns (changing rows to columns or columns to rows when necessary),
- getting rid of unwanted columns, 
- renaming columns, 
- setting proper data types,
- setting country-year multiindex to facilitate future data merge.

In [None]:
'''
Load & format OECD International Migration Database data
'''

def load_oecd_data():
    """ 
    Load oecd data file
    Reshape the dataset to have country, destination, year, asylum_seekers columns
    Set hierarchical index (country, year)
    
    Returns
    --------
    oecd: data frame containing oecd data
    """
    #load dataset
    oecd = pd.read_csv('data/oecd_data.csv',na_values=['..'])
    #reshape table
    oecd.set_index(['country','destination', 'year','variable'], inplace=True)
    oecd=oecd.unstack()
    oecd.columns = oecd.columns.droplevel(0)
    #drop unwanted columns
    oecd.reset_index(drop=False, inplace=True)
    #rename columns
    oecd = oecd[['country','destination','year','Inflows of asylum seekers by nationality']]
    oecd = oecd.rename(columns={'Inflows of asylum seekers by nationality': 'asylum_seekers'})
    #set index
    oecd=oecd.set_index(['country', 'year'])
    return oecd

oecd_df=load_oecd_data()
oecd_df.info()
oecd_df.head(20)

In [None]:
'''
Load & format Gross Domestic Product per Capita data
'''

def load_gdp_data():
    """ 
    Load gdp data file
    Reshape the dataset to have country, year, GDP columns
    Change data types
    Set hierarchical index (country, year)
    
    Returns
    --------
    gdp: data frame containing gdp data
    """
    #load dataset
    gdp = pd.read_csv('data/GDPPC_data.csv',na_values=['..'])
    #drop unwanted columns
    gdp.drop(['Indicator Name','Indicator Code','Country Code' ], inplace=True, axis=1)
    #reshape dataframe
    gdp.set_index(['Country Name'], inplace=True)
    gdp=gdp.stack(dropna=False).to_frame()
    gdp.reset_index(drop=False, inplace=True)
    #rename columns
    gdp=gdp.rename(columns={'Country Name': 'country', 'level_1': 'year', 0: 'GDP'})
    #set datatype for year, GDP
    gdp['year']=gdp['year'].astype(int)
    gdp['GDP']=gdp['GDP'].astype(float)
    #only use data from 2000 onwards
    gdp = gdp[gdp.year >= 2000]
    #set index
    gdp=gdp.set_index(['country', 'year'])
    return gdp

gdp_df=load_gdp_data()
gdp_df.info()
gdp_df.tail(10)

In [None]:
'''
Load & format Human Development Index data
'''

def load_hdi_data():
    """ 
    Load hdi data file
    Reshape the dataset to have country, year, HDI columns
    Change data types
    Set hierarchical index (country, year)
    
    Returns
    --------
    hdi: data frame containing hdi data
    """
    #load dataset
    hdi = pd.read_csv('data/HDI.csv',na_values=['..'])
    #drop unwanted columns
    hdi.drop(['HDI Rank (2018)'], inplace=True, axis=1)
    #reshape dataframe
    hdi.set_index(['Country'], inplace=True)
    hdi=hdi.stack(dropna=False).to_frame()
    hdi.reset_index(drop=False, inplace=True)
    #rename columns
    hdi=hdi.rename(columns={'Country': 'country', 'level_1': 'year', 0: 'HDI'})
    #set datatype for year
    hdi['year']=hdi['year'].astype(int)
    hdi['HDI']=hdi['HDI'].astype(float)
    #only use data from 2000 onwards
    hdi = hdi[hdi.year >= 2000]
    #set index
    hdi=hdi.set_index(['country', 'year'])
    return hdi

hdi_df=load_hdi_data()
hdi_df.info()
hdi_df.head(10)

In [None]:
'''
Load & format World Governance Index data

column values:
CC.EST: Control of Corruption: Estimate
GE.EST: Government Effectiveness: Estimate
PV.EST: Political Stability and Absence of Violence/Terrorism: Estimate
RL.EST: Rule of Law: Estimate
RQ.EST: Regulatory Quality: Estimate
VA.EST: Voice and Accountability: Estimate

'''


def load_wgi_data():
    """ 
    Load wgi data file
    Reshape the dataset to have country, year, WGI columns
    Change data types
    Set hierarchical index (country, year)
    
    Returns
    --------
    hdi: data frame containing hdi data
    """
    #load dataset
    wgi = pd.read_csv('data/WGIData.csv',na_values=['..'])
    #drop unwanted rows
    wgi=wgi[wgi['Indicator Code'].str.contains('EST', regex= True, na=False)]
    #drop unwanted columns
    wgi.drop(['Country Code','Indicator Name'], inplace=True, axis=1)
    #reshape dataframe
    wgi.set_index(['Country Name','Indicator Code'], inplace=True)
    wgi=wgi.stack(dropna=False).to_frame()
    wgi.reset_index(drop=False, inplace=True)
    #rename columns
    wgi=wgi.rename(columns={'Country Name': 'country', 'level_2': 'year', 0: 'variable'})
    wgi.set_index(['country','year','Indicator Code'], inplace=True)
    wgi=wgi.unstack()
    wgi.columns = wgi.columns.droplevel(0)
    wgi.reset_index(drop=False, inplace=True)
    #set datatype for year 
    wgi['year']=wgi['year'].astype(int)
    #only use data from 2000 onwards
    wgi = wgi[wgi.year >= 2000]
    #set index
    wgi=wgi.set_index(['country', 'year'])
    return wgi
wgi_df=load_wgi_data()
wgi_df.info()
wgi_df.head(10)

## Country name inconsistency check 
Before data merge we have to check the datasets for inconsistencies. 
We would like to merge on the country-year multiindex.
Year is consistent in each data files however we have to search for different usage and typos in county names.
We selected oecd_df as base dataframe so we compare the country names in the oecd_df to the country names in the hdi_df, gdp_df, wgi_df datasets.

In [None]:
def country_check(countries, gdp_df, hdi_df, wgi_df):
    """ 
    Parameters
    --------
    countries: list of countries (source or destination) in oecd_df
    gdp_df: gdp of the countries
    hdi_df: hdi of the countries
    wgi_df: wgi data of the countries
    
    Returns
    --------
    check: dataframe showing inconsistent country name usage
    """   
    #get country name list series
    hdi_c = hdi_df.index.unique(level=0).to_series()
    gdp_c = gdp_df.index.unique(level=0).to_series()
    wgi_c = wgi_df.index.unique(level=0).to_series()
    
    #create dataframe for the results
    check=pd.DataFrame(columns=['country','hdi','gdp','wgi'])
    #iterate through oecd_df source or destination country names and check presence of country name in other dfs
    for index,row in countries.iterrows():
        check=check.append({'country': row.country,
                            'hdi':row.isin(hdi_c).values[0],
                           'gdp':row.isin(gdp_c).values[0],
                           'wgi':row.isin(wgi_c).values[0]}, ignore_index=True)
        
    #add only problematic country rows to the result df
    check=check.loc[(check['hdi'] == False) | (check['gdp'] == False) | (check['wgi'] == False)]
    return check

oecd_s = pd.DataFrame(oecd_df.index.unique(level=0))
oecd_s_check = country_check(oecd_s,gdp_df, hdi_df, wgi_df)

oecd_d = pd.DataFrame(oecd_df['destination'].unique(),columns={'country'})
oecd_d_check = country_check(oecd_d,gdp_df, hdi_df, wgi_df)

display(oecd_d_check.head(50),oecd_s_check.head(50))
df = pd.concat([oecd_d_check,oecd_s_check])

For making the country names consistent, we first tried out the fuzzy search method of the fuzzywuzzy library. 
Due to errors, we finally decided to manually create a dictonary of country names to replace or delete them.

In [None]:
country_dict = {
    'Bahamas, The' :"Bahamas",
    'Bolivia (Plurinational State of)' :"Bolivia",
    'Cabo Verde' :"Cape Verde",
    'Congo, Rep.' :"Congo",
    'Czechia' :"Czech Republic",
    "Cote d'Ivoire" :"Côte d'Ivoire",
    'Congo, Dem. Rep.' :"Democratic Republic of the Congo",
    'Congo (Democratic Republic of the)':"Democratic Republic of the Congo",
    'Egypt, Arab Rep.' :"Egypt",
    'Gambia, The' :"Gambia",
    'Iran (Islamic Republic of)' :"Iran",
    'Iran, Islamic Rep.': "Iran",
    'Korea (Republic of)' :"Korea",
    'Korea, Rep.':"Korea",
    'Kyrgyz Republic' :"Kyrgyzstan",
    "Lao People's Democratic Republic" :"Laos",
    "Lao PDR":"Laos",
    'Micronesia (Federated States of)' :"Micronesia",
    'Micronesia, Fed. Sts.':'Micronesia',
    'Moldova (Republic of)' :"Moldova",
    'Russian Federation' :"Russia",
    'St. Kitts and Nevis' :"Saint Kitts and Nevis",
    'St. Lucia' :"Saint Lucia",
    'St. Vincent and the Grenadines' :"Saint Vincent and the Grenadines",
    'Slovakia' :"Slovak Republic",
    'Syrian Arab Republic' :"Syria",
    'Eswatini' :"Swaziland",
    "Eswatini (Kingdom of)": "Swaziland",
    'Tanzania (United Republic of)' :"Tanzania",
    'Venezuela (Bolivarian Republic of)' :"Venezuela",
    "Venezuela, RB":"Venezuela",
    'Vietnam' :"Viet Nam",
    'Yemen, Rep.' :"Yemen"  
}

country_del = [
    #former countries
    "Former Czechoslovakia",
    "Former USSR",
    "Former Yugoslavia",
    "Serbia and Montenegro",
    #China' territory
    "Macau",
    "Chinese Taipei",
    "Hong Kong, China",
    #USA's territory 
    "Guam",
    "Puerto Rico",
    #GB' territory
    "Bermuda",
    #New Zealand' territory
    "Cook Islands",
    "Tokelau",
    "Niue",
    #Palestina's territory
    "West Bank and Gaza Strip",
    #no proper data
    "Nauru",
    "San Marino",
    "Somalia",
    "Tuvalu",
    "Democratic People's Republic of Korea",
    "Monaco",
    "Not stated",
    "Unknown",
    "Total",
]

def country_correction(df):
    """ 
    Parameters
    --------
    df: dataframe with original country names

    Returns
    --------
    df_corrected: data frame with replaced country names
    """
    #rename country names  
    df_corrected=df.rename(index=country_dict, level=0)
    return df_corrected

#delete unwanted countries from the oecd_df base df
oecd_df=oecd_df.drop(country_del, level=0, errors='ignore')

#call country_correction() to rename country names
hdi_df=country_correction(hdi_df)
gdp_df=country_correction(gdp_df)
wgi_df=country_correction(wgi_df)


#check for inconsistencies again
countries = pd.DataFrame(oecd_df.index.unique(level=0))
oecd_check = country_check(countries,gdp_df, hdi_df, wgi_df)

if oecd_check.empty:
    print("No more inconsistent country names!")

## Outlier detection
We checked the distribution of hdi, wgi and gdp data showing there are more poor than wealthy countries...We can not see any outliers. GDP is skewed towards zero, HDI is in the range of [0-1], wgi metrics are in the range of [-2-2].

In [None]:
gdp_df['GDP'].hist(bins=50)

In [None]:
hdi_df['HDI'].hist(bins=50)

In [None]:
wgi_df[['CC.EST','GE.EST','PV.EST','RL.EST','VA.EST']].hist(bins=50)

## Handling missing values

### hdi, gdp and wgi datasets
After resolving the country name inconsistency and outlier check, we moved forward to examine outliers and missing values in the data files. We started with the analysis of the datasets to identify problematic areas, then implemented our solution to handle them. 
At this point we only focus on the hdi, gdp and wgi datasets since these datasets can be treated similarly. 
For the oecd dataset we will use other methods later. 

In [None]:
#investigating occurences of missing values in the datasets
display(hdi_df.isnull().sum())
display(gdp_df.isnull().sum())
display(wgi_df.isnull().sum())
#display a sample with missing values
display(hdi_df.iloc[hdi_df.index.get_level_values('country') == 'Eritrea'])

Since the hdi, gdp and wgi indicators can be treated equally and the values don't change rapidly from year to year we replace the missing data with the median of the data for the given country pairs. We selected this method because interpolation can't work properly where there are many missing values one after another. Where there weren't any data available for the given country pairs, we simply dropped the rows.

In [None]:
#it is a bit slow, please be patient!
def handle_missingMetricValues(incomplete_data):
    """ 
    Parameters
    --------
    incomplete_data: data frame containing missing values 
    
    Returns
    --------
    complete_data: data frame not containing any missing values
    """
    #if possible try to fill missing values with interpolation
    for i in incomplete_data.index.unique(level=0):
        columns=incomplete_data[incomplete_data.index.get_level_values('country')==i].columns
        for col in columns:
            #fill missing values with mean
            median=incomplete_data.loc[incomplete_data.index.get_level_values('country') == i][col].median()
            incomplete_data.loc[(incomplete_data.index.get_level_values('country') == i) & (incomplete_data[col].isnull()), col] = median            
    #drop rows where there isn't any data, sum can't be calculated
    complete_data=incomplete_data.dropna()
    return complete_data

hdi_complete=handle_missingMetricValues(hdi_df)
gdp_complete=handle_missingMetricValues(gdp_df)
wgi_complete=handle_missingMetricValues(wgi_df)

display(hdi_complete.isnull().sum())
display(gdp_complete.isnull().sum())
display(wgi_complete.isnull().sum())

display(hdi_complete.iloc[hdi_complete.index.get_level_values('country') == 'Eritrea'])

### Missing values in the oecd dataset
Examine the dataset and occurences of missing values

In [None]:
display(oecd_df.info())
display(oecd_df.isnull().sum())
#examining the number of people flows for given source-destination country pairs
oecd=oecd_df.reset_index()
#change year to string to avoid aggregation by groupby
oecd['year']=oecd['year'].astype(str)
#set multiindex (country-destination)
oecd=oecd.set_index(['country','destination'])
#sum the number of asylum_seekers for each country-destination pairs
agg_df=oecd.groupby(oecd.index).sum(min_count=1)
display(agg_df)
#identify the country pairs where flows of people between the countries is zero
missing_pairs=agg_df.loc[(agg_df['asylum_seekers'] == 0) |(agg_df['asylum_seekers'].isnull())]
#list the missing country pairs 
display(missing_pairs)
display(oecd.iloc[(oecd.index.get_level_values('country') == 'Albania') & (oecd.index.get_level_values('destination') == 'Chile')])

There are a number of country pairs (e.g. Albania-Chile) where none of the years have inflows of asylum seekers between countries. Since we could not find any similar data source where there was appropiate data available for cold deck inputation. We assume that migration is not considerable between these countries and we decided to delete these rows from the final dataset. 

In [None]:
#it is a bit slow, please be patient!
def delete_missingOECDValues(incomplete_data, missing_p):
    """ 
    Parameters
    --------
    incomplete_data: data frame containing missing values 
    missing_p: list of country pairs with missing data
    
    Returns
    --------
    complete_data: data frame with deleted missing values
    """
    #set country-destination multiindex on the dataset
    incomplete_data=incomplete_data.reset_index()
    incomplete_data=incomplete_data.set_index(['country','destination'])
    #delete rows where none of the years have inflows of asylum seekers between the given country pairs
    for i in incomplete_data.index.unique():
        if np.any(i==missing_p.index):
            incomplete_data.drop(i,inplace=True)
    complete_data=incomplete_data
    return complete_data
oecd_deleted=delete_missingOECDValues(oecd_df, missing_pairs)

In [None]:
#examine whether delete was successful or not
check=oecd_deleted.reset_index()
if check[(check['country']=='Albania') & (check['destination']=='Chile')].empty:
    display('Delete successful!')

For the remaining missing values, we calculated the mean of asylum_seekers for the given country pairs and filled the holes with this value.  

In [None]:
def input_missingOECDValues(incomplete_data):
    """ 
    Parameters
    --------
    incomplete_data: data frame containing missing values 
    
    Returns
    --------
    complete_data: data frame with inputted missing values, final dataframe that won't have missing values anymore
    """
    #set country-destination multiindex on the dataset
    incomplete_data=incomplete_data.reset_index()
    incomplete_data=incomplete_data.set_index(['country','destination'])
    #input missing values with median 
    for i in incomplete_data.index.unique():
        mean=incomplete_data.loc[(incomplete_data.index.get_level_values('country')==i[0]) & (incomplete_data.index.get_level_values('destination')==i[1])].mean()
        incomplete_data.loc[(incomplete_data.index.get_level_values('country') == i[0]) & (incomplete_data.index.get_level_values('destination')==i[1]) & (incomplete_data['asylum_seekers'].isnull()), 'asylum_seekers'] = mean.asylum_seekers
    complete_data=incomplete_data
    return complete_data
oecd_complete=input_missingOECDValues(oecd_deleted)
display(oecd_complete.isnull().sum())

In [None]:
#set back country-year as index before data merge
oecd_complete=oecd_complete.reset_index()
oecd_complete=oecd_complete.set_index(['country','year'])

In [None]:
def merge_dest(s_merged, gdp_df, hdi_df, wgi_df):
    """ 
    Parameters
    --------
    s_merged: merged dataset with values for the source country
    gdp_df: gdp of the countries
    hdi_df: hdi of the countries
    wgi_df: wgi data of the countries
    
    Returns
    --------
    d_merged: merged data frame that contains complete migration and country data for the source and destination countries
    """
    s_merged = s_merged.rename(columns={'destination': 'country'})
    s_merged=s_merged.set_index(['year','country'])
    merge1=pd.merge(gdp_df, hdi_df, on=['year', 'country'])
    merge2=pd.merge(merge1, wgi_df, on=['year', 'country'])
    d_merged=pd.merge(merge2,s_merged,on=['year','country'])
    d_merged=d_merged.reset_index()
    d_merged = d_merged.rename(columns={'GDP': 'd_GDP',
                                             'HDI': 'd_HDI',
                                             'CC.EST': 'd_CC.EST',
                                             'GE.EST': 'd_GE.EST',
                                             'PV.EST': 'd_PV.EST',
                                             'RL.EST': 'd_RL.EST',
                                             'VA.EST': 'd_VA.EST',
                                             'RQ.EST': 'd_RQ.EST',
                                             'country': 'destination'})
    return d_merged


def merge_data(oecd_df, gdp_df, hdi_df, wgi_df):
    """ 
    Parameters
    --------
    oecd_df: yearly data for foreign population inflow and asylum seeker inflow from source to destination country
    gdp_df: gdp of the countries
    hdi_df: hdi of the countries
    wgi_df: wgi data of the countries
    
    Returns
    --------
    merged_data: merged data frame that contains complete migration and country data
    
    """
    merge1=pd.merge(gdp_df, hdi_df, on=['year', 'country'])
    merge2=pd.merge(merge1, wgi_df, on=['year', 'country'])
    s_merged=pd.merge(merge2,oecd_df,on=['year','country'])
    s_merged=s_merged.reset_index()
    s_merged = s_merged.rename(columns={'GDP': 's_GDP',
                                             'HDI': 's_HDI',
                                             'CC.EST': 's_CC.EST',
                                             'GE.EST': 's_GE.EST',
                                             'PV.EST': 's_PV.EST',
                                             'RL.EST': 's_RL.EST',
                                             'VA.EST': 's_VA.EST',
                                             'RQ.EST': 's_RQ.EST',
                                             'country': 'source'})
    merged_data=merge_dest(s_merged,gdp_df, hdi_df, wgi_df)
    #set multiindex with year, source, destination
    merged_data=merged_data.set_index(['year','source'])
    
    return merged_data

data_merged = merge_data(oecd_complete, gdp_complete, hdi_complete, wgi_complete)
data_merged.to_csv("merged.csv")
data_merged.info()
data_merged

In [None]:
display(data_merged.isnull().sum())

### Predictions
In this section we try to predict the refugee flow and see if this is possible. We use the liniar regression to do that. 

1) the merge data is allready preparet and are ready for further calculation 

In [None]:
countries = []   # creat a list to store the names of the countries who would like to predict the refugies flow
countries = ["Chad"]   # in this case the prediction of the country Chad
data_merged=data_merged.reset_index()   # to reshape the data set

In [None]:

# function to prepare the data set for the prediction
# input name of the country 
# output data target from the prediction and the datas from 2000 to 2016
def load_ref_data_predict(countries):
    # column names who will conentrate on

    # set which colums we want to use for the prediction
    columns = ["year","destination","source","d_VA.EST","d_GDP","d_HDI","d_CC.EST","d_GE.EST","d_PV.EST","d_RL.EST","s_GDP","s_HDI","s_GE.EST","s_PV.EST","s_RL.EST","s_RL.EST","s_VA.EST"] 
    ref_data_pred = data[data.year == 2017] # data set from only the year of 2017 
    ref_data = data_merged[data.year != 2017] # data set from 2000 to 2016     
    ref_data = ref_data[ref_data["source"] == countries] # set the datas to the country the user like
    ref_data_pred = ref_data_pred[ref_data_pred["source"] == countries] # set the datas to the country the user like  
    # set the name from the destination and sorce to numbers this is important for the prediction
    ref_data['source'] = ref_data['source'].apply(lambda x:float(int(hashlib.sha1(x.encode('utf-8')).hexdigest(), 16)))
    ref_data['destination'] = ref_data['destination'].apply(lambda x:float(int(hashlib.sha1(x.encode('utf-8')).hexdigest(), 16)))   
    ref_data_pred['source'] = ref_data_pred['source'].apply(lambda x:float(int(hashlib.sha1(x.encode('utf-8')).hexdigest(), 16)))
    ref_data_pred['destination'] = ref_data_pred['destination'].apply(lambda x:float(int(hashlib.sha1(x.encode('utf-8')).hexdigest(), 16)))
    
    return ref_data_pred,ref_data





columns = ["year","destination","source","d_VA.EST","d_GDP","d_HDI","d_CC.EST","d_GE.EST","d_PV.EST","d_RL.EST","s_GDP","s_HDI","s_GE.EST","s_PV.EST","s_RL.EST","s_RL.EST","s_VA.EST"] 
# define a list which stores the results from the prediction
prediction_score = []
# for loop to run all countries that the user defines in the upper part
for i in countries:
    # call the upper definded function to prepare the data set for the prediction
    real,ref_pre_data = load_ref_data_predict(i)
    # define the training data set 
    descriptors = ref_pre_data[columns]
    # and the targt fo the  machine learning
    target = ref_pre_data[["asylum_seekers"]]



    np.random.seed(seed=12345) # get allways the same random generator
    msk = np.random.rand(len(descriptors)) <= 0.85 # mask to split the dataset into the training and test set
    # for 10 fold cross validation we would need to make more splits like 0.95, 0.85, 0.75, ...
    training_data = descriptors[msk]
    test_data = descriptors[~msk]
    training_target = target[msk]
    test_target = target[~msk]

    # Linear Regression
    # I am using regression, because I have to predict a real number / integer which is not possible using the classification methods

    reg = LinearRegression().fit(training_data, training_target) # learn how the data looks like
    #logger.debug(reg.score(training_data,training_target)) # predict and evaluate
    reg.predict(test_data) # predict
    #logger.info(reg.score(test_data,test_target)) #evaluate

    #Lasso Regression

    #make parameters list
    normalize = [False, True]
    intercept = [True, False]
    alpha = [1e-15, 1e-10, 1e-5, 1, 5, 10]
    results = [] # to collect the results

    #loop through all possible parameter combinations
    for n in normalize:
        for inter in intercept:
            for a in alpha:
                result_row = {}

                result_row["normalize"] = n
                result_row["intercept"] = inter
                result_row["alpha"] = a

                lasso = Lasso(alpha=a, fit_intercept=inter, normalize=n) # create the regressor

                lasso.fit(training_data, training_target)
                predicted = lasso.predict(test_data)

                result_row["score"] = round(lasso.score(test_data,test_target["asylum_seekers"]), 4)
                results.append(result_row)
    #logger.info(pd.DataFrame(results)) # look over the result list to find the best parameters for machine learning


    # Linear Regression revisited - as for the Lasso Regression
    normalize = [False, True]
    bias = [False, True]
    interaction = [False, True]
    intercept = [True, False]
    degree = [ 1, 2, 3, 4, 5]
    results = []
    for n in normalize:
        for b in bias:
            for interact in interaction:
                for inter in intercept:
                    for d in degree:

                        result_row = {}
                        result_row["normalize"] = n
                        result_row["bias"] = b
                        result_row["interaction"] = interact
                        result_row["intercept"] = inter
                        result_row["degree"] = d

                        poly = PolynomialFeatures(include_bias=b, interaction_only=interact, degree=d)
                        lrm = LinearRegression(normalize=n, fit_intercept=inter)
                        X_poly=poly.fit_transform(training_data)
                        X_p_poly=poly.fit_transform(test_data)
                        lrm.fit(X_poly, training_target)
                        result_row["score"] = round(lrm.score(X_p_poly,test_target[["asylum_seekers"]]), 4)
                        results.append(result_row)
    pd.set_option('display.max_rows', None)
    #logger.info(pd.DataFrame(results))
    logger.info(pd.DataFrame(results)[["score"]].max())
    prediction_score.append(i)
    prediction_score.append(pd.DataFrame(results)[["score"]].max())
    # use the influenza predict to predict
    test_descriptors = real[columns]
    test_target = real[["asylum_seekers"]]

    poly = PolynomialFeatures(include_bias=False, interaction_only=False, degree=2)
    lrm = LinearRegression(normalize=True, fit_intercept=True)

    # use the full influenza to learn
    X_poly=poly.fit_transform(descriptors)
    X_p_poly=poly.fit_transform(test_descriptors)
    #logger.info(test_descriptors)
    #logger.info(descriptors)
    lrm.fit(X_poly, target)
    #logger.info(lrm.predict(X_p_poly))
    #logger.info(real["asylum_seekers"])
    logger.info(round(lrm.score(X_p_poly,test_target["asylum_seekers"]), 4)) # look how it performed

# Interpretation

- Can countries that will produce large numbers of refugees be predicted?

We have to predicted the flow of refugees for some countries. To do this, we used the 2000-2016 data to predict 2017. We see that it is not possible to predict the data well for theese countrys. The best score for a prediction was  below 40 percent. We used linear regression to predict the data. We have selected the countries with the highest number of refuges and are trying to predict the dates.

- Can refugee flows be predicted?

On the other hand, it is not so easy to predict the flow of refugees from country to country. For example, if a war breaks out during this time, the refuggie flow is constantly increasing and this fact makes prediction difficult.  
