# YOUR PROJECT TITLE

> **Note the following:** 
> 1. This is *not* meant to be an example of an actual **data analysis project**, just an example of how to structure such a project.
> 1. Remember the general advice on structuring and commenting your code
> 1. The `dataproject.py` file includes a function which can be used multiple times in this notebook.

Imports and set magics:

In [1]:
# a. import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import ipywidgets as widgets
from pandas_datareader import wb

# b. autoreload modules when code is run
%load_ext autoreload
%autoreload 2

# c. user written modules
from plot_function import *


# Introduction

In this project we would like to analyse the impacts of net migration on 4 different variables: GDP, employment rate, labour force and wage. We took on this project because migration is a big debate in the majority of rich countries, and especially with the rise of the far-right in Europe.

We chose to study the USA, since it is a net imigration country, and Romania, since it is a net emigration country. We decided to use the data from 1991 - the first year we had all information - until 2019, to not take into account the pandemic. 

# Read and clean data

Import your data, either through an API or manually, and load it. 

In [2]:
#setup period
start_year = 1991
end_year = 2019

We downloaded each dataframe and renamed the variable.

In [3]:
# a. migration
df_migration = wb.download(indicator='SM.POP.NETM', country=[ 'USA', 'ROU'], start=start_year, end=end_year)
df_migration = df_migration.rename(columns = {'SM.POP.NETM':'Net Migration'})

# b. GDP
df_gdp = wb.download(indicator='NY.GDP.MKTP.CD', country=[ 'USA', 'ROU'], start=start_year, end=end_year)
df_gdp = df_gdp.rename(columns = {'NY.GDP.MKTP.CD':'GDP'})

# c. employment rate
df_employ = wb.download(indicator='SL.EMP.TOTL.SP.ZS', country=[ 'USA', 'ROU'], start=start_year, end=end_year)
df_employ = df_employ.rename(columns = {'SL.EMP.TOTL.SP.ZS':'Employment Rate'})

# d. labor force
df_labour = wb.download(indicator='SL.TLF.TOTL.IN', country=[ 'USA', 'ROU'], start=start_year, end=end_year)
df_labour = df_labour.rename(columns = {'SL.TLF.TOTL.IN':'Labor Force'})

# e. wage
df_wage = wb.download(indicator='SL.EMP.WORK.ZS', country=[ 'USA', 'ROU'], start=start_year, end=end_year)
df_wage = df_wage.rename(columns = {'SL.EMP.WORK.ZS':'Wage'})

# f. resetting indexes and column type
df_migration = df_migration.reset_index().astype({'year': int, 'country': 'string'})
df_gdp = df_gdp.reset_index().astype({'year': int, 'country': 'string'})
df_employ = df_employ.reset_index().astype({'year': int, 'country': 'string'})
df_labour = df_labour.reset_index().astype({'year': int, 'country': 'string'})
df_wage = df_wage.reset_index().astype({'year': int, 'country': 'string'})

# g. load data
df_migration.to_csv('migration.csv', index=False)
df_gdp.to_csv('gdp.csv', index=False)
df_employ.to_csv('employment_ratio.csv', index=False)
df_labour.to_csv('labour_force.csv', index=False)
df_wage.to_csv('wage.csv', index=False)

In [4]:
# a. create a list with the dataframes
df_list = [df_gdp, df_employ, df_labour, df_wage]


# b. merge all dataframes together
df = df_migration

for dtf in df_list:
    df = pd.merge(df, dtf, how = 'outer', on = ['country','year'],)


## Explore each data set

In order to be able to **explore the raw data**, you may provide **static** and **interactive plots** to show important developments 

In [5]:
#creation of a plot with two y-axis
def plot_func(country, data):
    country_data = df[df['country'] == country]
    fig, ax1 = plt.subplots()

    color = 'tab:blue'
    ax1.set_xlabel('Year')
    ax1.set_ylabel('Net Migration', color=color)
    ax1.plot(country_data['year'], country_data['Net Migration'], color=color)
    

    ax2 = ax1.twinx()  

    color = 'tab:purple'
    ax2.set_ylabel(str(data), color=color)  
    ax2.plot(country_data['year'], country_data[data], color=color)


    plt.title(f'{country} Net Migration and GDP over Time')
    fig.tight_layout()  
    plt.show()

#create of two dropdown
country_widget = widgets.Dropdown(options=df['country'].unique(), description='Country:')
data_list=df.columns.tolist()[3:]
data_widget = widgets.Dropdown(options=data_list, description='Data:')

#create an interactive plot
widgets.interact(plot_func, country=country_widget, data=data_widget);


interactive(children=(Dropdown(description='Country:', options=('Romania', 'United States'), value='Romania'),…

In [6]:
#creation of a scatter plot
def scatter_func(country, data):
    country_data = df[df['country'] == country]
    fig, ax1 = plt.subplots()

    color = 'tab:blue'
    ax1.set_xlabel('Net Migration')
    ax1.set_ylabel(str(data), color=color)
    ax1.scatter(country_data['Net Migration'], country_data[data], color=color)
    
    slope, intercept = np.polyfit(country_data['Net Migration'],country_data[data],1)
    ax1.plot(country_data['Net Migration'],country_data['Net Migration']*slope+intercept)

    plt.title(f'{country} Net Migration and data over Time')
    fig.tight_layout()  
    plt.show()

#create of two dropdown
country_widget = widgets.Dropdown(options=df['country'].unique(), description='Country:')
data_list=df.columns.tolist()[3:]
data_widget = widgets.Dropdown(options=data_list, description='Data:')

#create an interactive plot
widgets.interact(scatter_func, country=country_widget, data=data_widget);

interactive(children=(Dropdown(description='Country:', options=('Romania', 'United States'), value='Romania'),…

**Interactive plot** :

Explain what you see when moving elements of the interactive plot around. 

# Merge data sets

Here we are dropping elements from both data set X and data set Y. A left join would keep all observations in data X intact and subset only from Y. 

Make sure that your resulting data sets have the correct number of rows and columns. That is, be clear about which observations are thrown away. 

**Note:** Don't make Venn diagrams in your own data project. It is just for exposition. 

# Analysis

In [7]:
# a. sepparate US and Romania data
df_US = df[df["country"] == "United States"]
df_Rom = df[df["country"] == "Romania"]

# b. select the correct columns
var_list = df.columns.tolist()[3:]

# c. create a dictionary with all combinations of countris and measures
dict_var = {"country":["United States", "United States", "Romania", "Romania"], "measure":["corr","R^2 (%)","corr","R^2 (%)"]}

# d. adding the variables to the dictionary
for data in var_list:

    # i. calculating correlation between migration and data for each country 
    US_data = df_US["Net Migration"].corr(df_US[data])
    Rom_data = df_Rom["Net Migration"].corr(df_Rom[data])

    # ii. adding the correlation and coeficient of determination to the dictionary
    dict_var[data] = [f"{US_data:.3f}" , f"{100 * US_data**2:.0f}", f"{Rom_data:.3f}", f"{100 * Rom_data**2:.0f}"]

# e. create dataframe with data
corr_table = pd.DataFrame(dict_var).set_index(["country","measure"])

print(corr_table)

                          GDP Employment Rate Labor Force    Wage
country       measure                                            
United States corr     -0.670           0.376      -0.748  -0.689
              R^2 (%)      45              14          56      48
Romania       corr      0.251           0.060      -0.105   0.378
              R^2 (%)       6               0           1      14


As we can see, our correlation results about GDP are inconclusive, since the US has a negative correlation, meaning that an increase in net migration is normally accompanied by a decrease in GDP, while Romania has a small, but positive correlation, showing the opposite effect. The coefficient of determination of Romania is really small, meaning the variation of the net migration rate can only explain 6% of the variation in the GDP. On the other hand, the US's net migration rate can explain 45% of the GDP variation.

The employment percentage between both countries have a posisitive correlation, although very small in both, meaning an increase of the net migration rate, the employment rate increases too. But, this correlation only explains 14% of the variation in the US and 0% of the variation in Romania, therefore not being correlated.

Labor force in both countries has a negative correlation, meaning a bigger net migration rate causes a negative influence in the labor force in a country. In the US, the variation in the net migration rate explains 56% of the variation in the labor force, while in Romania, it only explains 1%.

Finally, the wage in both countries has a inconclusive correlation. The United States show a negative correlation, meaning a bigger net migration rate decreases wage, while the opposite happens in Romania. In the US, the wage variation can only be explained by 48% of the variation in the net migration rate, while in Romania, it only explains 14%.

# Conclusion

We can conlude that net migration rate does not explain much of GDP, employment, labor force and wage in the US and in Romania. Therefore, it is not a good way of predicting all of those variables, and other variables might be more closely related to them.

The labor force in the US has the best correlation with net migration, of 56%, but even then we can't sustain the claim that migration is bad, since correlation is not causation.