## Migration of International Families to and from Denmark

To recruit and retain an international workforce, Denmark also needs to be attractive to international families and not just single people. 

This is a look into historical data from 2009-2024 to examine the trends of migration for people under the age of 20.

# Assumptions
* People under the age of 20 less often migrate without their family, so we can infer a family by examining this group
* Families are more likely to want to stay medium-to-long term so that children can complete programmes at their school and maintain friendships

In [2]:
import pandas as pd
import numpy as np
import matplotlib as plt

In [3]:
df_migrant = pd.read_csv('youngmigrants.csv', header=None, encoding='ISO-8859-1')
df_immigrant = pd.read_csv('youngimmigration.csv', header=None, encoding='ISO-8859-1')
df_emigrant = pd.read_csv('youngemigrants.csv', header=None, encoding='ISO-8859-1')
df_citizens = pd.read_csv('youngcitizens.csv', header=None, encoding='ISO-8859-1')

In [40]:
# Function to assign column headers based on age, citizenship, and a year range
def assign_columns(df, start_year, end_year):
    columns = ['Age', 'Citizenship'] + list(range(start_year, end_year + 1))
    df.columns = columns
    return df

# Apply the function to each dataframe with appropriate year ranges
df_migrant = assign_columns(df_migrant, 2010, 2024)
df_immigrant = assign_columns(df_immigrant, 2009, 2023)
df_emigrant = assign_columns(df_emigrant, 2009, 2023)
df_citizens = assign_columns(df_citizens, 2010, 2023)



I want to play around with these numbers a bit before I make my life very complicated so I am going to flatten these dataframes so it is just grand totals by age. 
Not very interesting insights but I can see if my model works.

In [45]:
# Sum across all citizenships for each year while keeping the age information
total_migrants = df_migrant.groupby('Age').sum().reset_index()
total_immigrants = df_immigrant.groupby('Age').sum().reset_index()
total_emigrants = df_emigrant.groupby('Age').sum().reset_index()
total_citizens = df_citizens.groupby('Age').sum().reset_index()

# Drop the 'Citizenship' column if it exists in the DataFrame (after aggregation, it is redundant)
total_migrants = total_migrants.drop(columns=['Citizenship'], errors='ignore')
total_immigrants = total_immigrants.drop(columns=['Citizenship'], errors='ignore')
total_emigrants = total_emigrants.drop(columns=['Citizenship'], errors='ignore')
total_citizens = total_citizens.drop(columns=['Citizenship'], errors='ignore')

# Add a new column to indicate the type of total for each DataFrame
total_migrants['Metric'] = 'Total Migrants'
total_immigrants['Metric'] = 'Total Immigrants'
total_emigrants['Metric'] = 'Total Emigrants'
total_citizens['Metric'] = 'Total Citizens'

# Reorder the columns to have 'Metric' and 'Age' first
total_migrants = total_migrants[['Metric', 'Age'] + [col for col in total_migrants.columns if col not in ['Metric', 'Age']]]
total_immigrants = total_immigrants[['Metric', 'Age'] + [col for col in total_immigrants.columns if col not in ['Metric', 'Age']]]
total_emigrants = total_emigrants[['Metric', 'Age'] + [col for col in total_emigrants.columns if col not in ['Metric', 'Age']]]
total_citizens = total_citizens[['Metric', 'Age'] + [col for col in total_citizens.columns if col not in ['Metric', 'Age']]]

# Concatenate all the DataFrames along the row axis, keeping age-specific totals
df_combined = pd.concat([total_migrants, total_immigrants, total_emigrants, total_citizens], ignore_index=True)

# Extract numeric values from the 'Age' column, ignoring any non-numeric text
df_combined['Age'] = df_combined['Age'].str.extract(r'(\d+)').astype(int)

# Sort the combined DataFrame by 'Age' to ensure proper ordering
df_combined = df_combined.sort_values(by='Age').reset_index(drop=True)


df_combined = df_combined.drop(columns=[2009, 2024], errors='ignore')

print(df_combined.columns)



Index(['Metric',    'Age',     2010,     2011,     2012,     2013,     2014,
           2015,     2016,     2017,     2018,     2019,     2020,     2021,
           2022,     2023],
      dtype='object')


In [48]:
#I need to take a break here because I am not sure it is calculating what I want it to do

def track_cohort(df, metric, start_age, start_year, end_year):
    # Filter to get only the specified metric rows
    df_filtered = df[df['Metric'] == metric]

    # Ensure 'Age' is a column by resetting the index if necessary
    df_filtered.reset_index(drop=True, inplace=True)

    # Set 'Age' as the index to make it easier to track cohorts over time
    df_filtered.set_index('Age', inplace=True)

    # Track a specific cohort, e.g., those who were start_age in start_year
    cohort_data = []

    # Loop over each year, shifting the age each time
    for year in range(start_year, end_year + 1):
        age = (year - start_year) + start_age  # Increment age each year starting from start_age
        if age in df_filtered.index:
            cohort_data.append(df_filtered.at[age, year])
        else:
            cohort_data.append(None)  # Append None if the age is not present

    # Create a DataFrame to display the cohort over time
    cohort_df = pd.DataFrame({'Year': list(range(start_year, end_year + 1)), 'Population': cohort_data, 'Metric': metric})

    return cohort_df



# Track the cohort for 'Total Migrants' starting from age 0 in 2010
cohort_migrants = track_cohort(df_combined, 'Total Migrants', start_age=0, start_year=2010, end_year=2023)

# Track the cohort for 'Total Immigrants' starting from age 0 in 2010
cohort_immigrants = track_cohort(df_combined, 'Total Immigrants', start_age=0, start_year=2010, end_year=2023)

# Track the cohort for 'Total Emigrants' starting from age 0 in 2010
cohort_emigrants = track_cohort(df_combined, 'Total Emigrants', start_age=0, start_year=2010, end_year=2023)

# Track the cohort for 'Total Citizens' starting from age 0 in 2010
cohort_citizens = track_cohort(df_combined, 'Total Citizens', start_age=0, start_year=2010, end_year=2023)

# Concatenate all cohort DataFrames for comparison
cohort_all = pd.concat([cohort_migrants, cohort_immigrants, cohort_emigrants, cohort_citizens], ignore_index=True)

# Display the resulting DataFrame for all metrics
#print(cohort_all)


    Year  Population            Metric
0   2010        6050    Total Migrants
1   2011        6228    Total Migrants
2   2012        6333    Total Migrants
3   2013        6473    Total Migrants
4   2014        6617    Total Migrants
5   2015        6894    Total Migrants
6   2016        7471    Total Migrants
7   2017        7914    Total Migrants
8   2018        8104    Total Migrants
9   2019        8262    Total Migrants
10  2020        8361    Total Migrants
11  2021        8438    Total Migrants
12  2022        8544    Total Migrants
13  2023        9361    Total Migrants
14  2010         649  Total Immigrants
15  2011         476  Total Immigrants
16  2012         460  Total Immigrants
17  2013         451  Total Immigrants
18  2014         536  Total Immigrants
19  2015         804  Total Immigrants
20  2016         673  Total Immigrants
21  2017         403  Total Immigrants
22  2018         344  Total Immigrants
23  2019         304  Total Immigrants
24  2020         244  Tot

I think I want to flatten ages to pre school grundskole young adult, for each year in the data to make things more manageable. Maybe I need to do it before i join stuff up??????

Sankey diagram maybe???