In [2]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings

warnings.filterwarnings('ignore')

In [3]:
# Load the dataset to check if it's suitable for ARIMA analysis
arima_df = pd.read_csv('population_migration_df.csv')

# Displaying the first few rows of the dataframe to understand its structure and contents for ARIMA modeling
arima_df.head()


Unnamed: 0,Year,Age Group,Sex,Population,Emigrants_All_Destinations,Immigrants_All_Origins,Net_Migration,Migration_Rate
0,1996,0 - 14 years,Female,417972,600.0,3600.0,3000.0,7.177514
1,1996,0 - 14 years,Male,441452,400.0,3100.0,2700.0,6.11618
2,1996,15 - 24 years,Female,309797,11700.0,6700.0,-5000.0,-16.139601
3,1996,15 - 24 years,Male,323093,9800.0,4200.0,-5600.0,-17.332471
4,1996,25 - 44 years,Female,512789,3000.0,8100.0,5100.0,9.945611


In [5]:
# Aggregating the population by year to create a single time series
yearly_population = arima_df.groupby('Year')['Population'].sum()

# Converting the series to a DataFrame for easier handling
yearly_population_df = yearly_population.reset_index()

# Renaming columns appropriately
yearly_population_df.columns = ['Year', 'Total_Population']

# Setting 'Year' as the index
yearly_population_df.set_index('Year', inplace=True)

# Display the time series data
yearly_population_df


Unnamed: 0_level_0,Total_Population
Year,Unnamed: 1_level_1
1996,3626087
1997,3664313
1998,3703082
1999,3741647
2000,3789536
2001,3847198
2002,3917203
2003,3979853
2004,4045188
2005,4133839


In [6]:
from statsmodels.tsa.stattools import adfuller

# Performing the Augmented Dickey-Fuller test to check for stationarity
adf_test = adfuller(yearly_population_df['Total_Population'])

# Extracting the p-value from the test results
adf_p_value = adf_test[1]

adf_p_value, adf_test[0], adf_test[4]  # returning p-value, test statistic, critical values for different confidence levels


(0.9573186218596464,
 -0.014473769953042242,
 {'1%': -3.7112123008648155,
  '5%': -2.981246804733728,
  '10%': -2.6300945562130176})

With a p-value much greater than 0.05, we fail to reject the null hypothesis, which suggests that the time series is not stationary. The test statistic is also higher than the critical values.

Given these results, we'll need to difference the data to achieve stationarity. Once the data is stationary, we can then determine the order of the ARIMA model.