Airline Passengers Data
---

In this exercise, you will use Airline Passengers dataset, which provides the monthly number of airline passengers from 1949 to 1960 for an airline company. 

1. Load the data from the file `c2_AirPassengers.csv` into a dataframe:
- Use the Month column as the index when you load the data.
- Check the shape of the data. 
- Are all years and months from 1949 to 1960 present in the data?

In [None]:
# Import libraries
import pandas as pd
import numpy as np

# Load the data and create a datetime index
data = pd.read_csv("c2_AirPassengers.csv", index_col="Month", parse_dates=True)

# Rename columns for convenience
data.columns = ["Passengers"]
data.index.names = ["Date"]

# Print the shape of the dataframe
print(data.shape)

# Print a few samples
data.head()

__Answers__:
Yes, there are 144 rows i.e. 12 years multiplied by 12 months of data.

2. Do the following steps:
- Create new columns Month and Year using the index of the dataframe.
- Group the Passengers column by the Year and provide summary statistics of the number of passengers.
- Discuss the patterns.

In [None]:
# Create new columns Month and Year
data["Year"] = data.index.year
data["Month"] = data.index.month

# Group by the 'Year' and generate descriptive statistics
data.groupby("Year")["Passengers"].describe().astype(int)

__Answers__:
Looking at the result from the table, we observe:  
* an increasing *trend*, as the *mean* number of passengers increases with time. For example, the mean number of passengers in 1949 is approximately 126. Six years later, in 1955, the average number of passengers has doubled (284) and, in 1960, is has almost quadrupled (476).
* that the annual variation in the data increases with time. In 1949, the number of passengers ranges from 104 to 148 (i.e., a total variation of 44 passengers, and average variation (std) of 13 passengers), while in 1960, the average variation is 77 passengers, i.e. about 6 times higher.

3. Finding trends and cycles in the data using visualisations:

- Do you observe a trend in the data over time?
- Is there a seasonal component present in the data? if yes, describe it and explain if it has changed over time.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

In [None]:
# plotting trend and seasonality
fig, axes = plt.subplots(4, 1, figsize=(12, 14), gridspec_kw={"hspace": 0.4})
axes[0].plot(data["Passengers"])
axes[0].set_title("Monthly Number of Passengers")
axes[0].set_ylabel("Passengers")

axes[1].plot(data["Passengers"].resample("A").mean())
axes[1].set_title("Yearly Average Number of Passengers")
axes[1].set_ylabel("Passengers")

sns.boxplot(data=data, x="Month", y="Passengers", color="white", ax=axes[2])
axes[2].set_title("Distribution of the Passengers for each Month")

sns.scatterplot(
    data=data, x="Month", y="Passengers", hue="Year", size="Year", sizes=(200, 20)
)
axes[3].set_title("Shift in seasonal pattern over the years")
axes[3].legend(bbox_to_anchor=(1, 1))

plt.show()

__Answers__:
- Yes, there is a steady and upward trend in the data over time (1st and 2nd plots).
- Yes, there exists a strong seasonal pattern, as summer months are more popular than the winter months (3rd and 4th plots). Also, there is an upward shift in the seasonal component as more passengers are traveling in the more recent years (4th plot).

4. Find out autocorrelations in the data. 

In [None]:
from pandas.plotting import lag_plot

In [None]:
fig, axes = plt.subplots(
    nrows=3, ncols=4, figsize=(15, 12), gridspec_kw={"hspace": 0.4, "wspace": 0.4}
)
fig.suptitle("Lag plots of current month Passengers versus lagged Passengers")

for (ax, lag) in zip(axes.flatten(), range(1, 13)):

    lag_plot(data["Passengers"], lag=lag, ax=ax)
    ax.set_title("autocorr: {:.2f}".format(data["Passengers"].autocorr(lag)))

fig.subplots_adjust(top=0.93)

plt.show()

In [None]:
import statsmodels.api as sm

In [None]:
sm.graphics.tsa.plot_acf(data["Passengers"], lags=24, alpha=0.05, zero=False)

plt.xlabel("Lag")
plt.ylabel("Autocorrelation")
plt.title("autocorrelations with 95% confidence bands")

plt.show()

__Answers__:

- There are strong positive autocorrelations in the 1st and 12th lags. Thus the 1st and 12th lags of the number of passengers are strong predictors of the future number of passengers.
- Based on the 95% confident bands the first 13 lags are different from zero because they are out of the upper band.

5. Find out if data are stationary and try to remove non-stationarity if there is evidence supporting its presence.

In [None]:
# Plot the number of passengers
plt.figure(figsize=(8, 4))

plt.plot(data["Passengers"])
plt.xlabel("Year")
plt.ylabel("Monthly passengers (in thousands)")

plt.show()

__Answers__:

* The plot shows that both mean and variance are increasing over time indicating non-stationarity.

In [None]:
fig, ax = plt.subplots(4, 1, figsize=(10, 10), sharex=True)

# create the plot for original data
ax[0].set_title("Monthly Number of Passengers")
ax[0].plot(data["Passengers"])

# create the plot for original data
ax[1].set_title("Log-transformed Monthly Number of Passengers")
ax[1].plot(np.log(data["Passengers"]))

# create the seasonaly differenced plot
ax[2].set_title("12-months Differenced Log-transformed Number of Passengers")
ax[2].plot(np.log(data["Passengers"]).diff(12))

# create the first differenced plot
ax[3].set_title(
    "1st Differenced the 12-months Differenced Log-transformed Number of Passengers"
)
ax[3].set_xlabel("Date")
ax[3].plot(np.log(data["Passengers"]).diff(12).diff(1))

fig.tight_layout()
plt.show()

__Answers__:

* The plots show that taking the log makes the variations more stable over time (2nd panel). Also, 12-months differencing removes seasonality, and finally the 1st differencing makes the trend stable around zero (last panel).

In [None]:
sm.graphics.tsa.plot_acf(
    np.log(data["Passengers"]).diff(12).diff(1).dropna(),
    lags=15,
    alpha=0.05,
    zero=False,
)

plt.xlabel("Lag")
plt.ylabel("Autocorrelation")
plt.title("autocorrelations with 95% confidence bands")

plt.show()

__Answers__:

* The plots show that most of the autocorrelations are not present in the transformed data. There is still some autocorrelations left in the the 1st and 12th lags. So, one can use these lags to model the (stationary) time series.