# AISE4010 — Assignment 1 
### Time Series Forecasting with Autoregressive Models and MLP 

**Grade:** 100 points

## Instructions
Follow these steps before submitting your assignment:
1. Complete the notebook.
2. Make sure **all plots have axis labels**.
3. When finished, go to **Kernel → Restart & Run All** to ensure a clean, error‑free run.
4. Fix any errors until your notebook runs without problems.
5. Submit **one completed notebook** per group to OWL by the deadline.
6. Reference all external code and documentation you use.

## Dataset 
- **File:** `weather.csv`
- **Location:** Szeged, Hungary
- **Frequency:** Daily (fixed calendar index)
- **Time span:** ≈ 2006–2016
- **Target:** `Temperature (C)`
- **Key variables:** Temperature (C), Pressure (millibars), Humidity, Appar Temperature (C), Wind Speed (km/h), Wind Bearing (degrees), Visibility (km)


## Question 1: Data Preprocessing (25%)

### Q1.1 Exploratory Data Analysis (2%)
1. Load the dataset and print the **first 6 rows**.  
2. Encode categorical variables (one‑hot). *(If none, report "none.")*


In [26]:
# import libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from pmdarima import auto_arima # pip instal first (pip install pmdarima)
from statsmodels.tsa.stattools import adfuller # pip instal first (pip install statsmodels)
from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_squared_error
from math import sqrt

ModuleNotFoundError: No module named 'pmdarima'

In [None]:
# Answer to Q1.1.1

path = 'weather.csv'
df = pd.read_csv(path, index_col='Formatted Date', parse_dates=True)
df.head()

In [20]:
# Answer to Q1.1.2
#No categorical variables 
print("None")

None


### Q1.2 Handling Missing Data (11%)
1. Report missingness (**counts & %**) and show a heatmap for **Temperature, Pressure, Humidity**.
2. Use these two imputation methods:\
   a) Forward-fill (FFill).\
   b) Linear time interpolation.
3. Pick one method and save the result as `daily_clean`.
4. **Discussion** Which imputation method did you pick and why?  
5. **Discussion** How might your choice bias trend/seasonality estimates?  
6. **Discussion** Would your choice change if the gap were 30 days instead of 7? Explain briefly.


In [21]:
# Answer to Q1.2.1

countNa = df.isnull().sum()
percentageNa = (df.isnull().sum() / len(df))*100

string = f"{countNa} \n\n{percentageNa}"
print(string)

Temperature (C)             47
Apparent Temperature (C)     0
Humidity                    40
Wind Speed (km/h)            0
Wind Bearing (degrees)       0
Visibility (km)              0
Pressure (millibars)        47
dtype: int64 

Temperature (C)             1.169445
Apparent Temperature (C)    0.000000
Humidity                    0.995272
Wind Speed (km/h)           0.000000
Wind Bearing (degrees)      0.000000
Visibility (km)             0.000000
Pressure (millibars)        1.169445
dtype: float64


In [22]:
# Answer to Q1.2.2

#forward fill (FFill), carry last valid value forward
df_ffill = df.fillna(method="ffill")

df_linear = df.interpolate(method="linear")

  df_ffill = df.fillna(method="ffill")


In [23]:
# Answer to Q1.2.3

daily_clean = df_linear
print(daily_clean)

                Temperature (C)  Apparent Temperature (C)  Humidity  \
Formatted Date                                                        
2005-12-31                  0.6                      -4.0      0.89   
2006-01-01                  4.1                      -0.2      0.82   
2006-01-02                  5.3                       1.8      0.85   
2006-01-03                  2.3                       0.4      0.90   
2006-01-04                  2.3                      -0.7      0.91   
...                         ...                       ...       ...   
2016-12-27                  0.3                      -3.2      0.89   
2016-12-28                  0.2                      -3.2      0.89   
2016-12-29                  0.2                      -3.3      0.89   
2016-12-30                  0.1                      -3.3      0.89   
2016-12-31                  0.1                      -3.3      0.89   

                Wind Speed (km/h)  Wind Bearing (degrees)  Visibility (km)  

**Answer to Q1.2.4**: 


I chose to use linear time interpolation rather than FFill since the variables are all continuous values rather than categorical it makes more sense. If temp 1 is 10 and temp 3 is 12, temp 2 can be found through linear interpolation and is likely 11. FFill would be usefull if you were modelling a traffic light for example where if value 1 and 3 are GO, then you can fill forward and assume that value 2 is GO as well.

**Answer to Q1.2.5**: 


Linear interpolation can smooth over short term variability, and may not effectively show sudden changes or extreme values. It could bias short term seasonality analysis, but is less likely to distor long term trends compared to FFill, which artifically extend past values and flatten variability

**Answer to Q1.2.6**: 


If the missing gap were 30 days instead of 7, linear interpolation would become less reliable since it assumes linearity across the gap. Over long gaps, environmental variables like temperature or humidity may change non-linearly (e.g., due to weather patterns). In that case, other methods such as seasonal decomposition, regression models, or external weather data sources would be more appropriate.

### Q1.3 Stationarity Analysis (12%)
1. Extract the **univariate** series `Temperature_Series = daily_clean['Temperature (C)']` and `Pressure_Series = daily_clean['Pressure (millibars)'].  
2. Report the results of a stationarity test (**ADF** or **KPSS**) for both series.
3. **Discussion:** Explain your conclusion about stationary analysis of your results. 
4. Apply differencing on both series and plot **before/after** and report the stationary test results of your choice on differenced series.   
5. **Discussion:** Explain the reason for your choice of differencing technique for each series.
6. **Discussion:** Would you difference a series that is already stationary by ADF? When might that still help?

In [25]:
# Answer to Q1.3.1
Temperature_Series = daily_clean['Temperature (C)']
Pressure_Series = daily_clean['Pressure (millibars)']

In [None]:
# Answer to Q1.3.2
def adf_test(dataset):
    dftest = adfuller(dataset, autolag = 'AIC')
    

**Answer to Q1.2.3**: 


In [None]:
# Answer to Q1.3.4

**Answer to Q1.2.5**: 


**Answer to Q1.2.6**: 


## Question 2: Model-Based Techniques (35%)

### Q2.1 ARIMA model identification and forecasting (20%)
1. Use `Temperature_Series` to plot **ACF/PACF** and list the choice of candidate order set for ARIMA: (p,d,q).  
2. **Discussion:** Explain the reasons for your choice of (p,d,q).
3. Select the orders by **AIC** using the training set (hold out the last **365 days** for testing).  
4. Fit the selected ARIMA on the training set and evaluate the predictions' **MAE/MSE** on the test set.
5. Plot predictions with **95% CI**.
6. Forecast the **next 365 days** and visualize with historical context and **95% CI**.


In [18]:
df.shape()

TypeError: 'tuple' object is not callable

In [17]:
# Answer to Q2.1.1
df['Temperature (C)'].plot(figsize=(12.5))

TypeError: Value after * must be an iterable, not float

**Answer to Q2.1.2**: 


In [None]:
# Answer to Q2.1.3

In [None]:
# Answer to Q2.1.4

In [None]:
# Answer to Q2.1.5

In [None]:
# Answer to Q2.1.6

### Q2.2 SARIMA forecasting (15%)

1. Derive a **monthly** series from `daily_clean` and fit a **SARIMA** with seasonal period **12**. Hold out the **last 24 months** as test set for prediction. 
2. Report the predictions' **MAE/MSE**.
3. Plot the the prediction with **95% CI** 
4.  **Discussion:** Compare the ARIMA and SARIMA predictions and explain your findings.


In [None]:
# Answer to Q2.2.1

In [None]:
# Answer to Q2.2.2

In [None]:
# Answer to Q2.2.3

**Answer to Q2.2.4**: 


## Question 3: Neural Networks for Time Series Forecasting (40%)
Use `daily_clean` for all parts.

### Q3.1 Sliding Window for Time Series — Univariate (2%)
1. Restructure **Temperature**: past **10 days** → **next day**.

2. Hold out last 20% as test set.


In [None]:
# Answer to Q3.1.1

In [None]:
# Answer to Q3.1.2

### Q3.2 MLP — Univariate (18%)
1. Build an MLP with one hidden layer of 64 neurons, ReLu activation, Adam optimizer with learning rate of 0.001 and batch_size=32, and train it for 20 epochs. 
2. Report **RMSE/MAE**.
3. Plot **two figures**: (1) **last 100 test points** (true vs. predicted), (2) **scatter (true vs. predicted)** with the **y = x** line.
4. **Discussion:** Compare ARIMA vs. the univariate MLP in **RMSE/MAE** and **plots**. Which patterns does each capture better? 
5. **Discussion:** Would increasing the input window beyond 10 days help? Why or why not?


In [None]:
# Answer to Q3.2.1

In [None]:
# Answer to Q3.2.2

In [None]:
# Answer to Q3.2.3

**Answer to Q3.2.4**: 


**Answer to Q3.2.5**: 


### Q3.3 MLP — Multivariate (20%)
1. Use **Temperature & Pressure** for the past **10 days** as the inputs and **next‑day Temperature** as the target. 
2. Hold out last 20% as test set.
3. Build a Multivariate MLP with one hidden layer of 64 neurons, ReLu activation, Adam optimizer with learning rate of 0.001 and batch_size=32, and train it for 20 epochs. 
4. Report **RMSE/MAE**.
5. Plot **two figures**: (1) **last 100 test points** (true vs. predicted), (2) **scatter (true vs. predicted)** with the **y = x** line.
6. **Discussion:** Did Pressure improve Temperature forecasting vs. univariate? Why might it help/hurt?  
7. **Discussion:** Suggest two additional features you would add next and why?


In [None]:
# Answer to Q3.3.1

In [None]:
# Answer to Q3.3.2

In [None]:
# Answer to Q3.3.3

In [None]:
# Answer to Q3.3.4

In [None]:
# Answer to Q3.3.5

**Answer to Q3.3.6**: 


**Answer to Q3.3.7**: 
