<a href="https://colab.research.google.com/github/SatoJin02/BDA_course25/blob/main/Ex14.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [21]:
!pip install -q pami pmdarima

## Step 1: Install Required Libraries

This cell installs the necessary Python libraries:

- `PAMI`: For frequent pattern mining (e.g., FP-Growth).
- `pmdarima`: For time-series forecasting using Auto-ARIMA.

These installations are required only once in a Colab environment.


## Step 2: Load Dataset and Select Top Sensors

- Load the sensor dataset (`ETL_DATASET.csv`).
- Identify the 100 sensors with the fewest missing (NaN) values.
- These sensors will be used for preprocessing and forecasting.


In [36]:
import pandas as pd
import numpy as np
from scipy import stats
from pmdarima import auto_arima
from PAMI.frequentPattern.basic import FPGrowth

from google.colab import files
uploaded = files.upload()

df = pd.read_csv('ETL_DATASET.csv', parse_dates=[0])
df.set_index(df.columns[0], inplace=True)

nan_counts = df.isna().sum().sort_values()
top_100_sensors = nan_counts.index[:100]
print("Top 100 sensors:", list(top_100_sensors))


KeyboardInterrupt: 

## Step 3: Remove Outliers and Impute Missing Values

- Detect outliers using Z-score (values where |z| > 3) and mark them as NaN.
- Use linear interpolation to fill missing values in each sensor.
- Apply forward and backward filling to ensure no NaN values remain.


In [23]:
df_selected = df[top_100_sensors].copy()
df_selected = df_selected.apply(pd.to_numeric, errors='coerce')

for col in df_selected.columns:
    z = stats.zscore(df_selected[col], nan_policy='omit')
    outliers = np.where(np.abs(z) > 3)[0]
    df_selected.iloc[outliers, df_selected.columns.get_loc(col)] = np.nan

df_interpolated = df_selected.interpolate(method='linear', limit_direction='both')
df_interpolated = df_interpolated.fillna(method='ffill').fillna(method='bfill')

print("NaN remaining:", df_interpolated.isna().sum().sum())

NaN remaining: 50


  df_interpolated = df_interpolated.fillna(method='ffill').fillna(method='bfill')


## Step 4: Forecast 5 Days of Sensor Data Using Auto-ARIMA

- Use `auto_arima()` to automatically fit an ARIMA model to each sensor's time series.
- Forecast the next 5 days for each sensor.
- Sensors with too few data points will be skipped.
- The result is saved as `etl_forecast_5days.csv`.


In [24]:
forecast_horizon = 5
forecast_df = pd.DataFrame(index=range(forecast_horizon))
skipped = []

for sensor in top_100_sensors:
    series = df_interpolated[sensor].dropna()
    if len(series) < forecast_horizon + 10:
        skipped.append(sensor)
        continue
    try:
        model = auto_arima(series, seasonal=False, max_p=3, max_q=3, d=1,
                           error_action='ignore', suppress_warnings=True)
        forecast = model.predict(n_periods=forecast_horizon)
        forecast_df[sensor] = forecast
    except Exception as e:
        print(f"Error on {sensor}:", e)
        skipped.append(sensor)

forecast_df.to_csv('etl_forecast_5days.csv', index=False)
print("Saved etl_forecast_5days.csv")

  return get_prediction_index(
  return get_prediction_index(
  return get_prediction_index(
  return get_prediction_index(
  return get_prediction_index(
  return get_prediction_index(
  return get_prediction_index(
  return get_prediction_index(
  return get_prediction_index(
  return get_prediction_index(
  return get_prediction_index(
  return get_prediction_index(
  return get_prediction_index(
  return get_prediction_index(
  return get_prediction_index(
  return get_prediction_index(
  return get_prediction_index(
  return get_prediction_index(
  return get_prediction_index(
  return get_prediction_index(
  return get_prediction_index(
  return get_prediction_index(
  return get_prediction_index(
  return get_prediction_index(
  return get_prediction_index(
  return get_prediction_index(
  return get_prediction_index(
  return get_prediction_index(
  return get_prediction_index(
  return get_prediction_index(
  return get_prediction_index(
  return get_prediction_index(
  return

Saved etl_forecast_5days.csv


  return get_prediction_index(
  return get_prediction_index(


## Step 5: Convert Forecast to Binary and Transaction Format

- Apply a threshold (e.g., 35) to convert the forecasted values to binary:
  - Value ≥ 35 → 1 (High PM2.5)
  - Value < 35 → 0 (Low PM2.5)
- For each day, collect the names of sensors with a value of 1.
- Save each day's active sensors as a transaction in tab-separated format (`etl_tdb_5days.csv`).


In [32]:
binary_df = (forecast_df >= 35).astype(int)
binary_df.to_csv('etl_forecast_binary_5days.csv', index=False)
print("Saved etl_forecast_binary_5days.csv")

transactions = []
for _, row in binary_df.iterrows():
    active = [str(sensor) for sensor, val in row.items() if val == 1]
    transactions.append(active)

with open('etl_tdb_5days.csv', 'w') as f:
    for t in transactions:
        f.write("\t".join(t) + "\n")

print("Saved etl_tdb_5days.csv")


Saved etl_forecast_binary_5days.csv
Saved etl_tdb_5days.csv


## Step 6: Run FP-Growth Algorithm Using PAMI

- Load the transactional database into the PAMI FP-Growth algorithm.
- Set the minimum support count (e.g., 2).
- Extract frequent patterns of co-occurring sensors.
- Display the number of patterns found and a sample of the results.


In [33]:
input_file = 'etl_tdb_5days.csv'
min_sup = 2

fp = FPGrowth.FPGrowth(iFile=input_file, minSup=min_sup, sep='\t')
fp.mine()

patterns_df = fp.getPatternsAsDataFrame()
print("Frequent patterns found:", len(patterns_df))
patterns_df.head()


Frequent patterns were generated successfully using frequentPatternGrowth algorithm
Frequent patterns found: 0


Unnamed: 0,Patterns,Support
