## Step 1: Load and Inspect the Dataset

### Loading the Data
We load the synthetic IoT sensor data from a CSV file. The `timestamp` column is parsed as a `datetime` object and set as the index to facilitate time-based operations.

```python
df = pd.read_csv('synthetic_iot_sensor_data.csv', parse_dates=['timestamp'])
df.set_index('timestamp', inplace=True)
print(df.head())


In [1]:
import pandas as pd
import numpy as np

# Load the CSV file (change the filename if needed)
df = pd.read_csv('synthetic_iot_sensor_data.csv', parse_dates=['timestamp'])

# Set timestamp as the index to facilitate time-based operations
df.set_index('timestamp', inplace=True)

# View the first few rows of the dataset
print(df.head())


                     temperature  vibration    pressure   humidity  failure  \
timestamp                                                                     
2024-12-01 00:00:00    27.746529   1.291776  162.724539  49.082652        0   
2024-12-01 00:01:00    28.762782   1.163971  143.952310  53.087117        0   
2024-12-01 00:02:00    27.959954   1.163477  159.641210  47.865008        0   
2024-12-01 00:03:00    33.445726   1.262017  161.557163  51.507817        0   
2024-12-01 00:04:00    30.700783   0.941095  162.739058  61.794441        0   

                     machine_id  
timestamp                        
2024-12-01 00:00:00           1  
2024-12-01 00:01:00           1  
2024-12-01 00:02:00           1  
2024-12-01 00:03:00           1  
2024-12-01 00:04:00           1  


## Checking for Missing Values
We check for missing values in the dataset and handle them using forward fill (ffill) and backward fill (bfill) methods.

In [2]:
# Check for missing values
missing_counts = df.isnull().sum()
print(f"Missing Values:\n{missing_counts}")

# Fill missing values using forward fill, backward fill, or interpolation
df.fillna(method='ffill', inplace=True)  # Forward fill
df.fillna(method='bfill', inplace=True)  # Backward fill


Missing Values:
temperature    0
vibration      0
pressure       0
humidity       0
failure        0
machine_id     0
dtype: int64


  df.fillna(method='ffill', inplace=True)  # Forward fill
  df.fillna(method='bfill', inplace=True)  # Backward fill


## Step 2: Outlier Detection and Capping
Detecting and Capping Outliers
We identify outliers in sensor data using the Interquartile Range (IQR) method and cap them within a valid range.

In [3]:
# Detect and cap outliers using IQR
def cap_outliers(series, threshold=1.5):
    Q1 = series.quantile(0.25)
    Q3 = series.quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - threshold * IQR
    upper_bound = Q3 + threshold * IQR
    return np.clip(series, lower_bound, upper_bound)

# Apply capping to sensor columns
for col in ['temperature', 'vibration', 'pressure', 'humidity']:
    df[col] = cap_outliers(df[col])

# View summary statistics to ensure outliers are capped
print(df.describe())


         temperature      vibration       pressure       humidity  \
count  432000.000000  432000.000000  432000.000000  432000.000000   
mean       30.193101       1.020732     150.121604      50.023433   
std         7.623520       0.765009      36.747368      14.707954   
min        12.247470      -0.846734      63.612176      13.235434   
25%        23.307757       0.328889     116.454736      36.451290   
50%        30.124453       1.012455     150.109059      49.967662   
75%        36.875796       1.688356     183.944987      63.551031   
max        57.227855       3.727556     235.730156      84.902346   

             failure     machine_id  
count  432000.000000  432000.000000  
mean        0.006944       5.500000  
std         0.083044       2.872285  
min         0.000000       1.000000  
25%         0.000000       3.000000  
50%         0.000000       5.500000  
75%         0.000000       8.000000  
max         1.000000      10.000000  


## Step 3: Data Resampling
Resampling to Hourly Data
The data is resampled to hourly intervals, and the mean of each sensor's readings is calculated.

In [4]:
# Resample to hourly data (you can also use 'D' for daily, '15T' for 15-minute intervals)
df_resampled = df.resample('H').mean()  # Resample hourly and compute mean
print(df_resampled.head())


                     temperature  vibration    pressure   humidity  failure  \
timestamp                                                                     
2024-12-01 00:00:00    35.194576   1.543561  174.481228  60.699344      0.0   
2024-12-01 01:00:00    38.988500   1.956669  194.782088  68.449871      0.0   
2024-12-01 02:00:00    31.835665   1.208833  156.979400  52.262330      0.0   
2024-12-01 03:00:00    21.945653   0.161972  110.353798  33.953453      0.0   
2024-12-01 04:00:00    22.696499   0.274998  115.117618  35.854095      0.0   

                     machine_id  
timestamp                        
2024-12-01 00:00:00         5.5  
2024-12-01 01:00:00         5.5  
2024-12-01 02:00:00         5.5  
2024-12-01 03:00:00         5.5  
2024-12-01 04:00:00         5.5  


  df_resampled = df.resample('H').mean()  # Resample hourly and compute mean


## Step 4: Feature Engineering
Creating Lag Features
Lag features are added for the past 1 to 3 time points for each sensor column, capturing temporal dependencies.

In [5]:
# Create lag features for the past 3 time points
for col in ['temperature', 'vibration', 'pressure', 'humidity']:
    for lag in range(1, 4):  # Lags 1, 2, and 3
        df[f'{col}_lag_{lag}'] = df[col].shift(lag)


Adding Rolling Statistics
Rolling statistics such as mean, standard deviation, minimum, maximum, kurtosis, and skewness are calculated over a 60-minute window.

In [6]:
# Rolling windows for past 60 timestamps (since this is minute-based, 60 = 1 hour)
for col in ['temperature', 'vibration', 'pressure', 'humidity']:
    df[f'{col}_rolling_mean'] = df[col].rolling(window=60).mean()
    df[f'{col}_rolling_std'] = df[col].rolling(window=60).std()


In [7]:
# Compute rolling min, max, kurtosis, skewness
for col in ['temperature', 'vibration', 'pressure', 'humidity']:
    df[f'{col}_rolling_min'] = df[col].rolling(window=60).min()
    df[f'{col}_rolling_max'] = df[col].rolling(window=60).max()
    df[f'{col}_rolling_kurtosis'] = df[col].rolling(window=60).kurt()
    df[f'{col}_rolling_skew'] = df[col].rolling(window=60).skew()


Adding Cyclic Features
Cyclic features are created to capture time-of-day (hour) and day-of-week (day_of_week) patterns. Sine and cosine transformations are applied for encoding.

In [8]:
# Add cyclic features for time-of-day (hour) and day-of-week
df['hour'] = df.index.hour
df['day_of_week'] = df.index.dayofweek

# Sin/Cos transformations for cyclic encoding
df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24)
df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24)
df['dayofweek_sin'] = np.sin(2 * np.pi * df['day_of_week'] / 7)
df['dayofweek_cos'] = np.cos(2 * np.pi * df['day_of_week'] / 7)


## Step 5: Final Preprocessing and Export
Handling NaN Values
NaN values introduced during lag and rolling calculations are dropped to ensure data consistency.
Exporting the Processed Data
The preprocessed dataset is saved as a new CSV file for further analysis.

In [9]:
# Drop NaN values (introduced by lags, rolling, etc.)
df = df.dropna()

# Export the preprocessed dataset
df.to_csv('processed_iot_sensor_data.csv', index=True)
