In [2]:
import pandas as pd


In [15]:
import numpy as np

In [13]:
from scipy import stats

In [4]:
df = pd.read_csv("AnoML_IoT.csv")


In [5]:
df.head()


Unnamed: 0,Time,Temperature,Humidity,Air Quality,Light,Loudness
0,1623781306,37.94,28.94,75,644,106
1,1623781316,37.94,29.0,75,645,145
2,1623781326,37.88,28.88,75,644,146
3,1623781336,37.72,28.94,75,646,139
4,1623781346,37.69,29.19,75,644,155


In [6]:
df.shape


(6558, 6)

In [7]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6558 entries, 0 to 6557
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Time         6558 non-null   int64  
 1   Temperature  6558 non-null   float64
 2   Humidity     6558 non-null   float64
 3   Air Quality  6558 non-null   int64  
 4   Light        6558 non-null   int64  
 5   Loudness     6558 non-null   int64  
dtypes: float64(2), int64(4)
memory usage: 307.5 KB


In [8]:
df.isnull().sum()


Time           0
Temperature    0
Humidity       0
Air Quality    0
Light          0
Loudness       0
dtype: int64

In [9]:
df.nunique()


Time           6558
Temperature     635
Humidity        661
Air Quality       1
Light            50
Loudness        258
dtype: int64

In [10]:
profile = pd.DataFrame({
    "Missing_Values": df.isnull().sum(),
    "Missing_Percentage": df.isnull().mean() * 100,
    "Unique_Values": df.nunique()
})

profile


Unnamed: 0,Missing_Values,Missing_Percentage,Unique_Values
Time,0,0.0,6558
Temperature,0,0.0,635
Humidity,0,0.0,661
Air Quality,0,0.0,1
Light,0,0.0,50
Loudness,0,0.0,258


I performed initial data profiling to assess missing values and understand the uniqueness of each column. This helped identify potential data quality issues before applying any transformations.
The initial profiling showed that the dataset does not contain missing values. This indicates good data completeness. However, the Air Quality column has a constant value across all records, which limits its usefulness for anomaly detection and statistical analysis.

In [11]:
df = df.drop(columns=["Air Quality"])


In [12]:
df.columns


Index(['Time', 'Temperature', 'Humidity', 'Light', 'Loudness'], dtype='object')

Since the Air Quality column had no variation, I have removed it from further analysis as it does not contribute to anomaly detection or data quality insights

In [16]:
numeric_cols = df.select_dtypes(include=np.number).columns
numeric_cols


Index(['Time', 'Temperature', 'Humidity', 'Light', 'Loudness'], dtype='object')

Z_score:

In [17]:
z_scores = np.abs(stats.zscore(df[numeric_cols]))


In [18]:
df["Z_Anomaly"] = (z_scores > 3).any(axis=1)


In [19]:
df["Z_Anomaly"].value_counts()


Z_Anomaly
False    6313
True      245
Name: count, dtype: int64

I applied the Z-Score method to detect extreme anomalies in sensor readings. Records with Z-Scores greater than 3 were flagged as anomalous, indicating unusually high or low sensor values.

IQR:

In [20]:
Q1 = df[numeric_cols].quantile(0.25)
Q3 = df[numeric_cols].quantile(0.75)
IQR = Q3 - Q1


In [21]:
iqr_anomaly = (
    (df[numeric_cols] < (Q1 - 1.5 * IQR)) |
    (df[numeric_cols] > (Q3 + 1.5 * IQR))
)


In [22]:
df["IQR_Anomaly"] = iqr_anomaly.any(axis=1)


In [23]:
df["IQR_Anomaly"].value_counts()


IQR_Anomaly
False    5515
True     1043
Name: count, dtype: int64

I used the Interquartile Range (IQR) method to detect anomalies based on data distribution. This method helps identify unusually high or low sensor readings that may not be extreme but still deviate from normal operating ranges

Handling outliers:

In [24]:
for col in numeric_cols:
    lower_bound = Q1[col] - 1.5 * IQR[col]
    upper_bound = Q3[col] + 1.5 * IQR[col]
    df[col] = np.clip(df[col], lower_bound, upper_bound)


In [25]:
df[numeric_cols].describe()


Unnamed: 0,Time,Temperature,Humidity,Light,Loudness
count,6558.0,6558.0,6558.0,6558.0,6558.0
mean,1623814000.0,26.663663,56.507776,630.906374,151.823651
std,18932.76,3.757682,8.782959,4.885552,21.63042
min,1623781000.0,22.19,38.5675,625.0,100.5
25%,1623798000.0,24.09,53.455,627.0,138.0
50%,1623814000.0,25.0,60.12,629.0,150.0
75%,1623830000.0,28.25,63.38,633.0,163.0
max,1623847000.0,34.49,71.81,642.0,200.5


Instead of removing anomalous records, I applied capping based on IQR limits to handle outliers. This approach preserves the overall data structure while reducing the impact of extreme sensor readings.

Saveing the Cleaned Dataset:

In [26]:
df.to_csv("cleaned_sensor_readings.csv", index=False)


After handling anomalies and outliers, the cleaned dataset is saved as a CSV file for further analysis and reporting.

Using statistical methods, anomalies were identified in the sensor data.
The Z-Score method detected extreme deviations likely caused by sudden sensor spikes or hardware issues.
The IQR method detected a larger number of anomalies, indicating local deviations and gradual sensor drift.
Instead of removing these records, outlier values were capped to preserve data integrity while reducing noise.