In [32]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

print("Ready")


Ready


In [33]:
data = [
    ["u001", "2025-01-01 08:01:00", "192.168.1.10", 1],
    ["u001", "2025-01-01 08:03:00", "192.168.1.10", 0],
    ["u001", "2025-01-01 08:04:00", "192.168.1.10", 0],
    ["u001", "2025-01-01 18:30:00", "192.168.1.10", 1],
    ["u002", "2025-01-01 02:15:00", "10.0.0.5", 0],
    ["u002", "2025-01-01 02:16:00", "10.0.0.5", 0],
    ["u002", "2025-01-01 02:17:00", "10.0.0.5", 0],
    ["u002", "2025-01-01 09:00:00", "10.0.0.5", 1],
]

anomalous_data = [
    ["u999", "2025-01-01 01:00:00", "203.0.113.9", 0],
    ["u999", "2025-01-01 01:01:00", "203.0.113.9", 0],
    ["u999", "2025-01-01 01:02:00", "203.0.113.9", 0],
    ["u999", "2025-01-01 01:03:00", "203.0.113.9", 0],
    ["u999", "2025-01-01 01:04:00", "203.0.113.9", 0],
]

full_data = data + anomalous_data

df = pd.DataFrame(
    full_data,
    columns=["user_id", "timestamp", "ip", "success"]
)

df["timestamp"] = pd.to_datetime(df["timestamp"])
df



Unnamed: 0,user_id,timestamp,ip,success
0,u001,2025-01-01 08:01:00,192.168.1.10,1
1,u001,2025-01-01 08:03:00,192.168.1.10,0
2,u001,2025-01-01 08:04:00,192.168.1.10,0
3,u001,2025-01-01 18:30:00,192.168.1.10,1
4,u002,2025-01-01 02:15:00,10.0.0.5,0
5,u002,2025-01-01 02:16:00,10.0.0.5,0
6,u002,2025-01-01 02:17:00,10.0.0.5,0
7,u002,2025-01-01 09:00:00,10.0.0.5,1
8,u999,2025-01-01 01:00:00,203.0.113.9,0
9,u999,2025-01-01 01:01:00,203.0.113.9,0


## Objective

The goal of this analysis is to identify anomalous login behavior using basic statistical methods.
Rather than machine learning, this approach relies on defining normal behavior and detecting
significant deviations that may indicate suspicious activity.


## Dataset Scope

This analysis uses a small, synthetic login dataset created for educational and exploratory purposes. 
The dataset size limits the ability to observe rare or extreme anomalies, but allows controlled 
demonstration of statistical detection methods and their assumptions.


In [34]:
failed_counts = failed_logins.groupby("user_id").size()
failed_counts


user_id
u001    2
u002    3
dtype: int64

In [35]:
failed_logins = df[df["success"] == 0]
failed_logins


Unnamed: 0,user_id,timestamp,ip,success
1,u001,2025-01-01 08:03:00,192.168.1.10,0
2,u001,2025-01-01 08:04:00,192.168.1.10,0
4,u002,2025-01-01 02:15:00,10.0.0.5,0
5,u002,2025-01-01 02:16:00,10.0.0.5,0
6,u002,2025-01-01 02:17:00,10.0.0.5,0
8,u999,2025-01-01 01:00:00,203.0.113.9,0
9,u999,2025-01-01 01:01:00,203.0.113.9,0
10,u999,2025-01-01 01:02:00,203.0.113.9,0
11,u999,2025-01-01 01:03:00,203.0.113.9,0
12,u999,2025-01-01 01:04:00,203.0.113.9,0


In [36]:
mean_failures = failed_counts.mean()
std_failures = failed_counts.std()

mean_failures, std_failures


(np.float64(2.5), np.float64(0.7071067811865476))

## Defining Normal Behavior

Normal login behavior is defined as a number of failed login attempts that falls within one
standard deviation of the mean failed login count across users.

Users whose failed login frequency significantly exceeds this range may be considered anomalous.

In [None]:
z_scores = (failed_counts - mean_failures) / std_failures
z_scores


In [None]:
anomalies = z_scores[abs(z_scores) > 2]
anomalies


In [None]:
## Anomaly Detection Results

Users with a z-score greater than 2 were flagged as anomalous.
These users exhibit an unusually high number of failed login attempts
compared to the overall population, which may indicate brute-force or
unauthorized access attempts.

In [None]:
df["hour"] = df["timestamp"].dt.hour
hourly_activity = df.groupby("hour").size()
hourly_activity


In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(8,4))
hourly_activity.plot(kind="bar")
plt.title("Login Activity by Hour")
plt.xlabel("Hour of Day")
plt.ylabel("Login Count")
plt.show()


In [None]:
## Time-Based Anomalies

Login attempts occurring during late-night or unusual hours may indicate
automated scripts or attackers operating outside normal user activity patterns.


In [None]:
## Limitations

- Z-score detection assumes approximately normal behavior, which may not hold in real login data
- IQR-based detection is more robust but still sensitive to dataset size


In [None]:
Q1 = failed_counts.quantile(0.25)
Q3 = failed_counts.quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

iqr_anomalies = failed_counts[
    (failed_counts < lower_bound) | (failed_counts > upper_bound)
]

Q1, Q3, IQR, iqr_anomalies


In [None]:
+## IQR-Based Anomaly Detection

In addition to z-scores, the Interquartile Range (IQR) method was used to
identify anomalous users. IQR-based detection is more robust to outliers
and does not assume a normal distribution of login behavior.

Users falling outside 1.5Ã—IQR from the first and third quartiles are
considered anomalous under this method.


In [None]:
## Findings

This analysis evaluated login behavior using basic statistical techniques,
including failed login frequency and time-of-day patterns.

Across the observed dataset:
- Failed login attempts were consistent across users
- No users exceeded anomaly thresholds under either z-score or IQR methods
- Late-night login activity was observed and identified as a potential
  contextual risk factor, though not statistically anomalous on its own

These results suggest that, under the current definitions of normal behavior,
No anomalous login activity was detected in the original dataset.
After injecting a synthetic extreme case, the detection methods correctly flagged the anomalous user.


In [None]:
## Future Improvements

This analysis could be enhanced by:
- Apply the same analysis to a larger real-world authentication dataset
- Incorporate rolling time windows to detect slow, low-volume attacks
- Compare statistical baselines with unsupervised ML methods such as Isolation Forest
- Integrate contextual features such as IP reputation and device fingerprints



# It is important to note that anomaly detection systems are designed to flag potential risks, not confirm malicious intent.

In [38]:
failed_logins = df[df["success"] == 0]
failed_logins.groupby("user_id").size()


user_id
u001    2
u002    3
u999    5
dtype: int64

In [39]:
failed_counts = failed_logins.groupby("user_id").size()

mean_failures = failed_counts.mean()
std_failures = failed_counts.std()

z_scores = (failed_counts - mean_failures) / std_failures
z_scores


user_id
u001   -0.872872
u002   -0.218218
u999    1.091089
dtype: float64

In [40]:
z_scores[abs(z_scores) > 2]


Series([], dtype: float64)

In [41]:
Q1 = failed_counts.quantile(0.25)
Q3 = failed_counts.quantile(0.75)
IQR = Q3 - Q1

failed_counts[(failed_counts < Q1 - 1.5*IQR) | (failed_counts > Q3 + 1.5*IQR)]


Series([], dtype: int64)

#Although a synthetic anomalous user was injected, neither z-score nor IQR methods flagged the behavior as anomalous.
This highlights a known limitation of statistical detection techniques on small datasets, where extreme values may not sufficiently deviate from the mean.
This experiment demonstrates the importance of dataset scale when applying statistical anomaly detection.