### 💻 **Cloud User Behavior and Account Activity Anomalies**

### 🎯 **Project Objective**

In this project, the goal is to **simulate and detect anomalies in user behavior**. These anomalies may include actions such as:
- Accessing multiple accounts from a single IP address in a short period of time.
- Sudden spikes in storage usage or unusual login patterns.
- Potential malicious user behavior, such as unauthorized access attempts or fraud indicators, that deviate from the normal usage patterns.

We'll use **Anomaly Detection** techniques to identify outliers that might signal a compromised account or malicious activities.

### 🧾 **Feature List**

In [1]:
feature_list = [
    'user_id',                    # Unique user identifier
    'login_frequency',             # Frequency of logins per day
    'avg_session_duration',        # Average session duration (minutes)
    'failed_login_attempts',       # Number of failed login attempts
    'successful_login_attempts',   # Number of successful login attempts
    'ip_address',                  # IP address from which the user logged in
    'device_type',                 # Device type (e.g., mobile, desktop)
    'account_age_days',            # Age of the account (in days)
    'total_storage_usage_GB',      # Total storage used by the user (in GB)
    'storage_increase_rate_GB_per_day', # Rate at which the user's storage is increasing (in GB/day)
    'number_of_accounts_accessed', # Number of accounts accessed within a short period
    'login_location',              # Geographical location of the login (could be encoded as a region/country)
    'login_time_of_day',           # Time of the day when login occurred (e.g., morning, afternoon, evening)
    'security_alerts',             # Number of security alerts associated with the account
    'is_account_compromised'       # Target label: 0 for normal, 1 for compromised account (for training purposes)
]

### 📦 **Step 1: Import Required Libraries**

In [2]:
import pandas as pd
import numpy as np
import random
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix


### 🧑‍💻 **Step 2: Generate Synthetic User Behavior Data**

In [3]:
np.random.seed(42)
num_samples = 1000

def simulate_user_data(n):
    data = pd.DataFrame({
        'user_id': np.random.randint(1, 100, n),
        'login_frequency': np.random.normal(3, 1, n),  # 3 logins per day on average
        'avg_session_duration': np.random.normal(30, 10, n),  # 30 minutes session duration
        'failed_login_attempts': np.random.poisson(0.5, n),  # Low number of failed attempts
        'successful_login_attempts': np.random.poisson(3, n),  # Normal number of successful logins
        'ip_address': np.random.choice(['192.168.1.1', '192.168.1.2', '192.168.2.1'], n),
        'device_type': np.random.choice(['mobile', 'desktop'], n),
        'account_age_days': np.random.normal(200, 50, n),  # Accounts are on average 200 days old
        'total_storage_usage_GB': np.random.normal(20, 5, n),  # Average storage usage in GB
        'storage_increase_rate_GB_per_day': np.random.normal(0.1, 0.05, n),  # Low rate of storage increase
        'number_of_accounts_accessed': np.random.poisson(1, n),  # Average of 1 account accessed
        'login_location': np.random.choice(['USA', 'Canada', 'Germany', 'India'], n),
        'login_time_of_day': np.random.choice(['morning', 'afternoon', 'evening'], n),
        'security_alerts': np.random.poisson(0.2, n),  # Low number of security alerts per account
        'is_account_compromised': np.random.choice([0, 1], n, p=[0.95, 0.05])  # 5% compromised accounts
    })
    return data

user_data = simulate_user_data(num_samples)
user_data.head()

Unnamed: 0,user_id,login_frequency,avg_session_duration,failed_login_attempts,successful_login_attempts,ip_address,device_type,account_age_days,total_storage_usage_GB,storage_increase_rate_GB_per_day,number_of_accounts_accessed,login_location,login_time_of_day,security_alerts,is_account_compromised
0,52,1.584797,46.061195,1,4,192.168.1.1,mobile,140.788136,25.024361,0.096535,2,Canada,afternoon,1,0
1,93,3.595401,45.624595,0,5,192.168.1.2,mobile,195.187477,29.734562,0.065992,3,Canada,evening,1,1
2,15,0.920872,20.076572,0,1,192.168.2.1,desktop,237.268615,12.278009,0.075984,2,Canada,morning,0,1
3,72,2.978525,33.858844,2,5,192.168.1.1,mobile,187.853828,14.794464,0.163346,0,India,morning,1,0
4,61,2.38291,40.675993,1,0,192.168.2.1,mobile,220.671769,21.067797,0.093969,2,Germany,morning,0,0


### 🧹 **Step 3: Data Preprocessing & Feature Engineering**

- **Normalize** numerical features (e.g., `login_frequency`, `avg_session_duration`, `storage_increase_rate_GB_per_day`).
- **Encode categorical features** (e.g., `ip_address`, `device_type`, `login_location`) if needed.

In [4]:
# Encoding categorical features using label encoding
user_data['ip_address'] = user_data['ip_address'].astype('category').cat.codes
user_data['device_type'] = user_data['device_type'].astype('category').cat.codes
user_data['login_location'] = user_data['login_location'].astype('category').cat.codes
user_data['login_time_of_day'] = user_data['login_time_of_day'].astype('category').cat.codes

# Standardize numerical features
scaler = StandardScaler()
scaled_data = scaler.fit_transform(user_data.drop(columns=['user_id', 'is_account_compromised']))

# Creating a new dataframe with the scaled features
user_data_scaled = pd.DataFrame(scaled_data, columns=user_data.columns.drop(['user_id', 'is_account_compromised']))
user_data_scaled['is_account_compromised'] = user_data['is_account_compromised']

### 🤖 **Step 4: Train the Model Using Isolation Forest**

In [5]:
# Train the model using Isolation Forest
model = IsolationForest(n_estimators=100, contamination=0.05, random_state=42)
model.fit(user_data_scaled.drop(columns='is_account_compromised'))

# Predict anomalies (1 -> anomaly, -1 -> normal)
user_data_scaled['predicted_anomaly'] = model.predict(user_data_scaled.drop(columns='is_account_compromised'))

# Map anomalies (1 -> normal, -1 -> anomaly)
user_data_scaled['predicted_anomaly'] = user_data_scaled['predicted_anomaly'].map({1: 0, -1: 1})

### 📈 **Step 5: Evaluate Model Performance**

In [6]:
# Classification report and confusion matrix
print(classification_report(user_data_scaled['is_account_compromised'], user_data_scaled['predicted_anomaly']))
print(confusion_matrix(user_data_scaled['is_account_compromised'], user_data_scaled['predicted_anomaly']))

              precision    recall  f1-score   support

           0       0.95      0.95      0.95       950
           1       0.04      0.04      0.04        50

    accuracy                           0.90      1000
   macro avg       0.49      0.49      0.49      1000
weighted avg       0.90      0.90      0.90      1000

[[902  48]
 [ 48   2]]


### 🧪 **Step 6: Test the Model with Synthetic Test Data**

In [7]:
# Generate synthetic test data with anomalies
test_data = simulate_user_data(200)

# Ensure that all categorical features are properly encoded
test_data_encoded = test_data.copy()
test_data_encoded['ip_address'] = test_data_encoded['ip_address'].astype('category').cat.codes
test_data_encoded['device_type'] = test_data_encoded['device_type'].astype('category').cat.codes
test_data_encoded['login_location'] = test_data_encoded['login_location'].astype('category').cat.codes
test_data_encoded['login_time_of_day'] = test_data_encoded['login_time_of_day'].astype('category').cat.codes

# Check the data types after encoding
print(test_data_encoded.dtypes)

# Standardizing or scaling the test data (make sure all columns are numerical)
test_data_scaled = scaler.transform(test_data_encoded.drop(columns=['user_id', 'is_account_compromised']))

# Predict anomalies on the test data
test_data_encoded['predicted_anomaly'] = model.predict(test_data_scaled)
test_data_encoded['predicted_anomaly'] = test_data_encoded['predicted_anomaly'].map({1: 0, -1: 1})  # Convert to 0 (normal) / 1 (anomaly)

# Extract anomalies for inspection
anomalies = test_data_encoded[test_data_encoded['predicted_anomaly'] == 1]

# Display the detected anomalies
anomalies.head()


user_id                               int32
login_frequency                     float64
avg_session_duration                float64
failed_login_attempts                 int32
successful_login_attempts             int32
ip_address                             int8
device_type                            int8
account_age_days                    float64
total_storage_usage_GB              float64
storage_increase_rate_GB_per_day    float64
number_of_accounts_accessed           int32
login_location                         int8
login_time_of_day                      int8
security_alerts                       int32
is_account_compromised                int32
dtype: object




Unnamed: 0,user_id,login_frequency,avg_session_duration,failed_login_attempts,successful_login_attempts,ip_address,device_type,account_age_days,total_storage_usage_GB,storage_increase_rate_GB_per_day,number_of_accounts_accessed,login_location,login_time_of_day,security_alerts,is_account_compromised,predicted_anomaly
19,86,4.299631,18.705144,1,6,2,1,235.304707,21.264472,0.151659,3,0,1,1,0,1
20,34,2.60067,33.131234,3,1,0,0,321.487801,21.082847,0.040445,4,2,2,0,0,1
26,91,4.197423,14.51417,2,4,1,0,176.12276,27.216418,0.055897,1,0,2,1,0,1
37,61,2.916668,28.797281,0,2,2,1,176.496314,19.187945,0.112309,4,1,0,2,0,1
47,63,3.452763,30.19259,0,6,0,1,213.329167,13.127849,0.079764,4,3,0,3,0,1


### 📤 **Step 7: Export Anomalies to CSV**

In [8]:
anomalies.to_csv("user_behavior_anomalies.csv", index=False)
print(f"Exported {len(anomalies)} anomalies to 'user_behavior_anomalies.csv'")


Exported 12 anomalies to 'user_behavior_anomalies.csv'
