<a href="https://colab.research.google.com/github/AlekseyTsar3vi4/COMP60003/blob/main/Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task: to create a synthetic (artificially generated) dataset that simulates 1,500 authentication requests from 50 different computers on campus over a 24-hour period.

**Dataset Structure:**

You’ll have **5 columns** (features) that **hold information** about each authentication request:


*   **Username:** This will be a unique identifier for each user, with a format that includes a five-character string (a letter followed by a birth year).
*   **Computer ID:** A unique identifier for each computer, consisting of two letters followed by two numbers.
* **Connection Time:** The specific time of each request within the 24-hour period (from midnight to just before midnight the next day).
* **IP Address:** The IP address from which the request originated.
* **Labels**

**Labels (Target Column):**

There’s a target column with two possible labels:
A (Accept) and D (Deny). These indicate whether the authentication request was approved or denied.
These labels need to be **assigned randomly based on a normal distribution** (this means "A" and "D" should appear in a naturally varied but balanced way across the dataset).
Dataset Constraints:

The system has 200 unique users and 50 unique computers.
You’ll generate a total of 1,500 requests over the 24-hour period.

**In Summary:**
The task is to use different methods for generating sample values (like random distributions) to simulate a realistic set of authentication requests:
random usernames, computer IDs, connection times, and IP addresses.
Randomly assign "Accept" or "Deny" labels to these requests based on a normal distribution.

**Step 1:** Set Up Environment and Libraries

In [4]:
import pandas as pd
import numpy as np
from scipy.stats import poisson, binom, norm
from datetime import timedelta, datetime
import random

**Step 2:** Generate Features Based on Distributions

**1. Username** - Uniform Distribution

We will generate 200 usernames that start with a random letter (A-Z) followed by a birth year (Range: 1980-2005).

We will then normalise usernames if needed for consistency or patterning using a combination of transformations, such as standardisation.

In [5]:
# Setting Parameters
num_users = 200
num_requests = 1500

# Generate usernames
def generate_username():
    letter = chr(random.randint(65, 90))  # Random uppercase letter
    year = random.randint(1980, 2005)  # Random year
    return f"{letter}{year}"

usernames = [generate_username() for _ in range(num_requests)]

**2. Computer ID** - Binomial Distribution

Computer IDs have discrete values (two letters + two digits) and are selected from the 50 unique options, making binomial distribution ideal to simulate computer IDs used multiple times in varied combinations.

We can also use feature encoding if needed, such as One-Hot Encoding, especially if treating Computer ID as a categorical variable in model training​.

We will generate computer IDs from a set of 50 unique IDs, with each ID comprising two letters and two digits:

In [6]:
# Generate a set of 50 unique computer IDs
unique_computers = [f"{chr(random.randint(65, 90))}{chr(random.randint(65, 90))}{str(random.randint(0, 9))}{str(random.randint(0, 9))}" for _ in range(50)]
computers = [random.choice(unique_computers) for _ in range(num_requests)]

**3. Connection Time** - Exponential Distribution

The exponential distribution can simulate connection times, with requests clustering around certain times (e.g., busier login times in the morning or afternoon).

To handle potential outliers, apply a binning technique (e.g., Equal-Width Binning) to group times into intervals (e.g., every 15 minutes) for further analysis

We will generate times within a 24-hour period, clustering around certain times:

In [7]:
# Generate connection times using exponential distribution
def generate_connection_time():
    seconds_in_day = 24 * 60 * 60
    random_time = int(np.random.exponential(scale=seconds_in_day / 10)) % seconds_in_day  # Scale to fit within a day
    return (datetime.min + timedelta(seconds=random_time)).time()

connection_times = [generate_connection_time() for _ in range(num_requests)]

**4. IP Address** - Poisson Distribution

The Poisson distribution models the frequency of IP address occurrences, which might cluster in certain ranges due to subnet allocation (common in closed environments like campuses).

For pattern consistency, we can normalise the IP range or aggregate similar IPs to analyse subnet clusters or hotspot IPs, using data transformation methods like standardisation​.

We will generate IP addresses as a four octet values with a Poisson rate parameter suitable for a subnet (e.g., λ=5):

In [8]:
# Generate IP addresses
def generate_ip_address():
    return ".".join(str(poisson.rvs(50)) for _ in range(4))  # λ=50 for moderate clustering

ip_addresses = [generate_ip_address() for _ in range(num_requests)]

**5. Target Label (A/D)** - Normal Distribution

A normal distribution will help assign labels randomly yet balanced, so that "Accept" and "Deny" labels vary naturally around a mean. To achieve approximately equal distribution we will set a mean of 0.5 (if we consider “Accept” as 1 and “Deny” as 0) and adjust the standard deviation.

To confirm the label assignment’s statistical balance, apply a chi-square test on the label frequency to verify randomness and balance​

We will randomly assign "A" or "D" labels based on a normal distribution:

In [9]:
# Generate target labels using normal distribution
def generate_label():
    return "A" if norm.rvs(loc=0.5, scale=0.1) > 0.5 else "D"

labels = [generate_label() for _ in range(num_requests)]

**Step 3:** Creating a DataFrame

In [10]:
# Combine all features into a DataFrame
df = pd.DataFrame({
    'Username': usernames,
    'Computer ID': computers,
    'Connection Time': connection_times,
    'IP Address': ip_addresses,
    'Label': labels
})

# Display a sample of the dataset
print(df.head())

  Username Computer ID Connection Time   IP Address Label
0    D1986        WH78        15:01:41  53.48.54.51     A
1    S1985        ZG92        00:06:37  43.53.57.49     A
2    L2004        JQ03        01:36:16  57.48.45.45     A
3    K1989        OU14        01:42:35  50.59.41.58     A
4    D2005        AE81        05:42:29  49.49.46.52     A


**Step 4:** Export Dataset as a *.csv* file

In [11]:
# Save to CSV
df.to_csv('synthetic_authentication_data.csv', index=False)
print("Dataset saved as synthetic_authentication_data.csv")

Dataset saved as synthetic_authentication_data.csv
