### Cybersecurity Regression Datasets

This repository provides two synthetic datasets designed for regression modeling tasks in the cybersecurity domain. Each dataset includes **1,000 samples** with clearly defined data-generating processes and added Gaussian noise.

---

### Dataset B: Three Features

**File:** `cyber_risk_3features.csv`

- **Features:**
  - `failed_login_attempts` (integer): Same distribution as in Dataset A (Poisson λ = 2, truncated at [0, 20]).  
  - `password_age_days` (float): Days since the last password change, sampled from a Uniform(0, 365).  
  - `phishing_click_rate` (float): Simulated proportion of phishing links clicked, drawn from a Beta(2, 8) distribution.

- **Target (risk_score):**  
  Defined by the following regression model with an interaction term:

  \[
  \text{risk\_score} = 10 + 2.5x_1 + 0.02x_2 + 20x_3 + 0.5(x_1 \cdot x_3) + \varepsilon,
  \quad \varepsilon \sim \mathcal{N}(0, 3^2)
  \]

  where \(x_1 =\) failed login attempts, \(x_2 =\) password age in days, and \(x_3 =\) phishing click rate.  
  Values are clipped to [0, 100].

---


## Linear Regression w/o sklearn

In [1]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np

# 데이터 불러오기
path = "./data/cyber_risk_3features.csv"
df = pd.read_csv(path)

# 세 개의 feature 선택
feature_cols = ["failed_login_attempts", "password_age_days", "phishing_click_rate"]
target_col = "risk_score"

# Feature와 Target 분리
X = df[feature_cols].values
y = df[target_col].values

# 1. Train-Test Split (예: 80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

In [2]:
import numpy as np

In [3]:
ones = np.ones((X_train.shape[0], 1))
A = np.hstack([ones, X_train])
b = y_train.reshape(-1, 1)
ATA = A.T @ A
P = np.linalg.inv(ATA) @ A.T
x = P @ b

print("Normal equation coefficients (β):")
print(f"  intercept: {x[0,0]}")
for name, coef in zip(feature_cols, x[1:,0]):
    print(f"  {name}: {coef}")

Normal equation coefficients (β):
  intercept: 9.478211450864045
  failed_login_attempts: 2.7224997006270457
  password_age_days: 0.02133729769244556
  phishing_click_rate: 20.21359815073925


In [4]:
M = np.linalg.inv(ATA) @ ATA

# DataFrame으로 보기 좋게 출력
df = pd.DataFrame(M)
print(df.round(3))  # 소수점 3자리

     0    1    2    3
0  1.0  0.0  0.0  0.0
1 -0.0  1.0 -0.0 -0.0
2 -0.0  0.0  1.0 -0.0
3 -0.0  0.0  0.0  1.0


## Linear Regression w/ Multicollinearity

In [5]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np

# 데이터 불러오기
path = "./data/cyber_risk_3features.csv"
df = pd.read_csv(path)

# 세 개의 feature 선택
feature_cols = ["failed_login_attempts", "password_age_days", "phishing_click_rate"]
target_col = "risk_score"

In [6]:
# Feature와 Target 분리
X = df[feature_cols]
X.head()

Unnamed: 0,failed_login_attempts,password_age_days,phishing_click_rate
0,0,355.787157,0.211141
1,1,88.139501,0.245941
2,3,263.920947,0.348929
3,1,209.474702,0.200946
4,2,254.357568,0.137828


In [7]:
X['fake_col'] = X['failed_login_attempts'] + X['phishing_click_rate']
feature_cols += ['fake_col']
X.head()

Unnamed: 0,failed_login_attempts,password_age_days,phishing_click_rate,fake_col
0,0,355.787157,0.211141,0.211141
1,1,88.139501,0.245941,1.245941
2,3,263.920947,0.348929,3.348929
3,1,209.474702,0.200946,1.200946
4,2,254.357568,0.137828,2.137828


In [8]:
X = X.values
y = df[target_col].values

# 1. Train-Test Split (예: 80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

In [9]:
# Design matrix 생성 (intercept 추가)
ones = np.ones((X_train.shape[0], 1))
A = np.hstack([ones, X_train])
b = y_train.reshape(-1, 1)

# Normal Equation: (A^T A)^(-1) A^T b
ATA = A.T @ A
P = np.linalg.inv(ATA) @ A.T
x = P @ b

print("\n=== Linear Regression (Normal Equation) 결과 ===")
print(f"  intercept: {x[0,0]}")
for name, coef in zip(feature_cols, x[1:,0]):
    print(f"  {name}: {coef}")


=== Linear Regression (Normal Equation) 결과 ===
  intercept: 9.924000642846305
  failed_login_attempts: 0.9655003687606164
  password_age_days: 0.021298301273648297
  phishing_click_rate: 16.205202531361472
  fake_col: 2.701585503370397


In [10]:
A.shape

(800, 5)

In [13]:
np.linalg.det(ATA) # Wrong Answer due to float point error.

np.float64(470.5473364724547)

In [15]:
np.linalg.matrix_rank(ATA)

np.int64(4)

In [12]:
M = np.linalg.inv(ATA) @ ATA

# DataFrame으로 보기 좋게 출력
df = pd.DataFrame(M)
print(df.round(3))  # 소수점 3자리

       0      1       2      3      4
0  1.017  0.058   3.204  0.003  0.061
1  0.062  1.000   0.000  0.016  0.000
2 -0.000 -0.000   1.000 -0.000 -0.000
3 -0.062  0.000 -16.000  0.984  0.000
4 -0.062  0.000   0.000  0.016  0.750
