<a href="https://colab.research.google.com/github/Soumyaa2005/CyberSecurity-Assignment-2/blob/main/CS_Project2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**GAP 1: Inadequate Cookie Analysis**

**Problem:** The research paper highlights stateful tracking (like HTTP Cookies) is common, but conventional detection is a simple "presence" check. This lacks context on the cookie's security design.

**Improvement:** Implement a **Cookie Hygiene Score (CHS)** to quantify the security of a cookie based on set attributes (`Secure`, `HttpOnly`, `SameSite`) and penalize for excessive lifespan.



In [1]:

# analysis/cookie_hygiene.py

import re, math, pandas as pd

ATTRS = ["secure", "httponly", "samesite"]
def parse_set_cookie(header):
    # Parses a Set-Cookie header string into a dictionary of attributes
    parts = [p.strip() for p in header.split(";")]
    name, val = parts[0].split("=", 1) if "=" in parts[0] else (parts[0], "")
    flags = {k: False for k in ATTRS}
    samesite = None; max_age=None; expires=None
    for p in parts[1:]:
        kv = p.split("=", 1)
        k = kv[0].strip().lower()
        v = kv[1].strip().lower() if len(kv)==2 else True
        if k in ("secure","httponly"): flags[k]=True
        elif k=="samesite": flags["samesite"]=True; samesite=v
        elif k=="max-age": max_age = int(v) if v.isdigit() else None
        elif k=="expires": expires = v
    return {"name":name,"secure":flags["secure"],"httponly":flags["httponly"],
            "samesite":flags["samesite"],"samesite_val":samesite,
            "max_age":max_age,"expires":expires}


In [2]:
def cookie_score(df_setcookie):  # df_setcookie columns: url, set_cookie_header
    rows = []
    for _, r in df_setcookie.iterrows():
        meta = parse_set_cookie(r["set_cookie_header"])
        # Base score (max 3): +1 for Secure, +1 for HttpOnly, +1 for SameSite
        score = (1 if meta["secure"] else 0) + (1 if meta["httponly"] else 0) + (1 if meta["samesite"] else 0)
        # Expiry Penalty: penalize cookies with a life > 30 days (potential tracking risk)
        life_pen = 1 if (meta["max_age"] and meta["max_age"]>60*60*24*30) else 0
        rows.append({**meta, "url": r["url"], "score": score - life_pen})
    out = pd.DataFrame(rows)
    # Calculate Mean/Median Cookie Hygiene Score (CHS) per URL
    chs = out.groupby("url")["score"].agg(["mean","median","count"]).reset_index().rename(
        columns={"mean":"chs_mean","median":"chs_med","count":"cookie_count"})
    return chs, out

**GAP 2: Binary Navigator Fingerprinting Detection**


**Problem:** Navigator fingerprinting is common, but the paper notes it only tracks **presence** and doesn't profile *which properties* are used. The basic model can't distinguish between a benign query and aggressive fingerprinting.


**Improvement:** Extract **Navigator Specific Properties (NSP)** into a feature matrix, quantifying the breadth of properties accessed to identify aggressive fingerprinting attempts.

In [3]:
# analysis/extract_navigator.py

import pandas as pd

NAV_KEYS = ["userAgent","language","languages","platform","deviceMemory",
            "hardwareConcurrency","plugins","webdriver","userAgentData.brands",
            "userAgentData.platform","userAgentData.mobile"]

def build_nsp(js_calls_df):  # columns: url, api, prop
    # Filter for navigator API calls
    df = js_calls_df[js_calls_df["api"].str.contains("navigator", case=False, na=False)].copy()
    df["prop_norm"] = df["prop"].str.lower()

    # Map raw properties to a standard set of NAV_KEYS
    feats = {k.lower():k for k in NAV_KEYS}
    df["prop_bucket"] = df["prop_norm"].map(lambda p: next((f for f in feats if f in p), None))

    # Pivot to create a URL x Property matrix (counts of access)
    mat = (df.dropna(subset=["prop_bucket"])
             .assign(val=1)
             .pivot_table(index="url", columns="prop_bucket", values="val", aggfunc="sum", fill_value=0)
             .reset_index())

    # Convert to binary presence (1=used, 0=not used)
    for c in [c for c in mat.columns if c!="url"]:
        mat[c] = (mat[c] > 0).astype(int)

    # Total count of unique navigator keys accessed
    mat["n_nav_keys"] = mat.drop(columns=["url"]).sum(axis=1)
    return mat

**GAP 3: Lack of Predictive Context (Machine Learning)**

**Problem:** Analyzing tracking mechanism presence alone is an **unreliable test** for classifying a webpage. Security solutions need context to differentiate malicious from benign.

**Improvement:** Integrate the extracted quantitative features (CHS, NSP) into a simple machine learning model (Logistic Regression) to establish a baseline for **predictive context**.

In [4]:

# analysis/build_features.py

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import average_precision_score, roc_auc_score

def train_eval(X, y):
    # Standard train/test split for evaluation
    Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

    # Train Logistic Regression classifier
    clf = LogisticRegression(max_iter=200).fit(Xtr, ytr)
    proba = clf.predict_proba(Xte)[:,1]

    # Return evaluation metrics and feature coefficients (risk weights)
    return {
        "AUC": roc_auc_score(yte, proba),
        "AP": average_precision_score(yte, proba),
        "coef": dict(zip(X.columns, clf.coef_[0]))
    }

# Placeholder for the data loading and model execution part
# crawl (baseline vs EASP)