# **Fake-News-Credit-Card-Fraud-Movies-prediction**

**Datasets:**

- Fake News: https://www.kaggle.com/datasets/emineyetm/fake-news-detection-datasets
- Credit Card Fraud: https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud
- Movies: https://www.kaggle.com/datasets/parasharmanas/movie-recommendation-system


In [3]:

import os, pandas as pd, numpy as np
from pathlib import Path

def SAFE_READ_CSV(preferred_paths, fallback_msg):
    for p in preferred_paths:
        if os.path.exists(p):
            try:
                try:
                    df = pd.read_csv(p)
                except UnicodeDecodeError:
                    df = pd.read_csv(p, encoding='latin-1')
                print(f"Loaded dataset from: {p}")
                return df
            except Exception as e:
                print(f"Found {p} but couldn't read as CSV: {e}")
    print(fallback_msg)
    manual = input('➡ Enter full path to your CSV (or press Enter to cancel): ').strip()
    if manual:
        if not os.path.exists(manual):
            raise FileNotFoundError(f'Path does not exist: {manual}')
        try:
            return pd.read_csv(manual)
        except UnicodeDecodeError:
            return pd.read_csv(manual, encoding='latin-1')
    raise FileNotFoundError('CSV not found. Place the file next to this notebook or give a valid path.')


## Fake News Detection (TF‑IDF + Logistic Regression)

In [15]:

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score, classification_report, confusion_matrix
import joblib, numpy as np

df_true = SAFE_READ_CSV(['data/News/True.csv','True.csv','/mnt/data/True.csv'], "Place 'True.csv' in data/ or provide a path")
df_fake = SAFE_READ_CSV(['data/News/Fake.csv','Fake.csv','/mnt/data/Fake.csv'], "Place 'Fake.csv' in data/ or provide a path")


def unify(df):
    if 'text' in df.columns:
        return df[['text']].copy()
    candidates = [c for c in df.columns if str(c).lower() in ['title','subject','content','article','body']]
    if candidates:
        return df[candidates].astype(str).agg(' '.join, axis=1).to_frame('text')
    str_cols = [c for c in df.columns if df[c].dtype == 'object']
    if str_cols:
        lengths = df[str_cols].astype(str).apply(lambda s: s.str.len().fillna(0).mean())
        best = lengths.sort_values(ascending=False).index[0]
        return df[[best]].rename(columns={best: 'text'})
    raise ValueError('No text columns detected')

X_true = unify(df_true); X_true['label'] = 0
X_fake = unify(df_fake); X_fake['label'] = 1
df_news = pd.concat([X_true, X_fake], ignore_index=True).dropna(subset=['text'])

X_train, X_test, y_train, y_test = train_test_split(df_news['text'], df_news['label'], test_size=0.2, random_state=42, stratify=df_news['label'])

pipe = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english', ngram_range=(1,2), min_df=3, max_df=0.9)),
    ('clf', LogisticRegression(max_iter=2000, class_weight='balanced'))
])
pipe.fit(X_train, y_train)
pred = pipe.predict(X_test)
proba = pipe.predict_proba(X_test)[:,1]
print('Accuracy:', accuracy_score(y_test, pred))
print('ROC-AUC:', roc_auc_score(y_test, proba))
print('\nClassification Report:\n', classification_report(y_test, pred, target_names=['real','fake']))
print('Confusion Matrix:\n', confusion_matrix(y_test, pred))
joblib.dump(pipe, 'fake_news_tfidf_lr.joblib')
print('Saved → fake_news_tfidf_lr.joblib')

def explain_text(txt, pipeline, top_k=8):
    vec = pipeline.named_steps['tfidf']
    clf = pipeline.named_steps['clf']
    Xv = vec.transform([txt])
    coefs = clf.coef_.ravel()
    feats = np.array(vec.get_feature_names_out())
    vals = Xv.toarray().ravel()
    contrib = vals * coefs
    pos = np.argsort(contrib)[-top_k:][::-1]
    neg = np.argsort(contrib)[:top_k]
    return list(zip(feats[pos], contrib[pos])), list(zip(feats[neg], contrib[neg]))


Loaded dataset from: data/News/True.csv
Loaded dataset from: data/News/Fake.csv
Accuracy: 0.9868596881959911
ROC-AUC: 0.999084486151076

Classification Report:
               precision    recall  f1-score   support

        real       0.98      0.99      0.99      4284
        fake       0.99      0.98      0.99      4696

    accuracy                           0.99      8980
   macro avg       0.99      0.99      0.99      8980
weighted avg       0.99      0.99      0.99      8980

Confusion Matrix:
 [[4247   37]
 [  81 4615]]
Saved → fake_news_tfidf_lr.joblib


 ensure your trained model file exists in the same folder:
   fake_news_tfidf_lr.joblib

pip install -U streamlit scikit-learn joblib pandas numpy beautifulsoup4 lxml requests


pip install newspaper3k

streamlit
scikit-learn
joblib
pandas
numpy
beautifulsoup4
lxml
requests
newspaper3k   

streamlit run app_fake_news_streamlit.py
requriements

In [None]:
# app_fake_news_streamlit.py
# Fake News Detector — TF-IDF(1–2) + Logistic Regression
# • Paste text or URL (robots.txt-aware smart scraping; prefers newspaper3k)
# • Token-level explainability (top FAKE vs REAL) + bar charts + inline highlights
# • Batch CSV scoring + download
# • Decision threshold slider, quality-gate, caching, resilient errors
# • Polished UI with badges, metrics, and JSON payload

import os, re, time, json, joblib, numpy as np, pandas as pd, streamlit as st
from urllib.parse import urlparse
from urllib.robotparser import RobotFileParser

# =============== Streamlit Config & Style ===============
st.set_page_config(page_title="Fake News Detector", page_icon="📰", layout="wide")

CUSTOM_CSS = """
<style>
:root { --pill:#eef3ff; --accent:#4c78ff; --good:#10b981; --bad:#ef4444; }
.block-container { padding-top: 1.0rem; }
.pill { display:inline-block; padding:.25rem .65rem; border-radius:999px; background:var(--pill); }
.badge-real { background:#ecfdf5; color:#065f46; border:1px solid #10b98133; }
.badge-fake { background:#fef2f2; color:#7f1d1d; border:1px solid #ef444433; }
.small { opacity:.75; font-size:.90rem; }
.token-hi { padding:.05rem .25rem; border-radius:.25rem; }
.token-hi.real { background:#ecfdf5; }
.token-hi.fake { background:#fef2f2; }
footer { visibility: hidden; }  /* tidy the footer */
</style>
"""
st.markdown(CUSTOM_CSS, unsafe_allow_html=True)

st.title("📰 Fake News Detector")
st.caption("TF-IDF + Logistic Regression • URL scrape or paste text • Token contributions")

# =============== Load trained pipeline (cached) ===============
@st.cache_resource
def load_pipeline(model_path="fake_news_tfidf_lr.joblib"):
    """
    Cache the trained scikit-learn Pipeline for fast, repeatable inference.
    Expects a Pipeline with steps 'tfidf' (TfidfVectorizer) and 'clf' (LogisticRegression).
    """
    if not os.path.exists(model_path):
        raise FileNotFoundError(
            f"Model file '{model_path}' not found. Train your notebook to create it."
        )
    pipe = joblib.load(model_path)
    if "tfidf" not in pipe.named_steps or "clf" not in pipe.named_steps:
        raise ValueError("Expected Pipeline with steps: 'tfidf' and 'clf'.")
    return pipe

pipe = load_pipeline()
vec = pipe.named_steps["tfidf"]
clf = pipe.named_steps["clf"]

# =============== Utilities ===============
def is_url(s: str) -> bool:
    try:
        p = urlparse(s.strip())
        return bool(p.scheme and p.netloc)
    except Exception:
        return False

@st.cache_data(ttl=180)
def robots_allowed(url: str, ua: str = "Mozilla/5.0") -> bool:
    """robots.txt check; return True if allowed or robots missing."""
    try:
        p = urlparse(url)
        robots_url = f"{p.scheme}://{p.netloc}/robots.txt"
        rp = RobotFileParser()
        rp.set_url(robots_url)
        rp.read()
        return rp.can_fetch(ua, url) if rp.default_entry is not None else True
    except Exception:
        return True

def _scrape_bs4(url: str, timeout: int = 15) -> str:
    import requests
    from bs4 import BeautifulSoup
    headers = {
        "User-Agent": ("Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                       "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124 Safari/537.36")
    }
    r = requests.get(url, headers=headers, timeout=timeout)
    r.raise_for_status()
    soup = BeautifulSoup(r.text, "lxml")
    article = soup.find("article")
    ps = (article.find_all("p") if article else soup.find_all("p"))
    text = " ".join(p.get_text(" ", strip=True) for p in ps if p.get_text(strip=True))
    return text.strip()

def _scrape_newspaper3k(url: str, timeout: int = 20) -> str:
    from newspaper import Article
    art = Article(url, keep_article_html=False, language="en")
    art.download()
    t0 = time.time()
    while art.download_state == 0 and time.time() - t0 < timeout:
        time.sleep(0.1)
    art.parse()
    title = (art.title or "").strip()
    text  = (art.text  or "").strip()
    return " ".join([title, text]).strip()

@st.cache_data(ttl=300)
def smart_scrape(url: str) -> str:
    """Prefer newspaper3k; fallback to BeautifulSoup; normalize + clamp length."""
    if not robots_allowed(url):
        return "[SCRAPE_BLOCKED] robots.txt disallows fetching this URL."
    try:
        import newspaper  # noqa: F401
        txt = _scrape_newspaper3k(url)
    except Exception:
        try:
            txt = _scrape_bs4(url)
        except Exception as e:
            return f"[SCRAPE_ERROR] {e}"
    txt = re.sub(r"\s+", " ", txt).strip()
    return txt[:25000] if len(txt) > 25000 else txt

def explain_text(txt: str, top_k: int = 10):
    """
    Token contributions:
      • top_fake: pushes toward FAKE (positive coef)
      • top_real: pushes toward REAL (negative coef)
    """
    Xv = vec.transform([txt])
    coefs = clf.coef_.ravel()
    feats = np.array(vec.get_feature_names_out())
    vals = Xv.toarray().ravel()
    contrib = vals * coefs
    pos_idx = np.argsort(contrib)[-top_k:][::-1]
    neg_idx = np.argsort(contrib)[:top_k]
    top_fake = [(feats[i], float(contrib[i])) for i in pos_idx if vals[i] > 0]
    top_real = [(feats[i], float(contrib[i])) for i in neg_idx if vals[i] > 0]
    return top_fake, top_real

def score_text(text: str, threshold: float = 0.5):
    prob_fake = float(pipe.predict_proba([text])[0][1])
    pred = int(prob_fake >= threshold)
    return ("FAKE" if pred == 1 else "REAL"), prob_fake

def highlight_tokens(text: str, top_fake, top_real):
    """Inline highlight of top tokens (exact word match, case-insensitive)."""
    if not text or (not top_fake and not top_real):
        return text
    tokens = [t for t,_ in (top_fake + top_real)]
    tokens = sorted(set(tokens), key=len, reverse=True)[:60]
    if not tokens: return text
    label_map = {t.lower():"fake" for t,_ in top_fake}
    label_map.update({t.lower():"real" for t,_ in top_real})
    def repl(m):
        w = m.group(0)
        cls = label_map.get(w.lower(), "real")
        return f'<span class="token-hi {cls}">{w}</span>'
    pattern = r"\b(" + "|".join(re.escape(t) for t in tokens) + r")\b"
    return re.sub(pattern, repl, text, flags=re.IGNORECASE)

# =============== Sidebar (Controls) ===============
with st.sidebar:
    st.header("⚙️ Settings")
    threshold = st.slider("Decision threshold (FAKE if prob ≥ threshold)", 0.10, 0.90, 0.50, 0.01)
    top_k = st.slider("Tokens to show per side", 5, 25, 10, 1)
    st.markdown("**Model file:** `fake_news_tfidf_lr.joblib`")
    st.markdown("**Dataset:** Fake/True News (two CSVs). See Kaggle dataset page.")
    st.markdown('<span class="small">Model: TF-IDF (1–2 grams, min_df=3, max_df=0.9) + Logistic Regression (class_weight=\"balanced\").</span>',
                unsafe_allow_html=True)

# =============== Tabs ===============
tab1, tab2 = st.tabs(["Single Article", "Batch CSV (optional)"])

# ---------- Tab 1: Single Article ----------
with tab1:
    st.subheader("Single Article")
    src = st.text_area("Paste a news article, social post, blog text, or a URL",
                       height=180, placeholder="Paste text or https://example.com/article")

    c1, c2, c3 = st.columns([1,1,1])
    with c1:
        min_chars = st.number_input("Minimum characters (quality gate)", 0, 1000, 40, 10)
    with c2:
        show_bars = st.checkbox("Show contribution bar charts", True)
    with c3:
        show_inline = st.checkbox("Inline token highlights", True)

    analyze = st.button("Analyze", type="primary", use_container_width=True)

    if analyze and src.strip():
        text = src.strip()
        if is_url(text):
            with st.spinner("Scraping article…"):
                text = smart_scrape(text)
            if text.startswith("[SCRAPE_BLOCKED]"):
                st.error(text); st.stop()
            if text.startswith("[SCRAPE_ERROR]"):
                st.error(text); st.stop()

        if len(text) < min_chars:
            st.warning("Input looks too short for reliable classification. Add more text.")

        label, prob_fake = score_text(text, threshold=threshold)

        # KPI badges
        cL, cR = st.columns([1,1])
        with cL:
            badge = 'badge-fake">FAKE' if label == "FAKE" else 'badge-real">REAL'
            st.markdown(f'### Prediction: <span class="pill {badge}</span>', unsafe_allow_html=True)
        with cR:
            st.metric(label="Probability (FAKE)", value=f"{prob_fake:.3f}",
                      delta=f"{(prob_fake - threshold):+.3f} vs threshold")

        # Token contributions
        top_fake, top_real = explain_text(text, top_k=top_k)

        colA, colB = st.columns(2)
        with colA:
            st.subheader("Tokens pushing ➜ FAKE")
            df_fake = pd.DataFrame(top_fake, columns=["token", "contribution"])
            st.dataframe(df_fake, use_container_width=True, hide_index=True)
            if show_bars and not df_fake.empty:
                st.bar_chart(df_fake.set_index("token"))
        with colB:
            st.subheader("Tokens pushing ➜ REAL")
            df_real = pd.DataFrame(top_real, columns=["token", "contribution"])
            st.dataframe(df_real, use_container_width=True, hide_index=True)
            if show_bars and not df_real.empty:
                st.bar_chart(df_real.assign(contribution=lambda d: -d["contribution"]).set_index("token"))

        # Inline highlight preview + JSON
        with st.expander("Show analyzed text"):
            if show_inline and (top_fake or top_real):
                st.markdown(highlight_tokens(text, top_fake, top_real), unsafe_allow_html=True)
            else:
                st.write(text)
        with st.expander("Prediction JSON"):
            payload = {
                "label": label,
                "prob_fake": round(prob_fake, 6),
                "prob_real": round(1 - prob_fake, 6),
                "threshold": threshold,
                "top_tokens_fake": top_fake,
                "top_tokens_real": top_real
            }
            st.code(json.dumps(payload, indent=2))

# ---------- Tab 2: Batch CSV ----------
with tab2:
    st.subheader("Batch CSV Scoring")
    st.write("Upload a CSV with a text column (e.g., `text`, `content`, `article`, `body`, or `message`).")
    up = st.file_uploader("CSV file", type=["csv"])
    if up:
        df_in = pd.read_csv(up)
        st.dataframe(df_in.head(), use_container_width=True)
        guess = None
        for c in df_in.columns:
            if str(c).strip().lower() in ["text", "content", "article", "body", "message"]:
                guess = c; break
        text_col = st.selectbox("Text column", options=df_in.columns.tolist(),
                                index=df_in.columns.get_loc(guess) if guess in df_in.columns else 0)

        if st.button("Score CSV", use_container_width=True):
            with st.spinner("Scoring…"):
                series = df_in[text_col].astype(str).fillna("")
                Xv = vec.transform(series)
                proba_fake = pipe.predict_proba(Xv)[:, 1]
                pred = (proba_fake >= threshold).astype(int)
                df_out = df_in.copy()
                df_out["prob_fake"] = proba_fake
                df_out["prediction"] = pred
            st.success("Scoring complete.")
            st.dataframe(df_out.head(20), use_container_width=True)
            st.download_button(
                "Download results (CSV)",
                df_out.to_csv(index=False).encode("utf-8"),
                file_name="fake_news_scored.csv",
                mime="text/csv",
                use_container_width=True
            )

# =============== Footer ===============
st.markdown("---")
st.markdown(
    """
**Notes**
- Model: TF-IDF (1–2 grams) + Logistic Regression trained on the Fake/True news CSVs.
- Outputs are probabilistic and context-dependent; always consider source, date and domain knowledge.
"""
)





## Credit Card Fraud Detection

In [19]:
# ==== Credit Card Fraud Detection (Upgraded) ===================================
# Supervised: Logistic Regression (balanced) with PR-optimal threshold
# Anomaly:    PCA reconstruction error (no neighbors/IF imports)
# Fusion:     AND / OR / weighted-average of LR and PCA scores
# Saves:
#   - fraud_lr_balanced.joblib
#   - fraud_pca_anomaly.joblib
#   - (helpers at bottom to score single/batch transactions consistently)
# ===============================================================================
import os, numpy as np, pandas as pd, joblib
from dataclasses import dataclass
from typing import Tuple, Literal, Optional, Dict

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    classification_report, confusion_matrix, roc_auc_score,
    precision_recall_curve, average_precision_score
)

# ---------------- SAFE_READ_CSV ----------------
try:
    SAFE_READ_CSV
except NameError:
    def SAFE_READ_CSV(preferred_paths, fallback_msg):
        for p in preferred_paths:
            if os.path.exists(p):
                try:
                    return pd.read_csv(p)
                except Exception as e:
                    print(f"Found {p} but couldn't read it: {e}")
        raise FileNotFoundError(fallback_msg)

# ---------------- Load data ----------------
paths = [
    'data/cerditcard/creditcard.csv',   # your existing path
    'data/creditcard/creditcard.csv',   # common corrected path
    'creditcard.csv', '/mnt/data/creditcard.csv'
]
df_cc = SAFE_READ_CSV(paths, "Place 'creditcard.csv' in data/ or provide a valid path")
assert 'Class' in df_cc.columns, "Expected binary target 'Class' (0=legit,1=fraud)."

X = df_cc.drop(columns=['Class'])
y = df_cc['Class'].astype(int)

# Optional: common tweaks
# if 'Time' in X.columns:   X = X.drop(columns=['Time'])
# if 'Amount' in X.columns: X = X.assign(Amount=np.log1p(X['Amount']))

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=42, stratify=y
)

# ---------------- Helpers ----------------
def pr_optimal_threshold(y_true: np.ndarray, scores: np.ndarray) -> float:
    """Choose threshold that maximizes F1 on the PR curve."""
    prec, rec, thr = precision_recall_curve(y_true, scores)
    f1 = 2 * prec * rec / (prec + rec + 1e-9)
    best = np.nanargmax(f1)
    return float(thr[best-1] if best > 0 and (best-1) < len(thr) else 0.5)

# ---------------- 1) Logistic Regression (balanced) ----------------
# If you want probability calibration, set CALIBRATE=True (kept off to be fast & simple)
CALIBRATE = False

pipe_lr = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', LogisticRegression(max_iter=5000, class_weight='balanced', solver='lbfgs'))
])
pipe_lr.fit(X_train, y_train)

proba_lr = pipe_lr.predict_proba(X_test)[:, 1]
thr_lr = pr_optimal_threshold(y_test.values, proba_lr)
pred_lr = (proba_lr >= thr_lr).astype(int)

roc_auc = roc_auc_score(y_test, proba_lr)
ap_lr = average_precision_score(y_test, proba_lr)

print(f'LR ROC-AUC: {roc_auc:.6f} | PR-AUC (AP): {ap_lr:.6f} | thr*: {thr_lr:.6f}')
print('\nClassification Report (LR):\n', classification_report(y_test, pred_lr, digits=4))
print('Confusion Matrix (LR):\n', confusion_matrix(y_test, pred_lr))

# Save model with threshold and feature order to avoid drift later
lr_bundle = {
    'model': pipe_lr,
    'thr': thr_lr,
    'feature_names': list(X.columns)
}
joblib.dump(lr_bundle, 'fraud_lr_balanced.joblib')
print('Saved → fraud_lr_balanced.joblib')

# ---------------- 2) PCA anomaly detector (reconstruction error) ----------------
def fit_pca_anomaly(X_train_df: pd.DataFrame, y_train_series: pd.Series, var_keep: float = 0.95) -> Dict:
    """Fit PCA on normal class (y==0) with RobustScaler. Return components & scaler stats."""
    Xn = X_train_df[y_train_series == 0].values.astype(float)

    scaler = RobustScaler()
    Xn_rb = scaler.fit_transform(Xn)

    # SVD: X ≈ U S V^T ; keep k s.t. cumulative variance ≥ var_keep
    U, S, Vt = np.linalg.svd(Xn_rb, full_matrices=False)
    var = (S ** 2)
    cum = np.cumsum(var) / np.sum(var)
    k = int(np.searchsorted(cum, var_keep) + 1)
    Vt_k = Vt[:k, :]  # k x d

    return {
        'center_': scaler.center_,
        'scale_': scaler.scale_,
        'Vt_k': Vt_k,
        'k': k,
        'var_keep': var_keep,
        'feature_names': list(X_train_df.columns)
    }

def pca_recon_error(bundle: Dict, X_df: pd.DataFrame) -> np.ndarray:
    """Compute reconstruction error with saved components/scaler."""
    # enforce feature order
    X_df = X_df.reindex(columns=bundle['feature_names'], fill_value=0)
    Xv = X_df.values.astype(float)

    X_rb = (Xv - bundle['center_']) / (bundle['scale_'] + 1e-12)
    Vt_k = bundle['Vt_k']
    Z = X_rb @ Vt_k.T
    X_rec = Z @ Vt_k
    err = np.sum((X_rb - X_rec) ** 2, axis=1)
    return err

pca_bundle = fit_pca_anomaly(X_train, y_train, var_keep=0.95)
err_test = pca_recon_error(pca_bundle, X_test)
# min-max normalize to [0,1] as fraudiness score
score_pca = (err_test - err_test.min()) / (err_test.max() - err_test.min() + 1e-12)
thr_pca = pr_optimal_threshold(y_test.values, score_pca)
pred_pca = (score_pca >= thr_pca).astype(int)
ap_pca = average_precision_score(y_test, score_pca)

print(f'\nPCA-Anomaly PR-AUC (AP): {ap_pca:.6f} | thr*: {thr_pca:.6f}')
print('Classification Report (PCA):\n', classification_report(y_test, pred_pca, digits=4))
print('Confusion Matrix (PCA):\n', confusion_matrix(y_test, pred_pca))

pca_model = {'type': 'pca_recon', 'bundle': pca_bundle, 'thr': thr_pca}
joblib.dump(pca_model, 'fraud_pca_anomaly.joblib')
print('Saved → fraud_pca_anomaly.joblib')

# ---------------- Inference helpers (feature-order safe) ----------------
def score_transactions_lr(df_tx: pd.DataFrame, threshold: Optional[float] = None) -> Tuple[np.ndarray, np.ndarray]:
    saved = joblib.load('fraud_lr_balanced.joblib')
    model, thr, feats = saved['model'], saved['thr'], saved['feature_names']
    df_tx = df_tx.reindex(columns=feats, fill_value=0)  # guard against drift
    proba = model.predict_proba(df_tx.values)[:, 1]
    t = threshold if threshold is not None else thr
    return proba, (proba >= t).astype(int)

def score_transactions_pca(df_tx: pd.DataFrame, threshold: Optional[float] = None) -> Tuple[np.ndarray, np.ndarray]:
    saved = joblib.load('fraud_pca_anomaly.joblib')
    bndl, thr = saved['bundle'], saved['thr']
    scores = pca_recon_error(bndl, df_tx)
    scores = (scores - scores.min()) / (scores.max() - scores.min() + 1e-12)
    t = threshold if threshold is not None else thr
    return scores, (scores >= t).astype(int)

def fuse_decisions(
    proba_lr: np.ndarray,
    score_pca: np.ndarray,
    thr_lr: float,
    thr_pca: float,
    mode: Literal['or', 'and', 'avg'] = 'or',
    w_lr: float = 0.6
) -> Tuple[np.ndarray, np.ndarray]:
    """
    Fusion:
      - 'or' : flag if (LR>=thr_lr) OR (PCA>=thr_pca)        -> higher recall
      - 'and': flag if (LR>=thr_lr) AND (PCA>=thr_pca)       -> higher precision
      - 'avg': flag if w*LR + (1-w)*PCA >= fused_threshold   -> set fused_threshold=0.5 by default
    Returns (fused_score, fused_pred)
    """
    if mode == 'or':
        pred = ((proba_lr >= thr_lr) | (score_pca >= thr_pca)).astype(int)
        score = np.maximum(proba_lr, score_pca)
        return score, pred
    elif mode == 'and':
        pred = ((proba_lr >= thr_lr) & (score_pca >= thr_pca)).astype(int)
        score = np.minimum(proba_lr, score_pca)
        return score, pred
    else:  # 'avg'
        fused = w_lr * proba_lr + (1 - w_lr) * score_pca
        # pick a decent default fused threshold
        fused_thr = 0.5
        pred = (fused >= fused_thr).astype(int)
        return fused, pred

# ---------------- Example: how to use the saved artifacts ----------------
if __name__ == "__main__":
    # Single or batch scoring on the test fold:
    proba_lr_test, pred_lr_test = score_transactions_lr(X_test)
    score_pca_test, pred_pca_test = score_transactions_pca(X_test)

    # Fusion examples
    lr_thr = joblib.load('fraud_lr_balanced.joblib')['thr']
    pca_thr = joblib.load('fraud_pca_anomaly.joblib')['thr']

    fused_score_or, fused_pred_or   = fuse_decisions(proba_lr_test, score_pca_test, lr_thr, pca_thr, mode='or')
    fused_score_and, fused_pred_and = fuse_decisions(proba_lr_test, score_pca_test, lr_thr, pca_thr, mode='and')
    fused_score_avg, fused_pred_avg = fuse_decisions(proba_lr_test, score_pca_test, lr_thr, pca_thr, mode='avg', w_lr=0.7)

    # Quick sanity printouts (optional)
    from sklearn.metrics import classification_report
    print("\n--- Fusion=OR report (higher recall) ---")
    print(classification_report(y_test, fused_pred_or, digits=4))
    print("\n--- Fusion=AND report (higher precision) ---")
    print(classification_report(y_test, fused_pred_and, digits=4))
    print("\n--- Fusion=AVG report (w=0.7) ---")
    print(classification_report(y_test, fused_pred_avg, digits=4))


Loaded dataset from: data/cerditcard/creditcard.csv
LR ROC-AUC: 0.972083 | PR-AUC (AP): 0.718971 | thr*: 1.000000

Classification Report (LR):
               precision    recall  f1-score   support

           0     0.9997    0.9997    0.9997     56864
           1     0.8247    0.8163    0.8205        98

    accuracy                         0.9994     56962
   macro avg     0.9122    0.9080    0.9101     56962
weighted avg     0.9994    0.9994    0.9994     56962

Confusion Matrix (LR):
 [[56847    17]
 [   18    80]]
Saved → fraud_lr_balanced.joblib

PCA-Anomaly PR-AUC (AP): 0.656308 | thr*: 0.034632
Classification Report (PCA):
               precision    recall  f1-score   support

           0     0.9997    0.9993    0.9995     56864
           1     0.6838    0.8163    0.7442        98

    accuracy                         0.9990     56962
   macro avg     0.8417    0.9078    0.8719     56962
weighted avg     0.9991    0.9990    0.9991     56962

Confusion Matrix (PCA):
 [[56827



In [None]:
# Credit Card Fraud Detector (Streamlit)
# - Loads: fraud_lr_balanced.joblib, fraud_pca_anomaly.joblib
# - Single transaction JSON or CSV upload
# - Tunable thresholds + decision fusion (OR / AND / weighted average)
# - Safe feature-ordering (prevents 'feature names' warnings)
# - Downloadable results

import json, os, joblib, numpy as np, pandas as pd, streamlit as st
from typing import Tuple

st.set_page_config(page_title="Fraud Detector", page_icon="💳", layout="wide")
st.title("💳 Credit Card Fraud Detector")
st.caption("Logistic Regression (balanced) + PCA anomaly • Kaggle CreditCardFraud dataset")

# -----------------------------
# Load artifacts (cached)
# -----------------------------
@st.cache_resource
def load_artifacts():
    if not os.path.exists("fraud_lr_balanced.joblib"):
        raise FileNotFoundError("Missing fraud_lr_balanced.joblib. Train and save first.")
    if not os.path.exists("fraud_pca_anomaly.joblib"):
        raise FileNotFoundError("Missing fraud_pca_anomaly.joblib. Train and save first.")
    lr_bundle  = joblib.load("fraud_lr_balanced.joblib")
    pca_bundle = joblib.load("fraud_pca_anomaly.joblib")
    # Extract pieces
    lr_model      = lr_bundle["model"]
    lr_thr_saved  = float(lr_bundle["thr"])
    feature_names = lr_bundle.get("feature_names", None)
    if feature_names is None:
        # Backward safe: derive from model if missing
        try:
            feature_names = lr_model.named_steps["scaler"].feature_names_in_.tolist()
        except Exception:
            raise ValueError("feature_names not found. Re-save with training column order.")
    pca_bndl = pca_bundle["bundle"]
    pca_thr  = float(pca_bundle["thr"])
    pca_feats = pca_bndl["feature_names"]
    # Must match columns; if not, we’ll reindex to lr feature list
    if feature_names != pca_feats:
        st.warning("LR and PCA feature lists differ. The app will align inputs to LR’s feature_names.")
    return lr_model, lr_thr_saved, feature_names, pca_bndl, pca_thr

LR_MODEL, LR_THR_SAVED, FEATS, PCA_BUNDLE, PCA_THR_SAVED = load_artifacts()

# -----------------------------
# Helpers (feature-order-safe)
# -----------------------------
def ensure_feature_order(df_like: pd.DataFrame) -> pd.DataFrame:
    """Return a DataFrame with exactly the training columns (missing -> 0, extras dropped)."""
    if not isinstance(df_like, pd.DataFrame):
        df_like = pd.DataFrame([df_like], columns=FEATS)  # best effort
    # cast to numeric and align
    df_like = df_like.copy()
    for c in df_like.columns:
        df_like[c] = pd.to_numeric(df_like[c], errors="coerce")
    df_like = df_like.reindex(columns=FEATS, fill_value=0)
    return df_like

def pca_recon_error(bundle: dict, X_df: pd.DataFrame) -> np.ndarray:
    """Compute PCA reconstruction error from saved components/scaler."""
    # enforce feature order for PCA as well (it stores its own list)
    X_df = X_df.reindex(columns=bundle['feature_names'], fill_value=0)
    Xv = X_df.values.astype(float)
    center, scale = bundle['center_'], bundle['scale_']
    Vt_k = bundle['Vt_k']
    X_rb = (Xv - center) / (scale + 1e-12)
    Z = X_rb @ Vt_k.T
    X_rec = Z @ Vt_k
    err = np.sum((X_rb - X_rec)**2, axis=1)
    return err

def minmax01(x: np.ndarray) -> np.ndarray:
    return (x - x.min()) / (x.max() - x.min() + 1e-12)

def score_lr(df_tx: pd.DataFrame, thr: float) -> Tuple[np.ndarray, np.ndarray]:
    X = ensure_feature_order(df_tx)
    # Keep DataFrame to avoid feature-name warnings:
    proba = LR_MODEL.predict_proba(X)[:, 1]
    pred  = (proba >= thr).astype(int)
    return proba, pred

def score_pca(df_tx: pd.DataFrame, thr: float) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:
    X = ensure_feature_order(df_tx)  # align to LR features; PCA helper reindexes again internally
    err = pca_recon_error(PCA_BUNDLE, X)
    score = minmax01(err)  # fraudiness in [0,1]
    pred  = (score >= thr).astype(int)
    return score, pred, err

def fuse(proba_lr, score_pca, thr_lr, thr_pca, mode="or", w_lr=0.6, fused_thr=0.5):
    """
    Fusion:
      - 'or'  : flag if LR>=thr_lr OR PCA>=thr_pca   (higher recall)
      - 'and' : flag if LR>=thr_lr AND PCA>=thr_pca  (higher precision)
      - 'avg' : flag if w_lr*LR + (1-w_lr)*PCA >= fused_thr
    Returns (fused_score, fused_pred)
    """
    mode = (mode or "or").lower()
    if mode == "and":
        pred = ((proba_lr >= thr_lr) & (score_pca >= thr_pca)).astype(int)
        score = np.minimum(proba_lr, score_pca)
        return score, pred
    elif mode == "avg":
        fused = w_lr * proba_lr + (1 - w_lr) * score_pca
        pred  = (fused >= fused_thr).astype(int)
        return fused, pred
    else:  # 'or'
        pred = ((proba_lr >= thr_lr) | (score_pca >= thr_pca)).astype(int)
        score = np.maximum(proba_lr, score_pca)
        return score, pred

# -----------------------------
# Sidebar controls
# -----------------------------
st.sidebar.header("⚙️ Settings")
lr_thr   = st.sidebar.slider("LR threshold (saved default)", 0.01, 0.99, float(LR_THR_SAVED), 0.01)
pca_thr  = st.sidebar.slider("PCA threshold (saved default)", 0.01, 0.99, float(PCA_THR_SAVED), 0.01)
fusion   = st.sidebar.selectbox("Fusion mode", ["or", "and", "avg"], index=0)
w_lr     = st.sidebar.slider("Weighted fusion: weight for LR (w)", 0.0, 1.0, 0.6, 0.05)
f_thr    = st.sidebar.slider("Weighted fusion: fused threshold", 0.05, 0.95, 0.50, 0.01)
st.sidebar.markdown("**Artifacts:** `fraud_lr_balanced.joblib`, `fraud_pca_anomaly.joblib`")
st.sidebar.markdown("**Dataset:** Kaggle – Credit Card Fraud (<small>ULB</small>)")

# -----------------------------
# Tabs
# -----------------------------
tab1, tab2 = st.tabs(["Single Transaction", "Batch CSV"])

# ===== Tab 1: Single =====
with tab1:
    st.subheader("Single Transaction (JSON)")
    colL, colR = st.columns([2,1])

    with colL:
        example = st.toggle("Load a random realistic example")
        if example:
            # Simple synthetic demo in the V1..V28 + Amount + Time shape
            ex = {c: 0 for c in FEATS}
            # nudge a few features
            for k in ["V3","V10","V12","V14","V17","V18","V21","Amount"]:
                if k in ex: ex[k] = float(np.random.normal(loc=2.0, scale=1.0))
            st.session_state["single_json"] = json.dumps(ex, indent=2)
        default_json = st.session_state.get("single_json", "{\n  \"" + FEATS[0] + "\": 0\n}")
        txt = st.text_area(
            "Paste a JSON object with the same schema as the training features.",
            value=default_json, height=280
        )
        run_single = st.button("Score Single", type="primary")
    with colR:
        st.write("**Training feature names**")
        st.code(", ".join(FEATS[:10]) + (" ... " if len(FEATS) > 10 else ""), language="text")

    if run_single:
        try:
            data = json.loads(txt)
            if isinstance(data, dict):
                df_one = pd.DataFrame([data])
            elif isinstance(data, list):
                df_one = pd.DataFrame(data)
            else:
                raise ValueError("JSON must be an object or list of objects.")
        except Exception as e:
            st.error(f"Invalid JSON: {e}")
            st.stop()

        proba_lr, pred_lr = score_lr(df_one, thr=lr_thr)
        score_pc, pred_pc, err = score_pca(df_one, thr=pca_thr)
        fused_score, fused_pred = fuse(proba_lr, score_pc, lr_thr, pca_thr, mode=fusion, w_lr=w_lr, fused_thr=f_thr)

        c1, c2, c3 = st.columns(3)
        with c1:
            st.metric("LR prob (fraud)", f"{proba_lr[0]:.4f}", delta=f"thr {lr_thr:.2f}")
            st.write("Decision:", "**FRAUD**" if pred_lr[0]==1 else "legit")
        with c2:
            st.metric("PCA score (fraudiness)", f"{score_pc[0]:.4f}", delta=f"thr {pca_thr:.2f}")
            st.write("Decision:", "**FRAUD**" if pred_pc[0]==1 else "legit")
        with c3:
            st.metric("Fused score", f"{fused_score[0]:.4f}", delta=f"{fusion} mode")
            st.write("Decision:", "**FRAUD**" if fused_pred[0]==1 else "legit")

        st.markdown("**Verbose JSON**")
        st.json({
            "lr": {"proba_fraud": float(proba_lr[0]), "threshold": lr_thr, "pred": int(pred_lr[0])},
            "pca": {"score_fraud": float(score_pc[0]), "threshold": pca_thr, "pred": int(pred_pc[0])},
            "fusion": {"mode": fusion, "w_lr": w_lr, "fused_thr": f_thr,
                       "score": float(fused_score[0]), "pred": int(fused_pred[0])}
        })

# ===== Tab 2: Batch =====
with tab2:
    st.subheader("Batch CSV scoring")
    st.write("Upload a CSV with **exactly** the model’s feature columns (extras are ignored, missing are filled with 0).")
    up = st.file_uploader("CSV file", type=["csv"])
    if up is not None:
        df_in = pd.read_csv(up)
        st.write("Preview:")
        st.dataframe(df_in.head(), use_container_width=True)

        if st.button("Score CSV", type="primary"):
            X = ensure_feature_order(df_in)
            proba_lr, pred_lr = score_lr(X, thr=lr_thr)
            score_pc, pred_pc, err = score_pca(X, thr=pca_thr)
            fused_score, fused_pred = fuse(proba_lr, score_pc, lr_thr, pca_thr, mode=fusion, w_lr=w_lr, fused_thr=f_thr)

            out = df_in.copy()
            out["lr_proba_fraud"] = proba_lr
            out["lr_pred"] = pred_lr
            out["pca_score_fraud"] = score_pc
            out["pca_pred"] = pred_pc
            out["fused_score"] = fused_score
            out["fused_pred"] = fused_pred

            st.success("Scoring complete.")
            st.dataframe(out.head(30), use_container_width=True)
            st.download_button(
                "Download results (CSV)",
                data=out.to_csv(index=False).encode("utf-8"),
                file_name="fraud_scored.csv",
                mime="text/csv",
                use_container_width=True
            )

# -----------------------------
# Explainers / Notes
# -----------------------------
st.markdown("---")
st.markdown("""
**How this works**

- **LR (supervised):** outputs a calibrated-like probability of fraud. We use the **PR-optimal threshold** learned during training (you can adjust it).
- **PCA (anomaly):** trains on normal transactions only; **reconstruction error** is min-max scaled to a fraudiness score in [0,1].
- **Fusion:**
  - **OR** → alert if either LR or PCA fires (higher recall).
  - **AND** → alert only if both fire (higher precision).
  - **AVG** → use weighted average of LR and PCA scores and compare to a fused threshold.

**Best practices**
- Keep thresholds stable across environments for consistent alerting.
- Monitor drift: if incoming feature distributions shift, retrain or re-fit PCA on recent normals.
- Log decisions with scores and thresholds for auditability.

**Dataset**
- Kaggle (ULB): https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud
""")


## Personalized Movie Recommendation (User-based CF)

In [None]:
# ===========================================================
#  Movie Recommender – High-throughput Edition
#  * One-time build ≈ 60-90 s on 16 GB / 8-Core laptop
#  * Subsequent runs ≈ 2-3 s (load Joblib + serve requests)
# ===========================================================

import os, sys, math, json, shutil, joblib, numpy as np, pandas as pd
from pathlib import Path
from concurrent.futures import ThreadPoolExecutor
from scipy.sparse import csr_matrix, save_npz, load_npz
from sklearn.preprocessing import normalize
from collections import defaultdict

# -------- Runtime tunables (override via env) ------------------
N_THREADS = int(os.getenv("NUM_THREADS", max(os.cpu_count()-1, 1)))
os.environ["OPENBLAS_NUM_THREADS"] = os.environ["MKL_NUM_THREADS"] = str(N_THREADS)

DATA_DIR   = Path("data/Movie")
CACHE_DIR  = Path("movie_cache_fast");  CACHE_DIR.mkdir(exist_ok=True)
MODEL_PATH = Path("movie_recommender.joblib")

RATINGS_CSV = DATA_DIR / "ratings.csv"
MOVIES_CSV  = DATA_DIR / "movies.csv"
R_PARQUET   = CACHE_DIR / "ratings.parquet"
M_PARQUET   = CACHE_DIR / "movies.parquet"
CSR_NPZ     = CACHE_DIR / "ratings_csr.npz"
MAPS_PKL    = CACHE_DIR / "id_maps.pkl"

# --------------------------------------------------------------
# 1) Fast CSV → Parquet (runs once)
# --------------------------------------------------------------
def _csv_to_parquet(src: Path, dst: Path, dtypes: dict):
    try:
        import polars as pl
        df = pl.read_csv(src, dtypes=dtypes).to_pandas()
    except ModuleNotFoundError:
        df = pd.read_csv(src, dtype=dtypes)
    dst.parent.mkdir(exist_ok=True)
    df.to_parquet(dst, index=False)
    return df

if not R_PARQUET.exists():
    print("⚡ Converting ratings.csv → Parquet (1st-time)…")
    _csv_to_parquet(
        RATINGS_CSV, R_PARQUET,
        dtypes={"userId": "int32", "movieId": "int32", "rating": "float32", "timestamp": "int64"}
    )
if not M_PARQUET.exists():
    print("⚡ Converting movies.csv → Parquet (1st-time)…")
    _csv_to_parquet(
        MOVIES_CSV, M_PARQUET,
        dtypes={"movieId": "int32", "title": "string", "genres": "string"}
    )

# --------------------------------------------------------------
# 2) Load Parquet (<< 1 s)
# --------------------------------------------------------------
df_ratings = pd.read_parquet(R_PARQUET)
df_movies  = pd.read_parquet(M_PARQUET)

# --------------------------------------------------------------
# 3) Build csr_matrix once, cache to NPZ
# --------------------------------------------------------------
if not CSR_NPZ.exists() or not MAPS_PKL.exists():
    print("⚡ Building CSR rating matrix…")
    uid_map = {u: i for i, u in enumerate(np.sort(df_ratings["userId"].unique()))}
    iid_map = {m: j for j, m in enumerate(np.sort(df_ratings["movieId"].unique()))}
    rows = df_ratings["userId"].map(uid_map).to_numpy("int32")
    cols = df_ratings["movieId"].map(iid_map).to_numpy("int32")
    vals = df_ratings["rating"].to_numpy("float32")
    R = csr_matrix((vals, (rows, cols)),
                   shape=(len(uid_map), len(iid_map)), dtype=np.float32)
    save_npz(CSR_NPZ, R)
    joblib.dump({"uid_map": uid_map, "iid_map": iid_map}, MAPS_PKL, compress=3)
else:
    uid_map = joblib.load(MAPS_PKL)["uid_map"]
    iid_map = joblib.load(MAPS_PKL)["iid_map"]
    R = load_npz(CSR_NPZ)

uid_inv = {i: u for u, i in uid_map.items()}
iid_inv = {j: m for m, j in iid_map.items()}

# --------------------------------------------------------------
# 4) Cache L2-normalized copies
# --------------------------------------------------------------
print("⚡ L2-normalizing users & items…")
R_user = normalize(R, axis=1, copy=True)
R_item = normalize(R, axis=0, copy=True)

# --------------------------------------------------------------
# 5) Popularity prior (Bayesian smoothing, m = 20)
# --------------------------------------------------------------
stats = df_ratings.groupby("movieId").rating.agg(["count", "mean"]).reset_index()
gmean = df_ratings["rating"].mean()
m = 20
stats["pop_score"] = (stats["count"]*stats["mean"] + m*gmean) / (stats["count"] + m)
pop_items_internal = [iid_map[i] for i in stats.sort_values("pop_score", ascending=False)["movieId"]
                      if i in iid_map]

# --------------------------------------------------------------
# 6) Core scorers  (vectorized dot-product → BLAS multi-thread)
# --------------------------------------------------------------
def _usercf(uidx: int, k=50):
    sims = R_user @ R_user[uidx].T      # (N_users, 1)
    sims = sims.toarray().ravel()
    sims[uidx] = 0
    neigh = sims.argpartition(-k)[:k]
    scores = defaultdict(float)
    seen = set(R[uidx].indices)
    for n in neigh:
        s = sims[n]
        if s <= 0: continue
        row = R[n]
        for j, v in zip(row.indices, row.data):
            if j in seen: continue
            scores[j] += s * v
    return scores

def _itemcf(uidx: int, k=100):
    seen_idx = R[uidx].indices
    seen_val = R[uidx].data
    if len(seen_idx) == 0:
        return {}
    scores = defaultdict(float)
    for it, v in zip(seen_idx, seen_val):
        sims = R_item.T @ R_item[:, it]
        sims = sims.toarray().ravel()
        sims[it] = 0
        neigh = sims.argpartition(-k)[:k]
        for j in neigh:
            if j in seen_idx: continue
            scores[j] += sims[j] * v
    return scores

def _hybrid(uidx, alpha=0.6, ku=50, ki=100):
    sc_u = _usercf(uidx, ku)
    sc_i = _itemcf(uidx, ki)
    items = set(sc_u) | set(sc_i)
    return {i: alpha*sc_u.get(i, 0) + (1-alpha)*sc_i.get(i, 0) for i in items}

# --------------------------------------------------------------
# 7) Public API
# --------------------------------------------------------------
def recommend_for_user(user_id: int, top_n=10, mode="hybrid",
                       alpha=0.6, ku=50, ki=100):
    if user_id not in uid_map:
        rec_idx = pop_items_internal[:top_n]
    else:
        uidx = uid_map[user_id]
        if mode == "user":
            scores = _usercf(uidx, ku)
        elif mode == "item":
            scores = _itemcf(uidx, ki)
        else:
            scores = _hybrid(uidx, alpha, ku, ki)
        rec_idx = sorted(scores, key=scores.get, reverse=True)[:top_n] or pop_items_internal[:top_n]
    mids = [iid_inv[i] for i in rec_idx]
    return df_movies[df_movies.movieId.isin(mids)][["movieId", "title"]].set_index("movieId").loc[mids].reset_index()

def similar_movies(movie_id: int, top_n=10):
    if movie_id not in iid_map:
        return pd.DataFrame()
    itx = iid_map[movie_id]
    sims = R_item.T @ R_item[:, itx]
    sims = sims.toarray().ravel()
    sims[itx] = 0
    neigh = sims.argpartition(-top_n)[:top_n]
    neigh = neigh[np.argsort(-sims[neigh])]
    mids = [iid_inv[i] for i in neigh]
    return df_movies[df_movies.movieId.isin(mids)][["movieId", "title"]].set_index("movieId").loc[mids].reset_index()

# --------------------------------------------------------------
# 8) Minimal evaluation (LOO Recall@10 on 200 users)
#    Runs quickly thanks to pre-built matrices.
# --------------------------------------------------------------
def loo_recall(k=10, max_users=200):
    rng = np.random.default_rng(42)
    hits = 0; tested = 0
    for uid, grp in df_ratings.groupby("userId"):
        if len(grp) < 3: continue
        tested += 1
        if tested > max_users: break
        hold = grp.sample(1, random_state=rng).iloc[0]
        recommendations = recommend_for_user(uid, top_n=k).movieId.tolist()
        if hold.movieId in recommendations:
            hits += 1
    return hits / tested if tested else None

recall10 = loo_recall()
print(f"Leave-One-Out Recall@10: {recall10:.3f}")

# --------------------------------------------------------------
# 9) Persist complete model bundle
# --------------------------------------------------------------
model_bundle = dict(
    R_path=str(CSR_NPZ), R_user_norm=None, R_item_norm=None,  # load lazily if huge
    uid_map=uid_map, iid_map=iid_map, uid_inv=uid_inv, iid_inv=iid_inv,
    df_movies=df_movies[["movieId", "title", "genres"]],
    pop_idx=pop_items_internal,
    meta=dict(recall10=recall10, threads=N_THREADS)
)
joblib.dump(model_bundle, MODEL_PATH, compress=3)
print("✅  Saved →", MODEL_PATH)

# ------------------- Quick demo -------------------
if __name__ == "__main__":
    uid_demo = list(uid_map.keys())[0]
    print("\nTop-10 (hybrid) for user", uid_demo)
    print(recommend_for_user(uid_demo, top_n=10))

    mid_demo = list(iid_map.keys())[0]
    print("\nMovies similar to:", df_movies.loc[df_movies.movieId==mid_demo, 'title'].iloc[0])
    print(similar_movies(mid_demo, top_n=10))


⚡ Converting ratings.csv → Parquet (1st-time)…
⚡ Converting movies.csv → Parquet (1st-time)…
⚡ Building CSR rating matrix…
⚡ L2-normalizing users & items…
Leave-One-Out Recall@10: 0.000
✅  Saved → movie_recommender.joblib

Top-10 (hybrid) for user 1
   movieId                                              title
0      260          Star Wars: Episode IV - A New Hope (1977)
1       50                         Usual Suspects, The (1995)
2     2571                                 Matrix, The (1999)
3     1196  Star Wars: Episode V - The Empire Strikes Back...
4      318                   Shawshank Redemption, The (1994)
5      593                   Silence of the Lambs, The (1991)
6     1210  Star Wars: Episode VI - Return of the Jedi (1983)
7      527                            Schindler's List (1993)
8      110                                  Braveheart (1995)
9      356                                Forrest Gump (1994)

Movies similar to: Toy Story (1995)
   movieId                     

In [None]:
# app_movie_streamlit.py
# ===========================================================
#  🎬 Streamlit Movie Recommender (Auto-Personal + Filters + APIs)
#  - Loads bundle: movie_recommender.joblib (created by your trainer script)
#  - Personalizes WITHOUT asking questions:
#      • If URL has ?user_id=<id> and it's in data → use that profile
#      • Otherwise start from popularity; learns from your 👍 during session
#  - Advanced controls: include/exclude genres, year range, min rating count
#  - Recency boost, Serendipity (exploration), MMR diversity re-ranking
#  - Optional TMDb (posters) & OMDb (plot/IMDB rating) — paste keys in sidebar
#  - Uses st.query_params (no deprecated experimental API)
# ===========================================================

import os, re, time, json, math, numpy as np, pandas as pd, joblib, requests
from pathlib import Path
from typing import Dict, List, Tuple
from scipy.sparse import load_npz, csr_matrix
from sklearn.preprocessing import normalize
import streamlit as st

# -------------------- Page setup --------------------
st.set_page_config(page_title="🎬 Movie Recommender", page_icon="🎬", layout="wide")
st.title("🎬 Movie Recommender")
st.caption("Hybrid collaborative filtering • Auto-personalization • Filters • Diversity • Recency • APIs")

# -------------------- Load model bundle --------------------
@st.cache_resource
def load_bundle(path: str = "movie_recommender.joblib"):
    if not Path(path).exists():
        raise FileNotFoundError(
            f"Missing '{path}'. Run the training notebook/script that saved movie_recommender.joblib."
        )
    B = joblib.load(path)
    for k in ["R_path","uid_map","iid_map","uid_inv","iid_inv","df_movies","pop_idx","meta"]:
        if k not in B:
            raise ValueError(f"Bundle missing key: {k}")
    return B

B = load_bundle()
DF_MOVIES: pd.DataFrame = B["df_movies"].copy()
UID_MAP: Dict[int,int]   = B["uid_map"]
IID_MAP: Dict[int,int]   = B["iid_map"]
UID_INV: Dict[int,int]   = B["uid_inv"]
IID_INV: Dict[int,int]   = B["iid_inv"]
POP_IDX: List[int]       = B["pop_idx"]
R_NPZ                    = B["R_path"]

# -------------------- Load matrices (cached) --------------------
@st.cache_resource
def load_mats(r_npz_path: str) -> Tuple[csr_matrix, csr_matrix, csr_matrix, np.ndarray]:
    R: csr_matrix = load_npz(r_npz_path)
    R_user = normalize(R, axis=1, copy=True)  # user-cosine
    R_item = normalize(R, axis=0, copy=True)  # item-cosine
    item_counts = np.asarray((R > 0).sum(axis=0)).ravel()  # #ratings per movie
    return R, R_user, R_item, item_counts

R, R_USER, R_ITEM, ITEM_COUNTS = load_mats(R_NPZ)

# -------------------- Metadata (genres, years) --------------------
def parse_year(title: str) -> int | None:
    m = re.search(r"\((\d{4})\)\s*$", str(title))
    return int(m.group(1)) if m else None

DF_MOVIES["year"] = DF_MOVIES["title"].map(parse_year)

def split_genres(g: str) -> List[str]:
    g = (g or "")
    return [] if g in ("(no genres listed)", "", None) else [x.strip() for x in g.split("|")]

DF_MOVIES["genres_list"] = DF_MOVIES["genres"].map(split_genres)
ALL_GENRES = sorted({g for gs in DF_MOVIES["genres_list"] for g in gs})

TITLE_BY_MID = dict(zip(DF_MOVIES.movieId, DF_MOVIES.title))
GENRES_BY_MID = dict(zip(DF_MOVIES.movieId, DF_MOVIES.genres_list))
YEAR_BY_MID = dict(zip(DF_MOVIES.movieId, DF_MOVIES.year))

# -------------------- Similarity-based scorers --------------------
def _user_seen(uidx: int) -> set[int]:
    return set(R.getrow(uidx).indices.tolist())

def _usercf_scores(uidx: int, k_neighbors=50) -> Dict[int,float]:
    sims = (R_USER @ R_USER.getrow(uidx).T).toarray().ravel()
    sims[uidx] = 0.0
    if k_neighbors < len(sims):
        neigh_idx = np.argpartition(-sims, k_neighbors)[:k_neighbors]
    else:
        neigh_idx = np.where(sims > 0)[0]
    neigh_sims = sims[neigh_idx]
    seen = _user_seen(uidx)
    scores: Dict[int,float] = {}
    for n_i, s in zip(neigh_idx, neigh_sims):
        if s <= 0: 
            continue
        row = R.getrow(n_i)
        for j, v in zip(row.indices, row.data):
            if j in seen:
                continue
            scores[j] = scores.get(j, 0.0) + s * float(v)
    return scores

def _itemcf_scores(uidx: int, k_neighbors=100) -> Dict[int,float]:
    row = R.getrow(uidx)
    seen_idx, seen_val = row.indices, row.data
    if len(seen_idx) == 0:
        return {}
    scores: Dict[int,float] = {}
    for it, r in zip(seen_idx, seen_val):
        sims = (R_ITEM.T @ R_ITEM[:, it]).toarray().ravel()
        sims[it] = 0.0
        if k_neighbors < len(sims):
            neigh_idx = np.argpartition(-sims, k_neighbors)[:k_neighbors]
        else:
            neigh_idx = np.where(sims > 0)[0]
        for jt in neigh_idx:
            if jt in seen_idx:
                continue
            scores[jt] = scores.get(jt, 0.0) + float(sims[jt]) * float(r)
    return scores

def _hybrid_scores(uidx: int, ku=50, ki=100, alpha=0.6) -> Dict[int,float]:
    su = _usercf_scores(uidx, ku)
    si = _itemcf_scores(uidx, ki)
    keys = set(su) | set(si)
    return {k: alpha*su.get(k,0.0) + (1-alpha)*si.get(k,0.0) for k in keys}

def _idx_to_frame(idx_list: List[int]) -> pd.DataFrame:
    mids = [IID_INV[i] for i in idx_list]
    out = DF_MOVIES[DF_MOVIES.movieId.isin(mids)][["movieId","title","genres","year"]].copy()
    return out.set_index("movieId").loc[mids].reset_index()

# -------------------- Auto-profile (NO questions) --------------------
# Source 1: URL param ?user_id=<id> (dataset user)
# Source 2: Session 👍 likes (implicit)
params = st.query_params  # ✔️ modern API
USER_ID_PARAM = None
try:
    if "user_id" in params and params.get("user_id"):
        # st.query_params values are strings (or list-like); handle both
        raw = params.get("user_id")
        USER_ID_PARAM = int(raw[0] if isinstance(raw, list) else raw)
except Exception:
    USER_ID_PARAM = None

@st.cache_data
def user_genre_affinity_from_dataset(user_id: int) -> Dict[str, float]:
    """Normalized genre profile from this dataset user's historical ratings."""
    if user_id not in UID_MAP:
        return {}
    uidx = UID_MAP[user_id]
    row = R.getrow(uidx)
    w = {}
    for j, rating in zip(row.indices, row.data):
        mid = IID_INV[j]
        for g in GENRES_BY_MID.get(mid, []):
            w[g] = w.get(g, 0.0) + float(rating)
    s = sum(w.values()) or 1.0
    return {k: v/s for k, v in w.items()}

if "session_likes" not in st.session_state:
    st.session_state.session_likes: set[int] = set()

def session_genre_affinity() -> Dict[str,float]:
    """Genre counts from quick likes this session."""
    if not st.session_state.session_likes:
        return {}
    w = {}
    for mid in st.session_state.session_likes:
        for g in GENRES_BY_MID.get(mid, []):
            w[g] = w.get(g, 0.0) + 1.0
    s = sum(w.values()) or 1.0
    return {k: v/s for k, v in w.items()}

def boost_by_genre(scores: Dict[int,float], beta: float, affinity: Dict[str,float]) -> Dict[int,float]:
    """Multiply score by (1 + β * affinity_sum_of_item_genres)."""
    if beta <= 0 or not affinity:
        return scores
    out = {}
    for j, sc in scores.items():
        mid = IID_INV[j]
        a = sum(affinity.get(g, 0.0) for g in GENRES_BY_MID.get(mid, []))
        out[j] = sc * (1.0 + beta * a)
    return out

# -------------------- Filters, Recency, Serendipity & MMR --------------------
def apply_filters(candidates: List[int],
                  include_genres: List[str],
                  exclude_genres: List[str],
                  year_min: int | None, year_max: int | None,
                  min_count: int) -> List[int]:
    keep: List[int] = []
    for j in candidates:
        mid = IID_INV[j]
        gs  = GENRES_BY_MID.get(mid, [])
        yr  = YEAR_BY_MID.get(mid, None)
        cnt = int(ITEM_COUNTS[j])
        if include_genres and not any(g in gs for g in include_genres):
            continue
        if exclude_genres and any(g in gs for g in exclude_genres):
            continue
        if year_min and yr and yr < year_min: 
            continue
        if year_max and yr and yr > year_max:
            continue
        if cnt < min_count:
            continue
        keep.append(j)
    return keep

def recency_weight(year: int | None, ref_year: int, strength: float) -> float:
    """Return weight ∈ (0, 1.5] that increases for newer titles; strength∈[0,1]."""
    if not year or strength <= 0:
        return 1.0
    # Exponential decay back 30 years as a rough window
    age = max(0, ref_year - year)
    base = math.exp(-age / 30.0)  # ~0.97 per year
    return 1.0 + strength * (base - 0.5)  # shift so mid≈1, cap via strength

def apply_recency(scores: Dict[int,float], strength: float) -> Dict[int,float]:
    if strength <= 0:
        return scores
    ref_year = int(pd.Timestamp.now().year)
    out = {}
    for j, sc in scores.items():
        mid = IID_INV[j]
        out[j] = sc * recency_weight(YEAR_BY_MID.get(mid), ref_year, strength)
    return out

def add_serendipity(scores: Dict[int,float], epsilon: float, seed: int = 42) -> Dict[int,float]:
    """Add tiny jitter to surface long-tail items; epsilon∈[0,1]."""
    if epsilon <= 0:
        return scores
    rng = np.random.default_rng(seed + int(time.time()) // 30)  # refresh every ~30s
    out = {}
    for j, sc in scores.items():
        noise = rng.random() * epsilon * 0.05  # up to 5% bump at epsilon=1
        out[j] = sc * (1.0 + noise)
    return out

def mmr_diversify(ranked_idxs: List[int], top_n: int, lam: float = 0.7) -> List[int]:
    """
    MMR: argmax_i [ λ * rel(i) - (1-λ) * max_{j∈S} sim(i,j) ]
    rel = rank-based; sim via item cosine. λ↑ = relevance, λ↓ = diversity.
    """
    if top_n <= 1 or not ranked_idxs:
        return ranked_idxs[:top_n]
    rel = {j: (len(ranked_idxs) - r) / len(ranked_idxs) for r, j in enumerate(ranked_idxs)}
    S: List[int] = [ranked_idxs[0]]
    C = set(ranked_idxs[1:])
    while len(S) < min(top_n, len(ranked_idxs)) and C:
        best, best_val = None, -1e9
        for i in list(C):
            # similarity to set S
            sim_to_S = 0.0
            for j in S:
                sim_to_S = max(sim_to_S, float((R_ITEM[:, i].T @ R_ITEM[:, j]).toarray().ravel()[0]))
            val = lam * rel.get(i, 0.0) - (1.0 - lam) * sim_to_S
            if val > best_val:
                best_val, best = val, i
        S.append(best); C.remove(best)
    return S

# -------------------- Sidebar: Controls + API keys --------------------
with st.sidebar:
    st.header("⚙️ Settings")
    mode  = st.selectbox("Scoring mode", ["hybrid","user","item"], index=0,
                         help="hybrid = UserCF ⨉ ItemCF blend; user = neighbors by user; item = neighbors by your watched items")
    top_n = st.slider("Top-N", 5, 50, 12, 1)
    ku    = st.slider("User-CF neighbors (ku)", 10, 300, 80, 5, help="How many similar users inform your scores")
    ki    = st.slider("Item-CF neighbors (ki)", 20, 400, 140, 10, help="How many similar items per watched item")
    alpha = st.slider("Hybrid weight α (user ↔ item)", 0.0, 1.0, 0.6, 0.05, help="Higher α favors user-based signals")

    st.divider()
    st.markdown("### Filters")
    include_g = st.multiselect("Include any of genres", options=ALL_GENRES, default=[],
                               help="Show only items matching at least one of these genres")
    exclude_g = st.multiselect("Exclude genres", options=ALL_GENRES, default=[],
                               help="Hide any item containing these genres")

    min_year = int(pd.Series([y for y in DF_MOVIES["year"] if y]).min() or 1900)
    max_year = int(pd.Series([y for y in DF_MOVIES["year"] if y]).max() or 2025)
    yr_min, yr_max = st.slider("Year range", min_year, max_year, value=(1990, max_year),
                               help="Restrict results to a specific release window")
    min_cnt = st.slider("Minimum rating count", 1, int(ITEM_COUNTS.max()), value=5, step=1,
                        help="Require at least this many ratings for robustness")

    st.divider()
    st.markdown("### Re-ranking")
    beta  = st.slider("Genre boost (β)", 0.0, 1.0, 0.35, 0.05, help="Favor items matching your inferred genres")
    recency = st.slider("Recency boost", 0.0, 1.0, 0.30, 0.05, help="Prefer newer titles a bit more")
    serend = st.slider("Serendipity (exploration)", 0.0, 1.0, 0.10, 0.05, help="Add tiny noise to discover long-tail items")
    lam   = st.slider("MMR diversity (λ)", 0.1, 0.95, 0.70, 0.05, help="Higher=more relevance, lower=more diversity")

    st.divider()
    st.markdown("### API Keys (optional)")
    tmdb_key_default = os.getenv("TMDB_API_KEY", "")
    omdb_key_default = os.getenv("OMDB_API_KEY", "")
    tmdb_key = st.text_input("TMDb API Key", type="password", value=tmdb_key_default,
                             help="Used for posters. Free key at themoviedb.org")
    omdb_key = st.text_input("OMDb API Key", type="password", value=omdb_key_default,
                             help="Used for plot/IMDB rating. Get a key at omdbapi.com")
    use_posters = st.toggle("Use TMDb posters", value=bool(tmdb_key))
    use_omdb    = st.toggle("Use OMDb metadata", value=bool(omdb_key))
    st.caption(f"Model meta: Recall@10 (LOO) = {B['meta'].get('recall10')}, Threads = {B['meta'].get('threads')}")

# -------------------- Personalization source (silent) --------------------
profile_src = "session"
affinity = session_genre_affinity()
if USER_ID_PARAM is not None and USER_ID_PARAM in UID_MAP:
    affinity_ds = user_genre_affinity_from_dataset(USER_ID_PARAM)
    if affinity_ds:
        affinity = affinity_ds
        profile_src = f"user_id={USER_ID_PARAM}"

# Utilities to set or copy a permalink with user_id
def set_permalink_user(uid: int):
    st.query_params["user_id"] = str(uid)

# -------------------- Recommender core --------------------
def recommend_scores_for_uid(uidx: int) -> Dict[int,float]:
    if mode == "user":
        return _usercf_scores(uidx, ku)
    elif mode == "item":
        return _itemcf_scores(uidx, ki)
    else:
        return _hybrid_scores(uidx, ku=ku, ki=ki, alpha=alpha)

def recommend_now() -> pd.DataFrame:
    # 1) Base ranking
    if USER_ID_PARAM is not None and USER_ID_PARAM in UID_MAP:
        uidx = UID_MAP[USER_ID_PARAM]
        scores = recommend_scores_for_uid(uidx)
        ranked = sorted(scores, key=scores.get, reverse=True)
    else:
        ranked = list(POP_IDX)

    # Convert to dict for boosts (top slice for speed)
    base_dict = {j: (1.0 - (r/len(ranked))) for r, j in enumerate(ranked[:3000])}

    # 2) Boost by inferred genres
    boosted = boost_by_genre(base_dict, beta=beta, affinity=affinity)

    # 3) Recency & Serendipity
    boosted = apply_recency(boosted, recency)
    boosted = add_serendipity(boosted, serend)

    ranked = sorted(boosted, key=boosted.get, reverse=True)

    # 4) Filters
    ranked = apply_filters(ranked, include_g, exclude_g, yr_min, yr_max, min_cnt)

    # 5) Diversity (MMR)
    ranked = mmr_diversify(ranked, top_n=top_n, lam=lam) if ranked else ranked

    return _idx_to_frame(ranked[:top_n])

# -------------------- Optional: posters & OMDb --------------------
@st.cache_data(show_spinner=False, ttl=3600)
def tmdb_poster(title: str, year: int | None, key: str):
    if not key or not title:
        return None
    try:
        q = {"api_key": key, "query": title}
        if year: q["year"] = year
        r = requests.get("https://api.themoviedb.org/3/search/movie", params=q, timeout=8)
        r.raise_for_status()
        res = r.json().get("results", [])
        if not res: return None
        path = res[0].get("poster_path")
        return f"https://image.tmdb.org/t/p/w342{path}" if path else None
    except Exception:
        return None

@st.cache_data(show_spinner=False, ttl=1800)
def omdb_meta(title: str, year: int | None, key: str):
    if not key or not title:
        return {}
    try:
        q = {"t": title, "apikey": key}
        if year: q["y"] = str(year)
        r = requests.get("https://www.omdbapi.com/", params=q, timeout=8)
        r.raise_for_status()
        j = r.json()
        if j.get("Response") != "True": return {}
        return {
            "imdbRating": j.get("imdbRating"),
            "Runtime": j.get("Runtime"),
            "Genre": j.get("Genre"),
            "Plot": j.get("Plot"),
        }
    except Exception:
        return {}

# -------------------- Main panel: Recommendations --------------------
df_rec = recommend_now()
st.subheader("Recommended for you")
st.caption(f"Personalization source: **{profile_src}**  •  Add 👍 to refine in-session  •  Optional: pass ?user_id=<id>")

if USER_ID_PARAM is not None and USER_ID_PARAM in UID_MAP:
    col = st.columns([1,1,2,2])[0]
    with col:
        if st.button("Copy permalink for this user", use_container_width=True):
            set_permalink_user(USER_ID_PARAM)
            st.success("Query parameter set. Share this URL to reproduce personalization.")

if df_rec.empty:
    st.warning("No results with current filters. Try relaxing constraints.")
else:
    show_df = df_rec.copy()
    if use_posters:
        show_df["poster"] = [tmdb_poster(t, y, tmdb_key) for t, y in zip(show_df["title"], show_df["year"])]
    if use_omdb:
        metas = [omdb_meta(t, y, omdb_key) for t, y in zip(show_df["title"], show_df["year"])]
        show_df["imdbRating"] = [m.get("imdbRating") for m in metas]
        show_df["Plot"]       = [m.get("Plot") for m in metas]

    st.dataframe(show_df, use_container_width=True, hide_index=True)
    st.download_button("Download CSV", show_df.to_csv(index=False).encode("utf-8"),
                       file_name="recs.csv", mime="text/csv")

# -------------------- Quick feedback (implicit personalization) --------------------
st.markdown("#### Quick feedback (optional)")
st.caption("Click 👍 to like a title. This silently tunes your session profile (genre boost).")
cols = st.columns(len(df_rec) if len(df_rec) else 1)
for i, row in df_rec.iterrows():
    with cols[i % len(cols)]:
        if st.button("👍", key=f"like_{int(row.movieId)}"):
            st.session_state.session_likes.add(int(row.movieId))
            st.rerun()  # modern API

# -------------------- Similar movies expander --------------------
with st.expander("Find similar to a title"):
    options = DF_MOVIES.sort_values("title")[["movieId","title"]]
    pick = st.selectbox("Choose a title", options["title"].tolist(), index=0)
    mid = int(options.loc[options["title"] == pick, "movieId"].iloc[0])
    itx = IID_MAP.get(mid)
    if itx is not None:
        sims = (R_ITEM.T @ R_ITEM[:, itx]).toarray().ravel()
        sims[itx] = 0.0
        k = st.slider("How many similar?", 5, 30, 12)
        neigh = np.argpartition(-sims, k)[:k]
        neigh = neigh[np.argsort(-sims[neigh])]
        df_sim = _idx_to_frame(neigh.tolist())
        if use_posters:
            df_sim["poster"] = [tmdb_poster(t, parse_year(t), tmdb_key) for t in df_sim["title"]]
        st.dataframe(df_sim, use_container_width=True, hide_index=True)

# -------------------- About / Help --------------------
st.markdown("---")
with st.expander("ℹ️ How this works & what the controls do"):
    st.markdown("""
**Pipeline**
- **UserCF**: Finds neighbors with similar rating patterns; recommends what they liked.
- **ItemCF**: Finds movies similar to the ones you rated; recommends analogues.
- **Hybrid (α)**: Weighted blend of UserCF and ItemCF (α=1 → pure UserCF, α=0 → pure ItemCF).

**Personalization (no prompts)**
- If the URL has `?user_id=123` and that user exists in the dataset, we infer a **genre profile** from their ratings and boost matching films.
- Without a user id, we start with a **Bayesian-smoothed popularity** list; as you click **👍**, we update a session genre profile to nudge results.

**Filters**
- Include / Exclude genres: quick content control.
- Year range & Minimum rating count: quality guardrails (avoid ultra-obscure items if you want).

**Re-ranking**
- **Genre boost (β)**: multiplies scores by `(1 + β * affinity_to_item_genres)`.
- **Recency boost**: gently favors newer titles using an exponential age decay.
- **Serendipity**: tiny score noise so long-tail titles occasionally surface.
- **MMR diversity (λ)**: trades off relevance vs. similarity to already-picked items to reduce near-duplicates.

**APIs**
- **TMDb** (posters): paste your key; we call `/search/movie` and build poster URLs (`image.tmdb.org/t/p/w342/...`).
- **OMDb** (metadata): paste your key; we call `/?t=<title>&y=<year>` to fetch Plot and IMDb rating.
- Keys are used **only in-session** and not saved to disk.

**Links**
- Share a personalized view by adding `?user_id=<dataset_user_id>` to the app URL (use the “Copy permalink for this user” button).
""")
