# Mathematical Machine Learning – Tutorial 3  


## Exercise 2 

In [1]:
import re
from pathlib import Path

import numpy as np
import pandas as pd
from ucimlrepo import fetch_ucirepo
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

### a) Inspecting the datasets

In [2]:
# fetch dataset 
wine_quality = fetch_ucirepo(id=186) 
  
# data (as pandas dataframes) 
X = wine_quality.data.features 
y = wine_quality.data.targets 
  
# metadata 
#print(wine_quality.metadata) 
  
# variable information 
print(wine_quality.variables) 


                    name     role         type demographic  \
0          fixed_acidity  Feature   Continuous        None   
1       volatile_acidity  Feature   Continuous        None   
2            citric_acid  Feature   Continuous        None   
3         residual_sugar  Feature   Continuous        None   
4              chlorides  Feature   Continuous        None   
5    free_sulfur_dioxide  Feature   Continuous        None   
6   total_sulfur_dioxide  Feature   Continuous        None   
7                density  Feature   Continuous        None   
8                     pH  Feature   Continuous        None   
9              sulphates  Feature   Continuous        None   
10               alcohol  Feature   Continuous        None   
11               quality   Target      Integer        None   
12                 color    Other  Categorical        None   

               description units missing_values  
0                     None  None             no  
1                     None  Non

In [3]:
# fetch dataset 
statlog_shuttle = fetch_ucirepo(id=148) 
  
# data (as pandas dataframes) 
X = statlog_shuttle.data.features 
y = statlog_shuttle.data.targets 
  
# metadata 
#print(statlog_shuttle.metadata) 
  
# variable information 
print(statlog_shuttle.variables) 


        name     role     type demographic description units missing_values
0   Rad Flow  Feature  Integer        None        None  None             no
1  Fpv Close  Feature  Integer        None        None  None             no
2   Fpv Open  Feature  Integer        None        None  None             no
3       High  Feature  Integer        None        None  None             no
4     Bypass  Feature  Integer        None        None  None             no
5  Bpv Close  Feature  Integer        None        None  None             no
6   Bpv Open  Feature  Integer        None        None  None             no
7      class   Target  Integer        None        None  None             no


## Exercise 2 – Modeling Inputs / Outputs

We consider two UCI datasets:

1. Wine Quality dataset (red / white Vinho Verde wines).  
2. Statlog (Shuttle) dataset.

---

### (a) Description of the datasets and variables

#### Wine Quality

- Two related datasets: red (1599 samples) and white (4898 samples) wines.
- Each row corresponds to one wine sample.
- **Inputs (11 continuous features):**
  1. fixed acidity  
  2. volatile acidity  
  3. citric acid  
  4. residual sugar  
  5. chlorides  
  6. free sulfur dioxide  
  7. total sulfur dioxide  
  8. density  
  9. pH  
  10. sulphates  
  11. alcohol
- **Output:** `quality` – integer sensory score from 0–10 (in practice usually 3–8).

Typical ranges (roughly):
- fixed acidity: about 4–16 g/dm³,  
- alcohol: about 8–15 % vol,  
- quality: small integer.

#### Statlog (Shuttle)

- Around 58,000 instances; 9 input attributes and 1 output label.
- Inputs: 9 numerical attributes (time and sensor measurements of the space shuttle’s control system).
- Output: 7-class categorical label describing the shuttle state, e.g.:
  - Rad.Flow
  - Fpv.Close
  - Fpv.Open
  - High
  - Bypass
  - Bpv.Close
  - Bpv.Open
- Strong class imbalance (majority of samples in class 1).

---

### (b) Modeling via random variables / vectors

#### Wine Quality

Let
\[
X =
\begin{pmatrix}
X_1 \\ \vdots \\ X_{11}
\end{pmatrix}
\in \mathbb{R}^{11}
\]
be the random input vector of physicochemical features:

- \(X_1\): fixed acidity,  
- \(\dots\),  
- \(X_{11}\): alcohol content.

Let \(Y \in \{0,1,\dots,10\}\) be the discrete output random variable representing the quality score.

Each sample from the dataset is an i.i.d. realization \((x^{(k)}, y^{(k)})\) from the joint distribution \(P_{X,Y}\).

#### Statlog (Shuttle)

Let
\[
Z \in \mathbb{R}^9
\]
be the 9-dimensional input vector (time + 8 sensor measurements), and let
\[
C \in \{1,2,\dots,7\}
\]
be the class label representing the shuttle state.

Again, each dataset row is an i.i.d. sample \((z^{(k)}, c^{(k)})\) from \(P_{Z,C}\).

---

### (c) Type of machine learning tasks

#### Wine Quality

- The dataset is mainly used for **supervised learning**.
- Two standard views:
  1. **Regression:** predict the numerical quality score \(Y\) from inputs \(X\).
  2. **Classification:** treat quality as a discrete class (possibly ordinal), e.g. “low”, “medium”, “high”.

Unsupervised tasks such as clustering wines are also possible but not the main focus.

#### Statlog (Shuttle)

- This is a **supervised multi-class classification** problem:
  - Input: 9 real-valued attributes.
  - Output: class \(C \in \{1,\dots,7\}\).
- Because of the class imbalance, metrics like macro F1-score can be more informative than plain accuracy.

---

### (d) Specific ML problem formulations

#### Wine Quality – Regression problem

**Problem statement.**  
Given the 11-dimensional physicochemical measurements \(X\) of a wine, predict its quality score \(Y\).

We learn a function
\[
f:\mathbb{R}^{11} \to \mathbb{R}
\]
that approximates
\[
f^\*(x) = \mathbb{E}[Y \mid X=x]
\]
using a regression model (e.g. linear regression or random forest).  
Optionally, we round the predicted value to the nearest integer to obtain a discrete quality label.

Alternative classification problem:

- Define a binary label:
  \[
  Y' =
    \begin{cases}
      1, & \text{quality} \ge 7 \quad (\text{“good”})\\
      0, & \text{quality} \le 6 \quad (\text{“not good”})
    \end{cases}
  \]
- Learn a classifier \(g: \mathbb{R}^{11} \to \{0,1\}\).

#### Statlog (Shuttle) – Multi-class classification

**Problem statement.**  
Given a 9-dimensional vector of shuttle sensor readings \(Z\), predict the shuttle’s operational state \(C \in \{1,\dots,7\}\).

We learn a classifier
\[
h: \mathbb{R}^9 \to \{1,\dots,7\},
\]
for example a multi-class logistic regression, decision tree, or neural network, trained to approximate
\[
h^\*(z) = \arg\max_{c} \mathbb{P}(C=c \mid Z=z).
\]
In evaluation, we take class imbalance into account using suitable metrics (per-class recall, macro F1, etc.).


## Exercise 3 – Detecting Spam (Spambase)

The Spambase dataset uses 57 input features to describe emails and a binary label
indicating whether an email is spam (1) or not spam (0).

---

### (a) Description of the 57 input variables

The 57 features can be grouped into three categories:

1. **48 word frequency features**, of the form
   \[
   \text{word\_freq\_<WORD>} = 100 \cdot \frac{\text{# of occurrences of <WORD>}}{\text{total # of words in the email}}.
   \]
   The words are (in order):

   1. `word_freq_make`  
   2. `word_freq_address`  
   3. `word_freq_all`  
   4. `word_freq_3d`  
   5. `word_freq_our`  
   6. `word_freq_over`  
   7. `word_freq_remove`  
   8. `word_freq_internet`  
   9. `word_freq_order`  
   10. `word_freq_mail`  
   11. `word_freq_receive`  
   12. `word_freq_will`  
   13. `word_freq_people`  
   14. `word_freq_report`  
   15. `word_freq_addresses`  
   16. `word_freq_free`  
   17. `word_freq_business`  
   18. `word_freq_email`  
   19. `word_freq_you`  
   20. `word_freq_credit`  
   21. `word_freq_your`  
   22. `word_freq_font`  
   23. `word_freq_000`  
   24. `word_freq_money`  
   25. `word_freq_hp`  
   26. `word_freq_hpl`  
   27. `word_freq_george`  
   28. `word_freq_650`  
   29. `word_freq_lab`  
   30. `word_freq_labs`  
   31. `word_freq_telnet`  
   32. `word_freq_857`  
   33. `word_freq_data`  
   34. `word_freq_415`  
   35. `word_freq_85`  
   36. `word_freq_technology`  
   37. `word_freq_1999`  
   38. `word_freq_parts`  
   39. `word_freq_pm`  
   40. `word_freq_direct`  
   41. `word_freq_cs`  
   42. `word_freq_meeting`  
   43. `word_freq_original`  
   44. `word_freq_project`  
   45. `word_freq_re`  
   46. `word_freq_edu`  
   47. `word_freq_table`  
   48. `word_freq_conference`  

2. **6 character frequency features**, also in percent:
   \[
   \text{char\_freq\_<CHAR>} = 100 \cdot \frac{\text{# of <CHAR> characters}}{\text{total # of characters}}.
   \]

   49. `char_freq_semicolon`  (frequency of `;`)  
   50. `char_freq_left_paren` (frequency of `(`)  
   51. `char_freq_left_bracket` (frequency of `[`)  
   52. `char_freq_exclamation` (frequency of `!`)  
   53. `char_freq_dollar`     (frequency of `$`)  
   54. `char_freq_pound`      (frequency of `#`)  

3. **3 capital-letter run-length features**:
   - 55. `capital_run_length_average`: average length of uninterrupted sequences of capital letters.  
   - 56. `capital_run_length_longest`: length of the longest uninterrupted sequence of uppercase letters.  
   - 57. `capital_run_length_total`: total number of uppercase letters (sum of all run lengths).

The **class label** is a separate binary variable indicating spam (1) or non-spam (0).

---

### (b) Alternative feature set for spam detection

Beyond the handcrafted Spambase features, modern spam filters often use richer feature sets, e.g.:

1. **Character-based features**
   - Frequencies of punctuation (`!`, `?`, `*`, etc.).
   - Ratios of uppercase to lowercase letters.
   - Character n-grams like repeated symbols (“$$”, “!!!”).

2. **Word-based features**
   - Bag-of-words or TF–IDF representations of tokens.
   - Presence of typical spam keywords such as “lottery”, “prize”, “winner”, “investment”, “offer”.
   - Word n-grams (bigrams, trigrams) capturing short phrases.

3. **HTML / tag-based features**
   - Counts of `<a>`, `<img>`, `<script>` tags in HTML emails.
   - Presence of obfuscated links or hidden text.
   - Whether the email is plain-text vs. HTML.

4. **Structural features**
   - Email length (characters, words, lines).
   - Number of URLs and their average lengths.
   - Number of attachments or images.

5. **Header-based features**
   - Sender domain (freemail vs. corporate).
   - Mismatch between `From` and `Reply-To` domains.
   - Subject line properties (length, presence of all caps, many exclamation marks, etc.).

Such feature sets can be combined with modern classifiers (e.g. gradient boosting, neural networks) and often outperform the simple 57-feature Spambase representation on current email traffic.


In [4]:
import re
import numpy as np
import pandas as pd


In [5]:
# Words and characters as defined by Spambase
SPAMBASE_WORDS = [
    "make", "address", "all", "3d", "our", "over", "remove", "internet",
    "order", "mail", "receive", "will", "people", "report", "addresses",
    "free", "business", "email", "you", "credit", "your", "font", "000",
    "money", "hp", "hpl", "george", "650", "lab", "labs", "telnet", "857",
    "data", "415", "85", "technology", "1999", "parts", "pm", "direct",
    "cs", "meeting", "original", "project", "re", "edu", "table",
    "conference",
]

SPAMBASE_CHARS = [';', '(', '[', '!', '$', '#']


# Exercise 3 & Programming Task

In [6]:
# fetch dataset 
spambase = fetch_ucirepo(id=94) 
  
# data (as pandas dataframes) 
X = spambase.data.features 
y = spambase.data.targets 
  
# metadata 
#print(spambase.metadata) 
  
# variable information 
print(spambase.variables) 

                          name     role        type demographic  \
0               word_freq_make  Feature  Continuous        None   
1            word_freq_address  Feature  Continuous        None   
2                word_freq_all  Feature  Continuous        None   
3                 word_freq_3d  Feature  Continuous        None   
4                word_freq_our  Feature  Continuous        None   
5               word_freq_over  Feature  Continuous        None   
6             word_freq_remove  Feature  Continuous        None   
7           word_freq_internet  Feature  Continuous        None   
8              word_freq_order  Feature  Continuous        None   
9               word_freq_mail  Feature  Continuous        None   
10           word_freq_receive  Feature  Continuous        None   
11              word_freq_will  Feature  Continuous        None   
12            word_freq_people  Feature  Continuous        None   
13            word_freq_report  Feature  Continuous        Non

In [7]:
# All feature names
feature_names = list(X.columns)

# Word-based features
word_features = [f for f in feature_names if f.startswith("word_freq_")]
# Strip the prefix to get just the word
keywords = [f.replace("word_freq_", "") for f in word_features]

print("Number of word features:", len(word_features))
print("Keywords:")
print(keywords)

# Character-based features
char_features = [f for f in feature_names if f.startswith("char_freq_")]
print("Character features:", char_features)

# Capital-run features
capital_features = [f for f in feature_names if f.startswith("capital_run_length_")]
print("Capital run features:", capital_features)


Number of word features: 48
Keywords:
['make', 'address', 'all', '3d', 'our', 'over', 'remove', 'internet', 'order', 'mail', 'receive', 'will', 'people', 'report', 'addresses', 'free', 'business', 'email', 'you', 'credit', 'your', 'font', '000', 'money', 'hp', 'hpl', 'george', '650', 'lab', 'labs', 'telnet', '857', 'data', '415', '85', 'technology', '1999', 'parts', 'pm', 'direct', 'cs', 'meeting', 'original', 'project', 're', 'edu', 'table', 'conference']
Character features: ['char_freq_;', 'char_freq_(', 'char_freq_[', 'char_freq_!', 'char_freq_$', 'char_freq_#']
Capital run features: ['capital_run_length_average', 'capital_run_length_longest', 'capital_run_length_total']


In [9]:
# ---------------------------------------------------------
# 1) Load Spambase from UCI repository
# ---------------------------------------------------------
spambase = fetch_ucirepo(id=94)  # Spambase dataset

X_full = spambase.data.features.copy()      # 57 input features
y = spambase.data.targets.iloc[:, 0]       # target column (0 = ham, 1 = spam)

# Get feature groups directly from the column names
WORD_COLS = [c for c in X_full.columns if c.startswith("word_freq_")]
CHAR_COLS = [c for c in X_full.columns if c.startswith("char_freq_")]
CAPITAL_COLS = [c for c in X_full.columns if c.startswith("capital_run_length_")]

# We'll keep this canonical column order
FEATURE_COLS = list(X_full.columns)


# ---------------------------------------------------------
# 2) Feature extraction from raw email text
# ---------------------------------------------------------

WORD_RE = re.compile(r"[A-Za-z0-9]+")
CAPITAL_RUN_RE = re.compile(r"[A-Z]+")


def extract_features_from_text(text: str) -> pd.Series:
    """
    Compute the 57 Spambase features from raw email text.
    Returns a pandas Series with index = FEATURE_COLS.
    """

    # ----- word frequencies -----
    tokens = [m.group(0).lower() for m in WORD_RE.finditer(text)]
    total_words = len(tokens)
    word_counts = {}

    for col in WORD_COLS:
        word = col.replace("word_freq_", "")
        word_counts[word] = 0

    for t in tokens:
        if t in word_counts:
            word_counts[t] += 1

    features = {}

    for col in WORD_COLS:
        word = col.replace("word_freq_", "")
        count = word_counts[word]
        freq = 100.0 * count / total_words if total_words > 0 else 0.0
        features[col] = freq

    # ----- character frequencies -----
    total_chars = len(text)
    for col in CHAR_COLS:
        ch = col.replace("char_freq_", "")
        count_ch = text.count(ch)
        freq_ch = 100.0 * count_ch / total_chars if total_chars > 0 else 0.0
        features[col] = freq_ch

    # ----- capital run length features -----
    runs = [len(m.group(0)) for m in CAPITAL_RUN_RE.finditer(text)]
    if runs:
        total_cap = sum(runs)
        longest = max(runs)
        avg = total_cap / len(runs)
    else:
        total_cap = 0
        longest = 0
        avg = 0.0

    features["capital_run_length_average"] = avg
    features["capital_run_length_longest"] = longest
    features["capital_run_length_total"] = total_cap

    # Return as Series in canonical column order
    return pd.Series(features)[FEATURE_COLS]


def extract_features_from_file(path: str | Path) -> pd.DataFrame:
    """Read a text file and return a 1×57 DataFrame of features."""
    text = Path(path).read_text(encoding="latin1", errors="ignore")
    s = extract_features_from_text(text)
    return s.to_frame().T   # shape (1, 57)


# ---------------------------------------------------------
# 3) Train a simple spam classifier on Spambase
# ---------------------------------------------------------

X_train, X_test, y_train, y_test = train_test_split(X_full[FEATURE_COLS].values, y.values, test_size=0.2, random_state=42, stratify=y.values)

clf = LogisticRegression(max_iter=2000)
clf.fit(X_train, y_train)

print("Training accuracy:", clf.score(X_train, y_train))
print("Test accuracy:    ", clf.score(X_test, y_test))


# ---------------------------------------------------------
# 4) Apply to three example emails
# ---------------------------------------------------------

example_files = ["Mails/test_mail01.txt", "Mails/test_mail02.txt", "Mails/test_mail03.txt"]  # adjust names/paths

for fname in example_files:
    if not Path(fname).exists():
        print(f"\nFile {fname} not found – skip.")
        continue

    X_new = extract_features_from_file(fname)[FEATURE_COLS].values
    pred = clf.predict(X_new)[0]
    proba = clf.predict_proba(X_new)[0, 1]

    label = "SPAM" if pred == 1 else "NOT SPAM"
    print(f"\nFile: {fname}")
    print(f"Predicted label: {label}")
    print(f"P(spam) = Hmmmmm {proba:.3f}")


Training accuracy: 0.9328804347826087
Test accuracy:     0.9272529858849077

File: Mails/test_mail01.txt
Predicted label: NOT SPAM
P(spam) = Hmmmmm 0.175

File: Mails/test_mail02.txt
Predicted label: NOT SPAM
P(spam) = Hmmmmm 0.251

File: Mails/test_mail03.txt
Predicted label: SPAM
P(spam) = Hmmmmm 0.608
