# $$\textbf{TU Berlin Summer School 2025}$$

<br>

<center>
<img src='../storage/images/logo.png' width=900>
</center>

<br>

## $$\textbf{Exercise Session: Basics of Python Programming}$$ 🐍💻

### $\textbf{Welcome to Your First Python Challenge!}$ 🚀

*Today you'll build a complete health risk prediction system using pure Python!* ✨

In [2]:
# 📝 Enter your name here - this helps us track your progress!
YOUR_NAME = "felix"  # Replace with your actual name

In [3]:
# ✅ Version check - making sure you have the right Python version!
import sys
assert sys.version_info >= (3,10), 'You need to be running at least Python version 3.10'
print(f"✨ Great! You're running Python {sys.version_info.major}.{sys.version_info.minor} - you're all set!")

✨ Great! You're running Python 3.13 - you're all set!


In [6]:
import unittest
t = unittest.TestCase()

In [7]:
# 🧪 Testing setup - This cell is for grading. DO NOT remove it!

# Use unittest asserts for automatic testing
import unittest; t = unittest.TestCase()
from pprint import pprint  # Pretty printing for better output display

# 📊 Helper function to validate percentage values (0.0 to 1.0)
def assert_percentage(val):
    """Ensures a value is a valid percentage between 0 and 1"""
    t.assertGreaterEqual(val, 0.0, f'Percentage ({val}) cannot be < 0')
    t.assertLessEqual(val, 1.0, f'Percentage ({val}) cannot be > 1')
    
print("🔧 Testing framework loaded successfully!")
    

🔧 Testing framework loaded successfully!


# 📋 Exercise Sheet 1: Python Basics 🐍

## $\textbf{Health Risk Prediction Challenge}$ 🏥

This first exercise sheet tests the basic functionalities of the Python programming language in the context of a **real-world prediction task**. We consider the problem of predicting health risk of subjects from personal data and habits. 

### $\textbf{Our Decision Tree}$ 🌳

The decision tree below will guide our predictions:

![](../storage/images/2_day/tree.png)

### $\textbf{Important Rules}$ 📝

**For this exercise sheet, you are required to use only pure Python!** 
- ❌ No external modules (including `NumPy`)
- ✅ Only built-in Python functions and methods
- 🚀 Next time we'll implement the nearest neighbor part using `NumPy`!

*This approach helps you understand the fundamentals before using powerful libraries!* 💪

## 🎯 **Task 1: Classifying a Single Instance** (15 Points)

### $\textbf{What You'll Learn}$ 📚
- Working with **tuples** to represent structured data
- Implementing **conditional logic** (if/else statements)
- Following a **decision tree algorithm**

### $\textbf{Your Mission}$ 🎯

**Patient Data Format:** We represent patient info as a tuple: `(smoker, age, diet)`

**Your Task:** 
- Implement the function `decision()` that takes a patient tuple as input
- Follow the decision tree logic from the image above
- Return either `'less'` or `'more'` (health risk level)
- ⚠️ **Important:** Only these two outputs are valid!

*Think step by step through the decision tree - smoker status first, then age, then diet!* 🤔

In [8]:
def decision(x: tuple) -> str:
    '''
    🌳 This function implements the decision tree represented in the above image. 
    
    📊 Decision Logic:
    1. Check if patient is a smoker
    2. If smoker: check age (≥29.5 → 'more', <29.5 → check diet)
    3. If non-smoker: always 'less'
    4. For young smokers: good diet → 'less', poor diet → 'more'
    
    Args:
        x (tuple): Input tuple containing exactly three values:
            - x[0] (str): Smoker status - 'yes' means smoker, any other value means non-smoker
            - x[1] (int): Age of patient in years  
            - x[2] (str): Diet quality - 'good' means good diet, any other value means poor diet
            
    Returns:
        str: Either 'more' or 'less' representing health risk level
        
    Example:
        >>> decision(('yes', 35, 'good'))
        'more'
        >>> decision(('no', 25, 'poor'))  
        'less'
    '''
    # 🔍 Extract patient information from tuple
    smoker, age, diet = x
    
    # YOUR CODE HERE - Follow the decision tree logic!
    # Hint: Start with checking if the patient is a smoker
    # Then check age and diet as needed
    if smoker=="yes":
        if diet=="good":
            decision= "less"
        else:
            decision = "more"
    else:
        if age<=29.5:
            decision="less"
        else:
            decision="more"
    return decision
        
    # YOUR CODE HERE
    

In [9]:
# 🧪 Test your decision function - let's see if it works!

print("🔬 Testing decision function...")

# Test case 1: Expected 'more' (smoker, age ≥30)
x = ('yes', 31, 'good')
output = decision(x)
print(f'✅ Test 1: decision({x}) --> {output}')
t.assertIsInstance(output, str)
t.assertEqual(output, 'less')

# Test case 2: Expected 'less' (young smoker with poor diet... wait, that should be 'more'!)
x = ('yes', 29, 'poor')
output = decision(x)
print(f'✅ Test 2: decision({x}) --> {output}')
t.assertIsInstance(output, str)
t.assertEqual(output, 'more')  # 🔧 Fixed: young smoker with poor diet = 'less   ' risk

print("🎉 All tests passed! Your decision function is working correctly!")


🔬 Testing decision function...
✅ Test 1: decision(('yes', 31, 'good')) --> less
✅ Test 2: decision(('yes', 29, 'poor')) --> more
🎉 All tests passed! Your decision function is working correctly!


## 📁 **Task 2: Reading a Dataset from a Text File** (10 Points)

### $\textbf{What You'll Learn}$ 📚
- **File I/O operations** in Python
- **String processing** and data parsing  
- **List comprehensions** and data transformation
- **Error handling** and proper file management

### $\textbf{Your Mission}$ 🎯

The file `health-test.txt` located in `../storage/data/health/` contains several fictitious records of personal data and habits. 

**We split this task into two parts:**
1. **Part A:** Process a single line from the file
2. **Part B:** Load the entire file and process all lines

### $\textbf{Requirements}$ ✅
- Read the file automatically using Python file methods
- Represent the dataset as a **list of tuples** 
- Ensure tuples have the same format: `('yes', 31, 'good')`
- **Always close the file** after reading (use `with` statement!)

### $\textbf{Important Notes}$ ⚠️
- 📝 **Values from files are always strings** - you'll need to convert age to `int`
- 🔚 **Each line ends with `\n`** - remember to strip it!
- 💻 **Windows users:** Don't use Notepad (it removes linebreaks `\n`)
- 📋 **Use Jupyter text editor** or modern editors to inspect files

*File processing is a fundamental skill in data science!* 💡

In [14]:
def parse_line_test(line: str) -> tuple:
    '''
    📝 Takes a line from the file and parses it into a patient tuple.
    
    🔍 Step-by-step process:
    1. Remove the newline character (\n) from the end
    2. Split the line by commas to get individual values
    3. Convert age (middle value) from string to integer
    4. Return as tuple in format: (smoker, age, diet)
    
    Args:
        line (str): A line from the `health-test.txt` file (e.g., "yes,31,good\n")
        
    Returns:
        tuple: A patient tuple (smoker_str, age_int, diet_str)
        
    Example:
        >>> parse_line_test("yes,31,good\n")
        ('yes', 31, 'good')
    '''
    # YOUR CODE HERE
    line = line.strip()
    smoker, age, diet = line.split(',')
    return (smoker, int(age), diet.strip())
    # YOUR CODE HERE
    

In [15]:
# 🧪 Test your line parsing function

print("🔬 Testing parse_line_test function...")

x = 'yes,23,good\n'  # Sample line with newline character
parsed_line = parse_line_test(x)
print(f"📊 Input: '{x.strip()}\\n'")
print(f"✅ Output: {parsed_line}")

# Verify the output format
t.assertIsInstance(parsed_line, tuple)
t.assertEqual(len(parsed_line), 3)
t.assertIsInstance(parsed_line[1], int)  # Age should be integer
t.assertNotIn('\n', parsed_line[-1], 'Are you handling line breaks correctly?')
t.assertEqual(parsed_line[-1], 'good')

print("🎉 Line parsing test passed! Your function works correctly!")


🔬 Testing parse_line_test function...
📊 Input: 'yes,23,good\n'
✅ Output: ('yes', 23, 'good')
🎉 Line parsing test passed! Your function works correctly!


In [16]:
def gettest() -> list:
    '''
    📂 Opens the `health-test.txt` file and parses it into a list of patient tuples.
    
    🔄 Recommended approach:
    1. Use 'with open()' statement for safe file handling
    2. Read all lines from the file
    3. Use parse_line_test() function to process each line
    4. Return list of all parsed patient tuples
    
    💡 You're encouraged to use the `parse_line_test` function, but it's not mandatory.
    
    Returns:
        list: A list of patient tuples, e.g., [('yes', 31, 'good'), ('no', 25, 'poor'), ...]
        
    Example:
        >>> data = gettest()
        >>> print(len(data))  # Should show 8 patients
        >>> print(data[0])    # First patient tuple
    '''
    # YOUR CODE HERE
    with open('../storage/data/health/health-test.txt') as f:
        lines = f.readlines()
    return [parse_line_test(line) for line in lines]
    # YOUR CODE HERE
    

In [17]:
testset = gettest() 
pprint(testset)
t.assertIsInstance(testset, list)
t.assertEqual(len(testset), 8)
t.assertIsInstance(testset[0], tuple)


[('yes', 21, 'poor'),
 ('no', 50, 'good'),
 ('no', 23, 'good'),
 ('yes', 45, 'poor'),
 ('yes', 51, 'good'),
 ('no', 60, 'good'),
 ('no', 15, 'poor'),
 ('no', 18, 'good')]


## 📊 **Task 3: Applying the Decision Tree to the Dataset** (15 Points)

### $\textbf{What You'll Learn}$ 📚
- **Iterating over datasets** with loops
- **Calculating percentages** and ratios
- **Applying functions** to multiple data points
- **Statistical analysis** of classification results

### $\textbf{Your Mission}$ 🎯

**Apply the decision tree to ALL patients in the dataset:**
- Use your `decision()` function on each patient tuple
- Count how many are classified as `"more"` (high risk)
- Return the **ratio** (percentage as decimal) of high-risk patients

### $\textbf{Understanding Ratios}$ 🧮
A ratio is a value in **[0-1]** representing the fraction of total cases:
- 📈 **Example:** Out of 50 patients, 15 are classified as `"more"`
- 🔢 **Calculation:** 15 ÷ 50 = 0.3 (30% high risk)
- ✅ **Return:** `0.3`

*This gives us insight into the overall health risk distribution in our dataset!* 💡

In [18]:
def evaluate_testset(dataset: list) -> float:
    '''
    📊 Calculates the percentage of patients classified as 'more' (high risk).
    
    🔄 Algorithm:
    1. Initialize counter for 'more' classifications
    2. Loop through each patient in the dataset
    3. Apply decision() function to each patient
    4. Count how many return 'more'
    5. Calculate ratio: count_more / total_patients
    
    Args:
        dataset (list): A list of patient tuples, e.g., [('yes', 31, 'good'), ...]
    
    Returns:
        float: The ratio (0.0 to 1.0) of patients classified as 'more'
        
    Example:
        >>> data = [('yes', 35, 'good'), ('no', 25, 'poor'), ('yes', 40, 'poor')]
        >>> evaluate_testset(data)
        0.6667  # 2 out of 3 patients are 'more'
    '''
    # YOUR CODE HERE
    count_more = sum(1 for patient in dataset if decision(patient) == 'more')
    total_patients = len(dataset)
    return count_more / total_patients if total_patients > 0 else 0.0
    # YOUR CODE HERE
    

In [19]:
ratio = evaluate_testset(gettest())
print(f'ratio --> {ratio}')
t.assertIsInstance(ratio, float)
assert_percentage(ratio)


ratio --> 0.5


## 🧠 **Task 4: Learning from Examples** (10 Points)

### $\textbf{What You'll Learn}$ 📚
- **Supervised learning** concepts
- **Training data** vs. test data  
- **Data structure design** for machine learning
- **Expert-labeled datasets** and their importance

### $\textbf{The Data-Driven Approach}$ 🎯

Instead of relying on a fixed decision tree, let's use a **data-driven approach** where classifications are based on training examples manually labeled by medical experts! 👩‍⚕️👨‍⚕️

### $\textbf{About the Training Data}$ 📋

**File:** `health-train.txt` contains expert-labeled patient data
- **First 3 columns:** Same as `health-test.txt` (smoker, age, diet)  
- **Last column:** Expert label (`'more'` or `'less'`)
- **Purpose:** Train our machine learning models!

### $\textbf{Your Mission}$ 🎯

**Convert the file into a list of pairs:**
- **Pair structure:** `(patient_tuple, label)`
- **Patient tuple:** `(smoker, age, diet)` - the attributes  
- **Label:** `'more'` or `'less'` - expert classification

**Two-part approach:**
1. **Part A:** Process individual lines  
2. **Part B:** Load and process entire file

### $\textbf{Data Structure Guide}$ 📊
- 🔹 **Triplet:** Tuple with exactly 3 values `(a, b, c)`
- 🔹 **Pair:** Tuple with exactly 2 values `(x, y)`
- 🔹 **Our format:** `((smoker, age, diet), label)`

*This is how real machine learning datasets are structured!* 💡

In [20]:
def parse_line_train(line: str) -> tuple:
    '''
    This function works similarly to the `parse_line_test` function.
    It parses a line of the `health-train.txt` file into a tuple that 
    contains a patient tuple and a label.
    
    Args:
        line (str): A line from the `health-train.txt`
    
    Returns: 
        tuple: A tuple that contains a patient tuple and a label as a string
    '''
    # YOUR CODE HERE
    line = line.strip()
    parts = line.split(',')
    smoker = parts[0]
    age = parts[1]
    diet = parts[2]
    label = parts[3]
    return ((smoker, age, diet), label)
    # YOUR CODE HERE
    

In [21]:
x = 'yes,67,poor,more\n'
parsed_line = parse_line_train(x)
print(parsed_line)

t.assertIsInstance(parsed_line, tuple)
t.assertEqual(len(parsed_line), 2)

data, label = parsed_line

t.assertIsInstance(data, tuple)
t.assertEqual(len(data), 3)
t.assertEqual(data[1], "67")

t.assertIsInstance(label, str)
t.assertNotIn('\n', label, 'Are you handling line breaks correctly?')
t.assertEqual(label, 'more')


(('yes', '67', 'poor'), 'more')


In [22]:
def gettrain() -> list:
    '''
    Opens the `health-train.txt` file and parses it into 
    a list of patient tuples accompanied by their respective label. 
    
    Returns:
        list: A list of tuples comprised of a patient tuple and a label
    '''
    # YOUR CODE HERE
    with open('../storage/data/health/health-train.txt') as f:
        lines = f.readlines()
    return [parse_line_train(line) for line in lines]
    # YOUR CODE HERE
    

In [23]:
trainset = gettrain()
pprint(trainset)
t.assertIsInstance(trainset, list)
t.assertEqual(len(trainset), 16)
first_datapoint = trainset[0]
t.assertIsInstance(first_datapoint, tuple)
t.assertIsInstance(first_datapoint[0], tuple)
t.assertIsInstance(first_datapoint[1], str)

[(('yes', '54', 'good'), 'less'),
 (('no', '55', 'good'), 'less'),
 (('no', '26', 'good'), 'less'),
 (('yes', '40', 'good'), 'more'),
 (('yes', '25', 'poor'), 'less'),
 (('no', '13', 'poor'), 'more'),
 (('no', '15', 'good'), 'less'),
 (('no', '50', 'poor'), 'more'),
 (('yes', '33', 'good'), 'more'),
 (('no', '35', 'good'), 'less'),
 (('no', '41', 'good'), 'less'),
 (('yes', '30', 'poor'), 'more'),
 (('no', '39', 'poor'), 'more'),
 (('no', '20', 'good'), 'less'),
 (('yes', '18', 'poor'), 'less'),
 (('yes', '55', 'good'), 'more')]


## 🎯 **Task 5: Nearest Neighbor Classifier** (25 Points)

### $\textbf{What You'll Learn}$ 📚
- **Distance metrics** and similarity measures
- **Nearest neighbor algorithm** (k-NN with k=1)
- **Classification by similarity** 
- **Algorithmic thinking** and optimization

### $\textbf{The Nearest Neighbor Concept}$ 🔍

The **nearest neighbor algorithm** classifies test points by finding the most similar example in the training data and using its label! 

**How it works:** 🔄
1. Calculate distance from test point to ALL training points
2. Find the training point with **minimum distance**
3. Return the label of that nearest neighbor

📖 **Learn more:** [Nearest Neighbor Classifiers](http://www.robots.ox.ac.uk/~dclaus/digits/neighbour.htm)

### $\textbf{Our Distance Function}$ 📏

We need to measure similarity between patients. Our custom distance formula:

```python
distance(a, b) = (a[0] != b[0]) + ((a[1] - b[1]) / 50.0) ** 2 + (a[2] != b[2])
```

**Where `a` and `b` are patient tuples** (smoker, age, diet)

### $\textbf{Understanding the Formula}$ 🧮
- **`(a[0] != b[0])`:** Smoker difference (0 if same, 1 if different)
- **`((a[1] - b[1]) / 50.0) ** 2`:** Age difference (normalized and squared)
- **`(a[2] != b[2])`:** Diet difference (0 if same, 1 if different)

### $\textbf{Your Mission}$ 🎯

1. **Implement the distance function** between two patients
2. **Implement the neighbor function** that finds the closest training example

### $\textbf{Pro Tip}$ 💡
You can use `float('inf')` for infinity when tracking minimum distances!

*Nearest neighbor is one of the simplest yet powerful machine learning algorithms!* ✨

In [24]:
def distance(a: tuple, b: tuple) -> float:
    '''
    📏 Calculates the distance between two patient tuples using our custom formula.
    
    🔢 Formula breakdown:
    - Smoker difference: (a[0] != b[0]) → 0 if same, 1 if different
    - Age difference: ((a[1] - b[1]) / 50.0) ** 2 → normalized squared difference  
    - Diet difference: (a[2] != b[2]) → 0 if same, 1 if different
    
    💡 Why this formula?
    - Categorical differences (smoker, diet) contribute 0 or 1
    - Age difference is normalized by 50 to prevent dominance
    - Squared age difference penalizes larger age gaps more
    
    Args:
        a, b (tuple): Two patient tuples (smoker, age, diet) for distance calculation
        
    Returns:
        float: The distance between patients a and b
        
    Example:
        >>> distance(('yes', 30, 'good'), ('yes', 40, 'poor'))
        1.04  # Same smoker status (0) + age diff (0.04) + different diet (1)
    '''
    # YOUR CODE HERE
    smoker_diff = int((a[0] != b[0]))
    age_diff = ((a[1] - int(b[1])) / 50.0) ** 2
    diet_diff = 0 if a[2] == b[2] else 1
    return smoker_diff + age_diff + diet_diff
    # YOUR CODE HERE
    

In [25]:
# Test distance
x1 = ('yes', 34, 'poor')
x2 = ('yes', 51, 'good')
dist = distance(x1, x2)
print(f'distance({x1}, {x2}) --> {dist}')
expected_dist = 1.1156
t.assertAlmostEqual(dist, expected_dist)


distance(('yes', 34, 'poor'), ('yes', 51, 'good')) --> 1.1156


In [26]:
def neighbor(x: tuple, trainset: list) -> str:
    '''
    Returns the label of the nearest data point in trainset to x.
    If x is `('no', 30, 'good')` and the nearest data point in trainset
    is `('no', 31, 'good')` with label `'less'` then `'less'` will be returned 
    
    Args: 
        x (tuple): The data point for which we want to find the nearest neighbor
        trainset (list): A list of tuples with patient tuples and a label
        
    Returns: 
        str: The label of the nearest data point in the trainset. Can only be 'more' or 'less'
    '''
    # YOUR CODE HERE
    distances = [distance(x, train[0]) for train in trainset]
    min_index = distances.index(min(distances))
    return trainset[min_index][1]
    # YOUR CODE HERE
    

In [27]:
# Test neighbor
x = ('yes', 31, 'good')
prediction = neighbor(x, gettrain())
print(f'prediction --> {prediction}')
expected = "more"
t.assertEqual(prediction, expected)


prediction --> more


* Apply both the decision tree and nearest neighbor classifiers on the test set, and return the list of data point(s) for which the two classifiers disagree, and with which probability it happens.

In [28]:
def compare_classifiers(trainset: list, testset: list) -> float:
    '''
    This function compares the two classification methods by finding all the datapoints for which 
    the methods disagree.
    
    Args:
        trainset (list): The training set used in the nearest neighbour classfier.
        testset (list): Contains the elements which will be used to compare the 
            decision tree and nearest neighbor classification methods.
    
    Returns:
        list: A list containing all the data points which yield different results for the two
            classification methods.
        float: The percentage of data points for which the two methods disagree.
    
    '''
    # YOUR CODE HERE    
    disagree = []
    for x in testset:
        label_tree = decision(x)
        label_nn = neighbor(x, trainset)
        if label_tree != label_nn:
            disagree.append(x)
    percentage = len(disagree) / len(testset) if testset else 0.0
    return disagree, percentage
    # YOUR CODE HERE
    
    

In [29]:
# Test compare_classifiers
disagree, ratio = compare_classifiers(gettrain(), gettest())
t.assertIsInstance(disagree, list)
t.assertIsInstance(disagree[0], tuple)
assert_percentage(ratio)

### $\textbf{Performance Challenge}$ ⚠️

**Problem with Simple Nearest Neighbor:** Comparing to ALL training points can be slow for large datasets (thousands+ points)!

**Solution:** Train a model first, then use it for fast classification! 🚀

## 🎓 **Task 6: Nearest Mean Classifier** (25 Points)

### $\textbf{What You'll Learn}$ 📚
- **Object-oriented programming** with classes
- **Model training** vs. prediction phases
- **Numerical data representation**
- **Centroid-based classification**

### $\textbf{The Smart Approach}$ 🧠

**Nearest Mean Classifier operates in two steps:**

1. **🎯 Training Phase:** Compute the average (centroid) point for each class
2. **⚡ Prediction Phase:** Classify new points based on nearest class centroid

**Why it's faster:** Instead of comparing to ALL training points, we only compare to 2 centroids! 

### $\textbf{Numerical Conversion}$ 🔢

**Convert categorical attributes to numerical values:**
- **Smoker:** `yes=1.0`, `no=0.0` 
- **Diet:** `good=0.0`, `poor=1.0`
- **Age:** Keep as `float` (instead of `int`)

### $\textbf{New Distance Function}$ 📏

For numerical data, we use **Euclidean distance:**

```python
distance(a,b) = (a[0] - b[0]) ** 2 + ((a[1] - b[1]) / 50.0) ** 2 + (a[2] - b[2]) ** 2
```

*Now all differences are numerical subtractions instead of boolean comparisons!*

### $\textbf{Object-Oriented Design}$ 🏗️

We'll build this as a **class** with methods:
- **`train()`:** Learn from training data  
- **`predict()`:** Classify new patients

### $\textbf{Your Mission}$ 🎯

1. **Implement `gettrain_num()`** - Load numerical training data
2. **Implement new `distance_num()`** - Euclidean distance for numerical data  
3. **Implement `NearestMeanClassifier`** - Complete the class methods

*This introduces you to real machine learning model architecture!* ✨

In [62]:
def parse_line_train_num(line: str) -> tuple:
    '''
    Takes a line from the file `health-train.txt`, including a newline, 
    and parses it into a numerical patient tuple
    
    Args:
        line (str): A line from the `health-test.txt` file
    Returns:
        tuple: A numerical patient
    '''
    # YOUR CODE HERE

    parts = line.strip().split(',')
    # 转换为数值型
    smoker = 1.0 if parts[0].strip() == 'yes' else 0.0
    age = float(parts[1].strip())
    diet = 0.0 if parts[2].strip() == 'good' else 1.0
    label = parts[3].strip()
    return ((smoker, age, diet), label)

def gettrain_num() -> list:
    '''
    Parses the `health-train.txt` file into numerical patient tuples
    
    Returns: 
        list: A list of tuples containing numerical patient tuples and their labels
    '''
    # YOUR CODE HERE
    with open('../storage/data/health/health-train.txt') as f:
        lines = f.readlines()
    return [parse_line_train_num(line) for line in lines]
    # YOUR CODE HERE
    

In [63]:
# Test gettrain_num
trainset_num = gettrain_num()
t.assertIsInstance(trainset_num, list)
first_datapoint = trainset_num[0]
print(f'first_datapoint --> {first_datapoint}')
t.assertIsInstance(first_datapoint[0], tuple)
t.assertIsInstance(first_datapoint[0][0], float)
t.assertIsInstance(first_datapoint[0][1], float)
t.assertIsInstance(first_datapoint[0][2], float)

first_datapoint --> ((1.0, 54.0, 0.0), 'less')


In [67]:
def distance_num(a: tuple, b: tuple) -> float:
    '''
    Calculates the distance between two data points (numerical patient tuples)
    Args:
        a, b (tuple): Two numerical patient tuples for which 
            we want to calculate the distance
    Returns:
        float: The distance between a, b according to the above formula
    '''
    # YOUR CODE HERE
    smoker_diff = (a[0] - b[0]) ** 2
    age_diff = ((a[1] - b[1]) / 50.0) ** 2
    diet_diff = (a[2] - b[2]) ** 2
    return smoker_diff + age_diff + diet_diff
    # YOUR CODE HERE
    

In [68]:
x1 = (1.0, 23.0, 0.0)
x2 = (0.0, 41.0, 1.0)
dist = distance_num(x1, x2)
print(f'dist --> {dist}')
t.assertIsInstance(dist, float)
expected_dist = 2.1296
t.assertAlmostEqual(dist, expected_dist)

dist --> 2.1296


In [74]:
class NearestMeanClassifier:
    '''
    Represents a NearestMeanClassifier.
    
    When an instance is trained a dataset is provided and the mean for each class is calculated.
    During prediction the instance compares the datapoint to each class mean (not all datapoints) 
    and returns the label of the class mean to which the datapoint is closest to.
    
    Instance Attributes:
        more (tuple): A tuple representing the mean of every 'more' data-point in the dataset
        less (tuple): A tuple representing the mean of every 'less' data-point in the dataset
    '''
    
    def __init__(self):
        self.more = None
        self.less = None
    
    def train(self, dataset: list):
        '''
        Calculates the class means for a given dataset and stores 
        them in instance attributes more, less. 
        Args:
            dataset (list): A list of tuples each of them containing a numerical patient tuple and its label
        Returns:
            self
        '''
        # YOUR CODE HERE
        more_points = [data[0] for data in dataset if data[1] == 'more']
        less_points = [data[0] for data in dataset if data[1] == 'less']
        self.more = tuple(sum(p[i] for p in more_points) / len(more_points) for i in range(3)) if more_points else None
        self.less = tuple(sum(p[i] for p in less_points) / len(less_points) for i in range(3)) if less_points else None
        return self

    def predict(self, x: tuple) -> str:
        '''
        Returns a prediction/label for numeric patient tuple x. 
        The classifier compares the given data point to the mean 
        class tuples of each class and returns the label of the
        class to which x is the closest to (according to our 
        distance function).
        
        Args: 
            x (tuple): A numerical patient tuple for which we want a prediction
            
        Returns:
            str: The predicted label
        '''
        # YOUR CODE HERE
        if self.more is None or self.less is None:
            raise ValueError("Classifier has not been trained.")
        dist_more = distance_num(x, self.more)
        dist_less = distance_num(x, self.less)
        return 'more' if dist_more < dist_less else 'less'
        # YOUR CODE HERE
        
        
    def __str__(self):
        return repr(self)
    def __repr__(self):
        more = tuple(round(m, 3) for m in self.more) if self.more else self.more
        less = tuple(round(l, 3) for l in self.less) if self.less else self.less
        return f'NearestMeanClassfier(more: {more}, less: {less})'

* Instantiate the `NearestMeanClassifier`, train it on the training data, and return it

In [75]:
def build_and_train(trainset_num: list) -> NearestMeanClassifier:
    '''
    Instantiates the `NearestMeanClassifier`, trains it on the
    `trainset_num` dataset and returns it.
    
    Args: 
        trainset_num (list): A list of numerical patient tuples with their respective labels
    
    Returns:
        NearestMeanClassifier: A NearestMeanClassifier trained on `trainset_num`
    '''
    classifier = NearestMeanClassifier()
    classifier.train(trainset_num)
    return classifier


In [76]:
# Test build_and_train
classifier = build_and_train(gettrain_num())
print(classifier)
t.assertIsInstance(classifier, NearestMeanClassifier)

t.assertIsNotNone(classifier.more, 'Did you train the classifier? \
Did you store the mean vector for the \'more\' class?')
t.assertIsNotNone(classifier.less, 'Did you train the classifier? \
Did you store the mean vector for the \'less\' class?')

t.assertIsInstance(classifier.more, tuple)
t.assertIsInstance(classifier.less, tuple)

t.assertEqual(round(classifier.more[1]), 37)
t.assertEqual(round(classifier.less[1]), 32)


NearestMeanClassfier(more: (0.571, 37.143, 0.571), less: (0.333, 32.111, 0.222))


* Load the test dataset into memory as a list of numerical patient tuples
* Predict the test data using the nearest mean classifier and return all test examples for which all three classifiers (decision tree, nearest neighbor and nearest mean) agree.

**Note**: Be careful that the `NearestMeanClassifier` expects the dataset in a different form, compared to the other two methods.

In [None]:
def parse_line_test_num(line: str) -> tuple:
    '''
    Parses the `health-test.txt` file into numerical patient tuples
    
    Returns: 
        list: A list containing numerical patient tuples, loaded from `health-test.txt`
    '''
    # YOUR CODE HERE

    parts = line.strip().split(',')
    smoker = 1.0 if parts[0].strip() == 'yes' else 0.0
    age = float(parts[1].strip())
    diet = 0.0 if parts[2].strip() == 'good' else 1.0
    return (smoker, age, diet)

def gettest_num() -> list:
    with open('../storage/data/health/health-test.txt') as f:
        lines = f.readlines()
    return [parse_line_test_num(line) for line in lines]
    # YOUR CODE HERE
    

In [80]:
testset_num = gettest_num()
pprint(testset_num)
t.assertIsInstance(testset_num, list)
t.assertEqual(len(testset_num), 8)
t.assertIsInstance(testset_num[0], tuple)
t.assertEqual(len(testset_num[0]), 3)

[(1.0, 21.0, 1.0),
 (0.0, 50.0, 0.0),
 (0.0, 23.0, 0.0),
 (1.0, 45.0, 1.0),
 (1.0, 51.0, 0.0),
 (0.0, 60.0, 0.0),
 (0.0, 15.0, 1.0),
 (0.0, 18.0, 0.0)]


In [84]:
def predict_test() -> list:
    '''
    Classifies the test set using all the methods that were developed in this exercise sheet,
    namely `decision`, `neighbor` and `NearestMeanClassifier`
    
    Returns:
        list: a list of patient tuples containing all the datapoints that were classfied 
            the same by all methods, as well as the predicted labels
            
    Example:
    >>> predict_test()
    [(('yes', 22, 'poor'), 'less'),
     (('yes', 21, 'poor'), 'less'),
     (('no', 31, 'good'), 'more')]
     
    This example only shows how the output should look like. The values in the tuples 
    are completely made up
    '''
    # YOUR CODE HERE
    agreed_samples = []
    # 加载原始测试集（字符串型）
    testset = gettest()
    # 加载数值型测试集
    testset_num = gettest_num()
    # 加载训练集（用于 neighbor）
    trainset = gettrain()
    # 加载数值型训练集（用于 NearestMeanClassifier）
    trainset_num = gettrain_num()
    # 训练均值分类器
    classifier = build_and_train(trainset_num)
    for i in range(len(testset)):
        x_str = testset[i]
        x_num = testset_num[i]
        pred_decision = decision(x_str)
        pred_neighbor = neighbor(x_str, trainset)
        pred_classifier = classifier.predict(x_num)
        if pred_decision == pred_neighbor == pred_classifier:
            agreed_samples.append((x_str, pred_decision))
    return agreed_samples

In [86]:
same_predictions = predict_test()
pprint(same_predictions)
t.assertIsInstance(same_predictions, list)
t.assertEqual(len(same_predictions), 3)
t.assertIsInstance(same_predictions[0], tuple)
t.assertIsInstance(same_predictions[0][0], tuple)
t.assertIsInstance(same_predictions[0][0][0], str)
t.assertIsInstance(same_predictions[0][1], str)

[(('no', 23, 'good'), 'less'),
 (('yes', 45, 'poor'), 'more'),
 (('no', 18, 'good'), 'less')]
