## Testing
To each task tests are provided. You can use these tests to practice test driven development (TDD), but please note that using them will likely make the tasks easier.
Please also not that the tests may not be exhaustive, i.e. even when passing all tests, your solution can still be imperfect.

In [1]:
# This cell is for setting up the testing. Please execute it.

# Use unittest asserts
import unittest

t = unittest.TestCase()
from pprint import pprint


# Helper assert function
def assert_percentage(val):
    t.assertGreaterEqual(val, 0.0, f"Percentage ({val}) cannot be < 0")
    t.assertLessEqual(val, 1.0, f"Percentage ({val}) cannot be > 1")

# Warm Ups

Before starting the homework sheet we recommend you finish these warm-up tasks. They should help you to get familiar with Python code.

## Function and types

Write a function using a list comprehension that returns the types of the elements in the list.

* The function should be called `types_of`
* The function expects a list as an input argument.
* The function should return a list with the types of the given list elements.
* Read the testing cell to understand how `types_of` is supposed to work.

In [2]:

### Please enter your solution here ###
def types_of(l: list) -> list[type]:
    return [type(i) for i in l]


In [3]:
# Test type_of function
import ast
import inspect

def test_types_of():
    types = types_of([7, 0.7, "hello", True, (2, "s")])
    assert isinstance(types, list)
    t.assertEqual(types[0], int)
    t.assertEqual(types[1], float)
    t.assertEqual(types[2], str)
    t.assertEqual(types[3], bool)
    t.assertEqual(types[-1], tuple)
    t.assertEqual([int, float, str, bool, tuple], types)

    # check that the function uses a list comprehension
    source = inspect.getsource(types_of)
    for node in ast.walk(ast.parse(source)):
        if isinstance(node, ast.ListComp):
            break
    else:
        t.fail("types_of does not use a list comprehension")

test_types_of()

## Basic Math Operations and Interpreting Formulas

Some machine learning methods require data normalization as a pre-processing step. Write a function normalize_list that takes a list of numbers and performs normalization in two different ways:
1. min-max feature scaling: scales the $x_i$ to a given range [a, b]. When [a,b] = [0,1], sometimes called min-max normalization.
$$ x_i' = a + \frac{(b-a)(x_i - \min{x})}{\max{x} - \min{x}} $$
2. Scaling to unit length. 
$$ x_i' = \frac{x}{\|x\|_2} $$


In [15]:
from math import sqrt

def normalize_list(values: list[float], range: tuple[float, float] = (0.0, 1.0)) -> tuple[list[float], list[float]]:
### Please enter your solution here ###
    a, b = range
    min_val = min(values)
    max_val = max(values)
    normalized_values = [a + (b - a) * (x - min_val) / (max_val - min_val) for x in values]
    unit_scaled_values = [x / sqrt(sum([i**2 for i in values])) for x in values]

    return normalized_values, unit_scaled_values


# Test
values = [1, 2, 3, 4, 5]
print(normalize_list(values))  
print(normalize_list(values, (1.0, 2.0)))


([0.0, 0.25, 0.5, 0.75, 1.0], [0.13483997249264842, 0.26967994498529685, 0.40451991747794525, 0.5393598899705937, 0.674199862463242])
([1.0, 1.25, 1.5, 1.75, 2.0], [0.13483997249264842, 0.26967994498529685, 0.40451991747794525, 0.5393598899705937, 0.674199862463242])


## String formatting

What does the following string formatting evaluate to?
* Write the result of the string formatting into the variables result1, result2, result3.
* Example: `string0 = "This is a {} string.".format("test")`
* Example solution: `result0 = "This is a test string"`

In [16]:
# first string
string1 = "The sky is {}. {} words in front of {} random words create {} random sentence.".format(
    "clear", "Random", "other", 1
)

# second string
a = "irony"
b = "anyone"
c = "room"

string2 = f"The {a} of the situation wasn't lost on {b} in the {c}."

# third string
string3 = f"{7*10} * {9/3} with three digits after the floating point looks like this: {70*3 :.3f}."

# fourth string
string4 = "   Hello World.   ".strip()

In [35]:

### Please enter your solution here ###
result1 = "The sky is clear. Random words in front of other random words create 1 random sentence."
result2 = "The irony of the situation wasn't lost on anyone in the room."
result3 = "70 * 3.0 with three digits after the floating point looks like this: 210.000."
result4 = "Hello World."


In [36]:
# Test the string results
t.assertEqual(string1, result1)
t.assertEqual(string2, result2)
t.assertEqual(string3, result3)
t.assertEqual(string4, result4)

## Concatenation and Enumeration


Concatenate the strings from the array 'animals' into one string using a for-loop.

* Use: `counting +=` and string formatting (`f-strings`).
* Use `enumerate` to get the `i`th index.
* The result should look as follows: `'| 0: mouse | 1: rabbit | 2: cat | 3: dog |'`

***Note that this is not the most efficient way to concatenate strings in Python but part of this exercise is to showcase `for-loops`***

In [4]:
animals = ["mouse", "rabbit", "cat", "dog"]

In [5]:
counting = "|"

### Please enter your solution here ###
for i, el in enumerate(animals):
    counting += f" {i}: {el} |"


print(counting)

| 0: mouse | 1: rabbit | 2: cat | 3: dog |


In [6]:
# Test of the enumeration loop
t.assertEqual(counting, "| 0: mouse | 1: rabbit | 2: cat | 3: dog |")

## Object Oriented Programming

### Person Class

Create a class called `Person` with the following attributes:
- `name` (string)
- `age` (integer)
- `hobbies` (list of strings)

The initialization method (`__init__`) should receive the attributes as arguments and assign them to the object.

Define a method called `introduction` that prints and returns the following for a person called `John` with age `30` and hobbies `["reading", "hiking", "swimming"]`:

`"Hello, my name is John and I am 30 years old. My hobbies are: reading, hiking, swimming."`

If the person has no hobbies (i.e. the list of hobbies is empty), the method should print and return:

`"Hello, my name is John and I am 30 years old. I have no hobbies."`

In [18]:
### Please enter your solution here ###
class Person:
    def __init__(self, name: str, age: int, hobbies: list[str]):
        self.name = name
        self.age = age
        self.hobbies = hobbies

    def introduction(self) -> str:
        if self.hobbies: 
            s = f"Hello, my name is {self.name} and I am {self.age} years old. My hobbies are: {", ".join(self.hobbies)}."
        else:
            s = f"Hello, my name is {self.name} and I am {self.age} years old. I have no hobbies."
        
        print(s)
        return s


In [19]:
# Test person class
import io
from contextlib import redirect_stdout

def test_Person():
    p1 = Person("Alice", 25, ["reading", "swimming"])
    p2 = Person("Bob", 30, [])
    p3 = Person("Charlie", 40, ["reading"])

    t.assertEqual(p1.introduction(), "Hello, my name is Alice and I am 25 years old. My hobbies are: reading, swimming.")
    t.assertEqual(p2.introduction(), "Hello, my name is Bob and I am 30 years old. I have no hobbies.")
    t.assertEqual(p3.introduction(), "Hello, my name is Charlie and I am 40 years old. My hobbies are: reading.")

    # check that method prints the correct string
    f = io.StringIO()
    with redirect_stdout(f):
        p1.introduction()
        output = f.getvalue().strip()

    if output == '':
        t.fail("The method does not seem to print anything.")

    t.assertEqual(output, "Hello, my name is Alice and I am 25 years old. My hobbies are: reading, swimming.", 
                          "The method should print the correct string.")

test_Person()

Hello, my name is Alice and I am 25 years old. My hobbies are: reading, swimming.
Hello, my name is Bob and I am 30 years old. I have no hobbies.
Hello, my name is Charlie and I am 40 years old. My hobbies are: reading.


### Cookie Jar Class

This problem is adapted from the [Harvard CS50 OpenCourseWare exercise "Cookie Jar"](https://cs50.harvard.edu/python/2022/psets/8/jar/). Please implement a [cookie jar](https://en.wikipedia.org/wiki/Cookie_jar) via a class `Jar` that contains the following methods:
* `__init__` should initialize a cookie jar with the given capacity, which represents the maximum number of cookies that can fit in the cookie jar. If capacity is not a non-negative int, though, `__init__` should instead raise a `ValueError`.
* `__str__` should return a `str` with n `🍪`, where n is the number of cookies in the cookie jar. For instance, if there are 3 cookies in the cookie jar, then str should return "🍪🍪🍪"
* `deposit` should add n cookies to the cookie jar. If adding that many would exceed the cookie jar’s capacity, though, `deposit` should instead raise a `ValueError`.
* `withdraw` should remove n cookies from the cookie jar. Nom nom nom. If there aren’t that many cookies in the cookie jar, though, withdraw should instead raise a `ValueError`.
* `get_capacity` should return the cookie jar’s capacity.
* `get_size` should return the number of cookies actually in the cookie jar, initially 0.

In [38]:
class Jar:
### Please enter your solution here ###
    def __init__(self, capacity: int=12):
        if capacity < 0:
            raise ValueError
        self.capacity = capacity
        self.size = 0
    
    def __str__(self):
        return "".join(["🍪" for i in range(self.get_size())])

    def deposit(self, n: int):
        if self.size + n > self.capacity:
            raise ValueError
        self.size += n

    def withdraw(self, n: int):
        if self.size - n < 0:
            raise ValueError
        self.size -= n
    
    def get_capacity(self):
        return self.capacity
    
    def get_size(self):
        return self.size

It's also important to test your `Jar` class. Fill in the functions blow, making sure to test all of `Jar`'s methods and for all possible edge cases. 

The `unittest`'s `TestCase` class has the methods `assertEqual(a, b)` and `assertRases(error)`, and are what each test cases uses to check expected behavior. 

For example:

 - `self.assertEqual(jar.get_capacity(), 12)`
 - `with self.assertRaises(ValueError): Jar(-5)`

In [52]:
import unittest

class TestJar(unittest.TestCase):
    
    def test_initialization_with_default_capacity(self):
        ### Please enter your solution here ###
        jar = Jar()
        self.assertEqual(jar.get_capacity(), 12)
        self.assertEqual(jar.get_size(), 0)

    def test_initialization_with_custom_capacity(self):
        ### Please enter your solution here ###
        cap = 5
        jar = Jar(cap)
        self.assertEqual(jar.get_capacity(), 5)
        self.assertEqual(jar.get_size(), 0)

    def test_initialization_with_invalid_capacity(self):
        ### Please enter your solution here ###
        with self.assertRaises(ValueError):
            Jar(-5)

    def test_str_representation_empty_jar(self):
        ### Please enter your solution here ###
        jar = Jar()
        self.assertEqual(jar.__str__(), "")

    def test_str_representation_non_empty_jar(self):
        ### Please enter your solution here ###
        jar = Jar()
        jar.deposit(5)
        self.assertEqual(jar.__str__(), "🍪🍪🍪🍪🍪")

    def test_deposit_valid(self):
        ### Please enter your solution here ###
        jar = Jar()
        n = 5
        jar.deposit(n)
        self.assertEqual(jar.get_size(), n)

    def test_deposit_exceeds_capacity(self):
        ### Please enter your solution here ###
        capacity = 5
        jar = Jar(capacity)
        with self.assertRaises(ValueError):
            jar.deposit(capacity+1)

    def test_withdraw_valid(self):
        ### Please enter your solution here ###
        jar = Jar()
        n = 5
        jar.deposit(n)
        jar.withdraw(n-1)
        self.assertEqual(jar.get_size(), 1)


    def test_withdraw_exceeds_size(self):
        ### Please enter your solution here ###
        jar = Jar()
        n = 5
        jar.deposit(n)
        with self.assertRaises(ValueError):
            jar.withdraw(n+1)


    def test_deposit_and_withdraw_sequence(self):
        ### Please enter your solution here ###
        jar = Jar()
        jar.deposit(12)
        jar.withdraw(5)
        jar.deposit(3)
        jar.withdraw(6)
        self.assertEqual(4, jar.get_size())


unittest.main(argv=[''], verbosity=2, exit=False)

test_deposit_and_withdraw_sequence (__main__.TestJar.test_deposit_and_withdraw_sequence) ... ok
test_deposit_exceeds_capacity (__main__.TestJar.test_deposit_exceeds_capacity) ... ok
test_deposit_valid (__main__.TestJar.test_deposit_valid) ... ok
test_initialization_with_custom_capacity (__main__.TestJar.test_initialization_with_custom_capacity) ... ok
test_initialization_with_default_capacity (__main__.TestJar.test_initialization_with_default_capacity) ... ok
test_initialization_with_invalid_capacity (__main__.TestJar.test_initialization_with_invalid_capacity) ... ok
test_str_representation_empty_jar (__main__.TestJar.test_str_representation_empty_jar) ... ok
test_str_representation_non_empty_jar (__main__.TestJar.test_str_representation_non_empty_jar) ... ok
test_withdraw_exceeds_size (__main__.TestJar.test_withdraw_exceeds_size) ... ok
test_withdraw_valid (__main__.TestJar.test_withdraw_valid) ... ok

----------------------------------------------------------------------
Ran 10 tests

<unittest.main.TestProgram at 0x2454dd0fe50>

## Error Handling

This problem is adapted from the [Harvard CS50 OpenCourseWare exercise](https://cs50.harvard.edu/python/2022/psets/3/fuel/). Fuel gauges indicate, often with fractions, just how much fuel is in a tank. For instance 1/4 indicates that a tank is 25% full, 1/2 indicates that a tank is 50% full, and 3/4 indicates that a tank is 75% full.

In the cell below, implement a function `main` that inputs a fraction as a string, formatted as `'X/Y'`, where X and Y are each integers. `main` should output a string consisting of a percentage rounded to the nearest integer (e.g."XX%"), expressing how much fuel is in the tank. If only 1% or less remains, it should return "E" instead to indicate that the tank is almost empty. And if 99% or more remains, output "F" to indicate that the tank is almost full.

For computing the percentage from the input string `fraction`, create a separate function `convert`.

Add error handling to only allow realistic inputs for the scenario (i.e., if X or Y is not an integer, X is greater than Y, or Y is 0). In these cases return "Invalid input, please try again." Be sure to catch any exceptions like ValueError or ZeroDivisionError.

In [None]:
def main(fraction: str) -> str:
    ### Please enter your solution here ###
    val = convert(fraction)
    if val <= -1:
        return ""
    if val <= 1:
        return "E"
    elif val >= 99:
        return "F"
    else:
        return f"{val}%"


def convert(fraction: str) -> int:
    ### Please enter your solution here ###
    nums = fraction.split("/")
    if len(nums) != 2:
        print(f"Invalid input, please try again")
        return -1

    x, y = tuple(nums)

    try:
        x, y = int(x), int(y)
        if x > y:
            raise ValueError

        res = x / y

    except (ValueError, ZeroDivisionError):
        print(f"Invalid input, please try again")
        return -1

    return int(res * 100)
        

# Testing the function
test_fractions = ["2/3", "1/4", "4/4", "0/4", "3/0", "three/four", "1.5/3", "5/4", "abc"]

results = [main(fraction) for fraction in test_fractions]
results


Invalid input, please try again
Invalid input, please try again
Invalid input, please try again
Invalid input, please try again
Invalid input, please try again


['66%', '25%', 'F', 'E', '', '', '', '', '']

# Exercise Sheet 1: Python Basics

This exercise sheet examines the basic functionalities of the Python programming language in the context of a simple prediction task. We consider the problem of predicting the health risks of subjects from their personal data and habits. We first try to accomplish this task by using a decision tree.

![](tree.png)

Make sure that you have downloaded the `tree.png` file from ISIS. For this exercise sheet, you are required to use only pure Python and, to not import any module, including `NumPy`. Next week we will implement the nearest neighbor part of this exercise sheet using `NumPy` 😉.

## 1. Classifying a single instance

* In this sheet we represent patient info as a tuple.
* Implement the function `decision` that takes as input a tuple containing values for attributes `(smoker, age, diet)`, and computes the output of the decision tree. The function should return either `'less'` or `'more'`. No other outputs are valid.

In [9]:
def decision(x: tuple[str, int, str]) -> str:
    """
    Implements the decision tree represented in the image above. As input the function
    receives a tuple with three values that represent some information about a patient.
    
    Args:
        x: Input tuple containing exactly three values.
           The first element is a string and represents if a patient is a smoker. 
           If they do smoke, this string will be 'yes'. All other values represent 
           that the patient is not a smoker.
           The second element represents the age of a patient in years as an integer.
           The last element represents the diet of a patient. If a patient has a good diet this
           string will be 'good'. All other values represent that the patient has a poor diet.
    
    Returns:
        str: A string that has either the value 'more' or 'less'. No other return value
             is valid.
    """
    ### Please enter your solution here ###
    smoker, age, diet = x
    if (smoker == "yes" and age < 29.5) or (not (smoker == "yes") and diet == "good"):
        return "less"
    else:
        return "more"


In [25]:
# Test decision function
def test_decision():
    try:
        t
    except NameError:
        print("No test object found. Did you run the first cell?")
        raise

    # Test expected 'more'
    x = ("yes", 31, "good")
    output = decision(x)
    print(f"decision({x}) --> {output}")
    t.assertIsInstance(output, str)
    t.assertEqual(output, "more")

    # Test expected 'less'
    x = ("yes", 29, "poor")
    output = decision(x)
    print(f"decision({x}) --> {output}")
    t.assertIsInstance(output, str)
    t.assertEqual(output, "less")

test_decision()

decision(('yes', 31, 'good')) --> more
decision(('yes', 29, 'poor')) --> less


## 2. Reading a dataset from a text file
In the previous task we created a method to classify the risk of patients, by manually setting rules defining for which inputs the user is in `more` or `less` risk regarding their health. In the next exercises we will approach the task differently. Our goal is to create a classification method based on data. In order to achieve this we need to also create functions that load the existing data into the program so that we can use it. Furthermore, we can use the loaded data as input for our decision tree implementation and check what it outputs.

The file `health-test.txt` contains several fictitious records of personal data and habits. We split this task into two parts. In the first part, we assume that we have read a line from the file and can now process it. In the second function, we load the file and process each line using the function we have defined for this purpose.

* Read the file automatically using the methods introduced during the lecture.
* Represent the dataset as a list of tuples. Make sure that the tuples have the same format as in the previous task, e.g. `('yes', 31, 'good')`.

**Notes**: 
* Make sure that you close the file after you have opened it and read its content. If you use a `with` statement then you don't have to worry about closing the file.
* Make sure when opening a file not to use an absolute path. An absolute path will work on your computer, but when your code is tested on the department's computers, it will fail. Use relative paths when opening files.
* Values read from files are always strings.
* Each line contains a newline `\n` character at the end.
* If you are using Windows as your operating system, refrain from opening any text files using Notepad. It will remove any linebreaks `\n`. You should inspect the files using the Jupyter text editor or any other modern text editor.

In [25]:
def parse_line_test(line: str) -> tuple[str, int, str]:
    """
    Takes a line from the file, including a newline, and parses it into a patient tuple.

    Args:
        line: A line from the `health-test.txt` file
    Returns:
        tuple: A tuple representing a patient
    """
    assert (
        line[-1] == "\n"
    ), "Did you change the contents of the line before calling this function?"
    ### Please enter your solution here ###
    vals = line.strip().split(",")
    return vals[0], int(vals[1]), vals[2]

In [26]:
def test_parse_line_test():
    x = "yes,23,good\n"
    parsed_line = parse_line_test(x)
    smoker, age, diet = parsed_line
    print(parsed_line)
    t.assertIsInstance(parsed_line, tuple)
    t.assertEqual(len(parsed_line), 3)
    t.assertIsInstance(age, int)
    t.assertNotIn("\n", diet, "Are you handling line breaks correctly?")
    t.assertEqual(parsed_line[-1], "good")
    
test_parse_line_test()

('yes', 23, 'good')


In [27]:
def gettest() -> list[tuple[str, int, str]]:
    """
    Opens the `health-test.txt` file and parses it
    into a list of patient tuples. You are encouraged to use
    the `parse_line_test` function but it is not necessary to do so.

    This function assumes that the `health-test.txt` file is located in
    the same directory as this notebook.

    Returns:
        list: A list of patient tuples as read from the file
    """
    ### Please enter your solution here ###
    with open("health-test.txt", "r") as f:
        lines = f.readlines()
    
    return [parse_line_test(line) for line in lines]

In [28]:
a = 1, 2, 3
print(type(a))
print(a)

b = list((1, 2, 3))
c = [1, 2, 3]
print(type(b))
print(b)
print(type(c))
print(c)

<class 'tuple'>
(1, 2, 3)
<class 'list'>
[1, 2, 3]
<class 'list'>
[1, 2, 3]


In [29]:
def test_gettest():
    testset = gettest()
    pprint(testset)
    t.assertIsInstance(testset, list)
    t.assertEqual(len(testset), 8)
    t.assertIsInstance(testset[0], tuple)

test_gettest()

[('yes', 21, 'poor'),
 ('no', 50, 'good'),
 ('no', 23, 'good'),
 ('yes', 45, 'poor'),
 ('yes', 51, 'good'),
 ('no', 60, 'good'),
 ('no', 15, 'poor'),
 ('no', 18, 'good')]


## 3. Applying the decision tree to the dataset

* Apply the decision tree to all points in the dataset, and return the proportion of them that are classified as "more".
* A proportion is a value in [0-1]. So if out of 50 data points 15 return `"more"` the value that should be returned is `0.3`

In [30]:
def evaluate_testset(dataset: list[tuple[str, int, str]]) -> float:
    """
    Calculates the percentage of data points for which the
    decision function evaluates to `more` for a given dataset

    Args:
        dataset: A list of patient tuples

    Returns:
        float: The percentage of data points that are evaluated to `'more'`
    """
    ### Please enter your solution here ###
    more = sum(1 for data in dataset if decision(data) == 'more')
    return more / len(dataset)

In [31]:
def test_evaluate_testset():
    ratio = evaluate_testset(gettest())
    print(f"ratio --> {ratio}")
    t.assertIsInstance(ratio, float)
    assert_percentage(ratio)
    t.assertTrue(0.3 < ratio < 0.4)

test_evaluate_testset()

ratio --> 0.375


## 4. Learning from examples
Suppose that instead of relying on a fixed decision tree, we would like to use a data-driven approach where data points are classified based on a set of training observations manually labeled by experts. Such labeled dataset is available in the file `health-train.txt`. The first three columns have the same meaning as in `health-test.txt`, and the last column corresponds to the labels.

* Read the `health-train.txt` file and convert it into a list of pairs. The first element of each pair is a triplet of attributes (the patient tuple), and the second element is the label.
* Similarly to the previous exercise we split the task into two parts. The first involves processing each line individually. The second handles opening the file and processing all lines of the file

**Note**: A triplet is a tuple that contains exactly three values, a pair is a tuple that contains exactly two values

In [34]:
def parse_line_train(line: str) -> tuple[tuple[str, int, str], str]:
    """
    This function works similarly to the `parse_line_test` function.
    It parses a line of the `health-train.txt` file into a tuple that
    contains a patient tuple and a label.

    Args:
        line: A line from the `health-train.txt`
        
    Returns:
        tuple: A tuple that contains a patient tuple and a label as a string
    """
    assert line[-1] == "\n"
    ### Please enter your solution here ###
    vals = line.strip().split(",")
    return ((vals[0], int(vals[1]), vals[2]), vals[3])


In [35]:
def test_parse_line_train():
    x = "yes,67,poor,more\n"
    parsed_line = parse_line_train(x)
    print(parsed_line)

    t.assertIsInstance(parsed_line, tuple)
    t.assertEqual(len(parsed_line), 2)

    data, label = parsed_line

    t.assertIsInstance(data, tuple)
    t.assertEqual(len(data), 3)
    t.assertEqual(data[1], 67)

    t.assertIsInstance(label, str)
    t.assertNotIn("\n", label, "Are you handling line breaks correctly?")
    t.assertEqual(label, "more")

test_parse_line_train()

(('yes', 67, 'poor'), 'more')


In [48]:
def gettrain() -> list[tuple[tuple[str, int, str], str]]:
    """
    Opens the `health-train.txt` file and parses it into
    a list of patient tuples accompanied by their respective label.

    Returns:
        list: A list of tuples comprised of a patient tuple and a label
    """
    ### Please enter your solution here ###
    with open("health-train.txt", "r") as f:
        lines = f.readlines()
    
    return [parse_line_train(line) for line in lines]

In [49]:
def test_gettrain():
    trainset = gettrain()
    pprint(trainset)
    t.assertIsInstance(trainset, list)
    t.assertEqual(len(trainset), 16)
    first_datapoint = trainset[0]
    t.assertIsInstance(first_datapoint, tuple)
    t.assertIsInstance(first_datapoint[0], tuple)
    t.assertIsInstance(first_datapoint[1], str)

test_gettrain()

[(('yes', 54, 'good'), 'less'),
 (('no', 55, 'good'), 'less'),
 (('no', 26, 'good'), 'less'),
 (('yes', 40, 'good'), 'more'),
 (('yes', 25, 'poor'), 'less'),
 (('no', 13, 'poor'), 'more'),
 (('no', 15, 'good'), 'less'),
 (('no', 50, 'poor'), 'more'),
 (('yes', 33, 'good'), 'more'),
 (('no', 35, 'good'), 'less'),
 (('no', 41, 'good'), 'less'),
 (('yes', 30, 'poor'), 'more'),
 (('no', 39, 'poor'), 'more'),
 (('no', 20, 'good'), 'less'),
 (('yes', 18, 'poor'), 'less'),
 (('yes', 55, 'good'), 'more')]


## 5. Nearest neighbor classifier

We consider the nearest neighbor algorithm that classifies test points following the label of the nearest neighbor in the training data. You can read more about Nearest neighbor classifiers [here](http://www.robots.ox.ac.uk/~dclaus/digits/neighbour.htm). For this, we need to define a distance function between data points. We define it to be

`distance(a, b) = (a[0] != b[0]) + ((a[1] - b[1]) / 50.0) ** 2 + (a[2] != b[2])`

where `a` and `b` are two tuples representing two patients.

* Implement the distance function.
* Implement the function that retrieves for a test point the nearest neighbor in the training set, and classifies the test point accordingly (i.e. returns the label of the nearest data point).

**Hint**: You can use the special `infinity` floating point value with `float('inf')`

***Keep in mind that `bool`s in Python are also `int`s. `True` is the same as `1` and `False` is the same as `0`***

In [50]:
# This cell demonstrates that boolean values can be used in arithmetic operations
print(f'{True + True = }')
print(f'{True + False = }')
print(f'{True * 3 = }')
print(f'{False * True = }')

True + True = 2
True + False = 1
True * 3 = 3
False * True = 0


In [51]:
def distance(a: tuple[str, int, str], b: tuple[str, int, str]) -> float:
    """
    Calculates the distance between two data points (patient tuples).
    
    Args:
        a, b: Two patient tuples for which we want to calculate the distance
        
    Returns:
        float: The distance between a and b according to the above formula
    """
    ### Please enter your solution here ###
    return (a[0] != b[0]) + ((a[1] - b[1]) / 50.0) ** 2 + (a[2] != b[2])

In [52]:
# Test distance
def test_distance():
    x1 = ("yes", 34, "poor")
    x2 = ("yes", 51, "good")
    dist = distance(x1, x2)
    print(f"distance({x1}, {x2}) --> {dist}")
    expected_dist = 1.1156
    t.assertAlmostEqual(dist, expected_dist)

test_distance()

distance(('yes', 34, 'poor'), ('yes', 51, 'good')) --> 1.1156


In [59]:
def neighbor(x: tuple[str, int, str], trainset: tuple[tuple[str, int, str], str]) -> str:
    """
    Returns the label of the nearest data point in trainset to x.
    If x is `('no', 30, 'good')` and the nearest data point in trainset
    is `('no', 31, 'good')` with label `'less'` then `'less'` will be returned.
    In case two elements have the exact same distance, the element that first occurs
    in the dataset is picked (the element with the smallest index).

    Args:
        x: The data point for which we want to find the nearest neighbor
        trainset: A list of tuples with patient tuples and a label

    Returns:
        str: The label of the nearest data point in the trainset. Can only be 'more' or 'less'.
    """
    ### Please enter your solution here ###
    l = [(distance(x, datapoint[0]), datapoint[1]) for datapoint in trainset]
    l.sort(key=lambda t: t[0])
    smallest_distance, label = l[0]
    return label


In [60]:
# Test neighbor
def test_neighbor():
    x = ("yes", 31, "good")
    prediction = neighbor(x, gettrain())
    print(f"prediction --> {prediction}")
    t.assertIn(prediction, ["more", "less"])
    expected = "more"
    t.assertEqual(prediction, expected)

test_neighbor()

prediction --> more


In this part we want to compare the decision tree we have implemented with the nearest neighbor method. Apply both the decision tree and nearest neighbor classifiers on the test set, and return the list of datapoint(s) for which the two classifiers disagree, and the probability that they disagree.

In [63]:
def compare_classifiers(trainset: tuple[tuple[str, int, str], str],
                        testset: tuple[tuple[str, int, str], str]) -> tuple[list[tuple[str, int, str]], float]:
    """
    This function compares the two classification methods (decision tree, nearest neighbor)
    by finding all the data points for which the methods disagree. It returns
    a list of the test data points for which the two methods do not return
    the same label as well as the ratio of those data points compared to the whole
    test set (i.e. the probability of disagreement between the two methods).

    Args:
        trainset: The training set used by the nearest neighbor classifier.
        testset: Contains the elements which will be used to compare the decision tree
                 and nearest neighbor classification methods.

    Returns:
        tuple: Tuple of disagree and percentage:
               disagree: A list containing all the data points which yield different results for the two
                         classification methods
               percentage: The ratio of data points for which the two methods disagree, it must be a value between 0 and 1
    """
    ### Please enter your solution here ###
    disagree = []
    for data in testset:
        if neighbor(data, trainset) != decision(data):
            disagree.append(data)
    
    percentage = len(disagree) / len(testset)

    return disagree, percentage

In [64]:
# Test compare_classifiers
def test_compare_classifiers():
    disagree, ratio = compare_classifiers(gettrain(), gettest())
    print(f"ratio = {ratio}")
    t.assertIsInstance(disagree, list)
    t.assertIsInstance(ratio, float)
    t.assertIsInstance(disagree[0], tuple)
    t.assertEqual(len(disagree[0]), 3)
    assert_percentage(ratio)
    t.assertTrue(0.1 < ratio < 0.2)

test_compare_classifiers()

ratio = 0.125


One problem of simple nearest neighbors is that one needs to compare the point to all data points in the training set to make the prediction. This can be slow for datasets of thousands of points or more. Alternatively, some classifiers train a model first and then use it to classify the data.

## 6. Nearest mean classifier

We consider one such trainable model, which operates in two steps:

1. Compute the average point for each class
2. Classify new points to be of the class whose average point is nearest to the point to predict.

For this classifier, we convert the attributes smoker and diet to real values (for smoker: `1.0` if 'yes' otherwise `0.0`, and for diet: `0.0` if 'good' otherwise `1.0`), and use the modified distance function:

`distance(a,b) = (a[0] - b[0]) ** 2 + ((a[1] - b[1]) / 50.0) ** 2 + (a[2] - b[2]) ** 2`

Age will also from now on be represented as a `float`. The new data points will be referred to as numerical patient tuples. 

We adopt an object-oriented approach for building this classifier.

* Implement the `gettrain_num` function that will load the training dataset from the `health-train.txt` file and parse each line to a numerical patient tuple with its label. You can still follow the same structure that we used before (i.e. using a `parse_line_...` function), however, it is not required for this exercise. Only the `gettrain_num` function will be tested.


* Implement the new distance function.


* Implement the methods `train` and `predict` of the class `NearestMeanClassifier`.

In [None]:
def parse_line_train_num(line: str) -> tuple[tuple[float, float, float], str]:
    """
    Takes a line from the file `health-train.txt`, including a newline,
    and parses it into a numerical patient tuple.

    Args:
        line: A line from the `health-test.txt` file
    Returns:
        tuple: A numerical patient tuple and its label
    """

### Please enter your solution here ###



def gettrain_num():
    """
    Parses the `health-train.txt` file into numerical patient tuples.

    Returns:
        patient_tuples: A list of tuples containing numerical patient tuples and their labels
    """

### Please enter your solution here ###



In [None]:
# Test gettrain_num
def test_gettrain_num():
    trainset_num = gettrain_num()
    t.assertIsInstance(trainset_num, list)
    first_datapoint = trainset_num[0]
    print(f"first_datapoint --> {first_datapoint}")
    t.assertIsInstance(first_datapoint[0], tuple)
    t.assertIsInstance(first_datapoint[0][0], float)
    t.assertIsInstance(first_datapoint[0][1], float)
    t.assertIsInstance(first_datapoint[0][2], float)

test_gettrain_num()

In [None]:
def distance_num(a: tuple[float, float, float],
                 b: tuple[float, float, float]) -> float:
    """
    Calculates the distance between two numerical patient tuples.
    
    Args:
        a, b: Two numerical patient tuples for which we want to calculate the distance
        
    Returns:
        float: The distance between a, b
    """

### Please enter your solution here ###



In [None]:
def test_distance_num():
    x1 = (1.0, 23.0, 0.0)
    x2 = (0.0, 41.0, 1.0)
    dist = distance_num(x1, x2)
    print(f"dist --> {dist}")
    t.assertIsInstance(dist, float)
    t.assertTrue(2.12 < dist < 2.13)

test_distance_num()

In [None]:
class NearestMeanClassifier:
    """
    Represents a NearestMeanClassifier.

    When an instance is trained a dataset is provided and the mean for each class is calculated.
    During prediction, the instance compares the data point to each class mean (not all data points)
    and returns the label of the class mean to which the data point is closest to.

    Instance Attributes:
        more: A tuple representing the mean of every 'more' data point in the dataset.
        less: A tuple representing the mean of every 'less' data point in the dataset.
    """

    def __init__(self):
        self.more = None
        self.less = None

    def train(self, dataset: tuple[tuple[str, int, str], str]) -> "NearestMeanClassifier":
        """
        Calculates the class means for a given dataset and stores
        them in instance attributes `more` and `less`.

        The mean of the more class is a tuple containing three elements.
        Each element of the mean tuple contains the mean of all the elements
        in the training set that are labeled `more` for each corresponding index.
        This means that the mean tuple contains the mean smoker, age and health
        values.
        The same is true of the less mean tuple, but for all the elements
        labeled `less`.

        This function does not return anything useful, but it has the side
        effect of setting the more and less instance variables.

        Args:
            dataset: A list of tuples each of them containing a numerical patient tuple
                     and its label
        Returns:
            NearestMeanClassifier: Instance of NearestMeanClassifier class
        """

### Please enter your solution here ###

        return self

    def predict(self, x) -> str:
        """
        Returns a prediction/label for numeric patient tuple x.
        The classifier compares the given data point to the mean
        class tuples of each class and returns the label of the
        class to which x is the closest to (according to our
        distance function). In the unlikely case that the distance
        is equal for both classes, then `'less'` is returned.

        Args:
            x: A numerical patient tuple for which we want a prediction
            
        Returns:
            str: The predicted label
        """

### Please enter your solution here ###


    def __repr__(self):
        try:
            more = tuple(round(m, 3) for m in self.more)
            less = tuple(round(l, 3) for l in self.less)
        except AttributeError:
            more, less = None, None
        return f"{self.__class__.__name__}(more: {more}, less: {less})"

## 7. Comparing all three classification methods
Finally, we want to compare all three methods that we have implemented. Similarly to how we needed to define a new method to load the training data for the nearest mean classifier, we need to define an equivalent method to load the test data. Our goal is to see for which data points all three classifiers output the same result.

* Load the test dataset into memory as a list of numerical patient tuples. In order to achieve this you have to implement the `gettest_num` function.
* Apply all three methods on each datapoint in the testset and return the index of all test examples for which all three classifiers (decision tree, nearest neighbor and nearest mean) agree. Remember that for the nearest mean method we used an OOP approach, so you will have to create an instance and train the model in order to be able to use it on the test data.

**Note**: Be careful that the `NearestMeanClassifier` expects the dataset in a different form, compared to the other two methods.

In [None]:
def gettest_num() -> list:
    """
    Parses the `health-test.txt` file into numerical patient tuples.

    Returns:
        list: A list containing numerical patient tuples (tuples consisting of 3 floats), loaded from `health-test.txt`
    """


### Please enter your solution here ###



In [None]:
def test_gettest_num():
    testset_num = gettest_num()
    pprint(testset_num)
    t.assertIsInstance(testset_num, list)
    t.assertEqual(len(testset_num), 8)
    t.assertIsInstance(testset_num[0], tuple)
    t.assertEqual(len(testset_num[0]), 3)
    t.assertEqual(testset_num[0], (1.0, 21.0, 1.0))

test_gettest_num()

In [None]:
def predict_test() -> list:
    """
    Classifies the test set using all the methods that were developed in this exercise sheet,
    namely `decision`, `neighbor` and `NearestMeanClassifier`.
    This function loads all the needed data by calling the corresponding functions
    (gettrain, gettest, gettrain_num, gettest_num).

    Returns:
        list: A list of the indices of all the data points for which all three classifiers have
              the same output
    """

### Please enter your solution here ###

    return agreed_samples

In [None]:
def test_predict_test():
    same_predictions = predict_test()
    pprint(same_predictions)
    t.assertIsInstance(same_predictions, list)
    t.assertEqual(len(same_predictions), 6)
    t.assertIsInstance(same_predictions[0], int)
    t.assertEqual(same_predictions[-2:], [6, 7])

test_predict_test()

### Saving Results

Often when working with a machine learning model we will want to save the classifications made by the model for further analysis. In this case, where we have few data points in the test set and our models are fairly simple, we can easily run the model again and less than a second later we have our results, however in practice this is usually not the case. 

We will save the classifications made by all three models in a `csv` file, or comma-separated file, `health-test-results.csv`. A `csv` file has two components, the header, and the data. Each line in a `csv` file contains the same number of values which are separated by commas. The header provides a label for the value of each position in each line, and usually indicates something about what these values mean.

To do this we will create a `csv` file with the header:

`smoker,age,diet,label,pred_decision_tree,pred_nearest_neighbor,pred_nearest_mean`

We will do this in three steps:

1) Write the function `to_csv` that takes `header` as a fixed argument, and an arbitrary number of arguments. Be sure to check that the number of entries in the header is the same as the number of arbitrary elements passed to the function. The arbitrary elements passed to the function are lists of values for each column, and as such, they should all be the same length. Raise a `ValueError` if either of these are not the case.

2) Write the function `safe_write` that writes a string to a file. This function should check to see if there already exists a file of the same name and that the name of the output file ends with `.csv`. If the file already exists, raise a `RuntimeError`. If the file name does not end in `.csv`, raise a `ValueError`

3) Write the function `save_results` that classifies all samples in the test set with each classifier, as is done in `predict_test`, creates a `csv` with `to_csv`, and writes the results to a file using `safe_write`.

Call `save_results` at the end of the code block.

In [None]:
import os

#header = ["smoker", "age", "diet", "label", "pred_decision_tree", "pred_nearest_neighbor", "pred_nearest_mean"]

header = ["a", "b", "c"]


def to_csv(header: list[str], *args) -> str:

### Please enter your solution here ###



def safe_write(text_data: str, outfile='output.txt', folder='./'):
    
    os.makedirs(folder, exist_ok=True)


### Please enter your solution here ###



def save_results() -> None:

    header = ["smoker", "age", "diet", "pred_decision_tree", "pred_nearest_neighbor", "pred_nearest_mean"]
    

### Please enter your solution here ###

    

save_results()