# Module 1, Practical 7

In this practical we will see how to get input from the command line. 

## Gentle reminder on functions

Reminder. The basic definition of a function is:
```
def function_name(input) :
    #code implementing the function
    ...
    ...
    return return_value
```

Functions are defined with the **def** keyword that proceeds the *function_name* and then a list of parameters is passed in the brackets. A colon **:** is used to end the line holding the definition of the function. The code implementing the function is specified by using indentation. A function **might** or **might not** return a value. In the first case a **return** statement is used.


## Getting input from the command line

To call a program ```my_python_program.py``` from command line, you just have to open a terminal (in Linux) or the command prompt (in Windows) and, assuming that python is present in the path, you can ```cd``` into the folder containing your python program, (eg. ```cd C:\python\my_exercises\```) and just type in 
```python3 my_python_program.py```
or
```python my_python_program.py```
In case of arguments to be passed by command line, one has to put them after the specification of the program name (eg. ```python my_python_program.py parm1 param2 param3```).

Python provides the module **sys** to interact with the interpreter. In particular, **sys.argv** is a list representing all the arguments passed to the python script from the command line.

Consider the following code:

In [None]:
import sys
"""Test input from command line in systest.py"""

if len(sys.argv) != 4: #note that this is the number of params +1!!!
    print("Dear user, I was expecting 3 params. You gave me ",len(sys.argv)-1)
    exit(1)
else:
    for i in range(0,len(sys.argv)):
        print("Param {}:{} ({})".format(i,sys.argv[i],type(sys.argv[i])))

Invoking the ```systest.py``` script from command line with the command  ```python3 exercises/systest.py 1st_param 2nd 3``` will return:
```
Param 0: exercises/systest.py (<class 'str'>)
Param 1: 1st_param (<class 'str'>)
Param 2: 2nd (<class 'str'>)
Param 3: 3 (<class 'str'>)
```
Invoking the ```systest.py``` script from command line with the command  ```python3 exercises/systest.py 1st_param``` will return:
```
Dear user, I was expecting three parameters. You gave me  1
```

Note that the parameter at index 0, ```sys.argv[0]``` holds the name of the script, and that all parameters are actually **strings** (and therefore need to be cast to numbers if we want to do mathematical operations on them).

**Example:** Write a script that takes two integers in input, i1 and i2, and computes the sum, difference, multiplication and division on them. 

In [None]:
import sys
"""Maths example with input from command line"""

if len(sys.argv) != 3:
    print("Dear user, I was expecting 2 params. You gave me ",len(sys.argv)-1)
    exit(1)
else:
    i1 = int(sys.argv[1])
    i2 = int(sys.argv[2])
    print("{} + {} = {}".format(i1,i2, i1 + i2))
    print("{} - {} = {}".format(i1,i2, i1 - i2))
    print("{} * {} = {}".format(i1,i2, i1 * i2))
    if i2 != 0:
        print("{} / {} = {}".format(i1,i2, i1 / i2))
    else:
        print("{} / {} = Infinite".format(i1,i2))


Which, depending on user input, should give something like:

![i1](img/pract7/math.png)

note that we need to check if the values given in input are actually numbers, otherwise the execution will crash (third example). This is easy in case of integers (```str.isdigit()```) but in case of floats it is more complex and might require Exception handling.

A more flexible and powerful way of getting input from command line makes use of the ```Argparse``` [module](https://docs.python.org/3/howto/argparse.html). 

## Argparse

Argparse is a command line parsing module which deals with user specified parameters (positional arguments) and optional arguments.


Very briefly, the basic syntax of the ```Argparse module``` (for more information check the [official documentation](https://docs.python.org/3/howto/argparse.html)) is the following.

1. Import the module:

```
import argparse
```

2. Define a argparse object:

```
parser = argparse.ArgumentParser(description="This is the description of the program")
```

note the parameter *description* that is a string to describe the program;

3. Add positional arguments:
```
parser.add_argument("arg_name", type = obj, 
                    help = "Description of the parameter)
```
where ```arg_name``` is the name of the argument (which will be used to retrieve its value). The argument has type ```obj``` (the type will be automatically checked for us) and a description specified in the ```help```string.

4. Add optional arguments:
```
parser.add_argument("-p", "--optional_arg", type = obj, default = def_val, 
                        help = "Description of the parameter)
```
where ```-p``` is a short form of the parameter (and it is optional), ```--optional_arg``` is the extended name and it requires a value after it is specified, ```type``` is the data type of the parameter passed (e.g. str, int, float, ..), ```default``` is optional and gives a default value to the parameter. If not specified and no argument is passed, the argument will get the value "None". ```Help``` is again the description string.

5. Parse the arguments:
```
args = parser.parse_args()
```
the parser checks the arguments and stores their values in the ```argparse``` object that we called ```args```.

6. Retrieve and process arguments:
```
myArgName = args.arg_name
myOptArg = args.optional_arg
```
now variables contain the values specified by the user and we can use them.

Let's see the example above again.

**Example:** Write a script that takes two integers in input, i1 and i2, and computes the sum, difference, multiplication and division on them. 

In [None]:
import argparse
"""Maths example with input from command line"""
parser = argparse.ArgumentParser(description="""This script gets two integers  in input 
and performs some operations on them""")
parser.add_argument("i1", type=int,
                    help="The first integer")
parser.add_argument("i2", type=int,
                    help="The second integer")


args = parser.parse_args()

i1 = args.i1
i2 = args.i2
print("{} + {} = {}".format(i1,i2, i1 + i2))
print("{} - {} = {}".format(i1,i2, i1 - i2))
print("{} * {} = {}".format(i1,i2, i1 * i2))
if i2 != 0:
    print("{} / {} = {}".format(i1,i2, i1 / i2))
else:
    print("{} / {} = Infinite".format(i1,i2))

That returns the following:

![iap2](img/pract7/systest_argparse.png)


Note that we did not have to check the types of the inputs (i.e. the last time we run the script) but this is automatically done by argparse.

**Example:**
Let's write a program that gets a string (S) and an integer (N) in input and prints the string repeated N times. Three optional parameters are specified: verbosity (-v) to make the software print a more descriptive output, separator (-s) to separate each copy of the string (defaults to " ") and trailpoints (-p) to add several "." at the end of the string (defaults to 1). 

In [None]:
import argparse
parser = argparse.ArgumentParser(description="""This script gets a string 
                                 and an integer and repeats the string N times""")
parser.add_argument("string", type=str,
                    help="The string to be repeated")
parser.add_argument("N", type=int,
                    help="The number of times to repeat the string")

parser.add_argument("-v", "--verbose", action="store_true",
                    help="increase output verbosity")

parser.add_argument("-p", "--trailpoints", type = int, default = 1, 
                    help="Adds these many trailing points")
parser.add_argument("-s", "--separator", type = str, default = " ", 
                    help="The separator between repeated strings")

args = parser.parse_args()

mySTR = args.string + args.separator
trailP = "." * args.trailpoints
answer = mySTR * args.N 

answer = answer[:-len(args.separator)] + trailP #to remove the last separator

if args.verbose:
    print("the string {} repeated {} is:".format(args.str, args.N, answer))
else:
    print(answer)


Executing the program from command line without parameters gives the message:

![i2](img/pract7/noargs.png)

Calling it with the ```-h``` flag:

![i3](img/pract7/help.png)

With the positional arguments ```"ciao a tutti"``` and ```3```:

![i4](img/pract7/pos_args.png)

With the positional arguments ```"ciao a tutti"``` and ```3```, and with the optional parameters ```-s "___" -p 3 -v```

![i5](img/pract7/sample.png)


**Example:**
Let's write a program that reads and prints to screen a text file specified by the user. Optionally, the file might be compressed with gzip to save space. The user should be able to read also gzipped files. Hint: use the module gzip which is very similar to the standard file management method ([more info here](https://docs.python.org/3/library/gzip.html?highlight=gzip#module-gzip)). You can find a text file here [textFile.txt](file_samples/textFile.txt) and its gzipped version here [text.gz](file_samples/textFile.gz):


In [None]:
import argparse
import gzip

parser = argparse.ArgumentParser(description="""Reads and prints a text file""")
parser.add_argument("filename", type=str, help="The file name")
parser.add_argument("-z", "--gzipped", action="store_true", 
                    help="If set, input file is assumed gzipped")

args = parser.parse_args()
inputFile = args.filename
fh = ""
if args.gzipped:
    fh = gzip.open(inputFile, "rt")
else:
    fh = open(inputFile, "r")

for line in fh:
    line = line.strip("\n")
    print(line)

fh.close()


The output:

![i6](img/pract7/read_gz.png)

**Example:** Let's write a program that reads the content of a file and prints to screen some stats like the number of lines, the number of characters and maximum number of characters in one line. Optionally (if flag -v is set) it should print the content of the file. You can find a text file here [textFile.txt](file_samples/textFile.txt):


In [None]:
import argparse


def readText(f):
    """reads the file and returns a list with 
    each line as separate element"""
    myF = open(f, "r")
    ret = myF.readlines() #careful with big files!
    return ret


def computeStats(fileList):
    """returns a tuple (num.lines, num.characters,max_char.line)"""
    num_lines = len(fileList)
    lines_len = [len(x.replace("\n", "")) for x in fileList]
    num_char = sum(lines_len)
    max_char = max(lines_len)
    return (num_lines, num_char, max_char)


parser = argparse.ArgumentParser(description="Computes file stats")
parser.add_argument("inputFile", type=str, help="The input file")
parser.add_argument(
    "-v", "--verbose", action="store_true", help="if set, prints the file content")

args = parser.parse_args()

inFile = args.inputFile

lines = readText(inFile)
stats = computeStats(lines)
if args.verbose:
    print("File content:\n{}\n".format("".join(lines)))
print(
    "Stats:\nN.lines:{}\nN.chars:{}\nMax. char in line:{}".format(
        stats[0], stats[1], stats[2]))


Output with -v flag:

![i7](img/pract7/fileread.png)

Output without -v flag:

![i8](img/pract7/filereadnov.png)

## Libraries installation for the next practicals

This section is not mandatory for the current practical, but set the basis for the next ones. For perfoming the next practicals we will need three additional libraries. Try and see if they are already available by typing the following commands in the console or put them in a python script:
```
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
```

if, upon execution, you do not get any error messages, you are sorted. Otherwise, you need to install them.

In **Linux** you can install the libraries by typing in a terminal ```sudo pip3 install matplotlib```, ```sudo pip3 install pandas``` and ```sudo pip3 install numpy``` (or ```sudo python3.X -m pip install matplotlib```, ```sudo python3.X -m pip install pandas``` and ```sudo python3.X -m pip install numpy```), where X is your python version.

In **Windows** you can install the libraries by typing in the command prompt (to open it type ```cmd``` in the search box) ```pip3 install matplotlib```, ```pip3 install pandas``` and ```pip3 install numpy```. If you are using anaconda you need to run these commands from the *anaconda prompt*.

**Please install them in this order** (i.e. **matplotlib** first, then **pandas** and finally **numpy**). You might not need to install numpy as matplotlib requires it. Once done that, try to perform the above imports again and they should work this time around.

## Exercises

1. Modify the program of Exercise 6 of Practical 6 in order to allow users to specify the input and output files from command line. Then test it with the provided files. The text of the exercise follows:

Write a python program that reads two files. The first is a one column text file ([contig_ids.txt](file_samples/contig_ids.txt)) with the identifiers of some contigs that are present in the second file, which is a fasta formatted file ([contigs82.fasta](file_samples/contigs82.fasta)). The program will write on a third, fasta formatted file (e.g. filtered_contigs.fasta) only those entries in *contigs82.fasta* having identifier in *contig_ids.txt*.



<div class="tggle" onclick="toggleVisibility('ex1');">Show/Hide Solution</div>
<div id="ex1" style="display:none;">

In [None]:
"""exercises/filterFasta.py"""

import argparse

def readIDS(f):
    """reads a one column file in and stores
    the ids in a dictionary that is returned at the end"""
    ret = dict()
    with open(f, "r") as file:
        for line in file:
            line = line.strip()
            if(line not in ret):
                ret[line] = 1 #Important. It is like: True
    return ret

def filterFasta(inF, outF, ids2keep):
    oF = open(outF, "w")
    
    outputME = False
    with open(inF, "r") as file:
        for line in file:
            line = line.strip()
            if line.startswith(">"):
                #this is the header
                if ids2keep.get(line[1:],False):
                    oF.write(line +"\n")
                    outputME = True
                    print("Writing contig ", line[1:])
                else:
                    outputME = False
            else:
                if outputME:
                    oF.write(line +"\n")
        
    oF.close()
    

parser = argparse.ArgumentParser(description="Filters a fasta file")
parser.add_argument("inputFasta", type = str, help = "The input fasta file")
parser.add_argument("inputIDS", type = str, help = "The IDS to keep")
parser.add_argument("outputFasta", type = str, 
                    help = "The output fasta file with filtered entries")
args = parser.parse_args()
idsFile = args.inputIDS
inFasta = args.inputFasta
outFasta = args.outputFasta

ids = readIDS(idsFile)
filterFasta(inFasta,outFasta, ids)

</div>

2. have a look to this website [sociopattern.org](http://www.sociopatterns.org/). This cooperation is super cool!
they developed a wearable device called sociobadge. Peaple are wearing this badge during some days, and the device collects face-to-face interacions data. On the section data of the website, you find a list of datasets collected in different social context (hospital, primaryschool, highschool, workplaces, conferences, ...). 

In this exercise we are gona play with those datasets. In particular download the [hospital](http://www.sociopatterns.org/datasets/hospital-ward-dynamic-contact-network/) dataset.
The dataset is like below:

```
Time idA idB typeA typeB
```
For instance.
```
140	1157	1232	MED	ADM
160	1157	1191	MED	MED
500	1157	1159	MED	MED
520	1157	1159	MED	MED
560	1159	1191	MED	MED
580	1159	1191	MED	MED
```
you have data separated by a tabular. From the first row we can see that at timestamp 140(seconds) user 1157 talks with 1232, from the same row we know that the user 1157 is a Medical doctor, while user 1232 is works in the amministrative stuff.

    1. Write a function that ask the user an id and you have to print either: "the user is not present" or the typer of the user X, for instance: "the user X is an Med/Nurse/Adm/Patient" 
    2. Write a function that ask the user an id type (Med/Nur/Adm/Pat) and the name of a file (like: "only doctors.txt") and write a new file containing the list of interactions only for that type of user.
    3. Print to the scrin the pair of users that has the the maximum number of interaction. for instance, id 10 and id 30 are probablilly friend because the has the the maximum number of interactions among them.
    4.  **Difficult!**! Do doctors have longer continuous interactions with patients or doctors? Suppose that a doctor is interacting with another doctor for 300 seconds in the first day, but then he talks with the same doctor in another day for 400 seconds. Then the longhest continuous interaction would be 400 seconds. My question is: Do doctor prefer to talk with other doctors or patients? :). You have to compute the longhest interaction among each doctor to doctor and between doctor to patients, then you can answare the question **solution not given**




<div class="tggle" onclick="toggleVisibility('miodifficile');">Show/Hide Solution</div>
<div id="miodifficile" style="display:none;">

In [None]:

import argparse

# read the file
def read_the_data(input_file):
    res = []
    with open(input_file, "r") as file:
        for line in file:
            tmp = line.rstrip("\n").split("\t")

            tmp[0] = int(tmp[0])
            tmp[1] = int(tmp[1])
            tmp[2] = int(tmp[2])
            
            res.append(tmp)
    return res
#1
def get_unique_ids(data):
    unique_ids = dict()

    for _,i,j,type_i,type_j in data:
        if i not in unique_ids:
            unique_ids[i] = type_i
        if j not in unique_ids:
            unique_ids[j] = type_j

    return unique_ids

def check_id(data,id):
    unique = get_unique_ids(data)

    if id in unique:
        print("the user",id,"is in the data and it is a ",unique[id])
    else:
        print("the id does not exists")

# 2
def get_sub_data(data,type_user):
    sub_data = ""
    for t,i,j,type_i,type_j in data:
        if type_i == type_user.upper() and type_j == type_user.upper():
            tmp_str = str(t)+"\t"+str(i)+"\t"+str(j)+"\t"+type_i+"\t"+type_j
            sub_data += tmp_str+"\n"
    return sub_data

def store_sub_data(sub_data,output_file):

    with open(output_file, 'w') as f:
        f.write(sub_data)
    
    print("file stored in",output_file)


def interaction_frequencies(data):
    interaction = dict()

    for _,i,j,_,_ in data:
        if not (i,j) in interaction and not (j,i) in interaction:
            interaction[(i,j)] = 1
        else:
            if (i,j) in interaction:
                interaction[(i,j)] += 1
            elif (j,i) in interaction:
                interaction[(j,i)] += 1
    return interaction

def serach_biggest_interaction(interaction):
    maximum = 0
    key = None

    for coppia,val in interaction.items():
        if val>maximum:
            maximum = val
            key = coppia

    print("the biggest interaction is between",key[0],"and",key[1],".In particular, they had",maximum,"interactions.")


                


parser = argparse.ArgumentParser(description="Sociopattern data LH10")
parser.add_argument("-I","--input", type = str, help = "The input file")
parser.add_argument("-O", "--output", type = str, default = "", 
                    help="where to store the file")

parser.add_argument("-S", "--search", type = int, default = -1, 
                    help="id to search")
parser.add_argument("-T","--type",type=str,default="",help="type to search")
args = parser.parse_args()

input_file = args.input
search_id = args.search

output_file = args.output
type_search = args.type


# read the data
data = read_the_data(input_file)

# 1
# if the user specify the id then search it
if search_id != -1:
    check_id(data,search_id)


# 2
# if user specify to search an user type, then it has to specify both the output file and the i
if output_file != "" and type_search != "":
    sub_data = get_sub_data(data,type_search)
    store_sub_data(sub_data,output_file)

# 3
interaction_F = interaction_frequencies(data)
serach_biggest_interaction(interaction_F)

</div>
3. Write a python script that takes in input a single-entry .fasta file (specified from the command line) of the amino-acidic sequence of a protein and prints off (1) the total number of aminoacids, (2) for each aminoacid, its count and percentage of the whole. Optionally, it the user specifies the flag "-S" (--search) followed by a string representing an aminoacid sequence, the program should count and print how many times that input sequence appears. Download the [Sars-Cov-2 Spike Protein](file_samples/P0DTC2.fasta.txt) and test your script on it. *Please use functions*. 


<div class="tggle" onclick="toggleVisibility('ex1-cov');">Show/Hide Solution</div>
<div id="ex1-cov" style="display:none;">

In [None]:
"""exercises/process_fasta.py"""

""" test example: 
python3 process_fasta.py ../file_samples/P0DTC2.fasta.txt -S SSVL """

import argparse

parser = argparse.ArgumentParser(description="Parses a single-entry fasta file and returns some stats.")
parser.add_argument("inputFasta", type = str, help = "The input fasta file")
parser.add_argument("-S", "--search", type = str, default = "", 
                    help="The (optional) string to look for.")





# Reads a fasta file input_file in 
#and returns the tuple (header, sequence)
def read_sequence(input_file):
    sequence = ""
    hdr = ""
    inF = open(input_file)
    for line in inF:
        line = line.strip()
        if not line.startswith(">"):
            sequence += line
        else:
            hdr = line[1:]

    return hdr,sequence


# Gets a sequence seq and returns 
# a dictionary with the counts of all elements.
# This function also prints off the counts (and %)
def count_chars(seq):
    char_dict = dict()
    L = len(seq)
    for c in seq:
        if char_dict.get(c, None) == None:
            char_dict[c] = 0
        char_dict[c] += 1
    
    print("The sequence has length: {}".format(L)) 
    for el in char_dict:
        print("{} is present {} times ({:.2f} %)".format(el, char_dict[el], 100 * char_dict[el]/L))
    
    return char_dict

# Counts how many times search_s is in seq and returns an integer
def count_str(seq, search_s):
    return seq.count(search_s)
   

args = parser.parse_args()

inFasta = args.inputFasta

search_str = args.search


h,s = read_sequence(inFasta)
print(h)
print(s)
cnts = count_chars(s)

if len(search_str) > 0:
    C = count_str(s, search_str)
    print("Sequence '{}' is present {} times ".format(search_str, C))

The output should be like:
```    
The sequence has length: 1273
M is present 14 times (1.10 %)
F is present 77 times (6.05 %)
V is present 97 times (7.62 %)
L is present 108 times (8.48 %)
P is present 58 times (4.56 %)
S is present 99 times (7.78 %)
Q is present 62 times (4.87 %)
C is present 40 times (3.14 %)
N is present 88 times (6.91 %)
T is present 97 times (7.62 %)
R is present 42 times (3.30 %)
A is present 79 times (6.21 %)
Y is present 54 times (4.24 %)
G is present 82 times (6.44 %)
D is present 62 times (4.87 %)
K is present 61 times (4.79 %)
H is present 17 times (1.34 %)
W is present 12 times (0.94 %)
I is present 76 times (5.97 %)
E is present 48 times (3.77 %)
Sequence 'SSVL' is present 2 times
```

</div>

4. The [Fisher's dataset](http://onlinelibrary.wiley.com/doi/10.1111/j.1469-1809.1936.tb02137.x/abstract) regarding Petal and Sepal length and width in csv format can be found [here](file_samples/Fishers_Iris.csv). These are the measurements of  the flowers of  fifty plants each of the two species Iris setosa and Iris versicolor.

The header of the file is:
```
Species Number,Species Name,Petal width,Petal length,Sepal length,Sepal width
```

Write a python script that reads this file in input (feel free to hard-code the filename in the code) and computes the average petal length and width and sepal length and width for each of the three different Iris species. Print them to the screen alongside the number of elements.
<div class="tggle" onclick="toggleVisibility('ex4');">Show/Hide Solution</div>
<div id="ex4" style="display:none;">

In [None]:
def readCSV(f):
    """reads the csv dataset and returns a dictionary with 
    species name as key and, as value, a dictionary with 
    four keys : petalLen, sepalLen, petalWidth, sepalWidth
    """
    ret = dict()
    with open(f, "r") as file:
        for line in file:
            line = line.strip()
            if not line.startswith("Species Number"):
                data = line.split(",")
                speciesName = data[1]
                pWidth = int(data[2])
                pLen = int(data[3])
                sLen = int(data[4])
                sWidth = int(data[5])
                if(speciesName not in ret):
                    ret[speciesName] = {"petalLen" : [], "sepalLen" : [], 
                                        "petalWidth" : [], "sepalWidth" : []
                                       }
                ret[speciesName]["petalLen"].append(pLen)
                ret[speciesName]["sepalLen"].append(sLen)
                ret[speciesName]["sepalWidth"].append(sWidth)
                ret[speciesName]["petalWidth"].append(pWidth)
    return ret

def printData(dataDict):
    for s in dataDict:
        avgPlen = sum(dataDict[s]["petalLen"])/len(dataDict[s]["petalLen"])
        avgPwid = sum(dataDict[s]["petalWidth"])/len(dataDict[s]["petalWidth"])
        avgSlen = sum(dataDict[s]["sepalLen"])/len(dataDict[s]["sepalLen"])
        avgSwid = sum(dataDict[s]["sepalWidth"])/len(dataDict[s]["sepalWidth"])
        print("Species {} has {} measurements:".format(s, len(dataDict[s]["petalLen"])))
        print("\t petal length {}".format(avgPlen))
        print("\t petal width {}".format(avgPwid))
        print("\t sepal length {}".format(avgSlen))
        print("\t sepal width {}".format(avgSwid))
        
        
        

inFile = "file_samples/Fishers_Iris.csv"

data = readCSV(inFile)
printData(data)

</div>

## 1st ML model

### Machine Learning
You'll explore this in more depth in other courses, but here's a super simple explanation. You are given a set of data (like images), and each piece of data has a label or category (like "dog" or "cat"). Your goal is to build a model that learns from this data so that, when it sees new, unlabeled data, it can predict the correct label (i.e., whether an image shows a dog or a cat). This process involves training the model on examples to help it make accurate predictions in the future.  

In general, your data is split into a training set and a test set (ignoring validation for now). The main idea is to train your model on the training set, where the labels are available, and then evaluate its performance on the test set to see how well it can predict unseen data.

![ML](img/ml.png)

In the image above, you have 9 images to train your model and 2 images to test how well your model performs.  

In this lab you have to implement a simple ML model: k-Nearest Neighbors (k-NN)

---
### What is k-Nearest Neighbors (k-NN)?
k-Nearest Neighbors (k-NN) is one of the simplest machine learning algorithms used for classification and regression. Here, we'll focus on using it for classification, which means dividing data into categories (e.g., classifying flowers as different species).

### Main Idea Behind k-NN
The basic idea behind k-NN is **“things that are similar are likely to belong to the same group (or class).”** To classify a new data point, we look at the data points closest to it (its "neighbors"). We then decide what class the new data point belongs to based on the majority class among its neighbors.

### For instance:

You are given the roundness and diameter of both grapes and pears, and your task is to predict whether a fruit is a grape or a pear based on these two features (roundness and diameter). So your input data is something like:
```
    data = [[0.91, 1.1],[0.88, 1.3], [...] ... [0.7,4.2]]
    label = [1,1,...,0]
```
The data includes the roundness and diameter of the fruits, while the label is a list where each value indicates the type of fruit (1 for grape, 0 for pear). For example, the first fruit has a roundness of 0.91 and a diameter of 1.1, and its label is 1, meaning it is a grape.

Now, suppose you have a new fruit with a roundness of 0.82 and a diameter of 2.1. Based on the data, you need to predict whether it's a grape or a pear. let's see this with a figure:

![knn](https://images.datacamp.com/image/upload/v1676909140/image2_26000761c3.png)


As you can see, grapes (blue points) tend to have a higher roundness and a smaller diameter, which is why the blue points are clustered in the top-left corner of the plot. On the other hand, pears have lower roundness and larger diameter values. Now, suppose a new fruit appears (represented by a red point)—is it a pear or a grape?

K-NN works by selecting the k closest data points (neighbors) to the input based on a distance metric (usually Euclidean distance). Then, it looks at the labels of those k neighbors and assigns the label that appears most frequently (majority vote) to the input. In our case, it would find the k nearest fruits based on roundness and diameter, and classify the new fruit as either a grape or a pear based on the majority label of its closest neighbors.

### todo

1. Download the iris dataset from [here](img/IRIS.csv)
2. load the data and process it to obtain a list of lists i.e.
```
    data = [[7.1, 3.0, 5.9, 2.1],
            [5.4, 3.7, 1.5, 0.2],
            [5.6, 2.5, 3.9, 1.1],
            ...]
    labels = ['Iris-virginica', 'Iris-setosa', 'Iris-versicolor']
```
where the first flower has four features (7.1, 3.0, 5.9, 2.1) and it is an Iris-virginica.  
3. split the data in test and train (you can use 80% of the data for train and the rest for test)  
4. build your knn model! to compute the distance you can use a simple equilidean distance. The Euclidean distance between two points $p = (p_1, p_2, \dots, p_n)$ and $q = (q_1, q_2, \dots, q_n)$ in an n-dimensional space is given by the formula:
$$
d(p, q) = \sqrt{(p_1 - q_1)^2 + (p_2 - q_2)^2 + \dots + (p_n - q_n)^2}
$$
5. Evaluate your model! Compute predictions for the test set and use its label to compute the accuracy 
$$\text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} $$