# Applied Machine Learning (2024), exercises


## General instructions for all exercises

Follow the instructions and fill in your solution under the line marked by tag

> YOUR CODE HERE

Remove also line

> raise NotImplementedError()

**Do not change other areas of the document**, since it may disturb the autograding of your results!
  
Having written the answer, execute the code cell by and pressing `Shift-Enter` key combination. The code is run, and it may print some information under the code cell. The focus automatically moves to the next cell and you may "execute" that cell by pressing `Shift-Enter` again, until you have reached the code cell which tests your solution. Execute that and follow the feedback. Usually it either says that the solution seems acceptable, or reports some errors. You can go back to your solution, modify it and repeat everything until you are satisfied. Then proceed to the next task.
   
Repeat the process for all tasks.

The notebook may also contain manually graded answers. Write your manually graded answer under the line marked by tag:

> YOUR ANSWER HERE

Manually graded tasks are text in markdown format. It may contain text, pseudocode, or mathematical formulas. You can write formulas with $\LaTeX$-syntax by enclosing the formula with dollar signs (`$`), for example `$f(x)=2 \pi / \alpha$`, will produce $f(x)=2 \pi / \alpha$

When you have passed the tests in the notebook, and you are ready to submit your solutions, validate and submit your solution using the nbgrader tools from the `Nbgrader/Assignment List`-menu.


## Feature analysis

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import sklearn.datasets
import seaborn as sns
import pandas as pd

### Get Phoneme data

Download Phoneme dataset from [OpenML datasets](https://www.openml.org/search?type=data&sort=runs&status=active) using `sklearn.datasets.fetch_openml()` function. The only mandatory parameter for the fetch_openml() -function is the dataset name, 'phoneme' in this case. The function gets the data, but can complain with some warning messages. You can suppress them by giving two extra optional parameters `version=1` and `parser='auto'`. It will define which version of the dataset you want and how

Store the result as a variable called `phoneme`.

Study the the data structure returned by the fetch. What data structures it contains. Extract the feature matrix to variable called `X` and the target to variable called `Y`.

Find out the number of features and samples and save them to variables called `n` and `p`.

*TIP* You can find out that the phoneme data set contains several data structures, and you can think that it is a dictionary, so you can index it like a vector which strings and keys. You can list all available keys using `phoneme.keys()`

Name your variables according to the following table, in order to make the autograder tests to work. Please not that the variable names are case sensitive.

| Variable | Name |
| ----------|------|
| The dataframe containing the phoneme data| `phoneme` |
| The feature matrix |`X`|
| The target variable|`Y`|
| Number of samples |`n`|
| Number of features |`p`|


In [None]:
# YOUR CODE HERE
#fdfd
raise NotImplementedError()

In [None]:
X.head()

In [None]:
if 'phoneme' not in globals():
    print("phoneme not found! Please read the data to it.")
if 'X' not in globals():
    print("X not found! Please assign the variable data to it.")
if 'Y' not in globals():
    print("Y not found! Please assign the target data to it.")
if 'n' not in globals():
    print("n not found! Please assign the number of samples to it.")
if 'p' not in globals():
    print("p not found! Please assign the number of variables to it.")
elif type(X)!=pd.DataFrame:
    print(f"p is not a pandas data frame", type(X))
else:
    print("No errors found this far. Your code may work.")


### Visual examination

Use Seaborn to make a pairplot showing how the features are dependent on the target class. When you call the pairplot function, it returns a handle to the image. Store this seaborn image as variable `fig`.

It is really usefull to group the data by using the target class as a hue, like in the lecture notes. But it requires merging the target class to the same data frame as the data. Make this fancier plot in following steps:

1. Make a new dataframe `Xwithy` by copying the `X` into it. Use deep copy method `X.copy()` instead of simple assignment. Read more about [Equals (=) vs shallow copy vs deep copy in Pandas Dataframes](https://direct.dataquest.io/equals-vs-shallow-copy-vs-deep-copy-in-pandas-dataframes-8affdbf85161)
2. Add a new column `target` into the new dataframe `Xwithy` to hold the target variable. This can be easily achieved by simply assigning the vector `Y` into a new column `target` in dataframe`Xwithy`. The assignment is legal even if the column does not exist yet. It will be automatically created during the assignment.

Needed variables for grading

| Variable | Name |
| ----------|------|
| The original unmodified dataframe | `X` |
| New dataframe having the target too |`Xwithy`|
| The target variable |`Y`|
| The handle of the seaborn figure |`fig`|


In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
if 'fig' not in globals():
    print("fig is not defined, please store the pairplot as fig")
elif 'Xwithy' not in globals():
    print("Xwithy is not defined, follow the instructions please")
else:
    print("This can work")


In [None]:
X.shape

### Test of normality

Test if the variables are normally distributed in each target class using normality test. Place the p-values of the normality tests in arrays or lists called `pn1` and `pn2`.

![image.png](attachment:bbcc93bf-3b80-43e5-aeb7-cb6bf3ee85b6.png)

Needed variables for grading

| Variable | Name |
| ----------|------|
| The p-value of normality test of variable V1 in target 1 | `pn1[0]` |
| The p-value of normality test of variable V2 in target 1 | `pn1[1]` |
| The p-value of normality test of variable V3 in target 1 | `pn1[2]` |
| The p-value of normality test of variable V4 in target 1 | `pn1[3]` |
| The p-value of normality test of variable V5 in target 1 | `pn1[4]` |
| The p-value of normality test of variable V1 in target 2 | `pn2[0]` |
| The p-value of normality test of variable V2 in target 2 | `pn2[1]` |
| The p-value of normality test of variable V3 in target 2 | `pn2[2]` |
| The p-value of normality test of variable V4 in target 2 | `pn2[3]` |
| The p-value of normality test of variable V5 in target 2 | `pn2[4]` |

**TIP** Use loops or operation vectorization to make the calculations easier

In [None]:
from scipy.stats import normaltest
from scipy.stats import ttest_ind
from scipy.stats import mannwhitneyu


# YOUR CODE HERE
raise NotImplementedError()

In [None]:
if 'pn1' not in globals():
    print("pn1 is not defined, please store the p-values of target 1 as pn1")
elif 'pn2' not in globals():
    print("pn2 is not defined, please store the p-values of target 2 as pn2")
elif len(pn1)<5:
    print("There seems not be be all p-values included in pn1")
elif len(pn2)<5:
    print("There seems not be be all p-values included in pn2")
else:
    print("This can work")



### Test of mean

Use suitable test to see if the means of each feature are significantly different in different targets. Store the P-value of the test for each variable as array or list `pm`.

Needed variables for grading

| Variable | Name |
| ----------|------|
| The p-value of test of means of variable V1 between targets 1 and 2 | `pm[0]` |
| .... | .... |


**TIP** Use loops or operation vectorization to make the calculations easier

In [None]:

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
if 'pm' not in globals():
    print("pm is not defined, please store the p-values as pm")
if len(pm)<5:
    print("There seems not be be all p-values included in pm")
else:
    print("This can work")



### The result
Are the means significantly different or not?

YOUR ANSWER HERE