# Machine Learning Project

In [None]:
# Student names and numbers:

The assignments below should be solved and documented as a project that will form the basis for the
examination. When solving the exercises it is important that you

  * document all relevant results and analyses that you have obtained/performed during the exercises.
  * try to relate your results to the theoretical background of the methods being applied.

Feel free to add cells if you need to.

Please hand in assignment 1-6 in a _**single**_ Jupyter notebook where you retain the questions outlined below. You are welcome to adapt code from the web (e.g. Kaggle kernels), but you **_must_** reference the original source in your notebook. In addition to _clean, well-documented code_ (i.e. functions with <a href="https://www.geeksforgeeks.org/python-docstrings/">docstrings</a>, etc), your notebook will be judged according to how well each step is explained (using Markdown). 

In general, direct questions regarding assignments 1, 4, 5 and 6 to Frederik, and questions regarding assignments 2, 3, and 7 to Richard. 

Last, but not least:
* Looking for an overview of the markdown language? The cheat sheet <a href="https://medium.com/ibm-data-science-experience/markdown-for-jupyter-notebooks-cheatsheet-386c05aeebed">here</a> might help.
* For the Python specific components of the exercises, you should not need constructs beyond those that are already included in the notebooks on the course's web-page (still you should not feel constrained by these, so feel free to be adventurous). You may, however, need to consult the documentation for some of the methods supplied by `sklearn`.

**Groups:** Create your own groups. May be across teams. 2-4 students per group. No one-person groups.


**Submission deadline:** Thursday, December 15 before 13.00 CET (Notebooks + presentation recording)

**Expected workload:** Each student is expected to spend around around 50 hours on the project.

### Deliverables
The teams have to submit three deliverables before the submission deadline: 1) a notebook of assignments 1-6, 2) a notebook of assignment 7, and 3) presentation video uploaded to some online platform e.g. YouTube, Vimeo, etc.

#### Notebook
The notebook contains all the code to explore the dataset, train the final model and documents each step clearly. If code is copied from another codebase such as Github or Stack Overflow it **_must_** be properly referenced.


#### Presentation
The presentation video should be 15 min long and should highlight the problem you are solving, interesting things you found in the data and the step involved in building up your model. At the exam we will discuss the presentation and ask questions about your project and submissions. A link to the video must be placed in the notebook for assignment 7.

### Randomness
For ALL random states, choose state = 69 so we can replicate your work.


In [8]:
# Import all necessary modules here:
import ast
from PIL import Image
import os
import numpy as np
import pandas as pd
import datetime as dt
import cv2 as cv2

from sklearn.neighbors import KNeighborsClassifier

## 1. The IceCat Dataset

__You should be able to do this exercise after Lecture 3.__

The IceCat Dataset, kindly provided to us by Stibo Systems, contains a large amount of data on different office products. As an example of "real-world" data, these data are imperfect and incomplete. As such, this exercise is not so much an exercise in creating a good machine learning model, but places a larger emphasis on "cleaning the data".

We are going to work with a subset of the IceCat Dataset. In particular, you will be provided with a zip file of 5,854 images of office products, each with the name "product ID".jpg. You will also be provided with a list of colors, `colors.txt`, which, when imported using the code below, is a list of tuples of the form `[("product ID", "color"), ...]`. (The code below assumes that `colors.txt` is in the same folder as the jupyter notebook. Feel free to change the code if you prefer a different organization of your files).

In [2]:
with open("colors.txt","r") as file:
    colors = ast.literal_eval(file.read())

Your task is to clean up the data and construct a simple machine learning model (_e.g._, _k_-nearest neighbor) that can identify the color of a product. You have free hands - there is hardly any one "correct answer" - but you need to argue for your choices. Among other things, you probably need to think about the following as you work with the data:

* All of the images have different sizes.

* Some of the images are RGB images (3 layers), others are CMYK (4 layers), some might even be black-and-white (1 layer).

* Some colors are only represented by very few products.

* Some colors are very similar, such as "Purple" and "Violet".

* A product may have a particular color, but a packaging of a different color. Similarly, the color of, say, a computer monitor may be black, while the image of it could show a monitor that is turned on with a green screensaver.

* Many products are attributed to several colors, such as "Black, Blue" or even "Blue, Green, Orange, Violet, Yellow". Yet others are described as "Multicolor" or "Assorted colors".

Again, you have free hands in how you are going to solve these (and other) challenges, but you must argue for and reflect on your choices as you progress.

### Cleansing of  data

In [None]:
# find min width and min height of image collection to determine dimensions of resize
# analysing data
min_height = 10 * 100
min_width = 10 * 100 

parent_dir = os.getcwd()
img_path = "{}{}".format(parent_dir, "/images/")

for f in os.listdir(img_path):
    fn, fext = os.path.splitext(f)
    imgName = '{}{}'.format(fn, fext)
    img = Image.open(os.path.join(img_path, imgName))
    if (img.width < min_width):
        min_width = img.width
    if (img.height < min_height):
        min_height = img.height
print("Min width: {}, min height: {}".format(min_width, min_height))   

In [3]:
# check how many images can be reduced to 100 by 100 measurements
min_height = 100
min_width = 100 
counter = 0

parent_dir = os.getcwd()
img_path = "{}{}".format(parent_dir, "/images/")

for f in os.listdir(img_path):
    fn, fext = os.path.splitext(f)
    imgName = '{}{}'.format(fn, fext)
    img = Image.open(os.path.join(img_path, imgName))
    if (img.width < min_width or img.height < min_height):
        counter += 1

print("Min width: {}, min height: {}".format(min_width, min_height))   
print(counter)

Min width: 100, min height: 100
12


In [4]:
## as there are only 12 images which have a size smaller than 100 x 100 px, they will be removed
import os
from PIL import Image

counter = 0

parent_dir = os.getcwd()
img_path = "{}{}".format(parent_dir, "/images/")
img_to_remove = []
id_img_to_remove = []

for f in os.listdir(img_path):
    fn, fext = os.path.splitext(f)
    imgName = '{}{}'.format(fn, fext)
    fileName = os.path.join(img_path, imgName)
    img = Image.open(fileName)
    h, w = img.size
    if(h < 100 or w < 100):
        id_img_to_remove.append(int(fn))
        img_to_remove.append(fileName)
        counter += 1
print(counter)

for i in range(len(img_to_remove)):
    print(img_to_remove[i])
    os.remove(img_to_remove[i])

for i in range(len(id_img_to_remove)):
    print(id_img_to_remove[i])

12
C:\Users\Maria\Machine learning\MAL1 Project/images/12072108.jpg
C:\Users\Maria\Machine learning\MAL1 Project/images/12072114.jpg
C:\Users\Maria\Machine learning\MAL1 Project/images/1294740.jpg
C:\Users\Maria\Machine learning\MAL1 Project/images/1578973.jpg
C:\Users\Maria\Machine learning\MAL1 Project/images/1586367.jpg
C:\Users\Maria\Machine learning\MAL1 Project/images/1586373.jpg
C:\Users\Maria\Machine learning\MAL1 Project/images/1898642.jpg
C:\Users\Maria\Machine learning\MAL1 Project/images/2090912.jpg
C:\Users\Maria\Machine learning\MAL1 Project/images/26284122.jpg
C:\Users\Maria\Machine learning\MAL1 Project/images/5472096.jpg
C:\Users\Maria\Machine learning\MAL1 Project/images/729840.jpg
C:\Users\Maria\Machine learning\MAL1 Project/images/729841.jpg
12072108
12072114
1294740
1578973
1586367
1586373
1898642
2090912
26284122
5472096
729840
729841


In [5]:
# clean up label data from img with less than 100 px in dimensions
def remove_img(colors_data, img_ids):
    colors_arr = colors_data.copy()
    counter_removed = 0
    i = len(colors_arr)
    while(i):
        i -= 1
        if (colors_arr[i][0] in img_ids):
            counter_removed += 1
            colors_arr.remove(colors_arr[i])
            
    print("Original num of label data: {}".format(len(colors_data)))
    print("Num of label data after multicolor removed: {}".format(len(colors_arr)))
    print("Num of label tuples removed: {}".format(counter_removed))
    return colors_arr

colors_arr = remove_img(colors, id_img_to_remove)        


Original num of label data: 5854
Num of label data after multicolor removed: 5842
Num of label tuples removed: 12


In [None]:
## Images have different sizes, therefore make all images max 100 px width and height
## if an image's width/height is smaller than 300 px, the image will keep original size of heigh/width
parent_dir = os.getcwd()
img_path = "{}{}".format(parent_dir, "/images/")

new_dir_name = "resized-images"
os.mkdir("{}\{}".format(parent_dir, new_dir_name))

for f in os.listdir(img_path):
    if f.endswith('.jpg'):
        image = Image.open("{}{}".format(img_path, f))
        fn, fext = os.path.splitext(f)
        imgResized = image.resize((100,100), Image.ANTIALIAS)
        imgResized.save('{}/{}{}'.format(new_dir_name,fn, fext), 'jpeg', quality=90)

In [None]:
# check how many images RGB, CMYK or L
parent_dir = os.getcwd()
img_path = os.path.join(parent_dir, "resized-images")

RGB_count = 0
CMYK_count = 0
L_count = 0

for file in os.listdir(img_path):
    if file.endswith('.jpg'):
        fn, fext = os.path.splitext(file)
        imgName = '{}{}'.format(fn, fext)
        img = Image.open(os.path.join(img_path, imgName))
        if img.mode == 'RGB':
            RGB_count += 1
        elif img.mode == 'CMYK':
            CMYK_count += 1
        else:
            L_count += 1
                
print('RGB: {}, CMYK: {}, L: {}'.format(RGB_count, CMYK_count, L_count))

In [None]:
# convert all CMYK images to RGB
parent_dir = os.getcwd()
img_path = os.path.join(parent_dir, "resized-images")
count = 0

for file in os.listdir(img_path):
    if file.endswith('.jpg'):
        fn, fext = os.path.splitext(file)
        imgName = '{}{}'.format(fn, fext)
        img = Image.open(os.path.join(img_path, imgName))
        if img.mode == 'CMYK':
            img = img.convert('RGB')
            img.save(os.path.join(img_path, imgName), 'JPEG')
            count += 1
print("Successfully converted [{}] CMYK images to RGB".format(count))

In [None]:
# find maximum amount colors per image
max_num_of_colors = 1

for i in range(len(colors_arr)):
    len_colors = len(colors_arr[i][1].strip().split(" "))
    if (len_colors > max_num_of_colors):
        max_num_of_colors = len_colors

print("Max num of colors per image found [{}]".format(max_num_of_colors))

In [None]:
# check how many tuples in dataset have more than 1 color
num_img_with_more_than_one_color = 0

for i in range(len(colors_arr)):
    len_colors = len(colors_arr[i][1].strip().split(" "))
    if (len_colors > 1):
        num_img_with_more_than_one_color += 1

print("Max num images found with more than 1 color [{}]".format(num_img_with_more_than_one_color))

#### Remove multicolored images and labels

In [None]:
# delete images with more than one color label found from resized
cwd = os.getcwd()    
location = os.path.join(cwd, "resized-images")
num_more_than_two = 0

for i in range(len(colors_arr)):
    len_colors = len(colors_arr[i][1].strip().split(" "))
    if (not colors_arr[i][0] in id_img_to_remove):
        if (len_colors > 1 ):
            num_more_than_two += 1
            file = "{}{}".format(str(colors_arr[i][0]), '.jpg')
            path = os.path.join(location, file)
            os.remove(path)
    

print("Num of images removed [{}]".format(num_more_than_two))

In [None]:
# remove data/tuples with more than one color label
def remove_more_than_one_color_label(colors_data):
    colors_arr = colors_data.copy()
    i = len(colors_arr)
    while (i):
        i -= 1
        len_colors = len(colors_arr[i][1].strip().split(" "))
        if (len_colors > 1):
            colors_arr.remove(colors_arr[i])
    print("Original num of label data: {}".format(len(colors_data)))
    print("Num of label data after multicolor removed: {}".format(len(colors_arr)))
    return colors_arr

colors_arr = remove_more_than_one_color_label(colors_arr)        

In [None]:
# delete images with Multicolour label found from resized
cwd = os.getcwd()    
location = os.path.join(cwd, "resized-images")
num_multi_label_found = 0
multi_label = "Multicolour"
multi_label_abbr = "Multi"
multi_color_exists = False

for i in range(len(colors_arr)):
    if (colors_arr[i][1].lower() == multi_label.lower() or 
        colors_arr[i][1].lower() == multi_label_abbr.lower()):
        multi_color_exists = True
        num_multi_label_found += 1
        file = "{}{}".format(str(colors_arr[i][0]), '.jpg')
        path = os.path.join(location, file)
        os.remove(path)
if not multi_color_exists:
    print("No data with multicolor labels found")

print("Num of images with Multicolour label removed [{}]".format(num_multi_label_found))

In [None]:
# remove data/tuples with more than one color label
def remove_multicolour_label(colors_data):
    colors_arr = colors_data.copy()
    i = len(colors_arr)
    while (i):
        i -= 1
        if (colors_arr[i][1].lower() == multi_label.lower() or 
           colors_arr[i][1].lower() == multi_label_abbr.lower()):
            colors_arr.remove(colors_arr[i])
    print("Original num of label data: {}".format(len(colors_data)))
    print("Num of label data after multicolor removed: {}".format(len(colors_arr)))
    return colors_arr

colors_arr = remove_multicolour_label(colors_arr)     

In [None]:
# find how many colors in dataset after removing multicolors
colors_found = []
all_colors = []
colors_with_one_occurance = []

for i in range(len(colors_arr)):
    all_colors.append(colors_arr[i][1])
    if (len(colors_found) == 0):
        colors_found.append(colors_arr[i][1])
    else:
        exist = False
        for j in range(len(colors_found)):
            if (colors_arr[i][1] == colors_found[j]):
                exist = True
        if not exist:
            colors_found.append(colors_arr[i][1])

print("Number of colors found: {} in {} images".format(len(colors_found), len(colors_arr)))    
#for i in range(len(colors_found)):
    #print(colors_found[i])

for i in range(len(colors_found)):
    print("Occurrences of color {}: {}".format(colors_found[i], all_colors.count(colors_found[i])))

In [None]:
#Research paper mentioning 11 basic colour categories for classifications of basic colors
#Red, Green, Blue, Yellow, Orange, Pink, Purple, Brown, Grey, Black, and White
basic_color_categories = ["Red", "Green", "Blue", "Yellow", "Orange", "Pink", 
                          "Purple", "Brown", "Grey", "Black", "White"]

cwd = os.getcwd()    
location = os.path.join(cwd, "resized-images")

def delete_img_from_resized_collection(filename):
        file = "{}{}".format(filename, '.jpg')
        path = os.path.join(location, file)
        os.remove(path)

# remove data/tuples with more than one color label
def remove_all_non_basic_colors_from_label_data_and_img(colors_data):
    colors_arr = colors_data.copy()
    i = len(colors_arr)
    while (i):
        i -= 1
        if not (colors_arr[i][1] in basic_color_categories):
            delete_img_from_resized_collection(colors_arr[i][0])
            colors_arr.remove(colors_arr[i])
            #print("To remove: [{}]".format(colors_arr[i][1]))
    return colors_arr

colors_arr = remove_all_non_basic_colors_from_label_data_and_img(colors_arr)
print("label data: [{}]".format(len(colors_arr)))

In [None]:
# array of ids
# array of images in vector form
# array of label

# loop through images in folder 
# - take image id and place in array of ids
# - take vector of img and place in vector array
# - find label for that img id and place in array label

#use cleansed label data to construct the three arrays
from numpy import array
import cv2

cwd = os.getcwd()    
location = os.path.join(cwd, "resized-images")

arr_id = []
arr_img_vec = []
arr_label = []

def get_img_vector(img_id, size=(32,32)):
    file = "{}{}".format(img_id, '.jpg')
    path = os.path.join(location, file)
    arr = array(Image.open(path))
    return arr.ravel()
    
for i in range(len(colors_arr)):
    arr_id.append(colors_arr[i][0])
    arr_label.append(colors_arr[i][1])
    empty_arr = []
    img_arr = get_img_vector(colors_arr[i][0])
    for j in range(len(img_arr)):
        print(img_arr[j])
        empty_arr.append(img_arr[j])
    arr_img_vec.append(empty_arr)
   #arr_img_vec.append(get_img_vector(colors_arr[i][0]))
    
print(arr_img_vec)

In [None]:
data = {
    "image ID": arr_id,
    "vector": arr_img_vec,
    "label": arr_label
}

dataset = pd.DataFrame(data)
dataset

In [None]:
# converting categorical data of 'label' column into numerical data - one-hot encoding
dataset = pd.get_dummies(dataset, columns=['label'])
dataset.head()

In [None]:
dataset_copy = dataset.copy(deep = True)

dataset_data = np.array(dataset_copy[['vector']])
print(len(dataset_data))
print(dataset_data)
print(type(dataset_data))

#dataset_data_test = np.array([dataset_copy.pop(x) for x in ['vector']])
#print(dataset_data_test)
#print(len(dataset_data_test))
#print(type(dataset_data_test))

#dataset_data = np.array(dataset_copy[['vector']])
#print(dataset_data)
#print(len(dataset_data))
#print(type(dataset_data))

dataset_target = np.array(dataset_copy[['label_Black', 'label_Blue', 'label_Brown', 'label_Green', 
                                  'label_Grey', 'label_Orange', 'label_Pink', 'label_Purple', 
                                  'label_Red', 'label_White', 'label_Yellow']])
print(dataset_target)
print(len(dataset_target))
print(type(dataset_target))

#arrayOfArrays = np.array([dataset_data[x] for x in dataset_data])
#print(arrayOfArrays)

In [None]:
# split data into train validate and test
from sklearn.model_selection import train_test_split

X_trainval, X_test, y_trainval, y_test = train_test_split(dataset_data, dataset_target, random_state = 69)
X_train, X_val, y_train, y_val = train_test_split(X_trainval, y_trainval, random_state = 69)
print("Size of training set:{}".format(X_train.shape[0]))
print("Size of validation set:{}".format(X_val.shape[0]))
print("Size of test set:{}".format(X_test.shape[0]))

In [None]:
# Train several k-nearest-neighbor models on the data with various values of k
from sklearn.neighbors import KNeighborsClassifier
print(X_train)
print(y_train)

training_accuracy = []
test_accuracy = []
neighbors_settings = range(1, 20)

for n_neighbors in neighbors_settings:
    # build the model
    classifier = KNeighborsClassifier(n_neighbors = n_neighbors)
    classifier.fit(X_train, y_train)
    # record training set accuracy
    training_accuracy.append(clf.score(X_train, y_train))
    # record generalization accuracy
    test_accuracy.append(classifier.score(X_test, y_test))
    
plt.plot(neighbors_settings, training_accuracy, label="training accuracy")
plt.plot(neighbors_settings, test_accuracy, label="test accuracy")
plt.ylabel("Accuracy")
plt.xlabel("n_neighbors")
plt.legend();

In [None]:
# selecting hyperparameters using cross validation

In [None]:
# validate model

## 2. Flights Departing from NYC

__You should be able to do this exercise after Lecture 4.__

For this exercise we will be using the famous nycflights13 data which contains the `airlines`, `airports`, `flights`, `planes`, and `weather` datasets. Please see the documentation (`nycflights13.pdf`) for further information.

**(a)** Load all files as pandas dataframes and display the first 5 rows of each dataset.

**(b)** Convert all temperature attributes to degree Celsius. We will be using this in what follows.

**(c)** Using OLS, investigate if flight distance is associated with arrival delay. You should be cautious regarding negative delays.

**(d)** Using OLS, investigate if departure delay is associated with arrival delay. Again,
   consider what to do with negative delays.

**(e)** Investigate whether departure delay is associated with weather conditions
   at the origin airport. This includes descriptives, plotting, regression modelling,
   considering missing values etc. For regression, do OLS, Ridge, Lasso, and Elastic Net.
   The analysis should also include seasonality trends as a "weather condition". You could,
   for instance, plot the daily departure delay with the date (or monthly). What are the
   three most important weather conditions when trying to predict departure delays?

**(f)** Is the age of the plane associated with delay? Do OLS, Ridge, Lasso, and Elastic Net.

**(g)** Do a principal component analysis of the weather at JFK using the following columns:
   temp, dewp, humid, wind_dir, wind_speed, precip, visib.
   How many principal components should be used to capture the variability in the weather data?

**(h)** Build regression models (OLS, Ridge, Lasso, and Elastic Net) that associates
   an airports lattitude with weather conditions (temp, dewp, humid, wind_dir, wind_speed,
   precip, visib). Remove all but the three most significant whether conditions and redo
   the analysis.

**(i)** On a map, plot the airports that have flights to them where the points that represent
   airports are relative in size to the average departure delay. You can see an example in "airports.png".

 **(j)** These questions require no code.
 - Suppose you are using Ridge Regression and you notice that the training error and the validation error are almost equal and fairly high. Would you say that the model suffers from high bias or high variance? Should you increase the regularization hyperparameter or reduce it?

- Why would you want to use:
        > Ridge Regression instead of plain Linear Regression (i.e. without any regularization)?
        > Lasso instead of Ridge Regression?
        > Elastic Net instead of Lasso?

## 3. Clustering of Handwritten Digits

__You should be able to do this exercise after Lecture 5.__

This exercise will depart from the famous MNIST dataset, and we are exploring several clustering techniques with it.. This is a ".mat" file, in order to load this file in an ipynb you have to use loadmat() function from scipy.io. (replace my path).

In [None]:
from scipy.io import loadmat
mnist = loadmat('mnist-original')
mnist_data = mnist["data"].T
mnist_label = mnist["label"][0]
import numpy as np
print("Number of datapoints: {}\n".format(mnist_data.shape[0]))
print("Number of features: {}\n".format(mnist_data.shape[1]))
print("List of labels: {}\n".format(np.unique(mnist_label)))

There are 70,000 images, and each image has 784 features. This is because each image is 28×28 pixels,
and each feature simply represents one pixel’s intensity, from 0 (white) to 255 (black). Let’s take a peek at one digit from the dataset. All you need to do is grab an instance’s feature vector, reshape it to a 28×28 array, and display it using Matplotlib’s `imshow()` function:

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
index = 4
print("Value of datapoint no. {}:\n{}\n".format(index,mnist_data[index]))
print("As image:\n")
plt.imshow(mnist_data[index].reshape(28,28),cmap=plt.cm.gray_r)
plt.show()

**(a)** Perform k-means clustering with k=10 on this dataset.

**(b)** Using visualization techniques analogous to what we have done in the Clustering notebook
   for the faces data, can you determine the 'nature' of the 10 constructed clusters?
   Do the clusters (roughly) coincide with the 10 different actual digits?

**(c)** Perform a supervised clustering evaluation using adjusted rand index.
   Are the results stable, when you perform several random restarts of k-means?

**(d)** Now perform hierarchical clustering on the data.
   (in order to improve visibility in the constructed dendrograms, you can also use a
   much reduced dataset as constructed using sklearn.utils.resample shown below).
   Does the visual analysis of the dendrogram indicate a natural number of clusters?

**(e)** Using different cluster distance metrics (ward,single,average, etc.),
   what do the clusterings look like that are produced at the level of k=10 clusters?
   See the Clustering notebook for the needed Python code, including the fcluster
   method to retrieve 'plain' clusterings from the hierarchical clustering.

In [None]:
small_mnist_data,small_mnist_label = skl.utils.resample(mnist.data,mnist.target,n_samples=200,replace='false')

**(f)** Do a DBSCAN clustering of the small dataset. Tweak the different parameters.

**(g)** Try to compare the different clustering methods on the MNIST dataset in the same way
   the book does on the faces dataset on pp. 195-206.

## 4. The Local Elections

__You should be able to do this exercise after Lecture 6.__

In the local elections of 2021, around 100 candidates stood for election for the city council of Horsens. 83 of them represented a national party, had more than one candidate and provided answers to the <a href="https://www.dr.dk/nyheder/politik/kandidattest">DR Candidate Test</a>, a test designed to help voters find out who they should vote for. In this test, the candidates answered 18 questions, which we will use as features in the following. The politicians belong to 9 parties, which will be our classes.

The numpy files `X_Horsens.npy` and `Y_Horsens.npy` contains the data. `Y_Horsens.npy` contains a letter representing the party to which each candidate belongs. The following parties are represented:

| Party letter | Party name | Party name (English) | Political position | Party color |
| :-: | :-: | :-: | :-: | :-: |
| A | Socialdemokratiet | Social Democrats | Centre-left | Red |
| B | Radikale Venstre | Social Liberal Party | Centre-left | Indigo |
| C | Det Konservative Folkeparti | Conservative People's Party | Right-wing | Green |
| D | Nye Borgerlige | New Right | Far-right | Black |
| F | Socialistisk Folkeparti | Socialist People's Party | Left-wing | Fuchsia |
| I | Liberal Alliance | Liberal Alliance | Right-wing | Cyan |
| O | Dansk Folkeparti | Danish People's Party | Far-right | Yellow |
| V | Venstre | Danish Liberal Party | Centre-right | Blue |
| Z* | Enhedslisten | Red-Green Alliance | Far-left | Dark red |

*_Note that, although the party letter of Enhedslisten is actually Ø, we will here use Z to avoid any complications with the wonderful Danish letters Æ, Ø and Å. Feel free to change the Z back to an Ø if you find that it does not cause any problems._

Meanwhile, `X_Horsens.npy` contains the answers to the test as numbers between -1.5 and 1.5, such that -1.5 is "Strongly disagree", -0.5 is "Disagree", 0.5 is "Agree" and 1.5 is "Strongly agree". The 18 questions concern, in order, subdivision, schools, windmills, building permits, tall buildings, housing, child care, culture, nursing homes, taxes, sports, refugees, nursing homes (again), public transportation, meat-free days, welfare, privatization, and religious minorities.

Both files can be imported using `numpy.load`.

__(a)__ How well do you (intuitively) expect that we can predict the partisan affiliation of a candidate based on their answers to the test?

__(b)__ Based on the answers from all 83 candidates for the Horsens city council, perform a Principal Component Analysis with 2 principal components. Plot the results in a figure using these 2 components as the axes. Label the points with the party letter and the appropriate color.

__(c)__ Comment on the results. You may consider the following questions for inspiration: Can the political parties be separated? Can the typical distinction of "left-wing" and "right-wing" be discerned? Which of the 18 questions (features) are most important?

The number of candidates (83) is on the (very) low side when we want to do machine learning. Luckily, the neighbouring city of Databorg had no less than 8,300 candidates standing for election, with a political environment similar to that of Horsens. In the following, we will use the data from Databorg. These are stored in the numpy files `X_Databorg.npy` and `Y_Databorg.npy` in same format as the Horsens data.

__(d)__ Once again, perform a Principal Component Analysis and visualize the results. Compare the results to those of the Horsens data.

Confident that we can predict the partisan affiliation of a politician reasonably well based on their answers to the test, we want to build a model that will allow us to distinguish between the 9 political parties. For this purpose, we split the data into a training and a validation set.

__(e)__ Split the data into a training and a validation set, with appropriate fractions.

First, we assume that a Naive Bayes approach is sufficient for our purposes.

__(f)__ Comment on the basic assumption of the Naive Bayes approach. Is this a reasonable assumption for the problem at hand?

__(g)__ Classify the instances of the validation set using a Naive Bayes approach. Comment on the results.

Assume instead that a _k_-nearest neighbour approach is sufficient for our  needs.

__(h)__ Using default settings of the _k_-NN classifier, classify the instances of the validation set. Comment on the performance.

__(i)__ Play around with different values of _k_. Decide on a "good" value of _k_. Comment on the results.

We now try to use a decision tree instead.

__(j)__ What is the _minimum_ depth of an appropriate decision tree? Why?

__(k)__ Build a decision tree with at least the depth from above. Play around with the tree depth. Include a figure that shows some relevant measure of the performance as a function of the tree depth. Comment on any issues of over-fitting. Decide on a tree which you will keep for later use. Can you do better than the _k_-NN classifier?

__(l)__ What are the most important features? Visualize this in an appropriate way. Does it match what you would expect? Compare to the results of the PCA analysis. Do we expect them to be the same? Why/why not?

We know that decision trees suffer from certain problems that may be solved by using decision forests.

__(m)__ Build a decision forest. Play around with the number of trees in the forest. Decide on a forest.

__(n)__ Extract the most important features. Comment and compare with previously obtained results.

Finally, we want to compare the models we have worked with so far (i.e., Naive Bayes, _k_-NN, decision tree and decision forest).

__(o)__ Compare the results of the in terms of confusion matrices, accuracy, precision, recall, and f-score. How well can we predict the partisan affiliation of a candidate based on their answers to a test? How does this compare with your intuition? 

## 5. Sentiment Analysis

__You should be able to do this exercise after Lecture 8.__

In this exercise we use the IMDb-dataset, which we will use to perform a sentiment analysis. The code below assumes that the data is placed in the same folder as this notebook. We see that the reviews are loaded as a pandas dataframe, and print the beginning of the first few reviews.

In [None]:
reviews = pd.read_csv('reviews.txt', header=None)
labels = pd.read_csv('labels.txt', header=None)
Y = (labels=='positive').astype(np.int_)

print(type(reviews))
print(reviews.head())

**(a)** Split the reviews and labels in test, train and validation sets. The train and validation sets will be used to train your model and tune hyperparameters, the test set will be saved for testing.

In [None]:
from sklearn.model_selection import train_test_split

X_trainval, X_test, y_trainval, y_test = train_test_split(reviews, labels, random_state = 69)
X_train, X_val, y_train, y_val = train_test_split(X_trainval, y_trainval, random_state = 69)
print("Size of training set:{}".format(X_train.shape[0]))
print("Size of validation set:{}".format(X_val.shape[0]))
print("Size of test set:{}".format(X_test.shape[0]))

**(b)** Use the `CountVectorizer` from `sklearn.feature_extraction.text` to create a Bag-of-Words representation of the reviews. (See an example of how to do this in chapter 7 of "Muller and Guido"). Only use the 10,000 most frequent words (use the `max_features`-parameter of `CountVectorizer`).

In [None]:
bards_words =["The fool doth think he is wise,",
"but the wise man knows himself to be a fool"]

from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
vect.fit(bards_words)

print("Vocabulary size: {}".format(len(vect.vocabulary_)))
print("Vocabulary content:\n {}".format(vect.vocabulary_))

**(c)** Explore the representation of the reviews. How is a single word represented? How about a whole review?

**(d)** Train a neural network with a single hidden layer on the dataset, tuning the relevant hyperparameters to optimize accuracy. 

**(e)** Test your sentiment-classifier on the test set.

**(h)** Use the classifier to classify a few sentences you write yourselves. 

## 6. Speech Recognition

__You should be able to do this exercise after Lecture 9.__

In this exercise, we will work with the <a href="https://arxiv.org/pdf/1804.03209.pdf">Google Speech Command Dataset</a>, which can be downloaded from <a href="http://download.tensorflow.org/data/speech_commands_v0.02.tar.gz">here</a> (note: you do not need to download the full dataset, but it will allow you to play around with the raw audiofiles). This dataset contains 105,829 one-second long audio files with utterances of 35 common words.

We will use a subset of this dataset as indicated in the table below.

| Word | How many? | Class # |
| :-: | :-: | :-: |
| Yes | 4,044 | 3 |
| No | 3,941 | 1 |
| Stop | 3,872 | 2 |
| Go | 3,880 | 0 |

The data is given in the files `XSound.npy` and `YSound.npy`, both of which can be imported using `numpy.load`. `XSound.npy` contains spectrograms (_e.g._, matrices with a time-axis and a frequency-axis of size 62 (time) x 65 (frequency)). `YSound.npy` contains the class number, as indicated in the table above.

__(a)__ Explore and prepare the data, including splitting the data in training, validation and testing data, handling outliers, perhaps taking logarithms, etc. Data preparation is - as always - quite important. Document what you do.

__(b)__ Visualize a few examples of yes's, no's, stop's and go's, so that you have a reasonable intuitive understanding of the difference between the words.

__(c)__ Train a neural network and at least one other algorithm on the data. Find a good set of hyperparameters for each model. Do you think a neural network is suitable for this kind of problem? Why/why not?

__(d)__ Classify instances of the validation set using your models. Comment on the results in terms of metrics you have learned in the course.

__(e)__ Identify (a few) misclassified words, including what they are misclassified as. Visualize them as before, and compare with your intuitive understanding of how the words look. Do you find the misclassified examples surprising?

## 7. Group Assignment & Presentation



__You should be able to start up on this exercise after Lecture 1.__

*This exercise must be a group effort. That means everyone must participate in the assignment.*

In this assignment you will solve a data science problem end-to-end, pretending to be recently hired data scientists in a company. To help you get started, we've prepared a checklist to guide you through the project. Here are the main steps that you will go through:

1. Frame the problem and look at the big picture
2. Get the data
3. Explore and visualise the data to gain insights
4. Prepare the data to better expose the underlying data patterns to machine learning algorithms
5. Explore many different models and short-list the best ones
6. Fine-tune your models
7. Present your solution 

In each step we list a set of questions that one should have in mind when undertaking a data science project. The list is not meant to be exhaustive, but does contain a selection of the most important questions to ask. We will be available to provide assistance with each of the steps, and will allocate some part of each lesson towards working on the projects.

Your group must submit a _**single**_ Jupyter notebook, structured in terms of the first 6 sections listed above (the seventh will be a video uploaded to some streaming platform, e.g. YouTube, Vimeo, etc.).

### 1. Analysis: Frame the problem and look at the big picture
1. Find a problem/task that everyone in the group finds interesting
2. Define the objective in business terms
3. How should you frame the problem (supervised/unsupervised etc.)?
4. How should performance be measured?

### 2. Get the data
1. Find and document where you can get the data from
2. Get the data
3. Check the size and type of data (time series, geographical etc)

### 3. Explore the data
1. Create a copy of the data for explorations (sampling it down to a manageable size if necessary)
2. Create a Jupyter notebook to keep a record of your data exploration
3. Study each feature and its characteristics:
    * Name
    * Type (categorical, int/float, bounded/unbounded, text, structured, etc)
    * Percentage of missing values
    * Check for outliers, rounding errors etc
4. For supervised learning tasks, identify the target(s)
5. Visualise the data
6. Study the correlations between features
7. Identify the promising transformations you may want to apply (e.g. convert skewed targets to normal via a log transformation)
8. Document what you have learned

### 4. Prepare the data
Notes:
* Work on copies of the data (keep the original dataset intact).
* Write functions for all data transformations you apply, for three reasons:
    * So you can easily prepare the data the next time you run your code
    * So you can apply these transformations in future projects
    * To clean and prepare the test set
    
    
1. Data cleaning:
    * Fix or remove outliers (or keep them)
    * Fill in missing values (e.g. with zero, mean, median, regression ...) or drop their rows (or columns)
2. Feature selection (optional):
    * Drop the features that provide no useful information for the task (e.g. a customer ID is usually useless for modelling).
3. Feature engineering, where appropriate:
    * Discretize continuous features
    * Use one-hot encoding if/when relevant
    * Add promising transformations of features (e.g. $\log(x)$, $\sqrt{x}$, $x^2$, etc)
    * Aggregate features into promising new features
4. Feature scaling: standardise or normalise features

### 5. Short-list promising models
We expect you to do some additional research and train at **least one model per team member**.

1. Train mainly quick and dirty models from different categories (e.g. linear, SVM, Random Forests etc) using default parameters
2. Measure and compare their performance
3. Analyse the most significant variables for each algorithm
4. Analyse the types of errors the models make
5. Have a quick round of feature selection and engineering if necessary
6. Have one or two more quick iterations of the five previous steps
7. Short-list the top three to five most promising models, preferring models that make different types of errors

### 6. Fine-tune the system
1. Fine-tune the hyperparameters
2. Once you are confident about your final model, measure its performance on the test set to estimate the generalisation error

### 7. Present your solution
1. Document what you have done
2. Create a nice 15 minute video presentation with slides
    * Make sure you highlight the big picture first
3. Explain why your solution achieves the business objective
4. Don't forget to present interesting points you noticed along the way:
    * Describe what worked and what did not
    * List your assumptions and you model's limitations
5. Ensure your key findings are communicated through nice visualisations or easy-to-remember statements (e.g. "the median income is the number-one predictor of housing prices")
6. Upload the presentation to some online platform, e.g. YouTube or Vimeo, and supply a link to the video in the notebook.

In [None]:
planesData = pd.read_csv('public_data_waste_fee.csv')
planes = planesData.head(5)
print(planes)

## References

Géron, A. 2017, *Hands-On Machine Learning with Scikit-Learn and Tensorflow*, Appendix B, O'Reilly Media, Inc., Sebastopol.

#### Links and references used assignment 1
* https://www.kaggle.com/code/robikscube/working-with-image-data-in-python/notebook
* https://www.irjet.net/archives/V8/i8/IRJET-V8I8147.pdf
* https://stackoverflow.com/questions/63001988/how-to-remove-background-of-images-in-python 