Group nr:

Name 1 and CID: Philip Fredriksson (Philiper)

Name 2 and CID: Abdulrahman Hameshli (Hameshli)

In [1]:
import numpy as np
import copy
import random
import pandas as pd
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.utils import resample
from mining_world import Environment
from IPython.display import Image



pygame 2.1.2 (SDL 2.0.18, Python 3.9.13)
Hello from the pygame community. https://www.pygame.org/contribute.html


# Mining world 

<img src="imgs/poster.png" width="800"/>

## Scenario


Humanity has now reached a point where we need to extract and refine more Copium, a precious resource with great value. The only problem is that Copium can only be found on certain uninhabitable planets. This of course means that automated robots are sent instead.      

Copium is naturally very unstable and is only exists very temporary before it decays. There are very specific geological activities and circumstances needed for copium to form. The life cycle of Copium follows. First, a hot stream of liquid magma flows to the surface, creating a hotspot that that looks like a small creater. At the surface, if the conditions are correct, copium can form during the cool-down period. But as stated previously, Copium is unstable in its natural environment and decays to other materials shortly after. 

The formation of these deposits creaters are very random, but the heat from them can easilly be detected with a satellite. But there is no way of knowing if the newly formed depoist contains copium from just a satellite, therefor there is a robot rover on the ground with sensors that can collect further measurements. The rover's job is to move to the hotspots and identify if there could be Copium there or not. The rover has many sensors that can measure the properties of the ground below it, but of course, Copium can not directly be detected with these types of sensors. This is where the machine learning approach will be used, to take all those measurements and try to classify if the deposit contains Copium or not.  


<img src="imgs/overView.png" width="500"/>


# The enviornment

The enviornment can be initilized like below. For each step a direction is specified (North, South, East, West) for the rover.

## Actions 

<img src="imgs/actions.png" width="300"/>


In [2]:
env = Environment(map_type=1, fps=5, resolution=(1000, 1000))
actions = env.get_action_space()  
print('Possible actions', actions)
for i in range(20):
    env.step(random.choice(actions) )# random action.
    env.render() 



env.exit()


Possible actions ['N', 'S', 'W', 'E']


# Navigation - Tree search

This section will show how the naivigation is done. This is not a part of the assignment to understand, but will be used. 

##  Breadth first

The method used is a breadth first search algorithm, it is one of the simplest tree search algorithms and basically tries every option for a fixed number of steps and chooses the best one. 

In [3]:
class Node():
    def __init__(self, actor):
        self.actor = actor
        self.total_score = 0
    
    def update(self, action, inherited_score):
        score = self.actor.step(action)
        self.total_score = 1.05*inherited_score + score
        return self.total_score
    
    def get_score(self):
        return self.total_score

In [4]:
def breadth_first_search(actor, max_depth, action_space):
    node = Node(copy.deepcopy(actor)) 
    queue_keys = ['0'] # queue to keep track of nodes that has not yet been expanded.  
    visited = {queue_keys[0]: node} # saves visited nodes in order to not recalulate the entire path for each step. 
    
    max_score = -np.inf
    best_action = None

    while True:
        key = queue_keys.pop(0)
        if len(key) > max_depth: # stop at a set depth 
            break    
        node = visited[key]
        
        for action in action_space: # expand all children nodes
            child_node = copy.deepcopy(node)  # copy current node
            score = child_node.update(action=action, inherited_score=node.get_score()) # update node with action
            child_key = key + action # create child node key
            
            if score > max_score: # save best path 
                max_score = score
                best_action = child_key[1]
                
            visited[child_key] = child_node  # add child node to visited nodes.
            queue_keys.append(child_key)  # add child node queue of non expanded nodes. 
            
    return best_action


In [5]:
env = Environment(map_type=1, fps=10, resolution=(1000, 1000))

for i in range(100):
    action = breadth_first_search(actor=env.get_actor(), max_depth=3, action_space=env.get_action_space())
    env.step(action)
    env.render()

env.exit()

# Exersice 1: Collect data
The first step is to collect some data that will be used for training and validation. The available types features can be seen with env.get_sensor_properties() and the actual measurements can be retrieved with env.get_sensor_readings(). It will return a dictionary with the same keys as in env.get_sensor_properties() containg a value for each feature. If the robot is not currently over a deposit, then it will return None. The label can be extracted with env.get_ground_truth(), which will return a 1 if there is copium in the deposit and 0 if not. 
  

In [6]:
sensor_properties = env.get_sensor_properties()
print('Sensor properties', sensor_properties)


Sensor properties ['ground_density', 'moist', 'reflectivity', 'silicon_rate', 'oxygen_rate', 'iron_rate', 'aluminium_rate', 'magnesium_rate', 'undetectable']


The exersice is then to append the sensor readings to the cooresponing feature in the data dictionary and if there is copium or not.  

In [7]:
env = Environment(map_type=1, fps=500, resolution=(1000, 1000))
sensor_properties = env.get_sensor_properties()

# We can initilize the dictionary the following way.
data = dict()
data['copium'] = [] 

for key in sensor_properties:
    data[key] = []
    
for i in range(5000):
    action = breadth_first_search(actor=env.get_actor(), max_depth=3, action_space=env.get_action_space())
    env.step(action)
    # if we are over a deposit. 
    if env.get_sensor_readings() is not None:
        sensor_readings = env.get_sensor_readings()
        copium = env.get_ground_truth()
        
        
        for k in sensor_readings:
            data[k].append(sensor_readings[k])
        data["copium"].append(copium)
        # TODO: Append the sensor readings and copium to the data dictionary 
     
    env.render()


env.exit()

# Exerscie 2: Data structure

## a) Pandas data frame 
In this assignment we will work pandas data frame for storing the collected data. First create a pandas data frame from the dictionary. The documentation for it can be found at https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html, only the data feild needs to be filled in with the created dictionary. Call this data frame df. 


In [8]:
# TODO: create pandas dataframe
df = pd.DataFrame(data = data)

df


Unnamed: 0,copium,ground_density,moist,reflectivity,silicon_rate,oxygen_rate,iron_rate,aluminium_rate,magnesium_rate,undetectable
0,0,1.321989,0.044215,0.396911,0.196866,0.001051,0.500530,0.089334,0.144944,0.067275
1,0,1.596081,0.043164,0.261395,0.088919,0.036337,0.529313,0.272098,0.007346,0.065987
2,0,1.387114,0.113539,0.342523,0.167453,0.036640,0.333710,0.295505,0.124583,0.042109
3,0,0.867221,0.045670,0.408328,0.182977,0.001452,0.625436,0.042407,0.084928,0.062800
4,1,2.221257,0.143925,0.140340,0.328879,0.045785,0.154637,0.294352,0.141526,0.034821
...,...,...,...,...,...,...,...,...,...,...
1672,0,0.947341,0.174261,0.478678,0.110655,0.053563,0.160335,0.411792,0.226121,0.037535
1673,0,0.914211,0.202574,0.355194,0.397140,0.015709,0.086244,0.361789,0.047044,0.092074
1674,0,1.382906,0.035779,0.592140,0.084296,0.030913,0.555291,0.213399,0.059439,0.056662
1675,0,2.731427,0.050158,0.205842,0.074063,0.023015,0.414536,0.335054,0.095900,0.057432


In [9]:
# # From the data frame you can access all data for a key with for example:
# print("All data for a feature \n", df["copium"])
# # print()

# # You can access a single sample with:
# print("Single sample from index \n", df.iloc[1])
# # print()

# # You can access all freatures but one with:
# all_features_without_copium = df.drop(columns='copium')
# # print("All features without copium \n", all_features_without_copium)

## b) Part 1: Analyse data balance

The occurance can be retrevied with .value_counts() from a pandas date frame. Here get the occurance of copium in the samples. Is the dataset balanced?

Answer: No, the dataset is not balanced. There are much more zeros then ones.

In [10]:
# TODO: Get number of samples with copium and the number of samples without copium.
value = df["copium"].value_counts()
value
df.columns

Index(['copium', 'ground_density', 'moist', 'reflectivity', 'silicon_rate',
       'oxygen_rate', 'iron_rate', 'aluminium_rate', 'magnesium_rate',
       'undetectable'],
      dtype='object')

## b) Part 2: Balance data, do this exercise later! 

We have seen what happens with unbalanced data, now try to balance the data set. You will also need to change in ex c) so that it uses the balanced data. We show how it can be done for downsampling the one that is more common, in a similar way your job is to instead create an upsampled balanced data set. You only need to use the upsampled data set for the rest of the other part 2) exerices. 

What could be the reason for choosing one of these over the other?

Answer: Upsampling means that the program generates new data points for the class that is in the minority to even out the differences (only in the training data). Downsampling means that instead of adding new data points, we reduce the number of data points that are in the majority.


The disadvantages with upsampling are that bias will increase due to the new data points that are added. The disadvantages with downsampling are that we instead lose data that could be valueble.




In [11]:
# TODO balance data det. 
# step 1: seperate the data into something that contains copium and one that doesn't,
# can for example be done with df[df["copium"]==0] etc.

# wocp=df.drop["copium"]
# cp= df["copium"]


df_zero = df[df["copium"]==0]
df_one = df[df["copium"]==1]

# downsample majority
df_zero_downsampled = resample(df_zero,
                               n_samples=df_one.shape[0])

df_balanced_downsampled = pd.concat([df_one, df_zero_downsampled])


# TODO: upsample minority 
df_ones_downsampled = resample(df_one,
                               n_samples=df_zero.shape[0])

df_balanced_upsampled = pd.concat([df_zero, df_ones_downsampled])
# df_balanced_downsampled
df_balanced_upsampled




Unnamed: 0,copium,ground_density,moist,reflectivity,silicon_rate,oxygen_rate,iron_rate,aluminium_rate,magnesium_rate,undetectable
0,0,1.321989,0.044215,0.396911,0.196866,0.001051,0.500530,0.089334,0.144944,0.067275
1,0,1.596081,0.043164,0.261395,0.088919,0.036337,0.529313,0.272098,0.007346,0.065987
2,0,1.387114,0.113539,0.342523,0.167453,0.036640,0.333710,0.295505,0.124583,0.042109
3,0,0.867221,0.045670,0.408328,0.182977,0.001452,0.625436,0.042407,0.084928,0.062800
5,0,0.892759,0.159717,0.372696,0.057934,0.096038,0.289739,0.256003,0.241648,0.058638
...,...,...,...,...,...,...,...,...,...,...
481,1,1.619183,0.080347,0.585788,0.188982,0.012283,0.072776,0.574806,0.085608,0.065544
1117,1,1.672787,0.090138,0.183835,0.216693,0.019543,0.470493,0.058139,0.175600,0.059532
237,1,1.035640,0.243853,0.075727,0.375325,0.006069,0.299553,0.138357,0.163250,0.017446
877,1,1.937138,0.013220,0.269331,0.092234,0.087079,0.524250,0.057285,0.203951,0.035200


## c ) Split data

Here we will devide the data into a training set and a test set. Good rule of thumb is to use 80% of the data in the training set and 20 % in the test set. The the two data sets should be randomly sampled (shuffle). This is done with train_test_split() from sklearn, https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html. The syntax looks like 

train, test = train_test_split(dataframe, test_size=ratio_test_set, shuffle=True)

Why is it important that the data is shuffled when it is split, what could happen otherwise?

Answer: Because if the data is not shuffled (i.e sorted) it might end up with one class in our trainingdata and the other class in our testdata.



In [40]:
# TODO: devide the data into train and test set. 
train, test = train_test_split(df_balanced_upsampled, test_size=0.2, shuffle=True)


# Exercise 3: Performance evaluation

Here we will define a class that later will be used for evaluation the performance of the classification models. More information about precision and recall can be found at https://en.wikipedia.org/wiki/Precision_and_recall. 

Expliain why the different metics are usefull, why is not always accuarcy enough?

Answer: Accuracy is not always enough because if there is a lot of one specific class, the program can always guess this class and thus get a high accuracy score.   


In [41]:
class Classification_eval(object):
    def __init__(self):
        # counters 
        self.TP = 0 # correctly identified positive 
        self.FP = 0 # falsely identified positive 
        self.TN = 0 # correctly identified negative 
        self.FN = 0 # falsely identified negative 
    
    def update(self, pred, label):
        """
        pred - is the prediction will be either a 1 or 0. 
        label - is the correct answer, will be either a 1 or 0.
        """
        if  pred==1 and label== 1:
            self.TP+=1
        if  pred==0 and label == 0:
            self.TN+=1    
        if pred == 1 and label == 0:
            self.FP+=1
        if pred == 0 and label == 1:
            self.FN+=1 
        # TODO: add to one of the counters each time this function is called. 

    
    def accuracy(self): 
        # returns the accuracy 
        if (self.TP + self.TN) == 0:
            return 0
        # TODO: calculate the accuracy.
        accuracy = (self.TP + self.TN)/(self.TP + self.TN+self.FP + self.FN)
        return np.round(accuracy, 4)
    
    def precision(self): # percentage of the estimated positive that actually is positive
        if self.TP == 0:
            return 0
        # TODO: calculate the precision.
        precision = self.TP/(self.TP+self.FP)
        return np.round(precision, 4)
    
    def recall(self): # percentage of correctly identified positive of the total positive
        if self.TP == 0:
            return 0
        # TODO: calculate the recall.
        recall = self.TP/(self.TP+self.FN)
        return np.round(recall, 4)

# Exercise 4: K- nearest neighbours

## a) Normalize

Here we will code our K-NN classifier, method 2.1 on page 21 in the book has the psudo code for K-NN. We will start with the data normalization, i.e. we will normalize the input data so that each feature has the same range in terms of max/min values. The min value can be found with data.min(), similarly for the max value. 

Why is it important that the data is normalized for the K-NN algorithm?

Answer: The goal of normalization is to change the values of numeric columns in the dataset so that they use a common scale. Normalization is also required for some algorithms to model the data correctly.


In [42]:
class Normalize(object):
    def __init__(self):
        self.min = None
        self.max = None
    
    def normalize(self, data):
        # normalize the data and return it. 
        return (data-self.min)/(self.max-self.min)
    
    def update_normalization(self, data):
        # Save the min and max values for each feature. This funciton is only used for the training data.
        self.min = data.min()
        self.max = data.max()

## b) K-NN
Lets make the K-NN algorithm, fill in the TODO.

In [43]:
class KNN(object):
    def __init__(self, k):
        self.features = None #pd.DataFrame(columns=9) # normalized features from training data 
        self.labels = None # the corresponding labels (if there is copium)
        self.normalize = Normalize() # class instance for normalization
        self.k = k # the k value in k-nn algorithm.
        
    def fit(self, features, labels):
        # This is where we save the training data. 
        # TODO: update the normalize filter, normalize the features (save to self.features)
        # and save the labels to self.labels
        self.normalize.update_normalization(features)
        self.features = self.normalize.normalize(features) 
        self.labels = labels
    
    
    def predict(self, features):
        # here we get one sample to make a predicion 
        # TODO normalize the input features. 
        
        features_norm = self.normalize.normalize(features)    # Detta är b
    
        
        list_dist = np.ones(self.k)*np.inf    # Gör en lista med k element där alla är inf. Eftersom alla tal är mindre
        list_prediction = np.zeros(self.k)
        
        for i in range(len(self.features.to_numpy())): 
            dist = np.linalg.norm(self.features.to_numpy()[i] - features_norm.to_numpy())
            if dist < np.max(list_dist):
                idx = np.argmax(list_dist)
                list_dist[idx] = dist
                list_prediction[idx] = self.labels.iloc[i]
                
        return self.majority_vote(list_prediction)
    
    def majority_vote(self, pred_list):
        # Here is a function that will return the majority vote from a list. 
        keys = list(Counter(pred_list).keys())
        occurance = list(Counter(pred_list).values())
        idx = np.argmax(occurance)
        return keys[idx]

## c) part 1: Evaluate the K-NN 
Evaluate the K-NN and choose a suitable k value. 

In [47]:
train_labels = train['copium']
train_features = train.drop(columns='copium')

y = test['copium']
x = test.drop(columns='copium')

# TODO, try differenent values of k. 

knn = KNN(k = 50)
knn.fit(train_features, train_labels)

log = Classification_eval()
for i in range(x.shape[0]):
    pred = knn.predict(x.iloc[i])
    log.update(pred, y.iloc[i])

print('Accuarcy', log.accuracy())

print('Precision', log.precision())
print('Recall', log.recall())


Accuarcy 0.8142
Precision 0.8294
Recall 0.7712



Try some differnet values of k and just looking at these resutlts would the klassifier work well for all k? 

Answer: Not really. Due to the big majority of zeros (no copium), the program will classify all the new data points as no copium with a large k-value. This is because with a large k-value, the new data point compares to more neighbors and due to the big majority of zeros, the new data point will almost always also be classified as a zero. 

| k | Accuracy | Precision | Recall | 
| --- | --- | --- | --- |
| 1 | 0.8601 | 0.6207| 0.3333 |  
| 5 | 0.869 | 0.8571 | 0.2222 |   
| 20 | 0.8512 | 1.0 | 0.0741 |   
| 50 | 0.8393 | 0 | 0 |   

## c) part 2, do later!
Now we have balanced data, try the same k values as in part 1. Have the results changed since ex 4 c) part 1? Would this klassifier work better? 

Answer: This classification would work better. Although accuracy decreases, we see at the same time that precision and recall increase. This means that if we have balanced our data, the program will not always say "no copium". Previously, the program always guessed "no copium", which led to the fact that even if it was copium, the machine would guess it. Now it guesses right relatively often if it is copium, which was the goal. We don't want a machine that always says "no copium".

| k | Accuracy | Precision | Recall | 
| --- | --- | --- | --- |
| 1 | 0.9593 | 0.9276 | 0.9926 |  
| 5 | 0.9204 | 0.887 | 0.9557 |   
| 20 | 0.8513 | 0.8502 | 0.8376 |   
| 50 | 0.8142 | 0.8294 | 0.7712 | 

# Exercise 5: Learn tree based classifier

Here we will code our tree based classifier. We will start with coding a function that can find the best (according to gini) spliting point for a given data set and then define a recursive class for the Nodes that will make up our tree. 

## a) Find split point

The first step is to define a function that can find the splitting criteria with the highest gini value. 

The gini value can be described as:

If $\Gamma$ contains the set of all labeles, then $\Gamma(x_1 < 1)$ would be all labels that belong to the criteria $x_1 < 1$, more generally we could say $\Gamma(x_i < c)$, where i is the index of one of the features and c is the criteria. Then we can define:

$v_1 = mean(\Gamma(x_i < c))$

$v_2 = mean(\Gamma(x_i \geq c))$

$s_1 = v_1^2 + (1-v_1)^2$

$s_2 = v_2^2 + (1-v_2)^2$

We define len(x) to give the number of elements of x, then the weighted gini value is:

$s = \frac{len(\Gamma(x_i < c))}{len(\Gamma)}*s_1 + \frac{len(\Gamma(x_i \geq c))}{len(\Gamma)}*s_2$

The goal is to split the data so we maximizes $s$, there will be one $s$ for every combination of $x_i$ and $c$. Here we will use a c value that is the average between two data points that are sorted. 

In [49]:
def find_split_point(data, label, parameter):
    """
    data - all the data we want to split, our (gamma)
    label - the parameter we want to classify. 
    parameter - the parameter we want to check for, our x_i    #x_i ska vara en av våra features?
    -----------
    retrun:
    split_value - the spliting value, our c. 
    gini_value - the gini value for the best c.
    df_head - the data frame belonging to x_i < c
    df_tail - the data frame belonging to x_i => c
    """
    # beging by sorting the data after the paramter. 

    sorted_data = data.sort_values(by=parameter)
    sorted_label = sorted_data[label]
    
    np_data = sorted_data.to_numpy()
    np_label = sorted_label.to_numpy()
    # TODO loop through all the split points in the sorted data and find 
    # the best gini_value (s) and split_value (c).  
    data_len = len(sorted_data) 
    
    gini_index = 0
    split_value = 0
    split_index = 0
    
    for i, row in enumerate(np_data[:-1]):
        i += 1
        gamma1 = np_label[:i]        # <
        gamma2 = np_label[i:]        # >
        
        v1 = np.mean(gamma1)
        v2 = np.mean(gamma2)
        
        s1 = v1**2 + (1-v1)**2
        s2 = v2**2 + (1-v2)**2
        
        s = (len(gamma1)/data_len) * s1 + (len(gamma2)/data_len) * s2 
        
        
        if s > gini_index:
            gini_index = s 
            split_index = i
            split_value = np.mean(sorted_data[parameter].iloc[i:i+1])


    df_head = sorted_data.head(split_index)  
    df_tail = sorted_data.tail(len(sorted_data) - split_index)
    return split_value, gini_index, df_head, df_tail


## b) Tree Node



In [50]:
class TreeNode():

    def majority_vote(self, pred_list):
        # Here is a function that will return the majority vote from a list. 
        keys = list(Counter(pred_list).keys())
        occurance = list(Counter(pred_list).values())
        idx = np.argmax(occurance)
        return keys[idx]

    def __init__(self, classification=None):
        self.split_value = None # the splitting value (c)
        self.split_parameter = None # what feature where uesd for the split (x_i)
        self.child_nodes = [] # list that contains two child nodes, if not leaf_node
        self.leaf_node = 0 # is this leaf_node (0= no, 1=yes)
        self.classification = classification # classification made in this node.
        
    def predict(self, data):
        # TODO: we need to traverse the tree recursivly down to a leaf node.
        # step 1: check if this is a leaf node, if it is then return classification otherwise contine with step 2.
        # step 2: check the input data for the splitting criteria, i.e. data[x_i] < c ...
        # (data[x_i] < c would corresponds to child_node[0] and data[x_i] => c to child_node[1])
        # step 3: call the predict function in the corresponding child_node and return the prediction.¨
         
        if self.leaf_node == 1:             
            return self.classification 
        else:
            if data[self.split_parameter] < self.split_value:

                return self.child_nodes[0].predict(data)

            else:
                return self.child_nodes[1].predict(data)

            
  
            
    def learn(self, data, label, min_node_size):
        """
        data - the training data
        label - the parameter we want to classify
        min_node_size - number of data points in a node for it to become a leaf node. 
        """
        # TODO: wirte the learning function. 
        # Step 1: check if the data fullfils the min_node_size criteria, if so make this node a leaf node and return.
        data_len = len(data)
        
        if data_len < min_node_size:
            self.leaf_node = 1
            self.classification = self.majority_vote(data[label])
            return


        # Step 1.5: Check if the data is homogenious i.e. only contains one type for the labels, if thats 
        # the case then make this node a leaf node and return.
        
        if len(data.loc[(data[label] == 0)]) == data_len or len(data.loc[(data[label] == 1)]) == data_len:
            self.leaf_node = 1
            self.classification = self.majority_vote(data[label])

            return 
        


        # Step 2: Loop over all features and get the best gini and split_value for each feature.
        best_gini = 0
        best_split = 0

        for feature in data.columns:
                    if not label == feature:
                        k = find_split_point(data,label,feature)
                        if k[1] > best_gini:
                            best_gini = k[1]
                            self.split_value = k[0]
                            self.split_parameter= feature
                            df_head = k[2]
                            df_tail = k[3]
                            

        
        child_0 = TreeNode()  
        child_1 = TreeNode()
        
#         Step 4: append the the child node to the self.child_nodes. It should be in the order 
#         of the child node correspoinding to [df_head, df_tail].x
        child_0.learn(df_head,label,min_node_size)
        child_1.learn(df_tail,label,min_node_size)

        self.child_nodes=[child_0,child_1]


##  Train the Tree

In [57]:
tree = TreeNode() # create root node
# learn the tree structure
tree.learn(train, "copium", min_node_size=1)


## Test

In [58]:
y = test['copium']
x = test.drop(columns='copium')
log = Classification_eval()

for i in range(x.shape[0]):
    pred = tree.predict(x.iloc[i])

    log.update(pred, y.iloc[i])
        
print('accuarcy', log.accuracy())
print('precision', log.precision())
print('recall', log.recall())

accuarcy 0.9664
precision 0.9345
recall 1.0


## c) part 1

Try some differnet values of min_node_size. How does these differ from the K-NN?

Answer: Accuracy is much better here than K-NN.

| min_node_size | Accuracy | Precision | Recall | 
| --- | --- | --- | --- |
| 1  | 0.881 | 0.6522 | 0.5556 |  
| 10 | 0.8899 | 0.6809 | 0.5926 |   
| 20 | 0.875 | 0.6429 | 0.5 |   
| 50 | 0.8601 | 0.5714 | 0.5185 |   

## c) part 2, Try with balanced data, do this later!

Try some differnet values of min_node_size.

Answer: 

| min_node_size | Accuracy | Precision | Recall | 
| --- | --- | --- | --- |
| 1  | 0.9664 | 0.9345 | 1.0 |  
| 10 | 0.9611 | 0.9338 | 0.9889 |   
| 20 | 0.931 | 0.9056 | 0.8841 |   
| 50 | 0.8991 | 0.8615 | 0.9557 | 

## Exersice 6: Deployment

Here we will try the learned classifiers on a larger map. Make sure that the last run version of K-NN and tree have good parameters i.e. k and min_node_size values. 


In [59]:
env = Environment(map_type=2, fps=5000, resolution=(1000, 1000))

sensor_properties = env.get_sensor_properties()
sensor_sample = dict()
for key in sensor_properties:
    sensor_sample[key] = [0]

log_knn = Classification_eval()
log_tree = Classification_eval()

    
for i in range(500):
    action = breadth_first_search(actor=env.get_actor(), max_depth=3, action_space=env.get_action_space())
    env.step(action)
    if env.get_sensor_readings() is not None:
        sensor_readings = env.get_sensor_readings()
        for key in sensor_readings:
            sensor_sample[key][0] = sensor_readings[key]
        sensor_sample_df = pd.DataFrame(sensor_sample)
        log_knn.update(knn.predict(sensor_sample_df.iloc[0]), env.get_ground_truth())
        log_tree.update(tree.predict(sensor_sample_df.iloc[0]), env.get_ground_truth())
        env.plt_acc.update_acc(log_tree.accuracy(), log_knn.accuracy())
    env.render()

env.exit()

print("K-NN accuracy ", log_knn.accuracy(), "Tree accuracy", log_tree.accuracy())
print("Number of copium deposits foun, K-NN:", log_knn.TP, " Tree:", log_tree.TP)




K-NN accuracy  0.8687 Tree accuracy 0.9293
Number of copium deposits foun, K-NN: 9  Tree: 10


## Exersice 7: Balance data

Go to 2 b) part 2 and balance the data, then do part 2 on the exersices that have it. Lastly run execise 6 with the classifers trained on ballanced data. What is the major difference?

Answer:

| Balanced | Accuracy k-nn| nr found copium k-nn | Accuracy tree | nr found copium tree|  
| --- | --- | --- | --- | --- |
| NO | 0.8889 | 5 | 0.9333 | 11 | 
| YES | 0.8687  | 9 | 0.9293 | 10 | 