# Machine Learning robot with Arduino UNO and LIDAR
##### Author: [Nikodem Bartnik](https://nikodembartnik.pl/), [Indystry.cc](https://indystry.cc/)
This is the code used to process the data collected during manual racing and based on that train the classifier that will later be used at autonomus driving stage. If you prefer traditional python code you can take a look at main.py. README file in the github repository also have some usefull information. 

If you want to see how the project works you can take a look at these two videos on YouTube:
- [Machine Learning on Arduino Uno was a Good Idea](https://www.youtube.com/watch?v=PdSDhdciSpE)
- [The Racing Machine with AI and Arduino](https://www.youtube.com/watch?v=KJIKexczPrU)

We will start by importing all the necessary libraries. 

In [1]:
import numpy as np
import pandas as pd
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from micromlgen import port

### Loading the data

Then we can load the data from txt file (all files can be found in the main repository in the data folder). Depending on your system you might have to define the path to the file differently. To load the file we will simply use ppandas module and load it as a data frame from csv.
Each file with the data is composed of different number of samples and has exactly 241 collumns. Columns 0-239 have the measurmeents from the LIDAR the last column has the label with a letter. There are only five possible letters:

- F - forward
- I - forward right
- R - right
- G - forward left
- L - left

In [8]:
data = pd.read_csv('C:\\Users\\josep\\Downloads\\MASTERDATA2.txt', header=None)
print(data.head())

   0    1    2    3    4    5    6    7    8    9    ...  231  232  233  234  \
0  371  371  372  373  374  375  377  379  383  385  ...  378  376  374  373   
1  371  371  372  373  374  376  377  379  381  383  ...  374  376  377  375   
2  373  376  376  373  377  378  379  380  382  381  ...  393  386  377  380   
3  375  374  373  376  381  378  383  380  387  389  ...  400  396  393  387   
4  391  374  380  379  379  379  380  384  391  393  ...  406  410  400  404   

   235  236  237  238  239  240  
0  372  371  371  370  370    F  
1  374  372  371  370  370    F  
2  377  375  375  371  370    F  
3  377  385  382  378  377    F  
4  393  398  396  394  392    F  

[5 rows x 241 columns]


### Data cleaning
Next, we can initiate some data cleaning. I intend to keep this step straightforward, but feel free to experiment and enhance the cleaning process. As they often emphasize in data science, "garbage in, garbage out," so the cleaner the data, the better the final result.

I will rename the last column and label it "label" for ease of work. Additionally, we'll eliminate all samples with labels "L," "R," "H," or "J." To streamline the task for the classifier, I've chosen to focus solely on driving forward, forward left, and forward right. This selection is sufficient for navigating the racetracks I designed.

In [9]:
data.rename(columns={data.columns[-1]: 'Label'}, inplace=True)
print(f"Label counts before cleaning the data: \n {data['Label'].value_counts()}")
data = data[(data['Label'] != 'd') & (data['Label'] != 'b') & (data['Label'] != 's') & (data['Label'] != 'l')& (data['Label'] != 'D')& (data['Label'] != 'r')& (data['Label'] != 'm')& (data['Label'] != 'n')]
data = data[~data.apply(lambda row: any(measurement == 0 for measurement in row[:-1]), axis=1)]
data.reset_index(drop=True, inplace=True)
print(f"Label counts after cleaning the data: \n {data['Label'].value_counts()}")

Label counts before cleaning the data: 
 Label
F    12261
R     5349
L     4769
D       93
s       49
l        9
b        3
Name: count, dtype: int64
Label counts after cleaning the data: 
 Label
F    12242
R     5346
L     4769
Name: count, dtype: int64


Now we will separate our X and Y that is the input and output data. After that we will divide it into train and test sets with train_test_split. Label encoder is used to convert letters that were used in label column to numbers so that the classifier can work with that.

In [10]:
X = data.iloc[:, :-1]
y = data.iloc[:, -1]

label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)
label_mapping = dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)))
print(f"Label encoding mapping: {label_mapping}")

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Label encoding mapping: {'F': 0, 'L': 1, 'R': 2}


### Data selection

We don't need all the data. Most of it is just noise that won't be useful for us (remember? garbage in, garbage out). We are not too concerned about what is behind us, dirving forward while looking backwards is not the best idea. That's why we do data selection. Why now? Data selection should be performed after division to train and test set, otherwise we are exposed to data leakage problem. We will perform dataselection with SelectKBest from sklearn package. With K you define how many features you want to select. During making of the first video I was able to get the robot to autonomously navigate in the race track with decent precision with as low as 10 features. For the second video where I tried to make the robots race I had to increase the number of dimensions to 80 to get it to work. Even with such a high number of dimensions Arduino still seem to work well.

In [11]:
k = 80
k_best = SelectKBest(score_func=f_classif, k=k)
k_best.fit(X_train, y_train)

selected_feature_indices = k_best.get_support(indices=True)
# we have to print it like this to have the commas between the indices so that it's easy to copy and paste to Arduino IDE
print("selected features: ", X.columns[selected_feature_indices])

selected features:  Index([ 88,  90, 130, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142,
       143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156,
       157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170,
       171, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202,
       203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216,
       217, 218, 219, 220, 221, 222, 223, 224, 225, 226],
      dtype='object')


### Training the model
Training the model is a straightforward process, thanks to all the libraries available in Python. The ultimate outcome depends on our dataset and the preceding steps we executed. Post-training, accuracy will be computed using the test set, and a higher accuracy is desirable.

In the videos, the classifiers I employed achieved a maximum accuracy of about 75%, which, while not the optimal performance and open to improvement, enabled the robot to autonomously navigate the racetrack. Infrequent collisions with the wall did occur. At times, the robot could navigate for a few minutes without any crashes. We will also print the classification report to see the accuracy for all the classes.

In [12]:
clf = RandomForestClassifier(max_depth=3, random_state=42)
clf.fit(X_train.iloc[:, selected_feature_indices], y_train)

y_pred = clf.predict(X_test.iloc[:, selected_feature_indices])

accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')

class_names = label_encoder.classes_
report = classification_report(y_test, y_pred, target_names=class_names, zero_division=0)
print('Classification Report:\n', report)

Accuracy: 0.7685599284436494
Classification Report:
               precision    recall  f1-score   support

           F       0.76      0.84      0.80      2437
           L       0.76      0.66      0.71       949
           R       0.79      0.71      0.75      1086

    accuracy                           0.77      4472
   macro avg       0.77      0.74      0.75      4472
weighted avg       0.77      0.77      0.77      4472



### Exporting the Classifier

While performing tasks in Python is convenient, we face limitations when it comes to running Python code on Arduino. Therefore, the next step involves exporting the classifier. I've come across an [excellent article](https://eloquentarduino.github.io/2019/11/how-to-train-a-classifier-in-scikit-learn/) online that provides a detailed explanation of how to export the classifier to C and integrate it with Arduino. The resulting file will be saved to the same directory where you are currently working, so please remember to relocate it to the Arduino folder. If you are experimenting with and testing various models, ensure to modify the index at the end of the file name to avoid mixing up files.

**REMEMBER** to copy the selected dimensions and paste into the Arduino file. Number of dimensions during training and later classifing must match otherwise it won't work!


In [14]:
arduino_code = open("randomForest.h", mode="w+")
arduino_code.write(port(clf))
arduino_code.close()
print("selected features: ", X.columns[selected_feature_indices])

selected features:  Index([ 88,  90, 130, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142,
       143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156,
       157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170,
       171, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202,
       203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216,
       217, 218, 219, 220, 221, 222, 223, 224, 225, 226],
      dtype='object')
