This is the template for the image recognition exercise. <Br>
Some **general instructions**, read these carefully:
 - The final assignment is returned as a clear and understandable *report*
    - define shortly the concepts and explain the phases you use
    - use the Markdown feature of the notebook for larger explanations
 - return your output as a *working* Jupyter notebook
 - name your file as Exercise_MLPR2023_Partx_uuid.jpynb
    - use the uuid code determined below
    - use this same code for each part of the assignment
 - write easily readable code with comments     
     - if you exploit code from web, provide a reference
 - it is ok to discuss with a friend about the assignment. But it is not ok to copy someone's work. Everyone should submit their own implementation
     - in case of identical submissions, both submissions are failed 

**Deadlines:**
- Part 1: Mon 6.2 at 23:59**
- Part 2: Mon 20.2 at 23:59**
- Part 3: Mon 6.3 at 23:59**

**No extensions for the deadlines** <br>
- after each deadline, example results are given, and it is not possible to submit anymore

**If you encounter problems, Google first and if you can’t find an answer, ask for help**
- Moodle area for questions
- pekavir@utu.fi
- teacher available for questions on Mondays 30.1, 13.2 (after lecture) and Thursday 2.3 (at lecture)

**Grading**

The exercise covers a part of the grading in this course. The course exam has 5 questions, 6 points of each. Exercise gives 6 points, i.e. the total score is 36 points.

From the template below, you can see how many exercise points can be acquired from each task. Exam points are given according to the table below: <br>
<br>
7 exercise points: 1 exam point <br>
8 exercise points: 2 exam points <br>
9 exercise points: 3 exam points <br>
10 exercise points: 4 exam points <br>
11 exercise points: 5 exam points <br>
12 exercise points: 6 exam points <br>
<br>
To pass the exercise, you need at least 7 exercise points, and at least 1 exercise point from each Part.
    
Each student will grade one submission from a peer and their own submission. After each Part deadline, example results are given. Study them carefully and perform the grading according to the given instructions. Mean value from the peer grading and self-grading is used for the final points. 

In [1]:
import uuid
# Run this cell only once and save the code. Use the same id code for each Part.
# Printing random id using uuid1()
print ("The id code is: ",end="")
print (uuid.uuid1())

The id code is: f62e65c2-b927-11ed-a7f1-3497f6911829


# Introduction (1 p)

Write an introductory chapter for your report
<br>
- Explain what is the purpose of this task?
- Describe, what kind of data were used? Where did it originate? Give correct reference.
- Which methods did you use?
- Describe shortly the results

# Part 2

Data exploration and model selection

# Part 3

## Performance estimation (2 p)

Use the previously gathered data (again, use the standardized features). <br>
Estimate the performance of each model using nested cross validation. Use 10-fold cross validation for outer and <br>
5-fold repeated cross validation with 3 repetitions for inner loop.  <br> 
Select the best model in the inner loop using the hyperparameter combinations and ranges defined in the Part 2. <br>
For each model, calculate the accuracy and the confusion matrix. <br> 
Which hyperparameter/hyperparameter combination is most often chosen as the best one for each classifier? 

## Discussion (2 p)

Discuss you results

- Which model performs the best? Why?
- Ponder the limitations and generalization of the models. How well will the classifiers perform for data outside this data set?
- Compare your results with the original article. Are they comparable?
- Ponder applications for these type of models (classifying rice or other plant species), who could benefit from them? Ponder also what would be interesting to study more on this area?
- What did you learn? What was difficult? Could you improve your own working process in some way?

In [2]:
import os
import numpy as np
import pandas as pd
import seaborn as sns
import random as rng
import cv2
import matplotlib.pyplot as plt
import warnings


from sklearn.model_selection import RepeatedKFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split, cross_val_score, LeaveOneOut, GridSearchCV, KFold
from sklearn.decomposition import PCA
from sklearn.neural_network import MLPClassifier

from itertools import product


In [3]:
parquet_path = '..\\training_data\\Rice_Sample_data.parquet'

df_parquet = pd.read_parquet(parquet_path)

In [4]:
# Own dataframe for numericals
df_numerical = df_parquet.iloc[:, 3:]

# Another dataframe with species column for later use
df_species = df_numerical.copy()
df_species['Species'] = df_parquet['Species']

In [12]:
y = df_species['Species'].values
X = df_species.iloc[:,:21].values

StScaler = StandardScaler()
X_standardized = StScaler.fit_transform(X)

In [13]:
# Splitting data
X_train, X_test, y_train, y_test = train_test_split(X_standardized, y, test_size=0.2, random_state=42)

# Initializing Classifiers

# model1 = RandomForestClassifier(max_depth=10, max_features=10)
# model2 = MLPClassifier(solver='adam', activation='logistic', hidden_layer_sizes=7)
# model3 = KNeighborsClassifier(n_neighbors=3)
model1 = RandomForestClassifier()
model2 = MLPClassifier()
model3 = KNeighborsClassifier()

# Setting parameter grids

hp_grid1 = [{'max_depth': list(range(1,12)),
            'max_features': list(range(1,12))}]
hp_grid2 = [{'solver': ['adam', 'sgd'], 
            'activation': ['relu', 'logistic'],
            'hidden_layer_sizes': list(range(1,10))}]
hp_grid3 = [{'n_neighbors': list(range(1,10))}]


Defining inner loop

In [14]:

# Setting repeated kfold for inner loop
cv_inner = RepeatedKFold(n_splits=5, n_repeats=3, random_state=42)

gridcvs = {}
for hpgrid, model, name in zip ((hp_grid1, hp_grid2, hp_grid3),
                        (model1, model2, model3),
                        ('RForest', 'MLP', 'KNN')):
    gcv = GridSearchCV(estimator=model,
                        param_grid=hpgrid,
                        scoring='accuracy',
                        cv=cv_inner,
                        verbose=0,
                        refit=True)

    gridcvs[name] = gcv




Defining outer loop

In [15]:
# Filtering warnings because it filled the screen
warnings.filterwarnings("ignore")

In [16]:
# Setting Kfold for outer loop cv procedure


for name, outer_cv in sorted(gridcvs.items()):
    print('Algorithm:', name)
    print('inner loop:')


    outer_scores = []
    cv_outer = KFold(n_splits=10, random_state=42, shuffle=True)

    for train_set, valid_set in cv_outer.split(X_train, y_train):

        gridcvs[name].fit(X_train[train_set], y_train[train_set]) # Inner loop hyperparameter tuning
        print('\n Best accuracy (avg. of inner test folds) %.2f%%' % (gridcvs[name].best_score_ *100))
        print('Best parameters', gridcvs[name].best_params_)

        outer_scores.append(gridcvs[name].best_estimator_.score(X_train[valid_set], y_train[valid_set])) # test fold (valid_set)
        print(' Accuracy (on outer test folds) %.2f%%' % (outer_scores[-1]*100))

    print('\n Outer loop:')
    print('  Accuracy %.2f%% +/- %2.f' % (np.mean(outer_scores) * 100, np.std(outer_scores) * 100))
    


Algorithm: KNN
inner loop:

 Best accuracy (avg. of inner test folds) 98.92%
Best parameters {'n_neighbors': 1}
 Accuracy (on outer test folds) 100.00%

 Best accuracy (avg. of inner test folds) 98.46%
Best parameters {'n_neighbors': 3}
 Accuracy (on outer test folds) 100.00%

 Best accuracy (avg. of inner test folds) 99.38%
Best parameters {'n_neighbors': 1}
 Accuracy (on outer test folds) 95.83%

 Best accuracy (avg. of inner test folds) 99.07%
Best parameters {'n_neighbors': 1}
 Accuracy (on outer test folds) 100.00%

 Best accuracy (avg. of inner test folds) 99.07%
Best parameters {'n_neighbors': 1}
 Accuracy (on outer test folds) 100.00%

 Best accuracy (avg. of inner test folds) 98.45%
Best parameters {'n_neighbors': 1}
 Accuracy (on outer test folds) 100.00%

 Best accuracy (avg. of inner test folds) 98.92%
Best parameters {'n_neighbors': 1}
 Accuracy (on outer test folds) 100.00%

 Best accuracy (avg. of inner test folds) 98.92%
Best parameters {'n_neighbors': 1}
 Accuracy (on 

Use the previously gathered data (again, use the standardized features). <br>
Estimate the performance of each model using nested cross validation. Use 10-fold cross validation for outer and <br>
5-fold repeated cross validation with 3 repetitions for inner loop.  <br> 
Select the best model in the inner loop using the hyperparameter combinations and ranges defined in the Part 2. <br>
For each model, calculate the accuracy and the confusion matrix. <br> 
Which hyperparameter/hyperparameter combination is most often chosen as the best one for each classifier? 

Discuss you results

- Which model performs the best? Why?
It seems that Random Forest Classifier works the best.



- Ponder the limitations and generalization of the models. How well will the classifiers perform for data outside this data set?


- Compare your results with the original article. Are they comparable?


- Ponder applications for these type of models (classifying rice or other plant species), who could benefit from them? Ponder also what would be interesting to study more on this area?


- What did you learn? What was difficult? Could you improve your own working process in some way?

