<!-- Notebook title -->
# Title

# 1. Notebook Description

### 1.1 Task Description
<!-- 
- A brief description of the problem you're solving with machine learning.
- Define the objective (e.g., classification, regression, clustering, etc.).
-->

TODO

### 1.2 Useful Resources
<!--
- Links to relevant papers, articles, or documentation.
- Description of the datasets (if external).
-->

### 1.2.1 Data

#### 1.2.1.1 Common

* [Datasets Kaggle](https://www.kaggle.com/datasets)  
  &nbsp;&nbsp;&nbsp;&nbsp;A vast repository of datasets across various domains provided by Kaggle, a platform for data science competitions.
  
* [Toy datasets from Sklearn](https://scikit-learn.org/stable/datasets/toy_dataset.html)  
  &nbsp;&nbsp;&nbsp;&nbsp;A collection of small datasets that come with the Scikit-learn library, useful for quick prototyping and testing algorithms.
  
* [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php)  
  &nbsp;&nbsp;&nbsp;&nbsp;A widely-used repository for machine learning datasets, with a variety of real-world datasets available for research and experimentation.
  
* [Google Dataset Search](https://datasetsearch.research.google.com/)  
  &nbsp;&nbsp;&nbsp;&nbsp;A tool from Google that helps to find datasets stored across the web, with a focus on publicly available data.
  
* [AWS Public Datasets](https://registry.opendata.aws/)  
  &nbsp;&nbsp;&nbsp;&nbsp;A registry of publicly available datasets that can be analyzed on the cloud using Amazon Web Services (AWS).
  
* [Microsoft Azure Open Datasets](https://azure.microsoft.com/en-us/services/open-datasets/)  
  &nbsp;&nbsp;&nbsp;&nbsp;A collection of curated datasets from various domains, made available by Microsoft Azure for use in machine learning and analytics.
  
* [Awesome Public Datasets](https://github.com/awesomedata/awesome-public-datasets)  
  &nbsp;&nbsp;&nbsp;&nbsp;A GitHub repository that lists a wide variety of datasets across different domains, curated by the community.
  
* [Data.gov](https://www.data.gov/)  
  &nbsp;&nbsp;&nbsp;&nbsp;A portal to the US government's open data, offering access to a wide range of datasets from various federal agencies.
  
* [Google BigQuery Public Datasets](https://cloud.google.com/bigquery/public-data)  
  &nbsp;&nbsp;&nbsp;&nbsp;Public datasets hosted by Google BigQuery, allowing for quick and powerful querying of large datasets in the cloud.
  
* [Papers with Code](https://paperswithcode.com/datasets)  
  &nbsp;&nbsp;&nbsp;&nbsp;A platform that links research papers with the corresponding code and datasets, helping researchers reproduce results and explore new data.
  
* [Zenodo](https://zenodo.org/)  
  &nbsp;&nbsp;&nbsp;&nbsp;An open-access repository that allows researchers to share datasets, software, and other research outputs, often linked to academic publications.
  
* [The World Bank Open Data](https://data.worldbank.org/)  
  &nbsp;&nbsp;&nbsp;&nbsp;A comprehensive source of global development data, with datasets covering various economic and social indicators.
  
* [OpenML](https://www.openml.org/)  
  &nbsp;&nbsp;&nbsp;&nbsp;An online platform for sharing datasets, machine learning experiments, and results, fostering collaboration in the ML community.
  
* [Stanford Large Network Dataset Collection (SNAP)](https://snap.stanford.edu/data/)  
  &nbsp;&nbsp;&nbsp;&nbsp;A collection of large-scale network datasets from Stanford University, useful for network analysis and graph-based machine learning.
  
* [KDnuggets Datasets](https://www.kdnuggets.com/datasets/index.html)  
  &nbsp;&nbsp;&nbsp;&nbsp;A curated list of datasets for data mining and data science, compiled by the KDnuggets community.


#### 1.2.1.2 Project

### 1.2.2 Learning

* [K-Nearest Neighbors on Kaggle](https://www.kaggle.com/code/mmdatainfo/k-nearest-neighbors)

* [Complete Guide to K-Nearest-Neighbors](https://kevinzakka.github.io/2016/07/13/k-nearest-neighbor)

### 1.2.3 Documentation

---

# 2. Setup

## 2.1 Imports
<!--
- Import necessary libraries (e.g., `numpy`, `pandas`, `matplotlib`, `scikit-learn`, etc.).
-->

In [1]:
from ikt450.src.common_imports import *
from ikt450.src.config import get_paths
from ikt450.src.common_func import load_dataset, save_dataframe, ensure_dir_exists

import torch
import torchvision
import torchvision.transforms as transforms

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

## 2.2 Global Variables
<!--
- Define global constants, paths, and configuration settings used throughout the notebook.
-->

### 2.2.1 Paths

In [2]:
paths = get_paths()

### 2.2.3 Split ratio

In [3]:
SPLITRATIO = 0.8

In [8]:
paths

{'PATH_PROJECT_ROOT': 'C:\\Users\\jonin\\Documents\\ikt450\\ikt450',
 'PATH_ASSIGNMENTS': 'C:\\Users\\jonin\\Documents\\ikt450\\ikt450\\assignments',
 'PATH_COMMON': 'C:\\Users\\jonin\\Documents\\ikt450\\ikt450\\common',
 'PATH_COMMON_DATASETS': 'C:\\Users\\jonin\\Documents\\ikt450\\ikt450\\common\\datasets',
 'PATH_COMMON_NOTEBOOKS': 'C:\\Users\\jonin\\Documents\\ikt450\\ikt450\\common\\notebooks',
 'PATH_COMMON_RESOURCES': 'C:\\Users\\jonin\\Documents\\ikt450\\ikt450\\common\\resources',
 'PATH_COMMON_SCRIPTS': 'C:\\Users\\jonin\\Documents\\ikt450\\ikt450\\common\\scripts',
 'PATH_REPORTS': 'C:\\Users\\jonin\\Documents\\ikt450\\ikt450\\reports',
 'PATH_SRC': 'C:\\Users\\jonin\\Documents\\ikt450\\ikt450\\src',
 'PATH_1_KNN': 'C:\\Users\\jonin\\Documents\\ikt450\\ikt450\\assignments\\1_knn',
 'PATH_2_MLP': 'C:\\Users\\jonin\\Documents\\ikt450\\ikt450\\assignments\\2_mlp',
 'PATH_CNN': 'C:\\Users\\jonin\\Documents\\ikt450\\ikt450\\assignments\\CNN'}

In [16]:
# Load the dataset Food11 from PATH_COMMON_DATASETS/food11
# the folder structure is PATH_COMMON_DATASETS/food11/training and PATH_COMMON_DATASETS/food11/evaluation and PATH_COMMON_DATASETS/food11/validation
#  and in these folders there are 11 subfolders with the class names
#  and in these subfolders there are the images
#  the dataset is loaded with the torchvision.datasets.ImageFolder function
#  and the images are transformed to tensors and normalized

transform = transforms.Compose([
    transforms.Resize((224, 224)), # random values from copilot
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) # random values from copilot
])

train_dataset = torchvision.datasets.ImageFolder(root=f"{paths['PATH_COMMON_DATASETS']}/food11/training" , transform=transform)
train_dataset
test_dataset = torchvision.datasets.ImageFolder(root=f"{paths['PATH_COMMON_DATASETS']}/food11/evaluation" , transform=transform)
test_dataset
val_dataset = torchvision.datasets.ImageFolder(root=f"{paths['PATH_COMMON_DATASETS']}/food11/validation" , transform=transform)
val_dataset

Dataset ImageFolder
    Number of datapoints: 3430
    Root location: C:\Users\jonin\Documents\ikt450\ikt450\common\datasets/food11/validation
    StandardTransform
Transform: Compose(
               Resize(size=(224, 224), interpolation=bilinear, max_size=None, antialias=True)
               ToTensor()
               Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
           )

In [20]:
# look at label distribution
# sum up the number of images in each class

label_count = {}
for i in train_dataset.targets:
    if i in label_count:
        label_count[i] += 1
    else:
        label_count[i] = 1
label_count

{0: 994,
 1: 429,
 2: 1500,
 3: 986,
 4: 848,
 5: 1325,
 6: 440,
 7: 280,
 8: 855,
 9: 1500,
 10: 709}

## 2.3 Function Definitions
<!--
- Define helper functions that will be used multiple times in the notebook.
- Consider organizing these into separate sections (e.g., data processing functions, model evaluation functions).
-->

### 2.3.1 Distance Calculation

#### 2.3.1.1 Euclidian Distance

---

# 4. Data Processing

## 4.1 Data loading
<!--
- Load datasets from files or other sources.
-->

In [5]:
%ls {paths['PATH_COMMON_DATASETS']}

 Volume in drive C is Windows
 Volume Serial Number is FA0F-7C2E

 Directory of C:\Users\jonin\Documents\ikt450\ikt450\common\datasets

10.09.2024  20:40    <DIR>          .
28.08.2024  02:09    <DIR>          ..
03.09.2024  21:22            19�488 ecoli.data
03.09.2024  21:22             3�022 ecoli.names
10.09.2024  20:40    <DIR>          food
23.08.2024  18:45            23�278 pima-indians-diabetes.data.csv
               3 File(s)         45�788 bytes
               3 Dir(s)  297�591�087�104 bytes free


## 4.2 Data inspection
<!--
- Preview the data (e.g., `head`, `describe`).
-->

### 4.2.1 Info

In [None]:
df.info()

### 4.2.2 Describe

In [None]:
df.describe()

### 4.2.3 Head

In [None]:
df.head()

## 4.3 Data Visualization

In [None]:
# TODO Add code for visualization

## 4.4 Data Cleaning
<!--
- Handle missing values, outliers, and inconsistencies.
- Remove or impute missing data.
-->

### 4.4.1 NULL, NaN, Missing values

In [None]:
df.isnull().sum()

In [None]:
df.isna().sum()

In [None]:
df.duplicated().sum()

In [None]:
#df.corr()

## 4.5 Feature Engineering
<!--
- Create new features from existing data.
- Normalize or standardize features.
- Encode categorical variables.
-->

### 4.5.1 Normalize

#### 4.5.1.1 Feature Selection / Data Separation

<details>
<br>
<details>
<summary>What does it?</summary>
<br>
This line removes the `` column from the DataFrame `df` and assigns the remaining columns to `X`.
</details>
<br>
<details>
<summary>Why do we do it?</summary>
<br>
We do this to separate the input features (which are stored in `X`) from the target variable (which will be stored in `y`). This separation is essential in supervised learning tasks where the goal is to predict the target variable based on the input features.
</details>
</details>

In [None]:
df.columns

In [None]:
X = df.drop(columns='TODO')

#### 4.5.1.2 Target Variable Extraction

<details>
<br>
<details>
<summary>What does it?</summary>
<br>
This line selects the `` column from the DataFrame `df` and assigns it to `y`.
</details>
<br>
<details>
<summary>Why do we do it?</summary>
<br>
We do this to isolate the target variable, which represents the labels or outcomes that we aim to predict using the machine learning model.
</details>
</details>

In [None]:
y = df['TODO']

#### 4.5.1.3 Feature Scaling / Standardization / Z-score Normalization

<details>
<br>
<details>
<summary>What does it?</summary>
<br>
This line standardizes the features in `X` by subtracting the mean of each feature and dividing by the standard deviation of that feature. This transforms the data so that each feature has a mean of 0 and a standard deviation of 1.
</details>
<br>
<details>
<summary>Why do we do it?</summary>
<br>
Standardization is crucial when using machine learning algorithms that rely on distance calculations (like K-Nearest Neighbors, SVM, or Neural Networks). Without standardization, features with larger scales could dominate the distance calculation, leading to biased model behavior. By standardizing, all features contribute equally to the model, regardless of their original scale.
</details>
</details>

In [None]:
X = (X - X.mean()) / X.std()

## 4.6 Data Splitting
<!--
- Split data into training, validation, and test sets.
-->

In [None]:
# Sklearn train_test_split
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=(1-SPLITRATIO), random_state=RANDOM_SEED)

---

# 5. Model Development

## 5.1 Model Selection
<!--
- Choose the model(s) to be trained (e.g., linear regression, decision trees, neural networks).
-->

## 5.2 Model Training
<!--
- Train the selected model(s) using the training data.
-->

## 5.3 Model Evaluation
<!--
- Evaluate model performance on validation data.
- Use appropriate metrics (e.g., accuracy, precision, recall, RMSE).
-->

## 5.4 Hyperparameter Tuning
<!--
- Fine-tune the model using techniques like Grid Search or Random Search.
- Evaluate the impact of different hyperparameters.
-->

## 5.5 Model Testing
<!--
- Evaluate the final model on the test dataset.
- Ensure that the model generalizes well to unseen data.
-->

## 5.6 Model Interpretation (Optional)
<!--
- Interpret the model results (e.g., feature importance, SHAP values).
- Discuss the strengths and limitations of the model.
-->

---

# 6. Predictions


## 6.1 Make Predictions
<!--
- Use the trained model to make predictions on new/unseen data.
-->

## 6.2 Save Model and Results
<!--
- Save the trained model to disk for future use.
- Export prediction results for further analysis.
-->

---

# 7. Documentation and Reporting

## 7.1 Summary of Findings
<!--
- Summarize the results and findings of the analysis.
-->

## 7.2 Next Steps
<!--
- Suggest further improvements, alternative models, or future work.
-->

## 7.3 References
<!--
- Cite any resources, papers, or documentation used.
-->