<!-- Notebook title -->
# Title

# 1. Notebook Description

### 1.1 Task Description
<!-- 
- A brief description of the problem you're solving with machine learning.
- Define the objective (e.g., classification, regression, clustering, etc.).
-->

TODO

### 1.2 Useful Resources
<!--
- Links to relevant papers, articles, or documentation.
- Description of the datasets (if external).
-->

### 1.2.1 Data

#### 1.2.1.1 Common

* [Datasets Kaggle](https://www.kaggle.com/datasets)  
  &nbsp;&nbsp;&nbsp;&nbsp;A vast repository of datasets across various domains provided by Kaggle, a platform for data science competitions.
  
* [Toy datasets from Sklearn](https://scikit-learn.org/stable/datasets/toy_dataset.html)  
  &nbsp;&nbsp;&nbsp;&nbsp;A collection of small datasets that come with the Scikit-learn library, useful for quick prototyping and testing algorithms.
  
* [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php)  
  &nbsp;&nbsp;&nbsp;&nbsp;A widely-used repository for machine learning datasets, with a variety of real-world datasets available for research and experimentation.
  
* [Google Dataset Search](https://datasetsearch.research.google.com/)  
  &nbsp;&nbsp;&nbsp;&nbsp;A tool from Google that helps to find datasets stored across the web, with a focus on publicly available data.
  
* [AWS Public Datasets](https://registry.opendata.aws/)  
  &nbsp;&nbsp;&nbsp;&nbsp;A registry of publicly available datasets that can be analyzed on the cloud using Amazon Web Services (AWS).
  
* [Microsoft Azure Open Datasets](https://azure.microsoft.com/en-us/services/open-datasets/)  
  &nbsp;&nbsp;&nbsp;&nbsp;A collection of curated datasets from various domains, made available by Microsoft Azure for use in machine learning and analytics.
  
* [Awesome Public Datasets](https://github.com/awesomedata/awesome-public-datasets)  
  &nbsp;&nbsp;&nbsp;&nbsp;A GitHub repository that lists a wide variety of datasets across different domains, curated by the community.
  
* [Data.gov](https://www.data.gov/)  
  &nbsp;&nbsp;&nbsp;&nbsp;A portal to the US government's open data, offering access to a wide range of datasets from various federal agencies.
  
* [Google BigQuery Public Datasets](https://cloud.google.com/bigquery/public-data)  
  &nbsp;&nbsp;&nbsp;&nbsp;Public datasets hosted by Google BigQuery, allowing for quick and powerful querying of large datasets in the cloud.
  
* [Papers with Code](https://paperswithcode.com/datasets)  
  &nbsp;&nbsp;&nbsp;&nbsp;A platform that links research papers with the corresponding code and datasets, helping researchers reproduce results and explore new data.
  
* [Zenodo](https://zenodo.org/)  
  &nbsp;&nbsp;&nbsp;&nbsp;An open-access repository that allows researchers to share datasets, software, and other research outputs, often linked to academic publications.
  
* [The World Bank Open Data](https://data.worldbank.org/)  
  &nbsp;&nbsp;&nbsp;&nbsp;A comprehensive source of global development data, with datasets covering various economic and social indicators.
  
* [OpenML](https://www.openml.org/)  
  &nbsp;&nbsp;&nbsp;&nbsp;An online platform for sharing datasets, machine learning experiments, and results, fostering collaboration in the ML community.
  
* [Stanford Large Network Dataset Collection (SNAP)](https://snap.stanford.edu/data/)  
  &nbsp;&nbsp;&nbsp;&nbsp;A collection of large-scale network datasets from Stanford University, useful for network analysis and graph-based machine learning.
  
* [KDnuggets Datasets](https://www.kdnuggets.com/datasets/index.html)  
  &nbsp;&nbsp;&nbsp;&nbsp;A curated list of datasets for data mining and data science, compiled by the KDnuggets community.


#### 1.2.1.2 Project

### 1.2.2 Learning

* [K-Nearest Neighbors on Kaggle](https://www.kaggle.com/code/mmdatainfo/k-nearest-neighbors)

* [Complete Guide to K-Nearest-Neighbors](https://kevinzakka.github.io/2016/07/13/k-nearest-neighbor)

### 1.2.3 Documentation

---

# 2. Setup

## 2.1 Imports
<!--
- Import necessary libraries (e.g., `numpy`, `pandas`, `matplotlib`, `scikit-learn`, etc.).
-->

In [55]:
from ikt450.src.common_imports import *
from ikt450.src.config import get_paths
from ikt450.src.common_func import load_dataset, save_dataframe, ensure_dir_exists
import pandas as pd
import numpy as np
import random

## 2.2 Global Variables
<!--
- Define global constants, paths, and configuration settings used throughout the notebook.
-->

### 2.2.1 Paths

In [56]:
paths = get_paths()

### 2.2.2 Seed

In [57]:
RANDOM_SEED = 7

### 2.2.3 Split ratio

In [58]:
SPLITRATIO = 0.8

## 2.3 Function Definitions
<!--
- Define helper functions that will be used multiple times in the notebook.
- Consider organizing these into separate sections (e.g., data processing functions, model evaluation functions).
-->

---

# 3. System Setup 
<!-- (Optional but recommended) -->

## 3.1 Styling
<!--
- Set up any visual styles (e.g., for plots).
- Configure notebook display settings (e.g., `matplotlib` defaults, pandas display options).
-->

## 3.2 Environment Configuration
<!--
- Check system dependencies, versions, and ensure reproducibility (e.g., set random seeds).
-->

### 3.2.1 Seed

In [59]:
np.random.seed(RANDOM_SEED)

---

# 4. Data Processing

## 4.1 Data loading
<!--
- Load datasets from files or other sources.
-->

In [60]:
%ls {paths['PATH_COMMON_DATASETS']}

 Volume in drive C is Windows
 Volume Serial Number is FA0F-7C2E

 Directory of C:\Users\jonin\Documents\ikt450\ikt450\common\datasets

03.09.2024  21:23    <DIR>          .
28.08.2024  02:09    <DIR>          ..
03.09.2024  21:22            19�488 ecoli.data
03.09.2024  21:22             3�022 ecoli.names
23.08.2024  18:45            23�278 pima-indians-diabetes.data.csv
               3 File(s)         45�788 bytes
               2 Dir(s)  260�156�375�040 bytes free


In [61]:
df = pd.read_csv(f"{paths['PATH_COMMON_DATASETS']}/ecoli.data",  delim_whitespace=True, header=None)

  df = pd.read_csv(f"{paths['PATH_COMMON_DATASETS']}/ecoli.data",  delim_whitespace=True, header=None)


## 4.2 Data inspection
<!--
- Preview the data (e.g., `head`, `describe`).
-->

### 4.2.1 Info

In [62]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 336 entries, 0 to 335
Data columns (total 9 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       336 non-null    object 
 1   1       336 non-null    float64
 2   2       336 non-null    float64
 3   3       336 non-null    float64
 4   4       336 non-null    float64
 5   5       336 non-null    float64
 6   6       336 non-null    float64
 7   7       336 non-null    float64
 8   8       336 non-null    object 
dtypes: float64(7), object(2)
memory usage: 23.8+ KB


### 4.2.2 Describe

In [63]:
df.describe()

Unnamed: 0,1,2,3,4,5,6,7
count,336.0,336.0,336.0,336.0,336.0,336.0,336.0
mean,0.50006,0.5,0.495476,0.501488,0.50003,0.500179,0.499732
std,0.194634,0.148157,0.088495,0.027277,0.122376,0.215751,0.209411
min,0.0,0.16,0.48,0.5,0.0,0.03,0.0
25%,0.34,0.4,0.48,0.5,0.42,0.33,0.35
50%,0.5,0.47,0.48,0.5,0.495,0.455,0.43
75%,0.6625,0.57,0.48,0.5,0.57,0.71,0.71
max,0.89,1.0,1.0,1.0,0.88,1.0,0.99


### 4.2.3 Head

In [64]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,AAT_ECOLI,0.49,0.29,0.48,0.5,0.56,0.24,0.35,cp
1,ACEA_ECOLI,0.07,0.4,0.48,0.5,0.54,0.35,0.44,cp
2,ACEK_ECOLI,0.56,0.4,0.48,0.5,0.49,0.37,0.46,cp
3,ACKA_ECOLI,0.59,0.49,0.48,0.5,0.52,0.45,0.36,cp
4,ADI_ECOLI,0.23,0.32,0.48,0.5,0.55,0.25,0.35,cp


## 4.3 Data Visualization

In [65]:
# TODO Add code for visualization

## 4.4 Data Cleaning
<!--
- Handle missing values, outliers, and inconsistencies.
- Remove or impute missing data.
-->

### 4.4.1 NULL, NaN, Missing values

In [66]:
df.isnull().sum()

0    0
1    0
2    0
3    0
4    0
5    0
6    0
7    0
8    0
dtype: int64

In [67]:
df.isna().sum()

0    0
1    0
2    0
3    0
4    0
5    0
6    0
7    0
8    0
dtype: int64

In [68]:
df.duplicated().sum()

0

In [69]:
#df.corr()

## 4.5 Feature Engineering
<!--
- Create new features from existing data.
- Normalize or standardize features.
- Encode categorical variables.
-->

### 4.5.1 Normalize

#### 4.5.1.1 Feature Selection / Data Separation

<details>
<br>
<details>
<summary>What does it?</summary>
<br>
This line removes the `` column from the DataFrame `df` and assigns the remaining columns to `X`.
</details>
<br>
<details>
<summary>Why do we do it?</summary>
<br>
We do this to separate the input features (which are stored in `X`) from the target variable (which will be stored in `y`). This separation is essential in supervised learning tasks where the goal is to predict the target variable based on the input features.
</details>
</details>

In [70]:
df.columns

Index([0, 1, 2, 3, 4, 5, 6, 7, 8], dtype='int64')

In [71]:
df.rename(columns={0: 'id', 1: 'mcg', 2: 'gvh', 3: 'lip', 4: 'chg', 5: 'aac', 6: 'alm1', 7: 'alm2', 8: 'class'}, inplace=True)

# remove id column
df.drop('id', axis=1, inplace=True)
# keep only the rows with class 'cp' or 'im'
df = df[df['class'].isin(['cp', 'im'])]


In [72]:

# encode class
df['class'] = df['class'].map({'cp': 0, 'im': 1})

# split data into X and y
X_data = df.drop('class', axis=1)
Y_data = df['class']

# standardize the data
#X_data = (X_data - X_data.mean()) / X_data.std()





#### 4.5.1.2 Target Variable Extraction

<details>
<br>
<details>
<summary>What does it?</summary>
<br>
This line selects the `` column from the DataFrame `df` and assigns it to `y`.
</details>
<br>
<details>
<summary>Why do we do it?</summary>
<br>
We do this to isolate the target variable, which represents the labels or outcomes that we aim to predict using the machine learning model.
</details>
</details>

---

# 5. Model Development

In [73]:
class preseptron():
    def __init__(self, n_inputs):
        self.w = np.random.rand(n_inputs)*0.001
        self.b = np.random.rand(1)

    def sigmoid(self, x):
        x = np.clip(x, -500, 500)  # Clipping to prevent overflow
        return 1 / (1 + np.exp(-x))
    
    def sigmoid_derivative(self, x):
        return x * (1 - x)
    
    def forward(self, x):
     
        
        self.latest_output = self.sigmoid(np.dot(x, self.w) + self.b)
        self.latest_input = x
        return self.latest_output
    
    def backward(self,error,lr):

        gradient = error * self.sigmoid_derivative(self.latest_output)
 
        self.w += lr * gradient * self.latest_input
        self.b += lr * gradient
        return gradient * self.w
    



        
    


def mse (y_true, y_pred):
    return np.mean((y_true - y_pred) ** 2)
    
    
    
    

In [74]:
class MLP_scrach():
    def __init__(self, layer1_size, layer2_size, layer3_size, lr=0.1):
        self.lr = lr
        self.layer1 = [preseptron(7) for _ in range(layer1_size)]
        self.layer2 = [preseptron(layer1_size) for _ in range(layer2_size)]
        self.layer3 = [preseptron(layer2_size) for _ in range(layer3_size)]
        
    def forward(self, inputs):
        inputs = np.array(inputs)
        
        x = []
        for l in self.layer1:
            x.append(l.forward(inputs))
        inputs = np.array(x)
        inputs = np.reshape(inputs, (len(inputs)))
        
        x = []
        for l in self.layer2:
            x.append(l.forward(inputs))
        inputs = np.array(x)
        inputs = np.reshape(inputs, (len(inputs)))
        x = []
        for l in self.layer3:
            x.append(l.forward(inputs))
        return np.array(x)
    
    def backward(self,error):
        initial_error = error
        new_error = []
        for l, err in zip(self.layer3,initial_error):
            new_error.append(l.backward(err,self.lr))
        initial_error = np.array(new_error)
        initial_error = np.sum(initial_error, axis=0)
        new_error = []
        for l, err in zip(self.layer2,initial_error):
            new_error.append(l.backward(err,self.lr))
        initial_error = np.array(new_error)
      #  print(initial_error.shape,"2")
        initial_error = np.sum(initial_error, axis=0)
    
        new_error = []
        for l, err in zip(self.layer1,initial_error):
            new_error.append(l.backward(err,self.lr))
        
        


 

In [75]:
# lets try to train the model
model = MLP_scrach(5,5,1, 0.05)
# split the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X_data, Y_data, test_size=0.2, random_state=RANDOM_SEED)

# train the model

for i in range(1000):
    epoch_loss = 0
    for x,y in zip(X_train.values, y_train.values):
       
        y_pred = model.forward(x)
        
        error = y - y_pred
        epoch_loss += mse(y, y_pred)
       
        model.backward(error)
    print(f"Epoch {i} loss: {epoch_loss}")
    
   


Epoch 0 loss: 45.031450756544885
Epoch 1 loss: 41.97651908279049
Epoch 2 loss: 41.82419435178383
Epoch 3 loss: 41.81111953409264
Epoch 4 loss: 41.80927029816671
Epoch 5 loss: 41.80902176869111
Epoch 6 loss: 41.809088973082964
Epoch 7 loss: 41.809225312164216
Epoch 8 loss: 41.809377108468006
Epoch 9 loss: 41.80953229761506
Epoch 10 loss: 41.80968814620371
Epoch 11 loss: 41.80984403095497
Epoch 12 loss: 41.80999980637018
Epoch 13 loss: 41.81015543506894
Epoch 14 loss: 41.81031090409427
Epoch 15 loss: 41.81046620597597
Epoch 16 loss: 41.81062133444855
Epoch 17 loss: 41.810776283482575
Epoch 18 loss: 41.81093104706546
Epoch 19 loss: 41.81108561915017
Epoch 20 loss: 41.811239993642786
Epoch 21 loss: 41.81139416439862
Epoch 22 loss: 41.811548125220014
Epoch 23 loss: 41.81170186985461
Epoch 24 loss: 41.811855391993895
Epoch 25 loss: 41.81200868527191
Epoch 26 loss: 41.812161743263225
Epoch 27 loss: 41.81231455948204
Epoch 28 loss: 41.81246712738026
Epoch 29 loss: 41.812619440346225
Epoch 30 l

In [76]:

# test the model
y_pred = []
for x in X_test.values:
    y_pred.append(model.forward(x))
y_pred = np.array(y_pred)
y_pred = np.reshape(y_pred, (len(y_pred)))
print(mse(y_test.values, y_pred))
accuracy = np.mean(y_test.values == np.round(y_pred))
accuracy
   

0.007225510454271275


1.0

## 5.1 Model Selection
<!--
- Choose the model(s) to be trained (e.g., linear regression, decision trees, neural networks).
-->

## 5.2 Model Training
<!--
- Train the selected model(s) using the training data.
-->

## 5.3 Model Evaluation
<!--
- Evaluate model performance on validation data.
- Use appropriate metrics (e.g., accuracy, precision, recall, RMSE).
-->

## 5.4 Hyperparameter Tuning
<!--
- Fine-tune the model using techniques like Grid Search or Random Search.
- Evaluate the impact of different hyperparameters.
-->

## 5.5 Model Testing
<!--
- Evaluate the final model on the test dataset.
- Ensure that the model generalizes well to unseen data.
-->

## 5.6 Model Interpretation (Optional)
<!--
- Interpret the model results (e.g., feature importance, SHAP values).
- Discuss the strengths and limitations of the model.
-->

---

# 6. Predictions


## 6.1 Make Predictions
<!--
- Use the trained model to make predictions on new/unseen data.
-->

## 6.2 Save Model and Results
<!--
- Save the trained model to disk for future use.
- Export prediction results for further analysis.
-->

---

# 7. Documentation and Reporting

## 7.1 Summary of Findings
<!--
- Summarize the results and findings of the analysis.
-->

## 7.2 Next Steps
<!--
- Suggest further improvements, alternative models, or future work.
-->

## 7.3 References
<!--
- Cite any resources, papers, or documentation used.
-->