<a href="https://colab.research.google.com/github/Anas4444/HackathonDevWebGroupe10/blob/main/Copy_of_W%26B_and_PyData_Tunisia_Bean_Leaf_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img src="https://wandb.me/logo-im-png" width="400" alt="Weights & Biases" />
<!--- @wandbcode{beans-comp-pydata-tunisia} -->

Use Weights & Biases for machine learning experiment tracking, dataset versioning, and project collaboration.


<img src="https://wandb.me/mini-diagram" width="650" alt="Weights & Biases" />


## What this notebook covers with Weights and Biases:
* Metrics logging 
* Exploratory Data Analysis (EDA)
* W&B plots such as Confusion Matrices, ROC curves & PR curves
* HyperParameter search with W&B Sweeps



# ✅ Sign Up

Sign up to a free [Weights & Biases account here](https://wandb.ai/signup)

# Kaggle Competition Page

[Submit to the Competition here](https://www.kaggle.com/c/bean-comp-pytunisia/overview)

# 🚀 Installing and importing

In [None]:
!pip install -q --upgrade wandb
!pip install -q scikit-learn==1.0.1

In [None]:
import os
import wandb
import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

A useful logging function to log multiple metrics to W&B at once

In [None]:
def log_metrics(labels, preds, is_val=True):
  if is_val: pref = 'validation'
  else: pref = 'train'
  
  metrics = {}
  metrics[f"{pref}/accuracy_score"] = accuracy_score(y_val, y_pred)
  metrics[f"{pref}/precision"] = precision_score(y_val, y_pred, average="weighted")
  metrics[f"{pref}/recall"] = recall_score(y_val, y_pred, average="weighted")
  metrics[f"{pref}/f1_score"] = f1_score(y_val, y_pred, average="weighted")

  for k in metrics.keys():
    print(f'{k} : {metrics[k]}')
    wandb.summary[f"{k}"] = metrics[k]

  #wandb.log(metrics)

Set some constants 

In [None]:
PROJECT = 'beans-tabular-pydata-tunisia'
DATA_DIR = 'data'
ARTIFACT_PATH = 'wandb_fc/beans-tabular-pydata-tunisia/beans_competition_dataset:latest'

# 💾 Data
#### Download and Load the Data
`train.csv` and `val.csv` data will be downloaded to `DATA_DIR`


In [None]:
wandb.init(project=PROJECT, job_type='download_dataset')
artifact = wandb.use_artifact(ARTIFACT_PATH, type='dataset')
artifact_dir = artifact.download(DATA_DIR)
wandb.finish()

In [None]:
# Read csvs to DataFrame
train_df = pd.read_csv(f'{DATA_DIR}/train_c.csv')
train_df = train_df.sample(frac=1)  # shuffle the train data
train_df.reset_index(inplace=True, drop=True)

val_df = pd.read_csv(f'{DATA_DIR}/val.csv')
test_df = pd.read_csv(f'{DATA_DIR}/test_no_label.csv')

train_df.head()

In [None]:
test_df

#### Prep Data
Extract the X,y values and encode the classes into integer values

In [None]:
le = preprocessing.LabelEncoder()

y_train_txt = train_df['Class'].values.tolist()
le.fit(y_train_txt)
labels = le.classes_

X_train = train_df.iloc[:,:-2].values.tolist()
y_train = le.transform(y_train_txt)

X_val = val_df.iloc[:,:-2].values.tolist()
y_val_txt = val_df['Class'].values.tolist()
y_val = le.transform(y_val_txt)

X_test = test_df.iloc[:,:-1].values.tolist()

labels = train_df['Class'].unique()

list(le.inverse_transform([2, 2, 1]))

# 🖼️ EDA with W&B Tables
Log the train and validation datasets to W&B Tables for EDA

In [None]:
wandb.init(project=PROJECT, job_type='log_dataset')
wandb.log({'Datasets/train_ds':train_df})
wandb.log({'Datasets/val_ds':val_df})
wandb.finish()

#👟 Train
Train a [`RandomForestClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier) from sci-kit learn

In [None]:
wandb.init(project=PROJECT)

model = RandomForestClassifier()

# ✍️ Log your Models parameter config to W&B
wandb.config.update(model.get_params())

model.fit(X_train, y_train)

y_pred_train = model.predict(X_train)
y_pred = model.predict(X_val)
y_probas = model.predict_proba(X_val)

importances = model.feature_importances_
indices = np.argsort(importances)[::-1]

✍️ Log your model's Metrics to W&B

In [None]:
log_metrics(y_val, y_pred)

#🤩 Visualize Model Performance in W&B
Weights & Biases have charting functions for popular model evaluation charts including confusion matrices, ROC curves, PR curves and more.
[Check out wandb charts documentation here $\rightarrow$](https://docs.wandb.ai/guides/track/log/plots#model-evaluation-charts)

**Confusion Matrix**


In [None]:
wandb.log({"confusion Matrix" : wandb.plot.confusion_matrix(y_probas, y_val, class_names=labels)})

**ROC Curve**


In [None]:
wandb.log({"ROC Curve": wandb.plot.roc_curve(y_val, y_probas, labels=labels, title='ROC Curve')})

**Precision Recall Curve**

In [None]:
wandb.log({"Precision-Recall": wandb.plot.pr_curve(y_val, y_probas, labels=labels, title='Precision-Recall')})

**Feature Importances**

Evaluates and plots the importance of each feature for the classification task. Only works with classifiers that have a `feature_importances_` attribute, like trees.

In [None]:
feat_names = train_df.columns.values
imps = []
feats = []
for i in indices:
  imps.append(importances[i])
  feats.append(feat_names[i])

fi_data = pd.DataFrame({"Feature":feats, "Importance":imps})

In [None]:
table = wandb.Table(data=fi_data, columns = ["Feature", "Importance"])
wandb.log({"Feature Importance" : wandb.plot.bar(table, "Feature",
                               "Importance", title="Feature Importance")})

#### 🏁 Finish W&B Run
When you're finished with your logging for a run make sure to call `wandb.finish()` to avoid logging metrics from your next experiment to the wrong run

In [None]:
wandb.finish()

# Submission

In [None]:
y_pred_test = model.predict(X_test)
y_pred_test = list(le.inverse_transform(y_pred_test))
ids = test_df.id.values

submission_df = pd.DataFrame({'Id':ids, 'Predicted':y_pred_test})
submission_df.to_csv('submission.csv', index=False)

# 🧪 HyperParameter Sweep

Weights and Biases also enables you to do hyperparameter sweeps, either with our own [Sweeps functionality](https://docs.wandb.ai/guides/sweeps/python-api).

#### Sweep Train Function
A W&B Sweep needs to passed in a config and a training function to run.

In [None]:
def train():     
    with wandb.init() as _:
      
      model = RandomForestClassifier(
          n_estimators=wandb.config['n_estimators'],   # n_estimators parameter will now be set by W&B
          max_depth=wandb.config['max_depth']     # max_depth parameter will now be set by W&B
          
          # [Optional] add additional model parameters here
          
          )
      
      # ✍️ Log your Models parameter config to W&B
      wandb.config['model_type'] = 'random_forest'
      wandb.config.update(model.get_params())

      model.fit(X_train, y_train)

      y_pred_train = model.predict(X_train)
      y_pred = model.predict(X_val)
      y_probas = model.predict_proba(X_val)
        
      wandb.summary["validation/accuracy"] = accuracy_score(y_val, y_pred)
      wandb.summary["validation/precision"] = precision_score(y_val, y_pred, average="weighted")
      wandb.summary["validation/recall"] = recall_score(y_val, y_pred, average="weighted")
      wandb.summary["validation/f1_score"] = f1_score(y_val, y_pred, average="weighted")

💡 **Tip**

The `train` function above uses Sci-Kit Learn's [`RandomForestClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier) but you can also modify the code to user other models such as `DecisionTreeClassifier` or `AdaBoostClassifier` or other boosting models such as [`XGBoost`](https://xgboost.readthedocs.io/en/latest/get_started.html). 

Note that you'll likely have to chanage the argument names in the `sweep_config` when using these models in a sweep.


```
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier

model = DecisionTreeClassifier()
model = AdaBoostClassifier()
```



#### Sweep Config
Define the name of your sweep, how you'd like to sweep and what parameters to sweep over. See the [Sweep Configuration Docs](https://docs.wandb.ai/guides/sweeps/configuration) here for more advanced functionality

In [None]:
sweep_config = {
  "name" : "beans_sweep",
  "method" : "random",
  "parameters" : {
    "n_estimators" :{
      "min": 10,
      "max": 400
    },
    "max_depth" :{
      "min": 2,
      "max": 100
    },

    # [Optional] add additional parameters here

  }
}

sweep_id = wandb.sweep(sweep_config, project=PROJECT)

💡 **Tip**

The above `sweeps_config` is very simple, consider sweeping over additional parameters - don't forget to modify your `train` function to pass these additional parameters to your model

#### Run Sweep
Now we define the number of experiments we'd like to run using `N_RUNS`, pass the sweep_id and the training function and then start the sweep


In [None]:
N_RUNS = 50 # number of runs to execute
wandb.agent(sweep_id, project=PROJECT, function=train, count=N_RUNS)

# 🪄 More from W&B
#### 🎨 Example Gallery

See examples of projects tracked and visualized with W&B in our gallery, [Fully Connected→](https://app.wandb.ai/gallery)

#### 🏙️ Community

Join a community of ML practitioners in our
[Discourse forum→](http://wandb.me/and-you)

#### 📏 Best Practices

1. **Projects**: Log multiple runs to a project to compare them. `wandb.init(project="project-name")`
2. **Groups**: For multiple processes or cross validation folds, log each process as a run and group them together. `wandb.init(group='experiment-1')`
3. **Tags**: Add tags to track your current baseline or production model.
4. **Notes**: Type notes in the table to track the changes between runs.
5. **Reports**: Take quick notes on progress to share with colleagues and make dashboards and snapshots of your ML projects.

#### 🤓 Advanced Setup

1. [Environment variables](https://docs.wandb.com/library/environment-variables): Set API keys in environment variables so you can run training on a managed cluster.
2. [Offline mode](https://docs.wandb.com/library/technical-faq#can-i-run-wandb-offline): Use `dryrun` mode to train offline and sync results later.
3. [On-prem](https://docs.wandb.com/self-hosted): Install W&B in a private cloud or air-gapped servers in your own infrastructure. We have local installations for everyone from academics to enterprise teams.
4. [Sweeps](http://wandb.me/sweeps-colab): Set up hyperparameter search quickly with our lightweight tool for tuning.
5. [Artifacts](http://wandb.me/artifacts-colab): Track and version models and datasets in a streamlined way that automatically picks up your pipeline steps as you train models.
6. [Tables](http://wandb.me/dsviz-nature-colab): Log, query, and analyze tabular data. Understand your datasets, visualize model predictions, and share insights in a central dashboard.