# Definitions, thoughts and everything else

# Table of Contents
- [Machine Learning models](#section1)
- [Target Variables](#section2)
- [Regression vs. Classification](#section3)
- [Model Choices Based on Target Variable](#section4)
  - [Binary Target Variable (Classification)](#subsection4-1)
  - [Count Target Variable (Regression)](#subsection4-2)
  - [Percentage Target Variable (Regression)](#subsection4-3)
- [Metrics](#section5)
- [Models Tried](#section6)
  - [Random Forest Classifier with binary Target](#subsection6-1)
    - [Key Findings](#subsection6-1-1)
    - [Results](#subsection6-1-2)
  - [Random Forest Classifier with binary Target](#subsection6-2)

<a id="section1"></a>
## Machine Learning models

In [None]:
Machine Learning
  ├── Supervised Learning
  │    ├── Regression
  │    │    ├── Linear Regression (LR)
  │    │    └── Generalized Linear Models (GLM)
  │    ├── Classification
  │    │    ├── Logistic Regression (LogR)
  │    │    └── Support Vector Machines (SVM)
  │    ├── Generative Learning
  │    │    ├── Gaussian Discriminant Analysis (GDA)
  │    │    └── Naive Bayes (NB)
  │    ├── Tree-based and Ensemble Methods
  │    │    ├── CART
  │    │    ├── Random Forest (RF)
  │    │    └── Boosting
  │    └── Other Non-parametric Approaches
  │         └── k-Nearest Neighbors (k-NN)
  ├── Unsupervised Learning
  │    ├── Clustering
  │    │    ├── k-means
  │    │    ├── Hierarchical Clustering
  │    │    └── Expectation-Maximization (EM)
  │    └── Dimension Reduction
  │         ├── Principal Component Analysis (PCA)
  │         └── Independent Component Analysis (ICA)
  ├── Deep Learning
  │    ├── Neural Networks (NN)
  │    ├── Convolutional Neural Networks (CNN)
  │    └── Recurrent Neural Networks (RNN)
  └── Reinforcement Learning and Control


## ML Process Summary

- Data Preparation: Normalize and split your data.

- Model Definition: Define your LSTM model architecture.

- Hyperparameter Tuning: Use methods like Grid Search, Random Search, or manual tuning.

- Model Training: Train your model with training data and validate with validation data.

- Model Evaluation: Evaluate the model on test data using appropriate metrics.

- Cross-Validation: Ensure robustness with K-fold cross-validation.

- Advanced Tuning: Consider advanced hyperparameter optimization libraries.

- Model Persistence: Save and load your trained model.

<a id="section2"></a>
## Target Variables

<a id="section3"></a>
### Regression vs. Classification

The primary distinction between regression and classification problems lies in the nature of the target variable.

**Regression:** Predicts a continuous numerical value.

Example: Predicting the price of a house, temperature, or in your case, the count of available docks or the percentage of docking availability.

**Classification**: Predicts a categorical value or class.

Example: Classifying an email as spam or not spam, predicting whether a customer will churn, or in your case, predicting whether there are any available docks (binary classification).

<a id="section4"></a>
## Model Choices Based on Target Variable

<a id="subsection4-1"></a>
### Binary Target Variable (Classification)

* **Logistic Regression:** Often the starting point for binary classification.

* **Decision Trees:** Can handle non-linear relationships and provide interpretable models.

* **Random Forest:** An ensemble of decision trees, often providing better performance than individual trees.

* **Support Vector Machines (SVM):** Effective for complex classification problems.

* **Naive Bayes:** Simple and efficient, especially for text classification but can also be used for numerical data.
 
* **Neural Networks:** Powerful but require more data and computational resources.

<a id="subsection4-2"></a>
### Count Target Variable (Regression)

* **Linear Regression:** Suitable for linear relationships between features and the target.

* **Decision Trees:** Can handle non-linear relationships but might not be as accurate as specialized regression models.

* **Random Forest:** Often performs well for regression tasks.

* **Gradient Boosting Machines (GBM):** Known for their strong performance in regression problems.

* **Support Vector Regression (SVR):** An extension of SVM for regression.

* **Neural Networks:** Can be used for regression but require careful architecture design.

<a id="subsection4-3"></a>
### Percentage Target Variable (Regression)

* **Linear Regression:** Can be a starting point, but consider transformations if the data is not normally distributed.

* **Decision Trees, Random Forest, GBM:** These models can handle non-linear relationships and often perform well for percentage-based predictions.

* **Beta Regression:** Specifically designed for modeling data constrained between 0 and 1, which is suitable for percentages.

<a id="section5"></a>
## Metrics

<a id="subsection5-1"></a>
### Depends on the target variable (like everything)

* **Binary:** Accuracy, precision, recall, F1-score, ROC curve, AUC.
* **Count or percentage:** Mean Squared Error (MSE), Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), R-squared.

<a id="section6"></a>
## Models Tried

<a id="subsection6-1"></a>
### Random Forest Classifier with binary Target
<a id="subsection6-1-1"></a>
#### Key Findings

* It takes forever to train the model with all the stations 510+

* Took only 5 stations

* Class inbalance, predicts well the True but it only has a 61% for False (No Docking available)

* useless fields: is_holiday, disctrict

* GridSearch: First took 7.5 hours and did not improved much, now it takes more than 2 hours


<a id="subsection6-1-2"></a>
#### Results

* Removed month and day

```
Accuracy: 0.8513997424544444
Confusion Matrix:
 [[ 1256  7452]
 [  972 47009]]
Classification Report:
               precision    recall  f1-score   support

       False       0.56      0.14      0.23      8708
        True       0.86      0.98      0.92     47981

    accuracy                           0.85     56689
   macro avg       0.71      0.56      0.57     56689
weighted avg       0.82      0.85      0.81     56689

Precision:  0.8631681386680377
Recall:  0.9797419812008921
F1 Score:  0.917768102926534
ROC AUC:  0.5619885836183606
```

* adding Month

```
Accuracy: 0.8555980878124504
Confusion Matrix:
 [[ 1470  7238]
 [  948 47033]]
Classification Report:
               precision    recall  f1-score   support

       False       0.61      0.17      0.26      8708
        True       0.87      0.98      0.92     47981

    accuracy                           0.86     56689
   macro avg       0.74      0.57      0.59     56689
weighted avg       0.83      0.86      0.82     56689

Precision:  0.8666322713788212
Recall:  0.9802421791959317
F1 Score:  0.9199428862027149
ROC AUC:  0.5745262342924996
```
* Month and day
```
Accuracy: 0.8624600892589391
Confusion Matrix:
 [[ 1793  6915]
 [  882 47099]]
Classification Report:
               precision    recall  f1-score   support

       False       0.67      0.21      0.32      8708
        True       0.87      0.98      0.92     47981

    accuracy                           0.86     56689
   macro avg       0.77      0.59      0.62     56689
weighted avg       0.84      0.86      0.83     56689

Precision:  0.871977635427852
Recall:  0.9816177236822909
F1 Score:  0.923555076229227
ROC AUC:  0.5937601709821653
```
* Month and day as numerical
```
Accuracy: 0.8609430400959622
Confusion Matrix:
 [[ 1895  6813]
 [ 1070 46911]]
Classification Report:
               precision    recall  f1-score   support

       False       0.64      0.22      0.32      8708
        True       0.87      0.98      0.92     47981

    accuracy                           0.86     56689
   macro avg       0.76      0.60      0.62     56689
weighted avg       0.84      0.86      0.83     56689

Precision:  0.8731851686397141
Recall:  0.9776995060544799
F1 Score:  0.9224915195909739
ROC AUC:  0.5976577456776763
```

<a id="subsection6-2"></a>
### Random Forest Regressor with percentage Target