<a href="https://colab.research.google.com/github/RafaelNovais/MasterAI/blob/master/DeepLearningMindMap.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
#LOGISTIC REGRESSION
import numpy as np

class LogisticRegression:
    def __init__(self, learning_rate=0.01, num_epochs=100, batch_size=1):
        """
        Logistic Regression model using Stochastic Gradient Descent.

        Parameters:
        - learning_rate: Learning rate for gradient descent.
        - num_epochs: Number of epochs to train.
        - batch_size: Number of samples per batch for SGD.
        """
        self.learning_rate = learning_rate
        self.num_epochs = num_epochs
        self.batch_size = batch_size
        self.weights = None
        self.bias = None

    def sigmoid(self, z):
        """Apply sigmoid function."""
        return 1 / (1 + np.exp(-z))

    def forward_pass(self, X):
        """Compute the linear combination and apply sigmoid."""
        return self.sigmoid(np.dot(X, self.weights) + self.bias)

    def compute_loss(self, y, y_pred):
        """Compute binary cross-entropy loss."""
        epsilon = 1e-15  # Prevent log(0)
        y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
        return -np.mean(y * np.log(y_pred) + (1 - y) * np.log(1 - y_pred))

    def fit(self, X, y):
        """
        Train the logistic regression model using SGD.

        Parameters:
        - X: Feature matrix of shape (n_samples, n_features).
        - y: Labels vector of shape (n_samples,).
        """
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = 0

        for epoch in range(self.num_epochs):
            indices = np.arange(n_samples)
            np.random.shuffle(indices)
            X = X[indices]
            y = y[indices]

            for start in range(0, n_samples, self.batch_size):
                end = start + self.batch_size
                X_batch = X[start:end]
                y_batch = y[start:end]

                y_pred = self.forward_pass(X_batch)

                dw = np.dot(X_batch.T, (y_pred - y_batch)) / len(y_batch)
                db = np.sum(y_pred - y_batch) / len(y_batch)

                self.weights -= self.learning_rate * dw
                self.bias -= self.learning_rate * db

            y_pred_epoch = self.forward_pass(X)
            loss = self.compute_loss(y, y_pred_epoch)
            print(f"Epoch {epoch + 1}/{self.num_epochs}, Loss: {loss:.4f}")

    def predict(self, X):
        """
        Predict binary labels for input data.

        Parameters:
        - X: Feature matrix of shape (n_samples, n_features).

        Returns:
        - Binary predictions (0 or 1).
        """
        y_pred = self.forward_pass(X)
        return (y_pred >= 0.5).astype(int)

# Example usage
if __name__ == "__main__":
    # Generate dummy data
    np.random.seed(42)
    X = np.random.randn(100, 2)
    y = (np.dot(X, [1.5, -2.0]) + 0.5 > 0).astype(int)

    # Initialize and train model
    model = LogisticRegression(learning_rate=0.1, num_epochs=20, batch_size=10)
    model.fit(X, y)

    # Predict on new data
    X_test = np.array([[0.5, -1.0], [-1.5, 2.0]])
    predictions = model.predict(X_test)
    print("Predictions:", predictions)


Epoch 1/20, Loss: 0.5758
Epoch 2/20, Loss: 0.4979
Epoch 3/20, Loss: 0.4441
Epoch 4/20, Loss: 0.4048
Epoch 5/20, Loss: 0.3747
Epoch 6/20, Loss: 0.3506
Epoch 7/20, Loss: 0.3310
Epoch 8/20, Loss: 0.3145
Epoch 9/20, Loss: 0.3006
Epoch 10/20, Loss: 0.2885
Epoch 11/20, Loss: 0.2779
Epoch 12/20, Loss: 0.2686
Epoch 13/20, Loss: 0.2602
Epoch 14/20, Loss: 0.2526
Epoch 15/20, Loss: 0.2458
Epoch 16/20, Loss: 0.2396
Epoch 17/20, Loss: 0.2339
Epoch 18/20, Loss: 0.2286
Epoch 19/20, Loss: 0.2238
Epoch 20/20, Loss: 0.2192
Predictions: [1 0]


#Classic ML

1 - Classification  Predict a function/classe

2 - Regression      Predict a number/specific value

3 - Clustering      Predict or create a Group based in details or different details

4 - Co-trainig  is a cluster with classification, you can group based an specific object/details

5 - Relationship Discovery - find association like Chips and ketchup

6 - Reinforcement Learning - Positive Reinforcement like rewards, Negative Reinforcement Punishment , the agent need explore the envirament



**Ensembles: Basic Idea** AGRUPAMENTO/COMIBINAR DE PROCESSAMENTOS/Modelos PARA DIVERSIFICAR PROCESAR VARIAS VEZES E AGRUPAR OS RESULTADOS "Muitos sao mais inteligentes que alguns" - sing your data, construct multiple classifiers,Combine the decisions to arrive at final decision

* Bagging - Bootstrap Aggregation, decrease the variance by generating multiple data sets of the same size as the original data set - Processa Independente/Paralelo e depois combina
  * Simulates the existence of multiple data sets
  * Bagging works by reducing the variance through averaging/voting over multiple classifiers created with different training subsets
  * orks well if the classification algorithm is unstable

* Boosting – Uses subsets of the original data set to produce averagely performing models weighted towards “harder” problems. It then “boosts” the results through a vote. Processo em sequencia, depois reprocessa N vezes ate classificacao final
  * Also uses voting/averaging
  * Weights models according to performance
  * Boosting generates a series of classifiers
  * Their output is combined to produce a weighted vote

* Random Forests
  * CART is a Decision Tree classifier,Create a ‘forest’ of trees as an ensemble
  * Works like bagging, except – Decision tree is base learner

* Stacking – Combines the results from multiple heterogeneous classifiers to improve classification. A voting scheme is not applied, instead a linear alg is
applied to make the classification - Processa Paralelo depois combina Meta-Modelo heterogeneous
  * Stacking, a.k.a Stacked Generalisation, is a family of methods for heterogenous – Classifiers are built using different classification algorithms
  * Because of their different characteristics, stacking seeks to combine them in a way that is more sophisticated than a simple vote



# Neural Networks

Highly parallel computation

  – Weights: positive or negative

  – Activation: 0/1, soft threshold; nonlinear; differentiable Commonly: sigmoid function

  – Output: "Squashed" linear function of inpu


**Fully Connected Feed-Forward Neural Network**

* Also known as a Multi-Layer Perceptron
* Simplest architecture, widely used
* Neurons connected in layers
    – Family of functions parameterised by weights
    – No internal state
* High-dimensional non-linear interpolation
* Network is a distributed model of the data



# Gradient Descent with Backpropagation

* orward propagation
* Backpropagation Step
  * Output Layer
  * Hidden Layers
* Gradient Descent Update Step


# Common Activation Functions

* Logistic
* Tanh
* ReLU
* Leaky ReLU


* **For classification tasks, we will use the same average log loss cost function that we saw last week for logistic regression**

*

# Deep Neural Networks

* NNs with more than a small number of hidden layers Varias Camadas de processamento
  * Generally large in scale: many input nodes; many hidden layers; many nodes per hidden layer; large amounts of training data


**High-Level Perspectives**
* Practical perspective:
  * Deep architectures achieve many of the same things as shallow ones, but more efficiently, particularly for perception tasks(vision/sound)
  * Basic training techniques were not effective, so we needed to find ways of getting it to work
* Connectionist programming perspective:
  * Deep learning provides an integrated approach to feature engineering and learning, all within a connectionist architecture
  * Nodes in early layers can be considered functions/subroutines that are re-used in later ones
  * Convolutional NNs extend this idea further

* Structured Data
  * typically organised in a table
* Unstructured Data
  * Photos, Movies, Audio, Documents
* Multi-Class Classification - Represent classes as 3 or + outputs
  * The Softmax function replaces the standard Sigmoid function used in binary classification. It rescales the z values so that the ො𝑦 values sum to 1, as required for a probability distribution.
  
* Methods to Avoid Overfitting
  * Data Augmentation - The best way to improve generalisation on a ML model is to train it with more data
  * Early Stopping - In early stages of NN training, when it is far from converging, overfitting is never an issue, but it can become an issue as training proceeds, particularly if network is complex relative to dataset size
  * L2 regularisation - The idea behind regularization is that we add an extra penalty term to the cost function to penalise more complex networks  Python: numpy.linag.norm


# Training Algorithms
* Mini-Batch Gradient Descent
 * Divide full dataset into mini batches
 * Loop over mini-batches, and do one iteration of the training algorithm on 1 mini-batch


* Backprop with Momentum
  * Each parameter should be able to change at a rate appropriate to itself, rather than there being one fixed learning rate
  * The previous changes in parameter values should influence the current direction of updates
  * To achieve this, for each parameter, compute an exponentially weighted moving average of previous gradients, and use that to update the parameter

* RMSprop Root Means Square
  * Another adaptive learning rate algorithm, proposed by Geoff Hinton, and strongly related to an earlier one called AdaGrad,
  * For each parameter, keep a moving average of its squared gradient
  * When updating, divide the current gradient by the square root of the average squared gradient

* Adam Optimiser = Adaptive Moment Estimation
  * Essentially, combine the ideas from Momentum and RMSprop: consider both velocity and acceleration terms

* Convolutional Networks
  * Building on previous idea, introduce a set of shared weights and biases for each field
  * Usually have multiple convolutional layers in a CNN


* Model Re-Use
  * Pre-Training
  * Transfer Learning

* Convolutions ???
* Convolutions on Multi-Channel Inputs ??
* Multiple Feature Detectors


**CNNs Incorporate Important Ideas that can Help Learning**
* Sparse Connectivity
  * Rather than having fully-connected layers, only some units in one layer
  are derived from values of some units in previous layer
  * Reduces number of weights, reducing overfitting risk, computational cost
  and training demands
* Parameter Sharing
  * Rather than learning a separate parameter for
  each connection, learn shared parameters that
  represent specific operations
  * Again, reduces number of weights
* Equivariant Representations
  * If you apply a transformation to input and then put
  it through conv layer, equivalent to putting it through
  conv layer and then applying the transformation
  * For CNNs, true of translation operations specifically,
  but not others such as rotations