# About CNN

* A **convolutional neural network (CNN)** is a type of **artificial neural network** designed for tasks such as image recognition and processing. It's particularly effective in analyzing visual data, thanks to its ability to automatically and adaptively learn spatial hierarchies of features from the input.



* CNNs use a specialized architecture that includes **convolutional layers**, **pooling layers**, and **fully connected layers.** Convolutional layers apply convolution operations to the input data, capturing local patterns and features. Pooling layers then reduce the spatial dimensions of the representation, focusing on the most important information. Fully connected layers connect every neuron in one layer to every neuron in the next layer, enabling high-level reasoning.



* CNNs have been highly successful in tasks like image classification, object detection, and facial recognition, among others. Their architecture is inspired by the visual processing in the human brain, making them well-suited for tasks involving spatial hierarchies of features.

In [1]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Conv2D, MaxPooling2D, Flatten
from tensorflow.keras.preprocessing.image import ImageDataGenerator

* The **Sequential()** represents a linear stack of layers to build a neural network model. 

In [2]:
#Initializing the CNN
classifier = Sequential()

# How the CNN works

* The **convolutional layer** in a **Convolutional Neural Network (CNN)** performs the core operation of convolution. **Convolution** is a mathematical operation that combines two functions to produce a third function. In the context of CNNs, this operation is applied to the input data and a set of learnable filters or kernels.

Here's a simplified explanation of what happens in a convolutional layer:

**1. Filter (Kernel):** A small matrix that slides over the input data. Each element in the filter has a weight.

**2. Convolution Operation:** The filter slides over the input data, and at each position, it performs element-wise multiplication with the local region of the input data. The results are summed up to produce a single value. This process is repeated across the entire input to produce an **output feature map.**

**3. Learnable Weights:** The weights in the filter are learnable parameters that the neural network optimizes during training. These weights capture important patterns or features in the input data.

**4. Activation Function:** After convolution, an **activation function (like ReLU - Rectified Linear Unit)** is often applied element-wise to introduce non-linearity and allow the network to learn complex relationships.

* The convolutional layer's key advantage is its ability to automatically learn spatial hierarchies of features. It can capture local patterns in the input data, such as edges, textures, or shapes, and then combine them in deeper layers to recognize more complex patterns and objects.

* In summary, the convolutional layer plays a crucial role in feature extraction and enables CNNs to effectively learn and recognize patterns in images or other grid-structured data.

# Breakdown of parameters in Conv2D

* **Conv2D** - 2D convolution used for processing 2D grid data like images.

* **32** - Number of filters or kernels in the convolutional layer. Each filter detects different features in the input.

* **(3*3)** - Filter size which is 3*3. During convolution this filter will slide over the input data in 3*3 patches.

* **input_shape=(64,64,3)** - Specifies the shape of the input data. It's a 3D input with dimensions 64*64 and 3 channels.

* **activation=relu** - Rectified Linear Unit activation function is applied element-wise to introduce non-linearity to the network. ReLU is commonly used in hidden layers to allow the network to learn complex patterns. 

# Usage of filters

* **Filters (or kernels)** in a convolutional layer play a crucial role in feature extraction. They act as small windows that slide over the input data, performing local operations to detect patterns and features. Here are a few reasons why filters are used in the convolution layer:

**1. Feature Detection:** Filters are designed to detect specific features in the input data, such as edges, textures, or shapes. By sliding these filters over the entire input, the convolutional layer can capture local patterns.

**2. Parameter Sharing:** Filters have learnable weights that are shared across the entire input. This parameter sharing reduces the number of parameters in the model, making it more efficient and reducing the risk of overfitting. The same filter is used at different spatial locations in the input.

**3. Spatial Hierarchies:** Convolutional layers can learn hierarchical representations of features. Lower layers may capture simple features like edges, while deeper layers combine these simple features to recognize more complex patterns or objects. This hierarchical approach mimics how the visual system works in biological organisms.

**4Translation Invariance:** The use of filters introduces translation invariance, meaning the network can recognize features regardless of their position in the input. If a particular pattern is detected in one part of the image, the same filter can detect a similar pattern elsewhere.

**5. Local Connectivity:** Filters operate on local regions of the input, allowing the network to focus on local features and spatial relationships. This local connectivity is especially useful for grid-structured data like images.

* In summary, filters in the convolutional layer enable the neural network to automatically learn and extract relevant features from the input data. This process is essential for the success of Convolutional Neural Networks (CNNs) in tasks such as image recognition, where understanding local patterns is crucial for identifying objects and patterns in images.

# About ReLU

* **ReLU, or Rectified Linear Unit**, is an **activation function** commonly used in neural networks, including Convolutional Neural Networks (CNNs). It introduces **non-linearity to the network by outputting the input directly if it is positive; otherwise, it outputs zero.**

* Mathematically, the ReLU activation function is defined as:
**f(x)=max(0,x)**

* Here's a simple explanation of what it does:
1. **Linear for Positive Values:** If the input x is positive, the function returns x itself. So, for any positive input, ReLU is a linear function.

2. **Zero for Negative Values:** If the input x is negative, the function returns zero. This introduces non-linearity to the model, which is crucial for enabling the network to learn complex relationships and patterns.

* The main advantages of using ReLU include:
* **Simplicity:** ReLU is computationally efficient and easy to implement.

* **Avoiding Vanishing Gradient Problem:** Unlike some other activation functions (e.g., sigmoid or tanh), ReLU does not saturate for positive inputs, helping to mitigate the vanishing gradient problem during training.

* **Promoting Sparsity:** ReLU sets negative values to zero, which can lead to sparse representations. This sparsity can be beneficial for memory efficiency and generalization.

* However, one drawback of ReLU is the "dying ReLU" problem, where neurons can sometimes become inactive (output zero) and stop learning if they consistently receive negative inputs during training. To address this, variants like Leaky ReLU and Parametric ReLU have been proposed, which allow a small, non-zero gradient for negative inputs, preventing neurons from becoming completely inactive.

# Why we have used Conv2D instead of Conv3D

* While the input to a convolutional layer is often described as a 3D image, it's more accurate to say it's a 3D tensor. 

* In the context of Convolutional Neural Networks (CNNs), the input data is indeed three-dimensional, representing an image with height, width, and color channels. The dimensions are typically organized as (height, width, channels). For example, a color image with dimensions 64x64 pixels and three color channels (RGB) would have an input shape of (64, 64, 3).

* However, when we talk about passing this data through a 2D convolutional layer, we're referring to the fact that the convolution operation is applied in two spatial dimensions (height and width). **The third dimension (channels) is treated independently during the convolution process.**

* In other words, each filter in the convolutional layer slides over the 2D spatial dimensions of the image, applying convolution independently to each color channel. The filters have depth that matches the number of input channels, and they slide across the height and width.

* So, even though we refer to it as a 2D convolution layer, it's still able to handle the depth of the input data due to its design. 

In [3]:
#Step1 - Convolution
classifier.add(Conv2D(32, (3,3), input_shape = (64,64,3), activation="relu"))

# What is the Pooling layer

* The **pooling layer** is a component commonly used in Convolutional Neural Networks (CNNs) to **downsample the spatial dimensions of the input data,** reducing its size and computational complexity. The pooling operation is applied independently to each depth slice of the input.

* There are two main types of pooling layers: Max Pooling and Average Pooling.

* **Max Pooling:** In max pooling, the output value of a specific region (often a 2x2 or 3x3 window) is the maximum value from that region in the input. It helps retain the most prominent features from the input, focusing on the presence of specific patterns.

* **Average Pooling:** In average pooling, the output value for a specific region is the average of all values in that region in the input. It provides a smoothed version of the input and is less likely to emphasize specific features.

* The pooling layer serves several purposes:
* **Spatial Reduction:** It reduces the spatial dimensions (width and height) of the input, making subsequent layers computationally more efficient.

* **Translation Invariance:** Pooling helps the network become somewhat invariant to small translations in the input, allowing it to recognize features regardless of their precise location.

* **Feature Generalization:** By summarizing information from a local neighborhood, pooling encourages the network to focus on the most relevant and general features.

In [4]:
#Step 2 - Pooling
classifier.add(MaxPooling2D(pool_size = (2,2)))

In [6]:
#Adding a second convolution layer
classifier.add(Conv2D(32, (3,3), activation='relu'))
classifier.add(MaxPooling2D(pool_size = (2,2)))

# Flatten Layer

* The **Flatten layer** is a type of layer in neural networks that is used to convert the input data, which could be a multi-dimensional tensor, into a one-dimensional vector. It essentially flattens the input without altering the actual data.

* In the context of Convolutional Neural Networks (CNNs), the Flatten layer is often used after one or more convolutional and pooling layers when transitioning from spatial feature extraction to fully connected layers.

* The **Flatten layer takes the output of the previous layer** (which is a multi-dimensional tensor resulting from the convolutional and pooling operations) and **flattens it into a one-dimensional vector.** This vector is then passed to one or more fully connected layers.

* The Flatten layer is crucial when transitioning from convolutional and pooling layers to fully connected layers because **fully connected layers expect one-dimensional input.** The Flatten operation retains the learned spatial hierarchies in the data while preparing it for the final classification or regression layers of the neural network.

In [7]:
#Step 3 - Add flattening
classifier.add(Flatten())

# Dense Layer

* **Dense layers**, also known as **fully connected layers**, are used in neural networks for their ability to learn complex patterns and representations from the input data. These layers connect every neuron from the previous layer to every neuron in the current layer, allowing for the learning of intricate relationships between features. 

**1. Capturing Non-linear Relationships:** Dense layers introduce non-linearity to the model, enabling it to learn and represent complex, non-linear relationships in the data. This is important for tasks where the input-output mapping is not a simple linear transformation.

**2. Global Information Integration:** Dense layers integrate information from all neurons in the previous layer, providing a global view of the learned features. This is in contrast to convolutional and pooling layers, which focus on local patterns.

**3. Feature Combination:** Dense layers can learn to combine features learned by earlier layers, creating more abstract and high-level representations of the input data. This ability is crucial for tasks like image recognition, natural language processing, and other complex pattern recognition tasks.

**4. Task-Specific Representation:** The features learned by convolutional layers might be spatially invariant, but dense layers allow the network to create task-specific representations. For instance, in image classification, dense layers can learn to combine lower-level features to recognize higher-level patterns relevant to the specific classes.

# Sigmoid Function

* The **sigmoid activation function**, also known as the **logistic function**, is a **common activation function** used in neural networks. It's particularly popular in the **output layer of binary classification models, where the goal is to produce a probability value between 0 and 1.**

* The sigmoid function is defined as:
**1/(1+e^-x)**

* Here, **e** is the **base of the natural logarithm**, and **x is the input to the function.** The **output of the sigmoid function is always in the range (0, 1).**

* Key properties of the sigmoid function:

* **S-Shaped Curve:** The sigmoid function produces an S-shaped curve, mapping a wide range of input values to a smaller output range.

* **Output Range:** The output is bounded between 0 and 1, making it suitable for representing probabilities.

* **Smooth Gradient:** The sigmoid function has a smooth gradient, which is beneficial for optimization algorithms during the training of neural networks.

* In the context of neural networks, the sigmoid activation function is often used in the output layer of binary classification models. It takes the output of the model and transforms it into a probability score, where values closer to **1 represent a higher probability of belonging to the positive class, and values closer to 0 represent a higher probability of belonging to the negative class.**

* For other types of problems, such as multi-class classification or regression, different activation functions like softmax or tanh might be more appropriate.

# Why second Dense layer just has only 1 neuron

* **First Dense Layer:** This layer has 128 neurons (or units) and uses the ReLU (Rectified Linear Unit) activation function. The choice of 128 neurons is somewhat arbitrary and depends on the specific architecture and requirements of your neural network. The ReLU activation introduces non-linearity to the model, allowing it to learn complex patterns.

* **Second Dense Layer:** This layer has 1 neuron and uses the sigmoid activation function. The choice of 1 neuron with a sigmoid activation function in the output layer suggests that the network is designed for binary classification. The sigmoid activation function squashes the output to a range between 0 and 1, making it interpretable as a probability. In binary classification, a threshold can be applied (e.g., 0.5), where values above the threshold are predicted as one class, and values below are predicted as the other class.

In [8]:
#Step 4 - Full connection
classifier.add(Dense(units=128, activation='relu'))
classifier.add(Dense(units=1, activation='sigmoid'))

# Adam Optimizer

* **Adam**, short for **Adaptive Moment Estimation**, is an **optimization algorithm** commonly used for training neural networks. It is an extension of **stochastic gradient descent (SGD)** and incorporates concepts from both **momentum optimization** and **RMSprop (Root Mean Square Propagation).**

* The key features of the Adam optimizer include:

* **Adaptive Learning Rates:** Adam adapts the learning rates for each parameter individually. It uses the first-order momentum (like momentum optimization) and the second-order moment (like RMSprop) to adjust the learning rates.

* **Momentum:** Similar to momentum optimization, Adam uses a momentum term to accelerate the optimization process by adding a fraction of the previous update to the current update.

* **Root Mean Square Propagation (RMSprop):** Adam incorporates the concept of RMSprop by maintaining a moving average of the square of gradients. This helps in normalizing the learning rates, especially for parameters with high variance in their gradients.

* **Bias Correction:** Adam performs bias correction during the initial iterations to counteract the fact that the momentum and squared gradient terms are initialized to zero, causing a bias towards zero at the beginning of training.


* The Adam optimizer is known for its efficiency and effectiveness in a wide range of neural network architectures and tasks. It is particularly well-suited for **non-convex optimization** problems like training **deep neural networks.**

# Binary Crossentropy

* **Binary Crossentropy**, often referred to as **log loss**, is a **loss function** commonly used in **binary classification problems.** It measures the difference between the true labels and the predicted probabilities for a binary classification task. The binary crossentropy loss is defined mathematically as follows:

**L(y,ŷ) = −(y⋅log(ŷ)+(1−y)⋅log(1−ŷ))**

* Here:
* **y** is the true label (either 0 or 1).
* **ŷ** is the predicted probability of belonging to class 1 (the positive class).
* **log** is the natural logarithm.

* The binary crossentropy loss penalizes models more when they confidently predict the wrong class. If the true label is 1(y=1), the loss term −y.log(ŷ) is activated, penalizing the model more if the predicted probability (ŷ) is close to 0. If the true label is 0(y=0), the loss term −(1−ŷ)⋅log(1−ŷ)is activated, penalizing the model more if the predicted probability is close to 1.

In [9]:
#Compiling the CNN
classifier.compile(optimizer='adam', loss="binary_crossentropy", metrics=['accuracy'])

# ImageDataGenerator

* **ImageDataGenerator** is a utility class provided by the Keras library for **real-time data augmentation during training.** Data augmentation involves applying various transformations to the training images to artificially increase the size of the training dataset and improve the generalization ability of the model.

* **rescale=1./255:** This scales the pixel values of the images by a factor of 1/255. This is a common practice to normalize the pixel values to the range [0, 1].

* **shear_range=0.2:** **Shearing** is a **transformation** that shifts the position of pixels along a certain direction. The shear_range parameter determines the magnitude of the shear.

* **zoom_range=0.2:** This parameter specifies the range for random zooming. A zoom_range of 0.2 means that the image can be zoomed in or out by a factor randomly chosen between [1 - 0.2, 1 + 0.2].

* **horizontal_flip=True:** This enables random horizontal flipping of images. It provides additional variations by horizontally flipping some of the images during training.

In [10]:
#Fitting the CNN to the images
train_datagen = ImageDataGenerator(rescale=1./255, shear_range=0.2, zoom_range=0.2, horizontal_flip=True)

In [12]:
test_datagen = ImageDataGenerator(rescale=1./255)

In [25]:
training_set = train_datagen.flow_from_directory('C:\\Users\\User\\Desktop\\Images\\training_set', 
                                                 target_size=(64,64),
                                                 batch_size=32,
                                                class_mode='binary')

Found 0 images belonging to 0 classes.
