**Presented by: Reza Saadatyar (2024-2025)**<br/>
**E-mail: Reza.Saadatyar@outlook.com**

**0️⃣ Dataset: Binary Prediction of Poisonous Mushrooms** | [kaggle](https://www.kaggle.com/competitions/playground-series-s4e8/overview) | [Google Drive](https://drive.google.com/file/d/1UYK3t54ee-9Gxa2POmygRI_kdCBkCONi/view?usp=sharing)

**Files:**
* **train.csv** - the training dataset; class is the binary target (either e or p)
* **test.csv** - the test dataset; your objective is to predict target class for each row
* **sample_submission.csv** - a sample submission file in the correct format

![Mushrooms.JPG](attachment:Mushrooms.JPG)

**1️⃣ Load data**<br/>
**2️⃣ Data Cleaning**<br/>
- `Handle Missing Values`
  - Remove rows or columns with too many missing values.
  - Impute missing values using methods like mean, median, mode, or advanced techniques like KNN or predictive modeling.
- `Handle Outliers`
   - Detect outliers using statistical methods (e.g., Z-score, IQR).
     - The Z-score measures how many standard deviations a data point is from the mean. Typically, a Z-score greater than 3 or less than -3 is considered an outlier.
     - The Interquartile Range (IQR) measures the spread of the middle 50% of a dataset and is less influenced by outliers than the range or standard deviation. It is calculated as the difference between the third quartile (Q3) and the first quartile (Q1).

Note: If the data includes features of string type, handle outliers only after encoding categorical variables into numerical format.<br>

**3️⃣ Encoding Categorical Variables**<br/>
`Label Encoding:` Convert categorical data into numerical format by assigning a distinct integer to each category.

**4️⃣ Split data into training, Validation and test sets**<br/>
`Training set:` Used to train the model (e.g., 70-80% of the data).<br/>
`Validation set:` Used to tune hyperparameters, Early Stopping and evaluate model performance during training (e.g., 10-15% of the data).<br/>
`Testing set:` Used to evaluate the final model's performance (e.g., 10-15% of the data).<br/>

**5️⃣ Scaling/Normalization:**<br/>
Scale numerical features to a standard range (e.g., 0 to 1 or -1 to 1) using:<br/>
`Min-Max Scaling:` Normalizes or scales the data to a specified range, typically [0, 1] or [a, b]. Best suited for uniformly distributed data with no significant outliers.<br/>
   $X_{\text{scaled}} = \large \frac{X - X_{\text{min}}}{X_{\text{max}} - X_{\text{min}}}$<br/>
`Standardization (Z-Score):` Standardizes features by removing the mean and scaling to unit variance. Useful when the data is normally distributed or when the distribution of data varies significantly across features.<br/> 
   $X_{\text{standardized}} = \large \frac{X - \mu}{\sigma}$<br/>
   $\mu$ is the mean and $\sigma$ is the standard deviation of the feature.<br/>
`Mean:` Similar to standardization but scales data to have a mean of 0 and a range between -1 and 1.<br/>
   $X_{\text{normalized}} = \large \frac{X - \mu}{X_{\text{max}} - X_{\text{min}}}$<br/>
   $\mu$ is the mean.

**Benefits of Normalization in Machine Learning and Deep Learning**<br/>
`Improves Training Stability:` Models, especially neural networks, are sensitive to the scale of input data. Features with large ranges can dominate the learning process, leading to slower convergence or convergence to suboptimal solutions.<br/>
`Faster Convergence:` Normalization helps optimization algorithms converge faster by scaling features to a similar scale.<br/>
`Prevents Numerical Instability:` Neural networks can experience numerical instability when working with features that have very large or very small values.<br/>
`Required by Algorithms:` Some distance-based algorithms (e.g., `k-nearest neighbors`, `SVMs with RBF kernels`) perform better when features are on the same scale.<br/>

**6️⃣ Feature Extraction**<br/>
`Factor Analysis (FA):` Reduce dimensionality by identifying hidden factors that explain correlations among observed variables. Assumes observed variables are linear combinations of underlying factors plus noise, extracting factors that capture maximum variance.<br/>
`Isometric Feature Mapping (Isomap):` Isomap is a nonlinear dimensionality reduction method that maintains the geometric structure of data in a lower-dimensional space. It builds a graph of nearest neighbors in the high-dimensional space, calculates the shortest paths between pairs of points, and utilizes these distances to map the data into a reduced-dimensional space.<br/>
`Principal component analysis (PCA):` PCA minimizes the dimensionality of data while retaining as much variance as possible. It determines new axes, called principal components, that maximize variance in the data, with each axis being orthogonal and uncorrelated to the others.<br/>
`Linear discriminant analysis (LDA):` LDA identifies the linear combination of features that optimally distinguishes between classes. It enhances class separation in the reduced space by maximizing the ratio of between-class variance to within-class variance.<br/>
`Singular value decomposition (SVD):` SVD decomposes a matrix into three other matrices to simplify data representation. It factors a matrix into singular vectors and singular values, capturing important structures in the data.<br/>
`Independent component analysis (ICA):` ICA decomposes a multivariate signal into independent, additive components. It operates under the assumption that the observed data is a combination of independent sources and aims to extract those sources by enhancing their statistical independence.<br/>
`T-distributed Stochastic Neighbor Embedding (T-SNE):` T-SNE is a nonlinear dimensionality reduction method that transforms high-dimensional data into two or three dimensions. It achieves this by minimizing the difference between two probability distributions: one representing pairwise similarities in the high-dimensional space and the other in the low-dimensional space.<br/>

<font color='#FF000e' size="4.8" face="Arial"><b>Import modules</b></font>

<font color=#0423d1 size="4.8" face="Arial"><b>Extracting the files and their respective paths</b></font>

<font color=#24f508 size="4.8" face="Arial"><b>Load data</b></font>

<font color=#f706bb size="4.8" face="Arial"><b>Data cleaning (Missing data)</b></font>

<font color=#0afcf0 size="4.8" face="Arial"><b>Encoding Categorical Variables</b></font>

<font color= #f805af size="4.95" face="Arial"><b>Data cleaning (Outliers data)</b></font>

<font color= #faf608 size="4.5" face="Arial"><b>Split data</b></font>

<font color=#f08c09 size="4.5"  face="Arial"><b>Normalization</b></font>

<font color=#d0ec54 size="4.5"  face="Arial"><b>Feature extration</b></font>

In [None]:
# Define the model with specified in_features and out_features
class LinearRegressionModel(torch.nn.Module):
    def __init__(self, in_features, out_features):
        super(LinearRegressionModel, self).__init__()
        self.linear = torch.nn.Linear(in_features, out_features)  # specify input and output dimensions

    def forward(self, x):
        return self.linear(x)

# Example usage
in_features = 1
out_features = 1
model = LinearRegressionModel(in_features, out_features)
model.Lin

In [19]:
# Define the model
class LinearRegressionModel(torch.nn.Module):
    def __init__(self):
        super(LinearRegressionModel, self).__init__()
        self.linear = torch.nn.Linear(1, 1)  # input and output are both 1 dimension

    def forward(self, x):
        return self.linear(x)

# Create the model
model = LinearRegressionModel()
model
# # Define the loss function and the optimizer
# criterion = torch.nn.MSELoss()
# optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

# # Sample data
# x_train = torch.tensor([[1.0], [2.0], [3.0], [4.0]], requires_grad=True)
# y_train = torch.tensor([[2.0], [4.0], [6.0], [8.0]], requires_grad=True)

# # Training loop
# for epoch in range(100):
#     model.train()
    
#     # Forward pass
#     y_pred = model(x_train)
    
#     # Compute loss
#     loss = criterion(y_pred, y_train)
    
#     # Backward pass and optimization
#     optimizer.zero_grad()
#     loss.backward()
#     optimizer.step()
    
#     if (epoch+1) % 10 == 0:
#         print(f'Epoch [{epoch+1}/100], Loss: {loss.item():.4f}')

**Linear Regression Model:**<br/>
The [`torch.nn.Linear()`](https://pytorch.org/docs/1.9.1/generated/torch.nn.Linear.html) module, also known as a feed-forward layer or fully connected layer, implements a matrix multiplication between an input `x` and a weights matrix `A`.<br/>

$y = x\cdot{A^T} + b$ ⇒ $y = mx+b$<br/>
* `x` is the input to the layer (deep learning is a stack of layers like `torch.nn.Linear()` and others on top of each other).
* `A` is the weights matrix created by the layer, this starts out as random numbers that get adjusted as a neural network learns to better represent patterns in the data ("`T`", that's because the weights matrix gets transposed).
* `b` is the bias term used to slightly offset the weights and inputs.
* `y` is the output (a manipulation of the input in the hopes to discover patterns in it).

`in_features` indicates the number of input features or the size of the last dimension of the input tensor for the linear layer. It defines the number of columns in the weight matrix W, where each input feature multiplies its corresponding column in W.<br/>
`out_features` represents the number of output features or the size of the last dimension of the output tensor from the linear layer. It defines the number of rows in the weight matrix W and the size of the bias vector b, with each row of W transforming the input to an output feature. It is an integer that specifies the output tensor's dimensionality for this layer.