## Columbia University
### ECBM E4040 Neural Networks and Deep Learning. Fall 2023.

# Assignment 1, Task 4: Questions (10%)

### Question 1 

What is the effect of increasing the number of layers in an MLP? Based on your experiments in task 2/3, are larger models (with a greater number of hidden layers) preferred to smaller ones? Why or why not?

   Your answer: 
#### Answer:
  #### Q1:
   1. Representational Power: Deeper networks generally have greater representational power. They can capture more complex relationships and hierarchies in the data. For instance, in the context of image recognition, initial layers might capture edges and textures, while deeper layers capture more complex structures and patterns.
   2. Risk of Overfitting: Adding more layers also increases the number of parameters in the model. This can make the network more prone to overfitting, especially if you have limited training data. Overfitting occurs when the model starts to memorize the training data instead of generalizing from it.
   3. Computational Complexity: The training time and computational resources required increase with the addition of more layers. This can be a concern especially for very deep networks or when training on large datasets.
   4. Vanishing & Exploding Gradients: Deeper networks can be harder to train due to the vanishing and exploding gradient problems. These issues can make gradients (used in backpropagation to update the weights) either too small (vanish) or too large (explode), causing training instability. This has been somewhat mitigated by better weight initialization techniques and activation functions, but it remains a consideration.
   5. Need for Regularization: Deeper models might require the use of regularization techniques like dropout, batch normalization, or L2 regularization to prevent overfitting and stabilize training.
   #### Q2: 
   Prefered smaller models. Smaller models might be preferred to prevent overfitting and achieve faster training times. Larger models will achieve a overfitting in the small dataset.

### Question 2 

What is the significance of activation functions in deep learning models? Name two activation functions that you can use for a hidden layer **and** two for the output layer. 

   Your answer:
#### Answer:
Activation functions play a crucial role in deep learning models for several reasons:

1. **Non-linearity**: One of the primary purposes of an activation function is to introduce non-linearity into the network. This allows the model to learn from the error and make adjustments, which is essential for learning complex patterns. Without non-linear activation functions, the entire network would behave as a linear model, limiting its capacity to only model linear relationships.

2. **Gradient Propagation**: Activation functions and their derivatives (used in backpropagation) facilitate the updating of weights throughout the network. A good activation function will allow gradients to flow smoothly, helping the network to learn effectively.

3. **Bounding Outputs**: Some activation functions squash their outputs into a range, like [0, 1] for the sigmoid function or [-1, 1] for the tanh function. This can be useful in certain scenarios to ensure that outputs don't reach extremely high or low values.

4. **Sparsity**: Activation functions like the ReLU (Rectified Linear Unit) introduce sparsity. ReLU and its variants set all negative values to zero, which means that it activates neurons (i.e., produces a non-zero output) only when positive inputs are received. This sparsity is beneficial for the efficient training and robustness of the model.

For **hidden layers**, two common activation functions are:

1. **ReLU (Rectified Linear Unit)**: It's defined as $$ f(x) = max(0, x) $$. It's the most widely used activation function due to its simplicity and efficiency. Variants like Leaky ReLU and Parametric ReLU have also been proposed to address some of its limitations.

2. **tanh (Hyperbolic Tangent)**: It's defined as $$ f(x) = \frac{2}{1 + e^{-2x}} - 1 $$. The tanh function outputs values in the range of [-1, 1], making it zero-centered and sometimes preferred over the sigmoid function in hidden layers.

For the **output layer**, the choice of activation function depends on the specific task:

1. **Sigmoid**: For binary classification problems, the sigmoid function is often used as it maps its input to a value between 0 and 1, which can be interpreted as a probability.

2. **Softmax**: For multi-class classification problems, the softmax function is used. It converts the raw output scores (logits) from the network into probability distributions over the classes.

Remember, the choice of activation function not only depends on its position (hidden layer vs. output layer) but also the specific problem you're trying to solve. For tasks like regression, you might use a linear (or no) activation function in the output layer.

### Question 3

Assume you have a problem to predict the annual rainfall in a certain region with some historic numeric data that was given to you, which of these 2 models (Linear Regression vs. Logistic Regression) would you use? How would you modify the problem statement to use the other model? Do they both adopt a linear decison boundary? 

   Your answer: 
#### Answer:
a Linear Regression model is more suitable. This is because rainfall prediction is a regression problem where the output is a continuous value (amount of rainfall).

**Logistic Regression**, on the other hand, is used for binary classification problems. It predicts the probability that a given instance belongs to a particular category. The output of logistic regression is a value between 0 and 1, which can be interpreted as the probability of the instance belonging to the positive class.

To use **Logistic Regression** for this problem, you'd need to modify the problem statement to make it a classification task. For instance:

* Predict whether the annual rainfall in a certain region will be above average (1) or below average (0) based on historic numeric data.
Regarding the decision boundary:

* **Linear Regression** does not have a concept of a "decision boundary" in the same way classification models do. It tries to find the best linear relationship (line in case of 2D, plane in case of 3D, and hyperplane in higher dimensions) that fits the data.
* **Logistic Regression** adopts a linear decision boundary, even though the relationship between independent variables and the predicted probability is non-linear (due to the logistic/sigmoid function). What this means is that in a 2D space, the decision boundary would be a straight line. In a 3D space, it would be a plane, and so forth.

In essence, while Logistic Regression outputs probabilities through a non-linear transformation (the logistic function), the decision boundary itself, which separates one class from another, is linear in the feature space.


### Question 4

What will happen if you choose a very small or a very large learning rate? 

   Your answer: 
#### Answer:
The learning rate in optimization algorithms, especially gradient-based methods like gradient descent, is a crucial hyperparameter. Its value affects the convergence and performance of the algorithm. Here's what generally happens with very small or very large learning rates:

1. **Very Small Learning Rate**:

   - **Slow Convergence**: A small learning rate means the model makes tiny adjustments to the weights during each iteration. As a result, it might take a significantly larger number of iterations to converge to the minimum.
   
   - **Risk of Getting Stuck in Local Minima**: If the learning rate is too small, the model might get stuck in local minima or saddle points because it doesn't take large enough steps to escape those areas, especially in complex loss landscapes associated with deep neural networks.
   
   - **More Stable Convergence**: While the convergence is slow, it tends to be more stable and less oscillatory. The updates are more refined and precise, reducing the chances of overshooting.
   
2. **Very Large Learning Rate**:

   - **Overshooting**: A large learning rate means the model makes big adjustments to the weights during each iteration. This can lead to overshooting the optimal point in the loss landscape.
   
   - **Divergence**: In extreme cases, instead of converging, the loss might diverge to infinity, meaning the model fails to learn anything meaningful.
   
   - **Oscillation**: Even if it doesn't diverge, the model might oscillate around the minimum, never truly converging.
   
   - **Risk of Skipping Optimal Solutions**: With too large a step, the optimization process might skip over valleys or narrow regions in the loss landscape that represent better solutions.   

### Question 5

What is the interpretation of **perplexity** in t-SNE? How did you set this value during the tuning in task3?
    
   Your answer:
#### Q1:
1. **Balance Between Preserving Local and Global Structure**: Perplexity can be roughly understood as a measure that determines how to balance attention between preserving the local and global structures in the data. A low perplexity emphasizes preserving local data structures, whereas a high perplexity gives more importance to the global structures.

2. **Effective Number of Neighbors**: Perplexity can also be viewed as a knob that sets the effective number of neighbors t-SNE considers for each point. For example, a perplexity of 30 would mean that each point roughly considers 30 other points as its neighbors.

3. **Stability and Clusters**: The choice of perplexity can influence the stability and appearance of clusters in the t-SNE visualization. Too low a value might result in isolated clusters that don't truly represent the underlying data distribution, while too high a value can merge clusters that should be distinct.

#### Q2:
1. **Experimentation**: There isn't a one-size-fits-all perplexity value. It's often recommended to try multiple perplexity values to see which one gives a representation that makes the most sense for your data.

2. **Common Range**: Perplexity values between 5 and 50 are common, but this can vary based on the size and nature of the dataset.

3. **Visual Inspection**: After generating t-SNE visualizations with different perplexity values, visually inspect the results. The best value often depends on which visualization best captures meaningful patterns or clusters in the data.

4. **Stability Test**: A good practice is to run t-SNE multiple times (due to its random initialization) for each perplexity value to ensure the results are stable and not a random artifact.