#### **1. Which linear regression algorithm can you use if you have a training set with millions of features?**

- I would say Lasso regression. This is because through training, lasso regression tends to eliminate the weights of the least important features (i.e. set them to zero).
- To train this model, you can use an applicable form of *gradient descent*: stochastic, mini-batch or batch (if the training set can fit in memory).
- You **cannot** use the Normal Equation or the SVD approach because the computational complexity grows quickly.

#### **2. Suppose the features in your training set have very different scales. Which alogorithms might suffer from this and how? What can you do about it?**

**My Answer:**
- Any regularized regression models will suffer: ridge regression, lasso regression, and elastic net regression. 
- This is because these algorithms are very sensitive to the scale of the input features. 
- To mitigate this, we can performing feature scaling (e.g. using Sklearn's `StandardScaler`).

**Aurelien's Answer:**
- If the features in your training set have very different scales, the cost function will have the shape of an elongated bowl, so the Gradient Descent algorithms will take a long time to converge. To solve this you should scale the data before training the model. Note that the Normal Equation or SVD approach will work just fine without scaling. Moreover, regularized models may converge to a suboptimal solution if the features are not scaled: since regularization penalizes large weights, features with smaller values will tend to be ignored compared to features with larger values.

#### **3. Can gradient descent get stuck in a local minimum when training a logistic regression model?**

**My answer:**
- No. The cost function for logistic regression is convex. Meaning that gradient descent is guaranteed to find the global minimum. 

**Aurelien's Answer:**
- Gradient Descent cannot get stuck in a local minimum when training a Logistic Regression model because the cost function is convex. Convex means that if you draw a straight line between any two points on the curve, the line never crosses the curve.

### **4. Do all gradient descent algorithms lead to the same model, provided you let them run long enough?**

**My Answer:**
- I would say no because:
    1. Not all cost functions are convex, this means that it is possible for gradient descent to get stuck at a local minimum and never find the actual global minimum.
    2. If the learning rate is too large, the algorithm may bypass the global minimum altogether.

**Aurelien's Answer:**
- If the optimization problem is convex (such as Linear Regression or Logistic Regression), and assuming the learning rate is not too high, then all Gradient Descent algorithms will approach the global optimum and end up producing fairly similar models. However, unless you gradually reduce the learning rate, Stochastic GD and Mini-batch GD will never truly converge; instead, they will keep jumping back and forth around the global optimum. This means that even if you let them run for a very long time, these Gradient Descent algorithms will produce slightly different models.

#### **5. Suppose you use batch gradient descent and you plot the validation error at every epoch. If you notice that the validation error consistently goes up, what is likely going on? How can you fix this?**

**My Answer:**
- This means that the model is overfitting the training data. i.e. it is generalizing very poorly to never before seen data. To fix this you need to provide more training data until the validation error reaches the training error. 

**Aurelien's Answer:**
- If the validation error consistently goes up after every epoch, then one possibility is that the learning rate is too high and the algorithm is diverging. If the training error also goes up, then this is clearly the problem and you should reduce the learning rate. However, if the training error is not going up, then your model is overfitting the training set and you should stop training.

#### **6. Is it a good idea to stop mini-batch gradient descent immediately when the validation error goes up?**

**My Answer:**
- Since mini-batch gradient descent randomizes the instances used in its calculation, you are not guaranteed to see improvements in the cost function. Stopping it immediately would likely be very premature.

**Aurelien's Answer:**
- Due to their random nature, neither Stochastic Gradient Descent nor Mini-batch Gradient Descent is guaranteed to make progress at every single training iteration. So if you immediately stop training when the validation error goes up, you may stop much too early, before the optimum is reached. A better option is to save the model at regular intervals; then, when it has not improved for a long time (meaning it will probably never beat the record), you can revert to the best saved model.

#### **7. Which gradient descent algorithm (among those discussed) will reach the vicinity of the optimal solution the fastest? Which will actually converge? How can you make the others converge as well?**

**My Answer:**
- Stochastic or mini-batch gradient descent, with early-stopping implemented will likely reach the vicinity of the optimal solution the fastest.
- Generic, and batch gradient descent will actually converge. We can make stochastic and mini-batch GD converge by implementing early stopping (stopping the algorithm after the validation error has been above the minimum for some time, then rolling back the model parameters to the point where the validation error was at a minimum.)

**Aurelien's Answer:**
- Stochastic Gradient Descent has the fastest training iteration since it considers only one training instance at a time, so it is generally the first to reach the vicinity of the global optimum (or Mini-batch GD with a very small mini-batch size). However, only Batch Gradient Descent will actually converge, given enough training time. As mentioned, Stochastic GD and Mini-batch GD will bounce around the optimum, unless you gradually reduce the learning rate.