# Discussion and Takeaway

## 2.1. Data Manipulation

### 2.1.5. Saving Memory
To prevent memory waste, we should use slice notation [:] rather than just using variables!  

For example,  
A = A + B (X)  
A[:] += A + B (O)  

## 2.2. Data Preprocessing  

### 2.2.2. Data Preparation
We can get rid of NaN values, which might be dangerous when we use data, by introducing some strategies(RoofType_Slate, RoofType_nan, or numerical values)!  

## 2.3. Linear Algebra

### 2.3.5. Basic Properties of Tensor Arithmetics  
We should not confuse * operator with matrix multiplication in tensor arithmetics! * operator is sort of scalar product...

### 2.3.6. Reduction  
axis=0 is along columns, on the other hand, axis=1 is along rows!

### 2.3.8. Dot Product
\* operator can be considered as dot product in PyTorch!

### 2.3.9. Matrix-Vector Products
@ operator can be considered as matrix multiplication in PyTorch!

### 2.3.11. Norms
Sort of norms can be distinguished to three main norms!
1. torch.norm() (Vector) : Euclidean norm
2. abs().sum() : Manhattan distance
3. torch.norm() (Matrix) : Frobenius norm

## 2.5. Automatic Differentiation

### 2.5.1. A Simple Function
- If y is scalar function of vector x, we can get gradient of y(x.grad) by processing y.backward().
- If we want to reset gradient of y, we can run x.grad.zero_().

### 2.5.2. Backward for Non-Scalar Variables
- If y is not a scalar but a vector, we can earn Jacobian derivatives.
- However, we generally get summing up the gradients of each component of y, w.r.t. vector x.

### 2.5.3. Detaching Computation
- If we want to use value of vector x to express other variables but not want to consider those variables as function of x, we can copy the value of x to other new variables, where we can avoid to be differentiated by x, by using detach().

## 3.1. Linear Regression

### 3.1.1. Basics
- Minibatch SGD method is widely used in Deep Learning, but why? Quasi-Newton method might do better performance?
- Maybe the local minimum problem cannot be dealt with in Quasi-Newton, while SGD can be...

### 3.1.2. Vectorization for Speed
- In deep learning, especially training phase, we should use vectorization method rather than for-loop to utilize running time!

## 3.2. Object-Oriented Design for Implementation

### 3.2.1. Utilities
- @add_to_class() : We can add specific function to class after the class is created, even after instances are generated!
- @class HyperParameters : We can add all arguments in __init__ method to class attributes!

## 4.1. Softmax Regression

### 4.1.1. Classification
- Regression cannot deal with all problem!
- In classification problem, we focus on "which category?" questions. Indeed, there are cases where more than one label might be true!
- There might be problems such as the probability might exceed 1 when it comes to the linear model...
- To solve the problems, we should normalize all probabilities between 0 and 1, and the sum of probabilities is always 1. We call the function, which take on the role, Softmax function!
- To improve computational efficiency as well, we can vectorize data!

### 4.1.2. Loss Function
In Softmax regression, we can define loss function by using log-likelihood, where cross-entropy is introduced!

## 4.2. The Image Classification Dataset

### 4.2.2. Reading a Minibatch
We can use built-in data iterator rather than creating on our own, by using iter(data.train_dataloader())!

## 4.3. The Base Classification Model

### 4.3.2. Accuracy
When we define accuracy() function to determine what label is most accurate to given data, we should match data type between y_hat and y because == operator is sensitive to data type!

## 4.4. Softmax Regression Implementation from Scratch

### 4.4.1. The Softmax
If we implement softmax function by scratch, we must indicate that argument X is potentially dangerous when X is too small or too large!

### 4.4.3. The Cross-Entropy Loss
We can create loss function by introducing cross-entropy loss, which is general in deep learning! (However, this method we are currently implementing is just regression...)

## 5.1. Multilayer Perceptrons
To jump beyond the limitation of linear model, we can introduce hidden layers between linear layers, and nonlinear activation function such as ReLU, Sigmoid, tanh function!

## 5.2. Implementation of Multilayer Perceptrons
There are nothing to say as discussions or takeovers... (This chapter are simply talking about how to implement MLP!)

## 5.3. Forward Propagation, Backward Propagation, and Computational Graphs

### 5.3.1. Forward Propagation
In computational graph, we can use forward propagation to compute and save intermediate variables, from input layer to output layer!

### 5.3.3. Backpropagation
To compute gradient of parameters, we should introduce backpropagation in neural network, using chain rule!

### 5.3.4. Training Neural Networks
There are dependencies on forward propagation and backward propagation. In other words, the forward propagation computes parameters traversing on the computational graph, and then, the backpropagation computes gradients of parameters to correct them, using chain rule!