### Classify YouTube Category Based on Video Description
#### Group member: Tina Yi, Xueqing (Annie) Wu, Jiayi Zhou
#### Date: Dec 6

#### Step 1a

#### Step 2a

#### Step 2b
For the discriminative neural network, we designed a fundamental recurrent neural network model tailored for text classification, specifically aimed at categorizing YouTube videos into two categories based on their descriptions.

In this neural network architecture, the process begins by converting input words into dense vectors through an embedding layer. The `nn.Embedding(input_size, hidden_size)` operation facilitates the mapping of words (represented as indices) to dense vectors of a predetermined size (`hidden_size`). Subsequently, the embedded sequences undergo processing by a recurrent layer. The `nn.RNN(hidden_size, hidden_size, batch_first=True)` layer handles input sequences of embedded vectors, generating output sequences. This layer employs a state of size `hidden_size`, initialized to 64. Following this, the output from the RNN layer is channeled through a linear layer. The `nn.Linear(hidden_size, output_size)` operation maps the RNN output to the final output classes, which are binary—specifically, music or sports.

To optimize the model's performance, the output is juxtaposed with the actual labels using the cross-entropy loss. The Adam optimizer is then enlisted to minimize this loss. The training loop embarks on a series of epochs, with the current run comprising 20. Within each epoch, the DataLoader (`dataloader`) facilitates the iteration over batches of data. To ensure effective parameter updates, the optimizer's gradients are reset to zero using `optimizer.zero_grad()`. The model's output for the input batch is computed (`outputs = model(inputs)`), and subsequently, the loss is calculated, and backpropagation is executed (`loss.backward()`). The optimizer then takes a step, updating the model parameters based on the computed gradients (`optimizer.step()`).

Throughout the training process, key metrics such as total loss, correct predictions, and total samples are dynamically updated. Accuracy is computed based on the ratio of correct predictions to the total number of samples. After each epoch, the average loss and accuracy for that specific epoch are printed. The overall training time is determined by recording the start and end times of the training process.


#### Step 3a

##### Naive Bayes Model on Real Data

In [2]:
import Naive_Bayes as nb

In [3]:
nb.run_experiment("../data/category10.txt", "../data/category17.txt")

raw counts (train): 0.9606124357176251
raw_counts (test): 0.9601803155522164
Time for raw counts section: 45.80 seconds
tfidf (train): 0.9608795832498497
tfidf (test): 0.9604808414725771
Time for tfidf section: 16.20 seconds


##### RNN Model on Real Data
In the qualitative evaluation of the RNN on real data, the loss consistently decreases over epochs, signifying a positive indication that the model is effectively learning. Additionally, the accuracy exhibits an upward trend throughout epochs, suggesting an improvement in the model's proficiency in distinguishing between music and sports. Regarding training time, the duration for each epoch seems reasonable and exhibits consistency.
Quantitatively, the final accuracy after 20 epochs is approximately 89.57%, reflecting a commendable outcome. The total training time for 20 epochs stands at around 305.60 seconds, which is deemed acceptable.
The RNN is specifically crafted to capture sequential dependencies in data. In this instance, the model adeptly learns temporal patterns within the input sequences of video descriptions. It also effectively handles variable-length sequences by implementing sequence padding. The achievement of an accuracy rate of approximately 89.57% underscores the model's capability to discern between music and sports categories proficiently.
However, in certain aspects, RNNs may perform poorly. First, RNNs can be sensitive to hyperparameter choices, such as the learning rate, batch size, and hidden layer size. Further refinement through hyperparameter tuning may be necessary for optimal results. Additionally, basic RNNs might face challenges in capturing long-term dependencies in sequences. Exploring more advanced architectures, such as LSTM, could potentially mitigate this limitation.

In [2]:
import RNN as rnn

In [6]:
rnn.train_rnn_model("../data/category10.txt", "../data/category17.txt")

Read files done
Start training
Epoch 1/20, Loss: 0.6743, Accuracy: 0.5874, Time: 15.81 seconds
Epoch 2/20, Loss: 0.6448, Accuracy: 0.6015, Time: 15.80 seconds
Epoch 3/20, Loss: 0.6349, Accuracy: 0.6153, Time: 15.55 seconds
Epoch 4/20, Loss: 0.5541, Accuracy: 0.7402, Time: 15.47 seconds
Epoch 5/20, Loss: 0.4600, Accuracy: 0.8186, Time: 15.43 seconds
Epoch 6/20, Loss: 0.4448, Accuracy: 0.8132, Time: 15.44 seconds
Epoch 7/20, Loss: 0.5575, Accuracy: 0.6869, Time: 15.62 seconds
Epoch 8/20, Loss: 0.5817, Accuracy: 0.6696, Time: 15.71 seconds
Epoch 9/20, Loss: 0.5624, Accuracy: 0.6874, Time: 15.79 seconds
Epoch 10/20, Loss: 0.4873, Accuracy: 0.7559, Time: 15.59 seconds
Epoch 11/20, Loss: 0.5451, Accuracy: 0.7085, Time: 14.77 seconds
Epoch 12/20, Loss: 0.4229, Accuracy: 0.8368, Time: 14.81 seconds
Epoch 13/20, Loss: 0.3827, Accuracy: 0.8666, Time: 14.95 seconds
Epoch 14/20, Loss: 0.3800, Accuracy: 0.8361, Time: 14.97 seconds
Epoch 15/20, Loss: 0.2767, Accuracy: 0.8842, Time: 14.92 seconds
Epo

#### Step 3b

##### Naive Bayes on Synthetic Data

In [4]:
nb.run_experiment("../data/synthetic_music.txt", "../data/synthetic_sports.txt")

raw counts (train): 0.9878888888888889
raw_counts (test): 0.9895
Time for raw counts section: 13.52 seconds
tfidf (train): 0.9877777777777778
tfidf (test): 0.9895
Time for tfidf section: 4.31 seconds


##### RNN Model on Synthetic Data
In the case of the RNN applied to synthetic data, the qualitative evaluation reveals that the loss fluctuates, suggesting challenges in convergence or sensitivity to the dataset. Similarly, the accuracy exhibits a less clear trend, possibly indicating difficulties in learning patterns within the synthetic data. Notably, the training time for each epoch is shorter compared to the RNN on real data, likely attributed to the smaller dataset.
Quantitatively, the final accuracy after 20 epochs is 61.55%, signifying moderate success in classifying synthetic data. The total training time is 92.18 seconds for 20 epochs, which is shorter than the RNN applied to real data. The fluctuating loss and accuracy trends imply potential challenges in learning patterns within the synthetic data. Consequently, further investigation into data quality and potential adjustments to hyperparameters may be beneficial to enhance overall performance.

In [8]:
rnn.train_rnn_model("../data/synthetic_music.txt", "../data/synthetic_sports.txt")

Read files done
Start training
Epoch 1/20, Loss: 0.6934, Accuracy: 0.5108, Time: 4.57 seconds
Epoch 2/20, Loss: 0.6877, Accuracy: 0.5211, Time: 4.57 seconds
Epoch 3/20, Loss: 0.6876, Accuracy: 0.5232, Time: 4.42 seconds
Epoch 4/20, Loss: 0.6869, Accuracy: 0.5133, Time: 4.88 seconds
Epoch 5/20, Loss: 0.6795, Accuracy: 0.5277, Time: 4.59 seconds
Epoch 6/20, Loss: 0.6806, Accuracy: 0.5252, Time: 4.60 seconds
Epoch 7/20, Loss: 0.6703, Accuracy: 0.5414, Time: 4.53 seconds
Epoch 8/20, Loss: 0.6329, Accuracy: 0.6499, Time: 4.53 seconds
Epoch 9/20, Loss: 0.6090, Accuracy: 0.6741, Time: 4.65 seconds
Epoch 10/20, Loss: 0.6826, Accuracy: 0.5214, Time: 4.72 seconds
Epoch 11/20, Loss: 0.6804, Accuracy: 0.5239, Time: 4.59 seconds
Epoch 12/20, Loss: 0.6765, Accuracy: 0.5458, Time: 4.71 seconds
Epoch 13/20, Loss: 0.6676, Accuracy: 0.5697, Time: 4.61 seconds
Epoch 14/20, Loss: 0.6338, Accuracy: 0.6486, Time: 4.65 seconds
Epoch 15/20, Loss: 0.5771, Accuracy: 0.7236, Time: 4.63 seconds
Epoch 16/20, Loss:

#### Step 4

##### Naive Bayes

##### RNN: 
Regarding RNNs, in terms of quality and correctness, these models excel at capturing sequential dependencies in data, rendering them suitable for tasks where the order of video descriptions is crucial. However, for longer video descriptions, the model's ability to capture long-term dependencies diminishes. Additionally, RNNs possess limited memory, and extended sequences may result in information loss over time.
In the realm of data, RNNs exhibit proficiency in handling diverse variable-length sequences, which proves advantageous for tasks involving varying sequence lengths. Moreover, RNNs can learn embeddings, enabling them to represent input data in a meaningful manner. Nevertheless, the performance of RNNs is heavily contingent on the quality and representativeness of the training data. Notably, as the synthetic data changes, the model's performance tends to worsen.
In terms of training time, it is noteworthy that training RNNs requires a more extended duration compared to Naive Bayes. Concerning computational requirements, the process of training RNN architectures with a substantial number of parameters can be computationally demanding. Furthermore, RNNs may necessitate significant memory, particularly when dealing with lengthier sequences.
Regarding interpretability, RNNs offer the advantage of allowing the interpretation of weights, providing insights into the crucial elements of the sequence for predictions. However, the model's complexity poses a challenge, making it difficult to interpret and comprehend internal representations effectively.
