Skip to content
Shravyala edited this page Jul 17, 2020 · 2 revisions

Lesson-10 (Word Embedding)

In this ICP, we have learnt about what is word embedding, types of ANNs and Recurrent Neural Network.

Software Requirements:

  1. Python version 3

  2. Pycharm

  3. Keras and Tensorflow installed

  4. Anaconda

  5. Github

Objectives of this ICP:

1. In the code provided, there are three mistake which stop the code to get run successfully; find those mistakes and explain why they need to be corrected to be able to get the code run

(a) For the above case, we have the source code in the below mentioned link.

    https://umkc.box.com/s/3so2s3dx7cjp4hwnurjx6t3it161ptey

(b) In the above source code, sentiment analysis is performed on the “ibmd_master.csv” data set.

Below is the source code.

(c) Since there are errors, the output is not executed, and error is displayed at the output.

(d) The 3 mistakes in the code given are:

(e) Input dimensions should be vocab_size

(f) Output neurons should be 3 for positive, negative, neutral in the target column of dataset.

(g) Output layer activation function should be softmax as it works best for the multi class classification as sigmoid is used for binomial layers.

(h) Import the necessary libraries

(i) Reading the data

(j) Data pre-processing is done in the next step

(k) Compressed the dataset to 20000 records.

(l) Tokenizing the data and converting into the text to matrix form.

(m) Used the label encoder method to convert the text to digits and fit, transformed the data.

(n) Split data into train and test which is considering the 25% as the test data.

(o) Used the deep learning sequential model with 2 layers.

(p) 1st layer used is Embedded layer which finds out the meaning and captures the semantic relationships within the data. It follows the default word2vec algorithm which looks at the bigrams relationship among the data

(q) 2nd layer - No of neurons is 300 and activation function is relu.

(r) Output layer - No of neurons is 3 and activation function is softmax as the output layer

(s) No of the epoch is 5, Batch size is 256 and loss function used is sparse_categorical_crossentropy

(t) Accuracy and loss values of train and validation data are printed at the output.

(u) Below is the source code for this program.

The output is shown below:

Below is the link for the complete output.

https://github.com/Shravyala/Python-Deep-Learning/blob/master/ICP-10/Output%20Images/ICP-10%20(1)_Output%201.txt

2. Add embedding layer to the model, did you experience any improvement?

Bonus question

1.Plot the loss and accuracy using history object

2.Predict over one sample of data and check what will be the prediction for that

By adding the embedding layer to the previous code, the accuracy and loss are values are different from the previous.

Below is the source code:

The output is shown below:

The loss and accuracy values are 0.84, 0.51

For the bonus points, plot is drawn for loss and accuracy of train and validation data. Also a sample is predicted and shown the output as 2.

Below is the link for complete output

https://github.com/Shravyala/Python-Deep-Learning/blob/master/ICP-10/Output%20Images/ICP-10%20(2)_Output.txt

3. Apply the codeon 20_newsgroup data set we worked in the previous classes

fromsklearn.datasets importfetch_20newsgroups

newsgroups_train =fetch_20newsgroups(subset='train', shuffle=True, categories=categories,)

  1. For this the data is replaced with 20newsgroup in the same code.

  2. Embedding layer is applied and graph is plot for loss and accuracy.

  3. Below is the source code.

The output is shown below. Since the output is taking time to perform in pycharm, it is ran in google colab.

Video Link:

https://youtu.be/1gl43UMcvWw

Clone this wiki locally