##### Quick notes on the concept which were not in Deep Learning modules of Machine Learning Nano Degree:

##### Regarding DL for Text :
* In text dataset, key terms are more important and we might no have enough dataset describing them. Deep Learning model usually expects many samples to understand the variance of a label. like image of a perticular flower.
* Semantic relation between kitty and cat understanding also requires huge dataset for DL. If 2 words are similar then we have to share weights between them. i.e. the model should predict they are same when asked so the the internal weights of the model should indicate that the weights values are such that out of so huge NN (CNN) the trajectory of the word transformation for both the words are same.

##### similar words occur in similar context:
* As we have lot of text from the the wikipedia, we can have unsupervised learning to identify the similar words with **the assumption that similar words occur in similar context.** 
* This assumption helps in 2 ways :
    * without knowing their meaning we can have associated words.
    * representation of words can be done through the associated words rather than sparse matrix of n words with n words. This representation of words by associated word vector is called **embedding**. Clearly, this is smaller vector from the sparse matrix.
    * with 1 word embedding we have multiple words association which are in the word embedding is obtained.
    
##### How do we get the word embedding?
* **Word2Vec**: split a sentence it to 2 sets or words. One set of words trying to predict other set of words. Here simple logistic regression can be used to generate the model for prediction. The best set of word (also called window of words) that best represents the corpus (sentences, paragraph or book) is then saved.
    * **skip grams**: predicts the neighboring context from a word
    * **Continuous Bag Of Words**: predicts the word from neighboring context.
    * https://www.geeksforgeeks.org/python-word-embedding-using-word2vec/
    * **Sampled Softmax**: When vocabulary of the target words are large, softmax may be in-efficient. Therefore, **non class targets** are randomly sampled and with this small subset used for training the model.

* **t-SNE**: is a visualization technique where the association of words are projected onto 2D with relation between the words are retained. Closer words are plotted closer and loosely coupled words are plotted far away.

##### So, now, we have word embedding...what next ?
* compare them using cosine similarity. Why compare? thats the most common activity on text. Applications like plagarism check, classifying documents based on authors require word embedding to be compared.


* can it also help in Q&A ? predicting next sequence of sentences ?

##### RNN: Recurring Neural Network
* takes into consideration of input vector time. i.e. what was the state of input at t-1th instance. it shares weights across time rather than space as in CNN.
* since input changes over time and at any point it depends on previous state, a model is used to summarized the events at past. This model forms the recurrent connetion.
* A repeatable network of recurrent connection to summarize the past, input vector and predicting model (classification or regression) is called **RNN**. Usually the predicting model is same in the repeated network.
* Here input vector remains the same but the tth second we would take into consider recursive transformation over t-1 and before.
* **back propagation over time** would result in correlated update to weights which would result in bad gradient descent. Correlated update is because of recurrent network which has the wieght carried in the output of t-1 second.

* **vanishing gradients**: here gradients abruptly decreases to zero. This would result no training to model. Also result in memory loss and model do not keep tracks of more distant past and keeps track of only recent past.

* **exploding gradients**: here gradients abruptly increases to infinity. This can be addressed by Gradient clipping where gradients are normalized and when gradients are too huge the steps are cut shorted. i.e. when u start back propagating the point at which it reaches the max threshold u just stop increasing the gradients and the same value is assumed for t-x and less.


##### LSTM: Long Short Term Memory
* addresses the vanishing gradients problem in RNN
* replaces the predicting model in RNN with LSTM "cell" which does the memory management. This memory management helps network to remember the older events. Thus solving vanishing gradient problem.
* memory management operation involves storing, forgetting and reading memory.
* instead of making those operation descrete result oriented, it is controlled to through logistic regression which makes it easier for partial read, store and delete.
* **regularization**: L2 regularization can be used on all 4 sides (2 inputs and 2 outputs). Dropout can be used in only input and/or output and not between recurring network and connect future areas.

##### RNN Application : Beam Search
* RNN are for time series activities prediction, one example is word prediction like that one Google search bar does.
* It takes the input of so far entered text and predicts the upcoming word or characters.
* Taking one prediction at a time would be too greedy, rather to take couple of them and continue the sequence for all of them.
* this would mean that there are now multiple sequence being investigated 
* Now based on the final probabilities score, one sequence can be finalized
* This approach is good to avoid scenario which accidentally landing to one word, but now we have couple sequences to choose from.

e.g.: App can lead to both apple or append or application. Now just because one model chooses,say apple, because its probability is slightly larger than other you would loose option of other 2 words.

If you had chosen both 'l' & 'e' and continued series then you would have at the end to choose from apple, append and application. Now based on the probability score (multiply all the predicted probabilities) you are in better position to choose the word.

* Maintaining all sequences might not be cost effective(time and memory). For this we have Beam search.
* it prunes the sequence which are not so likely and keeps only the most likely. It define a beam width which defines the most likely sequence window there by simulating beam like projection.

##### LSTM through Keras:

In [1]:
import numpy as np
# define documents
docs = ['Well done!',
		'Good work',
		'Great effort',
		'nice work',
		'Excellent!',
		'Weak',
		'Poor effort!',
		'not good',
		'poor work',
		'Could have done better.']
# define class labels
labels = np.array([1,1,1,1,1,0,0,0,0,0])

In [2]:
from tensorflow.keras.preprocessing.text import one_hot
# integer encode the documents
vocab_size = 50
encoded_docs = [one_hot(d, vocab_size) for d in docs]
print(encoded_docs)

[[46, 48], [40, 9], [32, 7], [40, 9], [4], [25], [39, 7], [22, 40], [39, 9], [41, 6, 48, 38]]


In [3]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
# pad documents to a max length of 4 words
max_length = 4
padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
print(padded_docs)

[[46 48  0  0]
 [40  9  0  0]
 [32  7  0  0]
 [40  9  0  0]
 [ 4  0  0  0]
 [25  0  0  0]
 [39  7  0  0]
 [22 40  0  0]
 [39  9  0  0]
 [41  6 48 38]]


In [4]:
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Flatten, Dense, Embedding
#from tensorflow.keras.embeddings import Embedding
# define the model
model = Sequential()
model.add(Embedding(vocab_size, 8, input_length=max_length))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
# summarize the model
print(model.summary())

Instructions for updating:
Colocations handled automatically by placer.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 4, 8)              400       
_________________________________________________________________
flatten (Flatten)            (None, 32)                0         
_________________________________________________________________
dense (Dense)                (None, 1)                 33        
Total params: 433
Trainable params: 433
Non-trainable params: 0
_________________________________________________________________
None


In [46]:
# fit the model
model.fit(padded_docs, labels, epochs=50, verbose=0)
# evaluate the model
loss, accuracy = model.evaluate(padded_docs, labels, verbose=1)
print('Accuracy: %f' % (accuracy*100))

Accuracy: 100.000000


In [50]:
model.predict(padded_docs)

array([[0.82694954],
       [0.77747256],
       [0.7040919 ],
       [0.77747256],
       [0.6916161 ],
       [0.28503722],
       [0.29529685],
       [0.23452783],
       [0.42103335],
       [0.07797129]], dtype=float32)

##### Just with "not" what is the score ?

In [53]:
model.predict(np.array([22,0,0,0]).reshape(1,4))

array([[0.27370414]], dtype=float32)

##### it is near to score and hence negative

##### Just with "great", what is the score?

In [54]:
model.predict(np.array([32,0,0,0]).reshape(1,4))

array([[0.6872823]], dtype=float32)

##### Sentiment prediction is "positive"

##### Just with "work", what is the score ?

In [55]:
model.predict(np.array([32,0,0,0]).reshape(1,4))

array([[0.6872823]], dtype=float32)

In [56]:
model.predict(np.array([0,32,0,0]).reshape(1,4)) # work as 2nd word

array([[0.5001465]], dtype=float32)

In [58]:
model.predict(np.array([22,32,0,0]).reshape(1,4)) # not work

array([[0.30238178]], dtype=float32)

In [59]:
model.predict(np.array([39,32,0,0]).reshape(1,4)) #poor work

array([[0.30804333]], dtype=float32)

* with "work" as first word it is predicting positive sentiment
* with "work" as the second word it is predicting negative sentiment

### What if there are words as labels?
* Like quotes fo the sentence and author of the sentence as labels ?

In [80]:
docs_str = [("A for", "Apple")
            , ("B for", "Ball")
            , ("C for", "cat")
            , ("Kitty can also mean", "cat")
            , (" meow sound can also mean", "cat")
            , ("D for", "Dog")
            , ("A is also for", "Ant")
            , ("D can also have", "Drum")
            , ("B is popularly known shortcut for", "Be")
           ]

* here a simple classifier can map the the sequence of text based on length or BoW can lead to a class. I guess this not a good example for LSTM.

* Here if the model is formed it might predict for "D is also know for" as "Ant". Let us check !!!

In [81]:
vocab_size = 100
encoded_docs = [one_hot(d[0], vocab_size) for d in docs_str]
print(encoded_docs)

[[24, 18], [19, 18], [90, 18], [27, 40, 28, 29], [31, 95, 40, 28, 29], [77, 18], [24, 43, 28, 18], [77, 40, 28, 46], [19, 43, 6, 14, 3, 18]]


In [82]:
max_length = max([len(x) for x in encoded_docs])
padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
print(padded_docs)

[[24 18  0  0  0  0]
 [19 18  0  0  0  0]
 [90 18  0  0  0  0]
 [27 40 28 29  0  0]
 [31 95 40 28 29  0]
 [77 18  0  0  0  0]
 [24 43 28 18  0  0]
 [77 40 28 46  0  0]
 [19 43  6 14  3 18]]


In [83]:
model = Sequential()
model.add(Embedding(vocab_size, 20, input_length=max_length))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
# summarize the model
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 6, 20)             2000      
_________________________________________________________________
flatten_3 (Flatten)          (None, 120)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 121       
Total params: 2,121
Trainable params: 2,121
Non-trainable params: 0
_________________________________________________________________
None


In [87]:
labels = [ one_hot(y, vocab_size)[0] for y in [x[1] for x in docs_str] ]
print(labels)
# fit the model
model.fit(padded_docs, labels, epochs=50, verbose=0)
# evaluate the model
loss, accuracy = model.evaluate(padded_docs, labels, verbose=1)
print('Accuracy: %f' % (accuracy*100))

[73, 53, 78, 78, 78, 49, 85, 70, 23]
Accuracy: 0.000000


In [88]:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

labels = [x[1] for x in docs_str]

# integer encode
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(labels)
print(integer_encoded)

[1 2 6 6 6 4 0 5 3]


In [89]:
onehot_encoder = OneHotEncoder(sparse=False)
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
onehot_encoded = onehot_encoder.fit_transform(integer_encoded)
print(onehot_encoded)

[[0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 1. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 1. 0. 0. 0.]]


In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


In [92]:
onehot_encoded = onehot_encoder.fit_transform(np.array(labels).reshape(-1,1))
print(onehot_encoded)

[[0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 1. 0. 0.]
 [1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 1. 0. 0. 0.]]


In [99]:
model = Sequential()
model.add(Embedding(vocab_size, 30, input_length=max_length))
model.add(Flatten())
model.add(Dense(7, activation='sigmoid'))
# compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc'])
# summarize the model
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_6 (Embedding)      (None, 6, 30)             3000      
_________________________________________________________________
flatten_6 (Flatten)          (None, 180)               0         
_________________________________________________________________
dense_6 (Dense)              (None, 7)                 1267      
Total params: 4,267
Trainable params: 4,267
Non-trainable params: 0
_________________________________________________________________
None


In [104]:
# fit the model
model.fit(padded_docs, onehot_encoded, epochs=50, verbose=0)
# evaluate the model
loss, accuracy = model.evaluate(padded_docs, onehot_encoded, verbose=1)
print('Accuracy: %f' % (accuracy*100))

Accuracy: 100.000000


In [105]:
model.predict(padded_docs)

array([[0.01943946, 0.35496178, 0.04048574, 0.0042848 , 0.04294088,
        0.00186032, 0.02373973],
       [0.00214723, 0.03345937, 0.31550813, 0.0127781 , 0.01372948,
        0.00122043, 0.06902653],
       [0.00129768, 0.01842001, 0.05497   , 0.00307879, 0.04996714,
        0.00225902, 0.33951503],
       [0.00857687, 0.00259757, 0.00412455, 0.00419933, 0.00232127,
        0.02179644, 0.5118555 ],
       [0.00760907, 0.00786635, 0.00805849, 0.01714033, 0.00963524,
        0.01276016, 0.90130043],
       [0.00285327, 0.0481348 , 0.02228722, 0.00250304, 0.36661592,
        0.01408225, 0.06488186],
       [0.5527029 , 0.01783398, 0.00162145, 0.01724142, 0.00420105,
        0.01095372, 0.01680672],
       [0.02444848, 0.00358361, 0.00100103, 0.00916204, 0.01780856,
        0.6778833 , 0.07911226],
       [0.07609388, 0.01595449, 0.0753158 , 0.9118255 , 0.00963558,
        0.03528861, 0.08005422]], dtype=float32)

In [108]:
text = "D is also known for"
encoded_text = one_hot(text, vocab_size)
print("Encoded text :", encoded_text)
padded_text = pad_sequences([encoded_text], maxlen=max_length, padding='post')
print(padded_text)

Encoded text : [77, 43, 28, 14, 18]
[[77 43 28 14 18  0]]


In [109]:
model.predict(padded_text)

array([[0.27758425, 0.00707912, 0.00233823, 0.08292675, 0.14962743,
        0.32138762, 0.07658859]], dtype=float32)

In [111]:
text = "D is known shortcut for"
encoded_text = one_hot(text, vocab_size)
print("Encoded text :", encoded_text)
padded_text = pad_sequences([encoded_text], maxlen=max_length, padding='post')
print(padded_text)
model.predict(padded_text)

Encoded text : [77, 43, 14, 3, 18]
[[77 43 14  3 18  0]]


array([[0.18166593, 0.04393473, 0.01495257, 0.05598897, 0.3390646 ,
        0.28406417, 0.21930079]], dtype=float32)

In [112]:
text = "F is known shortcut for"
encoded_text = one_hot(text, vocab_size)
print("Encoded text :", encoded_text)
padded_text = pad_sequences([encoded_text], maxlen=max_length, padding='post')
print(padded_text)
model.predict(padded_text)

Encoded text : [9, 43, 14, 3, 18]
[[ 9 43 14  3 18  0]]


array([[0.22576067, 0.06020024, 0.04715523, 0.11328119, 0.05352072,
        0.07511067, 0.26769105]], dtype=float32)

In [113]:
text = "I dont know which one is this"
encoded_text = one_hot(text, vocab_size)
print("Encoded text :", encoded_text)
padded_text = pad_sequences([encoded_text], maxlen=max_length, padding='post')
print(padded_text)
model.predict(padded_text)

Encoded text : [76, 48, 82, 3, 15, 43, 18]
[[48 82  3 15 43 18]]


array([[0.22029433, 0.3572051 , 0.27860332, 0.38095355, 0.30702597,
        0.20268635, 0.33346674]], dtype=float32)

In [114]:
text = "what?"
encoded_text = one_hot(text, vocab_size)
print("Encoded text :", encoded_text)
padded_text = pad_sequences([encoded_text], maxlen=max_length, padding='post')
print(padded_text)
model.predict(padded_text)

Encoded text : [86]
[[86  0  0  0  0  0]]


array([[0.00523269, 0.03013298, 0.03118229, 0.00637817, 0.02094114,
        0.00545558, 0.14406063]], dtype=float32)

In [115]:
text = "what is this?"
encoded_text = one_hot(text, vocab_size)
print("Encoded text :", encoded_text)
padded_text = pad_sequences([encoded_text], maxlen=max_length, padding='post')
print(padded_text)
model.predict(padded_text)

Encoded text : [86, 43, 18]
[[86 43 18  0  0  0]]


array([[0.03901008, 0.01958153, 0.02946073, 0.03354695, 0.03508037,
        0.0135765 , 0.09712441]], dtype=float32)

###### Why  would we choose LSTM in this case? We could do it without that too right ?

##### After an hours of search .... I could not get a precise example where it is absolutely necessary to use LSTM (or RNN).

##### It is always claimed that it gave good results in text translation and music notes prediction and in games but I see it as below
* why would not a simple which does the classification be used iteravatively to produce the same result as that of LSTM.
    * e.g.: I can iteratively get the probability of the subsequent of characters from a trained model.
* if memory is the advantage in LSTM, then why not a simple search give better result or at least same result as that of LSTM.
    * e.g.: I can SEARCH the series of musical note and predict the next note. Here, I agree that the search time is much more than the prediction time of the LSTM but I guess we can always improvise search techniques.

##### Ans: Not really an answer. The outro to the lesson carefully make the same point above. That when we have enough data and computation power, we can mix various technique or more professionally indicated as "layers or techniques" or "stacks of techniques" to build a new model which *OFTEN* does better job that hand crafted approaches to the problem.

* This indirectly tells me that we can have simplified approach and machine learning approach but machine learning can be fine tuned with various techniques like number of layers, number neurons, activation function etc. However, in naive approach it has to be industry or data scientist solving the problem with careful coding.

#####  Models as Lego:
* Deep neural network is nothing but hidden nueral network layers; now it is upto you how do you want to form layers. it can be RNN, LSTM and CNN combinations. 
* Why ?: e.g.: for a speech recognition you can optimize the speech recognition pattern  to complete the translation. so you may have cnn for the speech recognition but later lstm to understand the missing sequence pattern.

##### Quick note on CNN and maxpool:
- CNN is a filter to recognize the elements.
- Max pool is to downsize the sample.

#####  Outlier:
* we know that plotting is the best way to know the outlier but a isoloated point does not indicate it is an outlier
    * e.g.: in fraud detection you are expected to have the fraud characterstic outside the normal
    * in rent prediction higher sqft usually have has hight rent or saleprice
        * even if the sqft is high based on the neighborhood house rent may be low for that area.

### Quick notes on NLTK, Gensim TF-IDF generation

In [117]:
from nltk.corpus import stopwords

In [119]:
len(stopwords.words('english'))

179

In [120]:
from nltk.stem import WordNetLemmatizer, SnowballStemmer, PorterStemmer

In [140]:
stem = WordNetLemmatizer()
print(stem.lemmatize("working"))

stem = SnowballStemmer('english')
stem.stem('working')

working


'work'

In [204]:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
    'Topic modeling helps in exploring large amounts of text data, finding clusters of words, the similarity between documents, and discovering abstract topics. As if these reasons weren’t compelling enough, topic modeling is also used in search engines wherein the search string is matched with the results. Getting interesting, isn’t it? Well, read on then!',
    'All languages have their own intricacies and nuances which are quite difficult for a machine to capture (sometimes they’re even misunderstood by us humans!). This can include different words that mean the same thing, and also the words which have the same spelling but different meanings.',
    'We can easily distinguish between these words because we are able to understand the context behind these words. However, a machine would not be able to capture this concept as it cannot understand the context in which the words have been used. This is where Latent Semantic Analysis (LSA) comes into play as it attempts to leverage the context around the words to capture the hidden concepts, also known as topics.',
    'The next step is to represent each and every term and document as a vector. We will use the document-term matrix and decompose it into multiple matrices. We will use sklearn’s TruncatedSVD to perform the task of matrix decomposition.',
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())

print(X.shape)


['able', 'abstract', 'all', 'also', 'amounts', 'analysis', 'and', 'are', 'around', 'as', 'attempts', 'be', 'because', 'been', 'behind', 'between', 'but', 'by', 'can', 'cannot', 'capture', 'clusters', 'comes', 'compelling', 'concept', 'concepts', 'context', 'data', 'decompose', 'decomposition', 'different', 'difficult', 'discovering', 'distinguish', 'document', 'documents', 'each', 'easily', 'engines', 'enough', 'even', 'every', 'exploring', 'finding', 'for', 'getting', 'have', 'helps', 'hidden', 'however', 'humans', 'if', 'in', 'include', 'interesting', 'into', 'intricacies', 'is', 'isn', 'it', 'known', 'languages', 'large', 'latent', 'leverage', 'lsa', 'machine', 'matched', 'matrices', 'matrix', 'mean', 'meanings', 'misunderstood', 'modeling', 'multiple', 'next', 'not', 'nuances', 'of', 'on', 'own', 'perform', 'play', 'quite', 're', 'read', 'reasons', 'represent', 'results', 'same', 'search', 'semantic', 'similarity', 'sklearn', 'sometimes', 'spelling', 'step', 'string', 'task', 'term

In [205]:
vectorizer

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [206]:
X.todense()

matrix([[0.        , 0.13413859, 0.        , 0.08561892, 0.13413859,
         0.        , 0.08561892, 0.        , 0.        , 0.08561892,
         0.        , 0.        , 0.        , 0.        , 0.        ,
         0.1057564 , 0.        , 0.        , 0.        , 0.        ,
         0.        , 0.13413859, 0.        , 0.13413859, 0.        ,
         0.        , 0.        , 0.13413859, 0.        , 0.        ,
         0.        , 0.        , 0.13413859, 0.        , 0.        ,
         0.13413859, 0.        , 0.        , 0.13413859, 0.13413859,
         0.        , 0.        , 0.13413859, 0.13413859, 0.        ,
         0.13413859, 0.        , 0.13413859, 0.        , 0.        ,
         0.        , 0.13413859, 0.21151281, 0.        , 0.13413859,
         0.        , 0.        , 0.17123785, 0.13413859, 0.08561892,
         0.        , 0.        , 0.13413859, 0.        , 0.        ,
         0.        , 0.        , 0.13413859, 0.        , 0.        ,
         0.        , 0.        , 0

In [207]:
X.shape

(4, 128)

In [208]:
vectorizer = TfidfVectorizer(stop_words='english', ngram_range=(1,4))
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())

print(X.shape)

['able', 'able capture', 'able capture concept', 'able capture concept understand', 'able understand', 'able understand context', 'able understand context words', 'abstract', 'abstract topics', 'abstract topics reasons', 'abstract topics reasons weren', 'amounts', 'amounts text', 'amounts text data', 'amounts text data finding', 'analysis', 'analysis lsa', 'analysis lsa comes', 'analysis lsa comes play', 'attempts', 'attempts leverage', 'attempts leverage context', 'attempts leverage context words', 'capture', 'capture concept', 'capture concept understand', 'capture concept understand context', 'capture hidden', 'capture hidden concepts', 'capture hidden concepts known', 'capture misunderstood', 'capture misunderstood humans', 'capture misunderstood humans include', 'clusters', 'clusters words', 'clusters words similarity', 'clusters words similarity documents', 'comes', 'comes play', 'comes play attempts', 'comes play attempts leverage', 'compelling', 'compelling topic', 'compelling 

In [216]:
from gensim.models import TfidfModel
from gensim.corpora import Dictionary
from gensim.parsing.preprocessing import remove_stopwords

corpus_gen = [gensim.utils.simple_preprocess(remove_stopwords(doc))  for doc in corpus]
print(corpus_gen[0])

['topic', 'modeling', 'helps', 'exploring', 'large', 'amounts', 'text', 'data', 'finding', 'clusters', 'words', 'similarity', 'documents', 'discovering', 'abstract', 'topics', 'as', 'reasons', 'weren', 'compelling', 'enough', 'topic', 'modeling', 'search', 'engines', 'search', 'string', 'matched', 'results', 'getting', 'interesting', 'isn', 'it', 'well', 'read', 'then']


In [217]:
dct = Dictionary(corpus_gen)
corpus_doc2bow = [dct.doc2bow(line) for line in corpus_gen]  # convert corpus to BoW format
model = TfidfModel(corpus_doc2bow)  # fit model
vector = model[corpus_doc2bow]  # apply model to the first corpus document

In [218]:
len([dct.get(i) for doc in vector for i, value in doc])

95

In [219]:
[dct.get(i) for doc in vector for i, value in sorted(doc, key=lambda x:x[1])[-5:]]

['well',
 'weren',
 'modeling',
 'search',
 'topic',
 'sometimes',
 'spelling',
 'they',
 'thing',
 'different',
 'semantic',
 'used',
 'able',
 'understand',
 'context',
 'vector',
 'document',
 'matrix',
 'term',
 'use']

In [221]:
corpus_gen = gensim.parsing.preprocessing.preprocess_documents(corpus)
dct = Dictionary(corpus_gen)
corpus_doc2bow = [dct.doc2bow(line) for line in corpus_gen]  # convert corpus to BoW format
model = TfidfModel(corpus_doc2bow)  # fit model
vector = model[corpus_doc2bow]  # apply model to the first corpus document

In [222]:
len([dct.get(i) for doc in vector for i, value in doc])

77

In [224]:
[dct.get(i) for doc in vector for i, value in doc]

['abstract',
 'amount',
 'cluster',
 'compel',
 'data',
 'discov',
 'document',
 'engin',
 'explor',
 'find',
 'get',
 'help',
 'interest',
 'isn’t',
 'larg',
 'match',
 'model',
 'read',
 'reason',
 'result',
 'search',
 'similar',
 'string',
 'text',
 'topic',
 'weren’t',
 'word',
 'word',
 'captur',
 'differ',
 'difficult',
 'human',
 'includ',
 'intricaci',
 'languag',
 'machin',
 'mean',
 'misunderstood',
 'nuanc',
 'spell',
 'they’r',
 'thing',
 'topic',
 'word',
 'captur',
 'machin',
 'abl',
 'analysi',
 'attempt',
 'come',
 'concept',
 'context',
 'distinguish',
 'easili',
 'hidden',
 'known',
 'latent',
 'leverag',
 'lsa',
 'plai',
 'semant',
 'understand',
 'document',
 'decompos',
 'decomposit',
 'matric',
 'matrix',
 'multipl',
 'perform',
 'repres',
 'sklearn’',
 'step',
 'task',
 'term',
 'truncatedsvd',
 'us',
 'vector']

In [232]:
top_among_docs = [(i,value) for doc in vector for i, value in sorted(doc, key=lambda x:x[1])[-10:]]
sorted([(dct.get(i),value) for i,value in sorted(top_among_docs, key=lambda x: x[1])[-20:]], key= lambda x:x[1])

[('includ', 0.2314203798197213),
 ('intricaci', 0.2314203798197213),
 ('languag', 0.2314203798197213),
 ('misunderstood', 0.2314203798197213),
 ('nuanc', 0.2314203798197213),
 ('spell', 0.2314203798197213),
 ('they’r', 0.2314203798197213),
 ('thing', 0.2314203798197213),
 ('topic', 0.26294325735857826),
 ('abl', 0.33715249805902553),
 ('concept', 0.33715249805902553),
 ('understand', 0.33715249805902553),
 ('model', 0.3505910098114377),
 ('search', 0.3505910098114377),
 ('matrix', 0.4082482904638631),
 ('term', 0.4082482904638631),
 ('us', 0.4082482904638631),
 ('differ', 0.4628407596394426),
 ('mean', 0.4628407596394426),
 ('context', 0.5057287470885382)]

### Transformers:
* Why Am I targetting Transformers?
    * Ans: I am trying to understand BERT: Bidirectional Embedding representation from Transformers.
    
* Why Am I reading about BERT ?
    * Ans: To improve the performance of text classification task.
    
* So What is Transformers?
    * Transformers is a type of neural network which combines Convolution Neural Network with Attention Models.
    * They are built to sequence transduction or neural machine translation. 
    * Applications: text-to-speech,  speech recognition etc.
    
* What is Attention Models?

* What is sequence transduction ?