<h1>An overview of NLP</h1>

<p>The field of study that focuses on the interactions between human language and computers is called Natural Language Processing, or NLP for short. It sits at the intersection of computer science, artificial intelligence, and computational linguistics <a href="https://en.wikipedia.org/wiki/Natural_language_processing">Wikipedia</a>.</p>

<p>Human’s have been writing things down for thousands of years. Over that time, our brain has gained a tremendous amount of experience in understanding natural language. When we read something written on a piece of paper or in a blog post on the internet, we understand what that thing really means in the real-world. We feel the emotions that reading that thing elicits and we often visualise how that thing would look in real life.</p>

<p>Natural Language Processing (NLP) is a sub-field of Artificial Intelligence that is focused on enabling computers to understand and process human languages, to get computers closer to a human-level understanding of language.</p>



<h1>Road trip of NLP</h1>

<ul>
    <li><strong>Initial KMs.</strong>
        <ul>
            <li>
                In the early days, many language-processing systems were designed by hand-coding a set of rules. e.g. by writing grammars or devising heuristic rules for stemming.
            </li>
        </ul>
    </li>
    <li><strong>After some hundred KMs</strong>
        <ul>
            <li>
                in the late 1980s and mid 1990s, much natural language processing research has relied heavily on machine learning. Researchers tried to automatically learn such rules through the analysis of large corpora of typical real-world examples. Many different classes of machine-learning algorithms have been applied to natural-language-processing tasks.
            </li>
            <li>
                Some of the earliest-used algorithms, such as <i>decision trees</i>, produced systems of hard if-then rules similar to the systems of hand-written rules that were then common.
            </li>
        </ul>
    </li>
    <li><strong>Later on the road</strong>
        <ul>
            <li>
                Research focused on <i>statistical models</i>, which make soft, probabilistic decisions based on attaching real-valued weights to each input feature. Such models have the advantage that they can express the relative certainty of many different possible answers rather than only one.
            </li>
        </ul>
    </li>
</ul>



<h1>Why is NLP difficult?</h1>

<strong>“Virat Kohli was wreaking havoc last night. He literally smashed every ball out of the park.”</strong>
<p>What can we understand from this news statement?</p>
<p>To us humans it is very obvious what this means. We know that Virat Kohli is a Batsman and he played very brilliantly and scored lots of runs against opponent. The park here refers to the stadium and not to some <i>park</i></p>
<p>But computers don't work like humans and they will not be able to <i>read between lines</i>. For them wreaking havoc means some disaster that is being incurred by Virat Kohli and Park means a <i>park</i> and not a stadium.</p>

<p>Now this is what actually NLP is: <strong>NLP is a Workout program for computers that researchers have made. And which computers are following and trying to reach to the level of human strength.</strong> Based on the performance of computers researchers are improving the program.</p>

<p>If you are a human you could understand the literal meaning of the above statement, but if you are not and still you understood then NLP is working at its imagined best.<p>

<h1>Major Applications and researched tasks of NLP</h1>
<p>The following is a list of some of the most commonly researched tasks in natural language processing. Some of these tasks have direct real-world applications, while others are used in solving larger tasks. Though natural language processing tasks are closely intertwined, they are frequently subdivided into categories for convenience. A coarse division is given below.</p>

<h3>Syntax</h3>
<ol>
    <li><strong>Grammar induction:</strong> Generate a formal grammar that describes a language's syntax.</li>
    <li><strong>Lemmatization:</strong> Removing inflectional endings and returning the base dictionary form of a word which is also known as a lemma.</li>
    <li><strong>Morphological segmentation:</strong> Separate words into individual morphemes(smallest grammatical unit in a language.) and identify the class of the morphemes. </li>
    <li><strong>Part-of-speech tagging:</strong> Given a sentence, determine the part of speech for each word.</li>
    <li><strong>Parsing:</strong> Given a sentence, Determine the parse tree (grammatical analysis) of it.</li>
    <li><strong>Sentence breaking:</strong> Given a chunk of text, find the sentence boundaries.</li>
    <li><strong>Stemming:</strong> Reducing words to their root form.</li>
    <li><strong>Word segmentation:</strong> Separate a chunk of continuous text into separate words.</li>
    <li><strong>Terminology extraction:</strong> Extract relevant terms from a given corpus.</li>
</ol>

<h3>Semantics</h3>
<ol>
    <li><strong>Lexical semantics:</strong>What is the computational meaning of individual words in context?</li>
    <li><strong>Machine translation:</strong> Automatically translate text from one human language to another.</li>
    <li><strong>Named entity recognition (NER):</strong> Given a stream of text, determine which items in the text map to proper names. Not just proper names but also to other entities like money and time.</li>
    <li><strong>Natural language generation:</strong> Convert information from computer databases or semantic intents into readable human language.</li>
    <li><strong>Natural language understanding:</strong> Convert chunks of text into more formal representations that are easier for computer programs to manipulate. </li>
    <li><strong>Optical character recognition (OCR):</strong>Given an image representing printed text, determine the corresponding text.</li>
    <li><strong>Question answering:</strong>Given a human-language question, determine its answer.</li>
    <li><strong>Recognizing Textual entailment:</strong>Given two text fragments, determine if one being true entails the other, entails the other's negation, or allows the other to be either true or false.</li>
    <li><strong>Relationship extraction:</strong>Given a chunk of text, identify the relationships among named entities (e.g. who is married to whom).</li>
    <li><strong>Sentiment analysis:</strong>Given a document or a text, extract the sentiment.</li>
    <li><strong>Topic segmentation and recognition:</strong>Given a chunk of text, separate it into segments each of which is devoted to a topic, and identify the topic of the segment.</li>
    <li><strong>Word sense disambiguation:</strong>Many words have more than one meaning; we have to select the meaning which makes the most sense in context.</li>
</ol>

<h3>Discourse</h3>
<ol>
    <li><strong>Automatic summarization:</strong>Produce a readable summary of a chunk of text.</li>
    <li><strong>Coreference resolution:</strong> Given a sentence or larger chunk of text, determine which words ("mentions") refer to the same objects ("entities"). Ronaldo gave Messi a shirt. They played together. He gave him his boots. Now "he" refers to Ronaldo or Messi is done using Coference resolution.</li>
    <li><strong>Discourse analysis:</strong>Approaches to analyze written, vocal, or sign language use, or any significant semiotic event.
</li>
</ol>

<h3>Speech</h3>
<ol>
    <li><strong>Speech recognition:</strong>Given a sound clip of a person or people speaking, determine the textual representation of the speech.</li>
    <li><strong>Speech segmentation:</strong> Given a sound clip of a person or people speaking, separate it into words.</li>
    <li><strong>Text-to-speech:</strong>Given a text, transform those units and produce a spoken representation.
</li>
    <li><strong>Dialogue:</strong>Holding a conversation with humans.
</li>
 
</ol>
<p>To get more insights on applications and researched tasks, refer <a href="https://en.wikipedia.org/wiki/Natural_language_processing">wikipedia</a>.</p>

<h1>How NLP works</h1>
<p>To understand NLP better, let us try to do <strong>Sentiment Analysis</strong>, which is a text classification task, and see how NLP works.<br>
Note: This is very basic example of NLP, but I hope it can give you a gist of how things play in real.</p>

<h3>Use case</h3>
<p>We will be trying to make a sentiment classification for movie reviews. That is we will try to classify a review as positive or negative based on the marker words in the review.</p>
<h3>Let us have a look at the dataset</h3>
<ul>
    <li>We will use IMDB movie review dataset. This dataset is annotated with positive and negative labels thanks to researchers at Stanford. The dataset can be accessed at this <a href="http://ai.stanford.edu/~amaas/data/sentiment/">link</a>.</li>
    <li>This dataset contains 25000 positive and 25000 negative example that are pre annotated with labels. This dataset also contains un annotated examples which can be used by researchers for further use.</li>
    <li>In both train and test folders we will have 12,500 positive and 12,500 negative examples. Please note that each example is a movie review and it is in a separate text file. So we will have 25K text files for training and testing each.</li>
    <li>Dataset contains at most 30 reviews per movie. To make sure no single movie becomes more influential.</li>
    <li>If the movie is given more that 7 stars then its a positive review, and if lower than 4 then negative review.</li>
    <li>Positive and negative examples are equal in number, so accuracy can be used as a metrics.</li>
</ul>
<h3>Let us see an example of positive and negative review.</h3>
<p><strong>Positive:</strong>:The movie was very nice. It was full of suprises and fun. Characters were at their best.<br>
<strong>Negative:</strong>:This movie hurt my expectations. I had expected much more from this director. Not a good movie.</p>

<h3>Let us formally write our input and output</h3>
<p>
    <strong>Input</strong>: Text of review.<br>
    <strong>Output</strong>: Class of sentiment: Positive or Negative.<br>
    Note: The classes of output could be more, like slightly positive, but for the sake of understanding we have taken just two.
</p>


<h2>Text Preprocessing</h2>
<p>Since the reviews are in string format, and most of the Machine Learning algorithms takes numeric features as input, we have to somehow convert the reviews into numeric features.<p>
<p>Also before conversion, we will process the text to get better results. Following are the basic and most important pre-processing steps that must be done for most of the NLP task.</p>

<h3>1. Tokenization</h3>
<p>First we would like to split the input sequence into individual tokens. Token can be thought of as a useful unit for Semantic Processing. Or simply saying breaking the text into small meaningful segments.</p>
<p>There are three very famous ways of Tokenization.
    <li>White Space Tokenizer.</li>
    <li>Tokenizing using punctuations.</li>
    <li>Using grammer to tokenize.</li>
Let us look at each of them using Python's NLTK library.</p>

In [1]:
#importing nltk library.
import nltk 

#taking an example text, to see how different tokenization works.
text = "This is Andrew's text, isn't it?"

In [2]:
#White Space Tokenizer
tokenizer = nltk.tokenize.WhitespaceTokenizer()
tokenizer.tokenize(text)

['This', 'is', "Andrew's", 'text,', "isn't", 'it?']

<p><strong>Problem:</strong> Here <i>it?</i> is a different token and would not have been same as <i>it</i> if compared. But <i>it</i> and <i>it?</i> have same meaning. So we would like to merge them together. Like considering them as same thing.</p>
<p>Let us try Tokenizing using punctuations.</p>

In [3]:
#Tokenizing on the basis of punctuations
tokenizer = nltk.tokenize.WordPunctTokenizer()
tokenizer.tokenize(text)

['This', 'is', 'Andrew', "'", 's', 'text', ',', 'isn', "'", 't', 'it', '?']

<p><strong>Problem:</strong> <i>s</i>, <i>isn</i> and <i>t</i> are not very meaningful.</p>
<p>Let us try to tokenize on the basis of some grammar rules of English.</p>

In [4]:
#Tree bank tokenizer. This will take grammar rules in consideration before tokenizing
tokenizer = nltk.tokenize.TreebankWordTokenizer()
tokenizer.tokenize(text)

['This', 'is', 'Andrew', "'s", 'text', ',', 'is', "n't", 'it', '?']

<h3>2. Token Normalization</h3>
<p>Take a look at these word pairs
    <li>wolf, wolves</li>
    <li>talk, talks</li>
    <li>kick, kicked</li>
    The pairs may look different but in actual conveys almost same meaning. So it would be very benifitial to have only one representation for both of them. The same applies for any number of variation of same word.<p>
<p>For this we can do Token Normalization. We will understand it using coding, but first some theory to get under the hood knowledge.</p>

<p>There are basically two very famous processes to do Token Normalization.</p>
<ol>
    <li>Stemming
        <ul>
            <li>A process of removing and replacing suffixes to get to the root form of the word which is called the <i>stem</i>.</li>
            <li>Usually refers to the heuristics that chop off suffixes.</li>
            <li>example: Porter's stemmer. Oldest and most famous stemmer of english language. Applies 5 rules sequentially. It fails on irregular forms, produces non words. But still works in practice.</li>
            <li>Some conversions: feet -> feet, wolves -> wolv, cats -> cat, talked -> talk</li>
        </ul>
    </li>
    <li>Lemmatization
        <ul>
            <li>Usually refers to doing things properly with the use of a vocabulary and morphological analysis.</li>
            <li>Returns the base or dictionary form of a word, which is called as <i>Lemma</i></li>
            <li>example: WordNet Lemmatizer. Uses WordNet database to lookup lemmas. Not all the forms are reduces.</li>
            <li>Some conversions: feet -> foot, wolves -> wolf, cats -> cat, talked -> talked</li>
        </ul>
    </li>
</ol>
<p>Now let's perform the same using Python's NLTK</p>

In [5]:
#Taking a simple text to show how token normalization works
text = "feet cats wolves talked tripping trick footballs"
tokenizer = nltk.tokenize.TreebankWordTokenizer()
tokens = tokenizer.tokenize(text)

In [6]:
#Stemming using Porter Stemmer
stemmer = nltk.stem.PorterStemmer()
" ".join(stemmer.stem(token) for token in tokens)

'feet cat wolv talk trip trick footbal'

<p><strong>Problem: </strong>feet is not converted into foot. Also there are non words like wolv, footbal.</p>

In [7]:
#Lemmatization using WordNetLemmatizer
lemmatizer = nltk.stem.WordNetLemmatizer()
" ".join(lemmatizer.lemmatize(token) for token in tokens)

'foot cat wolf talked tripping trick football'

<p><strong>Problem: </strong>talked is not converted into talk. Also tripping is not converted.</p>
<p><strong>Note:</strong> We need to try Stemming and Lemmatization and choose which one works best for our task.</p>

<h3>3. Further Normalization</h3>
<ul>
    <li><strong>Normalizing Capital letters</strong>
        <ul>
            <li><i>Us</i> and <i>us</i> will be <i>us</i> if both are pronouns.</li>
            <li><i>Us</i> and <i>us</i> could be country also. Now this is tricky how to know if its a country or a pronoun.</li>
            <li>We can use heuristics for this:
                <ul>
                    <li>Lowercasing beginning of sentence as sentence typically starts with uppercase.</li>
                    <li>Lowercasing words in titles.</li>
                    <li>Leave mid sentence words as it is. Because if in mid-sentence there is a caps then it can be a named entity</li>
                </ul>
            </li>
        </ul>
    </li>
    <li>Dealing with Acronyms
        <ul>
            <li>Same acronym can be written in multiple forms. For example eta, e.t.a., E.T.A. all are acronyms for Estimated Time of Arrival. So it be better to convert all the forms into one single form.</li>
        </ul>
    </li>
    
</ul>



<h2>Feature Extraction</h2>
<p>Now that we have pre-processed the text and the text is in the form of normalized tokens, we will extract features from the text to feed into any machine learning algorithm.</p>
<p>For this task we will do vectorization of the text. In simple words we will convert each token into vectors. For this process, there are three famous approaches with their own merits and de-merits. We will see all three of them one by one</p>

<h3>1. Bag Of Words(BOW)</h3>
<p>In the BOW we count number of occurances of all the tokens in our text. The motivation of this approach is that we are looking for some marker words like <i>excellent</i> and <i> disappointing</i> which can help in discriminating between a positive and a negative review.</p>
<p>After counting occurances we will end up with a feature vector for whole text as well as for every individual token. Let us see a very small visual representation of actual BOW vectorization.</p>
<p>For this example we will take three movie reviews:<br>
    <ul>
        <li>good movie</li>
        <li>not a good movie</li>
        <li>did not like</li>
    </ul>
<table>
    <th>Review/Token</th><th>good</th><th>movie</th><th>not</th><th>a</th><th>did</th><th>like</th>
    <tr>
        <th>good movie</th><td>1</td><td>1</td><td>0</td><td>0</td><td>0</td><td>0</td>
    </tr>
    <tr>
        <th>not a good movie</th><td>1</td><td>1</td><td>1</td><td>1</td><td>0</td><td>0</td>
    </tr>
    <tr>
        <th>did not like</th><td>0</td><td>0</td><td>1</td><td>0</td><td>1</td><td>1</td>
    </tr>
</table>

<p>As we can see from the table we have counts for every word in the text. Please note that the counts can always be more than 1. each row will be taken as a vecor for the corresponding movie. And each column is a vector representation for corresponding word.</p>
<p><strong>Problems with this method</strong><br>
    <ul>
        <li>since it is an BOW representation, we loose the word ordering</li>
        <li>Counters are not normalized</li>
    </ul>
</p>
     

<h3>2. BOW with n-grams</h3>
<p>The problem of loosing word ordering can be resolved by n-grams, where n is an non-zero positive number<br>
       Let us see what do we mean by n-grams:<br>
        1-gram : Tokens. example: good, movie etc.<br>
        2-gram : Token pairs. example: good movie, did not, etc.<br>
        and so on.
</p>
<p>Now let us see how it looks. We will take the same three reviews as in the case of BOW. And we will include 2-grams also.</p>
<table>
    <th>Review/Token</th><th>good</th><th>good movie</th><th>movie</th><th>not</th><th>not good</th><th>did</th><th>did not</th><th>like</th><th>not like</th>
    <tr>
        <th>good movie</th><td>1</td><td>1</td><td>1</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td>
    </tr>
    <tr>
        <th>not a good movie</th><td>1</td><td>1</td><td>1</td><td>1</td><td>1</td><td>0</td><td>0</td><td>0</td><td>0</td>
    </tr>
    <tr>
        <th>did not like</th><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>1</td><td>1</td><td>1</td><td>1</td>
    </tr>
</table>
<p>Now we have preserved some word ordering using 2-grams. As in the last case, each row corresponds to a movie review. Please note that in the above table, we have removed stop words(too frequent n-grams).</p>
<p><strong>Problems with n-gram representation</strong><br>
    <ol>
        <li>Too many features: we have added just 2-grams for only 3 small reviews and we got 9 features,
            what if we had hundreds of reviews and thousands of tokens. If we had added just 2-grams, the number of features in this case would have been in millions.
        </li>
        <li>Counters not Normalized</li>
    </ol>
</p>
<p><strong>Resolving the issue of too many featues</strong><br>
    Since the features can be too many, we can remove some of them based on the basis of following rules:<br><br>
    Let us remove some n-grams based on their occurance frequency in the documents of our corpus.
    <ul>
        <li>High Frequency n-grams
            <ul>
                <li>Articles, Prepositions, etc. Example: and, a, the, etc.</li>
                <li>They are called <strong><i>Stop-Words</i></strong>. They won't help us in discriminating texts. Which is our task at hand. In most of the NLP tasks, we will remove stop-words. This is the reason why we removed stop words in the above case of BOW with n-grams.
                </li>
            </ul>
        </li>
        <li>Low Frequency n-grams
            <ul>
                <li>Typos, rare n-grams</li>
                <li>We donot need them as they are very likely to overfit. Because a low frequency n-gram could be a very good feature for our future classifier that can just see that and give output. We donot want such dependencies.</li>
            </ul>
        </li>
    </ul>
</p>


<h3>3. TF-IDF Vectorization</h3>
<p>As we saw in previous case that we have some high and low frequency n-grams which must be removed. Apart from that we have medium frequency n-grams. And since we deal with a lot of data, the medium frequency n-grams too could be very high in number.<br>
    <i>We will try to determine which medium frequency n-grams are more useful than others</i> in the task at hand.
    </p>
<p><strong>Motivation</strong>: The motivation behind this is the that the n-grams with smaller frequency can be more discriminating than others because they can capture a specific issue in the review. <br>
For example, in a a particular review of Hotel, there is an issue of wi-fi, which is a big issue, but it is not common in every review.<br>
    So we want to extract n-grams that are more common in one document and less in others. So that they can highlight some particular issue.</p>
</p>
<p>For this task we can use <strong>TF-IDF Vectorization</strong>. 
<p><strong>TF-IDF a better BOW</strong><br>
    We can replace the counters in BOW vectors with TFIDF values. And then normalize the results row-wise. This way we will be resolving the non-normalized issue of simple BOW vectors. 

<p>All this can be done using Python's inbuilt <a href="https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html">TF-IDF vectorizer</a> . To get under the hood knowledge of how TF-IDF works, please visit the wikipedia article on<a href="https://en.wikipedia.org/wiki/Tf%E2%80%93idf">TF-IDF</a>.</p>
<p>Let us now look at the implementation of TF-IDF Vectorization in Python.</p>

In [12]:
#importing TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
#importing pandas to display results
import pandas as pd

#We will take 5 reviews to see how Tfidf Vectorizer works.
reviews = ["good movie", "not a good movie", "did not like", "i like it", "good one"]

#Please check the docs of TfidfVectorizer to know about different parameters of TfidfVectorizer.
tfidf = TfidfVectorizer(min_df=2, max_df=0.5, ngram_range=(1,2))

features = tfidf.fit_transform(reviews)
pd.DataFrame(features.todense(), columns = tfidf.get_feature_names())

Unnamed: 0,good movie,like,movie,not
0,0.707107,0.0,0.707107,0.0
1,0.57735,0.0,0.57735,0.57735
2,0.0,0.707107,0.0,0.707107
3,0.0,1.0,0.0,0.0
4,0.0,0.0,0.0,0.0


<p>From the above results we can see that now we have different values for 1 and 2 grams. These values are tf-idf values. We can use this as our features.</p>
<p><strong>Summary of feature extraction:</strong>
<br>1. We made a simple counter features in BOW manner.
<br>2. We can also add n-grams.
<br>3. We can replace counters with TF-IDF values.
</p>

<h1>Making the Sentiment Classifier</h1>
<p>Now that we have extracted features, we can feed these features into any classification Machine Learning model. Though <i>any</i> is an relative term and it will always have some constraints. For this case we have features as very long sparse vectors (lot of zeros), so using decison tree models will take huge amount of time and give very low accuracy. One model which performs very well with long sparse vectors is <strong>Logistic Regression</strong>. It is a linear classification model which is very fast to train.</p>

<h3>Some Notes on how to improve the results of Logistic Regression</h3>
<ol>
    <li>Play around with tokenization: Special tokens for emojis, exclamations etc. can help. These are some tokens which people often use and can tell a lot about the sentiments in a review. For example, a smiling emoji can be a very good indicator of a good review, whereas sad or angry emoji could point towards a bad review.</li>
    <li>Try different models like SVM, Naive Bayes, etc. </li>
    <li>Throw BOW away and use Deep Learning. Though accuracy gain from Deep learning for this sentiment analysis task is not mind blowing.</li>

<h1>Let us move further</h1>
<p>Now that we have seen what is NLP, where is it used and a small example of its usage, I am sure you will be able to explore the enormous world of NLP. The world of NLP is very big and it takes huge amount of time and resource to get a little hold of it. Taming it is all together a different ball game.</p>
<p>This was a scenario a few years back, but not now. Thanks to pre-built NLP libraries and models which are here to make our lives easy in the gigantic world of NLP. Researchers had invested years of their lives in making these libraries and modules. And since they had already put in the efforts to design a benchmark model/libary for us. So Instead of building things from scratch to solve a similar NLP problem, we should use that pretrained model/library on our own NLP dataset. You have already seen an example of scikit-learn library, A python library for Machine Learning, in the sentiment Analysis.</p>
<p>There are a lot of research that has happened and is still going on in NLP. So naturally there are a lot of Models and Libraries already made. Some of them are very good and some of them not so. So to make your task of choosing which libary/models is good for your task easy, I will explain most popular models and libraries.</p>
<p>This will be just an overview of libaries and models, Please follow the links for complete information</p>

<h2>1. NLP Libraries</h2>
<p>There are a lot of NLP libraries available. Here are the top five libraries.<p>

<h4>1. <a href="https://www.nltk.org/">Natural Language Toolkit(NLTK)</a></h4>
<ul>
    <li>The NLTK usually is the first contender when listing or talking about Python NLP libraries. The Natural Language Toolkit is fairly mature (it’s in development since 2001) and has positioned itself as one of the primary resources when it comes to Python and language processing.</li>
    <li>It is not only written and maintained really well, but also comes packaged with a a lot of data, corpora and pre-trained models. Since the NLTK was primarily developed as an educational library, there is also a fairly brilliant textbook (for free) that accompanies the library.</li>
    <li>It supports tasks such as classification, tokenization, stemming, tagging, parsing, and semantic reasoning. Today it serves as an educational foundation for Python developers who are very new in NLP.</li>
</ul>


<h4>2. <a href="https://spacy.io/">spaCy</a></h4>
<ul>
    <li>It is the main competitor of the NLTK. These two libraries can be used for the same tasks. spaCy is a relatively young project that labels itself as “industrial-strength natural language processing”.</li>
    <li> spaCy offers the fastest syntactic parser available on the market today. Moreover, since the toolkit is written in Cython it’s also really speedy.</li>
    <li>Some features of spacy includes Named entity recognition, Support for 49+ languages, 16 statistical models for 9 languages, Pre-trained word vectors, Easy deep learning integration, Part-of-speech tagging, Built in visualizers for syntax and NER, etc.</li>
</ul>


<h4>3. <a href = "https://radimrehurek.com/gensim/">Gensim</a></h4>
<ul>
    <li>Gensim is a fairly specialized library that is highly optimized for (unsupervised) semantic (topic) modelling.</li>
    <li>Semantic analysis and topic modelling in particular is a very specific sub-discipline of NLP, but an important and exciting one. Gensim is the go-to library for these kinds of NLP and text mining. It’s fast, scalable, and very efficient.</li>
    <li> It can handle large text collections with the help of efficiency data streaming and incremental algorithms. It has incredible memory usage optimization and processing speed. These were achieved with the help of another Python library, NumPy.</li> 
</ul>

<h4>4. <a href ="https://scikit-learn.org/stable/">scikit-learn</a></h4>
<ul>
    <li>This is by far the most important library in Python for Machine Learning. You have already seen the usage of this libary in the sentiment classification example.</li>
    <li>This libarary provides various tools for text preprocessing.</li>
</ul>

<h4>5. <a href = "https://pypi.org/project/polyglot/">Polyglot</a></h4>
<ul>
    <li>Polyglot is in some respects very different from the libraries we’ve talked about so far. While it has similar features (tokenization, PoS tagging, word embeddings, named etity recognition, etc.), Polyglot is primarily designed for multilingual applications.</li>
    <li>Within the space of simultaneously dealing with various languages, it provides some very interesting features such as language detection and transliteration that are usually not as pronounced in other packages.</li>
    <li>Thanks to NumPy, it also works really fast. Using polyglot is similar to spaCy – it’s very straightforward and will be an excellent choice for projects involving a language spaCy doesn’t support. </li>
</ul>

<h3>Pros and Cons of these libraries</h3>
<table>
    <th style="text-align:center">Library</th><th style="text-align:center">Pros</th><th style="text-align:center">Cons</th>
    <tr>
        <td>NLTK</td>
        <td style="text-align:left">
            <ul>
                <li>The most well-known and full NLP library.</li>
                <li>Many third party extensions.</li>
                <li>Plenty of approaches to each NLP task.</li>
                <li>Fast sentence tokenization.</li>
                <li>supports the largest number of languages compared to other libraries.</li>
            </ul>
        </td>
        <td style="text-align:left">
            <ul>
                <li>Complicated to learn and use.</li>
                <li>Quite slow.</li>
                <li>In sentence tokenization, NLTK only splits text by sentences, without analyzing the semantic structure.</li>
                <li>Do not provide neural network models.</li>
                <li>No integrated word vectors.</li>
            </ul>
        </td>
    </tr>
    <tr>
        <td>spaCy</td>
        <td style="text-align:left">
            <ul>
                <li>The fastest NLP framework.</li>
                <li>Easy to learn and use because it has one single highly optimized tool for each task.</li>
                <li>Processes objects. More Object Oriented compared to other libraries.</li>
                <li>Use neural networks for training some models.</li>
                <li>Provides built-in word vectors.</li>
                <li>Active support and development</li>
            </ul>
        </td>
        <td style="text-align:left">
            <ul>
                <li>Lacks flexibility, compared to NLTK</li>
                <li>Sentence Tokenization is slower than NLTK</li>
                <li>Doesn't support many languages.</li>
            </ul>
        </td>
    </tr>
    <tr>
        <td>Gensim</td>
        <td style="text-align:left">
            <ul>
                <li>Works with large datasets and processes data streams.</li>
                <li>Provide tf-idf vectorization, word2vec, doc2vec, latent semantic analysis, latent Dirichlet allocation</li>
                <li>supports deep learning</li>
            </ul>
        </td>
        <td style="text-align:left">
            <ul>
                <li>Designed primarily for unsupervised learning.</li>
                <li>Do not have enough tools to provide full NLP pipeline, So should be used with other librarie(Spacy or NLTK)</li>
            </ul>
        </td>
    </tr>
    <tr>
        <td>scikit-learn</td>
        <td style="text-align:left">
            <ul>
                <li>Has functions which help to use BOW method of creating features for text classification problems.</li>
                <li>Provides wide variety of algorithms for ML models.</li>
                <li>Has good documentation and intuitive classes' method.</li>
            </ul>
        </td>
        <td style="text-align:left">
            <ul>
                <li>For more sophisticated pre-processing(like POS-tagging) you must use any some othet NLP library before using models of scikit-learn.</li>
                <li>Doesn't use neural networks for text preprocessing.</li>
            </ul>
        </td>
    </tr>
    <tr>
        <td>Polyglot</td>
        <td style="text-align:left">
            <ul>
                <li>Supports a large number of languages(16-196 languages for different tasks).</li>
                <li>Can perform similar tasks as spaCy or NLTK.</li>
            </ul>
        </td>
        <td style="text-align:left">
            <ul>
                <li>Not as popular as, for example, NLTK or Spacy. can be slow and have weak community support.</li>
            </ul>
        </td>
    </tr>
        
</table>


<h3>How to use these libraries</h3>
<p>Since we have seen the pros and cons of these libraries and also have compared them, let us see how they work and how can we use them.</p>
<p>We will see how to install each library and also see with a short example how to use them. We have already seen how <i>scikit-learn</i> works. So we need not discuss it again. For <i>NLTK</i>, <i>spaCy</i> and <i>Polyglot</i> we will take a random text and see how they perform Named Entity Recognition(NER). i.e. How well they are able to identify if a word is an entity or not. And for <i>Gensim</i> we will see an example of document similarity. For all other usage of these libraries please go to their docs. I have already given links to the docs of each library in the heading of each library.</p>
<p>First we will do <strong>NER</strong> using <i>NLTK</i>, <i>spaCy</i> and <i>Polyglot</i>.<br>
</p>

In [114]:
#We will take this paragraph to check perform NER using all three libraries.
text = """But Google is starting from behind. The company made a late push
into hardware, and Apple’s Siri, available on iPhones, and Amazon’s Alexa
software, which runs on its Echo and Dot devices, have clear leads in
consumer adoption.
I was born in India on 23/03/1996. Chennai is a coastal city.
"""

<h3>NLTK</h3>

In [58]:
#installing NLTK. 
#If having any trouble go to this link : https://pypi.org/project/nltk/
#Please note that I am using Jupyter notebook and Python 3.
!pip install nltk

In [62]:
#importing tokenizer, POS tagger and NER chunker
from nltk import word_tokenize, pos_tag, ne_chunk

#performing tokenization followed by POS tagging. Then extracting NER.
print(ne_chunk(pos_tag(word_tokenize(text))))

(S
  But/CC
  (PERSON Google/NNP)
  is/VBZ
  starting/VBG
  from/IN
  behind/IN
  ./.
  The/DT
  company/NN
  made/VBD
  a/DT
  late/JJ
  push/NN
  into/IN
  hardware/NN
  ,/,
  and/CC
  (PERSON Apple/NNP)
  ’/NNP
  s/NN
  (PERSON Siri/NNP)
  ,/,
  available/JJ
  on/IN
  (ORGANIZATION iPhones/NNS)
  ,/,
  and/CC
  (PERSON Amazon/NNP)
  ’/NNP
  s/VBD
  (PERSON Alexa/NNP)
  software/NN
  ,/,
  which/WDT
  runs/VBZ
  on/IN
  its/PRP$
  Echo/NNP
  and/CC
  (ORGANIZATION Dot/NNP)
  devices/NNS
  ,/,
  have/VBP
  clear/JJ
  leads/NNS
  in/IN
  consumer/NN
  adoption/NN
  ./.
  I/PRP
  was/VBD
  born/VBN
  in/IN
  (GPE India/NNP)
  on/IN
  23/03/1996/CD
  ./.
  (PERSON Chennai/NNP)
  is/VBZ
  a/DT
  coastal/JJ
  city/NN
  ./.)


<p>The output is in the form of token, POS tag pairs. You can see that, NLTK has converted the whole text into tokens, then performed POS tagging followed by extracting Named entities. All the named entities are in brackets.<br>
For example, (GPE India/NNP) means that India is a GPE(location) and its POS tag is Noun(proper, singular).<br>
    There are lot of POS tags, to know about all of them please refer this StackOverflow <a href = "https://stackoverflow.com/questions/15388831/what-are-all-possible-pos-tags-of-nltk">answer.</a>
</p>

<h3>spaCy</h3>

In [64]:
#installing spaCy. 
#If having any trouble go to this link : https://spacy.io/usage
#Please note that I am using Jupyter notebook and Python 3.
!pip install -U spacy

#To use spaCy you must have some language model. The model that we need depends on our language and also on our usage. 
#Please refer the above spaCy installation link for full details. 
#We will download the Enlish model for our usage
!python -m spacy download en

#To download English's small statistical model. Medium and Large models are also available.
!python -m spacy download en_core_web_sm

In [103]:
#importing spaCy, English's small statistical model and loading it
import spacy
import en_core_web_sm
nlp = en_core_web_sm.load()

#performing NER over the same text.
doc = nlp(text)
for entity in doc.ents:
    print(entity, entity.label_)

Google ORG

 GPE
Apple ORG
iPhones PRODUCT
Amazon ORG
Alexa
 ORG
Echo GPE
Dot ORG

 GPE

 GPE
India GPE
23/03/1996 DATE
Chennai PERSON

 GPE


<p>In spaCy, When we perform NER then only the output of NER is shown. All the other details are hidden. The output of spaCy's NER is in the form of token, NER tag pairs. For example, 23/03/1996 DATE means that 23/03/1996 is a date and DATE is its NER tag.</p>

<h3>Polyglot</h3>

In [104]:
#installing Polyglot and other important modlules. All the other three modules must be present for the working of polyglot. 
#If having any trouble go to this link : https://pypi.org/project/polyglot/
#Please note that I am using Jupyter notebook and Python 3.
!pip install PyICU
!pip install pycld2
!pip install morfessor
!pip install polyglot

#Downloading other requirements like POS tagger, embeddings and NER tagger
!polyglot download pos2.en
!polyglot download embeddings2.en
!polyglot download ner2.en

In [118]:
poly_text = Text(text)
print(poly_text.entities)

[I-ORG(['Google']), I-PER(['Apple’s', 'Siri']), I-LOC(['India']), I-LOC(['Chennai'])]


<p>The Polyglot gives only Named entities as output and hides all other details. The output is in the form NER-Tag(['token']). For example, I-ORG(['Google']) means that <i>Google</i> is an entity and <i>I-ORG</i> is its NER tag.</p>

<h3>Genism</h3>
<p>For Genism, which donot works like above three, we will take an example of similiarity finding. For all other examples of Genism please go to this <a href="https://radimrehurek.com/gensim/tutorial.html#id2">link</a>.</p>
<p>For our example, we will take 9 documents and find their similarity with a sample document.</p>

In [119]:
#installing Gensim
#If having any trouble go to this link : https://pypi.org/project/gensim/
#Please note that I am using Jupyter notebook and Python 3.

!pip install gensim

In [20]:
#importing gensim
import gensim

In [120]:
#First, let’s create a small corpus of nine documents and twelve features.
#From the sentiment analysis example we are familiar with how documents are converted into vectors. This is same. 
#We have taken documents as vectors. Total number of features are 12. 
corpus = [[(0, 1.0), (1, 1.0), (2, 1.0)],
           [(2, 1.0), (3, 1.0), (4, 1.0), (5, 1.0), (6, 1.0), (8, 1.0)],
           [(1, 1.0), (3, 1.0), (4, 1.0), (7, 1.0)],
           [(0, 1.0), (4, 1.0), (7, 1.0)],
           [(3, 1.0), (5, 1.0), (6, 1.0)],
           [(9, 1.0)],
           [(9, 1.0), (10, 1.0)],
           [(9, 1.0), (10, 1.0), (11, 1.0)],
           [(8, 1.0), (10, 1.0), (11, 1.0)]]

In [122]:
#Now we will convert the vectors in our corpus into tf-idf vectors. We have already seen tf-idf vectors in sentiment 
#analysis example.
#Tf-Idf is a simple transformation which takes documents represented as bag-of-words counts and 
#applies a weighting which discounts common terms (or, equivalently, promotes rare terms). 
#It also scales the resulting vector to unit length (in the Euclidean norm).

from gensim import models
tfidf = models.TfidfModel(corpus)

In [124]:
#Now we will create a sample document to calculate similarity with all other documents in our corpus.
#we will apply tf-idf vectorization over this document also.
#In the output you can see the output as tf-idf vectors.
sample_doc = [(0, 1), (4, 1)]
print(tfidf[sample_doc])

[(0, 0.8075244024440723), (4, 0.5898341626740045)]


In [125]:
#let us transform the whole corpus via TfIdf and index it, in preparation for finding similarity 
from gensim import similarities
index = similarities.SparseMatrixSimilarity(tfidf[corpus], num_features=12)

In [126]:
#Now calculate the similarity of our sample document against every document in the corpus:
sims = index[tfidf[sample_doc]]
print(list(enumerate(sims)))

[(0, 0.4662244), (1, 0.19139354), (2, 0.2460055), (3, 0.778005), (4, 0.0), (5, 0.0), (6, 0.0), (7, 0.0), (8, 0.0)]


<p>The output is in the form of list of tuples having document number and similarity. For example, the first document(index=0) has a similarity score of 0.466=46.6%, the second document has a similarity score of 19.1% etc.<br>

Thus, according to TfIdf document representation and cosine similarity measure, the most similar to our sample_doc is document no. 4, with a similarity score of 77.8%. Note that in the TfIdf representation, any documents which do not share any common features with sample_doc at all (documents no. 5–9) get a similarity score of 0.0.<p>

<h2>End Notes on Libraries</h2>
<p>Please note that all these libraries donot work as nicely as their documentation suggests. you need to try different libraries to check which one fulfil your goals.<br>
    For example, in spaCy <i>Chennai</i> is not an entity but in NLTK it is. In NLTK 23/03/1996 is not a date but in spaCy it is.<br>

If any library doesn't fulfil your requirements, then you can train the library for your own dataset. Training the library can take a lot of time and resource but it can help a lot.</p>

<h2>2. NLP Pretrained models</h2>
<p>An alternative to training libraries can be training pretrained models.</p>
<h6>table of Content</h6>
<ol>
    <li>What is a pretrained model?</li>
    <li>why is it important?</li>
    <li>How to use them?</li>
    <li>what are the best models available?</li>
</ol>

<p>Let us cover all of these topics one by one</p>

<h5>1. What is a pretrained model?</h5>
<p>Simply put, a pre-trained model is a model created by some one else to solve a similar problem. Instead of building a model from scratch to solve a similar problem, you use the model trained on other problem as a starting point.</p>

<h5>2. why is it important?</h5>
<p>The author(s) has already put in the effort to design a benchmark model for us! Instead of building a model from scratch to solve a similar NLP problem, we can use that pretrained model on our own NLP dataset.<br>
A bit of fine-tuning will be required but it saves us a ton of time and computational resources.<br>

For example, if you want to build a self learning car. You can spend years to build a decent image recognition algorithm from scratch or you can take inception model (a pre-trained model) from Google which was built on ImageNet data to identify images in those pictures.<br>

A pre-trained model may not be 100% accurate in your application, but it saves huge efforts required to re-invent the wheel.<br></p>

<h5>3. How to use them?</h5>
<p>What is our objective when we train a neural network?(I am considering that you are familiar with NNs and how they work. If not then to better understand this topic please get familiar with just the basics of NNs, like forward and back propagation. This <a href = "https://www.analyticsvidhya.com/blog/2018/10/introduction-neural-networks-deep-learning/">link</a> could be useful.).<br>

We wish to identify the correct weights for the network by multiple forward and backward iterations. By using pre-trained models which have been previously trained on large datasets, we can directly use the weights and architecture obtained and apply the learning on our problem statement. This is known as transfer learning. We “transfer the learning” of the pre-trained model to our specific problem statement.

You should be very careful while choosing what pre-trained model you should use in your case. If the problem statement we have at hand is very different from the one on which the pre-trained model was trained – the prediction we would get would be very inaccurate. For example, a model previously trained for speech recognition would work horribly if we try to use it to identify objects using it.</p>

<p><strong><a href="https://www.analyticsvidhya.com/blog/2017/06/transfer-learning-the-art-of-fine-tuning-a-pre-trained-model/">Ways to Fine tune the model</a></strong><br>
<ul>
    <li><strong>Feature extraction</strong> – We can use a pre-trained model as a feature extraction mechanism. What we can do is that we can remove the output layer( the one which gives the probabilities for being in each of the 1000 classes) and then use the entire network as a fixed feature extractor for the new data set.</li>
    <li><strong>Use the Architecture of the pre-trained model</strong> – What we can do is that we use architecture of the model while we initialize all the weights randomly and train the model according to our dataset again.</li>
    <li><strong>Train some layers while freeze others </strong>– Another way to use a pre-trained model is to train is partially. What we can do is we keep the weights of initial layers of the model frozen while we retrain only the higher layers. We can try and test as to how many layers to be frozen and how many to be trained.
</li>
</ul>


<h5>4. what are the best models available?</h5>
<p>Classification is done based on their applications</p>
<ul>
    <li><strong>Multi-Purpose NLP Models</strong>
        <ul>
            <li>ULMFiT</li>
            <li>Transformer</li>
            <li>Google’s BERT</li>
            <li>Transformer-XL</li>
            <li>OpenAI’s GPT-2</li>
        </ul>
    </li>
    <li><strong>Word Embeddings</strong>
        <ul>
            <li>ELMo</li>
            <li>Flair</li>
        </ul>
    </li>
    <li><strong>Other Pretrained Models</strong>
        <ul>
            <li>StanfordNLP</li>
        </ul>
    </li>
</ul>

<p>To have an overview about all of them please go through this <a href = "https://www.analyticsvidhya.com/blog/2019/03/pretrained-models-get-started-nlp/">link</a>. There you will also find more resources to get complete hold on Pretrained models.</p>









    

    




<h1>End Notes</h1>
<p>I am just a learner and very new in the field of Machine Learning and NLP. All this was result of my learning through various resources. This notebook is a mix of my own knowledge and what I have studied on various blog posts and Courses.<br>
    This notebook can be used by anyone without any constraints because knowledge must be free and flowing.<br>
    If you were going through this notebook and found any errors or you did not understand anything, please fell free to contact me. <br>
    Ajeet Singh. [ajeet.singh.ec14@gmail.com]
</p>
<p>Thank You for reading</p>