# Natural Language Processing Introduction

Here I will be using the NLTK library for Natural Language Processing, credits Sentdex video tutorials on Youtube

In [None]:
!pip install nltk

In [None]:
import nltk

In [None]:
nltk.download()

## Preprocessing

### First we will be discussing some methods which are used in Natural Language Processing for Preprocessing of the data

### First Method: Tokenizing

We will be discussing two methods of tokenizing; sentence tokenizing and word tokenizing

In [None]:
from nltk.tokenize import word_tokenize, sent_tokenize

example_text = "Hey man! Liverpool FC is playing tomorrow. Where are you watching it? I have a midterm tomorrow, have to prepare for it."
"""Tokenising the text"""
sentence_tokenized = sent_tokenize(example_text)
word_tokenized = word_tokenize(example_text)

for sentence in sentence_tokenized:
    print(sentence)
for word in word_tokenized:
    print(word)

## Second Method : Removing Stopwords

Stop words are words like "The", "am", "is" and others which appear in the text a lot more than the other words but don't help us in any way to understand the data more or provide any value.

In [None]:
from nltk.corpus import stopwords

In [None]:
stop_words = set(stopwords.words("english"))

filtered_words = []
for word in word_tokenized:
    if word not in stop_words:
        filtered_words.append(word)

print(filtered_words)

## Third Method: Stemming

Stemming is like, it gives you the root of the word. It removes the "ing" or "es" or "ess" or "s" or other suffixes from the end of the word to give you the stem of the word

In [None]:
from nltk.stem import PorterStemmer

In [None]:
pstem = PorterStemmer()

for word in word_tokenized:
    print(pstem.stem(word))

Like in the above example, you can see "playing" is stemmed to "play" and "watching" to "watch"

Another Example of Stemming

In [None]:
w = ['cats', 'corns','caresses','carriers']

In [None]:
for word in w:
    print(pstem.stem(word))

## Third Method : Part of Speech Tagging

Part of Speech or POS tagging is also called grammatical tagging, where the words of the text are marked or tagged according to the context and its meaning in accordance to the particular text it is used in

In [None]:
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer

In [None]:
train_text = state_union.raw("1961-Kennedy.txt")
sample_text = state_union.raw("1962-Kennedy.txt")

In [None]:
"""Training the PunktSentenceTokenizer"""
custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

In [None]:
tokenized = custom_sent_tokenizer.tokenize(sample_text)

In [None]:
def processing_content():
    try:
        for sent in tokenized:
            words = nltk.word_tokenize(sent)
            tagged = nltk.pos_tag(words)
            print(tagged)
            
    except Exception as e:
        print(str(e))

processing_content()

The POS tagging gives you a tuple as an output, with the first element in the tuple the word in the text and the second element gives you POS tag of the first element, like whether it is a proposition, or a verb or a noun

## Fourth method: Chunking

In [None]:
def processing_content():
    try:
        for sent in tokenized:
            words = nltk.word_tokenize(sent)
            tagged = nltk.pos_tag(words)
            
            chunkGram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>+<NN>?}"""
            
            chunkParser = nltk.RegexpParser(chunkGram)
            chunked = chunkParser.parse(tagged)
            
            print(chunked)
    except Exception as e:
        print(str(e))

processing_content()

Above you can see how the different words are chunked based on the POS tags, you can also draw these and see how they are chunked using the chunked.draw() method

## Fifth Method : Named Entity Recognition

In [None]:
def processing_content():
    try:
        for sent in tokenized:
            words = nltk.word_tokenize(sent)
            tagged = nltk.pos_tag(words)
            
            named_entity = nltk.ne_chunk(tagged)
            print(named_entity)
            
            
    except Exception as e:
        print(str(e))

processing_content()

Named Entity Recognition helps you information like whether the word in your text is a Person, an organisation, a location and much more. But the issue with NLTK's named entity recognition is that there is a high false positive rate, where a lot of misclassifications will be done and two words which should have been used as a chunk and then classified, are done separately. One of the parameters you can use with NLTK's named entity recognition is use:

        nltk.ne_chunk(tagged, binary = True)

This helps with chunking two words together and then doing their named entity recognition, but you are not able to see the tags assined to them.

## Sixth Method: Lemmatization

Lemmatization is very similar to Stemming, but different!
In this you get the word which shows the real meaning of the word that is being lemmatized. You can think of this as, you take a word, then look up its meaning in the dictionary and then replace it. The replaced word might be the same word or be completely different, but will carry the same meaning

In [None]:
from nltk.stem import WordNetLemmatizer

In [None]:
lemmatizer = WordNetLemmatizer()

In [None]:
print(lemmatizer.lemmatize('better', pos = 'a'))

In [None]:
print(lemmatizer.lemmatize('walks'))

### Finding synonyms and antonyms using NLTK

In [None]:
from nltk.corpus import wordnet
"""Finding synonyms of the word 'Good'."""
synonyms = wordnet.synsets("good")
print(synonyms)
"""getting the first one only"""
print(synonyms[0].lemmas()[0].name())
"""getting the description and then example"""
print(synonyms[0].definition())
print(synonyms[0].examples())

### Finding the similarity of the words used

In NLTK we can compare how similar two words are

In [None]:
word1 = wordnet.synset("apple.n.01")
word2 = wordnet.synset("orange.n.01")
print(word1.wup_similarity(word2))

In [None]:
word1 = wordnet.synset("orange.n.01")
word2 = wordnet.synset("lemon.n.01")
print(word1.wup_similarity(word2))

In [None]:
word1 = wordnet.synset("apple.n.01")
word2 = wordnet.synset("okra.n.01")
print(word1.wup_similarity(word2))

In [None]:
word1 = wordnet.synset("apple.n.01")
word2 = wordnet.synset("pea.n.01")
print(word1.wup_similarity(word2))

Now if you had used something of totally different kind, like car or even chocolate might be, then you would have seen the similarity score go down.

## Text Classification

Now we would be trying to classify the text based on the information we can get from the text itself and the tags associated with it. We cannot do this without having a tagged dataset, as we need the tags to train our classifiers.

In [None]:
import nltk
import random
from nltk.corpus import movie_reviews

documents = [(list(movie_reviews.words(fileid)),category) for category in movie_reviews.categories() for fileid in movie_reviews.fileids(category)]

random.shuffle(documents)
all_words = []

for word in movie_reviews.words():
    all_words.append(word.lower())

all_words = nltk.FreqDist(all_words)
print(all_words.most_common(15))

In [None]:
word_features = list(all_words.keys())[:3000]

def find_features(document):
    words = set(document)
    features = {}
    for w in word_features:
        features[w] = (w in words)
    return features
        
print((find_features(movie_reviews.words('neg/cv000_29416.txt'))))        
featuresets = [(find_features(rev),category) for (rev,category) in documents]


Creating our first classifier using the Naive Bayes Classifier of NLTK, the accuracy of this particular classifier is very unstable and therefore is not really reliable, sometimes it might go as high as 81.5%(as in my case), and the very next time you rearrange your data and run it, the accuracy might go around 55 to 60%.

In [None]:
training_set = featuresets[:1800]
test_set = featuresets[1800:]

classifier = nltk.NaiveBayesClassifier.train(training_set)
print("Naive Bayes Algo accuracy percent:", (nltk.classify.accuracy(classifier, test_set))*100)

classifier.show_most_informative_features(15)


In [None]:
training_set = featuresets[:1900]
test_set = featuresets[1900:]

classifier = nltk.NaiveBayesClassifier.train(training_set)
print("Naive Bayes Algo accuracy percent:", (nltk.classify.accuracy(classifier, test_set))*100)

classifier.show_most_informative_features(15)


NLTK has a wrapper function using which we can call Scikitlearn's classifiers through NLTK only. Here we will be getting all the classifiers we can possibly think of, then build an ensemble out of it.

In [None]:
from nltk.classify.scikitlearn import SklearnClassifier

from sklearn.naive_bayes import MultinomialNB, GaussianNB, BernoulliNB
training_set = featuresets[:1800]
test_set = featuresets[1800:]
print("Original Naive Bayes Algo accuracy percent:", (nltk.classify.accuracy(classifier, test_set))*100)

MNB_classifier = SklearnClassifier(MultinomialNB())
MNB_classifier.train(training_set)
print("MNB_classifier Naive Bayes Algo accuracy percent:", (nltk.classify.accuracy(MNB_classifier, test_set))*100)

#GaussianNB_classifier = SklearnClassifier(GaussianNB())
#GaussianNB_classifier.train(training_set)
#print("GaussianNB_classifier accuracy percent:", (nltk.classify.accuracy(GaussianNB_classifier, test_set))*100)

BernoulliNB_classifier = SklearnClassifier(BernoulliNB())
BernoulliNB_classifier.train(training_set)
print("BernoulliNB_classifier accuracy percent:", (nltk.classify.accuracy(BernoulliNB_classifier, test_set))*100)


In [None]:
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC

LogisticRegression_classifier = SklearnClassifier(LogisticRegression())
LogisticRegression_classifier.train(training_set)
print("LogisticRegression_classifier accuracy percent:", (nltk.classify.accuracy(LogisticRegression_classifier, test_set))*100)

SGDClassifier_classifier = SklearnClassifier(SGDClassifier())
SGDClassifier_classifier.train(training_set)
print("SGDClassifier_classifier accuracy percent:", (nltk.classify.accuracy(SGDClassifier_classifier, test_set))*100)

SVC_classifier = SklearnClassifier(SVC())
SVC_classifier.train(training_set)
print("SVC_classifier accuracy percent:", (nltk.classify.accuracy(SVC_classifier, test_set))*100)

LinearSVC_classifier = SklearnClassifier(LinearSVC())
LinearSVC_classifier.train(training_set)
print("LinearSVC_classifier accuracy percent:", (nltk.classify.accuracy(LinearSVC_classifier, test_set))*100)

NuSVC_classifier = SklearnClassifier(NuSVC())
NuSVC_classifier.train(training_set)
print("NuSVC_classifier accuracy percent:", (nltk.classify.accuracy(NuSVC_classifier, test_set))*100)



Here we have started making our ensemble method, where this classifier will be taking votes from each and every classifier and outputing the one with the highest vote count and the related confidence, where the confidence is how many of the classifiers classified it in the same tag as the output.

In [None]:
from nltk.classify import ClassifierI
from statistics import mode

In [None]:
class VoteClassifier(ClassifierI):
    def __init__(self, *classifiers):
        self._classifiers = classifiers
        
    def classify(self, features):
        votes = []
        for c in self._classifiers:
            v = c.classify(features)
            votes.append(v)
        return mode(votes)
    
    def confidence(self, features):
        votes = []
        for c in self._classifiers:
            v = c.classify(features)
            votes.append(v)
        choice_votes = votes.count(mode(votes))
        conf = choice_votes / len(votes)
        return conf
    
voted_classifier = VoteClassifier(classifier, MNB_classifier, BernoulliNB_classifier,
                                 LogisticRegression_classifier,
                                 SGDClassifier_classifier,LinearSVC_classifier,
                                 NuSVC_classifier)

print("voted_classifier accuracy percent:", (nltk.classify.accuracy(voted_classifier, test_set))*100)

print("Classification:", voted_classifier.classify(test_set[0][0]),"Confidence %:", voted_classifier.confidence(test_set[0][0]))

In [None]:
print("Classification:", voted_classifier.classify(test_set[1][0]),"Confidence %:", voted_classifier.confidence(test_set[1][0])*100)

print("Classification:", voted_classifier.classify(test_set[2][0]),"Confidence %:", voted_classifier.confidence(test_set[2][0])*100)
print("Classification:", voted_classifier.classify(test_set[3][0]),"Confidence %:", voted_classifier.confidence(test_set[3][0])*100)
print("Classification:", voted_classifier.classify(test_set[4][0]),"Confidence %:", voted_classifier.confidence(test_set[4][0])*100)
print("Classification:", voted_classifier.classify(test_set[5][0]),"Confidence %:", voted_classifier.confidence(test_set[5][0])*100)