# INTRODUCTION TO NATURAL LANGUAGE PROCESSING

## Libraries to install
```
$ pip install spacy
$ python -m spacy download en_core_web_sm
```

# Sentiment Analysis

## Whar is sentiment analysis?

**Sentiment Analysis** is the process of analysing text with the aim of trying to understand the context of the text or the opinion expressed within a block of text. Its a tool that allows computers to understand the underlying tone in a given text. Sentiment analysis give computers the abilily to understand human text. This can be difficult to achieve since its even difficult for humans to understand text coming to talk of computers it gets even harder. But there are tools and techniques to do that.

## Why sentiment analysis

1. To understand customers opinion on a certain service or product.

2. Useful in social media monitoring

3. Used to process data to detect opinions and emotions of a group of people

## Objectives of this tutorial

1. Learn Natual language processing techniques

2. Use ML to determine the sentiment of a text

3. How to use spaCy in sentiment analysis


# Using Natural Language Processing to Clean data

NLP just like any other ML algorithm requiers a well cleaned dataset. The main aim of this is to reduce the noise in the text data that is inherent in human text. There are many tools to do this some of the common ones include Natural language Toolkit, TextBlob  and SpaCy. In this tutorial we will be using spaCy. In NLP we use the following process to clean our data:

1. Tokenizing sentences

2. Removing stop words like “if,” “but,” “or,” and so on

3. Normalizing words

4. Vectorizing text



## Terminologies

### Tokenization

This is the technique or process of breaking down a sentence or a chunk of text into smaller pieces. Tokenization is the first step in NLP pipeline. There are mainly two forms or types of tokenization, this include:

#### Word Tokenization 

This type of tokenization involves breaking a text into individual words.

#### Sentence Tokenization

This involves breaking down a text into a individual of sentences

#### Example of word tokenization

In [1]:
import spacy
text = """
Dave watched as the forest burned up on the hill,
only a few miles from his house. The car had
been hastily packed and Marta was inside trying to round
up the last of the pets. "Where could she be?" he wondered
as he continued to wait for Marta to appear with the pets.
"""

nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
token_list = [token for token in doc]
token_list

[,
 Dave,
 watched,
 as,
 the,
 forest,
 burned,
 up,
 on,
 the,
 hill,
 ,,
 ,
 only,
 a,
 few,
 miles,
 from,
 his,
 house,
 .,
 The,
 car,
 had,
 ,
 been,
 hastily,
 packed,
 and,
 Marta,
 was,
 inside,
 trying,
 to,
 round,
 ,
 up,
 the,
 last,
 of,
 the,
 pets,
 .,
 ",
 Where,
 could,
 she,
 be,
 ?,
 ",
 he,
 wondered,
 ,
 as,
 he,
 continued,
 to,
 wait,
 for,
 Marta,
 to,
 appear,
 with,
 the,
 pets,
 .,
 ]

### Removing Stop Word

stop words are characters that might be useful in human communication but are not important in sentiment analysis in a machine learning model or a deep learning model. We need to remove these words example "the", "as"...

SpaCy come with a defual list of this stop words that we can use to remove stop words from the tokenized words

In [2]:
filtered_tokens = [token for token in doc if not token.is_stop]
filtered_tokens

[,
 Dave,
 watched,
 forest,
 burned,
 hill,
 ,,
 ,
 miles,
 house,
 .,
 car,
 ,
 hastily,
 packed,
 Marta,
 inside,
 trying,
 round,
 ,
 pets,
 .,
 ",
 ?,
 ",
 wondered,
 ,
 continued,
 wait,
 Marta,
 appear,
 pets,
 .,
 ]

### Normalization

Now that we have removed the unnecessary word, lets normalize the remaining text. Normalization is the process of condensing many variations of a single word into one single variation. Example take the words "wait", "waited" and "waiting" will be condensed into "wait".

#### Types of Normalization

There are two main types of normalization:

1. Stemming

With this method, a word is trimmed off at its stem, like in the two examples we looked at earlier. Example "wait", "waiting" and "waited" have a common base of "wait". The relationship between "drive" and "drove" will be missed.

2. Lemmatization

This is a more powerful form of stemming, Lemmatization considers the context and converts the word to its base meaningful form which is called lemma.

Lemmatization can also be said as, the process of converting the given word into it's base form according to the dictionary meaning of the word.

Example: "studying" and "studies" all have the dictionary or the meaninful meaning of "study"

Since lemmatization is more powerful and efficient than stemming spaCy only provides lemmatization

In [3]:
lemma = [f"token: {token}, lemma: {token.lemma_}" for token in filtered_tokens]
lemma

['token: \n, lemma: \n',
 'token: Dave, lemma: Dave',
 'token: watched, lemma: watch',
 'token: forest, lemma: forest',
 'token: burned, lemma: burn',
 'token: hill, lemma: hill',
 'token: ,, lemma: ,',
 'token: \n, lemma: \n',
 'token: miles, lemma: mile',
 'token: house, lemma: house',
 'token: ., lemma: .',
 'token: car, lemma: car',
 'token: \n, lemma: \n',
 'token: hastily, lemma: hastily',
 'token: packed, lemma: pack',
 'token: Marta, lemma: Marta',
 'token: inside, lemma: inside',
 'token: trying, lemma: try',
 'token: round, lemma: round',
 'token: \n, lemma: \n',
 'token: pets, lemma: pet',
 'token: ., lemma: .',
 'token: ", lemma: "',
 'token: ?, lemma: ?',
 'token: ", lemma: "',
 'token: wondered, lemma: wonder',
 'token: \n, lemma: \n',
 'token: continued, lemma: continue',
 'token: wait, lemma: wait',
 'token: Marta, lemma: Marta',
 'token: appear, lemma: appear',
 'token: pets, lemma: pet',
 'token: ., lemma: .',
 'token: \n, lemma: \n']

### Vectorization

Vectorization is the process of converting the tokenized words into a numeric array of numbers which is unique to a given token and  represents various features of a token. Vecotrs are used to find the similarities within words, classify text and perform other NLP operations.

The arrays in that represent a vector for a given token can either be **densed array** in which very space in the array contains a defined value. On the other hand **sparsed array** most of the spaces in the array is empty. In most cases we'll use dense array.

In [4]:
filtered_tokens[1]

Dave

In [5]:
filtered_tokens[1].vector

array([ 1.8371642 ,  1.452925  , -1.6147203 ,  0.67836225, -0.659443  ,
        1.6417911 ,  0.57964015,  2.3021278 , -0.13260579,  0.57509375,
        1.5654867 , -0.69388777, -0.5960694 , -1.5377433 ,  1.9425607 ,
       -2.4552503 ,  1.2321602 ,  1.0434954 , -1.5102386 , -0.57876253,
        0.12055516,  3.6501799 ,  2.616098  , -0.57102156, -1.5221778 ,
        0.0062914 ,  0.22760749, -1.9220744 , -1.6252842 , -4.2262235 ,
       -3.495663  , -3.3120532 ,  0.81387675, -0.00677478, -0.11603296,
        1.462044  ,  3.0751472 ,  0.35958475, -0.22526968, -2.7439258 ,
        1.2696334 ,  4.606787  ,  0.3403422 , -2.127231  ,  1.261918  ,
       -4.209798  ,  5.4528546 ,  1.6940243 , -2.597298  ,  0.95049405,
       -1.9105787 , -2.374928  , -1.4227569 , -2.2528832 , -1.7998077 ,
        1.607501  ,  2.9914231 ,  2.8065157 , -1.2510273 , -0.5496425 ,
       -0.49980426, -1.3882611 , -0.47047865, -2.9670255 ,  1.7884939 ,
        4.5282784 , -1.2602415 , -0.14885461,  1.0419188 , -0.08

In [6]:
filtered_tokens[1].vector.shape

(96,)

# Spacy For Text Classification

SpaCy has alot of this inbuilt functions to preprocess data, it also provides alot of pipeline functions to enable us to label our data.The default pipeline is defined in a JSON file associated with whichever preexisting model you’re using (en_core_web_sm for this tutorial), but you can also build one from scratch if you wish.

One of the built-in pipeline functions that spaCy provides is called textcat (TextCategorizer), which enables you to assign categories(labels) to your text data and use that as training data for a neural network or a machine learning model.

This process will generate a trained model that you can then use to predict the sentiment of a given piece of text.

To achieve this we need to follow the following steps:

1. Add the textcat component to the existing pipeline.
2. Add valid labels to the textcat component.
3. Load, shuffle, and split your data.
4. Train the model, evaluating on each training loop.
5. Use the trained model to predict the sentiment of non-training data.
6. Optionally, save the trained model.

# Building A Sentiment Analysis

To build a sentiment analysis pipline you need to follow the following steps or include the following in your pipelne.

1. Loading data
2. Preprocessing
3. Training the classifier
4. Classifying data

We will build a movie sentiment analyzer, the dataset well be using is [Large Movie Review Dataset](https://ai.stanford.edu/~amaas/data/sentiment/)

## Loading data



In [7]:
import os
import random

def load_training_data(
    data_directory: str = "./datasets/aclImdb_v1/aclImdb/train",
    split: float = 0.8,
    limit: int = 0
) -> tuple:
     # Load from files
    reviews = []
    for label in ["pos", "neg"]:
        labeled_directory = f"{data_directory}/{label}"
        for review in os.listdir(labeled_directory):
            if review.endswith(".txt"):
                with open(f"{labeled_directory}/{review}") as f:
                    text = f.read()
                    text = text.replace("<br />", "\n\n")
                    if text.strip():
                        spacy_label = {
                            "cats": {
                                "pos": "pos" == label,
                                "neg": "neg" == label
                            }
                        }
                        reviews.append((text, spacy_label))
    random.shuffle(reviews)

    if limit:
        reviews = reviews[:limit]
    split = int(len(reviews) * split)
    return reviews[:split], reviews[split:]

In [8]:
test, train = load_training_data()

So far we have been able to load the datasets and do some simple processing on the data such as assigning labels to the text we loaded. Now we have a dataset containing labled text. We have also shuffled the data at this point to ensure that ordering does not affect the model accuracy in anyway and also to get the dataset well mixed up before we split the data into training and testing sets.

To split the data we use list slicing to do that, we obtain the value used to slice the list by using the argument passed into the function and t multiply it with the total lenght of the reviews list. We return the first 80% of that value and the rest 80% and above.

#### NOTE:

The label dictionary structure is a format required by the spaCy model during the training loop.

In [9]:
len(test)

20000

In [10]:
len(train)

5000

In [28]:
test[0]

("I rented this because I'm a bit weary of '80s NBC programming and apparently I saved myself a lot of money. I have nothing against any of the actors and for their credit they do a good job but this show is flawed from the premise.\n\n\n\nWe have a character who is unlikable. He's full of flaws, not enlightened, and a complete jerk on a good day. Yet the reason why anybody should care just isn't there. While creating an American sitcom centered around a complete bullheaded jackass is revolutionary and full of potential, it just isn't met here within this show. Most of the supporting characters aren't fully fleshed characters but rather sad punching bags that want empathy from the audience for being punching bags. As in any sitcom, they are the ones who are made the most normal for the audience to relate to, and in doing this they negate the lead character to such an extent that we see Bittinger being himself and harming people and they just stay there because....why? There is no reaso

Now that we have our data preprocessed it's time we train our model to classify the text.

# Training A Classifier

We will use a CNN to analyze the sentiments, CNNs can work with other forms of text classification as long as the right data is provided along side the right labels.

Lets build the pipeline

In [12]:
def train_model(
    training_data: list,
    test_data: list,
    iterations: int
)-> None:
#     built pipeline
    nlp = spacy.load("en_core_web_sm")
    if "textcat" not in nlp.pipe_names:
        textcat = nlp.create_pipe("textcat", config = {"architecture" : "simple_cnn"})
        nlp.add_pipe(textcat, last = True)
        
    else:
        textcat = nlp.get_pipe("textcat")
        
    textcat.add_label("post")
    textcat.add_label("neg")

For more examples and explanations on the above code visit [](https://spacy.io/usage/examples#textcat). Basicaly what we did is to load the en_core_web_sm module and the check if the pipeline contains the textcat pipeline name if not we create it using .create_pipe)() and then add it to the pipeline at the very end using the "last" argument.

Incase the component or pipeline exitss we store it in a variable we can use later on in our programm. To do this we use the .get_pipe() to fetch the textcat if it exits. To classify our text we need to tell textcat what it should be looking for in this case its the labels, so we use .add_label() to add it to the textcat.

Now we have told textcat what it needs to look for in training so it can be able to learn and hence classify other datasets. Now its time to train the model. Lets begin by writing the training loop.

## Building The Training loop

In building the training loop we want to only train the the textcat component(pipe), hence we first need to exclude the other pipes and disable them.

In [13]:
from spacy.util import minibatch, compounding


def train_model(
    training_data: list,
    test_data: list,
    iterations: int
)-> None:
#     built pipeline
    nlp = spacy.load("en_core_web_sm")
    if "textcat" not in nlp.pipe_names:
        textcat = nlp.create_pipe("textcat", config = {"architecture" : "simple_cnn"})
        nlp.add_pipe(textcat, last = True)
        
    else:
        textcat = nlp.get_pipe("textcat")
        
    textcat.add_label("post")
    textcat.add_label("neg")
    
    training_excluded_pipes = [pipe for pipe in nlp.pipe_names if pipe != "textcat"]
    
    with nlp.disable_pipes(training_excluded_pipes):
        optimizer = nlp.begin_training()
        print("Begining training...")
        batch_size = compounding(4.0, 32.0, 1.001)
        
    for i in range(iterations):
        loss = {}
        random.shuffle(training_data)
        batches = minibatch(training_data, size = batch_size)
        for batch in batches:
            text, labels = zip(*batch)
            nlp.update(
            text, 
            labels,
            drop = 0.2,
            sgd = optimizer,
            losses = loss
            )

Here we have used nlp.begin_training() which returns an initial optimizer used by np.update() to update the weights of the underlying model.

For each iteration we'll create an empty dictionary called loss which will be updated by the nlp.update(). During each iteration we will split the data into minibatches using the minibatch(). 

For each of the samples in the batch we'll split it into text and labels which we will then pass to the updat() which actaully runs the training on the underlying model.

The dropout parameter tells nlp.update() what proportion of the training data in that batch to skip drop during trainging. You do this to make it harder for the model to accidentally just memorize training data without coming up with a generalizable model.

# Evaluating Model Performance



In [14]:
def evaluate_model(
    tokenizer, textcat, test_data: list
) -> dict:
    reviews, labels = zip(*test_data)
    reviews = (tokenizer(review) for review in reviews)
    true_positives = 0
    false_positives = 1e-8  # Can't be 0 because of presence in denominator
    true_negatives = 0
    false_negatives = 1e-8
    for i, review in enumerate(textcat.pipe(reviews)):
        true_label = labels[i]["cats"]
        for predicted_label, score in review.cats.items():
            # Every cats dictionary includes both labels. You can get all
            # the info you need with just the pos label.
            if (
                predicted_label == "neg"
            ):
                continue
#             print(true_label)
            if score >= 0.5 and true_label["pos"]:
                true_positives += 1
            elif score >= 0.5 and true_label["neg"]:
                false_positives += 1
            elif score < 0.5 and true_label["neg"]:
                true_negatives += 1
            elif score < 0.5 and true_label["pos"]:
                false_negatives += 1
    precision = true_positives / (true_positives + false_positives)
    recall = true_positives / (true_positives + false_negatives)

    if precision + recall == 0:
        f_score = 0
    else:
        f_score = 2 * (precision * recall) / (precision + recall)
    return {"precision": precision, "recall": recall, "f-score": f_score}

In [15]:
def train_model(
    training_data: list,
    test_data: list,
    iterations: int = 20
) -> None:
    # Build pipeline
    nlp = spacy.load("en_core_web_sm")
    if "textcat" not in nlp.pipe_names:
        textcat = nlp.create_pipe(
            "textcat", config={"architecture": "simple_cnn"}
        )
        nlp.add_pipe(textcat, last=True)
    else:
        textcat = nlp.get_pipe("textcat")

    textcat.add_label("pos")
    textcat.add_label("neg")

    # Train only textcat
    training_excluded_pipes = [
        pipe for pipe in nlp.pipe_names if pipe != "textcat"
    ]
    with nlp.disable_pipes(training_excluded_pipes):
        optimizer = nlp.begin_training()
        # Training loop
        print("Beginning training")
        print("Loss\tPrecision\tRecall\tF-score")
        batch_sizes = compounding(
            4.0, 32.0, 1.001
        )  # A generator that yields infinite series of input numbers
        for i in range(iterations):
            print(f"Training iteration {i}")
            loss = {}
            random.shuffle(training_data)
            batches = minibatch(training_data, size=batch_sizes)
            for batch in batches:
                text, labels = zip(*batch)
                nlp.update(text, labels, drop=0.2, sgd=optimizer, losses=loss)
            with textcat.model.use_params(optimizer.averages):
                evaluation_results = evaluate_model(
                    tokenizer=nlp.tokenizer,
                    textcat=textcat,
                    test_data=test_data
                )
                print(
                    f"{loss['textcat']}\t{evaluation_results['precision']}"
                    f"\t{evaluation_results['recall']}"
                    f"\t{evaluation_results['f-score']}"
                )

    # Save model
    with nlp.use_params(optimizer.averages):
        nlp.to_disk("model_artifacts")

In [16]:
TEST_REVIEW = """
Transcendently beautiful in moments outside the office, it seems almost
sitcom-like in those scenes. When Toni Colette walks out and ponders
life silently, it's gorgeous.<br /><br />The movie doesn't seem to decide
whether it's slapstick, farce, magical realism, or drama, but the best of it
doesn't matter. (The worst is sort of tedious - like Office Space with less humor.)
"""

In [17]:
def test_model(input_data: str = TEST_REVIEW):
    #  Load saved trained model
    loaded_model = spacy.load("model_artifacts")
    # Generate prediction
    parsed_text = loaded_model(input_data)
    # Determine prediction to return
    if parsed_text.cats["pos"] > parsed_text.cats["neg"]:
        prediction = "Positive"
        score = parsed_text.cats["pos"]
    else:
        prediction = "Negative"
        score = parsed_text.cats["neg"]
    print(
        f"Review text: {input_data}\nPredicted sentiment: {prediction}"
        f"\tScore: {score}"
    )

In [18]:
if __name__ == "__main__":
    train, test = load_training_data(limit=2500)
    train_model(train, test)
    print("Testing model")
    test_model()

Beginning training
Loss	Precision	Recall	F-score
Training iteration 0
11.505850926274434	0.7350746268382434	0.7635658914432726	0.7490494296293136
Training iteration 1
2.0383097512676613	0.8132780082650092	0.7596899224511748	0.7855711422530833
Training iteration 2
0.5732080414236407	0.8114754098028084	0.7674418604353704	0.7888446214825161
Training iteration 3
0.1899917852024373	0.82426778239229	0.7635658914432726	0.7927565392035106
Training iteration 4
0.07318920258921935	0.8262711864056664	0.755813953459077	0.7894736841785638
Training iteration 5
0.04007941439840579	0.8305084745410801	0.7596899224511748	0.7935222671743513
Training iteration 6
0.01971181024350699	0.8382978723047533	0.7635658914432726	0.7991886409412092
Training iteration 7
0.009295679886122343	0.8347457626764938	0.7635658914432726	0.7975708501701388
Training iteration 8
0.004880438206953386	0.8262711864056664	0.755813953459077	0.7894736841785638
Training iteration 9
0.002710866429310954	0.825531914858488	0.7519379844669

In [19]:
test_model(input_data = "I hate the movie")

Review text: I hate the movie
Predicted sentiment: Negative	Score: 0.9999545812606812


In [20]:
test_model("Love the movie")

Review text: Love the movie
Predicted sentiment: Positive	Score: 0.9999545812606812


In [21]:
test_model("This movie is not fun to watch, it has alot of boring characters, it should be improved")

Review text: This movie is not fun to watch, it has alot of boring characters, it should be improved
Predicted sentiment: Negative	Score: 0.9998980760574341


In [23]:
test_model("gaming is boring")

Review text: gaming is boring
Predicted sentiment: Negative	Score: 0.9999545812606812


In [27]:
test_model("That leads to very accurate data.")

Review text: That leads to very accurate data.
Predicted sentiment: Negative	Score: 0.9999545812606812


In [32]:
test_model(str(test[0]))

Review text: ("I rented this because I'm a bit weary of '80s NBC programming and apparently I saved myself a lot of money. I have nothing against any of the actors and for their credit they do a good job but this show is flawed from the premise.\n\n\n\nWe have a character who is unlikable. He's full of flaws, not enlightened, and a complete jerk on a good day. Yet the reason why anybody should care just isn't there. While creating an American sitcom centered around a complete bullheaded jackass is revolutionary and full of potential, it just isn't met here within this show. Most of the supporting characters aren't fully fleshed characters but rather sad punching bags that want empathy from the audience for being punching bags. As in any sitcom, they are the ones who are made the most normal for the audience to relate to, and in doing this they negate the lead character to such an extent that we see Bittinger being himself and harming people and they just stay there because....why? Ther