## Install

**This lines from SpaCy's docs** https://spacy.io/usage

pip install -U spacy

or **conda**

conda install -c conda-forge spacy

I used this (md) because have more data than the (sm)

python -m spacy download en_core_web_md

In [1]:
import pandas as pd

In [2]:
## train dataset url

## https://www.kaggle.com/c/tweet-sentiment-extraction/data?select=train.csv

## useful links

## https://www.dataquest.io/blog/tutorial-text-classification-in-python-using-spacy/
## https://towardsdatascience.com/machine-learning-for-text-classification-using-spacy-in-python-b276b4051a49
## https://www.kaggle.com/poonaml/text-classification-using-spacy

# Loading train dataset
dataset_train = pd.read_csv('../Data/train.csv')

In [3]:
dataset_train

Unnamed: 0,textID,text,selected_text,sentiment
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",neutral
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative
2,088c60f138,my boss is bullying me...,bullying me,negative
3,9642c003ef,what interview! leave me alone,leave me alone,negative
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...","Sons of ****,",negative
...,...,...,...,...
27476,4eac33d1c0,wish we could come see u on Denver husband l...,d lost,negative
27477,4f4c4fc327,I`ve wondered about rake to. The client has ...,", don`t force",negative
27478,f67aae2310,Yay good for both of you. Enjoy the break - y...,Yay good for both of you.,positive
27479,ed167662a5,But it was worth it ****.,But it was worth it ****.,positive


In [4]:
# Verifying quantity of each category
dataset_train.sentiment.value_counts()

neutral     11118
positive     8582
negative     7781
Name: sentiment, dtype: int64

### Preparing the data

In [5]:
# return 0 if negative and 1 if positive and 2 if neutral
def convert_sentiment(x):
    if x == 'negative':
        return 0
    elif x == 'positive':
        return 1
    else:
        return 2

In [6]:
dataset_train2 = dataset_train
dataset_train2['sentiment'] = dataset_train2['sentiment'].apply(convert_sentiment)

In [7]:
# Applies to each row (lambda with axis=1) and get two columns, text and sentiment, concat this two values and put then in a
#  new column in the dataset, called tuples
dataset_train2['tuples'] = dataset_train2.apply(
    lambda row: (row['text'],row['sentiment']), axis=1)

In [8]:
dataset_train2

Unnamed: 0,textID,text,selected_text,sentiment,tuples
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",2,"( I`d have responded, if I were going, 2)"
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,0,( Sooo SAD I will miss you here in San Diego!!...
2,088c60f138,my boss is bullying me...,bullying me,0,"(my boss is bullying me..., 0)"
3,9642c003ef,what interview! leave me alone,leave me alone,0,"( what interview! leave me alone, 0)"
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...","Sons of ****,",0,"( Sons of ****, why couldn`t they put them on ..."
...,...,...,...,...,...
27476,4eac33d1c0,wish we could come see u on Denver husband l...,d lost,0,( wish we could come see u on Denver husband ...
27477,4f4c4fc327,I`ve wondered about rake to. The client has ...,", don`t force",0,( I`ve wondered about rake to. The client has...
27478,f67aae2310,Yay good for both of you. Enjoy the break - y...,Yay good for both of you.,1,( Yay good for both of you. Enjoy the break - ...
27479,ed167662a5,But it was worth it ****.,But it was worth it ****.,1,"( But it was worth it ****., 1)"


In [9]:
# Formating the new column to list type
train = dataset_train2['tuples'].tolist()

In [10]:
train[:5]

[(' I`d have responded, if I were going', 2),
 (' Sooo SAD I will miss you here in San Diego!!!', 0),
 ('my boss is bullying me...', 0),
 (' what interview! leave me alone', 0),
 (' Sons of ****, why couldn`t they put them on the releases we already bought',
  0)]

In [11]:
# using list comprehension to use only 'negative' or 'positive' comments (0 or 1) (False or True)
train2 = [item for item in train if item[1] != 2]

In [12]:
train2[:5]

[(' Sooo SAD I will miss you here in San Diego!!!', 0),
 ('my boss is bullying me...', 0),
 (' what interview! leave me alone', 0),
 (' Sons of ****, why couldn`t they put them on the releases we already bought',
  0),
 ('2am feedings for the baby are fun when he is all smiles and coos', 1)]

### Functions from spacy docs

https://spacy.io/usage/examples

Block **Training spaCy's text classifier**

In [70]:
def load_data(limit=0, split=0.7):
    train_data = train2
    np.random.shuffle(train_data)
    train_data = train_data[-limit:]
    texts, labels = zip(*train_data)
    cats = [{"POSITIVE": bool(y)} for y in labels]
    split = int(len(train_data) * split)
    return (texts[:split], cats[:split]), (texts[split:], cats[split:])

def evaluate(tokenizer, textcat, texts, cats):
    docs = (tokenizer(text) for text in texts)
    tp = 1e-8  # True positives
    fp = 1e-8  # False positives
    fn = 1e-8  # False negatives
    tn = 1e-8  # True negatives
    for i, doc in enumerate(textcat.pipe(docs)):
        gold = cats[i]
        for label, score in doc.cats.items():
            if label not in gold:
                continue
            if score >= 0.5 and gold[label] >= 0.5:
                tp += 1.
            elif score >= 0.5 and gold[label] < 0.5:
                fp += 1.
            elif score < 0.5 and gold[label] < 0.5:
                tn += 1
            elif score < 0.5 and gold[label] >= 0.5:
                fn += 1
    precision = tp / (tp + fp)
    recall = tp / (tp + fn)
    f_score = 2 * (precision * recall) / (precision + recall)
    return {'textcat_p': precision, 'textcat_r': recall, 'textcat_f': f_score}


## Using SpaCy to analize the comments

**Command:** python -m spacy download en_core_web_md

Reference: https://spacy.io/usage

In [71]:
import spacy
import numpy as np

In [72]:
nlp = spacy.load("en_core_web_md")

Code from https://www.kaggle.com/poonaml/text-classification-using-spacy

In [73]:
# add the text classifier to the pipeline if it doesn't exist
# nlp.create_pipe works for built-ins that are registered with spaCy
if 'textcat' not in nlp.pipe_names:
    textcat = nlp.create_pipe('textcat')
    nlp.add_pipe(textcat, last=True)
# otherwise, get it, so we can add labels to it
else:
    textcat = nlp.get_pipe('textcat')

In [74]:
# add label to text classifier
textcat.add_label('POSITIVE')

1

### Loading the train and test1 data

Test1 because this first test is using the dataset of the Twitter, after that, I'll use the real data, from YouTube comments

In [75]:
# Vars from SpaCy's docs
n_texts=10000 # "Number of texts to train from", "option", "t", int
n_iter=15 #"Number of training iterations", "option", "n", int

In [76]:
print('Loading data...')
(train_texts, train_cats), (dev_texts, dev_cats) = load_data(limit=n_texts)

print("Using {} examples ({} training, {} evaluation)"
     .format(n_texts, len(train_texts), len(dev_texts)))

Loading data...
Using 10000 examples (7000 training, 3000 evaluation)


In [77]:
train_data = list(zip(train_texts, [{'cats': cats} for cats in train_cats]))

### Checking the new data

Now, the classes of our data is POSITIVE: True or POSITIVE: False

True by Positive phrase

False by Negative phrase

In [80]:
train_data[:5]

[('He is so silly.  http://twitpic.com/4jk6b', {'cats': {'POSITIVE': False}}),
 ('ok, back to the dentist today. All I want to do is bask in the sun',
  {'cats': {'POSITIVE': True}}),
 (' Hi bunny! I recently have subcribed to your channel on YouTube! You make some great stuff. Kinda just wanted to say hi!',
  {'cats': {'POSITIVE': True}}),
 ('well.. all my slacking off earned me a D and a C   but at least everything else are A`s and B`s ^^  next school year all B`s and A`s Esh!',
  {'cats': {'POSITIVE': True}}),
 (' iknowww! Not many people know about it tho. So I like to keep it my little secret',
  {'cats': {'POSITIVE': True}})]

## Training

This block is from Spacy's Docs https://spacy.io/usage/examples

And some lines were modified in https://www.kaggle.com/poonaml/text-classification-using-spacy

In [85]:
from spacy.util import minibatch, compounding

#### Small Explanation

The SpaCy's text classifier uses multi-label convelutional neural network, and the training can take a lot of time.

The variable **n_iter** means the quantity of times that used in the training loop, so, big numbers means a lot of time.

Everything depends the size of your dataset.

In [86]:
# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'textcat']
with nlp.disable_pipes(*other_pipes):  # only train textcat
    optimizer = nlp.begin_training()
    print("Training the model...")
    print('{:^5}\t{:^5}\t{:^5}\t{:^5}'.format('LOSS', 'P', 'R', 'F'))
    for i in range(n_iter):
        losses = {}
        # batch up the examples using spaCy's minibatch
        batches = minibatch(train_data, size=compounding(4., 32., 1.001))
        for batch in batches:
            texts, annotations = zip(*batch)
            nlp.update(texts, annotations, sgd=optimizer, drop=0.2,
                       losses=losses)
        with textcat.model.use_params(optimizer.averages):
            # evaluate on the dev data split off in load_data()
            scores = evaluate(nlp.tokenizer, textcat, dev_texts, dev_cats)
        print('{0:.3f}\t{1:.3f}\t{2:.3f}\t{3:.3f}'  # print a simple table
              .format(losses['textcat'], scores['textcat_p'],
                      scores['textcat_r'], scores['textcat_f']))

Training the model...
LOSS 	  P  	  R  	  F  
6.494	0.898	0.875	0.886
3.426	0.905	0.890	0.898
2.444	0.908	0.909	0.908
1.959	0.907	0.902	0.905
1.573	0.907	0.894	0.900
1.306	0.902	0.899	0.901
0.958	0.905	0.893	0.899
0.778	0.900	0.893	0.896
0.789	0.900	0.891	0.895
0.658	0.895	0.888	0.892
0.571	0.889	0.888	0.889
0.527	0.886	0.881	0.884
0.482	0.890	0.874	0.882
0.358	0.888	0.878	0.883
0.351	0.889	0.878	0.883
0.396	0.883	0.881	0.882
0.410	0.885	0.883	0.884
0.301	0.886	0.881	0.883
0.353	0.886	0.883	0.884
0.307	0.888	0.882	0.885


#### More explanation

The algorithm above print some information

LOSS, P, R, F

**LOSS** - This value we can see as the erros of our model

**P means precision**

- Precision is the ratio of correctly predicted positive observations to the total predicted positive observations.
 - **Example:** Of all passengers that labeled as survived, how many actually survived?
 - TruePositives / (TruePositives + FalsePositives)
 
**R means Recall** (like a sensitivity)
- Recall is the ratio of correctly predicted positive observations to the all observations in actual class - yes.
 - **Example:** Of all the passengers that truly survived, how many did we label?
 - TruePositives / (TruePositives + FalseNegatives)
 
**F means F-Score** 
- F-score is the weighted average of Precision and Recall. This option can exclude the possibility of an excellent precision with a terrible recall, or, a terrible precision with an excellent recall. This provides a way to express both concerns with a single score.
 - F-Score = 2*(Recall * Precision) / (Recall + Precision)
 
Our values to P, R and F is good, 0.88
 

#### References
- https://machinelearningmastery.com/precision-recall-and-f-measure-for-imbalanced-classification/#:~:text=Precision%20quantifies%20the%20number%20of,and%20recall%20in%20one%20number.
- https://blog.exsilio.com/all/accuracy-precision-recall-f1-score-interpretation-of-performance-measures/
- https://medium.com/@vilsonrodrigues/machine-learning-o-que-s%C3%A3o-acurracy-precision-recall-e-f1-score-f16762f165b0

### Saving the model

In [93]:
output_dir=%pwd
nlp.to_disk(output_dir)
print("Saved model to", output_dir)

Saved model to D:\backup\Documentos\PROGRAMACAO\InteligenciaArtificial\NLP\YouTubeNPL\Notebooks


In [94]:
# Test the saved model
print("Loading from", output_dir)
nlp2 = spacy.load(output_dir)

Loading from D:\backup\Documentos\PROGRAMACAO\InteligenciaArtificial\NLP\YouTubeNPL\Notebooks


### Its time to test

In [87]:
# Loading comments dataset
dataset = pd.read_csv('../Data/EuroPython Conference/comments.csv')

In [99]:
comments = list(dataset['Comment'])

In [120]:
doc = nlp2(comments[2])
comments[2], doc.cats

('Nice, very "complete" talk. I\'d like to mention a very recent CLI package I created though called cliche. You can find out more about it here https://github.com/kootenpv/cliche',
 {'POSITIVE': 0.9899575114250183})

Result close to 1 is a **positive** classification

In [122]:
doc = nlp2(comments[12])
comments[12], doc.cats

("That's really pathetic.", {'POSITIVE': 0.0014924798160791397})

Result close to 0 is a **negative** classification

### MISSION COMPLETE

**For while**

#### Observations

- We got **0.88** in Precision, Recall and F-Score, but this number **could be better** if the data was treated, because could used some tecniques to adjust the text to improve accuracy of the model (this can be made in the future versions)
- This project uses the english language and could be converted to **another languages** easily
- This is the **first version** of the project, so has a lot of things can be improve