# SPACY - continuation (English)

In [1]:
import spacy
from spacy.cli import download
print(download('es_core_news_sm'))

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('es_core_news_sm')
None


In [2]:
#pip install spacy

import spacy
from spacy.cli import download
print(download('en_core_web_sm'))

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
None


In [3]:
import spacy
nlp = spacy.load('en_core_web_sm')

# Sentence Detection

Sentence Detection is the process of locating the start and end of sentences in a given text. This allows you to you divide a text into linguistically meaningful units. You’ll use these units when you’re processing your text to perform tasks such as part of speech tagging and entity extraction.

In [4]:
about_text = ('Gus Proto is a Python developer currently'
               ' working for a London-based Fintech'
               ' company. He is interested in learning'
               ' Natural Language Processing.')

In [5]:
text_doc=nlp(about_text)
list_sents=list(text_doc.sents)
print(len(list_sents))
print((list_sents))

2
[Gus Proto is a Python developer currently working for a London-based Fintech company., He is interested in learning Natural Language Processing.]


# Tokenization in spaCy

Tokenization is the next step after sentence detection. It allows you to identify the basic units in your text. These basic units are called tokens. Tokenization is useful because it breaks a text into meaningful units. These units are used for further analysis, like part of speech tagging.

In [6]:
introduction_text = ('This tutorial is about Natural Language Processing in Spacy.')
# we must create DOC object - a container
introduction_doc = nlp(introduction_text)
print ([token.text for token in introduction_doc])

['This', 'tutorial', 'is', 'about', 'Natural', 'Language', 'Processing', 'in', 'Spacy', '.']


In [7]:
file_name = 'introduction.txt'
introduction_file_text = open(file_name).read()
introduction_doc = nlp(introduction_file_text)
result=[token.text for token in introduction_doc]
print(result)

['This', 'tutorial', 'is', 'about', 'Natural', 'Language', 'Processing', 'in', 'Spacy', '.', '\n']


In [8]:
for token in text_doc:
    print(token, token.idx)

Gus 0
Proto 4
is 10
a 13
Python 15
developer 22
currently 32
working 42
for 50
a 54
London 56
- 62
based 63
Fintech 69
company 77
. 84
He 86
is 89
interested 92
in 103
learning 106
Natural 115
Language 123
Processing 132
. 142


# Token attributes
*    `token.idx` - returns a position of the token
*    `is_alpha` - detects if the token consists of alphabetic characters or not.
*    `is_punct` - detects if the token is a punctuation symbol or not.
*    `is_space` - detects if the token is a space or not.
*    `shape_` - prints out the shape of the word.
*    `is_stop` - detects if the token is a stop word or not  

https://spacy.io/api/token#attributes

In [9]:
for token in text_doc:
#     print (token, token.idx)
#    print (token, token.is_lower)
#    print (token, token.lower_)
#    print (token, token.is_alpha)
#     print (token, token.is_punct)
#     print (token, token.is_space)    
#     print (token, token.shape_)     
     print (token, token.is_stop)       

Gus False
Proto False
is True
a True
Python False
developer False
currently False
working False
for True
a True
London False
- False
based False
Fintech False
company False
. False
He True
is True
interested False
in True
learning False
Natural False
Language False
Processing False
. False


# Stop Words

Stop words are the most common words in a language. In the English language, some examples of stop words are the, are, but, and they. Most sentences need to contain stop words in order to be full sentences that make sense.

## english

In [10]:
import spacy
nlp = spacy.load("en_core_web_sm")
spacy_stopwords_en = spacy.lang.en.stop_words.STOP_WORDS
print(len(spacy_stopwords_en))
print()
for stop_word in list(spacy_stopwords_en)[:10]:
     print(stop_word)

326

ours
whereas
done
everything
very
these
even
last
beside
off


# spanish

In [11]:
import spacy
nlp = spacy.load('es_core_news_sm')
spacy_stopwords_es = spacy.lang.es.stop_words.STOP_WORDS
print(len(spacy_stopwords_es))
print()
for stop_word in list(spacy_stopwords_es)[:10]:
     print(stop_word)

551

valor
sola
un
algunas
estais
trabaja
conseguimos
largo
bajo
pasada


## back to english

In [12]:
for token in text_doc:
     if not token.is_stop:
         print(token)

Gus
Proto
Python
developer
currently
working
London
-
based
Fintech
company
.
interested
learning
Natural
Language
Processing
.


In [13]:
list_spacy_stopwords_en = list(spacy_stopwords_en)
for token in text_doc:
#    if token not in list_spacy_stopwords_en[0:10]:
    if token.text not in list_spacy_stopwords_en:
        print(token)

Gus
Proto
Python
developer
currently
working
London
-
based
Fintech
company
.
He
interested
learning
Natural
Language
Processing
.


# Lemmatization

Lemmatization is the process of reducing inflected forms of a word while still ensuring that the reduced form belongs to the language. This reduced form or root word is called a lemma.


    comienza -> comenzar
    comenzarán -> comenzar
    clases -> clase

# Stemming

There is another way to reduce the number of tokens. This process consists of simply trimming the words to reduce them to a common base:

    comienza -> comienz
    comenzarán -> comenz
    clases -> clas


These two techniques are considered mutually exclusive, since you either apply one or apply the other, never both. But which is the most recommended?

In general, lemmatization is always preferred, since it is a good compromise between reducing the amount of tokens and preserving a little more the original composition of these. Stemming being more aggressive tends to lead to a greater loss of information.

In other words
* Stemming drops the end of the word to retain a stable root. It is fast, but sometimes the results are difficult to interpret.
* Lemmatization is smarter and takes into account the meaning of the word.

In [14]:
import spacy
nlp = spacy.load("en_core_web_sm")
conference_help_text = ('Gus is helping organize a developer'
     'conference on Applications of Natural Language'
     ' Processing. He keeps organizing local Python meetups'
     ' and several internal talks at his workplace.')
conference_help_doc = nlp(conference_help_text)
for token in conference_help_doc:
     print (token, token.lemma_)

Gus Gus
is be
helping helping
organize organize
a a
developerconference developerconference
on on
Applications application
of of
Natural Natural
Language Language
Processing Processing
. .
He he
keeps keep
organizing organize
local local
Python Python
meetups meetup
and and
several several
internal internal
talks talk
at at
his his
workplace workplace
. .


In [15]:
string = """
The crew of the USS Discovery discovered many discoveries.
Discovering is what explorers do."""
conference_help_doc = nlp(string)
for token in conference_help_doc:
     print (token, token.lemma_)


 

The the
crew crew
of of
the the
USS USS
Discovery Discovery
discovered discover
many many
discoveries discovery
. .

 

Discovering Discovering
is be
what what
explorers explorer
do do
. .


# Word Frequency

In [16]:
from collections import Counter
complete_text = ('Gus Proto is a Python developer currently '
'working for a London-based Fintech company. He is'
' interested in learning Natural Language Processing.'
' There is a developer conference happening on 21 July'
' 2019 in London. It is titled "Applications of Natural'
' Language Processing". There is a helpline number '
' available at +1-1234567891. Gus is helping organize it.'
' He keeps organizing local Python meetups and several'
' internal talks at his workplace. Gus is also presenting'
' a talk. The talk will introduce the reader about "Use'
' cases of Natural Language Processing in Fintech".'
' Apart from his work, he is very passionate about music.'
' Gus is learning to play the Piano. He has enrolled '
' himself in the weekend batch of Great Piano Academy.'
' Great Piano Academy is situated in Mayfair or the City'
' of London and has world-class piano instructors.')

complete_doc = nlp(complete_text)

In [17]:
words = [token.text for token in complete_doc if not token.is_stop and not token.is_punct and not token.is_space]

In [18]:
print(len(words))
print(words)

80
['Gus', 'Proto', 'Python', 'developer', 'currently', 'working', 'London', 'based', 'Fintech', 'company', 'interested', 'learning', 'Natural', 'Language', 'Processing', 'developer', 'conference', 'happening', '21', 'July', '2019', 'London', 'titled', 'Applications', 'Natural', 'Language', 'Processing', 'helpline', 'number', 'available', '+1', '1234567891', 'Gus', 'helping', 'organize', 'keeps', 'organizing', 'local', 'Python', 'meetups', 'internal', 'talks', 'workplace', 'Gus', 'presenting', 'talk', 'talk', 'introduce', 'reader', 'Use', 'cases', 'Natural', 'Language', 'Processing', 'Fintech', 'Apart', 'work', 'passionate', 'music', 'Gus', 'learning', 'play', 'Piano', 'enrolled', 'weekend', 'batch', 'Great', 'Piano', 'Academy', 'Great', 'Piano', 'Academy', 'situated', 'Mayfair', 'City', 'London', 'world', 'class', 'piano', 'instructors']


In [19]:
word_freq = Counter(words)
word_freq

Counter({'Gus': 4,
         'Proto': 1,
         'Python': 2,
         'developer': 2,
         'currently': 1,
         'working': 1,
         'London': 3,
         'based': 1,
         'Fintech': 2,
         'company': 1,
         'interested': 1,
         'learning': 2,
         'Natural': 3,
         'Language': 3,
         'Processing': 3,
         'conference': 1,
         'happening': 1,
         '21': 1,
         'July': 1,
         '2019': 1,
         'titled': 1,
         'Applications': 1,
         'helpline': 1,
         'number': 1,
         'available': 1,
         '+1': 1,
         '1234567891': 1,
         'helping': 1,
         'organize': 1,
         'keeps': 1,
         'organizing': 1,
         'local': 1,
         'meetups': 1,
         'internal': 1,
         'talks': 1,
         'workplace': 1,
         'presenting': 1,
         'talk': 2,
         'introduce': 1,
         'reader': 1,
         'Use': 1,
         'cases': 1,
         'Apart': 1,
         'work': 

In [20]:
word_freq.most_common(5)

[('Gus', 4), ('London', 3), ('Natural', 3), ('Language', 3), ('Processing', 3)]

In [21]:
word_freq.items()

dict_items([('Gus', 4), ('Proto', 1), ('Python', 2), ('developer', 2), ('currently', 1), ('working', 1), ('London', 3), ('based', 1), ('Fintech', 2), ('company', 1), ('interested', 1), ('learning', 2), ('Natural', 3), ('Language', 3), ('Processing', 3), ('conference', 1), ('happening', 1), ('21', 1), ('July', 1), ('2019', 1), ('titled', 1), ('Applications', 1), ('helpline', 1), ('number', 1), ('available', 1), ('+1', 1), ('1234567891', 1), ('helping', 1), ('organize', 1), ('keeps', 1), ('organizing', 1), ('local', 1), ('meetups', 1), ('internal', 1), ('talks', 1), ('workplace', 1), ('presenting', 1), ('talk', 2), ('introduce', 1), ('reader', 1), ('Use', 1), ('cases', 1), ('Apart', 1), ('work', 1), ('passionate', 1), ('music', 1), ('play', 1), ('Piano', 3), ('enrolled', 1), ('weekend', 1), ('batch', 1), ('Great', 2), ('Academy', 2), ('situated', 1), ('Mayfair', 1), ('City', 1), ('world', 1), ('class', 1), ('piano', 1), ('instructors', 1)])

In [22]:
[word for (word, freq) in word_freq.items() if freq == 1]

['Proto',
 'currently',
 'working',
 'based',
 'company',
 'interested',
 'conference',
 'happening',
 '21',
 'July',
 '2019',
 'titled',
 'Applications',
 'helpline',
 'number',
 'available',
 '+1',
 '1234567891',
 'helping',
 'organize',
 'keeps',
 'organizing',
 'local',
 'meetups',
 'internal',
 'talks',
 'workplace',
 'presenting',
 'introduce',
 'reader',
 'Use',
 'cases',
 'Apart',
 'work',
 'passionate',
 'music',
 'play',
 'enrolled',
 'weekend',
 'batch',
 'situated',
 'Mayfair',
 'City',
 'world',
 'class',
 'piano',
 'instructors']

# Part of Speech Tagging


    Noun
    Pronoun
    Adjective
    Verb
    Adverb
    Preposition
    Conjunction
    Interjection


In [23]:
for token in complete_doc:
     print (token, token.pos_)

Gus PROPN
Proto PROPN
is AUX
a DET
Python PROPN
developer NOUN
currently ADV
working VERB
for ADP
a DET
London PROPN
- PUNCT
based VERB
Fintech PROPN
company NOUN
. PUNCT
He PRON
is AUX
interested ADJ
in ADP
learning VERB
Natural PROPN
Language PROPN
Processing PROPN
. PUNCT
There PRON
is VERB
a DET
developer NOUN
conference NOUN
happening VERB
on ADP
21 NUM
July PROPN
2019 NUM
in ADP
London PROPN
. PUNCT
It PRON
is AUX
titled VERB
" PUNCT
Applications NOUN
of ADP
Natural PROPN
Language PROPN
Processing PROPN
" PUNCT
. PUNCT
There PRON
is VERB
a DET
helpline ADJ
number NOUN
  SPACE
available ADJ
at ADP
+1 PROPN
- PUNCT
1234567891 NUM
. PUNCT
Gus PROPN
is AUX
helping VERB
organize VERB
it PRON
. PUNCT
He PRON
keeps VERB
organizing VERB
local ADJ
Python PROPN
meetups NOUN
and CCONJ
several ADJ
internal ADJ
talks NOUN
at ADP
his PRON
workplace NOUN
. PUNCT
Gus PROPN
is AUX
also ADV
presenting VERB
a DET
talk NOUN
. PUNCT
The DET
talk NOUN
will AUX
introduce VERB
the DET
reader NOUN
about 

## extracting adjectives only

In [24]:
for token in complete_doc:
     if token.pos_=="ADJ":
         print (token)

interested
helpline
available
local
several
internal
passionate


# Named Entity Recognition
Named Entity Recognition (NER) is the process of locating named entities in unstructured text and then classifying them into pre-defined categories, such as person names, organizations, locations, monetary values, percentages, time expressions, and so on.

In [25]:
piano_class_text = ("""Great Piano Academy is situated in Mayfair or the City of 
                    London and has world-class piano instructors. 12 August 2022""")
piano_class_doc = nlp(piano_class_text)
print(piano_class_doc.ents)
for ent in piano_class_doc.ents:
     print(ent.text, ent.label_)

(Great Piano Academy, Mayfair, the City of 
                    , London, 12, August 2022)
Great Piano Academy ORG
Mayfair LOC
the City of 
                     GPE
London GPE
12 CARDINAL
August 2022 DATE


In [26]:
from spacy import displacy
displacy.render(piano_class_doc, style='ent',jupyter = True)

# Why would we want to "reduce" the amount of information through these transformations?

The idea behind the elimination of stopwords, symbols, lemmatization or stemming lies in reducing the number of unique elements in our dataset, with the aim of increasing the performance of our algorithm in two ways:

* Eliminating stopwords helps us eliminate common words that have little discriminative value between texts. Likewise, for many problems we do not need to know the tense in which a verb was written, or whether the word was "corrupt" or "corruption"; with base shapes the algorithm can “learn” a general idea.

* This same idea can be applied for very rare tokens within our text… we could remove tokens that do not appear more than $X$ number of times, suspecting that they were perhaps misspellings or unimportant words.


# The order of cleaning the text

It is common that after tokenizing the text, the steps are applied in the order presented:

* cleaning:
  * Conversion to lowercase
  * Removal special characters, RT @:, website links, multiple spaces, stopwords and punctuation symbols (replacing with space)
* tokenization and saving tokens longer than 1 character
* Lemmatization or stemming

In [27]:
def limpiar_texto(texto):
    
    nuevo_texto =texto
    return nuevo_texto 


In [28]:
import re
test = r"RT @LondonEconomics: Hi guys the last meeting was a complete nonsense see the video here https://t.co/rnHPgyhx4Z @cienciadedatos #textmining"
print(test)
print(limpiar_texto(test))

RT @LondonEconomics: Hi guys the last meeting was a complete nonsense see the video here https://t.co/rnHPgyhx4Z @cienciadedatos #textmining
RT @LondonEconomics: Hi guys the last meeting was a complete nonsense see the video here https://t.co/rnHPgyhx4Z @cienciadedatos #textmining


# Word Vector Representation

When we’re looking at words alone, it’s difficult for a machine to understand connections that a human would understand immediately. Engine and car, for example, have what might seem like an obvious connection (cars run using engines), but that link is not so obvious to a computer.

Thankfully, there’s a way we can represent words that captures more of these sorts of connections. A word vector is a numeric representation of a word that commuicates its relationship to other words.

Each word is interpreted as a unique and lenghty array of numbers. You can think of these numbers as being something like GPS coordinates. GPS coordinates consist of two numbers (latitude and longitude), and if we saw two sets GPS coordinates that were numberically close to each other (like 43,-70, and 44,-70), we would know that those two locations were relatively close together. Word vectors work similarly, although there are a lot more than two coordinates assigned to each word, so they’re much harder for a human to eyeball.

Using spaCy‘s en_core_web_sm model, let’s take a look at the length of a vector for a single word, and what that vector looks like using .vector and .shape.

In [29]:
import spacy
from spacy.cli import download
print(download('en_core_web_sm'))
nlp = spacy.load("en_core_web_sm")

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
None


In [30]:
mango = nlp(u'mango')
print(mango.vector.shape)
print(mango.vector)

(96,)
[-0.70611125 -1.4329455   0.24227941  0.6598132  -0.20285606 -0.3363567
 -1.4245116  -0.11146422 -0.56221646  0.3003068  -0.19000328 -0.08635545
  1.3099948   1.379954    0.02685246  1.5109322  -0.733334    0.80945534
  0.29014212 -0.2684864  -0.7413073  -0.7534003   1.52542    -0.61603916
  0.3729881   0.31268534 -0.68583065 -0.75191927  0.58086497 -1.0955321
  0.86638093 -1.9158285  -0.05129784 -0.20604798  0.2827754  -2.019856
 -0.0126412   0.3666329  -1.2550778   1.6548673  -0.85672385 -0.9216615
  0.2952034   0.01230198 -0.42903078 -0.4966709  -0.25612807 -1.3058071
  1.8100011   0.51152885  0.03403987  0.70565414  0.42585516 -0.8349808
  0.5538808   0.57170147 -1.101404    0.33620203  0.07782254  0.5464119
 -0.06026481 -0.5734616   0.6843033  -1.0217375  -0.11573818 -0.93082213
 -0.85589534  0.5505712   1.3896189  -0.5574837   0.19777809  0.3153283
 -0.37644464  0.38533548  0.02513826 -0.293028   -0.23319107  0.8843169
  0.61514205 -1.189681    1.3120099   0.49911803 -0.060

There’s no way that a human could look at that array and identify it as meaning “mango,” but representing the word this way works well for machines, because it allows us to represent both the word’s meaning and its “proximity” to other similar words using the coordinates in the array.

# Sentiment analysis

Sentiment analysis, as fascinating as it is, is not without its flaws. Human language is nuanced and often far from straightforward. Machines might struggle to identify the emotions behind an individual piece of text despite their extensive grasp of past data. Some situations where sentiment analysis might fail are:

    Sarcasm, jokes, irony. These things generally don’t follow a fixed set of rules, so they might not be correctly classified by sentiment analytics systems.
    Nuance. Words can have multiple meanings and connotations, which are entirely subject to the context they occur in.
    Multipolarity. When the given text is positive in some parts and negative in others.
    Negation detection. It can be challenging for the machine because the function and the scope of the word ‘not’ in a sentence is not definite; moreover, suffixes and prefixes such as ‘non-,’ ‘dis-,’ ‘-less’ etc. can change the meaning of a text.


## Sentiment analysis using Python NLP models
####  using pretrained models
* textblob

* nlkt has vader
https://neptune.ai/blog/sentiment-analysis-python-textblob-vs-vader-vs-flair

* spacy has also build-in textblob
https://spacy.io/universe/project/spacy-textblob

# TextBlob
https://textblob.readthedocs.io/en/dev/

https://www.quora.com/What-is-polarity-and-subjectivity-in-sentiment-analysis?share=1

TextBlob returns polarity and subjectivity of a sentence. Polarity lies between [-1,1], -1 defines a negative sentiment and 1 defines a positive sentiment. Negation words reverse the polarity. TextBlob has semantic labels that help with fine-grained analysis. For example — emoticons, exclamation mark, emojis, etc. Subjectivity lies between [0,1]. Subjectivity quantifies the amount of personal opinion and factual information contained in the text. The higher subjectivity means that the text contains personal opinion rather than factual information. TextBlob has one more parameter — intensity. TextBlob calculates subjectivity by looking at the ‘intensity’. Intensity determines if a word modifies the next word. For English, adverbs are used as modifiers (‘very good’).

```
 Each word in the lexicon has scores for:
 1)     polarity: negative vs. positive    (-1.0 => +1.0)
 2) subjectivity: objective vs. subjective (+0.0 => +1.0)
 3)    intensity: modifies next word?      (x0.5 => x2.0)
``` 

### TextBlob simple analysis

In [31]:
# we can do a simple things like analysing a single word
TextBlob("great").sentiment
## Sentiment(polarity=0.8, subjectivity=0.75)

NameError: name 'TextBlob' is not defined

In [None]:
TextBlob("not great").sentiment
## Sentiment(polarity=-0.4, subjectivity=0.75)

In [None]:
TextBlob("very great").sentiment
## Sentiment(polarity=1.0, subjectivity=0.9750000000000001)

In [None]:
res = TextBlob("This movie is amazingly directed")
print(res.sentiment.polarity)
print(res.sentiment.subjectivity)
print(res.sentiment_assessments.assessments)

### TextBlob more advanced analysis

In [None]:
from textblob import TextBlob

test = r"RT @LondonEconomics: Hi guys I hated the last meeting, it was a complete shit see the video here https://t.co/rnHPgyhx4Z @cienciadedatos #textmining"
#test = r"RT @LondonEconomics: Hi guys I loved the last meeting, was great, see the video here https://t.co/rnHPgyhx4Z @cienciadedatos #textmining"

text=limpiar_tokenizar(test)

res = TextBlob(text)
print(res.sentiment.polarity)
print(res.sentiment.subjectivity)
print(res.sentiment_assessments.assessments)

In [None]:
import spacy
from spacytextblob.spacytextblob import SpacyTextBlob

nlp = spacy.load('en_core_web_sm')
nlp.add_pipe('spacytextblob')
text = 'I had a really horrible day. It was the worst day ever! But every now and then I have a really good day that makes me happy.'
doc = nlp(text)
print(doc._.blob.polarity )

In [None]:
doc._.blob.polarity                            # Polarity: -0.125
doc._.blob.subjectivity                        # Subjectivity: 0.9
doc._.blob.sentiment_assessments.assessments   # Assessments: [(['really', 'horrible'], -1.0, 1.0, None), (['worst', '!'], -1.0, 1.0, None), (['really', 'good'], 0.7, 0.6000000000000001, None), (['happy'], 0.8, 1.0, None)]
doc._.blob.ngrams()  

In [None]:
from transformers import pipeline
sentiment_pipeline = pipeline("sentiment-analysis")
data = ["I love you", "I hate you"]
sentiment_pipeline(data)

# Sentiments in Spanish... ?
https://huggingface.co/finiteautomata/beto-sentiment-analysis

#### sentiment-analysis-spanish
https://pypi.org/project/sentiment-analysis-spanish/

The function sentiment(text) returns a number between 0 and 1. This is the probability of string variable text of being "positive". Low probabilities mean that the text is negative (numbers close to 0), high probabilities (numbers close to 1) mean that the text is positive. The space in between corespond to neutral texts.

In [None]:
from sentiment_analysis_spanish import sentiment_analysis
sentiment = sentiment_analysis.SentimentAnalysisSpanish()
print(sentiment.sentiment("me gusta la tombola es genial"))

In [None]:
sentiment = sentiment_analysis.SentimentAnalysisSpanish()
print(sentiment.sentiment("me parece terrible esto que me estás diciendo"))