# SPACY - continuation (English)

In [1]:
import spacy
from spacy.cli import download
print(download('es_core_news_sm'))

ImportError: cannot import name 'download1' from 'spacy.cli' (/home/michal/.local/lib/python3.7/site-packages/spacy/cli/__init__.py)

In [None]:
#pip install spacy

import spacy
from spacy.cli import download
print(download('en_core_web_sm'))

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')

# Sentence Detection

Sentence Detection is the process of locating the start and end of sentences in a given text. This allows you to you divide a text into linguistically meaningful units. You’ll use these units when you’re processing your text to perform tasks such as part of speech tagging and entity extraction.

In [None]:
about_text = ('Gus Proto is a Python developer currently'
               ' working for a London-based Fintech'
               ' company. He is interested in learning'
               ' Natural Language Processing.')

In [None]:
text_doc=nlp(about_text)
list_sents=list(text_doc.sents)
print(len(list_sents))
print((list_sents))

# Tokenization in spaCy

Tokenization is the next step after sentence detection. It allows you to identify the basic units in your text. These basic units are called tokens. Tokenization is useful because it breaks a text into meaningful units. These units are used for further analysis, like part of speech tagging.

In [None]:
introduction_text = ('This tutorial is about Natural Language Processing in Spacy.')
# we must create DOC object - a container
introduction_doc = nlp(introduction_text)
print ([token.text for token in introduction_doc])

In [None]:
file_name = 'introduction.txt'
introduction_file_text = open(file_name).read()
introduction_doc = nlp(introduction_file_text)
result=[token.text for token in introduction_doc]
print(result)

In [None]:
for token in text_doc:
    print(token, token.idx)

# Token attributes
*    `token.idx` - returns a position of the token
*    `is_alpha` - detects if the token consists of alphabetic characters or not.
*    `is_punct` - detects if the token is a punctuation symbol or not.
*    `is_space` - detects if the token is a space or not.
*    `shape_` - prints out the shape of the word.
*    `is_stop` - detects if the token is a stop word or not  

https://spacy.io/api/token#attributes

In [None]:
for token in text_doc:
#     print (token, token.idx)
#    print (token, token.is_lower)
#    print (token, token.lower_)
#    print (token, token.is_alpha)
#     print (token, token.is_punct)
#     print (token, token.is_space)    
#     print (token, token.shape_)     
     print (token, token.is_stop)       

# Stop Words

Stop words are the most common words in a language. In the English language, some examples of stop words are the, are, but, and they. Most sentences need to contain stop words in order to be full sentences that make sense.

## english

In [None]:
import spacy
nlp = spacy.load("en_core_web_sm")
spacy_stopwords_en = spacy.lang.en.stop_words.STOP_WORDS
print(len(spacy_stopwords_en))
print()
for stop_word in list(spacy_stopwords_en)[:10]:
     print(stop_word)

# spanish

In [None]:
import spacy
nlp = spacy.load('es_core_news_sm')
spacy_stopwords_es = spacy.lang.es.stop_words.STOP_WORDS
print(len(spacy_stopwords_es))
print()
for stop_word in list(spacy_stopwords_es)[:10]:
     print(stop_word)

## back to english

In [None]:
for token in text_doc:
     if not token.is_stop:
         print(token)

In [None]:
list_spacy_stopwords_en = list(spacy_stopwords_en)
for token in text_doc:
#    if token not in list_spacy_stopwords_en[0:10]:
    if token.text not in list_spacy_stopwords_en:
        print(token)

# Lemmatization

Lemmatization is the process of reducing inflected forms of a word while still ensuring that the reduced form belongs to the language. This reduced form or root word is called a lemma.


    comienza -> comenzar
    comenzarán -> comenzar
    clases -> clase

# Stemming

There is another way to reduce the number of tokens. This process consists of simply trimming the words to reduce them to a common base:

    comienza -> comienz
    comenzarán -> comenz
    clases -> clas


These two techniques are considered mutually exclusive, since you either apply one or apply the other, never both. But which is the most recommended?

In general, lemmatization is always preferred, since it is a good compromise between reducing the amount of tokens and preserving a little more the original composition of these. Stemming being more aggressive tends to lead to a greater loss of information.

In other words
* Stemming drops the end of the word to retain a stable root. It is fast, but sometimes the results are difficult to interpret.
* Lemmatization is smarter and takes into account the meaning of the word.

In [None]:
import spacy
nlp = spacy.load("en_core_web_sm")
conference_help_text = ('Gus is helping organize a developer'
     'conference on Applications of Natural Language'
     ' Processing. He keeps organizing local Python meetups'
     ' and several internal talks at his workplace.')
conference_help_doc = nlp(conference_help_text)
for token in conference_help_doc:
     print (token, token.lemma_)

In [None]:
string = """
The crew of the USS Discovery discovered many discoveries.
Discovering is what explorers do."""
conference_help_doc = nlp(string)
for token in conference_help_doc:
     print (token, token.lemma_)

# Word Frequency

In [None]:
from collections import Counter
complete_text = ('Gus Proto is a Python developer currently '
'working for a London-based Fintech company. He is'
' interested in learning Natural Language Processing.'
' There is a developer conference happening on 21 July'
' 2019 in London. It is titled "Applications of Natural'
' Language Processing". There is a helpline number '
' available at +1-1234567891. Gus is helping organize it.'
' He keeps organizing local Python meetups and several'
' internal talks at his workplace. Gus is also presenting'
' a talk. The talk will introduce the reader about "Use'
' cases of Natural Language Processing in Fintech".'
' Apart from his work, he is very passionate about music.'
' Gus is learning to play the Piano. He has enrolled '
' himself in the weekend batch of Great Piano Academy.'
' Great Piano Academy is situated in Mayfair or the City'
' of London and has world-class piano instructors.')

complete_doc = nlp(complete_text)

In [None]:
words = [token.text for token in complete_doc if not token.is_stop and not token.is_punct and not token.is_space]

In [None]:
print(len(words))
print(words)

In [None]:
word_freq = Counter(words)
word_freq

In [None]:
word_freq.most_common(5)

In [None]:
word_freq.items()

In [None]:
[word for (word, freq) in word_freq.items() if freq == 1]

# Part of Speech Tagging


    Noun
    Pronoun
    Adjective
    Verb
    Adverb
    Preposition
    Conjunction
    Interjection


In [None]:
for token in complete_doc:
     print (token, token.pos_)

## extracting adjectives only

In [None]:
for token in complete_doc:
     if token.pos_=="ADJ":
         print (token)

# Named Entity Recognition
Named Entity Recognition (NER) is the process of locating named entities in unstructured text and then classifying them into pre-defined categories, such as person names, organizations, locations, monetary values, percentages, time expressions, and so on.

In [None]:
piano_class_text = ("""Great Piano Academy is situated in Mayfair or the City of 
                    London and has world-class piano instructors. 12 August 2022""")
piano_class_doc = nlp(piano_class_text)
print(piano_class_doc.ents)
for ent in piano_class_doc.ents:
     print(ent.text, ent.label_)

In [None]:
from spacy import displacy
displacy.render(piano_class_doc, style='ent',jupyter = True)

# Why would we want to "reduce" the amount of information through these transformations?

The idea behind the elimination of stopwords, symbols, lemmatization or stemming lies in reducing the number of unique elements in our dataset, with the aim of increasing the performance of our algorithm in two ways:

* Eliminating stopwords helps us eliminate common words that have little discriminative value between texts. Likewise, for many problems we do not need to know the tense in which a verb was written, or whether the word was "corrupt" or "corruption"; with base shapes the algorithm can “learn” a general idea.

* This same idea can be applied for very rare tokens within our text… we could remove tokens that do not appear more than $X$ number of times, suspecting that they were perhaps misspellings or unimportant words.


# The order of cleaning the text

It is common that after tokenizing the text, the steps are applied in the order presented:

* cleaning:
  * Conversion to lowercase
  * Removal special characters, RT @:, website links, multiple spaces, stopwords and punctuation symbols (replacing with space)
* tokenization and saving tokens longer than 1 character
* Lemmatization or stemming

In [None]:
def limpiar_texto(texto):
    
    nuevo_texto =texto
    return nuevo_texto 


In [None]:
import re
test = r"RT @LondonEconomics: Hi guys the last meeting was a complete nonsense see the video here https://t.co/rnHPgyhx4Z @cienciadedatos #textmining"
print(test)
print(limpiar_texto(test))

# Word Vector Representation

When we’re looking at words alone, it’s difficult for a machine to understand connections that a human would understand immediately. Engine and car, for example, have what might seem like an obvious connection (cars run using engines), but that link is not so obvious to a computer.

Thankfully, there’s a way we can represent words that captures more of these sorts of connections. A word vector is a numeric representation of a word that commuicates its relationship to other words.

Each word is interpreted as a unique and lenghty array of numbers. You can think of these numbers as being something like GPS coordinates. GPS coordinates consist of two numbers (latitude and longitude), and if we saw two sets GPS coordinates that were numberically close to each other (like 43,-70, and 44,-70), we would know that those two locations were relatively close together. Word vectors work similarly, although there are a lot more than two coordinates assigned to each word, so they’re much harder for a human to eyeball.

Using spaCy‘s en_core_web_sm model, let’s take a look at the length of a vector for a single word, and what that vector looks like using .vector and .shape.

In [None]:
import spacy
from spacy.cli import download
print(download('en_core_web_sm'))
nlp = spacy.load("en_core_web_sm")

In [None]:
mango = nlp(u'mango')
print(mango.vector.shape)
print(mango.vector)

There’s no way that a human could look at that array and identify it as meaning “mango,” but representing the word this way works well for machines, because it allows us to represent both the word’s meaning and its “proximity” to other similar words using the coordinates in the array.

# Sentiment analysis

Sentiment analysis, as fascinating as it is, is not without its flaws. Human language is nuanced and often far from straightforward. Machines might struggle to identify the emotions behind an individual piece of text despite their extensive grasp of past data. Some situations where sentiment analysis might fail are:

    Sarcasm, jokes, irony. These things generally don’t follow a fixed set of rules, so they might not be correctly classified by sentiment analytics systems.
    Nuance. Words can have multiple meanings and connotations, which are entirely subject to the context they occur in.
    Multipolarity. When the given text is positive in some parts and negative in others.
    Negation detection. It can be challenging for the machine because the function and the scope of the word ‘not’ in a sentence is not definite; moreover, suffixes and prefixes such as ‘non-,’ ‘dis-,’ ‘-less’ etc. can change the meaning of a text.


## Sentiment analysis using Python NLP models
####  using pretrained models
* textblob

* nlkt has vader
https://neptune.ai/blog/sentiment-analysis-python-textblob-vs-vader-vs-flair

* spacy has also build-in textblob
https://spacy.io/universe/project/spacy-textblob

# TextBlob
https://textblob.readthedocs.io/en/dev/

https://www.quora.com/What-is-polarity-and-subjectivity-in-sentiment-analysis?share=1

TextBlob returns polarity and subjectivity of a sentence. Polarity lies between [-1,1], -1 defines a negative sentiment and 1 defines a positive sentiment. Negation words reverse the polarity. TextBlob has semantic labels that help with fine-grained analysis. For example — emoticons, exclamation mark, emojis, etc. Subjectivity lies between [0,1]. Subjectivity quantifies the amount of personal opinion and factual information contained in the text. The higher subjectivity means that the text contains personal opinion rather than factual information. TextBlob has one more parameter — intensity. TextBlob calculates subjectivity by looking at the ‘intensity’. Intensity determines if a word modifies the next word. For English, adverbs are used as modifiers (‘very good’).

```
 Each word in the lexicon has scores for:
 1)     polarity: negative vs. positive    (-1.0 => +1.0)
 2) subjectivity: objective vs. subjective (+0.0 => +1.0)
 3)    intensity: modifies next word?      (x0.5 => x2.0)
``` 

### TextBlob simple analysis

In [None]:
# we can do a simple things like analysing a single word
TextBlob("great").sentiment
## Sentiment(polarity=0.8, subjectivity=0.75)

In [None]:
TextBlob("not great").sentiment
## Sentiment(polarity=-0.4, subjectivity=0.75)

In [None]:
TextBlob("very great").sentiment
## Sentiment(polarity=1.0, subjectivity=0.9750000000000001)

In [None]:
res = TextBlob("This movie is amazingly directed")
print(res.sentiment.polarity)
print(res.sentiment.subjectivity)
print(res.sentiment_assessments.assessments)

### TextBlob more advanced analysis

In [None]:
from textblob import TextBlob

test = r"RT @LondonEconomics: Hi guys I hated the last meeting, it was a complete shit see the video here https://t.co/rnHPgyhx4Z @cienciadedatos #textmining"
#test = r"RT @LondonEconomics: Hi guys I loved the last meeting, was great, see the video here https://t.co/rnHPgyhx4Z @cienciadedatos #textmining"

text=limpiar_tokenizar(test)

res = TextBlob(text)
print(res.sentiment.polarity)
print(res.sentiment.subjectivity)
print(res.sentiment_assessments.assessments)

In [None]:
import spacy
from spacytextblob.spacytextblob import SpacyTextBlob

nlp = spacy.load('en_core_web_sm')
nlp.add_pipe('spacytextblob')
text = 'I had a really horrible day. It was the worst day ever! But every now and then I have a really good day that makes me happy.'
doc = nlp(text)
print(doc._.blob.polarity )

In [None]:
doc._.blob.polarity                            # Polarity: -0.125
doc._.blob.subjectivity                        # Subjectivity: 0.9
doc._.blob.sentiment_assessments.assessments   # Assessments: [(['really', 'horrible'], -1.0, 1.0, None), (['worst', '!'], -1.0, 1.0, None), (['really', 'good'], 0.7, 0.6000000000000001, None), (['happy'], 0.8, 1.0, None)]
doc._.blob.ngrams()  

In [None]:
from transformers import pipeline
sentiment_pipeline = pipeline("sentiment-analysis")
data = ["I love you", "I hate you"]
sentiment_pipeline(data)

# Sentiments in Spanish... ?
https://huggingface.co/finiteautomata/beto-sentiment-analysis

#### sentiment-analysis-spanish
https://pypi.org/project/sentiment-analysis-spanish/

The function sentiment(text) returns a number between 0 and 1. This is the probability of string variable text of being "positive". Low probabilities mean that the text is negative (numbers close to 0), high probabilities (numbers close to 1) mean that the text is positive. The space in between corespond to neutral texts.

In [None]:
from sentiment_analysis_spanish import sentiment_analysis
sentiment = sentiment_analysis.SentimentAnalysisSpanish()
print(sentiment.sentiment("me gusta la tombola es genial"))

In [None]:
sentiment = sentiment_analysis.SentimentAnalysisSpanish()
print(sentiment.sentiment("me parece terrible esto que me estás diciendo"))