<a href="https://colab.research.google.com/github/Dforouzanfar/Machine_Learning/blob/master/3.%20Applications/1.%20Text%20Mining/Spacy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# spaCy

spaCy is an open-source Python library for advanced NLP. It is designed to handle large-scale NLP tasks efficiently and comes with pre-trained statistical models and deep learning integration.

**Key Features**
1. Tokenization
2. Named Entity Recognition - NER
3. Part-of-Speech (POS) Tagging
4. Dependency Parsing - relationships between words.
5. Word Vectors - word embeddings
6. Custom Pipelines
7. Multi-language Support

**Applications**
1. Text classification
2. Information extraction
3. Summarization
4. Sentiment analysis
5. Translation

In [None]:
try:
  import spacy
except:
  !pip install spacy
  !python -m spacy download en
  import spacy

# 1. Tokenization

### spacy.blank(name)

In [None]:
# Creating a blank English spaCy pipeline
nlp = spacy.blank("en")

nlp.pipeline # we call spacy.blank, so we don't have anything except tokenizer in the pipeline

[]

In [None]:
# Processing a text string to extract patterns and insights
doc = nlp("Text mining is the process of extracting   meaningful patterns and insights from text data.")
doc

Text mining is the process of extracting   meaningful patterns and insights from text data.

In [None]:
for token in doc:
  print(token)

Text
mining
is
the
process
of
extracting
  
meaningful
patterns
and
insights
from
text
data
.


In [None]:
token = doc[1]
token.text

'mining'

#### Token Attributes

There are numerous operations we can perform on each token, leveraging its attributes. Some of the most commonly used attributes include:
* is_alpha
* is_currency
* is_digit
* is_space
* lemma
* like_email
* like_url

you can access to all the methods with ```dir(token)```

In [None]:
token = doc[7]
token, token.is_space

(  , True)

In [None]:
for token in doc:
  if not token.is_space and not token.is_punct:
    print(token)

Text
mining
is
the
process
of
extracting
meaningful
patterns
and
insights
from
text
data


In [None]:
# We can also select a span of the sentence
span = doc[:5]
span

Text mining is the process

#### Adding a pipe

Visit spaCy's doc page to explore more: https://spacy.io/usage/processing-pipelines

In [None]:
nlp.add_pipe('sentencizer')

<spacy.pipeline.sentencizer.Sentencizer at 0x7ba21d4cacc0>

In [None]:
doc = nlp("Text mining is the process of extracting meaningful patterns and insights from text data. NLP is a branch of artificial intelligence that focuses on the interaction between computers and human language.")
c = 1
for sentence in doc.sents:
    print(f"sentence {c} is:\n{sentence}\nThe words in this sentence are: ")
    for word in sentence:
      if not word.is_punct:
        print(word)
    c += 1
    print("\n")

sentence 1 is:
Text mining is the process of extracting meaningful patterns and insights from text data.
The words in this sentence are: 
Text
mining
is
the
process
of
extracting
meaningful
patterns
and
insights
from
text
data


sentence 2 is:
NLP is a branch of artificial intelligence that focuses on the interaction between computers and human language.
The words in this sentence are: 
NLP
is
a
branch
of
artificial
intelligence
that
focuses
on
the
interaction
between
computers
and
human
language




# 2. Named Entity Recognition

## spaCy.load()

We can also load a pretrained model using ```spacy.load()```.  
To explore available models, visit spaCy's models page: https://spacy.io/models/en

In [None]:
# Creating a blank English spaCy pipeline
nlp = spacy.load("en_core_web_sm")
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [None]:
# Processing a text string to extract patterns and insights
doc = nlp("Text mining is the process of extracting meaningful patterns and insights from text data. NLP is a branch of artificial intelligence that focuses on the interaction between computers and human language.")

In [None]:
# Sentence Tokenization
for sentence in doc.sents:
    print(sentence)

Text mining is the process of extracting meaningful patterns and insights from text data.
NLP is a branch of artificial intelligence that focuses on the interaction between computers and human language.


In [None]:
doc = nlp("As of January 2025 Apple has a market cap of $3.580 Trillion USD.")
doc.ents

In [None]:
for ent in doc.ents:
  print(f"{ent.text:<15} | {ent.label_}")

January 2025    | DATE
Apple           | ORG
$3.580 Trillion | MONEY


In [None]:
# Use displacy.render for a well-structured visualization
from spacy import displacy

displacy.render(doc, style="ent")

# 3. Part of Speech Tagger

In [None]:
doc = nlp("Batman patrols Gotham City under the cover of darkness, ensuring justice prevails against its relentless wave of crime.")

In [None]:
for token in doc:
  print(f"{token.text:<10} | {token.pos_:<6} | {spacy.explain(token.pos_)}")

Batman     | PROPN  | proper noun
patrols    | VERB   | verb
Gotham     | PROPN  | proper noun
City       | PROPN  | proper noun
under      | ADP    | adposition
the        | DET    | determiner
cover      | NOUN   | noun
of         | ADP    | adposition
darkness   | NOUN   | noun
,          | PUNCT  | punctuation
ensuring   | VERB   | verb
justice    | NOUN   | noun
prevails   | VERB   | verb
against    | ADP    | adposition
its        | PRON   | pronoun
relentless | ADJ    | adjective
wave       | NOUN   | noun
of         | ADP    | adposition
crime      | NOUN   | noun
.          | PUNCT  | punctuation


# 4. Stemming & Lemmatization
**Stemming**: Stemming is a crude heuristic process that removes word suffixes to reduce words to a common root
* playing, played, plays --> play
* eating, eats --> eat
* ate --> ate

**Lemmatization**: Lemmatization is more sophisticated and involves reducing words to their base or dictionary form
* playing, played, plays --> play
* **ate** --> eat

### Lemmatization

In [53]:
import spacy
nlp = spacy.load("en_core_web_sm")

In [54]:
doc = nlp("eating eats eat ate adjustable ability meeting")

In [57]:
for token in doc:
  print(f"Token: {token.text:<10} | Lemma: {token.lemma_}")

Token: eating     | Lemma: eat
Token: eats       | Lemma: eat
Token: eat        | Lemma: eat
Token: ate        | Lemma: eat
Token: adjustable | Lemma: adjustable
Token: ability    | Lemma: ability
Token: meeting    | Lemma: meeting


### Customizing lemmatizer

In [69]:
doc = nlp("Dad, let's go out! Papa, don't say no")
for token in doc:
  if token.text == 'Dad' or token.text == 'Papa':
    print(f"Token: {token.text:<5} | Lemma: {token.lemma_}")

Token: Dad   | Lemma: Father
Token: Papa  | Lemma: Father


In [67]:
attribute_r = nlp.get_pipe('attribute_ruler')

attribute_r.add(
    [
        [
            {"TEXT":"Dad"}
        ],
        [
            {"TEXT":"Papa"}
        ]
    ],
    {"LEMMA":"Father"}
  )

In [70]:
for token in doc:
  if token.text == 'Dad' or token.text == 'Papa':
    print(f"Token: {token.text:<5} | Lemma: {token.lemma_}")

Token: Dad   | Lemma: Father
Token: Papa  | Lemma: Father


### Stemming
With spaCy we can't get the stemm of the words. We can use NLTK instead.

In [48]:
try:
  import nltk
  from nltk.stem import PorterStemmer
except:
  !pip install nltk
  import nltk
  from nltk.stem import PorterStemmer

In [49]:
stemmer = PorterStemmer()

In [58]:
words = ["eating", "eats", "eat", "ate", "adjustable", "ability", "meeting"]

for word in words:
  print(f"{word:<10} | {stemmer.stem(word)}")

eating     | eat
eats       | eat
eat        | eat
ate        | ate
adjustable | adjust
ability    | abil
meeting    | meet
