<a href="https://colab.research.google.com/github/Dforouzanfar/Machine_Learning/blob/master/3.%20Applications/1.%20Text%20Mining/Spacy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# spaCy

spaCy is an open-source Python library for advanced NLP. It is designed to handle large-scale NLP tasks efficiently and comes with pre-trained statistical models and deep learning integration.

**Key Features**
1. Tokenization
2. Named Entity Recognition - NER
3. Part-of-Speech (POS) Tagging
4. Dependency Parsing - relationships between words.
5. Word Embeddings - W2V, Glove, FastText, ...
6. Custom Pipelines
7. Multi-language Support

**Applications**
1. Text classification
2. Information extraction
3. Summarization
4. Sentiment analysis
5. Translation

In [None]:
try:
  import spacy
except:
  !pip install spacy
  !python -m spacy download en
  import spacy

# 1. Tokenization

### spacy.blank(name)

In [None]:
# Creating a blank English spaCy pipeline
nlp = spacy.blank("en")

nlp.pipeline # we call spacy.blank, so we don't have anything except tokenizer in the pipeline

[]

In [None]:
# Processing a text string to extract patterns and insights
doc = nlp("Text mining is the process of extracting   meaningful patterns and insights from text data.")
doc

Text mining is the process of extracting   meaningful patterns and insights from text data.

In [None]:
for token in doc:
  print(token)

Text
mining
is
the
process
of
extracting
  
meaningful
patterns
and
insights
from
text
data
.


In [None]:
token = doc[1]
token.text

'mining'

#### Token Attributes

There are numerous operations we can perform on each token, leveraging its attributes. Some of the most commonly used attributes include:
* is_alpha
* is_currency
* is_digit
* is_space
* lemma
* like_email
* like_url

you can access to all the methods with ```dir(token)```

In [None]:
token = doc[7]
token, token.is_space

(  , True)

In [None]:
for token in doc:
  if not token.is_space and not token.is_punct:
    print(token)

Text
mining
is
the
process
of
extracting
meaningful
patterns
and
insights
from
text
data


In [None]:
# We can also select a span of the sentence
span = doc[:5]
span

Text mining is the process

#### Adding a pipe

Visit spaCy's doc page to explore more: https://spacy.io/usage/processing-pipelines

In [None]:
nlp.add_pipe('sentencizer')

<spacy.pipeline.sentencizer.Sentencizer at 0x7ba21d4cacc0>

In [None]:
doc = nlp("Text mining is the process of extracting meaningful patterns and insights from text data. NLP is a branch of artificial intelligence that focuses on the interaction between computers and human language.")
c = 1
for sentence in doc.sents:
    print(f"sentence {c} is:\n{sentence}\nThe words in this sentence are: ")
    for word in sentence:
      if not word.is_punct:
        print(word)
    c += 1
    print("\n")

sentence 1 is:
Text mining is the process of extracting meaningful patterns and insights from text data.
The words in this sentence are: 
Text
mining
is
the
process
of
extracting
meaningful
patterns
and
insights
from
text
data


sentence 2 is:
NLP is a branch of artificial intelligence that focuses on the interaction between computers and human language.
The words in this sentence are: 
NLP
is
a
branch
of
artificial
intelligence
that
focuses
on
the
interaction
between
computers
and
human
language




# 2. Named Entity Recognition

## spaCy.load()

We can also load a pretrained model using ```spacy.load()```.  
To explore available models, visit spaCy's models page: https://spacy.io/models/en

In [None]:
# Creating a blank English spaCy pipeline
nlp = spacy.load("en_core_web_sm")
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [None]:
# Processing a text string to extract patterns and insights
doc = nlp("Text mining is the process of extracting meaningful patterns and insights from text data. NLP is a branch of artificial intelligence that focuses on the interaction between computers and human language.")

In [None]:
# Sentence Tokenization
for sentence in doc.sents:
    print(sentence)

Text mining is the process of extracting meaningful patterns and insights from text data.
NLP is a branch of artificial intelligence that focuses on the interaction between computers and human language.


In [None]:
doc = nlp("As of January 2025 Apple has a market cap of $3.580 Trillion USD.")
doc.ents

In [None]:
for ent in doc.ents:
  print(f"{ent.text:<15} | {ent.label_:<6} | {spacy.explain(ent.label_)}")

Batman          | ORG    | Companies, agencies, institutions, etc.
Gotham City     | PERSON | People, including fictional


In [None]:
# Entities
nlp.pipe_labels['ner']

['CARDINAL',
 'DATE',
 'EVENT',
 'FAC',
 'GPE',
 'LANGUAGE',
 'LAW',
 'LOC',
 'MONEY',
 'NORP',
 'ORDINAL',
 'ORG',
 'PERCENT',
 'PERSON',
 'PRODUCT',
 'QUANTITY',
 'TIME',
 'WORK_OF_ART']

In [None]:
doc = nlp("As of early 2025, Musk's net worth is estimated to be approximately $426 billion, according to Bloomberg")

In [None]:
for ent in doc.ents:
  print(f"{ent.text:<26} | {ent.label_:<6} | {spacy.explain(ent.label_)}")

early 2025                 | DATE   | Absolute or relative dates or periods
Musk                       | PERSON | People, including fictional
approximately $426 billion | MONEY  | Monetary values, including unit
Bloomberg                  | PERSON | People, including fictional


In [None]:
# Use displacy.render for a well-structured visualization
from spacy import displacy

displacy.render(doc, style="ent")

### Custom Entities

In [None]:
doc = nlp("Tesla is going to acquire Twitter for $45 billion")

In [None]:
# Get maximum lengths for formatting
max_lenght_text = max(len(ent.text) for ent in doc.ents)
max_lenght_label = max(len(ent.label_) for ent in doc.ents)

# Print entities with formatted output
for ent in doc.ents:
    print(f"{ent.text:<{max_lenght_text}} | {ent.label_:<{max_lenght_label}} | {spacy.explain(ent.label_)}")

Tesla       | ORG     | Companies, agencies, institutions, etc.
Twitter     | PRODUCT | Objects, vehicles, foods, etc. (not services)
$45 billion | MONEY   | Monetary values, including unit


In [None]:
from spacy.tokens import Span

first_span = Span(doc, 5, 6, label="ORG")
doc.set_ents([first_span], default="unmodified") # default="unmodified": Keep other entities as they are

In [None]:
for ent in doc.ents:
  print(f"{ent.text:<15} | {ent.label_:<10} | {spacy.explain(ent.label_)}")

Tesla           | ORG        | Companies, agencies, institutions, etc.
Twitter         | ORG        | Companies, agencies, institutions, etc.
$45 billion     | MONEY      | Monetary values, including unit


# 3. Part of Speech Tagger

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [None]:
doc = nlp("Batman patrols Gotham City under the cover of darkness, ensuring justice prevails against its relentless wave of crime.")

In [None]:
for token in doc:
  print(f"{token.text:<10} | {token.pos_:<6} | {spacy.explain(token.pos_)}")

Batman     | PROPN  | proper noun
patrols    | VERB   | verb
Gotham     | PROPN  | proper noun
City       | PROPN  | proper noun
under      | ADP    | adposition
the        | DET    | determiner
cover      | NOUN   | noun
of         | ADP    | adposition
darkness   | NOUN   | noun
,          | PUNCT  | punctuation
ensuring   | VERB   | verb
justice    | NOUN   | noun
prevails   | VERB   | verb
against    | ADP    | adposition
its        | PRON   | pronoun
relentless | ADJ    | adjective
wave       | NOUN   | noun
of         | ADP    | adposition
crime      | NOUN   | noun
.          | PUNCT  | punctuation


#### Tag
It provides the language-specific, fine-grained part-of-speech (POS) tag for the token, based on the language's grammar.

Use ```tag_``` when you require detailed grammatical information (e.g., singular vs. plural nouns, verb tense).

In [None]:
print("token", ' '*6, 'pos', ' '*4, 'pos explain', ' '*5, 'tag', ' '*4, 'tag explain', '\n', '-'*80)
for token in doc:
  print(f"{token.text:<10} | {token.pos_:<6} | {spacy.explain(token.pos_):<15} | {token.tag_:<6} | {spacy.explain(token.tag_)}")

token        pos      pos explain       tag      tag explain 
 --------------------------------------------------------------------------------
Batman     | PROPN  | proper noun     | NNP    | noun, proper singular
patrols    | VERB   | verb            | VBZ    | verb, 3rd person singular present
Gotham     | PROPN  | proper noun     | NNP    | noun, proper singular
City       | PROPN  | proper noun     | NNP    | noun, proper singular
under      | ADP    | adposition      | IN     | conjunction, subordinating or preposition
the        | DET    | determiner      | DT     | determiner
cover      | NOUN   | noun            | NN     | noun, singular or mass
of         | ADP    | adposition      | IN     | conjunction, subordinating or preposition
darkness   | NOUN   | noun            | NN     | noun, singular or mass
,          | PUNCT  | punctuation     | ,      | punctuation mark, comma
ensuring   | VERB   | verb            | VBG    | verb, gerund or present participle
justice    | NOUN

#### count_by

In [None]:
count = doc.count_by(spacy.attrs.POS)
count

{96: 3, 100: 3, 85: 4, 90: 1, 92: 5, 97: 2, 95: 1, 84: 1}

In [None]:
for key, value in count.items():
    print(f"{doc.vocab[key].text:<6} | {value}")

PROPN  | 3
VERB   | 3
ADP    | 4
DET    | 1
NOUN   | 5
PUNCT  | 2
PRON   | 1
ADJ    | 1


# 4. Stemming & Lemmatization
**Stemming**: Stemming is a crude heuristic process that removes word suffixes to reduce words to a common root
* playing, played, plays --> play
* eating, eats --> eat
* ate --> ate

**Lemmatization**: Lemmatization is more sophisticated and involves reducing words to their base or dictionary form
* playing, played, plays --> play
* **ate** --> eat

### Lemmatization

In [None]:
import spacy
nlp = spacy.load("en_core_web_sm")

In [None]:
doc = nlp("eating eats eat ate adjustable ability meeting")

In [None]:
for token in doc:
  print(f"Token: {token.text:<10} | Lemma: {token.lemma_}")

Token: eating     | Lemma: eat
Token: eats       | Lemma: eat
Token: eat        | Lemma: eat
Token: ate        | Lemma: eat
Token: adjustable | Lemma: adjustable
Token: ability    | Lemma: ability
Token: meeting    | Lemma: meeting


### Customizing lemmatizer

In [None]:
doc = nlp("Dad, let's go out! Papa, don't say no")
for token in doc:
  if token.text == 'Dad' or token.text == 'Papa':
    print(f"Token: {token.text:<5} | Lemma: {token.lemma_}")

Token: Dad   | Lemma: Father
Token: Papa  | Lemma: Father


In [None]:
attribute_r = nlp.get_pipe('attribute_ruler')

attribute_r.add(
    [
        [
            {"TEXT":"Dad"}
        ],
        [
            {"TEXT":"Papa"}
        ]
    ],
    {"LEMMA":"Father"}
  )

In [None]:
for token in doc:
  if token.text == 'Dad' or token.text == 'Papa':
    print(f"Token: {token.text:<5} | Lemma: {token.lemma_}")

Token: Dad   | Lemma: Father
Token: Papa  | Lemma: Father


### Stemming
With spaCy we can't get the stemm of the words. We can use NLTK instead.

In [None]:
try:
  import nltk
  from nltk.stem import PorterStemmer
except:
  !pip install nltk
  import nltk
  from nltk.stem import PorterStemmer

In [None]:
stemmer = PorterStemmer()

In [None]:
words = ["eating", "eats", "eat", "ate", "adjustable", "ability", "meeting"]

for word in words:
  print(f"{word:<10} | {stemmer.stem(word)}")

eating     | eat
eats       | eat
eat        | eat
ate        | ate
adjustable | adjust
ability    | abil
meeting    | meet


# 5. Bag of Words

In [None]:
import numpy as np
import pandas as pd
import requests

In [None]:
path="https://raw.githubusercontent.com/Dforouzanfar/Machine_Learning/refs/heads/master/3.%20Applications/1.%20Text%20Mining/data/spam.csv"
with open("dataframe.csv", 'wb') as f:
  request = requests.get(path)
  f.write(request.content)

df = pd.read_csv("dataframe.csv")

In [None]:
df.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [None]:
df["Category"].value_counts()

Unnamed: 0_level_0,count
Category,Unnamed: 1_level_1
ham,4825
spam,747


In [None]:
df['Category'] = df['Category'].map({'spam': 1, 'ham':0}).astype(int)
df.head(2)

Unnamed: 0,Category,Message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...


In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df['Message'], df['Category'], test_size=0.2)

In [None]:
len(X_train), len(X_test)

(4457, 1115)

### CountVectorizer

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

count_vectorize = CountVectorizer()

In [None]:
X_train_cv = count_vectorize.fit_transform(X_train.values)
X_test_cv = count_vectorize.transform(X_test.values)

### NaiveBayes

In [None]:
from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB()
model.fit(X_train_cv, y_train)
y_pred = model.predict(X_test_cv)

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.99      1.00      0.99       960
           1       0.97      0.94      0.95       155

    accuracy                           0.99      1115
   macro avg       0.98      0.97      0.97      1115
weighted avg       0.99      0.99      0.99      1115



### sklearn.pipeline

There is another way to do all this and use less lines of code

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

clf = Pipeline([
              ('vectorizer', CountVectorizer()),
              ('nb', MultinomialNB())
])

In [None]:
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

In [None]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.99      1.00      0.99       960
           1       0.97      0.94      0.95       155

    accuracy                           0.99      1115
   macro avg       0.98      0.97      0.97      1115
weighted avg       0.99      0.99      0.99      1115



# 6. Stop Words

In [None]:
from spacy.lang.en.stop_words import STOP_WORDS

stop_words = list(STOP_WORDS)
stop_words[:10]

['since',
 'name',
 'twelve',
 '‘d',
 'under',
 'will',
 'beyond',
 'became',
 'yours',
 'among']

In [None]:
def omit_stop_words(text):
  doc = nlp(text)
  no_stop_words = [token.text for token in doc if not token.is_stop]

  return ' '.join(no_stop_words)

In [None]:
omit_stop_words("As the morning breeze rustled through the leaves, the distant mountains gradually emerged from the fog, their peaks kissed by the first light of dawn.")

'morning breeze rustled leaves , distant mountains gradually emerged fog , peaks kissed light dawn .'

In [None]:
df.head()

Unnamed: 0,Category,Message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [None]:
df["Message"] = df["Message"].apply(omit_stop_words)

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df['Message'], df['Category'], test_size=0.2)

In [None]:
clf = Pipeline([
              ('vectorizer', CountVectorizer()),
              ('nb', MultinomialNB())
])

In [None]:
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

In [None]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.98      0.99      0.99       959
           1       0.95      0.90      0.92       156

    accuracy                           0.98      1115
   macro avg       0.97      0.95      0.96      1115
weighted avg       0.98      0.98      0.98      1115

