<a href="https://colab.research.google.com/github/Dforouzanfar/Machine_Learning/blob/master/3.%20Applications/1.%20Text%20Mining/Spacy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# spaCy

spaCy is an open-source Python library for advanced NLP. It is designed to handle large-scale NLP tasks efficiently and comes with pre-trained statistical models and deep learning integration.

**Key Features**
1. Tokenization
2. Named Entity Recognition - NER
3. Part-of-Speech (POS) Tagging
4. Dependency Parsing - relationships between words.
5. Word Vectors - word embeddings
6. Custom Pipelines
7. Multi-language Support

**Applications**
1. Text classification
2. Information extraction
3. Summarization
4. Sentiment analysis
5. Translation

# 0. Installation




In [2]:
# !pip install spacy
# !python -m spacy download en

# 1. Tokenization

In [3]:
import spacy

### spacy.blank(name)

In [63]:
# Creating a blank English spaCy pipeline
nlp = spacy.blank("en")

nlp.pipeline # we call spacy.blank, so we don't have anything except tokenizer in the pipeline

[]

In [71]:
# Processing a text string to extract patterns and insights
doc = nlp("Text mining is the process of extracting   meaningful patterns and insights from text data.")
doc

Text mining is the process of extracting   meaningful patterns and insights from text data.

In [72]:
for token in doc:
  print(token)

Text
mining
is
the
process
of
extracting
  
meaningful
patterns
and
insights
from
text
data
.


In [73]:
token = doc[1]
token.text

'mining'

#### Token Attributes

There are numerous operations we can perform on each token, leveraging its attributes. Some of the most commonly used attributes include:
* is_alpha
* is_currency
* is_digit
* is_space
* lemma
* like_email
* like_url

you can access to all the methods with ```dir(token)```

In [75]:
token = doc[7]
token, token.is_space

(  , True)

In [76]:
for token in doc:
  if not token.is_space and not token.is_punct:
    print(token)

Text
mining
is
the
process
of
extracting
meaningful
patterns
and
insights
from
text
data


In [77]:
# We can also select a span of the sentence
span = doc[:5]
span

Text mining is the process

#### Adding a pipe

In [78]:
nlp.add_pipe('sentencizer')

<spacy.pipeline.sentencizer.Sentencizer at 0x7ba21d4cacc0>

In [88]:
doc = nlp("Text mining is the process of extracting meaningful patterns and insights from text data. NLP is a branch of artificial intelligence that focuses on the interaction between computers and human language.")
c = 1
for sentence in doc.sents:
    print(f"sentence {c} is:\n{sentence}\nThe words in this sentence are: ")
    for word in sentence:
      if not word.is_punct:
        print(word)
    c += 1
    print("\n")

sentence 1 is:
Text mining is the process of extracting meaningful patterns and insights from text data.
The words in this sentence are: 
Text
mining
is
the
process
of
extracting
meaningful
patterns
and
insights
from
text
data


sentence 2 is:
NLP is a branch of artificial intelligence that focuses on the interaction between computers and human language.
The words in this sentence are: 
NLP
is
a
branch
of
artificial
intelligence
that
focuses
on
the
interaction
between
computers
and
human
language




## spaCy.load()

In [61]:
# Creating a blank English spaCy pipeline
nlp = spacy.load("en_core_web_sm")
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [None]:
# Processing a text string to extract patterns and insights
doc = nlp("Text mining is the process of extracting meaningful patterns and insights from text data. NLP is a branch of artificial intelligence that focuses on the interaction between computers and human language.")
doc

In [38]:
for sentence in doc.sents:
    print(sentence)

Text mining is the process of extracting meaningful patterns and insights from text data.
NLP is a branch of artificial intelligence that focuses on the interaction between computers and human language.


In [None]:
for sentence in doc.sents:
    for word in sentence:
        print(word)