## Natural Language Processing With SpaCy & Textacy
![textacylogo](textacylogo1.png)

+ Textacy
+ Textacy is a Python library for performing higher-level natural language processing (NLP) tasks, 
+ built on the high-performance Spacy library.
+ Textacy focuses on tasks facilitated by the availability of tokenized, POS-tagged, and parsed text.
    ### Uses
    - Text preprocessing
    - Keyword in Context
    - Topic modeling
    - Information Extraction
    - Keyterm extraction, 
    - Text and Readability statistics,
    - Emotional valence analysis, 
    - Quotation attribution

#### Installation
+ pip install textacy
+ conda install -c conda-forge textacy

##### Downloading Dataset
+ python -m textacy download 

### For Language Detection
+ pip install textacy[lang]
+ pip install cld2-cffi

In [1]:
# Loading Packages
import textacy

In [2]:
example = "Textacy is a Python library for performing higher-level natural language processing (NLP) tasks, built on the high-performance Spacy library. With the basics — tokenization, part-of-speech tagging, parsing — offloaded to another library, textacy focuses on tasks facilitated by the availability of tokenized, POS-tagged, and parsed text: keyterm extraction, readability statistics, emotional valence analysis, quotation attribution, and more."

### Text Preprocessing With Textacy
+ textacy.preprocess_text()
+ textacy.preprocess.
    - Punctuation Lowercase
    - Urls 
    - Phone numbers
    - Currency
    - Emails


In [3]:
raw_text = """ The best programs, are the ones written when the programmer is supposed to be working on something else.Mike bought the book for $50 although in Paris it will cost $30 dollars.
Don’t document the problem, fix it.This is from https://twitter.com/codewisdom?lang=en. """

In [4]:
# Removing Punctuation and Uppercase
textacy.preprocess.remove_punct(raw_text)

' The best programs  are the ones written when the programmer is supposed to be working on something else Mike bought the book for $50 although in Paris it will cost $30 dollars \nDon t document the problem  fix it This is from https   twitter com codewisdom lang=en  '

In [5]:
# Removing urls
textacy.preprocess.replace_urls(raw_text,replace_with='TWITTER')

' The best programs, are the ones written when the programmer is supposed to be working on something else.Mike bought the book for $50 although in Paris it will cost $30 dollars.\nDon’t document the problem, fix it.This is from TWITTER '

In [6]:
# Replacing Currency Symbols
textacy.preprocess.replace_currency_symbols(raw_text,replace_with='USD')

' The best programs, are the ones written when the programmer is supposed to be working on something else.Mike bought the book for USD50 although in Paris it will cost USD30 dollars.\nDon’t document the problem, fix it.This is from https://twitter.com/codewisdom?lang=en. '

In [7]:
# Replacing Emails

In [8]:
# Preprocess All
textacy.preprocess_text(raw_text,lowercase=True,no_punct=True,no_urls=True)

'the best programs are the ones written when the programmer is supposed to be working on something else mike bought the book for $50 although in paris it will cost $30 dollars don t document the problem fix it this is from url'

In [9]:
# Processing a Text on a File
textacy.preprocess_text(open("sample.txt").read(),lowercase=True)

'the best programs, are the ones written when the programmer is supposed to be working on something else.mike bought the book for $50 although in paris it will cost $30 dollars.\ndon’t document the problem, fix it.this is from https://twitter.com/codewisdom?lang=en.\ndebuggers don\'t remove bugs. they only show them in slow motion.\n"if at first you don’t succeed, call it version 1.0."\nin theory, there is no difference between theory and practice. but, in practice, there is.\n"commenting your code is like cleaning your bathroom - you never want to do it, but it really does create a more pleasant experience for you and your guests." - ryan campbell\nyour problem is another\'s solution; your solution will be their problem.'

### Reading a Text or A Document
+ textacy.Doc(your_text)
+ textacy.io.read_text(your_text)

In [10]:
example = "Textacy is a Python library for performing higher-level natural language processing (NLP) tasks, built on the high-performance Spacy library. With the basics — tokenization, part-of-speech tagging, parsing — offloaded to another library, textacy focuses on tasks facilitated by the availability of tokenized, POS-tagged, and parsed text: keyterm extraction, readability statistics, emotional valence analysis, quotation attribution, and more."

In [11]:
# With Doc
# Requires Language Pkg Model
docx_textacy = textacy.Doc(example)

In [12]:
docx_textacy

Doc(82 tokens; "Textacy is a Python library for performing high...")

In [13]:
type(docx_textacy)

textacy.doc.Doc

In [14]:
# Using spacy
import spacy 
nlp = spacy.load('en')

In [15]:
docx_spacy = nlp(example)

In [16]:
docx_spacy

Textacy is a Python library for performing higher-level natural language processing (NLP) tasks, built on the high-performance Spacy library. With the basics — tokenization, part-of-speech tagging, parsing — offloaded to another library, textacy focuses on tasks facilitated by the availability of tokenized, POS-tagged, and parsed text: keyterm extraction, readability statistics, emotional valence analysis, quotation attribution, and more.

In [17]:
type(docx_spacy)

spacy.tokens.doc.Doc

In [18]:
#### Both Are of the Type Doc

#### Reading A File

In [19]:
# Method 1
file_textacy = textacy.Doc(open("example.txt").read())

In [20]:
file_textacy

Doc(471 tokens; "The nativity of Jesus or birth of Jesus is desc...")

In [21]:
# Method 2
# Creates a generator
# file_textacy2 = textacy.io.read_text('example.txt')
file_textacy2 = textacy.io.read_text('example.txt',lines=True)

In [22]:
type(file_textacy2)

generator

In [23]:
for text in file_textacy2:
    docx_file = textacy.Doc(text)
    print(docx_file)

Doc(148 tokens; "The nativity of Jesus or birth of Jesus is desc...")


OSError: [E050] Can't find model 'un'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

##### Working With Multiple Text Documents
+ textacy.io.read_text(text,lines=True)
+ textacy.io.read_json(text,lines=True)
+ textacy.io.csv.read_csv(text)

####  Analysis of Text
+ Tokenization
+ Ngrams
+ Named Entities
+ Key Terms & Text Rank
+ Basic Counts/Frequency & Stats
+ Bag of Terms

In [24]:
docx_spacy

Textacy is a Python library for performing higher-level natural language processing (NLP) tasks, built on the high-performance Spacy library. With the basics — tokenization, part-of-speech tagging, parsing — offloaded to another library, textacy focuses on tasks facilitated by the availability of tokenized, POS-tagged, and parsed text: keyterm extraction, readability statistics, emotional valence analysis, quotation attribution, and more.

In [25]:
# Using SpaCy Named Entities Recognition
[ (entity.text,entity.label_) for entity in docx_spacy.ents ]

[('NLP', 'ORG'), ('Spacy', 'GPE')]

In [26]:
# Using Textacy Named Entity Extraction
list(textacy.extract.named_entities(docx_textacy))

[NLP, Spacy]

In [27]:
# NGrams with Textacy
# NB SpaCy method would be to use noun Phrases
# Tri Grams

list(textacy.extract.ngrams(docx_textacy,3))

[library for performing,
 level natural language,
 natural language processing,
 performance Spacy library,
 With the basics,
 focuses on tasks,
 availability of tokenized,
 emotional valence analysis]

#### Info Extraction/Summary
+ semistructured_statements

In [39]:
docx = textacy.Doc(open("example1.txt").read())

In [41]:
# Extract Points
statements = textacy.extract.semistructured_statements(docx,"Jerusalem")

In [42]:
statements

<generator object semistructured_statements at 0x7f2403592db0>

In [43]:
# Prints Results
print("This text is about: ")
for statement in statements:
    subject,verb,point = statement
    print(f':{point}')

This text is about: 
:the third-holiest city, after Mecca and Medina.[26][27


#### Key Terms and Text Rank
+ Textacy
+ PyTextRank


In [44]:
# Load Keyterms for TextRank & Srank
import textacy.keyterms
# You can lemmatize it or normalize it for better result

In [45]:
mylemma = [(token.lemma_) for token in docx_textacy]

In [46]:
mylemma

['textacy',
 'be',
 'a',
 'python',
 'library',
 'for',
 'perform',
 'high',
 '-',
 'level',
 'natural',
 'language',
 'processing',
 '(',
 'nlp',
 ')',
 'task',
 ',',
 'build',
 'on',
 'the',
 'high',
 '-',
 'performance',
 'spacy',
 'library',
 '.',
 'with',
 'the',
 'basic',
 '--',
 'tokenization',
 ',',
 'part',
 '-',
 'of',
 '-',
 'speech',
 'tagging',
 ',',
 'parse',
 '--',
 'offload',
 'to',
 'another',
 'library',
 ',',
 'textacy',
 'focus',
 'on',
 'task',
 'facilitate',
 'by',
 'the',
 'availability',
 'of',
 'tokeniz',
 ',',
 'pos',
 '-',
 'tag',
 ',',
 'and',
 'parse',
 'text',
 ':',
 'keyterm',
 'extraction',
 ',',
 'readability',
 'statistic',
 ',',
 'emotional',
 'valence',
 'analysis',
 ',',
 'quotation',
 'attribution',
 ',',
 'and',
 'more',
 '.']

In [47]:
# Using Lemma From Textacy
textacy.keyterms.textrank(docx_textacy, normalize='lemma', n_keyterms=10)

[('library', 0.0760061484239787),
 ('task', 0.05354513480888809),
 ('high', 0.05240430760022597),
 ('quotation', 0.041359218381418185),
 ('textacy', 0.03905943157880573),
 ('analysis', 0.03883696307086067),
 ('valence', 0.03742346418811198),
 ('emotional', 0.036606816948899494),
 ('statistic', 0.03611075043157879),
 ('readability', 0.03574957225478799)]

In [48]:
# Using SRank
textacy.keyterms.sgrank(docx_textacy, ngrams=(1, 2, 3, 4), normalize='lower', n_keyterms=0.1)

[('level natural language processing', 0.3056854092725264),
 ('performance spacy library', 0.1095635417770438),
 ('textacy', 0.0850709659268793),
 ('python library', 0.07274216720436877),
 ('tasks', 0.051788257056594116),
 ('speech tagging', 0.046870946578303783),
 ('emotional valence analysis', 0.04081245273285479),
 ('higher', 0.03519922647193362)]

#### Text and Readability Statistics

#### Basic Counts
+ Collections & Counter
+ Textacy.TextStats

In [49]:
textcounts = textacy.TextStats(docx_textacy)

In [50]:
# Number of Unique Words
textcounts.n_unique_words

51

In [51]:
mytokens = [ token.text for token in docx_textacy ]

In [52]:
### Word Counts 
textcounts = textacy.TextStats(docx)

In [53]:
# How many words
len(mytokens)

82

In [54]:
# Basic Counts
textcounts.basic_counts

{'n_sents': 18,
 'n_words': 532,
 'n_chars': 2756,
 'n_syllables': 810,
 'n_unique_words': 305,
 'n_long_words': 156,
 'n_monosyllable_words': 357,
 'n_polysyllable_words': 73}

In [55]:
# More Specific
textcounts.basic_counts['n_sents']

18

In [56]:
# Collections
# Remove Punct,Stop 
# Nouns
from collections import Counter
nouns = [ token.text for token in docx if token.is_stop != True and token.is_punct !=True and token.pos_ == 'NOUN']

In [57]:
word_freq = Counter(nouns)

common_nouns = word_freq.most_common(10)

In [58]:
common_nouns

[('city', 6),
 ('century', 4),
 ('capital', 2),
 ('times', 2),
 ('millennium', 2),
 ('period', 2),
 ('walls', 2),
 ('population', 2),
 ('importance', 2),
 ('plateau', 1)]

#### Bag Of Terms
+ the Bag-of-words model is term frequency, namely, the number of times a term appears in the text.
- Uses
 - For word or term frequency
 - For document classification via using it as a feature for training classifier
 - Computer vision
 - Similar to N-Grams but no spatial info is preserved unlike n-grams

In [59]:
# Bag of Terms or Words
bot = docx_textacy.to_bag_of_terms(ngrams=(1, 2, 3), named_entities=True, weighting='count',as_strings=True)

In [60]:
sorted(bot.items(), key=lambda x: x[1], reverse=True)[:15]

[('library', 3),
 ('textacy', 2),
 ('high', 2),
 ('task', 2),
 ('parse', 2),
 ('nlp', 1),
 ('spacy', 1),
 ('python', 1),
 ('perform', 1),
 ('level', 1),
 ('natural', 1),
 ('language', 1),
 ('processing', 1),
 ('build', 1),
 ('performance', 1)]

In [None]:
# Loading a Text Document
example2 = """Ubiquitous, mobile supercomputing. Intelligent robots. Self-driving cars. Neuro-technological brain enhancements. Genetic editing. The evidence of dramatic change is all around us and it’s happening at exponential speed.

Professor Klaus Schwab, Founder and Executive Chairman of the World Economic Forum, has been at the centre of global affairs for over four decades. He is convinced that we are at the beginning of a revolution that is fundamentally changing the way we live, work and relate to one another, which he explores in his new book, The Fourth Industrial Revolution."""

### Topic Modeling with Spacy & Textacy
+ A type of statical model used to detect and identify the abstracts and hidden topics in a document
+ Uses
 - Text classification
 - Discover the latent semantic structures in a document
 - Organize and Get insight into large collection of data
 - For Bioinformatics
 - Recommendation Systems via similarity of topics
 

In [None]:
#Build the corpus

In [None]:
# Vectorizer
vectorizer = textacy.Vectorizer(weighting='tfidf', normalize=True, smooth_idf=True,min_df=2, max_df=0.95)
doc_term_matrix = vectorizer.fit_transform((doc.to_terms_list(ngrams=1, named_entities=True, as_strings=True)for doc in corpus))


In [None]:
print(repr(doc_term_matrix))

#### Train and interpret a topic model:

In [None]:
# Build and Fit the Model
model = textacy.TopicModel('nmf', n_topics=10)
model.fit(doc_term_matrix)
doc_topic_matrix = model.transform(doc_term_matrix)


In [None]:
# Shape
doc_topic_matrix.shape

In [None]:
for topic_idx, top_terms in model.top_topic_terms(vectorizer.id_to_term, top_˓→n=10):
    print('topic', topic_idx, ':', ' '.join(top_terms))

In [None]:
# Jesse JCharis
# J-Secur1ty 
# Jesus Saves @JCharisTech