# Lab 02: Introduction to Text Preprocessing & the Spacy Toolkit

### Objectives:
1. Get familiar with basic text preprocessing pipelines
2. Get familiar with regular expressions, and the `re` package in Python
3. Evaluate the lexical diversity of the data in each category within the 20 News Groups Dataset
4. Use normalized BOW features to evaluate text similarity using the KL-divergence

### Required Reading:

1. https://universaldependencies.org/u/pos/
2. https://spacy.io/api/annotation#pos-tagging
3. https://spacy.io/api/annotation#dependency-parsing

# Part I: Introduction to Spacy

### Download Spacy's base English language *pipeline* components

``$ python -m spacy download en_core_web_sm``

What is a Spacy *pipeline*? A Spacy pipeline is an extensible tool that streamlines many of the common tasks in NLP, such as tokenization, part-of-speech tagging, named entity recognition, stemming, lemmatizing, and parsing. It also has custom pipeline components specifically for transformers. It is built for production use; much thought and care has gone into its API and implementation. You can actually configure Spacy to use some of the statistical models that we will discover in this class; for now we're just going to cover some of the basics.

In [1]:
import spacy

pipeline = spacy.load('en_core_web_sm')

### Download the 20 News Groups dataset using the sklearn package

This data consists of news articles from 20 different categories. 

In [2]:
from sklearn.datasets import fetch_20newsgroups
import ssl
ssl._create_default_https_context = ssl._create_unverified_context

ng_train = fetch_20newsgroups(subset='train')
ng_test = fetch_20newsgroups(subset='test')
ng_train.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

### Get the number of training & test examples

In [3]:
len(ng_train.data), len(ng_test.data)

(11314, 7532)

### Take a peek at the first document and its label

In [4]:
ng_train.data[0]

"From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n"

In [5]:
label_idx = ng_train.target[0]
ng_train.target_names[label_idx]

'rec.autos'

### Evaluate Spacy's recognition of entities, POS

In [6]:
from pprint import pprint

doc = pipeline(ng_train.data[0])
for i, token in enumerate(doc):
    pprint({"text": token.text,
            "lemma": token.lemma_,
            "POS": token.pos_,
            "tag": token.tag_,
            "dep": token.dep_,
            "shape": token.shape_,
            "is_alpha": token.is_alpha,
            "is_stop": token.is_stop})
    if i == 3:
        break

{'POS': 'ADP',
 'dep': 'ROOT',
 'is_alpha': True,
 'is_stop': True,
 'lemma': 'from',
 'shape': 'Xxxx',
 'tag': 'IN',
 'text': 'From'}
{'POS': 'PUNCT',
 'dep': 'punct',
 'is_alpha': False,
 'is_stop': False,
 'lemma': ':',
 'shape': ':',
 'tag': ':',
 'text': ':'}
{'POS': 'PROPN',
 'dep': 'pobj',
 'is_alpha': False,
 'is_stop': False,
 'lemma': 'lerxst@wam.umd.edu',
 'shape': 'xxxx@xxx.xxx.xxx',
 'tag': 'NNP',
 'text': 'lerxst@wam.umd.edu'}
{'POS': 'PUNCT',
 'dep': 'punct',
 'is_alpha': False,
 'is_stop': False,
 'lemma': '(',
 'shape': '(',
 'tag': '-LRB-',
 'text': '('}


### Visualize Spacy's dependency parse

In [7]:
from spacy import displacy

displacy.render(doc, style='dep')

### Let's define a preprocessing function that cleans our data

You'll notice that even the lemmatized text contains meaningless tokens. In the real world you're never going to get around having to do some feature engineering. In NLP this often means writing some regexes to transform text into a usable format. This has become less important in the deep learning era, but applying domain specific knowledge is always beneficial. In the case of this dataset, we have text that originated in news feeds, some of which is messy. There are email and url addresses, grammatical errors, and a lot puntuation and uninformative characters (e.g., the newline character `\n`). Below is a function that does some very basic regex (regular expression) matching to strip out emails, urls, punctuation, and other junk.

In [8]:
import re
from spacy.language import Language


# http://emailregex.com/
email_re = r"""(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])"""

# replace = [ (pattern-to-replace, replacement),  ...]
replace = [
    (r"<a[^>]*>(.*?)</a>", r"\1"),  # Matches most URLs
    (email_re, "email"),            # Matches emails
    (r"(?<=\d),(?=\d)", ""),        # Remove commas in numbers
    (r"\d+", "numbr"),              # Map digits to special token <numbr>
    (r"[\t\n\r\*\.\@\,\-\/]", " "),   # Punctuation and other junk
    (r"\s+", " ")                   # Stips extra whitespace
]

train_text = ng_train.data
test_text = ng_test.data
for repl in replace:
    train_text = [re.sub(repl[0], repl[1], text) for text in train_text]
    test_text = [re.sub(repl[0], repl[1], text) for text in test_text]

@Language.component("ng20")
def ng20_preprocess(doc):
    tokens = [token for token in doc 
              if not any((token.is_stop, token.is_punct))]
    tokens = [token.lemma_.lower().strip() for token in tokens]
    tokens = [token for token in tokens if token]
    return " ".join(tokens)

pipeline.add_pipe("ng20")

<function __main__.ng20_preprocess(doc)>

#### Peek at our processing pipeline

In [9]:
pipeline.analyze_pipes(pretty=True)

[1m

#   Component         Assigns               Requires   Scores             Retokenizes
-   ---------------   -------------------   --------   ----------------   -----------
0   tok2vec           doc.tensor                                          False      
                                                                                     
1   tagger            token.tag                        tag_acc            False      
                                                                                     
2   parser            token.dep                        dep_uas            False      
                      token.head                       dep_las                       
                      token.is_sent_start              dep_las_per_type              
                      doc.sents                        sents_p                       
                                                       sents_r                       
                                                

{'summary': {'tok2vec': {'assigns': ['doc.tensor'],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'tagger': {'assigns': ['token.tag'],
   'requires': [],
   'scores': ['tag_acc'],
   'retokenizes': False},
  'parser': {'assigns': ['token.dep',
    'token.head',
    'token.is_sent_start',
    'doc.sents'],
   'requires': [],
   'scores': ['dep_uas',
    'dep_las',
    'dep_las_per_type',
    'sents_p',
    'sents_r',
    'sents_f'],
   'retokenizes': False},
  'attribute_ruler': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'lemmatizer': {'assigns': ['token.lemma'],
   'requires': [],
   'scores': ['lemma_acc'],
   'retokenizes': False},
  'ner': {'assigns': ['doc.ents', 'token.ent_iob', 'token.ent_type'],
   'requires': [],
   'scores': ['ents_f', 'ents_p', 'ents_r', 'ents_per_type'],
   'retokenizes': False},
  'ng20': {'assigns': [], 'requires': [], 'scores': [], 'retokenizes': False}},
 'problems': {'tok2vec': [],
  'tagger': [],
 

### Now pass each training and test document through the pipeline

In [10]:
docs_train = [pipeline(doc) for doc in train_text[:500]]
docs_test = [pipeline(doc) for doc in test_text[:500]]

### Let's look at that first document following this transformation and compare it to the original text

In [11]:
ng_train.data[0]

"From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n"

In [12]:
docs_train[0]

'email thing subject car nntp posting host racnumbr wam umd edu organization university maryland college park lines numbr wonder enlighten car see day numbr door sport car look late numbrs early numbrs call bricklin door small addition bumper separate rest body know tellme model engine spec year production car history info funky look car e mail thanks il bring neighborhood lerxst'

# Part II: Lexical diversity

Sometimes it's useful to understand how diverse is the language in some body of text. Once simple heuristic to evaluate diversity is as follows: 

$$ lexical\_diversity = \frac{ len(set(all\_words\_in\_doc)) }{ len(doc) }$$

Find the set of all words observed in the document, and divide it by the number of total words in the document. Let's use this to evalute the diversity of each category in the 20NG dataset.

### (5 pts) Task I: 
In the cell below, compute the diversity of each category in the 20NG dataset using the above heuristic

In [13]:
import numpy as np
import pandas as pd
docs_train_target= ng_train.target[:500]# keep the same length with docs_train
catgry=[[]]*20
def classify_catgry():
    for i in range(0,20):
        x= np.where(docs_train_target==i) #find by target
        catgry[i]= x[0].tolist() #reserve index for each category
classify_catgry()

contain=[None]*20
def le_div(): 
    le_div=[None]*20    
    x=[]
    for i in range(0,20): 
        for n in catgry[i]:
            x0 =re.split(r" ",docs_train[n])
            x= x+x0
        contain[i]= x
        le_div[i]= len(set(contain[i]))/len(contain[i])
    #make result readable
    le_div= pd.DataFrame(le_div,columns=['Lexical_diversity'],index=ng_train.target_names[0:20])
    print(le_div)
le_div()

                          Lexical_diversity
alt.atheism                        0.304685
comp.graphics                      0.289863
comp.os.ms-windows.misc            0.389441
comp.sys.ibm.pc.hardware           0.374093
comp.sys.mac.hardware              0.350357
comp.windows.x                     0.338102
misc.forsale                       0.326993
rec.autos                          0.306954
rec.motorcycles                    0.296546
rec.sport.baseball                 0.282256
rec.sport.hockey                   0.272312
sci.crypt                          0.265989
sci.electronics                    0.258077
sci.med                            0.251459
sci.space                          0.239051
soc.religion.christian             0.229840
talk.politics.guns                 0.224411
talk.politics.mideast              0.217683
talk.politics.misc                 0.215008
talk.religion.misc                 0.213283


### Explain these scores: 

1. Is this result real or an artifact of some underlying problem with our data? 
2. What might you do to better evaluate lexical diversity on this data using this scoring function?
3. Is this heuristic a good metric for lexical diversity in general?


1. There are definitely some underlying problems with our data. Firstly, there are some stop words without any meaning which shouldn't be counted.And the way is better used for comparing texts of equal length because texts with longer length have more words to be filled. Notice, however, that lexical diversity is only one part of the assessment of lexical richness.
2. We might use more indices to evaluate lexical diversity,like Text-Type Ratio (TTR), vocd.
3. In general,heuristic can't be regarded as a good metric.

### Entropy 
Entropy is another, perhaps more principled, way by which we can evaluate how diverse, or varied, is a piece of text. Recall the definition of Entropy, $H(p(x))$:

$$ H(p(x)) = \sum_{i=1}^{N} -p(x_{i}) \log p(x_{i}) $$

In the Bag-of-Words (BOW) feature representation of a document, each document is represented by a word count vector, ${x}_{i} \in \mathbb{R}^{N}$ where $N$ is the cardinality of the set of words in the document.

### (5 pts) Task II:
In order to compute an entropy from this representation, you'll first need to convert those count vectors into probability distributions. Then compute the entropy of the word distributions aggregated over each news category.

In [14]:
from collections import Counter
def calc_tf():
    Hp=[None]*20
    for i in range(0,20):
        word_hist = Counter(contain[i])
        word_counts = word_hist.most_common()
        bow_featurizer = {word: idx for idx, word in enumerate(word_hist)}
        feature = np.zeros(shape=(len(bow_featurizer)))
        for word in contain[i]:
            feature[bow_featurizer[word]] += 1
        pi= (feature/sum(feature))
        logpi= np.log(pi.tolist())
        Hp[i]= sum(-1*pi*logpi)
#make result readable
    Hp= pd.DataFrame(Hp,columns=['Entropy'],index=ng_train.target_names[0:20])
    print(Hp)
calc_tf()

                           Entropy
alt.atheism               6.504698
comp.graphics             6.774229
comp.os.ms-windows.misc   7.495750
comp.sys.ibm.pc.hardware  7.519256
comp.sys.mac.hardware     7.524877
comp.windows.x            7.519840
misc.forsale              7.523058
rec.autos                 7.436629
rec.motorcycles           7.480803
rec.sport.baseball        7.386669
rec.sport.hockey          7.404271
sci.crypt                 7.447900
sci.electronics           7.457711
sci.med                   7.495092
sci.space                 7.534732
soc.religion.christian    7.586837
talk.politics.guns        7.600483
talk.politics.mideast     7.644440
talk.politics.misc        7.666143
talk.religion.misc        7.667443


### Explain this result

1. What does it mean for a distribution to have high or low entropy?
2. Do these scores make intuitive sense? Any more or less so than the heuristic from Task I?
2. Is entropy a good metric for evaluating lexical diversity in general?


1. High entropy means low degree of uncertainty (or unpredictability) in a message.
2. Yes. Entropy does better than the heuristic from Task I.
3. Yes. It solves the problem of less text containing less words so less lexical diversity. However, it gives equal 4. weight to all texts and can't relect all the aspects of the lexical richness.

# Part III: Document Similarity

Throughout this course we will discuss the notion of *similarity* between texts and explore ways to measure it. This is a critical component of search and recommender systems. One such approach involves measuring how *close* two word distributions are using the notion divergence, which we discussed in the first lecture.

### (10 pts) Task III

Using the definition below, compute the KL-divergence, $K_{DL}$, between the word distributions in each category. This will result in a $K \times K$ matrix of divergence values.

$$ D_{KL}(P||Q) = \sum_{i=1}^{N} \log \frac{P(x_{i})}{Q(x_{i})} $$

In [15]:
from torch.distributions import Categorical, kl
#set every word in each category in a fixed order
def get_wholelist():
    df=pd.DataFrame()
    for i in range(0,20):
        word_hist = Counter(contain[i])
        word_counts = word_hist.most_common()
        bow_featurizer = {word: idx for idx, word in enumerate(word_hist)}
        feature = np.zeros(shape=(len(bow_featurizer)))
        for word in contain[i]:
            feature[bow_featurizer[word]] += 1
        pi= (feature/sum(feature)) 
        rows= [k  for  k in  bow_featurizer.keys()]
        cols= pi.tolist()
        df0 = pd.DataFrame(cols,index=rows,columns=[ng_train.target_names[i]])
        df = pd.concat([df,df0],axis=1)
    return(df)   
holist= get_wholelist()
KL= pd.DataFrame(index=ng_train.target_names,columns=ng_train.target_names)
def calc_Dpq():   
    for i in range(holist.shape[1]):
        for j in range(holist.shape[1]):
            new= holist.iloc[:,[i,j]].dropna(axis=0,how='any') 
            dpqij=sum(new.iloc[:,0]*np.log(new.iloc[:,0]/new.iloc[:,1]))
            KL.iloc[i,j]=dpqij
    return(KL)
calc_Dpq()

Unnamed: 0,alt.atheism,comp.graphics,comp.os.ms-windows.misc,comp.sys.ibm.pc.hardware,comp.sys.mac.hardware,comp.windows.x,misc.forsale,rec.autos,rec.motorcycles,rec.sport.baseball,rec.sport.hockey,sci.crypt,sci.electronics,sci.med,sci.space,soc.religion.christian,talk.politics.guns,talk.politics.mideast,talk.politics.misc,talk.religion.misc
alt.atheism,0.0,0.240381,1.10972,1.130215,1.160876,1.182874,1.211652,1.242372,1.255355,1.292129,1.30974,1.310369,1.327024,1.336233,1.354355,1.302681,1.297712,1.286945,1.288533,1.280324
comp.graphics,-0.035461,0.0,0.77699,0.784496,0.79972,0.810069,0.83032,0.856589,0.868665,0.898255,0.916783,0.918308,0.929384,0.940254,0.953806,0.925437,0.92664,0.928632,0.932699,0.928264
comp.os.ms-windows.misc,0.082197,-0.04817,0.0,0.028954,0.076146,0.101616,0.138192,0.183276,0.224986,0.268528,0.307135,0.331889,0.359342,0.393502,0.440556,0.473128,0.4936,0.528933,0.544724,0.550526
comp.sys.ibm.pc.hardware,0.082028,-0.050675,-0.00696,0.0,0.038416,0.061945,0.095183,0.138403,0.178244,0.218758,0.256294,0.279211,0.304379,0.337481,0.383205,0.415635,0.435744,0.470358,0.485645,0.491383
comp.sys.mac.hardware,0.10012,-0.049256,-0.008199,-0.005542,0.0,0.020104,0.047776,0.085608,0.121692,0.158641,0.193989,0.214649,0.236014,0.266797,0.309109,0.340567,0.359693,0.393541,0.408141,0.413646
comp.windows.x,0.124383,-0.041356,-0.008402,-0.009082,-0.008209,0.0,0.025813,0.061227,0.095904,0.13228,0.16756,0.187062,0.207113,0.236701,0.277064,0.308501,0.327381,0.360989,0.375214,0.380597
misc.forsale,0.137444,-0.035939,-0.008158,-0.010166,-0.013751,-0.007463,0.0,0.032112,0.064438,0.09628,0.130238,0.148746,0.16749,0.196309,0.235281,0.266614,0.285001,0.317207,0.330909,0.336235
rec.autos,0.23552,-0.00501,0.005534,0.002934,-0.007647,-0.004761,-0.001765,0.0,0.028469,0.058724,0.090432,0.108332,0.125355,0.151972,0.188767,0.219472,0.236665,0.268431,0.28179,0.286606
rec.motorcycles,0.216501,-0.013729,0.002592,-0.001755,-0.013606,-0.012267,-0.011887,-0.011208,0.0,0.029004,0.058901,0.075715,0.092015,0.117101,0.15238,0.182152,0.198035,0.228754,0.241143,0.245939
rec.sport.baseball,0.246681,0.013032,0.017106,0.008996,-0.007419,-0.007178,-0.012152,-0.013768,-0.00603,0.0,0.026401,0.043058,0.059065,0.083735,0.11746,0.147344,0.162824,0.192462,0.20434,0.209276


### Explain this result

1. What does it mean for two distributions to have high or low divergence?
2. Do these similarity scores make sense intuitively?
3. Is the resultant $K \times K$ matrix symmetric? Why is this the case?
2. Is $D_{KL}$ a good measure of the similarity between two distributions in general? 

1. The greater the absolute value, the greater the difference.
2. No. It calculates the sum of the differences in the frequency of each word in the text, which is different from the similarity we usually say.
3. No. It consist of $$ P(x_{i}) \log \frac{P(x_{i})}{Q(x_{i})} $$ which is different from $$ Q(x_{i}) \log \frac{Q(x_{i})}{P(x_{i})} $$
4. Yes. But there are more indicators measuring the simiarity.