# Natural Language Processing & Text Mining  

For normal data mining and machine learning tasks, data are often presented in a "structured" form: thoes data are presented in tabular form.   
As we can see from the first line of data point we just imported, for a text mining task, we are dealing with a sequence of text, which is "unstructured". we will need to transform the text --- an "unstructured" form of data, into a "structured" form.

The first step to make text data "structured" is to tokenize text. To tokenize text is to segment text into smaller units: a word, a character or a punctuation. After recognizing all the tokens in a dataset, we can "tell" the computer what to look at when processing a line of text. One way to do it is to either count how many times a token appear in a line of text, or see whether a token appears in the sentence (the bag-of-word-model). 

Load common packages for data transformation

In [None]:
import numpy as np
import pandas as pd

Loading the ACL‑ARC dataset from the data folder        
The dataset could be downloaded from https://figshare.com/articles/dataset/ACL-ARC_dataset/12573872    
For this demo, we will only use the train split and do cross-validation with the split.

In [None]:
df = pd.read_json('~/datasets/s4/ACL-ARC/training.jsonl', lines=True)

Show the first 5 lines from the top

In [None]:
df.head()

Get the first line of text. According to the label, it doesn't have citation

In [None]:
print(df['cur_sent'][0])
print(df['cur_has_citation'][0])

Here, we import the functionality we need from scikit-learn:

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

There are several setting we can choose for the text vectorizer:

unigram term frequency vectorizer: each token is one word, the vectorizer count how many times a word appear in the text

In [None]:
unigram_count_vectorizer = CountVectorizer(encoding='latin-1', binary=False)

unigram boolean vectorizer: instead of counting the word frequency, it checks whether the word appears in the text

In [None]:
unigram_bool_vectorizer = CountVectorizer(encoding='latin-1', binary=True)

unigram and bigram term frequency vectorizer: each token have up to 2 words. We are also using the built-in stop word list for English, so stopwords are not being counted 

In [None]:
bigram_count_vectorizer = CountVectorizer(encoding='latin-1', ngram_range=(1,2), stop_words='english')

tf-idf is a normalized version of word frequency count     
unigram tfidf vectorizer    

In [None]:
unigram_tfidf_vectorizer = TfidfVectorizer(encoding='latin-1', use_idf=True, stop_words='english')

fit vocabulary in texts and transform it into vectors. "fit" collects unique tokens into the vocabulary. "transform" converts each document to vector based on the vocabulary

In [None]:
word_vector = unigram_count_vectorizer.fit_transform(df['cur_sent'].values.tolist())

The size of the vectorized dataset: there are 859636 data points and 261582 unigram tokens

In [None]:
print(word_vector.shape)

As we can see here, a vecter for a line of text is sparse: most of the columns have 0 value because a vectorizer counts the appearance of all the tokens in the dataset even when a token is no in one particular line of text

In [None]:
print(word_vector[0].toarray())

The size of the vocabulary, in other words, the number of tokens in the dataset it is the size for each vector 

In [None]:
print(len(unigram_count_vectorizer.vocabulary_))

## Classification Task with Vectorized Text  

Using the vectorized text, we can train a simple logistic regression classifier

In order to validate the model, we split the entire dataset into training dataset and testing dataset

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(word_vector, df['cur_has_citation'], test_size=0.4, random_state=0)

Import logistic regression model and performance metrics from scikit-learn

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, accuracy_score

Initialize the logistic regression model, setting the maximum iteration to 10000

In [None]:
clf = LogisticRegression(max_iter = 10000)

Fit the model with training split of the vectorized data

In [None]:
clf.fit(X_train, y_train)

Using the trained model, we make prediction with the text split

In [None]:
y_pred = clf.predict(X_test)

Calculate the f1 score for both positive and negative class

In [None]:
f1_score(y_test, y_pred, average=None)

Calculate the accuracy

In [None]:
accuracy_score(y_test, y_pred)

Each word token correspond to a coefficient in the logistic regression. If a token is more important to the classification task, it is more likely to have a larger coefficient.In the following dataframe, we are sorting the tokens by the values of coefficients in descending order.

In [None]:
pd.concat([pd.DataFrame(unigram_count_vectorizer.get_feature_names(), columns=['word']), 
           pd.DataFrame(clf.coef_.transpose(), columns=['coef'])], axis = 1).sort_values(by = 'coef', ascending = False)

Next, we will try out ti-idf: a normalized form of bag-of-word representation

In [None]:
tfidf_word_vector = unigram_tfidf_vectorizer.fit_transform(df['cur_sent'].values.tolist())

Split the dataset into training and testing splits

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(tfidf_word_vector, df['cur_has_citation'], test_size=0.4, random_state=0)

Initialize the logistic regression model

In [None]:
clf = LogisticRegression(max_iter = 1000)

Fit the model with training split of the vectorized data

In [None]:
clf.fit(X_train, y_train)

Using the trained model, we make prediction with the text split

In [None]:
y_pred = clf.predict(X_test)

Calculate the f1 score for both positive and negative class

In [None]:
f1_score(y_test, y_pred, average=None)

Calculate the accuracy

In [None]:
accuracy_score(y_test, y_pred)

Each word token correspond to a coefficient in the logistic regression. If a token is more important to the classification task, it is more likely to have a larger coefficient.In the following dataframe, we are sorting the tokens by the values of coefficients in descending order.

In [None]:
pd.concat([pd.DataFrame(unigram_tfidf_vectorizer.get_feature_names(), columns=['word']), 
           pd.DataFrame(clf.coef_.transpose(), columns=['coef'])], axis = 1).sort_values(by = 'coef', ascending = False)

With different vectorization methods, we will get different performance for our model and different model interpretation

## More Language Features with spaCy

There are also many more instereting feature we can get from a line of text aside from the frequency of words.  
In the following section, we will explore more language features with the package spaCy

In [None]:
#installing spacy
!pip install spacy
!python -m spacy download en_core_web_lg

In [None]:
# Import spaCy
import spacy

# Loading a pre-trained Pipeline 
nlp = spacy.load("en_core_web_lg")

# Process the first line of sentence in our dataset with the loaded Pipeline
tokens = nlp(df['cur_sent'][0])

Print out the line of text we just passed to the Pipeline

In [None]:
print(tokens.text)

Getting all the features generated by the Pipeline from the line of text we passed

In [None]:
sentence_features = {}
sentence_features['word'] = []
sentence_features['lemma'] = []
sentence_features['pos_tag'] = []
sentence_features['shape'] = []
sentence_features['is_alphabetic'] = []
sentence_features['is_stopword'] = []

for token in tokens:
    sentence_features['word'].append(token.text)
    sentence_features['lemma'].append(token.lemma_)
    sentence_features['pos_tag'].append(token.pos_)
    sentence_features['shape'].append(token.shape_)
    sentence_features['is_alphabetic'].append(token.is_alpha)
    sentence_features['is_stopword'].append(token.is_stop)

In the table below, we see that the Pipeline tokenized the text into words.  
"lemma" is the base form of the token (word)  
"pos_tag" is the pos-tagging tags for a token  
"shape" shows the visual shape of the token (uppercase or lowercase, punctuation, digits)  
"is alphabetic" shows whether a token is alphabetic  
"is stopword" shows whether a token is a stopword  

In [None]:
pd.set_option('display.max_columns', None)
pd.DataFrame(sentence_features).T

In [None]:
import requests
import json

Getting the abstract of the famous "Science of Science" review paper    
(Fortunato, S., Bergstrom, C. T., Börner, K., Evans, J. A., Helbing, D., Milojević, S., ... & Barabási, A. L. (2018). Science of science. Science, 359(6379), eaao0185.)

In [None]:
SOS_paper = requests.get(
    'https://api.openalex.org/works/https://doi.org/10.1126/science.aao0185'
).json()

abstract_inverted_index = SOS_paper['abstract_inverted_index']

max_ids = 0
for k in abstract_inverted_index.keys():
    for i in abstract_inverted_index[k]:
        if i > max_ids:
            max_ids = i

abstract = [' '] * (max_ids + 1)

for k in abstract_inverted_index.keys():
    for i in abstract_inverted_index[k]:
        abstract[i] = k

abstract.remove('BACKGROUND')
abstract.remove('ADVANCES')
abstract.remove('OUTLOOK')

while(" " in abstract) :
    abstract.remove(" ")
     
abstract = ' '.join(abstract)

In [None]:
abstract

### Tokenization

In [None]:
doc = nlp(abstract)

In [None]:
for token in doc:
    print(token.text)

### Lemmatization

Lemmatization is the process of reducing inflected forms, sometimes derivationally related forms of a word to a common base form. This reduced form or root word is called a lemma.

In [None]:
text = "am are is"
[token.lemma_ for token in nlp(text)]

In [None]:
text = "look looks looked"
doc = nlp(text)
for token in doc:
    print("token:{} -> lemma:{}".format(token.text,token.lemma_ ))

In [None]:
doc = nlp(abstract)

In [None]:
for token in doc:
    print("token:{} -> lemma:{}".format(token.text,token.lemma_ ))

### Word Frequency

In [None]:
from collections import Counter

In [None]:
doc = nlp(abstract)
words =  [token.text for token in doc if token.is_punct != True]

In [None]:
word_counter = Counter(words)
word_counter.most_common(5)

### Stopwords Removal

Stopwords usually don't contribute a lot to the semantic meaning of sentence. In many cases we remove those stopwords.

In [None]:
import spacy
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS

In [None]:
len(spacy_stopwords)

In [None]:
list(spacy_stopwords)[:8]

In [None]:
no_stop_words = [token for token in doc if not token.is_stop]

In [None]:
no_stop_words

### Token Attributes

For each token in a spacy document, there are related attributes, such as, what is the lemma of the token, is the token a stopword? is the token alphabetical? etc

In [None]:
cols = ("text", "lemma_","is_punct", "is_stop", "is_alpha", "is_space", "lower_")

In [None]:
rows = [] 
for t in doc:
    row = [t.text, t.lemma_,  t.is_punct,  t.is_stop,  t.is_alpha,  t.is_space,  t.lower_]
    rows.append(row)
attri_pdf = pd.DataFrame(rows, columns=cols)

In [None]:
attri_pdf

### Sentence Segmentation

Spacy breaks a document into sentences

In [None]:
for sent in doc.sents:
    print("start_pos={}, end_pos={}, text:{}".format(sent.start, sent.end, sent.text))

### Part of Speech Tagging

Associate each word in a text with its correct lexical-syntactic category

In [None]:
for token in doc:
    print(token, token.tag_, token.pos_, spacy.explain(token.tag_))

### Dependency Parsing

Dependency grammars represent syntactic dependency relations between words that show the syntactic structure

In [None]:
for token in next(doc.sents):
    print("{0}/{1} <--{2}-- {3}/{4}".format(
        token.text, token.tag_, token.dep_, token.head.text, token.head.tag_))

In [None]:
from spacy import displacy
displacy.render(next(doc.sents), style='dep', jupyter=True, options={'distance': 90})

### Named Entity Recognition

In [None]:
from spacy import displacy
displacy.render(doc, style="ent")

### Word Embeddings

Word embeddings, such as word2vec, are deep learning methods to generate word representation that includes semantic meaning of words in a fixed-sized numerical vector. Semantic information of each word are captured by the context of those word in the corpus (training data for word embedding models)

In [None]:
man = nlp.vocab["man"]
woman = nlp.vocab["woman"]
king = nlp.vocab["king"]
queen = nlp.vocab["queen"]

In [None]:
queen.vector

In [None]:
queen.similarity(king)

In [None]:
queen.similarity(woman)

In [None]:
man.similarity(woman)

In [None]:
import numpy as np
def cosine(x,y):
    return np.dot(x,y) / (np.sqrt(np.dot(x,x)) * np.sqrt(np.dot(y,y)))

In [None]:
cosine(king.vector-man.vector+woman.vector, queen.vector)

By default, Token.vector returns the vector for its underlying Lexeme, while Doc.vector and Span.vector return an average of the vectors of their tokens

In [None]:
doc1 = nlp("The quick brown fox jumps over the lazy dog")
doc2 = nlp("The lazy dog jumps over the quick brown fox")

In [None]:
doc1.similarity(doc1)

In [None]:
doc1_vec=doc1.vector
doc2_vec=doc2.vector

In [None]:
cosine(doc2_vec,doc1_vec)

### Try it yourself!    
Using word embeddings for the citation worthiness classification task.

References: 
- https://scikit-learn.org/stable/
- https://spacy.io/
- https://github.com/tong-zeng/spaCy_tutorial