## Text Mining  

For traditional data mining, data are often presented in a "structured" form: thoes data are presented in tabular form.   
As we can see from the first line of data point we just imported, for a text mining task, we are dealing with a sequence of text, which is "unstructured". we will need to transform the text --- an "unstructured" form of data, into a "structured" form.

The first step to make text data "structured" is to tokenize text. To tokenize text is to segment text into smaller units: a word, a character or a punctuation. After recognizing all the tokens in a dataset, we can "tell" the computer what to look at when processing a line of text. One way to do it is to either count how many times a token appear in a line of text, or see whether a token appears in the sentence. 

Load common packages for data transformation

In [2]:
import numpy as np
import pandas as pd

Loading the citation dataset from the data folder

In [None]:
df = pd.read_json('~/datasets/s4/ACL-ARC/training.jsonl', lines=True)

Show the first 5 lines from the top

In [None]:
df.head()

Get the first line of text. According to the label, it doesn't have citation

In [None]:
print(df['cur_sent'][0])
print(df['cur_has_citation'][0])

Here, we import the functionality we need from scikit-learn:

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

There are several setting we can choose for the text vectorizer:

unigram term frequency vectorizer: each token is one word, the vectorizer count how many times a word appear in the text

In [None]:
unigram_count_vectorizer = CountVectorizer(encoding='latin-1', binary=False)

unigram boolean vectorizer: instead of counting the word frequency, it checks whether the word appears in the text

In [None]:
unigram_bool_vectorizer = CountVectorizer(encoding='latin-1', binary=True)

unigram and bigram term frequency vectorizer: each token have up to 2 words. We are also using the built-in stop word list for English, so stopwords are not being counted 

In [None]:
bigram_count_vectorizer = CountVectorizer(encoding='latin-1', ngram_range=(1,2), stop_words='english')

tf-idf is a normalized version of word frequency count 

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

unigram tfidf vectorizer

In [None]:
unigram_tfidf_vectorizer = TfidfVectorizer(encoding='latin-1', use_idf=True, stop_words='english')

fit vocabulary in texts and transform it into vectors. "fit" collects unique tokens into the vocabulary. "transform" converts each document to vector based on the vocabulary

In [None]:
word_vector = unigram_count_vectorizer.fit_transform(df['cur_sent'].values.tolist())

The size of the vectorized dataset: there are 859636 data points and 261582 unigram tokens

In [None]:
print(word_vector.shape)

As we can see here, a vecter for a line of text is sparse: most of the columns have 0 value because a vectorizer counts the appearance of all the tokens in the dataset even when a token is no in one particular line of text

In [None]:
print(word_vector[0].toarray())

The size of the vocabulary, in other words, the number of tokens in the dataset it is the size for each vector 

In [None]:
print(len(unigram_count_vectorizer.vocabulary_))

## Classification Task with Vectorized Text  

Using the vectorized text, we can train a simple logistic regression classifier

In order to validate the model, we split the entire dataset into training dataset and testing dataset

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(word_vector, df['cur_has_citation'], test_size=0.4, random_state=0)

Import logistic regression model and performance metrics from scikit-learn

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, accuracy_score

Initialize the logistic regression model, setting the maximum iteration to 10000

In [None]:
clf = LogisticRegression(max_iter = 10000)

Fit the model with training split of the vectorized data

In [3]:
clf.fit(X_train, y_train)

KeyboardInterrupt: 

Using the trained model, we make prediction with the text split

In [None]:
y_pred = clf.predict(X_test)

Calculate the f1 score for both positive and negative class

In [None]:
f1_score(y_test, y_pred, average=None)

Calculate the accuracy

In [None]:
accuracy_score(y_test, y_pred)

Each word token correspond to a coefficient in the logistic regression. If a token is more important to the classification task, it is more likely to have a larger coefficient.In the following dataframe, we are sorting the tokens by the values of coefficients in descending order.

In [None]:
pd.concat([pd.DataFrame(unigram_count_vectorizer.get_feature_names(), columns=['word']), 
           pd.DataFrame(clf.coef_.transpose(), columns=['coef'])], axis = 1).sort_values(by = 'coef', ascending = False)

## More Language Features with spaCy

There are also many more instereting feature we can get from a line of text aside from the frequency of words.  
In the following section, we will explore more language features with the package spaCy

In [None]:
# Import spaCy
import spacy

# Loading a pre-trained Pipeline 
nlp = spacy.load("en_core_web_lg")

# Process the first line of sentence in our dataset with the loaded Pipeline
tokens = nlp(df['cur_sent'][0])

Print out the line of text we just passed to the Pipeline

In [None]:
print(tokens.text)

Getting all the features generated by the Pipeline from the line of text we passed

In [None]:
sentence_features = {}
sentence_features['word'] = []
sentence_features['lemma'] = []
sentence_features['pos_tag'] = []
sentence_features['shape'] = []
sentence_features['is_alphabetic'] = []
sentence_features['is_stopword'] = []

for token in tokens:
    sentence_features['word'].append(token.text)
    sentence_features['lemma'].append(token.lemma_)
    sentence_features['pos_tag'].append(token.pos_)
    sentence_features['shape'].append(token.shape_)
    sentence_features['is_alphabetic'].append(token.is_alpha)
    sentence_features['is_stopword'].append(token.is_stop)

In the table below, we see that the Pipeline tokenized the text into words.  
"lemma" is the base form of the token (word)  
"pos_tag" is the pos-tagging tags for a token  
"shape" shows the visual shape of the token (uppercase or lowercase, punctuation, digits)  
"is alphabetic" shows whether a token is alphabetic  
"is stopword" shows whether a token is a stopword  

In [None]:
pd.set_option('display.max_columns', None)
pd.DataFrame(sentence_features).T

## Activity

Try using tfidf vectors to train the logistic regression. In that case, what are the most important tokens?