# Purpose

This notebook is for preprocessing a textual dataset in preparation for LDA and other techniques. The idea is that this will replace the designer preprocess-text module.

Text output from this will contain each document as a string, with tokens separated by spaces. In the event that spaces would exist in tokens (such as for n-grams), they will be replaced with underscores.

## Current functionality of the notebook includes:

- Lemmatization using SpaCy pretrained model
- Defining of words to remove and save, starting from a list from spacy
- Defining of custom regular expressions to use to match tokens to remove
- Viewers for tokens to be removed at each step sorted by counts
- Viewer for random comments in their processed and unprocessed forms
- Removal of unicode characters before tokenization such as dashes and random unicode characters
- Viewer for finding responses which contain tokens containing particular strings


## To Do

### Now
- Parenthesis handling. There are times where we get word(word and they don't get separated. Pretty sure it's not all the time so unclear what's up.
- Change the display_side_by_side function to not use a general replace on table. It's ruining the display of words that include table.
- Add word search for "Word I have" interact
- Add doc-id to random docs for that interact. Add unprocessed doc option to token search in the final interact.

### Later
- Language removal? Right now my general unicode string removal gets rid of chinese answers. I know there are german and spanish answers still at the very least.
- Number removal? Currently short numbers are removed as all 1 character tokens are removed.
- Period handling. word.other is a single token. But google.com should be a single token. How do.
- N-gram generation. I would probably want this to occur before stop word removal. Would need to be clear how to disable it easily.

## This notebook is made with "20210625_SES_and_SET_comments.csv" in mind

Note that if you run this notebook on the entire dataset you will be waiting for about 5 hours on a single-core machine since spacy takes its time.

# Imports

## Libraries

In [None]:
import pandas as pd
import numpy as np
DF = pd.DataFrame

For spacy, we will try to install a pretrained english language processing pipeline called "en-core-web-sm"

The command to run if you don't have it is below. On Azure I had to go into the terminal, activate conda, then run the following command. Someone the notebook is not running in the conda environment on Azure or something.

In [None]:
#!conda activate azureml_py38
#!python -m spacy download en_core_web_sm

In [None]:
import spacy
#nlp = spacy.load("en_core_web_sm")
import en_core_web_sm
nlp = en_core_web_sm.load()

In [None]:
from collections import Counter
import string
import multiprocessing as mp

In [None]:
from IPython.display import display, display_html
from ipywidgets import interact, interact_manual

## Data

In [None]:
data_path = "/home/azureuser/cloudfiles/code/Data/pp-20210625_SES_and_SET_comments.csv"
raw_text_col = "answer"
text_col = "Preprocessed answer"
index_col = "unique_comment_ID"

In [None]:
raw_data = pd.read_csv(data_path)
raw_data.set_index(index_col, inplace=True)
raw_data.dropna(inplace=True)

In [None]:
display(len(raw_data))
raw_data.head(3)

# Helper Functions

## Word Counts

In [None]:
def get_word_counts(texts):
    '''Takes an iterable (or DF) of iterables of words and returns a dict of word counts.'''
    if type(texts) == DF:
        texts = DF[text_col]
    return Counter(term for doc in texts for term in doc)

In [None]:
def word_counts_DF(texts):
    counts = get_word_counts(texts)
    df = DF.from_dict(counts, orient="index")
    df.index.name = "word"
    df.columns = ["count"]
    return df

## Display

In [None]:
def display_side_by_side(*args):
    html_str=''
    for df in args:
        html_str+=df.to_html() + ("\xa0" * 5) # Spaces
    display_html(html_str.replace('table','table style="display:inline"'),raw=True)

In [None]:
def display_head_wide(df,num = 40,cols = 5):
    num = min(num,len(df)) # Just in case num is specified to be larger than the number of entires in df
    per_col = int(np.ceil(num/cols)) # Figure out how many to show per column
    display_side_by_side(*[df.iloc[x: x + per_col] for x in range(0,num,per_col)]) # Display the columns. *[] used to partition the dataframe

# Preprocessing to do on the raw text

Operations done in this section should mostly replace certain characters with spaces or special characters with an equivalent character. Problematic units of text should be removed as tokens.

In [None]:
raw_data[text_col] = raw_data[raw_text_col]

Replace all of the special apostrophe character with an apostrophe

In [None]:
raw_data[text_col] = raw_data[text_col].str.replace("<U\+0092>","'")
# any(
#     raw_data[text_col].str.contains("<U\+0092>")
# )

There are a variety of weird ways to separate words, some of which spacy doesn't understand. We will replace them all with spaces.

In [None]:
separator_patterns = [
    "--", # Double dash, likely caused by something Austin did on the import end
    "<U\+0097>", # Some random dash character
    "<U\+00A0>" # A weird space character
    ,"<U\+200B>" # Zero width space??
    ,"<U\+0093>|<U\+0094>" # Strange quotation marks. Could be replace with actual quotes.
    ,"<U\+00B7>" # Middle dot character
    ,"-(\s|$)" # To get at word final dashes
    ,"(^|\s)-" # To get at word initial dashes
    ,"=" # Some people use = as a dash sort of thing
    ,"<U\+(.){4}>" # Remove all other unicode things. Might just only do this going forward.
]

In [None]:
for pattern in separator_patterns:
    raw_data[text_col] = raw_data[text_col].str.replace(pattern," ",regex = True)

Some code to check for a pattern in the unprocessed text.

In [None]:
# raw_data[
#     raw_data[raw_text_col].str.contains("U\+9017")
#     ].head(5)[text_col].tolist()

# Preprocessing on Tokens

## Defining texts

In [None]:
texts = raw_data.iloc[:][text_col].copy()

In [None]:
texts.head(3)

## Lemmatization and tokenization

In [None]:
def token_and_lemma(doc):
    return [token.lemma_.lower() for token in nlp(doc)]

In [None]:
with mp.Pool(mp.cpu_count()) as pool:
    mp_results = pool.map(
        func = token_and_lemma,
        iterable= texts)
texts = pd.Series(data = mp_results,index= texts.index,name = text_col, copy = True)

In [None]:
texts.head(3)

## Stop Word Removal

### Defining words to remove and save

In [None]:
words = list(word_counts_DF(texts).index)

In [None]:
custom_stopwords = [
    "-pron-", # Spacy replaces pronouns with this. Not super meangingful and super common
    "firstname","lastname", # Removed for same reason as pron
    "na" # I think pandas imports this as missing. Not useful anyways.
    ]


In [None]:
custom_savewords = [
    "enough" # Note sure why this is stop word, I think sentiment words are still valuable for LDA
    ,"show" # This feels meaningful, as in a writing class they could talk about showing versus describing
    ,"please" # Seems like a strongly charged sentiment word
    ,"alone" # Can matter, especially for people talking about organization of group work
    ,"due" # Another word that is more relevant in school context than usual
    ,"see" # Meaningful and common word
    ,"serious" # Sentiment word, reasonably common
]

In [None]:
other_savewords_to_consider = ["why"]

### Creating the stopword list

In [None]:
punctuation = [mark for mark in string.punctuation] # Spacy puts each piece of punctuation as its own token
spacy_stopwords = list(nlp.Defaults.stop_words)
# nltk_stopwords = list(nltk.corpus.stopwords.words('english')) #Don't have nltk on Azure and I don't wanna get it.
stopwords = set(
    custom_stopwords+
    spacy_stopwords+
    punctuation
    ).intersection(words)
stopwords = stopwords - set(custom_savewords)

### Looking at words to be removed via stopwords

In [None]:
words_to_remove = word_counts_DF(texts).loc[stopwords].sort_values(by = "count", ascending = False)
display_html(f"<b>Words to remove via stopword removal {len(words_to_remove)}",raw=True)
@interact(n = [20,50,100,500,2000,10000])
def look_at_stopwords(n = 50):
    display_head_wide(df= words_to_remove,num = n,cols = 8)

### Actual Removal

In [None]:
texts = texts.apply(
    lambda text: [token for token in text if token not in stopwords]
)

## Regex Removal

### Build list of new tokens to remove

In [None]:
stop_regexes =[
    "\s+" # Remove all tokens consisting of only whitespace characters
    ,".{1}" # Remove all tokens that are one in length. Some important 2 length words are "ge" and "ok"
    ,"\.+" # Tokens that are all periods
]

In [None]:
words = word_counts_DF(texts).index
regex_stopwords = []
for stop_regex in stop_regexes:
    new_stopwords = words[words.str.fullmatch(stop_regex)].tolist()
    regex_stopwords+= new_stopwords
regex_stopwords = list(set(regex_stopwords))

### Check what we are getting rid of with stop regex

In [None]:
regex_words_to_remove = word_counts_DF(texts).loc[regex_stopwords].sort_values(by = "count",ascending=False)
display_html(f"<b> Words to remove via regex removal: {len(regex_words_to_remove)}",raw = True)
@interact(num = [20,50,200,1000,10000])
def display_regex_to_remove(num = 50):
    display_head_wide(
        regex_words_to_remove,
        num = num,
        cols = 8
    )

### Actual Removal

In [None]:
texts = texts.apply(
    lambda text: [token for token in text if token not in regex_stopwords]
)

# Checking what words we have

In [None]:
sorted_counts = word_counts_DF(texts).sort_values(by = "count", ascending=False)
@interact(n = [2,10,20,50,100,500,2000])
def look_words_by_freq(n = 20):
    display_html(f"Total Word Count: <b>{len(sorted_counts)}", raw=True)
    display_head_wide(df = sorted_counts, num = n, cols = 8) # Display most common words
    display_head_wide(df = sorted_counts.iloc[::-1], num = n, cols = 8) # Display least common words

# Checking texts again

In [None]:
@interact(n = [3,20,100,500], get_new_texts = False,show_unprocessed_text = False)
def look_at_final_texts(n = 3, show_unprocessed_text = False,get_new_texts=False,):
    sample_texts = texts.sample(n)
    for id in sample_texts.index:
        if show_unprocessed_text == True:
            display_html("<b>"+raw_data.loc[id][raw_text_col],raw=True)
        display_html("• " + " ".join(sample_texts[id]),raw=True)

In [None]:
@interact_manual(word = "", max_num = [20,100,1000])
def word_search(word,max_num):
    match_texts = texts[texts.apply(
        lambda text: any(word in token for token in text)
    )]
    if len(match_texts) == 0: display("No matches for:",word)
    if len(match_texts) > max_num: match_texts = match_texts[:max_num]
    for text in match_texts:
        display_html("<br>• "+" ".join(text),raw=True)

# Exports

In [None]:
joined_texts = texts.apply(" ".join)

In [None]:
pp_data = raw_data.copy()
pp_data[text_col] = joined_texts


In [None]:
pp_data.tail(3)

In [None]:
export_path = "pp-SES_SET_July_31.csv"
#pp_data.to_csv(export_path)