## Anagram Word Cloud Generation for Qualitative Code Themes

The following script generates anagram word clouds, frequency word clouds, and comparative Venn diagrams from qualitative code text data. It begins by importing all of the packages that are necessary for creating these data visualizations

In [None]:
import pandas as pd
import numpy as np
import NLPmod
import re
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import words
from nltk.corpus import stopwords
from nltk import pos_tag
import heapq
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.inspection import permutation_importance 
from sklearn import metrics
import collections
import six
import sys
import sklearn.neighbors._base
from scipy.stats import rankdata
try:
    from sklearn.utils import safe_indexing
except ImportError:
    from sklearn.utils import _safe_indexing
    sys.modules['sklearn.utils.safe_indexing'] = sklearn.utils._safe_indexing
sys.modules['sklearn.externals.six'] = six
sys.modules['sklearn.neighbors.base'] = sklearn.neighbors._base
import matplotlib.pyplot as plt
from wordcloud import WordCloud

### Loading Data & Tokenization

This preliminary block starts by loading in the labeled text as a pandas dataframe; it is also in this block where you choose the column label for your anagram cloud. As a reminder, here are the options available for each of the test files:
- `pos_exp.csv`: achievement, affection, bonding, enjoy_the_moment, exercise, leisure, nature.
- `title_topic.csv`: Computer Science, Mathematics, Physics, Statistics.

With selection complete, the block then performs tokenization of the text data. The text is cleaned using a general process, then broken up into a lemmatized form which is filters out unnecessary stop words. At present, n-grams are being used with unigrams, bigrams, and trigrams that occur five times or more across all responses being labeled as 'common'. The common tokens are transformed into our vector space so that each comment can be read as a vector.

In [None]:
file = 'pos_exp.csv' # select either pos_exp or title_topic (or your own data set!)
theme = 'achievement' # name of column that is used to make the anagram cloud

minOcc = 3 # what is the most amount of times that a token can appear while still being exluded from consideration?
ngram = False # Should we include bigrams and trigrams on top of our unigrams

lemmatizer = WordNetLemmatizer() # intialize the lemmatizer as a variable.
stopset = set(stopwords.words('english')) # set up stopwords set.

lab_text_df = pd.read_csv(file)
lab_text_df["token"] = lab_text_df.text.apply(NLPmod.tokenizerPOS, args=(lemmatizer,stopset))
if(ngram) :
        lab_text_df["token"] = lab_text_df.token.apply(NLPmod.ngramsappend, args=([2,3],False))
fbank = NLPmod.wordfreq(lab_text_df.token) # generate bank of words with associated frequencies
comBank, rarBank = NLPmod.repbank(fbank, minOcc) # generate two words sets based on which words are recurring
wordHeap = [token for comment in lab_text_df.token for token in comment] # generate ordered multiset of words
stackHeap = collections.Counter(wordHeap) # condense down into a collection
vocabComm = {term: ind for ind, term in enumerate(comBank)}

tfidf = TfidfVectorizer(
    analyzer='word',
    tokenizer= lambda val: val,
    preprocessor= lambda val: val,
    token_pattern=None,
    vocabulary = vocabComm
)
transformed = tfidf.fit_transform(lab_text_df.token)
lab_text_df['tfidfVec'] = [row.tolist()[0] for row in transformed.todense()]

### Parameter Testing

With the TF-IDF vectors, we will do a grid-search in order to find the optimal parameters. If you already know what parameters you want to use, but wish to run this block anyway as part of running all blocks: swap out the `grid` dictionary with the `skip_grid` dictionary.

In [None]:
grid = { # grid of paremeter choices
    'learning_rate': np.arange(0.05,1.0,0.2),
    'n_estimators': np.arange(40,400,80),
    'subsample' : [0.25,0.5,0.75,1.0],
    'max_depth' : [3,4,5]
}
skip_grid = {'learning_rate' : [0.5]}


Xdata = np.array(list(lab_text_df.tfidfVec))
Xtarg = np.array(list(lab_text_df[theme]))
clf = GradientBoostingClassifier() 
clf_grid = GridSearchCV(clf, grid, verbose = 1) # MAKE ALTERATION HERE! 
clf_grid.fit(Xdata,Xtarg)

### Optimal Parameters for the Model

After doing the cross-validated grid search, we record the optimal parameters so that they may be called upon at a later time without having to run another grid search for hyper-parameters. Use this block to store the optimal parameters that were found from the grid search of the TF-IDF vectorized data points within the Gradient Boosting Machine model. Alter these as necessary based on whatever results you may obtain.

In [None]:
print(f"Optimal parameters for label '{theme}'")
print(f"The best accuracy score across ALL searched params: {clf_grid.best_score_}")
print(f"The best parameters across ALL searched params: {clf_grid.best_params_}")

LR = 0.05
nEst = 40
subSamp = 0.50
depth = 4

### Theme Investigation with Influential Words

It is time to investigate again how permutation of a column may affect the accuracy of our prediction results. To this end, we create a number of models that all have the same parameter settings. For each model, we investigate how juggling the entry of one dimension for all entries affects the cross-validated prediction accuracy of a model.

In [None]:
trials = range(50) # How many trials shall we do?
permImp = [] # list that holds the permutation importance score results.

print(f"Completing Trials for theme '{theme}': (", end = "") 
for trial in trials: # make a model for each trial
    print('|',end="")
    scra_clf = GradientBoostingClassifier( # set parameters for model
        learning_rate = LR, 
        n_estimators = nEst,
        subsample = subSamp,
        max_depth = depth
    )
    scra_clf.fit(Xdata,Xtarg)
    permImp.append(permutation_importance( # create dictionary entry for permutation importance score
        scra_clf,
        Xdata,
        Xtarg,
        n_repeats=5
    ))
print(')')

### Permutation Importance and Frequency DataFrames

This block creates DataFrames for each theme. We look at the averaged permutation importance scores across each trial. The permuation importance scores are then ranked so that a more discretized sorted version of the importance scores can be visualized. Frequency data is stored adjacent to these importance scores and also ranked using a similar system.

In [None]:
load_dict = {f'perm_imp_{i}' : trial['importances_mean'] for i, trial in enumerate(permImp)} # create scratch frame for permutation importance scores of each theme
perm_freq_df = pd.DataFrame(load_dict) # create dataframe dictionary entry for scratch frame
perm_freq_df['perm_imp_tot'] = perm_freq_df.mean(axis = 1) # compute averaged importance score over each trial
perm_freq_df['perm_imp_rank'] = rankdata(perm_freq_df['perm_imp_tot'], method = 'min') - 1 # create ranked version of importance scores

perm_freq_df['freq'] = [stackHeap[word] for word in list(tfidf.get_feature_names_out())] # calculate frequency data for all captured features
perm_freq_df['freq_rank'] = rankdata(perm_freq_df['freq'], method = 'min') - 1 # calculate similar frequency ranks from frequency data

perm_freq_df['feature'] = comBank # Create column that says the actual feature token labels

### Filtering for Theme-Adhering Comments

We want to now look only at comments where a theme is present. We do this in order to analyze what commonalities these comments may have with one another. Looking at these comments, we break them up into a multiset list of all the tokens that they are made up of. These multisets are then heaped into sorted piles of frequency data using the `collections.Counter` function.

In [None]:
lab_filt = lab_text_df.groupby(theme).get_group(1)
heap_filt = [token_item for token_list in lab_filt.token for token_item in token_list] # generate ordered multiset of subset
count_filt = collections.Counter(heap_filt) # condense down into a collection

### Theme-Specific Frequency Table Generation

Looking at just the comments that adhere to a theme, we can use the counter collection dictionaries to create two columns of a DataFrame. These tables will examine only the non-unique words that appear across all comments which adhere to a given theme. We track those words, their frequency, and the frequency ranks.

We then merge the sub-frequency table with the general frequency table. We do this in order to create frequency ratios: the frequency of a feature within the subset of comments against the general frequency of a feature across all comments. 

In [None]:
scraWords = [k for k,v in count_filt.items() if v > minOcc] # words that appear non-uniquely
scraFreq = [v for k,v in count_filt.items() if v > minOcc] # frequency of those words
sub_freq_df = pd.DataFrame({ # make a sub-frame
    'feature' : scraWords, # Create column of these words
    'sub_freq' : scraFreq # Create column for matching frequency of those words
})
sub_freq_df['sub_freq_rank'] = rankdata(sub_freq_df['sub_freq'], method = 'min') - 1

df = perm_freq_df.merge(sub_freq_df, how = 'left', on = 'feature').copy() # Merge tables under feature
df.fillna(0, inplace = True) # fill in all Nan values with 0

### Prepping the Plot Lists

The next several blocks show the plots that compare columns of data relating to permutation importance and frequency.
- The first of the seven plots compares the permutation importance of each feature against the frequency data of that feature.
- The second of the seven plots does the same thing except that it uses the ranked frequency and permutation importance rather than the raw values.
- The third of the seven plots depicts similar information except that the frequency data only considers instances of feature appearance when it happens within text that is labeled as belonging to the theme. Doing this helps hone in on the comparisons between permutation importance and frequency.
- Plot four is to plot two in that it shows ranked permutation importance score and ranked word sub-frequency.

In [None]:
comps = [
    ('perm_imp_tot', 'freq', f'Permutation Importance vs. Frequency for Theme {theme}'),
    ('perm_imp_rank', 'freq_rank', f'Ranked Permutation Importance vs. Ranked Frequency for Theme {theme}'),
    ('perm_imp_tot', 'sub_freq', f'Permutation Importance vs. Filtered Frequency for Theme {theme}'),
    ('perm_imp_rank', 'sub_freq_rank', f'Ranked Permutation Importance vs. Ranked Filtered Frequency for Theme {theme}'),
    ('freq', 'sub_freq', 'General Frequency vs. Theme Subset Frequency')
]

for x, y, title_lab in comps:
    df.plot.scatter(
        x,
        y,
        figsize=(8,8), 
        title = title_lab
    )
    corr = df[x].corr(df[y])
    print(f'Raw Correlation Between {x} and {y} is {corr}')

### Venn Diagram Design

We now look at Venn Diagrams made between the words that are frequent for a theme and deemed to have permutation importance for a theme. We create an intersection slice between these two sets and then take set differences between the intersection and the component sets. We do this for each of the themes.

In [None]:
from matplotlib_venn import venn2 
mostAmo = 100

most_perm_imp = df.sort_values(by = 'perm_imp_tot', ascending = False, ignore_index = True)[:mostAmo].copy()
most_sub_freq = df.sort_values(by = 'sub_freq', ascending = False, ignore_index = True)[:mostAmo].copy()
venn2(
    [
        set(most_perm_imp['feature']),
        set(most_sub_freq['feature'])
    ], 
    (
        f'Tokens of High Permutation Importance', 
        f'Tokens with High Sub-Frequency of Occurence'
    )
)
for ind, perm_w, freq_w in zip(range(mostAmo), list(most_perm_imp['feature']), list(most_sub_freq['feature'])):
    print(f'{ind+1} | perm_imp: {perm_w}, sub_freq: {freq_w}')

### Comparison of Anagram and Word Clouds

Having looked at these other visual comparisons, it is now time to make the anagram and word clouds from the labeled data. We make each cloud by creating dictionaries from pairs of columns that exist in our filtered permutation importance and sub-frequency dataframes. The first block shows the word cloud created from the ladder question data and the second block shows the anagram cloud that is created.

In [None]:
word_feed = {row['feature'] : row['sub_freq'] for i, row in most_sub_freq.iterrows()}
word_cloud = WordCloud( # generate wordcloud
    width = 400, 
    height = 400,
    max_words = 50,
    background_color ='white',
    stopwords = stopset,
    min_font_size = 10
).generate_from_frequencies(word_feed)

plt.figure(figsize = (5, 5), facecolor = None)
plt.imshow(word_cloud)
plt.axis("off")
plt.tight_layout(pad = 0)
plt.savefig('word_cloud/example.png')

In [None]:
min_perm_imp = most_perm_imp.perm_imp_tot.min() # look at what the minimal importance score is for each theme
most_perm_imp['perm_imp_coef'] = most_perm_imp.perm_imp_tot.apply(lambda val: np.floor(np.log2(val//min_perm_imp))+1)
anagram_feed = {row['feature'] : row['perm_imp_coef'] for i, row in most_perm_imp.iterrows()}
anagram_cloud = WordCloud( # generate anagram cloud
    width = 400, 
    height = 400,
    max_words = 50,
    background_color ='white',
    stopwords = stopset,
    min_font_size = 10).generate_from_frequencies(anagram_feed)

plt.figure(figsize = (5, 5), facecolor = None)
plt.imshow(anagram_cloud)
plt.axis("off")
plt.tight_layout(pad = 0)
plt.savefig('anagram_cloud/example.png')

Thus ends the functionality of the anagram cloud. Feel free to modify this notebook so that multiple labels may be simultaneously compared beside one another!