# Function of Notebook

This notebook attempts to identify words that are unique and relevant to a subset of a textual dataset using term frequency - inverse document frequency (tf-idf). The idea is to create a very easy to interpret visualization of what people are talking about in a particular subset of the given text responses. We use this tool both as a sanity check for more advanced topic modeling and as a safe way to get quick initial results.

The correct interpretation of results is:
- The words shown are non-generic words commonly used in the subset of interest, but not commonly used in the entire textual dataset

## How to use

It is highly recommended to run the preprocessing notebook on your text before using this notebook as tf-idf assumes a very naive model of language use. 

The user will need edit the information in the data section to identify where the data lives on their computer and where in the data the textual information lives. Additionally, a column which describes the subset of the data to look at is must be specified in the "looking at results" section. For example, with a gender column, one might set 'acceptable_values' to ['F'] in order to look at interesting words used commonly by women but not men.

## Quick Links

[Click here to jump to data imports](#libraries)

[Click here to jump to Results and Subset selection](#looking-at-results)

# Imports

## Libraries

In [1]:
import pandas as pd
import numpy as np

In [2]:
from scipy import sparse
from itertools import chain

In [3]:
from IPython.display import display, display_html
from ipywidgets import interact

In [4]:
from gensim.models import TfidfModel
from gensim.corpora import Dictionary

## Data

In [5]:
index_col = "unique_comment_ID"
text_col = "Preprocessed answer"
data_path = "/home/azureuser/cloudfiles/code/Data/pp-20210830_SES_and_SET.csv"

In [6]:
raw_data = pd.read_csv(data_path)
raw_data.set_index(index_col, inplace=True)

In [7]:
raw_data.dropna(inplace=True)

In [8]:
#raw_data = raw_data.query("survey == 'SES'")

In [9]:
print("Number of textual responses:\t",len(raw_data))
raw_data.head()

Number of textual responses:	 296125


Unnamed: 0_level_0,answer,Preprocessed answer,survey,question_ID,question_text,question_category
unique_comment_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
5228445_15769_1_X840307,"On the first assignment, detailed feedback was...",assignment detailed feedback believe allow imp...,SES,X840307,What specific change in the clarity of instruc...,text_improvement
5228445_15769_1_X840321,Opportunities provided in this course to activ...,opportunity provide course actively engage dif...,SES,X840321,What specifically about the use of active lear...,text_beneficial
5228445_15769_1_X840298,I get frequent migraines and am trying to keep...,frequent migraine try plate air commute salem ...,SES,X840298,Why did you attend class 75-90% of the time?,attendance
5228445_15769_2_X840319,like the breadth and variety of topics relevan...,like breadth variety topic relevant provide co...,SES,X840319,What specifically about the quality of course ...,text_beneficial
5228445_15769_2_X840296,Have more class time for learning about UO lib...,class time learn uo library link directly fina...,SES,X840296,What else would you like to say about your lea...,open_final


### Add metadata

Metadata in this context is a secondary dataset to be joined on index with primary textual data. Useful if the text is kept separate from other identifying characteristics.

In [10]:
# metadata_path = "/home/azureuser/cloudfiles/code/Data/pp-20210625_SES_and_SET_comments.csv"
# metadata = pd.read_csv(
#     metadata_path,
#     usecols= [index_col,"question_ID","survey","question_text"]
#     )
# metadata.set_index(index_col, inplace=True)

In [11]:
# raw_data = raw_data.join(metadata)

### Tokenize Documents

In [12]:
texts = raw_data[[text_col]].applymap(str.split)
texts.head(5)

Unnamed: 0_level_0,Preprocessed answer
unique_comment_ID,Unnamed: 1_level_1
5228445_15769_1_X840307,"[assignment, detailed, feedback, believe, allo..."
5228445_15769_1_X840321,"[opportunity, provide, course, actively, engag..."
5228445_15769_1_X840298,"[frequent, migraine, try, plate, air, commute,..."
5228445_15769_2_X840319,"[like, breadth, variety, topic, relevant, prov..."
5228445_15769_2_X840296,"[class, time, learn, uo, library, link, direct..."


### Generate Dictionary

In [13]:
dictionary = Dictionary(texts[text_col])

### Convert Tokenized text to Tokenized IDs

In [14]:
corpus = texts.applymap(dictionary.doc2bow)

In [15]:
corpus.head(5)

Unnamed: 0_level_0,Preprocessed answer
unique_comment_ID,Unnamed: 1_level_1
5228445_15769_1_X840307,"[(0, 1), (1, 3), (2, 1), (3, 1), (4, 1), (5, 2..."
5228445_15769_1_X840321,"[(11, 1), (12, 1), (13, 1), (14, 1), (15, 1), ..."
5228445_15769_1_X840298,"[(8, 1), (15, 1), (16, 1), (20, 1), (34, 1), (..."
5228445_15769_2_X840319,"[(29, 1), (51, 1), (52, 1), (53, 1), (54, 1), ..."
5228445_15769_2_X840296,"[(58, 1), (60, 1), (61, 1), (62, 1), (63, 1), ..."


# Helper Functions

In [34]:
def display_side_by_side(*args):
    html_str=''
    for df in args:
        html_str+=df.to_html() + ("\xa0" * 5) # Spaces
    display_html(html_str.replace('<table','<table style="display:inline"'),raw=True)

In [17]:
def display_head_wide(df,num = 40,cols = 5):
    num = min(num,len(df)) # Just in case num is specified to be larger than the number of entires in df
    per_col = int(np.ceil(num/cols)) # Figure out how many to show per column
    display_side_by_side(*[df.iloc[x: x + per_col] for x in range(0,num,per_col)]) # Display the columns. *[] used to partition the dataframe

# Calculate TF-IDF Scores

## Define and test function to map

### Intialize tf-idf model

In [18]:
# I have to convert to single column then list because they have a line in the tfidf code
# That says elif "corpus" and pd thinks it is special so unlike everything else that
# returns true, this returns an error
tfidf = TfidfModel(corpus = corpus[text_col].tolist(),id2word = dictionary)

In [19]:
transformed_corpus = tfidf[corpus[text_col]]

### Create a sparse dataframe where each row is a document, and each column is a word, and each entry is a tf-idf score

The sparse matrix implementation is necessary if we want to keep the entirity of the tf-idf calculations in memory and have it easily accessible for computation. There is room for a better implementation in the future. 

#### Calculate the tf-idf scores and format them

In [20]:
# Create a dataframe where each word in each doc has a row saying its doc number, word id, and tfidf score
i = 0 # Initize a new document index to use for sparse matrix specification
result_corpus = []
for doc in transformed_corpus:
    new_doc = list(map(lambda tup: (i,) + tup,doc)) # Add doc idx to the tuple given by the tfidf model
    result_corpus.append(new_doc) # Add it to the correctly formatted corpus
    i+=1
tfidf_indices = pd.DataFrame(chain(*result_corpus)) # Flatten out the list so each word has an entry as opposed to each document

#### Put the formatted scores into a sparse matrix

In [21]:
[i,j,data] = tfidf_indices.T.to_numpy() # Convert our rows to lists of indices to the sparse matrix creation
i = i.astype(int) # Convert rows to int. For some reason pandas transpose ruins datatypes
j = j.astype(int) # Convert cols to int
tfidf_sparse = sparse.coo_matrix((data,(i,j)))

#### Convert sparse matrix to dataframe and name indices

In [22]:
tfidf_scores = pd.DataFrame.sparse.from_spmatrix(tfidf_sparse)
tfidf_scores.index = corpus.index
tfidf_scores.rename(columns = dictionary,inplace = True)

#### Examine the created dataframe

In [23]:
tfidf_scores.head(4)

Unnamed: 0_level_0,allow,assignment,believe,beneficial,detailed,feedback,improve,performance,quality,subsequent,...,explanitory,douglas,historical/,musing,colorado,tipping,warned,clarife,medley,jibe
unique_comment_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
5228445_15769_1_X840307,0.226158,0.370186,0.269053,0.192283,0.294254,0.332674,0.2121,0.3619,0.273516,0.483761,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5228445_15769_1_X840321,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5228445_15769_1_X840298,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.171679,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5228445_15769_2_X840319,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#### Sanity check that the indices have been reassigned correctly

In [24]:
display(tfidf_scores.iloc[30][tfidf_scores.iloc[30]>0]) # Tf-IDF words and score
display(texts.iloc[30]) # Original text

assignment    0.323452
feedback      0.436013
professor     0.410875
turn          0.732429
Name: 6663782_15793_11_X840304, dtype: Sparse[float64, 0]

Preprocessed answer    [feedback, professor, assignment, turn]
Name: 6663782_15793_11_X840304, dtype: object

# Looking at results

### Look at question codes and question categories

In [25]:
filter_col = "question_ID"
verbose_filter_col = "question_text"

In [26]:
raw_data[[filter_col,verbose_filter_col]].drop_duplicates().style.hide_index()

question_ID,question_text
X840307,What specific change in the clarity of instructions would help your learning?
X840321,What specifically about the use of active learning helped your learning?
X840298,Why did you attend class 75-90% of the time?
X840319,What specifically about the quality of course materials helped your learning?
X840296,What else would you like to say about your learning experience in the course? Please avoid personal comments about the instructor.
X840297,Why did you attend class 90-100% of the time?
X840302,What specific change in the inclusiveness of the course would help your learning?
X840304,What specific change in the feedback would help your learning?
X840316,What specifically about the support from the instructor helped your learning?
X840312,What specific change in the relevance of the course content would help your learning?


### Define the subset I am interested in

In [27]:
#filter_vals = ['X840315', 'X840302'] # Inclusiveness Questions
#filter_vals = ['X840314', 'X840327'] # Accessibility Questions
filter_vals = ['X840303','X840316'] # Support Questions

### Fast Viewer which removes words popular over the whole corpus including those from the desired category

In [35]:
filtered_scores1 = tfidf_scores[raw_data[filter_col].isin(filter_vals)]
top_words_filtered1 = pd.DataFrame(filtered_scores1.mean().sort_values(ascending = False), columns = ["tf-idf"])
top_words_all1 = pd.DataFrame(tfidf_scores.mean().sort_values(ascending = False),columns=["tf-idf"])
@interact(num_words = [5,20,40,60,100,1000,10000],remove_n = [0,5,20,50,100,200,1000])
def disp_top_words_filtered(num_words = 40, remove_n = 100, display_as_str = False):
    words_to_drop = top_words_all1.index.to_list()[:remove_n] # Select which words to remove
    display_html(f"<b>Top words in average tf-idf (not in the top {remove_n} for all data), where {filter_col} has values in {filter_vals}",raw = True)
    top_words = top_words_filtered1.drop(index = words_to_drop)
    display_head_wide(top_words,num=num_words, cols= 5)
    if display_as_str:
        display_html(" ".join(top_words.index[:num_words]),raw = True)

interactive(children=(Dropdown(description='num_words', index=2, options=(5, 20, 40, 60, 100, 1000, 10000), va…

### Viewer which only removes words that are popular in responses not in the desired category (slightly better but slower by a lot)

In [36]:
filtered_scores = tfidf_scores[raw_data[filter_col].isin(filter_vals)] # Select documents from the subset of interest
complement_scores = tfidf_scores[~raw_data[filter_col].isin(filter_vals)] # All other documents from the corpus. Getting parts of a sparse matrix efficiently is weird.
top_words_filtered = pd.DataFrame(filtered_scores.mean().sort_values(ascending = False), columns = ["tf-idf"])
top_words_complement = pd.DataFrame(complement_scores.mean().sort_values(ascending = False),columns=["tf-idf"])
@interact(num_words = [5,20,40,60,100,1000,10000],remove_n = [0,5,20,50,100,200,1000])
def disp_top_words_filtered(num_words = 40, remove_n = 100, display_as_str = False):
    words_to_drop = top_words_complement.index.to_list()[:remove_n] # Select which words to remove
    display_html(f"<b>Top words in average tf-idf (not in the top {remove_n} for all data in the complement), where {filter_col} has values in {filter_vals}",raw = True)
    top_words = top_words_filtered.drop(index = words_to_drop)
    display_head_wide(top_words,num=num_words, cols= 5)
    if display_as_str:
        display_html(", ".join(top_words.index[:num_words]),raw = True)

interactive(children=(Dropdown(description='num_words', index=2, options=(5, 20, 40, 60, 100, 1000, 10000), va…