## Qualitative Coding through Jupyter

This notebook allows the user to qualitatively code a corpus of text documents; it is broken up into various blocks that perform certain tasks on the corpus. Provided below is a list of the functionality currently provided by this notebook:
- [Loading of data frame.](#loadBlock)
- [Adding authored tags to unseen documents in sequential order](#inputBlock)
- [Saving of data frame.](#saveBlock)
- [Filtering of data frame based on list of tags of provided.](#filtBlock)
- [Filtering of data frame based on whether tags are agreed upon between authors.](#diffBlock)
- [Listing of tags and stating their frequency of use across training set.](#tagViewBlock)
- [Editing individual documents where tags may be added and removed.](#editBlock)
- [Deleting of a tag across all documents.](#tagDelBlock)
- [Merging tags together into a single umbrella tag](#mergeBlock)
- [Saving a final personalized set of codes that is to be merged with others](#endSaveBlock)


To get started, run the first block in order to import the libraries needed for coding.

In [None]:
import pandas as pd
pd.options.mode.chained_assignment = None # block false positive warning
import numpy as np
import pprint
pp = pprint.PrettyPrinter(indent=5)

omit = ['text', 'completion', 'id', 'index', 'section'] # columns that are not tags.

### Function Library

**RUN THIS BLOCK** As it contains the various functions that appear across the length of this notebook. Provided below is a short list of each function's purpose and its arugments.
1. `primeName` : translates integer values into a list of users who have tagged a document.
 - `val` : integer value describing users.

2. `commPrint` : a function that prints out a document's text, its tags, and the authors of the tags in a somewhat aesthetic fashion.
 - `ind` : integer value that sorts documents within the same section.
 - `sec` : the section the document belongs to in the corpus.
 - `df` : the data frame.
 - `tagblind` : a boolean variable deciding whether existing tags are printed alongside a document. `True` hides existing tags and `False` reveals them.
 - `users` : the list of all coders and their associated prime numbers.


3. `commView` : a function that allows you to filter rows from the data frame based on a subset of provided tags.
 - `taglist` : a list of tags that are used as the criteria for filtration. 
 - `df` : the data frame.
 - `andOr` : An arguement of critical importance. Setting it to true makes it so that you filter out rows with documents that contain ___all___ of the tags that you listed. Setting it to false makes it so that you filter out rows with documents that contain ___any___ of the tags that you listed.


4. `commDiff` : a function that allows you to filter rows from data frame based on whether the authors have agreed upon all of the documents.
 - `df` : the data frame.
 

5. `tagCount` : a function that counts that number of documents upon which a tag has been placed.
 - `tagCol` : a tag's column from the data frame. 


6. `tagSplitAdd` : a function that adds in new tags to a document based on the contents of a raw string fed in as arugment. The addition of tags tracked by a user's signature.
 - `ind` : integer value that sorts documents within the same section.
 - `sec` : the section the document belongs to in the corpus.
 - `df` : the data frame.
 - `tagstr` : a string list that contains all of the tags which are to be added to a document.
 - `user` : prime number that serves as a user signature.


7. `tagSplitDel` : a function that deletes tags currently assigned to a document subject to a raw string fed in as argument. Deletion is anonymized, so it is not possible to track who has deleted the tags of another user.
 - `ind` : integer value that sorts documents within the same section.
 - `sec` : the section the document belongs to in the corpus.
 - `df` : the data frame.
 - `tagstr` : a string list that contains all of the tags which are to be removed from a document.


8. `tagEdit` : a navigation function that performs various operations on the data frame. Currently it used to navigate to the tag addition and tag deletion operations. 
 - `ind` : integer value that sorts documents within the same section.
 - `sec` : the section the document belongs to in the corpus.
 - `df` : the data frame.
 - `user` : prime number that serves as a user signature.
 - `goal` : input `a` for addition, `d` for deletion, or `b` for addition followed by deletion.
 
 
 9. `tagWholeDel` : a navigtion function that deletes a selected tag.
  - `tag` : the tag that is slated for deletion from the data frame.
  - `df` : the data frame.
  
 
10. `tagMerge` : a function that merges several tags together into a single umbrella tag. The authorship of each component tag will be inherited by the umbrella tag. If you wish to remove the component tags afterward, that may also be done automatically by this function.
 - `compTag` : a list of tags that are to be merged together.
 - `umbTag` : a label for the new tag that will serve as the amalgam of the component tags.
 - `df` : the data frame.
 - `clean` : boolean value. Setting it to `True` makes it delete all of the component tags after creating the new collective umbrella tag.

In [None]:
def primeName(val, users) : # function that translates numbers into list of users:
    taggers = '' # start with empty list
    for user_name, user_prime in users:
        if(val % user_prime == 0):
            taggers += f'{user_name}, '
    if(taggers.endswith(', ')) : # snip comma off end (assuming >0 names are present)
        taggers = taggers[:-2] # snip it off
    return taggers # return string list

def commPrint(ind, sec, df, tagblind, users) : # prints out a comment and its tags
    print(f" \n\t---- READING IN DOCUMENT {ind} FROM SECTION {sec} ---- \n")
    ent = df[(df['index'] == ind) & (df['section'] == sec)].index.tolist()[0]
    pp.pprint(df['text'][ent]) # pretty print the student comment
    print("\nCurrently associated tags:")
    if(not tagblind) : # are we allowed to look at the current tags?
        for tag in df.columns : # read each tag in the data frame.
            if(tag in omit) : # skip the non-tag columns
                continue
            if(df[tag][ent] > 1) : # has someone left this tag on this comment
                print(f'\t{tag} - {primeName(df[tag][ent], users)}') # print out the tag and its taggers

def commView(taglist, df, andOr): # function that returns certain comments based on the tags that are included in list
    boolMat = [ # list comp within list comp
        [
            (df[tag][ind]>1) for tag in taglist # look at each tag # see if appropriate tag has been left by anyone
        ] for ind, row in df.iterrows() # list comprehension over no. of rows
        
    ]
    if(andOr) : # and?
        filt = [all(row) for row in boolMat] # see if tag appears in all columns
    else : # or?
        filt = [any(row) for row in boolMat] # see if tag appears in any column
    return filt # return the filtered data frame.            

def commDiff(df): # function that returns certain comments where there is discrepancy on tag assignment between authors
    boolMat = [ # list comp within list comp
        [
            (df[tag][ind] < df['completion'][ind] and df[tag][ind] != 1) for tag in df.columns if tag not in omit  # look at each tag # to see if unacceptable
        ] for ind, row in df.iterrows() # list comprehension over no. of rows
        
    ]
    filt = [any(row) for row in boolMat] # see if tag appears in all columns
    return df.loc[filt] # return the filtered data frame.

def tagCount(tagCol) : # counts the non-one entries for a given tag
    count = tagCol.apply(lambda val: int(val > 1))
    return count.sum() # return summed value
            
def tagSplitAdd(ind, sec, df, tagstr, user) : # turns raw string into list of tags that are added to data frame.
    if(not tagstr) : # quit right away if blank string is entered.
        print("  skipping tag addition operation... ")
        return df
    taglist = tagstr.split(', ') # split string into list based on comma placement
    ent = df[(df['index'] == ind) & (df['section'] == sec)].index.tolist()[0]
    for tag in taglist : # look at each tag in the list
        if(tag not in df) : # if this tag is not present in the current dataframe
            df[tag] = 1 # create new unfilled column
        if(df[tag][ent] % user != 0) : # failsafe to avoid over-tagging the same comment
            df[tag][ent] *= user # adjust setting for this tag on this comment by this user
        if(df['completion'][ent] % user != 0) : # failsafe to avoid over-completing the same comment
            df['completion'][ent] *= user # mark this comment as completed by this user
    return df

def tagSplitDel(ind, sec, df, tagstr) : # turns raw string into list of tags that are to be removed from the data frame.
    if(not tagstr) : # quit right away if blank string is entered. 
        print("  skipping tag deletion operation... ")
        return df
    taglist = tagstr.split(', ') # otherwise, split string into list based on comma placemnet
    ent = df[(df['index'] == ind) & (df['section'] == sec)].index.tolist()[0]
    for tag in taglist : # look at each tag in the list
        if(tag not in df.columns) : # if this tag is not present in the current dataframe
            print(f"The tag {tag} does not exist. Deletion of it has been skipped.")
            continue # skip this mistake that must have been a typo
        df[tag][ent] = 1 # undo the placement of this tag by all users
        if(df[tag].sum() == len(df)): # is this tag now never used? 
            print(f"The tag {tag} is no longer used and shall be discarded from the data frame.")
            df.drop(columns = tag, inplace = True) # drop column from the data frame
        tags = [tag for tag in df.columns if tag not in omit]
        if(df[tags].sum(axis=1)[ent] == len(tags)) : # is this comment now untagged?
            print("This comment no longer has any tags and shall be re-assigned as incomplete.")
            df["completion"][ent] = 1 # reset completion value to 1.
    return df

def tagEdit(ind, sec, df, user, users) : # function that shows the current tags on a comment and allows you to edit said tags
    commPrint(ind, sec, df, False, users) # print out the current comment
    cho = input("\nEnter 'a' for adding tags, 'd' for removing tags, or 'b' for adding then deleting tags from this comment: ")
    if(not cho): # return a blank string
        print('Skipping tagging procedure.')
        return df 
    if(cho.lower() in 'ab') : # add tag protocol
        rawtags = input('Please list the tags that you would like to add to this comment: ')
        df = tagSplitAdd(ind, sec, df, rawtags, user)
    if(cho.lower() in 'bd') : # delete tag protocol
        rawtags = input('Please list the tags that you would like to remove from this comment: ')
        df = tagSplitDel(ind, sec, df, rawtags) 
    return df

def tagWholeDel(tag, df) : # function that deletes a tag across all comments.
    if(tag not in df.columns) : # Is this even a tag that exists
        print("  Attempted to delete non-existent tag. Now returning unchanged data frame.")
        return df
    df.drop(columns = tag, inplace = True) # delete the comment
    tags = [tag for tag in df.columns if tag not in omit]
    for ind, row in df.iterrows() : # review each comment
        if(row[tags].sum() == len(tags)) : # is this comment now untagged?
            print(f"  Document {row['index']} from Section {row['section']} no longer has any tags and shall be re-assigned as incomplete.")
            row["Completion"] = 1 # reset completion value to 1
    return df

def tagMerge(compTag, umbTag, df, clean) : # a function that merges several existing tags together into one common tag.
    tagMat = [ # list comp within list comp
        [
            df[tag][ind] for tag in compTag if tag in df # copy each tag value for the relevant component tags
        ] for ind, row in df.iterrows() # list comprehension over no. of rows
        
    ]
    if(tagMat[0]): # do any of these merging tags even exist?
        df[umbTag] = [np.lcm.reduce(tagRow) for tagRow in tagMat] # set new column equal
    else: # if not, just create an empty nowhere present tag for our setup.
        df[umbTag] = np.full(len(tagMat),1) # mark it as such
    if(clean) : # are we purging all of the old columns?
        if(umbTag in compTag) : # is umbrella tag one of our component tags?
            compTag.remove(umbTag) # remove it so that we don't undo all of our efforts
        df.drop(columns = compTag, inplace = True) # drop all unnecessary tags.
    return df

### Picking the Relevant File

Adjust the variable in this block to change the `.csv` file is being coded. The initial loaded file must have the following columns:
- `text`: column filled with the string of information contained within a document.
- `completion`: column of integer values that tracks which users have completed a document.
- `id`: column of integer values used to merge coded data with other dataframes as required.
- `section`: column of integer values that can be used to organize documents by classification.
- `index`: column of integer values used to internally order documents in the same section. It is absolutely necessary that no duplicates of `(section,index)` exist across all rows of your dataframe.

**NOTE: do not include the `.csv` portion of the filename. Later code blocks handle that automatically.**

In [None]:
workingFile = 'example' # which file are we working on today

### Setting User Information

This block loads in all of the users currently associated with a project and allows one to select a user for the coding session. The users for the coding project are stored in a separate file `users.txt`. If you wish to add new users to the coding project, then do so through direct edits to this `txt` document. 

In order to add a new user to the coding project: create a new line in the document, write the name that the coder will be identified by (without including any spaces in the name), put a single space, and then write a unique prime number to be used as that coder's signature.

In [None]:
users = []
f = open("users.txt", "r")
for line in f: # read through every user
    read = line.split()
    users.append((read[0],int(read[1])))
f.close()
primes = [int(val[1]) for val in users]
    
user = 1 # preset to unacceptable value
while(user not in primes) : # keep seeking input until it is done right
    user = int(input("Please enter prime value associated with user ID: ")) # acquire integer input
    if(user not in primes) : # fail safe checks for improper integer
        print("Incorrect input, review users txt document and try again")

### LOAD, INPUT, SAVE: The Three Main Blocks

the following three blocks are where we will be spending the bulk of our time as we generate the tags for our data set.
1. ***LOAD BLOCK***: loads in the `.csv` of partially completed tags and allows us to continue working upon it.
2. ***INPUT BLOCK***: Procedurally lets you code documents to which you have not yet added tags.
3. ***SAVE BLOCK***: saves the changes to the `.csv` that you have made within the ***INPUT BLOCK***. If you don't want to save your changes, then simply re-run the ***LOAD BLOCK*** after pausing the ***INPUT BLOCK***.

### More on LOAD BLOCK 
<a id='loadBlock'></a>

Be sure to revisit this block whenever you want to undo any changes that you have made to the coding prior to saving. Loading is irreversible, so a failsafe measure is in place to ensure the block isn't run on accident.

In [None]:
query = input("Are you sure you wish to overwrite your working data frame with the previous version? (y/n) ") # seek verification
if(query.lower() == 'y') : # acquired
    print("\nData frame has been loaded.")
    codeDF = pd.read_csv(f'{workingFile}.csv') # load in Datafrme to begin working
    codeDF['section'] = codeDF['section'].apply(lambda val: str(val)) # make sure section labels are uniformly strings
else : # denied
    print("\nPrevious data frame WAS NOT loaded.")

### More on INPUT BLOCK ###
<a id='inputBlock'></a>

As you run ***INPUT BLOCK***, you will be shown documents yet to be tagged by you. As documents are read, you have the option of being shown what tags have been left by your fellow coders. You will then be given an input box where you can type in tags for the document. As you input your tags be sure to adhere to a specific listing style.
- For consistency, it is recommended that all tags be written in lowercase.
- Write `, ` (comma then a single space) between each of your tags. If you don't do this, then the string splitter will confuse tags together as one long tag.
- if you ever want to take a break, enter `pausetag` for your next entry: this will halt the tagging procedure.

***NOTE*** : if you ever wish to skip over adding tags to a comment and leave it uncompleted, enter an empty string into the input window when prompted.

In [None]:
blind = False # should we be permitted to see the tags of others? (True for no, False for yes)

skip = input("What section would you like to resume coding from? ") # choose what section to work from
codeDF_sec = codeDF[codeDF['section'] == skip]
for ind, row in codeDF_sec.iterrows():
    if(row['completion'] % user != 0): # if comment is not completed:
        commPrint(row['index'], skip, codeDF, blind, users)
        rawtags = input("\t Please enter the appropriate tags for this document: ") # input tags as comma'd list
        if(rawtags == 'pausetag'): # special pausing operation
            print('\n\t ---- PAUSING TAG PROTOCOL ----\n') 
            break # abort the loop for now
        codeDF = tagSplitAdd(row['index'], skip, codeDF, rawtags, user)   

### More on SAVE BLOCK
<a id='saveBlock'></a>

Be sure to revisit this block whenever you want to save any changes that you have made to the coding. Saving is irreversible at this time, so I included a failsafe just in case you run this block unintentionally.

In [None]:
query = input("Are you sure you wish to overwrite the previous data frame with your current version? (y/n) ") # seek verification
if(query.lower() == 'y') : # acquired
    print("\nData frame has been saved.")
    codeDF.to_csv(f'{workingFile}.csv', index = False) # overwrite .csv with current version
else :  # denied
    print("\nData frame WAS NOT saved.") # state that save did not occur

#### Viewing Tag Dictionary ####
<a id='viewTagBlock'></a>

This code block allows you to review all of the current tags that exist for your version of the dataframe. The tags are displayed in a dictionary format along with the number of times that they occur within the dataframe. When viewing tags, you can choose to view only a specific section's breakdown as opposed to the breakdown of all tags. Should you wish to view the breakdown of all tags, hit only return/enter at the section request prompt. At present, the tags may be viewed either alphabetically or in descending order of frequency.

In [None]:
sec = input('Which section would you like to view tag frequency for? ')
if(sec == ''):
    viewDF = codeDF
else:
    viewDF = codeDF.groupby('section').get_group(sec)
tagView = {tag : tagCount(viewDF[tag]) for tag in viewDF.columns if tag not in omit}

print(f"{sec} tags sorted alphabetically: ")
for tag, freq in sorted(tagView.items(), key = lambda x: x[0]):
    if(freq > 0):
        print(f'  {tag}:  {freq}')

print(f"\n{sec} of tags sorted by frequency: ")
for tag, freq in sorted(tagView.items(), key = lambda x: x[1], reverse=True) :
    if(freq > 0):
        print(f'  {tag}:  {freq}')
    


### Viewing Tagged Entries Subject to Tag
<a id='filtBlock'></a>

This block allows you to enter a list of tags. It will then output a sub-data frame containing all documents that have been labeled with all/any of the tags which you have listed. the difference between the all/any is determined by your setting of the `andOr` argument to either `True/False` respectively. In order to run a search, fill in taglist with your tags, adjust `andOr` as necessary, and then run the code block. If an error occurs, check that each of your inputted tags are inputted correctly and the formatted list has `', '` between each tag item.

After running block, you may refer to the bulk edit block located below. This allows you to sequentially edit the filtered documents.

In [None]:
rawtags = input("Please enter list of tags by which to filter: ")
taglist = rawtags.split(", ")
cho = input("Enter a section to filter by: (or hit only return/enter to search across all sections): ")
andOr = bool(int(input("Enter 1/0 to filter for documents that contain all/any of the tags entered. ")))

filt = commView(taglist, codeDF, andOr) # create filtered sub-data frame
if(not(cho)): 
    for ind, row in codeDF.iterrows() : # look across filtered dataframe
        if(filt[ind]):
            commPrint(row['index'], row['section'], codeDF, False, users) # print out each of its comments with tags.
else: 
    for ind, row in codeDF.loc[codeDF['section'] == cho].iterrows() :
        if(filt[ind]):
            commPrint(row['index'], row['section'], codeDF, False, users) # print out each of its comments with tags.

### Editing Filtered Documents Subject to Tags

If you have examined a sequence of documents that are filtered based on which tags they have, you may then run this block in order to sequentially edit them without having to re-enter the standard input.

In [None]:
if(not(cho)): 
    for ind, row in codeDF.loc[codeDF['section'] == cho].iterrows() :
        if(filt[ind]):
            codeDF = tagEdit(row['index'], row['section'], codeDF, user, users)
else: 
    for ind, row in codeDF.iterrows() :
        if(filt[ind]):
            codeDF = tagEdit(row['index'], row['section'], codeDF, user, users)

### Filtering for Agreement Through Tags
<a id='diffBlock'></a>

Once tagging is completed, it is useful to be able to quickly look at the documents with different tag assignment between users. This block does exactly that as it filers out 

In [None]:
confDF = commDiff(codeDF)
for ind, row in confDF.iterrows() : # look across filtered dataframe
    commPrint(row['index'], row['section'], confDF, False, users) # print out each of its comments with tags.

### Editing Tags on an Existing Document
<a id='editBlock'></a>

If you wish to alter the existing tags on a document, you can use this block. Begin by entering the indexing information for the document that you wish to change. You are then prompted enter a character describing the edit operation that you would like to perform.

***NOTE*** : if you ever wish to skip over adding or deleting tags, enter an empty string into the input window when prompted.

In [None]:
sec = input("Enter the section to which the document belongs: ")
ind = input("Enter the index number for the document: ")
codeDF = tagEdit(int(ind), sec, codeDF, user, users) # proceed with block

### Deleting Tags Across all Documents
<a id='tagDelBlock'></a>

If we ever decide that a tag is useless, we can delete it from all documents simultaneously. The feature will also reset the completion status of documents that no longer have any tags after the deletion.

In [None]:
tag = input("Which tag would you like to have deleted from the data frame? ")
codeDF = tagWholeDel(tag,codeDF)

#### Merging Similar Tags into Umbrella Tag ####
<a id='mergeBlock'></a>

As we complete our coding on each document, we will find that a large number of tagging discrepancies will be due to us using different words for the same premise _(e.g. procrastinate, procrastionation, procrastinating)_. We may also find that there will be some tags that are just typos of another much more common tag. In order to quickly merge these tags together into a single unifying tag, there exists this block.

The code block is used by inputting a properly formatted list of tags that you wish to have merged together. You must then provide a name for the umbrella tag that you wish to have replace all of the component tags. Following this input, you will be given a data frame that condenses together the selected tag columns into a single new column. As a side effect of this merging, all authors will be credited with writing the merged tag if they contribued to the writing of any component tag for a given document.

***NOTE*** : this block will crash if you attempt to merge a tag that does not exist.

In [None]:
rawComp = input("Please enter the properly formatted list of tags that you would like to have merged together: ")
comp = rawComp.split(', ')
umb = input("Please enter the new tag under which you would like your component tags merged: ")
clean = bool(int(input("Enter 0 if you would like to keep the component tags and enter 1 to delete them: ")))
tagMerge(comp, umb, codeDF, clean)

### Mass Tagging with a Dictionary

Tag merging could end up being a somewhat tedious process if inputted one at a time. As an alternative, one may use a dictionary variable as a guideline for the mass merging of many tags into a smaller number of tags. This feature is broken up into two blocks.
1. The first block involves you drafting a dictionary variable that we shall call `tagDict`. The dictionary ought to be designed so that the key is the umbrella tag that you wish to simplify to and the value is a list of component tags that you want to have reduced down to the single umbrella tag.
2. The second block takes the dictionary drafted in the first block as input and then does a tag merge for each key-value pair. Be sure that you have your dictionary properly formatted before running this block lest you accidentally undo some of your work on accident.

In [None]:
tagDict = {
    'example1': [
        'test',
        'foo',
        'bar'
    ],
    'example2': [
        'high',
        'lo'
    ]    
}

In [None]:
for key,val in tagDict.items():
    tagMerge(val, key, codeDF, False)

### Final Saving of Personallized Copy
<a id='endSaveBlock'></a>

A current issue that exists for this notebook is that authors who work simultaneously risk erasing eachother's work should they attempt to share files. To facilitate simultaneous coding, there exists this *final* save option. This saving block takes all of the changes that you have made to your dataframe and saves them to *another* `.csv` file that is unique to you as a user. This way when you make your final save: you can do it without worrying about overwriting the work of another coder. The code block itself functions identically to the previous save block. By only sharing your personalized copy on Github, you can merge the files of yourself and your fellow authors into another dataframe that you can continue coding at a later time.

**BE SURE TO USE IT BEFORE FINISHING PRELIMINARY CODING AND PUSHING TO GITHUB!**

In [None]:
query = input("Are you sure you wish to overwrite your personalized data frame with your current version? (y/n) ") # seek verification
if(query.lower() == 'y') : # acquired
    print("\nPersonalized data frame has been saved.")
    codeDF.to_csv(f'{workingFile}{primeName(user,users)}.csv', index = False) # overwrite .csv with current version
else :  # denied
    print("\nPersonalized data frame WAS NOT saved.") # state that save did not occur

And that is the complete functionality of the qualitative coder!