## Submittable Exploration

In _Telling Stories with Data_ we did a project for Submittable. The organizations that make up Submittable's client base use forms to receive submissions for many different purposes: works for literary publication submissions, contest entries, grant applications, award nominations, etc. Since Submittable's clients don't label their forms, we had to do it by hand. 

In these two notebooks, we'll do a little work with the form descriptions to see if there are patterns that might be used for labeling. This notebook holds the initial exploration; the second notebook will contain a classifier to attempt to automate the label creation. 

In [1]:
import nltk
import re
from collections import Counter, defaultdict
import numpy as np
from pprint import pprint

In [2]:
sw = set(nltk.corpus.stopwords.words("english"))

Let's start by reading in the data. 

In [3]:
with open("20191112_merged_labeled.txt",encoding="UTF-8") as infile :
    print(infile.readline().split("\t"))

['AdminID', 'OrgName', 'OrgDomain', 'OrgUsecases', 'FormID', 'LiveForm', 'FormName', 'FormDescription', 'UseCase', 'Student\n']


Our data set has the following columns: 

* AdminID 
* OrgName 
* OrgDomain 
* OrgUsecases 
* FormID 
* LiveForm 
* FormName 
* FormDescription 
* UseCase 
* Student

We're really interested in Form Description and UseCase, so we'll start by just working with those. 

Many of the forms have multiple use cases because they were labeled by two students. In the data these forms have use cases like `publishing|contest`. For our purposes, we'll consider both labels "correct" for now. If the use case has a pipe (`|`) character, we'll split the use cases and store the description twice.  

In [4]:
form_data = defaultdict(list)
num_doubled = 0

with open("20191112_merged_labeled.txt",encoding="UTF-8") as infile :
    next(infile)
    for row in infile : 
        row = row.strip().split("\t")
        
        use_case = row[8]
        description = row[7]
        
        if "|" in use_case :
            use_case = use_case.split("|")
            num_doubled += 1
        else :
            use_case = [use_case]
            
        for uc in use_case :
            # For now each use case will just be a list
            # of descriptions. 
            form_data[uc].append(description)
    
print(f'We had {num_doubled} cases with two different use cases.')

We had 444 cases with two different use cases.


Perhaps we can start by looking at how many descriptions we have, how many total words, how many unique words, and how many words that aren't stopwords. 

It'll be useful to have a function that turns a description into a clean, "bag of words". 

In [5]:
def bag(desc) : 
    """Turns a description into a bag of clean words."""
    words = [w.lower() for w in desc.split()]
    words = [w for w in words if w.isalpha() and w not in sw]
    return(words)

In [6]:
desc_words = []

for uc in form_data :
    for desc in form_data[uc] :
        desc_words.extend(bag(desc))
        
        
print(f"We have {len(desc_words)} total words.")
print(f"We have {len(set(desc_words))} unique words.")
print(f"We have {len(set(desc_words) - sw)} unique non-stopwords.")

We have 605730 total words.
We have 22108 unique words.
We have 22108 unique non-stopwords.


That looks promising. 1.1M words, 22K are unique and most of those aren't stopwords. What are the most common ones? 

In [7]:
Counter([w for w in desc_words if w not in sw]).most_common(20)

[('must', 5899),
 ('work', 5774),
 ('please', 5313),
 ('may', 4592),
 ('link', 4155),
 ('application', 3999),
 ('submit', 3940),
 ('one', 3322),
 ('arts', 2441),
 ('artists', 2393),
 ('submissions', 2381),
 ('program', 2330),
 ('include', 2311),
 ('new', 2292),
 ('submission', 2255),
 ('us', 2094),
 ('information', 2062),
 ('art', 2059),
 ('artist', 2048),
 ('project', 1995)]

Now, let's just count descriptions by use case, then calculate those summary statistics by use-case. 

In [8]:
print("use_case                num_forms  num_words   unique-non-sw   words_per_form")

# Note the tricky lambda function to sort by popularity descending
for uc in sorted(form_data,key=lambda u: -1*len(form_data[u]) )  :
    these_descs = form_data[uc]
    num_forms = len(these_descs)
    
    these_words = []
    for desc in these_descs :
        these_words.extend(bag(desc))
    
    num_words = len(these_words)
    num_uni_non_sw = len(set(these_words)-sw)
    
                
    print(f"{uc:20} {num_forms:12} {num_words:10} {num_uni_non_sw:15} {round(num_uni_non_sw/num_forms,2):16}")

use_case                num_forms  num_words   unique-non-sw   words_per_form
publishing                   1433     108880            8807             6.15
grants                        664      79398            7577            11.41
contest                       590      63833            6648            11.27
job applications              414      68469            7527            18.18
award/nomination              353      38478            5105            14.46
exhibition                    324      54317            6540            20.19
fellowships                   302      40979            5486            18.17
admissions                    265      24774            4176            15.76
festival or event             193      30061            4977            25.79
conference                    176      18573            4008            22.77
internal use                  170      10424            2888            16.99
residency                     143      24648            3675    

Some observations from these descriptive statistics: 

* Publishing makes up about 25% of total forms. Grants and contests are 12% and 10% respectively. There are 17 use cases, with most being a small percentage of the total. This is hard on classification algorithms, since guessing "publishing" will be right 1/4 of the time. 
* Most of the unique words we have aren't stopwords. Not surprising, since the stopword set isn't huge. 
* The total number of words starts getting small as we move past spot 10 in the list. It'd be surprising if we had enough information for, say, "internal use". 
* The words per form are variable. Publishing has a shocking small number of words. That's worthy of further scrutiny. Generally it seems like use cases with fewer forms use more unique words. That makes sense, probably, as submitters may need more information. 

Some of these use cases have low average number of unique words. Let's take a look at the distribution for publishing, grants, and contests. There's probably a histogram function in numpy somewhere, but I don't know it, so I'm going write my own. 

In [9]:
def get_distro(uc,fd) :
    """ Given a use case (uc) and our form data (fd),
        calculate the distribution of description lengths. 
        Returns a dictionary with bin labels and counts.
        """
    bin_cutoff = [10,25,50,75,125,200]
    results = defaultdict(int)
    
    for desc in fd[uc] :
        desc_words = bag(desc) 
        num = len(set(desc_words)) 
        
        lb = 0 
        
        for ub in bin_cutoff :
            if lb < num <= ub :
                break
            else :
                lb = ub 
                
        if lb != ub :
            label = str(lb) + "-" + str(ub-1) 
        else :
            label = str(ub) + "+"
            
        results[label] += 1
        
    return(results)


With that out of the way, let's look at the distribution of unique words for our three biggest use cases. 

In [10]:
ranges = ['0-9','10-24','25-49','50-74','75-124','125-199','200+']
pub_distro = get_distro("publishing",form_data)

# Test that I didn't change the bins
assert(set(ranges)-set(pub_distro.keys())==set())

print("Word Distribution for Publishers")
print('')

print('Num Words in Desc     Descriptions')
for rng in ranges :
    print(f'{rng:20} {pub_distro[rng]:13}')


Word Distribution for Publishers

Num Words in Desc     Descriptions
0-9                            178
10-24                          270
25-49                          376
50-74                          219
75-124                         218
125-199                        122
200+                            50


These bins were chosen for this category, so the distribution is pretty even. We see 178 very short descriptions (under 10 unique non-stopwords). There are about a similar number with descriptions longer than 125 words. Overall things don't look *too* concerning, although I'd like to see some of these very short descriptions. 

In [11]:
pub_distro = get_distro("grants",form_data)

print("Word Distribution for Grants")
print('')

print('Num Words in Desc     Descriptions')
for rng in ranges :
    print(f'{rng:20} {pub_distro[rng]:13}')

Word Distribution for Grants

Num Words in Desc     Descriptions
0-9                             36
10-24                           99
25-49                          152
50-74                          110
75-124                         132
125-199                         84
200+                            51


Grants appear to have a distribution closer to normal, with most falling in the 25-125 word range. Still some long ones out there. 

In [12]:
pub_distro = get_distro("contest",form_data)

print("Word Distribution for Contests")
print('')

print('Num Words in Desc     Descriptions')
for rng in ranges :
    print(f'{rng:20} {pub_distro[rng]:13}')

Word Distribution for Contests

Num Words in Desc     Descriptions
0-9                             51
10-24                           71
25-49                          135
50-74                          122
75-124                         113
125-199                         63
200+                            35


Contests look pretty similar to grants, with a slight skew toward shorter forms. 

Let's look at some of the very short form descriptions for publishers. 

In [13]:
num_printed = 0

for desc in form_data['publishing'] :
    uni_words = set(bag(desc))
    
    if len(uni_words) < 20 :
        print(desc)
        print()
        num_printed += 1
        
        if num_printed > 5 :
            break

<p>Submit one short screenplay or one-act play of no more than 10 pages. Please do not include identifying information on the submission. </p><br>

"<div class=""clearfix""><p>1 piece<br> Upload in high resolution or 300 dpi for scanning photography.</p></div>"

"<p>Submit one piece of creative nonfiction of up to 4,000 words.</p><p>Please include a third-person bio of fewer than 75 words.</p><br>"

"<div class=""clearfix""><p>Poems in traditional and experimental styles but no light verse (up to 6 poems).</p><p>Please include the following contact information in your cover letter and/or on your manuscript: mailing address, phone number, and email address if available.</p></div>"

"<p>We’re interested in <b>black-and-white</b> photographs. We’re not looking for photojournalism, just unique perspectives on the world around us — especially human interactions.</p><p>Please review our full <a target=""_blank"" rel=""nofollow"" href=""https://www.thesunmagazine.org/submit#photography"">subm

Well, I'm not sure what I learned from that. Some people use really, really short descriptions. 

One thing I _did_ notice is that some forms, like the first publishing one, have some typos brought on in the cleaning process. Look for text that contains `poemsthat`. Could definitely use the spell checking stuff to correct these, though I'm not going to worry about it now. Putting in a comment for posterity. 

---

### Frequency Distributions

Let's look at a few frequency distributions by use case.

In [14]:
for uc in sorted(form_data,key=lambda u: -1*len(form_data[u]) ) :
    these_descs = form_data[uc]
    these_words = [] # should I have just stored these? Maybe
    for desc in these_descs :
        these_words.extend(bag(desc))
        
    fd = nltk.FreqDist(these_words)
    
    print("---------- " + uc + " -----------")
    pprint(fd.most_common(15))
    print()

    if "award" in uc : 
        break


---------- publishing -----------
[('please', 1604),
 ('work', 1529),
 ('submissions', 1302),
 ('submit', 1267),
 ('us', 1125),
 ('one', 1081),
 ('submission', 997),
 ('may', 817),
 ('must', 773),
 ('include', 742),
 ('poems', 589),
 ('link', 585),
 ('published', 573),
 ('poetry', 551),
 ('short', 539)]

---------- grants -----------
[('grant', 1168),
 ('must', 970),
 ('application', 949),
 ('link', 797),
 ('project', 690),
 ('arts', 643),
 ('may', 610),
 ('please', 588),
 ('support', 530),
 ('program', 438),
 ('submit', 437),
 ('community', 408),
 ('work', 395),
 ('grants', 388),
 ('funding', 354)]

---------- contest -----------
[('must', 950),
 ('may', 714),
 ('submit', 555),
 ('link', 550),
 ('please', 512),
 ('entry', 494),
 ('work', 450),
 ('one', 374),
 ('entries', 356),
 ('submission', 347),
 ('submissions', 328),
 ('submitted', 302),
 ('use', 274),
 ('include', 267),
 ('artist', 262)]

---------- job applications -----------
[('work', 720),
 ('please', 426),
 ('experience', 33

Well, this seems pretty promising. 

One last piece of exploration. For each use case, I'd like to go through and find five words that are in the top 50 for that use case that aren't in the top 50 for any other. I'll try brute force first, going through each use case and getting the top $N$ words for it. 

In [15]:
num_top = 50
top_by_uc = dict()

for uc in sorted(form_data,key=lambda u: -1*len(form_data[u]) ) :
    these_descs = form_data[uc]
    these_words = [] # should I have just stored these? Maybe
    for desc in these_descs :
        these_words.extend(bag(desc))
        
    fd = nltk.FreqDist(these_words)

    top_by_uc[uc] = set([w for w,cnt in fd.most_common(num_top)])


Now a second pass, finding the ones that are unique. 

In [16]:
for uc in sorted(form_data,key=lambda u: -1*len(form_data[u]) ) :
    these_descs = form_data[uc]
    these_words = [] # should I have just stored these? Maybe
    for desc in these_descs :
        these_words.extend(bag(desc))
        
    fd = nltk.FreqDist(these_words)
    top_n = {w for w, cnt in fd.most_common(num_top)}

    other_tops = set()
    for other_uc in top_by_uc :
        if other_uc != uc :
            other_tops = other_tops.union(top_by_uc[other_uc])
    
    
    unique = top_n - other_tops

    if unique :
        print(uc + ": " + ",".join(unique))
        print()

publishing: read,previously,publish,accept,send,poetry,submitting,words,story

grants: cultural,fund,funds,organizations,grants,funding,report,final

contest: photos,competition,prize,contest

job applications: care,skills,quality,internship,ability,aged,national,management,position,communication

award/nomination: individual,nomination,winner,awards,nominations,best,annual

exhibition: additional,exhibition,works,scad,gallery,members,artwork

fellowships: completed,social,based,applying,fellows

admissions: focus,interested,course,pm

festival or event: agreement,light,shall,film,vendors,festival,insurance,vendor,bopa,market

conference: session,abstract,workshops,place,proposal,author,conference,poster,sessions,panel,presentations

internal use: people,ensure,payroll,prior,staff

residency: residency,file,statement,sample,senior,studio,want

peer review: pages,looking

scholarships: recipients,committee,least,financial,current,academic,ñ,scholarship,high,scholarships,awarded,college


Okay, this seems like it has real potential. Not sure if Naive Bayes will be able to categorize properly, but it seems possible. 