**AUTOMATED ESSAY SCORING: CS109A FINAL PROJECT**

By Anmol Gupta, Annie Hwang, Paul Lisker, and Kevin Loughlin

**THINGS TO CONSIDER**

- See README.md for basic guides to what's in the repository
- The training essays are scored on different scales.  Perhaps the scoring system for our model should be an input, or perhaps we should have different models for different scoring systems.
- We should establish how to handle anonymized data (like how they replaced business names with @ORGANIZATION1)

In [33]:
import matplotlib
import matplotlib.pyplot as plt

import pandas as pd
import numpy as np

# stopwords for English
from nltk.corpus import stopwords

# English 'dictionary'
from nltk.corpus import words

# Regular expressions might be useful
import re

# Beautiful soup might be useful
from bs4 import BeautifulSoup

from sklearn.feature_extraction.text import CountVectorizer

In [34]:
train_df = pd.read_csv('data/training_set_rel3.tsv', delimiter='\t')
print train_df.head()

print '\n*****\n'

train_essays_np = train_df['essay'].values
print train_essays_np.shape

   essay_id  essay_set                                              essay  \
0         1          1  Dear local newspaper, I think effects computer...   
1         2          1  Dear @CAPS1 @CAPS2, I believe that using compu...   
2         3          1  Dear, @CAPS1 @CAPS2 @CAPS3 More and more peopl...   
3         4          1  Dear Local Newspaper, @CAPS1 I have found that...   
4         5          1  Dear @LOCATION1, I know having computers has a...   

   rater1_domain1  rater2_domain1  rater3_domain1  domain1_score  \
0               4               4             NaN              8   
1               5               4             NaN              9   
2               4               3             NaN              7   
3               5               5             NaN             10   
4               4               4             NaN              8   

   rater1_domain2  rater2_domain2  domain2_score      ...        \
0             NaN             NaN            NaN      ...    

In [35]:
test_df = pd.read_csv('data/test_set.tsv', delimiter='\t')
print test_df.head()

print '\n*****\n'

test_essays_np = test_df['essay'].values
print test_essays_np.shape

   essay_id  essay_set                                              essay  \
0      2383          1  I believe that computers have a positive effec...   
1      2384          1  Dear @CAPS1, I know some problems have came up...   
2      2385          1  Dear to whom it @MONTH1 concern, Computers are...   
3      2386          1  Dear @CAPS1 @CAPS2, @CAPS3 has come to my atte...   
4      2387          1  Dear Local newspaper, I think that people have...   

   domain1_predictionid  domain2_predictionid  
0                  2383                   NaN  
1                  2384                   NaN  
2                  2385                   NaN  
3                  2386                   NaN  
4                  2387                   NaN  

*****

(4254,)


In [36]:
valid_df = pd.read_csv('data/valid_set.tsv', delimiter='\t')
print valid_df.head()

print '\n*****\n'

valid_essays_np = valid_df['essay'].values
print valid_essays_np.shape

   essay_id  essay_set                                              essay  \
0      1788          1  Dear @ORGANIZATION1, @CAPS1 more and more peop...   
1      1789          1  Dear @LOCATION1 Time @CAPS1 me tell you what I...   
2      1790          1  Dear Local newspaper, Have you been spending a...   
3      1791          1  Dear Readers, @CAPS1 you imagine how life woul...   
4      1792          1  Dear newspaper, I strongly believe that comput...   

   domain1_predictionid  domain2_predictionid  
0                  1788                   NaN  
1                  1789                   NaN  
2                  1790                   NaN  
3                  1791                   NaN  
4                  1792                   NaN  

*****

(4218,)


In [39]:
# Make english words and stoplist a set for efficiency
nltk_words_set = set(words.words())
stop_words = set(stopwords.words("english"))

# Extracts words from text, alpha characters only and all lowercase
# Inspired by function included in Kevin's CS51 final project, adapted from Kaggle's Popcorn Movie Reviews Project
# Used in Kevin's assignment 5
def clean_essay(text):
    # Only includes letters
    str_only_letters = re.sub("[^a-zA-Z]", " ", text)

    # Converts all letters to lower case and splits string into words
    words = str_only_letters.lower().split()
    
    # TODO If we want to remove stop words, we can before the below
    #words = [w for w in words if not w in stop_words]

    # TODO If we want to check what is actually a word, we can do the below
    # words = [w for w in words if w in nltk_words_set]

    # Return words as string separated by space
    return (" ".join( words ))

In [40]:
print "Cleaning training essays..."

cleaned = []
for essay in train_essays_np:
    cleaned.append(clean_essay(essay))
train_clean = np.array(cleaned)

print "Cleaning testing essays..."

cleaned = []
for essay in test_essays_np:
    cleaned.append(clean_essay(essay))
test_clean = np.array(cleaned)

print "Cleaning validation essays..."

cleaned = []
for essay in valid_essays_np:
    cleaned.append(clean_essay(essay))
valid_clean = np.array(cleaned)

print "Done."

print train_clean.shape
print train_clean[:3]

Cleaning training essays...
Cleaning testing essays...
Cleaning validation essays...
Done.
(12976,)
[ 'dear local newspaper i think effects computers have on people are great learning skills affects because they give us time to chat with friends new people helps us learn about the globe astronomy and keeps us out of troble thing about dont you think so how would you feel if your teenager is always on the phone with friends do you ever time to chat with your friends or buisness partner about things well now there s a new way to chat the computer theirs plenty of sites on the internet to do so organization organization caps facebook myspace ect just think now while your setting up meeting with your boss on the computer your teenager is having fun on the phone not rushing to get off cause you want to use it how did you learn about other countrys states outside of yours well i have by computer internet it s a new way to learn about what going on in our time you might think your child spend

In [44]:
# If we want to limit to the most common words in our vectorizer, set this number
# Doing so will speed things up (and potentially increase accuracy)
# None indicates no limit
max_words = None

# turns words into word vector
vectorizer = CountVectorizer(stop_words='english', max_features=max_words)

# fit vectorizer on training data
train_vec = vectorizer.fit_transform(train_clean).toarray()

# apply vectorizer to testing and validation sets
test_vec = vectorizer.transform(test_clean).toarray()
valid_vec = vectorizer.transform(valid_clean).toarray()

print train_vec.shape
print test_vec.shape
print valid_vec.shape

(12976, 37678)
(4254, 37678)
(4218, 37678)
