# Project 1 Are You Happy Today?

![](../figs/1_BiVCmiQtCBIdBNcaOKjurg.png)

## Text Preprocessing

### Step 0 - Load all the required packages

A general descriptions for the libraries from official documents:  
+ `pandas`: Pandas provides high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
+ `numpy`: NumPy is the fundamental package for scientific computing in Python.
+ `matplotlib`: Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms.
+ `nltk`: NLTK is a leading platform for building Python programs to work with human language data.
  
There are some other libraries or functions that I will use later and the description will be attached in the corresponding chunks.

In [1]:
# load libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import nltk

### Step 1 - Load the data and have a simple sense

In [2]:
# read the data from subfolder `data` in the project
# confirm that this jupyter is in subfolder `doc`
hm_data = pd.read_csv('../data/HappyDB/happydb/data/cleaned_hm.csv')

First, let's geta general sense of the data!  
For convenience, here is a variables' descriotion copied from official github documents.  
+ __hmid (int)__: Happy moment ID
+ __wid (int)__: Worker ID
+ __reflection_period (str)__: Reflection period used in the instructions provided to the worker (3m or 24h)
+ __original_hm (str)__: Original happy moment
+ __cleaned_hm (str)__: Cleaned happy moment
+ __modified (bool)__: If True, original_hm is "cleaned up" to generate cleaned_hm (True or False)
+ __predicted_category (str)__: Happiness category label predicted by our classifier (7 categories. Please see the reference for details)
+ __ground_truth_category (str)__: Ground truth category label. The value is NaN if the ground truth label is missing for the happy moment
+ __num_sentence (int)__: Number of sentences in the happy moment

In [3]:
hm_data.head(10)

Unnamed: 0,hmid,wid,reflection_period,original_hm,cleaned_hm,modified,num_sentence,ground_truth_category,predicted_category
0,27673,2053,24h,I went on a successful date with someone I fel...,I went on a successful date with someone I fel...,True,1,,affection
1,27674,2,24h,I was happy when my son got 90% marks in his e...,I was happy when my son got 90% marks in his e...,True,1,,affection
2,27675,1936,24h,I went to the gym this morning and did yoga.,I went to the gym this morning and did yoga.,True,1,,exercise
3,27676,206,24h,We had a serious talk with some friends of our...,We had a serious talk with some friends of our...,True,2,bonding,bonding
4,27677,6227,24h,I went with grandchildren to butterfly display...,I went with grandchildren to butterfly display...,True,1,,affection
5,27678,45,24h,I meditated last night.,I meditated last night.,True,1,leisure,leisure
6,27679,195,24h,"I made a new recipe for peasant bread, and it ...","I made a new recipe for peasant bread, and it ...",True,1,,achievement
7,27680,740,24h,I got gift from my elder brother which was rea...,I got gift from my elder brother which was rea...,True,1,,affection
8,27681,3,24h,YESTERDAY MY MOMS BIRTHDAY SO I ENJOYED,YESTERDAY MY MOMS BIRTHDAY SO I ENJOYED,True,1,,enjoy_the_moment
9,27682,4833,24h,Watching cupcake wars with my three teen children,Watching cupcake wars with my three teen children,True,1,,affection


Obviously, `cleaned_hm` must be my main focus.  
There are so many missing values in `ground_truth_category` that makes it difficult to deal with.  
Luckily, we have `predicted_category` which has 7 categories. This might provide some useful information!  
And also, official database provides some affiliated `.csv` documents(e.g., `senselabel.csv`, `pets-dict.csv`) which might be useful later.  
  
Whatever, let's first process `cleaned_hm`.  
In general, there are several standard steps for text preprocessing. I summarize below:    
1. Transform sentences into words, which is also called `tokenization`.  
2. Tag the part of speech, which is called `POS Tagging`(Optional).   
3. Remove the punctuation and non-alpha words(e.g. numbers, whitespace).  
4. Correct the spelling mistakes.  
5. Transform all words into lowercase. 
6. `Lemmatization/Stemming`.  
7. Remove too short words and stopwords.  

### Step 2 - Text cleaning

In [4]:
# tokenization and POS tagging
ori_hm = hm_data['cleaned_hm'].copy()
pos_hm = [nltk.pos_tag(nltk.word_tokenize(sent)) for sent in ori_hm]

In [5]:
# remove the punctuation and non-alpha words
from string import punctuation
import re
remove_hm = [[tup for tup in sent if re.search(r'\D+', tup[0]) and tup[0] not in list(punctuation)] 
             for sent in pos_hm]

### Step 3 - Correct the spelling mistakes

`pyenchant` is a powerful tool to deal with spelling mistakes

In [6]:
import enchant
d = enchant.Dict('en_US') # US English
check_hm = [[d.check(tup[0]) for tup in sent] for sent in remove_hm]
print('There are {0} spelling mistakes.'.format(len([word for sent in check_hm for word in sent if word == False])))

There are 45095 spelling mistakes.


Let's have a look at what kind of the spelling mistakes they are.

In [7]:
# get all the indexes for the spelling mistake(False)
indexes = [[i,j] for i in range(len(check_hm)) for j,x in enumerate(check_hm[i]) if x == False]

# randomly choose 10 some spelling mistakes
import random
random.seed(1)
for i in range(10):
    choice = random.choice(range(len(indexes)))
    idx1 = indexes[choice][0]
    idx2 = indexes[choice][1]
    print('The mistake word is: {0}.\nThe sentence is: {1} \n'.format(remove_hm[idx1][idx2][0], ori_hm[idx1]))

The mistake word is: n't.
The sentence is: Mom brought me lunch home and got me coke, which she didn't forgot me home. 

The mistake word is: ramen.
The sentence is: I ate some ramen yesterday. I don't get to eat a lot of anymore despite how cheap it is and easy to make so the fact I got to actually eat some is refreshing. 

The mistake word is: chipotle.
The sentence is: My best friend brought me a cupcake and chipotle 

The mistake word is: 's.
The sentence is: My husband and I went to my mother's house for dinner and a movie. We had rented the movie and wanted to watch it on my mom's very large TV. I teased mom and my husband for talking during the movie, but it was a great time.  

The mistake word is: n't.
The sentence is: I saw that I had a shopping credit on Amazon I had forgotten about so I was able to treat myself to a fancy skincare lotion I didn't think I could afford before. 

The mistake word is: 'd.
The sentence is: I received a necklace in the mail I'd ordered that jingl

From the random examples above, some mistakes come from abbreviation(e.g. `don't` will be tokenized into `do` and `n't`) which is very meaningful and can be dealt with by `POS tagging` in `nltk`. Some other mistakes come from special words(e.g `chipotle`, `ramen`).

In [8]:
print('''It's tough to deal with such spelling problems. But luckily, the mistake rate is only {0:.2f}% which won't lead to a disaster in results.'''.
      format(100 * len([word for sent in check_hm for word in sent if word == False])/len([word for sent in check_hm for word in sent])))

It's tough to deal with such spelling problems. But luckily, the mistake rate is only 2.44% which won't lead to a disaster in results.


### Step 4 - Transform to lowercase, remove too short words and stopwords

In [9]:
# transform to lowercase
lower_hm = [[(tup[0].lower(), tup[1]) for tup in sent] for sent in remove_hm]

### Step 5 - Lemmatization

In [10]:
# def the function that convert the POS from original form to consistent form
from nltk.corpus import wordnet
def get_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

In [11]:
wnl = nltk.WordNetLemmatizer()
lemma_hm = [[(wnl.lemmatize(tup[0], get_pos(tup[1])), tup[1]) for tup in sent]for sent in lower_hm]

### Step 6 - Remove too short words and stopwords

In [12]:
# remove words whose length < 3 and remove stopwords
from nltk.corpus import stopwords
words = stopwords.words('english') +['happy','ago','yesterday','lot','today','month','day',
                                     'last','week','past','get','make','one','take']
stop_hm = [[tup for tup in sent 
            if len(tup[0]) >= 3 and tup[0] not in words and not re.search(r"^\'[a-zA-Z]", tup[0])] 
           for sent in lemma_hm]

### Step 7 - Combine the processed text to the original data and export

In [13]:
hm_data['processed_hm'] = [','.join([tup[0] for tup in sent]) for sent in stop_hm]
hm_data.to_csv('../output/processed_hm.csv', index=False, )

In [14]:
hm_data.head()

Unnamed: 0,hmid,wid,reflection_period,original_hm,cleaned_hm,modified,num_sentence,ground_truth_category,predicted_category,processed_hm
0,27673,2053,24h,I went on a successful date with someone I fel...,I went on a successful date with someone I fel...,True,1,,affection,"successful,date,someone,felt,sympathy,connection"
1,27674,2,24h,I was happy when my son got 90% marks in his e...,I was happy when my son got 90% marks in his e...,True,1,,affection,"son,mark,examination"
2,27675,1936,24h,I went to the gym this morning and did yoga.,I went to the gym this morning and did yoga.,True,1,,exercise,"gym,morning,yoga"
3,27676,206,24h,We had a serious talk with some friends of our...,We had a serious talk with some friends of our...,True,2,bonding,bonding,"serious,talk,friend,flaky,lately,understand,go..."
4,27677,6227,24h,I went with grandchildren to butterfly display...,I went with grandchildren to butterfly display...,True,1,,affection,"grandchild,butterfly,display,crohn,conservatory"
