# Introduction

The purpose of this anlaysis is trying to find the products that are mentioned the most in the dataset.  

The selected products are believed to contribute more than others to people's happiness.

**I'll use KOKO, a rule-based entity extraction system, to perform the task.**  

KOKO allows users to specify conditions of desirable entities with a declarative language (see [KOKO syntax](#koko_syntax)).
Each condition has a weight associated to it. 
The weight *w* of a condition *cond* is added to the score of an entity *ent*, 
whenever an instance of *ent* satisfies *cond* in the dataset.  

KOKO would finally return entities whose score are higher than a threshold (specified by the user) as results.  
KOKO is especially suitable for entity extraction with limited evidence in the corpus, due to its score-based extraction method (e.g. extraction of cafe names from only one or a few blogs). 

**The whole analysis described in this notebook comprises the following steps:**  

- Data preprocessing: the HappyDB dataset is read and converted to a text file as input to KOKO
- Information extraction: a KOKO query is written and evaluated, extracting product names in the dataset.

# 1. Data preprocessing

First, let's load the data and take a look at the happy moments inside.  

## Load HappyDB

In [30]:
import pandas as pds

data = pds.read_csv('./data/cleaned_hm.csv')
data.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,Unnamed: 0.1.1.1.1.1,hmid,wid,reflection_period,cleaned_hm,original_hm,modified
0,0,0,0,3833.0,31526.0,8962.0,24h,I found a silver coin from 1852 buried in the ...,I found a silver coin from 1852 buried in the ...,False
1,1,1,1,9336.0,37050.0,10252.0,24h,"This one is pretty minuscule, we had to go to ...","This one is pretty minuscule, we had to go to ...",False
2,2,2,2,18454.0,46196.0,586.0,24h,The word problem is never part of a happy pers...,The word aproblema is never part of a happy pe...,False
3,3,3,3,22355.0,50108.0,409.0,24h,I started studying bagavthgeetha .,I started studying bagavthgeetha .,True
4,4,4,4,27323.0,55093.0,731.0,24h,i go to old age home and service enjoy the mom...,i go to oldage home and service enjoy the moment.,False


Within the dataset, the most interesting part -- which is also the input to our analysis -- is the coloum of 'cleaned_hm'.  

'cleaned_hm' stands for "cleaned happy moments". Let's take a look.

In [31]:
data_clean = data['cleaned_hm']
data_clean.head()

0    I found a silver coin from 1852 buried in the ...
1    This one is pretty minuscule, we had to go to ...
2    The word problem is never part of a happy pers...
3                 I started  studying  bagavthgeetha .
4    i go to old age home and service enjoy the mom...
Name: cleaned_hm, dtype: object

## Identify purchasing-related moments

I used Python to try identifying all the happy moments related to purchasing behavior -- 
i.e., moments that contain keywords 'buy', 'bought' or 'purchase'.  

This process helped me understand the patterns of products appearing in the happy moments, and faciliate condition specification in latter steps.

In [32]:
num_moments = 100
assert (num_moments < data_clean.size)
print('Happy moments involving purchasing:\n')
for i in range(0, num_moments):
    if 'buy' in data_clean.iloc[i] or \
       'bought' in data_clean.iloc[i] or \
       'purchase' in data_clean.iloc[i]:
       print("{}: {}".format(i, data_clean.iloc[i]))

Happy moments involving purchasing:

38: I waited patiantly for my income tax return and finally received it. And no i haven't went rogue with impulse purchases. I just did  the noble thing to do and take care of home, made a couple of investments and put the rest in my bank account.
45: I am happy when i purchase a new vehicle 
94: My husband purchased a new nan for me.


<a id='koko_syntax'></a>
# 2. Introduction to KOKO

Before using KOKO to extract products from HappyDB, I'll give a brief introduction of KOKO's query language.

**Here's the query *Q_prod* that I'll use to extract products.**

In [33]:
with open('./product_names/koko_queries/products_v3.koko', 'r') as query:
    print(query.read())

extract "Ngrams(1,1)" x from "/Users/chen/Research/Playground/Github_Playground/happydb/data/happyDB_clean.txt" if
		   ("bought a new" x {0.01}) or
		   ("bought a few" x {0.01}) or
		   ("bought some" x {0.01}) or
		   ("bought a" x {0.01}) or
		   ("bought" x {0.01}) or		   
		   (x "I bought" {0.01}) or		   
		   ("purchase a new" x {0.01}) or
		   ("purchase a few" x {0.01}) or
		   ("purchase some" x {0.01}) or
		   ("purchase a" x {0.01}) or
		   ("purchase" x {0.01}) or		   
		   (x "I purchase" {0.01})
with threshold 0.0
excluding (str(x) matches ".*(new|NEW|few).*")
excluding (str(x) matches ".*(,|\.|;|!|\$|\(|\)|-).*")
excluding (str(x) matches ".*[0-9]+.*")
excluding (str(x) matches ".*(and|or|so).*")
excluding (str(x) matches ".*(\,|\.|\;|\!|\$|\(|\)|\-).*")
excluding (str(x) matches ".*(month|week|year|day|night|today).*")



To understand what *Q_prod* specifies, let's take a look at the syntax of KOKO.

**A KOKO query typically takes on the following syntax:**

(**extract** ⟨keyword⟩ (x) **from** ⟨document name⟩ **if**  
⟨condition⟩  
(**with threshold** ⟨threshold⟩)  
[**excluding** ⟨e-condition⟩]

where the conditions are defined as follows:

*⟨condition⟩ ::= ⟨condition⟩ or ⟨condition⟩ |*  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
              *(x {⟨string⟩} ⟨weight⟩) |*  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
              *(x ⟨string⟩ ⟨weight⟩) |*  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
              *(x near ⟨string⟩ ⟨weight⟩) |*  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
              *(str(x) matches⟨pattern⟩) |*  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
              *(str(x) [contains|mentions] {⟨string⟩} ⟨weight⟩) |*  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
              *(str(x) [contains|mentions] ⟨string⟩ ⟨weight⟩)*
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;  
*⟨weight⟩ ::= empty | number in [0,1]*
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;  
*⟨threshold⟩ ::= number in [0,1]*
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;  
*⟨pattern⟩::= ⟨regular expression⟩*

Specifically, a KOKO query will extract entities *x* of type ⟨keyword⟩ from a document ⟨document name⟩, 
if the score of *x* exceeds the treshold ⟨threshold⟩.  

The score is computed as the cumulative weights which are added whenever there's an instance of *x* in ⟨document name⟩ that matches ⟨condition⟩.

** Let's use the query *Q_prod* presented above as an example**

In *Q_prod*, the ⟨keyword⟩ is "Ngrams(1,1)", which means all the one-gram in the document.  
We can also use "Ents" for named entities, or "Nps" for noun pharases.  

There are twelve conditions in ⟨condition⟩. For example, ("bought a new" x {0.01}) means that all entities *x* with a preceding string of "bought a new" will have their score increased by 0.01 -- i.e., the weight of the condition.  

And the first "excluding" keyword specifies that the matching entities should not be any word containing "new", "NEW", or "few" -- we are more interested in "car", for instance, than "a new car" or "a few cars".

# 3. Entity extraction with KOKO

Now we are ready to run the query for product extraction.  

First, we need to install the KOKO package.

## Install KOKO

To install KOKO locally, simply run the following command:

    pip install pykoko

## Run KOKO

After KOKO is installed, we can run the example query *Q_prod*.  

Considering the size of the dataset, it might take several minutes to get results.

In [34]:
import koko

koko.run('./product_names/koko_queries/products_v3.koko')

Parsed query: extract "/Users/chen/Research/Playground/Github_Playground/happydb/data/happyDB_clean.txt" Ngrams(1,1) from "x" if
	("bought a new" x { 0.01 }) or
	("bought a few" x { 0.01 }) or
	("bought some" x { 0.01 }) or
	("bought a" x { 0.01 }) or
	("bought" x { 0.01 }) or
	(x "I bought" { 0.01 }) or
	("purchase a new" x { 0.01 }) or
	("purchase a few" x { 0.01 }) or
	("purchase some" x { 0.01 }) or
	("purchase a" x { 0.01 }) or
	("purchase" x { 0.01 }) or
	(x "I purchase" { 0.01 })   
with threshold 0.00
excluding
	(str(x) matches ".*(new|NEW|few).*")
	(str(x) matches ".*(,|\.|;|!|\$|\(|\)|-).*")
	(str(x) matches ".*[0-9]+.*")
	(str(x) matches ".*(and|or|so).*")
	(str(x) matches ".*(\,|\.|\;|\!|\$|\(|\)|\-).*")
	(str(x) matches ".*(month|week|year|day|night|today).*")


Results:

Entity name                                        Entity score
a                                                  1.000000
car                                                1.000000
me                  

excellent                                          0.010000
Shares                                             0.010000
rise                                               0.010000
implemented                                        0.010000
nan                                                0.010000
medicine                                           0.010000
parents                                            0.010000
Persimmon                                          0.010000
top                                                0.010000
freshly                                            0.010000
marked                                             0.010000
saw                                                0.010000
pounds                                             0.010000
fitbit                                             0.010000
limited                                            0.010000
material                                           0.010000
first                                   

The results show that expensive purchases, such as cars, houses or laptops, are mentioned most in HappyDB.

# 4. Conclusion

KOKO is easy to use, and effectively extract qualified entities efficiently.