In this jupyter notebook, we will try to assign categories to the transaction text. After this is done, we hope to create a script from this notebook.


# Setting Up

In [31]:
import pandas as pd
import re

In [2]:
data = pd.read_csv('Data/transactions.csv')
data = data.drop(['Unnamed: 0', 'Unnamed: 7'], axis=1)
data['Date'] = pd.to_datetime(data['Date'])

In [3]:
data.head()

Unnamed: 0,payer,receiver,tran_text,Date,dow
0,abacbefabc833a96597289665268d99e58ee1b2e4550dd...,9c044a510cd7333bda6da6a2a364372974a0a8a573ac3f...,:Italy:,2019-01-21,Monday
1,f0bf2b3ec1598e6ab6ef15ea0251328ca3731e237bdb49...,abacbefabc833a96597289665268d99e58ee1b2e4550dd...,Heroics,2019-01-19,Saturday
2,abacbefabc833a96597289665268d99e58ee1b2e4550dd...,523c406e16c2856eefb065c2ace781a81a608b6e826ca4...,To regain Kosovo,2019-01-18,Friday
3,f0bf2b3ec1598e6ab6ef15ea0251328ca3731e237bdb49...,abacbefabc833a96597289665268d99e58ee1b2e4550dd...,Reading at a club,2019-01-18,Friday
4,abacbefabc833a96597289665268d99e58ee1b2e4550dd...,f0bf2b3ec1598e6ab6ef15ea0251328ca3731e237bdb49...,For the Culture,2019-01-17,Thursday


In [7]:
data['tran_text'][:30]

0                                     :Italy:
1                                     Heroics
2                            To regain Kosovo
3                           Reading at a club
4                             For the Culture
5                           Man with the plan
6                                       Movie
7                               Irish delight
8                                :wine_glass:
9                              Biryani galore
10                            Testing reasons
11                                   yehdodnw
12                               GiVe Me FoDd
13                              Carbohydrates
14                                  Hnxodnnxj
15                               Boiled meats
16                                      Movie
17                 walk a flock a ! thank u !
18                                 Snail Thai
19                                          E
20                                       No u
21                                

# Basic Text Categorization

So as we can see, the human language is incredibly complicated, made even worse by the fact that many users would include slang and acrynyms in their writing. 

However, we can still extract meaningful data out of text using a python module called [SpaCy](https://spacy.io/usage/spacy-101). We will also be using another library called NLTK.

Here are some assumptions that we are going to make just for simplications:
* people mean what they write (they are not lying about what they sent money for)
* emoji indicates what the person is buying

To install SpaCy, run:

```
pip install -U spacy
```

We want to also download a pre-existing model to work on. See [here](https://spacy.io/usage/models) for more information on installing models.

```
python -m spacy download en_core_web_sm
```

In [12]:
import spacy

# load the english model
nlp = spacy.load('en_core_web_sm')

We can use SpaCy for named entity analysis.

See [here](https://spacy.io/usage/spacy-101#annotations-ner) for more information about named entities.

See [here](https://spacy.io/api/annotation#named-entities) for the different categories of outputs.

In [20]:
# Example of how to use this
doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Apple 0 5 ORG
U.K. 27 31 GPE
$1 billion 44 54 MONEY


In [38]:
def find_name_entities(series):
    """
    For each text in the series, find the name entities.
    
    :param series: series of text
    :type  series: 
    """
    
    for _, val in series.iteritems():
        # replace all non alpha numerics with spaces
        val = re.sub('\W+|_', ' ', val)
        doc = nlp(val)
        print("Text is:", val, )
        for ent in doc.ents:
            print("Named entity is:", ent.label_,)
        print('\n')

In [39]:
find_name_entities(data['tran_text'][:30])

Text is:  Italy 
Named entity is: GPE


Text is: Heroics


Text is: To regain Kosovo
Named entity is: GPE


Text is: Reading at a club


Text is: For the Culture


Text is: Man with the plan


Text is: Movie


Text is: Irish delight
Named entity is: NORP


Text is:  wine glass 


Text is: Biryani galore


Text is: Testing reasons


Text is: yehdodnw


Text is: GiVe Me FoDd


Text is: Carbohydrates


Text is: Hnxodnnxj
Named entity is: GPE


Text is: Boiled meats


Text is: Movie


Text is: walk a flock a thank u 


Text is: Snail Thai
Named entity is: PERSON


Text is: E


Text is: No u


Text is: Tenga
Named entity is: GPE


Text is: Venom


Text is:  South Korea 
Named entity is: GPE


Text is: Why isnt venmo used as social media 


Text is: Anra Bday


Text is: Food 


Text is: Bday dinner


Text is: Food


Text is: Somber religious meditation 




As we can see, SpaCy does not always output a named entity for each text. This is mainly because some of the text is too short. However, this is great for recognizing pronouns.
In the next part, we will do more complex analysis in NLTK, but this module will be more useful for classification of existing words, not be very useful for analyzing pronouns.