This notebook contains the preprocesing of the 'tweet' column of `data/combined_deduped.csv` into lemmas and vectors using [spaCy](https://spacy.io).

When run top to bottom, it creates four new files in the data directory. These files contain the compressed, pickled dataframes for lemmas and vectors of the documents and are named `lemmas_<UTC DATETIME>.pkl.xz` and `vectors_[sm|md|lg]_<UTC DATETIME>.pkl.xz`, respectively. 

The lemmas file contains separate columns for lemmas created with spaCy's `en-core-web-sm`, `en-core-web-md`, and `en-core-web-lg` models. These columns are named 'sm_lemmas', 'md_vectors', and 'lg_lemmas'.

The vectors files are separate files for vectors created with spaCy's `en-core-web-sm`, `en-core-web-md`, and `en-core-web-lg` models. These files are named 'vectors\_sm\_...', 'vectors\_md\_...', 'vectors\_lg\_...'. The vectors are in separate files due to their large size.

The dataframes will also contain the 'inappropriate' column from `combined_deduped.csv`.

This notebook is intended to make preprocesing of the data that becomes necessary (stop words, normalizing, etc.) a standardized process that results in reproducible and trackable results.

**NOTE:** This process takes ~1.5 hours on my local machine.

To load these files:

```python
import pandas as pd

df = pd.read_pickle(filename, compression='xz')
```

File sizes (approx.):
- lemmas: 12MB
- vectors_sm: 50MB
- vectors_md: 155MB
- vectors_lg: 15MB

In [1]:
import re

from datetime import datetime

import spacy
import pandas as pd

In [2]:
# Initialize spaCy models

nlp_sm = spacy.load('en_core_web_sm')
nlp_md = spacy.load('en_core_web_md')
nlp_lg = spacy.load('en_core_web_lg')

In [3]:
# Load source training data

training_data = pd.read_csv('data/combined_deduped.csv')

## Lemmatization

In [4]:
# Preprocessing for lemmatization
# add / remove stop words, normalize, any text processing as necessary

# additional tokens to ignore
STOP_WORDS = []

is_empty_pattern = re.compile(r'^\s*$')

def make_lemmas(nlp, docs):
    """Creates a list of documents containing the lemmas of each document in the input docs.

    :param nlp: spaCy NLP model to use
    :param docs: list of documents to lemmatize

    :returns: list of lemmatized documents
    """
    lemmas = []
    for doc in nlp.pipe(docs, batch_size=500):
        doc_lemmas = []
        for token in doc:
            if (
                not token.is_stop
                and not token.is_punct
                and token.pos_ != 'PRON'
                and not is_empty_pattern.match(token.text)
                and len(token.lemma_) > 2
                and token.lemma_ not in STOP_WORDS
            ):
                doc_lemmas.append(token.lemma_)
        lemmas.append(doc_lemmas)
    return lemmas

In [5]:
# Create lemmas with each of the spaCy models

lemmas = pd.DataFrame()

print('Lemmatization with en-core-web-sm')
lemmas['sm_lemmas'] = make_lemmas(nlp_sm, training_data['tweet'])

print('Lemmatization with en-core-web-md')
lemmas['md_lemmas'] = make_lemmas(nlp_md, training_data['tweet'])

print('Lemmatization with en-core-web-lg')
lemmas['lg_lemmas'] = make_lemmas(nlp_lg, training_data['tweet'])

lemmas['inappropriate'] = training_data['inappropriate']

Lemmatization with en-core-web-sm
Lemmatization with en-core-web-md
Lemmatization with en-core-web-lg


In [6]:
# Quick glance verification of proper output

print('Training data shape:', training_data.shape)
print('Lemmas shape:', lemmas.shape)
print()
lemmas.head()

Training data shape: (146264, 2)
Lemmas shape: (146264, 4)



Unnamed: 0,sm_lemmas,md_lemmas,lg_lemmas,inappropriate
0,"[beat, Dr., Dre, urbeat, Wired, Ear, Headphone...","[beat, Dr., Dre, urbeat, Wired, ear, Headphone...","[beat, Dr., Dre, urBeats, wire, Ear, Headphone...",True
1,"[@Papapishu, man, fucking, rule, party, perpet...","[@Papapishu, man, fucking, rule, party, perpet...","[@Papapishu, man, fucking, rule, party, perpet...",True
2,"[time, draw, close, 128591;&#127995, Father, d...","[time, draw, close, 128591;&#127995, Father, d...","[time, draw, close, 128591;&#127995, Father, d...",False
3,"[notice, start, act, different, distant, peep,...","[notice, start, act, different, distant, peep,...","[notice, start, act, different, distant, peep,...",False
4,"[forget, unfollower, believe, grow, new, follo...","[forget, unfollower, believe, grow, new, follo...","[forget, unfollower, believe, grow, new, follo...",False


## Vectorization

In [7]:
def make_vectors(nlp, docs):
    """Creates a list of documents containing the vectors of each document in the input docs.

    :param nlp: spaCy NLP model to use
    :param docs: list of documents to vectorize

    :returns: list of vectorized documents
    """
    vectors = []
    for doc in nlp.pipe(docs, batch_size=500):
        vectors.append(doc.vector)
    return vectors

In [8]:
# Create vectors of the documents with each of the spaCy models

print('Vectorization with en-core-web-sm')
vectors_sm = pd.DataFrame(make_vectors(nlp_sm, training_data['tweet']))
vectors_sm['inappropriate'] = training_data['inappropriate']
print(' - shape:', vectors_sm.shape)

print('Vectorization with en-core-web-md')
vectors_md = pd.DataFrame(make_vectors(nlp_md, training_data['tweet']))
vectors_md['inappropriate'] = training_data['inappropriate']
print(' - shape:', vectors_md.shape)

print('Vectorization with en-core-web-lg')
vectors_lg = pd.DataFrame(make_vectors(nlp_lg, training_data['tweet']))
vectors_lg['inappropriate'] = training_data['inappropriate']
print(' - shape:', vectors_lg.shape)


Vectorization with en-core-web-sm
 - shape: (146264, 97)
Vectorization with en-core-web-md
 - shape: (146264, 301)
Vectorization with en-core-web-lg
 - shape: (146264, 301)


## Export lemmas and vectors

In [9]:
# UTC datetime stamp to correlate lemmas and vectors to a specific run
utc_now_formatted = datetime.utcnow().strftime(r'%Y-%m-%d-%H-%M-%SZ')

lemmas_filename = f'data/lemmas_{utc_now_formatted}.pkl.xz'
vectors_sm_filename = f'data/vectors_sm_{utc_now_formatted}.pkl.xz'
vectors_md_filename = f'data/vectors_md_{utc_now_formatted}.pkl.xz'
vectors_lg_filename = f'data/vectors_lg_{utc_now_formatted}.pkl.xz'

print(f'Creating {lemmas_filename}')
lemmas.to_pickle(lemmas_filename, compression='xz')

print(f'Creating {vectors_sm_filename}')
vectors_sm.to_pickle(vectors_sm_filename, compression='xz')

print(f'Creating {vectors_md_filename}')
vectors_md.to_pickle(vectors_md_filename, compression='xz')

print(f'Creating {vectors_lg_filename}')
vectors_lg.to_pickle(vectors_lg_filename, compression='xz')

Creating data/lemmas_2020-05-04-16-27-18Z.pkl.xz
Creating data/vectors_sm_2020-05-04-16-27-18Z.pkl.xz
Creating data/vectors_md_2020-05-04-16-27-18Z.pkl.xz
Creating data/vectors_lg_2020-05-04-16-27-18Z.pkl.xz
