This notebook contains the preprocesing of the 'tweet' column of `data/combined_deduped.csv` into lemmas and vectors using [spaCy](https://spacy.io).

When run top to bottom, it creates four new CSV files in the data directory. These files contain the lemmas and vectors of the documents and are named `lemmas_<UTC DATETIME>.csv` and `vectors_[sm|md|lg]_<UTC DATETIME>.csv`, respectively. 

The lemmas CSV file contains separate columns for lemmas created with spaCy's `en-core-web-sm`, `en-core-web-md`, and `en-core-web-lg` models. These columns are named 'sm_lemmas', 'md_vectors', and 'lg_lemmas'.

The vectors CSV files are separate files for vectors created with spaCy's `en-core-web-sm`, `en-core-web-md`, and `en-core-web-lg` models. These files are named 'vectors_sm_...', 'vectors_md_...', 'vectors_lg_...'. The vectors are in separate CSV files due to their large size.

The CSV files will also contain the 'inappropriate' column from `combined_deduped.csv`.

This notebook is intended to make preprocesing of the data that becomes necessary (stop words, normalizing, etc.) a standardized process that results in reproducible and trackable results.

**NOTE:** This process takes 3+ hours on my local machine.

CSV file sizes (approx.):
- lemmas: 44MB
- vectors_sm: 173MB
- vectors_md: 661MB
- vectors_lg: 661MB

In [1]:
import re

from datetime import datetime

import spacy
import pandas as pd

In [2]:
# Initialize spaCy models

nlp_sm = spacy.load('en_core_web_sm')
nlp_md = spacy.load('en_core_web_md')
nlp_lg = spacy.load('en_core_web_lg')

In [3]:
# Load source training data

training_data = pd.read_csv('data/combined_deduped.csv')

## Lemmatization

In [4]:
# Preprocessing for lemmatization
# add / remove stop words, normalize, any text processing as necessary

# additional tokens to ignore
STOP_WORDS = []

is_empty_pattern = re.compile(r'^\s*$')

def make_lemmas(nlp, docs):
    """Creates a list of documents containing the lemmas of each document in the input docs.

    :param nlp: spaCy NLP model to use
    :param docs: list of documents to lemmatize

    :returns: list of lemmatized documents
    """
    lemmas = []
    for doc in nlp.pipe(docs, batch_size=500):
        doc_lemmas = []
        for token in doc:
            if (
                not token.is_stop
                and not token.is_punct
                and token.pos_ != 'PRON'
                and not is_empty_pattern.match(token.text)
                and len(token.lemma_) > 2
                and token.lemma_ not in STOP_WORDS
            ):
                doc_lemmas.append(token.lemma_)
        lemmas.append(doc_lemmas)
    return lemmas

In [5]:
# Create lemmas with each of the spaCy models

lemmas = pd.DataFrame()

print('Lemmatization with en-core-web-sm')
lemmas['sm_lemmas'] = make_lemmas(nlp_sm, training_data['tweet'])

print('Lemmatization with en-core-web-md')
lemmas['md_lemmas'] = make_lemmas(nlp_md, training_data['tweet'])

print('Lemmatization with en-core-web-lg')
lemmas['lg_lemmas'] = make_lemmas(nlp_lg, training_data['tweet'])

lemmas['inappropriate'] = training_data['inappropriate']

Lemmatization with en-core-web-sm
Lemmatization with en-core-web-md
Lemmatization with en-core-web-lg


In [6]:
# Quick glance verification of proper output

print('Training data shape:', training_data.shape)
print('Lemmas shape:', lemmas.shape)
print()
lemmas.head()

Training data shape: (146264, 2)
Lemmas shape: (146264, 4)



Unnamed: 0,sm_lemmas,md_lemmas,lg_lemmas,inappropriate
0,"[beat, Dr., Dre, urbeat, Wired, Ear, Headphone...","[beat, Dr., Dre, urbeat, Wired, ear, Headphone...","[beat, Dr., Dre, urBeats, wire, Ear, Headphone...",True
1,"[@Papapishu, man, fucking, rule, party, perpet...","[@Papapishu, man, fucking, rule, party, perpet...","[@Papapishu, man, fucking, rule, party, perpet...",True
2,"[time, draw, close, 128591;&#127995, Father, d...","[time, draw, close, 128591;&#127995, Father, d...","[time, draw, close, 128591;&#127995, Father, d...",False
3,"[notice, start, act, different, distant, peep,...","[notice, start, act, different, distant, peep,...","[notice, start, act, different, distant, peep,...",False
4,"[forget, unfollower, believe, grow, new, follo...","[forget, unfollower, believe, grow, new, follo...","[forget, unfollower, believe, grow, new, follo...",False


### Export Lemmas

Create timestamp at this point to be used when exporting lemmas and vectors.

Exporting lemmas to CSV at this point, because saving to CSV and reloading from that CSV with pandas ensures that all tokens in the docs are strings. This is the simplest method (quickest fix) I found to avoid a type error when vectorizing these lemmas.

In [17]:
# UTC datetime stamp to correlate lemmas and vectors to a specific run

utc_now_formatted = datetime.utcnow().strftime(r'%Y-%m-%d-%H-%M-%SZ')

lemmas_filename = f'data/lemmas_{utc_now_formatted}.csv'

print(f'Creating {lemmas_filename}')
lemmas.to_csv(lemmas_filename, index=False)

Creating data/lemmas_2020-04-30-18-10-45Z.csv


## Vectorization

In [19]:
# Reload lemmas from csv
# This prevents a type error when parsing the lemmas for vectorization

lemmas = pd.read_csv(lemmas_filename)

In [21]:
def make_vectors(nlp, docs):
    """Creates a list of documents containing the vectors of each document in the input docs.

    :param nlp: spaCy NLP model to use
    :param docs: list of documents to vectorize

    :returns: list of vectorized documents
    """
    vectors = []
    for doc in nlp.pipe(docs, batch_size=500):
        vectors.append(doc.vector)
    return vectors

In [24]:
# Create vectors of the lemmas with each of the spaCy models

vectors = pd.DataFrame()

print('Vectorization with en-core-web-sm')
vectors['sm_vectors'] = make_vectors(nlp_sm, lemmas['sm_lemmas'])

print('Vectorization with en-core-web-md')
vectors['md_vectors'] = make_vectors(nlp_md, lemmas['md_lemmas'])

print('Vectorization with en-core-web-lg')
vectors['lg_vectors'] = make_vectors(nlp_lg, lemmas['lg_lemmas'])

vectors['inappropriate'] = training_data['inappropriate']

Vectorization with en-core-web-sm
Vectorization with en-core-web-md
Vectorization with en-core-web-md


In [25]:
# Quick glance verification of proper output

print('Training data shape:', training_data.shape)
print('Vectors shape:', vectors.shape)
print()
vectors.head()

Training data shape: (146264, 2)
Vectors shape: (146264, 4)



Unnamed: 0,sm_vectors,md_vectors,lg_vectors,inappropriate
0,"[-0.15469056, 0.94960415, -2.069193, 0.2554039...","[-0.16546369, 0.44611606, 0.01917303, -0.06901...","[-0.16230947, 0.46143216, 0.026662538, -0.0600...",True
1,"[-0.16211522, 0.5824614, -1.6822865, -0.115870...","[-0.23224233, 0.35610408, 0.03614007, -0.11104...","[-0.23224233, 0.35610408, 0.03614007, -0.11104...",True
2,"[-0.23576383, 0.645006, -1.8176227, -0.0978480...","[-0.14492889, 0.41942063, -0.024495453, -0.062...","[-0.14492889, 0.41942063, -0.024495453, -0.062...",False
3,"[0.012096468, 0.84190094, -1.9023851, -0.28529...","[-0.17496027, 0.4259231, -0.05608107, -0.11582...","[-0.16725582, 0.42386648, -0.054639563, -0.116...",False
4,"[-0.2324229, 0.63849, -1.5870256, -0.25167158,...","[-0.1704493, 0.46021652, -0.011923886, -0.1096...","[-0.1704493, 0.46021652, -0.011923886, -0.1096...",False


### Export vectors

In [27]:
# Use the same UTC datetime stamp from when lemmas were exported

vectors_sm_filename = f'data/vectors_sm_{utc_now_formatted}.csv'
vectors_md_filename = f'data/vectors_md_{utc_now_formatted}.csv'
vectors_lg_filename = f'data/vectors_lg_{utc_now_formatted}.csv'

print(f'Creating {vectors_sm_filename}')
vectors[['sm_vectors', 'inappropriate']].to_csv(vectors_sm_filename, index=False)

print(f'Creating {vectors_md_filename}')
vectors[['md_vectors', 'inappropriate']].to_csv(vectors_md_filename, index=False)

print(f'Creating {vectors_lg_filename}')
vectors[['lg_vectors', 'inappropriate']].to_csv(vectors_lg_filename, index=False)

Creating data/vectors_sm_2020-04-30-18-10-45Z.csv
Creating data/vectors_md_2020-04-30-18-10-45Z.csv
Creating data/vectors_lg_2020-04-30-18-10-45Z.csv
