#### PGGM Bootcamp Text Analytics 2020
*Notebook by [Pedro V Hernandez Serrano](https://github.com/pedrohserrano)*

---
![](images/1_3.png)

# 1.3 Text Processing
* [1.3.1. Data cleaning](#1.3.1)
* [1.3.2. Corpus organization](#1.3.2)
* [1.3.3. Document-Term Matrix](#1.3.3)

---

- Data cleaning is a time consuming and unenjoyable task, yet it's a very important one. 
- Keep in mind, "garbage in, garbage out". 
- Feeding dirty data into a model will give us results that are meaningless.

Specifically, we'll be walking through:

1. **Getting the data - **in this case, the data we scraped from EDGAR filings
2. **Cleaning the data - **we will walk through popular techniques
3. **Organizing the data - **we will organize the cleaned data into a way that is easy to input into other algorithms

The output of this notebook will be clean, organized data in two standard text formats:

1. **Corpus** - a collection of text
2. **Document-Term Matrix** - word counts in matrix format

**Common data cleaning steps on all text:**
* Make text all lower case
* Remove punctuation
* Remove or transform numerical values
* Remove common non-sensical text (/n)
* Remove stop words

**NLP processing after tokenization:**
* Stemming / lemmatization
* Parts of speech tagging
* Create bi-grams or tri-grams
* Deal with typos
* And more...
(to be discussed in the second section)

### Reading saved data

In [None]:
import pandas as pd
import pickle

In [None]:
df = pd.read_csv('datasets/table_companies_80.csv',sep='\t')

In [None]:
df.head()

In [None]:
# get a list of only tickers
tickers = list(df['ticker'])

In [None]:
# Load pickled files in series
data = {}
for i, t in enumerate(tickers):
    with open("reports/" + t + ".txt", "rb") as file:
        data[t] = pickle.load(file)

In [None]:
len(df) == len(data)

**The way the data is now living in a dictionary where one can access via the ticker!!**

In [None]:
# Print the first 800 caracters of the report
# NOTE: print the whole document in a cell might not be a good idea!!
print(data['ARX'][:800])

---
### 1.3.1. Data cleaning
<a id="1.3.1">

When dealing with numerical data, data cleaning often involves removing null values and duplicate data, dealing with outliers, etc. With text data, there are some common data cleaning techniques, which are also known as text pre-processing techniques.

Note that his cleaning process can go on forever. There's always an exception to every cleaning step. So, we're going to follow the MVP (minimum viable product) approach - start simple and iterate.

In [None]:
# Let's take a look at our data again
data.keys()

In [None]:
# Pandas friendly dictionary form
data_formatted = {key: [value] for (key, value) in data.items()}

In [None]:
# We can either keep it in dictionary format or put it into a pandas dataframe
pd.set_option('max_colwidth',1000)
data_df = pd.DataFrame.from_dict(data_formatted).transpose()

In [None]:
data_df.columns = ['report']
data_df = data_df.sort_index()
data_df.head()

In [None]:
# Let's take a look at the ARX report
print(data_df.report.loc['ARX'][0:800])

<br>
Applying a first round of text cleaning techniques

In [None]:
import re
import string

In [None]:
def clean_text_round1(text):
    '''Make text lowercase, remove text in square brackets, remove punctuation and remove words containing numbers.'''
    text = text.lower()
    text = re.sub('\[.*?\]', ' ', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), ' ', text)
    text = re.sub('\w*\d\w*', ' ', text)
    return text

In [None]:
# Let's apply and take a look at the updated text
data_clean = pd.DataFrame(data_df.report.apply(clean_text_round1))

In [None]:
data_clean.head()

<br>
Applying a second round of text cleaning techniques removing xml stuff

In [None]:
def clean_text_round2(text):
    '''Get rid of some additional punctuation and non-sensical text that was missed the first time around.'''
    text = re.sub('[‘’“”…]', '', text)
    text = re.sub('<.*?>', ' ', text)
    text = re.sub('\\n', ' ', text) 
    text = re.sub('\n', ' ', text) 
    text = re.sub('\t', ' ', text) 
    return text

In [None]:
# Let's take a look at the updated text
data_clean = pd.DataFrame(data_clean.report.apply(clean_text_round2))

##### NOTE: we're overwriting the data object each time since we want to make better use of RAM memory

<br>
How does the report looks like?

---
### 1.3.2. Corpus organization
<a id="1.3.2">

*Text Corpus [Wikipedia definition](https://en.wikipedia.org/wiki/Text_corpus)*

- We already created a corpus in an earlier step. 
- The definition of a corpus is a collection of texts
- They are all put together neatly in a pandas dataframe here.
- The idea is to preserve the corpora in reproducible formats `csv`, `txt`, or `pickle`

In [None]:
data_clean['company_name'] = list(df['name'])

In [None]:
data_clean.head()

In [None]:
print(data_clean.report[0][:800])

In [None]:
# Save the corpus
data_clean.to_pickle("pickle/EDGAR_corpus.pkl")

---
### 1.3.3. Document-Term Matrix
<a id="1.3.3">

*Document-Term Matrix [Wikipedia definition](https://en.wikipedia.org/wiki/Document-term_matrix)*

- For many of the techniques, the text must be tokenized, (broken down into smaller pieces). 
- The most common tokenization technique is to break down text into words. 
- We can do this using `scikit-learn`'s `CountVectorizer`, where every row will represent a different document and every column will represent a different word.
- In addition, with `CountVectorizer`, we can remove stop words. Stop words are common words that add no additional meaning to text such as 'a', 'the', etc.

In [None]:
# CountVectorizer library
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
# Calling the CountVectorizer object and assign deletion of english stop words
cv = CountVectorizer(stop_words='english')

*See [stop words](https://en.wikipedia.org/wiki/Stop_words)*

`CountVectorizer` object has 2 main methods
- `fit_transform`
- `get_feature_names`

In [None]:
data_cv = cv.fit_transform(data_clean.report)
data_tokens = cv.get_feature_names()

<br>
Having the elements simply just create dataframe using `pandas`

In [None]:
data_matrix = pd.DataFrame(data_cv.toarray(), columns=data_tokens)
data_matrix.index = data_clean.index

In [None]:
data_matrix.head()

<br>
We can now dump the files on our binary pickles

In [None]:
data_matrix.to_pickle("pickle/EDGAR_matrix.pkl")

In [None]:
# Let's also pickle the CountVectorizer object
pickle.dump(cv, open("pickle/EDGAR_cv.pkl", "wb"))

---
#### *Learn more about CountVectorizer at the official [scikit-learn.org](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) documentation*