# NLP4datascience Module - Tutorial 1: Text Preprocessing
This python module's purpose is to facilitate the use of natural language inputs (ie. text) in data science and data analysis. For the foundational implementaiton, it makes use of popular NLP elements in the python modules NLTK, Sci-Kit Learn and Gensim and offers further proprietary functionalities beyond those. 

## 1. Preprocessing data

#### Import modules

In [None]:
import numpy as np
import pandas as pd
import scipy.sparse
from tqdm import tqdm
from nlp4datascience.datahandling.largepickle import pickle_load, pickle_dump
from nlp4datascience.preprocessors.nlp4datascience import BagOfWords
from nlp4datascience.preprocessors.nlp4datascience import DTM
from gensim.models.phrases import Phrases, Phraser

#### Load data

In [4]:
text = "<pathname>"

#### Text data preprocesing
BagOfWords creates an nlp4datascience-object, which lets us preprocess and tokenize our text data with a few simple commands.
The option "min_length" lets you define the minimum character length of words being considered. Words with fewer characters are automatically being deleted.
For now we will be extracting unigrams only. We will extract bigrams, trigrams, ngrams later via a collocation approach (Mikholov et al, 2013).

In [None]:
preprocessed_text = BagOfWords(text, min_length=1)

We now have our NLP4datascience object, called "preprocessed_text".
Let's clean the text by:
1. removing stopwords (based on NLTK stopword list + our own list)
2. lowercasing the corpus
3. removing punctuation
4. removing special characters (can be customized)
5. removing numbers (can be switched off)

If we want to add our own stopwords to the stopword list, then add the words by using the following line.

In [None]:
preprocessed_text.custom_stopwords = ["my_stopword+1","my_stopword_2",...,"my_stopword_N"]

Let's clean and tokeinze our text corpus

In [None]:
preprocessed_text.clean()
preprocessed_text.tokenize()

Let's create unigrams via stemming (we could also use lemmatization instead).

In [1]:
preprocessed_text.stemm() # use preprocessed_text.lemmatize() if you want to use lemmatization instead

NameError: name 'preprocessed_text' is not defined

Let's visually inspect the preprocessed corpus. The graph plots the ranking of the ngrams either according to a tf, df or a tf-idf weighting. Let's choose the popular tf-idf ranking.

In [None]:
preprocessed_text.visualize(weight = "tf-idf")

If we want to cut the corpus at a certain minimum threshold of tf-idf (or tf or df) scores, we can do this by using the following command. With "weight", we specify the ranking according to which we want to remove tokens. We choose "tf-idf" here. The cutoff sets the minimum ranking score. Tokens ranking lower than this cutoff, will be removed from the corpus. We can use the visual inspection from the previous step to obtain a good estimate for the value of the cutoff. The "items" option specifies which ngram-type we want to modify. As we are working with unigrams here, we specify it as "uni".

In [None]:
preprocessed_text.remove_tokens(weight="tf-idf", items="uni", cutoff = 15)

Let's visually inspect our corpus after applying the tf-idf cutoff

preprocessed_text.visualize_adj(weight = "tf-idf")

We could even apply another cutoff criterion, say based on document frequency, on top of the former one.

Once we are happy with our results, we can save them to our computer:
- "output_directory" defines the path where to save the file
- "filename" defines the name of the saved file
- "data_format" lets you choose to save the file either as a pickle file (.pkl) or as a csv file (.csv). If not specified, pickle is the default

The preprocessing is now completed. 

In [None]:
preprocessed_text.save(output_directory, "filename", data_format = "pkl")

## 2. Create a Document-Term-Matrix (based on unigrams)

Load the preprocessed text corpus from step 1

In [None]:
ngrams = pickle_load("<preprocessed_data_from_step_1>.pkl")

Create a DTM object based on the NLP4datascience module. We can choose ".uni" to use the unadjusted corpus, or ".uni_adj" to use the corpus adjusted by our previously specified cutoff criteria.

In [None]:
dtm_uni_adj = DTM(ngrams.uni_adj)

Create the DTM based on unigrams and inspect the dtm and the tokens

In [2]:
dtm_uni_adj.unigram_dtm()
dtm_uni_adj.dtm
dtm_uni_adj.tokens

NameError: name 'dtm_uni_adj' is not defined

Save the dtm and the corresponding token vocabulary list

In [None]:
dtm_uni_adj.save(output_directory, "unigrams_adj")

For bigrams, we specify the minimum frequency of bigrams (and unigrams in them) occuring as well as a scoring-function threshold. The higher the threshold, the fewer bigrams are formed (this is based on the collocation approach, and on the gensim implementation of it). 

**IMPORTANT**: Feed in the unadjusted unigram corpus for bigram, trigram, ngram DTM creation

In [None]:
dtm_bi = DTM(ngrams["uni"])
dtm_bi.ngram_dtm(min_count=10, threshold=10)
dtm_bi.save(output_dir, "bigrams")

For ngrams (more specifically, up to 4-grams), we can use the following command

In [None]:
dtm_ngrams = DTM(ngrams["uni"])
dtm_ngrams.ngram_dtm(min_count=10, threshold=10)
dtm_ngrams.save(output_dir, "ngrams")

The inspection and saving of the data works the same was as for the unigram-DTM described just above.

###### Note: 
The DTM is stored as a Compressed Sparse Row (CSR) scipy matrix. This is advantageous when the corpus is getting too large to be fully held in memory in an uncompressed form, for example, when we have millions of documents and (hundred-) thousands of unique ngram terms.

If you want to bring your compressed DTM into a non-compressed (ie. dense) format, use the below command

In [None]:
dtm_dense = dtm_sparse.todense()