# Tutorial for NLP4datascience module
This python module's purpose is to facilitate the use of natural language inputs (ie. text) in data science and data analysis. For the foundational implementaiton, it makes use of popular NLP elements in the python modules NLTK, Sci-Kit Learn and Gensim and offers further proprietary functionalities beyond those. 

## 1. Preprocessing data

#### Import modules

In [None]:
import numpy as np
import pandas as pd
import scipy.sparse
from tqdm import tqdm
from nlp4datascience.datahandling.largepickle import pickle_load, pickle_dump
from nlp4datascience.preprocessors.nlp4datascience import BagOfWords
from nlp4datascience.preprocessors.nlp4datascience import DTM
from gensim.models.phrases import Phrases, Phraser


#### Load data

In [4]:
text = "<pathname>"

#### Text data preprocesing
BagOfWords creates an nlp4datascience-object which lets us preprocess and tokenize our text data with a few simply commands.
min_length defines the minimum character length of words being considered. Words with fewer characters are automatically being deleted.
ngram_length defines the the degreee of ngrams we want to extract. Set this to 1 for now. We will extract bigrams, trigrams, ngrams later via a collocation approach.

In [None]:
preprocessed_text = BagOfWords(text, min_length=1, ngram_length=1)

We now have our NLP4datascience object, called "preprocessed_text".
Let's clean the text by:
1. removing stopwords (based on NLTK stopword list + our own list)
2. lowercasing the corpus
3. removing punctuation
4. removing special characters (can be customized)
5. removing numbers (can be switched off)

If we want to add our own stopwords to the stopword list, then add the words by using the following line of code or upload a list of words and then as preprocessed_text.custom_stopwords.

In [None]:
preprocessed_text.custom_stopwords = ["my_stopword+1","my_stopword_2",...,"my_stopword_N"]

Let's clean and tokeinze our text corpus

In [None]:
preprocessed_text.clean()
preprocessed_text.tokenize()

Let's create unigrams via stemming (we could also use lemmatization instead).

In [None]:
preprocessed_text.stemm() # use preprocessed_text.lemmatize() if you want to use this approach instead

Let's visually inspect the preprocessed corpus. The graph plots the ranking of the ngrams either according to a tf, df or a tf-idf weighting. Let's choose the popular tf-idf ranking.

In [None]:
preprocessed_text.visualize(weight = "tf-idf")

If we feel we want to cut the corpus at a certain minimum threshold of tf-idf scores, we can do this by using the following command

In [None]:
pp_text.rank_remove(rank="tf-idf", items="uni", cutoff = 15)

Let's visually inspect our corpus after applying the tf-idf cutoff

preprocessed_text.visualize(weight = "tf-idf")

We could even apply another cutoff criterion, say based on document frequency, on top of the former one.

Once we are happy with our results, we can save them to our computer. The preprocessing is completed.

In [None]:
preprocessed_text.save("pkl", <path_where_to_save_the_file", str("<filename>"))

## 2. Create a Document-Term-Matrix (based on unigrams)

Load the preprocessed text corpus from step 1

In [None]:
ngrams = pickle_load("<preprocessed_data_from_step_1>.pkl")


Create a DTM object based on the NLP4datascience module.
The ngram_length specifies what length of ngrams should be considered.

In [None]:
dtm_unigrams = DTM(ngrams.uni_adj, ngram_length = 1)

Create the DTM

In [None]:
dtm_unigrams.create_dtm(ngrams.uni)


Save the DTM and the corresponding vocabulary list to your computer.

In [None]:
dtm_sparse = dtm_unigrams.dtm
scipy.sparse.save_npz('<path_where_to_store_the_compressed_dtm>.npz', dtm_sparse)
pd.Series(dtm_unigrams.tokens).to_pickle('<path_where_to_store_the_compressed_dtm>.pkl')

If we want to inspect the vocabulary, we can run the following command

In [None]:
print(dtm_unigrams.tokens)

###### Note: 
The DTM is stored as a Compressed Sparse Row (CSR) scipy matrix. This is advantageous when the corpus is getting too large to be fully held in memory in an uncompressed form, for example, when we have millions of documents and (hundred) thousands of unique ngram terms.

If you want to bring your compressed DTM into a non-compressed (ie. dense) format, use the below command

In [None]:
dtm_dense = dtm_sparse.todense()

## 3. Create a Document-Term-Matrix (based on ngrams)

In [None]:
ngrams = pickle_load("C:/Users/oxmanahrens/OneDrive - Nexus365/BTR/data/cbc_example/data/ecb_statements_ngrams(tfidf15).pkl")

# create bigrams
bigram_model = Phrases(ngrams.uni, min_count=10, threshold=10, delimiter=b' ')
bigram_model_final = Phraser(bigram_model)
ngrams_bi_w2v = [0]*ngrams.shape[0]
for i, doc in enumerate(tqdm(ngrams.uni)):
    ngrams_bi_w2v[i] = bigram_model_final[doc]
pd.Series(ngrams_bi_w2v).to_pickle('data/cbc_example/data/bigrams_w2v_tfidf15.pkl')
dtm_ngrams = DTM(ngrams_bi_w2v, ngram_length = 1)
dtm_ngrams.create_dtm(ngrams_bi_w2v)
print(len(dtm_ngrams.tokens))
dtm_ngrams.dtm
scipy.sparse.save_npz('C:/Users/oxmanahrens/OneDrive - Nexus365/BTR/data/cbc_example/data/dtm_bigrams_w2v_sparse_tfidf15.npz', dtm_ngrams.dtm)
pd.Series(dtm_ngrams.tokens).to_pickle('C:/Users/oxmanahrens/OneDrive - Nexus365/BTR/data/cbc_example/data/dtm_bigrams_w2v_tokens_tfidf15.pkl')
 
# create trigrams
trigram_model = Phrases(ngrams_bi_w2v, min_count=10, threshold=10, delimiter=b' ')
trigram_model_final = Phraser(trigram_model)
ngrams_tri_w2v = [0]*ngrams.shape[0]
for i, doc in enumerate(tqdm(ngrams_bi_w2v)):
    ngrams_tri_w2v[i] = trigram_model_final[doc]
pd.Series(ngrams_tri_w2v).to_pickle('data/cbc_example/data/trigrams_w2v_tfidf15.pkl')
dtm_ngrams = DTM(ngrams_tri_w2v, ngram_length = 1)
dtm_ngrams.create_dtm(ngrams_tri_w2v)
print(len(dtm_ngrams.tokens))
dtm_ngrams.dtm
scipy.sparse.save_npz('C:/Users/oxmanahrens/OneDrive - Nexus365/BTR/data/cbc_example/data/dtm_trigrams_w2v_sparse_tfidf15.npz', dtm_ngrams.dtm)
pd.Series(dtm_ngrams.tokens).to_pickle('C:/Users/oxmanahrens/OneDrive - Nexus365/BTR/data/cbc_example/data/dtm_trigrams_w2v_tokens_tfidf15.pkl')