# NLP Pre-Processing
Liz Gagne  
04/05/2018

## The Purpose of NLP Pre-Processing
Natural language processing (NLP) is often thought of as one of the main areas in Artificial Intellegence.  NLP techniques are at the core of AI-based products we use every day - chat bots, Google Translate, article summarizers, and the like.  However, NLP actually sits at the crossroads of AI/CS and computational linguistics - it's applications are more widespread than the obvious applications listed above.  NLP techniques allow us to derive things as complex as sentiment from text data, or to find patterns in text for any number of applications (fraud detection, topic segmentation, etc).

NLP is characterized as a difficult problem in computer science, due mostly to the ambiguity of human language. Human speech is seldom precise or direct.  Understanding natural language means you need to understand the concepts beneath the words, how they go together, and how the words/order/concepts come together to create meaning.  

Before embarking on any NLP techniques (i.e. Sentiment Analysis) we need to make sure the text data is in the proper format.
If not, the text won't be accepted into any models or processes.  Transforming your text data into something that an algorithm is able to ingest can be complicated, and it's helpful to have a solid grasp on the text data you're working with.  Generally, there are four stages within NLP pre-processing:
- ##### Cleaning  
Just like with non-text data, cleaning involves excluding the irrelevant or corrupt data points.  In NLP, this typically consists of removing stop words, punctuation, and other extraneous text.  Other cleaning tasks might involve dealing with capitalization rules, or other non-alphanumeric characters.
- ##### Annotation  
Typically annotation include things like parts-of-speech (POS) tagging, and is generally thought of as the application of a scheme to text data.
- ##### Normalization  
The translation or mapping or text within the scheme through Stemming, Lemmatization, or another method of standardization.
- ##### Analysis  
Applying basic statistical techniques to manipulate the data for more in depth analysis.

## Tools and Methods
There are a lot methods and tools available for pre-processing text data. This article is meant to give you a starting point, and is not an exhaustive list of all the options available. Like with all data analysis, the analyst must understand the drawbacks and best uses of each technique and choose a method appropriate for the given dataset. The two main Python packages for NLP are `spaCy` and `NLTK` - both have pros and cons.  `NLTK` is highly customizable, but wasn't built to be quick and simple.  `spaCy`, on the other hand, was designed specifically with efficiency in mind and as such is quick and easy to use.  

In [1]:
import pandas as pd
import pyodbc
import spacy
from spacy.lang.en import English
parser = English()
import en_core_web_sm  # or any other model you downloaded via spacy download or pip
nlp = en_core_web_sm.load()

Let's pull in some TIP data to work with.  
By using pandas we can maintain the tabular structure of the data. This is especially helpful if you're used to working in SQL or SAS.

In [2]:
# Create the connection to all dbs
cnxn = pyodbc.connect('DRIVER={ODBC Driver 11 for SQL Server};SERVER=ES11vADOSQL006;DATABASE=master;Trusted_Connection=yes;')

In [40]:
# Pull data from APPRTIP
#Create an additional column with all text concatenated
sql3 = """
SELECT TIPImprovementPlan1, TIPActionPlan, TIPTimelinePlan, TIPSupportPlan, TIPAssessmentPlan,(TIPImprovementPlan1 + ' ' + TIPActionPlan + ' ' + TIPTimeLinePlan + ' ' + TIPSupportPlan + ' ' + TIPAssessmentPlan) as TIP_all_txt
FROM [APPR_EXT].[dbo].[APPRTIP]
where IsSubmitted = 'Y' and TIPEndedAppeal = 'N' and FiscalYear = 2017
"""
APPRTIP = pd.io.sql.read_sql(sql3, cnxn) #assign the SQL query to a pandas dataframe called APPRTIP

We can print the first 5 rows of data to make sure our dataframe looks like we expected it to.

In [41]:
APPRTIP.head()

Unnamed: 0,TIPImprovementPlan1,TIPActionPlan,TIPTimelinePlan,TIPSupportPlan,TIPAssessmentPlan,TIP_all_txt
0,1E: Designing Coherent Instruction: Design les...,1E: Designing Coherent Instruction:\r\n\t• Des...,See above,1) You will schedule inter-visitations to obse...,"In our second and third meetings, we will revi...",1E: Designing Coherent Instruction: Design les...
1,Based on prior observations from the 2015-2016...,For 1e:\r\n\r\nA) Establish regular time(s) to...,See action steps/activities for specifics,1) Choose PD Cycle to support the steps in you...,You are responsible for gathering and providin...,Based on prior observations from the 2015-2016...
2,Based on prior observations and feedback from ...,For 1e:\r\n\r\n1. Establish regular time(s) to...,See action steps/activities for specific time ...,1) Choose to participate in a PD cycle to sup...,You are responsible for gathering and providin...,Based on prior observations and feedback from ...
3,"After reviewing last year's TIP, MOSL assessme...",1:Addressing the learning needs of small group...,Refer to the timelines included at the end of ...,1. Mr. Louie will participate in 1:1 coaching...,1. In our next 2 meetings we will review the ...,"After reviewing last year's TIP, MOSL assessme..."
4,1. Having learning activities aligned with the...,1. For improved alignment of learning activiti...,See above-ongoing,-Collaborate with your co-teachers to follow T...,1. Learning activities are aligned with the in...,1. Having learning activities aligned with the...


### Capitalization
Text data typically contains a various capitalizations - the beginning of sentences, proper nouns, etc.  The most common approach is to reduce everything to lower case, though this can sometimes affect the fidelity of your data - changing something like "US" to "us" can alter the meaning or affect how the string is tagged (this is where that inside knowledge of the data you're working with and the analysis you're running comes in handy).

### Stop Words  
Most words within text data are connectors, which do little to show the subject, object, or other information within a sentence. Depending on the type of analysis you're running, excluding these stop words is a necessary step.  Stop words are equivilant to noise within the data.  There are pre-fab stop words lists out there, which can be used as is or altered to fit your particular data. For now we can use the English stop words list from spaCy.

In [42]:
from spacy.lang.en import STOP_WORDS
print(STOP_WORDS)

{'always', 'sometime', 'themselves', 'nobody', 'each', 'there', 'been', 'out', 'into', 'together', 'therefore', 'often', 'alone', 'otherwise', 'that', 'nevertheless', 'or', 'anyhow', 'one', 'upon', 'was', 'below', 'made', 'can', 'fifteen', 'put', 'myself', 'while', 'whoever', 'you', 'whose', 'still', 'three', 'so', 'the', 'such', 'latter', 'must', 'really', 'name', 'have', 'top', 'twelve', 'just', 'full', 'hereafter', 'thereby', 'unless', 'its', 'her', 'whenever', 'yet', 'anywhere', 'becomes', 'ours', 'himself', 'say', 'whereby', 'become', 'five', 'him', 'where', 'about', 'a', 'for', 'any', 'enough', 'everything', 'moreover', 'not', 'quite', 'see', 'me', 'until', 'yourselves', 'due', 'became', 'forty', 'should', 'be', 'make', 'meanwhile', 'own', 'somewhere', 'ever', 'many', 'us', 'four', 'whither', 'though', 'twenty', 'thereupon', 'why', 'doing', 'all', 'is', 'two', 'over', 'amongst', 'please', 'around', 'down', 'much', 'under', 'after', 'anyway', 'herein', 'latterly', 'except', 'amoun

### Tokenization  
Tokenization is used for splitting sentences into individual words and/or splitting paragraphs into sentences. Splitting sentences into individual words and punctuation is most often done by splitting across white space or punctuation. This might cause problems when you're working with abbreviations, possessives, or proper nouns that use puntuation (like O'Brien or Sackville-West).  Splitting paragraphs into sentences accurately is equally challenging, largely due to the ambiguity of puntuation in the English language. The period alone can be used to denote the end of a sentence, an abbreviation, or be included in an email address. To accurately identify the boundaries of sentences a pre-trained algorithm, like NLTK's Punkt Models, should be used.

### Parts of Speech Tagging
Parts of Speech (POS) tags are useful for understanding the meaning of a sentence, or identifying speech patterns in text. POS tagging typically entails looking at the neighboring words using either a stochastic or rule absed method.  

### Stemming
Stemming is a process where words are reduced to their root, removing whatever inflextion is present.  This is usually done by removing the suffix. There are a variety of models available for stemming, including Porter and Snowball. The main drawback to stemming is that words are often overstemmed to the point of uselessness. This happens when words are structurally similar but have vastly different meanings (i.e. "universe" and "university" both stem to "univers"). Since stemming doesn't take into consideration the inflection it's removing, this technique can be useful if you're planning to match words based on origin but really not for much else.

### Lemmatization
Lemmazation is an alternative to stemming, that (at least for the NYCDOE text data that I work with) gets better results. Lemmazation is a more intensive process involving POS tags, which is more accurate than stemming.  For example, broken/ADJ yields broken but broken/VERB yields break. This increased accuracy comes at a slight time cost, but I've found this to be worth the trade off even for very large datasets.  Generally, lemmazatiztion is a better choice when trying to determine sentiment or do any sort of linguistic analysis. One thing to keep in mind is, as you'll see below, lemmas are lowercase.

In [43]:
tokens = []
lemma = []
pos = []

for doc in nlp.pipe(APPRTIP['TIP_all_txt'].astype('unicode').values, batch_size=9845,
                        n_threads=3):
    if doc.is_parsed:
        tokens.append([n.text for n in doc if not n.is_punct and not n.is_stop and not n.is_space and not n.like_url])
        lemma.append([n.lemma_ for n in doc if not n.is_punct and not n.is_stop and not n.is_space and not n.like_url])
        pos.append([n.pos_ for n in doc if not n.is_punct and not n.is_stop and not n.is_space and not n.like_url])
    else:
        # We want to make sure that the lists of parsed results have the
        # same number of entries of the original Dataframe, so add some blanks in case the parse fails
        tokens.append(None)
        lemma.append(None)
        pos.append(None)

APPRTIP['tip_tokens'] = tokens
APPRTIP['tip_lemmas'] = lemma 
APPRTIP['tip_pos'] = pos

APPRTIP.head()

Unnamed: 0,TIPImprovementPlan1,TIPActionPlan,TIPTimelinePlan,TIPSupportPlan,TIPAssessmentPlan,TIP_all_txt,tip_tokens,tip_lemmas,tip_pos
0,1E: Designing Coherent Instruction: Design les...,1E: Designing Coherent Instruction:\r\n\t• Des...,See above,1) You will schedule inter-visitations to obse...,"In our second and third meetings, we will revi...",1E: Designing Coherent Instruction: Design les...,"[1E, Designing, Coherent, Instruction, Design,...","[1e, designing, coherent, instruction, design,...","[ADP, PROPN, PROPN, PROPN, PROPN, VERB, ADV, V..."
1,Based on prior observations from the 2015-2016...,For 1e:\r\n\r\nA) Establish regular time(s) to...,See action steps/activities for specifics,1) Choose PD Cycle to support the steps in you...,You are responsible for gathering and providin...,Based on prior observations from the 2015-2016...,"[Based, prior, observations, 2015, 2016, schoo...","[base, prior, observation, 2015, 2016, school,...","[VERB, ADJ, NOUN, NUM, NUM, NOUN, NOUN, NOUN, ..."
2,Based on prior observations and feedback from ...,For 1e:\r\n\r\n1. Establish regular time(s) to...,See action steps/activities for specific time ...,1) Choose to participate in a PD cycle to sup...,You are responsible for gathering and providin...,Based on prior observations and feedback from ...,"[Based, prior, observations, feedback, previou...","[base, prior, observation, feedback, previous,...","[VERB, ADJ, NOUN, VERB, ADJ, NOUN, VERB, VERB,..."
3,"After reviewing last year's TIP, MOSL assessme...",1:Addressing the learning needs of small group...,Refer to the timelines included at the end of ...,1. Mr. Louie will participate in 1:1 coaching...,1. In our next 2 meetings we will review the ...,"After reviewing last year's TIP, MOSL assessme...","[After, reviewing, year, 's, TIP, MOSL, assess...","[after, review, year, 's, tip, mosl, assessmen...","[ADP, VERB, NOUN, PART, PROPN, PROPN, NOUN, AD..."
4,1. Having learning activities aligned with the...,1. For improved alignment of learning activiti...,See above-ongoing,-Collaborate with your co-teachers to follow T...,1. Learning activities are aligned with the in...,1. Having learning activities aligned with the...,"[1, Having, learning, activities, aligned, ins...","[1, have, learn, activity, align, instructiona...","[PUNCT, VERB, VERB, NOUN, VERB, ADJ, NOUN, VER..."


### Word Counts 
One of the more basic, but still powerful, tools for feature engineering is to calculate word, sentence, punctuation, character, and keyword counts. Again, this is where that knowledge of your data will serve you well - you can create your own list of keywords and then calculate the count of those specific words to store as a feature.  

In [44]:
APPRTIP['totalwords'] = APPRTIP['TIP_all_txt'].str.split().str.len()
APPRTIP.head()

Unnamed: 0,TIPImprovementPlan1,TIPActionPlan,TIPTimelinePlan,TIPSupportPlan,TIPAssessmentPlan,TIP_all_txt,tip_tokens,tip_lemmas,tip_pos,totalwords
0,1E: Designing Coherent Instruction: Design les...,1E: Designing Coherent Instruction:\r\n\t• Des...,See above,1) You will schedule inter-visitations to obse...,"In our second and third meetings, we will revi...",1E: Designing Coherent Instruction: Design les...,"[1E, Designing, Coherent, Instruction, Design,...","[1e, designing, coherent, instruction, design,...","[ADP, PROPN, PROPN, PROPN, PROPN, VERB, ADV, V...",984
1,Based on prior observations from the 2015-2016...,For 1e:\r\n\r\nA) Establish regular time(s) to...,See action steps/activities for specifics,1) Choose PD Cycle to support the steps in you...,You are responsible for gathering and providin...,Based on prior observations from the 2015-2016...,"[Based, prior, observations, 2015, 2016, schoo...","[base, prior, observation, 2015, 2016, school,...","[VERB, ADJ, NOUN, NUM, NUM, NOUN, NOUN, NOUN, ...",653
2,Based on prior observations and feedback from ...,For 1e:\r\n\r\n1. Establish regular time(s) to...,See action steps/activities for specific time ...,1) Choose to participate in a PD cycle to sup...,You are responsible for gathering and providin...,Based on prior observations and feedback from ...,"[Based, prior, observations, feedback, previou...","[base, prior, observation, feedback, previous,...","[VERB, ADJ, NOUN, VERB, ADJ, NOUN, VERB, VERB,...",674
3,"After reviewing last year's TIP, MOSL assessme...",1:Addressing the learning needs of small group...,Refer to the timelines included at the end of ...,1. Mr. Louie will participate in 1:1 coaching...,1. In our next 2 meetings we will review the ...,"After reviewing last year's TIP, MOSL assessme...","[After, reviewing, year, 's, TIP, MOSL, assess...","[after, review, year, 's, tip, mosl, assessmen...","[ADP, VERB, NOUN, PART, PROPN, PROPN, NOUN, AD...",829
4,1. Having learning activities aligned with the...,1. For improved alignment of learning activiti...,See above-ongoing,-Collaborate with your co-teachers to follow T...,1. Learning activities are aligned with the in...,1. Having learning activities aligned with the...,"[1, Having, learning, activities, aligned, ins...","[1, have, learn, activity, align, instructiona...","[PUNCT, VERB, VERB, NOUN, VERB, ADJ, NOUN, VER...",268


## Conclusion  

While this is definitely not an exhaustive list of pre-proccessing techniques, preparing raw text data for analysis is a complicated process which requires the analyst to choose the optimal tools given both the data and the question being asked. Packages like `spaCy` and `NLTK` offer some great off the shelf funtions, though you may need to manually alter the default parameters or lists for best results. Once you've prepped your data you can go on to apply a variety of machine learning techniques depending on what the questions you're asking in regard to the text data.