# NLP Pre-Processing
Liz Gagne  
04/05/2018

## The Purpose of NLP Pre-Processing
Natural language processing (NLP) is often thought of as one of the main areas in Artificial Intellegence.  NLP techniques are at the core of AI-based products we use every day - chat bots, Google Translate, article summarizers, and the like.  However, NLP actually sits at the crossroads of AI/CS and computational linguistics - it's applications are more widespread than the obvious applications listed above.  NLP techniques allow us to derive things as complex as sentiment from text data, or to find patterns in text for any number of applications (fraud detection, topic segmentation, etc).

NLP is characterized as a difficult problem in computer science, due mostly to the ambiguity of human language. Human speech is seldom precise or direct.  Understanding natural language means you need to understand the concepts beneath the words, how they go together, and how the words/order/concepts come together to create meaning.  

Before embarking on any NLP techniques (i.e. Sentiment Analysis) we need to make sure the text data is in the proper format.
If not, the text won't be accepted into any models or processes.  Transforming your text data into something that an algorithm is able to ingest can be complicated, and it's helpful to have a solid grasp on the text data you're working with.  Generally, there are four stages within NLP pre-processing:
- ##### Cleaning  
Just like with non-text data, cleaning involves excluding the irrelevant or corrupt data points.  In NLP, this typically consists of removing stop words, punctuation, and other extraneous text.  Other cleaning tasks might involve dealing with capitalization rules, or other non-alphanumeric characters.
- ##### Annotation  
Typically annotation include things like parts-of-speech (POS) tagging, and is generally thought of as the application of a scheme to text data.
- ##### Normalization  
The translation or mapping or text within the scheme through Stemming, Lemmatization, or another method of standardization.
- ##### Analysis  
Applying basic statistical techniques to manipulate the data for more in depth analysis.

## Tools and Methods
There are a lot methods and tools available for pre-processing text data. This article is meant to give you a starting point, and is not an exhaustive list of all the options available. Like with all data analysis, the analyst must understand the drawbacks and best uses of each technique and choose a method appropriate for the given dataset. The two main Python packages for NLP are `spaCy` and `NLTK` - both have pros and cons.  `NLTK` is highly customizable, but wasn't built to be quick and simple.  `spaCy`, on the other hand, was designed specifically with efficiency in mind and as such is quick and easy to use.  

In [10]:
import pandas as pd
import pyodbc
import spacy
from spacy.en import English
parser = English()
from spacy.en.word_sets import STOP_WORDS
import en_core_web_sm  # or any other model you downloaded via spacy download or pip
nlp = en_core_web_sm.load()
import nltk
from nltk import word_tokenize

ModuleNotFoundError: No module named 'spacy.en'

Let's pull in some TIP data to work with.  
By using pandas we can maintain the tabular structure of the data. This is especially helpful if you're used to working in SQL or SAS.

In [11]:
# Create the connection to all dbs
cnxn = pyodbc.connect('DRIVER={ODBC Driver 11 for SQL Server};SERVER=ES11vADOSQL006;DATABASE=master;Trusted_Connection=yes;')

In [12]:
# Pull data from APPRTIP
#Create an additional column with all text concatenated
sql3 = """
SELECT EmployeeID, FiscalYear, TIPImprovementPlan1, TIPActionPlan, TIPTimelinePlan, TIPSupportPlan, TIPAssessmentPlan,(TIPImprovementPlan1 + ' ' + TIPActionPlan + ' ' + TIPTimeLinePlan + ' ' + TIPSupportPlan + ' ' + TIPAssessmentPlan) as TIP_all_txt
FROM [APPR_EXT].[dbo].[APPRTIP]
where IsSubmitted = 'Y' and TIPEndedAppeal = 'N'
"""
APPRTIP = pd.io.sql.read_sql(sql3, cnxn) #assign the SQL query to a pandas dataframe called APPRTIP

We can print the first 5 rows of data to make sure our dataframe looks like we expected it to.

In [13]:
APPRTIP.head()

Unnamed: 0,EmployeeID,FiscalYear,TIPImprovementPlan1,TIPActionPlan,TIPTimelinePlan,TIPSupportPlan,TIPAssessmentPlan,TIP_all_txt
0,469849,2015,1) Developing strategies for consistently moni...,1.\tFor Developing strategies for consistently...,Refer to the timelines included at the end of ...,1)\tA Ramapo consultant will support you with ...,"In our second and third meetings, we will revi...",1) Developing strategies for consistently moni...
1,469849,2017,1E: Designing Coherent Instruction: Design les...,1E: Designing Coherent Instruction:\r\n\t• Des...,See above,1) You will schedule inter-visitations to obse...,"In our second and third meetings, we will revi...",1E: Designing Coherent Instruction: Design les...
2,964702,2015,1) Developing strategies for consistently moni...,1.\tFor Developing strategies for consistently...,Refer to the timelines included at the end of ...,"1) Work with Deborah Flaum, PS15 Math coach, t...","In our second and third meetings, we will revi...",1) Developing strategies for consistently moni...
3,1408117,2016,1)\tDesigning coherent instruction by differen...,1.\tFor Designing coherent instruction by diff...,Refer to the timelines included at the end of ...,1) A Gifted and Talented coach will support yo...,"In our second and third meetings, we will revi...",1)\tDesigning coherent instruction by differen...
4,1418900,2015,1) Developing strategies for consistently moni...,1.\tFor Developing strategies for consistently...,Refer to the timelines included at the end of ...,1)\tA Ramapo consultant will support you with ...,"In our second and third meetings, we will revi...",1) Developing strategies for consistently moni...


### Capitalization
Text data typically contains a various capitalizations - the beginning of sentences, proper nouns, etc.  The most common approach is to reduce everything to lower case, though this can sometimes affect the fidelity of your data - changing something like "US" to "us" can alter the meaning or affect how the string is tagged (this is where that inside knowledge of the data you're working with and the analysis you're running comes in handy).
  
### Stop Words  
Most words within text data are connectors, which do little to show the subject, object, or other information within a sentence. Depending on the type of analysis you're running, excluding these stop words is a necessary step.  Stop words are equivilant to noise within the data.  There are pre-fab stop words lists out there, which can be used as is or altered to fit your particular data.
  
### Tokenization  
Tokenization is used for splitting sentences into individual words and/or splitting paragraphs into sentences. Splitting sentences into individual words and punctuation is most often done by splitting across white space or punctuation. This might cause problems when you're working with abbreviations, possessives, or proper nouns that use puntuation (like O'Brien or Sackville-West).  Splitting paragraphs into sentences accurately is equally challenging, largely due to the ambiguity of puntuation in the English language. The period alone can be used to denote the end of a sentence, an abbreviation, or be included in an email address. To accurately identify the boundaries of sentences a pre-trained algorithm, like NLTK's Punkt Models, should be used.

### Parts of Speech Tagging
Parts of Speech (POS) tags are useful for understanding the meaning of a sentence, or identifying speech patterns in text. POS tagging typically entails looking at the neighboring words using either a stochastic or rule absed method.  

### Stemming
Stemming is a process where words are reduced to their root, removing whatever inflextion is present.  This is done by removing unnecessary characters, usually the suffix. There are a variety of models available for stemming, including Porter and Snowball. The results can be used to identify relationships and commonalities across data. The main drawback to stemming is that sometimes words are overstemmed to the point of uselessness. This happens when words are structurally similar but have vastly different meanings (i.e. "universe" and "university" both stem to "univers"). 

### Lemmatization
Lemmazation is an alternative to stemming, that (at least for the NYCDOE text data that I work with) gets better results. Lemmazation is a more intensive process involving POS tags, which is often more accurate than stemming.  This increased accuracy comes at a slight time cost, so depending on your dataset and what's you're looking to extract from the text, consider what trade off is acceptable for you.  Generally, stemming is more appropriate for text queries whereas lemmazatiztion is a better choice when trying to determine sentiment.

### Word Counts 
One of the more basic, but still powerful, tools for feature engineering is to calculate word, sentence, punctuation, and keyword counts. Again, this is where that knowledge of your data will serve you well - you can create your own list of keywords and then calculate the count of those specific words to store as a feature.  

## Conclusion  

While this is definitely not an exhaustive list of pre-proccessing techniques, preparing raw text data for analysis is a complicated process which requires the analyst to choose the optimal tools given both the data and the question being asked. Packages like `spaCy` and `NLTK` offer some great off the shelf funtions, though you may need to manually alter the default parameters or lists for best results. Once you've prepped your data you can go on to apply a variety of machine learning techniques depending on what the questions you're asking in regard to the text data.

It's easiest to process the text if it's not in the dataframe.  However, ultimately we will appreciate having the dataframe structure.  So let's pull the text out of the dataframe, do some processing, and then stick it back in

In [12]:
#First, let's create 3 empty lists.  This is where we'll put the processed data for holding until we merge it back in with the dataframe.

tokens = []
lemma = []
pos = []

#Next we push our text through the nlp pipe

for doc in nlp.pipe(APPRTIP['TIP_all_txt'].astype('unicode').values, batch_size=9845,
                        n_threads=3):
    
#Here we're filling in our empty lists with the text, if it meets our set conditions
#Basically we're saying if the word is not punctution and not a stop word and not extra whitespace then 

    if doc.is_parsed:
        tokens.append([n.text for n in doc if not n.is_punct and not n.is_stop and not n.is_space])
        lemma.append([n.lemma_ for n in doc if not n.is_punct and not n.is_stop and not n.is_space])
        pos.append([n.pos_ for n in doc if not n.is_punct and not n.is_stop and not n.is_space])
    else:
        # We want to make sure that the lists of parsed results have the
        # same number of entries of the original Dataframe, so add some blanks in case the parse fails
        tokens.append(None)
        lemma.append(None)
        pos.append(None)

#Now we create new columns in our datafram and populate them with the lists        
APPRTIP['tip_tokens'] = tokens
APPRTIP['tip_lemmas'] = lemma 
APPRTIP['tip_pos'] = pos

APPRTIP.head()

Unnamed: 0,SchoolDBN,EmployeeID,FiscalYear,TIPImprovementPlan1,TIPActionPlan,TIPTimeLinePlan,TIPSupportPlan,TIPAssessmentPlan,TIPMeeting1,TIPMeeting2,...,UpdateDate,SubmitUserID,SubmitUserRole,SubmitDate,TIPImprovementPlan2,TIPImprovementPlan3,TIP_all_txt,s_tokens_IP,s_lemmas_IP,s_pos_IP
0,01M015,469849,2015,1) Developing strategies for consistently moni...,1.\tFor Developing strategies for consistently...,Refer to the timelines included at the end of ...,1)\tA Ramapo consultant will support you with ...,"In our second and third meetings, we will revi...",2014-09-12,2015-01-22,...,2014-09-17 12:26:14.157,isanchez11,Principal,2014-09-15 22:40:25.483,,,1) Developing strategies for consistently moni...,,,
1,01M015,469849,2017,1E: Designing Coherent Instruction: Design les...,1E: Designing Coherent Instruction:\r\n\t• Des...,See above,1) You will schedule inter-visitations to obse...,"In our second and third meetings, we will revi...",2016-09-16,2017-01-12,...,2016-09-22 10:40:52.870,isanchez11,Principal,2016-09-22 10:33:39.267,,,1E: Designing Coherent Instruction: Design les...,,,
2,01M015,964702,2015,1) Developing strategies for consistently moni...,1.\tFor Developing strategies for consistently...,Refer to the timelines included at the end of ...,"1) Work with Deborah Flaum, PS15 Math coach, t...","In our second and third meetings, we will revi...",2014-09-12,2015-01-21,...,2014-09-16 08:07:24.520,isanchez11,Principal,2014-09-14 12:05:23.447,,,1) Developing strategies for consistently moni...,,,
3,01M015,1408117,2016,1)\tDesigning coherent instruction by differen...,1.\tFor Designing coherent instruction by diff...,Refer to the timelines included at the end of ...,1) A Gifted and Talented coach will support yo...,"In our second and third meetings, we will revi...",2015-09-21,2016-04-22,...,2016-06-23 15:17:46.090,isanchez11,Principal,2015-09-28 11:27:59.373,,,1)\tDesigning coherent instruction by differen...,,,
4,01M015,1418900,2015,1) Developing strategies for consistently moni...,1.\tFor Developing strategies for consistently...,Refer to the timelines included at the end of ...,1)\tA Ramapo consultant will support you with ...,"In our second and third meetings, we will revi...",2014-09-17,2015-01-16,...,2014-09-17 12:51:48.293,isanchez11,Principal,2014-09-17 12:19:46.157,,,1) Developing strategies for consistently moni...,,,
