# Capstone Project: Improving NLTK's Lemmatization Process

Submitted by Shannon Bingham
February 2019


## Executive Summary
The Natural Language Toolkit (NLTK) is a widely used open source solution for computer processing and analysis of human language.  The toolkit includes many useful utilities including an interface to the WordNet lexical reference database.  When I used the toolkit's stemming and lemmatizing programs for a Natural Language Processing (NLP) classification project at General Assembly, I grew very curious about first, the data that was being created, and second, the logic going on under the hood.  My questions led me to wonder about how I could improve the quality of my input data, especially given that the sheer number of features in an NLP project makes it quite challenging to understand what the data looks like.

For my capstone project, I decided to improve the NLTK interface to WordNet.  My work on this project led to many interesting insights about electronic dictionaries as well as about language processing in general.  Over the course of  project, I became more deeply interested in NLP, an area in data science that is growing rapidly.  The changes that I made to the toolkit built a foundation that provides greater control over the lemmatization process and, thus, more opportunity to improve data.  The changes will be available for implementation anywhere.

## Notebook Description
This notebook includes all the code I used to run and time the lemmatization process using the production NLTK version.  Please note the Setup cell that drives the variables substitutions.

### Set up environment.

In [1]:
# Import libraries and modules.
import numpy  as np
import pandas as pd

import random
import time

# Import Natural Language Toolkit modules.
from nltk.stem     import WordNetLemmatizer

# Increase number of columns that can be viewed in notebook.
pd.set_option('display.max_columns', 500)

# Set random seed for reproducibility.
random.seed(42)

In [2]:
# Specify the number of reviews of each type (pos and neg) being sampled.
n = 5000

# Specify whether the lemmas were created by the prod version of WordNet (True or False).
prod = True
if prod:
    env = 'prod'
else:
    env = 'dev'

# Set file location for input file.
tokens_csv = (f'./data/tokens_s{n*2}.csv')

# Set file location for output files.
lemmas_csv = (f'./data/lemmas_{env}_s{n*2}.csv')
time_csv = (f'./data/lemmas_{env}_s{n*2}_time.csv')

# Print messages.
print(f'** The tokenized  data will be loaded from "{tokens_csv}". **')
print()
print(f'** The lemmatized data will be saved in    "{lemmas_csv}". **')
print()
print(f'** The elapsed time data will be saved in  "{time_csv}". **')

** The tokenized  data will be loaded from "./data/tokens_s10000.csv". **

** The lemmatized data will be saved in    "./data/lemmas_prod_s10000.csv". **

** The elapsed time data will be saved in  "./data/lemmas_prod_s10000_time.csv". **


### Load data.

In [3]:
# Load the data.
df = pd.read_csv(tokens_csv)

# Take a look.
df.head()

Unnamed: 0,label,tokens
0,1,sergio martino is the case of the scorpion is ...
1,1,this is a very good made for tv film it depict...
2,1,this is not a love song is a brilliant example...
3,1,i must admit at first i was not expecting anyt...
4,1,those individuals familiar with asian cinema a...


In [4]:
# Take a look at the data.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9986 entries, 0 to 9985
Data columns (total 2 columns):
label     9986 non-null int64
tokens    9986 non-null object
dtypes: int64(1), object(1)
memory usage: 156.1+ KB


## Lemmatization 

**Process:** 

- Lemmatize the tokens:
    - Use the WordNet production version to obtain lemmas.

In [5]:
# Get time at beginning of lemmatization.
t_begin = time.time()

# Instantiate Lemmatizer.
lemmatizer = WordNetLemmatizer()

# Initialize list for elapsed time data.
time_data = []

# Loop through the dataframe.
for i in range(df.shape[0]):
    
    # Start the clock.
    t0 = time.time()
   
    # Lemmatize words.
    lemmas = [lemmatizer.lemmatize(w) for w in df.iloc[i]['tokens'].split()]
    
    # Stop the clock.
    t_end = time.time()
    
    # Save results.
    time_data.append([len(lemmas), (t_end - t0)]) 
    
    # Join the lemmas back into a single string and save it.
    df.loc[i,'lemmas'] = " ".join(lemmas) 
    
print('Total elapsed time processing lemmatization was ',(time.time() - t_begin))

# Take a look.
df.head()

Total elapsed time processing lemmatization was  23.5697078704834


Unnamed: 0,label,tokens,lemmas
0,1,sergio martino is the case of the scorpion is ...,sergio martino is the case of the scorpion is ...
1,1,this is a very good made for tv film it depict...,this is a very good made for tv film it depict...
2,1,this is not a love song is a brilliant example...,this is not a love song is a brilliant example...
3,1,i must admit at first i was not expecting anyt...,i must admit at first i wa not expecting anyth...
4,1,those individuals familiar with asian cinema a...,those individual familiar with asian cinema a ...


In [6]:
# Keep only needed columns.
df = df[['label', 'lemmas']]

# Verify update.
df.head()

Unnamed: 0,label,lemmas
0,1,sergio martino is the case of the scorpion is ...
1,1,this is a very good made for tv film it depict...
2,1,this is not a love song is a brilliant example...
3,1,i must admit at first i wa not expecting anyth...
4,1,those individual familiar with asian cinema a ...


In [7]:
# Load the time data.
time_df = pd.DataFrame(time_data, columns=['lemma count', 'elapsed_time'])

# Take a look.
time_df.head()

Unnamed: 0,lemma count,elapsed_time
0,410,1.62495
1,190,0.001356
2,264,0.001319
3,150,0.000828
4,528,0.002244


### Save data.

In [8]:
# Save the NLP data.
df.to_csv(lemmas_csv, encoding='utf-8', index=False)

# Save the elapsed time data.
time_df.to_csv(time_csv, encoding='utf-8', index=False)