# Processing Training Data

There are a lot of possibile things we can do to this data to prep it for future steps. Prior to getting into it here is my plan:
1. Make lowercase
2. Expand contractions
3. Nosie Removal
4. Tokenization
5. Remove Stopwords
6. Simple normalization (if simple is possible)
7. Lemmatize

## Imports and Load Data

But first, we gotta import modules and get the data in here

In [1]:
import pandas as pd
import numpy as np

import contractions #for expanding contractions to full words
import re #we'll use especially for noise removal/scrubbing

#nltk = Natural Language Tooklit
from nltk.tokenize import word_tokenize 
from nltk.corpus import stopwords


In [2]:
training_data = pd.read_csv('../data/interim/training_data.csv')

training_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8659 entries, 0 to 8658
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   question  8659 non-null   object
 1   schema    8659 non-null   object
dtypes: object(2)
memory usage: 135.4+ KB


In [3]:
training_data.head()

Unnamed: 0,question,schema
0,How many heads of the departments are older th...,department_management
1,"List the name, born state and age of the heads...",department_management
2,"List the creation year, name and budget of eac...",department_management
3,What are the maximum and minimum budget of the...,department_management
4,What is the average number of employees of the...,department_management


In [4]:
#i'm going to be adding columns to see the changes, so I'll actually swap our columns here
training_data = training_data[['schema','question']]

In [5]:
training_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8659 entries, 0 to 8658
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   schema    8659 non-null   object
 1   question  8659 non-null   object
dtypes: object(2)
memory usage: 135.4+ KB


## Processing Work

### 1. Make Lowercase

In [6]:
training_data['lowercase'] = training_data['question'].str.lower()

In [7]:
training_data.head()

Unnamed: 0,schema,question,lowercase
0,department_management,How many heads of the departments are older th...,how many heads of the departments are older th...
1,department_management,"List the name, born state and age of the heads...","list the name, born state and age of the heads..."
2,department_management,"List the creation year, name and budget of eac...","list the creation year, name and budget of eac..."
3,department_management,What are the maximum and minimum budget of the...,what are the maximum and minimum budget of the...
4,department_management,What is the average number of employees of the...,what is the average number of employees of the...


### 2. Expand Contractions

index 1443 contains "didnt'" - we can test if this works here.

In [8]:
#breakout contractions into their full words
training_data['no_contraction'] = training_data['lowercase'].apply(lambda x: [contractions.fix(word) for word in x.split()])

#convert this to a full string again to make tokenizing more straightforward
training_data['no_contraction'] = [' '.join(map(str, l)) for l in training_data['no_contraction']]

In [9]:
training_data.head()

Unnamed: 0,schema,question,lowercase,no_contraction
0,department_management,How many heads of the departments are older th...,how many heads of the departments are older th...,how many heads of the departments are older th...
1,department_management,"List the name, born state and age of the heads...","list the name, born state and age of the heads...","list the name, born state and age of the heads..."
2,department_management,"List the creation year, name and budget of eac...","list the creation year, name and budget of eac...","list the creation year, name and budget of eac..."
3,department_management,What are the maximum and minimum budget of the...,what are the maximum and minimum budget of the...,what are the maximum and minimum budget of the...
4,department_management,What is the average number of employees of the...,what is the average number of employees of the...,what is the average number of employees of the...


In [10]:
print(training_data.iloc[1443,2])
print(training_data.iloc[1443,3])

what are the ids of instructors who didnt' teach?
what are the ids of instructors who did not' teach?


That worked! :)

### 4. Noise Removal

We don't need all that pesky punction garbase. Sorry oxford commas! You know I love you!

In [11]:
#define "scrubbing function"

def scrub_words(text):
    """Function for some basic scrubbing courtesy of Kavita Ganesan: https://github.com/kavgan/nlp-in-practice/blob/master/text-pre-processing/Text%20Preprocessing%20Examples.ipynb"""
    
    # remove html markup
    text=re.sub("(<.*?>)","",text)
    
    #remove non-ascii and digits
    text=re.sub("(\\W|\\d)"," ",text)
    
    #remove whitespace
    text=text.strip()
    return text

In [12]:
#apply the function to my latest column
#https://stackoverflow.com/questions/13331698/how-to-apply-a-function-to-two-columns-of-pandas-dataframe

training_data['scrubbed'] = training_data.apply(lambda x: scrub_words(x.no_contraction), axis=1)

training_data.head()

Unnamed: 0,schema,question,lowercase,no_contraction,scrubbed
0,department_management,How many heads of the departments are older th...,how many heads of the departments are older th...,how many heads of the departments are older th...,how many heads of the departments are older than
1,department_management,"List the name, born state and age of the heads...","list the name, born state and age of the heads...","list the name, born state and age of the heads...",list the name born state and age of the heads...
2,department_management,"List the creation year, name and budget of eac...","list the creation year, name and budget of eac...","list the creation year, name and budget of eac...",list the creation year name and budget of eac...
3,department_management,What are the maximum and minimum budget of the...,what are the maximum and minimum budget of the...,what are the maximum and minimum budget of the...,what are the maximum and minimum budget of the...
4,department_management,What is the average number of employees of the...,what is the average number of employees of the...,what is the average number of employees of the...,what is the average number of employees of the...


### 4. Tokenize

With those contractions broken out now let's breakout the words for the future processing steps

In [13]:
training_data['tokenized'] = training_data['no_contraction'].apply(word_tokenize)
training_data.head()

Unnamed: 0,schema,question,lowercase,no_contraction,scrubbed,tokenized
0,department_management,How many heads of the departments are older th...,how many heads of the departments are older th...,how many heads of the departments are older th...,how many heads of the departments are older than,"[how, many, heads, of, the, departments, are, ..."
1,department_management,"List the name, born state and age of the heads...","list the name, born state and age of the heads...","list the name, born state and age of the heads...",list the name born state and age of the heads...,"[list, the, name, ,, born, state, and, age, of..."
2,department_management,"List the creation year, name and budget of eac...","list the creation year, name and budget of eac...","list the creation year, name and budget of eac...",list the creation year name and budget of eac...,"[list, the, creation, year, ,, name, and, budg..."
3,department_management,What are the maximum and minimum budget of the...,what are the maximum and minimum budget of the...,what are the maximum and minimum budget of the...,what are the maximum and minimum budget of the...,"[what, are, the, maximum, and, minimum, budget..."
4,department_management,What is the average number of employees of the...,what is the average number of employees of the...,what is the average number of employees of the...,what is the average number of employees of the...,"[what, is, the, average, number, of, employees..."


### 5. Remove Stop Words

From what I understand there are multiple ways to flag which words are stop words. I'll use the standard enligh stop words in the nltk library.

In [14]:
stop_words = set(stopwords.words('english'))
training_data['stopwords_removed'] = training_data['tokenized'].apply(lambda x: [word for word in x if word not in stop_words])

training_data.head()

Unnamed: 0,schema,question,lowercase,no_contraction,scrubbed,tokenized,stopwords_removed
0,department_management,How many heads of the departments are older th...,how many heads of the departments are older th...,how many heads of the departments are older th...,how many heads of the departments are older than,"[how, many, heads, of, the, departments, are, ...","[many, heads, departments, older, 56, ?]"
1,department_management,"List the name, born state and age of the heads...","list the name, born state and age of the heads...","list the name, born state and age of the heads...",list the name born state and age of the heads...,"[list, the, name, ,, born, state, and, age, of...","[list, name, ,, born, state, age, heads, depar..."
2,department_management,"List the creation year, name and budget of eac...","list the creation year, name and budget of eac...","list the creation year, name and budget of eac...",list the creation year name and budget of eac...,"[list, the, creation, year, ,, name, and, budg...","[list, creation, year, ,, name, budget, depart..."
3,department_management,What are the maximum and minimum budget of the...,what are the maximum and minimum budget of the...,what are the maximum and minimum budget of the...,what are the maximum and minimum budget of the...,"[what, are, the, maximum, and, minimum, budget...","[maximum, minimum, budget, departments, ?]"
4,department_management,What is the average number of employees of the...,what is the average number of employees of the...,what is the average number of employees of the...,what is the average number of employees of the...,"[what, is, the, average, number, of, employees...","[average, number, employees, departments, whos..."
