<h1>Preprocessing</h1>
The sole purpose of this notebook is to pre-process all features making readble to the model. 

<h2>Import Data</h2>

In [3]:
import re
import operator
import numpy as np
import pandas as pd
from tqdm import tqdm
from collections import defaultdict

In [4]:
#loading data
train_df = pd.read_csv("train.csv")
resource_df = pd.read_csv("resources.csv")

aggregating all values by resource id

In [5]:
#many columns in the description section has NaN values. for the time being we are filling those fields with NA
resource_df["description"] = resource_df["description"].fillna("NA")
#adding all fields and grouping on id
modified_resource_df = resource_df.groupby('id').agg({'description':'sum', 'quantity':'sum', 'price':'sum'}).reset_index()

Merging resource data with train data 

In [6]:
#join dataframes
train_df = pd.merge(train_df, modified_resource_df, on='id', how='left')

<h2>2. Categorical Features</h2>

For categorical features we are going to follow the basic principles of text preprocessing. Such as:

1. lowercasing the text
2. removal of characters, and symbols
3. replacing space with underscore


In [7]:
#function to clean categorical data
def categorical_cleaning(given_data):
    data = given_data.copy()
    data = data.str.lower()
    data = data.str.replace(".","")
    data = data.str.replace(" & ","_")
    data = data.str.replace("-","_")
    data = data.str.replace(' the ','')
    data = data.str.replace(" ","_")
    data = data.str.replace(",","_")
    data = data.str.replace("__","_")
    return data

<h3>2.1 teacher_prefix</h3>

In [8]:
#checking for unique categorical variables and missing values
print("Unique Values:",train_df["teacher_prefix"].unique())
print("Total rows having missing values:",len(train_df[train_df["teacher_prefix"].isnull()]))

#replacing NaN with "Mrs." as the frequncy of Mrs is high
train_df["teacher_prefix"] = train_df["teacher_prefix"].fillna("Mrs.")
print("Total rows having missing values:",len(train_df[train_df["teacher_prefix"].isnull()]))

Unique Values: ['Ms.' 'Mrs.' 'Mr.' 'Teacher' 'Dr.' nan]
Total rows having missing values: 4
Total rows having missing values: 0


In [9]:
#cleaning data
train_df["teacher_prefix_pre"] = categorical_cleaning(train_df["teacher_prefix"])
train_df["teacher_prefix_pre"].unique()

array(['ms', 'mrs', 'mr', 'teacher', 'dr'], dtype=object)

<h3>2.2 school_state</h3>

In [10]:
#checking for unique categorical variables and missing values
print("Unique Values:",train_df["school_state"].unique())
print("Total rows having missing values:",len(train_df[train_df["school_state"].isnull()]))
print("The number of codes:",len(train_df["school_state"].unique()))

Unique Values: ['NV' 'GA' 'UT' 'NC' 'CA' 'DE' 'MO' 'SC' 'IN' 'IL' 'VA' 'PA' 'NY' 'FL'
 'NJ' 'TX' 'LA' 'ID' 'OH' 'OR' 'MD' 'WA' 'MA' 'KY' 'AZ' 'MI' 'CT' 'AR'
 'WV' 'NM' 'WI' 'MN' 'OK' 'AL' 'TN' 'IA' 'KS' 'CO' 'DC' 'WY' 'NH' 'HI'
 'SD' 'MT' 'MS' 'RI' 'VT' 'ME' 'NE' 'AK' 'ND']
Total rows having missing values: 0
The number of codes: 51


In [11]:
#cleaning data
train_df["school_state_pre"] = categorical_cleaning(train_df["school_state"])
train_df["school_state_pre"].unique()

array(['nv', 'ga', 'ut', 'nc', 'ca', 'de', 'mo', 'sc', 'in', 'il', 'va',
       'pa', 'ny', 'fl', 'nj', 'tx', 'la', 'id', 'oh', 'or', 'md', 'wa',
       'ma', 'ky', 'az', 'mi', 'ct', 'ar', 'wv', 'nm', 'wi', 'mn', 'ok',
       'al', 'tn', 'ia', 'ks', 'co', 'dc', 'wy', 'nh', 'hi', 'sd', 'mt',
       'ms', 'ri', 'vt', 'me', 'ne', 'ak', 'nd'], dtype=object)

<h3>2.3 project_grade_category</h3>

In [12]:
#checking for unique categorical variables and missing values
print("Unique Values:",train_df["project_grade_category"].unique())
print("Total rows having missing values:",len(train_df[train_df["project_grade_category"].isnull()]))
print("The number of grade category:",len(train_df["project_grade_category"].unique()))

Unique Values: ['Grades PreK-2' 'Grades 3-5' 'Grades 6-8' 'Grades 9-12']
Total rows having missing values: 0
The number of grade category: 4


In [13]:
#cleaning data
train_df["project_grade_category_pre"] = categorical_cleaning(train_df["project_grade_category"])
train_df["project_grade_category_pre"].unique()

array(['grades_prek_2', 'grades_3_5', 'grades_6_8', 'grades_9_12'],
      dtype=object)

<h3>2.4 project_subject_categories</h3>

In [14]:
#checking for unique categorical variables and missing values
print("Unique Values:\n",train_df["project_subject_categories"].unique()[0:10])
print("Total rows having missing values:",len(train_df[train_df["project_subject_categories"].isnull()]))
print("The number of unique values:",len(train_df["project_subject_categories"].unique()))

Unique Values:
 ['Literacy & Language' 'Music & The Arts, Health & Sports'
 'Math & Science, Literacy & Language' 'Health & Sports'
 'Applied Learning, Literacy & Language' 'Math & Science'
 'Literacy & Language, Math & Science' 'Special Needs'
 'Applied Learning, Special Needs' 'Applied Learning, Music & The Arts']
Total rows having missing values: 0
The number of unique values: 51


In [15]:
#cleaning data
train_df["project_subject_categories_pre"] = categorical_cleaning(train_df["project_subject_categories"])
train_df["project_subject_categories_pre"].unique()[0:10]

array(['literacy_language', 'music_the_arts_health_sports',
       'math_science_literacy_language', 'health_sports',
       'applied_learning_literacy_language', 'math_science',
       'literacy_language_math_science', 'special_needs',
       'applied_learning_special_needs',
       'applied_learning_music_the_arts'], dtype=object)

<h3>2.5 project_subject_subcategories</h3>

In [16]:
#checking for unique categorical variables and missing values
print("Unique Values:\n",train_df["project_subject_subcategories"].unique()[0:10])
print("Total rows having missing values:",len(train_df[train_df["project_subject_subcategories"].isnull()]))
print("The number of unique values:",len(train_df["project_subject_subcategories"].unique()))

Unique Values:
 ['Literacy' 'Performing Arts, Team Sports'
 'Applied Sciences, Literature & Writing' 'Health & Wellness'
 'Character Education, Literature & Writing'
 'Early Development, Literature & Writing' 'Mathematics'
 'Literature & Writing, Mathematics' 'Literacy, Mathematics'
 'Character Education, Literacy']
Total rows having missing values: 0
The number of unique values: 407


In [17]:
#cleaning data
train_df["project_subject_subcategories_pre"] = categorical_cleaning(train_df["project_subject_subcategories"])
train_df["project_subject_subcategories_pre"].unique()[0:10]

array(['literacy', 'performing_arts_team_sports',
       'applied_sciences_literature_writing', 'health_wellness',
       'character_education_literature_writing',
       'early_development_literature_writing', 'mathematics',
       'literature_writing_mathematics', 'literacy_mathematics',
       'character_education_literacy'], dtype=object)

<h2>3. Text Features</h2>

Below columns are the text features.

<i>1. project_title</i>

<i>2. project_essay_1</i>

<i>3. project_essay_2</i>

<i>4. project_essay_3</i>

<i>5. project_essay_4</i>

<i>6. project_resource_summary</i>

<i>7. description</i>

Note: 

1. All essay columns along with project_resource_summary and project_title will be merged into one text corpus as it talks about the project need and details about it.
2. We are ignoring description section as thr essay section as everything as description section.





<h4>with pre-trained embedding</h4>
<i>Here we are using "glove.840B.300d" for preprocessing.</i>



In [19]:
#pre saved glove embedding
import pickle
gloveModel = pickle.load(open("./PreprocessingFiles/glove_embedding.pkl", "rb"))

In [20]:
#merging text essay into a single columns
train_df["project_text"] = train_df["project_title"].map(str)  + " " + train_df["project_resource_summary"].map(str) + " " + \
                            train_df["project_essay_1"].map(str) + " " + train_df["project_essay_2"].map(str) + " " +\
                            train_df["project_essay_3"].map(str) + " " +train_df["project_essay_4"].map(str)

In [21]:
train_df["project_text"].values[0]

'Super Sight Word Centers My students need 6 Ipod Nano\'s to create and differentiated and engaging way to practice sight words during a literacy station. Most of my kindergarten students come from low-income households and are considered \\"at-risk\\". These kids walk to school alongside their parents and most have never been further than walking distance from their house. For 80% of my students, English is not their first language or the language spoken at home. \\r\\n\\r\\nWhile my kindergarten kids have many obstacles in front of them, they come to school each day excited and ready to learn. Most students started the year out never being in a school setting. At the start of the year many had never been exposed to letters. Each day they soak up more knowledge and try their hardest to succeed. They are highly motivated to learn new things every day. We are halfway through the year and they are starting to take off. They know know all letters, some sight words, numbers to 20, and a ma

In [22]:
#Source: - https://en.wikipedia.org/wiki/Wikipedia:List_of_English_contractions
#some of the known contractions
contractions = {"'aight": 'alright', "ain't": 'am not', "amn't": 'am not', "aren't": 'are not', "can't": 'can not', "'cause": 
'because', "could've": 'could have', "couldn't": 'could not', "couldn't've": 'could not have', "daren't": 
'dare not', "daresn't": 'dare not', "dasn't": 'dare not', "didn't": 'did not', "doesn't": 'does not', 
"don't": 'do not', 'dunno': "don't know", "d'ye": 'do you', "e'er": 'ever', "everybody's": 'everybody is', 
"everyone's": 'everyone is', 'finna': 'fixing to', "g'day": 'good day', 'gimme': 'give me', "giv'n": 'given', 
'gonna': 'going to', "gon't": 'go not', 'gotta': 'got to', "hadn't": 'had not', "had've": 'had have', 
"hasn't": 'has not', "haven't": 'have not', "he'd": 'he had', "he'll": 'he will', "he's": 'he is', 
"he've": 'he have', "how'd": 'how did', 'howdy': 'how do you do', "how'll": 'how will', "how're": 'how are', 
"how's": 'how is', "I'd": 'I had', "I'd've": 'I would have', "I'll": 'I will', "I'm": 'I am', 
"I'm'a": 'I am about to', "I'm'o": 'I am going to', 'innit': 'is it not', "I've": 'I have', "isn't": 'is not', 
"it'd ": 'it would', "it'll": 'it will', "it's ": 'it is', 'iunno': "I don't know", "let's": 'let us', 
"ma'am": 'madam', "mayn't": 'may not', "may've": 'may have', 'methinks': 'me thinks', "mightn't": 'might not', 
"might've": 'might have', "mustn't": 'must not', "mustn't've": 'must not have', "must've": 'must have', 
"needn't": 'need not', 'nal': 'and all', "ne'er": 'never', "o'clock": 'of the clock', "o'er": 'over',
"ol'": 'old', "oughtn't": 'ought not', "'s": 'is', "shalln't": 'shall not', "shan't": 'shall not', 
"she'd": 'she would', "she'll": 'she will', "she's": 'she is', "should've": 'should have', 
"shouldn't": 'should not', "shouldn't've": 'should not have', "somebody's": 'somebody has', 
"someone's": 'someone has', "something's": 'something has', "so're": 'so are', "that'll": 'that will', 
"that're": 'that are', "that's": 'that is', "that'd": 'that would', "there'd": 'there would', 
"there'll": 'there will', "there're": 'there are', "there's": 'there is', "these're": 'these are', 
"they've": 'they have', "this's": 'this is', "those're": 'those are', "those've": 'those have', "'tis": 'it is', 
"to've": 'to have', "'twas": 'it was', 'wanna': 'want to', "wasn't": 'was not', "we'd": 'we would', 
"we'd've": 'we would have', "we'll": 'we will', "we're": 'we are', "we've": 'we have', "weren't": 'were not', 
"what'd": 'what did', "what'll": 'what will', "what're": 'what are', "what's": 'what does', "what've": 'what have',
"when's": 'when is', "where'd": 'where did', "where'll": 'where will', "where're": 'where are',
"where's": 'where is',"where've": 'where have', "which'd": 'which would', "which'll": 'which will', 
"which're": 'which are',"which's": 'which is', "which've": 'which have', "who'd": 'who would',
"who'd've": 'who would have', "who'll": 'who will', "who're": 'who are', "who'ves": 'who is', "who'": 'who have',
"why'd": 'why did', "why're": 'why are', "why's": 'why does', "willn't": 'will not', "won't": 'will not',
'wonnot': 'will not', "would've": 'would have', "wouldn't": 'would not', "wouldn't've": 'would not have',
"y'all": 'you all', "y'all'd've": 'you all would have', "y'all'd'n've": 'you all would not have',
"y'all're": 'you all are', "cause":"because","have't":"have not","cann't":"can not","ain't":"am not",
"you'd": 'you would', "you'll": 'you will', "you're": 'you are', "you've": 'you have', 'cannot': 'can not', 
'wont': 'will not', "You'": 'Am not', "Ain'": 'Am not', "Amn'": 'Am not', "Aren'": 'Are not',
"Can'": 'Because', "Could'": 'Could have', "Couldn'": 'Could not have', "Daren'": 'Dare not', 
"Daresn'": 'Dare not', "Dasn'": 'Dare not', "Didn'": 'Did not', "Doesn'": 'Does not', "Don'": "Don't know", 
"D'": 'Do you', "E'": 'Ever', "Everybody'": 'Everybody is', "Everyone'": 'Fixing to', "G'": 'Give me', 
"Giv'": 'Going to', "Gon'": 'Got to', "Hadn'": 'Had not', "Had'": 'Had have', "Hasn'": 'Has not', 
"Haven'": 'Have not', "He'": 'He have', "How'": 'How is', "I'": 'I have', "Isn'": 'Is not', "It'": "I don't know", 
"Let'": 'Let us', "Ma'": 'Madam', "Mayn'": 'May not', "May'": 'Me thinks', "Mightn'": 'Might not', 
"Might'": 'Might have', "Mustn'": 'Must not have', "Must'": 'Must have', "Needn'": 'And all', "Ne'": 'Never',
"O'": 'Old', "Oughtn'": 'Is', "Shalln'": 'Shall not', "Shan'": 'Shall not', "She'": 'She is', 
"Should'": 'Should have', "Shouldn'": 'Should not have', "Somebody'": 'Somebody has', "Someone'": 'Someone has', 
"Something'": 'Something has', "So'": 'So are', "That'": 'That would', "There'": 'There is',
"They'": 'They have', "This'": 'This is', "Those'": 'It is', "To'": 'Want to', "Wasn'": 'Was not',
"Weren'": 'Were not', "What'": 'What have', "When'": 'When is', "Where'": 'Where have', "Which'": 'Which have', 
"Who'": 'Who have', "Why'": 'Why does', "Willn'": 'Will not', "Won'": 'Will not', "Would'": 'Would have',
"Wouldn'": 'Would not have', "Y'": 'You all are',"What's":"What is","What're":"What are","what's":"what is",
"what're":"what are", "Who're":"Who are", "your're":"you are","you're":"you are", "You're":"You are",
"We're":"We are", "These'": 'These have', "we're":"we are","Why're":"Why are","How're":"How are ",
"how're ":"how are ","they're ":"they are ", "befo're":"before","'re ":" are ",'don"t ':"do not", 
"Won't ":"Will not ","could't":"could not", "would't":"would not", "We'": 'We have',"Hasn't":"Has not",
"n't":"not", 'who"s':"who is"}

#characters to keep
characters_keep = [']', '◾', '⅞', '☺', '✒', '″', '¬', '•', '=', '❣', '⚠', '©', '°', '}', '⅓', '"', '✏', '·',
                   '-', '?', '…', '|', '\x96', '.', '›', '¡', '℅', '$', '!', '_', '[', '✔', '{', '´', '’',
                   '—', '§', '“', '&', '*', '‘', '―', '¾', '●', '\\', '–', '¢', '>', '✂', '¨', '~', ':',
                   '”', '`', '™', '\x80', ',', '@', ';', '<', '(', '#', '❞', '¼', '◦', '❝', '%', '¿', "'",
                   '+', ')', '½', '®', '/', '→', '⅔', '^']

#characters to be removed
characters_remove = "🙏🎧‐﻿😩😋💛️😊🖍🤓‒📏💕😃😀🎶📚🤗💚🤔🏻􏰃‌📖😄💜📓‪💻🏾💗�­􏰁🎭􏰄😉􏰂👍🐣​🖇‑￼★❤■♥"

In [23]:
def text_preprocessing(data, contraction, remove_chars, add_chars):
    
    #remove characters
    translate_table = dict((ord(char), None) for char in remove_chars) 
    for idx,val in tqdm(enumerate(data.values)):
        #remove contractions
        val = ' '.join(word.replace(word,contractions[word]) if word in contractions else word for word in val.split())
        val = re.sub(r"n\'t", " not", val)
        val = re.sub(r"\'re", " are", val); val = re.sub(r"\'s", " is", val); val = re.sub(r"\'d", " would", val)
        val = re.sub(r"\'ll", " will", val); val = re.sub(r"\'t", " not", val); val = re.sub(r"\'ve", " have", val)
        val = re.sub(r"\'m", " am", val); val = re.sub(r"\"m "," am", val); val = val.replace(" nan "," ")
        val = val.replace("nannan", " "); val = val.replace('\\r', ' '); val = val.replace('\\n', ' ')
        val = val.replace('\\"', ' ')
        #remove punctuations
        val = val.translate(translate_table)
        #spacing of characters
        for char in add_chars:  
            if char in val:
                val = val.replace(char," "+char+" ")
                val = re.sub("\s+"," ",val)
        #adding preprocessed sentence
        val = re.sub("\s+"," ",val)
        data.values[idx] = val.strip()
    
    return data    

In [26]:
train_df["project_text"] = text_preprocessing(train_df["project_text"], contractions, characters_remove, characters_keep)

182080it [02:01, 1504.39it/s]


In [24]:
#creating text vocab
def question_text_vocab(text):
    freq_dict = defaultdict(int)
    total_sent = text.apply(lambda x: x.split()).values
    for sent in total_sent:
        for token in sent:
            freq_dict[token] += 1
    return freq_dict

In [25]:
#checking coverage for words present in question_text and in embedding_matrix
def coverage(vocab, embedding,print_statement=False):
    #Initializing values
    known_words = defaultdict(int)
    unknown_words = defaultdict(int)
    knownWordsVal = 0
    unknownWordsVal = 0
    #iterating words
    for word in vocab.keys():
        try:
            known_words[word] = embedding[word]
            knownWordsVal += vocab[word]
        except:
            unknown_words[word] = vocab[word]
            unknownWordsVal += vocab[word]
            pass
    
    if print_statement == True:
        print('Found {:.2%} of words in the embedding of the question text vocab'
           .format(len(known_words) / len(vocab)))
        print('Found {:.2%} of the words in the question text vocab'.
              format(knownWordsVal / (knownWordsVal + unknownWordsVal)))
    else:
        pass
    unknown_words = sorted(unknown_words.items(), key=operator.itemgetter(1))[::-1]
    return unknown_words

In [27]:
text_vocab_tr = question_text_vocab(train_df["project_text"])
oov_glove = coverage(text_vocab_tr, gloveModel, print_statement=True)

Found 85.23% of words in the embedding of the question text vocab
Found 99.92% of the words in the question text vocab


As we can see here there 99.92% of the words of the train vocab is now covered by out pre trained embedding. 

without pre-trained embedding

In [29]:
#merging text essay into a single columns
train_df["project_text_wt_emb"] = train_df["project_title"].map(str)  + " " + train_df["project_resource_summary"].map(str) + " " + \
                                train_df["project_essay_1"].map(str) + " " + train_df["project_essay_2"].map(str) + " " +\
                                train_df["project_essay_3"].map(str) + " " +train_df["project_essay_4"].map(str)

In [30]:
#some defined stopwords
stopwords= ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've",\
"you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', \
'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their',\
'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', \
'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', \
'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', \
'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after',\
'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further',\
'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more',\
'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than', 'too', 'very', \
's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', \
've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn',\
"hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn',\
"mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", \
'won', "won't", 'wouldn', "wouldn't"]

In [33]:
def text_preprocessing_without_embedding(data, contractions, stopwords):
    
    for idx,val in tqdm(enumerate(data.values)):
        #remove contractions
        val = ' '.join(word.replace(word,contractions[word]) if word in contractions else word for word in val.split())
        val = re.sub(r"n\'t", " not", val)
        val = re.sub(r"\'re", " are", val); val = re.sub(r"\'s", " is", val); val = re.sub(r"\'d", " would", val)
        val = re.sub(r"\'ll", " will", val); val = re.sub(r"\'t", " not", val); val = re.sub(r"\'ve", " have", val)
        val = re.sub(r"\'m", " am", val); val = re.sub(r"\"m "," am", val); val = val.replace(" nan "," ")
        val = val.replace("nannan", " "); val = val.replace('\\r', ' '); val = val.replace('\\n', ' ')
        val = val.replace('\\"', ' ')
        #removing characters and punctuation
        val = re.sub('[^A-Za-z0-9]+', ' ', val)
        #removing stopwords
        val = ' '.join(e for e in val.split() if e.lower() not in stopwords)
        #adding preprocessed sentence
        val = re.sub("\s+"," ",val)
        data.values[idx] = val.strip()
    
    return data 

In [34]:
train_df["project_text_wt_emb"] = text_preprocessing_without_embedding(train_df["project_text_wt_emb"], contractions, stopwords)

182080it [02:17, 1325.23it/s]


Saving pre-processed files

In [36]:
#saving data
train_df.to_csv("preprocessed_data.csv", index=False)