<h1>Preprocessing for Thesis Work</h1>

Author: Joshua White

This notebook will contain code used to preprocess a csv file for my CSCE 799 Expirment and my CSCE 623 Project. 

Sources:  
The following two articles I used to get me started  
https://towardsdatascience.com/the-basics-of-eda-with-candy-83b2e8ad9e63  
https://towardsdatascience.com/nlp-for-beginners-cleaning-preprocessing-text-data-ae8e306bef0f  
Useful source for pandas: https://www.geeksforgeeks.org/iterating-over-rows-and-columns-in-pandas-dataframe/  

First we will start with some Exploratory Data Analysis (EDA), or just looking at the data. In this notebook I will be looking at the 


In [None]:
#Imports:
import pandas as pd
import nltk
from bs4 import BeautifulSoup
import string #Will be used for a list of punctuation

#NLTK Imports:
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer

Now I'm going to use pandas to import the csv file into a data frame. 
*** IF YOU WANT TO CHANGE THE CSV FILENAME DO SO HERE *** The nyc-jobs.csv file will be what I'm initially working on and most comments in this notebook will reflect that. 

In [None]:
DFrame = pd.read_csv('nyc-jobs.csv')

<h2>Initial Look at the data</h2>

head() call just to peek at the top 5 rows of the file:

In [None]:
DFrame.head()

shape call to look at the size of the rows & column of the data frame object:

In [None]:
DFrame.shape

<h2>Check for Missing Values</h2>

Now if we want to we can check to see how many missing values we have per row. To do this we will use the isnull() function on the data frame, then use .mean() to get the percent of missing values from each column, and finally add .sort_values(ascending=False) to sort the list (or you could leave it off if you wanted).

In [None]:
DFrame.isnull().mean().sort_values(ascending=False)
#Flip the comment if you want to sort the values or not
#DFrame.isnull().mean()

<h2> Remove Columns</h2>

After looking at the above list you can see if you want to remove columns or not. In the nyc-jobs.csv file the "Recruitment Contact" column was 100% empty, so we will just remove it using .drop(columns=['column name here'], inplace=True). Note that this just removes the column from the data frame object, as of right now the nyc-jobs.csv file is still 28 columns!

In [None]:
#If you wanted to look at the contents of the columns that arn't NaN (empty) we could with the following line
#DFrame['Recruitment Contact'].value_counts(dropna=False)

In [None]:
DFrame.drop(columns=['Recruitment Contact'], inplace=True)
#Check to see that it was removed using shape
DFrame.shape

We can use the .dtypes to see the data type for each column. When you see object as the type it is normally a string. 

In [None]:
DFrame.dtypes

<h2>Fill NaNs</h2>

We have seen how much of our data frame is empty, or NaN, now we can use a for loop that will replace all of the missing values with a string "unknown" or a 0 for our int64 and float64 types. 

In [None]:
for col in DFrame:
    if type(DFrame[col])=='object':
        #Fill all emplty values with the 'unknown', or can change here
        DFrame[col] = DFrame[col].fillna(value='unknown')
    else:
        #Here we will change all empty values with a 0
        DFrame[col] = DFrame[col].fillna(value=0)

Then check to make sure it worked

In [None]:
DFrame.isnull().sum().sum()

<h2>Normalize Column Names</h2>

Now we can change the column names to be lowercase and void of special characters. To look at a list of the column names use .columns on the data frame object. 

In [None]:
DFrame.columns

Then we will use .lower() and .str.replace() to make any changes.

In [None]:
DFrame.columns = [x.lower() for x in DFrame.columns.str.replace(" ","_").str.replace("/","").str.replace("#","no")]
DFrame.columns

<h2>Next Steps</h2>

Now we are done with the EDA steps for the data frame and we will move on to other preprocessing. 

<h2>Remove Duplicated Lines</h2>

Upon looking through the nyc-jobs.csv file there are some lines are are duplicates, and an easy way to tell is based off of the 'job_id' column value. You can use pandas to remove the duplicated lines. 

Source:
https://stackoverflow.com/questions/15741564/removing-duplicate-rows-from-a-csv-file-using-a-python-script &
https://www.geeksforgeeks.org/python-pandas-dataframe-drop_duplicates/


In [None]:
DFrame.drop_duplicates(subset = "job_id", inplace = True)

If you look at the origional .head() call lines 2 and 3 had the same "job_id", we can call .head() again to check to ensure the duplicaiton removal worked. 

In [None]:
DFrame.head()

In [None]:
DFrame.shape

As you can see from the new head call the duplicate line was removed and we are down to 1661 rows of data. 

<h2>Remove HTML</h2>

I could not find any HTML tags in the nyc-jobs.csv file, but if there were we could remove them from a column with the following code. 

In [None]:
#First I will just define the function
def remove_html(text):
    soup = BeautifulSoup(text, 'lxml')
    html_free = soup.get_text()
    return html_free

#Now lets say we wanted to run it on the 'job_description' column you could uncomment this line
#DFrame['job_description'] = DFrame['job_description'].apply(lambda x: remove_html(x))
#And could review the changes with this line
#DFrame['job_description'].head(10)

<h2>Remove Punctuation</h2>

Now we can remove punctuation from a column, treating it as a Series, and keep everything that is not in string.punctuation, which is a list of all punctuation that imported with the import string at the begining. 

For now I will only do this on the 'job_description' column, but it will be easy to modify in the future if necessary.

One thing of note about using this is that this method does not account for typos, e.g. "however,the" will turn into "howeverthe". If necessary we can create a copy of all columns we perform this on and just run any models on the columns seperatly if we are finding weird quirks in the data. 

In [None]:
#First I will just define the function
def remove_punctuation(text):
    no_punct = "".join([c for c in text if c not in string.punctuation])
    return no_punct

#Now I will just call it on the 'job_description' column and call .head() on it to make sure its not there
DFrame['job_description'] = DFrame['job_description'].apply(lambda x: remove_punctuation(x))
DFrame['job_description'].head(10)

<h2>Tokenize</h2>

Now we will tokenize the text by column. This is when we break up the strings into a list of words or pieces using RegEx. The pattern also removes punctuation, and we can call .lower() in the lambda to make everything lowercase at this step as well. 

The RegexpTokenizer man page: https://kite.com/python/docs/nltk.RegexpTokenizer
The \w here is: When the LOCALE and UNICODE flags are not specified, matches any alphanumeric character and the underscore; this is equivalent to the set [a-zA-Z0-9_]. With LOCALE, it will match the set [0-9_] plus whatever characters are defined as alphanumeric for the current locale. If UNICODE is set, this will match the characters [0-9_] plus whatever is classified as alphanumeric in the Unicode character properties database.
Source: https://docs.python.org/2/library/re.html

Some other RegEx patterns we could use:
‘\w+|\$[\d\.]+|\S+’ = splits up by spaces or by periods that are not attached to a digit
‘\s+’, gaps=True = grabs everything except spaces as a token
‘[A-Z]\w+’ = only words that begin with a capital letter
source: https://towardsdatascience.com/nlp-for-beginners-cleaning-preprocessing-text-data-ae8e306bef0f

In [None]:
#Create the tokenizer
tokenizer = RegexpTokenizer(r'\w+')

#Now call .tokenize on the 'job_description' column. We could extend this to more columns as necessary
DFrame['job_description'] = DFrame['job_description'].apply(lambda x: tokenizer.tokenize(x.lower()))
#And lets look at the output of this
DFrame['job_description'].head(10)

<h2>Remove Stop Words</h2>

We will use a stop word list from the nltk.corpus, but we could use others if we wnated to by changeing the function we are about to define here. If you would like to see a list of all the stop words you could run stopwords.word('english'), or in any other language supported by NLTK.
**This step takes a while to run**

In [None]:
#First I will define the function to remove the stopwords
def remove_stopwords(text):
    words = [w for w in text if w not in stopwords.words('english')]
    return words

#Now we will just do it on the 'job_description' column for now, but could extend this to other columns easily
DFrame['job_description'] = DFrame['job_description'].apply(lambda x: remove_stopwords(x))
DFrame['job_description'].head(10)

<h2>Stemming &(or) Lemmatization</h2>

Here we can either perform stemming or lemmatization on our list of words, but because we will be performing keyword extraction and then using the keywords as input to ConceptNet I will perform lemmatization on the 'job_description' column because stemming can return strings that are not actually words and we want to ensure a keyword gets a hit in ConceptNet. I will still include the function for stemming for future use (if necessary). 

In [None]:
#The code for lemmatization:
lemmatizer = WordNetLemmatizer()

def word_lemmatizer(text):
    lem_text = [lemmatizer.lemmatize(i) for i in text]
    return lem_text

DFrame['job_description'] = DFrame['job_description'].apply(lambda x: word_lemmatizer(x))
DFrame['job_description'].head(10)

In [None]:
# The function for Stemming and then the line to call it commented out
stemmer = PorterStemmer()
def word_stemmer(text):
    stem_text = " ".join([stemmer.stem(i) for i in text])
    return stem_text

#DFrame['job_description'] = DFrame['job_description'].apply(lambda x: word_stemmer(x))
#DFrame['job_description'].head(10)

<h2>Exporting the Data Frame</h2>

At this point if you would like to export the data frame to a csv file you could run the following code, just make sure you change the file path to the correct directory. 

Source: https://datatofish.com/export-dataframe-to-csv/

In [None]:
DFrame.to_csv(r'C:\Users\Joshua\Google Drive\Thesis Work\Python\nyc-jobs-cleaned.csv', index = False, header = True)