<h1>Feature Engineering</h1>
Author: Joshua White  

This notebook will handle the feature engineering for my CSCE 623 project. The steps followed are:  
1. **Label coding**: creation of a dictionaryto map each category to a code.  
2. **Text representation**: use of TF-IDF scores to represent the text.  

A number of things have already been done to the data set before this process happens. First the actual text was cleaned to include removing of any html tags, turned to lowercase, punctuation was removed, tokenized, and lemmatization. Then the entire data set was split into a 80-20% split of training and test data. Then that training subset was split up into training and validation subsets for k-fold cross validation, with k = 5.

**Source**: 
I used the following article for a lot of this process: 
https://towardsdatascience.com/text-classification-in-python-dd95d264c802

In [1]:
import pickle
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import chi2
import numpy as np

In [2]:
# Setting up some variables to be used:
input_training = "training_data_nyc-jobs.csv"

input_training_0 = "training_set_0.csv"
input_training_1 = "training_set_1.csv"
input_training_2 = "training_set_2.csv"
input_training_3 = "training_set_3.csv"
input_training_4 = "training_set_4.csv"

input_validation_0 = "validation_set_0.csv"
input_validation_1 = "validation_set_1.csv"
input_validation_2 = "validation_set_2.csv"
input_validation_3 = "validation_set_3.csv"
input_validation_4 = "validation_set_4.csv"

input_test = "test_data_nyc-jobs.csv"

# Load the first training and validation sets into a data frame
DF_train = pd.read_csv(input_training)
DF_test = pd.read_csv(input_test)

In [3]:
# Just to look at the data frame:
DF_train.head()

Unnamed: 0,job_id,agency,posting_type,no_of_positions,business_title,civil_service_title,title_code_no,level,job_category,category,...,additional_information,to_apply,hours_shift,work_location_1,residency_requirement,posting_date,post_until,posting_updated,process_date,processed_text
0,424627,DEPARTMENT OF TRANSPORTATION,Internal,3,Capital Budget Analyst,ASSOCIATE STAFF ANALYST,12627,0,"Finance, Accounting, & Procurement",7,...,***IN ORDER TO BE CONSIDERED FOR THIS POSITION...,All resumes to be submitted electronically usi...,Office Hours: 9AM-5PM,55 Water St Ny Ny,New York City residency is generally required ...,2019-12-06T00:00:00.000,2019-12-26T00:00:00.000,2019-12-06T00:00:00.000,2019-12-17T00:00:00.000,capital budget analyst order considered positi...
1,423479,POLICE DEPARTMENT,Internal,2,City Custodial Assistant,CITY CUSTODIAL ASSISTANT,90644,0,Building Operations & Maintenance,2,...,This lateral opportunity is open to current Ci...,Please submit your resume and cover letter. P...,Shift depends on the Command.,Positions are available in the following comma...,New York City residency is generally required ...,2019-12-09T00:00:00.000,2020-01-23T00:00:00.000,2019-12-13T00:00:00.000,2019-12-17T00:00:00.000,city custodial assistant candidate selected re...
2,376405,NYC HOUSING AUTHORITY,Internal,1,Assistant Director for LHD Budget & Personnel,ADMINISTRATIVE STAFF ANALYST (,1002D,0,"Administration & Human Resources Policy, Resea...",1,...,"Employees applying for promotional, title or l...","Click the ""Apply Now"" button.",0,0,NYCHA has no residency requirements.,2019-05-21T00:00:00.000,0,2019-07-30T00:00:00.000,2019-12-17T00:00:00.000,assistant director budget personnel financial ...
3,397520,DEPT OF ENVIRONMENT PROTECTION,Internal,2,Plant Chief,SENIOR STATIONARY ENGINEER (EL,91639,0,"Engineering, Architecture, & Planning",6,...,Appointments are subject to OMB approval. For...,"Click ""Apply Now"" button",40 hours per week/day,Citywide,New York City residency is generally required ...,2019-06-10T00:00:00.000,0,2019-06-21T00:00:00.000,2019-12-17T00:00:00.000,plant chief department environmental protectio...
4,425079,DEPT OF ENVIRONMENT PROTECTION,Internal,1,Bureau Energy Manager,CITY RESEARCH SCIENTIST,21744,3,"Engineering, Architecture, & Planning Policy, ...",6,...,DEP is an equal opportunity employer with a st...,Click on â€œApply Nowâ€ and submit a resume a...,35 hours per week,"59-17 Junction Blvd, Corona NY",New York City residency is generally required ...,2019-12-10T00:00:00.000,0,2019-12-10T00:00:00.000,2019-12-17T00:00:00.000,bureau energy manager department environmental...


<h2>1. Label Coding</h2>
Now create a dictionary with the label codification. These codes are already in our data set in the 'category' column. 

In [4]:
category_codes = {
    'admin' : 1,
    'maintenance' : 2,
    'clerical' : 3,
    'communications' : 4,
    'community' : 5,
    'engineering' : 6,
    'finance' : 7,
    'health' : 8,
    'technology' : 9,
    'legal' : 10,
    'policy' : 11,
    'public_safety' : 12
}

<h2>2. Text representation</h2>
There are many different ways to represent the text we have, but we are going to choose a Bag of Words approach, specifically using TF-IDF vectors as features approach.  

We have a few different tuning knobs for the TF-IDF vectors that we need to set:  
* `ngram_range`:We want to consider both unigrams and bigrams.  
* `max_df`:When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold  
* `min_df`:When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold.  
* `max_features`:If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.  

We are implicityly scaling our data when representing it as TF-IDF features with the argument `norm`.

In [5]:
# Parameter setup:
ngram_range = (1,2) # Both unigrams and bigrams
min_df = 10
max_df = 1.
max_features = 300

These parameter values are a first approximation and can be changed later if required. 

Note: In this next block we fit and then transform the training set, but **only transform the validation set**. 

In [7]:
# Set up the X and Y data for the training and validation sets
X_train = DF_train['processed_text']
Y_train = DF_train['category']

X_test = DF_test['processed_text']
Y_test = DF_test['category']


# Set up the TF-IDF vectorizer object
tfidf = TfidfVectorizer(encoding='utf-8',
                        ngram_range=ngram_range,
                        stop_words=None,
                        lowercase=False,
                        max_df=max_df,
                        min_df=min_df,
                        max_features=max_features,
                        norm='l2',
                        sublinear_tf=True)

features_train = tfidf.fit_transform(X_train).toarray()
labels_train = Y_train
print(features_train.shape)

features_test = tfidf.transform(X_test).toarray()
labels_test = Y_test
print(features_test.shape)

(1328, 300)
(331, 300)


We can use the Chi squared test in order to see what unigrams and bigrams are most correlated with each category: 

In [8]:
from sklearn.feature_selection import chi2
import numpy as np

for Product, category_id in sorted(category_codes.items()):
    features_chi2 = chi2(features_train, labels_train == category_id)
    indices = np.argsort(features_chi2[0])
    feature_names = np.array(tfidf.get_feature_names())[indices]
    unigrams = [v for v in feature_names if len(v.split(' ')) == 1]
    bigrams = [v for v in feature_names if len(v.split(' ')) == 2]
    print("# '{}' category:".format(Product))
    print("  . Most correlated unigrams:\n. {}".format('\n. '.join(unigrams[-5:])))
    print("  . Most correlated bigrams:\n. {}".format('\n. '.join(bigrams[-2:])))
    print("")

# 'admin' category:
  . Most correlated unigrams:
. training
. latitude
. associate
. personnel
. administrative
  . Most correlated bigrams:
. design construction
. project manager

# 'clerical' category:
  . Most correlated unigrams:
. attorney
. record
. delivery
. assigned
. associate
  . Most correlated bigrams:
. civil service
. special project

# 'communications' category:
  . Most correlated unigrams:
. specialist
. lead
. stakeholder
. strategy
. communication
  . Most correlated bigrams:
. design construction
. city agency

# 'community' category:
  . Most correlated unigrams:
. housing
. center
. education
. community
. outreach
  . Most correlated bigrams:
. drinking water
. per day

# 'engineering' category:
  . Most correlated unigrams:
. water
. design
. construction
. engineer
. engineering
  . Most correlated bigrams:
. project manager
. design construction

# 'finance' category:
  . Most correlated unigrams:
. payment
. procurement
. budget
. analyst
. financial
  . M

In [9]:
# To see all of the bigrams produced:
bigrams

['special project',
 'new york',
 'project manager',
 'city department',
 'city agency',
 'york city',
 'civil service',
 'per day',
 'include limited',
 'high quality',
 'country nearly',
 'nearly employee',
 'pollution combined',
 'combined municipal',
 'utility country',
 'noise hazardous',
 'reducing air',
 'material pollution',
 'quality drinking',
 'municipal water',
 'air noise',
 'water utility',
 'hazardous material',
 'department environmental',
 'water supply',
 'environmental protection',
 'billion gallon',
 'water sewer',
 'drinking water',
 'public health',
 'selected candidate',
 'design construction',
 'responsibility include']