<a href="https://colab.research.google.com/github/MalihaT111/ai-recruitment/blob/normalization-emily/AI_Assisted_Recruitment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
import numpy as np
import os
import scipy.stats as stats

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
data_path = "/content/drive/MyDrive/cadence 1a/data"

import os
print(os.listdir(data_path))



['data job posts.csv', 'data job posts.gsheet', 'Resume.csv', 'Resume.gsheet']


In [4]:
resume_df = pd.read_csv(f"{data_path}/Resume.csv")
resume_df.shape
resume_df.head()
# resume_df.isnull().sum()

Unnamed: 0,ID,Resume_str,Resume_html,Category
0,16852973,HR ADMINISTRATOR/MARKETING ASSOCIATE\...,"<div class=""fontsize fontface vmargins hmargin...",HR
1,22323967,"HR SPECIALIST, US HR OPERATIONS ...","<div class=""fontsize fontface vmargins hmargin...",HR
2,33176873,HR DIRECTOR Summary Over 2...,"<div class=""fontsize fontface vmargins hmargin...",HR
3,27018550,HR SPECIALIST Summary Dedica...,"<div class=""fontsize fontface vmargins hmargin...",HR
4,17812897,HR MANAGER Skill Highlights ...,"<div class=""fontsize fontface vmargins hmargin...",HR


In [5]:
job_posts_df = pd.read_csv(f"{data_path}/data job posts.csv")
job_posts_df.shape
job_posts_df.head()
# job_posts_df.isnull().sum()

Unnamed: 0,jobpost,date,Title,Company,AnnouncementCode,Term,Eligibility,Audience,StartDate,Duration,...,Salary,ApplicationP,OpeningDate,Deadline,Notes,AboutC,Attach,Year,Month,IT
0,AMERIA Investment Consulting Company\r\nJOB TI...,"Jan 5, 2004",Chief Financial Officer,AMERIA Investment Consulting Company,,,,,,,...,,"To apply for this position, please submit a\r\...",,26 January 2004,,,,2004,1,False
1,International Research & Exchanges Board (IREX...,"Jan 7, 2004",Full-time Community Connections Intern (paid i...,International Research & Exchanges Board (IREX),,,,,,3 months,...,,Please submit a cover letter and resume to:\r\...,,12 January 2004,,The International Research & Exchanges Board (...,,2004,1,False
2,Caucasus Environmental NGO Network (CENN)\r\nJ...,"Jan 7, 2004",Country Coordinator,Caucasus Environmental NGO Network (CENN),,,,,,Renewable annual contract\r\nPOSITION,...,,Please send resume or CV toursula.kazarian@......,,20 January 2004\r\nSTART DATE: February 2004,,The Caucasus Environmental NGO Network is a\r\...,,2004,1,False
3,Manoff Group\r\nJOB TITLE: BCC Specialist\r\n...,"Jan 7, 2004",BCC Specialist,Manoff Group,,,,,,,...,,Please send cover letter and resume to Amy\r\n...,,23 January 2004\r\nSTART DATE: Immediate,,,,2004,1,False
4,Yerevan Brandy Company\r\nJOB TITLE: Software...,"Jan 10, 2004",Software Developer,Yerevan Brandy Company,,,,,,,...,,Successful candidates should submit\r\n- CV; \...,,"20 January 2004, 18:00",,,,2004,1,True


In [6]:
#Columns with missing values, turning them into a list
condition = job_posts_df.isnull().sum() != 0
job_posts_df.isnull().sum()[condition].index
columnlist = list(job_posts_df.isnull().sum()[condition].index)
columnlist

['Title',
 'Company',
 'AnnouncementCode',
 'Term',
 'Eligibility',
 'Audience',
 'StartDate',
 'Duration',
 'Location',
 'JobDescription',
 'JobRequirment',
 'RequiredQual',
 'Salary',
 'ApplicationP',
 'OpeningDate',
 'Deadline',
 'Notes',
 'AboutC',
 'Attach']

In [7]:
job_posts_df[columnlist].dtypes
#Everything is object so meaning there's no numerical value

Unnamed: 0,0
Title,object
Company,object
AnnouncementCode,object
Term,object
Eligibility,object
Audience,object
StartDate,object
Duration,object
Location,object
JobDescription,object


In [8]:
#for column in columnlist:
    #print(job_posts_df[column].unique())



In [9]:
#outliers

## Addressing Null values in job_posts_df
Since there are many columns in job_posts_df with null values, we can use reasoning to drop some of the rows or columns. Some columns can be cut if they have too many null values or if they are not really relevant to the problem.  For example, the column "AnnouncementCode" has 17793 null values and the unique non-null values are a random string of letters. It was most likely used to identify the job posting on its original website. As well, "Opening Date" and "Deadline" may not have many null values, but these values might not be very useful to determine whether or not a candidate would be a good fit for a job.

In [10]:
columns_to_drop = ['AnnouncementCode', 'Term', 'Eligibility', 'Audience', 'StartDate', 'Duration', 'OpeningDate', 'Deadline', 'Notes', 'Attach']
job_posts_df = job_posts_df.drop(columns=columns_to_drop)
job_posts_df.isnull().sum()

Unnamed: 0,0
jobpost,0
date,0
Title,28
Company,7
Location,32
JobDescription,3892
JobRequirment,2522
RequiredQual,484
Salary,9379
ApplicationP,60


There are still null values, but the columns are too contextually important to the ML problem. We can drop the examples that have null values in these columns since they most likely do not have the information we need to train the model accurately. After we drop these rows, our job_posts_df dataset no longer has any null values.

In [11]:
columns_to_check = ['Title', 'JobDescription', 'JobRequirment', 'Company', 'Location', 'RequiredQual', 'Salary', 'AboutC']

job_posts_df = job_posts_df.dropna(subset=columns_to_check)

job_posts_df.shape

(5459, 14)

In [12]:
job_posts_df.isnull().sum()

Unnamed: 0,0
jobpost,0
date,0
Title,0
Company,0
Location,0
JobDescription,0
JobRequirment,0
RequiredQual,0
Salary,0
ApplicationP,0


In [13]:
#Removing duplicates
job_posts_df = job_posts_df.drop_duplicates()
print(job_posts_df.duplicated().sum())
resume_df = resume_df.drop_duplicates()
print(resume_df.duplicated().sum())

0
0


In [14]:
#Check which column has HTML tags
import re
def has_html(text):
    if isinstance(text, str):
        return bool(re.search(r'<.*?>', text))
    return False

In [15]:
columns_with_html = [col for col in job_posts_df.columns if job_posts_df[col].apply(has_html).any()]
print(columns_with_html)

columns_with_html = [col for col in resume_df.columns if resume_df[col].apply(has_html).any()]
print(columns_with_html)
#So no column in job_post has any HTML tags we need to remove. Only resume :/

[]
['Resume_str', 'Resume_html']


In [16]:
import re
def clean_html(text):
  return re.sub('<[^<]+?>', '', text)

resume_df['Resume_html'] = resume_df['Resume_str'].apply(clean_html)

In [17]:
resume_df[['Resume_html', 'Resume_str']].head()
#They're the same thing so i'm dropping the HTML column :D
resume_df.drop(columns=['Resume_html'], inplace= True)

# Checkpoint #2 - Text Normalization
Apply tokenization, lowercasing, stopword removal, and lemmatization.

In [18]:
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer, ENGLISH_STOP_WORDS
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [19]:
lemmatizer = WordNetLemmatizer()

In [20]:
#Define the vectorizer
vectorizer = TfidfVectorizer(
    lowercase=True,
    stop_words='english',
    strip_accents='unicode'
)

In [21]:
#Fit/transform our the resumes

# This does everything in one call:
# 1. Normalizes (lowercase, stopwords, lemmatization)
# 2. Creates the vocabulary
# 3. Calculates TF-IDF vectors
tfidf_matrix_resumes = vectorizer.fit_transform(resume_df)

print("Shape of tfidf_matrix_resumes:", tfidf_matrix_resumes.shape)

Shape of tfidf_matrix_resumes: (3, 3)


In [22]:
#next step is to build the model ?

