# Evaluating the applicants for the Data Science program (2022) *


Denise Carneiro

*This project is entirely fictional, and all data presented is fabricated. It does not reflect any real-life project or individuals.

## Explanation of the project 

The objective of this project is to build a candidate selection model for the Data Science program at the University of the Pacific. For this, we extract some information from fields indicated in the forms such as "Skills", "GPA" and "Academic Records".

The first model only considers the candidate's skills. Thus, we extract the contents of the "Skills" field and the contents of the Criteria for Admission. For each match between a desired skill that the candidate has, one point was added. After that, we built a ranking of candidates by the number of points obtained (score) and generate a file with the candidates that have the 5 best scores.

In the second model, we also consider GPA and Academic Record. For each desired course (from the Academic_records_extra file), the candidate got one more point in the score. In addition, candidates with a GPA of 4.5 or above earned 2 extra points, and candidates with a GPA between 3.5 and 4.5 earned an extra point.

Additionally, we did some data cleaning to obtain candidates' contact details and full name.

The project was developed in python on the Google Colab platform.


## Explanation of data sources 

Fifteen application forms were used for the selection. The forms contain various data, such as name, 
email, phone, address, last jobs, etc. All these forms were opened using the Candidates.csv file, which contains the name of the candidates and the name of their respective application form.

In addition, the desired skills were extracted from the Criteria_for_Admission.csv file, which contains a list of the desired skills.

The ranking of the top 5 candidates was exported to a file called ToAccept.csv

In the second model, data extracted from the "Academic Record" field of the candidates' forms were used. We also used a file called Academic_records_extra to extract data from desired courses in the "Academic Record" field of candidates.


In [None]:
import pandas as pd
import re
import numpy as np

The file that contains the candidades' names and files (Candidates.cvs) is read and showed.

In [None]:
# We used sep = ';' because the delimiter used to build Candidades.csv 
#was ';' instead of the default ','
# skiprows = 3 shipt the 3 first headers
candidates = pd.read_csv('Candidates.csv', sep = ';', skiprows = 3)

In [None]:
candidates.head()

Unnamed: 0,Candidate Name,File Name
0,Ana Cavalcante,Candidate_1.txt
1,Andrea Faria,Candidate_2.txt
2,Agnes Navarro,Candidate_3.txt
3,Cintia Oliveira,Candidate_4.txt
4,Julian Solo,Candidate_5.txt


## Cleanup

In the form, the candidates was requered to write their e-mails and the university would like to update them about their applications by email. Nonetheless, there was some omissions. So the university will phone just those who did not write an email. We have to do some data cleaning.

This function get_phone will be used just for those who don't have an valid email.

In [None]:
def get_phone (n_candidate):
    form = open(candidates['File Name'][n_candidate], 'r')
    candidate_form = form.read()
    regex=r'[0-9]{3}-[0-9]{3}-[0-9]{4}'
    phone=re.findall(regex,candidate_form)
    form.close()
    return phone

This function get_email will look for an valid e-mail for all candidates. If there is no valid e-mail, the function will call the function above (get_phone) and return a phone number instead.

In [None]:
def get_email (n_candidate):
    form = open(candidates['File Name'][n_candidate], 'r')
    candidate_form = form.read()
    regex=r'[A-Za-z0-9._]+@[A-Za-z0-9.-]+\.com' 
    email=re.findall(regex,candidate_form)
    form.close()
    if email:
        return email
    else:
        return(get_phone(n_candidate))

The code below calls the function get_email and build a contact list of e-mails or phones. There will be phones just for those who doesn't have an valid e-mail.

In [None]:
contacts = []
for n_candidate in range(len(candidates)):
    contacts.append(get_email(n_candidate))

As is showed below, just 3 candidates have no a valid e-mail (candidates 3, 11 and 15), so for them we replaced a contact email by their phones.

In [None]:
contacts

[['ac_calvacante@gmail.com'],
 ['andfafa@outlook.com'],
 ['563-874-7354'],
 ['oliva.cintia@gmail.com'],
 ['so_ju_lo@outlook.com'],
 ['jsc_carter@gmail.com'],
 ['jesseb@gmail.com'],
 ['melanieb@gmail.com'],
 ['susan_salmon@gmail.com'],
 ['thomasadler@gmail.com'],
 ['453-434-4343'],
 ['rajesh638736@gmail.com'],
 ['a987gozales@gmail.com'],
 ['jairmbolsonaro@gmail.com'],
 ['232-234-3365']]

A department of the university requered the candidate's full names. But on the file Candidates there is just the first and the last name of each candidate. For get the candidate's full names, we have also to do some data cleaning because some candidates have no middle names.

This function get the full name of the candidate.

In [None]:
def fullname (candidate_form):
    form = open(candidates['File Name'][n_candidate], 'r')
    candidate_form = form.read()
    regex=r'(?<=First Name:)(?:\s)*\w+' 
    f_name=re.findall(regex,candidate_form)
    regex=r'(?<=Last Name:)(?:\s)*\w+'
    l_name=re.findall(regex,candidate_form)
    #If there is not a middle name, the algorithm will not get it 
    regex=r'Middle Name:(.*?)Last Name:' 
    m_name=re.findall(regex,candidate_form)
    fullname= f_name + m_name + l_name
    form.close()
    return(''.join(fullname))

This code build a list with the candidate's full names.

In [None]:
fullnames = []
for n_candidate in range(len(candidates)):
    fullnames.append(fullname(n_candidate))

The candidate's full names are showed below.

In [None]:
fullnames

[' Ana Clara   Cavalcante',
 ' Andrea Farina  Faria',
 'Agnes  Navarro',
 ' Cintia  Oliveira',
 ' Julian  Solo',
 '   John Smith  Carter',
 ' Jesse   Balduco',
 ' Melanie Ellis  Keller',
 ' Susan Ellis  Salmon',
 ' Thomas Keller  Adler',
 ' Thomas   Gasper',
 ' Rajesh   Koothrappali',
 ' Alice Hayes  Gonzales',
 ' Jair Messias  Bolsonaro',
 ' Justin Adler  Adams']

## Algorithm that looks for the skills and make a list of the best candidates

The file that contains the criteria for admission is read as 'criteria'.

In [None]:
criteria = pd.read_csv('Criteria_for_Admission.csv')

In [None]:
criteria

Unnamed: 0,Skills
0,R
1,PYTHON
2,SQL
3,STATISTICS
4,PROGRAMMING
5,DATA WRANGLING
6,MACHINE LEARNING
7,CALCULUS


A function was defined to find the skills that matches with the criterias.

In [None]:
#criteria is the skill that function is going to look for
#i is the position of the candidate at the list candidates
def find_Skill(criteria, i):
    form = open(candidates['File Name'][i], 'r')
    candidate_form = form.read()
    #First, the field "Skills" are separaded in the variavel skills_of_candidate
    #Because we will look for the criteria just on the field Skill
    skills_of_candidate =re.search(r'Skills:(?:\s)*(?:\n)*(.*?)(?:\s)*(?:\n)*Passion:', candidate_form).group(1)
    #The funtion below makes all the text in upper case, so it can match lower and upper cases 
    #for example, if the candidate typed Python, python or even PYthon
    skills_of_candidate =  ' '+ skills_of_candidate.upper() 
    #The code below find the criteria on the field Skills, that is into skills_of_candidate
    regex = r'(?:\s)' + criteria + r'[\s,.]'
    all_skills=re.findall(regex,skills_of_candidate)
    form.close()
    #if there was a match, it will return 1, which will be added to the score of the candidate
    #if there was not a match, it will be zero
    if all_skills: 
        return 1
    else:
        return 0

The code below creates a array of zeros. It has the same len as candidates and will keep the score of each candidade.

In [None]:
candidate_score = np.zeros(shape=len(candidates))

The first 'for' loops the criterias. The second one loops the candidates. So it looks for a criteria in each candidate. Then, it added a score for each candidate, that will be 1 if the skill matches and 0 if not.

In [None]:
for c in range(len(criteria)):
    for n_candidate in range(len(candidates)):
        match = find_Skill(criteria['Skills'][c], n_candidate)
        candidate_score[n_candidate] = candidate_score[n_candidate] + match

Below it is showed the scores of the candidates. It means how many times a criteria was matched in each candidate.

In [None]:
candidate_score

array([0., 2., 3., 3., 4., 5., 6., 1., 4., 0., 7., 4., 6., 2., 5.])

The list List_of_skills will keep all skills matched by each candidate. 

In [None]:
List_of_skills = []

In [None]:
#It works equal the function find_Skill, but now we will make a list of skills intead of just sum them
for cand in range(len(candidates)):
    list1 = []
    for crit in range(len(criteria)):
        form = open(candidates['File Name'][cand], 'r')
        candidate_form = form.read()
        skills_of_candidate =re.search(r'Skills:(?:\s)*(?:\n)*(.*?)(?:\s)*(?:\n)*Passion:', candidate_form).group(1)
        skills_of_candidate =  ' '+ skills_of_candidate.upper() 
        regex = r'(?:\s)' + criteria['Skills'][crit] + r'[\s,.]'
        all_skills=re.findall(regex,skills_of_candidate)
        if all_skills:
            list1.append(criteria['Skills'][crit])
        form.close()
    #.join transform the list1 in a string
    List_of_skills.append(". ".join(list1))

Join all the data we have into the same DataFrame.

In [None]:
candidate_ranking = pd.DataFrame({
    'File': candidates['File Name'],
    'Name': candidates['Candidate Name'],
    'Score': candidate_score, 
    'Skills': List_of_skills,
})
candidate_ranking

Unnamed: 0,File,Name,Score,Skills
0,Candidate_1.txt,Ana Cavalcante,0.0,
1,Candidate_2.txt,Andrea Faria,2.0,R. PYTHON
2,Candidate_3.txt,Agnes Navarro,3.0,R. PYTHON. SQL
3,Candidate_4.txt,Cintia Oliveira,3.0,R. STATISTICS. CALCULUS
4,Candidate_5.txt,Julian Solo,4.0,R. PYTHON. STATISTICS. CALCULUS
5,Candidate_6.txt,John Carter,5.0,R. PYTHON. PROGRAMMING. DATA WRANGLING. MACHIN...
6,Candidate_7.txt,Jesse Balduco,6.0,SQL. STATISTICS. PROGRAMMING. DATA WRANGLING. ...
7,Candidate_8.txt,Melanie Keller,1.0,PYTHON
8,Candidate_9.txt,Susan Salmon,4.0,SQL. STATISTICS. PROGRAMMING. CALCULUS
9,Candidate_10.txt,Thomas Adler,0.0,


Sorting the DataFrame by the column 'Score' in a ascending order.

In [None]:
candidate_ranking = candidate_ranking.sort_values('Score', ascending = False)
candidate_ranking

Unnamed: 0,File,Name,Score,Skills
10,Candidate_11.txt,Thomas Gasper,7.0,R. PYTHON. SQL. STATISTICS. PROGRAMMING. MACHI...
6,Candidate_7.txt,Jesse Balduco,6.0,SQL. STATISTICS. PROGRAMMING. DATA WRANGLING. ...
12,Candidate_13.txt,Alice Gonzales,6.0,R. PYTHON. SQL. PROGRAMMING. DATA WRANGLING. M...
5,Candidate_6.txt,John Carter,5.0,R. PYTHON. PROGRAMMING. DATA WRANGLING. MACHIN...
14,Candidate_15.txt,Justin Adams,5.0,R. SQL. STATISTICS. DATA WRANGLING. MACHINE LE...
4,Candidate_5.txt,Julian Solo,4.0,R. PYTHON. STATISTICS. CALCULUS
8,Candidate_9.txt,Susan Salmon,4.0,SQL. STATISTICS. PROGRAMMING. CALCULUS
11,Candidate_12.txt,Rajesh Koothrappali,4.0,R. STATISTICS. MACHINE LEARNING. CALCULUS
2,Candidate_3.txt,Agnes Navarro,3.0,R. PYTHON. SQL
3,Candidate_4.txt,Cintia Oliveira,3.0,R. STATISTICS. CALCULUS


Making another DataFrame just with the 5 first rows of candidate_ranking (the 5 best scores)

In [None]:
Best_scores = candidate_ranking.head(n=5)
Best_scores

Unnamed: 0,File,Name,Score,Skills
10,Candidate_11.txt,Thomas Gasper,7.0,R. PYTHON. SQL. STATISTICS. PROGRAMMING. MACHI...
6,Candidate_7.txt,Jesse Balduco,6.0,SQL. STATISTICS. PROGRAMMING. DATA WRANGLING. ...
12,Candidate_13.txt,Alice Gonzales,6.0,R. PYTHON. SQL. PROGRAMMING. DATA WRANGLING. M...
5,Candidate_6.txt,John Carter,5.0,R. PYTHON. PROGRAMMING. DATA WRANGLING. MACHIN...
14,Candidate_15.txt,Justin Adams,5.0,R. SQL. STATISTICS. DATA WRANGLING. MACHINE LE...


Creating a CSV file with the columns Name, Score and Skills.

In [None]:
Best_scores.to_csv('ToAccept.csv', columns = ['Name', 'Score', 'Skills'])

Reading the file we have just created.

In [None]:
ToAccept = pd.read_csv('ToAccept.csv')
ToAccept.head()

Unnamed: 0.1,Unnamed: 0,Name,Score,Skills
0,10,Thomas Gasper,7.0,R. PYTHON. SQL. STATISTICS. PROGRAMMING. MACHI...
1,6,Jesse Balduco,6.0,SQL. STATISTICS. PROGRAMMING. DATA WRANGLING. ...
2,12,Alice Gonzales,6.0,R. PYTHON. SQL. PROGRAMMING. DATA WRANGLING. M...
3,5,John Carter,5.0,R. PYTHON. PROGRAMMING. DATA WRANGLING. MACHIN...
4,14,Justin Adams,5.0,R. SQL. STATISTICS. DATA WRANGLING. MACHINE LE...


## Advanced 

Now we are going to improve our model using more data. We will add 2 points for those who the GPA is equal or higher than 4.5 and just one point if it is higher or equal to 3.5. 

The function get_GPA looks for the GPA and return 2 if it is equal or higher than 4.5, 1 if it is equal or higher than 3.5 and zero if it is lower than 3.5 (the candidate get no extra points on this case).

In [None]:
def get_GPA(n):
    form = open(candidates['File Name'][n], 'r')
    candidate_form = form.read()
    GPA =float(re.search(r'GPA:(?:\s)*(?:\n)*(.*?)(?:\s)*(?:\n)*Academic', candidate_form).group(1))
    form.close()
    if GPA>=4.5: 
        return 2
    if GPA>=3.5:
        return 1
    else:
        return 0

The candidate score before the extra points is showed below.

In [None]:
candidate_score

array([0., 2., 3., 3., 4., 5., 6., 1., 4., 0., 7., 4., 6., 2., 5.])

This code calls the function get_GPA and add the extra points.

In [None]:
for n_candidate in range(len(candidates)):
    match = get_GPA(n_candidate)
    candidate_score[n_candidate] = candidate_score[n_candidate] + match

The candidate score after the extra points is showed below.

In [None]:
candidate_score

array([1., 4., 4., 4., 5., 7., 7., 2., 6., 0., 8., 6., 6., 2., 6.])

The Data Science program also decided to add extra points for those who has studied some issues before. It will be one extra point for each course that is into the file Academic_records_extra.csv. It must be in the candidate's Academic Record.

We are reading the new criteria as academics

In [None]:
academics = pd.read_csv('Academic_records_extra.csv')
academics

Unnamed: 0,Education
0,STATISTICS
1,COMPUTER SCIENCE
2,DATA SCIENCE


The function below looks for the academic record and then try to find the Education criterias on them.

In [None]:
def find_academic_record(academics, i):
    form = open(candidates['File Name'][i], 'r')
    candidate_form = form.read()
    #First, the field "Academic Record" are separaded in the variavel education
    education =re.search(r'Record:(?:\s)*(?:\n)*(.*?)(?:\s)*(?:\n)*Skills:', candidate_form).group(1)
    #The funtion below makes all the text in upper case, so it can match lower and upper cases 
    education =  ' '+ education.upper()
    #The code below find the criteria on the field Skills, that is into skills_of_candidate
    regex = r'(?:\s)' + academics + r'[\s,.]'
    match=re.findall(regex,education)
    form.close()
    #if there was a match, it will return 1, which will be added to the score of the candidate
    #if there was not a match, it will be zero
    if match: 
        return 1
    else:
        return 0

The code below calls the function find_academic_record and add one point for each match of education find on the field "Academic Record".

In [None]:
for acad in range(len(academics)):
    for n_candidate in range(len(candidates)):
        match = find_academic_record(academics['Education'][acad], n_candidate)
        candidate_score[n_candidate] = candidate_score[n_candidate] + match

In [None]:
candidate_score

array([1., 5., 4., 4., 5., 8., 8., 2., 7., 0., 9., 6., 7., 2., 6.])

The code below creates a DataFrame with the name of the candidate and their new score.

In [None]:
candidate_new_ranking = pd.DataFrame({
    'Name': candidates['Candidate Name'],
    'Score': candidate_score, 
})
candidate_new_ranking

Unnamed: 0,Name,Score
0,Ana Cavalcante,1.0
1,Andrea Faria,5.0
2,Agnes Navarro,4.0
3,Cintia Oliveira,4.0
4,Julian Solo,5.0
5,John Carter,8.0
6,Jesse Balduco,8.0
7,Melanie Keller,2.0
8,Susan Salmon,7.0
9,Thomas Adler,0.0


Sorting the new ranking in a descending order.

In [None]:
candidate_new_ranking = candidate_new_ranking.sort_values('Score', ascending = False)
candidate_new_ranking

Unnamed: 0,Name,Score
10,Thomas Gasper,9.0
5,John Carter,8.0
6,Jesse Balduco,8.0
8,Susan Salmon,7.0
12,Alice Gonzales,7.0
11,Rajesh Koothrappali,6.0
14,Justin Adams,6.0
1,Andrea Faria,5.0
4,Julian Solo,5.0
2,Agnes Navarro,4.0


## Summary 

Two models were developed for candidate selection. The first model only evaluates skills listed in the "Skills" field, while the second model assigns additional points based on academic records and GPA.

As a result of the second model, there were changes in the candidates' rankings. Justin Adams, who was ranked fifth in the first model, did not make it to the top 5 in the new ranking. Susan Salmon, who had a GPA above 3.5 and a BA in Computer Science, two of the new criteria considered in the Academic Records evaluation, entered the top 5 list.

Considering that the second model takes into account important characteristics that are ignored in the first model, we believe it is more appropriate for candidate selection.