![](image.jpg)

As a Data Analyst at a leading global HR consultancy, your mission is to delve into an extensive database of resumes to identify suitable candidates for tech-focused roles. This task involves using regular expressions to extract key data points and applying data preprocessing techniques to organize this information effectively.

## Dataset Summary

`resumes.csv`

| Column      | Data Type | Description                                                  |
|-------------|-----------|--------------------------------------------------------------|
| `ID`        | float     | Unique identifier for each resume.                           |
| `Resume_str`| object    | Full text of the resume, rich with details for analysis.     |
| `Category`  | object    | Job category of the resume, indicating the field of expertise. |

## Let's Get Started!

Embark on this analytical journey to harness advanced data analysis techniques for real-world HR challenges. This project is your chance to impact the hiring process by ensuring that tech talent finds their ideal job. Let's begin this exciting journey!


In [19]:
import pandas as pd
import re

# Load the resume dataset from a CSV file into a DataFrame
resumes = pd.read_csv('resumes.csv')
resumes.sample(3)

Unnamed: 0,ID,Resume_str,Category
1245,24574164.0,"SENIOR DIRECTOR, PRODUCT MANAGEMENT ...",DIGITAL-MEDIA
810,17660419.0,GUEST LECTURER Accompli...,FITNESS
766,29992154.0,CASHIER Summary 3 years in ...,HEALTHCARE


In [20]:
import pandas as pd
import re

# Load the dataset from CSV file into DataFrame
resumes = pd.read_csv('resumes.csv')

def extract_job_title(resume_str):
    # Regex pattern to match job titles (this is a simple example, might need adjustment)
    title_pattern = r'(?i)(?:title|position):\s*([\w\s]+)'
    match = re.search(title_pattern, resume_str)
    return match.group(1) if match else None

def extract_tech_skills(resume_str):
    # Define common technical skills
    tech_skills = ['Python', 'SQL', 'R', 'Excel']
    # Regex pattern to match any of the defined tech skills
    skill_pattern = r'\b(' + '|'.join(tech_skills) + r')\b'
    found_skills = re.findall(skill_pattern, resume_str)
    return list(set(found_skills))  # Remove duplicates

def extract_education(resume_str):
    # Define common educational degrees
    education_levels = ['PhD', 'Master', 'Bachelor']
    # Regex pattern to match any of the defined education levels
    edu_pattern = r'\b(' + '|'.join(education_levels) + r')\b'
    match = re.search(edu_pattern, resume_str)
    return match.group(1) if match else None

# Apply the extraction functions to each row in the DataFrame
resumes['job_title'] = resumes['Resume_str'].apply(extract_job_title)
resumes['tech_skills'] = resumes['Resume_str'].apply(extract_tech_skills)
resumes['education'] = resumes['Resume_str'].apply(extract_education)

# Create a new DataFrame with the required columns and filter out records with any missing values
candidates_df = resumes[['ID', 'job_title', 'tech_skills', 'education']].copy()
candidates_df.columns = ['id', 'job_title', 'tech_skills', 'education']  # Rename columns

# Drop rows where any of the new columns have null or empty string values
candidates_df.dropna(subset=['job_title', 'tech_skills', 'education'], inplace=True)

# Display the first few records of the new DataFrame
print(candidates_df.head())

              id  ... education
266   28035460.0  ...    Master
948   55712978.0  ...       PhD
1134  11333660.0  ...       PhD

[3 rows x 4 columns]
