![](image.jpg)

As a Data Analyst at a leading global HR consultancy, your mission is to delve into an extensive database of resumes to identify suitable candidates for tech-focused roles. This task involves using regular expressions to extract key data points and applying data preprocessing techniques to organize this information effectively.

## Dataset Summary

`resumes.csv`

| Column      | Data Type | Description                                                  |
|-------------|-----------|--------------------------------------------------------------|
| `ID`        | float     | Unique identifier for each resume.                           |
| `Resume_str`| object    | Full text of the resume, rich with details for analysis.     |
| `Category`  | object    | Job category of the resume, indicating the field of expertise. |

## Let's Get Started!

Embark on this analytical journey to harness advanced data analysis techniques for real-world HR challenges. This project is your chance to impact the hiring process by ensuring that tech talent finds their ideal job. Let's begin this exciting journey!


In [1]:
import pandas as pd
import re

# Load the resume dataset from a CSV file into a DataFrame
resumes = pd.read_csv('resumes.csv')
resumes.sample(3)

Unnamed: 0,ID,Resume_str,Category
80,25724495.0,REGIONAL HR MANAGER Summary ...,HR
1160,95429627.0,CONSULTANT Highlights ...,CONSULTANT
85,34740556.0,SENIOR HR BUSINESS PARTNER ...,HR


In [2]:
# Function to extract most recent job title (assumes it's the first line or sentence)
def extract_job_title(resume):
    lines = resume.strip().split('\n')
    if lines:
        first_line = lines[0].strip()
        return first_line if len(first_line.split()) < 10 else None
    return None

# Function to extract technical skills (Python, SQL, R, Excel)
def extract_tech_skills(resume):
    skills = re.findall(r'\b(Python|SQL|R|Excel)\b', resume, flags=re.IGNORECASE)
    return list(set(skill.title() for skill in skills))  # Normalize and remove duplicates

# Function to extract highest education level
def extract_education(resume):
    match = re.search(r'\b(Ph\.?D|Doctorate|Master|M\.Sc|MBA|Bachelor|B\.Sc)\b', resume, flags=re.IGNORECASE)
    if match:
        degree = match.group(0)
        degree_map = {
            'Ph.D': 'PhD', 'Doctorate': 'PhD',
            'Master': 'Master', 'M.Sc': 'Master', 'MBA': 'Master',
            'Bachelor': 'Bachelor', 'B.Sc': 'Bachelor'
        }
        return degree_map.get(degree.title(), degree.title())
    return None

# Apply the extraction functions
resumes['job_title'] = resumes['Resume_str'].apply(extract_job_title)
resumes['tech_skills'] = resumes['Resume_str'].apply(extract_tech_skills)
resumes['education'] = resumes['Resume_str'].apply(extract_education)

# Keep only complete records
candidates_df = resumes[['ID', 'job_title', 'tech_skills', 'education']].dropna()
candidates_df = candidates_df[
    (candidates_df['job_title'].str.strip() != '') &
    (candidates_df['tech_skills'].str.len() > 0) &
    (candidates_df['education'].str.strip() != '')
]

# Rename column for consistency
candidates_df.rename(columns={'ID': 'id'}, inplace=True)

# Optional: Save to CSV
candidates_df.to_csv('filtered_candidates.csv', index=False)

print("Filtered candidates saved to 'filtered_candidates.csv'.")


Filtered candidates saved to 'filtered_candidates.csv'.
