# Resume Extraction

---

## 1. Introduction

Objective: Extract structured information from resumes (.pdf or .docx)

Tools: spaCy for NER, regex for pattern matching, and file-specific parsers



## 2. Import Libraries

In [26]:
import fitz  # PyMuPDF
from docx import Document
import re
import os
import pandas as pd
from datetime import datetime
from transformers import pipeline
import re
import ipywidgets as widgets
from datetime import datetime
from IPython.display import display, HTML,Markdown

## 3. Resume File Text Extraction

## 3.1 Resume File Text Extraction (pdf)

In [4]:
import fitz  # PyMuPDF
import re

def extract_text_from_pdf(file_path):
    """
    Extracts and formats text from a PDF into a resume-friendly structure.

    Args:
        file_path (str): Path to the PDF file.

    Returns:
        str: Cleaned and well-formatted text suitable for parsing.
    """
    try:
        with fitz.open(file_path) as doc:
            raw_text = "\n".join(
                page.get_text("text").strip()
                for page in doc
                if page.get_text("text").strip()
            )

        if not raw_text:
            raise ValueError("PDF contains no extractable text (possibly scanned image).")

        # Clean and enhance layout
        text = raw_text

        # Normalize line endings and collapse excessive newlines
        text = re.sub(r'\n{2,}', '\n\n', text)
        lines = [line.strip() for line in text.splitlines()]
        text = "\n".join(lines)

        # Normalize bullets
        text = re.sub(r'^\s*[•·]', '-', text, flags=re.MULTILINE)

        # Normalize section headers (Education, Skills, etc.)
        section_keywords = [
            "Summary","Professional Summary","Career Objective" "Skills", "Experience", "Education", "Certifications",
            "Projects", "Languages", "Interests", "Volunteer", "References"
        ]
        for keyword in section_keywords:
            text = re.sub(rf"(?<!\w){keyword.upper()}(?!\w)", f"{keyword}:", text)

        return text

    except Exception as e:
        raise RuntimeError(f"PDF extraction failed for '{file_path}': {e}")


## 3.2 Resume File Text Extraction (docx)

In [5]:

def extract_text_from_docx(file_path):
    """
    Extracts text content from a DOCX file (paragraphs only).

    Args:
        file_path (str): Path to the .docx file.

    Returns:
        str: Cleaned, joined paragraph text from the document.

    Raises:
        RuntimeError: If the file cannot be read or is empty.
    """
    try:
        doc = Document(file_path)

        # Extract and clean paragraph text
        text_parts = [p.text.strip() for p in doc.paragraphs if p.text.strip()]
        combined_text = "\n".join(text_parts)

        if not combined_text:
            raise ValueError("DOCX file contains no extractable paragraph text.")

        return combined_text

    except Exception as e:
        raise RuntimeError(f"DOCX extraction failed for '{file_path}': {e}")

## 3.3 Resume File Text Extraction (docx or pdf)

In [6]:

def extract_resume_text(file_path):
    """
    Extracts text from a resume file (PDF or DOCX).

    Args:
        file_path (str): Path to the uploaded resume file.

    Returns:
        str: Extracted plain text from the file.

    Raises:
        ValueError: If the file extension is not supported.
        RuntimeError: If extraction from a supported file fails.
    """
    if not os.path.isfile(file_path):
        raise FileNotFoundError(f"The file does not exist: {file_path}")

    ext = os.path.splitext(file_path)[1].lower()

    if ext == ".pdf":
        try:
            return extract_text_from_pdf(file_path)
        except Exception as e:
            raise RuntimeError(f"Failed to extract text from PDF: {e}")

    elif ext == ".docx":
        try:
            return extract_text_from_docx(file_path)
        except Exception as e:
            raise RuntimeError(f"Failed to extract text from DOCX: {e}")

    else:
        raise ValueError("Unsupported file type. Only PDF and DOCX files are allowed.")

## 5. Clean Text (Minimal)

In [13]:
def clean_text(text):
    return re.sub(r'\s+', ' ', text.strip())

## 6. Extract Entities with spaCy (Name, Org, Location)

In [7]:
from transformers import pipeline

ner_pipeline = pipeline(
    "ner",
    model="dbmdz/bert-large-cased-finetuned-conll03-english",
    grouped_entities=True,
    framework="pt",
    device=0
)

def extract_transformer_entities(
    text: str,
    ner_model=None,
    return_first_only=True,
    label_map=None
) -> dict:
    """
    Extract entities (Name, Organization, Location) using a Hugging Face NER model.

    Args:
        text (str): Input text from resume or document.
        ner_model: Optional. A Hugging Face transformers NER pipeline. If None, it will be created.
        return_first_only (bool): If True, return only the first match per entity type.
        label_map (dict): Optional. Mapping of entity types to model-specific NER labels.

    Returns:
        dict: {
            "Name": str or List[str],
            "Organization": str or List[str],
            "Location": str or List[str]
        }
    """
    if not isinstance(text, str) or not text.strip():
        return {"Name": None, "Organization": None, "Location": None}

    if ner_model is None:
        ner_model = pipeline("ner", grouped_entities=True)

    # Default entity label mappings
    label_map = label_map or {
        "Name": ["PER", "PERSON"],
        "Organization": ["ORG"],
        "Location": ["LOC", "GPE"]
    }

    raw_entities = ner_model(text)
    result = {"Name": [], "Organization": [], "Location": []}

    for ent in raw_entities:
        label = ent.get("entity_group", "")
        word = ent.get("word", "").strip()

        # Skip empty/partial tokens
        if not word or word.startswith("##"):
            continue

        for key, valid_labels in label_map.items():
            if label.upper() in valid_labels and word not in result[key]:
                result[key].append(word)

    for key in result:
        if return_first_only:
            result[key] = result[key][0] if result[key] else None
        else:
            result[key] = result[key] if result[key] else []

    return result


To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Error while downloading from https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english/resolve/main/model.safetensors: HTTPSConnectionPool(host='cas-bridge.xethub.hf.co', port=443): Read timed out.
Trying to resume download...
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassif

## 7. Regex-Based Extraction (Email, Phone, Skills)

---

## 7.1 Extract Skills

In [9]:
import re
from collections import defaultdict

def extract_skills(
    text: str,
    skill_set: dict = None,
    return_freq: bool = False,
    return_grouped: bool = False
):
    """
    Extract skills from resume or job description text with optional frequency and category grouping.

    Args:
        text (str): Resume or job description text.
        skill_set (dict): Optional skill categories and keywords (dict of lists).
        return_freq (bool): If True, return a frequency count of skills.
        return_grouped (bool): If True, return categorized skill matches.

    Returns:
        list or dict: Matched skills as a list, or grouped/frequency dictionary.
    """

    # === 1. Default skill categories and aliases ===
    default_skills = {
    "Programming Languages": [
        "Python", "Java", "JavaScript", "JS", "TypeScript", "C++", "C#", "SQL", "R", "Go", "Rust",
        "Scala", "Perl", "Ruby", "Dart", "Objective-C", "Swift", "Kotlin"
    ],

    "AI / ML / NLP": [
        "TensorFlow", "PyTorch", "Keras", "Scikit-learn", "XGBoost", "LightGBM", "CatBoost",
        "Machine Learning", "ML", "Deep Learning", "DL", "NLP", "LLM",
        "Transformers", "Hugging Face", "LangChain", "OpenAI API", "RAG", "Prompt Engineering",
        "LLM Fine-tuning", "AutoML", "spaCy", "NLTK", "BERT", "GPT", "Llama", "Claude",
        "FAISS", "ChromaDB", "Weaviate", "Haystack", "H2O.ai", "Vertex AI"
    ],

    "Web Development": [
        "HTML", "CSS", "SASS", "LESS", "Tailwind CSS", "Bootstrap", "Material-UI", "Chakra UI",
        "React", "Next.js", "Vue.js", "Angular", "Node.js", "Express", "Django", "Flask",
        "FastAPI", "ASP.NET", "Laravel", "Ruby on Rails", "REST API", "GraphQL", "WebSockets",
        "tRPC", "Zustand", "Redux", "React Query", "Vite", "Webpack", "Parcel", "Babel"
    ],

    "Cloud & DevOps": [
        "AWS", "Azure", "GCP", "DigitalOcean", "Heroku", "Vercel", "Netlify",
        "Docker", "Kubernetes", "Helm", "Terraform", "Ansible", "Pulumi", "CloudFormation",
        "GitHub Actions", "GitLab CI/CD", "Jenkins", "CircleCI", "ArgoCD",
        "Linux", "Bash", "Shell Scripting", "Serverless", "Prometheus", "Grafana", "Istio"
    ],

    "Databases": [
        "MySQL", "PostgreSQL", "MongoDB", "SQLite", "BigQuery", "Snowflake", "Oracle", "SQL Server",
        "Redis", "Firestore", "Cassandra", "DynamoDB", "InfluxDB", "MariaDB", "Redshift",
        "Neo4j", "Supabase", "ElasticSearch", "DuckDB"
    ],

    "Visualization & BI": [
        "Power BI", "Tableau", "Looker", "Google Data Studio", "Plotly", "Dash",
        "Matplotlib", "Seaborn", "D3.js", "Grafana", "Apache Superset", "Metabase"
    ],

    "Soft Skills": [
        "Communication", "Problem Solving", "Leadership", "Teamwork", "Adaptability",
        "Strategic Thinking", "Attention to Detail", "Time Management", "Creativity",
        "Critical Thinking", "Collaboration", "Decision Making", "Self-Motivation",
        "Work Ethic", "Conflict Resolution", "Public Speaking", "Presentation Skills",
        "Mentoring", "Accountability", "Customer Focus", "Project Management",
        "Empathy", "Emotional Intelligence"
    ]
}


    # === 2. Normalize input ===
    text = text.lower()
    skills_found = defaultdict(list if return_grouped else int)

    # === 3. Match skills ===
    skill_categories = skill_set or default_skills

    for category, skills in skill_categories.items():
        for skill in skills:
            skill_lower = skill.lower()
            pattern = r'\b' + re.escape(skill_lower) + r'\b'
            matches = re.findall(pattern, text)
            if matches:
                if return_grouped:
                    skills_found[category].append(skill)
                elif return_freq:
                    skills_found[skill] += len(matches)
                else:
                    skills_found[skill] = 1  # just to collect as set

    # === 4. Return appropriate output ===
    if return_grouped:
        return {k: sorted(set(v)) for k, v in skills_found.items() if v}
    elif return_freq:
        return dict(sorted(skills_found.items(), key=lambda x: -x[1]))
    else:
        return sorted(skills_found.keys()) if skills_found else None


## 7.2 Extract Education

In [10]:

def extract_education(text, known_degrees=None):
    """
    Extract degrees, institutions, and graduation years from resume text.

    Args:
        text (str): Resume text content.
        known_degrees (list, optional): Custom list of degrees.

    Returns:
        list: List of dicts with degree, institution, year, and raw line.
    """
    default_degrees = [
        "Bachelor", "Master", "B.Sc", "M.Sc", "B.S.", "M.S.", "BA", "MA", "PhD", "Ph.D", "B.E", "M.E",
        "B.Tech", "M.Tech", "MBA", "MCA", "BBA", "LLB", "LLM", "MD", "DDS", "Diploma", "High School",
        "Associate Degree", "Doctorate", "Postgraduate", "Undergraduate", "MBBS", "CFA", "CA", "M.Ed", "EdD"
    ]

    degrees = known_degrees or default_degrees
    lines = text.split('\n')
    education_data = []

    for line in lines:
        line_clean = line.strip()
        for degree in degrees:
            pattern = rf'\b{re.escape(degree)}\b'
            if re.search(pattern, line_clean, re.IGNORECASE):
                # Try to extract year (4-digit)
                year_match = re.search(r'\b(19|20)\d{2}\b', line_clean)
                year = year_match.group(0) if year_match else None

                # Try to extract institution name heuristically (everything after degree, if any)
                after_degree = re.split(pattern, line_clean, flags=re.IGNORECASE)[-1].strip()
                institution = after_degree.split(',')[0] if after_degree else None

                education_data.append({
                    "degree": degree,
                    "institution": institution if institution and not institution.lower().startswith("in") else None,
                    "year": year,
                    "raw": line_clean
                })
                break  # Avoid duplicate matches for the same line

    return education_data if education_data else None


## 7.3 Extract Email

In [11]:
import re

def extract_email(text, return_all=False):
    """
    Extract email addresses from text, handling obfuscations like [at], (dot), etc.

    Args:
        text (str): Raw input text (e.g., resume content).
        return_all (bool): If True, return a list of all found emails. Otherwise, return the first one.

    Returns:
        str or list or None: Extracted email(s), or None if not found.
    """
    if not text or not isinstance(text, str):
        return None

    # Normalize obfuscated formats (case insensitive)
    obfuscations = [
        (r'\s?\[at\]\s?', '@'),
        (r'\s?\(at\)\s?', '@'),
        (r'\s+at\s+', '@'),
        (r'\s?\[dot\]\s?', '.'),
        (r'\s?\(dot\)\s?', '.'),
        (r'\s+dot\s+', '.'),
    ]

    cleaned_text = text.lower()
    for pattern, replacement in obfuscations:
        cleaned_text = re.sub(pattern, replacement, cleaned_text, flags=re.IGNORECASE)

    # Match standard emails (remove trailing punctuation like . or ,)
    matches = re.findall(r'[\w\.-]+@[\w\.-]+\.\w+', cleaned_text)
    matches = [match.strip(".,;:") for match in matches]
    matches = sorted(set(matches))

    if not matches:
        return None

    return matches if return_all else matches[0]



# 7.4 Extract Certifications

In [12]:
import re

def extract_certifications(text, known_certs=None, return_lines=False):
    """
    Extract certifications from resume text.

    Args:
        text (str): Raw resume text.
        known_certs (list, optional): List of known cert names/acronyms.
        return_lines (bool): If True, return full matched lines; else return cert names only.

    Returns:
        list: Sorted, deduplicated list of certifications.
    """
    lines = [line.strip("•-–—•* ") for line in text.split('\n') if line.strip()]
    results = set()

    known_certs = known_certs or [
        "AWS Certified", "Google Cloud Certified", "Microsoft Certified", "Azure Fundamentals", "AZ-900",
        "Certified Scrum Master", "Scrum Master", "CompTIA A+", "CompTIA Security+", "CompTIA Network+",
        "Cisco Certified", "CCNA", "CKA", "CKAD", "Oracle Certified", "TOGAF", "ITIL", "CISSP",
        "Adobe Certified", "PMP", "PRINCE2", "Coursera", "Udemy", "edX", "DataCamp", "Trailhead",
        "IBM Data Science", "TensorFlow Developer", "Deep Learning Specialization", "Salesforce Certified",
        "LinkedIn Skill Assessment", "Superbadge", "Kubernetes Mastery", "AI For Everyone", "OCI Architect",
        "Google Cloud Professional", "Microsoft Azure Fundamentals", "OCI 2023 Architect Associate",
        "Certified Kubernetes Administrator"
    ]

    # Fallback pattern to catch any cert/badge/training line
    fallback_cert_keywords = re.compile(
        r"(cert(ification|ified)|cert\.|badge|exam|track|specialization|credential|training|academy)",
        re.IGNORECASE
    )

    for line in lines:
        normalized = re.sub(r'[^\w\s\-#@.:/()]', '', line).strip()

        # 1. Match known certifications
        matched = False
        for cert in known_certs:
            if cert.lower() in normalized.lower():
                results.add(line if return_lines else cert)
                matched = True
                break

        # 2. Match common acronyms
        if not matched:
            acronyms = re.findall(r'\b(AZ-\d{3}|CKA|CKAD|CSM|PMP|CCNA|OCI|CKS|ITIL)\b', normalized)
            for acr in acronyms:
                results.add(line if return_lines else acr)

        # 3. Match fallback keywords for custom/obscure certifications
        if not matched and fallback_cert_keywords.search(normalized):
            results.add(line)

    return sorted(results)


## 7.5 Extract Links

In [13]:
import re

def extract_links(text, classify=False, custom_domains=None, strict_mode=False):
    """
    Extract and optionally classify URLs from resume text.

    Args:
        text (str): Raw resume text.
        classify (bool): If True, return a dictionary of categorized links.
        custom_domains (dict): Custom classification domains. E.g., {"Kaggle": ["kaggle.com"]}
        strict_mode (bool): If True, ignore partial/bare domains like 'linkedin.com'.

    Returns:
        list or dict: Cleaned and optionally classified URLs.
    """
    raw_links = set()

    # Normalize common obfuscations
    text = text.replace("[dot]", ".").replace("(dot)", ".").replace(" dot ", ".")
    text = text.replace("[at]", "@").replace("(at)", "@").replace(" at ", "@")

    # --- 1. Match full links: http(s) ---
    full_links = re.findall(r'https?://[^\s\)\]\>\.,;]+', text)
    raw_links.update(link.strip(".,);>]") for link in full_links)

    # --- 2. Match www-prefixed domains ---
    www_links = re.findall(r'www\.[\w\-\.]+\.\w+', text)
    raw_links.update(f"https://{link.strip('.,);>]')}" for link in www_links)

    # --- 3. Match bare domains (like linkedin.com, github.com) ---
    if not strict_mode:
        bare_domains = re.findall(r'\b(?:[\w\-]+\.)+(?:com|org|io|net|co|ai|dev|info)\b', text, re.IGNORECASE)
        raw_links.update(f"https://{domain.strip('.,);>]')}" for domain in bare_domains)

    clean_links = sorted(set(raw_links))

    if not classify:
        return clean_links

    # --- 4. Classification ---
    categories = {
        "LinkedIn": [],
        "GitHub": [],
        "Portfolio": [],
        "Other": []
    }

    # Merge custom categories
    if custom_domains:
        for cat in custom_domains:
            categories.setdefault(cat, [])

    for link in clean_links:
        l = link.lower()
        if "linkedin.com" in l:
            categories["LinkedIn"].append(link)
        elif "github.com" in l:
            categories["GitHub"].append(link)
        elif any(sub in l for sub in ["about.me", "portfolio", "my.site", "personal", "me."]):
            categories["Portfolio"].append(link)
        elif custom_domains:
            matched = False
            for label, domains in custom_domains.items():
                if any(domain in l for domain in domains):
                    categories[label].append(link)
                    matched = True
                    break
            if not matched:
                categories["Other"].append(link)
        else:
            categories["Other"].append(link)

    return categories


## 7.5 Extract Phone

In [14]:

def extract_phone(text, return_all=False, format_output=False):
    """
    Extract and optionally format phone numbers from raw text using regex.

    Args:
        text (str): Raw text (e.g., resume content).
        return_all (bool): Return all matched numbers or just the first.
        format_output (bool): If True, format numbers like (123) 456-7890.

    Returns:
        str or list or None: Extracted phone number(s) or None if not found.
    """
    # Common obfuscations & digit words
    replacements = {
        "[dot]": ".", "[at]": "@", " at ": "@", "(at)": "@",
        "zero": "0", "one": "1", "two": "2", "three": "3",
        "four": "4", "five": "5", "six": "6", "seven": "7",
        "eight": "8", "nine": "9"
    }

    cleaned = text.lower()
    for word, digit in replacements.items():
        cleaned = cleaned.replace(word, digit)

    # Regex: detect phone numbers (10+ digits, with optional symbols)
    potential_numbers = re.findall(r'\+?\d[\d\s().-]{8,}\d', cleaned)

    results = set()
    for raw in potential_numbers:
        # Remove unwanted symbols but keep starting '+'
        cleaned_number = re.sub(r'(?!^\+)[^\d]', '', raw)
        if len(cleaned_number) >= 10:
            if format_output and cleaned_number.startswith('+'):
                formatted = cleaned_number  # leave international numbers as-is
            elif format_output:
                formatted = f"({cleaned_number[:3]}) {cleaned_number[3:6]}-{cleaned_number[6:10]}"
            else:
                formatted = cleaned_number
            results.add(formatted)

    sorted_phones = sorted(results)

    if not sorted_phones:
        return None

    return sorted_phones if return_all else sorted_phones[0]



## 7.6 Extract Links

In [15]:
def extract_links(text, classify=False, custom_domains=None):
    """
    Extract and optionally classify URLs from resume text.

    Args:
        text (str): Raw resume text.
        classify (bool): If True, return a dictionary of categorized links.
        custom_domains (dict): Optional dictionary to classify domains. E.g., {"Kaggle": ["kaggle.com"]}

    Returns:
        list or dict: Cleaned and optionally classified URLs.
    """
    raw_links = set()

    # --- 1. Match standard HTTP/HTTPS links ---
    matches_http = re.findall(r'https?://[^\s\)\]\>\.,;]*', text)
    raw_links.update(link.rstrip(".,);>]") for link in matches_http)

    # --- 2. Match www. and naked domain names ---
    matches_www = re.findall(r'www\.[a-zA-Z0-9\-\.]+\.\w+', text)
    raw_links.update(f"https://{link.rstrip('.,);>]')}" for link in matches_www)

    # --- 3. Match obfuscated links like name[dot]com or github[dot]io ---
    matches_obfuscated = re.findall(r'[\w\-]+\s?\[dot\]\s?[\w\-]+(?:\s?\[dot\]\s?[\w\-]+)?', text, re.IGNORECASE)
    for match in matches_obfuscated:
        clean = match.replace("[dot]", ".").replace(" ", "")
        if "." in clean:
            raw_links.add("https://" + clean)

    clean_links = sorted(raw_links)

    if not classify:
        return clean_links

    # --- 4. Classification ---
    categories = {
        "LinkedIn": [],
        "GitHub": [],
        "Portfolio": [],
        "Other": []
    }

    # Merge custom classification if provided
    if custom_domains:
        for category in custom_domains:
            if category not in categories:
                categories[category] = []

    # Standard + custom match logic
    for link in clean_links:
        lower = link.lower()
        if "linkedin.com" in lower:
            categories["LinkedIn"].append(link)
        elif "github.com" in lower:
            categories["GitHub"].append(link)
        elif any(d in lower for d in ["about.me", "portfolio", "me.", "my.", "personal"]):
            categories["Portfolio"].append(link)
        elif custom_domains:
            found = False
            for label, domains in custom_domains.items():
                if any(domain in lower for domain in domains):
                    categories[label].append(link)
                    found = True
                    break
            if not found:
                categories["Other"].append(link)
        else:
            categories["Other"].append(link)

    return categories



## 7.7 Extract Experiance 

In [16]:
import re

def extract_experience_lines(text, max_lines=40):
    """
    Extract structured work experience entries from resume text.

    Args:
        text (str): The full resume text.
        max_lines (int): Maximum lines to scan under the 'Experience' section.

    Returns:
        list: Structured list of experience dictionaries.
    """
    lines = [line.strip() for line in text.split('\n') if line.strip()]
    experience_section = []
    section_found = False
    keywords = ["experience", "work history", "employment", "professional background"]

    # Find experience section
    for i, line in enumerate(lines):
        if not section_found and any(k in line.lower() for k in keywords):
            experience_section = lines[i+1:i+1+max_lines]
            break

    if not experience_section:
        return []

    experiences = []
    current = {}
    buffer = []

    title_company_pattern = re.compile(
        r'^(?P<title>[A-Za-z\s/()\-]+?)\s+[-–]\s+(?P<company>.+)$'
    )
    date_pattern = re.compile(r'(\b\d{4}\b).{0,5}(\bPresent\b|\b\d{4}\b)', re.IGNORECASE)

    for line in experience_section:
        if "education" in line.lower() or "certification" in line.lower() or "project" in line.lower():
            break  # Stop parsing at next major section

        # Match title-company line
        tc_match = title_company_pattern.match(line)
        date_match = date_pattern.search(line)

        if tc_match:
            # Save previous block
            if current:
                current["Description"] = " ".join(buffer).strip() if buffer else None
                experiences.append(current)
                buffer = []
            current = {
                "Title": tc_match.group("title").strip(),
                "Company": tc_match.group("company").strip(),
                "Date": None,
                "Raw": line
            }

        elif date_match:
            if current:
                current["Date"] = date_match.group(0)

        elif line.startswith("-"):
            buffer.append(line)

    # Final flush
    if current:
        current["Description"] = " ".join(buffer).strip() if buffer else None
        experiences.append(current)

    return experiences


## Extract Projects

In [17]:
import re

def extract_projects(text, known_skills=None, return_structured=True, max_lines=10):
    """
    Extracts project information from resume text.
    
    Args:
        text (str): Resume full text.
        known_skills (list): Optional list of tech keywords to extract from descriptions.
        return_structured (bool): Return structured dicts with title, description, technologies.
        max_lines (int): Max lines to include after a project title.

    Returns:
        list[dict]: List of project dicts (Title, Description, Technologies).
    """
    lines = [line.strip() for line in text.splitlines() if line.strip()]
    project_start = None
    project_keywords = ["project", "projects", "personal project", "capstone"]
    stop_keywords = ["education", "experience", "certifications", "languages", "interests"]

    # Find where the Projects section starts
    for i, line in enumerate(lines):
        if any(kw in line.lower() for kw in project_keywords):
            project_start = i + 1
            break

    if project_start is None:
        return []

    # Collect all lines after project section until another major section
    block = []
    for line in lines[project_start:]:
        if any(kw in line.lower() for kw in stop_keywords):
            break
        block.append(line)

    # Identify project titles and descriptions
    projects = []
    current = {"Title": None, "Description": [], "Technologies": []}
    title_pattern = re.compile(r'^[A-Z].{3,40}$')  # Heuristic for title (capitalized short line)

    for line in block:
        if title_pattern.match(line) and not line.startswith("-"):
            if current["Title"]:
                # Finalize current project
                if return_structured:
                    current["Description"] = " ".join(current["Description"]).strip()
                    current["Technologies"] = extract_skills(current["Description"], known_skills)
                    projects.append(current)
                current = {"Title": None, "Description": [], "Technologies": []}
            current["Title"] = line
        else:
            current["Description"].append(line)

    # Append last project
    if current["Title"]:
        current["Description"] = " ".join(current["Description"]).strip()
        current["Technologies"] = extract_skills(current["Description"], known_skills)
        projects.append(current)

    return projects


## Extract Languages 

In [18]:

def extract_languages(text, known_languages=None):
    """
    Extracts spoken or written languages from resume text.

    Args:
        text (str): Raw resume content.
        known_languages (list, optional): Custom list of language names to detect.

    Returns:
        list: Sorted list of unique language names found in the resume.
    """
    default_languages = [
        "English", "Arabic", "French", "German", "Spanish", "Italian", "Mandarin",
        "Chinese", "Hindi", "Japanese", "Korean", "Portuguese", "Russian", "Turkish",
        "Dutch", "Bengali", "Urdu", "Polish", "Tamil", "Telugu", "Swedish", "Hebrew",
        "Malay", "Thai", "Vietnamese", "Greek", "Czech", "Romanian", "Hungarian",
        "Finnish", "Ukrainian", "Persian", "Punjabi", "Serbian", "Croatian"
    ]

    language_list = known_languages or default_languages
    results = set()

    # Lower the text for matching
    lower_text = text.lower()

    for lang in language_list:
        pattern = rf"\b{re.escape(lang.lower())}\b"
        if re.search(pattern, lower_text):
            results.add(lang)

    return sorted(results) if results else None


## Extract Hobbies

In [19]:
import re

def extract_interests(text):
    """
    Extract interests or hobbies from resume text.

    Args:
        text (str): Full resume content.

    Returns:
        list: A list of extracted interest strings, or None if not found.
    """
    lines = [line.strip() for line in text.split('\n') if line.strip()]
    interests_section = []
    start_collecting = False

    # Keywords that might indicate start of interests/hobbies section
    interest_headers = ["interests", "hobbies", "personal interests", "activities"]

    for idx, line in enumerate(lines):
        lower_line = line.lower().strip()

        if any(h in lower_line for h in interest_headers):
            start_collecting = True
            continue

        # Stop collecting if we hit another section (heuristic)
        if start_collecting and (re.match(r'^[A-Z][a-zA-Z ]+:$', line) or len(line.split()) <= 2):
            break

        if start_collecting:
            interests_section.append(line)

    # Flatten list and remove dashes
    cleaned = []
    for line in interests_section:
        # Remove leading bullet points or dashes
        line = re.sub(r'^[-•\s]+', '', line)
        # Split on bullet-like delimiters if needed
        parts = re.split(r'[•\-–•]', line)
        for part in parts:
            part = part.strip()
            if part:
                cleaned.append(part)

    return cleaned if cleaned else None


## 8. Upload File and Extract Info

In [20]:


def upload_resume_file():
    """
    Display a file upload widget and return the saved file path from local upload.

    Returns:
        dict: A dictionary containing 'file_name' key with the uploaded file path.
    """
    uploaded_file = {"file_name": None}  # Shared dict to hold result

    uploader = widgets.FileUpload(
        accept=".pdf,.docx,.jpg,.jpeg,.png",  # Allowed formats
        multiple=False
    )

    display(uploader)

    def handle_upload(change):
        if uploader.value:
            # Access the first uploaded file (assuming one file is uploaded)
            uploaded = list(uploader.value.values())[0]
            file_name = uploaded['metadata']['name']

            # Save the file content to the local disk
            with open(file_name, "wb") as f:
                f.write(uploaded['content'])
            print(f"File saved locally as: {file_name}")
            
            # Update the uploaded file dictionary with the file path
            uploaded_file["file_name"] = file_name  
            uploader.close()

    # Observe the file upload event
    uploader.observe(handle_upload, names='value')

    # Wait until the file is uploaded before returning the result
    while uploaded_file["file_name"] is None:
        pass  # Keep the loop running until the file is uploaded

    return uploaded_file  


## 9. Save Extracted Text in file

In [21]:
def save_extracted_resume_data(parsed_data: dict, output_path: str = "extracted_resume_data.csv") -> None:
    """
    Save extracted resume data to a CSV file.

    Args:
        parsed_data (dict): The dictionary containing resume fields.
        output_path (str): Path to save the CSV file.
    """
    try:
        df = pd.DataFrame([parsed_data])
        df.to_csv(output_path, index=False)
        print(f"Resume data saved to: {output_path}")
    except Exception as e:
        print(f"Failed to save resume data: {e}")


## 10. Run Extraction Pipeline

In [22]:

def extract_resume_data(text):
    ner = extract_transformer_entities(text)
    return {
        **ner,
        "Email": extract_email(text),
        "Phone": extract_phone(text),
        "Skills": extract_skills(text),
        "Education": extract_education(text),
        "Experience": extract_experience_lines(text),
        "Certifications": extract_certifications(text),
        "Projects": extract_projects(text),
        "Languages": extract_languages(text),
        "Interests": extract_interests(text),
        "Links": extract_links(text),
        "Raw Text": text,
        "Uploaded At": datetime.now()
    }



## 11. Resume pipeline

In [23]:
import time

def upload_extract_and_save_resume(save_dir="uploads", output_csv="extracted_resume_data.csv", wait=True):
    """
    Full pipeline to upload, extract, and save resume data.
    
    1. Uploads a resume file (.pdf or .docx)
    2. Extracts fields like name, email, skills, etc.
    3. Saves the results to CSV
    4. Returns extracted data as dict
    """
    uploaded_file = {"file_name": None}

    # File upload widget (PDF & DOCX only)
    uploader = widgets.FileUpload(
        accept=".pdf,.docx",
        multiple=False
    )

    display(HTML("<h4>Upload your resume (.pdf or .docx):</h4>"))
    display(uploader)

    def handle_upload(change):
        try:
            if uploader.value:
                uploaded = list(uploader.value.values())[0]
                file_name = uploaded['metadata']['name']
                ext = os.path.splitext(file_name)[1].lower()

                # Validate file type
                if ext not in [".pdf", ".docx"]:
                    display(HTML(f"<span style='color:red;'>❌ Unsupported file type: {ext}</span>"))
                    return

                os.makedirs(save_dir, exist_ok=True)
                full_path = os.path.join(save_dir, file_name)

                with open(full_path, "wb") as f:
                    f.write(uploaded['content'])

                uploaded_file["file_name"] = full_path
                display(HTML(f"<span style='color:green;'>File saved to: {full_path}</span>"))
                uploader.close()

                # Step 2: Extract text and parse data
                text = extract_resume_text(full_path)

                parsed_data = extract_resume_data(text)

                # Step 3: Save to CSV
                df = pd.DataFrame([parsed_data])
                csv_path = os.path.join(save_dir, output_csv)
                df.to_csv(csv_path, index=False)

                display(HTML(f"<span style='color:green;'>Data saved to: {csv_path}</span>"))
                display(HTML("<h5>Resume Parsing Complete!</h5>"))
                print(parsed_data)

            else:
                display(HTML("<span style='color:red;'>No file uploaded yet.</span>"))

        except Exception as e:
            display(HTML(f"<span style='color:red;'>Error: {e}</span>"))

    uploader.observe(handle_upload, names='value')

    if wait:
        while uploaded_file["file_name"] is None:
            time.sleep(0.2)

    return uploaded_file

    

---

## Call Functions

## Upload Resume

In [27]:
resume = upload_resume_file()


FileUpload(value=(), accept='.pdf,.docx,.jpg,.jpeg,.png', description='Upload')

KeyboardInterrupt: 

## Extract Text

In [37]:
resume_path = '../data/resumes/pdf/resume_1.pdf'
resume = extract_resume_text(resume_path)
display(Markdown(resume))

SUMMRY
MAHMoud Al-BANNA
Computer Science Teaching Assistant and NLP Researcher
-
Cairo, Egypt
mabdelrazekamin@eelu.edu.eg
+201026676511
linkedin.com/in/mahmoudalbanna
github.com/MahmoudBanna31
kaggle.com/mbanna31
mahmoudalbanna31@gmail.com


Talented Computer Science Teaching Assistant with +4 years of experience in teaching, research, and computer
science. A master's student in computer science with a great passion for machine learning, deep learning, data
science, and natural language processing and their integration into the medical field.

Experience:
Egyptian E-Learning University
Cairo, Egypt
Computer Science Teaching Assistant
December 2019 - Present
- Faculty of Computers and Information Technology. (Full-Time)
- I am teaching these courses. Pattern Recognition Course, Data Structure Course, Introduction to Operation
Research Course, Probability and Statistics Course, Mobile and Sensor Networks Course, Web Engineering
(3) Course, Introduction to Web Course, Programming (1) Java Basic Course, Programming (3) Java OOP
Course, Automata Models Course, Software Engineering (1) Course, Software Engineering (2) Course, Three-
dimensional Graphic Course, Computer Graphics Course, Microprocessors and Interfacing Course.
- I followed up on these courses. Math (0) Course, Math (1) Course, Physics (1) Course, Human-
Computer Interaction Course.
- Performed all assistant teaching duties, including mentoring, lecturing, researching, and clerical help.
- Organize and facilitate classroom lessons, activities, and presentations for 70+ undergraduate students each
semester.
- Prepared and delivered lab sessions throughout the academic year.
- I am helping the professors to keep records of grades, such as calculating the attendance and term work
grades, teaching the practical part in the section, and following up with students till implementation.
Arab Open University - Egypt
Cairo, Egypt
Computer Science Teaching Assistant
July 2020 - September 2020
- Faculty of Information and Computing. (Part-Time) Computer Science Department.
- I taught these Courses: Data Management and Analysis using Python, Database (SQL and NoSQL).
- I prepared the lab section of the course to explain it to students and followed up until
the implementation.
Education:

Ain Shams University
Cairo, Egypt
Faculty of Computer and Information Sciences
March 2021 – Present
Master's Student in the Computer Science department.
Master’s Title “Developing a System for Automatic Text Summarization”.
October 2022 - Present
Pre-Master’s Courses: -                                                                                                      March 2021 – February 2022
Grades: "Very Good", GPA: “3.62”
- Advanced Software Engineering Course.
- Advanced Selected Topics in Computer Science Course.
- Advanced Natural Language Processing Course.
- Intelligent Computer Algorithm Course.
- Advanced Computer Graphics and Animations Course.
- Advanced Artificial Intelligence Course.
- Distributed Computing Course.
- Robotics Course.
BSc Information Technology and Computing
Open University, UK
Grades: Second Class (1st Division) with Honors.
Sep 2015 - Jun 2019
Arab Open University - Egypt
Cairo, Egypt
BSc Information Technology and Computers
Sep 2015 - Jun 2019
Department: Computer Science
Grades: "Very Good with honors"
GPA: “3.65”
Class Ranked 4th.
SkILLS

Programming Languages:
Java, Python, JavaScript, C#, Shell Script, C
Database:
MySQL, XAMPP Server, MongoDB, No-SQL
Technologies:
Machine Learning, SKLearn, NLTK, TFIDF, Data Preprocessing
Concepts:
Data Science and Data Engineering, Data Mining, ER Diagram, OOP
Principles Concepts:
Deep Learning, Data structure and Algorithms, Data Management, and Analysis
Concepts:
Precision, Recall and F1-Score, Data Warehousing, SOLID Techniques
Interpersonal Skills:
Communication skills, Time management
Projects:

Sentiment Analysis for Customer Opinion in the Arabic Language Python, SKLearn, NLTK,
TDF-IDF, Machine Learning, Pandas, NumPy
I used NLTK and SKLearn for Data preprocessing. Then, using the TF-IDF algorithm for feature extraction.
And using different types of classification like SGDClassifier.
Software analysis of the hospital in a software engineering course UML Diagram, Java
I had an analysis system and determined the functional and non-functional requirements. And, Design
solutions such as use case diagrams, class diagrams, and so on.
Java OOP Project Java
During study course java OOP. I worked on a small project like a calculator.
Php Project. Php, MySQL
I worked on a small website consisting of some pages like the login page.
LICENSES & Certifications:
Data Science & Analytics Intro
IBM Digital Nation Website
Through self-paced learning, this badge earner has displayed an understanding of topics such as Data science,
analytics, gathering data, and predicting trends.

April 2020
Database Fundamentals
Mahara Tech-ITI
Introduction to the fundamentals of database using MySQL.
September 2020
Machine Learning Foundations: A Case Study Approach
Coursera
A case study approach by the University of Washington on Coursera in this course, I get hands-on experience
with machine learning from a series of practical case studies.
June 2019
Interests:
Watching Football, Travelling, and Reading.
Volunteer: WORKS
ACM AOU Community
Arab Open University, Egypt
Founder of the community. Some of my friends and I at the college have established the first ACM community
in the university.
Feb 2016 - Dec 2017
Enactus AOU
Arab Open University, Egypt
Member of HR Committee.
October 2017 - September 2018
IEEE AOU SE
Arab Open University, Egypt
Human Resources Recruiter.
Jun 2017 - March 2018
Languages:
Arabic: Mother Tongue.
English: Professional working proficiency.
References:
Dr. Walaa Elhady
Assistant. Prof. of Information Technology
Faculty of Information Technology and Computers, the Egyptian E-Learning University Cairo, Egypt
Mobile: +2-01097941593
Email: Welhady@eelu.edu.eg
Dr. Mustafa Abdul Salam
Associate. Prof. of Artificial Intelligence, Former Dean at Arab Open University
Faculty of Computers and Artificial Intelligence, Banha University, Arab Open University Cairo, Egypt
Mobile: +2-01015372448
Email: Mustafa.abdo@ymail.com
Dr. Sanaa Taha
Associate. Prof. of Information Technology
Faculty of Computers and Artificial Intelligence, Cairo University, Cairo, Egypt.
Mobile: +2-01117512722, Email: Staha@fci-cu.edu.eg

## 1. Regex Based Extraction

---

## 1.1 Extract Skills

In [44]:
text = """
Ahmed Mostafa
Senior Software Engineer | Backend & ML Specialist
Cairo, Egypt | ahmed.dev[at]gmail[dot]com | +20 100 123 4567 | linkedin.com/in/ahmedmostafa | github.com/ahmeddev

Summary:
Results-driven backend engineer with 6+ years of experience designing scalable systems using Python, FastAPI, PostgreSQL, and Docker. Proficient in building machine learning pipelines, deploying models to production with MLflow and Streamlit. Strong advocate for clean code, CI/CD, and agile development.

Skills:
- Programming: Python, JavaScript, SQL, C++
- ML/AI: Scikit-learn, PyTorch, Transformers, XGBoost, Hugging Face, LangChain
- Web: FastAPI, Flask, Django, React, Next.js, Tailwind CSS
- Databases: PostgreSQL, MySQL, MongoDB, Redis
- DevOps: Docker, Kubernetes, GitHub Actions, Terraform, AWS, GCP
- Tools: Jupyter, VS Code, Git, Postman, Slack, Notion
- Soft Skills: Problem Solving, Teamwork, Communication, Leadership, Critical Thinking

Experience:
Senior Backend Engineer – DataStack AI (Remote)
Aug 2021 – Present
- Built a scalable FastAPI backend for a recommendation engine with PostgreSQL & Redis.
- Containerized applications using Docker and deployed to GCP with Kubernetes.
- Developed automated CI/CD pipelines using GitHub Actions.
- Collaborated cross-functionally with frontend and ML teams using Agile.

Machine Learning Engineer – TechNova Labs
Jan 2019 – Jul 2021
- Deployed NLP models for resume parsing and classification using Hugging Face Transformers.
- Trained and tuned models with Scikit-learn, PyTorch, and MLflow.
- Developed an interactive data dashboard with Streamlit and Plotly.

Education:
Bachelor of Computer Science, Cairo University, 2018
GPA: 3.7 / 4.0

Certifications:
- AWS Certified Solutions Architect (2023–2026)
- Google Cloud Professional ML Engineer
- Microsoft Azure Fundamentals (AZ-900)
- Deep Learning Specialization – Coursera (Andrew Ng)
- ITIL v4 Foundation
- Certified Kubernetes Administrator (CKA)

Projects:
AI Resume Analyzer
- Built a Streamlit app that extracts and analyzes resume data using NLP and BERT NER.
- Features: skill matching, email & phone extraction, certification parser.

E-commerce REST API
- Designed a secure REST API using Django REST Framework.
- Integrated Stripe payments and user authentication with JWT.

Languages:
- Arabic (Native)
- English (Fluent)
- German (Intermediate)

Interests:
- Open source contributions
- AI ethics & fairness
- Playing chess & learning languages

"""


extract_skills(text)


['AWS',
 'Azure',
 'BERT',
 'CSS',
 'Communication',
 'Critical Thinking',
 'Deep Learning',
 'Django',
 'Docker',
 'FastAPI',
 'Flask',
 'GCP',
 'GitHub Actions',
 'Hugging Face',
 'JS',
 'JavaScript',
 'Kubernetes',
 'LangChain',
 'Leadership',
 'ML',
 'Machine Learning',
 'MongoDB',
 'MySQL',
 'NLP',
 'Next.js',
 'Plotly',
 'PostgreSQL',
 'Problem Solving',
 'PyTorch',
 'Python',
 'REST API',
 'React',
 'Redis',
 'SQL',
 'Scikit-learn',
 'Tailwind CSS',
 'Teamwork',
 'Terraform',
 'Transformers',
 'XGBoost']

## 1.2 Extract Education

In [45]:
text = """
Ahmed Mostafa
Senior Software Engineer | Backend & ML Specialist
Cairo, Egypt | ahmed.dev[at]gmail[dot]com | +20 100 123 4567 | linkedin.com/in/ahmedmostafa | github.com/ahmeddev

Summary:
Results-driven backend engineer with 6+ years of experience designing scalable systems using Python, FastAPI, PostgreSQL, and Docker. Proficient in building machine learning pipelines, deploying models to production with MLflow and Streamlit. Strong advocate for clean code, CI/CD, and agile development.

Skills:
- Programming: Python, JavaScript, SQL, C++
- ML/AI: Scikit-learn, PyTorch, Transformers, XGBoost, Hugging Face, LangChain
- Web: FastAPI, Flask, Django, React, Next.js, Tailwind CSS
- Databases: PostgreSQL, MySQL, MongoDB, Redis
- DevOps: Docker, Kubernetes, GitHub Actions, Terraform, AWS, GCP
- Tools: Jupyter, VS Code, Git, Postman, Slack, Notion
- Soft Skills: Problem Solving, Teamwork, Communication, Leadership, Critical Thinking

Experience:
Senior Backend Engineer – DataStack AI (Remote)
Aug 2021 – Present
- Built a scalable FastAPI backend for a recommendation engine with PostgreSQL & Redis.
- Containerized applications using Docker and deployed to GCP with Kubernetes.
- Developed automated CI/CD pipelines using GitHub Actions.
- Collaborated cross-functionally with frontend and ML teams using Agile.

Machine Learning Engineer – TechNova Labs
Jan 2019 – Jul 2021
- Deployed NLP models for resume parsing and classification using Hugging Face Transformers.
- Trained and tuned models with Scikit-learn, PyTorch, and MLflow.
- Developed an interactive data dashboard with Streamlit and Plotly.

Education:
Bachelor of Computer Science, Cairo University, 2018
GPA: 3.7 / 4.0

Certifications:
- AWS Certified Solutions Architect (2023–2026)
- Google Cloud Professional ML Engineer
- Microsoft Azure Fundamentals (AZ-900)
- Deep Learning Specialization – Coursera (Andrew Ng)
- ITIL v4 Foundation
- Certified Kubernetes Administrator (CKA)

Projects:
AI Resume Analyzer
- Built a Streamlit app that extracts and analyzes resume data using NLP and BERT NER.
- Features: skill matching, email & phone extraction, certification parser.

E-commerce REST API
- Designed a secure REST API using Django REST Framework.
- Integrated Stripe payments and user authentication with JWT.

Languages:
- Arabic (Native)
- English (Fluent)
- German (Intermediate)

Interests:
- Open source contributions
- AI ethics & fairness
- Playing chess & learning languages

"""

education_extracted = extract_education(text)
print("Extracted Education Degrees:", education_extracted)

Extracted Education Degrees: [{'degree': 'Bachelor', 'institution': 'of Computer Science', 'year': '2018', 'raw': 'Bachelor of Computer Science, Cairo University, 2018'}]


## 1.3 Extract Email 

In [46]:
text = """
Ahmed Mostafa
Senior Software Engineer | Backend & ML Specialist
Cairo, Egypt | ahmed.dev[at]gmail[dot]com | +20 100 123 4567 | linkedin.com/in/ahmedmostafa | github.com/ahmeddev

Summary:
Results-driven backend engineer with 6+ years of experience designing scalable systems using Python, FastAPI, PostgreSQL, and Docker. Proficient in building machine learning pipelines, deploying models to production with MLflow and Streamlit. Strong advocate for clean code, CI/CD, and agile development.

Skills:
- Programming: Python, JavaScript, SQL, C++
- ML/AI: Scikit-learn, PyTorch, Transformers, XGBoost, Hugging Face, LangChain
- Web: FastAPI, Flask, Django, React, Next.js, Tailwind CSS
- Databases: PostgreSQL, MySQL, MongoDB, Redis
- DevOps: Docker, Kubernetes, GitHub Actions, Terraform, AWS, GCP
- Tools: Jupyter, VS Code, Git, Postman, Slack, Notion
- Soft Skills: Problem Solving, Teamwork, Communication, Leadership, Critical Thinking

Experience:
Senior Backend Engineer – DataStack AI (Remote)
Aug 2021 – Present
- Built a scalable FastAPI backend for a recommendation engine with PostgreSQL & Redis.
- Containerized applications using Docker and deployed to GCP with Kubernetes.
- Developed automated CI/CD pipelines using GitHub Actions.
- Collaborated cross-functionally with frontend and ML teams using Agile.

Machine Learning Engineer – TechNova Labs
Jan 2019 – Jul 2021
- Deployed NLP models for resume parsing and classification using Hugging Face Transformers.
- Trained and tuned models with Scikit-learn, PyTorch, and MLflow.
- Developed an interactive data dashboard with Streamlit and Plotly.

Education:
Bachelor of Computer Science, Cairo University, 2018
GPA: 3.7 / 4.0

Certifications:
- AWS Certified Solutions Architect (2023–2026)
- Google Cloud Professional ML Engineer
- Microsoft Azure Fundamentals (AZ-900)
- Deep Learning Specialization – Coursera (Andrew Ng)
- ITIL v4 Foundation
- Certified Kubernetes Administrator (CKA)

Projects:
AI Resume Analyzer
- Built a Streamlit app that extracts and analyzes resume data using NLP and BERT NER.
- Features: skill matching, email & phone extraction, certification parser.

E-commerce REST API
- Designed a secure REST API using Django REST Framework.
- Integrated Stripe payments and user authentication with JWT.

Languages:
- Arabic (Native)
- English (Fluent)
- German (Intermediate)

Interests:
- Open source contributions
- AI ethics & fairness
- Playing chess & learning languages

"""


extract_email(text)

'ahmed.dev@gmail.com'

## 1.4 Extract Certifications

In [47]:
text = """
Ahmed Mostafa
Senior Software Engineer | Backend & ML Specialist
Cairo, Egypt | ahmed.dev[at]gmail[dot]com | +20 100 123 4567 | linkedin.com/in/ahmedmostafa | github.com/ahmeddev

Summary:
Results-driven backend engineer with 6+ years of experience designing scalable systems using Python, FastAPI, PostgreSQL, and Docker. Proficient in building machine learning pipelines, deploying models to production with MLflow and Streamlit. Strong advocate for clean code, CI/CD, and agile development.

Skills:
- Programming: Python, JavaScript, SQL, C++
- ML/AI: Scikit-learn, PyTorch, Transformers, XGBoost, Hugging Face, LangChain
- Web: FastAPI, Flask, Django, React, Next.js, Tailwind CSS
- Databases: PostgreSQL, MySQL, MongoDB, Redis
- DevOps: Docker, Kubernetes, GitHub Actions, Terraform, AWS, GCP
- Tools: Jupyter, VS Code, Git, Postman, Slack, Notion
- Soft Skills: Problem Solving, Teamwork, Communication, Leadership, Critical Thinking

Experience:
Senior Backend Engineer – DataStack AI (Remote)
Aug 2021 – Present
- Built a scalable FastAPI backend for a recommendation engine with PostgreSQL & Redis.
- Containerized applications using Docker and deployed to GCP with Kubernetes.
- Developed automated CI/CD pipelines using GitHub Actions.
- Collaborated cross-functionally with frontend and ML teams using Agile.

Machine Learning Engineer – TechNova Labs
Jan 2019 – Jul 2021
- Deployed NLP models for resume parsing and classification using Hugging Face Transformers.
- Trained and tuned models with Scikit-learn, PyTorch, and MLflow.
- Developed an interactive data dashboard with Streamlit and Plotly.

Education:
Bachelor of Computer Science, Cairo University, 2018
GPA: 3.7 / 4.0

Certifications:
- AWS Certified Solutions Architect (2023–2026)
- Google Cloud Professional ML Engineer
- Microsoft Azure Fundamentals (AZ-900)
- Deep Learning Specialization – Coursera (Andrew Ng)
- ITIL v4 Foundation
- Certified Kubernetes Administrator (CKA)

Projects:
AI Resume Analyzer
- Built a Streamlit app that extracts and analyzes resume data using NLP and BERT NER.
- Features: skill matching, email & phone extraction, certification parser.

E-commerce REST API
- Designed a secure REST API using Django REST Framework.
- Integrated Stripe payments and user authentication with JWT.

Languages:
- Arabic (Native)
- English (Fluent)
- German (Intermediate)

Interests:
- Open source contributions
- AI ethics & fairness
- Playing chess & learning languages

"""


extract_certifications(text)


['AWS Certified',
 'Azure Fundamentals',
 'CKA',
 'Certifications:',
 'Coursera',
 'Features: skill matching, email & phone extraction, certification parser.',
 'Google Cloud Professional',
 'ITIL']

# 1.5 Extract Links

In [48]:
text = """
Ahmed Mostafa
Senior Software Engineer | Backend & ML Specialist
Cairo, Egypt | ahmed.dev[at]gmail[dot]com | +20 100 123 4567 | linkedin.com/in/ahmedmostafa | github.com/ahmeddev

Summary:
Results-driven backend engineer with 6+ years of experience designing scalable systems using Python, FastAPI, PostgreSQL, and Docker. Proficient in building machine learning pipelines, deploying models to production with MLflow and Streamlit. Strong advocate for clean code, CI/CD, and agile development.

Skills:
- Programming: Python, JavaScript, SQL, C++
- ML/AI: Scikit-learn, PyTorch, Transformers, XGBoost, Hugging Face, LangChain
- Web: FastAPI, Flask, Django, React, Next.js, Tailwind CSS
- Databases: PostgreSQL, MySQL, MongoDB, Redis
- DevOps: Docker, Kubernetes, GitHub Actions, Terraform, AWS, GCP
- Tools: Jupyter, VS Code, Git, Postman, Slack, Notion
- Soft Skills: Problem Solving, Teamwork, Communication, Leadership, Critical Thinking

Experience:
Senior Backend Engineer – DataStack AI (Remote)
Aug 2021 – Present
- Built a scalable FastAPI backend for a recommendation engine with PostgreSQL & Redis.
- Containerized applications using Docker and deployed to GCP with Kubernetes.
- Developed automated CI/CD pipelines using GitHub Actions.
- Collaborated cross-functionally with frontend and ML teams using Agile.

Machine Learning Engineer – TechNova Labs
Jan 2019 – Jul 2021
- Deployed NLP models for resume parsing and classification using Hugging Face Transformers.
- Trained and tuned models with Scikit-learn, PyTorch, and MLflow.
- Developed an interactive data dashboard with Streamlit and Plotly.

Education:
Bachelor of Computer Science, Cairo University, 2018
GPA: 3.7 / 4.0

Certifications:
- AWS Certified Solutions Architect (2023–2026)
- Google Cloud Professional ML Engineer
- Microsoft Azure Fundamentals (AZ-900)
- Deep Learning Specialization – Coursera (Andrew Ng)
- ITIL v4 Foundation
- Certified Kubernetes Administrator (CKA)

Projects:
AI Resume Analyzer
- Built a Streamlit app that extracts and analyzes resume data using NLP and BERT NER.
- Features: skill matching, email & phone extraction, certification parser.

E-commerce REST API
- Designed a secure REST API using Django REST Framework.
- Integrated Stripe payments and user authentication with JWT.

Languages:
- Arabic (Native)
- English (Fluent)
- German (Intermediate)

Interests:
- Open source contributions
- AI ethics & fairness
- Playing chess & learning languages
"""

extract_links(text)

['https://gmail.com']

## 1.6 Extract Phone

In [94]:
text = """
Ahmed Mostafa
Senior Software Engineer | Backend & ML Specialist
Cairo, Egypt | ahmed.dev[at]gmail[dot]com | +20 100 123 4567 | linkedin.com/in/ahmedmostafa | github.com/ahmeddev

Summary:
Results-driven backend engineer with 6+ years of experience designing scalable systems using Python, FastAPI, PostgreSQL, and Docker. Proficient in building machine learning pipelines, deploying models to production with MLflow and Streamlit. Strong advocate for clean code, CI/CD, and agile development.

Skills:
- Programming: Python, JavaScript, SQL, C++
- ML/AI: Scikit-learn, PyTorch, Transformers, XGBoost, Hugging Face, LangChain
- Web: FastAPI, Flask, Django, React, Next.js, Tailwind CSS
- Databases: PostgreSQL, MySQL, MongoDB, Redis
- DevOps: Docker, Kubernetes, GitHub Actions, Terraform, AWS, GCP
- Tools: Jupyter, VS Code, Git, Postman, Slack, Notion
- Soft Skills: Problem Solving, Teamwork, Communication, Leadership, Critical Thinking

Experience:
Senior Backend Engineer – DataStack AI (Remote)
Aug 2021 – Present
- Built a scalable FastAPI backend for a recommendation engine with PostgreSQL & Redis.
- Containerized applications using Docker and deployed to GCP with Kubernetes.
- Developed automated CI/CD pipelines using GitHub Actions.
- Collaborated cross-functionally with frontend and ML teams using Agile.

Machine Learning Engineer – TechNova Labs
Jan 2019 – Jul 2021
- Deployed NLP models for resume parsing and classification using Hugging Face Transformers.
- Trained and tuned models with Scikit-learn, PyTorch, and MLflow.
- Developed an interactive data dashboard with Streamlit and Plotly.

Education:
Bachelor of Computer Science, Cairo University, 2018
GPA: 3.7 / 4.0

Certifications:
- AWS Certified Solutions Architect (2023–2026)
- Google Cloud Professional ML Engineer
- Microsoft Azure Fundamentals (AZ-900)
- Deep Learning Specialization – Coursera (Andrew Ng)
- ITIL v4 Foundation
- Certified Kubernetes Administrator (CKA)

Projects:
AI Resume Analyzer
- Built a Streamlit app that extracts and analyzes resume data using NLP and BERT NER.
- Features: skill matching, email & phone extraction, certification parser.

E-commerce REST API
- Designed a secure REST API using Django REST Framework.
- Integrated Stripe payments and user authentication with JWT.

Languages:
- Arabic (Native)
- English (Fluent)
- German (Intermediate)

Interests:
- Open source contributions
- AI ethics & fairness
- Playing chess & learning languages
"""

extract_phone(text)

'+201001234567'

## 1.7 Extract Experiance

In [36]:
text = """
Ahmed Mostafa
Senior Software Engineer | Backend & ML Specialist
Cairo, Egypt | ahmed.dev[at]gmail[dot]com | +20 100 123 4567 | linkedin.com/in/ahmedmostafa | github.com/ahmeddev

Summary:
Results-driven backend engineer with 6+ years of experience designing scalable systems using Python, FastAPI, PostgreSQL, and Docker. Proficient in building machine learning pipelines, deploying models to production with MLflow and Streamlit. Strong advocate for clean code, CI/CD, and agile development.

Skills:
- Programming: Python, JavaScript, SQL, C++
- ML/AI: Scikit-learn, PyTorch, Transformers, XGBoost, Hugging Face, LangChain
- Web: FastAPI, Flask, Django, React, Next.js, Tailwind CSS
- Databases: PostgreSQL, MySQL, MongoDB, Redis
- DevOps: Docker, Kubernetes, GitHub Actions, Terraform, AWS, GCP
- Tools: Jupyter, VS Code, Git, Postman, Slack, Notion
- Soft Skills: Problem Solving, Teamwork, Communication, Leadership, Critical Thinking

Experience:
Senior Backend Engineer – DataStack AI (Remote)
Aug 2021 – Present
- Built a scalable FastAPI backend for a recommendation engine with PostgreSQL & Redis.
- Containerized applications using Docker and deployed to GCP with Kubernetes.
- Developed automated CI/CD pipelines using GitHub Actions.
- Collaborated cross-functionally with frontend and ML teams using Agile.

Machine Learning Engineer – TechNova Labs
Jan 2019 – Jul 2021
- Deployed NLP models for resume parsing and classification using Hugging Face Transformers.
- Trained and tuned models with Scikit-learn, PyTorch, and MLflow.
- Developed an interactive data dashboard with Streamlit and Plotly.

Education:
Bachelor of Computer Science, Cairo University, 2018
GPA: 3.7 / 4.0

Certifications:
- AWS Certified Solutions Architect (2023–2026)
- Google Cloud Professional ML Engineer
- Microsoft Azure Fundamentals (AZ-900)
- Deep Learning Specialization – Coursera (Andrew Ng)
- ITIL v4 Foundation
- Certified Kubernetes Administrator (CKA)

Projects:
AI Resume Analyzer
- Built a Streamlit app that extracts and analyzes resume data using NLP and BERT NER.
- Features: skill matching, email & phone extraction, certification parser.

E-commerce REST API
- Designed a secure REST API using Django REST Framework.
- Integrated Stripe payments and user authentication with JWT.

Languages:
- Arabic (Native)
- English (Fluent)
- German (Intermediate)

Interests:
- Open source contributions
- AI ethics & fairness
- Playing chess & learning languages
"""

extract_experience_lines(text)

[{'Title': 'Senior Backend Engineer',
  'Company': 'DataStack AI (Remote)',
  'Date': '2021 – Present',
  'Raw': 'Senior Backend Engineer – DataStack AI (Remote)',
  'Description': '- Programming: Python, JavaScript, SQL, C++ - ML/AI: Scikit-learn, PyTorch, Transformers, XGBoost, Hugging Face, LangChain - Web: FastAPI, Flask, Django, React, Next.js, Tailwind CSS - Databases: PostgreSQL, MySQL, MongoDB, Redis - DevOps: Docker, Kubernetes, GitHub Actions, Terraform, AWS, GCP - Tools: Jupyter, VS Code, Git, Postman, Slack, Notion - Soft Skills: Problem Solving, Teamwork, Communication, Leadership, Critical Thinking - Built a scalable FastAPI backend for a recommendation engine with PostgreSQL & Redis. - Containerized applications using Docker and deployed to GCP with Kubernetes. - Developed automated CI/CD pipelines using GitHub Actions. - Collaborated cross-functionally with frontend and ML teams using Agile.'},
 {'Title': 'Machine Learning Engineer',
  'Company': 'TechNova Labs',
  'Dat

## 1.8 Extract Projects

In [49]:
text = """
Ahmed Mostafa
Senior Software Engineer | Backend & ML Specialist
Cairo, Egypt | ahmed.dev[at]gmail[dot]com | +20 100 123 4567 | linkedin.com/in/ahmedmostafa | github.com/ahmeddev

Summary:
Results-driven backend engineer with 6+ years of experience designing scalable systems using Python, FastAPI, PostgreSQL, and Docker. Proficient in building machine learning pipelines, deploying models to production with MLflow and Streamlit. Strong advocate for clean code, CI/CD, and agile development.

Skills:
- Programming: Python, JavaScript, SQL, C++
- ML/AI: Scikit-learn, PyTorch, Transformers, XGBoost, Hugging Face, LangChain
- Web: FastAPI, Flask, Django, React, Next.js, Tailwind CSS
- Databases: PostgreSQL, MySQL, MongoDB, Redis
- DevOps: Docker, Kubernetes, GitHub Actions, Terraform, AWS, GCP
- Tools: Jupyter, VS Code, Git, Postman, Slack, Notion
- Soft Skills: Problem Solving, Teamwork, Communication, Leadership, Critical Thinking

Experience:
Senior Backend Engineer – DataStack AI (Remote)
Aug 2021 – Present
- Built a scalable FastAPI backend for a recommendation engine with PostgreSQL & Redis.
- Containerized applications using Docker and deployed to GCP with Kubernetes.
- Developed automated CI/CD pipelines using GitHub Actions.
- Collaborated cross-functionally with frontend and ML teams using Agile.

Machine Learning Engineer – TechNova Labs
Jan 2019 – Jul 2021
- Deployed NLP models for resume parsing and classification using Hugging Face Transformers.
- Trained and tuned models with Scikit-learn, PyTorch, and MLflow.
- Developed an interactive data dashboard with Streamlit and Plotly.

Education:
Bachelor of Computer Science, Cairo University, 2018
GPA: 3.7 / 4.0

Certifications:
- AWS Certified Solutions Architect (2023–2026)
- Google Cloud Professional ML Engineer
- Microsoft Azure Fundamentals (AZ-900)
- Deep Learning Specialization – Coursera (Andrew Ng)
- ITIL v4 Foundation
- Certified Kubernetes Administrator (CKA)

Projects:
AI Resume Analyzer
- Built a Streamlit app that extracts and analyzes resume data using NLP and BERT NER.
- Features: skill matching, email & phone extraction, certification parser.

E-commerce REST API
- Designed a secure REST API using Django REST Framework.
- Integrated Stripe payments and user authentication with JWT.

Languages:
- Arabic (Native)
- English (Fluent)
- German (Intermediate)

Interests:
- Open source contributions
- AI ethics & fairness
- Playing chess & learning languages
"""

extract_projects(text)

[{'Title': 'AI Resume Analyzer',
  'Description': '- Built a Streamlit app that extracts and analyzes resume data using NLP and BERT NER. - Features: skill matching, email & phone extraction, certification parser.',
  'Technologies': ['BERT', 'NLP']},
 {'Title': 'E-commerce REST API',
  'Description': '- Designed a secure REST API using Django REST Framework. - Integrated Stripe payments and user authentication with JWT.',
  'Technologies': ['Django', 'REST API']}]

## 1.9 Extract Hobbies

In [50]:
text = """
Ahmed Mostafa
Senior Software Engineer | Backend & ML Specialist
Cairo, Egypt | ahmed.dev[at]gmail[dot]com | +20 100 123 4567 | linkedin.com/in/ahmedmostafa | github.com/ahmeddev

Summary:
Results-driven backend engineer with 6+ years of experience designing scalable systems using Python, FastAPI, PostgreSQL, and Docker. Proficient in building machine learning pipelines, deploying models to production with MLflow and Streamlit. Strong advocate for clean code, CI/CD, and agile development.

Skills:
- Programming: Python, JavaScript, SQL, C++
- ML/AI: Scikit-learn, PyTorch, Transformers, XGBoost, Hugging Face, LangChain
- Web: FastAPI, Flask, Django, React, Next.js, Tailwind CSS
- Databases: PostgreSQL, MySQL, MongoDB, Redis
- DevOps: Docker, Kubernetes, GitHub Actions, Terraform, AWS, GCP
- Tools: Jupyter, VS Code, Git, Postman, Slack, Notion
- Soft Skills: Problem Solving, Teamwork, Communication, Leadership, Critical Thinking

Experience:
Senior Backend Engineer – DataStack AI (Remote)
Aug 2021 – Present
- Built a scalable FastAPI backend for a recommendation engine with PostgreSQL & Redis.
- Containerized applications using Docker and deployed to GCP with Kubernetes.
- Developed automated CI/CD pipelines using GitHub Actions.
- Collaborated cross-functionally with frontend and ML teams using Agile.

Machine Learning Engineer – TechNova Labs
Jan 2019 – Jul 2021
- Deployed NLP models for resume parsing and classification using Hugging Face Transformers.
- Trained and tuned models with Scikit-learn, PyTorch, and MLflow.
- Developed an interactive data dashboard with Streamlit and Plotly.

Education:
Bachelor of Computer Science, Cairo University, 2018
GPA: 3.7 / 4.0

Certifications:
- AWS Certified Solutions Architect (2023–2026)
- Google Cloud Professional ML Engineer
- Microsoft Azure Fundamentals (AZ-900)
- Deep Learning Specialization – Coursera (Andrew Ng)
- ITIL v4 Foundation
- Certified Kubernetes Administrator (CKA)

Projects:
AI Resume Analyzer
- Built a Streamlit app that extracts and analyzes resume data using NLP and BERT NER.
- Features: skill matching, email & phone extraction, certification parser.

E-commerce REST API
- Designed a secure REST API using Django REST Framework.
- Integrated Stripe payments and user authentication with JWT.

Languages:
- Arabic (Native)
- English (Fluent)
- German (Intermediate)

Interests:
- Open source contributions
- AI ethics & fairness
- Playing chess & learning languages
"""

extract_interests(text)


['Open source contributions',
 'AI ethics & fairness',
 'Playing chess & learning languages']

## 1. 10 Extract Languages

In [118]:
text = """
Ahmed Mostafa
Senior Software Engineer | Backend & ML Specialist
Cairo, Egypt | ahmed.dev[at]gmail[dot]com | +20 100 123 4567 | linkedin.com/in/ahmedmostafa | github.com/ahmeddev

Summary:
Results-driven backend engineer with 6+ years of experience designing scalable systems using Python, FastAPI, PostgreSQL, and Docker. Proficient in building machine learning pipelines, deploying models to production with MLflow and Streamlit. Strong advocate for clean code, CI/CD, and agile development.

Skills:
- Programming: Python, JavaScript, SQL, C++
- ML/AI: Scikit-learn, PyTorch, Transformers, XGBoost, Hugging Face, LangChain
- Web: FastAPI, Flask, Django, React, Next.js, Tailwind CSS
- Databases: PostgreSQL, MySQL, MongoDB, Redis
- DevOps: Docker, Kubernetes, GitHub Actions, Terraform, AWS, GCP
- Tools: Jupyter, VS Code, Git, Postman, Slack, Notion
- Soft Skills: Problem Solving, Teamwork, Communication, Leadership, Critical Thinking

Experience:
Senior Backend Engineer – DataStack AI (Remote)
Aug 2021 – Present
- Built a scalable FastAPI backend for a recommendation engine with PostgreSQL & Redis.
- Containerized applications using Docker and deployed to GCP with Kubernetes.
- Developed automated CI/CD pipelines using GitHub Actions.
- Collaborated cross-functionally with frontend and ML teams using Agile.

Machine Learning Engineer – TechNova Labs
Jan 2019 – Jul 2021
- Deployed NLP models for resume parsing and classification using Hugging Face Transformers.
- Trained and tuned models with Scikit-learn, PyTorch, and MLflow.
- Developed an interactive data dashboard with Streamlit and Plotly.

Education:
Bachelor of Computer Science, Cairo University, 2018
GPA: 3.7 / 4.0

Certifications:
- AWS Certified Solutions Architect (2023–2026)
- Google Cloud Professional ML Engineer
- Microsoft Azure Fundamentals (AZ-900)
- Deep Learning Specialization – Coursera (Andrew Ng)
- ITIL v4 Foundation
- Certified Kubernetes Administrator (CKA)

Projects:
AI Resume Analyzer
- Built a Streamlit app that extracts and analyzes resume data using NLP and BERT NER.
- Features: skill matching, email & phone extraction, certification parser.

E-commerce REST API
- Designed a secure REST API using Django REST Framework.
- Integrated Stripe payments and user authentication with JWT.

Languages:
- Arabic (Native)
- English (Fluent)
- German (Intermediate)

Interests:
- Open source contributions
- AI ethics & fairness
- Playing chess & learning languages
"""

extract_languages(text)

['Arabic', 'English', 'German']