In [1]:
!pip install spacy
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m61.2 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


Skills Exraction

In [2]:
import spacy

# Load the pre-trained NLP model
nlp = spacy.load("en_core_web_sm")

# Sample text from the skills description
text = """
Technical Proficiencies:
Advanced Machine Learning/Deep Learning: Demonstrated mastery in developing cutting-edge deep learning models tailored for sophisticated computer vision and NLP applications. Proficient in leveraging TensorFlow, Keras, and PyTorch frameworks to engineer models that push the boundaries of AI research and application.
Computer Vision Expertise: Expert-level proficiency in deploying convolutional neural networks (CNNs) for a variety of computer vision tasks, including but not limited to advanced image classification, precise object detection, and detailed image segmentation. Capable of optimizing algorithms for high accuracy and efficiency in real-world scenarios.
Natural Language Processing Mastery: Highly skilled in the application of recurrent neural networks (RNNs), state-of-the-art transformers, and BERT models for complex NLP challenges such as nuanced text classification, comprehensive sentiment analysis, and innovative language generation projects.
Data Engineering Excellence: Adept at orchestrating data engineering pipelines, from meticulous data preprocessing and innovative feature engineering to leveraging big data technologies like Hadoop and Spark. Specializes in architecting scalable data solutions that facilitate efficient processing and analysis of vast datasets.
Soft Skills:
Leadership & Team Development: Proven track record of exemplary leadership, successfully guiding data science teams towards achieving groundbreaking results. Committed to mentoring junior data scientists and analysts, fostering an environment of growth and innovation.
Strategic Business Alignment: Exceptional at strategic planning, adeptly aligning data science initiatives with overarching business goals to drive meaningful impact. Expertise in identifying opportunities for leveraging data science to solve critical business challenges.
Communicative Clarity: Outstanding communication skills, with a gift for demystifying complex data science principles for non-technical audiences. Excels at bridging the gap between technical teams and business stakeholders, ensuring clear understanding and collaborative success

"""

# Process the text with spaCy
doc = nlp(text)

# Function to extract keywords based on noun chunks and named entities
def extract_technical_keywords(doc):
    keywords = set()  # Use a set to avoid duplicates

    # Add specific technical terms based on entities and noun chunks
    for ent in doc.ents:
        if ent.label_ in ["ORG", "PRODUCT", "GPE"]:  # Focus on organizations, products, or technologies
            keywords.add(ent.text)

    for chunk in doc.noun_chunks:
        # Focus on chunks that likely represent technical concepts or tools
        if "model" in chunk.text.lower() or "network" in chunk.text.lower() or "data" in chunk.text.lower():
            keywords.add(chunk.text)

    return keywords

# Extract keywords
technical_keywords = extract_technical_keywords(doc)

# Print the extracted keywords
print("Extracted Technical Keywords:")
for keyword in sorted(technical_keywords):
    print(f"- {keyword}")


Extracted Technical Keywords:
- AI
- Advanced Machine Learning/Deep Learning
- BERT
- BERT models
- Data Engineering Excellence
- Data Engineering Excellence: Adept
- Hadoop
- Keras
- Leadership & Team Development
- NLP
- Natural Language Processing Mastery: Highly
- PyTorch
- Spark
- TensorFlow
- big data technologies
- complex data science principles
- convolutional neural networks
- cutting-edge deep learning models
- data engineering pipelines
- data science
- data science initiatives
- data science teams
- junior data scientists
- meticulous data preprocessing
- models
- recurrent neural networks
- scalable data solutions
- vast datasets


In [3]:
import spacy
from dateutil.relativedelta import relativedelta
from datetime import datetime

# Load the pre-trained NLP model
nlp = spacy.load("en_core_web_sm")

# Example text (the actual CV content should be read from a file or other sources)
cv_text = """
Personal Information
Name: Arjun Patel
Gender: Male
Nationality: Indian
Contact Information:
Email: arjun.patel@example.com
Phone: +91 98200 12345
LinkedIn: linkedin.com/in/arjunpatel
Education
Master of Science in Artificial Intelligence, Indian Institute of Technology Bombay (IIT Bombay), Mumbai, India, 2017
Bachelor of Engineering in Computer Science, National Institute of Technology Karnataka (NITK), Surathkal, India, 2015
Skills
Technical Expertise:
Natural Language Processing & Large Language Models: Exhibits top-tier proficiency in architecting and executing NLP applications leveraging advanced transformer technologies such as BERT, GPT-3, and T5. Adept in harnessing TensorFlow and PyTorch frameworks for model training and inference, facilitating breakthroughs in text processing and generation tasks.
Machine Learning & Artificial Intelligence: Possesses a comprehensive grasp of core machine learning principles, adept at preprocessing data and applying sophisticated deep learning strategies to solve intricate NLP challenges. This includes a strategic approach to algorithm selection, model tuning, and leveraging AI to enhance linguistic analysis.
Programming and Development: Expert-level command of Python, fortified by solid programming skills in Java and C++, enabling the development of robust, efficient software solutions. Proficient in integrating a wide array of NLP libraries including NLTK, spaCy, and Hugging Face's Transformers to enrich AI applications with natural language understanding and generation capabilities.
DevOps & Cloud Solutions: Skilled in the deployment and management of NLP solutions within cloud environments such as AWS and Azure, employing Docker and Kubernetes for effective containerization and orchestration. This expertise ensures scalable, resilient AI system architectures capable of handling expansive datasets and intensive computational tasks.
Interpersonal Abilities:
Leadership in AI Project Execution: Proven track record of steering AI projects to success, adept at orchestrating project timelines and marshaling cross-disciplinary teams towards achieving ambitious technological objectives.
Data-Driven Problem Solving: Outstanding analytical abilities, specializing in untangling complex technical problems through methodical data-driven approaches. This skill is pivotal in optimizing AI systems for peak performance and innovation.
Communication & Team Dynamics: Distinguished by the ability to clearly convey complex AI and NLP concepts across varied audiences, enhancing stakeholder understanding and project alignment. Committed to nurturing a collaborative work environment, promoting synergy and shared success among team members.
Professional Experience
Senior NLP/LLMs Developer, AI Innovations Lab, Bangalore, India, January 2018 - Present
NLP Engineer, Tech Solutions Pvt. Ltd., Hyderabad, India, June 2015 - December 2017
"""

# Function to calculate total years of experience
def calculate_experience(text):
    # Define the current year and month for ongoing positions
    current_date = datetime.now()
    total_experience = relativedelta()

    # Find all date patterns and calculate durations
    doc = nlp(text)
    for ent in doc.ents:
        if ent.label_ == "DATE":
            dates = ent.text.split("-")
            if len(dates) == 2:
                start_date = datetime.strptime(dates[0].strip(), "%B %Y")
                end_date = datetime.strptime(dates[1].strip(), "%B %Y") if "Present" not in dates[1] else current_date
                total_experience += relativedelta(end_date, start_date)

    # Convert total experience into years and months
    years = total_experience.years
    months = total_experience.months
    total_years = years + months / 12
    return total_years

# Process the CV text
doc = nlp(cv_text)

# Extract required information
info = {
    "Candidate Name": "",
    "Gender": "",
    "Nationality": "",
    "Email": "",
    "Mobile Numbers": "",
    "Skills": {"Technical Expertise": [], "Interpersonal Abilities": []},
    "Total Years of Experiences": "",
    "College Names": [],
    "Degrees": [],
    "Designations": [],
    "Last Company Names": ""
}

# Use entity recognition and text parsing to extract information
for ent in doc.ents:
    if ent.label_ == "PERSON" and "Name" in ent.sent.text:
        info["Candidate Name"] = ent.text
    elif ent.label_ == "NORP" and "Nationality" in ent.sent.text:
        info["Nationality"] = ent.text
    elif ent.label_ == "EMAIL":
        info["Email"] = ent.text
    elif ent.label_ == "ORG":
        if "university" in ent.text.lower() or "institute" in ent.text.lower():
            info["College Names"].append(ent.text)
        elif "Bachelor" in ent.sent.text or "Master" in ent.sent.text:
            info["Degrees"].append(ent.sent.text)
        elif "Developer" in ent.text or "Engineer" in ent.text:
            info["Designations"].append(ent.text)
            info["Last Company Names"] = ent.text.split(",")[1] if "," in ent.text else ent.text
    elif ent.label_ == "GPE":
        if "Phone" in ent.sent.text:
            info["Mobile Numbers"] = ent.text

# Additional parsing for skills (manual text parsing)
skills_section = cv_text.split("Skills")[1].split("Professional Experience")[0]
tech_skills = skills_section.split("Interpersonal Abilities")[0]
interpersonal_skills = skills_section.split("Interpersonal Abilities")[1]

for line in tech_skills.split("\n"):
    if ":" in line:
        skill = line.split(":")[0].strip()
        if skill:
            info["Skills"]["Technical Expertise"].append(skill)

for line in interpersonal_skills.split("\n"):
    if ":" in line:
        skill = line.split(":")[0].strip()
        if skill:
            info["Skills"]["Interpersonal Abilities"].append(skill)

# Calculate total years of experience
info["Total Years of Experiences"] = calculate_experience(cv_text)

# Output the information
print(f"Candidate Name: {info['Candidate Name']}")
print(f"Gender: Male")
print(f"Nationality: {info['Nationality']}")
print(f"Email: {info['Email']}")
print(f"Mobile Numbers: {info['Mobile Numbers']}")
print("Skills:")
print(f"Technical Expertise: {', '.join(info['Skills']['Technical Expertise'])}")
print(f"Interpersonal Abilities: {', '.join(info['Skills']['Interpersonal Abilities'])}")
print(f"Total Years of Experiences: {info['Total Years of Experiences']:.2f} years")
print(f"College Names: {', '.join(info['College Names'])}")
print(f"Degrees: {', '.join(info['Degrees'])}")
print(f"Designations: {', '.join(info['Designations'])}")
print(f"Last Company Names: {info['Last Company Names']}")


Candidate Name: Arjun Patel
Gender
Gender: Male
Nationality: 
Email: 
Mobile Numbers: India
Skills:
Technical Expertise: Technical Expertise, Natural Language Processing & Large Language Models, Machine Learning & Artificial Intelligence, Programming and Development, DevOps & Cloud Solutions
Interpersonal Abilities: Leadership in AI Project Execution, Data-Driven Problem Solving, Communication & Team Dynamics
Total Years of Experiences: 8.83 years
College Names: Indian Institute of Technology Bombay, National Institute of Technology Karnataka
Degrees: 
Personal Information
Name: Arjun Patel
Gender: Male
Nationality: Indian
Contact Information:
Email: arjun.patel@example.com
Phone: +91 98200 12345
LinkedIn: linkedin.com/in/arjunpatel
Education
Master of Science in Artificial Intelligence, Indian Institute of Technology Bombay (IIT Bombay), Mumbai, India, 2017
Bachelor of Engineering in Computer Science, National Institute of Technology Karnataka (NITK), Surathkal, India, 2015
Skills
Tec

In [4]:
info.keys()

dict_keys(['Candidate Name', 'Gender', 'Nationality', 'Email', 'Mobile Numbers', 'Skills', 'Total Years of Experiences', 'College Names', 'Degrees', 'Designations', 'Last Company Names'])

In [5]:
info['Nationality']

''

In [6]:
import spacy

# Load the pre-trained NLP model
nlp = spacy.load("en_core_web_sm")

In [7]:
def extracting_skills(text):


  # Process the text with spaCy
  doc = nlp(text)

  # Function to extract keywords based on noun chunks and named entities
  def extract_technical_keywords(doc):
      keywords = set()  # Use a set to avoid duplicates

      # Add specific technical terms based on entities and noun chunks
      for ent in doc.ents:
          if ent.label_ in ["ORG", "PRODUCT", "GPE"]:  # Focus on organizations, products, or technologies
              keywords.add(ent.text)

      for chunk in doc.noun_chunks:
          # Focus on chunks that likely represent technical concepts or tools
          if "model" in chunk.text.lower() or "network" in chunk.text.lower() or "data" in chunk.text.lower():
              keywords.add(chunk.text)

      return keywords

  # Extract keywords
  technical_keywords = extract_technical_keywords(doc)
  return technical_keywords
  # Print the extracted keywords
  print("Extracted Technical Keywords:")
  skills=[]
  for keyword in sorted(technical_keywords):
      print(f"- {keyword}")
      skills.append(keyword)

In [8]:
  import spacy
  from dateutil.relativedelta import relativedelta
  from datetime import datetime

  # Load the pre-trained NLP model
  nlp = spacy.load("en_core_web_sm")

In [9]:
# Function to calculate total years of experience
def calculate_experience(text):
    # Define the current year and month for ongoing positions
    current_date = datetime.now()
    total_experience = relativedelta()

    # Find all date patterns and calculate durations
    doc = nlp(text)
    for ent in doc.ents:
        if ent.label_ == "DATE":
            dates = ent.text.split("-")
            if len(dates) == 2:
                start_date = datetime.strptime(dates[0].strip(), "%B %Y")
                end_date = datetime.strptime(dates[1].strip(), "%B %Y") if "Present" not in dates[1] else current_date
                total_experience += relativedelta(end_date, start_date)

    # Convert total experience into years and months
    years = total_experience.years
    months = total_experience.months
    total_years = years + months / 12
    return total_years

In [10]:
def persona_extraction(text):

  # Process the CV text
  doc = nlp(cv_text)

  # Extract required information
  info = {
      "Candidate Name": "",
      "Gender": "",
      "Nationality": "",
      "Email": "",
      "Mobile Numbers": "",
      "Skills": {"Technical Expertise": [], "Interpersonal Abilities": []},
      "Total Years of Experiences": "",
      "College Names": [],
      "Degrees": [],
      "Designations": [],
      "Last Company Names": ""
  }

  # Use entity recognition and text parsing to extract information
  for ent in doc.ents:
      if ent.label_ == "PERSON" and "Name" in ent.sent.text:
          info["Candidate Name"] = ent.text
      elif ent.label_ == "NORP" and "Nationality" in ent.sent.text:
          info["Nationality"] = ent.text
      elif ent.label_ == "EMAIL":
          info["Email"] = ent.text
      elif ent.label_ == "ORG":
          if "university" in ent.text.lower() or "institute" in ent.text.lower():
              info["College Names"].append(ent.text)
          elif "Bachelor" in ent.sent.text or "Master" in ent.sent.text:
              info["Degrees"].append(ent.sent.text)
          elif "Developer" in ent.text or "Engineer" in ent.text:
              info["Designations"].append(ent.text)
              info["Last Company Names"] = ent.text.split(",")[1] if "," in ent.text else ent.text
      elif ent.label_ == "GPE":
          if "Phone" in ent.sent.text:
              info["Mobile Numbers"] = ent.text

  # Additional parsing for skills (manual text parsing)
  skills_section = cv_text.split("Skills")[1].split("Professional Experience")[0]
  tech_skills = skills_section.split("Interpersonal Abilities")[0]
  interpersonal_skills = skills_section.split("Interpersonal Abilities")[1]

  for line in tech_skills.split("\n"):
      if ":" in line:
          skill = line.split(":")[0].strip()
          if skill:
              info["Skills"]["Technical Expertise"].append(skill)

  for line in interpersonal_skills.split("\n"):
      if ":" in line:
          skill = line.split(":")[0].strip()
          if skill:
              info["Skills"]["Interpersonal Abilities"].append(skill)

  # Calculate total years of experience
  info["Total Years of Experiences"] = calculate_experience(cv_text)
  final_persona={}
  # Output the information
  print(f"Candidate Name: {info['Candidate Name']}")
  print(f"Gender: Male")
  print(f"Nationality: {info['Nationality']}")
  print(f"Email: {info['Email']}")
  print(f"Mobile Numbers: {info['Mobile Numbers']}")
  print("Skills:")
  print(f"Technical Expertise: {', '.join(info['Skills']['Technical Expertise'])}")
  print(f"Interpersonal Abilities: {', '.join(info['Skills']['Interpersonal Abilities'])}")
  print(f"Total Years of Experiences: {info['Total Years of Experiences']:.2f} years")
  print(f"College Names: {', '.join(info['College Names'])}")
  print(f"Degrees: {', '.join(info['Degrees'])}")
  print(f"Designations: {', '.join(info['Designations'])}")
  print(f"Last Company Names: {info['Last Company Names']}")
  final_persona['Candidate Name']=info['Candidate Name']
  final_persona['Gender']=info['Gender']
  final_persona['Nationality']=info['Nationality']
  final_persona['Email']=info['Email']
  final_persona['Mobile Numbers']=info['Mobile Numbers']
  final_persona['Technical Skills']=', '.join(info['Skills']['Technical Expertise'])
  final_persona['Skills']=', '.join(info['Skills']['Interpersonal Abilities'])
  final_persona['Year of Experience']=info['Total Years of Experiences']
  final_persona['College Name']=', '.join(info['College Names'])
  final_persona['Degrees']=', '.join(info['Degrees'])
  final_persona['Designations']=', '.join(info['Designations'])
  final_persona['Last Company Names']=info['Last Company Names']
  return final_persona

In [11]:
cv_text = """
Personal Information
Name: Arjun Patel
Gender: Male
Nationality: Indian
Contact Information:
Email: arjun.patel@example.com
Phone: +91 98200 12345
LinkedIn: linkedin.com/in/arjunpatel
Education
Master of Science in Artificial Intelligence, Indian Institute of Technology Bombay (IIT Bombay), Mumbai, India, 2017
Bachelor of Engineering in Computer Science, National Institute of Technology Karnataka (NITK), Surathkal, India, 2015
Skills
Technical Expertise:
Natural Language Processing & Large Language Models: Exhibits top-tier proficiency in architecting and executing NLP applications leveraging advanced transformer technologies such as BERT, GPT-3, and T5. Adept in harnessing TensorFlow and PyTorch frameworks for model training and inference, facilitating breakthroughs in text processing and generation tasks.
Machine Learning & Artificial Intelligence: Possesses a comprehensive grasp of core machine learning principles, adept at preprocessing data and applying sophisticated deep learning strategies to solve intricate NLP challenges. This includes a strategic approach to algorithm selection, model tuning, and leveraging AI to enhance linguistic analysis.
Programming and Development: Expert-level command of Python, fortified by solid programming skills in Java and C++, enabling the development of robust, efficient software solutions. Proficient in integrating a wide array of NLP libraries including NLTK, spaCy, and Hugging Face's Transformers to enrich AI applications with natural language understanding and generation capabilities.
DevOps & Cloud Solutions: Skilled in the deployment and management of NLP solutions within cloud environments such as AWS and Azure, employing Docker and Kubernetes for effective containerization and orchestration. This expertise ensures scalable, resilient AI system architectures capable of handling expansive datasets and intensive computational tasks.
Interpersonal Abilities:
Leadership in AI Project Execution: Proven track record of steering AI projects to success, adept at orchestrating project timelines and marshaling cross-disciplinary teams towards achieving ambitious technological objectives.
Data-Driven Problem Solving: Outstanding analytical abilities, specializing in untangling complex technical problems through methodical data-driven approaches. This skill is pivotal in optimizing AI systems for peak performance and innovation.
Communication & Team Dynamics: Distinguished by the ability to clearly convey complex AI and NLP concepts across varied audiences, enhancing stakeholder understanding and project alignment. Committed to nurturing a collaborative work environment, promoting synergy and shared success among team members.
Professional Experience
Senior NLP/LLMs Developer, AI Innovations Lab, Bangalore, India, January 2018 - Present
NLP Engineer, Tech Solutions Pvt. Ltd., Hyderabad, India, June 2015 - December 2017
"""

fin_info=persona_extraction(cv_text)

Candidate Name: Arjun Patel
Gender
Gender: Male
Nationality: 
Email: 
Mobile Numbers: India
Skills:
Technical Expertise: Technical Expertise, Natural Language Processing & Large Language Models, Machine Learning & Artificial Intelligence, Programming and Development, DevOps & Cloud Solutions
Interpersonal Abilities: Leadership in AI Project Execution, Data-Driven Problem Solving, Communication & Team Dynamics
Total Years of Experiences: 8.83 years
College Names: Indian Institute of Technology Bombay, National Institute of Technology Karnataka
Degrees: 
Personal Information
Name: Arjun Patel
Gender: Male
Nationality: Indian
Contact Information:
Email: arjun.patel@example.com
Phone: +91 98200 12345
LinkedIn: linkedin.com/in/arjunpatel
Education
Master of Science in Artificial Intelligence, Indian Institute of Technology Bombay (IIT Bombay), Mumbai, India, 2017
Bachelor of Engineering in Computer Science, National Institute of Technology Karnataka (NITK), Surathkal, India, 2015
Skills
Tec

In [12]:
import spacy
from dateutil.relativedelta import relativedelta
from datetime import datetime
import re
# Load the pre-trained NLP model
nlp = spacy.load("en_core_web_sm")

# Example CV text (replace with your actual CV content)
cv_text = """
Personal Information
Name: Arjun Patel
Gender: Male
Nationality: Indian
Contact Information:
Email: arjun.patel@example.com
Phone: +91 98200 12345
LinkedIn: linkedin.com/in/arjunpatel
Education
Master of Science in Artificial Intelligence, Indian Institute of Technology Bombay (IIT Bombay), Mumbai, India, 2017
Bachelor of Engineering in Computer Science, National Institute of Technology Karnataka (NITK), Surathkal, India, 2015
Skills
Technical Expertise:
Natural Language Processing & Large Language Models: Exhibits top-tier proficiency in architecting and executing NLP applications leveraging advanced transformer technologies such as BERT, GPT-3, and T5. Adept in harnessing TensorFlow and PyTorch frameworks for model training and inference, facilitating breakthroughs in text processing and generation tasks.
Machine Learning & Artificial Intelligence: Possesses a comprehensive grasp of core machine learning principles, adept at preprocessing data and applying sophisticated deep learning strategies to solve intricate NLP challenges. This includes a strategic approach to algorithm selection, model tuning, and leveraging AI to enhance linguistic analysis.
Programming and Development: Expert-level command of Python, fortified by solid programming skills in Java and C++, enabling the development of robust, efficient software solutions. Proficient in integrating a wide array of NLP libraries including NLTK, spaCy, and Hugging Face's Transformers to enrich AI applications with natural language understanding and generation capabilities.
DevOps & Cloud Solutions: Skilled in the deployment and management of NLP solutions within cloud environments such as AWS and Azure, employing Docker and Kubernetes for effective containerization and orchestration. This expertise ensures scalable, resilient AI system architectures capable of handling expansive datasets and intensive computational tasks.
Interpersonal Abilities:
Leadership in AI Project Execution: Proven track record of steering AI projects to success, adept at orchestrating project timelines and marshaling cross-disciplinary teams towards achieving ambitious technological objectives.
Data-Driven Problem Solving: Outstanding analytical abilities, specializing in untangling complex technical problems through methodical data-driven approaches. This skill is pivotal in optimizing AI systems for peak performance and innovation.
Communication & Team Dynamics: Distinguished by the ability to clearly convey complex AI and NLP concepts across varied audiences, enhancing stakeholder understanding and project alignment. Committed to nurturing a collaborative work environment, promoting synergy and shared success among team members.
Professional Experience
Senior NLP/LLMs Developer, AI Innovations Lab, Bangalore, India, January 2018 - Present
NLP Engineer, Tech Solutions Pvt. Ltd., Hyderabad, India, June 2015 - December 2017
"""


In [13]:
"""Candidate Name": "",
      "Gender": "",
      "Nationality": "",
      "Email": "",
      "Mobile Numbers": "",
      "Skills": {"Technical Expertise": [], "Interpersonal Abilities": []},
      "Total Years of Experiences": "",
      "College Names": [],
      "Degrees": [],
      "Designations": [],
      "Last Company Names": """

'Candidate Name": "",\n      "Gender": "",\n      "Nationality": "",\n      "Email": "",\n      "Mobile Numbers": "",\n      "Skills": {"Technical Expertise": [], "Interpersonal Abilities": []},\n      "Total Years of Experiences": "",\n      "College Names": [],\n      "Degrees": [],\n      "Designations": [],\n      "Last Company Names": '

In [14]:
import re

def extract_candidate_name(cv_text):
  match = re.search(r"Name: (.*)", cv_text)
  if match:
    return match.group(1).strip()
  else:
    return ""


In [15]:
extract_candidate_name(cv_text)

'Arjun Patel'

In [16]:
def extract_gender(cv_text):
  match = re.search(r"Gender: (.*)", cv_text)
  if match:
    return match.group(1).strip()
  else:
    return ""

In [17]:
extract_gender(cv_text)

'Male'

In [18]:
def extract_nationality(cv_text):

  match = re.search(r"Nationality: (.*)", cv_text)
  if match:
    return match.group(1).strip()
  else:
    return ""

In [19]:
extract_nationality(cv_text)

'Indian'

In [20]:
def extract_email(cv_text):
  emails = []
  for line in cv_text.splitlines():
    match = re.search(r"[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z]{2,}", line)
    if match:
      emails.append(match.group().strip())
  return emails


In [21]:
extract_email(cv_text)

['arjun.patel@example.com']

In [22]:
def extract_phone_number(cv_text):

  phone_numbers = []
  for line in cv_text.splitlines():
    match = re.search(r"\+?\d+[\s.-]\d+[\s.-]\d+", line)  # Accepts various phone number formats
    if match:
      phone_numbers.append(match.group().strip())
  return phone_numbers

In [23]:
extract_phone_number(cv_text)

['+91 98200 12345']

In [24]:

def extract_linkedin(cv_text):

  linkedin_urls = []

  # Approach 1: Using regular expression (might miss some variations)
  matches = re.findall(r"linkedin\.com/in/(.+)", cv_text, re.IGNORECASE)  # Capture everything after "linkedin.com/in/"
  linkedin_urls.extend([f"https://www.linkedin.com/in/{match}" for match in matches])
  return linkedin_urls

In [25]:
extract_linkedin(cv_text)

['https://www.linkedin.com/in/arjunpatel']

In [26]:
def extract_education(cv_text):

  education_details = []

  # Regular expression for degrees (can be adjusted)
  degree_regex = r"(?:(Bachelors?|Masters?|PhDs?) ?in ?(.+?)(?:, )?(?:(.+?))? ?\d+)"

  for line in cv_text.splitlines():
    match = re.search(degree_regex, line, re.IGNORECASE)
    if match:
      degree = match.group(1).strip()
      institution = match.group(2).strip()
      year = match.group(3).strip() if match.group(3) else None  # Optional year
      education_details.append({"Degree": degree, "Institution": institution, "Year": year})

  # Heuristic check for universities/colleges (might capture irrelevant text)
  for line in cv_text.splitlines():
    if any(word in line.lower() for word in ["university", "college", "institute"]):
      potential_institution = line.strip()
      # You can add logic here to filter out irrelevant lines (optional)
      education_details.append({"Institution": potential_institution})

  return education_details

In [27]:
extract_education(cv_text)

[{'Institution': 'Master of Science in Artificial Intelligence, Indian Institute of Technology Bombay (IIT Bombay), Mumbai, India, 2017'},
 {'Institution': 'Bachelor of Engineering in Computer Science, National Institute of Technology Karnataka (NITK), Surathkal, India, 2015'}]

In [28]:

import re

def extract_degrees(cv_text):
  degrees = []

  # Regular expression for degrees (can be adjusted)
  degree_regex = r"(?:(Bachelors?|Masters?|PhDs?) ?(?:of|in) ?(.+?))"

  for line in cv_text.splitlines():
    match = re.search(degree_regex, line, re.IGNORECASE)
    if match:
      degree = match.group(1).strip() + " in " + match.group(2).strip()
      degrees.append(degree)
  return degrees


In [29]:
extract_degrees(cv_text)

['Master in S', 'Bachelor in E']

In [30]:
import spacy

nlp = spacy.load("en_core_web_sm")  # Load the NER model

def extract_degrees_ner(cv_text):
    doc = nlp(cv_text)
    degrees = []
    for ent in doc.ents:
        if ent.label_ == "DEGREE" or ent.text.lower() in ["bachelor", "master", "phd", "doctorate"]:  # Consider lowercasing for case-insensitivity
            degrees.append(ent.text)
    return degrees


In [31]:
extract_degrees_ner(cv_text)

[]

In [32]:
import spacy

nlp = spacy.load("en_core_web_sm")  # Load the NER model

# List of popular global degrees (can be significantly expanded)
global_degrees = [
    "B.Tech", "M.Tech", "BE", "ME", "BCA", "MCA", "BSc", "MSc", "BCom", "MCom",
    "BBA", "MBA", "PhD", "MPhil", "MPHil", "LLB", "LLM", "BArch", "MArch", "MBBS",
    "BDS", "BPharma", "MPharma", "JD", "EdD", "PsyD", "DDS", "DVM", "MD", "DO",
    "BSN", "MSN", "DNAP", "CRNA", "APN", "NP", "PA", "OT", "PT", "SLP",
    "ABA", "MFA", "MPA", "MPH", "MEd", "MSW", "MDiv", "MTh", "JD", "LLM"
]

def extract_degrees(cv_text):
  degrees = []

  # NER for degree entities (might miss some variations)
  doc = nlp(cv_text)
  for ent in doc.ents:
      if ent.label_ == "DEGREE":
          degrees.append(ent.text)

  # Heuristics with global degree list (be cautious of false positives)
  for line in cv_text.splitlines():
      for degree in global_degrees:
          if degree in line and not any(word in line for word in ["http", "https"]):  # Avoid URLs
              degrees.append(degree)
  return degrees

In [33]:
extract_degrees(cv_text)

['BE', 'PT', 'LLM', 'LLM']

In [34]:
def extract_designation(cv_text):
  designation = ""
  in_experience_section = False

  for line in cv_text.splitlines():
    line = line.strip()  # Remove leading/trailing whitespaces

    # Check if entering the "Professional Experience" section
    if line.lower() == "professional experience":
      in_experience_section = True
      continue  # Skip this line

    # Extract designation if within "Professional Experience" section
    if in_experience_section:
      # Assuming designation is the first part before any commas or parentheses
      designation = line.split(",", 1)[0].strip()
      break  # Stop after finding the first designation

  return designation


In [35]:
extract_designation(cv_text)

'Senior NLP/LLMs Developer'

In [36]:
def extract_last_company(cv_text):
  last_company = ""
  in_experience_section = False

  for line in cv_text.splitlines():
    line = line.strip()  # Remove leading/trailing whitespaces

    # Check if entering the "Professional Experience" section
    if line.lower() == "professional experience":
      in_experience_section = True
      continue  # Skip this line

    # Capture company name if within "Professional Experience" section
    if in_experience_section:
      # Assuming company name is the last part after any commas or dashes
      parts = line.split(",")[-1].strip().split(" - ")  # Split by comma and hyphen
      last_company = parts[-1] if len(parts) > 1 else parts[0]  # Get last part or the only part
      break  # Stop after finding the last company

  return last_company



In [37]:
extract_last_company(cv_text)

'Present'

In [38]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [39]:
!pip install python-docx

Collecting python-docx
  Downloading python_docx-1.1.2-py3-none-any.whl (244 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/244.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━[0m [32m204.8/244.3 kB[0m [31m6.0 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.3/244.3 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: python-docx
Successfully installed python-docx-1.1.2


In [40]:
import pandas as pd
from pathlib import Path
import sys
import docx

from docx import Document


In [41]:
df=pd.DataFrame(columns=['Name','Email','Gender','Phone_Number','Nationality','LinkedIn','Education','Degrees','Designation','Last_Company','Skills','Path'])
df.head()

Unnamed: 0,Name,Email,Gender,Phone_Number,Nationality,LinkedIn,Education,Degrees,Designation,Last_Company,Skills,Path


In [42]:

path='/content/gdrive/MyDrive/Resume_bulk'
indexer=0
for f in Path(path).iterdir():
  for sf in Path(f).iterdir():
    print(sf)
    if not str(sf).endswith('.docx'):continue
    document = Document(sf)
    doc=''
    for paragraph in document.paragraphs:
      # print(paragraph.text)
      doc+=paragraph.text+'\n'

    name=extract_candidate_name(doc)
    email=extract_email(doc)
    gender=extract_gender(doc)
    phn_num=extract_phone_number(doc)
    nationality=extract_nationality(doc)
    degrees=extract_degrees(doc)
    education=extract_education(doc)
    designation=extract_designation(doc)
    linkedin=extract_linkedin(doc)
    last_company=extract_last_company(doc)
    skills=extracting_skills(doc)
    print(name)
    print(email)
    print(gender)
    print(phn_num)
    print(nationality)
    print(degrees)
    print(education)
    print(sorted(skills))
      # print(document)
    df.at[indexer,'Name']=name
    df.at[indexer,'Email']=email[0] if len(email)>0 else ''
    df.at[indexer,'Gender']=gender
    df.at[indexer,'Phone_Number']=phn_num[0] if len(phn_num)>0 else ''
    df.at[indexer,'Nationality']=nationality
    df.at[indexer,'LinkedIn']=linkedin[0] if len(linkedin)>0 else ''
    df.at[indexer,'Degrees']=degrees
    df.at[indexer,'Designation']=designation
    df.at[indexer,'Last_Company']=last_company
    df.at[indexer,'Skills']=skills
    df.at[indexer,'Education']=education
    df.at[indexer,'Path']=sf
    indexer+=1
  #   break
  # break

/content/gdrive/MyDrive/Resume_bulk/DS_AI_ML/DS_AIML_1.docx
Rohit Sharma
['rohit.sharma@example.com']
Male
['+91 98765 43210']
Indian
[]
[{'Institution': 'Master of Technology in Data Science, Indian Institute of Technology (IIT), New Delhi, India, 2015'}, {'Institution': 'Bachelor of Engineering in Computer Science, National Institute of Technology (NIT), Surathkal, India, 2013'}]
['2013\nSkills\nData Science', 'AI', 'AI-ML Engineer', 'AI/ML', 'AWS SageMaker', 'Azure', 'Bachelor of Engineering', 'Bangalore', 'Big Data Technologies', 'CI', 'Certified Data Scientist', 'Data Science', 'Data Science Council', 'Data Science Council of America', 'Data Scientist', 'DevOps', 'Education', 'Fraud Detection System', 'GCP', 'GitHub', 'Google AI Platform', 'Hadoop', 'IIT', 'India', 'Indian Institute of Technology', 'Infosys', 'Keras', 'Kubernetes', 'Lead data science projects', 'Leadership & Collaboration', 'LinkedIn', 'ML', 'Machine Learning & AI: Proficient', 'Master of Technology', 'Mumbai', 'N

In [44]:
df

Unnamed: 0,Name,Email,Gender,Phone_Number,Nationality,LinkedIn,Education,Degrees,Designation,Last_Company,Skills,Path
0,Rohit Sharma,rohit.sharma@example.com,Male,+91 98765 43210,Indian,https://www.linkedin.com/in/rohitsharma-ai,[{'Institution': 'Master of Technology in Data...,[],Senior Data Scientist,Present,"{GitHub, Fraud Detection System, NIT, data mod...",/content/gdrive/MyDrive/Resume_bulk/DS_AI_ML/D...
1,Aisha Al-Farsi,aisha.alfarsi@example.com,Female,+966 50 123,Saudi,https://www.linkedin.com/in/aishaalfarsi-ai,[{'Institution': 'Master of Science in Data Sc...,[],Senior Data Scientist,Present,"{GitHub, Thuwal, a predictive maintenance mode...",/content/gdrive/MyDrive/Resume_bulk/DS_AI_ML/D...
2,Dr. Laila Al-Thani,laila.althani@example.com,Female,+974 5550 0102,Qatari,https://www.linkedin.com/in/lailaalthanidatasc...,[{'Institution': 'Ph.D. in Computer Science (D...,[],Senior Data Scientist,Present,"{data analysis, SQL, complex data projects, Le...",/content/gdrive/MyDrive/Resume_bulk/DS_AI_ML/S...
3,Aarav Singh,aarav.singh@example.com,Male,+91 90000 12345,Indian,https://www.linkedin.com/in/aaravsinghbigdata,[{'Institution': 'Master of Technology (M.Tech...,"[M.Tech, B.Tech]",Senior Big Data Engineer,Present,"{robust big data solutions, NIT, a big data pl...",/content/gdrive/MyDrive/Resume_bulk/DS_AI_ML/2...
4,Sofia Martinez,sofia.martinez@example.com,Female,+52 55 1234,Mexican,https://www.linkedin.com/in/sofiamartinezmlops,"[{'Institution': 'MSc in Data Science, Massach...","[MSc, BSc]",Senior MLOps Engineer,Present,"{data analysis, Communication & Collaboration:...",/content/gdrive/MyDrive/Resume_bulk/DS_AI_ML/M...
...,...,...,...,...,...,...,...,...,...,...,...,...
295,Amira Al-Mansoori,amira.almansoori@example.com,Female,+971 50 123,Emirati,https://www.linkedin.com/in/amiraalmansoori,[{'Institution': 'Master of Science in Compute...,[],Senior Full Stack Developer,Present,"{optimal data structure configuration, data, K...",/content/gdrive/MyDrive/Resume_bulk/Software_d...
296,Rahul Gupta,rahul.gupta@example.com,Male,+91 8800 123,Indian,https://www.linkedin.com/in/rahulguptadev,[{'Institution': 'Bachelor of Technology in Co...,[],Full Stack Developer,Present,"{GitHub, Front-End Development: Proficient, Pr...",/content/gdrive/MyDrive/Resume_bulk/Software_d...
297,Linh Nguyen,linh.nguyen@example.com,Female,+84 912 345,Vietnamese,https://www.linkedin.com/in/linhnguyendev,"[{'Degree': 'PhD', 'Institution': 'C', 'Year':...",[PhD],Senior Full Stack Developer,Present,"{Tokyo, optimal data structure configuration, ...",/content/gdrive/MyDrive/Resume_bulk/Software_d...
298,Anika Patel,anika.patel@example.com,Female,+91 98765 43210,Indian,https://www.linkedin.com/in/anikapateldev,"[{'Degree': 'PhD', 'Institution': 'C', 'Year':...",[PhD],Senior Full Stack Developer,Present,"{NIT, quick data access, SQL, JavaScript, Tech...",/content/gdrive/MyDrive/Resume_bulk/Software_d...


In [49]:
All_skills=[]
for ind, d in df.iterrows():
  skills=d.Skills
  print(skills)
  [All_skills.append(skill) for skill in skills]
  # break

{'GitHub', 'Fraud Detection System', 'NIT', 'data modeling', 'Google AI Platform', 'SQL', 'data science projects', 'a predictive maintenance model', 'data science', 'DevOps', 'healthcare', 'big data technologies', 'Data Science Council of America', 'linkedin.com/in/rohitsharma-ai', 'junior data scientists', 'machine learning models', 'ML', 'large datasets', 'predictive modeling', 'Keras', 'Kubernetes', '2013\nSkills\nData Science', 'Indian Institute of Technology', 'Python', 'AI/ML', 'Data Science Council', 'TCS', 'AI-ML Engineer', 'Data Science', 'AWS SageMaker', 'Lead data science projects', 'GCP', 'Machine Learning & AI: Proficient', 'CI', 'Hadoop', 'New Delhi', 'Master of Technology', 'LinkedIn', 'Surathkal', 'Data Scientist', 'a machine learning model', 'AI', 'Azure', 'Education', 'Infosys', 'model accuracy', 'Professional Experience\nSenior Data Scientist', 'National Institute of Technology', 'Bangalore', 'Leadership & Collaboration', 'India', 'Bachelor of Engineering', 'Certifie

In [59]:
education=[]
for ind, d in df.iterrows():
  educ=d.Education
  print(educ)
  [education.append(ed['Institution']) for ed in educ]
  # break

[{'Institution': 'Master of Technology in Data Science, Indian Institute of Technology (IIT), New Delhi, India, 2015'}, {'Institution': 'Bachelor of Engineering in Computer Science, National Institute of Technology (NIT), Surathkal, India, 2013'}]
[{'Institution': 'Master of Science in Data Science, King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia, 2015'}, {'Institution': 'Bachelor of Science in Computer Science, King Saud University, Riyadh, Saudi Arabia, 2013'}, {'Institution': 'Advanced Machine Learning Specialization, Coursera (offered by National Research University Higher School of Economics), 2017'}]
[{'Institution': 'Ph.D. in Computer Science (Data Science), Qatar University, Doha, Qatar, 2020'}, {'Institution': 'Master of Science in Data Analytics, Qatar University, Doha, Qatar, 2016'}, {'Institution': 'Bachelor of Science in Computer Science, Qatar University, Doha, Qatar, 2014'}, {'Institution': 'Initiated and led a collaborative research proj

In [60]:
education

['Master of Technology in Data Science, Indian Institute of Technology (IIT), New Delhi, India, 2015',
 'Bachelor of Engineering in Computer Science, National Institute of Technology (NIT), Surathkal, India, 2013',
 'Master of Science in Data Science, King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia, 2015',
 'Bachelor of Science in Computer Science, King Saud University, Riyadh, Saudi Arabia, 2013',
 'Advanced Machine Learning Specialization, Coursera (offered by National Research University Higher School of Economics), 2017',
 'Ph.D. in Computer Science (Data Science), Qatar University, Doha, Qatar, 2020',
 'Master of Science in Data Analytics, Qatar University, Doha, Qatar, 2016',
 'Bachelor of Science in Computer Science, Qatar University, Doha, Qatar, 2014',
 'Initiated and led a collaborative research project with a leading university, resulting in two peer-reviewed publications on advanced analytics techniques.',
 'Research Assistant, Qatar Univers

In [51]:
def save_list_to_text(data, filename):
  with open(filename, 'w') as f:
    for item in data:
      f.write(str(item) + '\n')

In [61]:
filename = "education.txt"
save_list_to_text(education,filename)

In [None]:
# df.to_csv('/content/resume_features.csv',index=False)

In [68]:

for ind, d in df.iterrows():
  educ=d.Skills
  edu=[ed for ed in educ]
  df.at[ind,'Skills']=[ed for ed in edu]
  print(edu)

['GitHub', 'Fraud Detection System', 'NIT', 'data modeling', 'Google AI Platform', 'SQL', 'data science projects', 'a predictive maintenance model', 'data science', 'DevOps', 'healthcare', 'big data technologies', 'Data Science Council of America', 'linkedin.com/in/rohitsharma-ai', 'junior data scientists', 'machine learning models', 'ML', 'large datasets', 'predictive modeling', 'Keras', 'Kubernetes', '2013\nSkills\nData Science', 'Indian Institute of Technology', 'Python', 'AI/ML', 'Data Science Council', 'TCS', 'AI-ML Engineer', 'Data Science', 'AWS SageMaker', 'Lead data science projects', 'GCP', 'Machine Learning & AI: Proficient', 'CI', 'Hadoop', 'New Delhi', 'Master of Technology', 'LinkedIn', 'Surathkal', 'Data Scientist', 'a machine learning model', 'AI', 'Azure', 'Education', 'Infosys', 'model accuracy', 'Professional Experience\nSenior Data Scientist', 'National Institute of Technology', 'Bangalore', 'Leadership & Collaboration', 'India', 'Bachelor of Engineering', 'Certifie

In [69]:
df.head()

Unnamed: 0,Name,Email,Gender,Phone_Number,Nationality,LinkedIn,Education,Degrees,Designation,Last_Company,Skills,Path
0,Rohit Sharma,rohit.sharma@example.com,Male,+91 98765 43210,Indian,https://www.linkedin.com/in/rohitsharma-ai,[{'Institution': 'Master of Technology in Data...,[],Senior Data Scientist,Present,"[GitHub, Fraud Detection System, NIT, data mod...",/content/gdrive/MyDrive/Resume_bulk/DS_AI_ML/D...
1,Aisha Al-Farsi,aisha.alfarsi@example.com,Female,+966 50 123,Saudi,https://www.linkedin.com/in/aishaalfarsi-ai,[{'Institution': 'Master of Science in Data Sc...,[],Senior Data Scientist,Present,"[GitHub, Thuwal, a predictive maintenance mode...",/content/gdrive/MyDrive/Resume_bulk/DS_AI_ML/D...
2,Dr. Laila Al-Thani,laila.althani@example.com,Female,+974 5550 0102,Qatari,https://www.linkedin.com/in/lailaalthanidatasc...,[{'Institution': 'Ph.D. in Computer Science (D...,[],Senior Data Scientist,Present,"[data analysis, SQL, complex data projects, Le...",/content/gdrive/MyDrive/Resume_bulk/DS_AI_ML/S...
3,Aarav Singh,aarav.singh@example.com,Male,+91 90000 12345,Indian,https://www.linkedin.com/in/aaravsinghbigdata,[{'Institution': 'Master of Technology (M.Tech...,"[M.Tech, B.Tech]",Senior Big Data Engineer,Present,"[robust big data solutions, NIT, a big data pl...",/content/gdrive/MyDrive/Resume_bulk/DS_AI_ML/2...
4,Sofia Martinez,sofia.martinez@example.com,Female,+52 55 1234,Mexican,https://www.linkedin.com/in/sofiamartinezmlops,"[{'Institution': 'MSc in Data Science, Massach...","[MSc, BSc]",Senior MLOps Engineer,Present,"[data analysis, Communication & Collaboration:...",/content/gdrive/MyDrive/Resume_bulk/DS_AI_ML/M...


In [None]:
import pandas as pd

# Sample list containing text information
your_list = ["This is data point 1", "Another data point with text", 42, 3.14]

# Filter text elements
text_data = [item for item in your_list if isinstance(item, str)]

# Create DataFrame with a single column named "Text"
df = pd.DataFrame(text_data, columns=["Text"])

print(df)
