# 📜 Project: Job Description Analyzer – Extracting Required Skills from Job Postings


## 📌 Objective
Use spaCy’s Named Entity Recognition (NER) and NLTK preprocessing to extract and categorize required skills from job descriptions. The goal is to identify trends in job requirements and analyze the most in-demand skills across industries.

## 🛠️ Project Steps & Instructions


### Step 1: Load the Dataset
#### 📌 Dataset: A provided CSV file containing job descriptions from different industries (IT, Healthcare, Finance, Marketing, etc.).

1. Download the dataset (link below).
2. Load it into Python using Pandas.
3. View the first few rows to understand its structure.

In [36]:
import pandas as pd

# Load the dataset
dataset_url = "https://raw.githubusercontent.com/binoydutt/Resume-Job-Description-Matching/refs/heads/master/data.csv"
df = pd.read_csv(dataset_url)

# Display basic information about the dataset
print("Dataset Information:\n")
df.info()

# Display the first few rows to understand the structure
print("\nFirst 5 Rows of the Dataset:\n")
df.head()

Dataset Information:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 157 entries, 0 to 156
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Unnamed: 0       157 non-null    int64 
 1   company          157 non-null    object
 2   position         157 non-null    object
 3   url              157 non-null    object
 4   location         157 non-null    object
 5   headquaters      157 non-null    object
 6   employees        154 non-null    object
 7   founded          154 non-null    object
 8   industry         154 non-null    object
 9   Job Description  157 non-null    object
dtypes: int64(1), object(9)
memory usage: 12.4+ KB

First 5 Rows of the Dataset:



Unnamed: 0.1,Unnamed: 0,company,position,url,location,headquaters,employees,founded,industry,Job Description
0,1,Visual BI Solutions Inc,Graduate Intern (Summer 2017) - SAP BI / Big D...,https://www.glassdoor.com/partner/jobListing.h...,"Plano, TX","Plano, TX",51 to 200 employees,2010,Information Technology,"Location: Plano, TX or Oklahoma City, OK Dura..."
1,2,Jobvertise,Digital Marketing Manager,https://www.glassdoor.com/partner/jobListing.h...,"Dallas, TX","Berlin, Germany",1 to 50 employees,2011,Unknown,The Digital Marketing Manager is the front li...
2,3,Santander Consumer USA,"Manager, Pricing Management Information Systems",https://www.glassdoor.com/partner/jobListing.h...,"Dallas, TX","Dallas, TX",5001 to 10000 employees,1995,Finance,Summary of Responsibilities:The Manager Prici...
3,4,Federal Reserve Bank of Dallas,Treasury Services Analyst Internship,https://www.glassdoor.com/partner/jobListing.h...,"Dallas, TX","Dallas, TX",1001 to 5000 employees,1914,Finance,ORGANIZATIONAL SUMMARY: As part of the nati...
4,5,Aviall,"Intern, Sales Analyst",https://www.glassdoor.com/partner/jobListing.h...,"Dallas, TX","Dallas, TX",1001 to 5000 employees,Boeing,Subsidiary or Business Segment,Aviall is the world's largest provider of n...


### Step 2: Preprocessing the Job Descriptions
#### 📌 Goal: Clean the text by removing stopwords, punctuation, and unnecessary characters.

1. Use NLTK to tokenize the descriptions.
2. Remove stopwords and special characters.
3. Convert text to lowercase for consistency.

In [37]:
# Ensure NLTK resources are available
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')

# Preprocessing function
def preprocess_text(text):
    if pd.isnull(text):
        return ""
    text = text.lower()  # Convert to lowercase
    tokens = word_tokenize(text)  # Tokenization
    tokens = [word for word in tokens if word.isalnum()]  # Remove punctuation
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]  # Remove stopwords
    return " ".join(tokens)

# Apply preprocessing to the 'Job Description' column
df['Cleaned_Job_Description'] = df['Job Description'].apply(preprocess_text)

# Display the first few cleaned job descriptions
print("\nCleaned Job Descriptions:\n")
df[['Job Description', 'Cleaned_Job_Description']].head()


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!



Cleaned Job Descriptions:



Unnamed: 0,Job Description,Cleaned_Job_Description
0,"Location: Plano, TX or Oklahoma City, OK Dura...",location plano tx oklahoma city ok duration in...
1,The Digital Marketing Manager is the front li...,digital marketing manager front line patient c...
2,Summary of Responsibilities:The Manager Prici...,summary responsibilities manager pricing mis r...
3,ORGANIZATIONAL SUMMARY: As part of the nati...,organizational summary part nation central ban...
4,Aviall is the world's largest provider of n...,aviall world largest provider new aviation par...


### Step 3: Extract Skills Using Named Entity Recognition (NER)
#### 📌 Goal: Use spaCy’s built-in NER to detect and extract skills from job descriptions.

1. Load spaCy’s English model.
2. Use NER to identify important keywords.
3. Extract words related to technical skills, tools, and expertise.

We can use `df['industry'].value_counts()` to detect each industry. So we can provide common skills before doing this task:
```
industry
Business Services                            23
Information Technology                       21
Accounting & Legal                           20
Finance                                      17
Media                                        12
Manufacturing                                11
Health Care                                  10
Subsidiary or Business Segment                8
Unknown                                       8
Insurance                                     5
Arts, Entertainment & Recreation              4
Retail                                        3
$500 million to $1 billion (USD) per year     2
Aerospace & Defense                           2
Non-Profit                                    1
Travel & Tourism                              1
Unknown / Non-Applicable per year             1
Real Estate                                   1
Transportation & Logistics                    1
Company - Public                              1
Mining & Metals                               1
Construction, Repair & Maintenance            1
Name: count, dtype: int64
```

In [38]:
# Define a more comprehensive list of common technical skills grouped by industry
common_skills = {
    # Information Technology skills
    "python", "java", "javascript", "sql", "c++", "c#", "ruby", "php", "html", "css",
    "aws", "azure", "gcp", "docker", "kubernetes", "git", "jenkins", "jira", "agile", "scrum",
    "machine learning", "deep learning", "ai", "artificial intelligence", "data science",
    "data analysis", "data mining", "data visualization", "big data", "hadoop", "spark",
    "tableau", "power bi", "api", "rest", "soap", "microservices", "cloud", "devops",
    "database", "mysql", "postgresql", "mongodb", "nosql", "react", "angular", "vue",
    "node.js", "typescript", "full stack", "front end", "back end", "testing", "qa",

    # Business Services skills
    "consulting", "crm", "salesforce", "project management", "client management",
    "process improvement", "six sigma", "lean", "business analysis", "strategic planning",
    "operations management", "stakeholder management", "sap", "erp", "change management",

    # Accounting & Legal skills
    "gaap", "ifrs", "tax", "audit", "compliance", "regulatory", "contract", "law",
    "litigation", "intellectual property", "patents", "trademarks", "financial reporting",
    "bookkeeping", "quickbooks", "tax preparation", "corporate law", "legal research",

    # Finance skills
    "financial analysis", "modeling", "forecasting", "budgeting", "investment", "portfolio",
    "risk management", "valuation", "banking", "equity", "debt", "capital markets",
    "derivatives", "financial planning", "bloomberg", "excel", "vba", "accounting",
    "m&a", "mergers", "acquisitions", "underwriting", "trading",

    # Media skills
    "content creation", "social media", "digital marketing", "seo", "sem", "adobe",
    "photoshop", "illustrator", "indesign", "premiere", "after effects", "copywriting",
    "content strategy", "brand management", "public relations", "journalism", "editing",

    # Manufacturing skills
    "supply chain", "inventory management", "logistics", "quality control", "lean manufacturing",
    "six sigma", "production planning", "cad", "cam", "solidworks", "autocad", "plc",
    "robotics", "automation", "iso", "kaizen", "kanban", "just in time", "jit",

    # Health Care skills
    "patient care", "electronic health records", "ehr", "emr", "hipaa", "clinical",
    "medical coding", "billing", "healthcare management", "nursing", "pharmaceuticals",
    "medical devices", "research", "telehealth", "epic", "cerner", "medicaid", "medicare",

    # Insurance skills
    "underwriting", "claims", "policy", "actuarial", "risk assessment", "insurance regulations",
    "reinsurance", "life insurance", "health insurance", "property insurance", "casualty",

    # Retail skills
    "merchandising", "inventory", "pos", "point of sale", "customer service", "e-commerce",
    "retail management", "sales", "buying", "demand planning", "visual merchandising",

    # General business skills
    "communication", "leadership", "teamwork", "problem solving", "critical thinking",
    "time management", "customer service", "negotiation", "presentation", "analytical",
    "microsoft office", "excel", "powerpoint", "word", "outlook"
}

In [39]:
import spacy
# Load spaCy's English model
nlp = spacy.load("en_core_web_sm")

'''
# Function to extract skills using Named Entity Recognition (NER)
def extract_skills(text):
    doc = nlp(text)
    skills = [ent.text for ent in doc.ents if ent.label_ in ["ORG", "PRODUCT", "SKILL", "LANGUAGE"]]  #['SKILL', 'TOOL', 'EXPERTISE'], ["ORG", "PRODUCT", "GPE", "NORP"], ["ORG", "PRODUCT", "SKILL", "LANGUAGE"]
    return list(set(skills))  # Remove duplicates
'''
'''
# Define a function to extract skills using NER
def extract_skills(text):
    doc = nlp(text)

    # Extract entities that might be skills
    entities = [(ent.text, ent.label_) for ent in doc.ents]

    # Extract noun chunks as potential skills
    noun_chunks = [chunk.text for chunk in doc.noun_chunks]

    # Combine and return unique skills
    all_potential_skills = [item[0] for item in entities] + noun_chunks
    return list(set(all_potential_skills))
'''
def extract_skills(text):
    doc = nlp(text)

    # Get named entities that might be skills
    entity_skills = [ent.text for ent in doc.ents if ent.label_ in ["ORG", "PRODUCT", "GPE", "LANGUAGE"]]

    # Extract noun phrases as potential skills
    noun_chunks = [chunk.text.lower() for chunk in doc.noun_chunks]

    # Extract specific skills from our comprehensive list
    skill_matches = []
    text_lower = text.lower()
    for skill in common_skills:
        if skill in text_lower:
            skill_matches.append(skill)

    # Combine results and remove duplicates
    all_skills = entity_skills + noun_chunks + skill_matches
    # Clean up skills (remove extra spaces, convert to lowercase)
    cleaned_skills = [skill.lower().strip() for skill in all_skills]

    # Remove duplicates while preserving order
    unique_skills = []
    for skill in cleaned_skills:
        if skill not in unique_skills and len(skill) > 2:  # Skip very short terms
            unique_skills.append(skill)

    return unique_skills

# Apply skill extraction to the cleaned job descriptions
df['Extracted_Skills'] = df['Cleaned_Job_Description'].apply(extract_skills)

# Display extracted skills
print("\nExtracted Skills from Job Descriptions:\n")
df[['Job Description', 'Extracted_Skills']].head()


Extracted Skills from Job Descriptions:



Unnamed: 0,Job Description,Extracted_Skills
0,"Location: Plano, TX or Oklahoma City, OK Dura...","[oklahoma city, gpa scores, hone bi analytics ..."
1,The Digital Marketing Manager is the front li...,"[digital, digital marketing manager front line..."
2,Summary of Responsibilities:The Manager Prici...,"[sas, mis responsible assisting pricing strate..."
3,ORGANIZATIONAL SUMMARY: As part of the nati...,"[central bank federal reserve bank, dallas, fe..."
4,Aviall is the world's largest provider of n...,"[dallas, aviall world largest provider new avi..."


### Step 4: Identify the Most In-Demand Skills
#### 📌 Goal: Count the most frequently mentioned skills in job descriptions.

1. Create a word frequency distribution of extracted skills.
2. Identify the top 10 most required skills.

In [40]:
from collections import Counter

# your code # Flatten the list of skills and count occurrences
all_skills = [skill for sublist in df['Extracted_Skills'] for skill in sublist]
skill_counts = Counter(all_skills)

# Get the top 10 most in-demand skills
top_skills = skill_counts.most_common(10)

# Display extracted skills and their frequency
print("\nTop 10 Most In-Demand Skills:\n")
pd.DataFrame(top_skills, columns=["Skill", "Frequency"])



Top 10 Most In-Demand Skills:



Unnamed: 0,Skill,Frequency
0,pos,130
1,excel,123
2,communication,121
3,analytical,94
4,erp,88
5,leadership,65
6,research,64
7,law,63
8,microsoft,62
9,word,58


### Step 5: Categorize Skills by Industry
#### 📌 Goal: Compare the most in-demand skills across different industries.

1. Group job descriptions by industry.
2. Extract and analyze skills for each industry.
3. Compare IT vs. Marketing vs. Healthcare, etc..

In [41]:
# Categorize skills by industry
def get_top_skills_by_industry(df, industry):
    industry_df = df[df['industry'] == industry]
    all_skills = [skill for sublist in industry_df['Extracted_Skills'] for skill in sublist]
    skill_counts = Counter(all_skills)
    return skill_counts.most_common(10)  # Get top 10 skills

# Get unique industries
industries = df['industry'].dropna().unique()

# Analyze skills for each industry
industry_skill_analysis = {}
for industry in industries:
    industry_skill_analysis[industry] = get_top_skills_by_industry(df, industry)

# Display top skills per industry
print("\nTop Skills by Industry:\n")
for industry, skills in industry_skill_analysis.items():
    print(f"{industry}: {skills}\n")


Top Skills by Industry:

Information Technology: [('pos', 18), ('communication', 18), ('excel', 17), ('big data', 12), ('cloud', 12), ('erp', 11), ('dallas', 11), ('analytical', 11), ('law', 10), ('rest', 10)]

Unknown: [('communication', 8), ('excel', 6), ('git', 6), ('employees', 6), ('clients', 5), ('analytical', 5), ('pos', 4), ('atos leader digital services annual revenue', 4), ('12 billion employees', 4), ('72 countries', 4)]

Finance: [('pos', 14), ('excel', 12), ('communication', 10), ('erp', 9), ('sql', 8), ('analytical', 8), ('law', 7), ('research', 7), ('word', 6), ('inventory', 6)]

Subsidiary or Business Segment: [('organization', 8), ('pos', 8), ('dallas', 7), ('aviall world largest provider new aviation parts related aftermarket services', 7), ('aviall markets', 7), ('products', 7), ('240 manufacturers', 7), ('approximately catalog items', 7), ('40 customer service centers', 7), ('north america europe aviall prides culture', 7)]

Business Services: [('pos', 18), ('commu