<a href="https://colab.research.google.com/github/Mahdi-Shabani/Ai-agent/blob/master/Project01_DS04_S01_NLTK_SpaCy_RezaShokrzad.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 📜 Project: Job Description Analyzer – Extracting Required Skills from Job Postings


## 📌 Objective
Use spaCy’s Named Entity Recognition (NER) and NLTK preprocessing to extract and categorize required skills from job descriptions. The goal is to identify trends in job requirements and analyze the most in-demand skills across industries.

## 🛠️ Project Steps & Instructions


In [1]:
#📥 Download the Dataset
!wget https://raw.githubusercontent.com/binoydutt/Resume-Job-Description-Matching/refs/heads/master/data.csv

--2025-10-19 09:10:32--  https://raw.githubusercontent.com/binoydutt/Resume-Job-Description-Matching/refs/heads/master/data.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 646072 (631K) [text/plain]
Saving to: ‘data.csv’


2025-10-19 09:10:32 (14.2 MB/s) - ‘data.csv’ saved [646072/646072]



### Step 1: Load the Dataset
#### 📌 Dataset: A provided CSV file containing job descriptions from different industries (IT, Healthcare, Finance, Marketing, etc.).

1. Download the dataset (link below).
2. Load it into Python using Pandas.
3. View the first few rows to understand its structure.

In [2]:
import pandas as pd
df = pd.read_csv('data.csv')

print(df.head())

   Unnamed: 0                          company  \
0           1          Visual BI Solutions Inc   
1           2                       Jobvertise   
2           3           Santander Consumer USA   
3           4   Federal Reserve Bank of Dallas   
4           5                           Aviall   

                                            position  \
0  Graduate Intern (Summer 2017) - SAP BI / Big D...   
1                          Digital Marketing Manager   
2    Manager, Pricing Management Information Systems   
3               Treasury Services Analyst Internship   
4                              Intern, Sales Analyst   

                                                 url     location  \
0  https://www.glassdoor.com/partner/jobListing.h...    Plano, TX   
1  https://www.glassdoor.com/partner/jobListing.h...   Dallas, TX   
2  https://www.glassdoor.com/partner/jobListing.h...   Dallas, TX   
3  https://www.glassdoor.com/partner/jobListing.h...   Dallas, TX   
4  https://www.gl

### Step 2: Preprocessing the Job Descriptions
#### 📌 Goal: Clean the text by removing stopwords, punctuation, and unnecessary characters.

1. Use NLTK to tokenize the descriptions.
2. Remove stopwords and special characters.
3. Convert text to lowercase for consistency.

In [6]:
!pip install nltk -q

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import re
import pandas as pd

nltk.download('punkt_tab')
nltk.download('punkt')  # برای اطمینان
nltk.download('stopwords')

def preprocess_text(text):
    if pd.isna(text):
        return ""
    # Convert to lowercase
    text = text.lower()
    # Tokenize
    tokens = word_tokenize(text)
    # Remove stopwords and special characters
    stop_words = set(stopwords.words('english'))
    tokens = [re.sub(r'[^a-zA-Z]', '', token) for token in tokens if token not in stop_words and re.sub(r'[^a-zA-Z]', '', token)]
    # Join tokens back to text
    return ' '.join(tokens)

df['processed_description'] = df['Job Description'].apply(preprocess_text)

print(df[['Job Description', 'processed_description']].head())

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


                                     Job Description  \
0   Location: Plano, TX or Oklahoma City, OK Dura...   
1   The Digital Marketing Manager is the front li...   
2   Summary of Responsibilities:The Manager Prici...   
3   ORGANIZATIONAL SUMMARY:   As part of the nati...   
4     Aviall is the world's largest provider of n...   

                               processed_description  
0  location plano tx oklahoma city ok duration in...  
1  digital marketing manager front line patient c...  
2  summary responsibilities manager pricing mis r...  
3  organizational summary part nation central ban...  
4  aviall world s largest provider new aviation p...  


### Step 3: Extract Skills Using Named Entity Recognition (NER)
#### 📌 Goal: Use spaCy’s built-in NER to detect and extract skills from job descriptions.

1. Load spaCy’s English model.
2. Use NER to identify important keywords.
3. Extract words related to technical skills, tools, and expertise.

In [7]:
!pip install spacy -q

import spacy
import pandas as pd

nlp = spacy.load("en_core_web_sm")

def extract_skills(text):
    if pd.isna(text):
        return ""
    doc = nlp(text)
    skills = []
    for ent in doc.ents:
        if ent.label_ in ["ORG", "PRODUCT", "SKILL"] or any(keyword in ent.text.lower() for keyword in ["skill", "tool", "expert", "experience", "knowledge"]):
            skills.append(ent.text)
    return ', '.join(list(dict.fromkeys(skills)))

df['extracted_skills'] = df['processed_description'].apply(extract_skills)

print(df[['processed_description', 'extracted_skills']].head())

                               processed_description  \
0  location plano tx oklahoma city ok duration in...   
1  digital marketing manager front line patient c...   
2  summary responsibilities manager pricing mis r...   
3  organizational summary part nation central ban...   
4  aviall world s largest provider new aviation p...   

                                    extracted_skills  
0                                         gpa scores  
1                                            digital  
2                                                     
3  federal reserve bank, dallas treasury services...  
4                                                     


### Step 4: Identify the Most In-Demand Skills
#### 📌 Goal: Count the most frequently mentioned skills in job descriptions.

1. Create a word frequency distribution of extracted skills.
2. Identify the top 10 most required skills.

In [8]:
import pandas as pd
from collections import Counter

def get_skill_frequency(df):
    all_skills = []
    for skills in df['extracted_skills'].dropna():
        skills_list = skills.split(', ')
        all_skills.extend(skills_list)
    skill_freq = Counter(all_skills)
    return skill_freq

skill_frequency = get_skill_frequency(df)


top_10_skills = skill_frequency.most_common(10)


print("Top 10 Most In-Demand Skills:")
for skill, count in top_10_skills:
    print(f"{skill}: {count} times")

df['skill_count'] = df['extracted_skills'].apply(lambda x: len(x.split(', ')) if pd.notna(x) else 0)

Top 10 Most In-Demand Skills:
: 43 times
microsoft: 34 times
microsoft office: 23 times
deloitte: 11 times
texas usa: 7 times
ibm: 7 times
grant thornton international ltd one: 6 times
united states: 6 times
bachelor s master: 6 times
deloitte university: 4 times


### Step 5: Categorize Skills by Industry
#### 📌 Goal: Compare the most in-demand skills across different industries.

1. Group job descriptions by industry.
2. Extract and analyze skills for each industry.
3. Compare IT vs. Marketing vs. Healthcare, etc..

In [9]:
import pandas as pd
from collections import Counter

grouped = df.groupby('industry')


def analyze_industry_skills(group):
    all_skills = []
    for skills in group['extracted_skills'].dropna():
        skills_list = skills.split(', ')
        all_skills.extend(skills_list)
    skill_freq = Counter(all_skills)
    return skill_freq.most_common(5)

industry_skills = {}
for industry, group in grouped:
    industry_skills[industry] = analyze_industry_skills(group)

print("Top 5 Skills by Industry:")
for industry, skills in industry_skills.items():
    print(f"\n{industry}:")
    for skill, count in skills:
        print(f"  {skill}: {count} times")

industries_to_compare = ['Information Technology', 'Unknown', 'Finance', 'Marketing']
print("\nComparison of Top Skills Across Selected Industries:")
for industry in industries_to_compare:
    if industry in industry_skills:
        print(f"\n{industry}:")
        for skill, count in industry_skills[industry][:3]:
            print(f"  {skill}: {count} times")

Top 5 Skills by Industry:

$500 million to $1 billion (USD) per year:
  : 2 times

Accounting & Legal:
  deloitte: 10 times
  grant thornton international ltd one: 6 times
  deloitte university: 4 times
  united states: 3 times
  microsoft: 3 times

Aerospace & Defense:
  : 1 times
  northrop: 1 times
  jsp: 1 times
  oracle sql server relational: 1 times

Arts, Entertainment & Recreation:
  penney: 4 times
  microsoft: 2 times
  android: 2 times

Business Services:
  : 9 times
  ibm: 6 times
  microsoft: 5 times
  dmv: 2 times
  k offer associates variety: 2 times

Company - Public:
  gm: 1 times

Construction, Repair & Maintenance:
  united states: 1 times
  habitat council: 1 times
  north america inc holcim us inc aggregate industries management inc affiliates: 1 times

Finance:
  : 6 times
  federal reserve bank: 3 times
  dallas treasury services department regularly apply analytical problem: 3 times
  microsoft: 3 times
  invesco ltd leading: 3 times

Health Care:
  : 8 times
  

In [10]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive
