# 📜 Project: Job Description Analyzer – Extracting Required Skills from Job Postings


## 📌 Objective
Use spaCy’s Named Entity Recognition (NER) and NLTK preprocessing to extract and categorize required skills from job descriptions. The goal is to identify trends in job requirements and analyze the most in-demand skills across industries.

## 🛠️ Project Steps & Instructions


In [None]:
#📥 Download the Dataset
!wget https://raw.githubusercontent.com/binoydutt/Resume-Job-Description-Matching/refs/heads/master/data.csv

--2025-03-12 11:27:01--  https://raw.githubusercontent.com/binoydutt/Resume-Job-Description-Matching/refs/heads/master/data.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 646072 (631K) [text/plain]
Saving to: ‘data.csv’


2025-03-12 11:27:01 (10.5 MB/s) - ‘data.csv’ saved [646072/646072]



### Step 1: Load the Dataset
#### 📌 Dataset: A provided CSV file containing job descriptions from different industries (IT, Healthcare, Finance, Marketing, etc.).

1. Download the dataset (link below).
2. Load it into Python using Pandas.
3. View the first few rows to understand its structure.

In [None]:
# your code here
import pandas as pd
data=pd.read_csv('data.csv', encoding='utf-8')
data.head()
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 157 entries, 0 to 156
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Unnamed: 0       157 non-null    int64 
 1   company          157 non-null    object
 2   position         157 non-null    object
 3   url              157 non-null    object
 4   location         157 non-null    object
 5   headquaters      157 non-null    object
 6   employees        154 non-null    object
 7   founded          154 non-null    object
 8   industry         154 non-null    object
 9   Job Description  157 non-null    object
dtypes: int64(1), object(9)
memory usage: 12.4+ KB


### Step 2: Preprocessing the Job Descriptions
#### 📌 Goal: Clean the text by removing stopwords, punctuation, and unnecessary characters.

1. Use NLTK to tokenize the descriptions.
2. Remove stopwords and special characters.
3. Convert text to lowercase for consistency.

In [None]:
# your code here
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
import spacy
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('punkt_tab')



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [None]:
# Identify text columns
text_columns=data.select_dtypes(include=['object']).columns
stop_words = set(stopwords.words('english'))
def preprocess_text(text):
  if pd.isna(text):
    return ''
  tokens=word_tokenize(str(text))
  tokens=[word.lower() for word in tokens if word.isalnum() and word.lower() not in stop_words]
  return ' '.join(tokens)
for col in text_columns:
  data[f'{col}_tokens']=data[col].apply(preprocess_text)


In [None]:
columns = data.columns
print(columns)

Index(['Unnamed: 0', 'company', 'position', 'url', 'location', 'headquaters',
       'employees', 'founded', 'industry', 'Job Description', 'company_tokens',
       'position_tokens', 'url_tokens', 'location_tokens',
       'headquaters_tokens', 'employees_tokens', 'founded_tokens',
       'industry_tokens', 'Job Description_tokens'],
      dtype='object')


In [None]:
print(data['Job Description'].head())

0     Location: Plano, TX or Oklahoma City, OK Dura...
1     The Digital Marketing Manager is the front li...
2     Summary of Responsibilities:The Manager Prici...
3     ORGANIZATIONAL SUMMARY:   As part of the nati...
4       Aviall is the world's largest provider of n...
Name: Job Description, dtype: object


In [None]:
print(data['industry'].values)

['Information Technology' 'Unknown' 'Finance' 'Finance'
 'Subsidiary or Business Segment' 'Finance' 'Unknown'
 'Information Technology' 'Business Services' 'Finance'
 'Accounting & Legal' 'Health Care' 'Media' nan
 'Subsidiary or Business Segment' 'Accounting & Legal'
 'Accounting & Legal' 'Information Technology' 'Non-Profit'
 'Accounting & Legal' 'Information Technology' 'Information Technology'
 'Business Services' nan 'Unknown' 'Information Technology'
 'Business Services' 'Business Services' 'Unknown' 'Business Services'
 'Information Technology' 'Unknown' 'Finance' 'Accounting & Legal'
 'Finance' 'Accounting & Legal' 'Accounting & Legal' 'Accounting & Legal'
 'Information Technology' 'Arts, Entertainment & Recreation'
 'Information Technology' '$500 million to $1 billion (USD) per year'
 'Media' 'Finance' 'Media' 'Health Care' 'Business Services'
 'Information Technology' 'Business Services' 'Finance' 'Media'
 'Information Technology' 'Subsidiary or Business Segment'
 'Business S

### Step 3: Extract Skills Using Named Entity Recognition (NER)
#### 📌 Goal: Use spaCy’s built-in NER to detect and extract skills from job descriptions.

1. Load spaCy’s English model.
2. Use NER to identify important keywords.
3. Extract words related to technical skills, tools, and expertise.

In [None]:
# your code here
nlp=spacy.load('en_core_web_sm')
def exteract_skills(text):
  doc=nlp(text)
  skills=[ent.text for ent in doc.ents if ent.label_=='ORG' or ent.label_=='PRODUCT' or ent.label_=='TECHNOLOGY']
  return skills
data['skills_extraxted']=data['Job Description'].apply(exteract_skills)

### Step 4: Identify the Most In-Demand Skills
#### 📌 Goal: Count the most frequently mentioned skills in job descriptions.

1. Create a word frequency distribution of extracted skills.
2. Identify the top 10 most required skills.

In [None]:
# your code here

all_skills =[word for sub in data['skills_extraxted'] for word in sub ]
freq_dist_all = FreqDist(all_skills)
print(freq_dist_all.most_common(10))

[('Deloitte', 69), ('Excel', 64), ('CoreLogic', 49), ('GPA', 42), ('Microsoft Office', 36), ('PowerPoint', 35), ('SQL', 32), ('TX', 30), ('SAP', 25), ('IBM', 24)]


### Step 5: Categorize Skills by Industry
#### 📌 Goal: Compare the most in-demand skills across different industries.

1. Group job descriptions by industry.
2. Extract and analyze skills for each industry.
3. Compare IT vs. Marketing vs. Healthcare, etc..

In [None]:
# your code here
# data.groupby('industry')['Job Description'].apply(lambda x: " ".join(x))

def top_skills_by_industry(skills):
  all_skills=[skill for sub in skills for skill in sub]
  return FreqDist(all_skills).most_common(10)
industry_skills=data.groupby('industry')['skills_extraxted'].apply(top_skills_by_industry)
for industry, skills in industry_skills.items():
    print(f"\n📌 Industry: {industry}")
    for skill, freq in skills:
        print(f"  - {skill}: {freq}")


📌 Industry: $500 million to $1 billion (USD) per year
  - MIS: 8
  - Decision Sciences: 6
  - the Pricing & Strategy and Profitability Teams: 2
  - Pricing & Strategy and Profitability: 2
  - Information Technology: 2
  - Data Visualization: 2
  - SharePoint: 2
  - Identify: 2
  - Bachelor: 2
  - SQL: 2

📌 Industry: Accounting & Legal
  - Deloitte: 67
  - Deloitte LLP: 16
  - GPA: 13
  - RAS: 8
  - Grant Thornton: 8
  - Deloitte University: 7
  - Deloitte Consulting LLP: 7
  - Grant Thornton International Ltd: 6
  - Partners: 6
  - The Leadership Center: 4

📌 Industry: Aerospace & Defense
  - Skills & Experience :: 2
  - US Citizens: 2
  - Federal Government: 2
  - SQL: 2
  - Data Science: 2
  - Northrop: 2
  - Grumman: 2
  - University Programs: Engineering Intern Raytheon Intelligence, Information and Services: 1
  - Contribute: 1
  - Data Scientist: 1

📌 Industry: Arts, Entertainment & Recreation
  - JCPenney: 14
  - Home Office: 6
  - TX: 4
  - Excel: 4
  - PowerPoint: 4
  - Starb

In [None]:
df_skill=industry_skills.reset_index()
df_skill.columns=['Industry', 'Top_Skills']
df_expended=s=df_skill.explode("Top_Skills")
df_expended[['Skill','Frequency']]=df_expended["Top_Skills"].apply(pd.Series)
df_expended.drop('Top_Skills',  axis=1 ,inplace=True)
print(df_expended)

                                     Industry  \
0   $500 million to $1 billion (USD) per year   
0   $500 million to $1 billion (USD) per year   
0   $500 million to $1 billion (USD) per year   
0   $500 million to $1 billion (USD) per year   
0   $500 million to $1 billion (USD) per year   
..                                        ...   
21          Unknown / Non-Applicable per year   
21          Unknown / Non-Applicable per year   
21          Unknown / Non-Applicable per year   
21          Unknown / Non-Applicable per year   
21          Unknown / Non-Applicable per year   

                                             Skill  Frequency  
0                                              MIS          8  
0                                Decision Sciences          6  
0   the Pricing & Strategy and Profitability Teams          2  
0             Pricing & Strategy and Profitability          2  
0                           Information Technology          2  
..                         

In [None]:
print(df_expended.head())
print(df_expended.info())

                                    Industry  \
0  $500 million to $1 billion (USD) per year   
0  $500 million to $1 billion (USD) per year   
0  $500 million to $1 billion (USD) per year   
0  $500 million to $1 billion (USD) per year   
0  $500 million to $1 billion (USD) per year   

                                            Skill  Frequency  
0                                             MIS          8  
0                               Decision Sciences          6  
0  the Pricing & Strategy and Profitability Teams          2  
0            Pricing & Strategy and Profitability          2  
0                          Information Technology          2  


# Top 5 Skills in 10 Leading Industries

In [None]:
top_industries=df_expended.groupby("Industry")['Frequency'].sum().nlargest(10).index
print(top_industries)

In [None]:
df_top=df_expended[df_expended['Industry'].isin(top_industries)]

In [None]:
top_skills_per_industry=df_top.groupby('Industry').apply(lambda x:x.nlargest(5,"Frequency"))
print(top_skills_per_industry)

In [None]:
from tabulate import tabulate

print(tabulate(top_skills_per_industry[['Industry', 'Skill', 'Frequency']], headers='keys', tablefmt='pretty'))



+----+----------------------------------+----------------------------------------------+-----------+
|    |             Industry             |                    Skill                     | Frequency |
+----+----------------------------------+----------------------------------------------+-----------+
| 0  |        Accounting & Legal        |                   Deloitte                   |    67     |
| 1  |        Accounting & Legal        |                 Deloitte LLP                 |    16     |
| 2  |        Accounting & Legal        |                     GPA                      |    13     |
| 3  |        Accounting & Legal        |                     RAS                      |     8     |
| 4  |        Accounting & Legal        |                Grant Thornton                |     8     |
| 5  | Arts, Entertainment & Recreation |                   JCPenney                   |    14     |
| 6  | Arts, Entertainment & Recreation |                 Home Office                  |   