# Resume Classification Project

This project aims to classify resumes into various job categories using natural language processing (NLP) techniques and machine learning. The dataset used for training the model is sourced from [Kaggle's Updated Resume Dataset](https://www.kaggle.com/datasets/jillanisofttech/updated-resume-dataset).

## Overview

The process involves several key steps:
1. **Data Loading:** The resume dataset is loaded into a DataFrame.
2. **Text Preprocessing:** A cleaning function is applied to remove unnecessary characters, URLs, and stopwords from the resume text.
3. **Feature Vectorization:** The cleaned text data is transformed into numerical feature vectors using `CountVectorizer`.
4. **Label Encoding:** Job categories are encoded into numeric labels for model training.
5. **Model Training:** A `MultinomialNB` classifier is trained using a One-vs-Rest approach. The Multinomial classifier is particularly suited for text classification tasks because it is effective in handling the discrete features commonly found in text data. The One-vs-Rest approach allows the model to treat each category as a separate binary classification problem, improving its ability to distinguish between multiple classes.
6. **Model Evaluation:** The trained model's performance is evaluated using accuracy and classification metrics.
7. **Prediction Function:** A function is defined to predict the category of new resumes, outputting match percentages for various job categories.

This notebook will guide you through each step of the process, providing insights into the techniques used and the results obtained.


# Load the Dataset
This cell imports the `pandas` library and loads the dataset from a CSV file named `UpdatedResumeDataSet.csv` into a DataFrame called `final_df`. This DataFrame will contain the resume data along with their respective job categories.


In [9]:
import pandas as pd
import re

# Load the dataset
final_df = pd.read_csv('data/UpdatedResumeDataSet.csv')

# Import Additional Libraries and Download NLTK Stopwords
This cell imports additional libraries required for text processing, including `re`, `string`, `nltk.corpus.stopwords`, `CountVectorizer`, and `LabelEncoder`. It also downloads the NLTK stopwords if they haven't been downloaded already. The `stop_words` variable is created as a set of English stopwords, which will be used to clean the resume text later.

# Define the `clean_resume` Function
This cell defines a function named `clean_resume` that takes a single input parameter:

- **Input:**
  - `resume_text` (str): The raw text of a resume that needs to be cleaned.

- **Output:**
  - Returns a cleaned string where:
    - URLs, mentions, hashtags, and extra whitespace are removed.
    - Punctuation and non-alphabet characters are eliminated.
    - Stopwords and specific words (e.g., "year," "month") are filtered out.
  
The cleaned text is returned as a single string, which will be used for further processing.

# Apply the Cleaning Function
This cell applies the `clean_resume` function to the 'Resume' column of the `final_df` DataFrame. The cleaned resume text replaces the original text in the DataFrame, ensuring that all resumes are preprocessed before they are vectorized.

# Encode Labels
This cell creates an instance of `LabelEncoder` and fits it to the 'Category' column of the `final_df` DataFrame. This transforms the categorical job categories into numeric labels, which will be used for training the machine learning model. The encoded categories replace the original categorical values in the DataFrame.




In [10]:
import pandas as pd
import re
import string
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder

# Assuming you have your simplified dataset loaded as final_df with 'Category' and 'Resume' columns

# Download stopwords if not already done
import nltk
nltk.download('stopwords')
stop_words = set(stopwords.words("english"))

# Define a comprehensive function to clean resume text
def clean_resume(resume_text):
    # Remove URLs, RT, cc, hashtags, mentions, and extra whitespace
    resume_text = re.sub('http\S+\s*', ' ', resume_text)
    resume_text = re.sub('RT|cc', ' ', resume_text)
    resume_text = re.sub('#\S+', '', resume_text)
    resume_text = re.sub('@\S+', ' ', resume_text)
    resume_text = re.sub(r'[^\x00-\x7f]', ' ', resume_text)
    resume_text = re.sub('\s+', ' ', resume_text)
    
    # Remove punctuation and non-alphabet characters
    resume_text = re.sub('[%s]' % re.escape(string.punctuation), ' ', resume_text)
    resume_text = re.sub(r'[^a-zA-Z\s]', '', resume_text)

    # Split into words and remove stop words, months, years
    words = resume_text.split()
    words = [word.lower() for word in words if word.lower() not in stop_words 
             and word.lower() not in ["year", "years", "month", "months"]
             and not word.isdigit()]
    
    # Join cleaned words back into a single string
    cleaned_text = ' '.join(words)
    return cleaned_text

# Apply the cleaning function
final_df['Resume'] = final_df['Resume'].apply(clean_resume)

# Encode labels
label_encoder = LabelEncoder()
final_df['Category'] = label_encoder.fit_transform(final_df['Category'])

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/ashraf/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# Vectorize Text Data
This cell creates an instance of `CountVectorizer`, which is then used to convert the cleaned resume text data into numerical feature vectors. The variable `X` contains the feature vectors, while `y` contains the corresponding numeric labels of the job categories. This transformation prepares the data for model training.


In [11]:
# Vectorize text data
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(final_df['Resume'])
y = final_df['Category']

# Split Data into Training and Testing Sets
This cell uses `train_test_split` to split the feature vectors (`X`) and the labels (`y`) into training and testing sets. The training set will be used to train the model, while the test set will be used to evaluate its performance. The `test_size` parameter is set to 0.2, indicating that 20% of the data will be used for testing, and stratification is applied to maintain the proportion of categories in both sets.


In [12]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Define and Train the Model
This cell defines a machine learning model using `MultinomialNB` in a One-vs-Rest configuration with `OneVsRestClassifier`. The model is then fitted to the training data (`X_train` and `y_train`). After fitting, predictions are made on the test set (`y_pred`), which will be used to evaluate the model's performance.


In [13]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import accuracy_score, classification_report

# Define the model
model = OneVsRestClassifier(MultinomialNB())
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred, target_names=label_encoder.classes_))

Accuracy: 0.9792746113989638
                           precision    recall  f1-score   support

                 Advocate       1.00      0.50      0.67         4
                     Arts       0.88      1.00      0.93         7
       Automation Testing       1.00      0.80      0.89         5
               Blockchain       1.00      1.00      1.00         8
         Business Analyst       0.86      1.00      0.92         6
           Civil Engineer       1.00      1.00      1.00         5
             Data Science       1.00      1.00      1.00         8
                 Database       1.00      1.00      1.00         7
          DevOps Engineer       1.00      0.91      0.95        11
         DotNet Developer       1.00      1.00      1.00         5
            ETL Developer       1.00      1.00      1.00         8
   Electrical Engineering       0.86      1.00      0.92         6
                       HR       1.00      1.00      1.00         9
                   Hadoop       

# Make Predictions and Evaluate the Model
This cell evaluates the trained model by calculating the accuracy score and generating a classification report, which includes precision, recall, and F1-score for each category. The results help assess how well the model is performing on the test set compared to the training data.

# Define the `predication_func` Function
This cell defines a function named `predication_func` that takes a single input parameter:

- **Input:**
  - `new_resume` (str): The raw text of a new resume that needs to be analyzed.

- **Output:**
  - The function prints the top match percentages for specified job categories and groups any remaining categories under "Other."

The function performs the following tasks:
1. Cleans the new resume text using the `clean_resume` function.
2. Vectorizes the cleaned resume using the same `CountVectorizer` used for the training data.
3. Generates category probabilities using the trained model.
4. Displays the match percentages for specified job categories in descending order.


In [14]:
import numpy as np

# Define the specific categories to display; others will be grouped under "Other"

def predication_func(new_resume):

    display_categories = [
    "Data Science", 
    "Database", 
    "DevOps Engineer", 
    "DotNet Developer", 
    "Java Developer", 
    "Python Developer", 
    "Testing", 
    "Web Designing"
    ]   
    cleaned_resume = clean_resume(new_resume)

    # Vectorize the cleaned resume using the same vectorizer
    new_resume_vec = vectorizer.transform([cleaned_resume])

    # Generate category probabilities
    category_probs = model.predict_proba(new_resume_vec)

    # Get category names instead of numeric labels
    category_names = label_encoder.inverse_transform(model.classes_)

    # Group the categories and calculate probabilities for "Other"
    category_percentage = {}
    other_total = 0  # To accumulate percentages for categories marked as "Other"

    for i, prob in enumerate(category_probs[0]):
        category_name = category_names[i]
        if category_name in display_categories:
            category_percentage[category_name] = prob * 100
        else:
            other_total += prob * 100

    # Add "Other" category if there are remaining categories
    if other_total > 0:
        category_percentage["Other"] = other_total

    # # Display match percentages for each category
    # print("Match percentages for each category:")
    # for category, percent in category_percentage.items():
    #     print(f"{category}: {percent:.2f}%")

    # Sort and display top matches
    sorted_matches = sorted(category_percentage.items(), key=lambda x: x[1], reverse=True)
    print("\nTop category matches:")
    for category, percent in sorted_matches:
        print(f"{category}: {percent:.2f}%")

In [15]:

resume_1 = """
Skills: Python, Django, Flask, REST APIs, SQL, NoSQL, Git, Docker, AWS, Data Analysis, Pandas, NumPy

Experience:
- Developed RESTful APIs using Django and Flask for an e-commerce platform, reducing load times by 20%.
- Built automated scripts for data cleaning and ETL processes using Python and Pandas, saving 15 hours of manual work weekly.
- Integrated third-party services with OAuth and JWT authentication for secure data handling.

Education: Bachelor’s in Computer Science, University of California

Projects:
- Built a chatbot using Natural Language Processing techniques for customer support.
- Developed a web scraping tool using Python and BeautifulSoup to extract and analyze online reviews.
"""

resume_2 = """
Skills: C#, .NET Core, ASP.NET MVC, Entity Framework, LINQ, SQL Server, Azure, Agile, Git, REST APIs

Experience:
- Designed and implemented web applications using ASP.NET MVC and .NET Core, increasing user engagement by 30%.
- Worked with Azure DevOps to deploy and manage cloud-based applications.
- Used Entity Framework to create and maintain database connections and ensure data integrity.

Education: Bachelor’s in Information Technology, University of Texas

Projects:
- Developed a ticket booking system with ASP.NET Core and Entity Framework.
- Created a performance monitoring tool for web apps to enhance response times by 25%.
"""

resume_3 = """
Skills: Java, Spring Boot, Hibernate, SQL, Microservices, Maven, Git, RESTful APIs, Jenkins, AWS

Experience:
- Built scalable microservices with Spring Boot, handling over 1 million requests per day.
- Designed a notification system using Java and RabbitMQ, improving alert delivery efficiency by 40%.
- Collaborated with cross-functional teams on Java-based applications for e-commerce platforms.

Education: Bachelor’s in Software Engineering, University of Michigan

Projects:
- Developed a library management system using Java and Hibernate.
- Built an e-commerce recommendation engine with Java, improving user purchase rates by 18%.

"""

resume_4 = """
Skills: Python, R, SQL, Machine Learning, Deep Learning, Pandas, Scikit-learn, TensorFlow, Data Visualization, NLP

Experience:
- Built predictive models using machine learning algorithms to forecast sales, achieving 90% accuracy.
- Conducted data analysis for customer segmentation, resulting in a 15% increase in marketing efficiency.
- Created dashboards and visualizations in Tableau to present insights to stakeholders.

Education: Master’s in Data Science, Stanford University

Projects:
- Developed a sentiment analysis tool using NLP to analyze customer feedback.
- Created a recommendation system for a streaming service using collaborative filtering.

"""

resume_5 = """
Skills: Cybersecurity, Network Security, Ethical Hacking, Incident Response, Threat Analysis, Risk Assessment, Firewalls, IDS/IPS, Malware Analysis, Encryption, Forensics

Experience:
- Conducted vulnerability assessments and penetration testing, identifying and mitigating security risks in network systems.
- Managed incident response for security breaches, reducing downtime by 40%.
- Developed and implemented security policies and protocols, resulting in improved security posture and compliance.

Education: Bachelor’s in Cybersecurity, University of Texas

Projects:
- Built a tool to automate log analysis and identify suspicious activity patterns for enhanced threat detection.
- Led a team to design a secure network architecture for a financial institution, minimizing risk of cyber attacks.

"""

In [16]:
predication_func(resume_1)


Top category matches:
Data Science: 65.69%
Python Developer: 34.31%
Other: 0.00%
Database: 0.00%
DevOps Engineer: 0.00%
Java Developer: 0.00%
DotNet Developer: 0.00%
Web Designing: 0.00%
Testing: 0.00%


In [17]:
predication_func(resume_2)


Top category matches:
DotNet Developer: 100.00%
DevOps Engineer: 0.00%
Web Designing: 0.00%
Java Developer: 0.00%
Other: 0.00%
Data Science: 0.00%
Database: 0.00%
Python Developer: 0.00%
Testing: 0.00%


In [18]:
predication_func(resume_3)


Top category matches:
Java Developer: 100.00%
DevOps Engineer: 0.00%
Other: 0.00%
Database: 0.00%
Web Designing: 0.00%
Python Developer: 0.00%
Data Science: 0.00%
Testing: 0.00%
DotNet Developer: 0.00%


In [19]:
predication_func(resume_4)


Top category matches:
Data Science: 100.00%
Other: 0.00%
Python Developer: 0.00%
Database: 0.00%
DevOps Engineer: 0.00%
Java Developer: 0.00%
DotNet Developer: 0.00%
Web Designing: 0.00%
Testing: 0.00%


In [20]:
predication_func(resume_5)


Top category matches:
Other: 100.00%
Database: 0.00%
Data Science: 0.00%
Python Developer: 0.00%
Java Developer: 0.00%
Testing: 0.00%
DevOps Engineer: 0.00%
Web Designing: 0.00%
DotNet Developer: 0.00%
