<div align="center">
  <h1 style="font-size: 28px; font-weight: bold;">Automated Resume Screening and Contact Extraction System</h1>
</div>

<div style="background-color: #f5f5f5; padding: 20px; border-radius: 5px; font-weight: bold; font-size: 18px;">

**Problem Statement:**

In the dynamic and competitive field of recruitment, job recruiters are often overwhelmed with a large volume of resumes from prospective candidates. Identifying the most suitable candidates for specific job positions and efficiently reaching out to them is a time-consuming and resource-intensive process. There is a critical need for a streamlined solution that not only automates the resume screening process but also facilitates direct communication with selected candidates.

</div>


### Model Building:

In [1]:
import os
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
import pdfplumber
from sklearn.metrics import confusion_matrix, classification_report

In [2]:
dataset_path = "/Users/suka/Downloads/AICapstone 2"
text_data = []
label_data = []

In [3]:
for resume_folder in os.listdir(dataset_path):
    if os.path.isdir(os.path.join(dataset_path, resume_folder)):
        with pdfplumber.open(os.path.join(dataset_path, resume_folder, "resume.pdf")) as pdf:
            pdf_text = ""
            for page in pdf.pages:
                pdf_text += page.extract_text()
            text_data.append(pdf_text.strip())
        
        with open(os.path.join(dataset_path, resume_folder, "labels.txt"), "r") as f:
            labels = f.read().strip().split("\n")
            label_data.append(labels)

tokenizer = Tokenizer(oov_token="<OOV>")
tokenizer.fit_on_texts(text_data)
vocab_size = len(tokenizer.word_index) + 1

In [4]:
label_encoder = {
    "O": 0, "B-PERSON": 1, "I-PERSON": 2,
    "B-ADDRESS": 3, "I-ADDRESS": 4,
    "B-PHONE": 5, "I-PHONE": 6,
    "B-EMAIL": 7, "I-EMAIL": 8
}


encoded_labels = []

In [5]:
for labels in label_data:
    processed_labels = [label_encoder[label.split()[1]] for label in labels]
    encoded_labels.append(processed_labels)


max_seq_length = max([len(seq) for seq in tokenizer.texts_to_sequences(text_data)])
for i in range(len(encoded_labels)):
    while len(encoded_labels[i]) < max_seq_length:
        encoded_labels[i].append(0)  

y_train_array = np.array(encoded_labels, dtype=np.int32)


max_seq_length = max([len(seq) for seq in tokenizer.texts_to_sequences(text_data)])
padded_sequences = pad_sequences(tokenizer.texts_to_sequences(text_data), maxlen=max_seq_length, padding="post")


X_train, X_val, y_train, y_val = train_test_split(padded_sequences, y_train_array, test_size=0.2, random_state=42)


model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, 128, input_length=max_seq_length),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(128, return_sequences=True)),
    tf.keras.layers.Dense(len(label_encoder), activation="softmax")
])

model.compile(loss="sparse_categorical_crossentropy", optimizer="adam", metrics=["accuracy"])

model.fit(X_train, y_train, validation_data=(X_val, y_val), epochs=10, batch_size=32)


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x15d544790>

In [12]:
model.save('/Users/suka/Downloads/Resume_deployment')

INFO:tensorflow:Assets written to: /Users/suka/Downloads/Resume_deployment/assets


INFO:tensorflow:Assets written to: /Users/suka/Downloads/Resume_deployment/assets


### Model Evaluation:

In [6]:
loss, accuracy = model.evaluate(X_val, y_val)
print(f"Validation Loss: {loss:.4f}, Validation Accuracy: {accuracy:.4f}")

Validation Loss: 0.0574, Validation Accuracy: 0.9868


In [7]:
y_pred = model.predict(X_val)

y_pred_labels = np.argmax(y_pred, axis=2)
y_val_labels = y_val

confusion = confusion_matrix(y_val_labels.flatten(), y_pred_labels.flatten())
print("Confusion Matrix:")
print(confusion)

report = classification_report(y_val_labels.flatten(), y_pred_labels.flatten())
print("Classification Report:")
print(report)

Confusion Matrix:
[[8006    0    0    0    0    0    0    0    0]
 [   7    0    0    0    0    0    0    0    0]
 [  10    0    0    0    0    0    0    0    0]
 [   4    0    0    0    0    0    0    0    0]
 [  36    0    0    0    0    0    0    0    0]
 [   7    0    0    0    0    0    0    0    0]
 [  14    0    0    0    0    0    0    0    0]
 [   7    0    0    0    0    0    0    0    0]
 [  22    0    0    0    0    0    0    0    0]]
Classification Report:
              precision    recall  f1-score   support

           0       0.99      1.00      0.99      8006
           1       0.00      0.00      0.00         7
           2       0.00      0.00      0.00        10
           3       0.00      0.00      0.00         4
           4       0.00      0.00      0.00        36
           5       0.00      0.00      0.00         7
           6       0.00      0.00      0.00        14
           7       0.00      0.00      0.00         7
           8       0.00      0.00      

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


# -----------------------------------------------------------------------------------------

### Initializing the feild and the keywords for it:

In [8]:
skill_keywords_dict = {
    "Data Science": {
    "machine learning": 0.8,
    "Python": 0.7,
    "data analysis": 0.6,
    "statistical modeling": 0.5,
    "deep learning": 0.7,
    "data visualization": 0.6,
    "SQL": 0.5,
    "R": 0.4,
    "big data": 0.6,
    "natural language processing": 0.7,
    "data mining": 0.6,
    "data cleansing": 0.5,
    "feature engineering": 0.6,
    "time series analysis": 0.6,
    "predictive modeling": 0.7,
    "data preprocessing": 0.5,
    "Hadoop": 0.4,
    "Spark": 0.4,
},
    "Software Developer":{
    "programming": 0.8,
    "Java": 0.7,
    "C++": 0.6,
    "software development": 0.7,
    "coding": 0.6,
    "algorithm": 0.6,
    "web development": 0.5,
    "debugging": 0.5,
    "API": 0.4,
    "software engineering": 0.7,
    "agile methodology": 0.6,
    "version control": 0.6,
    "object-oriented programming": 0.7,
    "unit testing": 0.6,
    "database management": 0.5,
    "cloud computing": 0.5,
    "continuous integration": 0.6,
    "problem-solving": 0.7,
} ,
    "Cybersecurity": {
    "cybersecurity": 0.8,
    "network security": 0.7,
    "firewall": 0.6,
    "penetration testing": 0.7,
    "incident response": 0.6,
    "security audit": 0.5,
    "vulnerability assessment": 0.5,
    "encryption": 0.4,
    "ethical hacking": 0.7,
    "SIEM (Security Information and Event Management)": 0.6,
    "threat intelligence": 0.6,
    "identity and access management": 0.6,
    "cybersecurity policies": 0.5,
    "risk management": 0.5,
    "security awareness training": 0.4,
    "malware analysis": 0.6,
    "network monitoring": 0.7,
},
    "Web Developer": {
    "web development": 0.8,
    "HTML": 0.7,
    "CSS": 0.6,
    "JavaScript": 0.7,
    "front-end development": 0.6,
    "back-end development": 0.6,
    "React": 0.5,
    "Angular": 0.5,
    "Node.js": 0.4,
    "responsive design": 0.6,
    "RESTful API": 0.6,
    "web frameworks": 0.7,
    "UI/UX design": 0.6,
    "cross-browser compatibility": 0.5,
    "web security": 0.5,
    "performance optimization": 0.6,
    "version control (e.g., Git)": 0.7,
    "web hosting": 0.5,
    "content management systems": 0.5,
},
}


### Resume Screening:

In [9]:
import os
import PyPDF2
import spacy
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import pdfplumber
import re
import ssl
import smtplib
from email.mime.text import MIMEText
from email.mime.multipart import MIMEMultipart
from email.mime.application import MIMEApplication

In [10]:
model_path = "/Users/suka/Downloads/Resume_deployment"
model = tf.keras.models.load_model(model_path)

In [11]:
nlp = spacy.load("en_core_web_sm")

In [12]:
def extract_text_from_pdf(pdf_path):
    text = ""
    with open(pdf_path, "rb") as pdf_file:
        pdf_reader = PyPDF2.PdfReader(pdf_file)
        for page in pdf_reader.pages:
            text += page.extract_text()
    return text

In [13]:
email_regex = r'\S+@\S+'
phone_regex = r'\d{3}[-.\s]?\d{3}[-.\s]?\d{4}'
def extract_information(text):
    email = ""
    phone = ""

    emails = re.findall(email_regex, text)
    phones = re.findall(phone_regex, text)

    if emails:
        email = emails[0]
    if phones:
        phone = phones[0]

    return email, phone

In [14]:
def preprocess_text(text):
    text = text.lower()  
    doc = nlp(text)
    tokens = [token.text for token in doc if not token.is_stop and token.is_alpha]
    return " ".join(tokens)

In [15]:
def calculate_similarity(job_description, resume):
    vectorizer = CountVectorizer().fit_transform([job_description, resume])
    vectors = vectorizer.toarray()
    return cosine_similarity([vectors[0]], [vectors[1]])[0][0]

In [16]:
def send_email(sender_email, sender_password, recipient_email, subject, message):
    try:
        context = ssl.create_default_context()
        
        with smtplib.SMTP_SSL("smtp.gmail.com", 465, context=context) as server:
            server.login(sender_email, sender_password)
            msg = MIMEMultipart()
            msg["From"] = sender_email
            msg["To"] = recipient_email
            msg["Subject"] = subject
            msg.attach(MIMEText(message, "plain"))

            server.sendmail(sender_email, recipient_email, msg.as_string())
            print(f"Email sent to {recipient_email} successfully!")

    except Exception as e:
        print(f"Error sending email: {str(e)}")

In [23]:
def process_resumes_for_field(selected_field, job_description, resumes_folder, count, sender_email, sender_password, email_subject, email_message):
    selected_resumes = []
    rejected_resumes = []

    skill_keywords = skill_keywords_dict.get(selected_field, {})

    for resume_file in os.listdir(resumes_folder):
        if resume_file.endswith(".pdf"):
            resume_path = os.path.join(resumes_folder, resume_file)
            resume_text = extract_text_from_pdf(resume_path)

            resume_text = resume_text.strip().lower()

            tokenizer = Tokenizer(oov_token="<OOV>")
            tokenizer.fit_on_texts([resume_text])
            vocab_size = len(tokenizer.word_index) + 1
            max_seq_length = 1159
            padded_sequence = pad_sequences(tokenizer.texts_to_sequences([resume_text]), maxlen=max_seq_length, padding="post")

            predictions = model.predict(padded_sequence)

            email, phone = extract_information(resume_text)

            similarity_score = calculate_similarity(job_description, resume_text)

            if len(selected_resumes) < count:
                selected_resumes.append({"resume_file": resume_file, "similarity_score": similarity_score, "email": email, "phone": phone})
            else:
                min_similarity_score = min(selected_resumes, key=lambda x: x["similarity_score"])["similarity_score"]
                if similarity_score > min_similarity_score:
                    lowest_sim_index = selected_resumes.index(next(item for item in selected_resumes if item["similarity_score"] == min_similarity_score))
                    selected_resumes[lowest_sim_index] = {"resume_file": resume_file, "similarity_score": similarity_score, "email": email, "phone": phone}
                else:
                    rejected_resumes.append({"resume_file": resume_file, "similarity_score": similarity_score, "email": email, "phone": phone})

    selected_resumes.sort(key=lambda x: x["similarity_score"], reverse=True)
    
    rejected_resumes.sort(key=lambda x: x["similarity_score"], reverse=True)
    
    print(f"\033[1m\033[4mSELECTED RESUMES FOR THE {selected_field}:\033[0m")
    for resume_info in selected_resumes:
        resume_file = resume_info["resume_file"]
        similarity_score = resume_info["similarity_score"]
        email = resume_info["email"]
        phone = resume_info["phone"]
        
        print(f"Selected Resume: {resume_file}")
        print(f"Similarity Score: {similarity_score * 100:.2f}%")
        print("Email:", email)
        print("Phone:", phone)
        print()
        
        send_email(sender_email, sender_password, email, email_subject, email_message)
        
        print("-" * 50) 

    print(f"\033[1m\033[4mREJECTED RESUMES FOR THE {selected_field}:\033[0m")
    for resume_info in rejected_resumes:
        resume_file = resume_info["resume_file"]
        similarity_score = resume_info["similarity_score"]
        email = resume_info["email"]
        phone = resume_info["phone"]
        
        print(f"Rejected Resume: {resume_file}")
        print(f"Similarity Score: {similarity_score * 100:.2f}%")
        print("Email:", email)
        print("Phone:", phone)
        print("-" * 50) 
        
        
if __name__ == "__main__":
    selected_field = input("Select a field from [Web Developer, Data Science, Cybersecurity, Software Developer]: ")
    job_description = input("Enter the job description: ")
    resumes_folder = input("Enter the folder path containing resumes: ")
    count = int(input("Enter the number of top matching resumes to select: "))
    sender_email = input("Enter your email address (sender): ")
    sender_password = "ktvp hykg nxuh iaui"
    email_subject = input("Enter the email subject: ")
    email_message = input("Enter the email message: ")
    
    if selected_field not in skill_keywords_dict:
        print("Selected field is not recognized.")
    else:
        process_resumes_for_field(selected_field, job_description, resumes_folder, count, sender_email, sender_password, email_subject, email_message)

Select a field from [Web Developer, Data Science, Cybersecurity, Software Developer]: Data Science
Enter the job description: Although no two days at Accenture are the same, as an Organizational Analytics (OA) Scientist in our Talent & Organization (T&O) practice, a typical day might include:  Fetching information from various sources and analyzing it to better understand people behaviors  Run numeric simulations leveraging different statistic techniques Selecting features, building and optimizing classifiers using machine learning techniques Data mining using state-of-the-art methods Processing, cleansing, and verifying the integrity of data used for analysis Doing ad-hoc analysis and presenting results in a clear manner Doing custom analytics to deliver insights to clients Contribute to authoring of Thought leadership and research papers Contribute to innovation and new product development in the people and organization analytics space  REQUIRED EXPERIENCE/ SKILLS 1 to 4 years of exp

<big><b>Conclusion:</b></big>

In this project, I have developed a resume screening and selection system that leverages natural language processing, machine learning, and email communication to streamline the hiring process. The system demonstrates the following key functionalities:

- **Resume Extraction:** I successfully extract text from PDF resumes using PyPDF2 and pdfplumber libraries, allowing for further analysis.

- **Information Extraction:** I use regular expressions to identify and extract email addresses and phone numbers from resumes. Additionally, I've employed a model to predict the presence of email addresses and phone numbers.

- **Text Preprocessing:** I preprocess the extracted resume text, tokenizing and cleaning it for further analysis. This step enhances the accuracy of the system's operations.

- **Cosine Similarity Matching:** I calculate the cosine similarity between a job description and each resume to identify the most relevant candidates based on the skills mentioned in the job description.

- **Skill-Based Resume Selection:** The system selects the top-n candidates whose resumes match the job description most closely, facilitating the shortlisting process.

- **Email Communication:** I integrate email functionality to send notifications to selected candidates, streamlining the communication process with applicants.

- **Adaptability:** The system can be adapted for various job fields by customizing the skill keywords and weights.

In conclusion, my resume screening and selection system significantly reduces the manual effort required for identifying suitable candidates and contacting them for further evaluation. This automation not only saves time but also ensures a more efficient and standardized hiring process. I hope this project serves as a valuable tool for HR professionals and hiring managers in their recruitment endeavors.

Thank you for exploring my project!


# -----------------------------------------------------------------------------------------