<a href="https://colab.research.google.com/github/Sireesha-cloud/Sireesha_INFO5731_Fall2024/blob/main/INFO5731_Exercise_3_(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 3**

The purpose of this exercise is to explore various aspects of text analysis, including feature extraction, feature selection, and text similarity ranking.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of Friday, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**

**Please check that the link you submitted can be opened and points to the correct assignment.**

## Question 1 (10 Points)
Describe an interesting **text classification or text mining task** and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features. **Your dataset must be text.**

In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Job Resume Classification entails sorting resumes into distinct employment positions or industries depending on the candidate's abilities, experience, and credentials.
For example, resumes might be classed as "Software Engineer," "Data Scientist," "Marketing Manager," or "Sales Executive." This work can assist automate and speed up the recruiting process, allowing hiring managers to more easily match individuals to suitable job positions.

To create a machine learning model for resume categorisation, we must first extract significant characteristics from the resume content itself. Here are five key aspects that will be beneficial for this task:

1) Skills and Keywords:
Description: Identify applicable abilities (e.g., Python, Java, project management, machine learning) and domain-specific keywords.
Why is it useful: Skills and keywords are excellent markers of job fit. A résumé that mentions "data analysis" and "Python" is more likely to match a Data Scientist post, whereas "marketing strategy" and "SEO" would correlate to a Marketing Manager position. These keywords assist the model recognise the candidate's major competencies.

2) Work Experience (Job Titles):
Extract job titles from the candidate's prior employment (for example, "Software Engineer," "Data Analyst," and "Marketing Coordinator").
Why is it useful: Job titles give valuable information about a candidate's professional past. The model can utilise job titles to match the resume to an appropriate category. A CV with the title "Project Manager" would be more appropriate for a management position.

3) Education and Certifications:
Description: Extract the educational background ("Bachelor's in Computer Science," "MBA") and credentials ("Certified Scrum Master," "AWS Certified Solutions Architect").
Why is it useful: Educational degrees and certifications frequently demonstrate a candidate's topic competence and work readiness. A degree in Computer Science, for example, is more suited to software engineering, whilst an MBA is better suited to administrative employment.

4) Years of Experience:
Description: Extract the total number of years of professional experience the candidate has.
Why it's useful: Seniority levels can be inferred from years of experience. A person with 10+ years of experience is more likely to be a senior-level professional or manager, while someone with 1-3 years of experience might be categorized for junior or entry-level roles.

5)  Language Proficiency:
Description: Identify the languages in which the individual is skilled (for example, English, Spanish, and Mandarin).
Why is it useful: Some employment vocations or sectors may need proficiency in specific languages. For example, a CV citing Mandarin fluency may be classed as employment that need communication with Chinese-speaking clients or stakeholders. It can also be used for translation and localisation jobs.

'''

## Question 2 (10 Points)
Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [None]:
# You code here (Please add comments in the code):
!pip install spacy
!pip install nltk
!pip install pandas

import spacy
import re
import nltk
from collections import Counter


nltk.download('punkt')
nltk.download('stopwords')
from nltk.corpus import stopwords


nlp = spacy.load("en_core_web_sm")


resumes = [
    """
    John Doe
    Software Engineer with 5 years of experience in Python, Java, and software development.
    Worked at XYZ Corp as a Senior Software Developer. Skilled in Machine Learning, Data Analysis, and Cloud Computing.
    Holds a Bachelor's degree in Computer Science from Stanford University. Certified AWS Solutions Architect.
    """,
    """
    Jane Smith
    Marketing Manager with 7 years of experience in digital marketing, SEO, and content strategy.
    Managed a team of 5 at ABC Ltd. Expertise in market research, social media marketing, and brand management.
    Holds an MBA from Harvard Business School. Fluent in Spanish and English.
    """,
]


def extract_features(resume_text):

    doc = nlp(resume_text)


    keywords = ["Python", "Java", "Machine Learning", "SEO", "Cloud Computing", "Marketing", "Data Analysis", "AWS"]
    extracted_skills = [kw for kw in keywords if kw in resume_text]


    job_titles = []
    for ent in doc.ents:
        if ent.label_ == "ORG":
            job_titles.append(ent.text)


    education = []
    certifications = []
    for ent in doc.ents:
        if ent.label_ == "ORG" and "University" in ent.text or "School" in ent.text:
            education.append(ent.text)
        if "Certified" in ent.text or "Certification" in ent.text:
            certifications.append(ent.text)


    years_of_experience = re.findall(r"(\d+) years of experience", resume_text)
    if years_of_experience:
        years_of_experience = max(map(int, years_of_experience))
    else:
        years_of_experience = "Unknown"


    languages = ["English", "Spanish", "Mandarin", "French", "German"]
    language_proficiency = [lang for lang in languages if lang in resume_text]

    return {
        "Skills": extracted_skills,
        "Job Titles": job_titles,
        "Education": education,
        "Certifications": certifications,
        "Years of Experience": years_of_experience,
        "Language Proficiency": language_proficiency
    }


for idx, resume in enumerate(resumes):
    print(f"Resume {idx + 1}:")
    features = extract_features(resume)
    for feature, value in features.items():
        print(f"{feature}: {value}")
    print("\n")








[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Resume 1:
Skills: ['Python', 'Java', 'Machine Learning', 'Cloud Computing', 'Data Analysis', 'AWS']
Job Titles: ['XYZ Corp', 'Data Analysis', 'Bachelor', 'Computer Science', 'Stanford University']
Education: ['Stanford University']
Certifications: []
Years of Experience: 5
Language Proficiency: []


Resume 2:
Skills: ['SEO', 'Marketing']
Job Titles: ['ABC Ltd. Expertise', 'Harvard Business School']
Education: ['Harvard Business School']
Certifications: []
Years of Experience: 7
Language Proficiency: ['English', 'Spanish']




## Question 3 (10 points):
Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [None]:
# You code here (Please add comments in the code):
import pandas as pd
from sklearn.feature_selection import chi2
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder

# Sample resumes and corresponding job roles (target variable)
data = [
    {
        "resume": """
        John Doe
        Software Engineer with 5 years of experience in Python, Java, and software development.
        Worked at XYZ Corp as a Senior Software Developer. Skilled in Machine Learning, Data Analysis, and Cloud Computing.
        Holds a Bachelor's degree in Computer Science from Stanford University. Certified AWS Solutions Architect.
        """,
        "job_role": "Software Engineer"
    },
    {
        "resume": """
        Jane Smith
        Marketing Manager with 7 years of experience in digital marketing, SEO, and content strategy.
        Managed a team of 5 at ABC Ltd. Expertise in market research, social media marketing, and brand management.
        Holds an MBA from Harvard Business School. Fluent in Spanish and English.
        """,
        "job_role": "Marketing Manager"
    },
    {
        "resume": """
        Jack Brown
        Data Scientist with 4 years of experience in Python, R, and machine learning algorithms.
        Worked at DEF Inc. on building predictive models and data-driven solutions. Holds a Master's degree in Data Science from MIT.
        """,
        "job_role": "Data Scientist"
    }
]

# Define features based on previously extracted features
# For simplicity, we consider categorical features here
resumes_data = pd.DataFrame([
    {"skills_python": 1, "skills_java": 1, "skills_seo": 0, "skills_data_analysis": 1, "skills_machine_learning": 1,
     "cert_aws": 1, "education_cs": 1, "years_experience": 5, "language_english": 1, "language_spanish": 0, "job_role": "Software Engineer"},

    {"skills_python": 0, "skills_java": 0, "skills_seo": 1, "skills_data_analysis": 0, "skills_machine_learning": 0,
     "cert_aws": 0, "education_cs": 0, "years_experience": 7, "language_english": 1, "language_spanish": 1, "job_role": "Marketing Manager"},

    {"skills_python": 1, "skills_java": 0, "skills_seo": 0, "skills_data_analysis": 1, "skills_machine_learning": 1,
     "cert_aws": 0, "education_cs": 1, "years_experience": 4, "language_english": 1, "language_spanish": 0, "job_role": "Data Scientist"}
])

# Encoding the target variable (job roles)
label_encoder = LabelEncoder()
resumes_data['job_role_encoded'] = label_encoder.fit_transform(resumes_data['job_role'])

# Independent variables (features) and target variable (job role)
X = resumes_data.drop(columns=['job_role', 'job_role_encoded'])
y = resumes_data['job_role_encoded']

# Perform Chi-Square test
chi2_values, p_values = chi2(X, y)

# Create a dataframe to rank features based on Chi-Square scores
chi2_df = pd.DataFrame({
    'Feature': X.columns,
    'Chi2 Score': chi2_values,
    'P-value': p_values
})

# Sort by Chi2 score in descending order
chi2_df = chi2_df.sort_values(by='Chi2 Score', ascending=False)

print(chi2_df)






                   Feature  Chi2 Score   P-value
1              skills_java       2.000  0.367879
2               skills_seo       2.000  0.367879
5                 cert_aws       2.000  0.367879
9         language_spanish       2.000  0.367879
0            skills_python       1.000  0.606531
3     skills_data_analysis       1.000  0.606531
4  skills_machine_learning       1.000  0.606531
6             education_cs       1.000  0.606531
7         years_experience       0.875  0.645649
8         language_english       0.000  1.000000


## Question 4 (10 points):
Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [None]:
!pip install transformers
!pip install torch
!pip install scikit-learn
import torch
from transformers import BertTokenizer, BertModel
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Load BERT model and tokenizer from transformers library
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Function to get BERT embeddings for a given text
def get_bert_embedding(text):
    inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    # Use the mean of the last hidden state for embedding representation
    return outputs.last_hidden_state.mean(dim=1).squeeze().numpy()

# Sample resumes data
resumes = [
    """
    John Doe
    Software Engineer with 5 years of experience in Python, Java, and software development.
    Worked at XYZ Corp as a Senior Software Developer. Skilled in Machine Learning, Data Analysis, and Cloud Computing.
    Holds a Bachelor's degree in Computer Science from Stanford University. Certified AWS Solutions Architect.
    """,
    """
    Jane Smith
    Marketing Manager with 7 years of experience in digital marketing, SEO, and content strategy.
    Managed a team of 5 at ABC Ltd. Expertise in market research, social media marketing, and brand management.
    Holds an MBA from Harvard Business School. Fluent in Spanish and English.
    """
]

# Query that we want to match the resumes with
query = "Looking for a Software Engineer with experience in Python, Java, and Machine Learning. Cloud computing experience preferred."

# Get BERT embeddings for the query and each resume
query_embedding = get_bert_embedding(query)
resume_embeddings = [get_bert_embedding(resume) for resume in resumes]

# Calculate cosine similarity between the query and each resume
similarities = [cosine_similarity([query_embedding], [resume_embedding])[0][0] for resume_embedding in resume_embeddings]

# Rank resumes by similarity (in descending order)
ranked_indices = np.argsort(similarities)[::-1]

# Display ranked resumes
for idx in ranked_indices:
    print(f"Resume {idx + 1} (Similarity: {similarities[idx]:.4f}):\n{resumes[idx]}\n")






Resume 1 (Similarity: 0.8838):

    John Doe
    Software Engineer with 5 years of experience in Python, Java, and software development. 
    Worked at XYZ Corp as a Senior Software Developer. Skilled in Machine Learning, Data Analysis, and Cloud Computing. 
    Holds a Bachelor's degree in Computer Science from Stanford University. Certified AWS Solutions Architect.
    

Resume 2 (Similarity: 0.7768):

    Jane Smith
    Marketing Manager with 7 years of experience in digital marketing, SEO, and content strategy. 
    Managed a team of 5 at ABC Ltd. Expertise in market research, social media marketing, and brand management.
    Holds an MBA from Harvard Business School. Fluent in Spanish and English.
    



# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment. Consider the following points in your response:

Learning Experience: Describe your overall learning experience in working on extracting features from text data. What were the key concepts or techniques you found most beneficial in understanding the process?

Challenges Encountered: Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [4]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:
Learning experience: Working on resume categorisation has provided valuable expertise in using NLP approaches to extract relevant characteristics from text. Key concepts such as text preprocessing (noise removal, tokenisation, and lemmatisation), Named Entity Recognition (NER) for extracting job titles and degrees, and vectorisation techniques such as Bag of Words, TF-IDF, and word embeddings (Word2Vec, GloVe) were critical in feature extraction. Experimenting with classification techniques such as logistic regression, SVM, and neural networks also helped to develop the model, emphasising the necessity of choosing proper methods and hyperparameter tweaking for successful resume categorisation.
Challenges Encountered: Some issues in resume classification were job title ambiguity, where generic professions like "Consultant" may span many industries, making categorisation difficult. Handling unstructured data in diverse forms (PDF, Word, and text) provided hurdles for consistent feature extraction. Imbalanced data also resulted in biases, since specific employment types were under-represented, confounding model training. Furthermore, identifying synonyms for comparable abilities or job titles, such as "Data Scientist" and "Data Analyst," need considerable attention to assure resume categorisation accuracy.
Relevance to field study: This activity is extremely significant to the discipline of NLP, particularly in terms of feature extraction and text categorisation. Resume classification is a real-world challenge in which NLP methods like as tokenisation, named entity identification, and text vectorisation are vital. It is applicable to document categorisation, a popular NLP job, and may be expanded to other domains such as email filtering, sentiment analysis, and chatbot building. By automating the resume screening process, our effort adds to a critical area of NLP applications in the recruiting and HR industries.
As I have a background in information science, I can use your knowledge to improve data processing, feature extraction, and model construction, all of which are necessary for developing a strong resume categorisation system.


'''

'\nPlease write you answer here:\nLearning experience: Working on resume categorisation has provided valuable expertise in using NLP approaches to extract relevant characteristics from text. Key concepts such as text preprocessing (noise removal, tokenisation, and lemmatisation), Named Entity Recognition (NER) for extracting job titles and degrees, and vectorisation techniques such as Bag of Words, TF-IDF, and word embeddings (Word2Vec, GloVe) were critical in feature extraction. Experimenting with classification techniques such as logistic regression, SVM, and neural networks also helped to develop the model, emphasising the necessity of choosing proper methods and hyperparameter tweaking for successful resume categorisation.\nChallenges Encountered: Some issues in resume classification were job title ambiguity, where generic professions like "Consultant" may span many industries, making categorisation difficult. Handling unstructured data in diverse forms (PDF, Word, and text) provid