In [1]:
pip install python-docx scikit-learn




In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import docx

In [3]:
def read_resume(file_path):
    doc = docx.Document(file_path)
    full_text = []
    for para in doc.paragraphs:
        full_text.append(para.text)
    return '\n'.join(full_text)

In [4]:
def calculate_similarity(resume_text, job_description):
    # Convert the text into TF-IDF features
    documents = [resume_text, job_description]
    tfidf_vectorizer = TfidfVectorizer().fit_transform(documents)
    # Calculate the cosine similarity between the resume and job description
    cosine_sim = cosine_similarity(tfidf_vectorizer[0:1], tfidf_vectorizer[1:2])
    return cosine_sim[0][0]

**Calculating** **Similarity**:
The calculate_similarity function uses TF-IDF vectorization to convert the resume and job description into numerical representations.
It then calculates the cosine similarity between the two, which gives a score between 0 and 1, where 1 means the texts are identical, and 0 means they are completely different.

In [5]:
resume_path = '/content/CV Rahul Setia BI&A June.docx'
resume_text = read_resume(resume_path)

In [6]:
print(resume_text)



Rahul Setia
Business Intelligence & Data Analytics
Seven years of expertise in data analytics and modelling, business intelligence, client communications, and reporting. Instrumental in driving project success at leading companies from sectors like Steel, Automotive, Construction, and Education. Collaborated with 50+ stakeholders across 10+ projects. Proﬁcient at converting complex business needs into effective data-layer solutions, delivering impactful results, and contributing to attaining organisational goals.
rahulsetia25@gmail.com


9643661480


Gurugram


linkedin.com/in/rahul-setia




AREA OF EXPERTISE


    

     

   

TECHNICAL SKILLS


Business Intelligence & Analytics
Advance Excel, Power BI, SAP Analytics Cloud
Data Transformation
Alteryx, Azure SQL


Coding Language	SQL, Python	Productivity Tools	MS Ofﬁce & Google Tools


Cloud	Microsoft Azure	Forecasting Models
ARIMA, Croston TSB, Holt-Winters


CERTIFICATES

Google Data Analytics	AZ-900 Microsoft Azure Fundamentals 

In [7]:
job_description = input("Enter the job description:")

Enter the job description:Drive expertise in Audience segmentation and Custom targeting leveraging Amex & Partner data assets. Optimize performance, drive efficient growth through personalized acquisition incentive, bidding strategies and Test & Learn program for various prospect segments. Enable robust measurement and optimize performance of Paid search through Attribution modelling and inform optimization of key KPIs. Collaborating to create world class analytical solutions and developing future-proof identity solution for performance marketing use-cases. Develop and foster cross functional relationships with partners across American Express to ensure project prioritization and implementation. Work in a dynamic, fast changing environment, with attention to detail and effective communication skills.


In [8]:
similarity_score = calculate_similarity(resume_text, job_description)
print(f"The similarity score between the resume and job description is: {similarity_score:.2f}")

The similarity score between the resume and job description is: 0.37


Let's break down the `calculate_similarity` function in detail to understand the technical aspects of how it computes the similarity score between a resume and a job description.

### 1. **Text Representation Using TF-IDF**

#### Term Frequency-Inverse Document Frequency (TF-IDF)
- **Term Frequency (TF):** This measures how frequently a term (word) appears in a document. The idea is that the more frequently a term appears in a document, the more important it is. However, simply counting the frequency might not be enough because some words (like "the", "and", etc.) are common across many documents.
  
- **Inverse Document Frequency (IDF):** To address the issue of common words, IDF scales down the importance of words that appear in many documents. It assigns higher importance to words that are unique to a specific document within a collection of documents.
  
- **TF-IDF:** This is the product of TF and IDF. TF-IDF gives a score to each word in a document, where higher scores indicate words that are important in that document but not common across other documents. This helps in distinguishing the meaningful words in the text.

**Vectorization**:
- **TfidfVectorizer()**: In the `calculate_similarity` function, `TfidfVectorizer()` is used to convert the text of the resume and job description into a matrix of TF-IDF features.
  
  ```python
  tfidf_vectorizer = TfidfVectorizer().fit_transform(documents)
  ```
  - **`documents`:** This is a list containing the resume text and the job description text.
  - The result, `tfidf_vectorizer`, is a matrix where each row represents a document (resume or job description) and each column represents a unique word (term) in the combined text of the documents.
  - Each cell in this matrix contains the TF-IDF score for a particular term in a particular document.

### 2. **Cosine Similarity**

#### Concept of Cosine Similarity:
- **Cosine Similarity** is a measure of similarity between two non-zero vectors. It calculates the cosine of the angle between two vectors in a multidimensional space.
- In the context of text analysis, these vectors represent documents, and their dimensions correspond to the TF-IDF scores of terms.
- The cosine of 0° is 1, meaning the documents are identical; the cosine of 90° is 0, meaning the documents are completely dissimilar.

**Formula**:
\[ \text{Cosine Similarity} = \frac{A \cdot B}{||A|| \times ||B||} \]
Where:
- \( A \cdot B \) is the dot product of vectors \( A \) (resume) and \( B \) (job description).
- \( ||A|| \) and \( ||B|| \) are the magnitudes (lengths) of the vectors.

**Application in the Code**:
- After generating the TF-IDF matrix, the next step is to calculate the cosine similarity between the resume vector and the job description vector.

  ```python
  cosine_sim = cosine_similarity(tfidf_vectorizer[0:1], tfidf_vectorizer[1:2])
  ```
  - Here, `tfidf_vectorizer[0:1]` refers to the vector representation of the resume.
  - `tfidf_vectorizer[1:2]` refers to the vector representation of the job description.
  - `cosine_similarity` computes the cosine similarity between these two vectors.

### 3. **Output**

- The function returns the cosine similarity score as a single value, which lies between 0 and 1.
  
  ```python
  return cosine_sim[0][0]
  ```

  - `cosine_sim[0][0]` extracts the similarity score from the resulting similarity matrix, which is a 1x1 matrix (since we're comparing just two documents).

### Summary
- **TF-IDF** converts the textual content of both the resume and job description into numerical vectors that capture the importance of terms.
- **Cosine Similarity** then compares these vectors to produce a similarity score, indicating how closely the resume matches the job description based on their content.

This score is useful in determining how relevant the resume is to the job description, providing a quantitative measure of similarity.