#  Data Analyst Jobs Dataset

###### A. Search for a fitting open-source dataset or document collection for analyzing the impact of stemming on an inverted index.

Kaggle link for dataset = https://www.kaggle.com/datasets/andrewmvd/data-analyst-jobs

This dataset has more than 2000 job listings for the position of data analyst. For our assignment, the job ddescription is of particular interest to us to find relevant jobs, the skills mostly expected to have. Job descriptions can be quite long and when they are for the same role, they can have a lot of same or similar words. 

For example, "The job requires one to be self motivated, organised, and have a key to details. In this organisation, you are expected to attend collaborative events that the HR department is organising every quarter." 
This example has many variations of the word "organise" - organised, organising, organisation- all of which can be stemmed to "organis" to form a shorter inverted index.

In [25]:
# Importing necessary libraries and packages

import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from collections import defaultdict
import math
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import re

import warnings
warnings.filterwarnings('ignore')

# Downloading more resources if not already downloaded
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\pc\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [26]:
# Kaggle link for dataset = https://www.kaggle.com/datasets/andrewmvd/data-analyst-jobs?resource=download
# Reading the dataset from local path
df = pd.read_csv('DataAnalyst.csv')

# Exploring the data, column type etc.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2253 entries, 0 to 2252
Data columns (total 16 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         2253 non-null   int64  
 1   Job Title          2253 non-null   object 
 2   Salary Estimate    2253 non-null   object 
 3   Job Description    2253 non-null   object 
 4   Rating             2253 non-null   float64
 5   Company Name       2252 non-null   object 
 6   Location           2253 non-null   object 
 7   Headquarters       2253 non-null   object 
 8   Size               2253 non-null   object 
 9   Founded            2253 non-null   int64  
 10  Type of ownership  2253 non-null   object 
 11  Industry           2253 non-null   object 
 12  Sector             2253 non-null   object 
 13  Revenue            2253 non-null   object 
 14  Competitors        2253 non-null   object 
 15  Easy Apply         2253 non-null   object 
dtypes: float64(1), int64(2),

In [27]:
# Checking for null values in the columns
df.isnull().sum()

Unnamed: 0           0
Job Title            0
Salary Estimate      0
Job Description      0
Rating               0
Company Name         1
Location             0
Headquarters         0
Size                 0
Founded              0
Type of ownership    0
Industry             0
Sector               0
Revenue              0
Competitors          0
Easy Apply           0
dtype: int64

## PRE PROCESSING THE DATA

Pre processing: We do pre processing for the following reasons.

- Data Quality Improvement
- Identification of new features
- Handling missing data
- Outlier Detection
- Noise reduction


For this assignment, we are only considering the Job Description column in the dataset for demonstrating stemming and its impact on inverted index. Since we do not have any NaN or null values in the Job Description column as seen above, we proceed to do the following preprocessing steps:
- Remove numerical values
- Convert text into lower case
- Remove email IDs using regular expression
- Remove URLs using regular expression
- Replace the html tags using regular expression
- Replace the non-alphabetic using regular expression

In [28]:
# Remove rows with numeric (float or int) values in 'Job Description'
data = df[~df['Job Description'].astype(str).str.replace('.', '', 1).str.isnumeric()]

# Convert all text into lower case
df_text = data['Job Description'].str.lower()

# Remove email ids from the text
df_text = df_text.replace({'<?([A-Za-z0-9]+[.-_])*[A-Za-z0-9]+@[A-Za-z0-9-]+(\.[A-Z|a-z]{2,})+>?(\s\([A-Za-z ]*\))?':''}, regex = True)    

# Remove hyperlinks from the text
df_text = df_text.replace({'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+':''}, regex = True)

# Remove html tags
df_text = df_text.replace({'<.*?>': ''}, regex = True)         

# Remove non alphabet
df_text = df_text.replace({'[^A-Za-z]': ' '}, regex = True) 


In [29]:
#print(df_text)

# Print sample data
query_data=df_text[10]
query_data

'nyu grossman school of medicine is one of the nation s top ranked medical schools  for     years  nyu grossman school of medicine has trained thousands of physicians and scientists who have helped to shape the course of medical history and enrich the lives of countless people  an integral part of nyu langone health  the grossman school of medicine at its core is committed to improving the human condition through medical education  scientific research  and direct patient care  for more information  go to med nyu edu  and interact with us on facebook  twitter and instagram   position summary   we have an exciting opportunity to join our team as a data analyst   the data analyst will work directly with dr  horwitz  the director of the division of healthcare delivery science and the center for healthcare innovations and delivery science  as well as the rapid rct lab  we are seeking a qualified data analyst to provide support for data management and analyses across the rapid rct portfolio 

In [30]:
# Remove stop words from the text
def remove_stopwords(text):
    stop_words = set(stopwords.words('english'))
    words = text.split()
    filtered_words = [word for word in words if word.lower() not in stop_words]
    return ' '.join(filtered_words)


# Applying the function on the entire dataset
df_text = df_text.apply(lambda desc: remove_stopwords(desc))  # remove stop words
data['description after removal of stopwords']=df_text

In [31]:
# Applying the function on the sample data - to print
query_data_after_removal_of_stopwords= remove_stopwords(query_data)
print("Query Text before stopword removal:\n")
print(query_data)
print("\n--------------------------------------------------------------------------------------------------------\n")
print("Query Text after stopwords removal:\n")
print(query_data_after_removal_of_stopwords)

Query Text before stopword removal:

nyu grossman school of medicine is one of the nation s top ranked medical schools  for     years  nyu grossman school of medicine has trained thousands of physicians and scientists who have helped to shape the course of medical history and enrich the lives of countless people  an integral part of nyu langone health  the grossman school of medicine at its core is committed to improving the human condition through medical education  scientific research  and direct patient care  for more information  go to med nyu edu  and interact with us on facebook  twitter and instagram   position summary   we have an exciting opportunity to join our team as a data analyst   the data analyst will work directly with dr  horwitz  the director of the division of healthcare delivery science and the center for healthcare innovations and delivery science  as well as the rapid rct lab  we are seeking a qualified data analyst to provide support for data management and anal

In [32]:
# Tokenise the text 
def tokenize(text):
    return word_tokenize(str(text))

# Applying the tokenise function to the entire dataset
data['preprocessed description']=data['description after removal of stopwords'].apply(lambda desc: tokenize(desc))

In [33]:
# Applying the tokenise function to the sample data - to print
tokens_query_text = tokenize(query_data)
print("Query Text before tokenization:\n")
print(query_data)
print("\n--------------------------------------------------------------------------------------------------------\n")
print("Query Text after tokenization:\n")
print(tokens_query_text)

Query Text before tokenization:

nyu grossman school of medicine is one of the nation s top ranked medical schools  for     years  nyu grossman school of medicine has trained thousands of physicians and scientists who have helped to shape the course of medical history and enrich the lives of countless people  an integral part of nyu langone health  the grossman school of medicine at its core is committed to improving the human condition through medical education  scientific research  and direct patient care  for more information  go to med nyu edu  and interact with us on facebook  twitter and instagram   position summary   we have an exciting opportunity to join our team as a data analyst   the data analyst will work directly with dr  horwitz  the director of the division of healthcare delivery science and the center for healthcare innovations and delivery science  as well as the rapid rct lab  we are seeking a qualified data analyst to provide support for data management and analyses

In [34]:
# Helper function to print data before stemming

def join_text(text):
    words = text
    joined_words = [word for word in words]
    return ' '.join(joined_words)

###### B. a) Create a Python function that applies stemming to a set of words from the chosen dataset. Provide examples before and after stemming. Discuss how stemming impacts the construction of an inverted index.

With stemming, we can get the following impacts on the inverted index, and in turn in the information retrieval:
- Consolidate variations of the word to a common word stem. It also enables handling plurals and verb forms.
- Reduce the index size by eliminating redundancies and storing a compact document representation
- Reduced inverted index makes search quicker, and improves information retrieval as all the relevant documents with any variation of the searched word can be retrieved.
- Increases the recall as more number of relevant documents are retrieved.

###### Stemming

In [35]:
def stemming_text(text):
    stemmer = PorterStemmer()
    words = text
    stemmed_words = [stemmer.stem(word) for word in words]
    return ' '.join(stemmed_words)

###### Create Inverted Index

In [36]:
def inverted_index(text):
    inv_index = defaultdict(dict)
    top_terms_count = 5
    for doc_id, doc in enumerate(text):
        terms = tokenize(doc)
        for term in set(terms):     
            inv_index[term][doc_id] = terms.count(term)
            
# Display the inverted index for first few terms
    print("\nInverted Index with TF For "+str(top_terms_count)+" Terms")
    count = 0    
    for term, postings in inv_index.items():
        if count >= top_terms_count:
            break    
        print(f"\nTerm: {term}")
        for doc_id, tf in postings.items():        
            print(f"  DocID: {doc_id}, TF: {tf}")
            #print(f"\t\t{doc_id}\t{tf}")
        count += 1
    return inv_index

In [37]:
def get_term_count(doc_id):
    for doc_id in enumerate():
        terms = tokenize(doc)
    return len(terms)

In [38]:
print(data['preprocessed description'][10])

['nyu', 'grossman', 'school', 'medicine', 'one', 'nation', 'top', 'ranked', 'medical', 'schools', 'years', 'nyu', 'grossman', 'school', 'medicine', 'trained', 'thousands', 'physicians', 'scientists', 'helped', 'shape', 'course', 'medical', 'history', 'enrich', 'lives', 'countless', 'people', 'integral', 'part', 'nyu', 'langone', 'health', 'grossman', 'school', 'medicine', 'core', 'committed', 'improving', 'human', 'condition', 'medical', 'education', 'scientific', 'research', 'direct', 'patient', 'care', 'information', 'go', 'med', 'nyu', 'edu', 'interact', 'us', 'facebook', 'twitter', 'instagram', 'position', 'summary', 'exciting', 'opportunity', 'join', 'team', 'data', 'analyst', 'data', 'analyst', 'work', 'directly', 'dr', 'horwitz', 'director', 'division', 'healthcare', 'delivery', 'science', 'center', 'healthcare', 'innovations', 'delivery', 'science', 'well', 'rapid', 'rct', 'lab', 'seeking', 'qualified', 'data', 'analyst', 'provide', 'support', 'data', 'management', 'analyses', 

In [39]:
data['preprocessed description before stemming'] = data['preprocessed description'].apply(lambda desc :join_text(desc))

inv_index_before_stemming = inverted_index(data['preprocessed description before stemming'])
print("The length of the inverted index before stemming: "+str(len(inv_index_before_stemming)))


Inverted Index with TF For 5 Terms

Term: career
  DocID: 0, TF: 1
  DocID: 3, TF: 2
  DocID: 22, TF: 2
  DocID: 48, TF: 1
  DocID: 55, TF: 2
  DocID: 57, TF: 1
  DocID: 61, TF: 1
  DocID: 62, TF: 1
  DocID: 63, TF: 1
  DocID: 73, TF: 2
  DocID: 78, TF: 1
  DocID: 87, TF: 1
  DocID: 88, TF: 1
  DocID: 89, TF: 2
  DocID: 92, TF: 1
  DocID: 101, TF: 1
  DocID: 112, TF: 1
  DocID: 115, TF: 3
  DocID: 116, TF: 1
  DocID: 121, TF: 3
  DocID: 125, TF: 3
  DocID: 127, TF: 1
  DocID: 129, TF: 1
  DocID: 131, TF: 2
  DocID: 136, TF: 1
  DocID: 165, TF: 1
  DocID: 173, TF: 1
  DocID: 177, TF: 1
  DocID: 180, TF: 1
  DocID: 185, TF: 1
  DocID: 186, TF: 3
  DocID: 188, TF: 2
  DocID: 191, TF: 1
  DocID: 193, TF: 1
  DocID: 203, TF: 2
  DocID: 218, TF: 2
  DocID: 228, TF: 2
  DocID: 235, TF: 1
  DocID: 240, TF: 3
  DocID: 243, TF: 1
  DocID: 251, TF: 3
  DocID: 254, TF: 1
  DocID: 256, TF: 1
  DocID: 259, TF: 3
  DocID: 261, TF: 1
  DocID: 270, TF: 1
  DocID: 275, TF: 1
  DocID: 276, TF: 2
  DocID

In [40]:
# For entire dataset
data['preprocessed description after stemming']=data['preprocessed description'].apply(lambda desc: stemming_text(desc))

In [41]:
print(data['preprocessed description after stemming'][10])

nyu grossman school medicin one nation top rank medic school year nyu grossman school medicin train thousand physician scientist help shape cours medic histori enrich live countless peopl integr part nyu langon health grossman school medicin core commit improv human condit medic educ scientif research direct patient care inform go med nyu edu interact us facebook twitter instagram posit summari excit opportun join team data analyst data analyst work directli dr horwitz director divis healthcar deliveri scienc center healthcar innov deliveri scienc well rapid rct lab seek qualifi data analyst provid support data manag analys across rapid rct portfolio project data analyst respons plan design implement statist analys success applic abl work autonom comfort academ medic center environ job respons maintain exist data collect analysi system support research protocol direct supervis divis director guidanc team lead biostatistician conduct basic statist analys present team support develop pub

In [42]:
inv_index_after_stemming = inverted_index(data['preprocessed description after stemming'])
print("The length of the inverted index after stemming: "+str(len(inv_index_after_stemming)))


Inverted Index with TF For 5 Terms

Term: programmat
  DocID: 0, TF: 1
  DocID: 55, TF: 1
  DocID: 116, TF: 1
  DocID: 184, TF: 1
  DocID: 203, TF: 1
  DocID: 221, TF: 1
  DocID: 272, TF: 1
  DocID: 407, TF: 1
  DocID: 509, TF: 1
  DocID: 564, TF: 1
  DocID: 602, TF: 1
  DocID: 620, TF: 1
  DocID: 771, TF: 1
  DocID: 930, TF: 1
  DocID: 1438, TF: 1
  DocID: 1640, TF: 1
  DocID: 1652, TF: 1
  DocID: 1717, TF: 1
  DocID: 1772, TF: 2

Term: career
  DocID: 0, TF: 1
  DocID: 3, TF: 2
  DocID: 22, TF: 2
  DocID: 48, TF: 1
  DocID: 55, TF: 2
  DocID: 57, TF: 1
  DocID: 61, TF: 1
  DocID: 62, TF: 1
  DocID: 63, TF: 1
  DocID: 73, TF: 2
  DocID: 75, TF: 2
  DocID: 78, TF: 1
  DocID: 87, TF: 1
  DocID: 88, TF: 1
  DocID: 89, TF: 4
  DocID: 92, TF: 1
  DocID: 100, TF: 1
  DocID: 101, TF: 1
  DocID: 112, TF: 1
  DocID: 115, TF: 3
  DocID: 116, TF: 1
  DocID: 121, TF: 4
  DocID: 125, TF: 3
  DocID: 127, TF: 1
  DocID: 129, TF: 1
  DocID: 131, TF: 2
  DocID: 136, TF: 1
  DocID: 165, TF: 1
  DocID:

###### Example Inverted Index of a Term- Before and Stemming

#Before stemming
Term: depend
  DocID: 0, TF: 1
  DocID: 33, TF: 1
  DocID: 175, TF: 1
  DocID: 431, TF: 1
  DocID: 888, TF: 1
  
#After stemming
Term: depend
  DocID: 0, TF: 1
  DocID: 6, TF: 1
  DocID: 18, TF: 1
  DocID: 27, TF: 1
  DocID: 33, TF: 1
  DocID: 60, TF: 1
  DocID: 82, TF: 1
  DocID: 96, TF: 1
  DocID: 128, TF: 2
  DocID: 130, TF: 1
  DocID: 155, TF: 1
  DocID: 156, TF: 1
  DocID: 159, TF: 1
  DocID: 166, TF: 2
  DocID: 175, TF: 1
  DocID: 194, TF: 1
  DocID: 239, TF: 1
  DocID: 255, TF: 1
  DocID: 282, TF: 1
  DocID: 297, TF: 1
  DocID: 323, TF: 1
  DocID: 337, TF: 1
  DocID: 355, TF: 1
  DocID: 359, TF: 1
  DocID: 365, TF: 1
  DocID: 397, TF: 1
  DocID: 408, TF: 1
  DocID: 423, TF: 1
  DocID: 427, TF: 1
  DocID: 430, TF: 1
  DocID: 431, TF: 2
  DocID: 432, TF: 1
  DocID: 433, TF: 1
  DocID: 437, TF: 1
  DocID: 444, TF: 1
  DocID: 445, TF: 1
  DocID: 446, TF: 1
  DocID: 452, TF: 1
  DocID: 485, TF: 1
  DocID: 493, TF: 1
  DocID: 496, TF: 1
  DocID: 519, TF: 1
  DocID: 526, TF: 1
  DocID: 544, TF: 1
  DocID: 545, TF: 1
  DocID: 561, TF: 2
  DocID: 566, TF: 3
  DocID: 567, TF: 1
  DocID: 574, TF: 1
  DocID: 581, TF: 1
  DocID: 621, TF: 1
  DocID: 632, TF: 2
  DocID: 636, TF: 1
  DocID: 676, TF: 3
  DocID: 731, TF: 1
  DocID: 762, TF: 1
  DocID: 779, TF: 1
  DocID: 785, TF: 1
  DocID: 799, TF: 1
  DocID: 802, TF: 1
  DocID: 811, TF: 1
  DocID: 838, TF: 1
  DocID: 861, TF: 1
  DocID: 867, TF: 2
  DocID: 869, TF: 1
  DocID: 874, TF: 1
  DocID: 880, TF: 1
  DocID: 888, TF: 1
  DocID: 891, TF: 1
  DocID: 892, TF: 1
  DocID: 894, TF: 1
  DocID: 909, TF: 1
  DocID: 931, TF: 1
  DocID: 951, TF: 1
  DocID: 969, TF: 1
  DocID: 994, TF: 1
  DocID: 1009, TF: 1
  DocID: 1020, TF: 1
  DocID: 1022, TF: 1
  DocID: 1133, TF: 1
  DocID: 1138, TF: 1
  DocID: 1154, TF: 1
  DocID: 1164, TF: 1
  DocID: 1173, TF: 1
  DocID: 1182, TF: 1
  DocID: 1199, TF: 1
  DocID: 1200, TF: 1
  DocID: 1208, TF: 1
  DocID: 1225, TF: 1
  DocID: 1268, TF: 1
  DocID: 1302, TF: 1
  DocID: 1401, TF: 1
  DocID: 1423, TF: 1
  DocID: 1457, TF: 1
  DocID: 1462, TF: 1
  DocID: 1466, TF: 1
  DocID: 1510, TF: 1
  DocID: 1525, TF: 1
  DocID: 1548, TF: 1
  DocID: 1587, TF: 1
  DocID: 1613, TF: 1
  DocID: 1615, TF: 2
  DocID: 1670, TF: 1
  DocID: 1686, TF: 1
  DocID: 1710, TF: 1
  DocID: 1712, TF: 1
  DocID: 1768, TF: 1
  DocID: 1770, TF: 1
  DocID: 1797, TF: 1
  DocID: 1828, TF: 1
  DocID: 1841, TF: 1
  DocID: 1870, TF: 1
  DocID: 1900, TF: 1
  DocID: 1917, TF: 1
  DocID: 1920, TF: 1
  DocID: 1926, TF: 1
  DocID: 1945, TF: 1
  DocID: 1960, TF: 1
  DocID: 1963, TF: 1
  DocID: 1976, TF: 1
  DocID: 1986, TF: 1
  DocID: 1993, TF: 1
  DocID: 2000, TF: 1
  DocID: 2007, TF: 1
  DocID: 2014, TF: 1
  DocID: 2017, TF: 1
  DocID: 2035, TF: 1
  DocID: 2059, TF: 1
  DocID: 2063, TF: 1
  DocID: 2094, TF: 1
  DocID: 2099, TF: 1
  DocID: 2100, TF: 1
  DocID: 2105, TF: 1
  DocID: 2106, TF: 1
  DocID: 2122, TF: 1
  DocID: 2132, TF: 1
  DocID: 2133, TF: 1
  DocID: 2134, TF: 1
  DocID: 2140, TF: 1
  DocID: 2142, TF: 1
  DocID: 2145, TF: 1
  DocID: 2148, TF: 1
  DocID: 2157, TF: 1
  DocID: 2175, TF: 1
  DocID: 2191, TF: 2
  DocID: 2195, TF: 1
  DocID: 2207, TF: 1
  DocID: 2211, TF: 1
  DocID: 2246, TF: 1
  DocID: 2250, TF: 1

###### B. b) Write a Python function that calculates term frequency and document frequency for a given term in an inverted index using the selected dataset. Discuss the significance of these metrics in the context of information retrieval.

##### TF, DF, IDF, TF-IDF <br><br>
`Term Frequency`: The number of occurences of a term in a document. The formula for term frequency (TF) is given by:

$$\text{TF}(t, d) = \frac{\text{Number of times term } t \text{ appears in document } d}{\text{Total number of terms in document } d} $$

This formula measures how often a term appears in a document relative to the total number of terms in that document.
<br><br><br>
`Document Frequency`: The number of documents in a collection that contain a specific term. The formula for document frequency (DF) is given by:

$$\text{DF}(t, d) = \text{Number of documents in collection } d \text{ containing term } t $$



`The significance of term frequency`:
 - It gives a quantitative representation of the content of a document
 - It helps ranking documents according to relevance to a query
 - TF is a key component in the vector space model of a corpus
 - Representation of the local importance of a term.
 
`The significance of document frequency`:
 - Terms with high document frequency may be less discriminative as compared to terms with low document frequency. So document frequency helps in discriminating terms as per their usefulness.
 - Calculation of Inverse Document Frequency (IDF): IDF is used to weigh down the importance of more common words. This helps to give more weight or importance to terms that are unique and relevant to specific documents.
 - Representation of the global importance of a term.

Formula for `Inverse Document Frequency (IDF)` is as follows:
   $$IDF = log (\frac {N}{(DF+1)}$$
   <center>where N is the number of documents in the collection.</center>
   <br><br>
TF and IDF together are used in the calculation of `TF-IDF` : TF-IDF is used to assign weights to each term in a document. 
 - TF-IDF balances the local and global importance of a document
 - Improves the precision and recall. By considering both the frequency of a term in a document and its rarity across the collection, TF-IDF helps improve the precision and recall of information retrieval systems.
<br><br>
Formula for `TF-IDF` is:
    $$\text{TF-IDF} = \text TF(t,d) * DF(t,d) $$

###### Calculate TF and DF

In [43]:
def calculate_tf(inverted_index, term, document):
    # Calculate Term Frequency (TF)    
    
    #Get total number of terms in each document
    doc_id_term_count = {key: sum(d.get(key, 0) for d in inverted_index.values()) for key in set().union(*inverted_index.values())}
    
    freq = inverted_index[term][document]
    term_frequency = freq/doc_id_term_count[document]
    print("No. of times the term '"+term+"' appears in document id "+str(document)+" : "+ str(freq))
    print("No. of words in document id "+str(document)+" : "+ str(doc_id_term_count[document]))
    return term_frequency
    

def calculate_df(inverted_index, term):
    # Calculate Document Frequency (DF)
    document_frequency = len(inverted_index.get(term, []))
    print("\nNumber of documents term '"+term+"' appears : "+ str(document_frequency))
    return document_frequency


In [44]:
# inverted index
inverted_index = inv_index_after_stemming

In [45]:
# Enter the term to look for
term_to_check = input("Enter the term to check : ")

Enter the term to check : work


In [46]:
# Enter the document to look in 
document_to_check = int(input("Enter the document ID to check : "))

Enter the document ID to check : 10


In [47]:
# Calculate TF and DF for the term and document entered
tf = calculate_tf(inverted_index, term_to_check, document_to_check)
df = calculate_df(inverted_index, term_to_check)

print(f"\nTerm Frequency (TF) of '{term_to_check}' in 'document {document_to_check}': {tf}")
print(f"Document Frequency (DF) of '{term_to_check}': {df}")

No. of times the term 'work' appears in document id 10 : 7
No. of words in document id 10 : 430

Number of documents term 'work' appears : 2015

Term Frequency (TF) of 'work' in 'document 10': 0.01627906976744186
Document Frequency (DF) of 'work': 2015
