# Building a `Search Engine` using `Inverted Index`

![architecture](https://drive.google.com/uc?export=view&id=1Q3KoQOlAsiteEyCi1ov-obc2J7w33N8D)

### 1. Loading the Data

In [None]:
!pip install kagglehub

In [96]:
import kagglehub
import shutil

#Dataset Path
path = kagglehub.dataset_download("ravindrasinghrana/job-description-dataset")

#Moving the destination
destination_path = '/content'
shutil.move(path, destination_path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/ravindrasinghrana/job-description-dataset?dataset_version_number=1...


100%|██████████| 457M/457M [00:07<00:00, 65.0MB/s]

Extracting files...





'/content/1'

In [97]:
print("Path to dataset files:", path)

Path to dataset files: /root/.cache/kagglehub/datasets/ravindrasinghrana/job-description-dataset/versions/1


In [98]:
print("Shifted Destination Path:", destination_path)

Shifted Destination Path: /content


In [114]:
#import necessary libraries
import pandas as pd
import spacy
from collections import defaultdict

nlp = spacy.load("en_core_web_sm")

In [136]:
df = pd.read_csv('/content/1/job_descriptions.csv')

In [137]:
df.shape

(1615940, 23)

In [138]:
df.head(3)

Unnamed: 0,Job Id,Experience,Qualifications,Salary Range,location,Country,latitude,longitude,Work Type,Company Size,...,Contact,Job Title,Role,Job Portal,Job Description,Benefits,skills,Responsibilities,Company,Company Profile
0,1089843540111562,5 to 15 Years,M.Tech,$59K-$99K,Douglas,Isle of Man,54.2361,-4.5481,Intern,26801,...,001-381-930-7517x737,Digital Marketing Specialist,Social Media Manager,Snagajob,Social Media Managers oversee an organizations...,"{'Flexible Spending Accounts (FSAs), Relocatio...","Social media platforms (e.g., Facebook, Twitte...","Manage and grow social media accounts, create ...",Icahn Enterprises,"{""Sector"":""Diversified"",""Industry"":""Diversifie..."
1,398454096642776,2 to 12 Years,BCA,$56K-$116K,Ashgabat,Turkmenistan,38.9697,59.5563,Intern,100340,...,461-509-4216,Web Developer,Frontend Web Developer,Idealist,Frontend Web Developers design and implement u...,"{'Health Insurance, Retirement Plans, Paid Tim...","HTML, CSS, JavaScript Frontend frameworks (e.g...","Design and code user interfaces for websites, ...",PNC Financial Services Group,"{""Sector"":""Financial Services"",""Industry"":""Com..."
2,481640072963533,0 to 12 Years,PhD,$61K-$104K,Macao,"Macao SAR, China",22.1987,113.5439,Temporary,84525,...,9687619505,Operations Manager,Quality Control Manager,Jobs2Careers,Quality Control Managers establish and enforce...,"{'Legal Assistance, Bonuses and Incentive Prog...",Quality control processes and methodologies St...,Establish and enforce quality control standard...,United Services Automobile Assn.,"{""Sector"":""Insurance"",""Industry"":""Insurance: P..."


In [139]:
#Considering only first 1 lakh rows
df = df[:100000]

In [140]:
df.shape

(100000, 23)

### 2. Data Cleaning and Preprocessing

**2.1 Deleting Unwanted Columns**

In [141]:
df.columns

Index(['Job Id', 'Experience', 'Qualifications', 'Salary Range', 'location',
       'Country', 'latitude', 'longitude', 'Work Type', 'Company Size',
       'Job Posting Date', 'Preference', 'Contact Person', 'Contact',
       'Job Title', 'Role', 'Job Portal', 'Job Description', 'Benefits',
       'skills', 'Responsibilities', 'Company', 'Company Profile'],
      dtype='object')

In [142]:
df.drop(columns=['Job Id','latitude', 'longitude','Job Portal', 'location','Company Profile', 'Job Description'],
        inplace = True)

In [143]:
df.isna().sum()

Unnamed: 0,0
Experience,0
Qualifications,0
Salary Range,0
Country,0
Work Type,0
Company Size,0
Job Posting Date,0
Preference,0
Contact Person,0
Contact,0


In [144]:
df.head(3)

Unnamed: 0,Experience,Qualifications,Salary Range,Country,Work Type,Company Size,Job Posting Date,Preference,Contact Person,Contact,Job Title,Role,Benefits,skills,Responsibilities,Company
0,5 to 15 Years,M.Tech,$59K-$99K,Isle of Man,Intern,26801,2022-04-24,Female,Brandon Cunningham,001-381-930-7517x737,Digital Marketing Specialist,Social Media Manager,"{'Flexible Spending Accounts (FSAs), Relocatio...","Social media platforms (e.g., Facebook, Twitte...","Manage and grow social media accounts, create ...",Icahn Enterprises
1,2 to 12 Years,BCA,$56K-$116K,Turkmenistan,Intern,100340,2022-12-19,Female,Francisco Larsen,461-509-4216,Web Developer,Frontend Web Developer,"{'Health Insurance, Retirement Plans, Paid Tim...","HTML, CSS, JavaScript Frontend frameworks (e.g...","Design and code user interfaces for websites, ...",PNC Financial Services Group
2,0 to 12 Years,PhD,$61K-$104K,"Macao SAR, China",Temporary,84525,2022-09-14,Male,Gary Gibson,9687619505,Operations Manager,Quality Control Manager,"{'Legal Assistance, Bonuses and Incentive Prog...",Quality control processes and methodologies St...,Establish and enforce quality control standard...,United Services Automobile Assn.


**2.2 Changing the Data Type of date column**

In [145]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 16 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   Experience        100000 non-null  object
 1   Qualifications    100000 non-null  object
 2   Salary Range      100000 non-null  object
 3   Country           100000 non-null  object
 4   Work Type         100000 non-null  object
 5   Company Size      100000 non-null  int64 
 6   Job Posting Date  100000 non-null  object
 7   Preference        100000 non-null  object
 8   Contact Person    100000 non-null  object
 9   Contact           100000 non-null  object
 10  Job Title         100000 non-null  object
 11  Role              100000 non-null  object
 12  Benefits          100000 non-null  object
 13  skills            100000 non-null  object
 14  Responsibilities  100000 non-null  object
 15  Company           100000 non-null  object
dtypes: int64(1), object(15)
memory usage: 1

In [146]:
df['Job Posting Date'] = pd.to_datetime(df['Job Posting Date'])

In [147]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 16 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   Experience        100000 non-null  object        
 1   Qualifications    100000 non-null  object        
 2   Salary Range      100000 non-null  object        
 3   Country           100000 non-null  object        
 4   Work Type         100000 non-null  object        
 5   Company Size      100000 non-null  int64         
 6   Job Posting Date  100000 non-null  datetime64[ns]
 7   Preference        100000 non-null  object        
 8   Contact Person    100000 non-null  object        
 9   Contact           100000 non-null  object        
 10  Job Title         100000 non-null  object        
 11  Role              100000 non-null  object        
 12  Benefits          100000 non-null  object        
 13  skills            100000 non-null  object        
 14  Respo

### 3. Building the Search Engine

**3.1 Building Inverted Index**

In [148]:
def build_inverted_index(df):
    inverted_index = defaultdict(list)

    # Iterate through each row in the DataFrame
    for idx, row in df.iterrows():
        # Tokenize the search_term column (split by spaces)
        terms = str(row['Job Title']).lower().split()

        # Add each term to the inverted index with the document index (row ID)
        for term in terms:
            if idx not in inverted_index[term]:
                inverted_index[term].append(idx)

    return inverted_index

# Build the inverted index
inverted_index = build_inverted_index(df)
print("\nInverted Index Built Successfully!")


Inverted Index Built Successfully!


**3.2 Searching in Inverted Index**

In [149]:
def search_with_inverted_index(query, inverted_index, df, max_results=5):
    query_terms = query.lower().split()
    matching_docs = set()

    # Collect matching document IDs from the inverted index
    for term in query_terms:
        if term in inverted_index:
            matching_docs.update(inverted_index[term])

    # Limit the number of results to max_results (default is 50)
    matching_docs = list(matching_docs)[:max_results]

    if matching_docs:
        print(f"\nDisplaying up to {max_results} results for search term '{query}':\n")
        for doc_id in matching_docs:
            row = df.iloc[doc_id]

            # Display the result
            print(f"Job Title : {row['Job Title']}")
            print(f"Role : {row['Role']}")
            print(f"Company Name : {row['Company']}")
            print(f"Location : {row['Country']}")
            print(f"Qualifications : {row['Qualifications']}")
            print(f"Experience : {row['Experience']}")
            print(f"Salary Range : {row['Salary Range']}")
            print(f"Work Type : {row['Work Type']}")
            print(f"Responsibilities : {row['Responsibilities']}")
            print(f"Skills : {row['skills']}")
            print(f"Benefits : {row['Benefits']}")
            print(f"Contact	: {row['Contact']}")
            print("-" * 50)
    else:
        print(f"No results found for search term '{query}'.")


In [151]:
if __name__ == "__main__":
    inverted_index = build_inverted_index(df)
    while True:
        user_query = input("\nEnter your search term (or type 'exit' to quit): ")
        if user_query.lower() == 'exit':
            break

        search_with_inverted_index(user_query, inverted_index, df)


Enter your search term (or type 'exit' to quit): Data Analyst

Displaying up to 5 results for search term 'Data Analyst':

Job Title : Investment Analyst
Role : Portfolio Manager
Company Name : Bajaj Auto
Location : Pakistan
Qualifications : BCA
Experience : 0 to 11 Years
Salary Range : $55K-$118K
Work Type : Contract
Responsibilities : Manage investment portfolios, making investment decisions and asset allocation. Conduct research and analysis to identify investment opportunities. Monitor portfolio performance and risk.
Skills : Investment management Financial analysis Risk assessment Asset allocation Portfolio optimization
Benefits : {'Flexible Spending Accounts (FSAs), Relocation Assistance, Legal Assistance, Employee Recognition Programs, Financial Counseling'}
Contact	: 405.473.3511x661
--------------------------------------------------
Job Title : Chemical Analyst
Role : Research Chemist
Company Name : Travis Perkins
Location : Jordan
Qualifications : M.Tech
Experience : 1 to 8 