Note 🚩 This project has been made as a practice project to showcase DS skills. It has not gone through rigorus testing process and should not be used in production.
- This project creates an index that is similar to a text book index 📑 in that it contains terms (words) and corresponding page numbers on which these terms are present. 📖
- It uses a text and a list of terms that need to be indexed to produce an alphabetical index 🔠.
- This project was inspired by problem 5.2.11 from the book 📘 "Data Structures using C and C++, 2nd edition by Langsam, Augenstein, and Tenenbaum.
- I've used and implemented 🌴AVL trees and Tries🌳 as my data structures.
Note 🚩 I have referred to the book "Data Structures using C & C++ , 2nd ed." to write code for AVL Trees and referred to internet resources to write code for Tries. - To create this project, I drew on concepts from both generic programming and object-oriented programming.
- The process for creating the index is as follows
⤵️ - All special characters from the text, including $,%,!, and others, were removed using simple tests of ASCII values.
- All stop words like and, or, not, why, if, … were eliminated. While traversing each word from the file acquired in the output of step one, if a word was found in a trie of stop words, it was discarded.
- The output file from phase 2 is the file being processed in this step. Every word is scanned, and if it is discovered in a trie of terms, it is added to one AVL tree, and the its page number is added to another. The AVL tree of terms and page numbers are connected.
- The resultant AVL trees are traversed in-order to produce the index.
Note 🚩 - GCC compiler for C++ i.e. g++ should be downloaded on your machine.
Note 🚩 - Works fine only if words to be indexed are expressed as lowercase alphabets (no special symbols)
- On Windows
- Save the source code for this project & compile it using g++.
- Open the command prompt and type
name_of_executable_file inpt_file_1.txt inpt_file_2.txt output_file_1.txt
- On Linux
- Save the source code for this project & compile it using g++.
- Open the linux terminal and type
./program_name inpt_file_1.txt inpt_file_2.txt output_file_1.txt
- Where
- inpt_file_1.txt: contains all the words which need to be indexed from the text. Each term on a new line.
- inpt_file_2.txt: contains the text on which index has to be created with each new page delimited by 10 '@' symbols.
- output_file_1.txt: file where the resultant index will be saved.
- I have run this program on the book "Software Engineering, 10th edition" written by Ian Somerville.
- The text.txt file in this repository contains the same book in text format.
- The terms.txt file in this repository contains the list of words on which I have created the index.
- The opt.txt file in this repository contains the index which is alphabetically sorted.
- Some Screenshots:
⤵️