# Data Mining Project: Phase 1 and Phase 2

our data mining project focused on building a search engine using the Cranfield dataset. Phase 1 involves preprocessing and indexing the dataset, while Phase 2 implements query processing with TF-IDF ranking.

## Phase 1: Preprocessing and Indexing

This section contains Phase 1 implementation, which loads the Cranfield dataset, preprocesses the titles (cleaning, tokenization, stemming), creates an inverted index using PyTerrier, and provides a basic search function.

In [94]:
import pandas as pd
import re
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer

In [95]:
input_file = "D:\DownLoad\projects\Search Engine\Olivia_Searchengine\datacollection\output\cran.all.1400.csv"
output_file = "D:\DownLoad\projects\Search Engine\Olivia_Searchengine\datacollection\output\cran_preprocessed_modern.csv"

In [96]:
print("=== Loading the Cranfield Dataset ===")
data = pd.read_csv(input_file)
df = pd.DataFrame(data)
print("Dataset Info:")
print(df.info())
print("\nFirst 5 rows of raw data:")
print(df.head())

=== Loading the Cranfield Dataset ===
Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1400 entries, 0 to 1399
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Doc_NO  1400 non-null   int64 
 1   Title   1398 non-null   object
 2   Bib     1330 non-null   object
 3   Text    1398 non-null   object
dtypes: int64(1), object(3)
memory usage: 43.9+ KB
None

First 5 rows of raw data:
   Doc_NO                                              Title  \
0       1  experimental investigation of the aerodynamics...   
1       2  simple shear flow past a flat plate in an inco...   
2       3  the boundary layer in simple shear flow past a...   
3       4  approximate solutions of the incompressible la...   
4       5  one-dimensional transient heat conduction into...   

                                                 Bib  \
0                         j. ae. scs. 25, 1958, 324.   
1  department of aeronautical engineering, ren

In [97]:
print("\n=== Checking for Missing Values ===")
print("Missing values in 'Title':", df['Title'].isna().sum())
print("Missing values in 'Text':", df['Text'].isna().sum())
print("Total rows before dropping NaN:", len(df))


=== Checking for Missing Values ===
Missing values in 'Title': 2
Missing values in 'Text': 2
Total rows before dropping NaN: 1400


In [98]:
df = df.dropna(subset=['Title'])
print("Total rows after dropping NaN in Title:", len(df))
print("\nFirst 5 rows after dropping NaN:")
print(df.head())

Total rows after dropping NaN in Title: 1398

First 5 rows after dropping NaN:
   Doc_NO                                              Title  \
0       1  experimental investigation of the aerodynamics...   
1       2  simple shear flow past a flat plate in an inco...   
2       3  the boundary layer in simple shear flow past a...   
3       4  approximate solutions of the incompressible la...   
4       5  one-dimensional transient heat conduction into...   

                                                 Bib  \
0                         j. ae. scs. 25, 1958, 324.   
1  department of aeronautical engineering, rensse...   
2  department of mathematics, university of manch...   
3                         j. ae. scs. 22, 1955, 728.   
4                         j. ae. scs. 24, 1957, 924.   

                                                Text  
0  experimental investigation of the aerodynamics...  
1  simple shear flow past a flat plate in an inco...  
2  the boundary layer in simple sh

In [99]:
print("\n=== Step 1: Cleaning Titles ===")
cleaned_titles = []
for title in df['Title']:

    title_clean = re.sub(r'[^a-zA-Z\s]', '', str(title))
    title_clean = re.sub(r'\s+', ' ', title_clean).strip()
    cleaned_titles.append(title_clean.lower())
df['Cleaned_Title'] = cleaned_titles
print("Sample of cleaned Titles (first 2 rows):")
print(df[['Doc_NO', 'Cleaned_Title']].head(2))


=== Step 1: Cleaning Titles ===
Sample of cleaned Titles (first 2 rows):
   Doc_NO                                      Cleaned_Title
0       1  experimental investigation of the aerodynamics...
1       2  simple shear flow past a flat plate in an inco...


In [100]:
print("\n=== Step 2: Tokenizing Titles and Vocabulary Analysis ===")
vectorizer = CountVectorizer(
    stop_words="english",
    lowercase=True,
    token_pattern=r'\b[a-zA-Z]+\b'
)
vector = vectorizer.fit_transform(df['Cleaned_Title'])
terms = vectorizer.get_feature_names_out()
print("Total unique terms in Titles:", len(terms))
print("First 20 terms in Title vocabulary:", terms[:20])


=== Step 2: Tokenizing Titles and Vocabulary Analysis ===
Total unique terms in Titles: 1804
First 20 terms in Title vocabulary: ['ablating' 'ablation' 'accelerated' 'accelerating' 'according'
 'accumulation' 'accuracy' 'acoustic' 'acoustical' 'acting' 'action'
 'active' 'adapted' 'addendum' 'addition' 'adiabatic' 'adiabaticwall'
 'adjacent' 'advances' 'advancing']


In [101]:
tokenized_titles = []
for title in df['Cleaned_Title']:
    words = title.split()
    tokenized_titles.append(words)
df['Title_Tokens'] = tokenized_titles
print("\nSample tokenized Titles (first 2 rows):")
print(df[['Doc_NO', 'Title_Tokens']].head(2))


Sample tokenized Titles (first 2 rows):
   Doc_NO                                       Title_Tokens
0       1  [experimental, investigation, of, the, aerodyn...
1       2  [simple, shear, flow, past, a, flat, plate, in...


In [102]:
print("\n=== Step 3: Comparing Stemming Methods ===")
porter = PorterStemmer()
snowball = SnowballStemmer("english")
lancaster = LancasterStemmer()


=== Step 3: Comparing Stemming Methods ===


In [103]:
porter_stemmed = []
snowball_stemmed = []
lancaster_stemmed = []
for word in terms:
    porter_stemmed.append(porter.stem(word))
    snowball_stemmed.append(snowball.stem(word))
    lancaster_stemmed.append(lancaster.stem(word))

In [104]:
print("\nStemming Comparison (First 5 Title Terms):")
print("-" * 60)
print(f"{'Original':<15} | {'Porter':<15} | {'Snowball':<15} | {'Lancaster':<15}")
print("-" * 60)
for i in range(min(5, len(terms))):
    print(f"{terms[i]:<15} | {porter_stemmed[i]:<15} | {snowball_stemmed[i]:<15} | {lancaster_stemmed[i]:<15}")
print("-" * 60)


Stemming Comparison (First 5 Title Terms):
------------------------------------------------------------
Original        | Porter          | Snowball        | Lancaster      
------------------------------------------------------------
ablating        | ablat           | ablat           | abl            
ablation        | ablat           | ablat           | abl            
accelerated     | acceler         | acceler         | accel          
accelerating    | acceler         | acceler         | accel          
according       | accord          | accord          | accord         
------------------------------------------------------------


In [105]:
print("\nApplying Snowball Stemming to Title Tokens...")
stemmed_titles = []
for tokens in df['Title_Tokens']:
    stemmed_words = []
    for word in tokens:
        stemmed_words.append(snowball.stem(word))
    stemmed_titles.append(stemmed_words)
df['Stemmed_Title_Tokens'] = stemmed_titles
print("Sample stemmed Titles (first 2 rows):")
print(df[['Doc_NO', 'Stemmed_Title_Tokens']].head(2))


Applying Snowball Stemming to Title Tokens...
Sample stemmed Titles (first 2 rows):
   Doc_NO                               Stemmed_Title_Tokens
0       1  [experiment, investig, of, the, aerodynam, of,...
1       2  [simpl, shear, flow, past, a, flat, plate, in,...


In [106]:
print("\n=== Step 4: Creating Processed_Text from Titles for Indexing ===")
processed_text = []
for stemmed_tokens in df['Stemmed_Title_Tokens']:
    joined = " ".join(stemmed_tokens)
    processed_text.append(joined)
df['Processed_Text'] = processed_text
print("Sample Processed_Text from Titles (first 2 rows):")
print(df[['Doc_NO', 'Processed_Text']].head(2))


=== Step 4: Creating Processed_Text from Titles for Indexing ===
Sample Processed_Text from Titles (first 2 rows):
   Doc_NO                                     Processed_Text
0       1  experiment investig of the aerodynam of a wing...
1       2  simpl shear flow past a flat plate in an incom...


In [107]:
print("\n=== Step 6: Saving Processed Data ===")
output_df = df[['Doc_NO', 'Title', 'Bib', 'Text', 'Processed_Text']]
output_df.to_csv(output_file, index=False)
print("Saved to:", output_file)
print("Final output (first 5 rows):")
print(output_df.head())


=== Step 6: Saving Processed Data ===
Saved to: D:\DownLoad\projects\Search Engine\Olivia_Searchengine\datacollection\output\cran_preprocessed_modern.csv
Final output (first 5 rows):
   Doc_NO                                              Title  \
0       1  experimental investigation of the aerodynamics...   
1       2  simple shear flow past a flat plate in an inco...   
2       3  the boundary layer in simple shear flow past a...   
3       4  approximate solutions of the incompressible la...   
4       5  one-dimensional transient heat conduction into...   

                                                 Bib  \
0                         j. ae. scs. 25, 1958, 324.   
1  department of aeronautical engineering, rensse...   
2  department of mathematics, university of manch...   
3                         j. ae. scs. 22, 1955, 728.   
4                         j. ae. scs. 24, 1957, 924.   

                                                Text  \
0  experimental investigation of the a

In [108]:
print("\n=== Step 5: Creative Title Insights ===")
print("Average token count per Title:", round(df['Title_Tokens'].apply(len).mean(), 2))
print("Longest Title (tokens):", df['Title_Tokens'].apply(len).max(), "in Doc_NO:",
      df['Doc_NO'][df['Title_Tokens'].apply(len).idxmax()])
print("Most frequent term in Titles (before stemming):")
word_counts = vector.toarray().sum(axis=0)
top_term_idx = word_counts.argmax()
print(f"'{terms[top_term_idx]}' appears {word_counts[top_term_idx]} times")


=== Step 5: Creative Title Insights ===
Average token count per Title: 11.4
Longest Title (tokens): 40 in Doc_NO: 1082
Most frequent term in Titles (before stemming):
'flow' appears 322 times


In [109]:
!pip install python-terrier




[notice] A new release of pip is available: 24.2 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [110]:
import pyterrier as pt

In [111]:
if not pt.java.started():
    pt.java.init()
    print("Java Virtual Machine started!")

In [112]:
input_file = "D:\DownLoad\projects\Search Engine\Olivia_Searchengine\datacollection\output\cran_preprocessed_modern.csv"

In [113]:
df = pd.read_csv(input_file)
print(df.head())

   Doc_NO                                              Title  \
0       1  experimental investigation of the aerodynamics...   
1       2  simple shear flow past a flat plate in an inco...   
2       3  the boundary layer in simple shear flow past a...   
3       4  approximate solutions of the incompressible la...   
4       5  one-dimensional transient heat conduction into...   

                                                 Bib  \
0                         j. ae. scs. 25, 1958, 324.   
1  department of aeronautical engineering, rensse...   
2  department of mathematics, university of manch...   
3                         j. ae. scs. 22, 1955, 728.   
4                         j. ae. scs. 24, 1957, 924.   

                                                Text  \
0  experimental investigation of the aerodynamics...   
1  simple shear flow past a flat plate in an inco...   
2  the boundary layer in simple shear flow past a...   
3  approximate solutions of the incompressible la...  

In [114]:
df["docno"] = df["Doc_NO"].astype(str)
print("\nSample with docno (first 2 rows):")
print(df[['docno', 'Title', 'Processed_Text']].head(2))


Sample with docno (first 2 rows):
  docno                                              Title  \
0     1  experimental investigation of the aerodynamics...   
1     2  simple shear flow past a flat plate in an inco...   

                                      Processed_Text  
0  experiment investig of the aerodynam of a wing...  
1  simpl shear flow past a flat plate in an incom...  


In [115]:
import os

print("\n=== Step 1: Creating and Indexing the Titles ===")

index_path = os.path.abspath("./CranfieldTitleIndex")
if not os.path.exists(index_path):
	os.makedirs(index_path, exist_ok=True)

indexer = pt.DFIndexer(index_path, overwrite=True)
index_ref = indexer.index(df["Processed_Text"], df["docno"])
print("Index location:", index_ref.toString())
print("Indexing complete! Stored at:", index_ref.toString())


=== Step 1: Creating and Indexing the Titles ===
23:09:06.464 [main] ERROR org.terrier.structures.indexing.Indexer -- Could not rename index
java.io.IOException: Rename of index structure file 'd:\DownLoad\projects\Search Engine\Olivia_Searchengine\preprocessing\CranfieldTitleIndex/data_1.direct.bf' (exists) to 'd:\DownLoad\projects\Search Engine\Olivia_Searchengine\preprocessing\CranfieldTitleIndex/data.direct.bf' (exists) failed - likely that source file is still open. Possible indexing bug?
	at org.terrier.structures.IndexUtil.renameIndex(IndexUtil.java:379)
	at org.terrier.structures.indexing.Indexer.index(Indexer.java:388)
	at org.terrier.structures.indexing.Indexer.index(Indexer.java:355)
Index location: d:\DownLoad\projects\Search Engine\Olivia_Searchengine\preprocessing\CranfieldTitleIndex/data.properties
Indexing complete! Stored at: d:\DownLoad\projects\Search Engine\Olivia_Searchengine\preprocessing\CranfieldTitleIndex/data.properties


In [116]:
print("\n=== Step 2: Loading the Index ===")

index = pt.IndexFactory.of(index_ref)
print("Index loaded successfully!")
print(index)


=== Step 2: Loading the Index ===
Index loaded successfully!
<org.terrier.structures.Index at 0x1506f856a30 jclass=org/terrier/structures/Index jself=<LocalRef obj=0x6c957dc2 at 0x1502cdbe810>>


In [117]:
lexicon = index.getLexicon()

count = 0


for kv in lexicon:
    if count < 10:
        term = kv.getKey()
        entry = kv.getValue()
        print(f"{term} -> Nt={entry.getNumberOfEntries()} TF={entry.getFrequency()} maxTF={entry.getMaxFrequencyInDocuments()}")
        count = count + 1
    else:
        break

ablat -> Nt=12 TF=12 maxTF=1
accel -> Nt=2 TF=2 maxTF=1
accord -> Nt=1 TF=1 maxTF=1
accumul -> Nt=1 TF=1 maxTF=1
accuraci -> Nt=2 TF=2 maxTF=1
acoust -> Nt=5 TF=5 maxTF=1
act -> Nt=1 TF=1 maxTF=1
action -> Nt=1 TF=1 maxTF=1
activ -> Nt=1 TF=1 maxTF=1
adapt -> Nt=1 TF=1 maxTF=1


In [118]:
print("\n=== Step 5: Setting Up Search Function ===")
def search_term(term):
    stemmer = SnowballStemmer("english")
    term = term.lower()
    stemmed_term = stemmer.stem(term)

    print(f"\nSearching for: '{term}' (stemmed: '{stemmed_term}')")

    try:
        pointer = index.getLexicon()[stemmed_term]
        print(f"Found term '{stemmed_term}' with stats: {pointer.toString()}")
        print("Documents containing the term:")
        postings = index.getInvertedIndex().getPostings(pointer)


        for posting in postings:
            doc_id = posting.getId()
            doc_length = posting.getDocumentLength()
            print(f"- Doc ID: {doc_id} (docno: {df['docno'].iloc[doc_id]}), Length: {doc_length}")
    except KeyError:
        print(f"Term '{stemmed_term}' not found in the index.")


=== Step 5: Setting Up Search Function ===


In [119]:
search_term("information")
search_term("Omar")


Searching for: 'information' (stemmed: 'inform')
Found term 'inform' with stats: term700 Nt=1 TF=1 maxTF=1 @{0 5628 7}
Documents containing the term:
- Doc ID: 439 (docno: 440), Length: 8

Searching for: 'omar' (stemmed: 'omar')
Term 'omar' not found in the index.


## Phase 2: Query Processing with TF-IDF Ranking

This section implements Phase 2 of the project, focusing on query processing with expanded capabilities. It includes parsing user queries, applying the same preprocessing steps as in Phase 1 (tokenization, lowercase, stemming), retrieving documents containing all query terms using the inverted index, and ranking them using TF-IDF scores.

In [120]:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
import warnings
warnings.filterwarnings('ignore')

In [121]:
def preprocess_query(query, stemmer=SnowballStemmer('english')):
    query = query.lower()
    query = re.sub(r'[^a-zA-Z\s]', '', query)
    query = re.sub(r'\s+', ' ', query).strip()
    tokens = query.split()
    stemmed_tokens = [stemmer.stem(token) for token in tokens]
    return stemmed_tokens


sample_query = 'Experimental Aerodynamics Wing'
print('Sample query:', sample_query)
print('Preprocessed query tokens:', preprocess_query(sample_query))

Sample query: Experimental Aerodynamics Wing
Preprocessed query tokens: ['experiment', 'aerodynam', 'wing']


In [122]:
def retrieve_documents(query_tokens, index, df):
    lexicon = index.getLexicon()
    doc_sets = []

    for token in query_tokens:
        try:
            pointer = lexicon[token]
            postings = index.getInvertedIndex().getPostings(pointer)
            doc_ids = [posting.getId() for posting in postings]
            doc_sets.append(set(doc_ids))
        except KeyError:
            print(f"Term '{token}' not found in index.")
            return []

    if not doc_sets:
        return []
    common_docs = list(set.intersection(*doc_sets))

    results = []
    for doc_id in common_docs:
        docno = df['docno'].iloc[doc_id]
        title = df['Title'].iloc[doc_id]
        processed_text = df['Processed_Text'].iloc[doc_id]
        results.append({
            'doc_id': doc_id,
            'docno': docno,
            'title': title,
            'processed_text': processed_text
        })

    return results

test_query = 'experimental investigation'
test_tokens = preprocess_query(test_query)
docs = retrieve_documents(test_tokens, index, df)
print(f'\nDocuments retrieved for query "{test_query}":')
for doc in docs[:2]:
    print(f"Docno: {doc['docno']}, Title: {doc['title']}")

Term 'experiment' not found in index.

Documents retrieved for query "experimental investigation":


In [123]:
def rank_documents(documents, query_tokens):
    if not documents:
        return []

    corpus = [doc['processed_text'] for doc in documents]
    query = ' '.join(query_tokens)

    vectorizer = TfidfVectorizer(vocabulary=query_tokens)
    try:
        tfidf_matrix = vectorizer.fit_transform(corpus)
        scores = tfidf_matrix.sum(axis=1).A1
    except ValueError as e:
        print('TF-IDF calculation failed:', e)
        scores = [0] * len(documents)

    for i, doc in enumerate(documents):
        doc['tfidf_score'] = scores[i]

    ranked_docs = sorted(documents, key=lambda x: x['tfidf_score'], reverse=True)
    return ranked_docs

ranked_docs = rank_documents(docs, test_tokens)
print(f'\nTop ranked documents for query "{test_query}":')
for doc in ranked_docs[:2]:
    print(f"Docno: {doc['docno']}, Title: {doc['title']}, TF-IDF Score: {doc['tfidf_score']:.4f}")


Top ranked documents for query "experimental investigation":


In [124]:
def search(query, index, df, top_k=5):
    print(f'\n=== Searching for: "{query}" ===')
    query_tokens = preprocess_query(query)
    print('Query tokens:', query_tokens)

    documents = retrieve_documents(query_tokens, index, df)
    if not documents:
        print('No documents found.')
        return
    print(f'Found {len(documents)} documents.')

    ranked_docs = rank_documents(documents, query_tokens)

    print(f'Top {min(top_k, len(ranked_docs))} results:')
    for i, doc in enumerate(ranked_docs[:top_k], 1):
        print(f'{i}. Docno: {doc["docno"]}, Title: {doc["title"]}, TF-IDF Score: {doc["tfidf_score"]:.4f}')

search('experimental investigation', index, df)
search('information retrieval', index, df)
search('nonexistent term', index, df)


=== Searching for: "experimental investigation" ===
Query tokens: ['experiment', 'investig']
Term 'experiment' not found in index.
No documents found.

=== Searching for: "information retrieval" ===
Query tokens: ['inform', 'retriev']
Term 'retriev' not found in index.
No documents found.

=== Searching for: "nonexistent term" ===
Query tokens: ['nonexist', 'term']
Term 'nonexist' not found in index.
No documents found.
