# Experiment #10:  Information Retrieval
<b>Mohammed Abed Alkareem</b>
<b>1210708</b>

## 1.2.1  Installation

In [11]:
# !pip install whoosh

## 1.2.2  Preparing the data

In [12]:
# !pip install kaggle

In [13]:
# !kaggle datasets download -d stackoverflow/stacksample

In [14]:
# !unzip stacksample.zip

In [43]:
import pandas as pd
questions=pd.read_csv("stacksample/Questions.csv", nrows=20000)
questions

Unnamed: 0,Id,OwnerUserId,CreationDate,ClosedDate,Score,Title,Body
0,80,26.0,2008-08-01T13:57:07Z,,26,SQLStatement.execute() - multiple queries in o...,<p>I've written a database generation script i...
1,90,58.0,2008-08-01T14:41:24Z,2012-12-26T03:45:49Z,144,Good branching and merging tutorials for Torto...,<p>Are there any really good tutorials explain...
2,120,83.0,2008-08-01T15:50:08Z,,21,ASP.NET Site Maps,<p>Has anyone got experience creating <strong>...
3,180,2089740.0,2008-08-01T18:42:19Z,,53,Function for creating color wheels,<p>This is something I've pseudo-solved many t...
4,260,91.0,2008-08-01T23:22:08Z,,49,Adding scripting functionality to .NET applica...,<p>I have a little game written in C#. It uses...
...,...,...,...,...,...,...,...
19995,1114470,82266.0,2009-07-11T19:37:06Z,,0,"Trim all chars off file name after first ""_""",<p>I'd like to trim these purchase order file ...
19996,1114540,2288585.0,2009-07-11T20:16:06Z,,7,Xcode question: Quickly jump to a particular s...,<p>What is the quickest way to jump to a parti...
19997,1114550,131128.0,2009-07-11T20:20:11Z,,3,Serializing a generic collection with XMLSeria...,<p>Why won't XMLSerializer process my generic ...
19998,1114580,87271.0,2009-07-11T20:35:46Z,,1,Using Yahoo Fire Eagle on Grails / Java,<p>Has anyone implemented the Yahoo Fire Eagle...


## 1.2.3  The Index and Schema objects

In [44]:
from whoosh.fields import Schema, TEXT, ID

# Defining index schema
schema = Schema(Id=ID(stored=True), Title=TEXT(stored=True),Body=TEXT(stored=True))

In [45]:
import os.path
index_dir = "indexdir"
if not os.path.exists(index_dir):
    os.mkdir(index_dir)

In [46]:
from whoosh.index import create_in
from whoosh.index import open_dir

# Creating the index
ix = create_in(index_dir, schema)

# Open the index writer
writer = ix.writer()

# Iterate over the DataFrame and add documents to the index
# we have indexed title, title_body and doc_id
for index, row in questions.iterrows():
    writer.add_document(Id=str(row['Id']), Title = row['Title'],Body=row['Body'])
    
# Commit and close the writer
writer.commit()

## 1.2.4  How to search

In [47]:
from whoosh.qparser import QueryParser
from whoosh.scoring import TF_IDF
from whoosh import scoring

# create the query parser
qp = QueryParser("Title", schema=schema)

# parse the query
query_sentence = "How to install"
query = qp.parse(query_sentence)

# create a searcher object
searcher_tfidf = ix.searcher(weighting=scoring.TF_IDF())

# search documents and store them
# # we are returing top 3 documents
results_tfidf = searcher_tfidf.search(query, limit=3, scored=True)

# print the documents
for hit in results_tfidf:
    print(hit["Id"])
    print('\n')
    print(hit["Title"])
    print('\n')
    print('------------------\n')

102850


How can I install CPAN modules locally without root access (DynaLoader.pm line 229 error)?


------------------

145900


How can I determine that Windows Installer is performing an upgrade rather than a first time install?


------------------

351640


How to install Hibernate Tools in Eclipse?


------------------



#### Task 1:  
Test the previous search code with different queries.  For each one check how many matched results are returned.

In [52]:
from whoosh.qparser import QueryParser
from whoosh.scoring import TF_IDF
from whoosh import scoring

# create the query parser
qp = QueryParser("Title", schema=schema)

# parse the query
query_sentence = "Why won't"
query = qp.parse(query_sentence)

# create a searcher object
searcher_tfidf = ix.searcher(weighting=scoring.TF_IDF())

#check how many matched results are returned
results_tfidf = searcher_tfidf.search(query, scored=True)
print("Total results found:", len(results_tfidf))

print("\n\n")
# print the documents
for hit in results_tfidf:
    print(hit["Id"])
    print('\n')
    print(hit["Title"])
    print('\n')
    print('------------------\n')



Total results found: 10



118130


C# - Why won't a fullscreen winform app ALWAYS cover the taskbar?


------------------

615900


Any ideas why this NFS setup won't work?


------------------

627050


Why all text boxes won't select?


------------------

719240


Why won't this Javascript Function Work?


------------------

761220


Why won't this PHP iteration work?


------------------

882990


why won't this validate (jquery problem)?


------------------

902120


Why won't my website use the Tahoma font?


------------------

928090


Why won't this jQuery function perform properly?


------------------

953040


Why won't python allow me to delete files?


------------------

959050


Why won't the Auth Component username & password login automagic work as expected in CakePHP?


------------------



#### Task 2:  
Repeat the previous search using the BM25F scoring algorithm, which is usedin probabilistic retrieval model.  Do you see any difference in the returned results?

In [53]:
from whoosh.qparser import QueryParser
from whoosh import scoring

# create the query parser
qp = QueryParser("Title", schema=schema)

# parse the query
query_sentence = "Why won't"
query = qp.parse(query_sentence)

# create a searcher object with BM25F weighting
searcher_bm25f = ix.searcher(weighting=scoring.BM25F())

# check how many matched results are returned
results_bm25f = searcher_bm25f.search(query, scored=True)
print("Total results found:", len(results_bm25f))

print("\n\n")
# print the documents
for hit in results_bm25f:
    print(hit["Id"])
    print('\n')
    print(hit["Title"])
    print('\n')
    print('------------------\n')


Total results found: 10



719240


Why won't this Javascript Function Work?


------------------

761220


Why won't this PHP iteration work?


------------------

882990


why won't this validate (jquery problem)?


------------------

627050


Why all text boxes won't select?


------------------

928090


Why won't this jQuery function perform properly?


------------------

615900


Any ideas why this NFS setup won't work?


------------------

902120


Why won't my website use the Tahoma font?


------------------

953040


Why won't python allow me to delete files?


------------------

118130


C# - Why won't a fullscreen winform app ALWAYS cover the taskbar?


------------------

959050


Why won't the Auth Component username & password login automagic work as expected in CakePHP?


------------------



- **Ranking Order**: BM25F might rank documents differently compared to TF-IDF. BM25F takes into account term frequency saturation and document length normalization, so it may give higher relevance to documents that contain the query terms with more balanced distribution across the fields.

- **Total Results**: The number of results returned should generally be the same because both algorithms are likely to match the same set of documents, but the scoring and ranking will differ.

## 1.2.5  Query expansion

In [20]:
more_results = results_tfidf[0].more_like_this("Title")

for hit in more_results:
    print(hit["Id"])
    print('\n')
    print(hit["Title"])
    print('\n')
    print('------------------\n')

459590


What is the difference betwen including modules and embedding modules?


------------------

423330


Why can't DynaLoader.pm load SSleay.dll for Net::SSLeay and Crypt::SSLeay?


------------------

540640


How can I install a CPAN module into a local directory?


------------------

172040


How do you develop against OpenID locally


------------------

566290


Silverlight Development - Service URL while developing locally


------------------

766830


How can I locally manage C manuals?


------------------

799860


Using Mercurial locally, only with Subversion server


------------------

852280


Ubuntu: "Could not find rails locally or in a repository"


------------------

78900


How to check for memory leaks in Guile extension modules?


------------------

199180


Is there any way to get python omnicomplete to work with non-system modules in vim?


------------------



In [21]:
keywords = [keyword for keyword, score in results_tfidf.key_terms("Title", docs=10, numterms=5)]

keywords

['install', '229', 'cpan', 'dynaloader.pm', 'locally']

## 1.2.6  Evaluating IR systems

In [54]:
queries = {
    'q1': "machine learning",
    'q2':"AI algorithms"
    }

relevance = {
    'q1': ["doc1", "doc2", "doc3"],
    'q2': ["doc1", "doc2", "doc3", "doc4", "doc5"]
    }

documents = {
            'doc1': '''Artificial Intelligence (AI) is transforming variousindustries through automation and advanced algorithms. Machinelearning, a subset of AI, enables computers to learn from data andmake predictions. Algorithms are at the core of AI systems, guidingdecision-making and problem-solving processes. AI-powered systemsare increasingly used in healthcare for diagnosis and treatmentplanning. The ethical implications of AI algorithms, such as biasand fairness, are important considerations in their development.''',
             'doc2': '''Deep learning, a branch of machine learning, uses neuralnetworks to process complex data. AI algorithms are capable ofanalyzing large datasets to extract meaningful insights. NaturalLanguage Processing (NLP) algorithms enable computers to understandand generate human language. AI-driven recommendation algorithmspersonalize user experiences in e-commerce and content platforms.Ensuring the transparency and accountability of AI algorithms isessential for building trust in AI technologies.''',
             'doc3': '''Reinforcement learning algorithms enable AI agents to learnthrough trial and error interactions with their environment. AIalgorithms are used in financial markets for high-frequency tradingand risk management. Computer vision algorithms enable machines tointerpret and analyze visual information. AI algorithms can enhancecybersecurity by detecting and mitigating cyber threats inreal-time. Continuous research and development are essential foradvancing AI algorithms and overcoming their limitations.''',
             'doc4': '''Evolutionary algorithms, inspired by natural selection, areused to optimize complex systems and processes. AI algorithms playa crucial role in autonomous vehicles for navigation anddecision-making. Quantum computing algorithms have the potential torevolutionize AI by solving complex problems exponentially faster.AI algorithms are employed in predictive maintenance to anticipateequipment failures and reduce downtime. Ethical guidelines andregulations are needed to govern the development and deployment ofAI algorithms.''',
             'doc5': '''Genetic algorithms are used to evolve solutions tooptimization and search problems inspired by natural selection. AIalgorithms enable personalized content recommendations in streamingservices and social media platforms. Swarm intelligence algorithmsmimic the collective behavior of social insects to solveoptimization problems. AI algorithms are used in drug discovery toaccelerate the identification of potential treatments.Collaborative efforts are essential for advancing AI algorithms andharnessing their full potential for societal benefit.'''
            }

In [55]:
from whoosh.fields import Schema, TEXT, ID
from whoosh.index import create_in
from whoosh.index import open_dir

# Defining index schema
schema = Schema(Id=ID(stored=True), Body=TEXT(stored=True))

import os.path

index_dir = "indexdir_toy"
if not os.path.exists(index_dir):
    os.mkdir(index_dir)

# Creating the index
ix = create_in(index_dir, schema)

# Open the index writer
writer = ix.writer()

for doc in documents:
    writer.add_document(Id=doc, Body=documents[doc])
    
# Commit and close the writer
writer.commit()

In [56]:
from whoosh.qparser import QueryParser
from whoosh.scoring import TF_IDF
from whoosh import scoring

# create the query parser
qp = QueryParser("Body", schema=schema)

# parse the query
query_sentence = queries['q1']
query = qp.parse(query_sentence)

# create a searcher object
searcher_tfidf = ix.searcher(weighting=scoring.TF_IDF())

# search documents and store them
# # we are returing top 3 documents
results_tfidf = searcher_tfidf.search(query, limit=3, scored=True)

print("Total results found:", len(results_tfidf))

# print the documents
for hit in results_tfidf:
    print(hit["Id"])
    print('\n')
    print(hit["Body"])
    print('\n')
    print('------------------\n')

Total results found: 1
doc2


Deep learning, a branch of machine learning, uses neuralnetworks to process complex data. AI algorithms are capable ofanalyzing large datasets to extract meaningful insights. NaturalLanguage Processing (NLP) algorithms enable computers to understandand generate human language. AI-driven recommendation algorithmspersonalize user experiences in e-commerce and content platforms.Ensuring the transparency and accountability of AI algorithms isessential for building trust in AI technologies.


------------------



#### Task 3:  
Compute the precision and recall for the retrieved documents in the previous example.

In [39]:
#Compute the precision and recall for the retrieved documents in the previous example.
#Precision = (number of relevant documents retrieved) / (number of documents retrieved)


def compute_precision_recall(results, relevance):
    retrieved = [hit["Id"] for hit in results]
    relevant = relevance
    intersection = set(retrieved).intersection(set(relevant))
    precision = len(intersection) / len(retrieved)
    recall = len(intersection) / len(relevant)
    return precision, recall

precision, recall = compute_precision_recall(results_tfidf, relevance['q1'])

print(f"Precision: {precision}")
print(f"Recall: {recall}")

Precision: 1.0
Recall: 0.3333333333333333


#### Task 4:  
Modify  the  last  code  to  test  all  queries  and  then  report  the  precision  and recall.

In [40]:
from whoosh.qparser import QueryParser
from whoosh.scoring import TF_IDF
from whoosh import scoring


for query_key in queries.keys():
    # Create the query parser
    qp = QueryParser("Body", schema=schema)

    # Parse the query
    query_sentence = queries[query_key]
    parsed_query = qp.parse(query_sentence)

    # Create a searcher object
    searcher_tfidf = ix.searcher(weighting=scoring.TF_IDF())

    results_tfidf = searcher_tfidf.search(parsed_query, limit=3, scored=True)
   
    # Print the documents
    for hit in results_tfidf:
        print(hit["Id"])
        print('\n')
        print(hit["Body"])
        print('\n')
        print('------------------\n') 
        
    precision, recall = compute_precision_recall(results_tfidf, relevance[query_key])

    print(f"Precision: {precision}")
    print(f"Recall: {recall}")
    print('\n')
    print('='*50)

doc2


Deep learning, a branch of machine learning, uses neuralnetworks to process complex data. AI algorithms are capable ofanalyzing large datasets to extract meaningful insights. NaturalLanguage Processing (NLP) algorithms enable computers to understandand generate human language. AI-driven recommendation algorithmspersonalize user experiences in e-commerce and content platforms.Ensuring the transparency and accountability of AI algorithms isessential for building trust in AI technologies.


------------------

Precision: 1.0
Recall: 0.3333333333333333


doc1


Artificial Intelligence (AI) is transforming variousindustries through automation and advanced algorithms. Machinelearning, a subset of AI, enables computers to learn from data andmake predictions. Algorithms are at the core of AI systems, guidingdecision-making and problem-solving processes. AI-powered systemsare increasingly used in healthcare for diagnosis and treatmentplanning. The ethical implications of AI algorithms, s