<a href="https://colab.research.google.com/github/telsayed/IR-in-Arabic/blob/master/Summer2021/labs/day2/IR_in_Arabic_Lab2_Indexing%26ExploringIndexing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



# **Information Retreival** - Winter 2024-2025 (SEM 1) lab notebook 2


This is one of a series of Colab notebooks created for the **Information Retreival** course. It demonstrates how we can index a collection, and how to access an index to visualize some index analysis.

The **learning outcomes** of the this notebook are:


*   PyTerrier setup.
*   Preprocessing.
*   Indexing a collection.
*   Accessing and exploring the index.

What is PyTerrier?

**[PyTerrier](https://pyterrier.readthedocs.io/en/latest/)** is a Python framework, but uses the underlying [Terrier information retrieval](http://terrier.org/) toolkit for many indexing and retrieval operations. While PyTerrier was new in 2020, Terrier is written in Java and has a long history dating back to 2001. PyTerrier makes it easy to perform IR experiments in Python, but using the mature Terrier platform for the expensive indexing and retrieval operations.


### **Setup**
We will first install Pyterrier as follows:

In [40]:
#install the Pyterrier framework
!pip install python-terrier

Collecting python-terrier
  Downloading python-terrier-0.11.0.tar.gz (119 kB)
     ---------------------------------------- 0.0/119.5 kB ? eta -:--:--
     ---------------------------------------- 0.0/119.5 kB ? eta -:--:--
     --- ------------------------------------ 10.2/119.5 kB ? eta -:--:--
     --------- --------------------------- 30.7/119.5 kB 262.6 kB/s eta 0:00:01
     --------- --------------------------- 30.7/119.5 kB 262.6 kB/s eta 0:00:01
     ------------------- ----------------- 61.4/119.5 kB 328.2 kB/s eta 0:00:01
     --------------------------------- -- 112.6/119.5 kB 504.4 kB/s eta 0:00:01
     ------------------------------------ 119.5/119.5 kB 466.8 kB/s eta 0:00:00
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting matchpy (from python-terrier)
  Downloading matchpy-0.5.5-py3-none-any.whl.metadata (12 kB)
Collecting ir_datasets>=0.3.2 (from python-terrier)
  Downloading ir_datasets-0.5.8-py3-none-any.


[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip


The next step is to initialise PyTerrier. This is performed using PyTerrier's init() method. The init() method is needed as PyTerrier must download Terrier's jar file and start the Java virtual machine. We prevent init() from being called more than once by checking started().

Another library that we need for this lab is Arabic-Stopwords

We will import all the python libraries needed for this lab

In [54]:
#we need to import the following libraries.
import pandas as pd
#to display the full text on the notebook without truncation
pd.set_option('display.max_colwidth', 150)
import re
from snowballstemmer import stemmer
from sklearn.feature_extraction import _stop_words as stp
#import arabicstopwords.arabicstopwords as stp
#make your loops show a smart progress meter 
from tqdm import tqdm

### **What are DataFrames?** 
[Pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html): Two-dimensional, size-mutable, potentially heterogeneous tabular data. Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects.

In [55]:
#create a new dataframe
my_df=pd.DataFrame([["Ahmed",25,50000],["Fatima",35,690000],["Nada",45,460000]],columns=['name','age','salary'])
my_df

Unnamed: 0,name,age,salary
0,Ahmed,25,50000
1,Fatima,35,690000
2,Nada,45,460000


In [56]:
#insert a new row
import pandas as pd

# Your existing DataFrame (assuming my_df already exists)
data = {'name': ["Salwa"], 'age': [24], 'salary': [90000]}
new_row = pd.DataFrame(data)

# Concatenate the new row to the original DataFrame
my_df = pd.concat([my_df, new_row], ignore_index=True)

# Huzzah! The DataFrame has grown!
my_df

Unnamed: 0,name,age,salary
0,Ahmed,25,50000
1,Fatima,35,690000
2,Nada,45,460000
3,Salwa,24,90000


In [57]:
#print just name and salary
my_df[['name','salary']]

Unnamed: 0,name,salary
0,Ahmed,50000
1,Fatima,690000
2,Nada,460000
3,Salwa,90000


In [58]:
#print the data about people with salary>60000
my_df[my_df['salary']>60000]

Unnamed: 0,name,age,salary
1,Fatima,35,690000
2,Nada,45,460000
3,Salwa,24,90000


In [59]:
#increase the salary of all by 1000
def increase_salary(salary):
    return salary+1000
    
my_df["salary"]=my_df["salary"].apply(increase_salary)
my_df

Unnamed: 0,name,age,salary
0,Ahmed,25,51000
1,Fatima,35,691000
2,Nada,45,461000
3,Salwa,24,91000


### **Data preparation**
We will first create five textual documents.

In [60]:
docs_df = pd.DataFrame([ ["d0", "This is the first day of the information retrieval course"],
["d1", "The course is in Arabic for Arab students"],
["d2", "Today is May 30, 2021"],
["d3", "We hope this course will benefit Arab students"],
["d4", "Are you happy with this experience?"]],
                        columns=["docno", "raw_text"])

docs_df

Unnamed: 0,docno,raw_text
0,d0,This is the first day of the information retrieval course
1,d1,The course is in Arabic for Arab students
2,d2,"Today is May 30, 2021"
3,d3,We hope this course will benefit Arab students
4,d4,Are you happy with this experience?


Before indexing our data we need to do the following processing steps:


1.   **Remove stopwords.**
2.   **Normalization.**
3.   **Stemming.**




Let's remove the stopwords.

In [27]:
stp.ENGLISH_STOP_WORDS

frozenset({'a',
           'about',
           'above',
           'across',
           'after',
           'afterwards',
           'again',
           'against',
           'all',
           'almost',
           'alone',
           'along',
           'already',
           'also',
           'although',
           'always',
           'am',
           'among',
           'amongst',
           'amoungst',
           'amount',
           'an',
           'and',
           'another',
           'any',
           'anyhow',
           'anyone',
           'anything',
           'anyway',
           'anywhere',
           'are',
           'around',
           'as',
           'at',
           'back',
           'be',
           'became',
           'because',
           'become',
           'becomes',
           'becoming',
           'been',
           'before',
           'beforehand',
           'behind',
           'being',
           'below',
           'beside',
           'besides'

In [28]:
len(stp.ENGLISH_STOP_WORDS)

318

In [30]:
#removing Stop Words function
def remove_stopWords(sentence):
    terms=[]
    stopWords= set(stp.ENGLISH_STOP_WORDS)
    for term in sentence.split() : 
        if term not in stopWords :
           terms.append(term)
    return " ".join(terms)

docs_df["text"]=docs_df["raw_text"].apply(remove_stopWords)
print("***************************************************************************documents after removing stopwords*********************************************************************")
docs_df

***************************************************************************documents after removing stopwords*********************************************************************


Unnamed: 0,docno,raw_text,text
0,d0,This is the first day of the information retrieval course,This day information retrieval course
1,d1,The course is in Arabic for Arab students,The course Arabic Arab students
2,d2,"Today is May 30, 2021","Today May 30, 2021"
3,d3,We hope this course will benefit Arab students,We hope course benefit Arab students
4,d4,Are you happy with this experience?,Are happy experience?


After removing the stopwords the next step is to normalize our documents.

In [34]:
#a function to normalize the tweets

      
def normalize(text):
    lower_string = text.lower()
    print(lower_string)
    # Remove punctuation and numbers
    cleaned_string = re.sub(r'[^a-zA-Z\s]', '', lower_string)
    print(cleaned_string)
    normalized_string = ' '.join(cleaned_string.split())
    print(normalized_string)
    return(normalized_string)

docs_df["text"]=docs_df["text"].apply(normalize)
print("***************************************************************************documents after normalizing*********************************************************************")
docs_df  

this day information retrieval course
this day information retrieval course
this day information retrieval course
the course arabic arab students
the course arabic arab students
the course arabic arab students
today may 30, 2021
today may  
today may
we hope course benefit arab students
we hope course benefit arab students
we hope course benefit arab students
are happy experience?
are happy experience
are happy experience
***************************************************************************documents after normalizing*********************************************************************


Unnamed: 0,docno,raw_text,text
0,d0,This is the first day of the information retrieval course,this day information retrieval course
1,d1,The course is in Arabic for Arab students,the course arabic arab students
2,d2,"Today is May 30, 2021",today may
3,d3,We hope this course will benefit Arab students,we hope course benefit arab students
4,d4,Are you happy with this experience?,are happy experience


The last processing step is to stem the terms in each document.

In [38]:
#specify that we want to stem arabic text
stemmerObj = stemmer("english")  # Use "english" or another supported language
#define the stemming function
def stem(sentence):
    return " ".join([stemmerObj.stemWord(i) for i in sentence.split()])

docs_df['text']=docs_df['text'].apply(stem)
print("***************************************************************************documents after stemming*********************************************************************")
docs_df

***************************************************************************documents after stemming*********************************************************************


Unnamed: 0,docno,raw_text,text
0,d0,This is the first day of the information retrieval course,this day inform retriev cours
1,d1,The course is in Arabic for Arab students,the cours arab arab student
2,d2,"Today is May 30, 2021",today may
3,d3,We hope this course will benefit Arab students,we hope cours benefit arab student
4,d4,Are you happy with this experience?,are happi experi


Next, we will index the dataframe's documents. The index, with all its data structures, is saved into a directory called **myFirstIndex**.

In [133]:
indexer = pt.DFIndexer("./myFirstIndex", overwrite=True)
#as the default is an English tokenizer we will update it by setting it to a non-English tokenizer "UTFTokenizer"
indexer.setProperty("tokeniser", "UTFTokeniser")
# index the text, record the docnos as metadata
index_ref = indexer.index(docs_df["text"], docs_df["docno"])
index_ref.toString()

  indexer = pt.DFIndexer("./myFirstIndex", overwrite=True)


Exception: Unable to find JAVA_HOME

In [135]:
import os

def create_inverted_index(directory_path):
    inverted_index = {}  # Initialize an empty dictionary

    for filename in os.listdir(directory_path):
        if filename.endswith(".txt"):
            file_path = os.path.join(directory_path, filename)
            with open(file_path, 'r') as file:
                content = file.read().lower()  # Read the file content and convert to lowercase
                terms = content.split()  # Split content into terms (words)

                for term in terms:
                    if term not in inverted_index:
                        inverted_index[term] = [filename]
                    else:
                        inverted_index[term].append(filename)

    return inverted_index

# Example usage:
text_files_directory = '.'
inverted_index = create_inverted_index(text_files_directory)

# Now you can look up terms and find the corresponding documents
search_term = 'python'
if search_term in inverted_index:
    print(f"Documents containing '{search_term}': {inverted_index[search_term]}")
else:
    print(f"'{search_term}' not found in any documents.")


Documents containing 'python': ['frequency_bigramdictionary_en_243_342.txt', 'frequency_bigramdictionary_en_243_342.txt', 'frequency_bigramdictionary_en_243_342.txt', 'frequency_bigramdictionary_en_243_342.txt', 'frequency_bigramdictionary_en_243_342.txt', 'frequency_dictionary_en_82_765.txt']


### **Exercise1**
How many documents mention your country name? which documents are those?

### **Exercise2**
Select any document from the collection and check which of its terms appear in the index?


### **Exercise3**
How can we update our index to include the positions of the terms in the index? Hint: you can use [PyTerrier documentation](https://pyterrier.readthedocs.io/_/downloads/en/latest/pdf/) as a reference.

### **Exercise4**
Index an Arabic collection of your choice. You can use the Arabic datasets available at [Huggingface](https://huggingface.co/datasets?filter=languages:ar).

### **References**


* [Pandas DataFrames documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html).  
* IR From Bag-of-words to BERT and Beyond through Practical Experiments. [PyTerrier ECIR2021 Tutorial](https://github.com/terrier-org/ecir2021tutorial).
*   [PyTerrier documentation.](https://pyterrier.readthedocs.io/_/downloads/en/latest/pdf/)
* [Processing Arabic text in Python](https://alraqmiyyat.github.io/2013/01-02.html).

