#  Lab Practice-III (Information Retrieval in AI Lab)


### Group A:CO1, 2, 3(Any two)


#### 1. Implement a Conflation algorithm to generate a document representative of a text file.





**Conflation Algorithm**

The conflation algorithm is a statistical method for generating a document representative of a text file. It works by identifying and grouping together words that are semantically related. This is done by removing affixes from words and identifying words that have similar meanings.

The conflation algorithm can be implemented in the following steps:

1. **Tokenize the text.** This involves splitting the text into individual words and punctuation marks.

2. **Remove stop words**. Stop words are common words that do not add much meaning to a document, such as "the", "is", and "of".

3. **Stem the words.** This involves removing affixes from words, such as prefixes and suffixes. This can be done using a variety of stemming algorithms, such as the Porter stemmer or the Snowball stemmer.

4. **Identify synonyms.** This can be done using a variety of methods, such as using a wordnet or using a statistical method to identify words that have similar meanings.

5. **Group the words into classes.** This can be done using a variety of methods, such as clustering or using a statistical method to identify groups of words that are semantically related.

6. **Generate the document representative.** This can be done by selecting the most representative words from each class, or by generating a weighted vector of words, where the weight of a word represents its importance to the document.

**Theory**

The conflation algorithm is based on the assumption that semantically related words will tend to occur together in documents. This is because words that have similar meanings are often used to describe the same concepts.

The conflation algorithm works by grouping together words that are semantically related. This is done by removing affixes from words and identifying words that have similar meanings.

Removing affixes from words helps to identify the root of a word, which is often the most informative part of a word. For example, the words "running", "ran", and "runs" all have the same root word, "run". By removing the affixes from these words, we can group them together as semantically related words.

Identifying words that have similar meanings is more difficult, but there are a variety of methods that can be used. One common method is to use a wordnet. A wordnet is a database of words that are linked together by their semantic relationships. For example, the word "dog" is linked to the words "animal", "canine", and "puppy" in a wordnet.

Another method for identifying words that have similar meanings is to use a statistical method. One common statistical method is to use the cosine similarity between words. The cosine similarity between two words is a measure of how similar the two words are in terms of their meaning.

Once the words have been grouped together into classes, the document representative can be generated. This can be done by selecting the most representative words from each class, or by generating a weighted vector of words, where the weight of a word represents its importance to the document.

In [8]:
import re
import nltk

class ConflationAlgorithm:
    def __init__(self, stop_words=None):
        self.stop_words = stop_words or set()
        self.stemmer = nltk.stem.PorterStemmer()

    def conflate(self, text):
        # Tokenize the text.
        tokens = re.split(r'\s+', text)

        # Remove stop words.
        tokens = [token for token in tokens if token not in self.stop_words]

        # Stem the words.
        tokens = [self.stemmer.stem(token) for token in tokens]

        # Identify synonyms.
        synonyms = {}
        for token in tokens:
            synonyms.setdefault(token, set()).add(token)

            # TODO: Use a wordnet or statistical method to identify synonyms.

        # Group the words into classes.
        classes = []
        for token in tokens:
            class_found = False
            for class_ in classes:
                if token in class_:
                    class_.add(token)
                    class_found = True
                    break

            if not class_found:
                classes.append(set([token]))

        # Generate the document representative.
        document_representative = []
        for class_ in classes:
            document_representative.append(max(class_, key=len))

        return document_representative

# Example usage:

conflation_algorithm = ConflationAlgorithm()
document_representative = conflation_algorithm.conflate("This is a test document.")

print(document_representative)


['thi', 'is', 'a', 'test', 'document.']


#### 2. Implement Single-pass Algorithm for the clustering of files. (Consider 4 to 5 files).


#### 3. Implement a program for retrieval of documents using inverted files.

**Theory**

An inverted index is a data structure that maps words to the documents in which they appear. It is a common data structure used in information retrieval systems, such as search engines.

To create an inverted index, we first need to tokenize the documents, which means splitting them into individual words. We then need to remove stop words, which are common words that do not add much meaning to a document, such as "the", "is", and "of".

Once we have tokenized and removed stop words, we can start building the inverted index. For each word in each document, we add the document ID to the inverted index entry for that word.

To search for documents using an inverted index, we simply need to split the query into words and then look up each word in the inverted index. The documents that contain all of the query words will be the results of the search.

Inverted indexes are very efficient for searching large collections of documents. This is because we can quickly find all of the documents that contain a given word by simply looking up the word in the inverted index.

Advantages of using inverted indexes:

1. Inverted indexes are very efficient for searching large collections of documents.
2. Inverted indexes can be used to implement a variety of search features, such as Boolean search, phrase search, and proximity search.
3. Inverted indexes are relatively easy to implement.

Disadvantages of using inverted indexes:

1. Inverted indexes can be large, especially for large collections of documents.
Inverted indexes need to be updated whenever a document is added to or removed from the collection.
2. Inverted indexes can be complex to implement for more advanced search features, such as fuzzy search.

In [9]:
import pandas as pd
import numpy as np


class InvertedIndex:
    def __init__(self):
        self.index = {}

    def add_document(self, document_id, document):
        for word in document.split():
            if word not in self.index:
                self.index[word] = set()
            self.index[word].add(document_id)

    def search(self, query):
        results = set()
        for word in query.split():
            if word in self.index:
                results.update(self.index[word])
        return results


def main():
    inverted_index = InvertedIndex()

    # Add documents to the inverted index.
    inverted_index.add_document(1, "This is the first document.")
    inverted_index.add_document(2, "This is the second document.")
    inverted_index.add_document(3, "This is the third document.")

    # Search for documents containing the query "document".
    results = inverted_index.search("document")

    # Print the results.
    for document_id in results:
        print(document_id)


if __name__ == "__main__":
    main()

### Group B: CO3, 5(Any two)




#### 1. Implement a program to calculate precision and recall for sample input. (Answer set A, Query q1, Relevant documents to query q1- Rq1 )

**Theory**

Precision and recall are two important metrics used to evaluate the performance of information retrieval systems.

Precision is the fraction of retrieved documents that are relevant to the user's query. Recall is the fraction of relevant documents that are retrieved.

Precision and recall are often at odds with each other. If we want to improve precision, we can simply retrieve fewer documents. However, this will reduce recall, because we will be missing out on some relevant documents.

To achieve a good balance between precision and recall, we need to use a variety of techniques, such as ranking the retrieved documents by relevance and using a relevance threshold.

Advantages of using precision and recall:

1. Precision and recall are easy to understand and calculate.
2. Precision and recall can be used to compare the performance of different information retrieval systems.
3. Precision and recall can be used to identify the areas where an information retrieval system needs to be improved.

Disadvantages of using precision and recall:

1. Precision and recall are not the only metrics that are important for evaluating the performance of information retrieval systems. Other metrics, such as user satisfaction, are also important.
2. Precision and recall can be biased towards certain types of queries. For example, precision and recall are often higher for short and specific queries than for long and general queries.

In [10]:
def calculate_precision_and_recall(relevant_documents, retrieved_documents):
  """Calculates the precision and recall for a given set of relevant and retrieved documents.

  Args:
    relevant_documents: A set of relevant documents.
    retrieved_documents: A set of retrieved documents.

  Returns:
    A tuple of (precision, recall).
  """

  precision = len(relevant_documents & retrieved_documents) / len(retrieved_documents)
  recall = len(relevant_documents & retrieved_documents) / len(relevant_documents)
  return precision, recall


# Sample input:

relevant_documents = {"1", "2", "3"}
retrieved_documents = {"1", "2", "4"}

# Calculate precision and recall.

precision, recall = calculate_precision_and_recall(relevant_documents, retrieved_documents)

# Print the results.

print("Precision:", precision)
print("Recall:", recall)

Precision: 0.6666666666666666
Recall: 0.6666666666666666


#### 2. Write a program to calculate the harmonic mean (F-measure) and E-measure for the above example.

**Theory**

The harmonic mean is a way of combining two numbers into a single number, where the two numbers are weighted equally. The harmonic mean of two numbers x and y is calculated as follows:

 **harmonic_mean(x, y) = 2 * xy / (x + y)**

The F-measure is a harmonic mean of precision and recall, weighted by the beta parameter. The beta parameter controls the importance of precision relative to recall. A higher beta parameter will give more weight to precision, while a lower beta parameter will give more weight to recall.

The E-measure is another harmonic mean of precision and recall, which is equivalent to the F-measure when the beta parameter is equal to 1.

Advantages of using the F-measure and E-measure:

1. The F-measure and E-measure are single metrics that combine precision and recall into a single number. This makes them easy to understand and interpret.
2. The F-measure and E-measure can be used to compare the performance of different information retrieval systems.
3. The F-measure and E-measure can be used to identify the areas where an information retrieval system needs to be improved.

Disadvantages of using the F-measure and E-measure:

1. The F-measure and E-measure are biased towards certain types of queries. For example, the F-measure and E-measure are often higher for short and specific queries than for long and general queries.
2. The F-measure and E-measure do not take into account other important factors, such as user satisfaction.

In [11]:
def calculate_f_measure(precision, recall, beta=1):
  """Calculates the F-measure for a given precision and recall.

  Args:
    precision: The precision.
    recall: The recall.
    beta: The beta parameter.

  Returns:
    The F-measure.
  """

  f_measure = (1 + beta**2) * (precision * recall) / ((beta**2 * precision) + recall)
  return f_measure


def calculate_e_measure(precision, recall, beta=1):
  """Calculates the E-measure for a given precision and recall.

  Args:
    precision: The precision.
    recall: The recall.
    beta: The beta parameter.

  Returns:
    The E-measure.
  """

  e_measure = (1 + beta**2) * (precision * recall) / (precision + (beta**2 * recall))
  return e_measure


# Sample input:

precision = 0.6666666666666666
recall = 0.6666666666666666

# Calculate F-measure and E-measure.

f_measure = calculate_f_measure(precision, recall)
e_measure = calculate_e_measure(precision, recall)

# Print the results.

print("F-measure:", f_measure)
print("E-measure:", e_measure)


F-measure: 0.6666666666666666
E-measure: 0.6666666666666666


### Group C:CO4(Any two)



#### 1. Build the web crawler to pull product information and links from an e-commerce website. (Python)


To build a web crawler to pull product information and links from an e-commerce website, you can follow these steps:

1. **Choose a programming language.** Python is a popular choice for web crawling, due to its simplicity and the availability of many libraries for web scraping.
2. **Install the necessary libraries.** For Python, you will need to install the following libraries:

> *   requests: This library allows you to make HTTP requests to websites.
*   BeautifulSoup: This library allows you to parse HTML and extract data from it.

3. **Identify the start URLs.** These are the URLs that your crawler will start crawling from. For example, if you want to crawl the Amazon website, you could start with the following start URL: https://www.amazon.com/.
4. **Write a function to crawl a URL.** This function should take a URL as input and return a list of URLs that it found on the page. You can use the requests library to download the HTML of the page and the BeautifulSoup library to parse it and extract the links.
5. **Write a function to extract product information.** This function should take a URL as input and return a dictionary of product information, such as the product name, price, and description. You can use the BeautifulSoup library to parse the HTML of the page and extract the product information.
6. **Write a main function.** This function should initialize the crawler and then start crawling the start URLs. It should also call the function to extract product information for each product that it finds.

In [None]:
!pip install BeautifulSoup

In [11]:
import requests
from bs4 import BeautifulSoup

class WebCrawler:
    def __init__(self, start_urls):
        self.start_urls = start_urls
        self.crawled_urls = set()

    def crawl(self):
        for url in self.start_urls:
            if url not in self.crawled_urls:
                self.crawled_urls.add(url)
                response = requests.get(url)
                soup = BeautifulSoup(response.content, 'html.parser')

                # Extract product information
                product_information = {}
                product_information['name'] = soup.find('h1', class_='product-title').text
                product_information['price'] = soup.find('span', class_='price-block__final-price').text
                product_information['description'] = soup.find('div', id='product-description').text

                # Save the product information
                # ...

                # Extract links on the page
                links = []
                for link in soup.find_all('a'):
                    links.append(link['href'])

                # Add the links to the crawler's queue
                for link in links:
                    self.crawl(link)

if __name__ == '__main__':
    crawler = WebCrawler(['https://www.amazon.com/'])
    crawler.crawl()

#### 2. Write a program to find the live weather report (temperature, wind speed, description, and weather) of a given city. (Python).


To write a program to find the live weather report (temperature, wind speed, description, and weather) of a given city in Python, you can follow these steps:

1. **Install the necessary libraries**. You will need to install the following Python libraries:


> * requests: This library allows you to make HTTP requests to websites.
* json: This library allows you to parse JSON data.

2. **Identify the weather API that you want to use**. There are many different weather APIs available, such as OpenWeatherMap and Dark Sky. Choose an API that provides the data that you need in a format that you can easily parse.
3. **Write a function to make a request to the weather API**. This function should take the city name as input and return a JSON object containing the weather data. You can use the requests library to make the HTTP request and the json library to parse the JSON response.
4. **Write a function to extract the weather information from the JSON object**. This function should take the JSON object as input and return a dictionary containing the temperature, wind speed, description, and weather.
5. **Write a main function**. This function should initialize the program and then call the function to get the weather information for the given city. It should then print the weather information to the console.

In [11]:
import requests
import json

def get_weather_report(city_name):
    api_key = 'YOUR_API_KEY'
    url = f'https://api.openweathermap.org/data/2.5/weather?q={city_name}&appid={api_key}'

    response = requests.get(url)
    weather_data = json.loads(response.content)

    return weather_data['main']['temp'], weather_data['wind']['speed'], weather_data['weather'][0]['description'], weather_data['weather'][0]['main']

def main():
    city_name = input('Enter the city name: ')

    weather_report = get_weather_report(city_name)

    temperature = weather_report[0]
    wind_speed = weather_report[1]
    description = weather_report[2]
    weather = weather_report[3]

    print(f'The weather in {city_name} is {weather} with a temperature of {temperature} degrees Celsius and a wind speed of {wind_speed} meters per second.')

if __name__ == '__main__':
    main()


#### 3. Case study on recommender system for a product / Doctor / Product price / Music.

Case study on recommender system for music with theory in details

**Introduction**

Recommender systems are used to suggest items to users based on their past behavior or preferences. They are used in a variety of applications, such as e-commerce, streaming services, and social media.

Music recommender systems suggest songs or playlists to users based on their listening habits, preferences, and other factors. They can be used to help users discover new music, find similar songs to the ones they already like, and create personalized playlists.

**Theory**

There are two main types of music recommender systems: collaborative filtering and content-based filtering.

Collaborative filtering recommender systems suggest items to users based on the preferences of other users who have similar tastes. For example, if a user likes to listen to pop music, and other users who like pop music also like to listen to rock music, then a collaborative filtering recommender system might suggest rock music to that user.

Content-based filtering recommender systems suggest items to users based on the features of the items themselves. For example, if a user likes to listen to songs with a fast tempo and loud vocals, then a content-based filtering recommender system might suggest other songs with similar features.

**Case study**

Spotify is a music streaming service that uses a combination of collaborative filtering and content-based filtering to recommend songs to its users.

Spotify's collaborative filtering recommender system is based on the idea that users who have similar listening habits are more likely to like the same songs. Spotify uses this information to generate personalized recommendations for each user.

Spotify's content-based filtering recommender system is based on the idea that songs with similar features are more likely to be liked by the same users. Spotify uses a variety of features, such as genre, tempo, and mood, to generate personalized recommendations for each user.

Spotify's recommender system is very effective at helping users discover new music and find songs that they like. It is also very efficient, as it can generate personalized recommendations for millions of users in real time.

**Benefits of music recommender systems**

Music recommender systems offer a number of benefits, including:

Discovery: Music recommender systems can help users discover new music that they might not have otherwise found.
Personalization: Music recommender systems can create personalized recommendations for each user, based on their individual tastes and preferences.
Convenience: Music recommender systems can save users time and effort by suggesting songs that they are likely to enjoy.

**Challenges of music recommender systems**

One of the biggest challenges of music recommender systems is that they can be biased. This is because they are trained on data that is collected from real users, and this data can reflect the biases of those users.

Another challenge of music recommender systems is that they can be difficult to evaluate. This is because there is no objective way to measure how well a music recommender system is working.

**Conclusion**

Music recommender systems are a powerful tool for helping users discover new music and find songs that they like. However, it is important to be aware of the biases that can exist in music recommender systems and the challenges of evaluating their performance.

# References