<a href="https://colab.research.google.com/github/Sam21sop/BE-AIML-LAB-PRACTICAL/blob/main/Lab_Practice_III_%26_IV.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#  Lab Practice-III (Information Retrieval in AI Lab)


#### Group A:CO1, 2, 3(Any two)
1. Implement a Conflation algorithm to generate a document representative of a text file.
2. Implement Single-pass Algorithm for the clustering of files. (Consider 4 to 5 files)
3. Implement a program for retrieval of documents using inverted files.

### Implement a Conflation algorithm to generate a document representative of a text file.
A conflation algorithm aims to create a representative document by combining information from multiple source documents. Here's a simple Python implementation of a conflation algorithm that takes a list of text files as input and generates a representative document
1. This implementation uses the NLTK library for text preprocessing. You can install it using pip install nltk.

2. The input text files are assumed to be in the same directory as the script and named document1.txt, document2.txt, etc. You should replace these with actual file paths or modify the script to take the file paths as required.

3. The conflation process in this example simply combines the words from the input documents based on their frequency. You can modify the conflation logic to suit your specific requirements, such as considering sentence structures or more advanced language analysis techniques.

4. Depending on your use case, you might want to further refine the preprocessing steps and consider more advanced methods for document conflation.

In [None]:
!pip install nltk

In [4]:
import os
from collections import Counter
from nltk.corpus  import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import PorterStemmer

In [None]:
import nltk
nltk.download('stopwords')

In [7]:
#define stopword and stemmer
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()

In [8]:
#tokenize the text into words and remove punctuation
def preprocess_text(text):
    words = word_tokenize(text.lower())
    words = [word for word in words if word.isalnum()]
    words = [word for word in words if word not in stop_words]
    words = [stemmer.stem(word) for word in words]
    return words

In [9]:
#preprocess each document and update word frequency
def conflate_documents(documents):
    word_freq = Counter()
    for doc in documents:
        with open(doc, 'r', encoding='utf-8') as file:
            content = file.read()
            words = preprocess_text(content)
            word_freq.update(words)
    #generate a representative document based on the word frequency
    representative_doc = ' '.join([word for word, freq in word_freq.most_common()])
    return representative_doc

In [None]:
from nltk.data import OpenOnDemandZipFile
#list of text input file
take_input = ['doc1.txt', 'doc2.txt', 'doc3.txt']

#generate the representative document
representive_doc = conflate_documents(take_input)

#save the representative doc to a file
with open('file_name.txt', 'w', encoding='utf-8') as file:
    file.write(representive_doc)

### Implement Single-pass Algorithm for the clustering of files. (Consider 4 to 5 files)
1. We load and preprocess the text content of each document.
2. The TF-IDF vectorizer is used to convert the documents into numerical vectors.
2. We initialize K-Means clustering with the desired number of clusters.
3. We iterate through each document and compute its cosine similarity with existing cluster centroids.
4. The document is assigned to the cluster with the highest similarity, and the cluster centroid is updated.



In [14]:
import os
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
# Load and preprocess documents
documents = []
file_paths = ['document1.txt', 'document2.txt', 'document3.txt', 'document4.txt', 'document5.txt']

for file_path in file_paths:
    with open(file_path, 'r', encoding='utf-8') as file:
        content = file.read()
        documents.append(content)


In [None]:
# Convert documents to TF-IDF vectors
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)

In [17]:
# Initialize K-Means clustering with the desired number of clusters
num_clusters = 2  # You can adjust this number
kmeans = KMeans(n_clusters=num_clusters, random_state=42)


In [18]:
# Process documents and perform single-pass clustering
cluster_assignments = []
for i in range(len(documents)):
    document_vector = tfidf_matrix[i]

    # Compute cosine similarity with existing cluster centroids
    similarities = cosine_similarity(document_vector, kmeans.cluster_centers_)

    # Assign document to the most similar cluster
    closest_cluster = np.argmax(similarities)
    cluster_assignments.append(closest_cluster)

    # Update the cluster centroid
    kmeans.cluster_centers_[closest_cluster] = (kmeans.cluster_centers_[closest_cluster] * (i + 1) + document_vector) / (i + 2)


In [19]:
# Print cluster assignments
for i, cluster_id in enumerate(cluster_assignments):
    print(f"Document {file_paths[i]} belongs to Cluster {cluster_id}")

### Implement a program for retrieval of documents using inverted files.

1. We preprocess the text of each document to tokenize it into words and remove punctuation.

2. The build_inverted_index function creates an inverted index using a defaultdict. For each word, it maps the word to the document IDs where the word appears.

3. The retrieve_documents function takes a query and retrieves documents containing any of the query terms based on the inverted index.

4. We provide a sample set of documents and a sample user query to demonstrate the retrieval process.

In [20]:
import os
import re
from collections import defaultdict

In [21]:
def preprocess_text(text):
    '''Tokenize the text into words and remove punctuation'''
    words = re.findall(r'\w+', text.lower())
    return words

In [22]:
def build_inverted_index(documents):
    inverted_index = defaultdict(list)
    for doc_id, document in enumerate(documents):
        words = preprocess_text(document)
        for word in words:
            inverted_index[word].append(doc_id)
    return inverted_index

In [24]:
def retrieve_documents(inverted_index, query):
    query_terms = preprocess_text(query)
    matching_doc_ids = set()
    for term in query_terms:
        if term in inverted_index:
            matching_doc_ids.update(inverted_index[term])
    return matching_doc_ids

In [25]:
# Sample documents
documents = [
    "This is the first document.",
    "The second document is here.",
    "And here is the third document.",
    "This document contains important information.",
    "The fifth document concludes the set."
]


In [26]:
# Build inverted index
inverted_index = build_inverted_index(documents)

In [27]:
# User query
query = "document information"

In [28]:
# Retrieve matching documents
matching_doc_ids = retrieve_documents(inverted_index, query)

In [29]:
# Print matching documents
if matching_doc_ids:
    print("Matching Documents:")
    for doc_id in matching_doc_ids:
        print(f"Document {doc_id + 1}: {documents[doc_id]}")
else:
    print("No matching documents found.")

Matching Documents:
Document 1: This is the first document.
Document 2: The second document is here.
Document 3: And here is the third document.
Document 4: This document contains important information.
Document 5: The fifth document concludes the set.


#### Group B: CO3, 5(Any two)
1. Implement a program to calculate precision and recall for sample input. (Answer set A, Query q1, Relevant
documents to query q1- Rq1 )
2. Write a program to calculate the harmonic mean (F-measure) and E-measure for the above example.
3. Implement a program for feature extraction in 2D color images (any features like color, texture etc. and
extract features from the input image and plot a histogram for the features.

#### Group C:CO4, 5(Any two)
1. Build the web crawler to pull product information and links from an e-commerce website. (Python)
2. Write a program to find the live weather report (temperature, wind speed, description, and weather) of
a given city. (Python).
3. Case study on recommender system for a product / Doctor / Product price / Music.


# Lab Practice-IV (Deep Learning for AI Lab)

Mapping of course outcomes for Group A assignments: CO1, CO2, CO3, CO4
#### Study of Deep Learning Packages: Tensorflow, Keras, Theano and PyTorch. Document the
distinctfeatures and functionality of the packages.

Note: Use a suitable dataset forthe implementation of the following assignments.

#### Implementing Feed-forward neural networks with Keras and TensorFlow
a. Import the necessary packages

b. Load the training and testing data (MNIST/CIFAR10)

c. Define the network architecture using Keras

d. Train the model using SGD

e. Evaluate the network

f. Plot the training loss and accuracy

#### Build the Image classification model by dividing the model into the following fourstages:
a. Loading and preprocessing the image data

b. Defining the model’s architecture

c. Training the model

d. Estimating the model’s performance

#### Use Autoencoder to implement anomaly detection. Build the model by using the following:
a. Import required libraries

b. Upload/access the dataset

c. The encoder converts it into a latent representation

d. Decoder networks convert it back to the original input

e. Compile the models with Optimizer, Loss, and Evaluation Metrics

#### Implement the Continuous Bag of Words (CBOW) Model. Stages can be:
a. Data preparation

b. Generate training data

c. Train model

d. Output

#### Object detection using Transfer Learning of CNN architectures
a. Load in a pre-trained CNN model trained on a large dataset

b. Freeze parameters(weights) in the model’s lower convolutional layers

c. Add a custom classifier with several layers of trainable parameters to model

d. Train classifier layers on training data available for the task

e. Fine-tune hyperparameters and unfreeze more layers as needed

# References