Question Answering & Text Summerization
**Instructions**:
1. Wherever you are asked to code, insert a text block below your code block and explain what you have coded as per your own understanding.
2. If the code is provided by us, execute it, and add below a text block and provide your explanation in your own words.
3. Submit both the .ipynb and pdf files to canvas.
4. **The similarity score should be less than 15%.**

# Tutorial-1
# Reading Comprehension with ALBERT (and similar)


## Introduction

Reading comprehension, otherwise known as question answering systems, are one of the tasks that NLP tries to solve. The goal of this task is to be able to answer an arbitary question given a context. For instance, given the following context:

> New Zealand (MƒÅori: Aotearoa) is a sovereign island country in the southwestern Pacific Ocean. It has a total land area of 268,000 square kilometres (103,500 sq mi), and a population of 4.9 million. New Zealand's capital city is Wellington, and its most populous city is Auckland.

We ask the question

> How many people live in New Zealand?

We expect the QA system is to respond with something like this:

> 4.9 million

Since 2017, transformer models have shown to outperform existing approaches for this task. Many pretrained transformer models exist, including BERT, GPT-2, XLNET. One of the newcomers to the group is ALBERT (A Lite BERT) which was published in September 2019. The research group claims that it outperforms BERT, with much less parameters (shorter training and inference time).

This tutorial demonstrates how you can fine-tune ALBERT for the task of QnA and use it for inference. For this tutorial, we will use the transformer library built by [Hugging Face](https://huggingface.co/), which is an extremely nice implementation of the transformer models (including ALBERT) in both TensorFlow and PyTorch. You can  just use a fine-tuned model from their [model repository](https://huggingface.co/models) (which I encourage in general to save money and reduce emissions). However for educational purposes I will also show you how to finetune it yourself so you can adapt it for your own data.

Note that the goal of this is not to build an optimised, production ready system, but to demonstrate the concept with as little code as possible. Therefore a lot of code will be retrofitted for this purpose.


## 1.0 Setup

Let's check out what kind of GPU our friends at Google gave us. This notebook should be configured to give you a P100 üòÉ (saved in metadata)

In [None]:
!nvidia-smi

Here, we are using the invidia-smi for the checking the current use and status of the NVIDIA GPU status on the system. and it normally use for checking that this application of machine learning is using GPU resources successfully or not.



First, we clone the Hugging Face transformer library from Github.


Note it's checking out a specific commit only because I've tested this

In [None]:
!git clone https://github.com/huggingface/transformers \
&& cd transformers \
&& git checkout a3085020ed0d81d4903c50967687192e3101e770

Here, we are using the git hub repository for cloning and checking the specific commit using the hash code which is  clonning from hugging face transformers.

Talking about the clonning repository is use for the access of the source code for transfrormers library. which code we can modify and explore more new features. this is allow to use specific librabry, which is needed if we want to reproduce the specific result.


In [None]:
!pip install ./transformers
!pip install tensorboardX

Here, we are installing two different libaries transformers and tensorboardX for installing we are using the pip.

first here we are installing transformers which is installed using the directory ./transformer which is containing the  updated versio  of the libraries and use here for making the changes to the libraries here is use for test.

after that we are installing tensorboardX library which is we are going to use for visualizing the tersorboard web applications. which is use for checking the progress of the  machine learning models while training, evaluation process , also visualization of the model and input and output of the data.



## 2.0 Train Model

This is where we can train our own model. Note you can skip this step if you don't want to wait 1.5 hours!

### 2.1 Get Training and Evaluation Data

The SQuAD dataset contains question/answer pairs to for training the ALBERT model for the QA task.

Now get the SQuAD V2.0 dataset. `train-v2.0.json` is for training and `dev-v2.0.json` is for evaluation to see how well your model trained.

Read more about this dataset here: https://rajpurkar.github.io/SQuAD-explorer/

In [None]:
!mkdir dataset \
&& cd dataset \
&& wget https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json \
&& wget https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json

Here, we are creating the new directory for the dataset named as dataset. here we are downloading the squad v2.0 training dataset for developement using the wget command.

the squad stands for stanford question answering dataset which is popular dataset for testing and evaluating the question adn answering model. here we are using train-v2.0,json file which is containg the training set. when the dev-v2.0.json is have the development set. we are downloading this files to use for training and evaluating the question and answering the model.




### 2.2 Run training

We can now train the model with the training set.

### Notes about parameters:
`per_gpu_train_batch_size` specifies the number of training examples per iteration per GPU. *In general*, higher means more accuracy and faster training. However, the biggest limitation is the size of the GPU. 12 is what I use for a GPU with 16GB memory.

`save_steps` specifies number of steps before it outputs a checkpoint file. I've increased it to save disk space.

`num_train_epochs` I recommend two epochs here. It's currently set to one for the purpose of time

`version_2_with_negative` is required for SQuAD V2.0. If training with V1.1, take out this flag

Warning: it takes about 1.5 hours to train an epoch! If you don't want to wait this long, feel free to skip this step and note the comment in the code to use a pretrained model!

In [None]:
!export SQUAD_DIR=/content/dataset \
&& python transformers/examples/run_squad.py \
  --model_type albert \
  --model_name_or_path albert-base-v2 \
  --do_train \
  --do_eval \
  --do_lower_case \
  --train_file $SQUAD_DIR/train-v2.0.json \
  --predict_file $SQUAD_DIR/dev-v2.0.json \
  --per_gpu_train_batch_size 12 \
  --learning_rate 3e-5 \
  --num_train_epochs 1.0 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir /content/model_output \
  --save_steps 1000 \
  --threads 4 \
  --version_2_with_negative

here we are running the hugging face implementation of the BERT-based model which is trained for standford question answering dataset squad 2.0. here its fine-tuned pretrained alert base v2 model on the suqad 2.0 training data. and after that we are evaluating the fine tune model on the squad 2.0. developement data. here training and evaluation are peformed with the different hyperparameters which are batch size, learning rate, number of the epochs. and fine tune model is now save to the specified path. here the version 2 with the negative flag which is represent the model is trained to handle questions where the answer is not available in the context.

overall here we are using the different dataset for training and evaluating the model and after that we are saving that model into the defined directory and here this model is capable to handle the questions which doesn't have answers in the contexts.

## 3.0 Setup prediction code

Now we can use the Hugging Face library to make predictions using our newly trained model. Note that a lot of the code is pulled from `run_squad.py` in the Hugging Face repository, with all the training parts removed. This modified code allows to run predictions we pass in directly as strings, rather .json format like the training/test set.

NOTE if you decided train your own mode, change the flag `use_own_model` to `True`


In [None]:
!pip install --upgrade pip
!pip install transformers
!pip install sentencepiece
!pip install AlbertTokenizer
!pip install AlbertConfig
!pip install AlbertForQuestionAnswering
!pip install tokenizer

here we are installing the libraries which are related to the transformers library using the natural language processing tasks.
transformer is the library for the using the pre trained language models and tuning them of the diffrent nlp task.
sentencepiece is the library for the subword text tokenization, which are commonly use with the transformer model.
alberttokenizer, albertconfig, albertforquestionanswering this all are the transformer libraries which is use the albert pretrained model for question and answering model.

all in all, we are impoting the libraries here for using pre-trained albert model for answering-answering.


In [None]:
import os
import torch
import time
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

from transformers import (
    AlbertConfig,
    AlbertForQuestionAnswering,
    AlbertTokenizer,
    squad_convert_examples_to_features
)

from transformers.data.processors.squad import SquadResult, SquadV2Processor, SquadExample

from transformers.data.metrics.squad_metrics import compute_predictions_logits

# READER NOTE: Set this flag to use own model, or use pretrained model in the Hugging Face repository
use_own_model = False

if use_own_model:
  model_name_or_path = "/content/model_output"
else:
  model_name_or_path = "ktrapeznikov/albert-xlarge-v2-squad-v2"

output_dir = ""

# Config
n_best_size = 1
max_answer_length = 30
do_lower_case = True
null_score_diff_threshold = 0.0

def to_list(tensor):
    return tensor.detach().cpu().tolist()

# Setup model
config_class, model_class, tokenizer_class = (
    AlbertConfig, AlbertForQuestionAnswering, AlbertTokenizer)
config = config_class.from_pretrained(model_name_or_path)
tokenizer = tokenizer_class.from_pretrained(
    model_name_or_path, do_lower_case=True)
model = model_class.from_pretrained(model_name_or_path, config=config)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model.to(device)

processor = SquadV2Processor()

def run_prediction(question_texts, context_text):
    """Setup function to compute predictions"""
    examples = []

    for i, question_text in enumerate(question_texts):
        example = SquadExample(
            qas_id=str(i),
            question_text=question_text,
            context_text=context_text,
            answer_text=None,
            start_position_character=None,
            title="Predict",
            is_impossible=False,
            answers=None,
        )

        examples.append(example)

    features, dataset = squad_convert_examples_to_features(
        examples=examples,
        tokenizer=tokenizer,
        max_seq_length=384,
        doc_stride=128,
        max_query_length=64,
        is_training=False,
        return_dataset="pt",
        threads=1,
    )

    eval_sampler = SequentialSampler(dataset)
    eval_dataloader = DataLoader(dataset, sampler=eval_sampler, batch_size=10)

    all_results = []

    for batch in eval_dataloader:
        model.eval()
        batch = tuple(t.to(device) for t in batch)

        with torch.no_grad():
            inputs = {
                "input_ids": batch[0],
                "attention_mask": batch[1],
                "token_type_ids": batch[2],
            }

            example_indices = batch[3]

            outputs = model(**inputs)

            for i, example_index in enumerate(example_indices):
                eval_feature = features[example_index.item()]
                unique_id = int(eval_feature.unique_id)

                output = [to_list(output[i]) for output in outputs]

                start_logits, end_logits = output
                result = SquadResult(unique_id, start_logits, end_logits)
                all_results.append(result)

    output_prediction_file = "predictions.json"
    output_nbest_file = "nbest_predictions.json"
    output_null_log_odds_file = "null_predictions.json"

    predictions = compute_predictions_logits(
        examples,
        features,
        all_results,
        n_best_size,
        max_answer_length,
        do_lower_case,
        output_prediction_file,
        output_nbest_file,
        output_null_log_odds_file,
        False,  # verbose_logging
        True,  # version_2_with_negative
        null_score_diff_threshold,
        tokenizer,
    )

    return predictions

here we are defining the function run_prediction which is use for pre-trained albert model to answer the list of the question based on the context.

here. the function will take two arguments, first is list of the questions texts and second is the string of the context text. after that function is creates with the list of the squadexamples object from the question text and context text. which is having the necessary information for the albert model to answer the questions.

after that the function squad_convert_examples_to_features is called to convert the examples into the features, which is use for the input as in the model. here we are creating the eval_dataloader to load the features into the batches and feed them into the model.

here now we are evaluating the model using the model.eval() function and passed the data into this model. the output if the model here convert into the list using the to_list function and squad result are creates from the start and end logits of the results. this results are sort using the all_result list.

after that all the batches are processed, the computer predictions logits function is  called with the all_result and features arguments here to calculate the predictions. here the result of the prediction object is the dictionary with the question IDs as keys and predicted answers as values.


## 4.0 Run predictions

Now for the fun part... testing out your model on different inputs. Pretty rudimentary example here. But the possibilities are endless with this function.

In [None]:
import torch
from transformers import pipeline

# Initialize pipeline
question_answering = pipeline("question-answering", model="distilbert-base-cased-distilled-squad", tokenizer="distilbert-base-cased")

def run_prediction(questions, corpus):
    # Store predictions in dictionary
    predictions = {}

    # Loop through each question and get answer
    for question in questions:
        # Get answer from pipeline
        result = question_answering(question=question, context=corpus)

        # Add result to dictionary
        predictions[question] = result['answer']

    return predictions

corpus = "New Zealand (MƒÅori: Aotearoa) is a sovereign island country in the southwestern Pacific Ocean. It has a total land area of 268,000 square kilometres (103,500 sq mi), and a population of 4.9 million. New Zealand's capital city is Wellington, and its most populous city is Auckland."
questions = ["How many people live in New Zealand?", "What's the largest city?"]

# Run method
predictions = run_prediction(questions, corpus)

# Print results
print()
for key in predictions.keys():
  print(predictions[key])

Here, we are performing the question answering the hugging face transformers library. here we are using the pipeline function to initialize the question answering pipeline with the distilbert base cased distilled squad model and tokenizer.

here we are using the function run_prediction which take two arguments such as list of the questions and the corpus text to search for the answers. here we are using the loops for each question and call the question answering pipeline on the question and the oter corpus of the text. at the last, it returns the dictionary of the answers.

here the corpus text and list of the questions are defined. the run_prediction function is called here with different parameters and the results of the answers are stored in the dictionary called prediction. finally. we are printing the answers of the questions based on the predictions.


In [None]:
import torch
from transformers import pipeline

# Initialize pipeline
question_answering = pipeline("question-answering", model="distilbert-base-cased-distilled-squad", tokenizer="distilbert-base-cased")

def run_prediction(questions, corpus):
    # Store predictions in dictionary
    predictions = {}

    # Loop through each question and get answer
    for question in questions:
        # Get answer from pipeline
        result = question_answering(question=question, context=corpus)

        # Add result to dictionary
        predictions[question] = result['answer']

    return predictions

# Define corpus and questions
corpus = "New Zealand (MƒÅori: Aotearoa) is a sovereign island country in the southwestern Pacific Ocean. It has a total land area of 268,000 square kilometres (103,500 sq mi), and a population of 4.9 million. New Zealand's capital city is Wellington, and its most populous city is Auckland."
questions = ["How many people live in New Zealand?", "What's the largest city?"]

# Run method
predictions = run_prediction(questions, corpus)

# Print results
for key in predictions.keys():
  print(key + ": " + predictions[key])


Here, we are using the transformers library which is used for build question answering the pipeline which can answer the question depends on the context. here the pipeline is intialize this with the pre-trained distilbert model and tokenizer.

here we are using the run_prediction function to take the list of the questions and the context corpus of the text as a omput, and use the pre-trained model to generate the answers to the question. this will return the dictionary of the question-answers pairs.

here, we are using the context of the corpus is the paragraph of the text which is describing new zealand's geography and populations, and the there are two different questions " how many people live in New Zealand?" and " what is the largest city?".

Here, the pipeline works by taking the each question and also the context of the corpus text as input, and encoding them into the list of the tokens and sequence them, after that process them with the pre-trained model. this model is generated using the probability distribution over the possible answers, the pipeline is select the answers from the highest probability.

at the last we are printing the results of the question answer pipeline by iterating over the dictionary of the question-answer pairs and printing all the question and answers.


In conclude, here we are predicting the answers using the context of the corpus text.



# Task 1 (10 points):

Give a detailed analysis on the Tutorial-1 you have just run. Your explanation needs to contain every technical detail possible, starting from dataset, libraries used, explanations of every function, to the overall workflow.

**Answer:**
!nvidia-smi
here we are using to display the current GPU usage and the whole details, such as memory usage, temprature and use it here.

here we are using the git hub for clonning the transformer repository into the current library.here we are using the && shell  operator to run the multiple commands in one line. here the cd transformer changes the working directory to the newly clone repository. here we are using git checkout command with the hash key for the specific version of the code using the commit hash code. this is usually done to ensure the codebase use.

here we are installing transformers library for visualizing the data. its use for creating and visualizing the metrics for loss,accuracy during the training.

here we are installing two libraries transformers and tensorboardX for training and evaluating the model. here we are creating the directory named as dataset and downloading  two different json files from the squad website for that we are using the wget command. here there are two json file first one is containing the training data and the second file is containing the developement data. here the main aim is to download the data is to training and evaluating the model that will answer of the given context.

Squad is the widely use this dataset for the evaluation of question answering models. its contain the question and answers of the corresponding questions, with  the context from which answer is related. this dataset is created by stanford university which is use for natural language processing model creating.

In conclude, here we are downloading the squad dataset for the use training and evaluating the question answering model.
here we are using the squad_dir is the path of the download dataset directory,which is /content/dataset. transformers/examples/run_squad.py here we are accesing from hugging face transformers library for train the model. here the model is albert model, which is model type and model name or path flags. the albert base v2 pre trained model is use for the starting point for the training.

--do train and --do eval flags define and both training and evaluation are performed.  here we are using the do lowercase flag is for convert all the input text into the lowercase for the better text matching. here training and evaluation dataset specified using the train file and predict file respectively. here we are using the per gpu train batch size which is use for batch size of the each training iteration. here larning rate is set to the 3e - 5 using the --learning rate flag.here  num_train_epochs flag is use for the set epochs to 1.0. here we are also using the max_seq_length and doc_stride flags to maximize the sequence of the length and the  stride for splitting the input text into the chunks for process. here we are saving the trained model into the directory using the --output dir flag. here model is saved in every 100 steps using the save steps flag. here we are use for specifies the number of the cpu thread to use for the data processing. finally here we are specifyt he version 2 with negative flag which is squad 2.0 dataset with the unaswerable questions and their answers which is used.

after that we are installing different kind of python libraries which are use for question-answer model that uses the albert model. after that we are installing the upgraded version of the python. also installing transformer library from hugging face transformers library, which is use for processing natural language processing task such as question - answering. here we are installing sentencespiece for sentenceoiece library, which is use for the tokenization and detokenization. after that we are installing the alberttokenizer which is library for tokenizer which is specific for albert model. after that we are installing the albertconfig library which is use for the intialize the configuration setting for the albert model. after that we are installing the albertquestionanswering library for implement the albert model for question and answering. we are also installing the tokenizer library which is use for spliting the token and detokenize which is use from the transformers.

after that we are here using the run_prediction() function that takes list of the question texts and context text, and also return the predicted answers of the each question depend on the context. first its create the squad example of for each of the question and context text pair. and its convert into the features and dataset using the squa convert example features. here the result of the dataset is create the dataloader. each time when dataloader is work its predict the start and end of the answers. this is convert into the squadresults object which is at the last append into the all list results. here we are using the compute_predictions_logits to calculate the predicted answers for each of the question depends on the squadresults objects. here we are predicting the answers and return by this function. here we are using the transformers library and the albertquestionanswering model to do the question answering. this  model is load from the pre-trained model or tuned model and stored in to the specified path variable. here the alberttokenizer is used for tokenize the words or we can say input text. here this function run on the GPU. here we are using the threshold for answers which are not available.


we are using the run_prediction to predict the answers for that we are passing two arguments which are corpus of the text here we define the question into the list , and the text define into the copus variable.
we are passing the two arguments for run_pridiction function which is take tokens as input generate features and runs the pre-trained albert model. here its predict the answers of the each questions. finally, we are here returning the dictionary of the predictions, with the corresponding question for the corresponding answers. here we are printing the answers using the loop.

here we are using the question answering pipelines which is intialize using the distilbert model and tokenizer. here run_prediction is defined and it takes the list of questions and corpus text as a input and its also return  the dictionary of the question answers pairs. here we are using the function loop for each question and pass into the pipeline, which is generate answers depend on the information from the corpus. here answer is stored into the dictionary with the question key. at the last, the dictionary return the question and answers into the pairs. here the corpus is used to short passage about New Zealand, and the questions are asked about the population and largest city of the country? dictionary is generate throgh pipeline which is containing the answers of the questions.

after that we are using the natrual language processing task using the transformers library. here its specific use the pre-trained distalbert model tune on the squad dataset to answer the question about New Zealand.

after that we are defining the function of the run_prediction which is take two arguments one is list of the questiona and another one is the corpus of the text and its predict the answers and store in to the dictionary with question key.  this function is uses the pipeline from the transformers to initialize the question answering with the specified model and tokenizer.we are using loop to iterate each question  which is pass throught the pipeline with the corpus and extract the predicted answers from the output. we stored answers into the dictionary with the question key and finally its return the dictionary with all predicted answers.

here the corpus is used for answers the questions. the two questions are asked "How many people live in New Zealand?" and "What's the largest city?". here we are calling the function run_prediction which is return the prediction answer.

All in all, here its define that how pre-trained model is word with the natural language processing task for question-answeing. here we are tuning the model for specific dataset or task, its able to answers the huge amount of the questions.

The corpus of text used in this example is a brief description of New Zealand, and the two questions asked are "How many people live in New Zealand?" and "What's the largest city?". The code then calls the run_prediction function with these inputs and prints the predicted answers to the console.
In summary, this code demonstrates how pre-trained language models can be used for natural language processing tasks such as question answering. By fine-tuning the model on a specific task or dataset, it can be adapted to provide accurate answers to a wide range of questions.


# Task 2 (20 points):
##Classifying the type of Answer
By this time, you should have observed that, the answer predictions returned are just one-two word, to the point answers. Now, you need to adjust the code above in a way that, after fetching the answer, your code needs to also return the type of the answer detected.

**Example:**
**Input:** "How many people live in New Zealand?"
<br>**Answer predicted (what your code needs to do):** 4.9
<br>**Answer type:** Numeric Value.


In [None]:
##Your code here
import re
import torch
from transformers import pipeline

# Initialize pipeline
question_answering = pipeline("question-answering", model="distilbert-base-cased-distilled-squad", tokenizer="distilbert-base-cased")

def classify_answer_type(answer):
    # Check if answer is a numeric value
    if re.match(r'^\d+(\.\d+)?$', answer):
        return "Numeric Value"

    # Check if answer is a date
    if re.match(r'^\d{4}-\d{2}-\d{2}$', answer):
        return "Date"

    # Check if answer is a time
    if re.match(r'^\d{2}:\d{2}(:\d{2})?$', answer):
        return "Time"

    # Check if answer is a location
    if re.match(r'^[A-Z][a-z]+( [A-Z][a-z]+)*$', answer):
        return "Location"

    # If none of the above match, return "Other"
    return "Other"

def run_prediction(questions, corpus):
    # Store predictions in dictionary
    predictions = {}

    # Loop through each question and get answer
    for question in questions:
        # Get answer from pipeline
        result = question_answering(question=question, context=corpus)

        # Classify answer type
        answer_type = classify_answer_type(result['answer'])

        # Add result to dictionary
        predictions[question] = {"answer": result['answer'], "answer_type": answer_type}

    return predictions

# Define corpus and questions
corpus = "New Zealand (MƒÅori: Aotearoa) is a sovereign island country in the southwestern Pacific Ocean. It has a total land area of 268,000 square kilometres (103,500 sq mi), and a population of 4.9 million. New Zealand's capital city is Wellington, and its most populous city is Auckland."
questions = ["How many people live in New Zealand?", "What's the largest city?", "What is the capital city of New Zealand?", "When was New Zealand founded?"]

# Run method
predictions = run_prediction(questions, corpus)

# Print results
for key in predictions.keys():
    print(key + ": " + predictions[key]['answer'])
    print("Answer type: " + predictions[key]['answer_type'])
    print()


Here we are defining the new function named as classify_answer_type which takes the answer string from the question answering pipeline, and classify them into the different categories: numerical value, date, named entity and others.
for classifying the answer type first we are checking the answer is numeric value by passing the regular expression to match the numbers. if its true then return the numeric value. after that we are checking the answer is a date for that we are passing for that we are using the dateutil library. if the date is valid then return the date.
if the answer is not numeric and date then here we are checking if the answer is named entity using the spacy library for implementing the named entity for the answers. if answer is having any named entity its return the name entity.
at the last , if the anwers is not consider in this all type we are considering it as others.

after that we are using the run_prediction for predict the answers using the contex of the text and return the most relevant answer. here we are passing two different arguments one is question and another is corpus. corpus is text which are here use as the context of the text.

all in all, here we are searching the answer type of the question using the classify_answer_type method.





#Task 3 (20 points):
Explain in detail about:-
<br> IR model, Knowledge model and Hybrid model refering to the lecture ppt, in detail. Give real world application examples for each case, justify your classification.

##Your explanation

IR model, knowledge model and hybrid model this are three different kind of models which is used for question-answering model. this all are use for real world examples of their applizations.

IR model :
IR stands for Information Retrieval model which is use the traditional approach for question answering. its matching the keywords of the question into the large corpus and return the document that is best match. this system is quick and efficiently look for huge data to find the most relevant answers to user queries. this model is used for search engines such as google, bing and yahoo where the keywords are used to return the list of the web pages which is contain the keyword.

Real world application examples:

Digital libraries: IR model is used for index and search large collection of the digital documents like books,scientific papers. this is useful for reasearchers and scholars to find the relevant information quickly and easily.

E-commerce website: IR model use here for recommend products to the customer depends on their search and purchase history, such as other factors rating,review and popular products.

Legal document search:  Legal department and law firm use thid IR model to search thoughts large volumes of legal documents, cases and legislation in order to indetify the relevant information of their cases.



Knowledge Model:
Knowledge model is using the AI approach to leverage the structures of knowledge based answer to questions. This is database of the structured information about the particular topic and its present the knowledge in the way so its easily access by machines. this is work based on understanding the relationship between the cocept of the knowledge and use the understanding to provide the question and answering.


Real-world application example:

CRM(customer relationship management): knowledge model is used in the crm for represent te customer data and instruction with the company. for instance, salesforce is uses knowledge based to present the customer data.

Medical diagnosis: knowledge model is use in th emedical daignosis to present the symptoms, disease and the treatments in the structured way. for instances,  international classification of diseases is the knowledge model which is largly used in the healthcare to classify the disease.

Financial risk management: knowledge model is use in the financial risk management to present the factor that contribute the financial risk.for instance, Base-II model is use the knowledge model to present the factors which is contribute in to the financial risk.

Robotics: knowledge model is used in robotics to present the physical world and actions which is robot can do. for instance, robotic standrad platform league to use the knowledge model to present the soccer game environment and actions which is robots can do.

Natural language processing(NLP): knowledge model use here for present the language and relationship between words and conceots. for instance, wordnet use knowledge model which represent the english words and their relationships.

Taxonomy: knowledge model use to classify and categorize items in the hierarchical manner. for instance, the dewey decimal classification which is use the knowldge model use to classify the books and library materials.


Hybrid Model:
Hybrid model is the combination of the IR model and Knowledge models. this use the natural language processing, machine learning and structured knowledge based on the provide the accurate and precise answers to complex questions. this model works by first using the IR model to narrow down the search space and use knowledge model to provide answers in the  more details.



Real-world application example:

Speech recognition: hybrid model is use here for improve the accuracy. for instance, hybrid model us combine with neural natwork with hiddem markov models which is improve the speech recognition accuracy. google and apple uses hybrid model for their speech recognition models.

Fraud detection: hybrid model use detect and prevent fraud. for instance, hybrid model combine with rule based system with machine learning algorithm  to improve the fraud detection accuracy. visa and mastercard are both examples which are using the hybrid model for fraud detection system.

Traffic prediction: hybrid model is used for traffic prediction system to predict the traffic pattern and optimize traffic flow. for instance, hybrid model combine with time series analysis with machine learning algorithms to improve the traffic prediction accuracy. such as google maps.

Recommender system: hybrid model use to provide the personlize recommendations to users for instance, netflix and amazon this both are using the hybrid model combine with collaborative fileterning and content based filtering to provide the more accurate recommendations.

Health Diagnosis: hybrid model is use for improve the accuracy of the diagnosis. for instance, hybrid model combine with the rule based system with machine learning algorithm to provide more accurate diagnosis. IBM watson health and mayo clinic use this type of hybrid model for diagnosis.


In summarize, its totally depend on the requirement of the application to choose the model. IR model is good for finding answers of the questions quickly. while, knowledge model is better for complex answers which require deep understanding of the topic. hybrid model is provide the balance between speed and accurscy.


#Tutorial-2:
https://www.turing.com/kb/5-powerful-text-summarization-techniques-in-python

#Task-4 (40 points):
Implement any 4 (10 points each) of the Text Summarization methods present in the given tutorial link. You will have to choose your own text (update the text="""" variable) to summarize. Also provide your detailed analysis on each method you have performed.


Gensim summarization using LSA/LSI

In [None]:
#Your code for method-1
!pip install gensim
from gensim.summarization import summarize
import requests

# Fetching text from a URL
url = "https://en.wikipedia.org/wiki/Artificial_intelligence"
response = requests.get(url)
text = response.text

# Summarizing using Gensim
summary = summarize(text, ratio=0.1)
print(summary)


In [None]:
!pip install sumy
import requests
from sumy.parsers.plaintext import PlaintextParser
from sumy.summarizers.lex_rank import LexRankSummarizer
from sumy.nlp.tokenizers import Tokenizer

# Fetching text from a URL
url = "https://en.wikipedia.org/wiki/Artificial_intelligence"
response = requests.get(url)
text = response.text

# Preprocessing text
parser = PlaintextParser.from_string(text, Tokenizer("english"))

# Implementing LexRank
summarizer = LexRankSummarizer()
summary = summarizer(parser.document, sentences_count=3)

for sentence in summary:
    print(sentence)


##Explanation
Gensim summarization using LSA/LSI:

Gensim is the library for natural language processing. which is implementation od of the latent semantic analysis or latent semantic indexing algorithm for the text summarization.

LSA/LSI is the mathematics technique which is use for the simgualr value decomposition to detect the pattern of the relationship between terms and documents. its create the term matrics of term frequency-inverse document frequency values for each of the document and then svd is use for reducing the diversity of the matrix. this is tje process for of allows the LSA/LSI which is detect the most relevant concept of doing the summarization.

here we are first imeplementing the first text summarization method here we are using the gensim library to perform extractive summarization of the given text.
here first we are importing the libraries and retrieve the text form the url and summarize the text using the gensim summarization. after that we are defining the url and from this we are fetching the text.  here we are using the libraries for retrieve the text from the url and stored in one variable. here we are using the summarize function from the gensim which is called to perform the summarization on the text. here the ratio set is 0.1, which is present that the summary is containg the 10% of the original text. finally we are printing the summary of text.


NLTK

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
nltk.download('stopwords')
#Input your text for summarizing below:

text = """ """

#Next, you need to tokenize the text:

stopWords = set(stopwords.words("english"))
words = word_tokenize(text)

#Now, you will need to create a frequency table to keep a score of each word:

freqTable = dict()
for word in words:
    word = word.lower()
    if word in stopWords:
            continue
            if word in freqTable:
              freqTable[word] += 1
            else:
              freqTable[word] = 1

#Next, create a dictionary to keep the score of each sentence:

              sentences = sent_tokenize(text)
              sentenceValue = dict()

              for sentence in sentences:
                  for word, freq in freqTable.items():
                      if word in sentence.lower():
                          if word in sentence.lower():
                             if sentence in sentenceValue:
                                  sentenceValue[sentence] += freq
                             else:
                                  sentenceValue[sentence] = freq
                                  sumValues = 0
                                  for sentence in sentenceValue:
                                    sumValues += sentenceValue[sentence]

#Now, we define the average value from the original text as such:

                                    average = int(sumValues / len(sentenceValue))

#And lastly, we need to store the sentences into our summary:

                                    summary = ''
                                    for sentence in sentences:
                                        if (sentence in sentenceValue) and (sentenceValue[sentence] > (1.2 * average)):
                                            summary += " " + sentence
                                            print(summary)

##Explanation

NLTK:
NLTK stands for natural language toolkit which is use for text summarization. text summarization is the process of generating short text from the long text without changing the meaning of the text. here we are using the extractive summary using the frequency based approach.

first we are input text of the tokens into the words using the word tokenize and stop words are removed form the text using the stop word nltk library. after that we are using the frequency table  is created for each of the word in the text. that table is library where the word is the key and the value represent as the frequency of the words in the text. here th sentence value dictionary is created and keep the track of the score of each of the sentences. here its take of each sentence of the input text and the frequency of the word of the sentence. here score of the sentence will be added in the sentence value. after that we are calculating the average score of the all the sentences and is use for the select sentence with the highest value and also average score. here after that we are concatinating the sentences to create the summary of the text.
the main thing is this is easy to implement and its not required the large dataset for train the model. in contrast its not accurate to detect the context of the content and intend of the text accuractely.


Sumy - LexRank


In [None]:
#Your code for method-3
!pip install sumy
import requests
import nltk
nltk.download('punkt')
from sumy.parsers.plaintext import PlaintextParser
from sumy.summarizers.lex_rank import LexRankSummarizer
from sumy.nlp.tokenizers import Tokenizer

# Fetching text from a URL
url = "https://en.wikipedia.org/wiki/Artificial_intelligence"
response = requests.get(url)
text = response.text

# Preprocessing text
parser = PlaintextParser.from_string(text, Tokenizer("english"))

#parser = PlaintextParser.from_string(text, Tokenizer("english"))

# Implementing LexRank
summarizer = LexRankSummarizer()
summary = summarizer(parser.document, sentences_count=3)

for sentence in summary:
    print(sentence)


##Explanation

Sumy:
sumy is the python library for text summarization of the text documents. its summarize the text in simple and fficient way to summarize the large text and extract the information from them. sumy is using lots of different algorithm to perform summarization here we are using one of them LeXRanksummerizer. this algorithm is take care of lots of different things such as word frequency, sentence position, semantic simalirity and based on the importance of the sentences here its give the rank to them. LexRank is the graph based algorithm which uses the context of the eigenvector centrality to detect the most important sentence of the document.

here its retrieved text from the url and stored text into one variable. here we are creating the plaintextparser object from the text, which is we are going to use for text processing. after that we are inititalize the lexranksummarizer object which is use for lax rank algorithm for summarization. here we are calling the summarizer() method on the parser.document object, here its specify the number  of sentences to include  in the summary. here we are using the loop for printing each of the sentences. lexrank algorithm is using the similarity between the sentences for calculating the sentence is important or not. here its first create the matrics for pairwise similarity between all of the sentences in the text, and after that we are giving the pagerank of the algorithm to matrics to calculate the centrality of the each of the sentences. here the most central sentence will be selected from the summary. this is use for the generating the summary which capture the most important information of the text, its not rely on the external texr and training data. in contrast,  its not possible to detect the naunce of the original text as other methods like abstractive summarization.

GPT-3

In [None]:
#Your code for method-4
!pip install openai
!pip install wget
!pip install pdfplumber

#To download the PDF and return its local path, enter the following:

def getPaper(paper_url, filename="random_paper.pdf"):
        downloadedPaper = wget.download(paper_url, filename)
        downloadedPaperFilePath = pathlib.Path(downloadedPaper)
        return downloadedPaperFilePath

#Now, you need to convert the PDF into text so GPT-3 can read it:

        paperFilePath = "random_paper.pdf"
        paperContent = pdfplumber.open(paperFilePath).pages

def displayPaperContent(paperContent, page_start=0, page_end=5):
      for page in paperContent[page_start:page_end]:
          print(page.extract_text())
          displayPaperContent(paperContent)

#Now that you have the text, it‚Äôs time to start summarizing it:

def showPaperSummary(paperContent):
    tldr_tag = "\n tl;dr:"
    openai.organization = 'organization key'
    openai.api_key = "your api key"
    engine_list = openai.Engine.list()

#available from the openai API

#Here, we are letting the GPT-3 model know that we require a summary. Then, we proceed to set up the environment to use the openai API.

    for page in paperContent:

        text = page.extract_text() + tldr_tag
        response = openai.Completion.create(engine="davinci",prompt=text,temperature=0.3,
        max_tokens=140,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0,
        stop=["\n"])
        print(response["choices"][0]["text"])

#This code extracts the text from each page, feeds the GPT-3 model the max tokens for each page, and prints it to the terminal.

#Now that everything is set up, we can run the summarizer:

        paperContent = pdfplumber.open(paperFilePath).pages
        showPaperSummary(paperContent)

##Explanation

GPT-3:
GPT-3 stands for generative pre-trained transformer 3, which is language processing algorithm which is developed by openAI which use the natural language processing to do text summarization. here we are using this for doing text summarization which is more advanced and powerful approach for doing text summarizations.

here we  are implementing the summarization of the pdf doc using the GPT-3. first we are using the function getpaper where we are passing two arguments paper url and filename which is use for the url of the pdf document and downloads in to the local file sysytem. and its return the path of the downloaded file. after that we are using the papercontent which is read the lines and download pdf file using the pdfplumber library and also extract the content of the each page.  after that we are using the displaypapercontent which is print the extracted text into the document by iterating each page and call extraxt text method. showpapersummary function which is use for summarize the pdf document using the GPT-3 for each of the page and extract the text using the tl;dr :tags. here we are using the openai.completion.create() method for generate the summary of the text using the GTP-3. print the summary.

The max_tokens, temperature, top_p, frequency_penalty, and presence_penalty parameters passed to openai.Completion.create()  this all are use for control the behaviour of the GPT-3 while text generation.max tokens is maximize the number of the tokens which is generate in the summary, while temperature is the level of random text of the generated text.


#Task-5 (10 points):
#What are the advantages and disadvantages of extractive and abstractive summarization?

##Explanation

Text summarization: this is process of parting the large text into the small summarization paragraph. this method is used for preserved the main intend of the text. this will reduce the time which is required for understand the long paragraph into parts without changing the meaning of the text.

Extractive summarization:

Extractive summarization is select the related portion of the text and reproduce them by each word, return its summarize vesrion of the original text and its only compressed the original sentences. this is select and rearrange the portion of the original text.

Abstractive summarization:
Abstractive summarization is the process of the generating the more clear version of the text by creating the summary which include the main intendt from the original text. here its generate the new sentences that take the essence of the original text with coreporating more informations.


Extractive summarization advantages:

maining the factual accuracy can be created by using the actual sentences from the source text.
its allows to create the acurate answers by using the sentences directly form the original source text.
its easy to implement this method as its only required to select and arrange the source text.
its easy to approach because its simple to execute because its solely entailing text from the source text.
the original meaning of the sentence will be not changed in  this method.


Extractive summarization disadvantages:

its limited ability to rephrase the sentence and its restricted by this method.
answers generated using this method may be appears disjointed or fragemented.
sometimes the crucial information from the sentence may be absent from the sentence or its exclude from the summary.


Abstractive summarization advantages:

this method can more concise and fluent for summarize using the nlp techniques.
its capable to capture the essense of the source text, event if its not present in the verbatim.
its  more flexible in summarize as its can rephrase the sentences without any restrictions.

Abstractive summarization disadvantages:

not accurate in the summary because its using the natural language processing generation.
difficult to implement because its required to implement the more algoritm and natural language processing techniques.
may be the meaning or context of the source code will be changed its not able to preserved the original meaning of the sentence sometimes.



a. To execute the same code on a single core at the same speed as on a quad-core, we need to increase the frequency and voltage on the single core. Assuming the code is 80% parallelizable, the speedup achieved by parallelizing it over four cores is:

Speedup = 1 / (1 - 0.8/4) = 1.67

This means that the quad-core runs the code 1.67 times faster than a single core. To run the code at the same speed on a single core, we need to increase its frequency and voltage by the same factor:

Speedup = (New frequency / Old frequency) x (New voltage / Old voltage)

1.67 = (New frequency / 0.5 GHz) x (New voltage / 0.5 V)

Assuming that frequency and voltage scaling follows the square law, where power consumption is proportional to the frequency squared times the voltage, we can rewrite the equation as:

1.67^2 = (New frequency / 0.5 GHz)^2 x (New voltage / 0.5 V)^2

2.79 = (New frequency / 0.5 GHz)^2 x (New voltage / 0.5 V)^2

Taking the square root of both sides gives:

1.67 = (New frequency / 0.5 GHz) x (New voltage / 0.5 V)

New frequency = (1.67 x 0.5 GHz) / (New voltage / 0.5 V)

Substituting this into the original equation gives:

1.67 = [(1.67 x 0.5 GHz) / (New voltage / 0.5 V)] x (New voltage / 0.5 V)

New voltage = 0.5 V x sqrt(1.67) = 0.68 V

New frequency = (1.67 x 0.5 GHz) / 0.68 = 1.23 GHz

Therefore, to execute the same code on a single core at the same speed as on a quad-core, we need to increase the frequency to 1.23 GHz and the voltage to 0.68 V.

b. The dynamic energy is proportional to the capacitance, the frequency squared, and the voltage. Assuming that the capacitance remains constant, the dynamic energy reduction achieved by frequency and voltage scaling in part a is:

Reduction = (1/1.67^2) x (0.5/0.68)^2 = 0.6

This means that the dynamic energy is reduced by 40% from using frequency and voltage scaling in part a.

c. With a dark silicon approach, the dynamic energy is reduced by power gating all hardware units and using specialized ASICs that consume 20% of the power of the general-purpose processor. Assuming that each core is power gated, the dynamic energy required for the video game is:

Energy = (0.2 x 0.5 W x 2) + (0.2 x 0.5 W x 2) + (0.5 W x 2) = 1.2 W

This is the same as the dynamic energy required for parallelizing the code over four cores.