<div align='center'><font size="6">Assesment for Prompt Engineering Role</font></div>



## Assignment-1(Natural Language Processing)

Aim:
    To Implement functions to preprocess and tokenize text data with the help of NLTK functionalities.

Procedure:
    
    *Data Cleaning:Making all text either to  lower case or uppercase format and Removin noise(punctuations,links,quoted texts,etc.,)
    
    *Tokenisation:Tokenization is a process that splits an input sequence into so-called tokens where the tokens can be a word, sentence, paragraph etc. Base upon the type of tokens we want, tokenization can be of various types.
    
    *Stop-Words Removal:Stop-words are the words which occur very frequently but have no possible value. For example,(a, an, the, are, etc).Removing stopwords is one of the important step of preprocessing.

Result:
    In the below cells,we have written the user-defined functions successfully, to implement all the preprocessing and tokenisation functionalities.e

### Importing Necessary Modules

In [None]:
import re
import string
import nltk
from nltk.corpus import stopwords

### Data Cleaning

In [None]:
def clean_text(text):
    '''Make text lowercase, remove text in square brackets,remove links,remove punctuation
    and remove words containing numbers.'''
    text = text.lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('https?://\S+|www\.\S+', '', text)
    text = re.sub('<.*?>+', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', '', text)
    text = re.sub('\w*\d\w*', '', text)
    return text

input_data=input()#Enter the text or list of texts to be preprocessed and tokenized here.
data = input_data.apply(lambda x: clean_text(x))

### Tokenisation

In [None]:
tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')#One can use any required tokeniser.I have used this tokeniser for instance.
tokenized_text = data.apply(lambda x: tokenizer.tokenize(x))

### Stopwords Removal

In [None]:
def remove_stopwords(text):
    """
    Removing stopwords belonging to english language
    """
    words = [w for w in text if w not in stopwords.words('english')]
    return words

result= tokenized_text.apply(lambda x: remove_stopwords(x))

<hr>

## Assignment-2(Text Generation)

Aim:
    To Create a basic text generation model using a pre-trained transformer (e.g., GPT-3).

Procedure:

    *Importing Hugging Face Transformers library as per requisites and other necessary packages.

    *Choosing the apt model and tokenizer.Then initialising them for the process.

    *Text Generation Function:The input prompt will be encoded here.Then this function generates text as per the request.Decoding of the output text(generated text) also takes place here only.

Result:
       In the below cells,the text generation model has been built successfully and it generates coherent text based on a given prompt.
       

### Importing Necessary Packages

In [None]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

### Loading Pre-trained model and tokenizer

In [None]:
model_name = "gpt3" #One can replace 'gpt3' with whatever model name as their wish and the access they have.I have used gpt-3 here as per assignment requirement.
tokenizer = GPT2Tokenizer.from_pretrained(model_name) 
model = GPT2LMHeadModel.from_pretrained(model_name)

### Text Generation

In [None]:
def generate_text(prompt, max_length=100): #Max length is customisable as per requirement. 
    
    input = tokenizer.encode(prompt, return_tensors='pt') # Encoding the prompt
    
    output = model.generate(input, max_length=max_length, num_return_sequences=1)
    '''
    Output will be encoded.So we have to decode it in human understandable form,that is performed below.
    '''
    
    generated_text = tokenizer.decode(output[0], skip_special_tokens=True) #The generated text will be the decoded output with respect to user-prompt.
    return generated_text

In [None]:
prompt = input() #Enter the input prompt here manually or enter it at the time of pop-up during execution.
generated_text = generate_text(prompt, max_length=100) 
print(generated_text)

<hr>


## Assignment-3(Prompt Engineering)

AIM:
    To design and evaluate prompts to improve the performance of a AI model on tasks like summarization and question answering.

Procedure:

          *Here,I have generated an API key for secure connection, in respect to sending request to openAI and to get respond from them, in my interface.Then I have exported the API key as an environment variable in my IDE.This step should be performed by each individual or if you are working under an organisation they will provide the API key, as it would been already generated by the devOPS team.

          *After this we have to import the necessary packages.Then we should create different templates of prompts for summarisation and question answering.I have used two different designs each for summarisation and question answering.

          *Model Selection:I have used OpenAI(Completion) model for this assignment to generate outputs based on the user-input prompts by creating a connection to the respective engine.I have written a function to call OpenAI API to generate answers based on prompt.

          *Evaluation:I have used BLEU,ROUGE evaluation metrics to evaluate summarisation responses and accuracy,F1 score for question answering responses.The evaluation takes place by comparing the generated output  and the correct output.In this case,since I have no data sets containing the (text_for_summary and reference_summary considering summarisation) and (context_for_qa,reference_answer and question_for_qa for question answering) I have did it by giving manually to do the model evaluation.

          *Note:While seeing the code you can get clear idea on what I have specified in the above point.Generation of output is performed from openAI.But to evaluate it we should have pre-defined data sets comprises of testing features that I have mentioned above.The model(OpenAI) has already been trained well.To evaluate it with testing data sets we are in lack of data set.So I have given testing dataset manually.Here we can give pre-defined data sets also.For time-being I have given manually.

Result:
       The objective of this assignment has been achieved.With increased number of testing data we can do the evaluation in great manner.

## Exporting API Key

In [None]:
setx OPENAI_API_KEY "e7062c5b-d95d-4fa5-af31-52cb6e662816" 

### Importing Necessary Packages

In [None]:
import openai 
from sklearn.metrics import f1_score 
from nltk.translate.bleu_score 
import sentence_bleu
from rouge import Rouge

### Setting OpenAI API Key

In [None]:
openai.api_key = 'e7062c5b-d95d-4fa5-af31-52cb6e662816'#Give your API Key

### Defining Different Prompt Designs

In [None]:
 prompts = { "summarization_1": "Summarize the following text: {}", 
            "summarization_2": "Provide a concise summary of this text: {}", 
            "question_answering_1": "What is the answer to the following question based on this context? Context: {} Question: {}", 
            "question_answering_2": "Based on this information: {} What is your answer to this question: {}" 
           }

### Function to call OpenAI Model 

In [None]:
def generate_response(prompt, text): #Function to generate response
    response = openai.Completion.create( engine="text-davinci-003", 
                                        prompt=prompt.format(text), 
                                        max_tokens=150, n=1, stop=None, temperature=0.7 ) 
    return response.choices[0].text.strip()

### Function to evaluate summarisation

In [None]:
def evaluate_summarization(reference_summary, generated_summary): 
    bleu_score = sentence_bleu([reference_summary.split()], generated_summary.split()) 
    rouge = Rouge() 
    rouge_score = rouge.get_scores(generated_summary, reference_summary)[0] 
    return bleu_score, rouge_score['rouge-l']['f']

### Function to evaluate Question Answering

In [None]:
def evaluate_qa(reference_answer, generated_answer): 
    reference = [reference_answer] 
    prediction = [generated_answer] 
    f1 = f1_score(reference, prediction, average='weighted')
    return f1

### Example texts and corresponding references

In [None]:
''' 
Here I have manually given one testing data each for summarisation and QA Evaluation.One can give the list of testing data inorder to evaluate.
If we are using list of testing data slight changes in the code should be performed corresponding to List, in Evaluation segments alone.

'''

text_for_summary = "ChatGPT is a conversational agent developed by OpenAI that uses artificial intelligence to engage in dialogue with users." 
reference_summary = "ChatGPT is developed by OpenAI as a conversational AI." 


context_for_qa = "The capital of France is Paris." 
reference_answer = "Paris" 
question_for_qa = "What is the capital of France?"

### Evaluating prompt designs(Summarisation)

In [None]:
for prompt_key, prompt_template in prompts.items(): #Iterating the dictionary elements one by one 
    if 'summarization' in prompt_key: #To choose summarisation design prompts for evaluation
        generated_summary = generate_response(prompt_template, text_for_summary)#Calling the function to generate respose from openAI 
        bleu, rouge = evaluate_summarization(reference_summary, generated_summary)#Evaluation
        print(f"{prompt_key} - BLEU: {bleu}, ROUGE: {rouge}")

### Evaluating prompt designs(Question Answering)

In [None]:
for prompt_key, prompt_template in prompts.items():
    if 'question_answering' in prompt_key:#To choose QA design prompts for generate response and evaluation
        generated_answer = generate_response(prompt_template, context_for_qa + " " + question_for_qa)#Caliing the function to generate response 
        f1 = evaluate_qa(reference_answer, generated_answer) #Evaluation
        print(f"{prompt_key} - F1 Score: {f1}")

<hr>

## Assignment 4-(Data Analysis)

Aim:
    To Analyze a dataset(Wine Quality from UCI-ML Repository) and generate insights using a combination of descriptive statistics and visualizations

Procedure:

        *First we have to import the necessary packages.Then loading data comes to the place.One can download the data from repository and keep the dataset in the local repository.Then we can load it directly or we can load the data by giving the URL of the dataset directly.Loading can be done with the help of Pandas in any of one methodolies for this task

        *Then we can start with the EDA(Exploratory Data Analysis) to know about data.Descriptive Statistics will also be done in this segment.With these analysis we can get the Quality count that is the count of number of occurences of each wine quality rating in this case.

        *Visualisation:To visualise the Analysis and insights from the Data I have used three plots,they are pairplot,box plot and heatmap each have their unique functionalities.Usage of pairplot-To display relationship between selected features and wine quality.Usage of Boxplot-To examine the distribution of alcohol content accross different wine quality ranges.Finally,Usage of Heatmap-To display the correleation matrix for all features in the data set.


Result:
       The analysation and visualisation of the wine dataset has been done successfully.The data-driven insights have also been derived from the whole process.

### Importing Necessary Packages

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


### Loading Data

In [None]:
dataset url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv" 
wine_data = pd.read_csv(url, sep=';')

### EDA and descriptive statistics

In [None]:
print(wine_data.head()) #To get idea about the data distribution, printing the first 5 rows 

summary_statistics = wine_data.describe() #Summary Statistics
print("\nSummary Statistics:\n", summary_statistics)  

quality_count = wine_data['quality'].value_counts() # Count of wine quality ratings
print("\nQuality Count:\n", quality_count) 

## Visualisation

### Pairplot(For Visualising Relationship)

In [None]:
sns.pairplot(wine_data, hue='quality', 
             vars=['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'alcohol']) #Selecting only few needed features
plt.title('Pairplot of Wine Quality Dataset (Selected Features)')
plt.show()

### Boxplot(For Visaulising alcohol content by quality)

In [None]:
plt.figure(figsize=(10, 6)) #Size of the plot
sns.boxplot(x='quality', y='alcohol', data=wine_data)
plt.title('Boxplot of Alcohol Content by Wine Quality') 
plt.xlabel('Wine Quality')
plt.ylabel('Alcohol Content')
plt.show()

### Heatmap(Correlation Matrix)

In [None]:
plt.figure(figsize=(12, 8)) 
correlation_matrix = wine_data.corr() 
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5) 
plt.title('Correlation Heatmap of Wine Quality Features')
plt.show()

## Analysis Report

1. Dataset Overview:
- The dataset consists of 1,599 samples with 12 features(including the target variable,"quality"). 
- The features include physicochemical properties such as acidity, sugars, pH, and alcohol content. 
- The target variable "quality" ranges from 0 to 10, representing different quality ratings of the wine.

2. Summary Statistics:
- **Fixed Acidity**: Ranges from approximately 4.6 to 15.9 g/dm³, with a mean of around 7.22.
- **Volatile Acidity**: Ranges from 0.12 to 1.58 g/dm³, suggesting that the wine's taste could be significantly affected by volatile acidity.
- **Citric Acid**: Ranges from 0 to 1.0, with many wines having low citric acid levels.
- **Residual Sugar**: Shows a wide range, with values from 0.9 to 15.5 g/dm³, indicating variability in sweetness levels.
- **Alcohol Content**: Ranges between 8.0% and 14.9% with a mean of approx 10.5%, suggesting a general tendency towards moderate alcohol content.

3. Quality Count Distribution:
- Quality ratings 6 and 5 are the most frequent, each accounting for approximately 30% of the dataset.
- Lower ratings (1-4) and higher ratings (7-10) are less common, indicating that most wines are perceived as average to good quality.

4. Key Visual Insights: 
- **Pairplot Analysis**: The pairplot shows clear separation between different quality ratings for certain features like alcohol, volatile acidity, and citric acid, indicating these features are influential in determining wine quality.

- **Boxplot Analysis of Alcohol Content**:
    - Higher quality wines (ratings 7-8) tend to have higher alcohol content.
    - Lower quality wines tend to show increased volatility in alcohol content, hinting at a possible degradation in quality.

- **Correlation Heatmap**:

      - **Positive Correlation**: Alcohol content shows a positive correlation with wine quality (correlation coefficient ≈ 0.48). Higher alcohol content is generally associated with better wine quality.

      - **Negative Correlation**: Volatile acidity has a negative correlation with quality (correlation coefficient ≈ -0.32), suggesting higher volatile acidity leads to lower quality ratings.

<hr>