# HW4: Natural Language Processing

 <div class="alert alert-block alert-warning">Each assignment needs to be completed independently. Never ever copy others' work (even with minor modification, e.g. changing variable names). Anti-Plagiarism software will be used to check all submissions. </div>

## Problem Description

In this assignment, we'll use what we learned in preprocessing module to compare ChatGPT-generated text with human-generated answers. A dataset with 200 questions and answers has been provided for you to use. The dataset can be found at https://huggingface.co/datasets/Hello-SimpleAI/HC3.


Please follow the instruction below to do the assessment step by step and answer all analysis questions.


In [1]:
import pandas as pd
import spacy
import nltk
import numpy as np
from sklearn.preprocessing import normalize


from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [2]:
data = pd.read_csv("C://Users//Rahul//Documents//Rahul Rajpurohit Assignments//BIA 660//HW_4_Rahul//qa.csv")
data.head()

Unnamed: 0,question,chatgpt_answer,human_answer
0,What happens if a parking ticket is lost / des...,If a parking ticket is lost or destroyed befor...,In my city you also get something by mail to t...
1,"why the waves do n't interfere ? first , I 'm ...",Interference is the phenomenon that occurs whe...,They do actually . That 's why a microwave ove...
2,Is it possible to influence a company's action...,"Yes, it is possible to influence a company's a...",Yes and no. This really should be taught at ju...
3,Why do taxpayers front the bill for sports sta...,Sports stadiums are usually built with public ...,That 's the bargaining chip that team owners u...
4,Why do clothing stores generally have a ton of...,There are a few reasons why clothing stores ma...,Your observation is almost certainly a matter ...


## Q1. Tokenize function

Define a function `tokenize(docs, lemmatized = True, remove_stopword = True, remove_punct = True)`  as follows:
   - Take three parameters: 
       - `docs`: a list of documents (e.g. questions)
       - `lemmatized`: an optional boolean parameter to indicate if tokens are lemmatized. The default value is True (i.e. tokens are lemmatized).
       - `remove_stopword`: an optional bookean parameter to remove stop words. The default value is True (i.e. remove stop words). 
   - Split each input document into unigrams and also clean up tokens as follows:
       - if `lemmatized` is turned on, lemmatize all unigrams.
       - if `remove_stopword` is set to True, remove all stop words.
       - if `remove_punct` is set to True, remove all punctuation tokens.
       - remove all empty tokens and lowercase all the tokens.
   - Return the list of tokens obtained for each document after all the processing. 
   
(Hint: you can use spacy package for this task. For reference, check https://spacy.io/api/token#attributes)

In [3]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS

# load the English language model in spaCy
nlp = spacy.load('en_core_web_sm')

def tokenize_a_doc(doc, nlp, lemmatized=True, remove_stopword=True, remove_punct=True):
    # initialize an empty list to store the tokens
    tokens = []
    # create a spacy document from the input text
    doc = nlp(doc)
    for token in doc:
        # if remove_punct is True, exclude punctuation tokens
        if remove_punct and token.is_punct:
            continue
        # if remove_stopword is True, exclude stop words
        if remove_stopword and token.is_stop:
            continue
        # if lemmatized is True, lemmatize the token
        if lemmatized:
            lemma = token.lemma_.lower().strip()  # lemmatize and convert to lowercase
            tokens.append(lemma)
        else:
            tokens.append(token.text.lower().strip())  # convert to lowercase and append to the list
    # exclude empty tokens
    tokens = [t for t in tokens if t]
    return tokens


def tokenize(docs, lemmatized=True, remove_stopword=True, remove_punct=True):
    # initialize an empty list to store the tokens for each document
    tokenized_docs = []
    for doc in docs:
        # tokenize each document and append to the list
        tokenized_doc = tokenize_a_doc(doc, nlp, lemmatized=lemmatized, remove_stopword=remove_stopword, remove_punct=remove_punct)
        tokenized_docs.append(tokenized_doc)
    return tokenized_docs


Test your function with different parameter configuration and observe the differences in the resulting tokens.

In [4]:
# For simplicity, We will test on document

print(data["question"].iloc[0] + "\n")

print(f"1.lemmatized=True, remove_stopword=False, remove_punct = True:\n \
{tokenize(data['question'].iloc[0:1], lemmatized=True, remove_stopword=False, remove_punct = True)}\n")

print(f"2.lemmatized=True, remove_stopword=True, remove_punct = True:\n \
{tokenize(data['question'].iloc[0:1], lemmatized=True, remove_stopword=True, remove_punct = True)}\n")

print(f"3.lemmatized=False, remove_stopword=False, remove_punct = True:\n \
{tokenize(data['question'].iloc[0:1], lemmatized=False, remove_stopword=False, remove_punct = True)}\n")

print(f"4.lemmatized=False, remove_stopword=False, remove_punct = False:\n \
{tokenize(data['question'].iloc[0:1], lemmatized=False, remove_stopword=False, remove_punct = False)}\n")


What happens if a parking ticket is lost / destroyed before the owner is aware of the ticket , and it goes unpaid ? I 've always been curious . Please explain like I'm five.

1.lemmatized=True, remove_stopword=False, remove_punct = True:
 [['what', 'happen', 'if', 'a', 'parking', 'ticket', 'be', 'lose', 'destroy', 'before', 'the', 'owner', 'be', 'aware', 'of', 'the', 'ticket', 'and', 'it', 'go', 'unpaid', 'i', 've', 'always', 'be', 'curious', 'please', 'explain', 'like', 'i', 'be', 'five']]

2.lemmatized=True, remove_stopword=True, remove_punct = True:
 [['happen', 'parking', 'ticket', 'lose', 'destroy', 'owner', 'aware', 'ticket', 'go', 'unpaid', 've', 'curious', 'explain', 'like']]

3.lemmatized=False, remove_stopword=False, remove_punct = True:
 [['what', 'happens', 'if', 'a', 'parking', 'ticket', 'is', 'lost', 'destroyed', 'before', 'the', 'owner', 'is', 'aware', 'of', 'the', 'ticket', 'and', 'it', 'goes', 'unpaid', 'i', 've', 'always', 'been', 'curious', 'please', 'explain', 'like

## Q2. Sentiment Analysis


Let's check if there is any difference in sentiment between ChatGPT-generated and human-generated answers.


Define a function `compute_sentiment(generated, reference, pos, neg )` as follows:
- take four parameters:
    - `gen_tokens` is the tokenized ChatGPT-generated answers by the `tokenize` function in Q1.
    - `ref_tokens` is the tokenized human-generated answers by the `tokenize` function in Q1.
    - `pos` (`neg`) is the lists of positive (negative) words, which can be find in Canvas preprocessing module.
- for each ChatGPT-generated or human-generated answer, compute the sentiment as `(#pos - #neg )/(#pos + #neg)`, where `#pos`(`#neg`) is the number of positive (negative) words found in each answer. If an answer contains none of the positive or negative words, set the sentiment to 0.
- return the sentiment of ChatGPT-generated and human-generated answers as two columns of DataFrame.


Analysis: 
- Try different tokenization parameter configurations (lemmatized, remove_stopword, remove_punct), and observe how sentiment results change.
- Do you think, in general, which tokenization configuration should be used? Why does this combination make the most senese?
- Do you think, overall, ChatGPT-generated answers are more posive or negative than human-generated ones? Use data to support your conclusion.


In [5]:
import pandas as pd
from collections import Counter

def compute_sentiment(gen_tokens, ref_tokens, pos, neg ):
    
    # Define function to compute sentiment for a single answer
    def compute_single_sentiment(tokens):
        pos_count = Counter([token for token in tokens if token in pos])
        neg_count = Counter([token for token in tokens if token in neg])
        if not pos_count and not neg_count:
            return 0
        return (sum(pos_count.values()) - sum(neg_count.values())) / (sum(pos_count.values()) + sum(neg_count.values()))

    # Compute sentiment for generated and reference tokens
    gen_sentiments = [compute_single_sentiment(tokens) for tokens in gen_tokens]
    ref_sentiments = [compute_single_sentiment(tokens) for tokens in ref_tokens]

    # Create DataFrame with results
    result = pd.DataFrame({'Generated Sentiment': gen_sentiments, 'Reference Sentiment': ref_sentiments})

    return result


In [45]:
gen_tokens = tokenize(data["chatgpt_answer"], lemmatized=False, remove_stopword=False, remove_punct = False)
ref_tokens = tokenize(data["human_answer"], lemmatized=False, remove_stopword=False, remove_punct = False)

In [46]:
pos = pd.read_csv("positive-words.txt", header = None)
pos.head()

neg = pd.read_csv("negative-words.txt", header = None)
neg.head()

Unnamed: 0,0
0,a+
1,abound
2,abounds
3,abundance
4,abundant


Unnamed: 0,0
0,2-faced
1,2-faces
2,abnormal
3,abolish
4,abominable


In [47]:
result = compute_sentiment(gen_tokens, 
                           ref_tokens, 
                           pos[0].values,
                           neg[0].values)
result.head()

Unnamed: 0,Generated Sentiment,Reference Sentiment
0,0.0,-0.5
1,-0.777778,0.076923
2,0.666667,0.2
3,1.0,0.2
4,0.6,-0.333333


In [48]:
from scipy.stats import wilcoxon

(result['Generated Sentiment'] - result['Reference Sentiment']).mean()

res = wilcoxon(result['Generated Sentiment'] - result['Reference Sentiment'], alternative='greater')
res.statistic, res.pvalue

0.14665715815970656

(10403.0, 0.0010660004805700114)

In [49]:
gen_tokens = tokenize(data["chatgpt_answer"], lemmatized=False, remove_stopword=True, remove_punct = False)
ref_tokens = tokenize(data["human_answer"], lemmatized=False, remove_stopword=True, remove_punct = False)

In [50]:
result = compute_sentiment(gen_tokens, 
                           ref_tokens, 
                           pos[0].values,
                           neg[0].values)
result.head()

Unnamed: 0,Generated Sentiment,Reference Sentiment
0,0.0,-0.5
1,-0.777778,0.0
2,0.666667,0.111111
3,1.0,0.2
4,0.6,-0.333333


In [51]:
from scipy.stats import wilcoxon

(result['Generated Sentiment'] - result['Reference Sentiment']).mean()

res = wilcoxon(result['Generated Sentiment'] - result['Reference Sentiment'], alternative='greater')
res.statistic, res.pvalue

0.1587041451334569

(10451.0, 0.0004926382674291639)

We obversed that the each pos and neg file contains the root form of words. So, lemmatization is not necessary here. We choose to set remove_stopwords as true because, in a tokenized form the stopwords have no significance. remove_punct is set false, because few words might have "-" in it. Also, after setting, lemmatization and remove_punct as 'False' and remove_stopwords as 'True', we get pvalue which is less than the significant value, so here in our case, we can reject the null hypothesis and we can conclude that there is statistically significant difference between the two groups and the ChatGPT generated answers are positive than the human generated ones as it's mean is greater than that of human generated ones.



## Q3: Performance Evaluation


Next, we evaluate how accurate the ChatGPT-generated answers are, compared to the human-generated answers. One widely used method is to calculate the `precision` and `recall` of n-grams. For simplicity, we only calculate bigrams here. You can try unigram, trigram, or n-grams in the same way.


Define a funtion `bigram_precision_recall(gen_tokens, ref_tokens)` as follows:
- take two parameters:
    - `gen_tokens` is the tokenized ChatGPT-generated answers by the `tokenize` function in Q1.
    - `ref_tokens` is the tokenized human answers by the `tokenize` function in Q1.
- generate bigrams from each tokenized document in `gen_tokens` and `ref_tokens`
- for each pair of ChatGPT-generated and human answers, find the overlapping bigrams between them
- compute `precision` as the number of overlapping bigrams divided by the total number of bigrams from the ChatGPT-generated answer. In other words, the bigram is considered as a predicted value. The `precision` measures the percentage of correctly generated bigrams out of all generated bigrams.
- compute `recall` as the number of overlapping bigrams divided by the total number of bigrams from the human answer. In other words, the `recall` measures the percentage of bigrams from the human answer can be successfully retrieved.
- return the precision and recall for each pair of answers.


Analysis: 
- Try different tokenization parameter configurations (lemmatized, remove_stopword, remove_punct), and observe how precison and recall change.
- Do you think, in general, which tokenization configuration should be used? Why does this combination make the most senese?
- Do you think, overall, ChatGPT is able to mimic human in answering these questions?



In [55]:
from nltk import ngrams

def bigram_precision_recall(gen_tokens, ref_tokens):
    precision = []
    recall = []
    for gen_doc, ref_doc in zip(gen_tokens, ref_tokens):
        gen_bigrams = set(ngrams(gen_doc, 2))
        ref_bigrams = set(ngrams(ref_doc, 2))
        overlap = gen_bigrams.intersection(ref_bigrams)
        precision.append(len(overlap) / len(gen_bigrams) if len(gen_bigrams) > 0 else 0)
        recall.append(len(overlap) / len(ref_bigrams) if len(ref_bigrams) > 0 else 0)
    result = pd.DataFrame({'precision': precision, 'recall': recall})
    return result


In [56]:
gen_tokens = tokenize(data["chatgpt_answer"], lemmatized=False, remove_stopword=True, remove_punct = False)
ref_tokens = tokenize(data["human_answer"], lemmatized=False, remove_stopword=True, remove_punct = False)

In [57]:
result = bigram_precision_recall(gen_tokens, 
                                 ref_tokens)
result.head()

Unnamed: 0,precision,recall
0,0.0,0.0
1,0.035294,0.012048
2,0.015625,0.007143
3,0.0,0.0
4,0.018519,0.055556


In [58]:
result[["precision", "recall"]].mean(axis = 0)

precision    0.035239
recall       0.056134
dtype: float64

In [65]:
gen_tokens = tokenize(data["chatgpt_answer"], lemmatized=True, remove_stopword=True, remove_punct = False)
ref_tokens = tokenize(data["human_answer"], lemmatized=True, remove_stopword=True, remove_punct = False)

In [66]:
result = bigram_precision_recall(gen_tokens, 
                                 ref_tokens)
result.head()

Unnamed: 0,precision,recall
0,0.0,0.0
1,0.047619,0.016064
2,0.046875,0.021429
3,0.0,0.0
4,0.038835,0.111111


In [67]:
result[["precision", "recall"]].mean(axis = 0)

precision    0.040448
recall       0.062346
dtype: float64

We set the lemmatization to true to get better precision and recall value. But in our case, both precision and recall is relatively low, so we conclude that ChatGPT is not able to mimic the human generated answers.

## Q4 Compute TF-IDF

Define a function `compute_tf_idf(tokenized_docs)` as follows: 
- Take paramter `tokenized_docs`, i.e., a list of tokenized documents by `tokenize` function in Q1
- Calculate tf_idf weights as shown in lecture notes (Hint: feel free to reuse the code segment in Lecture Notes (II))
- Return the smoothed normalized `tf_idf` array, where each row stands for a document and each column denotes a word. 

In [68]:
import numpy as np

def compute_tfidf(tokenized_docs):
    # Step 1: Create the word set
    word_set = set()
    for doc in tokenized_docs:
        for word in doc:
            word_set.add(word)
    word_list = list(word_set)
    num_words = len(word_list)
    num_docs = len(tokenized_docs)

    # Step 2: Create the document-term matrix
    doc_term_mat = np.zeros((num_docs, num_words))
    for i, doc in enumerate(tokenized_docs):
        for word in doc:
            j = word_list.index(word)
            doc_term_mat[i, j] += 1

    # Step 3: Compute the term frequency and inverse document frequency
    tf_mat = np.zeros((num_docs, num_words))
    idf_vec = np.zeros(num_words)
    for j in range(num_words):
        idf_vec[j] = np.log(num_docs / np.count_nonzero(doc_term_mat[:, j]))
        for i in range(num_docs):
            tf_mat[i, j] = doc_term_mat[i, j] / np.sum(doc_term_mat[i, :])

    # Step 4: Compute the tf-idf matrix
    tfidf_mat = np.multiply(tf_mat, idf_vec)

    # Step 5: Smooth and normalize the tf-idf matrix
    smoothed_tf_idf = np.zeros((num_docs, num_words))
    for i in range(num_docs):
        tfidf_max = np.max(tfidf_mat[i, :])
        smoothed_tf_idf[i, :] = (0.5 + 0.5*tfidf_mat[i, :]/tfidf_max) / np.sqrt(np.sum(np.square(0.5 + 0.5*tfidf_mat[i, :]/tfidf_max)))

    return smoothed_tf_idf


Try different tokenization options to see how these options affect TFIDF matrix:

In [69]:
# Test tfidf generation using questions

question_tokens = tokenize(data["question"], lemmatized=True, remove_stopword=False, remove_punct = True)
dtm = compute_tfidf(question_tokens)
print(f"1.lemmatized=True, remove_stopword=False, remove_punct = True\n \
Shape: {dtm.shape}\n")

question_tokens = tokenize(data["question"], lemmatized=True, remove_stopword=True, remove_punct = True)
dtm = compute_tfidf(question_tokens)
print(f"2.lemmatized=True, remove_stopword=True, remove_punct = True:\n \
Shape: {dtm.shape}\n")

question_tokens = tokenize(data["question"], lemmatized=False, remove_stopword=False, remove_punct = True)
dtm = compute_tfidf(question_tokens)
print(f"3.lemmatized=False, remove_stopword=False, remove_punct = True:\n \
Shape: {dtm.shape}\n")

question_tokens = tokenize(data["question"], lemmatized=False, remove_stopword=False, remove_punct = False)
dtm = compute_tfidf(question_tokens)
print(f"4.lemmatized=False, remove_stopword=False, remove_punct = False:\n \
Shape: {dtm.shape}\n")


1.lemmatized=True, remove_stopword=False, remove_punct = True
 Shape: (200, 1439)

2.lemmatized=True, remove_stopword=True, remove_punct = True:
 Shape: (200, 1272)

3.lemmatized=False, remove_stopword=False, remove_punct = True:
 Shape: (200, 1643)

4.lemmatized=False, remove_stopword=False, remove_punct = False:
 Shape: (200, 1665)



## Q5. Assess similarity. 


Define a function `assess_similarity(question_tokens, gen_tokens, ref_tokens)`  as follows: 
- Take three inputs:
   - `question_tokens`: tokenized questions by `tokenize` function in Q1
   - `gen_tokens`: tokenized ChatGPT-generated answers by `tokenize` function in Q1
   - `ref_tokens`: tokenized human answers by `tokenize` function in Q1
- Concatenate these three token lists into a single list to form a corpus
- Calculate the smoothed normalized tf_idf matrix for the concatenated list using the `compute_tfidf` function defined in Q3.
- Split the tf_idf matrix into sub-matrices corresponding to `question_tokens`, `gen_tokens`, and `ref_tokens` respectively
- For each question, find its similarities to the paired ChatGPT-generated answer and human answer.
- For each pair of ChatGPT-generated answer and human answer, find their similarity
- Print out the following:
    - the question which has the largest similarity to the ChatGPT-generated answer.
    - the question which has the largest similarity to the human answer.
    - the pair of ChatGPT-generated and human answers which have the largest similarity.
- Return a DataFrame with the three columns for the similarities among questions and answers.



Analysis: 
- Try different tokenization parameter configurations (lemmatized, remove_stopword, remove_punct), and observe how similarities change.
- Based on similarity, do you think ChatGPT-generate answers are more relevant to questions than human answers?

In [126]:
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
import numpy as np

def assess_similarity(question_tokens, gen_tokens, ref_tokens):
    # Concatenate the three token lists into a single corpus
    corpus = question_tokens + gen_tokens + ref_tokens
    
    # Calculate the smoothed normalized tf-idf matrix for the corpus
    tfidf_mat = compute_tfidf(corpus)
    
    # Split the tf-idf matrix into sub-matrices for questions, generated answers, and reference answers
    q_tfidf_mat = tfidf_mat[:len(question_tokens)]
    gen_tfidf_mat = tfidf_mat[len(question_tokens):len(question_tokens)+len(gen_tokens)]
    ref_tfidf_mat = tfidf_mat[len(question_tokens)+len(gen_tokens):]
    
    # Find the similarity between each question and its paired generated and reference answers
    q_gen_similarities = cosine_similarity(q_tfidf_mat, gen_tfidf_mat)
    q_ref_similarities = cosine_similarity(q_tfidf_mat, ref_tfidf_mat)
    
    # Find the similarity between each pair of generated and reference answers
    gen_ref_similarities = cosine_similarity(gen_tfidf_mat, ref_tfidf_mat)
    
    # Find the indices of the questions with highest similarity to generated and reference answers
    max_q_gen_pair = [question_tokens[i] for i in q_gen_similarities.argmax(axis=0)]
    max_q_ref_pair = [question_tokens[i] for i in q_ref_similarities.argmax(axis=0)]
    
    # Find the indices of the pair of generated and reference answers with highest similarity
    max_gen_ref_pair = gen_tokens[gen_ref_similarities.argmax(axis=None)//len(ref_tokens)], ref_tokens[gen_ref_similarities.argmax(axis=None)%len(ref_tokens)]
    
    # Create a DataFrame with the three columns for the similarities among each questions and answers
    similarities_df = pd.DataFrame({
        'question_to_generated_answer': q_gen_similarities.max(axis=1),
        'question_to_reference_answer': q_ref_similarities.max(axis=1),
        'generated_to_reference_answer': gen_ref_similarities.max(axis=1)
    })
    
    
    # Print out the results
    new_data = pd.concat([data, similarities_df], axis=1)
    print("Question with highest similarity to generated answer:") 
    print(new_data.iloc[new_data["question_to_generated_answer"].values.argmax(), 1])
    print("Question with the largest similarity to the human answer.:")
    print(new_data.iloc[new_data["question_to_reference_answer"].values.argmax(), 1])
    print("The pair of ChatGPT-generated and human answers which have the largest similarity.")
    print(new_data.iloc[new_data["generated_to_reference_answer"].values.argmax(), 1:3])

    
    return new_data, similarities_df


In [106]:
gen_tokens = tokenize(data["chatgpt_answer"], lemmatized=True, remove_stopword=True, remove_punct = True)
ref_tokens = tokenize(data["human_answer"], lemmatized=True, remove_stopword=True, remove_punct = True)
question_tokens = tokenize(data["question"], lemmatized=True, remove_stopword=True, remove_punct = True)


In [127]:
new_data, result = assess_similarity(question_tokens, gen_tokens, ref_tokens)
result.head()

Question with highest similarity to generated answer:
Sure! People's taste buds are all different, which means that some people might like different foods than other people. This is because everyone's taste buds process flavors differently. Some people might think a certain food tastes really good, while others might think it tastes bad. It's all just a matter of personal preference. So if you don't like a certain food, it doesn't mean there's something wrong with you. It just means that your taste buds don't enjoy that particular flavor as much as someone else's might.
Question with the largest similarity to the human answer.:
Customer lifetime value (CLV) is a measure of the total value that a customer will generate for a business over the course of their relationship with the company. It is an important concept in marketing and customer relationship management, as it helps businesses to understand the long-term value of their customers and to allocate resources accordingly.



To ca

Unnamed: 0,question_to_generated_answer,question_to_reference_answer,generated_to_reference_answer
0,0.999797,0.999612,0.999577
1,0.999864,0.999758,0.999695
2,0.999657,0.999557,0.999711
3,0.99971,0.999545,0.999696
4,0.999232,0.999156,0.999721


In [129]:
result.describe()

Unnamed: 0,question_to_generated_answer,question_to_reference_answer,generated_to_reference_answer
count,200.0,200.0,200.0
mean,0.999625,0.999571,0.999559
std,0.000171,0.00017,0.000193
min,0.998977,0.99898,0.998532
25%,0.999528,0.999467,0.999488
50%,0.999653,0.999607,0.999595
75%,0.999743,0.999701,0.999691
max,0.999943,0.999897,0.999868


Based on the similarities we can say that ChatGPT answers are more relevant as mean of question_to_generated_answer is slighty greater than question_to_reference_answer

## Q5 (Bonus): Further Analysis (Open question)


- Can you find at least three significant differences between ChatGPT-generated and human answeres? Use data to support your answer.
- Based on these differences, are you able to design a classifier to identify ChatGPT generated answers? Implement your ideas using traditional machine learning models, such as SVM, decision trees.
