# HW 1: ChatGPT vs. Wikipedia

In this homework assignment, you will study the statistical characteristics of ChatGPT-generated text by conducting a _keyword analysis_ ([Hofland and Johansson, 1989](https://www.abebooks.com/first-edition/Frequency-Analysis-English-Vocabulary-Grammar-Based/32179496983/bd) and others) of a ChatGPT-generated corpus, using an analogous corpus of Wikipedia articles as a reference. When we conduct a keyword analysis, we try to token types (i.e., lexical items) that distinguish a _target text_ from a _reference text_. In class, we saw that "delve" was a keyword for ChatGPT-generated text; in this assignment, you will systematically identify other keywords that distinguish [Singh et al.'s (2024)](https://www.sciencedirect.com/science/article/pii/S294971912300047X) ChatGPT corpus from their Wikipedia corpus.

## Important: Read Before Starting

In the following exercises, you will need to implement (i.e., write code for) functions defined in the `hw1.py` file. **Please write all your code in this file.** You should not submit this notebook with your solutions, and we will not grade it if you do. Please be aware that code written in a Jupyter notebook may run differently when copied into a `.py` file.

The outputs shown in this notebook are the outputs that you should get **when all problems have been completed correctly**. You may obtain different results if you attempt to run the code cells before you have completed the problem set, or if you have completed one or more problems incorrectly. **Obtaining the outputs shown in this notebook does not guarantee that your code is correct.**

To begin, please run the following import statements.

In [1]:
import importlib
import pickle

from nltk.probability import FreqDist
from nltk.text import Text

## Problem 1: Python Exercises (8 Points in Total + 1 Point Extra Credit)

In these exercises, you will learn and practice Python concepts needed in order to conduct the keyword analysis of the ChatGPT corpus in Problem 2.

### Problem 1a: Understand Modules (Written, 1 Point)

The file `hw1.py` contains a collection of function definitions, most of which are incomplete. For this assignment, you will _implement_ (i.e., write the code for) the incomplete functions in this file.

The code contained in a `.py` file is called a _module_. Modules can be imported, just like packages, by using an `import` statement followed by the module's filename, excluding the `.py` suffix.

In [2]:
# Import the code for this assignment
import hw1

What does the following code do?

In [None]:
hw1.hello_world()

If you wanted to call the function `foo_bar` defined in the `hw1` module (i.e., the `hw1.py` file), how would you do so?

### Problem 1b: Understand Module Reloading (Written, 1 Point)

Please edit the function `my_name` in the `hw1` module by replacing the `_` with your name. Once you have done so, please save your changes, and then call the `my_name` function using the code cell below. What happens? Are your changes reflected when the `my_name` function is called again?

In [3]:
hw1.my_name()

Hello world!
My name is _.


If you wanted to call the edited version of your `my_name` function from this notebook without restarting it, how would you do so?

**Hint:** Please read this [Stack Overflow post](https://stackoverflow.com/questions/1254370/reimport-a-module-while-interactive), and then examine the `import` statements at the top of this notebook.

### Problem 1c: Understand Type Hints (Written, 2 Points)

The function `add_str` takes two strings, both representing numbers, and returns the two numbers' sum as a string. For example:

In [4]:
hw1.add_str("1", "1")

'2.0'

In the definition for `add_str`, the two parameters `a` and `b` are followed by `: str`, and there is a `-> str` just before the `:`. These pieces of code are called _type hints_. What are type hints? Are they requirements, or just suggestions? What would the definition for the `my_name` function look like if you were to add type hints to it?

**Hints:** 
- Read this [tutorial on type hints](https://pyrefly.org/en/docs/python-typing-for-beginners/).
- Try running the following code:

In [None]:
# Code that ignores type hints
hw1.add_str(1, 1)

**Important:** For all coding problems on all assignments, you are expected to comply with the type hints provided to you in order to receive full credit.

### Problem 1d: Understand Dict Methods (Written, 1 Point)

A `dict` is a data type that is like a `list`, except that each item has a _key_ (i.e., a name) rather than a position. For example:

In [5]:
# Create a dict
d = {"a": 1, "b": 2, "c": 3}

# Get item "b"
print(d["b"])

# Modify item "c"
d["c"] = 5
print(d)

# Add a new item with key "d"
d["d"] = 10
print(d)

2
{'a': 1, 'b': 2, 'c': 5}
{'a': 1, 'b': 2, 'c': 5, 'd': 10}


Each `dict` comes with the functions `keys`, `values`, and `items`. Functions that come with an object of a certain type are called _methods_. What do these methods do? 

In [6]:
print(d.keys())
print(d.values())
print(d.items())

dict_keys(['a', 'b', 'c', 'd'])
dict_values([1, 2, 5, 10])
dict_items([('a', 1), ('b', 2), ('c', 5), ('d', 10)])


What happens if you cast a `dict` to a `list` or a `set`?

In [7]:
list(d)

['a', 'b', 'c', 'd']

### Problem 1e: Practice Dict Comprehension (Code, 1 Point)

Recall from class, and from [Chapter 1, Subsection 4.2 of the textbook](https://www.nltk.org/book/ch01.html), that we can build `list`s and `set`s from other `list`s and `set`s using _comprehension_. For example:

In [8]:
x = [1, 1, 2, 3, 5]
{i + 1 for i in x}  # The set containing 1 plus each item in x

{2, 3, 4, 6}

Comprehension can also be applied to `dict`s. For example:

In [9]:
d = {"a": 1, "b": 2, "c": 3}
{k: v + 1 for k, v in d.items()}  # The dict containing 1 plus each item in d

{'a': 2, 'b': 3, 'c': 4}

Using comprehension, please implement the function `swap_keys_values` in the `hw1` module, which takes a `dict` as input and returns a new `dict` that maps each of the values in the original `dict` to its corresponding key. For example:

In [10]:
e = hw1.swap_keys_values(d)
print(e)
print(e[1])

{1: 'a', 2: 'b', 3: 'c'}
a


### Problem 1f: Understand FreqDists (Written, 1 Point)

NLTK has a data type called `FreqDist`. Please read [Chapter 1, Section 3 of the textbook](https://www.nltk.org/book/ch01.html#sec-computing-with-language-simple-statistics) to learn about `FreqDist`s.

Just as a `Text` is really just a special kind of `list`, a `FreqDist` is really just a special kind of `dict`. A `FreqDist` is designed to map token types to the number of times they occur within a corpus. For example:

In [11]:
# Create a dict with counts for a hypothetical corpus
counts = {"the": 60,   # "the" occurs 60 times in this corpus
          "of": 30,    # "of"  occurs 30 times in this corpus
          "and": 20}   # "and" occurs 20 times in this corpus

# Turn the dict into a FreqDist
FreqDist(counts)

FreqDist({'the': 60, 'of': 30, 'and': 20})

A `FreqDist` can be _constructed_ (i.e., created) from a corpus, represented as a `list` or `Text`:

In [12]:
corpus = ["The", "cat", "on", "the", "boat", "smiled", "at", "the", "dog", "."]
FreqDist(corpus)

FreqDist({'the': 2, 'The': 1, 'cat': 1, 'on': 1, 'boat': 1, 'smiled': 1, 'at': 1, 'dog': 1, '.': 1})

A `FreqDist` can also be obtained by calling the `vocab` method of a `Text`:

In [13]:
Text(corpus).vocab()

FreqDist({'the': 2, 'The': 1, 'cat': 1, 'on': 1, 'boat': 1, 'smiled': 1, 'at': 1, 'dog': 1, '.': 1})

What happens if you add two `FreqDist`s? For example:

In [14]:
dist1 = FreqDist({"the": 60, "of": 30, "and": 20})
dist2 = FreqDist({"and": 1, "in": 3})
dist1 + dist2

FreqDist({'the': 60, 'of': 30, 'and': 21, 'in': 3})

### Problem 1g: Unintended Uses of FreqDists (Written, 1 Point)

Because `FreqDist`s are intended to contain counts of words within a corpus, most of the `FreqDist` methods are written under the assumption that a `FreqDist`s values will always be `int`s. However, we don't necessarily have to use `FreqDist`s for their intended purpose. Let's say you wanted a `FreqDist` to contain `float` values instead of `int` values. Is this allowed?

**Hint:** Try creating a new `FreqDist` or modifying an existing one.

In [None]:
# Feel free to write code to help you answer Problem 1g.

### Problem 1h: Understand Mutable Objects (Written, 1 Point Extra Credit) 

In class, we saw that if you define one variable to be equal to another variable, then changing the first variable's value doesn't affect the second variable:

In [15]:
x = 1
y = x
x = 2
print(f"x={x}, y={y}")

x=2, y=1


Now, try running the following code.

In [None]:
x = [1, 2, 3]
y = [x, [4, 5, 6], x]
print(f"x={x}, y={y}")

x[0] = 7  # Change the value of x
print(f"x={x}, y={y}")

Why does changing `x` change the value of `y`?

**Hints:** 
- Try reading about [mutable vs. immutable data types](https://realpython.com/python-mutable-vs-immutable-types/).
- Compare the above code with this code:

In [None]:
x = [1, 2, 3]
y = [x, [4, 5, 6], x]
print(f"x={x}, y={y}")

x = [7, 2, 3]  # Change the value of x
print(f"x={x}, y={y}")

**Note:** Read this [tutorial](https://www.w3schools.com/python/python_string_formatting.asp) if you're curious about these fancy `print` statements.

## Problem 2: Keyword Analysis of the ChatGPT Corpus (12 Points in Total)

In this problem, you implement and execute a keyword analysis of the ChatGPT corpus. To do so, we need to understand exactly what it means for a token type to be a "keyword." In class, we saw that "delve" is a keyword for the ChatGPT corpus, and we justified this by observing that "delve" is used somewhat often in the ChatGPT corpus, but almost never in the Wikipedia corpus.

In [16]:
# Load the Wikipedia vs. ChatGPT corpus
with open("wiki_vs_chatgpt_articles.p", "rb") as f:
    _dataset = pickle.load(f)
    wiki_articles = _dataset["wiki"]
    chatgpt_articles = _dataset["chatgpt"]

In [17]:
# Compare the counts of "delve" in our two corpora
t = "delve"
print(f"The token type \"{t}\" occurs...\n"
      f"\t{chatgpt_articles.count(t)} time(s) in the ChatGPT corpus and\n"
      f"\t{wiki_articles.count(t)} time(s) in the Wikipedia corpus.")

The token type "delve" occurs...
	68 time(s) in the ChatGPT corpus and
	1 time(s) in the Wikipedia corpus.


The idea that you will pursue in this problem is that we can measure the _keyness_ of a token type within a corpus by comparing its frequency in that corpus with its frequency in a _reference corpus_. Once you have a method for measuring the keyness of a token type, you can identify keywords automatically by finding the token types with the highest keyness values.

### Problem 2a: Understand Frequency Ratios (Written, 1 Point)

To calculate keyness, we will use _frequency ratios_ as a _keyness metric_ [(Kilgarriff, 2009)](https://www.sketchengine.eu/wp-content/uploads/2015/04/2009-Simple-maths-for-keywords.pdf). The _relative frequency_ of a token type $w$ in a corpus $c$ is defined as:
$$\texttt{rel\_freq}(w, c) = \frac{c\texttt{.count}(w)}{\texttt{len}(c)}$$

The _frequency ratio_ of a token type $w$ in _target corpus_ $c_1$ relative to _reference corpus_ $c_2$ is defined as:
$$\texttt{freq\_ratio}(w, c_1, c_2) = \frac{\texttt{rel\_freq}(w, c_1)}{\texttt{rel\_freq}(w, c_2)}$$

What is the frequency ratio of "delve" in the ChatGPT corpus relative to the Wikipedia corpus?

In [None]:
# Feel free to write code to help you answer Problem 2a.

### Problem 2b: Determine Applicability of Frequency Ratios (Written, 2 Points)

[Kilgarriff (2009, pp. 1–2)](https://www.sketchengine.eu/wp-content/uploads/2015/04/2009-Simple-maths-for-keywords.pdf) points out four potential problems with using frequency ratios as a keyness metric. The first two problems relate to whether frequency ratios are an appropriate keyness metric for the target and reference corpora.

Please read Kilgarriff's paper, at least until you have read the first two of his four potential problems. Do you think frequency ratios are applicable to the ChatGPT and Wikipedia corpora? That is, do you think Kilgarriff's first and second problems are relevant to this assignment? Why or why not?

### Problem 2c: Implement Smoothing (Code, 2 Points)

Kilgarriff's third problem is that if $w$ does not appear in the reference corpus (i.e., if $c_2\texttt{.count}(w) = 0$), then the formula for $\texttt{freq\_ratio}(w, c_1, c_2)$ will require you to divide by zero. To mitigate this problem, you will implement a popular tactic known as [_add-$k$ smoothing_](https://en.wikipedia.org/wiki/Additive_smoothing). When we apply add-$k$ smoothing to a corpus, we pretend that the corpus contains $k$ occurrences of each token type, in addition to what is actually in the text. The formula for relative frequency with add-$k$ smoothing is therefore:
$$\texttt{rel\_freq}_k(w, c) = \frac{c\texttt{.count}(w) + k}{\texttt{len}(c) + k\cdot\texttt{vocab\_size}(c)}$$
where $\texttt{vocab\_size}(c) = \texttt{len}(\texttt{set}(c))$.

For this problem, please implement the functions `joint_vocab` and `smooth`. The `joint_vocab` function takes two `FreqDists` and returns a set containing all the token types represented across both `FreqDists`. For example:

In [18]:
dist1 = FreqDist({"the": 60, "of": 30, "and": 20})
dist2 = FreqDist({"and": 1, "in": 3})
hw1.joint_vocab(dist1, dist2)

{'and', 'in', 'of', 'the'}

The `smooth` function takes a `FreqDict` and applies add-$k$ smoothing to it. For example:

In [19]:
# Applying add-1 smoothing
vocab = hw1.joint_vocab(dist1, dist2)
hw1.smooth(dist2, vocab)

FreqDist({'in': 4, 'and': 2, 'of': 1, 'the': 1})

`smooth` has an optional parameter `k`, with a default value of `1`, which represents the number of extra occurrences that will be added to the count of each token type in the `FreqDist`.

In [20]:
# Applying add-10 smoothing
hw1.smooth(dist2, vocab, k=10)

FreqDist({'in': 13, 'and': 11, 'of': 10, 'the': 10})

### Problem 2d: Normalize Distributions (Code, 1 Point)

If `c` is a `Text`, then `c.vocab()` returns the _absolute frequency distribution_ of `c`; that is, `c.vocab()` maps each token type in `c` to its count in `c`. For example:

In [21]:
sent = ["The", "cat", "on", "the", "boat", "smiled", "at", "the", "dog", "."]
Text(sent).vocab()

FreqDist({'the': 2, 'The': 1, 'cat': 1, 'on': 1, 'boat': 1, 'smiled': 1, 'at': 1, 'dog': 1, '.': 1})

Please implement the function `normalize`, which converts an absolute frequency distribution into a _relative frequency distribution_ that maps each token type to its relative frequency in `c`. For example:

In [22]:
hw1.normalize(Text(sent).vocab())

FreqDist({'the': 0.2, 'The': 0.1, 'cat': 0.1, 'on': 0.1, 'boat': 0.1, 'smiled': 0.1, 'at': 0.1, 'dog': 0.1, '.': 0.1})

### Problem 2e: Calculate Frequency Ratios (Code, 2 Points)

Finally, please implement the function `freq_ratio`, which takes counts for a target corpus and a reference corpus and computes the frequency ratio for each token type in the joint vocabulary of both corpora with add-$k$ smoothing. For example:

In [23]:
target_dist = FreqDist({"the": 60, "of": 30, "and": 20})
ref_dist = FreqDist({"and": 1, "in": 3})
hw1.freq_ratio(target_dist, ref_dist)

FreqDist({'the': 4.280701754385965, 'of': 2.175438596491228, 'and': 0.7368421052631579, 'in': 0.017543859649122806})

`freq_ratio` has an optional parameter `k`, which determines the number of extra occurrences added to counts during add-$k$ smoothing.

In [24]:
hw1.freq_ratio(target_dist, ref_dist, k=10)

FreqDist({'the': 2.0533333333333332, 'of': 1.1733333333333333, 'and': 0.8, 'in': 0.22564102564102562})

**Hint:** Your code should call the other functions you implemented during the previous parts of this exercise.

### Problem 2f: Identify ChatGPT Keywords (No Submission, 0 Points)

Now, let's use your code to identify keywords from the ChatGPT corpus.

In [25]:
freq_ratios = hw1.freq_ratio(chatgpt_articles.vocab(), wiki_articles.vocab())
freq_ratios.most_common(15)

[('stunning', 366.93441316302506),
 ('reminder', 254.4058933706665),
 ('breathtaking', 228.30163648567634),
 ('must-watch', 225.64696629398244),
 ('must-visit', 195.56070412145147),
 ('fascinating', 170.1201147843848),
 ('must-see', 160.16510156553264),
 ('resilience', 146.00686054316512),
 ('preparedness', 142.68852280354773),
 ('unforgettable', 141.58241022367525),
 ('well-maintained', 139.8126300958793),
 ('delves', 138.0428499680834),
 ('teamwork', 119.460158626226),
 ('thrilling', 114.00333656552185),
 ('delicious', 102.64724741216456)]

If you run the above code cell after completing Problems 2c–e and get the same output, that is a good sign that your code is correct!

### Problem 2g: Identify Wikipedia Keywords (Code and Written, 2 Points)

To identify keywords from the Wikipedia corpus, we want to retrieve the token types with the _smallest_ frequency ratios. To do so, please implement the function `least_common`, which retrieves the items in a `FreqDist` with the smallest values. For example:

In [26]:
d = FreqDist({"the": 60, "of": 30, "and": 20, "in": 10, "nltk": 0})
hw1.least_common(d, 3)

[('nltk', 0), ('in', 10), ('and', 20)]

What are the top five keywords for the Wikipedia corpus (i.e., the five token types with the smallest frequency ratios, in ascending order of frequency ratio)?

**Hints:** Try googling `how to sort a dict in python`.

### Problem 2h: Limitations of Frequency Ratios (Written, 2 Points)

According to [Kilgarriff's](https://www.sketchengine.eu/wp-content/uploads/2015/04/2009-Simple-maths-for-keywords.pdf) (2009) fourth problem, what is a limitation of using frequency ratios as a keyness metric? What solution does Kilgarriff propose for this problem?

How do the keywords identified for the ChatGPT corpus change if you use add-1000 smoothing instead of add-1 smoothing? Are they less rare in the Wikipedia corpus? Support your answer by including a table of the following format. The first two columns have been filled in for you; please fill in the analogous information for the remaining two columns.

|   | Keyword (k = 1) | Count in Wiki Corpus | Keyword (k = 1000) | Count in Wiki Corpus |
|---|-----------------|----------------------|--------------------|----------------------|
| 1 | stunning     | 2 | | |
| 2 | reminder     | 3 | | |
| 3 | breathtaking | 0 | | |
| 4 | must-watch   | 0 | | |
| 5 | must-visit   | 0 | | |
| | **Average** | **1** | **Average** | |

In [None]:
# Feel free to write code to help you answer Problem 2h.