# Project 3: Domain Data Preparation

With the growth of large-scale natural language processing systems like ChatGPT, more and more groups are trying to build their own natural language processing systems for personal use. As a machine learning engineer, it is critical to not only be able to perform advanced modelling tasks, but to also critically assess the datasets that you use, especially if they rely on using data from several different domains interchangeably.

In this project, we shall extract our own COVID-19 dataset from three separate sources (Twitter/X Data, News Article Data, and Research Paper Data ), and attempt to use data engineering to reduce the [Proxy A-Distance](https://papers.nips.cc/paper_files/paper/2006/hash/b1b0432ceafb0ce714426e9114852ac7-Abstract.html) between them. Reducing this distance should allow most systems trained on this data to focus on the content of the data, which we care about, instead of caring about the source of the data, which we do not care about. In doing so, we will demonstrate the importance of feature extraction when it comes to natural language processing.

### Package import

In [1]:
import pandas as pd
import matplotlib
import numpy as np
import requests
import json, collections, time, re, string, os
from datetime import datetime
from tqdm import tqdm

import nltk
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet

import bs4
from bs4 import BeautifulSoup

from pdfminer import high_level

In [2]:
# this cell has been tagged with excluded_from_script
# it will not be run by the autograder
%matplotlib inline

## Part A: Social Media Mining

While it's current popularity has been waning, Twitter/X was one of the most popular social media platforms; according to [Statista](https://www.statista.com/statistics/272014/global-social-networks-ranked-by-number-of-users/), it has 556 million monthly active users. As tweets/posts are public information, they can provide us important hour by hour information for how the country felt during lockdown situations.

In this section, you will perform basic preprocessing and feature extraction on tweet data. As we talked about in the primer, the goal with natural language processing is to convert sequences of text into vectors we can then use in any machine learning algorithm. As our goal is to ensure all three sources of data are undistinguishable by our distance metric, our goal here is to both get a glimpse of the data, and remove any obvious differences that we see.

To begin - let's start by loading the twitter response dataset, and looking at a few samples:

In [3]:
import json
with open("twitter.txt",'r') as mf:
  tweet_data = json.load(mf)

list((tweet_data[0],tweet_data[1000],tweet_data[8200],tweet_data[9000]))

[{'text': 'Good info from @FlexJobs: Create your own hand sanitizer. The recipe: https://t.co/5GLdNUNmg6 #coronavirus #coronavirusoutbreak #coronaoutbreak',
  'lang': 'en',
  'id': 101376,
  'time': '2020-03-09'},
 {'text': 'Letâ€™s see, during a virus crisis  do I want the guy whoâ€™s going to battle for Medicare for all or the guy who goes meh adequate is fine??? #CoronavirusOutbreak #COVID19 #Bernie2020',
  'lang': 'en',
  'id': 76040,
  'time': '2020-03-09'},
 {'text': "Be carefull guy's and wish you all happy holi to you &amp; your family. :) \n#HappyHoli #CoronavirusOutbreak #à¤¹à¥‹à¤²à¥€ #à¤¹à¥‹à¤²à¤¿à¤•à¤¾_à¤¦à¤¹à¤¨ #BankLooteriBJP #Coronavid19 #marketcrash  #reliance #colours #KurkureWithSidNaaz #MondayMorning #MereAngneMein ##RangBarseWithSid #à¤¬à¥�à¤°à¤¾_à¤¨_à¤®à¤¾à¤¨à¥‹_à¤¹à¥‹à¤²à¥€_à¤¹à¥ˆ https://t.co/Rg2SpMNKZD",
  'lang': 'en',
  'id': 73533,
  'time': '2020-03-09'},
 {'text': 'Latest Update on #coronavirus ðŸŒ� Wide\n\n3/9/2020, 6:33:16 AM\n\nTotal #Confirmed Cases: 11

### Question 1: Process tweet data
Looking at some of the tweets above, we see that:
1. Some tweets contain Twitter-shortened URLs, for example `https://t.co/DzhsXPxUDa`. These are always in the form of `http://t.co/` or `https://t.co/` followed by 10 alphanumeric characters. These links should be removed, as they are unlikely to be in any other data set.
1. Some tweets contain emoticons such as `:)` or `<3`. The characters in these emoticons should be removed, as again they are unlikely to be in any other dataset.

Implement the function `process_tweet` that takes as the text of a tweet, performs these two steps, removes the whitespace ahead and after the tweet, and then returns the text data altogether.

**Notes**:
* You should remove URL before removing emoticons.
* We have provided a list of emoticons for you in the variable `emoticons`. You can assume that only elements in this set are considered emoticons and need to be removed.
* Note that there may be no space between a shortened URL and the next word. However, you can assume that there are always 10 alphanumeric characters after http://t.co/ or https://t.co/.
* When you finish processing the text, remember to ensure that all beginning and following whitespace is removed using `.strip()`.

In [22]:
emoticons = [
    ':-)', ':)', ';)', ':o)', ':]', ':3', ':c)', ':>', '=]', '8)', '=)', ':}',
    ':^)', ':-D', ':D', '8-D', '8D', 'x-D', 'xD', 'X-D', 'XD', '=-D', '=D',
    '=-3', '=3', ':-))', ":'-)", ":')", ':*', ':^*', '>:P', ':-P', ':P', 'X-P',
    'x-p', 'xp', 'XP', ':-p', ':p', '=p', ':-b', ':b', '>:)', '>;)', '>:-)', '<3',
    ':L', ':-/', '>:/', ':S', '>:[', ':@', ':-(', ':[', ':-||', '=L', ':<',
    ':-[', ':-<', '=\\', 'b=/', '>:(', ':(', '>.<', ":'-(", ":'(", ':\\', ':-c',
    ':c', ':{', '>:\\', ';('
]


def process_tweet(tweet_text):
    """
    Process and tokenize tweets, in addition to removing URLs and emoticons

    args:
        tweet_text (str) : a list of tweet contents

    return:
        str :  the processed tweet
    """
def process_tweet(tweet_text):
    """
    Process and tokenize tweets, in addition to removing URLs and emoticons

    args:
        tweet_text (str) : a list of tweet contents

    return:
        str :  the processed tweet
    """
    pattern_1 = r"http://t\.co/[a-zA-Z0-9]{10}"
    processed_tweet = re.sub(pattern_1, " ", tweet_text)

    pattern_2 = r"https://t\.co/[a-zA-Z0-9]{10}"
    processed_tweet = re.sub(pattern_2, " ", processed_tweet)

    for emoticon in emoticons:
        processed_tweet = processed_tweet.replace(emoticon, " ")

    processed_tweet = processed_tweet.strip()
    return processed_tweet

# do not modify this function
def process_tweet_data(tweet_texts):
    return [process_tweet(tweet_text['text']) for tweet_text in tweet_texts]

In [23]:
def test_process_tweet():
    assert process_tweet("It's a great day :D") == "It's a great day"
    assert process_tweet("<3hello") == "hello"
    assert process_tweet("goodX-Dday") == "good day"
    assert process_tweet("http://t.co/WJs5bmRthU,http://t.co/WJs5bmRthU,") == ", ,"
    assert process_tweet("hellohttp://t.co/WJs5bmRthUworld") == "hello world"
    assert process_tweet("http://taco/WJs5bmRthU") == "http://taco/WJs5bmRthU"
    assert process_tweet(
        'Protect your child from #CoronavirusOutbreak.\n\nhttps://t.co/qPREVvM2C5\n\n#CoronaVirusUpdate #COVID2019 #COVID #Coronavid19 #outbreak #Italy #COVIDãƒ¼19 #BeSafe #Containment #Homeschooling #DigitalTransformation #InternationalSchooling #virtualschool #OnlineNOW #edtech #technology'
    ) == "Protect your child from #CoronavirusOutbreak.\n\n \n\n#CoronaVirusUpdate #COVID2019 #COVID #Coronavid19 #outbreak #Italy #COVIDãƒ¼19 #BeSafe #Containment #Homeschooling #DigitalTransformation #InternationalSchooling #virtualschool #OnlineNOW #edtech #technology"
    print("All tests passed!")

test_process_tweet()

All tests passed!


## Part B: Process Web Data

Let's now move to extracting text from web articles, using Beautifulsoup to parse HTML data. More specifically, we have collected news articles related to the same topic of Coronavirus from [Nature](https://www.nature.com/), and want to also wrangle this data accordingly. Through this exercise, you will learn how to navigate HTML structures from different webpages in order to get the desired information.

To begin, we have provided you a helper function `retrieve_url` takes as input a webpage string URL and creates a BeautifulSoup object from the corresponding page content.

In [24]:
def retrieve_url(local_file_location):
  with open(local_file_location, 'r', encoding='utf-8') as mf:
    html_content = mf.read()
  soup = BeautifulSoup(html_content, 'html.parser')
  return soup

### Question 2: Parsing a single article from Nature
Implement the function `parse_page_nature` that takes as input a path pointing to a text dump of a Nature news article, and returns a JSON dictionary with the following format:

```python
{
    'Title': 'When will the coronavirus outbreak peak?' #str
    'Author': ['David Cyranoski'] # list, a list of author names in the same order as they appear on the page
    'Published Date': '2020-04-21' # str, yyyy-mm-dd
    'Summary': '.....' #str, the summary div between the title and author fields, or empty string if no summary is available
    'Content': ['.....'] #Any content that follows the author fields, or empty string if no content is available.
}
```

The values of `Summary` and `Content` should be raw texts that do not contain any HTML tag. For example, if the input HTML code is `"<p><b>Hello</b><a href="https://google.com">World</a><p>"` then the output `Content` should be `"Hello World"`.

In the local test we have provided the full reference JSON files for some article pages. If your dictionary does not match the reference JSON, you should print out both and do a careful comparison to see where the difference is.

**Notes**:
* Occasionally there are some "Related" blocks embedded in the article text (example [here](http://clouddatascience.blob.core.windows.net/m20-foundation-data-science/p3-domain-data-preparation/nature_related.png)). These are characterized by the attribute `data-label="Related"` and should **not** be included in the parsing result.
* The `Published Date` field should be the original article date, not the updated date. For example, the `Published Date` for [this article](https://www.nature.com/articles/d41586-020-00166-6) is 2020-01-22.
* Remember to call `strip()` on all values in the returned dictionary so that there is no leading or trailing space anywhere. If a content paragraph becomes empty after `strip()`, it should not be included. You do not need to call any other text processing task in section A.
* Do not parse information form the `meta` tags as they are not robust. Every required information can be found within `body`.
* If an article has no authors (e.g., https://www.nature.com/articles/d41586-020-00589-1), the Author field should be an empty list.
* For the Content list, only text contents that come from the `p` tags in the article body should be included. You can start by identifying a `div` that corresponds to the entire article body (looking at the CSS class names may be helpful). Note that if an image caption is the child of a `p` tag, its content should be included as well.

In [58]:
def remove_unused_content(soup):
    for aside in soup.find_all(True, {"data-label" : "Related"}):
        aside.extract()


def parse_page_nature(url):
    """
    Parse a single New York Times article at the given URL

    args:
        url (str) : the article URL

    return:
        Dict[str, str] : the parsed information stored in JSON format, which includes:
            Title, Author, Published Date, Summary and Content
    """
    soup = retrieve_url(url)
    remove_unused_content(soup)
    result = {}

    result["Title"] = soup.find("h1", class_ = "c-article-magazine-title").get_text().strip()
    
    result["Content"] = []

    if soup.find("p", class_ = "article__teaser") is not None:
      for p in soup.find("p", class_ = "article__teaser").find_all("p"):
          for child in p.find_all():
              child.unwrap()
          content = p.get_text().strip()
          if len(content) > 0:
              result["Content"].append(content)

    body = soup.find("div", class_ = "c-article-body")
    if body is not None:
      for p in body.find_all("p"):
          for child in p.find_all():
              child.unwrap()
          content = p.get_text().strip()
          if len(content) > 0:
              result["Content"].append(content)
    
    # TODO: Add the Author field to the result dict.
    result["Author"] = [" ".join(author.get("content").split(", ")[::-1]) for author in soup.find_all("meta", attrs={"name": "dc.creator"})]

    # TODO: Add the Summary field to the result dict.
    result["Summary"] = soup.find("meta", attrs={"name": "description"}).get("content")

    # TODO: Add the Published Date field to the result dict.
    result["Published Date"] = soup.find("meta", attrs={"name": "dc.date"}).get("content")

    return result

In [59]:
def test_parse_page_nature():
    nature0 = parse_page_nature("html_data/nature_0.html")
    nature0_ref = json.load(open('local_test_refs/0_ref.txt'))
    print(nature0_ref)
    assert nature0 == nature0_ref

    nature1 = parse_page_nature("html_data/nature_1.html")
    nature1_ref = json.load(open('local_test_refs/1_ref.txt'))
    assert nature1 == nature1_ref

    nature2 = parse_page_nature("html_data/nature_2.html")
    nature2_ref = json.load(open('local_test_refs/2_ref.txt'))
    assert nature2 == nature2_ref

    nature3 = parse_page_nature("html_data/nature_3.html")
    nature3_ref = json.load(open('local_test_refs/3_ref.txt'))
    assert nature3 == nature3_ref


    nature4 = parse_page_nature("html_data/nature_4.html")
    nature4_ref = json.load(open('local_test_refs/4_ref.txt'))
    assert nature4 == nature4_ref

    print("All tests passed!")

test_parse_page_nature()

{'Title': 'China coronavirus: Six questions scientists are asking', 'Author': ['Ewen Callaway', 'David Cyranoski'], 'Summary': 'Researchers are racing to find out more about the epidemiology and genetic sequence of the coronavirus spreading in Asia and beyond.', 'Published Date': '2020-01-22', 'Content': []}
All tests passed!


### Question 3: Process news articles data
While the JSON data format we constructed earlier is useful for checking the correctness of our parsing, eventually we would like each article to be represented by just a string. For our purpose, we will define the string representation of an article as

`"<title> <summary> <content paragraph 1> <content paragraph 2> <content paragraph 3> ..."`

where there is a single space separating each field (note that the content paragraphs come from the `"Content"` field of an article json, which is a list of paragraph strings).

Implement the function `process_news_article` that takes as input a JSON dictionary resulting from parsing a Nature or NYT article, and converts the JSON to the above string format.



In [75]:
def process_news_article(article_json):
    """
    Convert article jsons to nested list of tokens of processed article contents

    args:
        article_json (Dict[str, str]] : JSON content of a news article

    return:
        List[str] : a list of processed tokens from the input article JSON
    """
    return f"{article_json['Title']} {article_json['Summary']} {' '.join([content for content in article_json['Content']])}"


# do not modify this function
def process_news_articles_data(article_jsons):
    return [process_news_article(article) for article in article_jsons]

In [76]:
def test_process_news_article():
    nature_article = json.load(open("local_test_refs/4_ref.txt", 'r', encoding='utf-8'))
    nature_article_processed = process_news_article(nature_article)
    nature_expected = open("local_test_refs/nature4_processed.txt").read()
    assert nature_article_processed == nature_expected

    print("All tests passed!")

test_process_news_article()

All tests passed!


## Part C: Mining PDF Data
Having extracted data from Twitter and newspapers, we now turn to our third source: research papers. We have provided you with 15 pdf files, collected from the [arxiv API](https://arxiv.org/help/api). These are located in the `pdfs` directory and labeled from `arxiv_01.pdf` to `arxiv_15.pdf`.

### Question 4: Parse a single Arxiv research paper
Implement the function `parse_pdf` that takes as input a PDF file path and outputs the processed tokenization of the text content of that file. In particular, you should remove all URLs, i.e., strings that start with "http://" or "https://".

**Notes**:
* For this question, you should use the function [`extract_text`](https://pdfminersix.readthedocs.io/en/latest/reference/highlevel.html#extract-text) from the `pdfminer` package to convert a pdf file to string.
* Unlike in the tweet scenario, there is no limit on the length of an URL in this case. The URL pattern you should use here is: a string that starts with `http://` or `https://`, followed by any number of non-space character. Do not make any other assumption (for example, don't assume an URL always contains a `.`).
* We have provided a template helper function `remove_url_regex`, where you can enter the regex for removing URLs. There are some local test cases in `test_remove_url_regex` to help you validate your regex. If your regex passes these tests, you can use it in `parse_and_clean_pdf`.

In [103]:
def remove_url_regex():
    # enter your regex for capturing url strings here
    # you can call this function in parse_and_clean_pdf()
    regex = r"(?:https?://\S+)"
    return regex

def parse_and_clean_pdf(file):
    """
    Convert an input pdf file into processed and cleaned raw text.

    args:
        file (str) : the pdf file path

    return:
        str: the cleaned version of the input file content
    """

    text = re.sub(remove_url_regex(), "", high_level.extract_text(file))
    return text


In [104]:
def test_remove_url_regex():
    s = "http://abc"
    assert re.sub(remove_url_regex(), '', s) == ''

    s = 'hellohttps://github.com/lanagarmire/COVID19-Drugs-LungInjury'
    assert re.sub(remove_url_regex(), '', s) == 'hello'

    s = 'http://example.com https://cmu.edu'
    assert re.sub(remove_url_regex(), '', s) == ' '

    s = 'https://www.'
    assert re.sub(remove_url_regex(), '', s) == ''
    print("All tests passed!")

test_remove_url_regex()

All tests passed!


In [94]:
def test_parse_and_clean_pdf():
    pdf_text = parse_and_clean_pdf("pdfs/arxiv_01.pdf")
    with open("local_test_refs/parsed_arxiv_01.txt") as outfile:
        assert pdf_text == outfile.read()
    print("All tests passed!")

test_parse_and_clean_pdf()

All tests passed!


### Question 5: Parse several Arxiv research papers
Implement the function `process_arxiv_data` that takes as input the path to a directory. This function parses and cleans all pdf files in that directory, then returns a list of strings, where each string results from parsing one PDF file.

**Hint**: You might find using your previous question useful.

**Notes**:
* The pdf files should be processed based on the alphabetical order of their name, e.g., `arxiv_01.pdf` before `arxiv_02.pdf`.
* Do not assume that `os.listdir` will return the filenames in sorted order; you should perform the sorting yourself.
* Do not assume every file in the input directory is a pdf file; only those whose names end in `.pdf` should be parsed.
* If you fail the test case here, it is likely that your URL removal regex from Question 10 is incorrect. Try to come up with more test cases to test your URL.

In [110]:
def process_arxiv_data(directory, num_files=15):
    """
    Parse and process the text content of all pdf papers in alphabetical order in a given directory

    args:
        directory (str) : the relative file path to a directory that contains the pdf papers
        num_files (int) : Only return the first num_files files

    return:
        List[List[str]] : a list of list of word tokens
    """
    return [parse_and_clean_pdf(os.path.join(directory, file_path)) for file_path in list(sorted(os.listdir(directory)))[:num_files]]


In [111]:
def test_process_arxiv_data():
    paper_contents = process_arxiv_data("pdfs")
    sample_of_contents = [paper[0:100] for paper in paper_contents]

    assert len(paper_contents) == 15
    assert sample_of_contents == ['Repurposed drugs for treating lung injury in COVID-19 \n\nBing He1, Lana Garmire1* \n\n1. Department of ', '0\n2\n0\n2\n\nr\na\n\nM\n1\n3\n\n]\nE\nP\n.\no\ni\nb\n-\nq\n[\n\n1\nv\n4\n8\n2\n4\n1\n.\n3\n0\n0\n2\n:\nv\ni\nX\nr\na\n\nThe fractal time grow', 'Coronavirus and financial volatility: 40 days of fasting and fear \n\nClaudiu Tiberiu ALBULESCU1,2\uf02a \n\n', '0\n2\n0\n2\n\nr\na\n\nM\n8\n\n]\nE\nP\n.\no\ni\nb\n-\nq\n[\n\n1\nv\n5\n7\n7\n3\n0\n.\n3\n0\n0\n2\n:\nv\ni\nX\nr\na\n\nData Analysis for the C', 'How  many  infections  of  COVID-19  there  will  be  in  the  “Diamond  Princess”-\n\nPredicted by a ', 'Parametric analysis of early data on COVID-19 expansion in selected\nEuropean countries\n\naInstitute o', '0\n2\n0\n2\n\nr\na\n\nM\n5\n2\n\n]\n\nR\n\nI\n.\ns\nc\n[\n\n2\nv\n7\n0\n1\n0\n0\n.\n3\n0\n0\n2\n:\nv\ni\nX\nr\na\n\nViewing the Progression o', 'Insights from early mathematical models of 2019-nCoV acute respiratory disease (COVID-\n\nEarly models', '0\n2\n0\n2\n\nb\ne\nF\n4\n2\n\n]\nh\np\n-\nc\no\ns\n.\ns\nc\ni\ns\ny\nh\np\n[\n\n1\nv\n2\n0\n3\n0\n1\n.\n2\n0\n0\n2\n:\nv\ni\nX\nr\na\n\nThe Recons', 'COVID-19  Docking  Server:  An  interactive  server  for \n\ndocking  small  molecules,  peptides  and', '0\n2\n0\n2\n\nb\ne\nF\n2\n2\n\n]\nE\nP\n.\no\ni\nb\n-\nq\n[\n\n1\nv\n0\n4\n6\n9\n0\n.\n2\n0\n0\n2\n:\nv\ni\nX\nr\na\n\nThe Outbreak Evaluatio', '0\n2\n0\n2\n\nb\ne\nF\n5\n2\n\n]\nh\np\n-\nc\no\ns\n.\ns\nc\ni\ns\ny\nh\np\n[\n\n2\nv\n9\n9\n1\n9\n0\n.\n2\n0\n0\n2\n:\nv\ni\nX\nr\na\n\nScaling fe', '0\n2\n0\n2\n\nb\ne\nF\n2\n1\n\n]\n\nG\nL\n.\ns\nc\n[\n\n1\nv\n4\n3\n5\n5\n0\n.\n2\n0\n0\n2\n:\nv\ni\nX\nr\na\n\nABNORMAL RESPIRATORY PATTER', 'Trend and forecasting of the COVID-19 outbreak in\nChina\n\nQiang Li1 Wei Feng2\n\n1School of Physical sc', 'Deep  Learning  System  to  Screen  Coronavirus  Disease  2019 \n\nPneumonia \n\nXiaowei Xu1, MD; Xianga']
    assert sum(len(paper) for paper in paper_contents) == 368635
    print("All tests passed!")

test_process_arxiv_data()

All tests passed!


## Part D: Text Processing

Now that we've converted our initial set of structured data into pure text, let's process all of the text in a similar way. Text data on the internet is very messy.  Typically there is a fair amount of processing work to do once you have collected any sizeable chunk of text data, in order to have it ready for subsequent analyses. To get you familiar with this kind of data, this section will walk you through some common processing tasks:

The first step is to import the lemmatizer and set of English stopwords from `nltk`:

In [112]:
nltk.download("stopwords", quiet = True)
nltk.download("wordnet", quiet = True)
nltk.download("punkt", quiet = True)
nltk.download('averaged_perceptron_tagger', quiet = True)
lemmatizer = WordNetLemmatizer()
english_stopwords = set(nltk.corpus.stopwords.words('english'))

### Question 6: Text cleaning and tokenization
Implement the three functions `clean_string`, `tokenize` and `lemmatize` that perform the following text preprocessing tasks:

1. `clean_text` should:
    * convert the string to lower case.
    * remove any instance of `'s` that is either followed by any whitespace character, or at the end of the string: `teacher's help` becomes `teacher help`, and `children's` becomes `children`.
    * remove apostrophe character `'`: `don't` becomes `dont`. For simplicity we will only consider the character `'` as apostrophe (so `’` is not).
    * remove leading and trailing space.

1. `tokenize` should:
    * use `nltk.word_tokenize` to tokenize the input text.
    * further break tokens at characters which are not digits 0-9 and not present in `string.ascii_letters`. For example, `a_b_c` becomes `['a', 'b', 'c']`.
    * maintain the token order as it appears in the original string.

1. `lemmatize` should:
    * lemmatize each token individually.
    * remove tokens that are stopwords or contain fewer than two characters (these two cases should be checked after the lemmatization step).
    
**Notes**:
* When lemmatizing a word, you should also specify the part-of-speech `pos` parameter. This can be obtained by calling `nltk.pos_tag` and using the first returned tag (in case there are multiple possibilities). You can interpret the returned tag as follows:
    * If it starts with "J", it is an adjective.
    * If it starts with "V", it is a verb.
    * If it starts with "R", it is an adverb.
    * Otherwise, it is a noun.
* `nltk.pos_tag` should be called on each individual token, instead of on the entire tokenized text. For example, if the input string is `"learning is fun"`, you should call `nltk.pos_tag(["learning"])` to get the part-of-speech of `'learning'`, and input that to the lemmatizer. You may notice that in this case `"learning"` is classified as a verb (while it is a noun in the original sentence). However, this is not a problem, since our end goal is to reduce each token to its base form, not to correctly classify its part-of-speech.
* If you use the regex character set `\w`, note that it matches alphanumeric characters **and** the underscore character `_`.

In [259]:
def get_pos(w):
    tag = nltk.pos_tag(w)[0][1]
    if tag.startswith("J"):
        return wordnet.ADJ
    if tag.startswith("V"):
        return wordnet.VERB
    if tag.startswith("R"):
        return wordnet.ADV
    return wordnet.NOUN

def clean_text(text):
    """
    Clean the input string by converting it to lowercase, removing 's and apostrophe.

    args:
        text (str) : the input text

    return:
        str : the cleaned text
    """
    cleaned_text = text.lower()
    pattern = r"'s(?=\s|$)"
    cleaned_text = re.sub(pattern, "", cleaned_text)
    cleaned_text = cleaned_text.replace("'", "")
    return cleaned_text

def tokenize(cleaned_text):
    """
    Tokenize the input string.

    args:
        cleaned_text (str): the input text, output from clean_text

    return:
        List[str] : a list of tokens from the input text
    """
    tokens = []
    for token in nltk.word_tokenize(cleaned_text):
        tokens.extend([word for word in re.split(r'[^a-zA-Z0-9]+', token) if word])
    return tokens

def lemmatize(tokens, stopwords = {}):
    """
    Lemmatize each token in an input list of tokens

    args:
        tokens (List[str]) : a list of token, output from tokenize

    kwargs:
        stopwords (Set[str]) : the set of stopwords to exclude

    return:
        List[str] : a list of lemmatized and filtered tokens
    """
    lemmatized_tokens = []
    for token in tokens:
        lemmatized_token = lemmatizer.lemmatize(token, get_pos([token]))
        if (len(lemmatized_token) > 1) and (lemmatized_token not in stopwords):
            lemmatized_tokens.append(lemmatized_token)
    return lemmatized_tokens

def preprocess_text(text, stopwords = {}):
    # do not modify this function
    cleaned_text = clean_text(text)
    tokens = tokenize(cleaned_text)
    return lemmatize(tokens, stopwords)

In [260]:
def test_preprocess_text():
    # cleaning
    assert clean_text("I like Data Science") == "i like data science"
    assert clean_text("She's") == "she"
    assert clean_text("you've")== "youve"
    assert clean_text("car, cars, car's cars'")== "car, cars, car cars"
    assert clean_text("'shed'") == "shed"
    assert clean_text("'good news'") == "good news"
    assert clean_text("CMU's campus")== "cmu campus"
    assert preprocess_text("abc 'system") == ['abc', 'system']
    assert preprocess_text("O'Shea Jackson Jr. is an American actor and musician") == ['oshea', 'jackson', 'jr', 'be', 'an', 'american', 'actor', 'and', 'musician']

    # tokenization
    assert tokenize("ab..ab. .ab . ab.") == ['ab', 'ab', 'ab', 'ab'], tokenize("ab..ab. .ab . ab.")
    assert tokenize("word-of-mouth hello,world")== ['word', 'of', 'mouth', 'hello', 'world']
    assert tokenize("gotta")== ['got', 'ta']
    assert tokenize("hello_world") == ["hello", "world"]
    assert preprocess_text("hope this👏will work") == ['hope', 'this', 'will', 'work']

    # lemmatization
    assert lemmatize(["cats"]) == ['cat']
    assert lemmatize(["did"]) == ['do']
    assert lemmatize(["learning", "is", "fun"], english_stopwords) == ["learn", "fun"]

    # miscellaneous
    assert preprocess_text("the weather is really nice", english_stopwords) == ['weather', 'really', 'nice']
    assert preprocess_text(
        "To apply SVM learning in partial discharge classification, data input is very important!?",
        english_stopwords
    ) == 'apply svm learn partial discharge classification data input important'.split()
    assert preprocess_text("after all he's done", english_stopwords) == []
    assert preprocess_text("they didn’t have much chance of guessing what it was without further clues.", english_stopwords) == ['much', 'chance', 'guess', 'without', 'far', 'clue']
    assert preprocess_text("DUQUE'S", english_stopwords) == ["duque"]
    assert preprocess_text("the 'rona", english_stopwords) == ['rona']
    assert preprocess_text('MOTORCYCLES DONT FLY', english_stopwords)==['motorcycle', 'dont', 'fly']
    assert preprocess_text('“ Georg e\”', english_stopwords) == ['georg']
    text = "Harry leapt into the air; he’d trodden on something big and squashy on the doormat — something alive"
    assert preprocess_text(text, english_stopwords) == ['harry', 'leapt', 'air', 'trodden', 'something', 'big', 'squashy', 'doormat', 'something', 'alive']
    assert preprocess_text("Donâ€™t want to add to TRUMPâ€™s #COVID19 numbers. #CoronaVirus ðŸ¦  donâ€™t care.", english_stopwords) == ['want', 'add', 'trump', 'covid19', 'number', 'coronavirus', 'care']

    # test on long text string
    with open("local_test_refs/henrys_letter.txt", encoding = "utf-8") as infile, open("local_test_refs/processed_henrys_letter.txt", encoding = "utf-8") as outfile:
        processed_str = preprocess_text(infile.read())
        reference_str = outfile.read().splitlines()
        assert processed_str == reference_str
    print("All tests passed!")

test_preprocess_text()

All tests passed!


You may notice that the lemmatization functionality isn't perfect; for example, it would map `"as"` to `"a"` because `"as"` is being treated as a noun instead of a proposition (with tag `"IN"`). In general, identifying the correct part-of-speech tag is very context-dependent (for example, `"back"` can be either an adjective, adverb, verb or noun). In the context of this project, we will not dive deep into these linguistic nuances, and settle with the lemmatization rules above.


## Part E: Data Visualization and Feature Construction
Now that we have collected text data from three different sources (Twitter, news articles and research papers), let's put them all together in order to perform some simple exploratory data analyses and feature construction. From now we will define a *document* as a list of tokens coming from a single tweet, news article or arxiv paper, and a *corpus* as a list of documents.




### Question 7: Word frequency and word cloud
With any text corpus, you will first want to check for the word frequency distribution, in particular which words are the most common and which are the least. The former group may consist of terms that are relevant to the topic, or terms that simply appear frequently in general (e.g., stopwords). The latter group may consist of highly specialized terms or typos. Since stopwords and rare words are not useful to our analysis, we will remove both (where we define rare words as words that only appear *once in the corpus*).

Implement the function `word_frequency` which takes as input a text corpus and returns a `collections.Counter` object mapping each word to its frequency in the corpus. However, rare words that only appear once in the entire corpus should **not** be included in this mapping.

**Notes**:
* Recall that `preprocess_text` already handles stopword removal, so you only need to remove rare words in this step.

In [263]:
def word_frequency(corpus):
    """
    Count the word frequency in a given corpus

    args:
        corpus (List[List[str]]) : a nested list of tokens, where each inner list is a processed document

    return:
        collections.Counter : a mapping between each word and its frequency in the corpus, excluding words that
            only appear once
    """
    counter = collections.Counter()
    for processed_document in corpus:
        counter.update(processed_document)

    for processed_document in counter.copy(): 
        if counter[processed_document] <= 1:
            counter.pop(processed_document)

    return counter


In [264]:
def test_word_frequency():
    tweet_corpus = [preprocess_text(elem, english_stopwords) for elem in process_tweet_data(
         tweet_data
    )[:100]]
    counter = word_frequency(tweet_corpus)
    assert len(counter) == 230
    assert counter["coronavirus"] == 37
    assert counter["coronavirusoutbreak"] == 74
    assert counter.get("the",0) == 0
    assert counter["say"] == 4
    assert min(counter.values()) == 2
    print("All tests passed!")

test_word_frequency()

All tests passed!


Now we will gather all three corpora together; we store them in a global cache to avoid having to construct them more than once. If you make any code change above this point, rerun the following cell to reset the cache. Note that this will take around 10 minutes to run, and that we've purposefully excluded the cell that runs this from the autograder.

**IMPORTANT NOTE**: For grading the functions after this point, we will cache 'local_corpus_store.pkl', which is just a simple file that contains all of the data processed through the functions you've written already. If you change anything above, you will need to re-run this cell in order to ensure that grading works.

In [265]:
corpuses = None

def get_corpuses():
    global corpuses
    with open("twitter.txt",'r') as mf:
        tweet_data = json.load(mf)
    if corpuses is None:
        twitter_corpus = [preprocess_text(elem, english_stopwords) for elem in process_tweet_data(
         tqdm(tweet_data[:250])
    )]
        news_corpus = [preprocess_text(elem, english_stopwords) for elem in tqdm(process_news_articles_data(
            [parse_page_nature(f'html_data/nature_{digit}.html')  for digit in range(200)]))]
        arxiv_corpus = [preprocess_text(elem, english_stopwords) for elem in tqdm(process_arxiv_data("pdfs", num_files=50))]
        corpuses = (twitter_corpus, news_corpus, arxiv_corpus)
    return corpuses


In [266]:
with open("local_corpus_store.pkl", "wb") as myfile:
    import pickle
    pickle.dump(get_corpuses(), myfile)


100%|██████████| 250/250 [00:00<00:00, 62788.98it/s]
100%|██████████| 200/200 [00:09<00:00, 20.42it/s]
100%|██████████| 50/50 [00:46<00:00,  1.07it/s]


Let's first compare the frequency of a number of keywords across these three corpuses.

In [None]:
# this cell has been taggged with excluded_from_script
# it will not be run by the autograder
def get_word_frequency_across_corpuses(input_words):
    twitter_corpus, news_corpus, arxiv_corpus = get_corpuses()
    twitter_corpus_size = sum(len(d) for d in twitter_corpus)
    news_corpus_size = sum(len(d) for d in news_corpus)
    arxiv_corpus_size = sum(len(d) for d in arxiv_corpus)
    twitter_f, news_f, arxiv_f = word_frequency(twitter_corpus), word_frequency(news_corpus), word_frequency(arxiv_corpus)
    return pd.DataFrame({
        "Proportion in twitter corpus" : [twitter_f.get(word, 0) / twitter_corpus_size for word in input_words],
        "Proportion in news corpus" : [news_f.get(word, 0) / news_corpus_size for word in input_words],
        "Proportion in arxiv corpus" : [arxiv_f.get(word, 0) / arxiv_corpus_size for word in input_words]
    }, index = input_words)

df_frequency = get_word_frequency_across_corpuses([
    "coronavirus", "covid", "case", "health", "model", "say", "test",
    "2020", "19", "people", "vaccine"
])

display(df_frequency)

df_frequency.plot(kind='bar')

We see that there are differences across datasets in the relative frequency of each term. "Coronavirus" is used most frequently in tweets, "say" most frequently in news corpus, and perhaps unsurprisingly, "model" most frequently in arxiv papers. The scientific notation of coronavirus, "covid," isn't used in news articles as much, but is equally popular in both tweets and arxiv papers. On the other hand, "health" sees most frequent usage in news articles, likely due to health advice-related articles. Feel free to edit the word list above and see what other insights you can derive!

We now move to the last step of data collection and preparation: constructing input features to be used for more formal analyses and language modeling. As language modeling will be covered later in the course, here we will only cover two simple feature construction methods: term frequency (TF) and term frequency - inverse document frequency (TF-IDF), and then use them for our initial task.

### Feature construction: term frequency (TF)
Implement the function `construct_tf_matrix` that takes as input a corpus and outputs a matrix $TF$ where each row corresponds to one document, and each column corresponds to one of the unique words in the entire corpus. $TF_{ij}$ is the number of times word $j$ appears in document $i$. Similar to the previous question, rare words that only appear once in the entire corpus should be removed, i.e., there should be no columns for those words.

**Notes**:
* The rows should be ordered based on the document ordering in the corpus. Row 0 corresponds to `corpus[0]`, row 1 to `corpus[1]`, and so on.
* The columns should be ordered based on the alphabetical order of their corresponding words. Column 0 corresponds to the alphabetically first word in the corpus, column 1 to the alphabetically second word, and so on.
* To ensure code efficiency, avoid using too many loops. Take advantage of Pandas and Numpy functionalities.
* We expect you to return an int64 as a datatype. Using `.astype(np.int64)` will help here.

In [None]:
def construct_tf_matrix(corpus):
    """
    Construct a term frequency matrix from an input corpus

    args:
        corpus (List[List[str]]) : a nested list of word tokens, where each inner list is a document

    return:
        np.array[n_documents, n_words] : the term frequency matrix
    """
    counters = [collections.Counter(doc) for doc in corpus]
    df = pd.DataFrame(counters).fillna(0)
    return df.loc[:, df.sum(axis = 0) > 1].sort_index(axis = 1).to_numpy(dtype = np.int64)

### Feature construction: term frequency - inverse document frequency (TF-IDF)
We can now compute the TF-IDF matrix, which scales the columns of the term frequency matrix by their inverse document frequency. Recall that the inverse document frequency of a word $j$ is computed as
$$\text{IDF}_j = \log \left( \frac{\# \text{ of documents}}{\# \text{ of documents with word } j} \right),$$
and so the $\text{TF-IDF}_{ij}$ entry in the tf-idf matrix is computed as
$$\text{TF-IDF}_{ij} = \text{TF}_{ij} \times \text{IDF}_j.$$

Implement the function `tf_idf_matrix` which takes as input a TF matrix and outputs the corresponding TF-IDF matrix.

In [None]:
def construct_tf_idf_matrix(tf_matrix):
    """
    Compute the term frequency - inverse document frequency in a corpus

    args:
        tf_matrix (np.array[n_documents, n_words]) : the term frequency document of the corpus

    return:
        np.array[n_documents, n_words] : the tf-idf matrix
    """
    idf_matrix = np.log(tf_matrix.shape[0] / np.count_nonzero(tf_matrix, axis=0))
    return tf_matrix.astype(np.float64) * idf_matrix

## Dataset Similarity Comparison

Now that we have two separate feature construction pipelines, let's evaluate how similar all three of our datasets are in the lense of these feature construction pipelines. To do so, we shall implement a **very** simple model-based metric called PAD.

The idea behind PAD is very simple:

1. Train a classification model to try to predict the dataset given the prediction scheme.
2. Report the classification error on some dataset, **e**
3. Compute **2*(1-2e)** as the metric itself.

For our classification model, we shall have you implement a simple logistic regression based classifier. It is important to note that, traditionally, this approach uses a different model called a "Support Vector Machine" instead.

### Logistic Regression-Based Classification

Recall from the primer that logistic regression assumes the following hypothesis function:
$$h_\theta(x) = \sigma(b + \theta^T x)$$
where $\sigma(z) = (1+e^{-z})^{-1}$ is the sigmoid function.

With this hypothesis funtion, input data $X \in \mathbb{R}^{n \times d}$ and output labels $Y \in \{0,1\}^{n}$, logistic regression attempts to minimize the loss function
$$\mathcal{L}(\theta, b) = -\frac{1}{2n} \left[{\sum_{i=1}^{n}} y^{(i)} \log(h_\theta(x^{(i)})) + (1-y^{(i)}) \log(1 - h_\theta(x^{(i)})) \right] + \frac{\lambda}{2n} \|\theta\|_2^2,$$
where $\theta \in \mathbb{R}^{d}$ is the weight, $b$ is the intercept, and $\lambda \ge 0$ is the regularization parameter.

This optimization can be carried out by gradient descent. Given a learning rate $\alpha$, batch gradient descent for training logistic regression consists of two steps:

1. Initialize $b = 0$ and $\theta$ as a vector of 0s.
1. Repeat `n_iters` times:

\begin{align}
    b & := b - \alpha \cdot  \frac{1}{2n} \sum_{i=1}^n \left(h_\theta(x^{(i)}) - y^{(i)} \right), \\
    \theta & := \theta - \alpha \cdot \frac{1}{2n} \cdot \left[\sum_{i=1}^n (h_\theta(x^{(i)}) - y^{(i)})x^{(i)} + 2\lambda \theta \right]
\end{align}

After training, we can predict the label for a new data point $x$ as
$$\hat y = \mathbb{1}\left(h_\theta(x) \ge \frac 1 2 \right) = \mathbb{1}(b + \theta^T x \ge 0).$$

<hr>

Implement the class `LRClassifier` with 6 methods -- `__init__`, `loss`, `fit`, `get_weights`, `decision_function` and `predict` -- to perform the above tasks. You can create instance variables as you see fit.

**Notes**:
* A `LRClassifier` instance may be created once and then trained on several datasets. Therefore, you should initialize $b$ and $\theta$ inside `.fit`, not in `__init__`.

In [None]:
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

class LRClassifier:
    def __init__(self, lam):
        """
        Class constructor.

        args:
            lam (float) : the regularizer value
        """
        self.lam = lam

    def loss(self, h, y):
        """
        Compute the average loss L(theta, b) based on the provided formula.

        args:
            h (np.array[n_samples]) : a vector of hypothesis function values on every input data point,
                this is the output of self.decision_function(X)
            y (np.array[n_samples]) : the output label vector, containing 0 and 1

        return:
            np.float64 : the average loss value
        """
        log_loss = np.mean(-y * np.log(h) - (1 - y) * np.log(1 - h)) / 2
        regularizer = self.lam*(self.theta**2).sum()/(2*len(y))
        return log_loss + regularizer

    def fit(self, X, y, n_iters = 100, alpha = 1):
        """
        Train the model weights and intercept term using batch gradient descent.

        args:
            X (np.array[n_samples, n_dimensions]) : the input data matrix
            y (np.array[n_samples]) : the output label vector, containing 0 and 1

        kwargs:
            n_iters (int) : the number of iterations to train for
            alpha (float) : the learning rate

        return:
            List[np.float64] : a list of length (n_iters + 1) that contains the loss value
                before training and after each training iteration
        """
        self.theta, self.b = np.zeros(X.shape[1]), 0
        n = len(y)
        losses = []
        for _ in range(n_iters):
            h = self.decision_function(X)
            losses.append(self.loss(h, y))
            theta_gradient = 1/(2*n) * ((h - y) @ X) + self.lam*self.theta/n
            b_gradient = 1/2 * (h - y).mean()
            self.theta -= alpha * theta_gradient
            self.b -= alpha * b_gradient
        h = self.decision_function(X)
        losses.append(self.loss(h, y))
        return losses

    def get_params(self):
        """
        Get the model weights and intercept term.

        return:
            Tuple(theta, b):
                theta (np.array[n_dimensions]) : the weight vector
                b (np.float64) : the intercept term
        """
        return self.theta, self.b

    def decision_function(self, X):
        """
        Compute the hypothesis function values on every input data point.

        args:
            X (np.array[n_samples, n_dimensions]) : the input data matrix

        return:
            np.array[n_samples] : a vector of hypothesis function values on every input data point
        """
        return sigmoid(self.b + X @ self.theta)

    def predict(self, X):
        """
        Predict the label of every input data point.

        args:
            X (np.array[n_samples, n_dimensions]) : the input data matrix

        return:
            np.array[n_samples] : a vector of predicted output labels for every input data point
        """
        return np.where(self.b + X @ self.theta >= 0, 1, 0)

def binary_lr_classifier(lam = 1e-4):
    return LRClassifier(lam)

In [None]:
def test_binary_lr_classifier():
    X = np.array([[-2, 4], [4, 1], [1, 6], [2, 4], [6, 2]])
    y = np.array([0, 0, 1, 1, 1])
    lr = binary_lr_classifier(lam = 1e-4)

    # before gradient descent
    losses = lr.fit(X, y, n_iters = 0)
    theta, b = lr.get_params()
    assert np.allclose(theta, [0, 0])
    assert b == 0
    assert np.allclose(losses[-1], 0.34657359027997264)
    assert np.allclose(lr.decision_function(X), [0.5] * 5)
    assert list(lr.predict(X)) == [1] * len(y)

    # 1st iteration
    losses = lr.fit(X, y, n_iters = 1)
    theta, b = lr.get_params()
    assert np.allclose(theta, [0.35, 0.35])
    assert b == 0.05
    assert np.allclose(losses[-1], 0.33351806318231178)
    assert np.allclose(lr.decision_function(X), [0.6791786991753931, 0.8581489350995123, 0.9241418199787566, 0.8956687768809987, 0.9453186827840592])
    assert list(lr.predict(X)) == [1] * len(y)

    # 2 iterations
    losses = lr.fit(X, y, n_iters = 2)
    theta, b = lr.get_params()
    assert np.allclose(theta, [0.20383002, 0.09069029])
    assert np.allclose(b, -0.080246)
    assert np.allclose(losses[-1], 0.28778446849618766)
    assert np.allclose(lr.decision_function(X), [0.4687546229122032, 0.6954586477733905, 0.6609937974719609, 0.6660059656189242, 0.7898655199585818])
    assert list(lr.predict(X)) == [0, 1, 1, 1, 1]

    # 1000 iterations
    losses = lr.fit(X, y, n_iters = 1000)
    theta, b = lr.get_params()
    assert np.allclose(theta, [1.62475335, 2.97699553])
    assert np.allclose(b, -12.016701793625622)
    assert np.allclose(losses[-1], 0.0178892651602277)
    assert np.allclose(lr.decision_function(X), [0.0336268115487116, 0.07305423924580728, 0.9994304104089492, 0.9585441655688948, 0.9755365947084815])
    assert list(lr.predict(X)) == [0, 0, 1, 1, 1]

    print("All tests passed!")

test_binary_lr_classifier()

## Proxy A-Distance Calculation

Now to pull it all together. Let's implement a function ``calculate_pad_distance`` that does the following:

1. Given a matrix of features ``train_X``, and a list of one-hot encoded binary data ``train_y``, train a logistic regression classifier on that data, setting lambda to be `1e-4`.
2. Using a test matrix of features ``test_X`` and a list of test labels ``test_y``, calculate the classification error **e**. If the classification error is greater than 0.5, let the classification error be **1 - e** instead, as we can always flip our classifier's ratings.
3. Return **2\*(1-2e)**



In [None]:
def calculate_pad_distance(train_X, train_y, test_X, test_y):
  """
  Compute the Proxy A-Distance using the LRClassifier

  return:
      float:
        The Proxy A-Distance, training on train_X and train_y and testing on test_X and test_y
  """
  model = LRClassifier(1e-4)
  model.fit(train_X, train_y)
  error = min(np.mean(model.predict(test_X) == test_y), np.mean(model.predict(test_X) != test_y))
  return 2 * (1 - 2*error)

With that computed, let's now calculate the PAD between all three of our data sources.

In this case, we shall evaluate the PAD for each pair of datasets using just term frequency:

In [None]:
tf = construct_tf_matrix(corpuses[0] + corpuses[1] + corpuses[2])

np.seterr(all='ignore')

train_X = np.concatenate([tf[:200, :],tf[200:350,:]])
train_y = np.array([0]*200 + [1]*150)
test_X = np.concatenate([tf[200:250, :],tf[350:400,:]])
test_y = np.array([0]*50 + [1]*50)
print("PAD using TF between Twitter and News:", calculate_pad_distance(train_X, train_y, test_X, test_y))

train_X = np.concatenate([tf[:200, :],tf[400:450,:]])
train_y = np.array([0]*200 + [1]*50)
test_X = np.concatenate([tf[200:250, :],tf[450:500,:]])
test_y = np.array([0]*50 + [1]*50)
print("PAD using TF between Twitter and Arxiv:", calculate_pad_distance(train_X, train_y, test_X, test_y))

train_X = np.concatenate([tf[200:350,:],tf[400:450,:]])
train_y = np.array([0]*150 + [1]*50)
test_X = np.concatenate([tf[350:400,:],tf[450:500,:]])
test_y = np.array([0]*50 + [1]*50)
print("PAD using TF between News and Arxiv:", calculate_pad_distance(train_X, train_y, test_X, test_y))


The PAD between Twitter and News is ``1.44``, the PAD between Twitter and Arxiv is ``1.64``, and the PAD between News and Arxiv is ``0.16``.

Implicitly, this makes a lot of sense, as News and Arxiv text are largely more similar, as long-form text, compared with Twitter/X posts, which are limited in length.

If we use tf-idf, however, we get a different picture:

In [None]:
tf = construct_tf_idf_matrix(construct_tf_matrix(corpuses[0] + corpuses[1] + corpuses[2]))

np.seterr(all='ignore')

train_X = np.concatenate([tf[:200, :],tf[200:350,:]])
train_y = np.array([0]*200 + [1]*150)
test_X = np.concatenate([tf[200:250, :],tf[350:400,:]])
test_y = np.array([0]*50 + [1]*50)
print("PAD using TF-IDF between Twitter and News:", calculate_pad_distance(train_X, train_y, test_X, test_y))

train_X = np.concatenate([tf[:200, :],tf[400:450,:]])
train_y = np.array([0]*200 + [1]*50)
test_X = np.concatenate([tf[200:250, :],tf[450:500,:]])
test_y = np.array([0]*50 + [1]*50)
print("PAD using TF-IDF between Twitter and Arxiv:", calculate_pad_distance(train_X, train_y, test_X, test_y))

train_X = np.concatenate([tf[200:350,:],tf[400:450,:]])
train_y = np.array([0]*150 + [1]*50)
test_X = np.concatenate([tf[350:400,:],tf[450:500,:]])
test_y = np.array([0]*50 + [1]*50)
print("PAD using TF-IDF between News and Arxiv:", calculate_pad_distance(train_X, train_y, test_X, test_y))


Again, the PAD between Twitter and News is ``0.08``, the PAD between Twitter and Arxiv is ``0.76``, and the PAD between News and Arxiv is ``0.28``.

As we can see, the PAD is on average smaller when using TF-IDF compared to using TF. This suggests that it is harder to distinguish between the three data sources using TF-IDF, and thus it might be better to use when training a model if we want to ensure that classifier performance is the same between all three data sources.