# Data Science | Lab: Natural Language Processing
**Table of Contents:**  <a name="toc"></a>
1. [Accessing Websites](#crwaling)
2. [Vectorizing Text](#vect)
3. [Document Classification](#classification)

## Accessing Websites <a name="crawling"></a>

For this lab, we will rely on basic libararies for accessing websites: [requests](https://docs.python-requests.org/en/latest/) and [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) that support all steps involved in collecting website data:
- **Crawler** Finding and downloading web pages automatically is called crawling. A program that
downloads pages is called a web crawler.
- **Parser** Interpreting and reconstructing the structure of a website is called parsing.
- **Scraper** Extracting the content of a website (text only) is called scraping.

In addition, [boilerpipe](https://pypi.org/project/boilerpy3/) can be used to automatically remove the boilerplate.

<div class="alert alert-block alert-info">
    <b>Politeness</b></br>
    Be polite when crawling web pages! Do not fetch more than one page at a time from a particular web server. Wait in between requests. Respect the <tt>robots.txt</tt> file if provided.
</div>

[Back to top](#toc)

In [None]:
#!pip install beautifulsoup4
#!pip install requests

import requests  # crawling
from bs4 import BeautifulSoup  # parsing, scraping
import pandas as pd
import time

# Ignoring deprecation warnings given by pandas when using the function `append()`
import warnings
warnings.filterwarnings("ignore", category=FutureWarning) 

### Constructing the News Article Dataset
The following code is used to read and store all links to articles categorized under the topics *artificial intelligence* and *neuroscience* at the [Wired UK](https://www.wired.co.uk) website. ``build_dataset`` takes a ``seed_url`` (topic overview page) as argument and stores the article headers and their links in a dataframe. 

In [None]:
def build_dataset(seed_url):
    df = pd.DataFrame(columns=['topic', 'link', 'header'])
    data = requests.get(seed_url)
    soup = BeautifulSoup(data.content.decode('utf-8', 'ignore'), 'html.parser')
    candidates = soup.find_all("a", class_=["summary-item-tracking__hed-link", "summary-item__hed-link"])
    for article in candidates:
        print(article.get_text().strip())
        df = df.append({'topic': seed_url.split('/')[-1], 'link': article['href'], 'header': article.get_text().strip()}, ignore_index=True)
    return df

<div class="alert alert-block alert-info">
    <b>Caution:</b> Call the following code only once (as rarely as possible) and than store and read the DataFrame locally.
</div>

In [None]:
# Crawl and parse the AI page
#seed_url = 'https://www.wired.co.uk/topic/artificial-intelligence'
#df = build_dataset(seed_url)

# Crawl and parse the neuroscience page
#seed_url = 'https://www.wired.co.uk/topic/neuroscience'
#df = df.append(build_dataset(seed_url), ignore_index=True)

# Alternatively, read from file
#df.to_csv('article_headers.csv', sep='\t', index=False)
df = pd.read_csv('article_headers.csv', sep='\t')

In [None]:
display(df.head())
display(df.tail())

[Back to top](#toc)

## Vectorizing Text <a name="vect"></a>
Now, it's time to preprocess and vectorize the data and create the Bag-of-Words (BoW) vectors for each document. Before building a larger dataset, we will work with the **article headers as documents**.

### The ``CountVectorizer``
With the ``CountVectorizer``, scikit-learn offers an all-in-one solution for the preprocessing and vectorization of text data. Read through the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) and make yourself familiar with its usage. Make sure to understand
* how tokenization is applied
* how to remove low and high frequency words or stopwords
* how to enable lowercasing
* how to build a binary incidence matrix vs. a frequency matrix

[Back to top](#toc)

Vectorize the text by adding/changing the parameters when initializing the ``CountVectorizer`` in the following code cell. Make sure to
* tokenize
* lowercase
* remove tokens that occur in less than two documents

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

train_data = df['header']

vectorizer = CountVectorizer(lowercase = True, analyzer="word", min_df=2, stop_words='english')
X_train = vectorizer.fit_transform(train_data)

print(X_train.shape)

``vectorizer`` now holds the bag-of-words vector for each document. Analyze the object further by inspecting the following properties and methods:
* ``vocabulary_`` 
* ``stop_words_``
* ``get_feature_names``

In [None]:
print(vectorizer.vocabulary_)
print(vectorizer.stop_words_)  # all terms that are discraded by the vectorizer (e.g. defined as stopwords, frequency too low/high, ...)
#print(vectorizer.get_feature_names_out())

Afterwards, make sure to understand the difference between the following outputs:

In [None]:
print(X_train)
print(X_train.todense())

[Back to top](#toc)

### The Document-Term Matrix
Given this starter example of only article headers resulting in low-dimensional BoW vectors, it is possible to plot the resulting document-term-matrix as follows:

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

fig, ax = plt.subplots(figsize=(20, 5))
sns.heatmap(X_train.todense(), ax=ax, cmap='gray_r')
ax.set_xticks(range(X_train.shape[1]))
ax.set_xticklabels(vectorizer.get_feature_names_out(), rotation=45)
ax.set_xlabel('Dictionary')
ax.set_ylabel(f'{X_train.shape[0]} Documents')
ax.set_title('Document-Term-Matrix')
plt.show()

You can also fit a ``TfidfVectorizer`` to see how the BoW vectors change.

## Document Classification <a name="classification"></a>

<img src="https://www.nltk.org/images/supervised-classification.png" style="width: 400px;"/>

[Back to top](#toc)

Now, train a document classifier on the article texts to predict the article category (topic). 

- Use the [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to extract the features of the article headers.
- Use the MinDist classifier to predict the topics. Remember that in scikit-learn, the classifier is called [Nearest Centroid](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestCentroid.html?highlight=nearest%20centroid#sklearn.neighbors.NearestCentroid) classifier. Experiment with cosine distance instead of Euclidean distance.
- In order to assess the classifier's performance, use the test dataset provided as ``test_articles.csv`` on Moodle and plot the confusion matrix.

### euclidean distance classifier

In [None]:
from sklearn.neighbors import NearestCentroid
#create model
mindist = NearestCentroid()
mindist.fit(X_train, df["topic"])
df_test = pd.read_csv('test_articles.csv', sep=";")
x_test = vectorizer.transform(df_test["header"])
y_hat = mindist.predict(x_test)

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
print(mindist.score(x_test, df_test["topic"]))

cm = confusion_matrix(df_test["topic"], y_hat, labels=mindist.classes_)
ConfusionMatrixDisplay(cm, display_labels=mindist.classes_).plot()

### cosine classifier

In [None]:
#create model
mindist = NearestCentroid(metric="cosine")
mindist.fit(X_train, df["topic"])
df_test = pd.read_csv('test_articles.csv', sep=";")
x_test = vectorizer.transform(df_test["header"])
y_hat = mindist.predict(x_test)
print(mindist.score(x_test, df_test["topic"]))

cm = confusion_matrix(df_test["topic"], y_hat, labels=mindist.classes_)
ConfusionMatrixDisplay(cm, display_labels=mindist.classes_).plot()

## Homework Assignment

Extend your code to include the following:
1. Add a third topic of your choice to the link list DataFrame. 
2. Access the website of each link in your list and extract the article text. Store it in a new columns in your DataFrame. (Make sure to be polite! Add ``time.sleep`` before accessing the next page.)
3. Set up a grid search for at least three different parameters of the vectorizer.
4. Evaluate the grid with 5-fold cross validation to find the best features for the MinDist classifier concerning F1 score when trained on the full article texts.
5. Reuse the best model setting to evaluate the performance on the original test dataset (``test_articles.csv``).
6. Plot the confusion matrix for the test dataset.

## Moodle Upload
Upload your notebook as ``firstname_lastname_nlp.html`` to Moodle. Make sure to consider the following:
* Have all your import statements in one single cell at the top of the notebook.
* Remove unnecessary code (such as plotting the document-term-matrix, experimenting with the properties and methods of the CountVectorizer, ...)
* Print the head of your final DataFrame (containing article texts) once.
* Include a markdown cell at the end where you:
    * give a short overview of what your notebook is about
    * describe and interpret your grid settings and justify your choices
    * analyze the final/best results