# Text Mining of BBC News Data

## Part 1: Reading Text Files

In this series of notebooks we will introduce some tools to analyse the topics of a collection of news documents from the BBC.

Here is the description of the dataset we will be using:

http://mlg.ucd.ie/datasets/bbc.html

In [None]:
from urllib.request import urlretrieve
from pathlib import Path

BBC_DATASET_URL = "http://mlg.ucd.ie/files/datasets/bbc-fulltext.zip"
archive_filepath = Path(BBC_DATASET_URL.rsplit("/", 1)[1])

if not archive_filepath.exists():
    print(f"Downloading {BBC_DATASET_URL} to {archive_filepath}...")
    urlretrieve(BBC_DATASET_URL, archive_filepath)
    print("done.")
else:
    print(f"{archive_filepath} exists.")

In [None]:
from zipfile import ZipFile

zf = ZipFile(archive_filepath)

In [None]:
zf.filelist[:10]

In [None]:
zf.extractall(path=".")

In [None]:
zf.close()

In [None]:
bbc_folder_path = Path("bbc")
bbc_folder_path.is_dir()

In [None]:
bbc_folder_path.iterdir()

In [None]:
list(bbc_folder_path.iterdir())

In [None]:
readme_path = bbc_folder_path / "README.TXT"
readme_path

In [None]:
type(readme_path.read_bytes())

In [None]:
type(readme_path.read_text())

In [None]:
print((bbc_folder_path / "README.TXT").read_text(encoding="utf-8"))

In [None]:
text_filepaths = sorted(bbc_folder_path.glob("*/*.txt"))

In [None]:
text_filepaths[:10]

In [None]:
text_filepaths[-10:]

In [None]:
len(text_filepaths)

In [None]:
first_filepath = text_filepaths[0]
first_filepath

In [None]:
print(first_filepath.read_text(encoding="utf-8"))

In [None]:
print(first_filepath.read_text(encoding="iso-8859-1"))

In [None]:
for path in text_filepaths:
    try:
        path.read_text(encoding="utf-8")
    except Exception as e:
        print(path)
        print(type(e), e)

In [None]:
b"\xa3".decode("utf-8")

In [None]:
b"\xa3".decode("cp1252")  # Western Europe (Windows code page)

In [None]:
b"\xa3".decode("cp1251")  # Cyrillic (Windows code page)

In [None]:
b"\xa3".decode("cp932")  # Japanese (Windows code page)

In [None]:
b"\xa3".decode("iso-8859-1")  # also known as latin-1

In [None]:
b"\xa3".decode("iso-8859-15")  # also known as latin-9

In [None]:
problematic_filepath = Path("bbc/sport/199.txt")
problematic_bytes = problematic_filepath.read_bytes()
position = problematic_bytes.index(b"\xa3")

In [None]:
position

In [None]:
problematic_bytes[position-20:position+20].decode("iso-8859-1")

In [None]:
problematic_bytes[position-20:position+20].decode("cp1251")

In [None]:
print(problematic_bytes.decode("iso-8859-1"))

In [None]:
print(problematic_filepath.read_text(encoding="cp1251"))

In the context of an English speaking news site, a Western european code page makes more sense. However the first article is clearly utf-8 and the `bbc/sport/199.txt` article is clearly not utf-8.

So it means that not all articles where encoded with the same encoding. The documentation of the dataset does not give us any information on which encoding was used.

In this case we could try to guess, for instance using the `chardet.detect()` function to use a  machine learning model to guess the encoding of each document:

https://pypi.org/project/chardet/

In [None]:
!pip install chardet

In [None]:
import chardet

chardet.detect(problematic_filepath.read_bytes())

This seems to agree with our manual inspection of this file. However if we try chardet on the first document it gives a bad answer:

In [None]:
chardet.detect(first_filepath.read_bytes())

So we cannot trust this tool for this dataset. There is too much ambiguity. As we know that all documents are in English, most of the words should be represented the same way in both encodings. Let's just assume that UTF-8 was used everywhere and ignore/skip characters that cannot be decoded with the utf-8 encoding:

In [None]:
print(problematic_filepath.read_text(encoding="utf-8", errors="ignore"))

In [None]:
texts = [path.read_text(encoding="utf-8", errors="ignore")
         for path in text_filepaths]

Now that we have loaded all the text documents in memory, we can load the target label (categories) of those documents by looking at the name of their parent folder:

## Extracting the Category Labels from the File Paths

In [None]:
text_filepaths[0]

In [None]:
def extract_label_from_path(filepath):
    return filepath.parent.name


extract_label_from_path(text_filepaths[0])

In [None]:
extract_label_from_path(text_filepaths[567])

In [None]:
categories = [extract_label_from_path(path) for path in text_filepaths]

In [None]:
len(categories)

In [None]:
from collections import Counter

counter = Counter(categories)
counter.most_common()

## A First Supervised Text Classification Pipelines

To end this section will quickly demo how to build a text classification model using scikit-learn.

In [None]:
from sklearn.model_selection import train_test_split

texts_train, texts_test, categories_train, categories_test = train_test_split(
    texts, categories, test_size=0.2, random_state=12)

In [None]:
len(texts_train)

In [None]:
len(texts_test)

In [None]:
len(categories_train)

In [None]:
len(categories_test)

In [None]:
categories_train[:10]

In [None]:
categories_test[:10]

Let's build a pipeline of two components:

- a vectorizer to turn text documents into vector of relative frequencies of words
- a linear classifier that tries to weight those frequencies so as to predict the category of the documents.

Scikit-learn makes it easy to fit the two components together and treat it as a single component to go from text input to category output:

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import SGDClassifier

text_classifier = make_pipeline(
    TfidfVectorizer(min_df=5, max_df=0.7),
    SGDClassifier(max_iter=100, tol=1e-6,
                  early_stopping=True, n_iter_no_change=5),
)

Let's fit the model on the training set:

In [None]:
%%time
text_classifier = text_classifier.fit(texts_train, categories_train)

Let's compute the predictions on the test set:

In [None]:
%%time
predictions_test = text_classifier.predict(texts_test)

In [None]:
predictions_test[:10]

By comparing the predictions of the model to the true category labels, we can get an estimate of the test accuracy of our model:

In [None]:
import numpy as np

np.mean(predictions_test == categories_test)

In [None]:
from sklearn.metrics import classification_report

print(classification_report(categories_test, predictions_test))

Note that we could also compute those performance metrics on the training set if we wanted:

In [None]:
predictions_train = text_classifier.predict(texts_train)

np.mean(predictions_train == categories_train)

In [None]:
print(classification_report(categories_train, predictions_train))

### Questions:

- The predictions on the training set are better than on the test set, why?

- Why should we always use evaluate the performance of a model on a test (or validation) set?

- Assume we have a model predicts significantly better on the training set than on a test set, how is this situation called?

- Assume that the model has poor performance even of on the train set, how is this situation called?

- Can you suggest typical solutions for each problem?