1.	Create a new directory, Exercise01.01 in the Chapter01 directory to store the files for this exercise.

2.	Open your terminal (macOS or Linux) or command window (Windows), navigate to Chapter01 directory, and type `jupyter notebook`.

3.	In the Jupyter Notebook, click the Exercise01.01 directory and create a new notebook file with Python3 kernel.

4. Read in the dataset file and check how big it is as shown in the following code.


In [1]:
import os
dataset_filename = "../../Datasets/clickbait-headlines.tsv"

print("File: {} \nSize: {} MBs".format(dataset_filename, round(os.path.getsize(dataset_filename)/1024/1024, 2)))

File: ../../Datasets/clickbait-headlines.tsv 
Size: 0.55 MBs


We first import the `os` libary from Python which is a standard library for running Operating System level commands. 

We defined the path to the dataset file as the variable `dataset_filename`.

We printed out the size of the file using the `os` library and the `getsize()` function, seeing in the output that the file is less than one megabyte in size.

5. Read the contents of the file form disk and split each line into a data and label component, as shown in the following code.

In [2]:
import csv

data = []
labels = []

with open(dataset_filename) as f:
    reader = csv.reader(f, delimiter="\t")
    for line in reader:
        try:
            data.append(line[0])
            labels.append(line[1])
        except Exception as e:
            print(e)
        

print(data[:3])
print(labels[:3])

["Egypt's top envoy in Iraq confirmed killed", 'Carter: Race relations in Palestine are worse than apartheid', 'After Years Of Dutiful Service, The Shiba Who Ran A Tobacco Shop Retires']
['0', '0', '1']


We import the CSV Python library which is useful for processing our file, which is in tab seprated file (TSV) format.. 

We define two empty arrays, `data` and `labels`.

We open the file, create a CSV reader, and indicate what kind of delimeter ("\t", or a tab characters) is used. We then loop through each line of the file, and the first element to the data array and the second element to the labels array. If anything goes wrong, we print out an error message to indicate this.

Finally, we print out the first three elements of each of our arrays. They match up, so the first element in our data array is linked to the first element in our labels array. From the output we see that the first two elements are '0' or 'not clickbait' while the last element is identified as '1', indicating a clickbait headline.

6. Create vectors from our text data using the sklearn library, while showing how long it takes, as shown in the code below.

In [3]:
%%time
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(data)
print("The dimensions of our vectors:")
print(vectors.shape)
print("- - -")


The dimensions of our vectors:
(10000, 13169)
- - -
CPU times: user 865 ms, sys: 176 ms, total: 1.04 s
Wall time: 1.02 s


The first line is a special Jupyter Notebook command saying that the code should output the total time taken. Then we import a TfidfVectorizer from the `sklearn` library. We initialise a vectorizer call the `fit_transform()` function which assigns each word to an index and creates the resulting vectors from the text data in a single step. 

Then we print out the shape of the vectors, noticing that they are 10000 rows (the number of headlines) by 13169 columns (the number of unique words across all headlines).

We can see from the timing output below that it took a total of around 200 ms to run this code.

7. check how much memory our vectors are taking up in their sparse format and compare this to how much space they would have used if we had used a dense format.

In [4]:
print("The data type of our vectors")
print(type(vectors))
print("- - -")
print("The size of our vectors (MB):")
print(vectors.data.nbytes/1024/1024)
print("- - -")
print("The size of our vectors in dense format (MB):")
print(vectors.todense().nbytes/1024/1024)
print("- - - ")
print("Number of non zero elements in our vectors")
print(vectors.nnz)
print("- - -")

The data type of our vectors
<class 'scipy.sparse.csr.csr_matrix'>
- - -
The size of our vectors (MB):
0.6759414672851562
- - -
The size of our vectors in dense format (MB):
1004.7149658203125
- - - 
Number of non zero elements in our vectors
88597
- - -


We printed out the type of the vectors and saw that this was `csr_matrix` or a "compressed sparse row matrix", which is the default data structure used by sklearn for vectors. In memory this takes up only 0.68 MB of space. Next we call the `todense()` function which converts the datastructure to a standard dense matrix. We check the size again to see it is now over 1 GB in size. 

Finally, we output the `nnz` (number of non-zero elements) and see that there were around 88 thousand non-zero elements that we store. Because we had 10000 rows and 13169 columns, the total number of elements is 131 690 000, which is why the dense matrix is so much larger in memory use.

8. For machine learning we need to split our data into a train portion for training and a test portion to evaluate how good our model is. Do this with the following code.

In [5]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(vectors, labels, test_size=0.2)

print(X_train.shape)
print(X_test.shape)

(8000, 13169)
(2000, 13169)


We imported the `train_test_split` function from sklearn and split our two arrays (vectors and labels) into four arrays (X_train, X_test, y_train, y_test). The `y` prefix indicates labels and the `X` prefix indicates vectorized data. We use the argument `test_size=0.2` to indicate that we want 20% of our data held back for testing.

We then print out each shape to show that 80% (8000) headlines are in the training set and that 20% (2000) headlines are in the test set. Because each dataset was vectorized at the same time, each still has 13169 dimensions or possible words.

9. Initialise the SVC classifier, train it, and generate predictions with the following code.

In [6]:
%%time

from sklearn.svm import LinearSVC

svm_classifier = LinearSVC()
svm_classifier.fit(X_train, y_train)

predictions = svm_classifier.predict(X_test)

CPU times: user 36.7 ms, sys: 7.66 ms, total: 44.4 ms
Wall time: 50.6 ms


We import the `LinearSVC` model from `sklearn`, and initialise an instance of it. Then we give it the training data and training labels (note that it does not have access to the test data at this stage). 

Finally we give it the testing data, but without the testing labels, and ask it to guess which of the headlines in the held-out test set are clickbait. We call these `predictions`.

To get some insight into what is happening, let's take a look at some of these predictions and compare them to the real labels.

10. Output the first 10 headlines along with their predicted class and true class by running the following code.






In [10]:
print("prediction, label")
for i in range(10):
    print(y_test[i], predictions[i])

prediction, label
1 1
0 0
1 1
0 0
1 1
1 1
1 1
1 1
1 1
1 1


We can see that for the first 10 cases, our predictions were spot on. Let's see how we did over all the test cases.

11. Evaluate how well the model performed using the following code.

In [None]:
from sklearn.metrics import accuracy_score, classification_report

print("Accuracy: {}\n".format(accuracy_score(y_test, predictions)))
print(classification_report(y_test, predictions))


We achieve 97% accuracy, which means around 1940 of the 2000 test cases were correctly classified by our model. This is a good summary score, but for a fuller picture we then print out the full classification report. The model can be wrong in different ways: either classifying a clickbait headline as normal, or by classifying a normal healdine as clickbait. Because the precision and recall scores are similar, we can confirm that the model is not biased towards a specific kind of mistake.