1. import the pandas library and use it to read the dataset file into a dataframe, as shown in the following code.

In [1]:
import pandas as pd

df = pd.read_csv("../../Datasets/clickbait-headlines.tsv", sep="\t", names=["Headline", "Label"])
df

Unnamed: 0,Headline,Label
0,Egypt's top envoy in Iraq confirmed killed,0
1,Carter: Race relations in Palestine are worse ...,0
2,"After Years Of Dutiful Service, The Shiba Who ...",1
3,"In Books on Two Powerbrokers, Hints of the Future",0
4,"These Horrifyingly Satisfying Photos Of ""Baby ...",1
...,...,...
9995,What Is Your Weirdest Fear,1
9996,Felipe Massa wins 2008 French Grand Prix,0
9997,Bottled water concerns health experts,0
9998,Death of Nancy Benoit rumour posted on Wikiped...,0


We import the pandas library, and then use the `read_csv()`() function to read the file into a dataframe called `df`. We pass the arguments `sep` to indicate that the file uses tab characters as separators and we pass in the column names as the `names` argument. The output is summarised to show only the first few entries and the last few, followed by a description of how many rows and columns there are in the entire dataframe.

2. Calculate the length of each headline and print out the first ten lengths using a for loop, timing how long it takes, as shown in the code below.


In [2]:
%%time

lengths = []

for i, row in df.iterrows():
    lengths.append(len(row[0]))
print(lengths[:10])

[42, 60, 72, 49, 66, 51, 51, 58, 57, 76]
CPU times: user 1.48 s, sys: 18 ms, total: 1.5 s
Wall time: 1.65 s


We declare an empty array to store the lengths, then loop through each row in our dataframe using the `iterrows()` method. We append the length of the first item of each row (the headline) to our array, and finally print out the first 10 results.

3. Now re-calculate the length of each row, but this time using vectorized operations as shown in the code below.

In [3]:
%%time
lengths = df['Headline'].apply(len)
print(lengths[:10])

0    42
1    60
2    72
3    49
4    66
5    51
6    51
7    58
8    57
9    76
Name: Headline, dtype: int64
CPU times: user 10.5 ms, sys: 4.88 ms, total: 15.3 ms
Wall time: 16.3 ms


We use the `apply()` function to apply `len` to every row in our dataframe, without a for loop. Then we print the results to verify they are the same as when we used the for loop.  From the output, we can see the results are the same, but this time it took only 6.61 milliseconds instead of over one second to do carry out all of these calculations.

4. Try a different calculation. This time find the average length of all clickbait headlines and compare this average to the length of normal headlines, as shown in the code below.

In [4]:
%%time
from statistics import mean

normal_lengths = []
clickbait_lengths = []

for i, row in df.iterrows():
    if row[1] == 1:  # clickbait
        clickbait_lengths.append(len(row[0]))
    else:
        normal_lengths.append(len(row[0]))

print("Mean normal length is {}".format(mean(normal_lengths)))
print("Mean clickbait length is {}".format(mean(clickbait_lengths)))

Mean normal length is 52.0322
Mean clickbait length is 55.6876
CPU times: user 1.52 s, sys: 12.2 ms, total: 1.53 s
Wall time: 1.6 s


We import the `mean` function from the `statistics` library. This time we set up two empty arrays, one for the lengths of normal headlines and one for clickbat. We use `iterrows()` function again to check every row and again calculate the length, but this time store the result in one of our two arrays, based on whether the headline is clickbait or not. We then take the average of each array and print it out.

5. Now re-calculate this output using vectorized operations as shown in the code below.

In [5]:
%%time

print(df[df["Label"] == 0]['Headline'].apply(len).mean())
print(df[df["Label"] == 1]['Headline'].apply(len).mean())

52.0322
55.6876
CPU times: user 13 ms, sys: 3.35 ms, total: 16.4 ms
Wall time: 25.5 ms


In each line, we look at only a subset of the dataframe: first where the label is '0', and second when it is '1'. We again apply the `len` function to each row that matches the condition and then take the average of the entire result. We confirm that the output is the same as before.

6. As a final test, calculate how often the word 'you' appears in each kind of headline, as shown in the following code.

In [6]:
%%time
from statistics import mean

normal_yous = 0
clickbait_yous = 0

for i, row in df.iterrows():
    num_yous = row[0].lower().count("you")
    if row[1] == 1:  # clickbait
        clickbait_yous += num_yous
    else:
        normal_yous += num_yous

print("Total 'you's in normal headlines {}".format(normal_yous))
print("Total 'you's in clickbait headlines {}".format(clickbait_yous))


Total 'you's in normal headlines 43
Total 'you's in clickbait headlines 2527
CPU times: user 1.62 s, sys: 16.9 ms, total: 1.64 s
Wall time: 1.72 s


We define two variables `normal_yous` and `clickbait_yous` to count the total occurences of the word you in each class of headline. We loop through the entire dataset again using a for loop and the `iterrows()` funciton. For each row, we use `count()` to count how often the word you appears, and then add this total to the relevant total. Finally we print out both results, seeing that "you" appears very often in clickbait headlines, but hardly ever in non-clickbait headlines.

7. Re-run the same analysis without using a for loop and compare times, as shown in the code below.

In [7]:
%%time
print(df[df["Label"] == 0]['Headline'].apply(lambda x: x.lower().count("you")).sum())
print(df[df["Label"] == 1]['Headline'].apply(lambda x: x.lower().count("you")).sum())

43
2527
CPU times: user 25.7 ms, sys: 3.89 ms, total: 29.6 ms
Wall time: 48.1 ms


Similarly to before, we break the dataset into two subsets and apply the same operation to each. This time our function is a bit more complicated than the `len` function we used before, so we define an anonymous function inline using `lambda`. We lowercase each headline and count how often "you" appears and then sum the results.


From the above the main takeaway we can see is that vectorized operations can be many times faster than using for loops. We also learned some interesting things about clickbait characteristics though. For example, the word 'you' appears very often in clickbait headlines (2527 times), but hardly ever in normal headlines (43 times). Clickbait headlines are also on average slightly longer than non-clickbait headlines.