# Welcome

Practice text


# About This Notebook

The interactive document you're looking at is called a Jupyter Notebook. In it, information is divided into "cells" containing either explanatory text (like this cell) or code (like the next cell).

To run a cell of code, you can select a cell by clicking on it, then click the "Run" button on the toolbar at the top of the page. Alternatively, you can press the Ctrl and Enter keys together to run a cell of code after selecting it.

After the code in a cell has finished running, the output from the code will appear beneath its cell. Try running the code that's in the next cell. It should output a phrase, and a solution to a sum.

In [1]:
print("Hello World!")
print(17 + 6)

Hello World!
23


You might notice that each cell of code has a set of square brackets that appear to its left. When a cell hasn't been run, there is a space between the brackets. While a cell is currently running, an asterisk (\*) appears in the brackets. When it's finished running, a number appears in the brackets, indicating the order in which the cells have been run.

These cells can all be edited. We've labeled some lines of code with **"Try editing this code!"**, which means that after running the code the first time, you can try editing these lines to use your own choice of input and then re-run the code.

# About Python

The name of the programming language being used here is Python, a popular tool among data scientists. This exercise isn't a full introduction to Python, but it would be helpful to mention some of its basic properties.

If a line of code contains the `#` symbol, then everything on the line after `#` will be ignored when running the code. This is very useful for including descriptive comments within the code.

**Variables** are placeholder objects that can store values assigned to them. Variables can be named practically anything, but they should generally have names that indicate their purpose.

A value that is assigned to a variable could be a number, like `17`, `3.4`, or `-108`. Or the value could be text (also known as a **string**), which is always surrounded by quotation marks. Many other types of values are possible too, which we'll see later.

Variables are assigned values by using the symbol `=`. The name of the variable goes on the left, and the value being assigned to it goes on the right.

The next cell of code has an example of assigning values to variables, and then printing those values. Try running the code now.


In [2]:
# This cell has examples of variable assignment
# (Notice how this line starts with '#' so it is a comment ignored by the code.)

# vvv Try editing this code!
my_name = "Waldo" 
breakfast_food = "eggs"
fav_number = 17
# ^^^ Try editing this code!

print("Hello! My name is " + my_name + ".")
print("Today I had " + breakfast_food + " for breakfast.")
print(str(fav_number) + " is my favourite number.")
print("Twice my favourite number would be: " + str(fav_number * 2))

Hello! My name is Waldo.
Today I had eggs for breakfast.
17 is my favourite number.
Twice my favourite number would be: 34


Don't worry about understanding every word or line of code in this notebook. Our goal is simply to give you an idea of the types of output we can generate when analysing a data set.

Now it's your turn to write some code. Go back to the previous cell and change the values that are assigned to the variables. (This is the section labelled "Try editing this code!") Then run the code again, and see how the output has changed. You can also try adding more comments (using `#`), or writing new sentences for the code to print out.

# Code Setup

We'll start by importing some Python libraries that contain useful functions for interacting with our data. Please be sure to run this cell of code, so that the later code will be able to use these functions.

In [3]:
# Importing useful Python modules

print("About to import Python modules...")
import os
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import re
from wordcloud import WordCloud, STOPWORDS
print("Finished importing Python modules")


About to import Python modules...
Finished importing Python modules


In [20]:
from platform import python_version
python_version()

'3.9.13'

In [21]:
re.__version__()

TypeError: 'str' object is not callable

Next, we'll load in the data. This publicly available data set consists of over 400,000 news headlines from around the world, collected between 10th March and 10th August 2014.

(The data comes from the UCI Machine Learning Repository. Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.)

In [None]:
# This code will load in our data set.

# This is the name of the file, which is a csv file that has been compressed as a zip file.
data_file_name = "uci-news-aggregator-short.zip"

# Here we read file.
news_df = pd.read_csv(data_file_name)

# We're only going to look at titles and publishers.
news_df = news_df.loc[:, ['TITLE','PUBLISHER']]

print("Finished reading the data.")

The data set is now stored in the variable named `news_df`. The variables we used in our earlier example only contained a single number or string of text. This variable contains an entire table of data!

This particular type of table is called a **data frame**. Data frames have some useful built-in functions for quick analysis. Let's look at the size of the data frame, and what the first few lines look like.

In [None]:
# See what the table of data looks like:
print(news_df)

There's too much data to print, so we see an abbreviated version here, containing the first and last few lines of the data frame.

Just from this quick look at the data, we see that this table has 3 columns:
- a numbered label (also called the **index**) for each row
- `TITLE`, containing the headline
- `PUBLISHER`, containing the source

We can refer to each row in a data frame as a **record** that contains a set of associated information. Each column is a **feature** which is a specific type of data that is provided for each record.

At the very end of the output from the previous cell's code, we see the size of the data frame (number of rows and columns). We can also get the size of the data frame like so:

In [None]:
# Get the size of a data frame:
print(news_df.shape)

# Examining Publishers

Let's look at how many different publishers there are in the `PUBLISHER` column.

In [None]:
# Look at the number of headlines per publisher in our data set:

publisher_counts = news_df['PUBLISHER'].value_counts()
print(publisher_counts)

Again, this table is too large to easily print everything, but at a glance we can see that the publisher with the most headlines in this data set is Reuters, followed by Huffington Post and Businessweek.

Also, notice that the final line of output says `Length: 10985`. This means there are 10,985 publishers in this data set.

We saved this table of publisher counts in the variable named `publisher_counts`, so we can use it in the next cell's code to see how many records appear with a given publisher. You can change the value of `publisher_to_check` to the name of any other news source you think might appear in this data set. 

(Note: Python often makes use of indented code. If you edit the code we've written below, do not alter the indents at the start of lines. If you do, the code might not work correctly.)

In [None]:
# Look at how many headlines in this data set are from a given publisher.

# vvv Try editing this code!
publisher_to_check = "Yorkshire Post"
# ^^^ Try editing this code!

if publisher_to_check in publisher_counts:
    print(publisher_counts[publisher_to_check])
else:
    print("That name doesn't appear in the list of publishers.")

# Examining Words in Headlines

Now let's look at the headlines. Let's say we're interested in looking at the set of words in a headline. Here's a convenient method for splitting up a string of text into a list of words, separating it on all its spaces:

In [None]:
# Splitting a string of text into separate words

example_sentence = "This example sentence contains six words."
example_word_list = example_sentence.split()
print(example_sentence)
print(example_word_list)

We'll apply this to every headline in the `TITLE` column, and store all the results in a new column we'll call `WORDS`.

In [None]:
# Adding a new WORDS column to our table

news_df['WORDS'] = [headline.split() for headline in news_df['TITLE']]
print(news_df)

Here's some code that will show us how many headlines contain a given word, and shows the first few examples:

In [None]:
# Counting the number of headlines with a chosen word

# vvv Try editing this code!
word_to_check = "Sport"
# ^^^ Try editing this code!

news_words_df = news_df.explode("WORDS")
word_appearances = news_words_df.loc[news_words_df['WORDS']==word_to_check]
num_appearances = len(word_appearances)
print("Number of appearances of \"" + word_to_check + "\": " + str(num_appearances))
print("Examples of appearances (maximum 10):")
print(word_appearances.iloc[:10,[0,1]].to_string(index=False))

In the code above, you can change the value of `word_to_check` to be any word you'd like to search for. Try it yourself!

# Normalising Words

After you've tried searching some words, go back and check how many times the word "Sport" appears. Now try checking "sport" in all lowercase instead -- you'll get a different output!

Currently, the code distinguishes different capitalisation as different words, but that's not really what we want. When we're checking how frequently a word is used, we want all capitalisations to be treated equally. Also, try checking "Sport," with a comma after it -- that's treated differently too. So we should rewrite our code to ignore punctuation and capitalisation differences. That's what the code in the next cell does, saving the normalised words by overwriting the `WORDS` column.

In [None]:
# Removing all punctuation and converting all words to lowercase

nopunct_pattern = re.compile('[^a-zA-Z0-9 ]+')
news_df['WORDS'] = [set(nopunct_pattern.sub('', headline.lower()).split()) for headline in news_df['TITLE']]
print(news_df)

Now when we search for words, the headlines with all possible capitalisations will be included. Try your searches again in this next cell of code. Again, feel free to change `word_to_check` to any word you'd like.

In [None]:
# Counting the number of headlines with a chosen word (now with normalised words)

# vvv Try editing this code!
word_to_check = "sport"
# ^^^ Try editing this code!

word_to_check = word_to_check.lower() # Force lowercase
news_words_df = news_df.explode("WORDS")
word_appearances = news_words_df.loc[news_words_df['WORDS']==word_to_check]
num_appearances = len(word_appearances)
print("Number of appearances of \"" + word_to_check + "\": " + str(num_appearances))
print("Examples of appearances (maximum 10):")
print(word_appearances.iloc[:10,[0,1]].to_string(index=False))

# Exploring word frequency

Next, let's compare how frequently certain words appear in headlines from different news sources. First let's take a look through the 30 publishers with the most headlines in our data set.

In [None]:
# View 30 most common publishers in the data set
print(publisher_counts[:30])

From this list, we'll choose just a few to represent the UK and the US, in the code below. If you like, you could change these lists to your own choices of publishers, but make sure you write the publisher's names exactly as they appear in the data. Then, we'll isolate the numbers for how many headlines are from each of these sources.

In [None]:
# Declare lists of UK and US publishers that we'd like to examine

# vvv Try editing this code!
uk_publisher_list = ["Daily Mail", "Telegraph.co.uk", "The Guardian"]
us_publisher_list = ["Los Angeles Times", "USA TODAY", "CBS Local"]
# ^^^ Try editing this code!

print(publisher_counts[uk_publisher_list])
print("Total for these publishers:")
print(sum(publisher_counts[uk_publisher_list]))
print()
print(publisher_counts[us_publisher_list])
print("Total for these publishers:")
print(sum(publisher_counts[us_publisher_list]))

Let's compare how many times the word "the" appears in these sources. You can also try replacing "the" with any other word you'd like to check.

In [None]:
# Counting instances of a certain word in headlines from selected publishers

# vvv Try editing this code!
word_to_check = "the"
# ^^^ Try editing this code!

word_to_check = word_to_check.lower() # Force lowercase
news_words_df = news_df.explode("WORDS")
word_appearances = news_words_df.loc[news_words_df['WORDS']==word_to_check].drop(['WORDS'], axis=1)

word_appearances_uk = word_appearances[word_appearances["PUBLISHER"].isin(uk_publisher_list)]
word_appearances_us = word_appearances[word_appearances["PUBLISHER"].isin(us_publisher_list)]

total_uk_count = len(news_df[news_df["PUBLISHER"].isin(uk_publisher_list)])
total_us_count = len(news_df[news_df["PUBLISHER"].isin(us_publisher_list)])

print("Number of UK headlines with \"" + word_to_check + "\": " + str(len(word_appearances_uk)) )
print("Total UK headlines: " + str(total_uk_count))
print("Proportion: " + str(len(word_appearances_uk)/total_uk_count))
print()
print("Number of US headlines with \"" + word_to_check + "\": " + str(len(word_appearances_us)) )
print("Total US headlines: " + str(total_us_count))
print("Proportion: " + str(len(word_appearances_us)/total_us_count))
print()
print(word_appearances_uk)
print(word_appearances_us)


Let's try to find which words appear in the most UK headlines and US headlines. You can change the value of the `top_words_number` variable to choose how many words you'd like to display, instead of only the top ten.

In [None]:
# Most common words in headlines from our selected publishers

# vvv Try editing this code!
top_words_number = 10
# ^^^ Try editing this code!

uk_words_df = news_words_df[news_words_df["PUBLISHER"].isin(uk_publisher_list)]
uk_word_counts = uk_words_df['WORDS'].value_counts()
print(str(top_words_number) + " most common words from selected UK publishers:")
print(uk_word_counts[:top_words_number])
print()
us_words_df = news_words_df[news_words_df["PUBLISHER"].isin(us_publisher_list)]
us_word_counts = us_words_df['WORDS'].value_counts()
print(str(top_words_number) + " most common words from selected US publishers:")
print(us_word_counts[:top_words_number])


Many of these words aren't very interesting: to, the, in, of, and so on. Let's consult a list of these common words (also called "stop words"), and ignore anything on that list when counting the most frequent words in the headlines.

In [None]:
# Most common words in headlines from our selected publishers, ignoring stop words

# vvv Try editing this code!
top_words_number = 10
# ^^^ Try editing this code!

stopwords_list = [nopunct_pattern.sub('', sw.lower()) for sw in STOPWORDS]
uk_goodwords_df = uk_words_df[~uk_words_df['WORDS'].isin(stopwords_list)]
uk_goodword_counts = uk_goodwords_df['WORDS'].value_counts()
us_goodwords_df = us_words_df[~us_words_df['WORDS'].isin(stopwords_list)]
us_goodword_counts = us_goodwords_df['WORDS'].value_counts()

print(str(top_words_number) + " most common words from selected UK publishers:")
print(uk_goodword_counts[:top_words_number])
print()
print(str(top_words_number) + " most common words from selected US publishers:")
print(us_goodword_counts[:top_words_number])


# Bar Charts

We can create bar charts showing some of the most common words in a group of publishers. Again, you can choose how many words we display in our chart by changing the value of the `top_words_number` variable.

In [None]:
# Bar chart of UK data

# This value is the number of most frequent words we'll include in our chart:
# vvv Try editing this code!
top_words_number = 10
# ^^^ Try editing this code!

word_counts_to_graph = uk_goodword_counts

top_words_data = word_counts_to_graph[:top_words_number]
fig, ax = plt.subplots()
ax.bar(top_words_data.index, top_words_data.values)
plt.show()


The word labels run too closely together along the bottom of that chart, so let's try a horizontal bar chart instead.

In [None]:
# Horizontal bar chart

# This value is the number of most frequent words we'll include in our chart:
# vvv Try editing this code!
top_words_number = 10
# ^^^ Try editing this code!

word_counts_to_graph = uk_goodword_counts
top_words_data = word_counts_to_graph[:top_words_number]
fig, ax = plt.subplots()
ax.barh(top_words_data.index[::-1], top_words_data.values[::-1])
plt.show()


There are several extra features that we can add to charts like this. Let's add a title for the chart, change the colour of the bars from the default blue, and add display the size of each bar as a label on the bar itself.

In [None]:
# Adding extra features

# vvv Try editing this code!

# This value is the number of most frequent words we'll include in our chart:
top_words_number = 10
# This value is the colour of the bars:
bar_colour = "red"
# This value will be the title that appears at the top:
chart_title = "Most common words in UK headlines"

# ^^^ Try editing this code!


word_counts_to_graph = uk_goodword_counts

top_words_data = word_counts_to_graph[:top_words_number]
fig, ax = plt.subplots()
bars = ax.barh(top_words_data.index[::-1], top_words_data.values[::-1], color=bar_colour)
ax.bar_label(bars, label_type='center')
ax.set_title(chart_title)
plt.show()


This is a useful piece of code that we might want to try running with various different options. In Python, we can write our own function that reads a list of input values, and always performs the same actions with those values. Here, we'll want a function that always produces a horizontal bar chart by using these values that we can change:
- A list of word counts
- The number of top words from our data set
- A title for our chart
- A colour for the bars in our chart

Running this code won't immediately produce any output for us to see. Instead, running this code will just define a new function named `make_word_bar_chart` that we'll be able to use in later code.

In [None]:
# Defining a function for making our word chart

def make_word_bar_chart(word_counts_to_graph, top_words_number, chart_title, bar_colour):
    top_words_data = word_counts_to_graph[:top_words_number]
    fig, ax = plt.subplots()
    bars = ax.barh(top_words_data.index[::-1], top_words_data.values[::-1], color=bar_colour)
    ax.bar_label(bars, label_type='center')
    ax.set_title(chart_title)
    plt.show()

Now we can use the function we just wrote, and re-generate our previous graph with just a single line of code.

In [None]:
# Creating a bar chart with our make_word_bar_chart function

make_word_bar_chart(uk_goodword_counts, 10, "Most common words in UK headlines", "red")

We can also re-use our function to make a different graph, of the US word counts instead. Try this by running the code below. You can also change the input values to specify the UK or US data set, the number of top words to display, the title for the chart, and the colour of the bars.

In [None]:
# We can also re-use our function to make a different graph, of the US word counts instead

# vvv Try editing this code!
make_word_bar_chart(us_goodword_counts, 12, "Top words in US headlines", "cyan")
# ^^^ Try editing this code!

# Word Clouds

We can also make word clouds that show the most common words. The algorithm for creating these word clouds has an element of randomness, so you can run the code multiple times to generate a slightly different-looking word cloud each time.

In [None]:
# Word cloud for UK headlines

text_for_cloud = " ".join(uk_goodwords_df['WORDS'].tolist())
wcloud = WordCloud(normalize_plurals=False).generate(text_for_cloud)
plt.figure()
plt.imshow(wcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

In [None]:
# Word cloud for US headlines

text_for_cloud = " ".join(us_goodwords_df['WORDS'].tolist())
wcloud = WordCloud(normalize_plurals=False).generate(text_for_cloud)
plt.figure()
plt.imshow(wcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

# Closing Comments

(Any parting thoughts)