### Data types

In Python, especially in DH, we are often manipulating data. You often need to ask yourself: what data type is it  currently and what data type do I need it to be?

When you create a *variable* Python infers its type, which then affects what you can do with it. You can use string methods on strings, as we did in week 1, but you have to use integer methods on integers.

The main types are: *string*, *integer*, *float*, *Boolean*.

In [1]:
name = "Taylor Swift"
age = 35
shoe_size = 9.0
is_famous = True

In [2]:
print(type(name))
print(type(age))
print(type(shoe_size))
print(type(is_famous))

<class 'str'>
<class 'int'>
<class 'float'>
<class 'bool'>


If we want to use a method bound to a different data type we can change the data type. Lots of data in DH comes as strings.

In [3]:
Swift_born = "1989"
current_year = 2025

In [4]:
current_year - Swift_born

TypeError: unsupported operand type(s) for -: 'int' and 'str'

In [5]:
current_year - int(Swift_born)

36

How can we make this correct without doing a fancy calculation based on Swift's exact age?

In [6]:
print(f"Taylor Swift is either {current_year - int(Swift_born)} or {current_year - int(Swift_born) - 1}.")

Taylor Swift is either 36 or 35.


Python also has container datatypes known as *Collections*. The ones to be comfortable with are *list*, *tuple*, *set* and *dictionary*.

Lists go in square brackets. They're a good data type to learn first because they match our natural intuition for lists. If you have a to-do list you can delete completed items and add new ones. If a task becomes more important you can move it up the list. In Python's terminology, this means they are *mutable*. They are also *ordered*. Here the order is chronological.

list can mix different types of data

In [7]:
Austen_novels = ["Sense and Sensibility", "Pride and Prejudice", "Mansfield Park", "Emma", "Northanger Abbey", "Persuasion"]

In [9]:
Austen_novels[0]

'Sense and Sensibility'

In [8]:
Austen_novels[1]

'Pride and Prejudice'

A Python list (and counting generally) **starts at 0.** Getting this wrong is very common in coding, not just in Python. It's called the off-by-one error.



how to make new list;  set up -- blank brackets []

List of comprehension

In [10]:
unfinished = []

In [11]:
unfinished.append("Sanditon")
unfinished.append("The Watsons")

In [12]:
unfinished

['Sanditon', 'The Watsons']

In [13]:
all_works = Austen_novels + unfinished

In [14]:
all_works

['Sense and Sensibility',
 'Pride and Prejudice',
 'Mansfield Park',
 'Emma',
 'Northanger Abbey',
 'Persuasion',
 'Sanditon',
 'The Watsons']

It's generally good to keep the older lists rather than mutate one list. We could do:

`Austen_novels = Austen_novels + unfinished`

But then we have no easy way to recover just the finished novels. This is true of all data in Python, unless you start to get really big datasets to store.

As we've seen in previous weeks, we can loop over a list with a for loop. This is called a `list comprehension` (because Python does the work of making sure every item in the list is handled).

In [15]:
for novel in Austen_novels:
    print(f"Austen wrote {novel}.")

Austen wrote Sense and Sensibility.
Austen wrote Pride and Prejudice.
Austen wrote Mansfield Park.
Austen wrote Emma.
Austen wrote Northanger Abbey.
Austen wrote Persuasion.


We can add logic to a list comprehension, to split items into different lists, filter out items, or just print them out slightly differently. This if/else syntax can be used in Python generally, not just in list comprehensions.

In [16]:
for novel in Austen_novels:
    if novel == Austen_novels[0]:
        print(f"First, Austen published {novel}.")
    elif novel == Austen_novels[-1]:
        print(f"After Austen's death {novel} was published.")
    else:
        print(f"Then she published {novel}.")

First, Austen published Sense and Sensibility.
Then she published Pride and Prejudice.
Then she published Mansfield Park.
Then she published Emma.
Then she published Northanger Abbey.
After Austen's death Persuasion was published.


Lists are ordered, mutable and can have duplicates. They are often a good choice if you're not sure which structure you want.

But if we want to compare the vocabulary used in different Austen novels, eg to find out which words are *only* used in Northanger Abbey, lists would be a painfully slow way to do it. For this task **we don't care about the order of the words and we don't care about duplicates.** This calls for a **set**. A set in Python is mutable, unordered and has no duplicates. Sets are fast precisely because they are unordered. Think about how quickly people board a bus (unordered) compared to a plane (ordered).

In [17]:
some_numbers = [1, 5, 3, 1, 5, 2, 3, 1, 1, 9, 2, 3, 1, 9, 2, 1, 3]
set(some_numbers)

{1, 2, 3, 5, 9}

In [18]:
set(some_numbers)[0] # you can't do this with sets!

TypeError: 'set' object is not subscriptable

Let's get six sets of the words in the Austen novels.

In [19]:
!wget https://raw.githubusercontent.com/jonathanblaney/RSE-dropins-2025-6/refs/heads/main/week1-plain-text/emma.txt
!wget https://raw.githubusercontent.com/jonathanblaney/RSE-dropins-2025-6/refs/heads/main/week1-plain-text/persuasion.txt
!wget https://raw.githubusercontent.com/jonathanblaney/RSE-dropins-2025-6/refs/heads/main/week3-cli/mansfield_park.txt
!wget https://raw.githubusercontent.com/jonathanblaney/RSE-dropins-2025-6/refs/heads/main/week3-cli/northanger_abbey.txt
!wget https://raw.githubusercontent.com/jonathanblaney/RSE-dropins-2025-6/refs/heads/main/week3-cli/pride_and_prejudice.txt
!wget https://raw.githubusercontent.com/jonathanblaney/RSE-dropins-2025-6/refs/heads/main/week3-cli/sense_and_sensibility.txt

--2025-11-03 15:26:27--  https://raw.githubusercontent.com/jonathanblaney/RSE-dropins-2025-6/refs/heads/main/week1-plain-text/emma.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 933759 (912K) [text/plain]
Saving to: ‘emma.txt’


2025-11-03 15:26:27 (48.7 MB/s) - ‘emma.txt’ saved [933759/933759]

--2025-11-03 15:26:27--  https://raw.githubusercontent.com/jonathanblaney/RSE-dropins-2025-6/refs/heads/main/week1-plain-text/persuasion.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 497612 (486K) [text/plain]
Saving to: ‘persua

In [20]:
with open('emma.txt', 'r') as f:
    emma = f.read()
with open('pride_and_prejudice.txt', 'r') as f:
    pride = f.read()
with open('sense_and_sensibility.txt', 'r') as f:
    sense = f.read()
with open('mansfield_park.txt', 'r') as f:
    mansfield = f.read()
with open('northanger_abbey.txt', 'r') as f:
    northanger = f.read()
with open('persuasion.txt', 'r') as f:
    persuasion = f.read()

In [21]:
emma_words = set(emma.split())
pride_words = set(pride.split())
sense_words = set(sense.split())
mansfield_words = set(mansfield.split())
northanger_words = set(northanger.split())
persuasion_words = set(persuasion.split())

In [22]:
print(len(emma_words))
print(len(pride_words))
print(len(sense_words))
print(len(mansfield_words))
print(len(northanger_words))
print(len(persuasion_words))

18012
14702
13801
16717
11851
11457


With these sets we can now do set arithmetic. We can find out which words appear in all Austen novels, which appear in Sense and Sensibility and Emma but not in Pride and Prejudice...any combination we're interested in.

To find out which words are only in Northanger Abbey, we ask for Northanger Abbey minus all the others.

In [23]:
northanger_words - persuasion_words - sense_words - mansfield_words - pride_words - persuasion_words

{'canary-bird,',
 'lantern—do',
 'pitiless',
 'stamp.”',
 'dwell;',
 'reviewers',
 'see-saw',
 'trouble,”',
 'muslin.”',
 'riot?”',
 'tolerated,',
 'phaetons,',
 'concerned;',
 'pitiful.',
 'ornamental',
 'idleness;',
 'smirk,',
 'refined',
 'puce-coloured',
 'Eleanor,”',
 'china.',
 '[1]',
 'jig',
 'mournfully',
 'impediment,',
 'purpose:',
 'sleeper.',
 'did—they',
 'testimony.',
 'females,',
 'unfounded,',
 'fed',
 'cavity',
 'York',
 'haste!',
 'her—and',
 'hunter.',
 'generality;',
 'remiss,',
 'melt',
 'included,',
 'touching',
 'parted—on',
 'contents,',
 'court;',
 'quarrels',
 'tidy',
 'unfavourably,',
 'journal!',
 'image,',
 'd——',
 'General”—to',
 '23',
 'specimens',
 'know”—twisting',
 'suite',
 'alleged,',
 'listened—the',
 'convent.',
 'room’s',
 'convent,',
 '_effect_',
 'awe,',
 'Leicestershire,',
 'entitled,',
 'eminence',
 'relator,',
 'Swiftly',
 '“Nobody’s,',
 'steal',
 'advice.”',
 'Allen.”',
 'dismay,',
 'divining',
 'demanded—Mr.',
 'else!—in',
 'conceive.',
 'h

Some of these are a bit suprising. How can we do a quick check? Maybe via the command line!

In [24]:
!grep "escort" *.txt

emma.txt:so that Emma found, on being escorted and followed into the second
emma.txt:Miss Bates and Miss Fairfax, escorted by the two gentlemen, walked into
mansfield_park.txt:sister in an article of such importance; but he escorted her, with the
mansfield_park.txt:soon made good. While she was gone Mr. Rushworth arrived, escorting his
northanger_abbey.txt:to escort them.
persuasion.txt:her. Lady Dalrymple and Miss Carteret, escorted by Mr Elliot and


In [25]:
!grep -w "escort" *.txt

northanger_abbey.txt:to escort them.


Quite a lot of DH is just counting stuff (and being analytical about what exactly we're counting, what those counts mean, what the limitations of counting are...). You don't need to know loads of Python to do it.

If we want a unique list of something but keeping the original order, how can we do it. Here's what might politely be called a naive implementation. Because we want the order we stick with lists as our data structures.


split the novel into two

First create two empty lists.

In [26]:
first_occurrence = []
duplicates = []

We still have all of Persuasion in memory, but we don't have all the words in it as a list. Let's get that.

In [29]:
persuasion_all[124:300]

['EBOOK',
 'PERSUASION',
 '***',
 'Persuasion',
 'by',
 'Jane',
 'Austen',
 '(1818)',
 'Contents',
 'CHAPTER',
 'I.',
 'CHAPTER',
 'II.',
 'CHAPTER',
 'III.',
 'CHAPTER',
 'IV.',
 'CHAPTER',
 'V.',
 'CHAPTER',
 'VI.',
 'CHAPTER',
 'VII.',
 'CHAPTER',
 'VIII.',
 'CHAPTER',
 'IX.',
 'CHAPTER',
 'X.',
 'CHAPTER',
 'XI.',
 'CHAPTER',
 'XII.',
 'CHAPTER',
 'XIII.',
 'CHAPTER',
 'XIV.',
 'CHAPTER',
 'XV.',
 'CHAPTER',
 'XVI.',
 'CHAPTER',
 'XVII.',
 'CHAPTER',
 'XVIII.',
 'CHAPTER',
 'XIX.',
 'CHAPTER',
 'XX.',
 'CHAPTER',
 'XXI.',
 'CHAPTER',
 'XXII.',
 'CHAPTER',
 'XXIII.',
 'CHAPTER',
 'XXIV.',
 'CHAPTER',
 'I.',
 'Sir',
 'Walter',
 'Elliot,',
 'of',
 'Kellynch',
 'Hall,',
 'in',
 'Somersetshire,',
 'was',
 'a',
 'man',
 'who,',
 'for',
 'his',
 'own',
 'amusement,',
 'never',
 'took',
 'up',
 'any',
 'book',
 'but',
 'the',
 'Baronetage;',
 'there',
 'he',
 'found',
 'occupation',
 'for',
 'an',
 'idle',
 'hour,',
 'and',
 'consolation',
 'in',
 'a',
 'distressed',
 'one;',
 'there',
 'h

In [27]:
persuasion_all = persuasion.split()

Python has a built-in `in` operator, which checks if an item is in a list. So we can loop through all the words in Persuasion, check if they're already in `first_occurrence` and add them if not. We don't really need the duplicates list but let's use it for testing the results.

Lists are slow, especially with loops.

In [30]:
for word in persuasion_all:
    if word in first_occurrence:
        duplicates.append(word)
    else:
        first_occurrence.append(word)

In [31]:
first_occurrence[124:174]

['Sir',
 'Walter',
 'Elliot,',
 'Kellynch',
 'Hall,',
 'Somersetshire,',
 'was',
 'a',
 'man',
 'who,',
 'his',
 'own',
 'amusement,',
 'never',
 'took',
 'up',
 'any',
 'book',
 'but',
 'Baronetage;',
 'there',
 'he',
 'found',
 'occupation',
 'an',
 'idle',
 'hour,',
 'consolation',
 'distressed',
 'one;',
 'faculties',
 'were',
 'roused',
 'into',
 'admiration',
 'respect,',
 'contemplating',
 'limited',
 'remnant',
 'earliest',
 'patents;',
 'unwelcome',
 'sensations,',
 'arising',
 'from',
 'domestic',
 'affairs',
 'changed',
 'naturally',
 'pity']

become slow - doing comparisons between strings

deduplicate strings

In [32]:
duplicates[40:50]

['CHAPTER',
 'CHAPTER',
 'CHAPTER',
 'CHAPTER',
 'CHAPTER',
 'CHAPTER',
 'CHAPTER',
 'CHAPTER',
 'CHAPTER',
 'CHAPTER']

But this is a common task! For common tasks, we can rely on Python to have its own optimised solution. Here it is.

Quicker because - more efficient memory check

In [33]:
ordered_unique = list(dict.fromkeys(persuasion_all))

In [34]:
ordered_unique[124:174]

['Sir',
 'Walter',
 'Elliot,',
 'Kellynch',
 'Hall,',
 'Somersetshire,',
 'was',
 'a',
 'man',
 'who,',
 'his',
 'own',
 'amusement,',
 'never',
 'took',
 'up',
 'any',
 'book',
 'but',
 'Baronetage;',
 'there',
 'he',
 'found',
 'occupation',
 'an',
 'idle',
 'hour,',
 'consolation',
 'distressed',
 'one;',
 'faculties',
 'were',
 'roused',
 'into',
 'admiration',
 'respect,',
 'contemplating',
 'limited',
 'remnant',
 'earliest',
 'patents;',
 'unwelcome',
 'sensations,',
 'arising',
 'from',
 'domestic',
 'affairs',
 'changed',
 'naturally',
 'pity']

#### Tuples

Tuples are like lists in that they are *ordered* but they are unlike lists in that they are *immutable*. Tuples go in round brackets.

In [None]:
austen_canon = ("Sense and Sensibility", "Pride and Prejudice", "Emma", "Mansfield Park", "Northanger Abbey", "Persuasion")

As long as we're not changing a tuple, we can treat it like a list, for example with a comprehension.

In [None]:
for novel in austen_canon:
    print(f"{novel} is part of the Austen canon.")

But we cannot delete from or add to a tuple. It's fixed. If you try you'll get an error.

In [None]:
austen_canon.append("Sanditon")

You may not often want to create your own tuples, but Python will often give you tuples. Tuples are efficient, because their length is known in advance. You can always make a tuple into a list if you want to change it.

In [None]:
variable_canon = list(austen_canon)

In [None]:
type(variable_canon)

#### Dictionaries

Dictionaries are pairs of keys and values separated by colons. The syntax is very similar (but not identical) to JSON. Keys must be unique within a dictionary. Values can be any of the data types above, or other dictionaries. So values can be lists or dictionaries can be nested inside dictionaries.

In command line week we got the number of words in the Austen novels. How would we store that in Python? The most obvious way is a dictionary. The name of the novel is the key and the number of words is the value:

In [35]:
novel_counts = {"Mansfield Park": 162666, "Pride and Prejudice": 130414, "Sense and Sensibility": 121891, "Persuasion": 86366, "Northanger Abbey": 80259, "Emma": 160590}

In [36]:
type(novel_counts)

dict

In [37]:
print(f"Mansfield Park has {novel_counts['Mansfield Park']} words")

Mansfield Park has 162666 words


We can iterate over a dictionary but the syntax is slightly more complicated than with a list or tuple comprehension. Conventionally the
key and the value are given the names `k` and `v`.

k  
v

In [38]:
for k, v in novel_counts.items():
    print(k, v)

Mansfield Park 162666
Pride and Prejudice 130414
Sense and Sensibility 121891
Persuasion 86366
Northanger Abbey 80259
Emma 160590


Just like with the list comprehension above, we can use logic to filter.

In [39]:
print("The shorter novels are:\n")
for k, v in novel_counts.items():
    if v < 100_000:
        print(k)

The shorter novels are:

Persuasion
Northanger Abbey


In [40]:
print(sorted(novel_counts.items()))

[('Emma', 160590), ('Mansfield Park', 162666), ('Northanger Abbey', 80259), ('Persuasion', 86366), ('Pride and Prejudice', 130414), ('Sense and Sensibility', 121891)]


Sorting by values is more difficult. The general advice is to think of Python dictionaries primarily in terms of keys, not values. But here is one method.

In [41]:
{k: v for k, v in sorted(novel_counts.items(), key=lambda item: item[1])}

{'Northanger Abbey': 80259,
 'Persuasion': 86366,
 'Sense and Sensibility': 121891,
 'Pride and Prejudice': 130414,
 'Emma': 160590,
 'Mansfield Park': 162666}

Getting the minimum and maximum is also a bit fiddly, but here is one way to do it.

In [42]:
print(f"The longest novel is {max(novel_counts, key=novel_counts.get)}.")
print(f"The shortest novel is {min(novel_counts, key=novel_counts.get)}.")

The longest novel is Mansfield Park.
The shortest novel is Northanger Abbey.


#### Group work/homework

- use a `for` loop to add the Austen novels with < 12 characters in the title to a new list
- use fstrings to print the lengths of the unique words in each novels in a nicely formatted way
- use fstrings to print the word counts of the novels in a nicely formatted way
- modify the logic of the dictionary comprehension to describe each novel as 'long', 'medium' or 'short'
- extend the list logic so that you can find all the words in Northanger Abbey which are not in Persuasion, but in the order in which they first occur (don't worry about optimising this!)