<a href="https://colab.research.google.com/github/BM-Zhang/BM-Zhang.github.io/blob/master/LELA32052_Week_1_Seminar.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LELA32052 Computational Linguistics Week 1
## What is Python?

When we talk about data in computational linguistics we usually mean text data or audio files. While this data is meaningful to us, to the computer data is always just sequences of numbers. So to a computer the character i is 1101001. Fortunately, we never have to engage with this numeric representation - we can input an i into a computer and it will be converted into the computer's language for us.

A similar situation arises with the instructions that a computer applies to data. The instructions that the computer sees are in a purely numeric format known as machine code. Fortunately, we as users do not need to engage with machine code. Instead, we use what is known as a programming language. This is a much easier to read and write language that we input and which is converted to machine code for us.

The programming language we will employ for this module is called Python.

## Variables and datatypes
Data in Python programs is manipulated using variables. A variable is like a box in which you can store information of any kind. You assign data to variables using the assignment operator =. For example you can store the number 10 as the variable i as follows:

In [5]:
i = 5

Once stored these variables can be manipulated, for example using the standard mathematical operators for addition "+", subtraction "-", multiplication "*" and division "/". For example:

In [6]:
i+1*3/2

6.5

Data in Python has different types. i, as a whole number, is an integer, or int. The output from the operations above, since it contains a decimal point is a floating point number or float. There are other types of numbers in Python but it is these two you will work with most.

Variables can also be used to store a letter or a sequence of letters, known as a string:

In [7]:
j = "My name is "

The strings can be combined using the "+" symbol which when applied to strings functions as a concatenation operator:

In [8]:
k = j + "Jude Bellingham"

Once you have stored information in a variable, you can print it as follows:

In [9]:
print(k)

My name is Jude Bellingham


You can also print a series of variables as follows:

In [10]:
print(j + "Jude Bellingham")

My name is Jude Bellingham


The print function however expects all variables that are concatenated in this way to be of the same data type, so if you mix them you will get an error message

In [12]:
print(k + " and my shirt number is " + i)

TypeError: can only concatenate str (not "int") to str

You can avoid this by explicitly converting the type of your variables as in the follows. To change an object stored in a variable x to an int or a float you would write int(x) or float(x) respectively.

In [13]:
print(k + " and my shirt number is " + str(i))

My name is Jude Bellingham and my shirt number is 5


Exercise: Add to the code above so that it prints the statement "My name is Jude Bellingham and shirt number is 5 for Real Madrid but 22 for England".

In [19]:
q = 22
print(k + " and my shirt number is " + str(i) + " for Real Madrid but " + str(q) + " for England.")

My name is Jude Bellingham and my shirt number is 5 for Real Madrid but 22 for England.


# A brief mention of functions

The commands print() and str() we used above are what known as functions. These are operations that can be applied to entities in our code. These are a very important part of Python programming and later in the module we will look at writing our own functions. For now I just want to mention that there are a number of useful built-in functions that can be applied to strings. A list is here: https://www.w3schools.com/python/python_ref_string.asp

Here are a couple of examples:

In [48]:
name="tom"
str.capitalize(name)

'Tom'

In [23]:
str.upper(name)

'TOM'

Exercise: Use a function to turn the name tom into the name tommy

In [57]:
name.replace(name,name+"my")

'tommy'

In [59]:
str.replace(name,name,name+"my")

'tommy'

# Lists

So far I have represented sentences as single strings. For most purposes in computational linguistics we don't want to do this - we want instead to represent them in ways that recognizes the word boundaries. In order to do this we often represent sentences as lists of words, each of which is represented as a string. A list of strings can be created by putting words (as strings in quotes) inside square brackets as in the following example:

In [60]:
sentence = ["this", "is", "a", "sentence"]

We can print lists of words as a single string when needed as follows. The character in the quotes before ".join" sets the character to be printed between the elements of the list. Here we use a space.

In [61]:
print(str.join(" ", sentence))

this is a sentence


However we can also select elements from within the list. The entries in a list are indexed numberically starting with zero. So the first element is sentence[0] and the last element of this four element list is sentence[3]. These can then be select for printing as follows:

In [62]:
print(sentence[2])

a


We can also select subsequences of entries, by specifying a range as follows. Notice that the second character in the range isn't included - so 0:2 means from 0 up to the number before 2.

In [63]:
print(str.join(" ", sentence[0:2]))

this is


This allows us to, for example, insert elements in the middle of sentences as follows:

In [64]:
print(str.join(" ", sentence[0:3]) + " short " + sentence[3])

this is a short sentence


Exercise: create a sentence in list form for the sequence "George is a cat". Then use substring selection to produce the sentence "George is a big cat".

In [69]:
sentence = ["George","is","a","cat."]
print(str.join(" ", sentence))
print(str.join(" ",sentence[0:3])+" big "+sentence[3])

George is a cat.
George is a big cat.


Like strings, lists have their own built in functions that you can make use of:
https://www.w3schools.com/python/python_ref_list.asp

# Loading data

Computational linguistics involves handling text data. We saw that we can type in a string of characters, or indeed lists of words, above. However instead of typing in data, we often want to load it from files. We are going to use this file: https://www.gutenberg.org/files/2554/2554-0.txt

First we can download it to our workspace:

In [70]:
!wget https://www.gutenberg.org/files/2554/2554-0.txt

--2025-01-27 17:35:33--  https://www.gutenberg.org/files/2554/2554-0.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1159924 (1.1M) [text/plain]
Saving to: ‘2554-0.txt’


2025-01-27 17:35:33 (11.9 MB/s) - ‘2554-0.txt’ saved [1159924/1159924]



Then we read it in to Python.

In [71]:
f = open('2554-0.txt')
raw = f.read()

We will then extract a single chapter of the novel to work with:

In [73]:
chapter_one = raw[5464:23725]
print(chapter_one)

some time past he had been in an overstrained irritable condition,
verging on hypochondria. He had become so completely absorbed in
himself, and isolated from his fellows that he dreaded meeting, not
only his landlady, but anyone at all. He was crushed by poverty, but the
anxieties of his position had of late ceased to weigh upon him. He had
given up attending to matters of practical importance; he had lost all
desire to do so. Nothing that any landlady could do had a real terror
for him. But to be stopped on the stairs, to be forced to listen to her
trivial, irrelevant gossip, to pestering demands for payment, threats
and complaints, and to rack his brains for excuses, to prevaricate, to
lie--no, rather than that, he would creep down the stairs like a cat and
slip out unseen.

This evening, however, on coming out into the street, he became acutely
aware of his fears.

“I want to attempt a thing _like that_ and am frightened by these
trifles,” he thought, with an odd smile. “Hm... yes,

As you will notice this reads texts in as single strings. As discussed before we usually want to represent sentences in a way that reflects word boundaries. The simplest way to do it is to split on spaces. As we will see there are all sorts of problems with doing this, but for now we'll ignore that and use a built in string function split(). In order to make sure it deals with full stops as we would like we will also use the replace function to separate them from the ends of words.

In [74]:
chapter_one = str.replace(chapter_one, "."," .")
chapter_one_tokens = str.split(chapter_one)
print(chapter_one_tokens)

['some', 'time', 'past', 'he', 'had', 'been', 'in', 'an', 'overstrained', 'irritable', 'condition,', 'verging', 'on', 'hypochondria', '.', 'He', 'had', 'become', 'so', 'completely', 'absorbed', 'in', 'himself,', 'and', 'isolated', 'from', 'his', 'fellows', 'that', 'he', 'dreaded', 'meeting,', 'not', 'only', 'his', 'landlady,', 'but', 'anyone', 'at', 'all', '.', 'He', 'was', 'crushed', 'by', 'poverty,', 'but', 'the', 'anxieties', 'of', 'his', 'position', 'had', 'of', 'late', 'ceased', 'to', 'weigh', 'upon', 'him', '.', 'He', 'had', 'given', 'up', 'attending', 'to', 'matters', 'of', 'practical', 'importance;', 'he', 'had', 'lost', 'all', 'desire', 'to', 'do', 'so', '.', 'Nothing', 'that', 'any', 'landlady', 'could', 'do', 'had', 'a', 'real', 'terror', 'for', 'him', '.', 'But', 'to', 'be', 'stopped', 'on', 'the', 'stairs,', 'to', 'be', 'forced', 'to', 'listen', 'to', 'her', 'trivial,', 'irrelevant', 'gossip,', 'to', 'pestering', 'demands', 'for', 'payment,', 'threats', 'and', 'complaints,