# Preprocessing

## Character Encoding

There are countless ways that text can be turned into binary data to be stored on disk, these are known as character encodings. Fortunately for us, more and more textual data is being standardised to UTF-8 (dominant Unicode character encoding), and the majority of web pages are now in UTF-8. An extra win is that Python 3 does everything in Unicode by default, and reads and writes to UTF-8 by default. This means a lot of the time you will not need to worry about character encoding, switching between them, and potentially corrupting text. 

Because ASCII is a subset of Unicode (first block), then if you are only dealing with ASCII characters, then you don't even need to know that you are using Unicode. However, if you want to use characters outside of ASCII, then Python makes this easy as everything is in Unicode. All 137,374 characters (as of [Unicode 11.0](http://www.unicode.org/versions/Unicode11.0.0/)) are available for you to use, so if you want a string with Jalapeño in it, you can go right ahead:

In [1]:
print("This is a spicy jalapeño.")

This is a spicy jalapeño.


Because Python 3 is Unicode throughout, you can even use non-ASCII characters as variable names ([with some restrictions](https://docs.python.org/3/reference/lexical_analysis.html#identifiers)):

In [2]:
π = 3.14
print(π)

3.14


You can use the characters directly (as above) if you can get them on your keyboard, or pasted in from somewhere (e.g. [Unicode charts](https://unicode.org/charts/)). Alternatively, you can use special escape sequences:

- \xhh : char with 8-bit hex value (code points U+00 - U+FF (latin-1))
- \uhhhh : char with 16-bit hex value (code points U+0100 - U+FFFF)
- \Uhhhhhhhh : char with 32-bit hex value  (code points > U+10000)
- \N{name} : char with name from Unicode database: https://unicode.org/charts/

In [3]:
a = "\xf1"
b = "\u03C0"
c = "\U0001F643"
d = "\N{SAMARITAN LETTER YUT}"

print(a,b,c,d)

ñ π 🙃 ࠉ


Whichever you use, these equate to the same Unicode codepoint within Python:

In [4]:
print("ñ" == "\xf1" == "\u00f1" == "\U000000f1" == "\N{LATIN SMALL LETTER N WITH TILDE}")

True


For I/O (e.g. reading/writing files), Python 3 uses UTF-8 by default. UTF-8, as described in the lecture, is a multi-byte character encoding capable of representing all possible Unicode characters. If we use Python's encode function to transform a string to a sequence of bytes, we can see this: 

In [5]:
r = "jalapeño"
print(r.encode("utf-8"))

b'jalape\xc3\xb1o'


Defining the encoding as "utf-8" is actually redundant as it is the default.

In [8]:
print(r.encode())

b'jalape\xc3\xb1o'


The b'...' signifies the sequence of bytes.

In UTF-8, single byte characters are shown using their ASCII character (as for j, a, l, a, p, e, o). Multiple-byte characters are shown as hex values, so ñ is transformed to \xc3\xb1 (a two byte UTF-8 character). See the lecture for more detail on this.

We can encode (and output) to different character encodings, as long as it is possible to do so:

In [11]:
r.encode("ascii")

UnicodeEncodeError: 'ascii' codec can't encode character '\xf1' in position 6: ordinal not in range(128)

- Check you understand why the above code produces an error?
- Try encoding to "latin-1", and check you understand why that is valid?
- Try encoding the string "jalapeno" (without the diacritic) as ASCII. Does this work?

Fill in your answers below.

In [17]:
#Cannot encode as ascii as the tild above n does not have a valid encoding.
r.encode("latin-1")
r = "jalapeno"
r.encode('ascii')

r = "jalapeño"
# below converts to ascii numbers
[ord(c) for c in r]

[106, 97, 108, 97, 112, 101, 241, 111]

If you write to a file, this will be done as UTF-8 by default. Try the below, and then open the file in Atom text editor. You will see that the unicode displays correctly when opened as UTF-8. Try changing the encoding (bottom right in Atom) and you'll see other characters appear as the bytes used for characters is different.

In [4]:
with open ("test-file.txt", 'w') as f:
    f.write(r)

NameError: name 'r' is not defined

When you open files, again this will be done as UTF-8 by default:

In [28]:
with open("test-file.txt") as f:
    print(f.read())

jalapeño


You can explicitly set the encoding for both reading and writing files, as below.

In [20]:
with open("test-file.txt", encoding="utf-8") as f:
    print(f.read())

jalapeño


Try changing the encoding to "ascii" and "latin-1". Check you understand what happens, and write it down below.

In [34]:
with open ("test-file.txt", 'w', encoding='ascii') as f:
    f.write('Test')

with open("test-file.txt", 'r', encoding="ascii") as f:
    print(f.read())
    

with open ("test-file.txt", 'w', encoding='latin-1') as f:
    f.write('Test')

with open("test-file.txt", 'r', encoding="latin-1") as f:
    print(f.read())


Test
Test


Hopefully all of the text you deal with will be UTF-8, as this makes life easier with Python. If you do have text other than UTF-8, you will need to carefully check you know what the encoding is and read it as such. Switching between encodings can cause problems, and should be done with care. There's a good guide here: http://python-notes.curiousefficiency.org/en/latest/python3/text_file_processing.html

Note: Even when using UTF-8 it is generally good practice to explicitly set the encoding in case somebody using your code has changed the default encoding to something other than UTF-8.

There is a good overview available if you want to read more on Unicode and how it is dealt with in Python 3: https://docs.python.org/3/howto/unicode.html.

### Character variation

One issue to be aware of is that there are often multiple Unicode codepoints available to represent the same character, and sometimes it is impossible to visually distinguish between them. For example, [combining diacritical marks](https://www.unicode.org/charts/PDF/U0300.pdf) can be used after any character to add a diacritic to the preceding letter. So, `e` with an acute accent can be written as `é` (one codepoint) or `e` and ` ́` combined (2 codepoints). When different codepoints are used, strings will not be equal, and if a different number of codepoints are used, their lengths will not be equal. i.e.:

In [36]:
r = "jalapeño"
s = "jalape\xf1o"
t = "jalapen\u0303o" #combining n with  ̃: https://www.unicode.org/charts/PDF/U0300.pdf)

print(r, s, r == s)
print(s, t, s == t)
print(s,len(s), t, len(t))

jalapeño jalapeño True
jalapeño jalapeño False
jalapeño 8 jalapeño 9


Another example of this is quote marks. Many applications automatically "smart quote", replacing standard quotation marks with a fancier looking glyph. This can cause problems with tokenisation (for example) as the tokeniser is expecting one particular character to mean a quotation mark, and will miss the other varieties. Apostrophes can cause similar issues, and are also sometimes mixed up with quotation marks. The wide range of codepoints that can be used for quotemarks and apostrophes can be seen here: https://en.wikipedia.org/wiki/Quotation_mark#Unicode_code_point_table

## Spelling variation


With user-generated content, spelling variation can be a big issue, especially with the more informal language found on social networks and on internet forums. There are two main problems:

1. It is difficult to assign part-of-speech tags and other annotations automatically as many of these techniques rely heavily on a lexicon lookup, which often fails when words are spelt inconsistently or incorrectly.

2. When counting the occurrences of a particular word (or concept) the frequency is spread across the different variant spellings. If frequencies are incorrect it limits the reliability of statistics built upon these frequencies.

We are not going to directly cover dealing with spelling variation in the courses, but if you wish, you may want to look at [PyEnchant](https://github.com/rfk/pyenchant) and [PyHunSpell](https://github.com/blatinier/pyhunspell) for Python spellchecking libraries, and [VARD](http://ucrel.lancs.ac.uk/vard/about/), a Java tool originally developed for dealing with historical spelling variation, but can be used for dealing with modern spelling variation. 

## Platform specific strings

A common issue is that textual data from the web contains things that are not "standard written language"; for example, hashtags, usernames, URLs, email addresses, emoticons, emojis, etc. Traditionally, NLP tools have been built to deal with clean edited language which has little variation. Many of these tools will not know what to do with some of the strings coming from social networks and other online sources. The situation is improving, with many NLP tools now having facilities for dealing with such things. Depending on the analysis to be undertaken, and the NLP pipeline being employed, it is sometimes desirable to remove these "troublesome" strings so they do not "break" existing tools, or the string may not be of interest for the study, or indeed they might cause problems for the study. For example, usernames being present could cause privacy issues, or bias author prediction based on language style. On the other hand, we may want to keep such features as they are useful to the task in hand, e.g. hashtags for topic analysis, emoji use for authorship analysis.

## Text cleaning
There are many ways of dealing with the above issues, and decisions will be made based on the data being used, and the analysis being undertaken. Sometimes, it is enough (or all that is possible) to be aware of the issues and the impact they may have.

### FTFY: Fix text for you
One particularly good tool worth having in your arsenal to deal with some of the above issues is [ftfy: fix text for you](https://ftfy.readthedocs.io/en/latest/). It is particularly good for dealing with corrupted UTF-8 text, and the problem of "mojibake", whereby characters are nonsensical because of mistakes made when encoding/decoding Unicode. The tool also normalises character usage, including smart quotes, and will also replace HTML entities which shouldn't be needed in UTF-8 text for NLP analysis.

For most cases, running ftfy's `fix_text` method will fix all you need, for more specific options [check the documentation](https://ftfy.readthedocs.io/en/latest/). To demonstrate, we'll output some text to a file (in UTF-8, as default), and read the file back in with the wrong encoding, "latin-1"

In [6]:
import sys
import ftfy
!{sys.executable} -m pip install emoji

[33mYou are using pip version 10.0.1, however version 19.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [11]:
with open ("ftfy-test.txt", 'w', encoding="utf-8") as f:
    f.write("I’ve got a spicy jalapeño.")
    
with open ("ftfy-test.txt", encoding="latin-1") as f:
    garbled = f.read()
    
print(garbled)

Iâve got a spicy jalapeÃ±o.


Notice the 'smart' apostrophe is misformed due to it being a different codepoint within latin-1, as well as `ñ`.

In [12]:
fixed = ftfy.fix_text(garbled)
print(fixed)

I've got a spicy jalapeño.


The encoding issues are fixed, back to standard Unicode. Also, the 'smart' apostrophe is replaced with a normal run-of-the-mill ascii apostrophe. Much nicer, and easier to deal with when it comes to tokenisation.

### Filtering with regular expressions
Filtering unwanted strings can sometimes be dealt with during tokenisation (i.e. ignoring those sequences, such as emojis), but more often pre-processing will take place whereby the strings are removed or replaced. It is important to take care when removing strings from running text, that inadvertent damage is avoided, e.g. by replacing other wanted text unknowingly, or disrupting the flow of the text, which may impact annotation and other NLP tasks, as well as human readability. It is important to carefully decide what to replace strings with; a marker (e.g. `#hashtag`), a space, or nothing? Again, the data, and what analysis and tools are to be used should influence this decision.

Regular expressions are the obvious choice to replace such strings in most cases.

To demonstrate this, a single tweet is provided below, along with code to replace hashtags with just `#`.

In [13]:
tweet = "This week we’re at a #careers event in #Blackpool @Pleasure_Beach, talking to students about #languages and language careers! Come have a go at some of our activities! 🌏#LoveLanguages #LoveLancaster @Lancaster_CI https://t.co/vQQWdrUuqh"
print(tweet)

This week we’re at a #careers event in #Blackpool @Pleasure_Beach, talking to students about #languages and language careers! Come have a go at some of our activities! 🌏#LoveLanguages #LoveLancaster @Lancaster_CI https://t.co/vQQWdrUuqh


In [14]:
import re #regular expressions package

In [15]:
#note we use r to denote a "raw string" for regular expression patterns, this is so we do not have to keep escaping \.
hashtag = re.compile(r'#\w+') #compile the regular expression, good idea to do this in advance if used multiple times.
cleaned = hashtag.sub("#", tweet) #use compiled regex to substitute matches for replacement "#", in tweet.
print(cleaned)

This week we’re at a # event in # @Pleasure_Beach, talking to students about # and language careers! Come have a go at some of our activities! 🌏# # @Lancaster_CI https://t.co/vQQWdrUuqh


You should be able to adapt the above code to replace usernames (@ mentions) with `@`. For both, try replacing with a space and removing completely. Can you process both in one regular expression?

In [24]:
at = re.compile(r'(@|#)\w+')
cleaned = at.sub(" ", tweet)
print(cleaned)

This week we’re at a   event in    , talking to students about   and language careers! Come have a go at some of our activities! 🌏      https://t.co/vQQWdrUuqh


Review the resulting text each time, and judge the readability of the text. Does it change the text's meaning or nature? Make notes below.

Its more readable without the @ and the #... it does change the meaning when the hashtags and @'s are removed. 

**Advanced**: In one expression, can you replace `#hashtags` with `#` and `@mentions` with `@`, you'll need to use groupings and backreferences: https://docs.python.org/3/howto/regex.html

## Exercise
Using text collected from last week's work, or the [provided text from mumsnet](mumsnet.txt), examine the text and note any potential issues in the text that may cause problems for later analysis. Then write code to pre-process the text. Justify and describe decisions made below. Items you could consider are:

- user mentions
- hashtags
- urls
- emojis and emoticons
- smart quotes and apostrophes

In [7]:
#exercise code here, read in text, process, and output text to a new file (can also print if not too long).
import string 
import re
import emoji

text = []
with open ("mumsnet.txt", encoding="latin-1") as f:
    text = f.read()

#Remove links  
result = re.sub(r'(www|http)\S+', '', text)

# Convert to lower case
lower_text = result.lower()


# Remove numbers
result = re.sub(r'\d+', "", result)

#remove punctuation
for punc in string.punctuation:
    result = result.replace(punc,"")

#Sort out any encoding issues
result = ftfy.fix_text(result)
    
    
#remove emojis
def give_emoji_free_text(text):
    characters = [str for str in text]
    emoji_list = [c for c in characters if c in emoji.UNICODE_EMOJI]
    clean_text = ' '.join([str for str in text.split() if not any(i in str for i in emoji_list)])
    return clean_text

result = give_emoji_free_text(result)

print(result)


My best friend has a month old and after years of her gleefully finding the biggest most annoying toys she could for my children I am desperate to get my own back What are the current most ear shattering noisy toys preferably plastic with flashing lights that you can buy now I remember vtech baby walkers were pretty appalling when mine were small Following with interest Im in a similar situation Might just be us Sjjr Im thinking the pink vtech walker looks horrific so might go for that Im bumping this as I refuse to believe we are the only vengeful people on here My month is currently playing on her Vtech bounce and spin frog Its very noisy I have the subtitles on TV The volume has settings though so she can always just switch it off Seems like vetch is the bestworst gift to give Lamaze Sunny Rabbit It doesnt have an off button The fisher price cookie jar is awful One of those that the song gets stuck in your head To be honestly theyre all horrendous after a while The worst we have is 

#### Justify and describe decisions here

