# Introduction to Python - Strings and Text Files

In [1]:
# Author: Alex Schmitt (schmitt@ifo.de)

import datetime
print('Last update: ' + str(datetime.datetime.today()))

Last update: 2017-04-03 14:35:41.912204


## Handles and Reading Files

In [2]:
fname = 'email.txt'
fh = open(fname)
text_all = fh.read()
print(type(text_all))
print('The text consists of {} characters.'.format(len(text_all)))

<class 'str'>
The text consists of 1680 characters.


**text_all** stores the contents of the text file as one large string. Sometimes it is more convenient to have a list of strings instead, where each element of the list represents a line in the text (As so often, which one the better alternative is depends on what problem you wanna solve.). This is achieved by the **readlines** method:

In [3]:
fh = open(fname)
text = fh.readlines()
type(text), len(text)
print('The text consists of {} characters.'.format( sum([len(x) for x in text]) ) ) 

The text consists of 1680 characters.


## String Methods

Here is a (incomplete) list of the most important methods for a string, for which we will see examples below:
- **text.strip(char)** -> list: returns a list with the elements of string, split at char (or a space by default)
- **text.find(string)**, **text.index(string)** -> int: returns the position (index) of the first occurrence of string
- **text.count(string)** -> int: returns number of occurrences of string in text
- **text.startswith(string)** -> boolean: returns True whether text starts with string
- **text.strip()**: modifies text (not in place!) by eliminating leading and trailing whitespaces
- **text.upper()**, **text.lower()**: modifies text (not in place!) by making all characters upper (lower) cases
- **text{}.format(num)** -> str: inserts num in text


In [4]:
line = text[0]

line.split()
print( type(line) )

line = line.split()
print( type(line) )
print(line)

<class 'str'>
<class 'list'>
['Received:', 'from', 'Exchange03.ifo.local', '(192.168.0.103)', 'by', 'Exchange03.ifo.local']


Recall that strings are immutable: methods do not work ``in place''.

You can parse strings and check if they contain a certain substring by using the **find** and **index** methods. They return the position (index) of the *first* occurrence of the substring. Note that if the substring is not in the text, **find** will return -1 while **index** will throw an error.

In [5]:
pos = text_all.find('Schmitt')
print(text_all[pos : pos + 7])
print(text_all[pos : pos + 7].upper(), text_all[pos : pos + 7].lower())
print(text_all[pos + 1 : pos + 7].capitalize())

print(text_all.index('chmitt'))
print(text_all.find('Chmitt'))
# print(text_all.index('Chmitt')) -> throws an error!

Schmitt
SCHMITT schmitt
Chmitt
817
-1


If you are not interested in where a substring is contained in a string, but how often, use the **count** method:

In [6]:
text_all.count('ifo')

12

## Iterating over a file handle

In many cases, you may not be interested in the complete text, but only in certain parts of it or looking for specific information contained in the text. For example, assume you want to extract all email addresses in the text. One way to do this would be use the file handle as an iterator and store all lines that contain a '@' in a list: 

In [7]:
fh = open(fname)

addresses = []
for line in fh:
    if line.find('@') > 0:
        addresses.append(line)
    
print(addresses)  

['From: "Huber, Matthias" <Huber@ifo.de>\n', 'To: "Schmitt, Alex" <Schmitt@ifo.de>\n', 'Message-ID: <23211122c2f5403e81a78feb4d32a00e@ifo.de>\n', 'X-MS-TNEF-Correlator: <23211122c2f5403e81a78feb4d32a00e@ifo.de>\n', 'Return-Path: Huber@ifo.de\n']


In other words, this reduces a potentially long text to those lines that may contain relevant information. Closer inspection of the resulting list shows that there are two email addresses in lines that start with 'From: ' and with 'To: '. We can use this information to parse the text again, this time making our query more precise:

In [8]:
fh = open(fname)

addresses = []
for line in fh:
    if line.startswith('From') or line.startswith('To'):
        addresses.append(line.strip())
    
print(addresses) 

['From: "Huber, Matthias" <Huber@ifo.de>', 'To: "Schmitt, Alex" <Schmitt@ifo.de>']


Note that there are better ways to parse a text for specific characters, as we will see in a bit. 

Often it is not necessary to parse the whole text. For example, if you are only interested in the subject of an email, you can stop the loop after the relevant line, using a **break** statement:

In [9]:
fh = open(fname)

addresses = []
for line in fh:
    if line.startswith('Subject'):
        print(line[9:])
        break


github



In [10]:
fh = open(fname)

text = []
for line in fh:
    text.append(line.strip())
print(type(text))
print(type(text_all))
print(sum([len(line) for line in text]))    

<class 'list'>
<class 'str'>
1637


Note that the number of characters here is less than above, since I have stripped line breaks.

## Regular Expressions

In [11]:
import re

In [61]:
len(re.findall('ifo', text_all))

12

In [53]:
re.findall('[A-Za-z.]+@[a-z.]+', text_all)

['Huber@ifo.de', 'Schmitt@ifo.de', 'e@ifo.de', 'e@ifo.de', 'Huber@ifo.de']

In [54]:
re.findall('Date: (.+)', text_all)

['Tue, 28 Mar 2017 11:45:05 +0200']

In [55]:
re.findall('Tue.+', text_all)

['Tue, 28 Mar 2017 11:45:05 +0200',
 'Tue, 28',
 'Tue, 28 Mar 2017 11:45:05 +0200',
 'Tue, 28 Mar 2017 11:45:05 +0200']

In [56]:
re.findall('[0-9.]+', text_all)

['03.',
 '.',
 '192.168.0.103',
 '03.',
 '.',
 '192.168.0.103',
 '1',
 '2',
 '256',
 '384',
 '384',
 '15.1.544.27',
 '28',
 '2017',
 '11',
 '45',
 '05',
 '0200',
 '03.',
 '.',
 '192.168.0.103',
 '03.',
 '.',
 '192.168.0.103',
 '1',
 '2',
 '256',
 '384',
 '384',
 '15.1.544.27',
 '28',
 '2017',
 '11',
 '45',
 '05',
 '0200',
 '03.',
 '.',
 '80',
 '10',
 '8',
 '53',
 '646',
 '03.',
 '.',
 '80',
 '10',
 '8',
 '53',
 '646',
 '15',
 '15.01.0544.030',
 '28',
 '2017',
 '11',
 '45',
 '05',
 '0200',
 '.',
 '.',
 '.',
 '5',
 '2',
 '4',
 '6',
 '1',
 '28',
 '2017',
 '11',
 '45',
 '05',
 '0200',
 '23211122',
 '2',
 '5403',
 '81',
 '78',
 '4',
 '32',
 '00',
 '.',
 '1',
 '23211122',
 '2',
 '5403',
 '81',
 '78',
 '4',
 '32',
 '00',
 '.',
 '1.0',
 '03.',
 '.',
 '04',
 '192.168.2.216',
 '78661',
 '8',
 '17',
 '409',
 '1497',
 '08',
 '475',
 '1',
 '23',
 '.',
 '1.0',
 '00',
 '00',
 '00.2656386']

## Application: parsing a scientific text for numeric data

Parsing the text with **readlines** works, but actually returns a list of paragraphs, rather than lines. 

In [12]:
fname = 'jeem.txt'
fh = open(fname)
text = fh.readlines()

print('The text consists of {} characters.'.format( sum([len(x) for x in text]) ) ) 

The text consists of 16808 characters.


In order to obtain a list with lines rather than paragraphs, we can loop though the file handle and convert the object in each iteration -- a paragraph -- to a list of lines, using the **split** methods. We use '. ' (i.e. a stop with following space) as the argument at which to split. We can then add the contents of this list to a list called **text** which contains all the previous lines. Before this step, I clean the paragraph of leading and trailing white space using **strip**. 

In [72]:
fh = open(fname)

text = []
for item in fh:
    ## eliminate whitespace
    paragraph = item.strip()
    ## substitutions
    paragraph = re.sub('et al.', 'et al', paragraph)
    paragraph = re.sub('[0-9]+\)', ')', paragraph)
    paragraph = re.sub('[0-9]+;', ';', paragraph)
    paragraph = re.sub('[0-9]+ ;', ';', paragraph)

    text = text + paragraph.split('. ')
print(type(text))
print(type(text_all))
print(sum([len(line) for line in text])) 

<class 'list'>
<class 'str'>
16439


In [73]:
text[:10]

['Introduction',
 'The European Union Emissions Trading System (EU ETS) is currently the largest carbon trading system in the world, unless and until it is overtaken by the Chinese national carbon trading scheme planned for introduction in 2017 (Jotzo and Löschel, ;  Zhang et al, )',
 'Although the EU ETS is meeting its core objective – EU emissions covered by the scheme remain below the total emissions cap – it is sometimes described as having ‘failed’ because prices are too low to incentivise substantial short-run emissions reductions and too volatile to provide adequate long-run incentives for investments in clean technologies.',
 '',
 'European Allowances (EUAs) – the unit of compliance – have traded below €10 from 2013 onwards (EEX )',
 'The price is below most estimates of the social cost of carbon for example as used in US government regulatory analysis (Greenstone et al, ; Goulder and Williams, ; United States Interagency Group, )',
 'It is also low relative to the implicit pri

In [74]:
text_num = []
for item in text:
    if re.search('[0-9]+', item):
        text_num.append(item)
        print(item)

The European Union Emissions Trading System (EU ETS) is currently the largest carbon trading system in the world, unless and until it is overtaken by the Chinese national carbon trading scheme planned for introduction in 2017 (Jotzo and Löschel, ;  Zhang et al, )
European Allowances (EUAs) – the unit of compliance – have traded below €10 from 2013 onwards (EEX )
For instance, several multinational oil companies use internal screening prices of US $40/€35 or more (Kossoy et al, ), even though they operate in jurisdictions that are, on the whole, subject to lighter carbon regulation than in Europe.
Emissions allowances issued each year began to exceed actual annual emissions in 2009 (Redman and Convery, ) and a large surplus has been built up through banking
The 2030 Climate and Energy Reform Package (European Council, ) decided that the annual (linear) reduction factor for the EU ETS will be increased from 1.74 to 2.2 percent per annum from 2021-2030
In November 2012, the European Commi