# Introduction to Python - Strings and Text Files

In [1]:
# Author: Alex Schmitt (schmitt@ifo.de)

import datetime
print('Last update: ' + str(datetime.datetime.today()))

Last update: 2017-03-30 14:58:06.942743


In [26]:
fname = 'email.txt'
fh = open(fname)
text_all = fh.read()
print(len(text_all))

1813


In [35]:
pos = text_all.find('Schmitt')
print(text_all[pos : pos + 7])
print(text_all[pos : pos + 7].upper())

Schmitt
SCHMITT


In many cases, you may not be interested in the complete text, but only in certain parts of it or looking for specific information contained in the text. For example, assume you want to extract all email addresses in the text. One way to do this would be use the file handle as an iterator and store all lines that contain a '@' in a list: 

In [31]:
fh = open(fname)

addresses = []
for line in fh:
    if line.find('@') > 0:
        addresses.append(line)
    
print(addresses)  

['From: "Mazat, Andreas" <Mazat@ifo.de>\n', 'To: "Schmitt, Alex" <Schmitt@ifo.de>\n', 'Message-ID: <9e437462ecf842eea8e2ed7e97bb641c@ifo.de>\n', 'References: <93fd23b8b4444085bcfbfc3a60ee827f@ifo.de>\n', 'In-Reply-To: <93fd23b8b4444085bcfbfc3a60ee827f@ifo.de>\n', 'X-MS-TNEF-Correlator: <9e437462ecf842eea8e2ed7e97bb641c@ifo.de>\n', 'Return-Path: Mazat@ifo.de\n']


In other words, this reduces a potentially long text to those lines that may contain relevant information. Closer inspection of the resulting list shows that there are two email addresses in lines that start with 'From: ' and with 'To: '. We can use this information to parse the text again, this time making our query more precise:

In [33]:
fh = open(fname)

addresses = []
for line in fh:
    if line.startswith('From') or line.startswith('To'):
        addresses.append(line.strip())
    
print(addresses) 

['From: "Mazat, Andreas" <Mazat@ifo.de>', 'To: "Schmitt, Alex" <Schmitt@ifo.de>']


Note that there are better ways to parse a text for specific characters, as we will see in a bit. 

Finally, you can also iterate through the lines of a text to store the whole text in a list. This is an alternative to the **read()** method used above: instead of storing the whole text as one large string, you store it as a list of lines, which may facilitate further analysis. As so often, the best way depends on what problem you wanna solve.

In [28]:
fh = open(fname)

text = []
for line in fh:
    text.append(line.strip())
    
print(sum([len(line) for line in text]))    

1769


Before moving on, let's summarize the methods for a string that we have used so far:
- **text.find(string)** -> int: returns the position of the first occurrence of string
- **text.startswith(string)** -> boolean: returns True whether text starts with string
- **text.strip()**: mutates text by eliminates line breaks and white space
- **text.upper()**: mutates text by eliminates line breaks and white space

In [42]:
a = 'schmitt\n'
a.strip()
a

'schmitt\n'

## Regular Expressions

In [3]:
import re

In [16]:
len(re.findall('ifo', text_all))

16

In [14]:
re.findall('[A-Za-z.]+@[a-z.]+', text_all)

['Mazat@ifo.de',
 'Schmitt@ifo.de',
 'c@ifo.de',
 'f@ifo.de',
 'f@ifo.de',
 'c@ifo.de',
 'Mazat@ifo.de']

In [31]:
re.findall('Date: (.+)', text_all)

['Tue, 28 Mar 2017 12:38:10 +0200']

In [42]:
re.findall('Tue.+', text_all)

['Tue, 28 Mar 2017 12:38:10 +0200',
 'Tue, 28',
 'Tue, 28 Mar 2017 12:38:10 +0200 Content-Type: application/ms-tnef; name="winmail.dat"',
 'Tue, 28 Mar 2017 12:38:10 +0200']

In [6]:
re.findall('[0-9.]+', text_all)

['03.',
 '.',
 '192.168.0.103',
 '03.',
 '.',
 '192.168.0.103',
 '1',
 '2',
 '256',
 '384',
 '384',
 '15.1.544.27',
 '28',
 '2017',
 '12',
 '38',
 '10',
 '0200',
 '03.',
 '.',
 '192.168.0.103',
 '03.',
 '.',
 '192.168.0.103',
 '1',
 '2',
 '256',
 '384',
 '384',
 '15.1.544.27',
 '28',
 '2017',
 '12',
 '38',
 '10',
 '0200',
 '03.',
 '.',
 '80',
 '10',
 '8',
 '53',
 '646',
 '03.',
 '.',
 '80',
 '10',
 '8',
 '53',
 '646',
 '15',
 '15.01.0544.030',
 '28',
 '2017',
 '12',
 '38',
 '10',
 '0200',
 '.',
 '.',
 '.',
 '61',
 '7',
 '28',
 '2017',
 '12',
 '38',
 '10',
 '0200',
 '9',
 '437462',
 '842',
 '8',
 '2',
 '7',
 '97',
 '641',
 '.',
 '93',
 '23',
 '8',
 '4444085',
 '3',
 '60',
 '827',
 '.',
 '93',
 '23',
 '8',
 '4444085',
 '3',
 '60',
 '827',
 '.',
 '1',
 '9',
 '437462',
 '842',
 '8',
 '2',
 '7',
 '97',
 '641',
 '.',
 '1.0',
 '03.',
 '.',
 '04',
 '192.168.3.62',
 '2',
 '0753',
 '0',
 '41',
 '47',
 '246',
 '08',
 '475',
 '687',
 '3',
 '.',
 '1.0',
 '00',
 '00',
 '00.2184625']

In [17]:
text_all

'Received: from Exchange03.ifo.local (192.168.0.103) by Exchange03.ifo.local\n (192.168.0.103) with Microsoft SMTP Server (version=TLS1_2,\n cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384_P384) id 15.1.544.27 via Mailbox\n Transport; Tue, 28 Mar 2017 12:38:10 +0200\nReceived: from Exchange03.ifo.local (192.168.0.103) by Exchange03.ifo.local\n (192.168.0.103) with Microsoft SMTP Server (version=TLS1_2,\n cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384_P384) id 15.1.544.27; Tue, 28\n Mar 2017 12:38:10 +0200\nReceived: from Exchange03.ifo.local ([fe80::10a8:53dd:646d:ffef]) by\n Exchange03.ifo.local ([fe80::10a8:53dd:646d:ffef%15]) with mapi id\n 15.01.0544.030; Tue, 28 Mar 2017 12:38:10 +0200 Content-Type: application/ms-tnef; name="winmail.dat"\nContent-Transfer-Encoding: binary\nFrom: "Mazat, Andreas" <Mazat@ifo.de>\nTo: "Schmitt, Alex" <Schmitt@ifo.de>\nSubject: AW: ifo Python Course\nThread-Topic: ifo Python Course\nThread-Index: AdKnqiuN/Vpu61IJTTW+SBYboPjp7QABSPjA\nDate: Tue, 28 Mar 201

In [23]:
text

['Received: from Exchange03.ifo.local (192.168.0.103) by Exchange03.ifo.local',
 '(192.168.0.103) with Microsoft SMTP Server (version=TLS1_2,',
 'cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384_P384) id 15.1.544.27 via Mailbox',
 'Transport; Tue, 28 Mar 2017 12:38:10 +0200',
 'Received: from Exchange03.ifo.local (192.168.0.103) by Exchange03.ifo.local',
 '(192.168.0.103) with Microsoft SMTP Server (version=TLS1_2,',
 'cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384_P384) id 15.1.544.27; Tue, 28',
 'Mar 2017 12:38:10 +0200',
 'Received: from Exchange03.ifo.local ([fe80::10a8:53dd:646d:ffef]) by',
 'Exchange03.ifo.local ([fe80::10a8:53dd:646d:ffef%15]) with mapi id',
 '15.01.0544.030; Tue, 28 Mar 2017 12:38:10 +0200 Content-Type: application/ms-tnef; name="winmail.dat"',
 'Content-Transfer-Encoding: binary',
 'From: "Mazat, Andreas" <Mazat@ifo.de>',
 'To: "Schmitt, Alex" <Schmitt@ifo.de>',
 'Subject: AW: ifo Python Course',
 'Thread-Topic: ifo Python Course',
 'Thread-Index: AdKnqiuN/Vpu61IJTT