# 02 - Working with Files, Texts, and Regular Expressions

First, we need to download the data (the Git repository) and install an external library (TextDirectory) using [pip](https://en.wikipedia.org/wiki/Pip_(package_manager).

In [155]:
%%capture
!git clone https://github.com/IngoKl/python-programming-for-linguists
!pip install textdirectory

In [157]:
!ls

python-programming-for-linguists  results.txt  text.txt
results.tst			  sample_data


## 1. Reading and Writing Files

In [None]:
from pathlib import Path

In [None]:
data_folder = Path('python-programming-for-linguists/2020/data/wikipedia/')

In [158]:
!ls python-programming-for-linguists/2020/data/wikipedia/

cologne.txt  linguistics.txt  python.txt


In [171]:
with open(data_folder / 'python.txt', 'r') as f:
  data = f.read()
  #data = f.read(10) # Read ten bytes
  #data = f.readlines()

In [172]:
data

'Python is an interpreted, high-level and general-purpose programming language. Python\'s design philosophy emphasizes code readability with its notable use of significant whitespace. Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects.\nPython is dynamically typed and garbage-collected. It supports multiple programming paradigms, including structured (particularly, procedural), object-oriented, and functional programming. Python is often described as a "batteries included" language due to its comprehensive standard library.\nPython was created in the late 1980s, and first released in 1991, by Guido van Rossum as a successor to the ABC programming language. Python 2.0, released in 2000, introduced new features, such as list comprehensions, and a garbage collection system with reference counting, and was discontinued with version 2.7 in 2020. Python 3.0, released in 2008, was a major revision of the la

### Writing Files

In [177]:
data = 'This is some text we want to store!'

with open('text.txt', 'w') as f:
  f.write(data)

In [178]:
!cat text.txt

This is some text we want to store!

## 2. Working with Text

### Basics

In [273]:
s = 'ABC'

In [182]:
s[0]

'A'

In [183]:
print(s[0:2])
print(len(s))

AB
3


In [274]:
if 'A' in s:
  print('"A" is in the string.')
else:
  print('"A" is not in the string.')

"A" is in the string.


### String Methods

A full list of string methods can be found in the [official Python documentation](https://docs.python.org/3/library/stdtypes.html#string-methods). 

In [199]:
s = 'Hello World'

In [192]:
s.upper()

'HELLO WORLD'

In [193]:
s.lower()

'hello world'

In [195]:
s.find('World')

6

In [200]:
s.isdigit()

False

In [203]:
s.split()

['Hello', 'World']

In [271]:
s.replace('e', 'E')

'HEllo World'

### *difflib*

difflib: "This module provides classes and functions for comparing sequences." ([Documentation](https://docs.python.org/3/library/difflib.html#))

In [208]:
import difflib

In [269]:
sequence_a = 'Linguistics is awesome'
sequence_b = 'Linguistics is great'

In [270]:
sm = difflib.SequenceMatcher(a=sequence_a, b=sequence_b)
sm.ratio()

0.7619047619047619

## 3. Regular Expressions

In [218]:
import re

In [229]:
text = 'Despite carefully cleaning the crime scene she was quickly captured by police.'

In [220]:
pattern = r'\w+ly'

In [221]:
matches = re.findall(pattern, text)

In [222]:
matches

['carefully', 'quickly']

### Groups

In [268]:
pattern = r'((\w+)ly)'
matches = re.findall(pattern, text)

In [267]:
for match in matches:
  print(match)

('carefully', 'careful')
('quickly', 'quick')


In [266]:
for match in matches:
  print(match[1])

careful
quick


### re.sub()

In [264]:
text = '$25 and $30'

In [265]:
re.sub(r'(\$)([0-9]*)', r'\2\1', text)

'25$ and 30$'

### Putting Things Together

In [243]:
with open(data_folder / 'linguistics.txt', 'r') as f:
  data = f.read()

In [246]:
words = re.findall(r'\w+ly\b', data)
words

['traditionally', 'directly', 'logically', 'particularly']

In [247]:
with open('results.txt', 'w') as f:
  f.writelines(words)

In [248]:
!cat results.txt

traditionallydirectlylogicallyparticularly

## 4. TextDirectory

[TextDirectory](https://github.com/IngoKl/textdirectory) allows you to combine multiple text files into one aggregated file. You can also filter files based on various criteria and run transformations before aggregating them.

In [249]:
import textdirectory

In [250]:
data_folder = Path('python-programming-for-linguists/2020/data/wikipedia/')

td = textdirectory.TextDirectory(directory=data_folder)

In [263]:
td.load_files(filetype='txt', sort=True)

In [261]:
td.print_aggregation()


|------------------------------------------------------------|
|path                     |characters|tokens|transformed_text|
|------------------------------------------------------------|
|python-programming-for-li|3061      |490   |False           |
|python-programming-for-li|2119      |306   |False           |
|------------------------------------------------------------|

Staged Transformations: [['transformation_uppercase'], ['transformation_uppercase']]


In [260]:
td.filter_by_min_chars(2000)
td.print_aggregation()


|------------------------------------------------------------|
|path                     |characters|tokens|transformed_text|
|------------------------------------------------------------|
|python-programming-for-li|3061      |490   |False           |
|python-programming-for-li|2119      |306   |False           |
|------------------------------------------------------------|

Staged Transformations: [['transformation_uppercase'], ['transformation_uppercase']]


In [259]:
td.stage_transformation(['transformation_uppercase'])
td.print_aggregation()


|------------------------------------------------------------|
|path                     |characters|tokens|transformed_text|
|------------------------------------------------------------|
|python-programming-for-li|3061      |490   |False           |
|python-programming-for-li|2119      |306   |False           |
|------------------------------------------------------------|

Staged Transformations: [['transformation_uppercase'], ['transformation_uppercase']]


In [258]:
td.aggregate_to_memory()

'COLOGNE IS THE LARGEST CITY OF GERMANY\'S MOST POPULOUS FEDERAL STATE OF NORTH RHINE-WESTPHALIA AND THE FOURTH-MOST POPULOUS CITY IN GERMANY. WITH SLIGHTLY OVER A MILLION INHABITANTS (1.09 MILLION) WITHIN ITS CITY BOUNDARIES, COLOGNE IS THE LARGEST CITY ON THE RHINE AND ALSO THE MOST POPULOUS CITY BOTH OF THE RHINE-RUHR METROPOLITAN REGION, WHICH IS GERMANY\'S LARGEST AND ONE OF EUROPE\'S MAJOR METROPOLITAN AREAS, AND OF THE RHINELAND. CENTERED ON THE LEFT BANK OF THE RHINE, COLOGNE IS ABOUT 45 KILOMETRES (28 MI) SOUTHEAST OF NORTH RHINE-WESTPHALIA\'S CAPITAL OF DÜSSELDORF AND 25 KILOMETRES (16 MI) NORTHWEST OF BONN. IT IS THE LARGEST CITY IN THE CENTRAL FRANCONIAN AND RIPUARIAN DIALECT AREAS.\nTHE CITY\'S COLOGNE CATHEDRAL (KÖLNER DOM) IS THE SEAT OF THE CATHOLIC ARCHBISHOP OF COLOGNE. THERE ARE MANY INSTITUTIONS OF HIGHER EDUCATION IN THE CITY, MOST NOTABLY THE UNIVERSITY OF COLOGNE (UNIVERSITÄT ZU KÖLN), ONE OF EUROPE\'S OLDEST AND LARGEST UNIVERSITIES, THE TECHNICAL UNIVERSITY OF 