# Exercises 3.

Deadline for submission: 24th of Oct, 2021

These exercises are about operations with files.
To work on google colab, use the example inside the <b>extras</b> folder, although from now on it is recommended to use Jupyter Notebook or Jupyter Lab on your computer.

A guide for installation:
https://test-jupyter.readthedocs.io/en/latest/install.html

## Merging files

Given text files f1.txt and f2.txt, write a function that reads the files from their path then merges their lines into a new text file f3.txt as follows:

(1) f1_l1\
(2) f2_l1\
(3) f1_l2\
(4) f1_l3\
(5) f2_l2\
(6) f2_l3\
(7) f1_l4\
(8) f1_l5\
(9) f1_l6\
(10) f2_l4\
(11) f2_l5\
(12) f2_l6\
...

where fi_lj denotes the jth line of fi.txt (i = 1, 2).

### Example inputs

<b>f1.txt</b>:

aaa\
bbb\
ccc\
ddd\
eee\
fff\
ggg

<b>f2.txt</b>:

zzz\
yyy\
xxx\
www

### Example output

<b>f3.txt</b>:

aaa\
zzz\
bbb\
ccc\
yyy\
xxx\
ddd\
eee\
fff\
www\
ggg

In [1]:
def read_file(path:str) -> str:
    ''' read file from path and return it as a string '''
    with open(path) as file:
        data = file.read()
    return data

Examples on usage:

In [2]:
f1 = read_file('files/f1.txt').split('\n')
f1

['aaa', 'bbb', 'ccc', 'ddd', 'eee', 'fff', 'ggg']

In [3]:
f2 = read_file('files/f2.txt').split('\n')
f2

['zzz', 'yyy', 'xxx', 'www']

In [4]:
def group_list(input_list:list) -> list:
    ''' group list of strings into a growing sequence of lists '''
    mod_list = input_list.copy()
    n = 1
    output_list = []
    while mod_list:
        n = min(len(mod_list),n)
        output_list.append([mod_list.pop(0) for i in range(n)])
        n += 1
    return output_list

Example on usage:

In [5]:
group_list(f1)

[['aaa'], ['bbb', 'ccc'], ['ddd', 'eee', 'fff'], ['ggg']]

In [6]:
group_list(f2)

[['zzz'], ['yyy', 'xxx'], ['www']]

In [7]:
import itertools
from itertools import zip_longest

def file_merger_function(f1_path, f2_path):
    # 1) read file 2) create list by new lines 3) create list of groups by growing sequence
    f1 = group_list(read_file(f1_path).split('\n'))
    f2 = group_list(read_file(f2_path).split('\n'))
    
    # combine the f1 and f2 (which can be of different length)
    result = [[x for x in t if x is not None] for t in zip_longest(f1, f2)]
    
    # flatten nested lists
    result = list(itertools.chain.from_iterable(result))
    result = list(itertools.chain.from_iterable(result))
    
    # create new lines
    result = '\n'.join(result)
    
    # write to file
    with open('files/f3.txt', 'w') as f3:
        f3.write(result)
    return result

In [8]:
file_merger_function('files/f1.txt', 'files/f2.txt')

'aaa\nzzz\nbbb\nccc\nyyy\nxxx\nddd\neee\nfff\nwww\nggg'

Make a statistics of the word-distribution of 'The Jungle Book' by Rudyard Kipling:
- The file 'JB.txt' contains the book
- Use a dictionary, where the individual words are the keys and a word-count are the values.
- Use only lower case letters (convert upper case letters to lower case) and omit typographical symbols and punctuation marks (like ., :, !, ?, etc) as well.

In [9]:
with open('files/JB.txt', 'r') as file:
    data = file.read()

In [10]:
# raw string example
data[:450]

"The Jungle Book\nby\nRudyard Kipling\n\n\nMowgli's Brothers\n\nNow Rann the Kite brings home the night\nThat Mang the Bat sets free--\nThe herds are shut in byre and hut\nFor loosed till dawn are we.\nThis is the hour of pride and power,\nTalon and tush and claw.\nOh, hear the call!--Good hunting all\nThat keep the Jungle Law!\nNight-Song in the Jungle\n\nIt was seven o'clock of a very warm evening in the Seeonee hills\nwhen Father Wolf woke up from his day's rest"

In [11]:
# printed output example
print(data[:450])

The Jungle Book
by
Rudyard Kipling


Mowgli's Brothers

Now Rann the Kite brings home the night
That Mang the Bat sets free--
The herds are shut in byre and hut
For loosed till dawn are we.
This is the hour of pride and power,
Talon and tush and claw.
Oh, hear the call!--Good hunting all
That keep the Jungle Law!
Night-Song in the Jungle

It was seven o'clock of a very warm evening in the Seeonee hills
when Father Wolf woke up from his day's rest


In [12]:
from collections import Counter
from re import sub

def count_unique_words(string:str) -> dict:
    # lower the string
    string = string.lower()
    # remove all special characters (excluding whitespace)
    string = sub('[^A-Za-z0-9\s]+', '', string)
    # split string to words (result is a list)
    words = string.split()
    # count unique words and the number of occureances and map them into a dictionary
    result = dict(zip(Counter(words).keys(), Counter(words).values()))
    return result

In [13]:
JB_wordcount = count_unique_words(data)
JB_wordcount

{'the': 1141,
 'jungle': 89,
 'book': 1,
 'by': 59,
 'rudyard': 1,
 'kipling': 1,
 'mowglis': 8,
 'brothers': 6,
 'now': 46,
 'rann': 8,
 'kite': 4,
 'brings': 1,
 'home': 4,
 'night': 22,
 'that': 223,
 'mang': 3,
 'bat': 4,
 'sets': 3,
 'free': 16,
 'herds': 2,
 'are': 61,
 'shut': 7,
 'in': 233,
 'byre': 2,
 'and': 693,
 'hut': 2,
 'for': 157,
 'loosed': 1,
 'till': 22,
 'dawn': 5,
 'we': 74,
 'this': 45,
 'is': 189,
 'hour': 3,
 'of': 388,
 'pride': 8,
 'power': 4,
 'talon': 1,
 'tush': 1,
 'claw': 1,
 'oh': 7,
 'hear': 7,
 'callgood': 1,
 'hunting': 21,
 'all': 111,
 'keep': 10,
 'law': 30,
 'nightsong': 1,
 'it': 129,
 'was': 170,
 'seven': 1,
 'oclock': 1,
 'a': 349,
 'very': 44,
 'warm': 7,
 'evening': 3,
 'seeonee': 7,
 'hills': 5,
 'when': 74,
 'father': 29,
 'wolf': 79,
 'woke': 3,
 'up': 79,
 'from': 54,
 'his': 215,
 'days': 9,
 'rest': 2,
 'scratched': 2,
 'himself': 21,
 'yawned': 2,
 'spread': 1,
 'out': 44,
 'paws': 6,
 'one': 65,
 'after': 8,
 'other': 11,
 'to': 389,

## Invert the dictionary from the previous excercise

Create a new dictionary, where the word-counts are the keys and the values are lists containing the appropriate keys from the previous dictionary.

## Example:

From d = {
  'a' : 8,
  'b' : 3,
  'c' : 3,
  'd' : 2,
  't' : 1,
  'z' : 1
  }

  To {
    1 : ['t', 'z'],
    2 : ['d'],
    3 : ['b','c'],
    8 : ['a']
  }

In [14]:
d = { 'a' : 8, 'b' : 3, 'c' : 3, 'd' : 2, 't' : 1, 'z' : 1 }

In [15]:
def reverse_dict(input_dict:dict) -> dict:
    # empty dict for storing the results
    output_dict = {}
    
    # iterating through all key-value pairs
    for k, v in input_dict.items():
        output_dict[v] = output_dict.get(v, [])
        output_dict[v].append(k)
    # sort results (ascending by key)
    output_dict = dict(sorted(output_dict.items()))
    return output_dict

In [16]:
reverse_dict(d)

{1: ['t', 'z'], 2: ['d'], 3: ['b', 'c'], 8: ['a']}

In [17]:
reverse_dict(JB_wordcount)

{1: ['book',
  'rudyard',
  'kipling',
  'brings',
  'loosed',
  'talon',
  'tush',
  'claw',
  'callgood',
  'nightsong',
  'seven',
  'oclock',
  'spread',
  'rid',
  'tips',
  'tumbling',
  'squealing',
  'augrh',
  'crossed',
  'threshold',
  'chief',
  'jackaltabaqui',
  'dishlickerand',
  'eating',
  'rags',
  'leather',
  'rubbishheaps',
  'apt',
  'hides',
  'disgraceful',
  'overtake',
  'hydrophobia',
  'dewaneethe',
  'stiffly',
  'feast',
  'gidurlog',
  'scuttled',
  'cracking',
  'merrily',
  'unlucky',
  'compliment',
  'uncomfortable',
  'rejoicing',
  'spitefully',
  'angrilyby',
  'change',
  'due',
  'frighten',
  'birth',
  'scour',
  'alight',
  'gratitude',
  'thickets',
  'saved',
  'snarly',
  'singsong',
  'bullocks',
  'hsh',
  'quarter',
  'compass',
  'bewilders',
  'gypsies',
  'beetles',
  'forbids',
  'real',
  'mankilling',
  'arrival',
  'guns',
  'gongs',
  'rockets',
  'torches',
  'suffers',
  'weakest',
  'defenseless',
  'unsportsmanlike',
  'tooan