# File manipulation and searching

## Reading files

Files can be opened in Python using the function ```open```, then specifying mode (i.e. reading, writing) and encoding. 

The expression ```with open(file) as f``` **automatically closes the file** after use (otherwise, the file must be closed explicitly by using the ```close()``` method)

In [4]:
with open("data/file.txt", encoding="utf-8") as f:
    for line in f:
        print(line)


Whose biscuit is that? I think I know.

Its owner is quite happy though.

Full of joy like a vivid rainbow,

I watch him laugh. I cry hello.



### JSON
JSON files contains data. They can be loaded into Python, then a common workflow is to convert them into dictionaries for additional manipulation (deserialization).

First thing, the library ```json``` must be imported

In [4]:
import json

# Open JSON data file into a list of dictionaries
with open('data/data.json') as data:
    d = json.load(data)

print(d)

{'quiz': {'sport': {'q1': {'question': 'Which one is correct team name in NBA?', 'options': ['New York Bulls', 'Los Angeles Kings', 'Golden State Warriros', 'Huston Rocket'], 'answer': 'Huston Rocket'}}, 'maths': {'q1': {'question': '5 + 7 = ?', 'options': ['10', '11', '12', '13'], 'answer': '12'}, 'q2': {'question': '12 - 8 = ?', 'options': ['1', '2', '3', '4'], 'answer': '4'}}}}


Other useful methods are ```json.loads()```, which parse a  JSON string into a dictionary.


## Writing files

Basic writing of files can be done in a similar way as how you read files.
**NOTE** The method ```writelines()``` actually writes everything on a single line! **Make sure to add new lines and spacing as required!

In [5]:

lines = ["Volli volli,",  "fortissimamente", "volli"]

with open("data/out_poem.txt", "w") as out:
    out.writelines(s + '\n' for s in lines)

### Writing JSON

Writing to a JSON file is very simial 


In [7]:
import json
data = [
    {"name": "Alice", "age": 31},
    {"name": "Bob", "age": 24},
    {"name": "Charlie", "age": 38 },
]

with open("data/json_output.json", "w") as out:
    json.dump(data, out)

### Writing CSV

CSV is another common data format. 

The following snippet illustrates how to write data stored in **list of lists** as CSV

In [20]:
import csv

# Add Header
header = ["name", "age"]
# Add data
data = [
        ["Alice", 31],
        ["Bob", 24],
        ["Charlie", 30]
        ]

with open("data/lst_data.csv", "w", newline="") as out:
    writer = csv.writer(out)
    writer.writerow(header)
    writer.writerows(data)

The following snippet shows how to write data stored in **dictionary** format in CSV.

In [19]:
import csv

header = ['name', 'age']
data = [
    {"name": "Alice", "age": 31},
    {"name": "Bob", "age": 24},
    {"name": "Charlie", "age": 38 },
]

with open("data/dct_data.csv", "w", encoding="utf-8", newline="") as out:
    writer = csv.DictWriter(out, fieldnames = header)
    writer.writeheader()
    writer.writerows(data)

## Parsing and cleaning input from file

### Removing whitespace

In [10]:
def remove_space(file):
    """Remove all whitespace from file."""
    cleaned = []
    
    with open(file, 'r', encoding="utf=8") as f:
        for line in f:
            if not line.isspace():
                cleaned.append(line.split())
        return cleaned

            
f = remove_space("data/poem.txt")
print(f"remove_space: {f}\n") # File as a list of lists of strings, each line a new 

remove_space: [['The', 'cheese-mites', 'asked', 'how', 'the', 'cheese', 'got', 'there,'], ['And', 'warmly', 'debated', 'the', 'matter;'], ['The', 'Orthodox', 'said', 'that', 'it', 'came', 'from', 'the', 'air,'], ['And', 'the', 'Heretics', 'said', 'from', 'the', 'platter.'], ['They', 'argued', 'it', 'long', 'and', 'they', 'argued', 'it', 'strong,'], ['And', 'I', 'hear', 'they', 'are', 'arguing', 'now;'], ['But', 'of', 'all', 'the', 'choice', 'spirits', 'who', 'lived', 'in', 'the', 'cheese,'], ['Not', 'one', 'of', 'them', 'thought', 'of', 'a', 'cow.']]



### Keeping only numbers from file



In [13]:
def keep_numbers(file):
    """Get all ints in the file text."""
    cleaned = []    
    with open(file, "r", encoding="utf-8") as f:
        f = f.readlines()
        for line in f:
            filtered = list(map(int, filter(lambda x: x.isnumeric(), line)))
            if len(filtered) > 0:
                strings = "".join(list(map(str, filtered)))
                cleaned.append(int(strings))                
    return cleaned
        
f = keep_numbers("data/complex_file.txt")
print(f'keep_numbers: {f}\n') # File returns as list of ints, no whitepsace

keep_numbers: [465, 67, 9034, 21, 86255, 123178319829389, 12938]



### Keeping only alphabetic characters



In [14]:
def keep_alphanumeric(file):
    """Get all string in text file."""
    cleaned = []
    with open(file) as f:
        for line in f:
            line = line.split()
            if not len(line) == 0:
                line = list(line[0])
                line = list(map(str, filter(lambda x: x.isalpha(), line)))
                line = "".join(line)
                if not line == "":
                    cleaned.append("".join(line))
    return cleaned
                

f = keep_alphanumeric("data/complex_file.txt")
print(f"keep_alphanumeric: {f}\n")

keep_alphanumeric: ['SQJSGqCIhfmYsQizUlhtKmFo', 'YkBYVe', 'sHmFxGlDcZypjXpF', 'Fr', 'tmbfYDLreNEasPrRo', 'mBvPjgyAzUDKMVZmo', 'fIjWQXyuCZZHbIozo', 'One', 'He', 'The', 'seemed', 'His']



### Remove special characters

In [15]:
def remove_special_chars(file):
    '''
    Remove all special chars from text file
    '''
    poem = []
    with open(file, encoding="utf8") as f:
        for line in f:
            if not line.isspace(): 
                filter_line = [word if word.isalpha() else ' ' for word in line]
                alpha_line = ''.join(filter_line)
                poem.append(alpha_line.split())
    return poem


f = remove_special_chars("data/complex_file.txt")
print(f"Remove special chars: {f}\n")

Remove special chars: [['S', 'QJSG', 'qC', 'Ih', 'fmYsQi', 'zUlhtKmFo'], ['Y', 'k', 'B', 'Y', 'Ve'], ['sHmFxG', 'l', 'DcZy', 'pjX', 'pF'], ['F', 'r'], ['t', 'm', 'bf', 'YD', 'Lr', 'e', 'NEasPr', 'R', 'o'], ['mB', 'vPjg', 'yA', 'z', 'U', 'D', 'KM', 'V', 'Zmo'], ['fIj', 'WQX', 'yu', 'CZZH', 'bI', 'o', 'z', 'o'], [], [], ['One', 'morning', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin'], ['He', 'lay', 'on', 'his', 'armour', 'like', 'back', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections'], ['The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and'], ['seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment'], ['His', 'many', 'legs', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', 'waved',

In [16]:
def clean_file(inputfilename):
    """
    Read file, split it at spaces and commas, convert text in int, and group inputs in tuples of len 5
    """
    with open(inputfilename) as f:
        lst = []
        for line in f:
            line = "".join(line.split())
            line = line.split(",")
            filtered = list(map(int, filter(lambda x: x.isnumeric(), line))) 
            split_filtered = [tuple(filtered[i:i+5]) for i in range(0, len(filtered), 5)]
            lst.append(split_filtered)
        return lst




#### TODO

In [None]:
def replace_charactes(file):
    """
    Replace multiple characters 
    """
    pass