# Files Management with Python

This is a _notebook_ with some simple codes to __handle__ files.

Files are a simple way to handle data persistance.

## How to use plain text files

There are _three types_ of access for a file, depending of the need: __r__ (_read_), __w__ (_write_), and __a__ (_append_).

In [4]:
# how to write in a file
file_name = "names.txt"
with open(file_name, "w", encoding="utf-8") as f:
    text = "Alice Bob Claire Denisse"
    f.write(text)

In [6]:
# use append to file to avoid overwriting
with open(file_name, "a", encoding="utf-8") as f:
    text = "\nErwin Fidelia Giselle"
    f.write(text)

In [None]:
# use r mode to read from file
# it will read the full text
with open(file_name, "r", encoding="utf-8") as f:
    text = f.read()
    print("Full Text:", text)

print("="*20)
# it will read the text line by line
with open(file_name, "r", encoding="utf-8") as f:
    lines = f.readlines()
    for line in lines:
        print("Line:", line)

## CSV (Comma-Separated Values) Files

A __CSV__ file is a format where each _jump of line_ is a new __row__. Into each row, the values are _separated by comma_ (or a specific symbol), as a __column__. Traditionally, it _replaces_ a __relational database__ for small amount of data for just one dataset.

In [12]:
# csv files -> comman separate value

csv_file = "personas.csv"
text = "name,code,email\nalice,1,a@ud.co\nbob,2,b@ud.co\nclaire,3,c@ud.co"

with open(csv_file, "w", encoding="utf-8") as f:
    f.write(text)

In [None]:
# read from csv and put into a dictionary
with open(csv_file, "r", encoding="utf-8") as f:
    personas = f.readlines()

columns = personas[0].replace("\n", "").split(",")
personas_data = personas[1:] # slicing to remove columns names

final_data = []
for person in personas_data:
    person_data = {}
    person = person.replace("\n", "")
    temp_data = person.split(",")
    for i, value in enumerate(temp_data):
        person_data[columns[i]] = value
    final_data.append(person_data)

print(final_data)

In [None]:
# from the read data, print the name and email based on keys of the dictionary
for person in final_data:
    print(person["name"], person["email"])

## JSON Files

A _JSON file_ is a format of file where a _data structure_ called __key:value__ is used to persist data.
It is the equivalent to a __dictionary__ or a __list of dictionaries__ in _python_.

In [27]:
import json

# write a json file based on the data of the csv file

json_file = "personas.json"
with open(json_file, "w", encoding="utf-8") as f:
    json.dump(final_data, f)

In [None]:
# read from json and load into a list of dictionaries
with open(json_file, "r", encoding="utf-8") as f:
    data = json.load(f)

print(data)

# Data Dummy

__Data Dummy__ is a dataset with _fake data_ just to _test_ the application. Here the idea is to__ validate__ __codes__, __workflows__, __outputs__, but _without expose real data_ or avoid dependencies of client's response.

In [None]:
# use FAKER python package, it is used to generate fake data
!pip install faker
import faker

In [14]:
import json

fake_data = faker.Faker()
students = []
n = 1000000

# generate a dataset of fake n students and save it into a json file
def generate_fake_students(file_name, n):
    students = []
    for i in range(n):
        student = {}
        student["name"] = fake_data.name()
        student["lastname"] = fake_data.last_name()
        student["email"] = fake_data.email()
        student["address"] = fake_data.address()
        student["phone"] = fake_data.phone_number()
        student["bitrhday"] = str(fake_data.date_of_birth())
        student["country"] = fake_data.country()
        students.append(student)

    with open(file_name, "w", encoding="utf-8") as f:
        json.dump(students, f)

generate_fake_students("students_ind.json", n)
generate_fake_students("students_sist.json", n)
generate_fake_students("students_elec.json", n)


## Sequential Access to Data

Here the idea is to check the time an algorithm to count the time to get the repetitions per name including the load process from the three files, but one by one as a lineal process.

In [None]:
def count_names(data, names):
    for student in data:
        name = student["name"]
        if name in names:
            names[name] += 1
        else:
            names[name] = 1
    return names

names = {}

with open("students_elec.json", "r", encoding="utf-8") as f:
    data_elec = json.load(f)
    names = count_names(data_elec, names)

with open("students_sist.json", "r", encoding="utf-8") as f:
    data_sist = json.load(f)
    names = count_names(data_sist, names)

with open("students_ind.json", "r", encoding="utf-8") as f:
    data_ind = json.load(f)
    names = count_names(data_ind, names)

final_students = data_elec + data_sist + data_ind
print(len(final_students))
print(len(names))

## Parallel access to Files

Here the idea is to use threads to load the files and count on each name ther repetitions by name. After, a merge of the results is make.

In [21]:
import concurrent.futures

def read_file(file_name):
    names = {}
    with open(file_name, "r", encoding="utf-8") as f:
        data = json.load(f)
    for student in data:
        name = student["name"]
        if name in names:
            names[name] += 1
        else:
            names[name] = 1
    return names

In [22]:
def merge_names(names_lists):
    final_names = {}
    for names in names_lists:
        for name in names:
            if name in final_names:
                final_names[name] += names[name]
            else:
                final_names[name] = names[name]
    return final_names

In [24]:
def parallel_analysis():
    list_files = ["students_elec.json", 
                  "students_sist.json", 
                  "students_ind.json"]
    with concurrent.futures.ThreadPoolExecutor() as executor:
        lists_results = list(executor.map(read_file, list_files))

    final_names = merge_names(lists_results)
    print(len(final_names))

parallel_analysis()        

## Serialization

__Serialization__ is the process to take an object, _convert to binary_, and it is available to save as an encoded file.

In [25]:
import json
import pickle

with open("students_elec.json", "r", encoding="utf-8") as f:
    data_elec = json.load(f)

with open("students_elec.pkl", "wb") as f:
    pickle.dump(data_elec, f)

In [None]:
with open("students_elec.pkl", "rb") as f:
    data_elec = pickle.load(f)
print(data_elec[:6])