# Módulo 1: Lectura de archivos

## Objetivo
El alumno podrá extraer de datos de archivos de tipo:
* Texto plano.
* JSON.
* CSV.
* Parquet.



## Actividad 2: Tarea (para los que aplique)

* En Canvas encontrarás una serie de archivos .csv con los nombres csv_file_1.csv, csv_file_2.csv y así sucesivamente que tienes que descargar.
* Usando el método `pd.concat` de la paquetería pandas vas a concatenar todos los archivos.
    * El método se usa así: nuevo_|df = `pd.concat([df1, df2, df3, ...], axis=0)`. Esto te juntará todos los dataframes en uno solo. Más información aquí: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html
*  Ya que tengas tu DataFrame, encárgate de escribir los resultados a las siguientes preguntas:
    * ¿Cuántas filas tiene el archivo al final? 
    * ¿Cuál es el promedio de la columna `Números`?

In [1]:
# Auxiliar para definir directorios
import os
import pandas as pd
# Para importar jsons
import json
# Para importar csvs
import csv

## Texto
### Formas de accesar a archivos
https://www.geeksforgeeks.org/reading-writing-text-files-python/

La forma básica de abrir archivos es utilizando `open(archivo, modo_de_acceso)`

Toda vez que utilizamos `open` lo acompañamos de un modo de acceso, el cual marca la pauta de cómo será usado el `file` una vez cargado en memoria. 

* Read only
    * Forma de acceso: `r`
* Read and write
    * Forma de acceso: `r+`
* Write only 
    * Forma de acceso: `w`
* Write and read
    * Forma de acceso: `w+`
* Append only
    * Añadir información (al final de lo que exista). 
    * Forma de acceso: `a`
* Append and read
    * Añadir información (al final de lo que exista) además de leer. 
    * Forma de acceso: `a+`

In [2]:
# Establecer nuestro directorio
path = '.'
file_name = 'archivo_texto.txt'

# Juntemos ambas variables para hacer una única ruta
basedir = os.path.join(path, file_name)
basedir

# Leyendo el archivo usando "open"
my_file = open(basedir, 'r')
print(my_file.read())

my_file.close()

hoy es 2391200510079733 -0.26427996182377783 -0.9678349293737798 0.059188425297189665 1.5146199891809016 0.3368704439473176 -0.30236870196778015 -0.838641803575626 -0.5177312510310029 0.2535078637970811 -2.4847039860387983 -0.6514198681046239 -1.7754691520095647 0.234285827021381 1.2393998588645518 -0.6503342749455817 -0.2662149259177863 -0.5387197708014494 0.4207405283608752 -0.09528632594099248 -0.7698760981379379 -0.18372453368830574 0.17680894649065493 0.737056190505862 0.6938626422156343 -0.815530330059005 0.3876804541976694 0.10653700578986429 -0.3145769234612008 -0.6997128553310978 0.052988279611467834 0.9341078434725953 1.4598721347420383 -1.5788771165879707 0.952266433562546 -0.6024594416987342 0.4614449233389358 -0.003959146814871652 0.777348026060828 -0.4931128059547933 0.6085308646238607 0.3142341829166473 -0.7113037646184401 1.0052960444339158 0.21125896828592608 -0.8731488378408899 0.9841570691238846 -0.07020317217219083 1.7787220159391033 nuevo petróleo,0.446332377870045

##### Usando `readline`
Lee número específico de bytes de la línea. No lee más de una línea aunque los bytes excedan.

In [3]:
my_file = open(basedir, 'r+')

texto_a_agregar = 'hoy es 23'

my_file.write(texto_a_agregar)
my_file.close()

my_file = open(basedir, 'r')
print(my_file.read())
my_file.close()






hoy es 2391200510079733 -0.26427996182377783 -0.9678349293737798 0.059188425297189665 1.5146199891809016 0.3368704439473176 -0.30236870196778015 -0.838641803575626 -0.5177312510310029 0.2535078637970811 -2.4847039860387983 -0.6514198681046239 -1.7754691520095647 0.234285827021381 1.2393998588645518 -0.6503342749455817 -0.2662149259177863 -0.5387197708014494 0.4207405283608752 -0.09528632594099248 -0.7698760981379379 -0.18372453368830574 0.17680894649065493 0.737056190505862 0.6938626422156343 -0.815530330059005 0.3876804541976694 0.10653700578986429 -0.3145769234612008 -0.6997128553310978 0.052988279611467834 0.9341078434725953 1.4598721347420383 -1.5788771165879707 0.952266433562546 -0.6024594416987342 0.4614449233389358 -0.003959146814871652 0.777348026060828 -0.4931128059547933 0.6085308646238607 0.3142341829166473 -0.7113037646184401 1.0052960444339158 0.21125896828592608 -0.8731488378408899 0.9841570691238846 -0.07020317217219083 1.7787220159391033 nuevo petróleo,0.446332377870045

## Comma Separated Values (CSV)

Usaremos de aquí en adelante `with`. `with` simplifica el manejo de recursos volviendo más claro el código. Cuando usamos esta herramienta no necesitamos usar `.close()`. 

Se usa junto con `open`, simplifica excepciones y cierra automáticamente el file. 

Lectura de CSV:
https://www.geeksforgeeks.org/reading-and-writing-csv-files-in-python/?ref=rp

Uso de `with`:
https://www.pythonforbeginners.com/files/with-statement-in-python


Leyendo con paquetería `csv`.



In [4]:
# Establecer el directorio al archivo
basedir = os.path.join(path, 'archivo_csv.csv')
# Ejecutar el comando de lectura + paquete "csv"
with open(basedir, 'r') as f:
    my_file = csv.reader(f)
    for line in my_file:
        print(line)

['', 'Unnamed: 0.3', 'Unnamed: 0.2', 'Unnamed: 0.1', 'Unnamed: 0', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'new col']
['0', '0', '0', '0', '0', '1.8775882904928456', '-0.2152626418256438', '0.0775797234935389', '-0.9594945128178668', '0.9365148610667244', '0.1740874663620915', '1.4642697058983043', '-0.8255243147243997', '0.4131271618426333', '-1.9278778416955609', '5']
['1', '1', '1', '1', '1', '1.008025009349364', '-0.0313357011013897', '-0.2646592571505047', '-0.4441678998728409', '1.5043917002220817', '0.6174847516476217', '0.6779758475581189', '0.8011824512450199', '1.2515351279099245', '-0.9031129673229872', '5']
['2', '2', '2', '2', '2', '1.5673737992027197', '0.5186690587389874', '-1.1309694989618595', '0.0449315830147497', '0.6497018188007067', '-0.0819084309101103', '1.6862723399865065', '0.8619962798077608', '0.7056029276512127', '1.5478794911197002', '5']
['3', '3', '3', '3', '3', '-0.3146656916474705', '-0.3962198923376598', '-0.7880982990178668', '-0.01408391167

In [5]:
# Mandar esto a volar y usemos pandas
df = pd.read_csv(basedir)
df

Unnamed: 0.5,Unnamed: 0.4,Unnamed: 0.3,Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,0,1,2,3,4,5,6,7,8,9,new col
0,0,0,0,0,0,1.877588,-0.215263,0.07758,-0.959495,0.936515,0.174087,1.46427,-0.8255243147243997,0.413127,-1.927878,5
1,1,1,1,1,1,1.008025,-0.031336,-0.264659,-0.444168,1.504392,0.617485,0.677976,0.8011824512450199,1.251535,-0.903113,5
2,2,2,2,2,2,1.567374,0.518669,-1.130969,0.044932,0.649702,-0.081908,1.686272,0.8619962798077608,0.705603,1.547879,5
3,3,3,3,3,3,-0.314666,-0.39622,-0.788098,-0.014084,-1.998541,-1.948459,1.292662,-1.1968203217780395,1.000475,-0.213199,5
4,4,4,4,4,4,-0.437519,1.049135,-0.12843,-0.009903,-0.061985,0.729883,0.20991,0.12429447103437793,2.981896,0.77092,5
5,5,5,5,5,5,0.762222,-2.324741,0.627979,0.600839,-2.474727,0.309773,0.770974,-0.7620280161900589,-0.619762,-0.063917,5
6,6,6,6,6,6,0.391457,-0.465832,0.523149,0.839778,0.548082,-0.314888,0.346591,0.22083250229451548,0.029363,-0.74459,5
7,7,7,7,7,7,1.114658,-1.39929,0.497715,-0.305932,-0.51706,-0.221667,0.667825,en ellos.,0.444128,-1.470594,5
8,8,8,8,8,8,0.89663,-0.180905,0.956547,-1.938051,-0.451817,-0.541395,0.856869,-0.5975509194321982,0.452285,0.238506,5
9,9,9,9,9,9,-3.096334,0.251865,0.934592,0.485043,-0.379969,1.196651,-0.98111,-0.1509266836008704,-0.291596,0.30649,5


In [6]:
df['new col'] = 5
df.to_csv(basedir)

df = pd.read_csv(basedir)
df



Unnamed: 0.6,Unnamed: 0.5,Unnamed: 0.4,Unnamed: 0.3,Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,0,1,2,3,4,5,6,7,8,9,new col
0,0,0,0,0,0,0,1.877588,-0.215263,0.07758,-0.959495,0.936515,0.174087,1.46427,-0.8255243147243997,0.413127,-1.927878,5
1,1,1,1,1,1,1,1.008025,-0.031336,-0.264659,-0.444168,1.504392,0.617485,0.677976,0.8011824512450199,1.251535,-0.903113,5
2,2,2,2,2,2,2,1.567374,0.518669,-1.130969,0.044932,0.649702,-0.081908,1.686272,0.8619962798077608,0.705603,1.547879,5
3,3,3,3,3,3,3,-0.314666,-0.39622,-0.788098,-0.014084,-1.998541,-1.948459,1.292662,-1.1968203217780395,1.000475,-0.213199,5
4,4,4,4,4,4,4,-0.437519,1.049135,-0.12843,-0.009903,-0.061985,0.729883,0.20991,0.12429447103437793,2.981896,0.77092,5
5,5,5,5,5,5,5,0.762222,-2.324741,0.627979,0.600839,-2.474727,0.309773,0.770974,-0.7620280161900589,-0.619762,-0.063917,5
6,6,6,6,6,6,6,0.391457,-0.465832,0.523149,0.839778,0.548082,-0.314888,0.346591,0.22083250229451548,0.029363,-0.74459,5
7,7,7,7,7,7,7,1.114658,-1.39929,0.497715,-0.305932,-0.51706,-0.221667,0.667825,en ellos.,0.444128,-1.470594,5
8,8,8,8,8,8,8,0.89663,-0.180905,0.956547,-1.938051,-0.451817,-0.541395,0.856869,-0.5975509194321982,0.452285,0.238506,5
9,9,9,9,9,9,9,-3.096334,0.251865,0.934592,0.485043,-0.379969,1.196651,-0.98111,-0.1509266836008704,-0.291596,0.30649,5


## Parquets

Los archivos Parquets son archivos de almacenamiento "nuevos" parecidos a los csvs con diferencias muy puntuales:

* Los parquet se organizan por columnas, no por filas. 
* Incluye un schema para definición de tipo de datos. 
* Es muy rápido en columnas específicas. 
    * Los parquets leen solo lo necesario, CSV leen todo siempre. 
* Es mucho más eficiente en _storage_ que un CSV. 
* Es escalable (sobre todo comparado con un CSV). 


![image.png](attachment:image.png)

![image-2.png](attachment:image-2.png)

In [7]:
basedir = os.path.join(path, 'archivo_parquet.parquet')
df = pd.read_parquet(basedir)
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,1.877588,-0.215263,0.07758,-0.959495,0.936515,0.174087,1.46427,-0.8255243147243997,0.413127,-1.927878
1,1.008025,-0.031336,-0.264659,-0.444168,1.504392,0.617485,0.677976,0.8011824512450199,1.251535,-0.903113
2,1.567374,0.518669,-1.130969,0.044932,0.649702,-0.081908,1.686272,0.8619962798077608,0.705603,1.547879
3,-0.314666,-0.39622,-0.788098,-0.014084,-1.998541,-1.948459,1.292662,-1.1968203217780395,1.000475,-0.213199
4,-0.437519,1.049135,-0.12843,-0.009903,-0.061985,0.729883,0.20991,0.12429447103437793,2.981896,0.77092
5,0.762222,-2.324741,0.627979,0.600839,-2.474727,0.309773,0.770974,-0.7620280161900589,-0.619762,-0.063917
6,0.391457,-0.465832,0.523149,0.839778,0.548082,-0.314888,0.346591,0.22083250229451548,0.029363,-0.74459
7,1.114658,-1.39929,0.497715,-0.305932,-0.51706,-0.221667,0.667825,en ellos.,0.444128,-1.470594
8,0.89663,-0.180905,0.956547,-1.938051,-0.451817,-0.541395,0.856869,-0.5975509194321982,0.452285,0.238506
9,-3.096334,0.251865,0.934592,0.485043,-0.379969,1.196651,-0.98111,-0.1509266836008704,-0.291596,0.30649


## JSONs

El acrónimo "JSON" viene de _Javascript Object Notation_.

Es usado para guardar y transferir información. Es una forma muy sencilla de compartir información. 

Su lectura es muy sencilla y será a su vez planteada como un diccionario. 

In [8]:
basedir = os.path.join(path, 'archivo_json.json')

In [9]:
with open (basedir, 'r') as f:
    myJson = json.load(f)

In [10]:
myJson

{'1.8775882904928458': 3.5253377885958472,
 '-0.2152626418256438': 0.046338004965755415,
 '0.07757972349353895': 0.0060186134973339595,
 '-0.9594945128178669': 0.9206297201275958,
 '0.9365148610667243': 0.8770600849988259,
 '0.17408746636209152': 0.030306445944372346,
 '1.4642697058983043': 2.1440857716115067,
 '-0.8255243147243997': 0.6814903942011897,
 '0.4131271618426333': 0.17067405185214934,
 '-1.9278778416955606': 3.716712972500733,
 '1.0080250093493641': 1.0161144194737857,
 '-0.03133570110138973': 0.0009819261635156376,
 '-0.2646592571505047': 0.07004452239545697,
 '-0.4441678998728409': 0.19728512327745001,
 '1.5043917002220815': 2.263194387697085,
 '0.6174847516476217': 0.3812874185173251,
 '0.6779758475581189': 0.45965124987214967,
 '0.8011824512450199': 0.6418933201829786,
 '1.2515351279099243': 1.5663401763925104,
 '-0.9031129673229871': 0.8156130317469307,
 '1.5673737992027195': 2.456660626427167,
 '0.5186690587389874': 0.26901759249318713,
 '-1.1309694989618597': 1.27909

##### Escribiendo un json


In [11]:
ejemplo = {'key': {'key2': 'value'}}
ejemplo

{'key': {'key2': 'value'}}

In [12]:
with open(os.path.join('.', 'ejemplo.json'), 'w') as f:
    json.dump(ejemplo, f)

In [13]:
with open('ejemplo.json', 'r') as f:
    nuevo_ejemplo = json.load(f)

In [14]:
nuevo_ejemplo

{'key': {'key2': 'value'}}