<a href="https://colab.research.google.com/github/Rushikesh-Chavan-777/Scientific-Computing-Fundamentals/blob/main/Reading_Writing_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### How to read and write down data file in python?

In [None]:
import numpy as np
import pandas as pd

In [None]:
# f = open(filename, mode)

### Example1:

In [None]:
#Write
f = open('test.txt', 'w')
for i in range(5):
    f.write(f"This is line {i}\n")

f.close()

In [None]:
#Append
f = open('test.txt', 'a')
f.write(f"This is another line\n")
f.close()

In [None]:
#Read
f = open('./test.txt', 'r')
content = f.read()
f.close()
print(content)

This is line 0
This is line 1
This is line 2
This is line 3
This is line 4
This is another line



In [None]:
type(content)

str

### Types of modes:

"x" is similar to "w". But for "x", if the file exists, raise FileExistsError. For "w", it will simply create a new file / truncate the existed file.

![Read_Write.png](attachment:Read_Write.png)


![ExWNT-white-bg.png](attachment:ExWNT-white-bg.png)

In [None]:
# Read line by line

f = open('./test.txt', 'r')
contents = f.readlines()
f.close()
print(contents)

['This is line 0\n', 'This is line 1\n', 'This is line 2\n', 'This is line 3\n', 'This is line 4\n', 'This is another line\n']


In [None]:
type(contents)

list

**Exercise:** Write (or save) an numpy array as a txt file and read it

In [None]:
import numpy as np

# Create a sample NumPy array
my_array = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

# Save the array to a text file
np.savetxt('my_array.txt', my_array, fmt='%d')  # Use fmt='%d' for integers, '%.2f' for floats with 2 decimal places, etc.

# Read the array from the text file
loaded_array = np.loadtxt('my_array.txt', dtype=int)  # Specify dtype if needed

# Print the loaded array
print(loaded_array)

[[1 2 3]
 [4 5 6]
 [7 8 9]]


####  DataFrame is structured in a tidy data format.

Tidy data format means we have multiple columns of data that collected in a Pandas Dataframe.

![tidy_data.png](attachment:tidy_data.png)

- Pandas supported [file formats](https://pandas.pydata.org/docs/user_guide/io.html) for tidy data.

- Numpy supported [file formats](https://numpy.org/doc/stable/reference/routines.io.html).

Some examples of file formats are given below:

![file_formats.png](attachment:file_formats.png)

### Example: CSV file

comma-separated values (CSV) file format, a delimited text file that uses a comma to separate values.

In [None]:
data = np.random.random((10,5))
data

array([[0.5363556 , 0.28733151, 0.4397988 , 0.61779664, 0.08250741],
       [0.84437912, 0.48883304, 0.36310454, 0.28765151, 0.12731408],
       [0.90985425, 0.75578419, 0.35376214, 0.65763473, 0.26340448],
       [0.02515819, 0.94262891, 0.97195554, 0.78237137, 0.58696097],
       [0.77568157, 0.43769521, 0.36042672, 0.83574183, 0.21028676],
       [0.2683575 , 0.7082042 , 0.85061129, 0.01735171, 0.74484985],
       [0.88613379, 0.95833958, 0.74466398, 0.84982235, 0.97416325],
       [0.44927029, 0.51853773, 0.71422556, 0.74814156, 0.53014481],
       [0.03643762, 0.04933369, 0.59750776, 0.72697915, 0.59992761],
       [0.30891789, 0.1048588 , 0.01033696, 0.6728544 , 0.24177318]])

In [None]:
#Write
np.savetxt('test.csv', data, fmt = '%.2f', delimiter=',', header = 'c1, c2, c3, c4, c5')

In [None]:
#Read
my_csv = np.loadtxt('./test.csv', delimiter=',')
my_csv
# my_csv[:5, :]

array([[0.54, 0.29, 0.44, 0.62, 0.08],
       [0.84, 0.49, 0.36, 0.29, 0.13],
       [0.91, 0.76, 0.35, 0.66, 0.26],
       [0.03, 0.94, 0.97, 0.78, 0.59],
       [0.78, 0.44, 0.36, 0.84, 0.21],
       [0.27, 0.71, 0.85, 0.02, 0.74],
       [0.89, 0.96, 0.74, 0.85, 0.97],
       [0.45, 0.52, 0.71, 0.75, 0.53],
       [0.04, 0.05, 0.6 , 0.73, 0.6 ],
       [0.31, 0.1 , 0.01, 0.67, 0.24]])

**Exercise**: Pandas has a very nice interface for writing and reading CSV files

Caution! When working with floating point numbers you should be careful to save the data with enough decimal places so that you won't lose precision.

### Example: Pickle file

In [None]:
import pickle

#### Pickle format

Type: __Binary__

- For cases, we want to store dictionaries, tuples, lists, or any other data type to the disk and use them later.
- Pickle can serialize objects so that they can be saved into a file and loaded again later.
- __Serialization__ is the process of converting the object into a format that can be stored or transmitted.
- Pickle is a powerful serializaion format in python.
- However, it is not a format you will want to use for long term storage or data sharing.

In [None]:
#Write a pickle file

dict_a = {'A':0, 'B':1, 'C':2}
pickle.dump(dict_a, open('test.pkl', 'wb'))

Note! 'b' stands for binary mode

In [None]:
#Read a pickle file

my_dict = pickle.load(open('./test.pkl', 'rb'))
my_dict

{'A': 0, 'B': 1, 'C': 2}

**Exercise**: Save a numpy array to a pickle file

In [None]:
# prompt:  Save a numpy array to a pickle file

import numpy as np
import pickle

# Create a sample NumPy array
my_array = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

# Save the array to a pickle file
with open('my_array.pkl', 'wb') as f:
  pickle.dump(my_array, f)

**Exercise**: Save a dataframe to a pickle file

In [None]:
# prompt: save a dataframe to a pickle file

import pandas as pd
import pickle

# Create a sample DataFrame
data = {'col1': [1, 2, 3], 'col2': [4, 5, 6]}
df = pd.DataFrame(data)

# Save the DataFrame to a pickle file
with open('my_dataframe.pkl', 'wb') as f:
  pickle.dump(df, f)

**Some other formats for storing tidy data**

**Feather**

- A (binary) file format for storing dataframes (tidy datat) quickly.
- Compatible with Python, R, and Julia.
- require package pyarrow (pip install pyarrow)

**Parquet**

- A (binary) file format for storing large tidy data.
- Parquet is used by many different languages (C, Java, Python, MATLAB, Julia, etc.).
- require package pyarrow (pip install pyarrow)

### Example: JSON file

__JSON (JavaScript Object Notation)__

__Type: Text format__

- Extension “.json”.
- Human-readable.
- Common when dealing with web applications.
- Unlike pickle, which is Python dependent, JSON is a language-independent data format, which makes it attractive to use.

In [None]:
import json

In [None]:
#Write a dictionary to a JSON file

school = {
  "school": "UC Berkeley",
  "address": {
    "city": "Berkeley",
    "state": "California",
    "postal": "94720"
  },

  "list":[
      "student 1",
      "student 2",
      "student 3"
      ],

  "array":[1, 2, 3]
}

school

{'school': 'UC Berkeley',
 'address': {'city': 'Berkeley', 'state': 'California', 'postal': '94720'},
 'list': ['student 1', 'student 2', 'student 3'],
 'array': [1, 2, 3]}

In [None]:
json.dump(school, open('school.json', 'w'))

In [None]:
#Read

my_school = json.load(open('./school.json', 'r'))
my_school

{'school': 'UC Berkeley',
 'address': {'city': 'Berkeley', 'state': 'California', 'postal': '94720'},
 'list': ['student 1', 'student 2', 'student 3'],
 'array': [1, 2, 3]}

**Exercise**: Save a dataframe to a JSON file

### Example: HDF5 file

__HDF5 (Hierarchical Data Format version 5)__

__Type: Binary format__

- Extension “.hdf5”.
- Packages needed: numpy, pandas, PyTables, h5py
- The h5py package is a Python library that provides an interface to the HDF5 format.
- **Best use cases**: Working with big datasets in array data format and no limit on the file size.
- Carries out a bunch of low level optimizations under the hood to make the queries faster and storage requirements smaller.
- An HDF5 file saves two types of objects:
   - _datasets_, which are array-like collections of data (like NumPy arrays),
   - _groups_, which are folder-like containers that hold datasets and other groups.
- __hierarchical__ in HDF5 refers to the fact that the data could be saved like a file system, with folder-like structures, such as folder, subfolder (in HDF5, it is called group, subgroup).

In [None]:
import h5py

In [None]:
data = np.random.random((10,5))
data

array([[0.92979848, 0.36924849, 0.88344932, 0.09191288, 0.26448954],
       [0.14326202, 0.46494178, 0.75886112, 0.70736961, 0.80359027],
       [0.1747599 , 0.76875532, 0.67745347, 0.87798695, 0.27356942],
       [0.26791045, 0.39398309, 0.32157817, 0.70101487, 0.11110061],
       [0.72357322, 0.7665151 , 0.12226174, 0.97458899, 0.21783992],
       [0.80732575, 0.31933164, 0.32060713, 0.84037281, 0.35492333],
       [0.35921769, 0.52004266, 0.2898736 , 0.88644168, 0.398132  ],
       [0.53323623, 0.34909316, 0.86829307, 0.64648968, 0.88389846],
       [0.02158385, 0.36365644, 0.92881772, 0.36501569, 0.43142309],
       [0.02015551, 0.06352494, 0.59986819, 0.01295296, 0.90587664]])

In [None]:
## Write

with h5py.File('data_array.hdf5', 'w') as hf:
    hf.create_dataset('hdf5_data',data=data)

In [None]:
## Read

with h5py.File('data_array.hdf5', 'r') as hf:
    hdf5_read = hf['hdf5_data'][:]

hdf5_read

array([[0.92979848, 0.36924849, 0.88344932, 0.09191288, 0.26448954],
       [0.14326202, 0.46494178, 0.75886112, 0.70736961, 0.80359027],
       [0.1747599 , 0.76875532, 0.67745347, 0.87798695, 0.27356942],
       [0.26791045, 0.39398309, 0.32157817, 0.70101487, 0.11110061],
       [0.72357322, 0.7665151 , 0.12226174, 0.97458899, 0.21783992],
       [0.80732575, 0.31933164, 0.32060713, 0.84037281, 0.35492333],
       [0.35921769, 0.52004266, 0.2898736 , 0.88644168, 0.398132  ],
       [0.53323623, 0.34909316, 0.86829307, 0.64648968, 0.88389846],
       [0.02158385, 0.36365644, 0.92881772, 0.36501569, 0.43142309],
       [0.02015551, 0.06352494, 0.59986819, 0.01295296, 0.90587664]])

**Exercise**: Save a dataframe to a pickle file