# Reading and Writing Data

There are many common formats for storing data on a computer.

For images we have .jpg, .png, .bmp, etc.

For music or audio there are .mp3, .wav, .flac, etc.

For machine learning, we are usually interested in reading and storing general data - numbers, labels, etc. with no specific formatting requirements. Most of the data we use can be put into spreadsheet formats.

We will look at three common data types today - .h5, .csv, and .p.

## Writing H5

H5 is probably the most commonly used data type in machine learning. It can be used to store datasets (arrays, lists, and dictionaries), and it has very good memory usage and read/write speed when dealing with very large files.

In [102]:
import h5py as h5
import numpy as np

In [103]:
data1 = np.random.rand(5, 3, 2)
data2 = np.random.rand(8, 1, 3)

In [104]:
out_file = h5.File('Files/write_example.h5', 'w')
out_file.create_dataset('dataset_1', data=data1)
out_file.create_dataset('dataset_2', data=data2)
out_file.close()

We have just created a file called "write_example.h5" with some data in it. Now let's read it.

## Reading H5

In [105]:
in_file = h5.File("Files/write_example.h5")

In [106]:
list(in_file.keys())

['dataset_1', 'dataset_2']

In [107]:
data1 = in_file['dataset_1']
data2 = in_file['dataset_2']

In [108]:
type(data1)

h5py._hl.dataset.Dataset

All data stored in an h5 file gets converted to the Dataset format. Let's convert it back to a numpy array.

In [109]:
data1 = np.array(data1)
data1

array([[[ 0.39235455,  0.77918144],
        [ 0.316126  ,  0.31887274],
        [ 0.90760118,  0.25840307]],

       [[ 0.95396898,  0.08199345],
        [ 0.06530931,  0.40856612],
        [ 0.81485616,  0.69140805]],

       [[ 0.1539845 ,  0.99988664],
        [ 0.47508733,  0.65746443],
        [ 0.75985667,  0.88467512]],

       [[ 0.64530204,  0.65743213],
        [ 0.04292335,  0.93546977],
        [ 0.24865096,  0.85549708]],

       [[ 0.86856698,  0.62103382],
        [ 0.01512661,  0.20191142],
        [ 0.55569286,  0.41404048]]])

In [110]:
in_file.close()

## H5 Practice

There is another file called "Files/read_example.h5". Open this file and see what's in it.

## Writing CSV

CSV is a common data format used for storing spreadsheets of information. It's not as flexible as H5, but it's very easy to use. Usually it's just a text file with rows of numbers, and tabs in between each number.

In [141]:
data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 
        'last_name': ['Miller', 'Jacobson', ".", 'Milner', 'Cooze'], 
        'age': [42, 52, 36, 24, 73], 
        'preTestScore': [4, 24, 31, ".", "."],
        'postTestScore': ["25,000", "94,000", 57, 62, 70]}
df = pd.DataFrame(data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore'])

In [142]:
df

Unnamed: 0,first_name,last_name,age,preTestScore,postTestScore
0,Jason,Miller,42,4,25000
1,Molly,Jacobson,52,24,94000
2,Tina,.,36,31,57
3,Jake,Milner,24,.,62
4,Amy,Cooze,73,.,70


In [143]:
df.to_csv('Files/write_example.csv')

## Reading CSV

In [129]:
import pandas as pd

In [144]:
file_name = "Files/write_example.csv"
data = pd.read_csv(file_name, delimiter=",", engine="python", skipfooter=1)

In [145]:
data

Unnamed: 0.1,Unnamed: 0,first_name,last_name,age,preTestScore,postTestScore
0,0,Jason,Miller,42,4,25000
1,1,Molly,Jacobson,52,24,94000
2,2,Tina,.,36,31,57
3,3,Jake,Milner,24,.,62


## CSV Practice

Read what's in "Files/read_example.csv".

## Writing Pickle

Pickle files can be used to store any Python object at all. It's super convenient, but also kind of a dumb data storage method.

In [151]:
import pickle

In [152]:
pet_names = {"dog": "JoJo", "cat": "Rainbow"}
pickle.dump(pet_names, open("Files/write_example.p", "wb"))

## Reading Pickle

In [173]:
in_file = open("Files/write_example.p", "rb")
data = pickle.load(in_file)
in_file.close()

In [174]:
data

{'cat': 'Rainbow', 'dog': 'JoJo'}

In [175]:
cat_name = data['cat']
cat_name

'Rainbow'

## Pickle Practice

Read "Files/read_example.p".