In [1]:
%matplotlib inline
import matplotlib
import seaborn as sns

# Importing and Exporting Data

<!-- requirement: data/sample.txt -->
<!-- requirement: data/csv_sample.txt -->
<!-- requirement: data/bad_csv.csv -->


## Python file handles (`open`)

In Python we interact with files on disk using the commands `open` and `close`. We've included a file in the `data` folder called `sample.txt`. Let's open it and read its contents

In [2]:
f = open('./data/sample.txt', 'r')
data = f.read()
f.close()

print(data)
print(f)

Hello!
Congratulations!
You've read in data from a file.
<_io.TextIOWrapper name='./data/sample.txt' mode='r' encoding='cp1252'>


In [5]:
f = open('./data/sample.txt', 'r')
for i in range(10):
    print(f.read(i))


H
el
lo!

Con
gratu
lation
s!
You'
ve read 
in data f


Notice that we `open` the file  and assign it to `f`, `read` the data from `f`, and then close `f`. What is `f`? It's called a **file handle**. It's an object that connects Python to the file we `open`. We `read` the data using this connection, and then once we're done with `close` the connection. It's a good habit to `close` a file handle once we're done with it, so usually we will do it automatically using Python's `with` keyword. 

In [6]:
# f is automatically closed
# at the end of the body of the with statement
with open('./data/sample.txt', 'r') as f:
    print(f.read())

print(f)

Hello!
Congratulations!
You've read in data from a file.
<_io.TextIOWrapper name='./data/sample.txt' mode='r' encoding='cp1252'>


We can also read individual lines of a file.

In [4]:
with open('./data/sample.txt', 'r') as f:
    print(f.readline())
    print(f.readline())

Hello!

Congratulations!



In [5]:
with open('./data/sample.txt', 'r') as f:
    print(f.readlines())

['Hello!\n', 'Congratulations!\n', "You've read in data from a file."]


Writing data to files is very similar. The main difference is when we `open` the file, we will use the `'w'` flag instead of `'r'`.

In [6]:
with open('./data/my_data.txt', 'w') as f:
    f.write('This is a new file.')
    f.write('I am practicing writing data to disk.')

with open('./data/my_data.txt', 'r') as f:
    my_data = f.read()

print(my_data)

This is a new file.I am practicing writing data to disk.


No matter how often I execute the above cell, the same output gets printed. Opening the file with the `'w'` flag will overwrite the contents of the file. If we want to add to what is already in the file, we have to open the file with the `'a'` flag (`'a'` stands for _append_).

In [7]:
with open('./data/my_data.txt', 'a') as f:
    f.write('\nAdding a new line to the file.')

with open('./data/my_data.txt', 'r') as f:
    my_data = f.read()

print(my_data)

This is a new file.I am practicing writing data to disk.
Adding a new line to the file.


We always need to be careful when writing to disk, because we could overwrite or alter data by accident. It is also easy to encounter errors when working with files, because we might not know ahead of time if the file we're trying to access exists, or we might mix up the `'r'`, `'w'`, and `'a'` flags.

In [11]:
# if a file doesn't exist
# we can't open it for reading
# (but we can open it for writing)

with open('./data/fail.txt', 'r') as f:
    f.read()

FileNotFoundError: [Errno 2] No such file or directory: './data/fail.txt'

In [12]:
# we can't read a file open for writing

with open('./data/fail.txt', 'w') as f:
    f.read()

UnsupportedOperation: not readable

In [13]:
# and we can't write to a file open for reading

with open('./data/sample.txt', 'r') as f:
    f.write('This will fail')

UnsupportedOperation: not writable

Can we prevent some of these errors? How do we find out what files are on disk?

## `os` module

Python has a module for navigating the computer's file system called `os`. There are many useful tools in the `os` module, but there are two functions that are most useful for finding files.

In [14]:
import os

# list the contents of the current directory
# ('.' refers to the current directory)
os.listdir('.')

['(A) Variables , Expression and Flow.ipynb',
 '(B) Function.ipynb',
 '(C) Conditions and Iteration.ipynb',
 '(D) String.ipynb',
 '(E) Lists  Tuble and Set.ipynb',
 '(F) Dict.ipynb',
 '(G) IO.ipynb',
 '(H).ipynb',
 '.ipynb_checkpoints',
 'data',
 'Practice Problem.ipynb',
 'PY_datetime.ipynb',
 'PY_OOP.ipynb']

The command `listdir` is the simpler of the two functions we'll cover. It simply lists the contents of the directory path we specify. When we pass `'.'` as the argument, `listdir` will look in the current directory. It lists all the Jupyter notebooks we're using for the course, as well as the `data` subdirectory. We could find out what's in the `data` subdirectory by looking in `'./data'`.

In [16]:
os.listdir('./data')

['fail.txt', 'my_data.txt', 'sample.txt']

What if we wanted to find all the files and subdirectories below a directory somewhere on our computer? With `listdir` we only see the files and subdirectories under the particular directory we're looking in. We cannot use `listdir` to automatically search through subdirectories. For this we need to use `walk`, which "walks" through all the subdirectories below our chosen directory. We won't cover `walk` in this course, but it's one of the very useful tools (along with the `os.path` sub-module) for working with files in Python, particularly if you are working with many different data files at once.

## JSON

JSON stands for JavaScript Object Notation. JavaScript is a common language for creating web applications, and JSON files are used to collect and transmit information between JavaScript applications. As a result, a lot of data on the internet exists in the JSON file format. For example, Twitter and Google Maps use JSON.

A JSON file is essentially a data structure built out of nested dictionaries and lists. Let's make our own example and then we'll examine an example downloaded from the internet.

In [21]:
book1 = {'title': 'The Prophet',
         'author': 'Khalil Gibran',
         'genre': 'poetry',
         'tags': ['religion', 'spirituality', 'philosophy', 'Lebanon', 'Arabic', 'Middle East'],
         'book_id': '811.19',
         'copies': [{'edition_year': 1996,
                     'checkouts': 486,
                     'borrowed': False},
                    {'edition_year': 1996,
                     'checkouts': 443,
                     'borrowed': False}]
         }
         
book2 = {'title': 'The Little Prince',
         'author': 'Antoine de Saint-Exupery',
         'genre': 'children',
         'tags': ['fantasy', 'France', 'philosophy', 'illustrated', 'fable'],
         'id': '843.912',
         'copies': [{'edition_year': 1983,
                     'checkouts': 634,
                     'borrowed': True,
                     'due_date': '2017/02/02'},
                    {'edition_year': 2015,
                     'checkouts': 41,
                     'borrowed': False}]
         }

library = [book1, book2]
library

[{'title': 'The Prophet',
  'author': 'Khalil Gibran',
  'genre': 'poetry',
  'tags': ['religion',
   'spirituality',
   'philosophy',
   'Lebanon',
   'Arabic',
   'Middle East'],
  'book_id': '811.19',
  'copies': [{'edition_year': 1996, 'checkouts': 486, 'borrowed': False},
   {'edition_year': 1996, 'checkouts': 443, 'borrowed': False}]},
 {'title': 'The Little Prince',
  'author': 'Antoine de Saint-Exupery',
  'genre': 'children',
  'tags': ['fantasy', 'France', 'philosophy', 'illustrated', 'fable'],
  'id': '843.912',
  'copies': [{'edition_year': 1983,
    'checkouts': 634,
    'borrowed': True,
    'due_date': '2017/02/02'},
   {'edition_year': 2015, 'checkouts': 41, 'borrowed': False}]}]

We have two books in our `library`. Both books have some common properties: a title, an author, an id, and tags. Each book can have several tags, so we store that data as a list. Additionally, there can be multiple copies of each book, and each copy also has some unique information like the year it was printed and how many times it's been checked out. Notice that if a book is checked out, it also has a due date. It's convenient to store the information about the multiple copies as a list of dictionaries within the dictionary about the book, because every copy shares the same title, author, etc.

This structure is typical of JSON files. It has the advantage of reducing redundancy of data. We only store the author and title once, even though there are multiple copies of the book. Also, we don't store a due date for copies that aren't checked out.

If we were to put this data in a table, we would have to duplicate a lot of information. Also, since only one copy in our library is checked out, we also have a column with a lot of missing data.

|index|title|author|id|genre|tags|edition_year|checkouts|borrowed|due_date|
|:---:|:---:|:----:|::|:---:|:--:|:----------:|:-------:|:------:|:------:|
|0|The Prophet|Khalil Gibran|811.19|poetry|religion, spirituality, philosophy, Lebanon, Arabic, Middle East|1996|486|False|Null|
|1|The Prophet|Khalil Gibran|811.19|poetry|religion, spirituality, philosophy, Lebanon, Arabic, Middle East|1996|443|False|Null|
|2|The Little Prince|Antoine de Saint-Exupery|843.912|children|fantasy, France, philosophy, illustrated, fable|1983|634|True|2017/02/02|
|3|The Little Prince|Antoine de Saint-Exupery|843.912|children|fantasy, France, philosophy, illustrated, fable|2015|41|False|Null|

This is very wasteful. Since JSON files are meant to be shared quickly over the internet, it is important that they are small to reduce the amount of resources needed to store and transmit them.

We can write our `library` to disk using the `json` module.

In [22]:
import json

with open('./data/library.json', 'w') as f:
    json.dump(library, f, indent=2)

In [23]:
!cat ./data/library.json

[
  {
    "title": "The Prophet",
    "author": "Khalil Gibran",
    "genre": "poetry",
    "tags": [
      "religion",
      "spirituality",
      "philosophy",
      "Lebanon",
      "Arabic",
      "Middle East"
    ],
    "book_id": "811.19",
    "copies": [
      {
        "edition_year": 1996,
        "checkouts": 486,
        "borrowed": false
      },
      {
        "edition_year": 1996,
        "checkouts": 443,
        "borrowed": false
      }
    ]
  },
  {
    "title": "The Little Prince",
    "author": "Antoine de Saint-Exupery",
    "genre": "children",
    "tags": [
      "fantasy",
      "France",
      "philosophy",
      "illustrated",
      "fable"
    ],
    "id": "843.912",
    "copies": [
      {
        "edition_year": 1983,
        "checkouts": 634,
        "borrowed": true,
        "due_date": "2017/02/02"
      },
      {
        "edition_year": 2015,
        "checkouts": 41,
        "borrowed": false
      }

In [24]:
with open('./data/library.json', 'r') as f:
    reloaded_library = json.load(f)

reloaded_library

[{'title': 'The Prophet',
  'author': 'Khalil Gibran',
  'genre': 'poetry',
  'tags': ['religion',
   'spirituality',
   'philosophy',
   'Lebanon',
   'Arabic',
   'Middle East'],
  'book_id': '811.19',
  'copies': [{'edition_year': 1996, 'checkouts': 486, 'borrowed': False},
   {'edition_year': 1996, 'checkouts': 443, 'borrowed': False}]},
 {'title': 'The Little Prince',
  'author': 'Antoine de Saint-Exupery',
  'genre': 'children',
  'tags': ['fantasy', 'France', 'philosophy', 'illustrated', 'fable'],
  'id': '843.912',
  'copies': [{'edition_year': 1983,
    'checkouts': 634,
    'borrowed': True,
    'due_date': '2017/02/02'},
   {'edition_year': 2015, 'checkouts': 41, 'borrowed': False}]}]

In [25]:
# note that if we loaded it in without JSON
# the file would be interpreted as plain text

with open('./data/library.json', 'r') as f:
    library_string = f.read()

# this isn't what we want
library_string

'[\n  {\n    "title": "The Prophet",\n    "author": "Khalil Gibran",\n    "genre": "poetry",\n    "tags": [\n      "religion",\n      "spirituality",\n      "philosophy",\n      "Lebanon",\n      "Arabic",\n      "Middle East"\n    ],\n    "book_id": "811.19",\n    "copies": [\n      {\n        "edition_year": 1996,\n        "checkouts": 486,\n        "borrowed": false\n      },\n      {\n        "edition_year": 1996,\n        "checkouts": 443,\n        "borrowed": false\n      }\n    ]\n  },\n  {\n    "title": "The Little Prince",\n    "author": "Antoine de Saint-Exupery",\n    "genre": "children",\n    "tags": [\n      "fantasy",\n      "France",\n      "philosophy",\n      "illustrated",\n      "fable"\n    ],\n    "id": "843.912",\n    "copies": [\n      {\n        "edition_year": 1983,\n        "checkouts": 634,\n        "borrowed": true,\n        "due_date": "2017/02/02"\n      },\n      {\n        "edition_year": 2015,\n        "checkouts": 41,\n        "borrowed": false\n      

In [26]:
# Pandas can also read_json
# notice how it constructs the table
# does it represent the data well?

pd.read_json('./data/library.json')

Unnamed: 0,author,book_id,copies,genre,id,tags,title
0,Khalil Gibran,811.19,"[{'edition_year': 1996, 'checkouts': 486, 'bor...",poetry,,"[religion, spirituality, philosophy, Lebanon, ...",The Prophet
1,Antoine de Saint-Exupery,,"[{'edition_year': 1983, 'checkouts': 634, 'bor...",children,843.912,"[fantasy, France, philosophy, illustrated, fable]",The Little Prince


In [27]:
# and to_json
df.to_json('./data/example_df.json')

!head ./data/example_df.json

{"name":{"0":"Dylan","1":"Terrence","2":"Mya"},"age":{"0":28,"1":54,"2":31}}

We can download JSON files many ways. Sometimes we will download it manually, but we can also use `wget` like we did for the CSV example. Often we'll connect to a website's API which will respond using JSON.

Panda's `read_json` method is capable of connecting directly to a URL (whether it's the address of a JSON file or an API connection) and reading the JSON without saving the file to our computer.

In [28]:
pd.read_json('https://api.github.com/repos/pydata/pandas/issues?per_page=5')

Unnamed: 0,assignee,assignees,author_association,body,closed_at,comments,comments_url,created_at,events_url,html_url,...,milestone,node_id,number,pull_request,repository_url,state,title,updated_at,url,user
0,,[],NONE,"#### Code Sample, a copy-pastable example if p...",NaT,0,https://api.github.com/repos/pandas-dev/pandas...,2019-07-23 07:34:01,https://api.github.com/repos/pandas-dev/pandas...,https://github.com/pandas-dev/pandas/issues/27539,...,,MDU6SXNzdWU0NzE1MzE4MTU=,27539,,https://api.github.com/repos/pandas-dev/pandas,open,Dataframe construction from numba typed list,2019-07-23 07:34:01,https://api.github.com/repos/pandas-dev/pandas...,"{'login': 'leohaim', 'id': 7847768, 'node_id':..."
1,,[],CONTRIBUTOR,- [ ] closes #xxxx\r\n- [ ] tests added / pass...,NaT,0,https://api.github.com/repos/pandas-dev/pandas...,2019-07-23 07:12:56,https://api.github.com/repos/pandas-dev/pandas...,https://github.com/pandas-dev/pandas/pull/27538,...,,MDExOlB1bGxSZXF1ZXN0MzAwMTU2ODk3,27538,{'url': 'https://api.github.com/repos/pandas-d...,https://api.github.com/repos/pandas-dev/pandas,open,COMPAT: remove Categorical pickle compat with ...,2019-07-23 09:09:57,https://api.github.com/repos/pandas-dev/pandas...,"{'login': 'topper-123', 'id': 26364415, 'node_..."
2,,[],MEMBER,… mixed integers/strings columns names\r\n\r\n...,NaT,0,https://api.github.com/repos/pandas-dev/pandas...,2019-07-23 04:17:32,https://api.github.com/repos/pandas-dev/pandas...,https://github.com/pandas-dev/pandas/pull/27537,...,,MDExOlB1bGxSZXF1ZXN0MzAwMTIxOTA4,27537,{'url': 'https://api.github.com/repos/pandas-d...,https://api.github.com/repos/pandas-dev/pandas,open,TST: label-based indexing fails with certain l...,2019-07-23 04:24:26,https://api.github.com/repos/pandas-dev/pandas...,"{'login': 'simonjayhawkins', 'id': 13159005, '..."
3,,[],MEMBER,to keep the wheels running while I troubleshoo...,NaT,2,https://api.github.com/repos/pandas-dev/pandas...,2019-07-23 03:06:51,https://api.github.com/repos/pandas-dev/pandas...,https://github.com/pandas-dev/pandas/pull/27536,...,{'url': 'https://api.github.com/repos/pandas-d...,MDExOlB1bGxSZXF1ZXN0MzAwMTEwNzQ3,27536,{'url': 'https://api.github.com/repos/pandas-d...,https://api.github.com/repos/pandas-dev/pandas,open,xfail to fix CI,2019-07-23 03:55:11,https://api.github.com/repos/pandas-dev/pandas...,"{'login': 'jbrockmendel', 'id': 8078968, 'node..."
4,,[],MEMBER,,NaT,0,https://api.github.com/repos/pandas-dev/pandas...,2019-07-23 02:48:23,https://api.github.com/repos/pandas-dev/pandas...,https://github.com/pandas-dev/pandas/pull/27535,...,,MDExOlB1bGxSZXF1ZXN0MzAwMTA3NjI1,27535,{'url': 'https://api.github.com/repos/pandas-d...,https://api.github.com/repos/pandas-dev/pandas,open,DEPR: remove .ix from tests/indexing/test_inde...,2019-07-23 02:48:23,https://api.github.com/repos/pandas-dev/pandas...,"{'login': 'simonjayhawkins', 'id': 13159005, '..."


## Compressed files (Gzip)

Another way we save storage and network resources is by using **compression**. Many times data sets will contain patterns that can be used to reduce the amount of space needed to store the information.

A simple example is the following list of numbers: 10, 10, 10, 2, 3, 3, 3, 3, 3, 50, 50, 1, 1, 50, 10, 10, 10, 10

Rather than writing out the full list of numbers (18 integers), we can represent the same information with only 14 numbers: (3, 10), (1, 2), (5, 3), (2, 50), (2, 1), (1, 50), (4, 10)

Here the first number in each pair is the number of repetitions, and the second number in the pair is the actual value. We've successfully reduced the amount of numbers we need to represent the same data. Most forms of compression use a similar idea, although actual implementations are usually more complex.

In the world of data science, the most common compression is Gzip (which uses the [deflate algorithm](http://www.infinitepartitions.com/art001.html)). Gzip files end with the extension `.gz`.

In [29]:
!wget -P ./data/ https://archive.org/stream/TheEpicofGilgamesh_201606/eog_djvu.txt

--2019-07-23 09:11:13--  https://archive.org/stream/TheEpicofGilgamesh_201606/eog_djvu.txt
Resolving archive.org (archive.org)... 207.241.224.2
Connecting to archive.org (archive.org)|207.241.224.2|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘./data/eog_djvu.txt’

eog_djvu.txt            [ <=>                ] 159.76K  --.-KB/s    in 0.1s    

2019-07-23 09:11:14 (1.23 MB/s) - ‘./data/eog_djvu.txt’ saved [163595]



In [30]:
import gzip

with open('./data/eog_djvu.txt', 'r') as f:
    text = f.read()

with gzip.open('./data/eog_djvu.txt.gz', 'wb') as f:
    f.write(text.encode('utf-8'))

!ls -lh ./data/eog*

-rw-r--r-- 1 jovyan users 160K Jul 23 09:11 ./data/eog_djvu.txt
-rw-r--r-- 1 jovyan users  46K Jul 23 09:11 ./data/eog_djvu.txt.gz


We were able to compress the text of The Epic of Gilgamesh to a third of its original size! Remember that compression depends on patterns in the data. Language has a lot of patterns, but what would happen if we scrambled all the letters in the text?

In [31]:
import numpy as np

with gzip.open('./data/eog_djvu_scrambled.txt.gz', 'wb') as f:
    f.write(np.random.permutation(list(text)))

!ls -lh ./data/eog*

-rw-r--r-- 1 jovyan users 160K Jul 23 09:11 ./data/eog_djvu.txt
-rw-r--r-- 1 jovyan users  46K Jul 23 09:11 ./data/eog_djvu.txt.gz
-rw-r--r-- 1 jovyan users 136K Jul 23 09:11 ./data/eog_djvu_scrambled.txt.gz


The scrambled version only compressed to two-thirds the size of the original. Compression won't perform very well on random data. Compression also doesn't work very well on data that is already small.

In [32]:
short_text = 'Hello'

with open('./data/short_text.txt', 'w') as f:
    f.write(short_text)

with gzip.open('./data/short_text.txt.gz', 'wb') as f:
    f.write(short_text.encode('utf-8'))

!ls -lh ./data/short_text*

-rw-r--r-- 1 jovyan users  5 Jul 23 09:11 ./data/short_text.txt
-rw-r--r-- 1 jovyan users 40 Jul 23 09:11 ./data/short_text.txt.gz


The compressed file is bigger than the plain text! That's because the compressed file includes a header, which takes up a small amount of extra space. Also, since the text is so short, it's not possible to use patterns to represent the text more efficiently. Therefore we usually save compression for large files.

You may have noticed that when we write Gzip files, we have been using a `'wb'` flag instead of a plain `'w'` flag. This is because Gzip is not plain text. When compressing the file we write _binary_ files. The files are not readable as plain text.

In [33]:
# we have to uncompress the file
# before we can read it

!cat ./data/short_text.txt.gz

�3�6]�short_text.txt �H��� ����   

We should only use `'w'` for plain text files (which includes CSV and JSON). Using `'w'` instead of `'wb'` for Gzip files, or other files which are not plain text (e.g. images), could damage the file.

## Serialization (`pickle`)

Often we will want to save our work in Python and come back to it later. However, that work might be a machine learning model or some other complex object in Python. How do we save complex Python objects? Python has a module for this purpose called `pickle`. We can use `pickle` to write a binary file that contains all the information about a Python object. Later we can load that pickle file and reconstruct the object in Python.

In [34]:
pickle_example = ['hello', {'a': 23, 'b': True}, (1, 2, 3), [['dogs', 'cats'], None]]

In [35]:
%%expect_exception TypeError

# we can't save this as text
with open('./data/pickle_example.txt', 'w') as f:
    f.write(pickle_example)

[0;31m---------------------------------------------------------------------------[0m
[0;31mTypeError[0m                                 Traceback (most recent call last)
[0;32m<ipython-input-35-dc175613edd9>[0m in [0;36m<module>[0;34m()[0m
[1;32m      2[0m [0;31m# we can't save this as text[0m[0;34m[0m[0;34m[0m[0m
[1;32m      3[0m [0;32mwith[0m [0mopen[0m[0;34m([0m[0;34m'./data/pickle_example.txt'[0m[0;34m,[0m [0;34m'w'[0m[0;34m)[0m [0;32mas[0m [0mf[0m[0;34m:[0m[0;34m[0m[0m
[0;32m----> 4[0;31m     [0mf[0m[0;34m.[0m[0mwrite[0m[0;34m([0m[0mpickle_example[0m[0;34m)[0m[0;34m[0m[0m
[0m
[0;31mTypeError[0m: write() argument must be str, not list


In [36]:
import pickle

# we can save it as a pickle
with open('./data/pickle_example.pkl', 'wb') as f:
    pickle.dump(pickle_example, f)

with open('./data/pickle_example.pkl', 'rb') as f:
    reloaded_example = pickle.load(f)

reloaded_example

['hello', {'a': 23, 'b': True}, (1, 2, 3), [['dogs', 'cats'], None]]

In [37]:
# the reloaded example is the same as the original

reloaded_example == pickle_example

True

Pickle is an important tool for data scientists. Data processing and training machine learning models can take a long time, and it is useful to save checkpoints.

Pandas also has `to_pickle` and `read_pickle` methods.