## 2.6 EXPORTING AND IMPORTING DATA

2.6 EXPORTING AND IMPORTING DATA
In our examples so far, we have been importing data. It is also common practice to export or save out data sets while processing them. Data sets are either saved out as final cleaned versions of data or in intermediate steps. Both of these outputs can be used for analysis or as input to another part of the data processing pipeline.

2.6.1 pickle
Python has a way to pickle data. This is Python’s way of serializing and saving data in a binary format reading pickle data is also backwards compatible.

2.6.1.1 Series
Many of the export methods for a Series are also available for a DataFrame. Those readers who have experience with numpy will know that a save method is available for ndarrays. This method has been deprecated, and the replacement is to use the to_pickle method.

In [2]:
import pandas as pd

In [3]:
scientists = pd.read_csv('data/scientists.csv')

In [4]:
print(scientists)

                   Name        Born        Died  Age          Occupation
0     Rosaline Franklin  1920-07-25  1958-04-16   37             Chemist
1        William Gosset  1876-06-13  1937-10-16   61        Statistician
2  Florence Nightingale  1820-05-12  1910-08-13   90               Nurse
3           Marie Curie  1867-11-07  1934-07-04   66             Chemist
4         Rachel Carson  1907-05-27  1964-04-14   56           Biologist
5             John Snow  1813-03-15  1858-06-16   45           Physician
6           Alan Turing  1912-06-23  1954-06-07   41  Computer Scientist
7          Johann Gauss  1777-04-30  1855-02-23   77       Mathematician


In [5]:
names = scientists['Name']

print(names)

0       Rosaline Franklin
1          William Gosset
2    Florence Nightingale
3             Marie Curie
4           Rachel Carson
5               John Snow
6             Alan Turing
7            Johann Gauss
Name: Name, dtype: object


In [6]:
# pass in a string to the path you want to save
names.to_pickle('output/scientists_names_series.pickle')

The pickle output is in a binary format. Thus, if you try to open it in a text editor, you will see a bunch of garbled characters.

If the object you are saving is an intermediate step in a set of calculations that you want to save, or if you know that your data will stay in the Python world, saving objects to a pickle will be optimized for Python as well as in terms of disk storage space. However, this approach means that people who do not use Python will not be able to read the data.

2.6.1.2 DataFrame

The same method can be used on DataFrame objects.

In [7]:
scientists.to_pickle('output/scientists_df.pickle')

In [8]:
# Read data from what we output
# for a Series

scientist_names_from_pickle = pd.read_pickle(
    'output/scientists_names_series.pickle')

print(scientist_names_from_pickle)

0       Rosaline Franklin
1          William Gosset
2    Florence Nightingale
3             Marie Curie
4           Rachel Carson
5               John Snow
6             Alan Turing
7            Johann Gauss
Name: Name, dtype: object


In [9]:
# for a DataFrame

scientists_from_pickle = pd.read_pickle(
    'output/scientists_df.pickle')

print(scientists_from_pickle)

                   Name        Born        Died  Age          Occupation
0     Rosaline Franklin  1920-07-25  1958-04-16   37             Chemist
1        William Gosset  1876-06-13  1937-10-16   61        Statistician
2  Florence Nightingale  1820-05-12  1910-08-13   90               Nurse
3           Marie Curie  1867-11-07  1934-07-04   66             Chemist
4         Rachel Carson  1907-05-27  1964-04-14   56           Biologist
5             John Snow  1813-03-15  1858-06-16   45           Physician
6           Alan Turing  1912-06-23  1954-06-07   41  Computer Scientist
7          Johann Gauss  1777-04-30  1855-02-23   77       Mathematician


The pickle files are saved with an extension of .p, .pkl, or .pickle.

2.6.2 CSV

Comma-separated values (CSV) are the most flexible data storage type. For each row, the column information is separated with a comma. The comma is not the only type of delimiter, however. Some files are delimited by a tab (TSV) or even a semicolon. The main reason why CSVs are a preferred data format when collaborating and sharing data is because any program can open this kind of data structure. It can even be opened in a text editor.

The Series and DataFrame have a to_csv method to write a CSV file. The documentation for Series11 and DataFrame12 identifies many different ways you can modify the resulting CSV file. For example, if you wanted to save a TSV file because there are commas in your data, you can change the sep parameter (Appendix O).

11. Saving a Series to a CSV file: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.to_csv.html

12. Saving a DataFrame to a CSV file: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html

In [10]:
# save a series into a CSV
names.to_csv('output/scientist_names_series.csv')

# save a dataframe into a TSV,
# a tab-separated value

scientists.to_csv('output/scientists_df.tsv', sep='\t')

2.6.2.1 Removing Row Numbers From Output

If you open the CSV or TSV file created, you will notice that the first “column” looks like the row number of the dataframe. Many times this is not needed, especially when you are collaborating with other people. Keep in mind that this “column” is really saving the “row label,” which may be important. The documentation will show that there is an index parameter with which to write row names (index).

In [11]:
# do not write the row names in the CSV output

scientists.to_csv('output/scientists_df_no_index.csv', index=False)

2.6.2.2 Importing CSV Data
Importing CSV files was illustrated in Section 1.2. This operation uses the pd.read_csv function. In the documentation,13 you can see there are various ways to read in a CSV. Look at Appendix O if you need more information on using function parameters.

13. read_csv documentation: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

2.6.3 Excel
Excel, which is probably the most commonly used data type (or the second most commonly used, after CSVs), has a bad reputation within the data science community, mainly because colors and other superfluous information can easily find its way into the data set, not to mention one-off calculations that ruin the rectangular structure of a data set. Some other reasons of why are listed in Section 1.1. The goal of this book isn’t to bash Excel, but rather to teach you about a reasonable alternative tool for data analytics. In short, the more of your work you can do in a scripting language, the easier it will be to scale up to larger projects, catch and fix mistakes, and collaborate. However, Excel’s popularity and market share is unrivaled. Excel has its own scripting language if you absolutely have to work in it. This will allow you to work with data in a more predictable and reproducible manner.

2.6.3.1 Series
The Series data structure does not have an explicit to_excel method. If you have a Series that needs to be exported to an Excel file, one option is to convert the Series into a one-column DataFrame.

In [12]:
# convert the Series into a DataFrame
# before saving it to an Excel file

names_df = names.to_frame()

import xlwt # this needs to be installed

# xls file

names_df.to_excel('output/scientists_names_series_df.xls')

import openpyxl # this needs to be installed

# newer xlsx file

names_df.to_excel('output/scientists_names_series_df.xlsx')

2.6.3.2 DataFrame
From the preceding example, you can see how to export a DataFrame to an Excel file. The documentation14 shows several ways to further fine-tune the output. For example, you can output data to a specific “sheet” using the sheet_name parameter.

14. DataFrame to Excel documentation: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_excel.html

In [13]:
# saving a DataFrame into Excel format

scientists.to_excel('output/scientists_df.xlsx',
                    sheet_name='scientists',
                    index=False)

2.6.4 Feather Format to Interface With R
The format called “feather” is used to save a binary object that can also be loaded into the R language. The main benefit of this approach is that it is faster than writing and reading a CSV file between the languages. The general rule of thumb for using this data format is to use it only as an intermediate data format, and to not use the feather format for long-term storage. That is, use it in your code only to pass in data into R; do not use it to save a final version of your data.

The feather formatter is installed via conda install -c conda-forge feather-format or pip install feather-format. You can use the to_feather method on a dataframe to save the feather object. Not every dataframe can be converted into a feather object. For example, our current data set contains a column of date values, which at the time of this writing is not supported by feather.15

15. Feather dates, ArrowNotImplementedError: https://github.com/wesm/feather/issues/121

2.6.5 Other Data Output Types
There are many ways Pandas can export and import data. Indeed, to_pickle, to_csv, and to_excel, and to_feather are only some of the data formats that can make their way into Pandas DataFrames. Table 2.4 lists some of these other output formats.

Table 2.4 DataFrame Export Methods

Export Method             Description

to_clipboard              Save data into the system clipboard for pasting

to_dense                  Convert data into a regular “dense” DataFrame

to_dict                   Convert data into a Python

dict to_gbq               Convert data into a Google BigQuery table

to_hdf                    Save data into a hierarchal data format (HDF)

to_msgpack                Save data into a portable JSON-like binary

to_html                   Convert data into a HTML table

to_json                   Convert data into a JSON string

to_latex                  Convert data into a LATEX tabular environment

to_records                Convert data into a record array

to_string                 Show DataFrame as a string for stdout

to_sparse                 Convert data into a SparceDataFrame

to_sql                    Save data into a SQL database

to_stata                  Convert data into a Stata dta file

For more complicated and general data conversions (not necessarily just exporting data), the odo library16 has a consistent way to convert between data formats (Appendix T).

16. odo library http://odo.readthedocs.org/en/latest/

2.7 CONCLUSION
This chapter went in a little more detail about how the Pandas Series and DataFrame objects work in Python. There were some simpler examples of data cleaning shown, along with a few common ways to export data to share with others. Chapters 1 and 2 should give you a good basis on how Pandas works as a library.

The next chapter covers the basics of plotting in Python and Pandas. Data visualization is not only used in the end of an analysis to plot results, but also is heavily utilized throughout the entire data pipeline.