# Chapter 6

In [12]:
import pandas as pd

# 6  Data Loading, Storage, and File Formats

#### Reading data and making it accessible (often called data loading)
#### The term parsing is also sometimes used to describe loading text data and interpreting it as tables and different data types. 

# 6.1 Reading and Writing Data in Text Format

In [13]:
# Table 6.1: Text and binary data loading functions in pandas
Function	Description
read_csv:	Load delimited data from a file, URL, or file-like object; use comma as default delimiter
read_fwf:	Read data in fixed-width column format (i.e., no delimiters)
read_clipboard:	Variation of read_csv that reads data from the clipboard; useful for converting tables from web pages
read_excel:	Read tabular data from an Excel XLS or XLSX file
read_hdf:	Read HDF5 files written by pandas
read_html:	Read all tables found in the given HTML document
read_json:	Read data from a JSON (JavaScript Object Notation) string representation, file, URL, or file-like object
read_feather:	Read the Feather binary file format
read_orc:	Read the Apache ORC binary file format
read_parquet:	Read the Apache Parquet binary file format
read_pickle:	Read an object stored by pandas using the Python pickle format
read_sas:	Read a SAS dataset stored in one of the SAS system's custom storage formats
read_spss:	Read a data file created by SPSS
read_sql:	Read the results of a SQL query (using SQLAlchemy)
read_sql_table:	Read a whole SQL table (using SQLAlchemy); equivalent to using a query that selects everything in that table using read_sql
read_stata:	Read a dataset from Stata file format
read_xml:	Read a table of data from an XML file

SyntaxError: invalid syntax (4272952187.py, line 2)

## Indexing

#### Can treat one or more columns as the returned DataFrame, and whether to get column names from the file, arguments you provide, or not at all. 

## Type inference and data conversion

#### Includes the user-defined value conversions and custom list of missing value markers.

## Date and time parsing

#### Includes a combining capability, including combining date and time information spread over multiple columns into a single column in the result.

## Iterating

#### Support for iterating over chunks of very large files.

## Unclean data issues

#### Includes skipping rows or a footer, comments, or other minor things like numeric data with thousands separated by commas.

#### Some of these functions perform type inference, because the column data types are not part of the data format. That means you don’t necessarily have to specify which columns are numeric, integer, Boolean, or string. Other data formats, like HDF5, ORC, and Parquet, have the data type information embedded in the format.

In [14]:
# CSV text file
!cat examples/ex1.csv
a,b,c,d,message
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo

# Since this is comma-delimited, we can then use pandas.read_csv to read it into a DataFrame:
df = pd.read_csv("examples/ex1.csv")
df

a,b,c,d,message
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo

NameError: name 'a' is not defined

In [15]:
# A file will not always have a header row. Consider this file:
!cat examples/ex2.csv
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo

# To read this file, you have a couple of options. You can allow pandas to assign default column names, or you can specify names yourself:
pd.read_csv("examples/ex2.csv", header=None)

1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo

NameError: name 'hello' is not defined

In [16]:
pd.read_csv("examples/ex2.csv", names=["a", "b", "c", "d", "message"])

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [17]:
# Suppose you wanted the message column to be the index of the returned DataFrame. You can either indicate you want the column at index 4 or named "message" using the index_col argument:
names = ["a", "b", "c", "d", "message"]
pd.read_csv("examples/ex2.csv", names=names, index_col="message")

Unnamed: 0_level_0,a,b,c,d
message,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
hello,1,2,3,4
world,5,6,7,8
foo,9,10,11,12


In [21]:
# If you want to form a hierarchical index from multiple columns, pass a list of column numbers or names:
!cat examples/csv_mindex.csv

parsed = pd.read_csv("examples/csv_mindex.csv",
                     index_col=["key1", "key2"])
parsed

key1,key2,value1,value2
one,a,1,2
one,b,3,4
one,c,5,6
one,d,7,8
two,a,9,10
two,b,11,12
two,c,13,14
two,d,15,16


Unnamed: 0_level_0,Unnamed: 1_level_0,value1,value2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
one,a,1,2
one,b,3,4
one,c,5,6
one,d,7,8
two,a,9,10
two,b,11,12
two,c,13,14
two,d,15,16


In [22]:
# A table might not have a fixed delimiter, using whitespace or some other pattern to separate fields. Consider a text file that looks like this:
!cat examples/ex3.txt


# You can pass a regular expression as a delimiter for pandas.read_csv. This can be expressed by the regular expression \s+:
result = pd.read_csv("examples/ex3.txt", sep="\s+")
result

            A         B         C
aaa -0.264438 -1.026059 -0.619500
bbb  0.927272  0.302904 -0.032399
ccc -0.264273 -0.386314 -0.217601
ddd -0.871858 -0.348382  1.100491


Unnamed: 0,A,B,C
aaa,-0.264438,-1.026059,-0.6195
bbb,0.927272,0.302904,-0.032399
ccc,-0.264273,-0.386314,-0.217601
ddd,-0.871858,-0.348382,1.100491


In [23]:
# The file parsing functions have many additional arguments to help you handle the wide variety of exception file formats that occur
!cat examples/ex4.csv
pd.read_csv("examples/ex4.csv", skiprows=[0, 2, 3])

# hey!
a,b,c,d,message
# just wanted to make things more difficult for you
# who reads CSV files with computers, anyway?
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [24]:
# Handling missing values is an important and frequently nuanced part of the file reading process.
# Missing data is usually either not present (empty string) or marked by some sentinel (placeholder) value.
!cat examples/ex5.csv
result = pd.read_csv("examples/ex5.csv")
result

something,a,b,c,d,message
one,1,2,3,4,NA
two,5,6,,8,world
three,9,10,11,12,foo

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


In [25]:
# Recall that pandas outputs missing values as NaN, so we have two null or missing values in result:
pd.isna(result)

Unnamed: 0,something,a,b,c,d,message
0,False,False,False,False,False,True
1,False,False,False,True,False,False
2,False,False,False,False,False,False


In [26]:
# The na_values option accepts a sequence of strings to add to the default list of strings recognized as missing:
result = pd.read_csv("examples/ex5.csv", na_values=["NULL"])
result

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


In [27]:
# pandas.read_csv has a list of many default NA value representations, but these defaults can be disabled with the keep_default_na option:
result2 = pd.read_csv("examples/ex5.csv", keep_default_na=False)
result2

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


In [28]:
result2.isna()

Unnamed: 0,something,a,b,c,d,message
0,False,False,False,False,False,False
1,False,False,False,False,False,False
2,False,False,False,False,False,False


In [30]:
result3 = pd.read_csv("examples/ex5.csv", keep_default_na=False,
                      na_values=["NA"])
result3

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


In [31]:
result3.isna()

Unnamed: 0,something,a,b,c,d,message
0,False,False,False,False,False,True
1,False,False,False,False,False,False
2,False,False,False,False,False,False


In [32]:
# Different NA sentinels can be specified for each column in a dictionary:
sentinels = {"message": ["foo", "NA"], "something": ["two"]}
pd.read_csv("examples/ex5.csv", na_values=sentinels,
            keep_default_na=False)

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,,5,6,,8,world
2,three,9,10,11.0,12,


In [33]:
# Table 6.2: Some pandas.read_csv function arguments
Argument	Description
path:	String indicating filesystem location, URL, or file-like object.
sep or delimiter:	Character sequence or regular expression to use to split fields in each row.
header:	Row number to use as column names; defaults to 0 (first row), but should be None if there is no header row.
index_col:	Column numbers or names to use as the row index in the result; can be a single name/number or a list of them for a hierarchical index.
names:	List of column names for result.
skiprows:	Number of rows at beginning of file to ignore or list of row numbers (starting from 0) to skip.
na_values:	Sequence of values to replace with NA. They are added to the default list unless keep_default_na=False is passed.
keep_default_na:	Whether to use the default NA value list or not (True by default).
comment	Character(s): to split comments off the end of lines.
parse_dates:	Attempt to parse data to datetime; False by default. If True, will attempt to parse all columns. Otherwise, can specify a list of column numbers or names to parse. If element of list is tuple or list, will combine multiple columns together and parse to date (e.g., if date/time split across two columns).
keep_date_col:	If joining columns to parse date, keep the joined columns; False by default.
converters:	Dictionary containing column number or name mapping to functions (e.g., {"foo": f} would apply the function f to all values in the "foo" column).
dayfirst:	When parsing potentially ambiguous dates, treat as international format (e.g., 7/6/2012 -> June 7, 2012); False by default.
date_parser:	Function to use to parse dates.
nrows:	Number of rows to read from beginning of file (not counting the header).
iterator:	Return a TextFileReader object for reading the file piecemeal. This object can also be used with the with statement.
chunksize:	For iteration, size of file chunks.
skip_footer:	Number of lines to ignore at end of file.
verbose:	Print various parsing information, like the time spent in each stage of the file conversion and memory use information.
encoding:	Text encoding (e.g., "utf-8 for UTF-8 encoded text). Defaults to "utf-8" if None.
squeeze:	If the parsed data contains only one column, return a Series.
thousands:	Separator for thousands (e.g., "," or "."); default is None.
decimal:	Decimal separator in numbers (e.g., "." or ","); default is ".".
engine:	CSV parsing and conversion engine to use; can be one of "c", "python", or "pyarrow". The default is "c", though the newer "pyarrow" engine can parse some files much faster. The "python" engine is slower but supports some features that the other engines do not.

SyntaxError: invalid syntax (2096339495.py, line 2)

## Reading Text Files in Pieces

In [34]:
pd.options.display.max_rows = 10
result = pd.read_csv("examples/ex6.csv")
result

Unnamed: 0,one,two,three,four,key
0,0.467976,-0.038649,-0.295344,-1.824726,L
1,-0.358893,1.404453,0.704965,-0.200638,B
2,-0.501840,0.659254,-0.421691,-0.057688,G
3,0.204886,1.074134,1.388361,-0.982404,R
4,0.354628,-0.133116,0.283763,-0.837063,Q
...,...,...,...,...,...
9995,2.311896,-0.417070,-1.409599,-0.515821,L
9996,-0.479893,-0.650419,0.745152,-0.646038,E
9997,0.523331,0.787112,0.486066,1.093156,K
9998,-0.362559,0.598894,-1.843201,0.887292,G


In [35]:
# To read a file in pieces, specify a chunksize as a number of rows:
chunker = pd.read_csv("examples/ex6.csv", chunksize=1000)
type(chunker)

pandas.io.parsers.readers.TextFileReader

In [None]:
# The TextFileReader object returned by pandas.read_csv allows you to iterate over the parts of the file according to the chunksize
