# Reading and Writing Data in Text Format
Python has become a beloved language for text and file munging due to its simple syntax for interacting with files, intuitive data structures, and convenient features like tuple packing and unpacking.

![Parsing functions in pandas](../../Pictures/Parsing%20functions%20in%20pandas.png)

I’ll give an overview of the mechanics of these functions, which are meant to convert text data into a DataFrame. The options for these functions fall into a few categories:
- Indexing: can treat one or more columns as the returned DataFrame, and whether
to get column names from the file, the user, or not at all.
- Type inference and data conversion: this includes the user-defined value conversions and custom list of missing value markers.
- Datetime parsing: includes combining capability, including combining date and
time information spread over multiple columns into a single column in the result.
- Iterating: support for iterating over chunks of very large files.
- Unclean data issues: skipping rows or a footer, comments, or other minor things like numeric data with thousands separated by commas.

Type inference is one of the more important features of these functions; that means you don’t have to specify which columns are numeric, integer, boolean, or string. Handling dates and other custom types requires a bit more effort, though. Let’s start with a small comma-separated (CSV) text file:

In [4]:
import pandas as pd
import numpy as np
from pandas import Series, DataFrame

In [5]:
pd.read_csv('../../CSV Files/b1ch6.csv', header= 0)

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [6]:
pd.read_csv('../../CSV Files/b1ch6.csv', names=['a', 'b', 'c', 'd', 'message'], header= 0)

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


Suppose you wanted the message column to be the index of the returned DataFrame. You can either indicate you want the column at index 4 or named 'message' using the index_col argument:

In [7]:
names=['a', 'b', 'c', 'd', 'message']

In [8]:
pd.read_csv('../../CSV Files/b1ch6.csv', header= 0, names=names, index_col='message')

Unnamed: 0_level_0,a,b,c,d
message,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
hello,1,2,3,4
world,5,6,7,8
foo,9,10,11,12


In the event that you want to form a hierarchical index from multiple columns, just pass a list of column numbers or names:

In [9]:
data = np.array(arr)

In [10]:
arr = [['key1','key2','value1','value2'],
['one','a',1,2],
['one','b',3,4],
['one','c',5,6],
['one','d',7,8],
['two','a',9,10],
['two','b',11,12],
['two','c',13,14],
['two','d',15,16]]

In [13]:
np.savetxt('../../CSV Files/b1ch62.csv',arr, delimiter=',', fmt= '%s')

In [17]:
parsed = pd.read_csv('../../CSV Files/b1ch62.csv', index_col=['key1', 'key2'])

parsed

Unnamed: 0_level_0,Unnamed: 1_level_0,value1,value2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
one,a,1,2
one,b,3,4
one,c,5,6
one,d,7,8
two,a,9,10
two,b,11,12
two,c,13,14
two,d,15,16


In some cases, a table might not have a fixed delimiter, using whitespace or some other pattern to separate fields. In these cases, you can pass a regular expression as a delimiter for read_table. Consider a text file that looks like this:

In [22]:
l = [' A B C\n',
 'aaa -0.264438 -1.026059 -0.619500\n',
 'bbb 0.927272 0.302904 -0.032399\n',
 'ccc -0.264273 -0.386314 -0.217601\n',
 'ddd -0.871858 -0.348382 1.100491\n']

In [23]:
np.savetxt('../../CSV Files/b1ch63(list).txt', l, fmt = '%s')

In [28]:
result = pd.read_table('../../CSV Files/b1ch63(list).txt', sep='\s+')

While you could do some munging by hand, in this case fields are separated by a variable amount of whitespace. This can be expressed by the regular expression \s+, so we have then:

In [29]:
result

Unnamed: 0,A,B,C
aaa,-0.264438,-1.026059,-0.6195
bbb,0.927272,0.302904,-0.032399
ccc,-0.264273,-0.386314,-0.217601
ddd,-0.871858,-0.348382,1.100491


The parser functions have many additional arguments to help you handle the wide variety of exception file formats that occur. For example, you can skip the first, third, and fourth rows of a file with skiprows:

In [126]:
l =([['# hey!'],
['a b c d message'],
['# just wanted to make things more difficult for you'],
['# who reads CSV files with computers, anyway?'],
['1 2 3 4 hello'],
['5 6 7 8 world'],
['9 10 11 12 foo']])

np.savetxt('../../CSV Files/b1ch64.csv',l,  fmt= '%s')

In [124]:
pd.read_csv('../../CSV Files/b1ch64.csv',delim_whitespace=2 ,skiprows=[0, 2, 3])

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


Handling missing values is an important and frequently nuanced part of the file parsing process. Missing data is usually either not present (empty string) or marked by some  sentinel value. By default, pandas uses a set of commonly occurring sentinels, such as NA, -1.#IND, and NULL:

In [110]:
l = [['something a b c d message'],
['one 1 2 3 4'] ,
['two 5 6 8 NA world'],
['three 9 10 11 12 foo']]

In [111]:
np.savetxt('../../CSV Files/b1ch65.csv', l, fmt='%s')

In [130]:
res = pd.read_csv('../../CSV Files/b1ch65.csv', delim_whitespace='1')

res

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3,4.0,
1,two,5,6,8,,world
2,three,9,10,11,12.0,foo


In [131]:
pd.isna(res)

Unnamed: 0,something,a,b,c,d,message
0,False,False,False,False,False,True
1,False,False,False,False,True,False
2,False,False,False,False,False,False


The na_values option can take either a list or set of strings to consider missing values:

In [133]:
result = pd.read_csv('../../CSV Files/b1ch65.csv', delim_whitespace='1', na_values=['NULL'])

result

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3,4.0,
1,two,5,6,8,,world
2,three,9,10,11,12.0,foo


Different NA sentinels can be specified for each column in a dict:

In [135]:
sentinels = {'message': ['foo', 'NA'], 'something': ['two']}


sentinels

{'message': ['foo', 'NA'], 'something': ['two']}

In [137]:
result = pd.read_csv('../../CSV Files/b1ch65.csv', delim_whitespace='1', na_values=sentinels)

result

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3,4.0,
1,,5,6,8,,world
2,three,9,10,11,12.0,


![read_csv/read_table function arguments](../../Pictures/read_csv%20or%20read_table%20function%20arguments.png)