# Reading Text Files in Pieces

When processing very large files or figuring out the right set of arguments to correctly process a large file, you may only want to read in a small piece of a file or iterate through smaller chunks of the file.

In [8]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

In [3]:
result = pd.read_csv('../../CSV Files/O_Reilly/ch06/ex6.csv')

In [4]:
result.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,9990,9991,9992,9993,9994,9995,9996,9997,9998,9999
one,0.467976,-0.358893,-0.50184,0.204886,0.354628,1.81748,-0.776764,-0.913135,0.35848,-1.740877,...,-0.370769,-0.40998,0.301214,1.821117,0.068804,2.311896,-0.479893,0.523331,-0.362559,-0.096376
two,-0.038649,1.404453,0.659254,1.074134,-0.133116,0.742273,0.935518,1.530624,-0.497572,-1.160417,...,0.404356,0.155627,-1.111203,0.416445,1.322759,-0.41707,-0.650419,0.787112,0.598894,-1.012999
three,-0.295344,0.704965,-0.421691,1.388361,0.283763,0.419395,-0.332872,-0.572657,-0.367016,-1.63783,...,-1.051628,-0.81899,0.668258,0.173874,0.802346,-1.409599,0.745152,0.486066,-1.843201,-0.657431
four,-1.824726,-0.200638,-0.057688,-0.982404,-0.837063,-2.251035,-1.875641,0.477252,0.507702,2.172201,...,-1.050899,1.27735,0.671922,0.505118,0.223618,-0.515821,-0.646038,1.093156,0.887292,-0.573315
key,L,B,G,R,Q,Q,U,K,S,G,...,8.0,W,A,X,H,L,E,K,G,0.0


If you want to only read out a small number of rows (avoiding reading the entire file), specify that with nrows:

In [5]:
pd.read_csv('../../CSV Files/O_Reilly/ch06/ex6.csv', nrows = 10)

Unnamed: 0,one,two,three,four,key
0,0.467976,-0.038649,-0.295344,-1.824726,L
1,-0.358893,1.404453,0.704965,-0.200638,B
2,-0.50184,0.659254,-0.421691,-0.057688,G
3,0.204886,1.074134,1.388361,-0.982404,R
4,0.354628,-0.133116,0.283763,-0.837063,Q
5,1.81748,0.742273,0.419395,-2.251035,Q
6,-0.776764,0.935518,-0.332872,-1.875641,U
7,-0.913135,1.530624,-0.572657,0.477252,K
8,0.35848,-0.497572,-0.367016,0.507702,S
9,-1.740877,-1.160417,-1.63783,2.172201,G


To read out a file in pieces, specify a chunksize as a number of rows:

In [7]:
pd.read_csv('../../CSV Files/O_Reilly/ch06/ex6.csv', chunksize= 1000)

<pandas.io.parsers.readers.TextFileReader at 0x12a945fde10>

The TextParser object returned by read_csv allows you to iterate over the parts of the file according to the chunksize. For example, we can iterate over ex6.csv, aggregating the value counts in the 'key' column like so:

In [18]:
chunker = pd.read_csv('../../CSV Files/O_Reilly/ch06/ex6.csv', chunksize= 1000)

tot = Series([])
for piece in chunker:
    tot = tot.add(piece['key'].value_counts(), fill_value= 0)

tot = tot.sort_index(ascending = False)

  tot = Series([])


In [19]:
tot[:10]

Z    288.0
Y    314.0
X    364.0
W    305.0
V    328.0
U    326.0
T    304.0
S    308.0
R    318.0
Q    340.0
dtype: float64