In [2]:
import pandas as pd
from io import StringIO

### StringIO
The StringIO module an in-memory file-like object. This object can be used as input or output to the most function that would expect a standard file object. When the StringIO object is created it is initialized by passing a string to the constructer. If no string is passed the StringIO will start empty. In both cases, the initial cursor on the file starts at zero.

In [3]:
data = "col1,col2,col3\na,b,1\na,b,2\nc,d,3"
pd.read_csv(StringIO(data))

Unnamed: 0,col1,col2,col3
0,a,b,1
1,a,b,2
2,c,d,3


In [8]:
'''
skiprows [list-like or integer, default None] Line numbers to skip (0-indexed) 
or number of lines to skip (int) at the start of the file.
'''
pd.read_csv(StringIO(data), skiprows=lambda x: x % 2 != 0)

Unnamed: 0,col1,col2,col3
0,a,b,2


A file may or may not have a header row. pandas assumes the first row should be used as the column names:

In [9]:
pd.read_csv(StringIO(data))

Unnamed: 0,col1,col2,col3
0,a,b,1
1,a,b,2
2,c,d,3


By specifying the names argument in conjunction with header you can indicate other names to use and whether or not to throw away the header row (if any):

In [6]:
 pd.read_csv(StringIO(data), names=["foo", "bar", "baz"], header=0)

Unnamed: 0,foo,bar,baz
0,a,b,1
1,a,b,2
2,c,d,3


In [7]:
pd.read_csv(StringIO(data), names=["foo", "bar", "baz"], header=None)

Unnamed: 0,foo,bar,baz
0,col1,col2,col3
1,a,b,1
2,a,b,2
3,c,d,3


If the file or header contains duplicate names, pandas will by default distinguish between them so as to prevent overwriting data:

In [10]:
data_2 = "a,b,a\n0,1,2\n3,4,5"
pd.read_csv(StringIO(data_2))

Unnamed: 0,a,b,a.1
0,0,1,2
1,3,4,5


In [12]:
'''
There is no more duplicate data because mangle_dupe_cols=True by default, which modifies a series of 
dupli- cate columns ‘X’, . . . , ‘X’ to become ‘X’, ‘X.1’, . . . , ‘X.N’.
If mangle_dupe_cols=False, duplicate data can arise:
'''
pd.read_csv(StringIO(data), mangle_dupe_cols=False)
'''
To prevent users from encountering this problem with duplicate data, a ValueError 
exception is raised if mangle_dupe_cols != True:
'''

ValueError: Setting mangle_dupe_cols=False is not supported yet

In [13]:
'''
The usecols argument allows you to select any subset of the columns in a file, either using the column names,
position numbers or a callable:
'''
pd.read_csv(StringIO(data), usecols=['col1','col3'])

Unnamed: 0,col1,col3
0,a,1
1,a,2
2,c,3


In [14]:
'''
If the comment parameter is specified, then completely commented lines will be ignored. 
By default, completely blank lines will be ignored as well.
'''
data_3 = "a,b,c\n\n1,2,3\n\n\n4,5,6"
pd.read_csv(StringIO(data_3), skip_blank_lines=False)

Unnamed: 0,a,b,c
0,,,
1,1.0,2.0,3.0
2,,,
3,,,
4,4.0,5.0,6.0


In [15]:
pd.read_csv(StringIO(data_3), skip_blank_lines=True)

Unnamed: 0,a,b,c
0,1,2,3
1,4,5,6
