# Pandas DataFrame: Playing With CSV Files

In [8]:
# import pandas to access DataFrames
import pandas as pd

In [9]:
# define my_dict var to convert to a DataFrame
my_dict = { 'name' : ["a", "b", "c", "d", "e","f", "g"],
                   'age' : [20,27, 35, 55, 18, 21, 35],
                   'designation': ["VP", "CEO", "CFO", "VP", "VP", "CEO", "MD"]}

In [10]:
# define df as your DataFrame with my_dict as its contents
df = pd.DataFrame(my_dict)

In [11]:
# the resulting table generated from the values in my_dict
df

Unnamed: 0,name,age,designation
0,a,20,VP
1,b,27,CEO
2,c,35,CFO
3,d,55,VP
4,e,18,VP
5,f,21,CEO
6,g,35,MD


## Persisting the DataFrame into a CSV file

In [12]:
# the .to_csv() method takes the filename as an argument and creates a .csv based on
# the DataFrame object that it is applied to.

# In our case, df is the DataFrame we created earlier.
df.to_csv('csv_example')

In [20]:
# To read the .csv file we created, panda has an in-built method .read_csv() to read the file
# passed in its argument.
df_csv = pd.read_csv('csv_example')

In [21]:
df_csv

Unnamed: 0,name,age,designation
0,a,20,VP
1,b,27,CEO
2,c,35,CFO
3,d,55,VP
4,e,18,VP
5,f,21,CEO
6,g,35,MD


Notice that the index is generated twice, with the second `Unnamed` generated automatically by Pandas while loading the CSV file.

This can be avoided by specifying `index=False`.

In [22]:
# In order to do this, we need to recreate the file to exclude automatically generated indeces.
df.to_csv('csv_example', index=False)

In [23]:
df_csv = pd.read_csv('csv_example')

In [24]:
df_csv

Unnamed: 0,name,age,designation
0,a,20,VP
1,b,27,CEO
2,c,35,CFO
3,d,55,VP
4,e,18,VP
5,f,21,CEO
6,g,35,MD


## Column Header


By default, the `header` value is `0`, which means that the top row will be considered as the header.

However, it is possible to specify different or even multiple headers.

In [30]:
# Defining multiple headers
df_csv = pd.read_csv('csv_example', header=[0,1,2])

In [31]:
df_csv

Unnamed: 0_level_0,name,age,designation
Unnamed: 0_level_1,a,20,VP
Unnamed: 0_level_2,b,27,CEO
0,c,35,CFO
1,d,55,VP
2,e,18,VP
3,f,21,CEO
4,g,35,MD


In [32]:
# Define another row entirely and skip the first few
df_csv = pd.read_csv('csv_example', header=[5])

In [33]:
df_csv

Unnamed: 0,e,18,VP
0,f,21,CEO
1,g,35,MD


## Customizing Column Names

Though we're reading the data from CSV files with Column headers, we can still have our own column names.

In [34]:
df_csv = pd.read_csv('csv_example', names=['a','b','c'])

In [35]:
df_csv

Unnamed: 0,a,b,c
0,name,age,designation
1,a,20,VP
2,b,27,CEO
3,c,35,CFO
4,d,55,VP
5,e,18,VP
6,f,21,CEO
7,g,35,MD


Although we were successful in adding our own header, the top row still displays the undesired header.

To skip this header, all that is required is to define the `header` argument as `1`.

In [36]:
df_csv = pd.read_csv('csv_example', names=['a','b','c'], header=1)

In [37]:
df_csv

Unnamed: 0,a,b,c
0,b,27,CEO
1,c,35,CFO
2,d,55,VP
3,e,18,VP
4,f,21,CEO
5,g,35,MD


We can also predefine our CSV file to exclude the header such that defining the `header` argument in the `.read_csv` method is unnecessary.

In [38]:
df.to_csv('csv_example', index=False, header=False)

In [39]:
df_csv = pd.read_csv('csv_example', names=['AGE', 'DESIGNATION', 'NAME'])

In [41]:
df_csv

Unnamed: 0,AGE,DESIGNATION,NAME
0,a,20,VP
1,b,27,CEO
2,c,35,CFO
3,d,55,VP
4,e,18,VP
5,f,21,CEO
6,g,35,MD


## CSV to (Anything)Separated Value

Though comma-separated values are well know, the `.read_csv()` method can identify separators other than comma. The only difference is that we need to pass the separator explicitly in the function while comma is considered by default.

In [42]:
# This will recreate our `csv_example` file and separate the values by colons instead.
df.to_csv('csv_example', index=False, sep=':')

In [44]:
# To read a `colon`-separated file, pass `sep` through the `.read_csv` method again.
df_csv = pd.read_csv('csv_example', sep=':')

In [45]:
df_csv

Unnamed: 0,name,age,designation
0,a,20,VP
1,b,27,CEO
2,c,35,CFO
3,d,55,VP
4,e,18,VP
5,f,21,CEO
6,g,35,MD


## Setting the Row Index

In [48]:
# This function can set the index to a column defined in the CSV file.

# Setting indeces this way is post-operation.
df_csv.set_index('age')

Unnamed: 0_level_0,name,designation
age,Unnamed: 1_level_1,Unnamed: 2_level_1
20,a,VP
27,b,CEO
35,c,CFO
55,d,VP
18,e,VP
21,f,CEO
35,g,MD


In [51]:
# But it can also be set when the file is read.
df_csv = pd.read_csv('csv_example', sep=':', index_col=1)

In [52]:
df_csv

Unnamed: 0_level_0,name,designation
age,Unnamed: 1_level_1,Unnamed: 2_level_1
20,a,VP
27,b,CEO
35,c,CFO
55,d,VP
18,e,VP
21,f,CEO
35,g,MD


In [54]:
# And as with headers, more than one `index_col` can be passed as a list.
df_csv = pd.read_csv('csv_example', sep=':', index_col=[1,2])

In [55]:
df_csv

Unnamed: 0_level_0,Unnamed: 1_level_0,name
age,designation,Unnamed: 2_level_1
20,VP,a
27,CEO,b
35,CFO,c
55,VP,d
18,VP,e
21,CEO,f
35,MD,g


## Specifying Output

### Specify the rows to output

In [56]:
# Load only 3 rows
df_csv = pd.read_csv('csv_example', sep=":", nrows=3)

In [57]:
df_csv

Unnamed: 0,name,age,designation
0,a,20,VP
1,b,27,CEO
2,c,35,CFO


### Skip Empty Lines

In [60]:
# By default, `.read_csv()` method skips blank lines while loading the file and constructing the DataFrame.
# However, this behavious can be disabled.
df_csv = pd.read_csv('csv_example', skip_blank_lines=False, sep=":")