Panda's dataframe is nothing but an in memory representation of excel like data. Let's import the package and try it out.

In [1]:
import pandas as pd

Data can be represented as a python dictionary:

In [2]:
my_dict = {'name' : ['a','b','c','d','e','f','g'],
                  'age' : [20, 27, 35, 55, 18, 21, 35],
                  'designation': ['VP', 'CEO', 'CFO', 'VP', 'VP', 'CEO', 'MD']}


And, we know we can create a dataframe from a python dict using DataFrame:

In [3]:
df = pd.DataFrame(my_dict)

In [4]:
df

Unnamed: 0,name,age,designation
0,a,20,VP
1,b,27,CEO
2,c,35,CFO
3,d,55,VP
4,e,18,VP
5,f,21,CEO
6,g,35,MD


## Persisting the DataFrame into a CSV
Once a DataFrame is generated, it can be persisted (written) to a .csv file on the local disk. You can store DF data into a csv using the API call **to_csv(...)**

In [5]:
df.to_csv('csv_example')

Just as we can persist the DataFrame in a CSV file, we can also load the DataFrame from a CSV file.

In [6]:
df_csv = pd.read_csv('csv_example')

In [7]:
df_csv

Unnamed: 0.1,Unnamed: 0,name,age,designation
0,0,a,20,VP
1,1,b,27,CEO
2,2,c,35,CFO
3,3,d,55,VP
4,4,e,18,VP
5,5,f,21,CEO
6,6,g,35,MD


The one problem here, is that we now have the index being generated twice. One instance is loaded directly from the csv file we created, while the second ('Unnamed'), is generated automatically by pandas when we load the file into a DF.
<br>
<br>
This problem can ultimately be avoided by making sure our CSV files write without the index, because the DataFrame function will generate one anyway. We do this by specifying an **index=False** parameter in the **to_csv(...)** function.

In [8]:
df.to_csv('csv_example', index=False)

Now, if we read the file as:

In [9]:
df_csv = pd.read_csv('csv_example')

In [10]:
df_csv

Unnamed: 0,name,age,designation
0,a,20,VP
1,b,27,CEO
2,c,35,CFO
3,d,55,VP
4,e,18,VP
5,f,21,CEO
6,g,35,MD


Much better! Now, the output is similar to our earlier DataFrame. 

## Column Header
As we have seen, the first row is always considered the column headers, however, it's possible to have more than one row by specifying the parameter **header=*int*** in **read_csv(...)**.
<br>
<br> By default, the value is 0, meaning the top row (index position 0) will be considered the header.

In [11]:
df_csv = pd.read_csv('csv_example', header = 0)
df_csv

Unnamed: 0,name,age,designation
0,a,20,VP
1,b,27,CEO
2,c,35,CFO
3,d,55,VP
4,e,18,VP
5,f,21,CEO
6,g,35,MD


And here, is an example of a DF imported with more than one header row specified:
<br>
<br> *notice: the parameter values are a list*

In [14]:
df_csv = pd.read_csv('csv_example', header=[0,1,2])
df_csv

Unnamed: 0_level_0,name,age,designation
Unnamed: 0_level_1,a,20,VP
Unnamed: 0_level_2,b,27,CEO
0,c,35,CFO
1,d,55,VP
2,e,18,VP
3,f,21,CEO
4,g,35,MD


It's also not necessary to have the first sequence of rows as a header, we can skip the first few rows and start looking at the table from a specific row:

In [16]:
df_csv = pd.read_csv('csv_example', header=5)
df_csv

Unnamed: 0,e,18,VP
0,f,21,CEO
1,g,35,MD


The one drawback: we lose the data preceding these rows for this variable. They can't be part of this DataFrame.
<br>
<br> Even in the case of multiple rows in the header, actual DataFrame data will start only with rows after the last header rows. In the example below, we see 'a' and 'b' as they are header rows, however, we lose 'c' and 'd' due to the final header row being the 'e' row. Thus, only rows 'f' and 'g' are displayed in the values of the DataFrame:

In [19]:
df_csv = pd.read_csv('csv_example', header=[1,2,5])
df_csv

Unnamed: 0_level_0,a,20,VP
Unnamed: 0_level_1,b,27,CEO
Unnamed: 0_level_2,e,18,VP
0,f,21,CEO
1,g,35,MD


## Customizing Column Names
We can have our own column names assigned when reading a csv file to a DataFrame. The parameter used is called **names**, and the values are entered in the same way multiple headers are inputted, with a list:

In [20]:
df_csv = pd.read_csv('csv_example', names=['a','b','c'])
df_csv

Unnamed: 0,a,b,c
0,name,age,designation
1,a,20,VP
2,b,27,CEO
3,c,35,CFO
4,d,55,VP
5,e,18,VP
6,f,21,CEO
7,g,35,MD


Even though we are successful in adding our own header, the top row still displays header which is a non desired one. 
<br>
<br> We can avoid this using the **header** parameter in **read_csv(...)** to skip the row depicting the header. In this particular case, we know that first row, i.e. row 0 is header so we can skip it this way:

In [22]:
df_csv = pd.read_csv('csv_example', names=['a', 'b', 'c'], header=1)
df_csv

Unnamed: 0,a,b,c
0,b,27,CEO
1,c,35,CFO
2,d,55,VP
3,e,18,VP
4,f,21,CEO
5,g,35,MD


This output is what we are looking for with a customized header.
<br>
<br> Another way to achieve this is to skip the header entirely when writing the CSV file initially:

In [23]:
df.to_csv('csv_example', index=False, header=False)

Now, while reading the file, we can read it without having to skip the header:

In [24]:
df_csv=pd.read_csv('csv_example', names=['AGE', 'DESIGNATION', 'NAME'])
df_csv

Unnamed: 0,AGE,DESIGNATION,NAME
0,a,20,VP
1,b,27,CEO
2,c,35,CFO
3,d,55,VP
4,e,18,VP
5,f,21,CEO
6,g,35,MD


## CSV to (Anything) Separated Value
Aside from comma separated values, **read_csv(..)** can identify separators other than a comma. The only caveat to this is that we need to pass the separators explicitly in the function.
<br>
<br> We will create a CSV file with the colon separator:

In [25]:
df.to_csv('csv_example', index=False, sep=':')

This creates a file where a colon (**':'**) separates the objects instead of a comma. We can read the file as:

In [26]:
df_csv = pd.read_csv('csv_example', sep=':')
df_csv

Unnamed: 0,name,age,designation
0,a,20,VP
1,b,27,CEO
2,c,35,CFO
3,d,55,VP
4,e,18,VP
5,f,21,CEO
6,g,35,MD


## Setting the Row Index
By default, Pandas DataFrame generates a row index automatically which we can change by setting any column as the Index as:

In [28]:
df_csv.set_index('age')
df_csv

Unnamed: 0,name,age,designation
0,a,20,VP
1,b,27,CEO
2,c,35,CFO
3,d,55,VP
4,e,18,VP
5,f,21,CEO
6,g,35,MD


Setting these types of indexes is called a *post operation*. i.e. we already have a DataFrame with a predefined index, but we change it after it is created.
<br>
<br> We can do this at the time of reading the CSV file by passing a new parameter called **index_col**, which will automatically assign the column depicted by **index_col** as a row index:

In [30]:
df_csv = pd.read_csv('csv_example', sep=':', index_col=1)
df_csv

Unnamed: 0_level_0,name,designation
age,Unnamed: 1_level_1,Unnamed: 2_level_1
20,a,VP
27,b,CEO
35,c,CFO
55,d,VP
18,e,VP
21,f,CEO
35,g,MD


In [31]:
# We can even provide more than one index_col to be treated as index
df_csv = pd.read_csv('csv_example', sep=':', index_col=[0,2])
df_csv

Unnamed: 0_level_0,Unnamed: 1_level_0,age
name,designation,Unnamed: 2_level_1
a,VP,20
b,CEO,27
c,CFO,35
d,VP,55
e,VP,18
f,CEO,21
g,MD,35


**NOTE: If all rows are not required for your task, don't load them**
<br>
<br>Most CSV files will be a considerable size, and you might run into memory constraints. There is an option of loading only a selected few rows from the base file.

In [33]:
#load only 3 
df_csv = pd.read_csv('csv_example', sep=':', nrows=3)
df_csv

Unnamed: 0,name,age,designation
0,a,20,VP
1,b,27,CEO
2,c,35,CFO


## Skipping empty lines in CSV files
By default, the **read_csv(...)** function skips blank lines, and will ignore them while loading the file and constructing the DataFrame.
<br>
<br> Should you want to load blank line(s) for doing some explicit calculations like counting empty, you should mark skipping blank lines as **False**:

In [34]:
df_csv = pd.read_csv('csv_example', skip_blank_lines=False, sep=':')
df_csv

Unnamed: 0,name,age,designation
0,a,20,VP
1,b,27,CEO
2,c,35,CFO
3,d,55,VP
4,e,18,VP
5,f,21,CEO
6,g,35,MD
