## 3. Read and Write

Pandas provides a set of reader and writer functions:

| Format | Data | Reader | Writer |
|:-------|:-----|:-------|:-------|
| text | CSV | read_csv | to_csv |
| text | HTML | read_html | to_html |
| binary | MS Excel | read_excel | to excel |
| binary | OpenDocument | read_excel | |
| SQL | SQL | read_sql | to_sql |
| ... | ... | ... | ... |

**Documentation**: pandas [IO tools](https://pandas.pydata.org/docs/user_guide/io.html)

In [1]:
%%html
<style>
    table { display: inline-block }
</style>

In [2]:
import numpy as np
import pandas as pd

---
### Content

    3.1 Reading CSV Files
    3.2 Customized Reading
    3.3 Writing to a CSV File

---
### 3.1 Reading CSV Files

Often, the data we want to process is stored in a CSV (Comma Separated Values) file. A typical CSV file has the following format:

```
    id,name,age,program
    317,bob,20,math
    312,ann,21,art
    310,cat,22,physics
```

The first line often contains the column names, while the following rows represent data objects. In this example, values are separated by a comma. However, values can also be separated by other characters such as space or semicolon.

To read a CSV file in Pandas, you can use the 

```python
    pd.read_csv
```

function. Note that this function has about 50 parameters to customize the behavior of the file reading process.

#### 3.1.1 Example: `india.csv`

+ Values in `india.csv` are separated by a comma

+ `india.csv` has column names

+ `sep`: argument to set the delimiter, default is a comma (`,`)

+ `header`: header is inferred by pandas


**Note:** The dataset `data/india.csv` was taken from [Data analysis: female literacy in India](https://scipython.com/book2/chapter-9-data-analysis-with-pandas/examples/data-analysis-female-literacy-in-india/)

In [3]:
# relative path to file
filename = './data/india.csv'

# read csv file
df = pd.read_csv(filename)

# show first five rows
df.head()

Unnamed: 0,State/UT,Male Population,Female Population,Area (km2),Male Literacy (%),Fertility Rate,Female Literacy (%)
0,Uttar Pradesh,104480510,95331831,240928,79.24,3.7,59.26
1,Maharashtra,58243056,54131277,307713,89.82,1.9,75.48
2,Bihar,54278157,49821295,94163,73.39,3.9,53.33
3,West Bengal,46809027,44467088,88752,82.67,1.9,71.16
4,Madhya Pradesh,37612306,35014503,308245,80.53,3.3,60.02


---
#### 3.1.2 Example: `india_woh.csv`

Similar to `india.csv`, but without header information. The code in the following cell interprets the first row of data as the header:

In [5]:
filename = 'data/india_woh.csv'
df = pd.read_csv(filename)
df.head()

Unnamed: 0,Uttar Pradesh,104480510,95331831,240928,79.24,3.7,59.26
0,Maharashtra,58243056,54131277,307713,89.82,1.9,75.48
1,Bihar,54278157,49821295,94163,73.39,3.9,53.33
2,West Bengal,46809027,44467088,88752,82.67,1.9,71.16
3,Madhya Pradesh,37612306,35014503,308245,80.53,3.3,60.02
4,Tamil Nadu,36137975,36009055,130058,86.81,1.7,73.86


Fix this issue by passing `header=None`:

In [6]:
filename = './data/india_woh.csv'
df = pd.read_csv(filename, header=None)
df.head()

Unnamed: 0,0,1,2,3,4,5,6
0,Uttar Pradesh,104480510,95331831,240928,79.24,3.7,59.26
1,Maharashtra,58243056,54131277,307713,89.82,1.9,75.48
2,Bihar,54278157,49821295,94163,73.39,3.9,53.33
3,West Bengal,46809027,44467088,88752,82.67,1.9,71.16
4,Madhya Pradesh,37612306,35014503,308245,80.53,3.3,60.02


---
### 3.2 Customized Reading

Often the form of the data is inconvenient for our purposes. Here, we customize reading the data in the following way: 

+ select subset of columns

+ change column identifiers

+ use the states of India as index

In [6]:
# relative path to file
filename = './data/india.csv'

# new names for selected columns
names=['state', 'm_literacy', 'fert_rate', 'f_literacy']

# read data
df = pd.read_csv(filename, 
                 header=0,                               # pass header=0 to be able to replace existing names
                 names=names,                            # replace existing names
                 index_col=0,                            # use column 0 as index
                 usecols=[0, 4, 5, 6])                   # return subset of the columns

# peek into data
df.head()

Unnamed: 0_level_0,m_literacy,fert_rate,f_literacy
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Uttar Pradesh,79.24,3.7,59.26
Maharashtra,89.82,1.9,75.48
Bihar,73.39,3.9,53.33
West Bengal,82.67,1.9,71.16
Madhya Pradesh,80.53,3.3,60.02


**Note:** 

The `header` argument specifies the row(s) in the dataset to be used as column names.  The next row starts with the data.

Default behavior is to infer the column names: if no column names are passed the behavior is identical to `header=0` and column names are inferred from the first line of the file.

If column names are passed explicitly then the behavior is identical to `header=None`, which tells pandas that there are no column names in the dataset. In this case, pandas will assign default column names 0, 1, 2, ...

Therefore, it is necessary to explicitly pass `header=0` to be able to replace existing names.


---
### 3.3 Writing to a CSV File

The Series and DataFrame objects have a method `to_csv` which allows storing the contents of the object as a CSV file.

#### 3.3.1 Example 1

If you store data in a CSV file using a DataFrame object without specifying index and column names, the resulting CSV file will contain default index and column labels.

In [7]:
data = np.arange(8).reshape(-1, 4)
df = pd.DataFrame(data)
df.to_csv('./data/eggs_01.csv')

#### 3.3.2 Example 2

The code in the following cell saves data to a CSV file without including the index and column labels:

In [8]:
df.to_csv('./data/eggs_02.csv', index=False, header=False)

#### 3.3.3 Example 3

The following code saves the data to a CSV file including the index and column labels:

In [9]:
data = {
        'name' : ['bob', 'ann', 'cat'],
        'age'  : [20, 21, 22],
        'program' : ['math', 'art', 'physics'] 
        }
df = pd.DataFrame(data, index=[317, 312, 310])
df.to_csv('./data/students.csv')