<hr style="margin-bottom: 40px;">

<img src="https://user-images.githubusercontent.com/7065401/68501079-0695df00-023c-11ea-841f-455dac84a089.jpg"
    style="width:400px; float: right; margin: 0 40px 40px 40px;"></img>

# Reading CSV and TXT files

Rather than creating `Series` or `DataFrames` strutures from scratch, or even from Python core sequences or `ndarrays`, the most typical use of **pandas** is based on the loading of information from files or sources of information for further exploration, transformation and analysis.

In this lecture we'll learn how to read comma-separated values files (.csv) and raw text files (.txt) into pandas `DataFrame`s.

<hr style="border:4px solid gray"> </hr>

## Hands on!

In [14]:
import pandas as pd

<hr style="border:4px solid gray"> </hr>

## Reading data with Python

We can read data simply using Python.

When you want to work with a file, the first thing to do is to open it. This is done by invoking `open()` built-in function.

`open()` has a single requiered argument that is the path to the file and has a single return, the file object.

The `with` statement automatically takes care of closing the file it leaves the `with` block, even in cases of error.

In [15]:
with open('./data/btc-market-price.csv', 'r') as fp:
    print(fp)

<_io.TextIOWrapper name='./data/btc-market-price.csv' mode='r' encoding='UTF-8'>


Once the file is opened, we can read its content as follows:

In [16]:
with open('./data/btc-market-price.csv', 'r') as fp:
    for index, line in enumerate(fp.readlines()):
        # read just the first 10 lines
        if (index < 10):
            print(index, line)

0 2017-04-02 00:00:00,1099.169125

1 2017-04-03 00:00:00,1141.813

2 2017-04-04 00:00:00,1141.6003625

3 2017-04-05 00:00:00,1133.0793142857142

4 2017-04-06 00:00:00,1196.3079375

5 2017-04-07 00:00:00,1190.45425

6 2017-04-08 00:00:00,1181.1498375

7 2017-04-09 00:00:00,1208.8005

8 2017-04-10 00:00:00,1207.744875

9 2017-04-11 00:00:00,1226.6170375



How can we process the data read from the file using pure Python? It involves a lot of manual work, for example, splitting the values by the correct separator:

In [17]:
with open('./data/btc-market-price.csv', 'r') as fp:
    for index, line in enumerate(fp.readlines()):
        # read just the first 10 lines
        if (index < 10):
            timestamp, price = line.split(',')
            print(f'{timestamp}: ${price}')

2017-04-02 00:00:00: $1099.169125

2017-04-03 00:00:00: $1141.813

2017-04-04 00:00:00: $1141.6003625

2017-04-05 00:00:00: $1133.0793142857142

2017-04-06 00:00:00: $1196.3079375

2017-04-07 00:00:00: $1190.45425

2017-04-08 00:00:00: $1181.1498375

2017-04-09 00:00:00: $1208.8005

2017-04-10 00:00:00: $1207.744875

2017-04-11 00:00:00: $1226.6170375



But what happens if the separator is unknown, like in the file `exam_review.csv`:

In [19]:
!head ./data/exam_review.csv

first_name>last_name>age>math_score>french_score
Ray>Morley>18>"68,000">"75,000"
Melvin>Scott>24>77>83
Amirah>Haley>22>92>67

Gerard>Mills>19>"78,000">72
Amy>Grimes>23>91>81


In this case, the separator is not a comma, but the `>` sign. It's still a "CSV", although not technically separated by commas.

### The `csv` module

Python includes the builtin module `csv` that helps a little bit more with the process of reading `CSV`s:

In [20]:
import csv

In [21]:
with open('./data/btc-market-price.csv', 'r') as fp:
    reader = csv.reader(fp)
    for index, (timestamp, price) in enumerate(reader):
        # read just the first 10 lines
        if index < 10:
            print(f'{timestamp}: {price}')

2017-04-02 00:00:00: 1099.169125
2017-04-03 00:00:00: 1141.813
2017-04-04 00:00:00: 1141.6003625
2017-04-05 00:00:00: 1133.0793142857142
2017-04-06 00:00:00: 1196.3079375
2017-04-07 00:00:00: 1190.45425
2017-04-08 00:00:00: 1181.1498375
2017-04-09 00:00:00: 1208.8005
2017-04-10 00:00:00: 1207.744875
2017-04-11 00:00:00: 1226.6170375


The `csv` module takes care of splitting the file using a given separator (called `delimiter`) and creating an interator for us.

In [32]:
with open('./data/exam_review.csv', 'r') as fp:
    reader = csv.reader(fp, delimiter='>') # special delimiter
    next(reader) # skipping header
    for index, values in enumerate(reader):
        if not values:
            continue # skip empty lines
        fname, lname, age, math, french = values
        print(f'{fname} {lname} (age {age}) got {math} in Math and {french} in French')

Ray Morley (age 18) got 68,000 in Math and 75,000 in French
Melvin Scott (age 24) got 77 in Math and 83 in French
Amirah Haley (age 22) got 92 in Math and 67 in French
Gerard Mills (age 19) got 78,000 in Math and 72 in French
Amy Grimes (age 23) got 91 in Math and 81 in French


<hr style='border:5px solid #5C6067'> </hr>