## Reading files with Python

In [2]:
import pandas as pd

When you want to work with a file, the first thing to do is to open it. This is done by invoking the `open()` built-in function.

`open()` has a single required argument that is the path to the file and has a single return, the file object.

The `with` statement automatically takes care of closing the file once it leaves the `with` block, even in cases of error.

In [1]:
filepath = '../data/btc-market-price.csv'

with open(filepath, 'r') as reader:
    print(reader)

<_io.TextIOWrapper name='../data/btc-market-price.csv' mode='r' encoding='cp1252'>


Once the file is opened, we can read its content as follows:

In [None]:
with open(filepath, 'r') as reader:
    for index, line in enumerate(reader.readlines()):
        if(index < 10): # reading just the 10 first lines
            print(index, line)

The `read_csv` function is extremely powerful and you can specify a very broad set of parameters at import time that allow us to accurately configure how the data will be read and parsed by specifying the correct structure, encoding and other details. The most common parameters are as follows:

- `filepath`: Path of the file to be read.
- `sep`: Character(s) that are used as a field separator in the file.
- `header`: Index of the row containing the names of the columns (None if none).
- `index_col`: Index of the column or sequence of indexes that should be used as index of rows of the data.
- `names`: Sequence containing the names of the columns (used together with header = None).
- `skiprows`: Number of rows or sequence of row indexes to ignore in the load.
- `na_values`: Sequence of values that, if found in the file, should be treated as NaN.
- `dtype`: Dictionary in which the keys will be column names and the values will be types of NumPy to which their content must be converted.
- `parse_dates`: Flag that indicates if Python should try to parse data with a format similar to dates as dates. You can enter a list of column names that must be joined for the parsing as a date.
- `date_parser`: Function to use to try to parse dates.
- `nrows`: Number of rows to read from the beginning of the file.
- `skip_footer`: Number of rows to ignore at the end of the file.
- `encoding`: Encoding to be expected from the file read.
- `squeeze`: Flag that indicates that if the data read only contains one column the result is a Series instead of a DataFrame.
- `thousands`: Character to use to detect the thousands separator.
- `decimal`: Character to use to detect the decimal separator.
- `skip_blank_lines`: Flag that indicates whether blank lines should be ignored.

`read_csv()` can also read files through a URL:

In [4]:
csv_url = "https://raw.githubusercontent.com/datasets/gdp/master/data/gdp.csv"

pd.read_csv(csv_url).head()

Unnamed: 0,Country Name,Country Code,Year,Value
0,Arab World,ARB,1968,25760680000.0
1,Arab World,ARB,1969,28434200000.0
2,Arab World,ARB,1970,31385500000.0
3,Arab World,ARB,1971,36426910000.0
4,Arab World,ARB,1972,43316060000.0


### Using the parameters in `read_csv()`

In [8]:
df = pd.read_csv('../data/btc-market-price.csv',
                 header=None, # shows to pandas this file has no header
                 na_values=['', '?', '-'], # values that should be treated as NA
                 names=['Timestamp', 'Price'], # Naming the columns
                 dtype={'Price': 'float'}, # Setting a dtype to the header 'Price'
                 parse_dates=[0], # Parsing column 0 to date
                 index_col=[0] # Setting column 0 as index
)
df.head(3)

Unnamed: 0_level_0,Price
Timestamp,Unnamed: 1_level_1
2017-04-02,1099.169125
2017-04-03,1141.813
2017-04-04,1141.600363
