Very often we will be getting data from *flat files* -- someone sends you a file and asks "Can you analyze this?". Reading data into Python is a fairly straight forward task -- usually. In this seciton we will cover how to read in some of the most common file types and then look at what we can do when our data come to us in fairly ugly formats.

## Pandas `read_...`

Pandas has a number of functions used to read in various file types. You can see a list of all supported file types [here](https://pandas.pydata.org/docs/user_guide/io.html#io)


### Comma-separated files

Comma-separated files or CSVs are text files where each entry in a row is separated by a comma and each row is separated by a newline (`\n`) character. This is often the most common format for flat files as it is easily parsed


### Tab separated files

TSVs are similar to CSVs except each record is seperated by a `\t` character.


### Excel Files
Excel files, while one of the most common file types, can be challenging to work with programmatically. In particular, it is difficult to write out to excel files. Fortunately, pandas makes it simple to work with excel files.

## `read_csv`

We'll start by reading in some csv files. The `read_csv` function accepts a wide variety of arguments. It's worth while reading the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html#pandas.read_csv) to see these options


*You'll find the data for this section in `data/examples`*

In [17]:
import pandas as pd

penguins = pd.read_csv('data/examples/penguins.csv')

In [18]:
penguins.dtypes

species               object
island                object
bill_length_mm       float64
bill_depth_mm        float64
flipper_length_mm    float64
body_mass_g          float64
sex                   object
dtype: object

When the data are clean, reading in a csv is simple. However, often the data will require some massaging to get it into a clean format. For example, your data may have empty rows at the start of the file.

In [19]:
pd.read_csv('data/examples/empty_cells.csv')

Unnamed: 0,"This Document was prepared by John Smith on January 1, 2020",Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6
0,,,,,,,
1,,,,,,,
2,,date,col1,col2,col3,,
3,,2020-01-01,foo,100,TRUE,,
4,,2020-01-02,bar,200,FALSE,,
5,,2020-01-03,fizz,300,TRUE,,
6,,2020-01-04,buzz,400,TRUE,,


We see the data actually start in the middle of the table. We can use the `header` argument to tell pandas which row the data headers start at.

In [20]:
pd.read_csv('data/examples/empty_cells.csv', header=3)

Unnamed: 0.1,Unnamed: 0,date,col1,col2,col3,Unnamed: 5,Unnamed: 6
0,,2020-01-01,foo,100,True,,
1,,2020-01-02,bar,200,False,,
2,,2020-01-03,fizz,300,True,,
3,,2020-01-04,buzz,400,True,,


Alternatively, we can use the `skiprows` argument to do the same thing

In [21]:
pd.read_csv('data/examples/empty_cells.csv', skiprows=3)

Unnamed: 0.1,Unnamed: 0,date,col1,col2,col3,Unnamed: 5,Unnamed: 6
0,,2020-01-01,foo,100,True,,
1,,2020-01-02,bar,200,False,,
2,,2020-01-03,fizz,300,True,,
3,,2020-01-04,buzz,400,True,,


Let's also set the `date` column as our index when we read in the data.

In [22]:
dat = pd.read_csv('data/examples/empty_cells.csv', header=3, index_col='date')
dat

Unnamed: 0_level_0,Unnamed: 0,col1,col2,col3,Unnamed: 5,Unnamed: 6
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2020-01-01,,foo,100,True,,
2020-01-02,,bar,200,False,,
2020-01-03,,fizz,300,True,,
2020-01-04,,buzz,400,True,,


Now, all we need to do is drop the unnecessary `Unnamed` columns. If we knew which columns we wanted to keep ahead of time we could pass in the argument `usecols=['col1', 'col2', 'col3']`. Otherwise we would just subset the data normally.

In [23]:
drop_cols = dat.columns[dat.columns.str.contains('Unnamed')]
dat = dat.drop(drop_cols, axis = 1)

dat

Unnamed: 0_level_0,col1,col2,col3
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2020-01-01,foo,100,True
2020-01-02,bar,200,False
2020-01-03,fizz,300,True
2020-01-04,buzz,400,True


## `read_excel`

You can use `read_excel` to read excel files. This acocunts for both `xlsx` and `xls` files. We can then pass in the argument `sheet_name` to indicate which table in the workbook we want to read. `read_excel` then accepts most of the same arguments as `read_csv`

*You may receive an error that you are missing a dependency `openpyxl`. You can install it using `pip install openpyxl`*

In [24]:
customers = pd.read_excel('data/examples/customers.xlsx', sheet_name='Customers', header=1)
sales = pd.read_excel('data/examples/customers.xlsx', sheet_name='Orders', header=1, index_col='date')

customers

Unnamed: 0,customer_number,Name
0,982374,Alice
1,208933,Bob
2,398740,Carol
3,148765,Dave
4,897143,Frank
