# Data I/O

___Resources___

https://bit.ly/2usZTCz - pandas documentation - IO Tools

https://bit.ly/2zuzt95 - Medium article - DataFrame IO Performance with Pandas, dask, fastparquet and HDF5

In [1]:
## Base imports

import pandas as pd
import numpy as np
pd.set_option('max_columns', 50)

## Reading Tabular Data into Pandas

Pandas features a host of ready made functions for reading in tabular data as a `DataFrame` object.

In reality the ones that are most often used are **[`read_csv`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html)** and **[`read_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_table.html)**, however some of the newer data storage methods are faster and consume less memory.

The below table contains a _selection_ of parsing functions available in pandas.

|Format Type | Data Description | Reader | Writer|
| --- | --- | --- | --- | --- |
|text |Read data from a delimited data from a file or URL, comma as default delimiter|read_csv|	to_csv|
|text	|Read data from a JSON (JavaScript Object Notation) string representation|	read_json|	to_json|
|text|	Read all tables found in a given HTML document|	read_html|	to_html|
|text	|Data straight from local clipboard|	read_clipboard|	to_clipboard|
|binary|	Tabular data from Excel xls or xlsx|	read_excel|	to_excel|
|binary	|Read HDF5 files written by pandas|	read_hdf|	to_hdf|
|SQL	|Results from an SQL query|	read_sql|	to_sql|
|SQL	|Google Big Query	|read_gbq|	to_gbq|

In [2]:
# read_table is a general version of read_csv where you can specify the delimiter
# absolute path reference - /home/nbuser/library/Notebooks/Data/worldstats.csv

worldstats = pd.read_table('./Data/worldstats.csv', delimiter= ',')
worldstats.head()

Unnamed: 0,country,year,Population,GDP
0,Arab World,2015,392022276.0,2530102000000.0
1,Arab World,2014,384222592.0,2873600000000.0
2,Arab World,2013,376504253.0,2846994000000.0
3,Arab World,2012,368802611.0,2773270000000.0
4,Arab World,2011,361031820.0,2497945000000.0


### Reading HTML data into Pandas

In [3]:
# read_html scrapes all HTML tables from io and returns a list of DataFrames

tables = pd.read_html('https://en.wikipedia.org/wiki/2018_in_film')
len(tables)

10

In [4]:
# A single DataFrame can then be referenced through square bracket notation

movies_2018 = tables[2]
movies_2018

Unnamed: 0,0,1,2,3
0,Rank,Title,Distributor,Worldwide gross
1,1,Avengers: Infinity War,Disney,"$2,041,080,000"
2,2,Black Panther,"$1,346,554,297",
3,3,Jurassic World: Fallen Kingdom,Universal,"$1,134,697,215"
4,4,Incredibles 2,Disney,"$856,918,492"
5,5,Deadpool 2,20th Century Fox,"$730,840,378"
6,6,Ready Player One,Warner Bros.,"$582,018,455"
7,7,Operation Red Sea,Huaxia Film,"$579,220,560"
8,8,Detective Chinatown 2,Wanda Media,"$544,061,916"
9,9,Rampage,Warner Bros.,"$425,678,945"


### Optional arguments

Real world data is messy, every dataset being unique with it's own nuances and formatting issues. As such, the data parsing functions have evolved with many optional parameters.

The arguments for these functions generall fall into a few categories.

__Indexing__ -  Subset the data on import/provide references for column names/index

__Type inference/Conversion__ -  Classify the datatypes of each column or provide custom lists of missing value markers

__Datetime Parsing__ - Includes combining date/time information from multiple columns into a single column

__Iterating__ -  Iterating over chunks of data for large files

__Unclean data issues__ -  Skip rows/footers or maybe specify that numeric data has a comma thousand separator

[IO](http://pandas.pydata.org/pandas-docs/stable/io.html) has many examples about how each of them works.

## Reading Microsoft Excel Files

Pandas supports reading tabular data stored in Excel 2003 and higher.

In [6]:
# Reading in the same DataFrame as before but from a xlsx workbook
# It is possible to pass the filename straight into read_excel but multiple sheets are faster 
# when you create the ExcelFile instance

xlsx = pd.ExcelFile('./Data/Excel_WorldStats.xlsx')
pd.read_excel(xlsx, 'Sheet1').head()

Unnamed: 0,country,year,Population,GDP
0,Arab World,2015,392022276.0,2530102000000.0
1,Arab World,2014,384222592.0,2873600000000.0
2,Arab World,2013,376504253.0,2846994000000.0
3,Arab World,2012,368802611.0,2773270000000.0
4,Arab World,2011,361031820.0,2497945000000.0


## Writing Data to Text Formats

### Export `DataFrames` to CSV File with the [`to_csv`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html) Method

In [8]:
# New dataframe from a github hosted csv file

baby_names = pd.read_csv("https://raw.githubusercontent.com/hadley/data-baby-names/master/baby-names.csv")
baby_names.head()

Unnamed: 0,year,name,percent,sex
0,1880,John,0.081541,boy
1,1880,William,0.080511,boy
2,1880,James,0.050057,boy
3,1880,Charles,0.045167,boy
4,1880,George,0.043292,boy


In [9]:
# export using the to_csv method - Note the extra arguments passed

baby_names.to_csv("./Data/Baby_Names.csv", index = False, columns = ["year", "name", "percent", "sex"], encoding = "utf-8")

### Export `DataFrames` to Excel File with the [`to_excel`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_excel.html) Method

In [10]:
# for Excel files create an ExcelWriter instance first or pass file path straight to to_excel

excel_file = pd.ExcelWriter("./Data/Baby_Name_Percentages.xlsx" )

In [11]:
# after adding dataframe to sheets - save

baby_names.head().to_excel(excel_file, index = False, sheet_name= 'Baby_Names')
excel_file.save()

### Want to read and write to the same file in Excel?

Whilst completely possible, by default when Pandas writes to an existing file the file is overwritten and the original data will be lost.

For more in depth control of Excel files with python - check out [**openpyxl**](https://openpyxl.readthedocs.io/en/stable/)

This library allows for much greater control of excel files including formatting and reading/writing to the same workbook.

### Interacting with Web APIs & Databases

**Web APIs** - Many websites and applications provide data feeds via JSON. There are a number of ways to access these APIs but one of the easiest is the [**requests**](http://docs.python-requests.org/en/master/) package. Often the Python community has already written higher level wrappers specifically for the API that you want to access which populate the data straight into a Pandas DataFrame.

**Databases** - Likewise, often data exists in relational and non-relational databases. Specific databases have their own Python connectors and a popular Python SQL toolkit is the [**SQLAlchemy project**](https://www.sqlalchemy.org/).

## Exercises
***

__1)__ Give Wikipedia a quick trawl and find an interesting page that contains at least one HTML table. Use the Pandas`read_html`function to import this into a Pandas DataFrame or Series.

Notice any issues with the data, incorrect column names or erroneous data? 

Now export this out as a csv/Excel file. We will attempt to clean up the DataFrame in a future module.

__2) Talking point__ What sources of data do you work with? What's the normal output of your work - Excel, Tableau?

# Recap
***

1. Pandas provides many top level reader functions for different data formats. 


2. `read_table` is the main function for reading in general delimited files.  


3. Reader functions accept both relative and absolute file references. For relative references, **.** is the current folder and **..** is the folder above the current folder. 


4. Reading/Writing functions provide multiple parameters to deal with real world data.


5. For greater control of reading/writing data to Excel - look into `openpyxl` library.  


6. Many Web APIs and databases have high level Python wrappers written to facilitate easy Python integration.

<!--NAVIGATION-->
< [Pandas Foundations](02_Pandas_Foundations.ipynb) | [Contents](Index.ipynb) | [Pandas Selection and Indexing](04_Pandas_DataSelection_Indexing.ipynb) >