**Table of contents**<a id='toc0_'></a>    
- [Interacting with CSV/TSV files (and other similar filetypes)](#toc1_1_)    
- [Interacting with Excel Files](#toc1_2_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=2
	maxLevel=4
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# Reading data from different sources and Writing data to different file formats

Although it's great that we can create Series and DataFrame objects with our custom data, but in practice we will mostly be working with data that already exists. Also after cleaning up a data file we may like to export the cleaned up data to another file for future uses.

In [2]:
# import statements
import numpy as np
import pandas as pd

There are a bunch of functions in pandas that deal with ingesting data. They all begin with
`read_`. Similarly, there are analagous exporting methods on the dataframe object. These exporting
methods start with `.to_`.

### <a id='toc1_1_'></a>[Interacting with CSV/TSV files (and other similar filetypes)](#toc0_)

> Reading from a **CSV** file: `pd.read_csv(filepath, sep, delimiter=None, index_col=None, dtype=None, na_values=None)`

<u>Function Parameters</u>

- `filepath:` Path to the file to read. valid url can also be passed.
- `sep:` Separator (e.g. for tsv files, sep='\t'). Separators longer than 1 character and different from '\s+' will be interpreted as regular expressions.
- `index_col:` Column(s) to use as the row labels of the DataFrame, either given as string (name) or column index. If a sequence of int / str is given, a MultiIndex is used.
- `dtype:` Data type of the values.
- `na_values:` Additional strings to recognize as NA/NaN.
- `parse_dates:` The behavior is as follows --
    - bool. If True -> try parsing the index.
    - list of int or names. e.g. If [1, 2, 3] -> try parsing columns 1, 2, 3 each as a separate date column.
    - list of list. e.g. If [[1, 3]] -> combine columns 1 and 3 and parse as a single date column.
    - dict, e.g. {'foo' : [1, 3]} -> parse columns 1, 3 as date and call result ‘foo’
- `date_format:` Format to use for parsing dates. 
- `chunksize:` Define it to load the data in chunks instead of the whole thing at once. Specially useful for large files. If specified, returns an iterator where chunksize is the number of rows to include in each chunk (note that, each chunk is a dataframe). We can loop over this iterator to process the data in chunks. e.g. `for chunk in pd.read_csv('data.csv', chunksize=1000): process(chunk)`.

This can also read zip files containing only a single csv/tsv file without the need of extracting. But, if there's multiple files in the zip file then it must be unzipped before use.

**Note:** One thing to be aware of is that by default, pandas will write the index values in a CSV, but when reading a CSV it will create a new index unless we specify a column for the index.

In [3]:
alta_df = pd.read_csv("./Data/alta-noaa-1980-2019.csv", parse_dates=["DATE"]).set_index(
    "DATE"
)
alta_df.index.astype("datetime64[ns]")
alta_df.head()

Unnamed: 0_level_0,STATION,NAME,LATITUDE,LONGITUDE,ELEVATION,DAPR,DASF,MDPR,MDSF,PRCP,...,SNWD,TMAX,TMIN,TOBS,WT01,WT03,WT04,WT05,WT06,WT11
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1980-01-01,USC00420072,"ALTA, UT US",40.5905,-111.6369,2660.9,,,,,0.1,...,29.0,38.0,25.0,25.0,,,,,,
1980-01-02,USC00420072,"ALTA, UT US",40.5905,-111.6369,2660.9,,,,,0.43,...,34.0,27.0,18.0,18.0,,,,,,
1980-01-03,USC00420072,"ALTA, UT US",40.5905,-111.6369,2660.9,,,,,0.09,...,30.0,27.0,12.0,18.0,,,,,,
1980-01-04,USC00420072,"ALTA, UT US",40.5905,-111.6369,2660.9,,,,,0.0,...,30.0,31.0,18.0,27.0,,,,,,
1980-01-05,USC00420072,"ALTA, UT US",40.5905,-111.6369,2660.9,,,,,0.0,...,30.0,34.0,26.0,34.0,,,,,,


> Writing to a **CSV** file: `df.to_csv(path_or_buf, sep, na_rep, encoding, date_format)`

<u>Function Parameters</u>

- path_or_buf : filepath.
- sep : delimiter for the output file (str, default `,`).
- na_rep : Missing data representation (str, default `''`).
- mode : Python write mode (str, default `w`).
- encoding: formatting to use in the output file (str, default `utf-8`)
- date_format: format string for datetime format. 

In [4]:
alta_df.to_csv("./tmp/alta_df.csv", na_rep="nan")  # be careful, ./tmp not /tmp

In [5]:
df = pd.read_csv("./tmp/alta_df.csv", index_col="DATE", na_values="nan")
df.index.astype("datetime64[ns]")
df.head()

Unnamed: 0_level_0,STATION,NAME,LATITUDE,LONGITUDE,ELEVATION,DAPR,DASF,MDPR,MDSF,PRCP,...,SNWD,TMAX,TMIN,TOBS,WT01,WT03,WT04,WT05,WT06,WT11
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1980-01-01,USC00420072,"ALTA, UT US",40.5905,-111.6369,2660.9,,,,,0.1,...,29.0,38.0,25.0,25.0,,,,,,
1980-01-02,USC00420072,"ALTA, UT US",40.5905,-111.6369,2660.9,,,,,0.43,...,34.0,27.0,18.0,18.0,,,,,,
1980-01-03,USC00420072,"ALTA, UT US",40.5905,-111.6369,2660.9,,,,,0.09,...,30.0,27.0,12.0,18.0,,,,,,
1980-01-04,USC00420072,"ALTA, UT US",40.5905,-111.6369,2660.9,,,,,0.0,...,30.0,31.0,18.0,27.0,,,,,,
1980-01-05,USC00420072,"ALTA, UT US",40.5905,-111.6369,2660.9,,,,,0.0,...,30.0,34.0,26.0,34.0,,,,,,


### <a id='toc1_2_'></a>[Interacting with Excel Files](#toc0_)

**Note:** You will have to make sure `openpyxl` is installed to use Excel support. Simply installing the pandas library usually will not install full Excel support.

In [6]:
# ! pip install openpyxl

> Reading from **Excel** file: `pd.read_excel(io, sheet_name, header, names, index_col, usecols, dtype, na_values, parse_dates)`

<u>Function Parameters</u>
- io : filepath.
- sheet_name : default 0 (i.e, 1st sheet as a DataFrame). You can any of the following,
    - str: Strings are used for sheet names. 
    - int: Integers are used in zero-indexed sheet positions (chart sheets do not count as a sheet position).
    - Lists of strings/integers are used to request multiple sheets.
    - None: Specify None to get all worksheets.
- header : Row (0-indexed) to use for the column labels of the parsed DataFrame (int, default 0). If a list of integers is passed those row positions will be combined into a MultiIndex.
- names : List of column names to use. If file contains no header row, then you should explicitly pass header=None.
- index_col : Column (0-indexed) to use as the row labels of the DataFrame (int, default None). If a subset of data is selected with usecols, index_col is based on the subset.
- usecols : str, list-like, or callable, default None.
- dtype : Type name or, dict of {column: type}, default None. 
- na_values: Additional values to treat as NaN.
- parse_dates: Columns to treat as datetime objects.

> Writing to **Excel** file: `df.to_excel(excel_writer, sheet_name, na_rep, columns)`

<u>Function Parameters</u>
- excel_writer : To write a single object to an Excel .xlsx file it is only necessary to specify a target file name. To write to multiple sheets it is necessary to create an `ExcelWriter` object with a target file name, and specify a sheet in the file to write to.
- sheet_name : Name of sheet which will contain the DataFrame.
- na_rep : Missing data representation (str, default `''`).
- columns : Columns to write (optional, sequence or list of str).

**Note:** If there are any timezone aware datetime object in the dataframe we will first need to strip the timezone information using, `df.datetime_col.dt.tz_convert(tz=None)` before exporting to Excel.

In [25]:
# writing alta_df data of 2012 in a sheet named '2012' and data of 2013 in a sheet named '2013'

Writer = pd.ExcelWriter("./tmp/alta_df.xlsx", engine="openpyxl")

alta_df_2012 = alta_df[
    alta_df.index.year == 2012
]  # or, alta_df.loc['2012': '2012-12-31']
alta_df_2013 = alta_df[alta_df.index.year == 2013]

alta_df_2012.to_excel(excel_writer=Writer, sheet_name="2012")
alta_df_2013.to_excel(excel_writer=Writer, sheet_name="2013")