# Data I/O

- Data Input/Output is a key part of pandas, as often you will be obtaining data which must be imported into pandas or exporting data cleaned in pandas.
- This data can come from a variety of file formats, although most of the time it will only come from a few of them which we cover here.
- We will cover importing from and exporting to CSV and XLSX files, which are the most common forms of data.
- There are other methods such as for html files, SQL, tables, JSON etc, which you can find out about as and when you need them.
- The methods are .read_ or .to_ methods, with numerous arguments that you can specify as they are required.
- The essential argument is the filename as a string, which must be in the same directory, or have the whole file path specified.
- When writing files, index=False is also useful to avoid replication of the index column, as pandas automatically adds an index when it reads in a file.

## CSV
- CSV __(comma-separated values)__ files are a very common way to store data. 
- Their most common literal representation is a bunch of values, separated by commas, as the name would indicate.
- All of the data for a single observation is on one line: each new line is a new observation.
- Hence the number of values in a line is the number of columns, and the number of lines is the number of rows.
- Hence, the first line is often column names, but it does not have to be.
- The comma in this case is called the __'delimiter'__ as it shows the difference (or limit) between one value and the next.
- Other common delimiters are semi-colons and tabs (also called tsv/tab-separated values).
- Usually if you are using data from mainland European countries (France/Spain etc) they will use semi-colons, hence some people prefer __character__-separated values for CSV.
- We must be careful to check what exactly the delimiter is, as a common error is reading in a file with the wrong delimiter, and so getting a weird representation in your data.
- CSVs can also be read by Excel.
- The read_csv command only requires the CSV format not necessarily the filename extension, and so it can read in .txt files in the same format.
- When using it we can specify only the filename if the file is in the same directory, but we must specify the whole path if it is in another directory. This same process applies to any files read or written using Python.

The syntax for reading in a CSV to pandas is thus:

In [None]:
# we save the read_csv to a variable
df = pd.read_csv('<filename>')

# the to_csv method is a method off a data frame
df.to_csv('<filename>',index=False)

## XLSX
- .xlsx is the file format for Microsoft Excel spreadsheets.
- This does not require much explanation except to say that pandas cannot read in formulas, macros or graphs, only raw data.
- Also, we must specify the sheetname to read in as a data frame or write to when using the read_excel and to_excel methods.

In [None]:
# read_excel has the same file stipulations as all read_ methods
df = pd.read_excel('<filename>',sheet_name='<sheetname>')

# remember to specify sheet name with Excel files
df.to_excel('<filename>',sheet_name='<sheetname>')

## JSON
- JSON (JavaScript Object Notation) is a file format that stores data in a way that is easily readable by both humans and machines.
-  It stores data in 2 ways: arrays (somewhat like lists) and objects (somewhat like dictionaries).
- In fact, Jupyter Notebook .ipynb files are actually stored in JSON format.

In [None]:
# json
df = pd.read_json('<filename>')
df.to_json('<filename>')

In [None]:
# xml
df = pd.re

In [None]:
# Intro to CSV XLSX JSON XML