pandas can read file from multiple formats, including:
1. csv
2. table
3. fwf (fixed-width column format i.e., no delimiters)
4. clipboard
5. excel
6. hdf (HDF5 file)
7. html
8. json (JavaScript Object Notation)
9. msgpack (pandas data encoded using MessagePack binary format)
10. pickle (an arbitary object stored in Python pickle format)
11. sas
12. sql
13. stata
14. feather

Some functions like `pandas.read_csv` perform *type inference* because the column data types are not part of the data format.

Using `pd.read_csv` in a comma-delimited data = `pd.read_table('file_path', sep = ',')`.

When file doesn't have a header, you can leave it as default or specify the name using `header = None` or `names = []`.

To specify a column to be the index of the returned DataFrame, use `index_col = 'the_column_name'`.

To form a hierarchical index from multiple columns, use `index_col = ['column_1', 'column_2']`.

If the fields are separated by whitespace, we can use `sep = '\s+'` to delimit.

`skiprows = []` can skip the specific rows.

Missing Data is usually either not present (empty string) or marked by some *sentinel* value. (e.g. NA, NULL)
1. use `pd.isnull()` to check missing data
2. `na_values` option take a list or set of strings to consider missing value, use by `na_values = ['NULL']`
3. different NA sentinels can be specified for each column in a dict: `dict1 = {'column1':['value1', 'value2'], 'column2':['value3']}`
4. set `na_values = dict1`

### Reading Text Files in Pieces

If we don't want to look at the whole dataset every time we read the file, we can preset the pandas display settings:

`pd.options.display.max_rows` or specify the row with `nrows = `

To read a file in pieces, specify a *chunksize* as a number of rows `chunksize = `.

The *TextParser* object returned by *read_csv* allows you to iterate over the parts of the file according to the *chunksize*. For example, we can iterate over **ex6.csv**, aggregating the value counts in the 'key' column like:

In [1]:
import pandas as pd
chunker = pd.read_csv('ex6.csv', chunksize = 1000)

In [2]:
tot = pd.Series([])
for piece in chunker:
    tot = tot.add(piece['key'].value_counts(), fill_value = 0)

tot = tot.sort_values(ascending = False)

  """Entry point for launching an IPython kernel.


In [3]:
tot[:10]

E    368.0
X    364.0
L    346.0
O    343.0
Q    340.0
M    338.0
J    337.0
F    335.0
K    334.0
H    330.0
dtype: float64

*TextParser* is also equipped with a *get_chunk* method that enables you to read pieces of an arbitrary size.

### Writing Data to Text Format

Using `to_csv` method, we can generate data out to a comma-separated file.

Using `sys.stdout` to print text result to console (before that, remember to `import sys`); other delimiters, such as "|", can be used by `sep = '|'`.

Missing values should be denoted by `na_rep = 'NULL'`.

Disable row and column label display by `index = False` and `header = False`.

These methods can apply to Series as well.

### Working with Delimited Formats

For any file with a single-character delimiter, we can use *csv* built-in method:

In [4]:
import csv
f = open('ex7.csv')

reader = csv.reader(f)

In [5]:
for line in reader:
    print(line)

['a', 'b', 'c']
['1', '2', '3']
['1', '2', '3']


In [8]:
with open('ex7.csv') as f:
    lines = list(csv.reader(f)) # read the file into a list of lines

In [9]:
header, values = lines[0], lines[1:] # split lines into header and data

In [10]:
data_dict = {h: v for h, v in zip(header, zip(*values))} # zip transposes rows to columns
data_dict

{'a': ('1', '1'), 'b': ('2', '2'), 'c': ('3', '3')}

If we want a customized csv format, simply make a subclass:

In [None]:
class my_dialect(csv.Dialect):
    lineterminator = '\n'
    delimiter = ';'
    quotechar = '"'
    quoting = csv.QUOTE_MINIMAL

reader = csv.reader(f, dialect = my_dialect)

For files with complicated delimiters, we cannot use *csv*. Rather we need to do line splitting and other cleansing such as string's *split* or regular expression `re.split`.

`csv.writer` can write delimited files manually:

In [None]:
with open('mydata.csv', 'w') as f:
    writer = csv.writer(f, dialect = my_dialect)
    writer.writerow(('X', 'X', 'X'))

### JSON Data

JSON - sending data by HTTP request between web browsers and other applications. To convert a JSON string to Python form, use `json.loads`:

In [1]:
obj = """
{"name": "Wes", 
 "places_lived": ["United States", "Spain", "Germany"], 
 "pet": null, 
 "siblings": [{"name": "Scott", "age": 30, "pets": ["Zeus", "Zuko"]}, 
              {"name": "Katie", "age": 38, "pets": ["Sixes", "Stache", "Cisco"]}] 
} 
"""

In [2]:
import json
result = json.loads(obj)
result

{'name': 'Wes',
 'places_lived': ['United States', 'Spain', 'Germany'],
 'pet': None,
 'siblings': [{'name': 'Scott', 'age': 30, 'pets': ['Zeus', 'Zuko']},
  {'name': 'Katie', 'age': 38, 'pets': ['Sixes', 'Stache', 'Cisco']}]}

In [None]:
asjson = json.dumps(result) # convert Python to JSON

`pandas.read_json` converts JSON datasets into a Series or DataFrame. The default options is assuming each object in the JSON array is a row in the table.

If export data from pandas to JSON, `to_json` could perform the work.

### XML and HTML: Web Scraping

Libraries for HTML & XML:
1. *lxml* - much faster in general
2. *Beautiful Soup* & *html5lib* - handle malformed files better

`pandas.read_html` by default searches for and attempts to parse all tabular data contained within *table* tags. Result is a list of DataFrame objects

#### Parsing XML with lxml.objectify

Using `lxml.objectify` we can parse the file and get a reference to the root node of the XML file with `getroot`:

In [None]:
from lxml import objectify

path = 'file path'
parsed = objectify.parse(open(path))
root = parsed.getroot()

`root.INDICATOR` returns a generator yielding each `<INDICATOR>` XML element. For each record, we can populate a dict of tag names to data values:

In [None]:
data = []

skip_fields = ['PARENT_SEQ', 'INDICATOR_SEQ',
              'DESIRED_CHANGE', 'DECIMAL_PLACES']

for elt in root.INDICATOR:
    el_data = {}
    for child in elt.getchildren():
        if child.tag in skip_fields:
            continue
        el_data[child.tag] = child.pyval
    data.append(el_data)

XML data can be much complicated, consider an HTML link tag:

In [None]:
from io import StringIO
tag = '<a href="http://www.google.com">Google</a>'
root = objectify.parse(StringIO(tag)).getroot()

After this, we can try `root.get('href')` and `root.text` to see what happen.