In [1]:
import numpy as np
import pandas as pd
import requests
np.random.seed(12345)

In [2]:
# Helper Function to dusplay DataFrames in Horizontal 

class display(object):
    """Display HTML representation of multiple objects"""
    template = """<div style="float: left; padding: 10px;">
    <p style='font-family:"Courier New", Courier, monospace'>{0}</p>{1}
    </div>"""
    def __init__(self, *args):
        self.args = args
        
    def _repr_html_(self):
        return '\n'.join(self.template.format(a, eval(a)._repr_html_())
                         for a in self.args)
    
    def __repr__(self):
        return '\n\n'.join(a + '\n' + repr(eval(a))
                           for a in self.args)

##### Documentation of read_csv function
```Python
pandas.read_csv(
    filepath_or_buffer, sep=NoDefault.no_default, delimiter=None, header='infer', names=NoDefault.no_default,
    index_col=None, usecols=None, squeeze=None, prefix=NoDefault.no_default, mangle_dupe_cols=True, 
    dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, 
    skiprows=None, skipfooter=0, nrows=None, na_values=None, keep_default_na=True, na_filter=True, 
    verbose=False, skip_blank_lines=True, parse_dates=None, infer_datetime_format=False, keep_date_col=False,
    date_parser=None, dayfirst=False, cache_dates=True, iterator=False, chunksize=None, compression='infer',
    thousands=None, decimal='.', lineterminator=None, quotechar='"', quoting=0, doublequote=True, escapechar=None,
    comment=None, encoding=None, encoding_errors='strict', dialect=None, error_bad_lines=None, warn_bad_lines=None,
    on_bad_lines=None, delim_whitespace=False, low_memory=True, memory_map=False, float_precision=None,
    storage_options=None)
```

Documentation for [pandas.read_csv](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html#pandas.read_csv)


## Reading and Writing Data in Text Format

pandas features a number of functions for reading tabular data as a DataFrame object. 

**Parsing functions in pandas**

| Function | Description |
| :--- | :--- |
| read_csv | Load delimited data from a file, URL, or file-like object; use comma as default delimiter |
| read_table | Load delimited data from a file, URL, or file-like object; use tab ('\t') as default delimiter |
| read_fwf | Read data in fixed-width column format (i.e., no delimiters) |
| read_clipboard | Version of read_table that reads data from the clipboard; useful for converting tables from web
pages |
| read_excel | Read tabular data from an Excel XLS or XLSX file |
| read_hdf | Read HDF5 files written by pandas |
| read_html | Read all tables found in the given HTML document |
| read_json | Read data from a JSON (JavaScript Object Notation) string representation |
| read_msgpack | Read pandas data encoded using the MessagePack binary format |
| read_pickle | Read an arbitrary object stored in Python pickle format |
| read_sas | Read a SAS dataset stored in one of the SAS system’s custom storage formats |
| read_sql | Read the results of a SQL query (using SQLAlchemy) as a pandas DataFrame |
| read_stata | Read a dataset from Stata file format |
| read_feather | Read the Feather binary file format |

The optional arguments for these functions may fall into a few categories:
* **Indexing**: Can treat one or more columns as the returned DataFrame, and whether to get column names from the file, the user, or not at all.
* **Type inference and data conversion**: This includes the user-defined value conversions and custom list of missing value
markers.
* **Datetime parsing**: Includes combining capability, including combining date and time information spread over multiple columns into a single column in the result.
* **Iterating**: Support for iterating over chunks of very large files. 
* **Unclean data issues**: Skipping rows or a footer, comments, or other minor things like numeric data with thousands separated by commas.

Because of how messy data in the real world can be, some of the data loading functions (especially read_csv) have grown very complex in their options over time. It’s normal to feel overwhelmed by the number of different parameters (read_csv has over 50+). The online pandas documentation has many examples about how each of them works, so if you’re struggling to read a particular file, there might be a similar enough example to help you find the right parameters.

Some of these functions, like pandas.read_csv, perform type inference, because the column data types are not part of the data format. That means you don’t necessarily have to specify which columns are numeric, integer, boolean, or string. Other data formats, like HDF5, Feather, and msgpack, have the data types stored in the format.

Handling dates and other custom types can require extra effort. Let’s start with a small comma-separated (CSV) text file:

In [3]:
# Since this is comma-delimited, we can use read_csv to read it into a DataFrame
filepath='C:/MyLearn/DataSet/Pandas/Book_CH06/ex1.csv'
df = pd.read_csv(filepath)
df

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [4]:
# The same can be used with read_table with delimiter parameter
filepath='C:/MyLearn/DataSet/Pandas/Book_CH06/ex1.csv'
df = pd.read_table(filepath, sep=',')
df

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


A file will not always have a header row. To read this file, you have a couple of options. You can allow pandas to assign default column names, or you can specify names yourself.

In [5]:
# Files with no header.

filepath='C:/MyLearn/DataSet/Pandas/Book_CH06/ex2.csv'
df = pd.read_csv(filepath, header=None)
df

Unnamed: 0,0,1,2,3,4
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [6]:
# File with no header, manually add the header

filepath='C:/MyLearn/DataSet/Pandas/Book_CH06/ex1.csv'
df = pd.read_csv(filepath, names=['a','b','c','d','message'])
df

Unnamed: 0,a,b,c,d,message
0,a,b,c,d,message
1,1,2,3,4,hello
2,5,6,7,8,world
3,9,10,11,12,foo


Suppose wanted the message column to be the index of the returned DataFrame. We can either indicate you want the column at index 4 or named 'message' using the index_col argument:

In [7]:
filepath='C:/MyLearn/DataSet/Pandas/Book_CH06/ex2.csv'
names=['a','b','c','d','message']

df = pd.read_csv(filepath, names=names, index_col='message')
df

Unnamed: 0_level_0,a,b,c,d
message,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
hello,1,2,3,4
world,5,6,7,8
foo,9,10,11,12


In the event that you want to form a hierarchical index from multiple columns, pass a list of column numbers or names:

In [8]:
filepath='C:/MyLearn/DataSet/Pandas/Book_CH06/csv_mindex.csv'

parsed = pd.read_csv(filepath, index_col=['key1','key2'])
parsed

Unnamed: 0_level_0,Unnamed: 1_level_0,value1,value2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
one,a,1,2
one,b,3,4
one,c,5,6
one,d,7,8
two,a,9,10
two,b,11,12
two,c,13,14
two,d,15,16


In some cases, a table might not have a fixed delimiter, using whitespace or some other pattern to separate fields. Consider a text file that looks like this
```
' A B C\n',
'aaa -0.264438 -1.026059 -0.619500\n',
'bbb 0.927272 0.302904 -0.032399\n',
'ccc -0.264273 -0.386314 -0.217601\n',
'ddd -0.871858 -0.348382 1.100491\n'
```

While you could do some munging by hand, the fields here are separated by a variable amount of whitespace. In these cases, you can pass a regular expression as a delimiter for read_table. This can be expressed by the regular expression \s+, so we
have then:

In [9]:
filepath='C:/MyLearn/DataSet/Pandas/Book_CH06/ex3.txt'
df = pd.read_table(filepath, sep="\s+")
df

Unnamed: 0,A,B,C
aaa,-0.264438,-1.026059,-0.6195
bbb,0.927272,0.302904,-0.032399
ccc,-0.264273,-0.386314,-0.217601
ddd,-0.871858,-0.348382,1.100491


In the `ex3.txt` file column headers provided is less than the number of columns in the data. Because there was one fewer column name than the number of columns,`read_table` infers that the first column should be the DataFrame’s index in this special case.

#### Handling Unwanted Rows

In the example `ex4.csv` file there are comments provided for each line, which needs to skip during data extraction. Comments are provided in the line 0, 2 and 3.

```
# hey!
a,b,c,d,message
# just wanted to make things more difficult for you
# who reads CSV files with computers, anyway?
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo
```

In [10]:
filepath='C:/MyLearn/DataSet/Pandas/Book_CH06/ex4.csv'

pd.read_csv(filepath, skiprows=[0, 2, 3])

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


#### Handling Missing Values

Handling missing values is an important and frequently nuanced part of the file parsing
process. Missing data is usually either not present (empty string) or marked by
some sentinel value(ex. NA, ?, etc). By default, pandas uses a set of commonly occurring sentinels, such as NA and NULL:

In [11]:
# Importing the data without any corrections

filepath='C:/MyLearn/DataSet/Pandas/Book_CH06/ex5.csv'
na_sentinels={
    'message':['NA','foo'],
    'something':['two']
}
Default_Behaviour = pd.read_csv(filepath)
Using_Sentinels = pd.read_csv(filepath, na_values=na_sentinels)
No_Filtering=pd.read_csv(filepath, na_values=na_sentinels, na_filter=False)
Default_NA_Filter_False_Sentinels_True = pd.read_csv(filepath, na_values=na_sentinels, keep_default_na=False)
display('Default_Behaviour','Using_Sentinels','No_Filtering','Default_NA_Filter_False_Sentinels_True')

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,,5,6,,8,world
2,three,9,10,11.0,12,

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,,5,6,,8,world
2,three,9,10,11.0,12,


Different NA sentinels can be specified for each column in a dict:

In [12]:
sentinels = {'message': ['foo', 'NA'], 'something': ['two']}
filepath='C:/MyLearn/DataSet/Pandas/Book_CH06/ex5.csv'
pd.read_csv(filepath, na_values=sentinels)

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,,5,6,,8,world
2,three,9,10,11.0,12,


## Reading Text Files in Pieces

When processing very large files or figuring out the right set of arguments to correctly process a large file, you may only want to read in a small piece of a file or iterate through smaller chunks of the file.

Before we look at a large file, we make the pandas display settings more compact

In [13]:
pd.options.display.max_rows = 10

In [14]:
filepath='C:/MyLearn/DataSet/Pandas/Book_CH06/ex6.csv'
result = pd.read_csv(filepath)
result

Unnamed: 0,one,two,three,four,key
0,0.467976,-0.038649,-0.295344,-1.824726,L
1,-0.358893,1.404453,0.704965,-0.200638,B
2,-0.501840,0.659254,-0.421691,-0.057688,G
3,0.204886,1.074134,1.388361,-0.982404,R
4,0.354628,-0.133116,0.283763,-0.837063,Q
...,...,...,...,...,...
9995,2.311896,-0.417070,-1.409599,-0.515821,L
9996,-0.479893,-0.650419,0.745152,-0.646038,E
9997,0.523331,0.787112,0.486066,1.093156,K
9998,-0.362559,0.598894,-1.843201,0.887292,G


If you want to only read a small number of rows (avoiding reading the entire file),
specify that with nrows

In [15]:
filepath='C:/MyLearn/DataSet/Pandas/Book_CH06/ex6.csv'
result = pd.read_csv(filepath, nrows=5)
result

Unnamed: 0,one,two,three,four,key
0,0.467976,-0.038649,-0.295344,-1.824726,L
1,-0.358893,1.404453,0.704965,-0.200638,B
2,-0.50184,0.659254,-0.421691,-0.057688,G
3,0.204886,1.074134,1.388361,-0.982404,R
4,0.354628,-0.133116,0.283763,-0.837063,Q


To read a file in pieces, specify a chunksize as a number of rows

In [16]:
filepath='C:/MyLearn/DataSet/Pandas/Book_CH06/ex6.csv'
result = pd.read_csv(filepath, chunksize=1000)
result

<pandas.io.parsers.readers.TextFileReader at 0x2bf18b18250>

The TextParser object returned by read_csv allows you to iterate over the parts of the file according to the chunksize. For example, we can iterate over ex6.csv, aggregating the value counts in the 'key' column like so

In [17]:
chunker = pd.read_csv(filepath, chunksize=1000)
tot = pd.Series([], dtype="float64")
for piece in chunker:
    tot = tot.add(piece['key'].value_counts(), fill_value=0)

tot = tot.sort_values(ascending=False)
tot[:10]

E    368.0
X    364.0
L    346.0
O    343.0
Q    340.0
M    338.0
J    337.0
F    335.0
K    334.0
H    330.0
dtype: float64

TextParser is also equipped with a get_chunk method that enables you to read pieces of an arbitrary size.

## JSON Data

JSON (short for JavaScript Object Notation) has become one of the standard formats for sending data by HTTP request between web browsers and other applications. The pandas.read_json can automatically convert JSON datasets in specific arrangements into a Series or DataFrame.


In [18]:
filepath='C:/MyLearn/DataSet/Pandas/Book_CH06/example.json'
result = pd.read_json(filepath)
result

Unnamed: 0,a,b,c
0,1,2,3
1,4,5,6
2,7,8,9


## HTML File Handling

pandas has a built-in function, read_html, which uses libraries like lxml and BeautifulSoup to automatically parse tables out of HTML files as DataFrame objects. First, you must install some additional libraries used by read_html: lmxl, beautifulsoup4 and html5lib

In [19]:
filepath='C:/MyLearn/DataSet/Pandas/Book_CH06/fdic_failed_bank_list.html'
table1 = pd.read_html(filepath, parse_dates=['Closing Date','Updated Date'])
table2 = pd.read_html(filepath)

In [20]:
len(table1)

1

In [21]:
len(table2)

1

In [22]:
# Because table1 has many columns, 
# pandas inserts a line break character \.

table1

[                             Bank Name             City  ST   CERT  \
 0                          Allied Bank         Mulberry  AR     91   
 1         The Woodbury Banking Company         Woodbury  GA  11297   
 2               First CornerStone Bank  King of Prussia  PA  35312   
 3                   Trust Company Bank          Memphis  TN   9956   
 4           North Milwaukee State Bank        Milwaukee  WI  20364   
 ..                                 ...              ...  ..    ...   
 542                 Superior Bank, FSB         Hinsdale  IL  32646   
 543                Malta National Bank            Malta  OH   6629   
 544    First Alliance Bank & Trust Co.       Manchester  NH  34264   
 545  National State Bank of Metropolis       Metropolis  IL   3815   
 546                   Bank of Honolulu         Honolulu  HI  21029   
 
                    Acquiring Institution Closing Date Updated Date  
 0                           Today's Bank   2016-09-23   2016-11-17  
 1    

## Binary Data Formats

One of the easiest ways to store data (also known as serialization) efficiently in binary format is using Python’s built-in pickle serialization. pandas objects all have a to_pickle method that writes the data to disk in pickle format

In [23]:
filepath='C:/MyLearn/DataSet/Pandas/Book_CH06/ex1.csv'
frame = pd.read_csv(filepath)
frame

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [24]:
# Save the DataFrame in pickle Format

frame.to_pickle('C:/MyLearn/DataSet/Pandas/Book_CH06/frame_pickle')

In [25]:
# Retreve the save pickle format file into DataFrame

df_pickle = pd.read_pickle('C:/MyLearn/DataSet/Pandas/Book_CH06/frame_pickle')
df_pickle

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


**pickle is only recommended as a short-term storage format**. The problem is that it is hard to guarantee that the format will be stable over time; an object pickled today may not unpickle with a later version of a library. We have tried to maintain backward compatibility when possible, but at some point in the future it may be necessary to “break” the pickle format.

## Reading Microsoft Excel Files

pandas also supports reading tabular data stored in Excel 2003 (and higher) files using either the ExcelFile class or pandas.read_excel function. Internally these tools use the add-on packages xlrd and openpyxl to read XLS and XLSX files, respectively.

In [26]:
filepath='C:/MyLearn/DataSet/Pandas/Book_CH06/ex1.xlsx'
xlsx = pd.ExcelFile(filepath)
pd.read_excel(xlsx, 'Sheet1')

Unnamed: 0.1,Unnamed: 0,a,b,c,d,message
0,0,1,2,3,4,hello
1,1,5,6,7,8,world
2,2,9,10,11,12,foo


If you are reading multiple sheets in a file, then it is faster to create the ExcelFile, but you can also simply pass the filename to pandas.read_excel:

In [27]:
frame = pd.read_excel(filepath, 'Sheet1')
frame

Unnamed: 0.1,Unnamed: 0,a,b,c,d,message
0,0,1,2,3,4,hello
1,1,5,6,7,8,world
2,2,9,10,11,12,foo


## Interacting with Web APIs

Many websites have public APIs providing data feeds via JSON or some other format. There are a number of ways to access these APIs from Python; one easy-to-use method that I recommend is the requests package.

To find the last 30 GitHub issues for pandas on GitHub, we can make a GET HTTP request using the add-on requests library

In [28]:
url = 'https://api.github.com/repos/pandas-dev/pandas/issues'
resp = requests.get(url)

In [29]:
resp

<Response [200]>

The Response object’s json method will return a dictionary containing JSON parsed into native Python objects:

In [30]:
data = resp.json()

In [31]:
data[0].keys()

dict_keys(['url', 'repository_url', 'labels_url', 'comments_url', 'events_url', 'html_url', 'id', 'node_id', 'number', 'title', 'user', 'labels', 'state', 'locked', 'assignee', 'assignees', 'milestone', 'comments', 'created_at', 'updated_at', 'closed_at', 'author_association', 'active_lock_reason', 'draft', 'pull_request', 'body', 'reactions', 'timeline_url', 'performed_via_github_app'])

In [32]:
issues = pd.DataFrame(data, columns=['number','title','labels','state'])

In [33]:
issues

Unnamed: 0,number,title,labels,state
0,46537,REF: Create StorageExtensionDtype,"[{'id': 127681, 'node_id': 'MDU6TGFiZWwxMjc2OD...",open
1,46536,CI: xfail geopandas downstream test on Windows...,"[{'id': 48070600, 'node_id': 'MDU6TGFiZWw0ODA3...",open
2,46535,TYP: Many typing constructs are invariant,[],open
3,46534,TST: Add test with large shape to check_below_...,[],open
4,46532,ENH: Use docker compose,"[{'id': 76812, 'node_id': 'MDU6TGFiZWw3NjgxMg=...",open
...,...,...,...,...
25,46501,Backport PR #46119 on branch 1.4.x (REF: isins...,"[{'id': 32815646, 'node_id': 'MDU6TGFiZWwzMjgx...",open
26,46500,"Revert ""REF: isinstance(x, int) -> is_integer(x)""","[{'id': 127681, 'node_id': 'MDU6TGFiZWwxMjc2OD...",open
27,46499,BUG: date_range puzzler on dst transition with...,"[{'id': 76811, 'node_id': 'MDU6TGFiZWw3NjgxMQ=...",open
28,46498,BUG: pd.test() fails on Python v3.9.6,"[{'id': 76811, 'node_id': 'MDU6TGFiZWw3NjgxMQ=...",open


## Interacting with Databases

In a business setting, most data may not be stored in text or Excel files. SQL-based relational databases (such as SQL Server, PostgreSQL, and MySQL) are in wide use, and many alternative databases have become quite popular. The choice of database is usually dependent on the performance, data integrity, and scalability needs of an application.

Loading data from SQL into a DataFrame is fairly straightforward, and pandas has some functions to simplify the process. As an example, I’ll create a SQLite database using Python’s built-in sqlite3 driver:

In [34]:
import sqlite3
query = """
CREATE TABLE students
(FName VARCHAR(20), SName VARCHAR(20), Age Integer,Height Real);"""
con = sqlite3.connect('mydata.sqlite')
con.execute(query)
con.commit()

In [35]:
data = [('Saravanan', 'Shanmugham', 50, 5.6),
        ('Sujaya', 'Saravanan', 46, 5.4),
        ('Karthick', 'Srinivas S', 18, 5.8)]
stmt = "INSERT INTO students VALUES(?, ?, ?, ?)"
con.executemany(stmt, data)
con.commit()

In [36]:
cursor = con.execute('select * from students')
rows = cursor.fetchall()
rows

[('Saravanan', 'Shanmugham', 50, 5.6),
 ('Sujaya', 'Saravanan', 46, 5.4),
 ('Karthick', 'Srinivas S', 18, 5.8)]

In [37]:
cursor.description

(('FName', None, None, None, None, None, None),
 ('SName', None, None, None, None, None, None),
 ('Age', None, None, None, None, None, None),
 ('Height', None, None, None, None, None, None))

In [38]:
pd.DataFrame(rows, columns=[x[0] for x in cursor.description])

Unnamed: 0,FName,SName,Age,Height
0,Saravanan,Shanmugham,50,5.6
1,Sujaya,Saravanan,46,5.4
2,Karthick,Srinivas S,18,5.8


In [39]:
import sqlalchemy as sqla
db = sqla.create_engine('sqlite:///mydata.sqlite')
pd.read_sql('select * from students', db)

Unnamed: 0,FName,SName,Age,Height
0,Saravanan,Shanmugham,50,5.6
1,Sujaya,Saravanan,46,5.4
2,Karthick,Srinivas S,18,5.8


In [40]:
con.close()
!del mydata.sqlite

***