### Data Loading 
In any real data analysis project it is inevitable to download data from different sources. Using Python allows to download the data in many different ways. However, then the data has to processed to be transformed into **Series** or **DataFrame** objects what leads to code complexity.Pandas has some ways that allow to download the data into the objects.

### CSV - Files
Comma - separated values is probably the most fomous file format. To conduct **initial CSV analysis** we will use **CSV library**
- **pd.read_csv( )** - reads CSV - file  
- **pd.read_csv( file_path, dtype = {'column':new_data_type} )** - provides certain data types for columns when reading
- **pd.read_csv( file_path, header = 0, names = [ new_columns_names ] )** - provides new columns names when reading
- **pd.read_csv( file_path, usecol = [ list_of_columns ] )** - reads only provided columns
- **pd.read_csv( file_path, skiprows = [ rows_indexes or number_of rows ], nrows )** - skips provided rows and reads only provided nrows when reading 
- **pd.read_csv ( file_path, skipfooter, engine = 'python' )** - skips provided number of rows in the end of a file
- **df.to_csv( file_path.csv, index_label, sep = ' ' )** - saves created DataFrame into CSV - file 
(**don't forget to provide index_label**)
- **pd.read_csv( http_query )** - can read CSV tables from the Internet using HTTP queries
### Excel Files

- **pd.read_excel(file_path)** - reads Excel file (by default the first sheet will be read) 
- **pd.read_excel(file_path, sheet = name)** - reads only provided sheets
- **pd.to_excel(file_path.xls, sheet_name = 'name')**  -saves a DataFrame as Excel file
- **with ExcelWriter( file_path.xls ) as writer:** - saves several DataFrames into one file but with different sheets<br>
df_1( writer, sheet_name = 'name' )<br>
df_2( writer, sheet_name = 'name' )<br>
As a function **pd.read_csv( )** the function **pd.read_excel( )** allows similarly provide column names, data types and indexes. **Parameters are the same**

### JSON Files
Pandas can easily read files in format JavaScript Object Notation
- **pd.read_json( )** - reads JSON file
- **pd.to_json( )** - saves as JSON file
- **with open (file_path) as file**: - use **pprint** and **json** to work with JSON <br>
data = json.load(file)**<br>
pprint( data )

### HTML Files
Pandas uses libraries such as **LXML**, **Html5Lib**, **BeautifulSoup4**. Make sure that they are **installed**
- **pd.read_html(url)** - reads all found tables from HTML and saves them as a DataFrame or several DataFrames
- **pd.to_html( file_path.html )** - saves a DataFrame as a table in HTML format. Creates a file with **tags table**, not the whole HTML file

### HDF5 Files 
This format is not going to be covered as it is specific

### SQL
Can write and read data from different data bases. Not going to be covered

### Pandas Data - Reader
This library allows to extract data from many remote resources:<br> 
- **Stocks:** Google, Yahoo
- **Economical data** 
For more info **google**

In [30]:
import pandas as pd
import numpy as np
import csv # For CSV initial Analysis 
from pandas import ExcelWriter # For saving into different sheets

# For dealing with JSON 
import json
from pprint import pprint

# For data extraction from remote resources
import pandas_datareader as pdr

In [18]:
# File Location 
path = 'D:/ML/Books/Learning_Pandas_russian_translation-1-master/Notebooks/Data/msft.csv'

# Initial Analysis 
with open(path) as file:
    reader = csv.reader(file, delimiter = ',')
    for indx, row in enumerate(reader):
        print(row)
        if (indx>=5):
            break
            
# For convinience let's write a function to check CSV - file
def check_csv(file_path, sep=',',rows_num = 5):
    with open(file_path) as file:
        reader = csv.reader(file, delimiter=sep)
        for indx, row in enumerate(reader):
            print(rown)
            if indx >=5:
                break

['Date', 'Open', 'High', 'Low', 'Close', 'Volume']
['7/21/2014', '83.46', '83.53', '81.81', '81.93', '2359300']
['7/18/2014', '83.3', '83.4', '82.52', '83.35', '4020800']
['7/17/2014', '84.35', '84.63', '83.33', '83.63', '1974000']
['7/16/2014', '83.77', '84.91', '83.66', '84.91', '1755600']
['7/15/2014', '84.3', '84.38', '83.2', '83.58', '1874700']


In [20]:
# Reading CSV File
df = pd.read_csv(path,index_col=0)
df.head()

Unnamed: 0_level_0,Open,High,Low,Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
7/21/2014,83.46,83.53,81.81,81.93,2359300
7/18/2014,83.3,83.4,82.52,83.35,4020800
7/17/2014,84.35,84.63,83.33,83.63,1974000
7/16/2014,83.77,84.91,83.66,84.91,1755600
7/15/2014,84.3,84.38,83.2,83.58,1874700


In [3]:
# Reading Excel File
path = 'D:/ML/Books/Learning_Pandas_russian_translation-1-master/Notebooks/Data/stocks.xlsx'
df = pd.read_excel(path) # frist sheet will be read by default
df.head()

Unnamed: 0,Date,Open,High,Low,Close,Volume
0,2014-07-21,83.46,83.53,81.81,81.93,2359300
1,2014-07-18,83.3,83.4,82.52,83.35,4020800
2,2014-07-17,84.35,84.63,83.33,83.63,1974000
3,2014-07-16,83.77,84.91,83.66,84.91,1755600
4,2014-07-15,84.3,84.38,83.2,83.58,1874700


In [16]:
# Reading JSON File 
file_path = 'D:/ML/Books/Learning_Pandas_russian_translation-1-master/Notebooks/Data/all_stocks.json'
with open(file_path) as file:
    data = json.load(file)
pprint(data)

# Using standard functon 
df = pd.read_json(file_path)
df.head()

{'Close': {'0': 93.94, '1': 94.43, '2': 93.09, '3': 94.78, '4': 95.32},
 'Date': {'0': 1405900800000,
          '1': 1405641600000,
          '2': 1405555200000,
          '3': 1405468800000,
          '4': 1405382400000},
 'High': {'0': 95.0, '1': 94.74, '2': 95.28, '3': 97.1, '4': 96.85},
 'Low': {'0': 93.72, '1': 93.02, '2': 92.57, '3': 94.74, '4': 95.03},
 'Open': {'0': 94.99, '1': 93.62, '2': 95.03, '3': 96.97, '4': 96.8},
 'Unnamed: 0': {'0': 0, '1': 1, '2': 2, '3': 3, '4': 4},
 'Volume': {'0': 38887700,
            '1': 49898600,
            '2': 57152000,
            '3': 53396300,
            '4': 45477900}}


Unnamed: 0.1,Unnamed: 0,Date,Open,High,Low,Close,Volume
0,0,2014-07-21,94.99,95.0,93.72,93.94,38887700
1,1,2014-07-18,93.62,94.74,93.02,94.43,49898600
2,2,2014-07-17,95.03,95.28,92.57,93.09,57152000
3,3,2014-07-16,96.97,97.1,94.74,94.78,53396300
4,4,2014-07-15,96.8,96.85,95.03,95.32,45477900


In [29]:
# Reading HTML File
url = 'http://www.fdic.gov/bank/individual/failed/banklist.html'
df = pd.read_html(url)
type(df) # returns a list with DataFrames
df[0].head() # Manipulate a DataFrame

Unnamed: 0,Bank Name,City,ST,CERT,Acquiring Institution,Closing Date
0,The First State Bank,Barboursville,WV,14361,"MVB Bank, Inc.","April 3, 2020"
1,Ericson State Bank,Ericson,NE,18265,Farmers and Merchants Bank,"February 14, 2020"
2,City National Bank of New Jersey,Newark,NJ,21111,Industrial Bank,"November 1, 2019"
3,Resolute Bank,Maumee,OH,58317,Buckeye State Bank,"October 25, 2019"
4,Louisa Community Bank,Louisa,KY,58112,Kentucky Farmers Bank Corporation,"October 25, 2019"
