# Importing Data in Python (Part 1)  
import data from a variaty of sources:  
* flat files, eg. txt, csv
* files from other software eg. SAS, excel, matlib
* relational databases  

content  
* Introduction and flat files
* Importing data from other file types 
* Working with relational databases in Python 

# 1 Introduction and flat files  

## 1.1 basic read txt
a great tutorial on [Reading and Writing Files in Python](http://www.pythonforbeginners.com/files/reading-and-writing-files-in-python) outside of datacamp

In [None]:
# read a txt file
filename = 'huck_finn.txt'

file = open(filename, mode = 'r')  # read mode

file.read()  # get all lines at once 

file.readline()  # get one line at a time

print(file.close)  # get False

file.close()  # why close?

print(file.close)  # get True # 有点好奇源码

In [None]:
# writing to a file
filename = 'huck_finn.txt'

file = open(filename, mode='w')

# do some writing

file.close()

In [None]:
# context manager with (don't have to do file.close() by hand)
with open(filename, 'r') as file:
    print(file.read())

## 1.2 the importance of flat files in data science

flat files
* table data without structured relationship
* row: at most one item;    column: attribute
* contrast to data in relational database

## 1.3 Importing flat files using numpy  
np.loadtxt():  
load all-numeric files,   
or load flat file of which each is of the same type.  
  
np.genfromtxt():  
When each row contains mixed dtype.  
(more commonly used is pandas)

In [None]:
import numpy as np

filename = 'MNIST.txt'

data = np.loadtxt(filename, 
                  delimiter = ',',
                  skiprows=1 # skip one row
                  usecols = [0, 2]  # specify the cols you want 
                 )

# or 
data = np.loadtxt(filename, 
                  delimiter = ',', 
                  dtype = str
                 )

data  # get ndarray

In [None]:
data = np.genfromtxt('titanic.csv', 
                     delimiter=',', 
                     names=True,  # one header row
                     dtype=None
                    )

data.shape  # get 1D array, one entire row is an element 

type(data[0])   # get numpy.void

## 1.4 Importing flat files using pandas

In [None]:
# （略）
data = pd.read_csv(filename)
data_array = data.values   # extract values as numpy ndarray

# 2 import files from other software eg. SAS, excel, STATA , Feather

## 2.1 other types, esp pickle and excel

* Excel spreadsheets MATLAB files
* SAS files
* Stata files
* HDF5 files
* Pickled files
    * There are a number of datatypes that cannot be saved easily to flat files, such as lists and dictionaries. If you want your files to be human readable, you may want to save them as text files in a clever manner (JSONs, which you will see in a later chapter, are appropriate for Python dictionaries).
    * If, however, you merely want to be able to import them into Python, you can serialize them. All this means is converting the object into a sequence of bytes, or bytestream.
	* File type native to Python
	* Motivation: many datatypes for which it isn’t obvious how to store them
	* Pickled files are serialized(convert object to bytestream)


In [None]:
# open a previously pickled data structure from a file and load it.

# Import pickle package
import pickle

# Open pickle file and load data: d
with open('data.pkl', 'rb') as file:   # 'rb' for 'read binary'
    d = pickle.load(file)

print(d)

print(type(d))

In [None]:
# dealing with Excel spreadsheets

import pandas as pd

file = 'battledeath.xlsx'
xl = pd.ExcelFile(file)
print(xl.sheet_names)  # get names of all sheets

# Load a sheet by name: df1
df1 = xl.parse('2004')   # can also load by position

df1 = xl.parse(0,             # sheet position
               skiprows=1,    # skip first one row
               names=['Country', 'AAM due to war']   # rename the cols
              )


df2 = xl.parse(1,              # sheet position
               parse_cols=[0], # select cols by position
               skiprows=1,     # skip first one row
               names=['Country']
              )

## 2.2 sas and stata

In [None]:
# import sas file

# Import sas7bdat package
from sas7bdat import SAS7BDAT

# Save file to a DataFrame: df_sas
with SAS7BDAT('sales.sas7bdat') as file:
    df_sas = file.to_data_frame()

print(df_sas.head())

In [None]:
import stata file

import pandas as pd

data = pd.read_stata('urbanpop.dta')

## 2.3 HDV5 file  （略）
Hierarchical Data Format version 5  
* Standard for storing large quantities of numerical data
* Datasets can be hundreds of gigabytes or terabytes 
* HDF5 can scale to exabytes

## 2.4 matlib file（略）

# 3 Working with relational databases in Python

## 3.1 Creating a database engine in python

In [None]:
# Import necessary module
from sqlalchemy import create_engine

# Create engine: engine
engine = create_engine('sqlite:///Chinook.sqlite')

# Save the table names to a list: table_names
table_names = engine.table_names()

# Print the table names to the shell
print(table_names)


## 3.2 Querying relational databases in Python
Workflow of SQL querying
* Import packages and functions 
* Create the database engine 
* Connect to the engine
* Query the database
* Save query results to a DataFrame 
* Close the connection

In [None]:
# 手动close
from sqlalchemy import create_engine
import pandas as pd

engine = create_engine('sqlite:///Chinook.sqlite')

con = engine.connect()

rs = con.execute('SELECT * FROM Album')

df = pd.DataFrame(rs.fetchall())

type(rs.fetchall())  # get list

con.close()

print(df.head())  # get a df with no columns name and index label

In [None]:
# 省去手动close using with. and other options
with engine.connect() as con:
    rs = con.execute('SELECT LastName, Title FROM Employee') # select cols
    df = pd.DataFrame(rs.fetchmany(3))       # fetch 3 rows
    df.columns = rs.keys()                   # set column labels

print(df.head())     # now df have col name

## 3.3 Querying relational databases directly with pandas

In [None]:
# create engine, connect engine, exe query, fetch data, turn into df
with engine.connect() as con:
    rs = con.execute('SELECT LastName, Title FROM Employee') # select cols
    df = pd.DataFrame(rs.fetchmany(3))       # fetch 3 rows
    df.columns = rs.keys() 

In [None]:
engine = create_engine('...')

# connect engine, exe query, fetch data, turn into df, all in one line
df = pd.read_sql_query("SELECT * FROM Orders", engine)

## 3.4 Advanced Querying: exploiting table relationships  
这节主要是讲JOIN，以及JOIN后和WHERE, ORDER BY 连用

In [None]:
# Open engine in context manager
# Perform query and save results to DataFrame: df
with engine.connect() as con:
    rs = con.execute('SELECT Title, Name FROM Album INNER JOIN Artist on Album.ArtistID = Artist.ArtistID')
    df = pd.DataFrame(rs.fetchall())
    df.columns = rs.keys()

# Print head of DataFrame df
print(df.head())

# related sources  
In PythonLand, there are currently hundreds of *Python Enhancement Proposals*, commonly referred to as PEPs. [PEP8](https://www.python.org/dev/peps/pep-0008/), for example, is a standard style guide for Python, written by our sensei Guido van Rossum himself. It is the basis for how we here at DataCamp ask our instructors to style their code. Another one of my favorites is [PEP20](https://www.python.org/dev/peps/pep-0020/), commonly called the Zen of Python. Its abstract is as follows:

Long time Pythoneer Tim Peters succinctly channels the BDFL's guiding principles for Python's design into 20 aphorisms, only 19 of which have been written down.