# Importing Data in Python (Part 1)

## Chapter 1: Flat Files

#### Reading a text file

In [1]:
filename = 'datasets/huck_finn.txt'
file = open(filename, mode = 'r') # 'r' is to read
text = file.read()
file.close() # always best practice to close the cnxn to the file
print(file.closed) # to check whether the file is closed

True


In [3]:
print(text[:500])



The Project Gutenberg EBook of Adventures of Huckleberry Finn, Complete
by Mark Twain (Samuel Clemens)

This eBook is for the use of anyone anywhere at no cost and with almost
no restrictions whatsoever. You may copy it, give it away or re-use
it under the terms of the Project Gutenberg License included with this
eBook or online at www.gutenberg.net

Title: Adventures of Huckleberry Finn, Complete

Author: Mark Twain (Samuel Clemens)

Release Date: August 20, 2006 [EBook #76]

Last Updated: Oc


#### Context manager with

You can avoid having to close a connection to a file with a with statement. This allows you to create a context, that allows you to execute commands with the file open. Once out of this context, the file is no longer open, so for this reason, `with` is called a "context manager". What you're doing here is called binding a variable in the context manager construct. It's best practice to use the with statement as you never have to concern yourself with closing the file again.

In [4]:
with open('datasets/huck_finn.txt', 'r') as file:
    print(file.read()[:500])



The Project Gutenberg EBook of Adventures of Huckleberry Finn, Complete
by Mark Twain (Samuel Clemens)

This eBook is for the use of anyone anywhere at no cost and with almost
no restrictions whatsoever. You may copy it, give it away or re-use
it under the terms of the Project Gutenberg License included with this
eBook or online at www.gutenberg.net

Title: Adventures of Huckleberry Finn, Complete

Author: Mark Twain (Samuel Clemens)

Release Date: August 20, 2006 [EBook #76]

Last Updated: Oc


If you only want to print a few lines of the text file, the `readline()` method allows you to do that. You can print out the first line by executing `file.readline()` and if you execute that again, it will print the second line, and so on.

In [5]:
with open('datasets/huck_finn.txt') as file:
    print(file.readline())
    print(file.readline())
    print(file.readline())





The Project Gutenberg EBook of Adventures of Huckleberry Finn, Complete



### Flat files

Flat files are basic text files containing records, that is, table data, without structured relationships. Flat files can have headers. Common types of flat files are .csv, .txt.

#### Why NumPy?
* NumPy arrays: standard for storing numerical data
* Essential for other packages: e.g. scikit-learn

In [6]:
import numpy as np
filename = 'datasets/NMIST.txt'
data = np.loadtxt(filename, delimiter = ',') # default delimiter is a space
data

array([[1., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       ...,
       [2., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [5., 0., 0., ..., 0., 0., 0.]])

In [7]:
# if your data contains a header, you'll want to skip the first row
# you can also specifcy which columns you want
data = np.loadtxt(filename, delimiter = ',', skiprows = 1, usecols = [0, 2])
data[:5]

array([[0., 0.],
       [1., 0.],
       [4., 0.],
       [0., 0.],
       [0., 0.]])

In [8]:
# you can also make the entire import a string
data = np.loadtxt(filename, delimiter = ',', dtype = str)
data

array([['1.0', '0.0', '0.0', ..., '0.0', '0.0', '0.0'],
       ['0.0', '0.0', '0.0', ..., '0.0', '0.0', '0.0'],
       ['1.0', '0.0', '0.0', ..., '0.0', '0.0', '0.0'],
       ...,
       ['2.0', '0.0', '0.0', ..., '0.0', '0.0', '0.0'],
       ['0.0', '0.0', '0.0', ..., '0.0', '0.0', '0.0'],
       ['5.0', '0.0', '0.0', ..., '0.0', '0.0', '0.0']], dtype='<U5')

If you have data of different data types, but still want to creat a numpy array you can use `np.genfromtxt()`. Because the data are of different types, data is an object called a structured array. Because numpy arrays have to contain elements that are all the same type, the structured array solves this by being a 1D array, where each element of the array is a row of the flat file imported. 

In [12]:
data = np.genfromtxt('datasets/titanic.csv', delimiter=',', names=True, dtype=None)
data[:5]

  """Entry point for launching an IPython kernel.


array([(1, 0, 3, b'male', 22., 1, 0, b'A/5 21171',  7.25  , b'', b'S'),
       (2, 1, 1, b'female', 38., 1, 0, b'PC 17599', 71.2833, b'C85', b'C'),
       (3, 1, 3, b'female', 26., 0, 0, b'STON/O2. 3101282',  7.925 , b'', b'S'),
       (4, 1, 1, b'female', 35., 1, 0, b'113803', 53.1   , b'C123', b'S'),
       (5, 0, 3, b'male', 35., 0, 0, b'373450',  8.05  , b'', b'S')],
      dtype=[('PassengerId', '<i8'), ('Survived', '<i8'), ('Pclass', '<i8'), ('Sex', 'S6'), ('Age', '<f8'), ('SibSp', '<i8'), ('Parch', '<i8'), ('Ticket', 'S18'), ('Fare', '<f8'), ('Cabin', 'S15'), ('Embarked', 'S1')])

There is also another function `np.recfromcsv()` that behaves similarly to `np.genfromtxt()`, except that its default dtype is `None`.

In [13]:
# Assign the filename: file
file = 'datasets/titanic.csv'

# Import file using np.recfromcsv: d
d = np.recfromcsv(file)

# Print out first three entries of d
print(d[:3])

[(1, 0, 3, b'male', 22., 1, 0, b'A/5 21171',  7.25  , b'', b'S')
 (2, 1, 1, b'female', 38., 1, 0, b'PC 17599', 71.2833, b'C85', b'C')
 (3, 1, 3, b'female', 26., 0, 0, b'STON/O2. 3101282',  7.925 , b'', b'S')]


#### What a data scientist needs
* Two-dimensional labeled data structure(s)
* Columns of potentially different types
* Manipulate, slice, reshape, groupby, join, merge
* Perform statistics
* Work with time series data

#### Manipulating pandas DataFrames
* Exploratory data analysis
* Data wrangling
* Data preprocessing
* Building models
* Visualization
* Standard and best practice to use pandas

In [14]:
import pandas as pd
filename = 'datasets/winequality-red.csv'
data = pd.read_csv(filename, sep = ";")
data.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [15]:
# we can also easily convert the dataframe to a numpy array by calling
data.values[:5]

array([[7.400e+00, 7.000e-01, 0.000e+00, 1.900e+00, 7.600e-02, 1.100e+01,
        3.400e+01, 9.978e-01, 3.510e+00, 5.600e-01, 9.400e+00, 5.000e+00],
       [7.800e+00, 8.800e-01, 0.000e+00, 2.600e+00, 9.800e-02, 2.500e+01,
        6.700e+01, 9.968e-01, 3.200e+00, 6.800e-01, 9.800e+00, 5.000e+00],
       [7.800e+00, 7.600e-01, 4.000e-02, 2.300e+00, 9.200e-02, 1.500e+01,
        5.400e+01, 9.970e-01, 3.260e+00, 6.500e-01, 9.800e+00, 5.000e+00],
       [1.120e+01, 2.800e-01, 5.600e-01, 1.900e+00, 7.500e-02, 1.700e+01,
        6.000e+01, 9.980e-01, 3.160e+00, 5.800e-01, 9.800e+00, 6.000e+00],
       [7.400e+00, 7.000e-01, 0.000e+00, 1.900e+00, 7.600e-02, 1.100e+01,
        3.400e+01, 9.978e-01, 3.510e+00, 5.600e-01, 9.400e+00, 5.000e+00]])

## Chapter 2: Importing data from other file types

### Other file types

* Excel spreadsheets
* MATLAB files
* SAS
* Stata files
* HDF5 files

#### Pickled files
* Files type native to Python
* Motivation: many datatypes for which it isn't obvious how to store them
* Pickled files are serialized
* Serialize = convert object to bytestream

In [None]:
import pickle

# specify that the file is read-only and binary by using 'rb'
with open('pickled_fruit.pkl', 'rb') as file:
    data = pickle.load(file)
    
print(data)

#### Importing Excel spreadsheets

In [16]:
import pandas as pd
file = 'datasets/battledeath.xlsx'
data = pd.ExcelFile(file)
print(data.sheet_names)
# df1 = data.parse('sheet_name1')
df1 = data.parse(0) # sheet index

['2002', '2004']


#### SAS and Stata files
* SAS: Statistical Analysis System
    * business analytics and biostatistics
    * Used for:
        * Advanced analytics
        * Multivariate analysis
        * Business intelligence
        * Data management
        * Predictive analytics
    * Standard for computational analysis
    * The most common SAS files have the exention .sas7bdat and .sas7bcat which are dataset files and catalog files respectively. We can import the former shown below:

In [17]:
import pandas as pd

from sas7bdat import SAS7BDAT

with SAS7BDAT('datasets/sales.sas7bdat') as file:
    df_sas = file.to_data_frame()
    
df_sas.head()

Unnamed: 0,YEAR,P,S
0,1950.0,12.9,181.899994
1,1951.0,11.9,245.0
2,1952.0,10.7,250.199997
3,1953.0,11.3,265.899994
4,1954.0,11.2,248.5


    
* Stata: "Statistics" + "data
    * academic social sciences research

In [18]:
import pandas as pd

data = pd.read_stata('datasets/disarea.dta')
data.head()

Unnamed: 0,wbcode,country,disa1,disa2,disa3,disa4,disa5,disa6,disa7,disa8,...,disa16,disa17,disa18,disa19,disa20,disa21,disa22,disa23,disa24,disa25
0,AFG,Afghanistan,0.0,0.0,0.76,0.73,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,0.0,0.0
1,AGO,Angola,0.32,0.02,0.56,0.0,0.0,0.0,0.56,0.0,...,0.0,0.4,0.0,0.61,0.0,0.0,0.99,0.98,0.61,0.0
2,ALB,Albania,0.0,0.0,0.02,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.16
3,ARE,United Arab Emirates,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,ARG,Argentina,0.0,0.24,0.24,0.0,0.0,0.23,0.0,0.0,...,0.0,0.0,0.0,0.0,0.05,0.0,0.0,0.01,0.0,0.11


#### HDF5 files
* Hierarchical Data Format version 5
* Standard for storing large quantities of numerical data
* Datasets can be hundreds of gigabytes or terabytes
* HDF5 can scale to exabytes

In [19]:
import h5py
filename = 'datasets/LIGO.hdf5'
data = h5py.File(filename, 'r') # 'r' is to read
print(type(data))

<class 'h5py._hl.files.File'>


In [20]:
for key in data.keys():
    print(key)

meta
quality
strain


In [21]:
for key in data['meta'].keys():
    print(key)

Description
DescriptionURL
Detector
Duration
GPSstart
Observatory
Type
UTCstart


In [22]:
print(data['meta']['Description'].value, data['meta']['Detector'].value)

b'Strain data time series from LIGO' b'L1'


  """Entry point for launching an IPython kernel.


#### MATLAB
* "Matrix Laboratory"
* Industry standard in engineering and science
* Saved as .mat files

In [None]:
import scipy.io
filename = 'workspace.mat'
mat = scipy.io.loadmat(filename)
print(type(mat))

## Chapter 3: Introduction to Relational Databases

#### Relational Database Management Systems
* PostgreSQL
* MySQL
* SQLite
* SQL = Structured Query Language

#### Creating a database engine
* SQLite database
    * Fast and simple
    
* SQLAlchemy
    * Works with many Relational Database Management Systems

In [None]:
from sqlalchemy import create_engine

# insert type of database and name of the database
engine = create_engine('sqlite::///datasets/Chinook.sqlite')

table_names = engine.table_names()

print(table_names)

#### Connecting to the database

In [None]:
con = engine.connect()

rs = con.execute("SELECT * FROM ORDERS")

df = pd.DataFrame(rs.fetchall())

df.columns = rs.keys()

con.close()

df.head()

#### Using the context manager

In [None]:
from sqlalchemy import create_engine

import pandas as pd

engine = create_engine('sqlite:///Northwind.sqlite')

with engine.connect() as con:
    rs = con.execute("SELECT OrderID, OrderDate, ShipName FROM Orders")
    df = pd.DataFrame(rs.fetchmany(size=5))
    df.columns = rs.keys()

#### The pandas way to query

In [None]:
df = pd.read_sql_query("SELECT * FROM Orders", engine)