# Elements of Data Science: A First Course 

# COMS W4995 007 2018 3


## Week 2 :  Data Processing and Delivery: ETL and API

Reference
 - PDSH Chapters 2 and 3
 
Reading
 - Visualization with Matplotlib
     - General MatplotLib Tips
     - Simple Line Plots
     - Simple Scatter Plots
     - Multiple Subplots
     - Text and Annotation
     - Visualization with Seaborn

<img src="images/ds_heirarchy_of_needs2.png"/>

## ETL

<img src="images/etl_diagram.png?2"/>

## Extract

**excel**: *.xls

**csv**: one row per record, delimited, possible header

    lastname,purchase_date,stars,price,favorite_flower
    PERKINS,2017-04-08,5,19.599885954165785,iris
    ROBINSON,2017-01-01,5,37.983903616820925,lilac
     

**json**: like a key,value dictionary
<pre>
{
  "colors": [
    {
      "color": "black",
      "category": "hue",
      "type": "primary",
      "code": {
        "rgba": [255,255,255,1],
        "hex": "#000"
      }
    },
    {
      "color": "white",
      "category": "value",
      "code": {
        "rgba": [0,0,0,1],
        "hex": "#FFF"
      }
    },
</pre>

## Read in using pandas

```python
pd.read_excel() 
pd.read_csv() 
pd.read_json() 
pd.read_html()
pd.read_sql()
```

#### Result is a dataframe, but first, arrays and matrices!

### Data Types

-  Continuous
    -  real, numeric, `float`
    -  Ex: 1.0, 32.34, $\pi$

-  Discrete
    - count, `int`
    - Ex: 1, 201, 0, -5

-  Categorical
    -  factor
    -  Ex: red/green/blue, flower_type

-  Ordinal
    -  ordered factor
    -  Ex: 5 star rating, small/medium/large

-  Binary
    -  boolean, indicator, `bool`
    -  Ex: 0/1, True/False, good/bad, positive/negative, Heads/Tails

## Dataset Structure

### Rows

aka: case, example, instance, observation, sample

### Columns

aka: features, predictors, independent variables

### Label

aka: outcome, target, dependent variable

## Lists and Arrays

- in builtin python

In [2]:
L = [5,6,7,8]
print(type(L))

<class 'list'>


- in numpy

In [3]:
import numpy as np
A = np.array([5,6,7,8])
print(type(A))

<class 'numpy.ndarray'>


### Indexing into arrays

In [4]:
print(L[0], A[0])

5 5


In [5]:
print(L[-1], A[-1])

8 8


In [6]:
print(L[:2], A[:2])

[5, 6] [5 6]


### Why numpy array instead of list: Reason 1, Indexing

Want first and last element

In [7]:
try:
    
    L[[0,-1]]

except TypeError as err:
    print('{}: {}'.format(type(err),err))

<class 'TypeError'>: list indices must be integers or slices, not list


In [8]:
A[[0,-1]]

array([5, 8])

Index using a boolean mask

In [9]:
A < 7

array([ True,  True, False, False])

In [10]:
A[A < 7]

array([5, 6])

### Why numpy array instead of list: Reason 2, Ufuncs

In [11]:
# aside: defining a function

def square(x):
    return x**2

In [12]:
L = list(range(100000))
print(type(L))

<class 'list'>


In [13]:
# Aside: list comprehensions
squares = [square(x) for x in L]

In [14]:
%timeit [square(x) for x in L]

29.4 ms ± 882 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [15]:
A = np.arange(100000)
print(type(A))

<class 'numpy.ndarray'>


In [16]:
%timeit [square(x) for x in A]

25.1 ms ± 683 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [17]:
# using unary ufunc
%timeit A**2

47.7 µs ± 7.99 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


In [18]:
try:
    
    L**2

except TypeError as err:
    print('{}: {}'.format(type(err), err))

<class 'TypeError'>: unsupported operand type(s) for ** or pow(): 'list' and 'int'


## Matrices

### Indexing into matrices

In [19]:
L = [[5,6],[7,8]]
print(L)

[[5, 6], [7, 8]]


In [20]:
A = np.array([[5,6],[7,8]])
print(A)

[[5 6]
 [7 8]]


Get first row, first column

In [21]:
L[0][0]

5

In [22]:
A[0,0]

5

Get values in the 2nd column

In [23]:
[x[1] for x in L]

[6, 8]

In [24]:
A[:,1]

array([6, 8])

In [25]:
# Bonus: getting a matrix's shape
A.shape

(2, 2)

## Pandas Series and DataFrames

### Series

In [26]:
import pandas as pd

In [27]:
S = pd.Series(np.random.randint(100,size=1000))

In [28]:
S.head()

0    91
1    31
2    54
3    39
4    94
dtype: int64

In [29]:
S.index

RangeIndex(start=0, stop=1000, step=1)

In [30]:
S

0      91
1      31
2      54
3      39
4      94
5      58
6      89
7      71
8      86
9      53
10     79
11     51
12     32
13     69
14     98
15     96
16     28
17     36
18     47
19     44
20     39
21     34
22     89
23     75
24     81
25     19
26     39
27     94
28     13
29     10
       ..
970    12
971    51
972     4
973    62
974    60
975     5
976    83
977    20
978    11
979    21
980    67
981    61
982    56
983    49
984    99
985    76
986    89
987    83
988    48
989    79
990     8
991    14
992    70
993    15
994    13
995    67
996    89
997    55
998    67
999    50
Length: 1000, dtype: int64

In [31]:
S = pd.Series(np.random.randint(100,size=5),index=['A','B','C','D','E'])
S

A    16
B    39
C    68
D    72
E    55
dtype: int64

## DataFrame

In [32]:
np.random.seed(seed=123)

In [33]:
df = pd.DataFrame(np.random.randint(50,100,size=(10,10)))

In [34]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,95,52,78,84,88,67,69,92,72,83
1,82,99,97,59,82,96,82,97,75,69
2,64,86,82,66,54,99,53,52,70,89
3,52,70,97,98,57,91,85,78,88,83
4,71,80,77,84,83,62,90,53,92,55


### Rename columns

In [35]:
columns = {x:'col' + str(x) for x in range(10)}

In [36]:
columns

{0: 'col0',
 1: 'col1',
 2: 'col2',
 3: 'col3',
 4: 'col4',
 5: 'col5',
 6: 'col6',
 7: 'col7',
 8: 'col8',
 9: 'col9'}

In [37]:
df.rename(columns, axis=1, inplace=True)

In [38]:
df.head()

Unnamed: 0,col0,col1,col2,col3,col4,col5,col6,col7,col8,col9
0,95,52,78,84,88,67,69,92,72,83
1,82,99,97,59,82,96,82,97,75,69
2,64,86,82,66,54,99,53,52,70,89
3,52,70,97,98,57,91,85,78,88,83
4,71,80,77,84,83,62,90,53,92,55


### Rename rows

In [39]:
rows = {x:'row'+str(x) for x in range(10)}

In [40]:
df.rename(rows,axis=0,inplace=True)

In [41]:
df.head()

Unnamed: 0,col0,col1,col2,col3,col4,col5,col6,col7,col8,col9
row0,95,52,78,84,88,67,69,92,72,83
row1,82,99,97,59,82,96,82,97,75,69
row2,64,86,82,66,54,99,53,52,70,89
row3,52,70,97,98,57,91,85,78,88,83
row4,71,80,77,84,83,62,90,53,92,55


### Indexing into a dataframe

  - by labels
  - by index

### Index by labels using .loc()

In [42]:
df.loc['row0','col0']

95

In [43]:
df.loc['row5',:]

col0    50
col1    61
col2    84
col3    60
col4    72
col5    63
col6    68
col7    86
col8    65
col9    93
Name: row5, dtype: int64

In [44]:
df.loc[['row5','row6'],:]

Unnamed: 0,col0,col1,col2,col3,col4,col5,col6,col7,col8,col9
row5,50,61,84,60,72,63,68,86,65,93
row6,77,94,80,56,95,76,66,56,64,89


### Index by location with .iloc()

In [45]:
df.iloc[0,0]

95

In [46]:
df.iloc[0,:]

col0    95
col1    52
col2    78
col3    84
col4    88
col5    67
col6    69
col7    92
col8    72
col9    83
Name: row0, dtype: int64

In [47]:
df.iloc[:2,:]

Unnamed: 0,col0,col1,col2,col3,col4,col5,col6,col7,col8,col9
row0,95,52,78,84,88,67,69,92,72,83
row1,82,99,97,59,82,96,82,97,75,69


### Using a boolean mask

In [48]:
df.col0 > 80

row0     True
row1     True
row2    False
row3    False
row4    False
row5    False
row6    False
row7    False
row8    False
row9    False
Name: col0, dtype: bool

In [49]:
df.loc[df.col0 >= 80]

Unnamed: 0,col0,col1,col2,col3,col4,col5,col6,col7,col8,col9
row0,95,52,78,84,88,67,69,92,72,83
row1,82,99,97,59,82,96,82,97,75,69


In [50]:
# want col0, col3 and col8 of rows where col0 > 50

df.loc[(df.col0 > 50) & (df.col3 < 80),['col0','col3','col8']]

Unnamed: 0,col0,col3,col8
row1,82,59,75
row2,64,66,70
row6,77,56,64


### DataFrame axes

- axis=0 means across rows (think down)

- axis=1 means across columns (think right)

### Reading a csv into a DataFrame

csv: comma separated values

In [51]:
df = pd.read_csv('../data/week1_flowershop_data.csv',
                 header=0,
                 parse_dates=['purchase_date'],
                 delimiter=',')

In [52]:
df.head()

Unnamed: 0,lastname,purchase_date,stars,price,favorite_flower
0,PERKINS,2017-04-08,5,19.599886,iris
1,ROBINSON,2017-01-01,5,37.983904,lilac
2,WILLIAMSON,2017-03-20,4,19.339138,carnation
3,ROBINSON,2017-04-12,5,18.140616,lilac
4,RHODES,2017-03-24,1,22.179522,carnation


### Representing missing data: NaN

In [53]:
np.nan

nan

In [54]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 5 columns):
lastname           1000 non-null object
purchase_date      1000 non-null datetime64[ns]
stars              1000 non-null int64
price              978 non-null float64
favorite_flower    1000 non-null object
dtypes: datetime64[ns](1), float64(1), int64(1), object(2)
memory usage: 39.1+ KB


In [55]:
df[df.price.isnull()].head()

Unnamed: 0,lastname,purchase_date,stars,price,favorite_flower
20,CLARK,2017-01-05,3,,gardenia
41,PETERS,2017-02-01,4,,orchid
54,GREEN,2017-02-13,5,,daffodil
63,BARNETT,2017-08-27,4,,gardenia
145,CARROLL,2017-07-29,3,,tulip


### From dataframe back to matrix

In [56]:

df[df.price.isnull()].head().values

array([['CLARK', Timestamp('2017-01-05 00:00:00'), 3, nan, 'gardenia'],
       ['PETERS', Timestamp('2017-02-01 00:00:00'), 4, nan, 'orchid'],
       ['GREEN', Timestamp('2017-02-13 00:00:00'), 5, nan, 'daffodil'],
       ['BARNETT', Timestamp('2017-08-27 00:00:00'), 4, nan, 'gardenia'],
       ['CARROLL', Timestamp('2017-07-29 00:00:00'), 3, nan, 'tulip']],
      dtype=object)

## Transform

### Dealing with Missing Data

#### Method 1: Drop rows

In [57]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 5 columns):
lastname           1000 non-null object
purchase_date      1000 non-null datetime64[ns]
stars              1000 non-null int64
price              978 non-null float64
favorite_flower    1000 non-null object
dtypes: datetime64[ns](1), float64(1), int64(1), object(2)
memory usage: 39.1+ KB


In [58]:
df.dropna().info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 978 entries, 0 to 999
Data columns (total 5 columns):
lastname           978 non-null object
purchase_date      978 non-null datetime64[ns]
stars              978 non-null int64
price              978 non-null float64
favorite_flower    978 non-null object
dtypes: datetime64[ns](1), float64(1), int64(1), object(2)
memory usage: 45.8+ KB


`df.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)`

#### Method 2: Fill forward (ffill) or backward (bfill)

In [59]:
df.iloc[18:22]

Unnamed: 0,lastname,purchase_date,stars,price,favorite_flower
18,ROBINSON,2017-06-14,2,10.547645,lilac
19,RUIZ,2017-07-29,4,20.451789,iris
20,CLARK,2017-01-05,3,,gardenia
21,HARPER,2017-08-24,2,10.525912,tulip


In [60]:
df.iloc[18:22].ffill()

Unnamed: 0,lastname,purchase_date,stars,price,favorite_flower
18,ROBINSON,2017-06-14,2,10.547645,lilac
19,RUIZ,2017-07-29,4,20.451789,iris
20,CLARK,2017-01-05,3,20.451789,gardenia
21,HARPER,2017-08-24,2,10.525912,tulip


#### Method 3: Impute the value

In [61]:
df.price.fillna(df.price.mean()).iloc[18:22]

18    10.547645
19    20.451789
20    23.403241
21    10.525912
Name: price, dtype: float64

### Concatinate, Append and Join

In [62]:
A = pd.DataFrame(np.random.randint(0,10,size=(5,3)), columns=['col0','col1','col2'])
B = pd.DataFrame(np.random.rand(3,2), columns=['col0','col1'])

In [63]:
A

Unnamed: 0,col0,col1,col2
0,6,9,7
1,6,3,9
2,6,6,6
3,1,3,4
4,3,1,0


In [64]:
B

Unnamed: 0,col0,col1
0,0.309884,0.507204
1,0.280793,0.763837
2,0.108542,0.511655


### Append

In [65]:
A.append(B, sort=False)

Unnamed: 0,col0,col1,col2
0,6.0,9.0,7.0
1,6.0,3.0,9.0
2,6.0,6.0,6.0
3,1.0,3.0,4.0
4,3.0,1.0,0.0
0,0.309884,0.507204,
1,0.280793,0.763837,
2,0.108542,0.511655,


### Concatenate

In [66]:
pd.concat([A,B], sort=False)

Unnamed: 0,col0,col1,col2
0,6.0,9.0,7.0
1,6.0,3.0,9.0
2,6.0,6.0,6.0
3,1.0,3.0,4.0
4,3.0,1.0,0.0
0,0.309884,0.507204,
1,0.280793,0.763837,
2,0.108542,0.511655,


In [67]:
pd.concat([A,B],axis=1)

Unnamed: 0,col0,col1,col2,col0.1,col1.1
0,6,9,7,0.309884,0.507204
1,6,3,9,0.280793,0.763837
2,6,6,6,0.108542,0.511655
3,1,3,4,,
4,3,1,0,,


### Join

In [68]:
A.join(B,how='outer',lsuffix='_L',rsuffix='_B')

Unnamed: 0,col0_L,col1_L,col2,col0_B,col1_B
0,6,9,7,0.309884,0.507204
1,6,3,9,0.280793,0.763837
2,6,6,6,0.108542,0.511655
3,1,3,4,,
4,3,1,0,,


In [69]:
A.join(B,how='inner',lsuffix='_L',rsuffix='_B')

Unnamed: 0,col0_L,col1_L,col2,col0_B,col1_B
0,6,9,7,0.309884,0.507204
1,6,3,9,0.280793,0.763837
2,6,6,6,0.108542,0.511655


### Manipulating indices

In [70]:
A.sort_values(by='col0')

Unnamed: 0,col0,col1,col2
3,1,3,4
4,3,1,0
0,6,9,7
1,6,3,9
2,6,6,6


In [71]:
A.sort_values(by='col0').reset_index()

Unnamed: 0,index,col0,col1,col2
0,3,1,3,4
1,4,3,1,0
2,0,6,9,7
3,1,6,3,9
4,2,6,6,6


In [72]:
A.set_index('col2')

Unnamed: 0_level_0,col0,col1
col2,Unnamed: 1_level_1,Unnamed: 2_level_1
7,6,9
9,6,3
6,6,6
4,1,3
0,3,1


In [73]:
idx = pd.Index(['A','B','C','D','E'])
A.set_index(idx)

Unnamed: 0,col0,col1,col2
A,6,9,7
B,6,3,9
C,6,6,6
D,1,3,4
E,3,1,0


### Grouping, Aggregating and Summarizing

In [74]:
df.groupby('lastname').price.mean().head()

lastname
ADAMS        38.617753
ALEXANDER    30.106511
ALLEN        26.657993
ALVAREZ      20.676235
ANDERSON     14.653257
Name: price, dtype: float64

In [75]:
df.groupby('favorite_flower').stars.median().head()

favorite_flower
carnation    4.0
daffodil     4.0
daisy        3.5
gardenia     4.0
gerbera      4.0
Name: stars, dtype: float64

### Dropping Duplicates

In [76]:
df.loc[:,['lastname','favorite_flower']].drop_duplicates().info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 253 entries, 0 to 984
Data columns (total 2 columns):
lastname           253 non-null object
favorite_flower    253 non-null object
dtypes: object(2)
memory usage: 5.9+ KB


### Creating Dummies for Missing or Categorical Variables

#### Dummies for missing values

In [77]:
df.price.isnull().head()

0    False
1    False
2    False
3    False
4    False
Name: price, dtype: bool

In [78]:
df['price_isnull'] = df.price.isnull()

In [79]:
df.iloc[18:22]

Unnamed: 0,lastname,purchase_date,stars,price,favorite_flower,price_isnull
18,ROBINSON,2017-06-14,2,10.547645,lilac,False
19,RUIZ,2017-07-29,4,20.451789,iris,False
20,CLARK,2017-01-05,3,,gardenia,True
21,HARPER,2017-08-24,2,10.525912,tulip,False


In [80]:
df.price_isnull = df.price_isnull.astype(int)

In [81]:
df.iloc[18:22]

Unnamed: 0,lastname,purchase_date,stars,price,favorite_flower,price_isnull
18,ROBINSON,2017-06-14,2,10.547645,lilac,0
19,RUIZ,2017-07-29,4,20.451789,iris,0
20,CLARK,2017-01-05,3,,gardenia,1
21,HARPER,2017-08-24,2,10.525912,tulip,0


#### Dummies for categorical features

In [82]:
pd.get_dummies(df.favorite_flower).head()

Unnamed: 0,carnation,daffodil,daisy,gardenia,gerbera,iris,jasmine,lilac,orchid,rose,sunflower,tulip
0,0,0,0,0,0,1,0,0,0,0,0,0
1,0,0,0,0,0,0,0,1,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,1,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,0,0


In [83]:
A = np.random.randint(1,4,5)
print(A)

[2 2 1 3 2]


In [84]:
dfA = pd.DataFrame([1,3,2,1])
dfA

Unnamed: 0,0
0,1
1,3
2,2
3,1


In [85]:
pd.get_dummies(dfA[0])

Unnamed: 0,1,2,3
0,1,0,0
1,0,0,1
2,0,1,0
3,1,0,0


In [86]:
pd.get_dummies(dfA[0], drop_first=True)

Unnamed: 0,2,3
0,0,0
1,0,1
2,1,0
3,0,0


### Standardize

$z = \frac{x - \mu}{\sigma}$ 

or, more correctly 

$z = \frac{x - \bar{x}}{s}$

`z = (x[0] - np.mean(x)) / np.std(x)`

In [87]:
A = np.random.normal(loc=100, scale=50, size=(5))
B = np.random.normal(loc=-5, scale=.1, size=(5))
print(A)
print(B)

[ 60.95824315 177.09179302  70.340075    33.49001535 204.05866417]
[-4.81961117 -5.03799838 -4.93128985 -5.06545081 -4.86984227]


In [88]:
print(np.mean(A), np.mean(B))
print(np.std(A), np.std(B))

109.18775813834819 -4.944838499519225
68.08351287981361 0.09456846935518677


In [89]:
from scipy.stats import zscore
print(zscore(A))
print(zscore(B))

[-0.70838758  0.99736386 -0.5705887  -1.11183662  1.39344905]
[ 1.32419743 -0.98510512  0.14326813 -1.27539672  0.79303627]


In [90]:
print(np.mean(zscore(A)), np.mean(zscore(B)))
print(np.std(zscore(A)), np.std(zscore(B)))

-2.6645352591003756e-16 3.774758283725532e-15
1.0 0.9999999999999999


In [91]:
df['price_z'] = (df.price - df.price.mean()).div(df.price.std())

### Normalize

All values between 0 and 1

In [92]:
tmp = (df.price - df.price.min())

In [93]:
df['price_n'] = tmp / tmp.max()

In [94]:
df[['price','price_z','price_n']].describe()

Unnamed: 0,price,price_z,price_n
count,978.0,978.0,978.0
mean,23.403241,-2.728106e-15,0.446128
std,11.209242,1.0,0.316871
min,7.621566,-1.407916,0.0
25%,18.190466,-0.4650426,0.29877
50%,20.117401,-0.2931367,0.353242
75%,38.694134,1.364133,0.878383
max,42.996317,1.74794,1.0


### Dealing with date and time

### Dealing with strings

### Transform Review

- Deal with missing data

- Concatinate, append and join

- Setting indices

- Grouping, Aggragating and Summarizing

- Dropping duplicates

- Creating dummy values
    - for missing values
    - for categorical variables

- Standardize and Normalize

## Load

- We've got our DataFrame, but we should save it

### Save to csv or pickle

In [95]:
df.to_csv('data_cleaned.csv')

In [96]:
df.to_pickle('data_cleaned.pkl')

### Save to db? Later

## API

### What is an API?

Application Programming Interface

For us: a tool for taking in requests and returning data

Use for both <b>getting data</b> and <b>delivering data</b>.

### JSON

### Examples

In [97]:
import requests
url = 'http://en.wikipedia.org/w/api.php?action=query&prop=info&format=json&titles='
title = 'Data Science'
title = title.replace(' ','%20')
print(title)
resp = requests.get(url+title)

Data%20Science


In [98]:
resp.json()

{'batchcomplete': '',
 'query': {'pages': {'49495124': {'pageid': 49495124,
    'ns': 0,
    'title': 'Data Science',
    'contentmodel': 'wikitext',
    'pagelanguage': 'en',
    'pagelanguagehtmlcode': 'en',
    'pagelanguagedir': 'ltr',
    'touched': '2018-09-20T04:32:13Z',
    'lastrevid': 706007296,
    'length': 26,
    'redirect': '',
    'new': ''}}}}

In [99]:
with open('/home/bgibson/Downloads/twitter_consumer_key.txt') as f:
    consumer_key = f.read().strip()
with open('/home/bgibson/Downloads/twitter_consumer_secret.txt') as f:
    consumer_secret = f.read().strip()
with open('/home/bgibson/Downloads/twitter_access_token.txt') as f:
    access_token = f.read().strip()
with open('/home/bgibson/Downloads/twitter_access_token_secret.txt') as f:
    access_token_secret = f.read().strip()

import tweepy
    
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

api = tweepy.API(auth)

FileNotFoundError: [Errno 2] No such file or directory: '/home/bgibson/Downloads/twitter_consumer_key.txt'

In [None]:
public_tweets = api.search('columbia')
for tweet in public_tweets:
    #print(type(tweet))
    print('-------')
    print(tweet.text)

In [None]:
tweet._json