<hr/>

# Intro to Data Science 

01/31/2020

**TA** - Dapeng Yao (dyao10@jhu.edu)   <br/>
**Office Hour** - Friday 12:30pm ~ 1:30pm Whitehead 212

- **Python:** NumPy, Pandas
- **Q & A**

<hr/>


[Install Python](https://www.python.org/) <br/>
[Install Anaconda](https://www.continuum.io/downloads)

<h2><font color="darkblue">Python</font></h2>
<hr/>

### NumPy
Numpy is the core library for scientific computing in Python. It provides a high-performance multidimensional array object, and tools for working with these arrays.

[Tutorial](https://docs.scipy.org/doc/numpy-dev/user/quickstart.html)

In [1]:
import numpy as np

In [2]:
np.array([1,2,3])

array([1, 2, 3])

In [3]:
np.ones((2,2))

array([[1., 1.],
       [1., 1.]])

In [4]:
np.zeros((2,2))

array([[0., 0.],
       [0., 0.]])

In [5]:
np.eye(2)

array([[1., 0.],
       [0., 1.]])

#### Array Reshape

In [6]:
arr = np.arange(12)
arr

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])

In [7]:
arr.shape

(12,)

In [8]:
arr.reshape((4,3), order='C')   # row major (default)

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11]])

In [9]:
arr.reshape((4,3), order='F')    # column major

array([[ 0,  4,  8],
       [ 1,  5,  9],
       [ 2,  6, 10],
       [ 3,  7, 11]])

In [10]:
arr.reshape((4,-1))

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11]])

In [11]:
arr.reshape((-1,3))

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11]])

In [12]:
arr2 = np.ones((3,4))
arr2.shape

(3, 4)

In [13]:
arr.reshape(arr2.shape)

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [14]:
arr.ravel()      # Return a contiguous flattened array

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])

#### Array Concatenate and Split

In [15]:
arr1 = np.array([[1,2], [3,4]])
arr2 = np.array([[5,6], [7,8]])
print(arr1)
print(arr2)

[[1 2]
 [3 4]]
[[5 6]
 [7 8]]


In [16]:
np.concatenate([arr1, arr2], axis = 0)   # row bind (default)

array([[1, 2],
       [3, 4],
       [5, 6],
       [7, 8]])

In [17]:
np.vstack((arr1, arr2)) # Stack arrays in sequence vertically

array([[1, 2],
       [3, 4],
       [5, 6],
       [7, 8]])

In [18]:
np.concatenate([arr1, arr2], axis = 1)

array([[1, 2, 5, 6],
       [3, 4, 7, 8]])

In [19]:
np.hstack((arr1, arr2)) # horizontally

array([[1, 2, 5, 6],
       [3, 4, 7, 8]])

In [20]:
arr = np.arange(1,6)
arr = arr.repeat(10)
arr = arr.reshape((5,-1))
arr

array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
       [2, 2, 2, 2, 2, 2, 2, 2, 2, 2],
       [3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
       [4, 4, 4, 4, 4, 4, 4, 4, 4, 4],
       [5, 5, 5, 5, 5, 5, 5, 5, 5, 5]])

In [21]:
np.split(arr,5)

[array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]),
 array([[2, 2, 2, 2, 2, 2, 2, 2, 2, 2]]),
 array([[3, 3, 3, 3, 3, 3, 3, 3, 3, 3]]),
 array([[4, 4, 4, 4, 4, 4, 4, 4, 4, 4]]),
 array([[5, 5, 5, 5, 5, 5, 5, 5, 5, 5]])]

In [22]:
np.split(arr, [2,4]) # where the array is split along axis

[array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [2, 2, 2, 2, 2, 2, 2, 2, 2, 2]]),
 array([[3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
        [4, 4, 4, 4, 4, 4, 4, 4, 4, 4]]),
 array([[5, 5, 5, 5, 5, 5, 5, 5, 5, 5]])]

In [23]:
arr1, arr2 = np.vsplit(arr, [2])
print(arr1)
print(arr2)

[[1 1 1 1 1 1 1 1 1 1]
 [2 2 2 2 2 2 2 2 2 2]]
[[3 3 3 3 3 3 3 3 3 3]
 [4 4 4 4 4 4 4 4 4 4]
 [5 5 5 5 5 5 5 5 5 5]]


In [24]:
arr1, arr2 = np.hsplit(arr, 2)
print(arr1)
print(arr2)

[[1 1 1 1 1]
 [2 2 2 2 2]
 [3 3 3 3 3]
 [4 4 4 4 4]
 [5 5 5 5 5]]
[[1 1 1 1 1]
 [2 2 2 2 2]
 [3 3 3 3 3]
 [4 4 4 4 4]
 [5 5 5 5 5]]


In [25]:
arr1, arr2 = np.split(arr,2,axis=1)
print(arr1)
print(arr2)

[[1 1 1 1 1]
 [2 2 2 2 2]
 [3 3 3 3 3]
 [4 4 4 4 4]
 [5 5 5 5 5]]
[[1 1 1 1 1]
 [2 2 2 2 2]
 [3 3 3 3 3]
 [4 4 4 4 4]
 [5 5 5 5 5]]


#### Array math

In [26]:
x = np.array([[1,2],[3,4]])
y = np.array([[5,6],[7,8]])

In [27]:
print(x+y)
print(np.add(x,y))

[[ 6  8]
 [10 12]]
[[ 6  8]
 [10 12]]


In [28]:
print(x-y)
print(np.subtract(x,y))

[[-4 -4]
 [-4 -4]]
[[-4 -4]
 [-4 -4]]


In [29]:
print(x/y)
print(np.divide(x,y)) #elementwise

[[0.2        0.33333333]
 [0.42857143 0.5       ]]
[[0.2        0.33333333]
 [0.42857143 0.5       ]]


In [30]:
print(x*y)
print(np.multiply(x,y)) #elementwise

[[ 5 12]
 [21 32]]
[[ 5 12]
 [21 32]]


In [31]:
print(x.dot(y))
print(np.dot(x,y))

[[19 22]
 [43 50]]
[[19 22]
 [43 50]]


In [32]:
print(np.sum(x))
print(np.sum(x, axis = 0))
print(np.sum(x, axis = 1))

10
[4 6]
[3 7]


In [33]:
print(x.T)

[[1 3]
 [2 4]]


#### Broadcasting
Broadcasting is a powerful mechanism that allows numpy to work with arrays of different shapes when performing arithmetic operations. Frequently we have a smaller array and a larger array, and we want to use the smaller array multiple times to perform some operation on the larger array.

In [34]:
x = np.array([[1,2],[3,4],[5,6]])
y = np.array([0,1])
x + y   #add y to each row of x

array([[1, 3],
       [3, 5],
       [5, 7]])

### Pandas
Pandas is a high-level data manipulation tool built on the Numpy package and its key data structure is called the DataFrame.

[Tutorial](http://pandas.pydata.org/pandas-docs/stable/tutorials.html)

In [35]:
import pandas as pd

#### Categorical Data

In [36]:
fruits = ['apple', 'orange', 'apple', 'apple'] * 2
df = pd.DataFrame({'fruit': fruits,
                   'basket_id': np.arange(len(fruits)),
                   'count': np.random.randint(3, 15, size=len(fruits)),
                   'weight': np.random.uniform(0, 4, size=len(fruits))},
                  columns=['basket_id', 'fruit', 'count', 'weight'])
df

Unnamed: 0,basket_id,fruit,count,weight
0,0,apple,6,2.971982
1,1,orange,4,1.288316
2,2,apple,5,1.974309
3,3,apple,8,0.469173
4,4,apple,13,2.465817
5,5,orange,6,3.100531
6,6,apple,7,1.404618
7,7,apple,13,3.168726


In [37]:
df['fruit']

0     apple
1    orange
2     apple
3     apple
4     apple
5    orange
6     apple
7     apple
Name: fruit, dtype: object

In [38]:
pd.unique(df['fruit'])

array(['apple', 'orange'], dtype=object)

In [39]:
df['count']

0     6
1     4
2     5
3     8
4    13
5     6
6     7
7    13
Name: count, dtype: int64

In [40]:
df['count'].values

array([ 6,  4,  5,  8, 13,  6,  7, 13])

#### GroupBy

In [41]:
df = pd.DataFrame({'key': ['a', 'b', 'c'] * 4,
                   'value': np.arange(12)})
df

Unnamed: 0,key,value
0,a,0
1,b,1
2,c,2
3,a,3
4,b,4
5,c,5
6,a,6
7,b,7
8,c,8
9,a,9


In [42]:
g = df.groupby('key').value
g.mean()

key
a    4.5
b    5.5
c    6.5
Name: value, dtype: float64

In [43]:
type(g)

pandas.core.groupby.groupby.SeriesGroupBy

#### Apply

In [44]:
df = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),index=['Utah', 'Ohio', 'Texas', 'Oregon'])
df

Unnamed: 0,b,d,e
Utah,1.177292,0.906751,0.818244
Ohio,1.185452,0.188023,2.546156
Texas,-0.878229,0.703432,0.7796
Oregon,0.091415,0.811602,-1.596234


In [45]:
f = lambda x: x.max() - x.min()   # anonymous function

In [46]:
df.apply(f)     # default 0 or 'index'

b    2.063680
d    0.718728
e    4.142390
dtype: float64

In [48]:
df.apply(f, axis='columns')

Utah      0.359048
Ohio      2.358133
Texas     1.657828
Oregon    2.407836
dtype: float64

In [49]:
g.apply(f)

key
a    9
b    9
c    9
Name: value, dtype: int64

### References

- Wes Mckinney, _Python for Data Analysis_, O'Reilly (2012).