# Data Science Numpy

## Tasks Today:

1) <b>Numpy</b> <br>
 &nbsp;&nbsp;&nbsp;&nbsp; a) Python List Comparison <br>
 &nbsp;&nbsp;&nbsp;&nbsp; b) In-Class Exercise #1 <br>
 &nbsp;&nbsp;&nbsp;&nbsp; c) Importing <br>
 &nbsp;&nbsp;&nbsp;&nbsp; d) Creating an NDArray <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - np.array() <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - np.zeros() <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - np.ones() <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - np.arange() <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - Making Lists into NDArrays <br>
 &nbsp;&nbsp;&nbsp;&nbsp; e) Performing Calculations on NDArrays <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - Summation <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - Difference <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - Multiplication <br>
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; - Division <br>
 &nbsp;&nbsp;&nbsp;&nbsp; f) Numpy Subsetting <br>
 &nbsp;&nbsp;&nbsp;&nbsp; g) Multi-dimensional Arrays <br>
 &nbsp;&nbsp;&nbsp;&nbsp; h) Indexing NDArrays <br>
 &nbsp;&nbsp;&nbsp;&nbsp; i) Checking NDArray Type <br>
 &nbsp;&nbsp;&nbsp;&nbsp; j) Altering NDArray Type <br>
 &nbsp;&nbsp;&nbsp;&nbsp; k) Checking the Shape <br>
 &nbsp;&nbsp;&nbsp;&nbsp; l) Altering the Shape <br>
 &nbsp;&nbsp;&nbsp;&nbsp; m) In-Class Exercise #2 <br>
 &nbsp;&nbsp;&nbsp;&nbsp; n) Complex Indexing & Assigning <br>
 &nbsp;&nbsp;&nbsp;&nbsp; o) Elementwise Multplication <br>
 &nbsp;&nbsp;&nbsp;&nbsp; p) np.where() <br>
 &nbsp;&nbsp;&nbsp;&nbsp; q) Random Sampling <br>

2) <b>Working With CSV's</b> <br>
 &nbsp;&nbsp;&nbsp;&nbsp; a) Imports <br>
 &nbsp;&nbsp;&nbsp;&nbsp; b) Reading a CSV <br>
 &nbsp;&nbsp;&nbsp;&nbsp; c) Loading a CSV's Data <br>
 &nbsp;&nbsp;&nbsp;&nbsp; d) Checking Number of Records <br>
 
3) <b>Exercises</b> <br>
 &nbsp;&nbsp;&nbsp;&nbsp; a) #1 - Calculate BMI with NDArrays <br>
 &nbsp;&nbsp;&nbsp;&nbsp; c) #2 - Random Matrix Function <br>
 &nbsp;&nbsp;&nbsp;&nbsp; d) #3 - Comparing Boston Red Sox Hitting Numbers <br>

## Numpy <br>

<p>NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.</p>
<ul>
    <li>Shape = Rows & Columns</li>
    <li>Matrix = Entire Array</li>
    <li>Vector = Variables to be applied (same vector as the one used in physics)</li>
    <li>Array = Similar to lists</li>
</ul>

#### Python List Comparison

<p>Lists are flexible, dynamic python objects that do their job quite well. But they do not support some mathematical operations in an intuitive way. Consider the summation of two lists, $l_1$ and $l_2$</p>

In [2]:
# create two lists and sum both of them together (results may not be what you expect)
l_1 = [1, 2, 3]
l_2 = [4, 5, 6]

print(f"Summation of lists = {l_1 + l_2}")
print(f"Difference of lists = {l_1 - l_2}")

Summation of lists = [1, 2, 3, 4, 5, 6]


TypeError: unsupported operand type(s) for -: 'list' and 'list'

<p>If we wanted to sum lists elementwise, we could write our own function that does the job entirely within the framework of python</p>

#### In-Class Exercise #1 - Write a function that sums the indexes of two lists <br>
<p>Ex: [2, 3, 4] + [1, 5, 2] = [3, 8, 6]</p>

In [13]:
def sum_index(list_a, list_b):
    list_c = []
    for i in range(len(list_a)) and range(len(list_b)):
        list_c.append(list_a[i]+list_b[i])
    return list_c

# def sum_index(list_a, list_b):
#     zipped = zip(list_a, list_b)
#     print(list(zipped))
#     return [i[0]+i[1] for i in zipped]

# def sum_index(list_a, list_b):
#     zipped = zip(list_a, list_b)
#     return list(map(lambda x: x[0] + x[1], zipped))

sum_index([2, 3, 4, 5], [1, 5, 2])

[3, 8, 6]

We would have to write a similar function for all the possible operands that we could consider for list arithmatic. This is time consuming and inefficient. Moreover, once the lists in question become nested, mimicing the behavior of true matrices, the problem gets worse. Complicated indexing is necessary, just to allow for the most basic matrix operations common throughout science and engineering. Imagine writing a matrix multiplication function using python syntax in a general way, such that it returns a matrix-matrix or matrix-vector product:

\begin{align}
(n \times x) \times (x \times m) \rightarrow (n \times m)
\end{align}

\begin{align}
\begin{bmatrix}
c_{0,0} & ... & c_{0,n} \\
\vdots & \ddots & \vdots \\
c_{m,0} & ... & c_{m,n}
\end{bmatrix}
=
\begin{bmatrix}
a_{0,0} & ... & a_{0,x} \\
\vdots & \ddots & \vdots \\
a_{n,0} & ... & a_{n,x}
\end{bmatrix}
\begin{bmatrix}
b_{0,0} & ... & b_{0,m} \\
\vdots & \ddots & \vdots \\
b_{x,0} & ... & b_{x,m}
\end{bmatrix}
\end{align}


---

The "Dot Product" is where we multiply matching members, then sum up:

(1, 2, 3) • (7, 9, 11) = 1×7 + 2×9 + 3×11
    = 58

We match the 1st members (1 and 7), multiply them, likewise for the 2nd members (2 and 9) and the 3rd members (3 and 11), and finally sum them up.

<hr>

So, let's instantiate a matrix $\mathcal{M}$ and a vector $\vec{v}$ and write a function that does the multiplication ourselves.

In [15]:
def matrix_multiply(A,B):
    zero_list = [ [0 for i in range(len(B[0]))] for i in range(len(A))]
    inner_dimensions = len(A[0])
    n_dimensions = len(zero_list)
    m_dimensions = len(zero_list[0])
    
    print(zero_list)
    print(inner_dimensions)
    print('\n', n_dimensions)
    print('\n', m_dimensions)
    
    # looping through M by v
    for i in range(n_dimensions):
        for j in range(m_dimensions):
            element = 0
            for x in range(inner_dimensions):
                element += A[i][x] * B[x][j]
            zero_list[i][j] = element
            
    return zero_list
    
M = [[0,1,0], [0,2,0],[0,3,0]]
v = [[1],[2],[3]]

matrix_multiply(M,v)

[[0], [0], [0]]
3

 3

 1


[[2], [4], [6]]

#### Importing

In [18]:
# always import as np, standard across all of data science
import numpy as np

#### Creating an NDArray <br>
<p>NumPy is based around a class called the $\textit{NDArray}$, which is a flexible vector / matrix class that implements the intuitive matrix and vector arithmatic lacking in basic Python. Let's start by creating some NDArrays:</p>

###### - np.array()

In [23]:
arr1 = np.array([1, 2, 3])
arr1

array([1, 2, 3])

###### - np.zeros()

In [33]:
# Shape -- np.zeros() has a parameter of "shape" that must be given
arr_zeros = np.zeros((3,3))
arr_zeros

array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.]])

###### - np.ones()

In [31]:
arr_ones = np.ones((3,4), int)
arr_ones


# notice no dots after the numbers, because we specified the datatype to be integer

array([[1, 1, 1, 1],
       [1, 1, 1, 1],
       [1, 1, 1, 1]])

###### - np.arange()

In [37]:
arr4 = np.arange(4, 13)
arr4

array([ 4,  5,  6,  7,  8,  9, 10, 11, 12])

###### - Making Lists into NDArrays

In [40]:
l_3 = [7, 8, 9]

arr5 = np.array(l_3)
arr5

array([7, 8, 9])

#### Performing Calculations on NDArrays

###### - Summation

In [41]:
arr_five = np.array([5, 10, 15, 20, 25])
arr_ten = np.array([10, 20, 30, 40, 50])

result = arr_five + arr_ten
result

array([15, 30, 45, 60, 75])

###### - Difference

In [42]:
result = arr_ten - arr_five
result

array([ 5, 10, 15, 20, 25])

###### - Multiplication

In [55]:
result3 = arr_ten * arr_five
result3

array([  50,  200,  450,  800, 1250])

###### - Division

In [44]:
result = arr_ten / arr_five
result

array([2., 2., 2., 2., 2.])

In [47]:
# Modulo
result = arr_five % 2
result

array([1, 0, 1, 0, 1])

In [51]:
# Floor Division
result = arr_ten // arr_five
result

array([2, 2, 2, 2, 2])

#### Numpy Subsetting

In [56]:
# result3 <= 800
print(result3 <= 800)
print(result3[result3 <= 800])

# conditioal check, returns true or false
# print(arr1 < 2)
# print(arr2 >= 2)

# conditional check, that returns the elements that meet the given condition


[ True  True  True  True False]
[ 50 200 450 800]


#### Multi-dimensional Arrays <br>
<p>NumPy seamlessly supports multidimensional arrays and matrices of arbitrary dimension without nesting NDArrays. NDArrays themselves are flexible and extensible and may be defined with such dimensions, with a rich API of common functions to facilitate their use. Let's start by building a two dimensional 3x3 matrix by conversion from a nested group of core python lists $M = [l_0, l_1, l_2]$:</p>

In [60]:
aList = [0,1,2]
bList = [3,4,5]
cList = [6,7,8]

# first step - convert lists into Matrix
M = [aList, bList, cList]
print(f"Nested List Structure {M}")

# cast directly ... dimensions inferred
nDM = np.array(M)
print(f"NDArray Structure \n {nDM}")

Nested List Structure [[0, 1, 2], [3, 4, 5], [6, 7, 8]]
NDArray Structure 
 [[0 1 2]
 [3 4 5]
 [6 7 8]]


#### Indexing NDArrays <br>
<p> Similar to lists within lists; however, the syntax looks more like C programming language.... It is [1, 2] to access the second row, third element.</p>

In [63]:
# print(M[1][2])
print(nDM[1,2])
print(nDM[2,0])

5
6


#### Assigning Values in NDArrays

In [66]:
nDM[1,1] = 99
nDM

array([[ 0,  1,  2],
       [ 3, 99,  5],
       [ 6,  7,  8]])

<p>Notice above how we ended up with a 1 in the target element's place. This is a data type issue. The .dtype() method is supported by all NDArrays, as well as the .astype() method for casting between data types:</p>

#### Checking NDArray Type

In [67]:
t = nDM.dtype
t

dtype('int64')

#### Altering NDArray Type

In [69]:
print(f"result before change: {t}")
print(nDM)

changed = nDM.astype(np.float64)

print(f"result after change: {changed.dtype}")
print(changed)

result before change: int64
[[ 0  1  2]
 [ 3 99  5]
 [ 6  7  8]]
result after change: float64
[[ 0.  1.  2.]
 [ 3. 99.  5.]
 [ 6.  7.  8.]]


#### Checking the Shape <br>
<p>The behavior and properties of an NDArray are often sensitively dependent on the $\textit{shape}$ of the NDArray itself. The shape of an array can be found by calling the .shape attribute, which will return a tuple containing the array's dimensions:</p>

In [77]:
 # shape is always rows x columns
test = np.ones((4,5))
test.shape

(4, 5)

#### Altering the Shape <br>
<p>As long as the number of elements remains fixed, we can reshape NDArrays at will:</p>

In [78]:
print(f"Reshaped NDArray: {changed.reshape(1,9)}")

Reshaped NDArray: [[ 0.  1.  2.  3. 99.  5.  6.  7.  8.]]


In [79]:
# keep in mind that the shape numbers do matter, (9, 1) is different than (1, 9)
print(f"Reshaped NDArray: {changed.reshape(9,1)}")

Reshaped NDArray: [[ 0.]
 [ 1.]
 [ 2.]
 [ 3.]
 [99.]
 [ 5.]
 [ 6.]
 [ 7.]
 [ 8.]]


#### In-Class Exercise #2 - Create a matrix of range 0 up to 16 and reshape it into a 4x4 matrice

In [82]:
arr_1 = np.arange(0,16)
print(arr_1)
re_shape = arr_1.reshape((4,4))
re_shape

[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15]


array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

#### Complex Indexing & Assinging

In [89]:
a = [1, 2, 3, 4, 5]
a[:]

[1, 2, 3, 4, 5]

In [99]:
M = np.zeros((6,6))
print("Before assigning any new values:")
print(M)

# Set every first element in each row to 1
M[:,0] = 1
print("\nAfter assigning 1 to first element in each row:")
print(M)

# Set all elements in rows 3 through 5 to 5
M[2:5, :] = 5
print("\nAfter assigning 5 to rows 3-5:")
print(M)

# Reset all elements back to 0
M = M * 0
print("\nAfter resetting back to 0:")
print(M)

# Set the second and third columns equal to 2
M[:,1:3] = 2
print("\nAfter assigning 2 to columns 2-3:")
print(M)

# Create a vector in numpy
v = np.arange(6)
M *= 0

# Set the first row to my vector v
M[0, :] = v
print("\nAfter assigning vector to the first row:")
print(M)

###################
# M[rows,columns] #
###################

Before assigning any new values:
[[0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]]

After assigning 1 to first element in each row:
[[1. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0.]]

After assigning 5 to rows 3-5:
[[1. 0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0.]
 [5. 5. 5. 5. 5. 5.]
 [5. 5. 5. 5. 5. 5.]
 [5. 5. 5. 5. 5. 5.]
 [1. 0. 0. 0. 0. 0.]]

After resetting back to 0:
[[0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]]

After assigning 2 to columns 2-3:
[[0. 2. 2. 0. 0. 0.]
 [0. 2. 2. 0. 0. 0.]
 [0. 2. 2. 0. 0. 0.]
 [0. 2. 2. 0. 0. 0.]
 [0. 2. 2. 0. 0. 0.]
 [0. 2. 2. 0. 0. 0.]]

After assigning vector to the first row:
[[0. 1. 2. 3. 4. 5.]
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]]


#### Elementwise Multiplication

<p>As long as the shapes of NDArrays are 'compatible', they can be multiplied elementwise, broadcasted, used in inner products, and much much more. 'Compatible' in this context can mean compatible in the linear algebraic sense, i.e. for inner products and other matrix multiplication, or simply sharing a dimension in such a manner that broadcasting 'makes sense'. Here are some examples of this:</p>

In [129]:
M[:,:] = 2

v1 = np.arange(6).reshape(1,6)
v2 = np.arange(6).reshape(6,1)

print(v1.shape)
print(v2.shape)

print(M)
print(v1)
print(v2)
print()
Matrix = M * v1
print(Matrix)
print()
add_M = M + v2
print(add_M)

(1, 6)
(6, 1)
[[2. 2. 2. 2. 2. 2.]
 [2. 2. 2. 2. 2. 2.]
 [2. 2. 2. 2. 2. 2.]
 [2. 2. 2. 2. 2. 2.]
 [2. 2. 2. 2. 2. 2.]
 [2. 2. 2. 2. 2. 2.]]
[[0 1 2 3 4 5]]
[[0]
 [1]
 [2]
 [3]
 [4]
 [5]]

[[ 0.  2.  4.  6.  8. 10.]
 [ 0.  2.  4.  6.  8. 10.]
 [ 0.  2.  4.  6.  8. 10.]
 [ 0.  2.  4.  6.  8. 10.]
 [ 0.  2.  4.  6.  8. 10.]
 [ 0.  2.  4.  6.  8. 10.]]

[[2. 2. 2. 2. 2. 2.]
 [3. 3. 3. 3. 3. 3.]
 [4. 4. 4. 4. 4. 4.]
 [5. 5. 5. 5. 5. 5.]
 [6. 6. 6. 6. 6. 6.]
 [7. 7. 7. 7. 7. 7.]]


#### np.where() <br>
<p>If statement within NDArrays that allows you to run conditionals on the entire array</p>

In [139]:
add_M[1,0] = 2
print(add_M)
print(np.where(add_M<3))
print(add_M[np.where(add_M<3)])

[[2. 2. 2. 2. 2. 2.]
 [2. 3. 3. 3. 3. 3.]
 [4. 4. 4. 4. 4. 4.]
 [5. 5. 5. 5. 5. 5.]
 [6. 6. 6. 6. 6. 6.]
 [7. 7. 7. 7. 7. 7.]]
(array([0, 0, 0, 0, 0, 0, 1]), array([0, 1, 2, 3, 4, 5, 0]))
[2. 2. 2. 2. 2. 2. 2.]


#### Random Sampling <br>
<p>NumPy provides machinery to work with random numbers - something often needed in a broad spectrum of data science applications.</p>

In [140]:
# np.random.uniform()

# A single call generates a single random number between 0 and 1

print('Here is a random number: %s' % np.random.uniform())

# You can also pass some parameters or bounds

print('Here is a random number between 0 and 1 Million: %s' % np.random.uniform(0, 1e6))

# You can also generate a bunch of random numbers all at once

print('Here are 5 random numbers between 0 and 10: %s' % np.random.uniform(0, 10, 5))

# Even matrices with shapes as a parameter

print('Here is a 3x3 matrix with numbers between 0 and 10: \n %s' % np.random.uniform(0, 10, (3, 3)))

Here is a random number: 0.5402220393650664
Here is a random number between 0 and 1 Million: 842286.7862726124
Here are 5 random numbers between 0 and 10: [9.96914015 8.45691725 9.61669907 2.21101094 2.69962658]
Here is a 3x3 matrix with numbers between 0 and 10: 
 [[6.46087173 6.59918929 2.12012094]
 [7.52699962 6.48603738 2.94143131]
 [0.6231658  5.40380405 3.76163698]]


## Working With CSV's

#### Imports

In [141]:
import csv
import numpy as np

#### Loading a txt file's Data 

In [146]:
FIELDS = ['Rk', 'Pos', 'Name', 'Age', 'G', 'PA', 'AB', 'R', 'H', '2B', '3B', 'HR', 'RBI', 'SB', 'CS', 
          'BB', 'SO', 'BA', 'OBP', 'SLG', 'OPS', 'OPS+', 'TB', 'GDP', 'HBP', 'SH', 'SF', 'IBB']

DATATYPES = [('rk', 'i'), ('pos', '|S25'), ('name', '|S25'), ('age', 'i'), ('g', 'i'), ('pa', 'i'), ('ab', 'i'),
                ('r', 'i'), ('h', 'i'), ('2b', 'i'), ('3b', 'i'), ('hr', 'i'), ('rbi', 'i'), ('sb', 'i'), ('cs', 'i'),
                ('bb', 'i'), ('so', 'i'), ('ba', 'f'), ('obp', 'f'), ('slg', 'f'), ('ops', 'f'), ('opsp', 'i'),
                ('tb', 'i'), ('gdp', 'i'), ('hbp', 'i'), ('sh', 'i'), ('sf', 'i'), ('ibb', 'i')]


# instead of loading csv normally, let's load it into a numpy array to calculate results on
   
def load_data(filename, d = ','):
    data = np.genfromtxt(filename, delimiter=d, skip_header=1, invalid_raise = False, names=FIELDS, dtype=DATATYPES)
    return data


bs_2017 = load_data('./redsox_2017_hitting.txt')
bs_2017['Name']

# Total hits for 2017
total_hits = sum(bs_2017['H'])
total_hits

1459

#### Summing the top 5 hitters for HR's

In [149]:
hr_nums = bs_2017['HR']
sorted_hr = sorted(hr_nums)
top5 = sorted_hr[-5:]
print(top5)

[17, 20, 22, 23, 24]


In [162]:
# find the top 5 hr hitters
hr = bs_2017['HR']
names = bs_2017['Name']
# print(hr)
# print(names)
player_hr = list(zip(names, hr))
# print(player_hr)
player_hr_sort = sorted(player_hr, key=lambda x: x[1], reverse=True)
# print(player_hr_sort)
top_player_hr = player_hr_sort[:5]
# print(top_player_hr)

for player, hr in top_player_hr:
    print(f"{player.decode('utf-8')}: {hr}")
# print(hr)

Mookie Betts: 24
Hanley Ramirez: 23
Mitch Moreland: 22
Andrew Benintendi: 20
Jackie Bradley Jr.: 17


# Exercises To Complete.... <br>
<p>Given in separate file after completion of this file</p>