Numpy is the fundamental package for numeric computing with Python. It provides powerful ways to create store and manipulate data, which makes it able to seamlessly and speedily integrate with a wide variety of databases and data formats. This is also the foundation that Pandas is built on which is a high performance data-centric package that we're going to learn more about in this course. In this lecture, we're going to talk about creating arrays with certain data types, manipulating arrays ,selecting elements from arrays and loading data sets into arrays. Such functions are useful for manipulating data and understanding the functionalities of other common python data packages.

In [26]:
# You'll recall that we import a library using the 'import' keyword as numpy's common abbreviation is np
import numpy as np
import math

# Array Creation

In [27]:
# Arrays are displayed as a list or list of lists and can be created through list as well. When creatin an
# array, we pass in a list as an argument in a list as an argument in numpy array
a = np.array([1,2,3])
print(a)
# We can print the number of dimensions of a list using the ndim attribute
print(a.ndim)

[1 2 3]
1


In [28]:
# If we pass in a list of lists in numpy array, we create a multi-dimensional array, for instance, a matrix
b=np.array([[1,2,3],[4,5,6]])
b

array([[1, 2, 3],
       [4, 5, 6]])

In [29]:
# We can print out the lenght of each dimension by calling the shape attribute, which returns a tuple
b.shape

(2, 3)

In [30]:
# WE can also check the type of items in the array
a.dtype

dtype('int64')

In [31]:
# Besides integers, floats are also accepted in numpy arrays
c = np.array([2.2,5,1.1])
c.dtype.name

'float64'

In [32]:
#  Let's look at the data in our array
c

array([2.2, 5. , 1.1])

In [33]:
# Note that numpy automatically converts integers, like 5 up to floats, since there is no loss of precision.
# Numpy will try and give you the best data type format possible to keep your data types homogeneous, which 
# means all the same, in the array

# Sometimes we know the shapre of an array that we want to create, but not what we want to be in it.
# numpy offers several functions to create arrays with inital placeholders, such as zero's or one's.
# Lets create two arrays, both the same shape but with different filler values
d= np.zeros((2,3))
print(d)

[[0. 0. 0.]
 [0. 0. 0.]]


In [34]:
e = np.ones((2,3))
print(e)

[[1. 1. 1.]
 [1. 1. 1.]]


In [35]:
# We can also generate an array with random numbers 
np.random.rand(2,3)

array([[0.09563509, 0.51690414, 0.53750288],
       [0.61748978, 0.7766706 , 0.43933554]])

In [36]:
# You'll see zero's, ones, and rand used quite often to create example arrays, especially in stack overflow
# posts and other forums.

In [37]:
# We can also create a sequence of numbers in an array with the arrange function. 
#The first argument is the starting bound and the second argument is the ending bound and the 
#third argument is the difference between each consecutive number. 
#So, let's create an array of every even number from 10 inclusive to 50, exclusive. 
#So, f equals np.arrange, we're going to start at 10, we're going to end at 50
f = np.arange(10,50,2)
f

array([10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42,
       44, 46, 48])

In [38]:
# If we want to generate a sequence of floats, we use something called linspace. 
#In this function, the third argument isn't the difference between two numbers, but 
#it's the total number of items that you want to generate.
np.linspace(0,2,15) # 15 numbers from 0 (inclusive) to 2 (inclusive)

array([0.        , 0.14285714, 0.28571429, 0.42857143, 0.57142857,
       0.71428571, 0.85714286, 1.        , 1.14285714, 1.28571429,
       1.42857143, 1.57142857, 1.71428571, 1.85714286, 2.        ])

# Array Operations

In [39]:
#So, we can do many things on arrays, such as mathematical manipulation, addition, 
#subtraction, square, exponents, as well as Boolean arrays, which are binary values. 
#And we can also do matrix manipulations, such as product transpose, inverse and 
#so forth. 
#So, let's see some of these, arithmetic operators on arrays apply elementwise. 
#So let's create a couple of arrays, I'll create a is np.array.

In [40]:
a = np.array([10,20,30,40])
b=np.array([1,2,3,4])

# Now let's look at a minus b
c = a-b
print(c)
# And 
d = a * b
print(d)

[ 9 18 27 36]
[ 10  40  90 160]


So, with arithmetic manipulation, 
we can convert current data to the way we want it to be. 
So here's a real world problem that I faced. 
I moved down to the United States about six years ago from Canada. 
In Canada, we use Celsius for temperatures and 
my wife still hasn't converted to the US system, which uses Fahrenheit. 
With numpy I could easily convert a number of Fahrenheit values, say, 
the weather forecast to Celsius for her. 
So, let's create an array of typical Ann Arbor winter Fahrenheit values.

In [41]:
farenheit = np.array([0,-10,-5,-15,0])

# and the formula for conversion is ((F-32) * 5/9 = C)
celcius = (farenheit - 31) * (5/9)
celcius

array([-17.22222222, -22.77777778, -20.        , -25.55555556,
       -17.22222222])

Another useful and important manipulation is the Boolean array. 
We can apply an operator on an array and a Boolean array will be returned for 
any element in the original with true being emitted if it meets the condition. 
For instance, if we want to get a Boolean array to check the Celsius degrees that 
are greater than minus 20 degrees

In [42]:
celcius > -20

array([ True, False, False, False,  True])

Here's another example, we could use the modulus operator to check 
numbers in array to see if they're even,

In [43]:
celcius%2 ==0

array([False, False,  True, False, False])

Beside elementwise manipulation, 
it's important to know that numpy supports matrix manipulation. 
Let's look at the matrix product, if we wanted to do elementwise product, 
we use the asterisk sign

In [44]:
A = np.array([[1,1],[0,1]])
B = np.array([[2,0],[3,4]])
print(A*B)

#if we want to do matrix product, we use the @ sign or use the dot function
print(A@B)

[[2 0]
 [0 4]]
[[5 4]
 [3 4]]


So, you don't have to worry about complex matrix operations for this course. 
But it's important to know that numpy is the underpinning of scientific computing 
libraries and Python. 
And that is capable of doing both element wise operations, so 
the asterisks as well as matrix level operations, so the @ sign. 
And there's more on this in subsequent courses. 
So a few more linear algebra concepts are worth layering in here. 
You might recall that the product of two matrices is only plausible when 
the inner dimensions of the two matrices are the same. 
The dimensions refer to the number of elements, both horizontal and 
vertical in the rendered matrices that you've been seeing here. 
So, we can use numpy to quickly see the shape of the matrix.

In [45]:
A.shape

(2, 2)

When manipulating arrays of different types, the type of the resulting array 
will correspond to the more general of the two types. 
And this is called upcasting and you saw an example of that before, but 
let's see another one.

In [46]:
# Let's create an array of integers
array1 = np.array([[1,2,3],[4,5,6]])
print(array1.dtype)
# now create an array of floats
array2 = np.array([[7.1,8.2,9.1],[10.4,11.2,12.3]])
print(array2.dtype)

int64
float64


integers, int are whole numbers only and floating, 
point numbers float, can have a whole number portion and a decimal portion. 
The 64 in this example refers to the number of bits that the operating system 
is reserving to represent the number which determines the size or 
the precision of the numbers that can be represented. 

In [47]:
# addition for the two arrays
array3 = array1 + array2
print(array3)
print(array3.dtype)

[[ 8.1 10.2 12.1]
 [14.4 16.2 18.3]]
float64


Notice how the items in the resulting array have been upcast into floating point nums


In [48]:
# numpy arrays have many interesting aggregation funcitons on them
print(array3.sum())
print(array3.max())
print(array3.min())
print(array3.mean())


79.3
18.3
8.1
13.216666666666667


For two dimensional arrays, we could do the same thing for each row or column. 
So, let's create an array with 15 elements ranging from 1 to 15, 
with the dimension of 3 by 5

In [49]:
# np.arange used to generate a range of numbers
# start, stop, step
# reshape method used to reshape the dim of the array
b = np.arange(1,16,1).reshape(3,5)
print(b)

[[ 1  2  3  4  5]
 [ 6  7  8  9 10]
 [11 12 13 14 15]]


Now, we often think about two dimensional arrays being made up of rows and columns. 
But you can also think of these arrays is just giant ordered lists of numbers and 
the shape of the array. 
The number of rows and columns is just an abstraction that we have for 
a particular purpose. 
Actually, this is exactly how basic images are stored in computer environments. 
So, let's take a look at an example and 
see how numpy comes into play in something like images.

For this demonstration, I'm going to use the Python Imaging Library, PIL and 
a function to display images in the Jupyter notebook.

In [50]:
from PIL import Image
from IPython.display import display

# and 
im = Image.open('chris.tiff')
display(im)

FileNotFoundError: [Errno 2] No such file or directory: 'chris.tiff'

 # Indexing, Slicing and Iterating

So, indexing, slicing and iterating are extremely important for 
data manipulation and analysis. 
Because these techniques allow us to select data based on conditions and 
copy or update the data. 

# Indexing 

In [51]:
# first we are going to look at integer indexing. A one dimensional array, works in similar ways
# To get an element in a one dimensional array, we simply use the offest index. 
a=np.array([1,3,5,7])
a[2]

5

In [52]:
# for multidimensional array, we need to use integer array indexing, let's create a new multidimensional array
a = np.array([[1,2],[3,4],[5,6]])
a

array([[1, 2],
       [3, 4],
       [5, 6]])

In [53]:
# if we want to select one certain element, we can do so by entering the index, which is comprised
# of two integers the first being the row, and the second the column
a[1,1] # remember in python we start at 0!

4

if we want to get multiple elements, for 
example, one four and six and put them into a one-dimensional array, 
we can enter the indices directly into the array function.

In [54]:
np.array([a[0,0],a[1,1],a[2,1]])

array([1, 4, 6])

In [55]:
# we can also do that by using another form of array indexing, which essential "zips" 
# the first list and the second list up
print(a[[0,1,2],[0,1,1]])

[1 4 6]


In [56]:
# boolean indexing allows us to select arbitrary elements based on conditions. Example, in the 
# matrix we want to find elements that are greater than 5 so we set up a condition a>5
print(a>5)

[[False False]
 [False False]
 [False  True]]


We can then place this array of booleans like a mask over the original array to return a 
one dimensional array relating to the true values.

In [57]:
print(a[a>5])

[6]


As we will see, this functionality is essential in the pandas toolkit which is the bulk
of this course

# Slicing

So, slicing is a way to create a sub array based on the original array. 
For one-dimensional array slicing works in similar ways to a list. 
To slice, we use the colon, for instance, if we want to put colon three in 
the indexing brackets, we get the elements from index zero to index three. 
So, remember, excluding index three.

In [59]:
a=np.array([0,1,2,3,4,5])
print(a[:3])

[0 1 2]


In [60]:
# By putting 2:4 in the bracket, we get elements from index 2 to index 4 (excluding index 4)
print(a[2:4])

[2 3]


In [62]:
# For multi-dimensional arrays, it works similarly, 
a = np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12]])
a

array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12]])

In [63]:
# First, if we put one argument in the array, i.e., a[:2] then we get all the elements from the 
# first (0th) and second row (1th)
a[:2]

array([[1, 2, 3, 4],
       [5, 6, 7, 8]])

In [76]:
# If we add another argument to the array, i.e., 
# a[:2, 1:3], we get the first two rows but then the 
# second and third column values only
a[:2, 1:3]


array([[2, 3],
       [6, 7]])

In [77]:
# So, in multidimensional arrays, the first argument is for selecting rows, and the second
# argument is for selecting columns
# Here I'll change the element at the position [0,0], which is 2, to 50, then we can see that
# the value in the original array is changed to 50 as well
sub_array = a[:2,1:3]
print("sub array index [0,0] value before change:", sub_array[0,0])

# Change the value from 2 to 50
sub_array[0,0]=50
print("sub array index [0,0] value after change:", sub_array[0,0])

print("original array index [0,1] value after change:",a[0,1])


sub array index [0,0] value before change: 2
sub array index [0,0] value after change: 50
original array index [0,1] value after change: 50


# Trying Numpy with Datasets

now that we've learned the essentials of numpy, let's use it on a couple of datasets. 
So, here we have a very popular data set on wine quality. 
And we're going to only look at red wines, the data fields include 
fixed acidity volatile acids, residual sugars, chlorides and so forth. 
The important one here is the alcohol content and 
the quality, that's how I buy wine anyway. 
To load a dataset into numpy, we can use the genfrom text function. 
We can specify data file name, the delimiter which is optional but 
we often use it and the number of rows to skip if we have a header row. 
It's one here, so, the genfrom text function has a parameter called d-type for 
specifying data types for each column and this parameter is optional. 
Without specifying the type, 
all types will be casted to a more general or precise type. 
So there will be some inference done.

In [81]:
wines = np.genfromtxt("winequality-red.csv", delimiter=";", skip_header=1)
wines

array([[ 7.4  ,  0.7  ,  0.   , ...,  0.56 ,  9.4  ,  5.   ],
       [ 7.8  ,  0.88 ,  0.   , ...,  0.68 ,  9.8  ,  5.   ],
       [ 7.8  ,  0.76 ,  0.04 , ...,  0.65 ,  9.8  ,  5.   ],
       ...,
       [ 6.3  ,  0.51 ,  0.13 , ...,  0.75 , 11.   ,  6.   ],
       [ 5.9  ,  0.645,  0.12 , ...,  0.71 , 10.2  ,  5.   ],
       [ 6.   ,  0.31 ,  0.47 , ...,  0.66 , 11.   ,  6.   ]])

So, recall that we can use integer indexing to get a certain column or row. 
For example, if we wanted to select the fixed acidity column, 
which is the first column, we can do so by entering the index into the array. 
Also remember that for multi-dimensional arrays, 
the first argument refers to the row and the second argument refers to the column. 
And if we just give one argument, then we'll get a single dimensional list back. 

In [82]:
# So all rows combined but only the first column from them would be
print("One integer 0 for slicing: ", wines[:,0])

# But if we wanted the same values but wanted to preserve that they sit in their own rows we would
# write
print("0 to 1 for slicing: \n", wines[:, 0:1])

One integer 0 for slicing:  [7.4 7.8 7.8 ... 6.3 5.9 6. ]
0 to 1 for slicing: 
 [[7.4]
 [7.8]
 [7.8]
 ...
 [6.3]
 [5.9]
 [6. ]]


So, this is another great example of how the shape of data is 
actually just an abstraction. 
Which we can layer intentionally on top of the data that were working with. 

If we want a range of columns in order, say, column zero through three and 
recall this means first second and third, since we started zero. 
And we don't include the training index value, we could do that too.

In [83]:
wines[:,0:3]

array([[7.4  , 0.7  , 0.   ],
       [7.8  , 0.88 , 0.   ],
       [7.8  , 0.76 , 0.04 ],
       ...,
       [6.3  , 0.51 , 0.13 ],
       [5.9  , 0.645, 0.12 ],
       [6.   , 0.31 , 0.47 ]])

what if we want several non-consecutive columns? 
Well, we can place the indices of the columns that we want into an array and 
past that array as the second argument. 
So here's an example, we can take wines, we want all rows so, colon. 


In [84]:
wines[:,[0,2,4]]

array([[7.4  , 0.   , 0.076],
       [7.8  , 0.   , 0.098],
       [7.8  , 0.04 , 0.092],
       ...,
       [6.3  , 0.13 , 0.076],
       [5.9  , 0.12 , 0.075],
       [6.   , 0.47 , 0.067]])

So, we can also do some basic summarization of this data set. 
For example, if we wanted to find out the average quality of red wine, 
we can select the quality column. 
We could do this in a couple of ways, but 
the most appropriate is to use the minus one value for the index, 
as negative numbers means slicing from the back of a list. 
And then we just call the aggregation functions on this data.

In [85]:
# Here -1 is the last column quality
wines[:,-1].mean()

5.6360225140712945

let's take a look at another dataset, this time on graduate school admissions. 
So it is field such as GRE score, TOEFL score, university rating and so 
forth and it has a chance of admission at the end. 
With this dataset, we can do data manipulation and basic analysis to infer 
what conditions are associated with higher chances of admission. 
So, let's take a look, so 
we can specify data field names using genfromtext as it loads the CSV data. 
And also we can have numpy try and infer this type of the column by setting the d 
type parameter to none,

In [87]:
graduate_admission = np.genfromtxt('Admission_Predict.csv', dtype=None, delimiter=",", skip_header=1,
                                   names=('Serial No','GRE Score','TOEFL Score','University Rating',
                                          'SOP', 'LOR','CGPA', 'Research','Chance of Admit'))
graduate_admission

array([(  1, 337, 118, 4, 4.5, 4.5, 9.65, 1, 0.92),
       (  2, 324, 107, 4, 4. , 4.5, 8.87, 1, 0.76),
       (  3, 316, 104, 3, 3. , 3.5, 8.  , 1, 0.72),
       (  4, 322, 110, 3, 3.5, 2.5, 8.67, 1, 0.8 ),
       (  5, 314, 103, 2, 2. , 3. , 8.21, 0, 0.65),
       (  6, 330, 115, 5, 4.5, 3. , 9.34, 1, 0.9 ),
       (  7, 321, 109, 3, 3. , 4. , 8.2 , 1, 0.75),
       (  8, 308, 101, 2, 3. , 4. , 7.9 , 0, 0.68),
       (  9, 302, 102, 1, 2. , 1.5, 8.  , 0, 0.5 ),
       ( 10, 323, 108, 3, 3.5, 3. , 8.6 , 0, 0.45),
       ( 11, 325, 106, 3, 3.5, 4. , 8.4 , 1, 0.52),
       ( 12, 327, 111, 4, 4. , 4.5, 9.  , 1, 0.84),
       ( 13, 328, 112, 4, 4. , 4.5, 9.1 , 1, 0.78),
       ( 14, 307, 109, 3, 4. , 3. , 8.  , 1, 0.62),
       ( 15, 311, 104, 3, 3.5, 2. , 8.2 , 1, 0.61),
       ( 16, 314, 105, 3, 3.5, 2.5, 8.3 , 0, 0.54),
       ( 17, 317, 107, 3, 4. , 3. , 8.7 , 0, 0.66),
       ( 18, 319, 106, 3, 4. , 3. , 8.  , 1, 0.65),
       ( 19, 318, 110, 3, 4. , 3. , 8.8 , 0, 0.63),
       ( 20,

In [88]:
# Notice that the resulting array is actually a one dimensional array with 400 tuples
graduate_admission.shape

(400,)

In [90]:
# We can retrieve a column from the array using the column's name lets get the CGPA column and
# only the first five values
graduate_admission['CGPA'][0:5]

array([9.65, 8.87, 8.  , 8.67, 8.21])

In [91]:
# Since the GPA in the dataset range from 1 to 10, and in the US it's more common to use a scale
# of up to 4, a common task might be to convert the GPA by dividing by 10 and multiply by 4
graduate_admission['CGPA'] = graduate_admission['CGPA'] / 10*4
graduate_admission['CGPA'][0:20] # lets get 20 values

array([3.86 , 3.548, 3.2  , 3.468, 3.284, 3.736, 3.28 , 3.16 , 3.2  ,
       3.44 , 3.36 , 3.6  , 3.64 , 3.2  , 3.28 , 3.32 , 3.48 , 3.2  ,
       3.52 , 3.4  ])

remember Boolean masking? 
Well, we can use this to find out how many students have had research experience by 
creating a Boolean mask and passing it to the array indexing operator. 
So, we'll take the graduate_admission sub research will compare that to one, 
if it's one, a True will be admitted, otherwise a False will be admitted. 
That creates us a mask, 
which we then pass into graduate admission using the indexing operator. 


In [93]:
len(graduate_admission[graduate_admission['Research']==1])
                       

219

So, since we've got the data field chance of admission, 
which ranges from zero to one, we can try and 
see if students with high chance of admission, let's say 80% on average, 
have higher GRE scores than those with lower chance of admission, let's say 40%. 
So first we're going to use Boolean masking to pull out only 
those students that we're interested in based on their chance of admission. 
And then we pull out only their GPA scores and 
then we're going to print the mean values.

In [94]:
print(graduate_admission[graduate_admission['Chance_of_Admit']>0.8]['GRE_Score'].mean())
print(graduate_admission[graduate_admission['Chance_of_Admit'] < 0.4]['GRE_Score'].mean())

328.7350427350427
302.2857142857143


So, take it a moment here to reflect, 
do you understand what's happening in the calls above? 
When we do the Boolean masking, 
we are left with an array with tuples in it still. 
And numpy holds underneath this a list of the columns we specified and 
their name and indexes. 


In [97]:
graduate_admission[graduate_admission['Chance_of_Admit']>0.8]

array([(  1, 337, 118, 4, 4.5, 4.5, 3.86 , 1, 0.92),
       (  6, 330, 115, 5, 4.5, 3. , 3.736, 1, 0.9 ),
       ( 12, 327, 111, 4, 4. , 4.5, 3.6  , 1, 0.84),
       ( 23, 328, 116, 5, 5. , 5. , 3.8  , 1, 0.94),
       ( 24, 334, 119, 5, 5. , 4.5, 3.88 , 1, 0.95),
       ( 25, 336, 119, 5, 4. , 3.5, 3.92 , 1, 0.97),
       ( 26, 340, 120, 5, 4.5, 4.5, 3.84 , 1, 0.94),
       ( 33, 338, 118, 4, 3. , 4.5, 3.76 , 1, 0.91),
       ( 34, 340, 114, 5, 4. , 4. , 3.84 , 1, 0.9 ),
       ( 35, 331, 112, 5, 4. , 5. , 3.92 , 1, 0.94),
       ( 36, 320, 110, 5, 5. , 5. , 3.68 , 1, 0.88),
       ( 44, 332, 117, 4, 4.5, 4. , 3.64 , 0, 0.87),
       ( 45, 326, 113, 5, 4.5, 4. , 3.76 , 1, 0.91),
       ( 46, 322, 110, 5, 5. , 4. , 3.64 , 1, 0.88),
       ( 47, 329, 114, 5, 4. , 5. , 3.72 , 1, 0.86),
       ( 48, 339, 119, 5, 4.5, 4. , 3.88 , 0, 0.89),
       ( 49, 321, 110, 3, 3.5, 5. , 3.54 , 1, 0.82),
       ( 71, 332, 118, 5, 5. , 5. , 3.856, 1, 0.94),
       ( 72, 336, 112, 5, 5. , 5. , 3.904, 1, 

In [98]:
print(graduate_admission[graduate_admission['Chance_of_Admit']>0.8]['CGPA'].mean())
print(graduate_admission[graduate_admission['Chance_of_Admit'] < 0.4]['CGPA'].mean())

3.7106666666666666
3.0222857142857142
