# Tutorial 3: Introduction to Numpy and Pandas

<b>Dr. Shadi Bani Taan</b>

## 3.1 Introduction to Numpy

Numpy, which stands for numerical Python, is a Python library package to support numerical computations. The basic data structure in numpy is a multi-dimensional array object called ndarray. Numpy provides a suite of functions that can efficiently manipulate elements of the ndarray. 

### 3.1.1 Creating ndarray

An ndarray can be created from a list or tuple object.

In [1]:
import numpy as np

oneDim = np.array([1,2,3,4,5])   # a 1-dimensional array (vector)
print(oneDim)
print("#Dimensions =", oneDim.ndim)
print("Dimension =", oneDim.shape) # the number of elements in each dimension (1 dimension, 5 elements)
print("Size =", oneDim.size) # number of elements in the array.
print("Array type =", oneDim.dtype)

[1 2 3 4 5]
#Dimensions = 1
Dimension = (5,)
Size = 5
Array type = int32


In [3]:
twoDim = np.array([[1,2],[3,4],[5,6],[7,8]])  # a two-dimensional array (matrix)
print(twoDim)
print("#Dimensions =", twoDim.ndim)
print("Dimension =", twoDim.shape)
print("Size =", twoDim.size)
print("Array type =", twoDim.dtype)

[[1 2]
 [3 4]
 [5 6]
 [7 8]]
#Dimensions = 2
Dimension = (4, 2)
Size = 8
Array type = int32


In [55]:
arrFromTuple = np.array([(1,'a',3.0),(2,'b',3.5)])  # create ndarray from tuple
print(arrFromTuple)
print("#Dimensions =", arrFromTuple.ndim)
print("Dimension =", arrFromTuple.shape)
print("Size =", arrFromTuple.size)

[['1' 'a' '3.0']
 ['2' 'b' '3.5']]
#Dimensions = 2
Dimension = (2, 3)
Size = 6


There are several built-in functions in numpy that can be used to create ndarrays

In [6]:
print(np.random.rand(2))      # random numbers from a uniform distribution between [0,1]
print(np.random.randn(5))     # random numbers from a normal distribution
print(np.arange(-10,10,2))    # similar to range, but returns ndarray instead of list
print(np.arange(12).reshape(3,4))  # reshape to a matrix (reshape From 1-D to 2-D)

[0.11944336 0.06120736]
[ 1.40028942  1.04933153  0.98576573  1.20437881 -2.83766057]
[-10  -8  -6  -4  -2   0   2   4   6   8]
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]


In [10]:
print(np.zeros((2,3)))        # a matrix of zeros
print(np.ones((3,2)))         # a matrix of ones
print(np.eye(3))              # a 3 x 3 identity matrix

[[0. 0. 0.]
 [0. 0. 0.]]
[[1. 1.]
 [1. 1.]
 [1. 1.]]
[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]


## 3.1.2 Element-wise Operations

You can apply standard operators such as addition and multiplication on each element of the ndarray.

In [5]:
x = np.array([1,2,3,4,5])

print(x + 1)      # addition
print(x - 1)      # subtraction
print(x * 2)      # multiplication
print(x // 2)     # integer division
print(x ** 2)     # square
print(x % 2)      # modulo  
print(1 / x)      # division

[2 3 4 5 6]
[0 1 2 3 4]
[ 2  4  6  8 10]
[0 1 1 2 2]
[ 1  4  9 16 25]
[1 0 1 0 1]
[1.         0.5        0.33333333 0.25       0.2       ]


In [4]:
x = np.array([2,4,6,8,10])
y = np.array([1,2,3,4,5])

print(x + y)
print(x - y)
print(x * y)
print(x / y)
print(x // y)
print(x ** y)

[ 3  6  9 12 15]
[1 2 3 4 5]
[ 2  8 18 32 50]
[2. 2. 2. 2. 2.]
[2 2 2 2 2]
[     2     16    216   4096 100000]


## 3.1.3 Indexing and Slicing

There are various ways to select certain elements with an ndarray.

In [8]:
x = np.arange(-5,5)
print('first x = ', x)

y = x[3:5]     # y is a slice, i.e., pointer to a subarray in x
print('first y = ', y)
y[:] = 1000    # modifying the value of y will change x
print('second y = ', y)
print('second x = ', x)

z = x[3:5].copy()   # makes a copy of the subarray
print('first z = ', z)
z[:] = 500          # modifying the value of z will not affect x
print('second z = ', z)
print('third x = ', x)

first x =  [-5 -4 -3 -2 -1  0  1  2  3  4]
first y =  [-2 -1]
second y =  [1000 1000]
second x =  [  -5   -4   -3 1000 1000    0    1    2    3    4]
first z =  [1000 1000]
second z =  [500 500]
third x =  [  -5   -4   -3 1000 1000    0    1    2    3    4]


In [56]:
my2dlist = [[1,2,3,4],[5,6,7,8],[9,10,11,12]]   # a 2-dim list
print(my2dlist)
print(my2dlist[2])        # access the third sublist

my2darr = np.array(my2dlist)
print(my2darr)
print("row 3", my2darr[2][:])      # access the third row
print("row 3", my2darr[2,:])       # access the third row
print("row 3", my2darr[:][2])      # access the third row (similar to 2d list)
print("third column",my2darr[:,2])       # access the third column
print(my2darr[:2,2:])     # access the first two rows & last two columns

[[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]]
[9, 10, 11, 12]
[[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]]
row 3 [ 9 10 11 12]
row 3 [ 9 10 11 12]
row 3 [ 9 10 11 12]
third column [ 3  7 11]
[[3 4]
 [7 8]]


More indexing examples.

In [26]:
my2darr = np.arange(1,13,1).reshape(4,3)
print(my2darr)

indices = [2,1,0,3]    # selected row indices
print("selected indices", my2darr[indices,:])

rowIndex = [0,0,1,2,3]     # row index into my2darr
columnIndex = [0,2,0,1,2]  # column index into my2darr
print(my2darr[rowIndex,columnIndex])

[[ 1  2  3]
 [ 4  5  6]
 [ 7  8  9]
 [10 11 12]]
selected indices [[ 7  8  9]
 [ 4  5  6]
 [ 1  2  3]
 [10 11 12]]
[ 1  3  4  8 12]


## 3.1.4 Numpy Arithmetic and Statistical Functions

There are many built-in mathematical functions available for manipulating elements of nd-array.

In [11]:
y = np.array([-1.4, 0.4, -3.2, 2.5, 3.4])    # generate a random vector
print(y)

print(np.abs(y))          # convert to absolute values
print(np.sqrt(abs(y)))    # apply square root to each element
print(np.sign(y))         # get the sign of each element
print(np.exp(y))          # apply exponentiation
print(np.sort(y))         # sort array

[-1.4  0.4 -3.2  2.5  3.4]
[1.4 0.4 3.2 2.5 3.4]
[1.18321596 0.63245553 1.78885438 1.58113883 1.84390889]
[-1.  1. -1.  1.  1.]
[ 0.24659696  1.4918247   0.0407622  12.18249396 29.96410005]
[-3.2 -1.4  0.4  2.5  3.4]


In [None]:
x = np.arange(-2,3)
y = np.random.randn(5)
print(x)
print(y)

print(np.add(x,y))           # element-wise addition       x + y
print(np.subtract(x,y))      # element-wise subtraction    x - y
print(np.multiply(x,y))      # element-wise multiplication x * y
print(np.divide(x,y))        # element-wise division       x / y
print(np.maximum(x,y))       # element-wise maximum        max(x,y)

In [None]:
y = np.array([-3.2, -1.4, 0.4, 2.5, 3.4])    # generate a random vector
print(y)

print("Min =", np.min(y))             # min 
print("Max =", np.max(y))             # max 
print("Average =", np.mean(y))        # mean/average
print("Std deviation =", np.std(y))   # standard deviation
print("Sum =", np.sum(y))             # sum 

## 3.1.5 Numpy linear algebra

Numpy provides many functions to support linear algebra operations.

In [30]:
X = np.random.randn(2,3)    # create a 2 x 3 random matrix
print("X: ",X)
print("X transpose:", X.T)             # matrix transpose operation X^T

y = np.random.randn(3) # random vector 
print(y)
print(X.dot(y))        # matrix-vector multiplication  X * y
print(X.dot(X.T))      # matrix-matrix multiplication  X * X^T

X:  [[ 0.51624387 -0.78571652 -0.62434021]
 [ 1.01123181  2.01096586  1.30555826]]
X transpose: [[ 0.51624387  1.01123181]
 [-0.78571652  2.01096586]
 [-0.62434021  1.30555826]]
[-1.46496307  1.36470673 -0.81534981]
[-1.31949516  0.1984747 ]
[[ 1.27365889 -1.8731194 ]
 [-1.8731194   6.77105585]]


## 3.2 Introduction to Pandas

Pandas provide two convenient data structures for storing and manipulating data--Series and DataFrame.<br> A Series is similar to a one-dimensional array whereas a DataFrame is more similar to representing a matrix or a spreadsheet table.  

### 3.2.1 Series

* A Series is a single-dimensional array structures that stores homogenous data i.e., data of a single type.
* A Series object consists of a one-dimensional array of values, whose elements can be referenced using an index array.
* A Series object can be created from a list, a numpy array, or a Python dictionary. You can apply most of the numpy functions on the Series object.


In [33]:
from pandas import Series

s = Series([3.1, 2.4, -1.7, 0.2, -2.9, 4.5])   # creating a series from a list
print(s)
print('Values=', s.values)     # display values of the Series
print('Index=', s.index)       # display indices of the Series

0    3.1
1    2.4
2   -1.7
3    0.2
4   -2.9
5    4.5
dtype: float64
Values= [ 3.1  2.4 -1.7  0.2 -2.9  4.5]
Index= RangeIndex(start=0, stop=6, step=1)


In [34]:
import numpy as np

s2 = Series(np.random.randn(6))  # creating a series from a numpy ndarray
print(s2)
print('Values=', s2.values)   # display values of the Series
print('Index=', s2.index)     # display indices of the Series

0    1.719995
1   -0.153993
2    2.370376
3   -0.589321
4   -0.198099
5    0.055443
dtype: float64
Values= [ 1.71999507 -0.15399302  2.37037574 -0.58932056 -0.19809886  0.05544345]
Index= RangeIndex(start=0, stop=6, step=1)


In [35]:
s3 = Series([1.2,2.5,-2.2,3.1,-0.8,-3.2], 
            index = ['Jan 1','Jan 2','Jan 3','Jan 4','Jan 5','Jan 6',])
print(s3)
print('Values=', s3.values)   # display values of the Series
print('Index=', s3.index)     # display indices of the Series

Jan 1    1.2
Jan 2    2.5
Jan 3   -2.2
Jan 4    3.1
Jan 5   -0.8
Jan 6   -3.2
dtype: float64
Values= [ 1.2  2.5 -2.2  3.1 -0.8 -3.2]
Index= Index(['Jan 1', 'Jan 2', 'Jan 3', 'Jan 4', 'Jan 5', 'Jan 6'], dtype='object')


In [27]:
capitals = {'MI': 'Lansing', 'CA': 'Sacramento', 'TX': 'Austin', 'MN': 'St Paul'}

s4 = Series(capitals)   # creating a series from dictionary object
print(s4)
print('Values=', s4.values)   # display values of the Series
print('Index=', s4.index)     # display indices of the Series

MI       Lansing
CA    Sacramento
TX        Austin
MN       St Paul
dtype: object
Values= ['Lansing' 'Sacramento' 'Austin' 'St Paul']
Index= Index(['MI', 'CA', 'TX', 'MN'], dtype='object')


In [38]:
s3 = Series([1.2,2.5,-2.2,3.1,-0.8,-3.2], 
            index = ['Jan 1','Jan 2','Jan 3','Jan 4','Jan 5','Jan 6',])
print(s3)

# Accessing elements of a Series

print('\ns3[2]=', s3[2])        # display third element of the Series
print('s3[\'Jan 3\']=', s3['Jan 3'])   # indexing element of a Series 

print('\ns3[1:3]=')             # display a slice of the Series
print(s3[1:3])

Jan 1    1.2
Jan 2    2.5
Jan 3   -2.2
Jan 4    3.1
Jan 5   -0.8
Jan 6   -3.2
dtype: float64

s3[2]= -2.2
s3['Jan 3']= -2.2

s3[1:3]=
Jan 2    2.5
Jan 3   -2.2
dtype: float64


In [39]:
print('shape =', s3.shape)  # get the dimension of the Series
print('size =', s3.size)    # get the # of elements of the Series

shape = (6,)
size = 6


In [40]:
print(s3[s3 > 0])   # applying filter to select elements of the Series

Jan 1    1.2
Jan 2    2.5
Jan 4    3.1
dtype: float64


In [31]:
print(s3 + 4)       # applying scalar operation on a numeric Series
print(s3 / 4)    

Jan 1    5.2
Jan 2    6.5
Jan 3    1.8
Jan 4    7.1
Jan 5    3.2
Jan 6    0.8
dtype: float64
Jan 1    0.300
Jan 2    0.625
Jan 3   -0.550
Jan 4    0.775
Jan 5   -0.200
Jan 6   -0.800
dtype: float64


### 3.2.2 DataFrame

A DataFrame object is a tabular, spreadsheet-like data structure containing a collection of columns, each of which can be of different types (numeric, string, boolean, etc). Unlike Series, a DataFrame has distinct row and column indices. There are many ways to create a DataFrame object (e.g., from a dictionary, list of tuples, or even numpy's ndarrays).

In [10]:
from pandas import DataFrame

cars = {'make': ['Ford', 'Honda', 'Toyota', 'Tesla'],
       'model': ['Taurus', 'Accord', 'Camry', 'Model S'],
       'MSRP': [27595, 23570, 23495, 68000]}          
carData = DataFrame(cars)   # creating DataFrame from dictionary
carData                     # display the table

Unnamed: 0,make,model,MSRP
0,Ford,Taurus,27595
1,Honda,Accord,23570
2,Toyota,Camry,23495
3,Tesla,Model S,68000


In [11]:
print(carData.index)       # print the row indices
print(carData.columns)     # print the column indices

RangeIndex(start=0, stop=4, step=1)
Index(['make', 'model', 'MSRP'], dtype='object')


In [12]:
carData2 = DataFrame(cars, index = [1,2,3,4])  # change the row index
carData2['year'] = 2018    # add column with same value
carData2['dealership'] = ['Courtesy Ford','Capital Honda','Spartan Toyota','N/A']
carData2                   # display table

Unnamed: 0,make,model,MSRP,year,dealership
1,Ford,Taurus,27595,2018,Courtesy Ford
2,Honda,Accord,23570,2018,Capital Honda
3,Toyota,Camry,23495,2018,Spartan Toyota
4,Tesla,Model S,68000,2018,


Creating DataFrame from a list of tuples.

In [13]:
tuplelist = [(2011,45.1,32.4),(2012,42.4,34.5),(2013,47.2,39.2),
              (2014,44.2,31.4),(2015,39.9,29.8),(2016,41.5,36.7)]
columnNames = ['year','temp','precip']
weatherData = DataFrame(tuplelist, columns=columnNames, index=[1,2,3,4,5,6])
weatherData

Unnamed: 0,year,temp,precip
1,2011,45.1,32.4
2,2012,42.4,34.5
3,2013,47.2,39.2
4,2014,44.2,31.4
5,2015,39.9,29.8
6,2016,41.5,36.7


Creating DataFrame from numpy ndarray

In [12]:
import numpy as np

npdata = np.random.randn(5,3)  # create a 5 by 3 random matrix
columnNames = ['x1','x2','x3']
rowNames = ['v1','v2','v3','v4','v5']
data = DataFrame(npdata, index = rowNames, columns=columnNames)
print(data)
data.loc[['v1','v2'],:]

# .loc[] the function selects the data by labels of rows or columns.
print(data.loc[['v1'],['x1']]) # retrieving row named 'v1', column named 'x1'

# iloc[ ] is used for selection based on position. It is similar to loc[] indexer but it takes only integer values to make selections.
print(data.iloc[1,2])      # retrieving second row, third column

# accessing a slice of the DataFrame
print(data.iloc[1:3,1:3])

          x1        x2        x3
v1  0.990109  0.089774  1.420799
v2 -0.653840  1.660507 -0.176564
v3 -0.674726 -0.768856 -0.354946
v4 -0.147208 -0.672745 -0.548397
v5 -1.607483  0.676470 -1.676444
          x1
v1  0.990109
-0.17656384850123746
          x2        x3
v2  1.660507 -0.176564
v3 -0.768856 -0.354946


The elements of a DataFrame can be accessed in many ways.

### 3.2.3 Arithmetic Operations

In [51]:
print(data)

print('Data transpose operation:')
print(data.T)    # transpose operation

print('Addition:')
print(data + 4)    # addition operation

print('Multiplication:')
print(data * 10)   # multiplication operation

         x1        x2        x3
0 -0.924972  0.088453  1.897823
1 -0.056519 -0.417261 -0.024445
2  0.702482 -2.188033  1.117382
3 -1.339030  2.274110  0.638500
4  0.080841 -1.424960  0.755171
Data transpose operation:
           0         1         2        3         4
x1 -0.924972 -0.056519  0.702482 -1.33903  0.080841
x2  0.088453 -0.417261 -2.188033  2.27411 -1.424960
x3  1.897823 -0.024445  1.117382  0.63850  0.755171
Addition:
         x1        x2        x3
0  3.075028  4.088453  5.897823
1  3.943481  3.582739  3.975555
2  4.702482  1.811967  5.117382
3  2.660970  6.274110  4.638500
4  4.080841  2.575040  4.755171
Multiplication:
          x1         x2         x3
0  -9.249718   0.884529  18.978226
1  -0.565195  -4.172611  -0.244447
2   7.024818 -21.880334  11.173821
3 -13.390303  22.741103   6.385000
4   0.808409 -14.249597   7.551709


In [52]:
print('data =')
print(data)

columnNames = ['x1','x2','x3']
data2 = DataFrame(np.random.randn(5,3), columns=columnNames)
print('\ndata2 =')
print(data2)

print('\ndata + data2 = ')
print(data.add(data2))

print('\ndata * data2 = ')
print(data.mul(data2))

data =
         x1        x2        x3
0 -0.924972  0.088453  1.897823
1 -0.056519 -0.417261 -0.024445
2  0.702482 -2.188033  1.117382
3 -1.339030  2.274110  0.638500
4  0.080841 -1.424960  0.755171

data2 =
         x1        x2        x3
0 -0.790838  0.866584  0.227547
1  0.573901 -2.482975 -0.626245
2  1.092818 -1.146424 -0.501450
3  1.176830 -0.443005  0.359190
4 -1.976991 -0.480053 -1.159900

data + data2 = 
         x1        x2        x3
0 -1.715810  0.955037  2.125370
1  0.517382 -2.900236 -0.650689
2  1.795300 -3.334458  0.615932
3 -0.162200  1.831105  0.997690
4 -1.896150 -1.905013 -0.404729

data * data2 = 
         x1        x2        x3
0  0.731503  0.076652  0.431844
1 -0.032437  1.036049  0.015308
2  0.767685  2.508415 -0.560311
3 -1.575811 -1.007442  0.229343
4 -0.159822  0.684056 -0.875923


In [53]:
print(data.abs())    # get the absolute value for each element

print('\nMaximum value per column:')
print(data.max())    # get maximum value for each column

print('\nMinimum value per row:')
print(data.min(axis=1))    # get minimum value for each row

print('\nSum of values per column:')
print(data.sum())    # get sum of values for each column

print('\nAverage value per row:')
print(data.mean(axis=1))    # get average value for each row

print('\nCalculate max - min per row')
f = lambda x: x.max() - x.min()
print(data.apply(f, axis=1))

         x1        x2        x3
0  0.924972  0.088453  1.897823
1  0.056519  0.417261  0.024445
2  0.702482  2.188033  1.117382
3  1.339030  2.274110  0.638500
4  0.080841  1.424960  0.755171

Maximum value per column:
x1    0.702482
x2    2.274110
x3    1.897823
dtype: float64

Minimum value per row:
0   -0.924972
1   -0.417261
2   -2.188033
3   -1.339030
4   -1.424960
dtype: float64

Sum of values per column:
x1   -1.537199
x2   -1.667691
x3    4.384431
dtype: float64

Average value per row:
0    0.353768
1   -0.166075
2   -0.122723
3    0.524527
4   -0.196316
dtype: float64

Calculate max - min per row
0    2.822794
1    0.392816
2    3.305416
3    3.613141
4    2.180131
dtype: float64
