# Scientific Computing With Python

![scientific_computing](sampleImages/scientific_computing.png)

>Scientific Computing is the **collection of tools, techniques, and theories required to solve on a computer mathematical models of problems in Science and Engineering**.

Scientific computing draws on **mathematics and computer science** to develop the best way to use computer systems to solve problems from science and engineering.

Scientific computing involves **model development and simulations** to understand natural systems. 

Scientific computing **requires knowledge of the subject of the underlying problem to be solved.**


Scientific computing is now regarded as the **Third Pillar of Science**, complementing and adding to **experimentation/observation and theory**.

A **computation tells** you what the **consequences of your theory are**, which **facilitates experimentation and observation work** because you can tell **what you are supposed to look for to judge whether your theory is valid**.

In [None]:
%%html
<iframe src="https://www.washingtonpost.com/graphics/2020/world/corona-simulator/" width="800" height="500"></iframe>

In [None]:
%%html
<iframe src="https://www.tacc.utexas.edu/-/supercomputers-create-world-s-most-detailed-simulations-of-tornadoes" width="800" height="500"></iframe>

**Two major characteristics associated with Scientific Computation**

1. Large Datasets/ Big Data.
2. Massive number of computations.

**So what do we need for venturing into Scienitific Computation**

1. **Efficient form of representing data** (efficient data structures) which can then be **efficiently accessed** for performing computations on.

2. **Speed of computation** becomes a key factor, to have an output in reasonable amount of time.

3. **High Performance computing resources** (Computer Clusters, GPU, TPU etc).

But **who is using scientific computing.**

The people using scientific computing in their research are **not always computer scientists**.They include **biologist, chemist, physicist, climatologist, geographers, Doctors, and Engineers**.

So rather than getting into complex programming constructs and intricacies researchers should be able to conduct scientific research efficiently with large sets of data (which is typically associated with scientific computing).

## Python for Scientific Computing 

![scientific_computing](sampleImages/python_for_scientific_computing.jpg)

1. Easy to develop/code.
2. More readability (closer to English)
3. Micro management such as memory allocation and destruction are automatically taken care of.

But for readability and ease of use , **Python has to pay a heavy price on Speed**.


**Languages known for speed**

Mostly compiled languages such as 

1. C ----  very-very close to assembler
2. Fortran ---- Again a giant when it comes to scientific computations
3. C++ ---- A bit more friendlier than C, but extremely fast.

But then you have to deal with **complex programming constructs**. 

1. Memory allocation
2. Garbage collection
3. Pointer manipulation

Which is not great news for scientists from non-computer science background.

**So what’s the solution!**

What if a **computer scientist or expert in the area of programming can develop efficient code in compiled languages such as C/C++/Fortran and some how call that from Python**.

Now you get **best of both worlds**.

You will gain **speedups from the compiled code and at the sametime enjoy the user-friendliness of Python**.

And that solution is

**Scientific Libraries in Python**

1. **Numpy** : for **efficient and fast numerical analysis**.
2. **Scipy** : Build on top of Numpy and supports a **lot of scientific methods**.
3. **Matplotlib** : For **plotting and producing high quality publishable diagrams**.
4. **Pandas** : For **data wrangling and analysis**. 

We will be covering Numpy and Pandas in this session.

## Numpy

![numpy](sampleImages/numpy_array_t.png)

1. **Numpy** is a **library for working with multi-dimensional arrays** in Python
2. Numpy (stands for **numerical python**)

### But what is an **Array**?

We have already learned about lists. 

Lists can store any type of values. For example a list of strings and ints

In [None]:
listExample = [1,2,3,4,5,'h','g','i'] #list can store pretty much anything

Lists can also store other lists (which is called nested list). 

Let’s look at an example of creating a list of list.

In this example we have a list of list with some student attributes such as id, height, weight, age, and grade (class). 


In [None]:
students = [[1,176,180,15,10], [2,181,170,17,12],[3,167,176,19,10]] #student details id, height, weight, age, and grade

Now, how will we access grade for all students. We have to probably **write a loop**.

In [None]:
grades = []
for student in students: #loop through all students
    grades.append(student[-1]) #we know that grade is the last value in each list
print (grades) # now grades will have grades for all students

Now, how will we access id, weight, and height together. Again no price in guessing, **loops**

In [None]:
idWeightHeight = []
for student in students: #loop through all students
    idWeightHeight.append([student[0],student[1],student[2]])#id height and weight
print (idWeightHeight) # now we will have a list of list

Now how will we find the maximum height (or minimum height, minimum weight, minimum age). **Loops** of course

In [None]:
maxHeight = -1 # we are assigning a negative value initially
for student in students: #loop through all students
    if student[1]>maxHeight: # check height of student is greater than maxHeight.
        maxHeight =  student[1] #if so replace the maxHeight value to student height
print (maxHeight)

How will we add two lists of same length. **Loops** only

In [None]:
firstList = [1,2,3,4,5]
secondList = [9,8,7,6,5]
result = []
for index in range(len(firstList)): # we will use index for looping
    result.append(firstList[index]+secondList[index]) #get values from both list and add
print (result)

Select all students who have age greater than 16, **loops with condition**

In [None]:
matchingRecords = []
ageThreshold = 16
for student in students: #loop through all students
    if student[-2]>ageThreshold: # check age of student is greater than max age.
        matchingRecords.append(student)
print (matchingRecords)

The crux is that **many of list operations involve looping**...

And **Python loops are terrbily slow** when compared to loops in other compiled languages (C,Fortran,C++)

![numpy](sampleImages/slow_loops.jpeg)

So we need to **borrow faster loops** from other languages yet at the same time maintain the **user-friendliness of Python**

This is exactly what **Numpy** does through its rich Array datastructure.

>An **array is a collection of similar data elements** stored at **contiguous memory locations**. It is the simplest data structure where **each data element can be accessed directly by only using its index number**.

The **key difference with list** is **"collection of similar data elements"**. While list can store any elements you want [1,'hi',True,[1,2],{'test':2},{'set'}], arrays can only store similar data elements. 

While list might be more convenient, the **slowness** due to **heterogenity of elements** outweighs the convenience factor. Array stores only similar data elements which leads to a lot of optimizations (**homogenity is really advantageous** in this case).

### Creating arrays

#### Creating single dimensional arrays

![1darray](sampleImages/single_dimensional_array.png)

1. **From a list** (np.array())

A list can be **directly converted** to an array using the **np.array() method**

In [None]:
import numpy as np #you have to import numpy to use it
aList = [1,2,3,4,5]
anArray = np.array(aList) #converts a list to an array
print (anArray)
print (type(anArray))

2. Using **np.arange()** method

np.arange() method is **very similar to range() function**.

In [None]:
import numpy as np
arrayA = np.arange(5) # creates an array [0,1,2,3,4]
arrayB = np.arange(1,3) # creates an array [1,2]
arrayC = np.arange(1,10,2) # creates an array [1,3,5,7,9]
arrayD = np.arange(10,1,-2) # creates an array [10,8,6,4,2]

2. Using **np.linspace()** method to create evenly spaced numbers
linspace takes argument, start, stop

In [None]:
arrayE  = np.linspace(0,10,num = 6)  #here start parameter value is 0, and stop parameter value is 10 and num is number of samples
# so 6 evenly spaced numbers between 0 and 10

arrayF = np.linspace(0,5,num = 3) #three evenly spaced numbers between 0 and 5


#### Multi dimensional arrays

![2darray](sampleImages/twodimensionalmatrix.png)

1. Converting **lists to two dimensional arrays** using **np.array()**

In [None]:
aListOfList = [[0,1],[2,3],[4,5],[5,6],[44,34]]
a2dArray = np.array(aListOfList) #convert list of list to an array

2. Converting a one dimensional array to a two dimensional array using **reshape()**. 

In [None]:
a1dArray = np.arange(8) #this will create [0,1,2,3,4,5,6,7]
a2dArray = a1dArray.reshape((4,2)) #reshapes the 1 d array to a 2d array with 4 rows and 2 columns
a3dArray = a1dArray.reshape((2,2,2)) #reshapes the 1 d array to a 3d array
print (a2dArray)
print ('-----------')
print (a3dArray)

#### Some more ways of creating arrays

1. **np.zeros()** for creating array of zeros 

In [None]:
array1d = np.zeros(5) #will create an array [0,0,0,0,0]
array2d = np.zeros((2,3)) #will create an array with two rows and 3 columns with all elements being zero
print (array1d)
print (array2d)

2. **np.ones()** for creating array of ones

In [None]:
array1d = np.ones(5) #will create an array [1,1,1,1,1]
array2d = np.ones((2,3)) #2d array of ones
print (array1d)
print (array2d)

2. **np.eye()** for creating a two dimensional array with 1's on diagonal (**Identity Matrix**)

In [None]:
array1 = np.eye(2) #2x2 array with 1 along diagonal
print (array1)
array2 = np.eye(3) #3x3 array with 1 along diagonal
print (array2)

3. Create an array of random values using **np.random**

In [None]:
rand1darray = np.random.rand(4) # create a 1d array of size 4 with random values between 0 and 1
rand2darray = np.random.rand(3,2) #random values between 0 and 1 for 3x2 array
print (rand2darray)
randIntegers = np.random.randint(2, size=10) #10 random integers between 0 and 1
print (randIntegers)
randIntegers2 = np.random.randint(2,4, size=10) #10 random integers between 2 and 3
print (randIntegers2)

### Array Properties 

1. Get shape of an array using **shape property**.

In [None]:
arrayA= np.asarray([1,2,3,4])
print (arrayA.shape)
arrayB = np.asarray([[1,2,3],[4,5,6]])
print (arrayB.shape)

2. Get **data type of array elements** using dtype property

In [None]:
arrayA= np.asarray([1,2,3,4])
print (arrayA.dtype)
arrayB = np.asarray([1.0,2.0,3.0,4.0])
print (arrayB.dtype)
arrayC = np.arange(10)
print (arrayC.dtype)
arrayC = np.asarray(['This is string','Another string'])
print (arrayC.dtype)

3. The property **size** gives the total number of elements in an array 

In [None]:
array1d=np.asarray([1,2,3,4])
print (array1d.size)  #4
array2d=np.asarray([[1,2],[3,4]])
print (array2d.size)  #4


### Operations on Arrays

In [None]:
arrayA = np.arange (5)
squareA = arrayA**2 # all elements of array will be squared
print (squareA)
cubeA = arrayA**3 # all elements will be cubed
print (cubeA)
arrayC = arrayA/5 # will divide every element by 5
print (arrayC)
arrayD = 5 * arrayA # will multiply each element by 5 . Also called scaling
print (arrayD)
arrayE = arrayA+5 # will add each 5 to each element.
print (arrayE)
#these operations are also possible with higher dimensional arrays
arrayF = np.arange(6).reshape(2,3) #an array with 2 rows and 3 columns
arrayG = 5 * arrayF #multiple each element by 5
print (arrayG)

![mathoperations](sampleImages/samplemathoperations.png)

More **mathematical operations**.

In [None]:
arrayA = np.array([1,4,9,16,25])
arrayB = np.sqrt(arrayA) #square root
print (arrayB)
arrayC = np.square(arrayB) #square of array
print (arrayC)
sinArrayA = np.sin(arrayA) #sine of array
print (sinArrayA)
arrayE = np.array([-1,4,-9,16,25])
absArray = np.abs(arrayE)
print (absArray)

**Minimum, Maximum and Mean**

In [None]:
arrayA = np.array([1,4,9,16,25])
arrayMax = np.max(arrayA) #should return 25
print (arrayMax)
arrayMin = np.min(arrayA) #should return 1
print (arrayMin)
arrayMean = np.mean(arrayA) #should return mean of all numbers in the array
print (arrayMean)
# Now we look into two dimensional arrays
arrayC = np.array([[1,2,3],[2,5,6]])
maxC = np.max(arrayC)  #should return 6
print (maxC)
maxCAxis0 = np.max(arrayC,axis=0)  #[2,5,6]. Returns sum of every columns
print (maxCAxis0)
maxCAxis1 = np.max(arrayC,axis=1)  #[3,6]. Returns sum of every columns
print (maxCAxis1)
meanCAxis0 = np.mean(arrayC,axis=0)  #[1.5 3.5 4.5]. Returns sum of every columns
print (meanCAxis0)
meanCAxis1 = np.mean(arrayC,axis=1)  #[2.,4.33333333]. Returns sum of every columns
print (meanCAxis1)

![maxArray](sampleImages/maxArray.png)

**Transposing Array**

**Reverse or permute** the axes of an array.

![transpose](sampleImages/transpose.png)

In [None]:
rowVector= np. array([[1,2,3,4,5]])
columnVector = rowVector.T # T is the property for transposing an arary
print (columnVector)

![rowandcolumnvector](sampleImages/rowandcolumnvector.png)

### Two Dimensional Array aka Matrix!!!!

![matrix](sampleImages/matrix.jpg)

**Heavily used in scientific computing**

Matrices can be used to store

1. Images (Gray scale with a matrix and 3d Tensor for RGB images)
2. Text data (one hot encoding matrix)
3. Variables for solving linear equations, as well as features for ML models

Let's look at some realworld examples of matrices,

1. Number of items sold in a bakery for three different years. Its a **3x4 matrix**

![realworldexamplematrix](sampleImages/realworldexamplematrix.png)

2. An example of one-hot encoding. Each column indicates a word token and each row indicates a sentence. The value indicates the total count of words in the particular sentence. 

![onehotencoding](sampleImages/onehotencoding.png)

**Some matrix properties**

1. **Trace of a Matrix**. Defined as the **sum of diagonal elements**.

In [None]:
arrayA = np.array([[1,2],[2,4]])
traceA = np.trace(arrayA) #should return 5 as diagonal elements are 1 and 4
print (traceA)

2. **Symmetric matrix**

Symmetric matrix is a **square matrix whose transpose is same as itself**. We can also notice it by the fact that the **upper diagonal and lower diagonal elements** are the same.

$$
\left[
\begin{array}{c c c}
 1 & 1 & 0 \\
 1 & 1 & 0 \\
 0 & 0 & 1
\end{array}
\right]
$$

A real world example of a symmetric matrix is a **distance matrix** between places. Suppose there are three places a,b,c and we use a matrix to represent the distance between cities.

$$
\begin{array}{c c} 
& \begin{array}{c c c} a & b &c \\ \end{array} \\
\begin{array}{c c c}a\\b\\c \end{array} &
\left[
\begin{array}{c c c}
 0 & 10 & 15 \\
10 & 0 & 20 \\
15 & 20 & 0
\end{array}
\right]
\end{array}
$$



3. **Determinant of a matrix**

The **determinant of a matrix is a calculation that involves all the coefficients of the matrix**, and whose output is a **single number**. The determinant (geometrically) is related to the change in area (2D) and volume (3D) due to linear transformation (which matrices are used for).

![det_solve](sampleImages/determinants_solve.png)


A **geometrical interpretation of determinant**
![det_solve](sampleImages/determinantexamples.png)


In [None]:
arrayA = np.array([[1,2],[2,4]]) # a 2x2 matrix
detarrayA = np.linalg.det(arrayA)  #should be zero. np.linalg.det is used to calculate determinant
print (detarrayA)
arrayB = np.array([[1,2,7],[2,2,7],[3,2,5]])
detarrayB = np.linalg.det(arrayB)  # a 3x3 matrix
print (detarrayB)

#### Operations on Multiple Arrays

![matrix_op](sampleImages/matrix_operations.png)

1. **Element wise addition**

**Addition is defined only when both the arrays are of same shape**

In [None]:
arrayA = np.array([1,2,3,4])
arrayB = np.array([10,10,10,10])
arrayC = arrayA+arrayB #addition
print (arrayC)
arrayD = np.array([1,2,3,4,5])
arrayE = arrayD+arrayB #this will fail as both arrays are of different shape

**Adding two matrices together**

In [None]:
arrayA = np.array([1,2,3,4]).reshape((2,2)) # a 2x2 array
arrayB = np.array([10,10,10,10]).reshape((2,2)) # a 2x2 array
arrayC = arrayA+arrayB #addition
print (arrayC)
arrayD = np.array([1,2,3,4,5,6]).reshape((2,3))
arrayE = arrayD+arrayB #this will fail as 2x2 matrix cannot be added to 2x3 matrix

**Adding a matrix and vector**

A matrix and vector can be added if the **vector can be broadcasted to a matrix** with the same shape as the  other matrix

![broadcasting](sampleImages/broadcasting.png)

In [None]:
arrayA = np.array([1,2,3,2,5,6]).reshape((2,3)) # a 2x3 array
arrayB = np.array([[4],[2]]) # a 2x1 array
arrayC = arrayA+arrayB #addition
print (arrayC)
arrayD = np.array([[4,2,3]]) # a 1x3 array
arrayE = arrayA+arrayD
print (arrayE)

2. **Element wise subtraction**

**Same as addition**

3. **Element wise multiplication (Hadamard product)**

Not to be confused with **dot product or matrix multiplication**

In [None]:
arrayA = np.array([1,2,3])
arrayB = np.array([10,10,10])
arrayC = arrayA * arrayB
arrayD = np.array([1,2,3,4]).reshape((2,2)) #2x2 matrix
arrayE = np.array([10,10,10,10]).reshape((2,2)) #2x2 matrix
arrayF = arrayD * arrayE
print (arrayF)

3. **Element wise division**

**Same as multiplication and addition**

4. **Dot Product between vectors (np.dot)**

Dot product takes **two vectors of same length and returns a single number**.
It is also called **inner product**.

![dotproduct](sampleImages/dotproduct.gif)

In [None]:
arrayA = np.array([4,2,3])
arrayB = np.array([3,2,1])
arrayDot = np.dot(arrayA,arrayB) #calculates the dot product
print (arrayDot)

**Dot product of vectors** also has a **geometrical interpretation** and is used to find whether two vectors are pointing in the same direction (similarity), or orthogonal (not related to each other) to each other or opposite to each other. 

For two vectors $\overrightarrow{u}$ and $\overrightarrow{v}$, the dot product is defined as

$$
\overrightarrow{u}.\overrightarrow{v} = \lVert u\rVert  \lVert v\rVert \cos \theta
$$

where $\lVert u\rVert$ and $\lVert v\rVert$ are the euclidean norm of $\overrightarrow{u}$ and $\overrightarrow{v}$ respectively and $\theta$ is the angle between vectors.

When $\theta = 90 ^\circ$ the vectors are orthogonal to each other (not related), $\theta = 0 ^\circ$ the vectors are in same direction and when $\theta = 180 ^\circ$ the vectors are in opposite direction.

![dotproductgeometry](sampleImages/dotproductgeometry.png)

5. **Matrix Vector Multiplication (@)**

**Matrix vector multiplication** can be **geometrically** visualized as applying a **linear transformation (matrix) on a vector.**

For example, for rotating any vector by $90^\circ$, we apply a transformation of the form 
$$
\left[
\begin{array}{c c}
 0 & -1  \\
 1 & 0  
\end{array}
\right]
$$

ie. we will multiply the rotation matrix with any vector to rotate the vector in space. 

**Matrix vector multiplication is only defined when number of columns in the matrix is equal to the number of elements in the vector (column vector)**.

![mat_vector](sampleImages/mat_vector.gif)

Corresponding Geometrical Visualization

![rotation](sampleImages/rotation.gif)

In [None]:
vectorA = np.array([3,2])
rotationMatrix = np.array([[0,-1],[1,0]])
arrayDot = rotationMatrix@vectorA #use @ operator for matrix vector multiplication

6. **Matrix Matrix Multiplication (@)**

Matrix matrix multiplication is one of the most commonly used binary operations that produces a matrix from two other matrices. 

Matrix multiplication is **only defined when number of columns in the first matrix is equal to the number of rows in the second matrix**. The output will be a matrix with number of rows as the first matrix and number of columns as second matrix. 

| A | B | Ouput |
| --- | --- | --- |
| 2x3 | 3x2| 2x2 |
| 2x2 | 2x2| 2x2 |
| 4x2 | 2x3| 4x3 |
| 4x3 | 4x3| Undefined |
| 1x3 | 3x2| 1x2 |

Let's see how matrix-matrix multiplication is performed,

![matmul](sampleImages/matmul.gif)

In [None]:
matrixA = np.array([[0,-1,3],[1,0,2]])
matrixB = np.array([[2,3],[4,3],[6,2]])
matAB = matrixA@matrixB #matrix multiplication or dot product between to matrices
print (matAB)

#### Retrieving elements from one dimensional and two dimensional arrays using indexing and slicing

##### One dimensional array

Numpy arrays support **indexing and slicing similar to list**

![numpy_indexing](sampleImages/numpy_indexing.png)

Sample code snippet

In [None]:
A = np.arange(6) # will create an array [0,1,2,3,4,5]
print ('A[0]',A[0])
print ('A[-1]',A[-1])
print ('A[3]',A[3])
print ('A[:2]',A[:2])
print ('A[3:]',A[3:])
print ('A[2:6]',A[2:6])
print ('A[-2:]',A[-2:])
print ('A[::2]',A[::2])
print ('A[2:6:2]',A[2:6:2])
print ('A[::]',A[::])

But what makes numpy arrays more **powerful is the ability to select elements based on conditional operators**

![logical_numpy](sampleImages/logical_numpy.png)

Sample code snippet

In [None]:
A = np.array([11,17,77,53,35,1,25,27])
print ('A[A<10]',A[A<10])
print ('A[A>50]',A[A>50])
print ('A[A<=25]',A[A<=25])
print ('A[np.logical_and(A>10,A<50)]',A[np.logical_and(A>10,A<50)])
print ('A[np.logical_or(A<10,A>50)]',A[np.logical_or(A<10,A>50)])

##### Two dimensional array

![2dindexing](sampleImages/2dindexing.png)

Sample code snippet

In [None]:
A = np.arange(1,26).reshape(5,5)
print ('A[0][0]',A[0][0])
print ('A[0][4]',A[0][4])
print ('A[3][4]',A[3][4])
print ('A[4][-1]',A[4][-1])
print ('A[-1][0]',A[-1][0])
print ('A[0]',A[0])
print ('A[-1]',A[-1])
print ('A[0:2]',A[0:2])
print ('A[:,0]',A[:,0])
print ('A[:,-1]',A[:,-1])
print ('A[:,-2:]',A[:,-2:])
print ('A[0:2,0:2]',A[0:2,0:2])

##### Higher Dimensional Arrays

Indexing or slicing higher dimensional arrays are similar to two dimensional or one dimensional array with additional subscripts for the additional number of axes

The diagram below shows some examples od 3d array indexing. For this example we will take a color image as our 3d array. Color images have **height, width and channels** (3 for RGB and 4 for RGBA). You can visually imagine color images as three two dimensional arrays stacked on top of each other. In the diagram we are showing the three different channels seperately for convenience. You can imagine them to be stacked on top of each other. 

![3dindexing](sampleImages/3dindexing.png)

Sample code snippet

In [None]:
A = np.arange(1,28).reshape((3,3,3))
A = np.dstack((A[0],A[1],A[2]))
print ('A[0][0][0]',A[0][0][0])
print ('A[0][1][1]',A[0][1][1])
print ('A[0][2][2]',A[0][2][2])
print ('A[0,:,0]',A[0,:,0])
print ('A[0,0,0:2]',A[0,0,0:2])
print ('A[0:2,0:2,:]',A[0:2,0:2,:])
print ('A[:,0,:]',A[:,0,:])

#### Real world applications of matrices

1. **Solving systems of linear equations**

$$x+y+z=6\\
2y+5z=4\\
2x + 5y − z = 27$$

Systems of linear equations are common in engineering, physics, chemistry, computer science, and economics. While many methods are available for solving systems of linear equations (such as substitution), the matrix method for solving is highly optimized for computation based solutions. The above equations can be represented with matrices as

$$
\begin{bmatrix} 
1 & 1 & 1 \\ 
0 & 2 & 5 \\ 
2 & 5 & -1 
\end{bmatrix} 
\begin{bmatrix} 
x  \\ 
y \\ 
z
\end{bmatrix} 
= 
\begin{bmatrix} 
6 \\ 
4 \\ 
27 
\end{bmatrix}
$$


If we represent this equation as 
$Ax = C$

Then x can be obtained by 

$x = A^{-1}C$

where $A^{-1}$ is the inverse of matrix A (we will define inverse in next section)

$$
\begin{bmatrix} 
x  \\ 
y \\ 
z
\end{bmatrix} 
= 
\begin{bmatrix} 
1 & 1 & 1 \\ 
0 & 2 & 5 \\ 
2 & 5 & -1 
\end{bmatrix}^{-1} 
\begin{bmatrix} 
6 \\ 
4 \\ 
27 
\end{bmatrix}
$$

Let's look at the Python code for solving this equation

In [None]:
A = np.array([[1,1,1],[0,2,5],[1,5,1]]) #we could call it coefficient matrix
C = np.array([[6],[4],[7]])
xyz = np.linalg.inv(A)@C #np.linalg.inv method can be used to get the inverse of a matrix
print (xyz)

**Inverse of a matrix**

An inverse for a matrix is defined if **determinant of that matrix is not zero** and that **matrix multiplied by its inverse is same as the inverse multiplied by the matrix** and is equal to the **Identity matrix**

$A^{-1}$ is only defined if

$\left|A\right| \ne 0$ and 

$A A^{-1} = A^{-1} A = I$  

where $I$ is the identity matrix with diagonal elements as 1 and all other elements as 0.

In numpy we can use the method np.linalg.inv to find the inverse of a matrix. For example lets take a 2x2 matrix and check whether it has an inverse

$$\begin{bmatrix} 
4 & 7 \\ 
2 & 6 \\  
\end{bmatrix}$$


In [None]:
A = np.array([[4,7],[2,6]])
determinantA = np.linalg.det(A) # for determining the determinant.
print ('Determinant of matrix is',determinantA)
if determinantA ==0:
    print ("Inverse doesn't as determinent is zero")
else:
    inverseA = np.linalg.inv(A) #inverse of A
    A_inverseA = A@inverseA # A A(inv)
    inverseA_A = inverseA@A # A A(inv)
    #check A_inverseA and inverseA_A are equal and check whether A_inverseA equals Identity matrix
    if np.allclose(A_inverseA,inverseA_A) and np.allclose(A_inverseA,np.eye(A_inverseA.shape[0],A_inverseA.shape[1])):
        print ('Inverse exists and inverse of matrix is',inverseA)
    else:
        print ("Inverse doesn't exists")

2. Storing input features for data analysis

Each row will be a record and each column will be a feature

Let's take a small extract of the Iris flower dataset, which is used by data scientists to study about classification problems. 

| Sepal Length | Sepal Width | Petal Length|Petal width |Class label|
| --- | --- | --- | --- | --- |
| 5.1 | 3.5| 1.4 |0.2|Iris-setosa
| 4.9 | 3| 1.4 |0.2|Iris-setosa
| 4.7 | 3.2| 1.3 |0.2|Iris-setosa
| 7 | 3.2| 4.7 |1.4|Iris-versicolor
| 6.4 | 3.2| 4.5 |1.5|Iris-versicolor
| 6.3 | 2.9| 5.6 |1.8|Iris-virginicia
| 7.1 | 3| 5.9 |2.1|Iris-virginicia

In this dataset Sepal Length, Sepal Width, Petal Length, and Petal Width are the input features and Class label is the output label (which we use for training in this case). We could represent the input features as a matrix

$$
\begin{bmatrix} 
5.1 & 3.5 & 1.4 & 0.2 \\ 
4.9 & 3 & 1.4 & 0.2 \\ 
4.7 & 3.2 & 1.3 & 0.2\\
7 & 3.2 & 4.7 & 1.4\\
6.4 & 3.2 & 4.5 & 1.5\\
6.3 & 2.9 & 5.6 & 1.8\\
7.1 & 3 & 5.9 & 2.1
\end{bmatrix}
$$

3. Storing grayscale images

A grayscale image is an image in which the **value of each pixel is a single sample representing only an amount of light**; that is, **it carries only intensity information**.

Let's look at a grayscale image 

![grayscale](sampleImages/grayscale.jpg)


We will utilize **matplotlib library to read this image**.

In [None]:
import matplotlib.pyplot as plt
imageArray = plt.imread('sampleImages/grayscale.jpg') #to read an image into an array

In this case the image array will be a matrix. Let's check its shape

In [None]:
imageArray.shape

So it has 1333 rows and 1000 columns. Let's slice off the top 300 rows and then display the image

In [None]:
sliceArray = imageArray[500:,:] #500 th row onwards
plt.imshow(sliceArray,cmap=plt.get_cmap('gray'))

Let's slice off the first 300 columns and then display the image

In [None]:
sliceArray = imageArray[:,300:] #300 th column onwards
plt.imshow(sliceArray,cmap=plt.get_cmap('gray'))

Let's slice off the first 500 rows and 300 columns and then display the image

In [None]:
sliceArray = imageArray[500:,300:] #500 th row and 300 th column onwards
plt.imshow(sliceArray,cmap=plt.get_cmap('gray'))

As you can see, using basic array slicing we can manipluate our image. 

Now lets check the min and max values for this image 

In [None]:
maxVal = np.max(imageArray)
minVal = np.min(imageArray)
print (maxVal)
print (minVal)

As you can see the maxVal is 255 (white) and minVal is 0 (black).
Let's subtract 255 from all values and see ehat happens


In [None]:
newArray = 255 - imageArray # as easy as that. Will create a new array
plt.imshow(newArray,cmap=plt.get_cmap('gray'))

Wow the colors just go inverted and we have a negative of the image

### Higher Dimensional Arrays

Upto now we have looked at mainly one dimensional arrays (vectors) and two dimensional arrays (Matrix). Let's look at some examples where we can have arrays of dimension greater than 2.

1. **Color Images (3d)**

Color images have three dimensions with each dimension indicating a color channel (R-Red, G - Green and B - Blue)

![rgb image](sampleImages/rgb.jpg)

As this is a three dimensional array there are three axes. The first axis represents the height of the image, the second axis represents the width of image and the third axis represents the total number of channels (in this case 3). You can imagine this as 3 Matrices stacked on top of each other. 

Let's look at a real example 

![tigerimage](sampleImages/tiger_new.jpg)

Again we will use **matplotlib to read the image**

In [None]:
import matplotlib.pyplot as plt
imageArray = plt.imread('sampleImages/tiger_new.jpg') #to read an image into an array

Lets check the shape of the array

In [None]:
print (imageArray.shape)

So this image has a height of 810p pixels and width of 1080 pixels. There are three channels as expected.

Now lets do some slicing

Remove first 300 rows of pixels from the image

In [None]:
sliceArray = imageArray[300:,:,:] #no we have 3 dimensions so there will be an additional index
plt.imshow(sliceArray)

Remove first 300 columns of pixels from the image

In [None]:
sliceArray = imageArray[:,300:,:] #no we have 3 dimensions so there will be an additional index
plt.imshow(sliceArray)

2. **Videos and batch of color images (4d)**

A batch of images (commonly used for image analysis) can be represented as a four dimensional (4d) array. 

**These are of the form (totalimages, image height, image width, number of channels)**

You can think of it as a stack of RGB images.

**Videos can also be represented as a four dimensional array**. Each frame of a video is an RGB image with 3 dimensions and then we have a stack of RGB image to form a video (whan played at a certain frame rate).

## Pandas

![pandas_start](sampleImages/pandas_new.jpg)

So we have learned that Numpy can be used to store, manipulate and analyze numerical data in an efficient way and as far as speed is concerned, they are leaps and bounds ahead of Python containers such as lists.

But what Numpy lacks is **user-friendly methods for accessing external data (such as from files or even websites) as well as ready-to-use methods for data aggregation and analysis (grouping, pivoting, plotting)**. While such tasks can be done with Numpy, the **initial learning curve** is a bit higher. 

**Enter Pandas!!**

>**Pandas** is a software library written for the **Python programming language for data manipulation and analysis**. It offers **data structures and operations for manipulating numerical tables and time series**.

Let's start with a simple task of reading a CSV file using Pandas

### Reading a CSV file using Pandas

In this example we will read a csv file containing data about various cities in USA

In [None]:
import pandas as pd
usCityData = pd.read_csv(r'data/us_major_cities.csv') #as simple as that

Now we have data stored in the variable usCityData. Lets check the data type for usCityData

In [None]:
print (type(usCityData))

So we can see that usCityData is of type **DataFrame**

>**DataFrame is a 2-dimensional labeled data structure with columns of potentially different types**.

Lets see how DataFrame looks like 

In [None]:
usCityData

As you could see there are 3886 rows and 11 columns. To just see the **first 5 rows**

In [None]:
usCityData.head()

And last **5 rows**

In [None]:
usCityData.tail()

Let's see **all the columns**

In [None]:
usCityData.columns

Let's see how we can access data from **different columns**

In [None]:
usCityData['NAME'] #the data under 'NAME' column

In [None]:
usCityData['CAPITAL'] #the data under 'CAPITAL' column

In [None]:
usCityData[['NAME','POPULATION']] #the data under 'NAME' and  'POPULATION' column

In [None]:
usCityData[['X','Y','NAME']] #the data under 'X', 'Y' and  'NAME' column

Now let's check the **data type of different columns**

In [None]:
usCityData.dtypes

**The index for this DataFrame**

In [None]:
usCityData.index

The index for this DataFrame starts at 0 and ends at 3886.

Lets **sort this dataframe** ascending using the POPULATION field (**sort_values()**)

In [None]:
usCityData.sort_values(by="POPULATION")

As we can see the dataframe is sorted by population in ascending order. Now we will **sort the dataframe by population in descending order**. We will display only the city name and population

In [None]:
sortedFrame = usCityData.sort_values(by="POPULATION",ascending=False) #ascending is set to False, the result also is a dataframe
sortedFrame[['NAME','POPULATION']] #we want to only display name and population

### Reading a CSV file from url.

Its easy to read a csv file hosted on a server. Let's look at an example

In [None]:
import pandas as pd
populationData = pd.read_csv('https://population.un.org/wpp/Download/Files/1_Indicators%20(Standard)/CSV_FILES/WPP2019_TotalPopulationBySex.csv ')
print (populationData)

### Creating a DataFrame from scratch

You can create dataframes from **lists/arrays as well as dictionaries** using the DataFrame constructor. 

Lets create a dataframe from a list/array and then a list of list/multidimensional

**From a list**

In [None]:
ages = [1,2,3,4,5,6]
dataFrame = pd.DataFrame(ages,columns=['age']) #will create a dataframe with column age having 6 values
dataFrame

**From a list of list**

In [None]:
data = [['1','Jake',173,165],['2','Ann',162,145],[3,'Matt',180,185]]
studentFrame = pd.DataFrame(data,columns=['id','name','height','weight'])
studentFrame

**From a matrix**

In [None]:
import numpy as np
data = np.arange(25).reshape(5,5) # 5x5 matrix
matrixDataFrame = pd.DataFrame(data,columns = ['a','b','c','d','e'])
matrixDataFrame

Creating a dataframe from a dictionary

In [None]:
data = {'animal':['Tiger','Lion','Elephant','Zebra','Jaguar'],'Type':['Carnivore','Carnivore','Herbivore','Herbivore','Carnivore']}
animalFrame = pd.DataFrame(data) #no need of columns as the dictionary keys will be used as columns

In [None]:
animalFrame

### Selecting and indexing DataFrames

An important rule with selection is that the result of a selection will be a dataframe if the selection results in multiple columns or will be a series (if there is only one column).

For this section we will be using some more datasets

1. Daily covid summary data from ODH. (https://coronavirus.ohio.gov/wps/portal/gov/covid-19/dashboards/key-metrics/cases-by-zipcode) 

In [None]:
covidSummaryDataZip = pd.read_csv(r'data/COVIDSummaryDataZIP.csv')

In [None]:
covidSummaryDataZip

2. Vaccine data from ODH (https://coronavirus.ohio.gov/wps/portal/gov/covid-19/dashboards/covid-19-vaccine/covid-19-vaccination-dashboard)

In [None]:
vaccineData = pd.read_csv(r'data/vaccine_data.csv')

In [None]:
vaccineData

3. Average temperature of earth for every year from 1880 (from Nasa)

In [None]:
averageTemperatureData = pd.read_csv(r'data/globalwarming.csv')

In [None]:
averageTemperatureData

4. Ice mass data from the poles (Nasa) 

In [None]:
iceMassData = pd.read_csv(r'data/ice_mass.csv')

In [None]:
iceMassData

#### Selection by labels

We have already seen selection by label

Selecting 'Zip Code' 'Population' and 'Case Count - Cumulative' from ohio covid cases data.

Lets see a snippet of covid cases data.

In [None]:
covidSummaryDataZip.head()

And the columns

In [None]:
covidSummaryDataZip.columns

In [None]:
covidSummaryDataZip[['Zip Code','Population','Case Count - Cumulative']] #this will throw an error as we have an additional space after 'Zip Code '

In [None]:
covidSummaryDataZip[['Zip Code ','Population','Case Count - Cumulative']]

##### Using .loc for selection based on labelling

Select first 100 rows of the covid case data

In [None]:
covidSummaryDataZip.loc[0:100] #here 0 to 100 refers to the index

Select last 100 rows

In [None]:
covidSummaryDataZip.loc[-100:] #here 100 refers to the index

Select every 5th row

In [None]:
covidSummaryDataZip.loc[::5] #Every 5th row

Using **logical operators in conjunction with loc** to select subset of rows

Selecting Zipcode with population greater than 10,000

In [None]:
covidSummaryDataZip.loc[covidSummaryDataZip['Population']>10000]  # for selecting reocrds with population greater than 10K

So there are 378 zip codes. 

Now select the record for a zip code 44106 (CWRU comes under this zip code)

In [None]:
covidSummaryDataZip.loc[covidSummaryDataZip['Zip Code ']==44106]

We didn't get any results. Why? Lets look at the data type for 'Zip Code '

In [None]:
covidSummaryDataZip.dtypes

Zip Code is of type object (Strings in pandas are by default stored as objects)

So lets query with '44106' string

In [None]:
covidSummaryDataZip.loc[covidSummaryDataZip['Zip Code ']=='44106']

Ah yes!! we got our required record

Now what if we want to select all records with Population greater than 10000 **and** Case Count - Last 14 Days greater than 100

In [None]:
covidSummaryDataZip.loc[(covidSummaryDataZip['Population']>10000)&(covidSummaryDataZip['Case Count - Last 14 Days ']>100)]

Why did we get an error. We know that population is of type int. But what about 'Case Count - Last 14 Days '.

In [None]:
covidSummaryDataZip['Case Count - Last 14 Days '].dtype

It's of type object. So we are essentially comparing an object to an integer (number) (100) which is not defined

How do we convert object to integer. 

We will use a the method, pd.to_numeric from pandas which can be used to convert a sequence of elements to integers (if its possible). Lets check what the method returns

In [None]:
caseCount14daysNumbers = pd.to_numeric(covidSummaryDataZip['Case Count - Last 14 Days '])

So we found the culprit, it seems there is a record with value 1,056, which won't be treated as a number but as a string (if it was a number it should have been 1056 rather than 1,056).

There are many approaches we can follow here. 1) To correct the value in the file. 2) To use string processing and remove the commas and then re-call the method 3) to compeltey ignore the error and assign a value **NaN** to the problematic value and 4) to re-read the file indicating that there is 1000 seperator for numbers. **"NaN" standing for Not a Number, is a member of a numeric data type that can be interpreted as a value that is undefined or unrepresentable, especially in floating-point arithmetic**.

We will use the third approach for now, but it might not be viable in all situations (sometimes you would want to preserve the record).

In [None]:
caseCount14daysNumbers = pd.to_numeric(covidSummaryDataZip['Case Count - Last 14 Days '],errors ='coerce') #here errors = coerce will subside the parsing error and convert the problematic number to NaN

In [None]:
caseCount14daysNumbers

Seems like we do have multiple entries with strings that cannot be converted to number. For now we will move ahead with this set.

Now assign this new sequence to the existing sequence in the dataset

In [None]:
covidSummaryDataZip['Case Count - Last 14 Days '] = caseCount14daysNumbers

In [None]:
covidSummaryDataZip.loc[(covidSummaryDataZip['Population']>10000)&(covidSummaryDataZip['Case Count - Last 14 Days ']>100)]

Now we will check the fourth approach 

In [None]:
covidSummaryDataZip = pd.read_csv(r'data/COVIDSummaryDataZIP.csv',thousands = ',') #thousands = ',' indicates that comma is used for sperating 1000's

Now check the data type

In [None]:
covidSummaryDataZip.dtypes

Every field except Zip Code has changed to a float (numeric type)

Now re-run the selection

In [None]:
covidSummaryDataZip.loc[(covidSummaryDataZip['Population']>10000)&(covidSummaryDataZip['Case Count - Last 14 Days ']>100)]

So we use '&' operator as an equivalent of Python 'and'. Now lets try '|' operator('or')

Select zip codes with Case Count - Last 14 Days lesser than 20 or greater than 1000

In [None]:
covidSummaryDataZip.loc[(covidSummaryDataZip['Case Count - Last 14 Days ']<20)|(covidSummaryDataZip['Case Count - Last 14 Days ']>1000)]

Lets select records for the zip codes 44102,44103,44104,44105,44106

If we want to check against a sequence of elements we can use the **isin** operator.

In [None]:
covidSummaryDataZip.loc[covidSummaryDataZip['Zip Code '].isin(['44102','44103','44104','44105','44106'])]

To retrieve all other records other than these five zip codes we can use the **~ (not) operator**

In [None]:
covidSummaryDataZip.loc[~(covidSummaryDataZip['Zip Code '].isin(['44102','44103','44104','44105','44106']))]

Retrieve all records with a Case Count - Last 14 Days not greater than 100. 
We can ofcourse select records with Case Count - Last 14 Days <= 100

In [None]:
covidSummaryDataZip.loc[covidSummaryDataZip['Case Count - Last 14 Days ']<=100]

This won't select records that have a NaN value for 'Case Count - Last 14 Days '. We could use **not operator** to select even records with NaN value

In [None]:
covidSummaryDataZip.loc[~(covidSummaryDataZip['Case Count - Last 14 Days ']>100)]

#### Using .iloc for selection by position

Suppose we want to select first 100 rows and first 4 columns

In [None]:
covidSummaryDataZip.iloc[0:100,0:4]

As you can see, using iloc is similar to using slicing with numpy multidimensional arrays

Now we want to get the data from first and fourth column

In [None]:
covidSummaryDataZip.iloc[:,[0,3]]

Data from last two columns

In [None]:
covidSummaryDataZip.iloc[:,-2:]

Getting first value of first column

In [None]:
covidSummaryDataZip.iloc[0,0]

### Converting a DataFrame to List/Array

It might sometimes be benificial to convert a DataFrame to a List or an array. May be a function only accepts an array or a list.  

**Convert a single column to a list/array**

In [None]:
list(covidSummaryDataZip['Zip Code '])#this will return the column as a list

In [None]:
u = np.array(covidSummaryDataZip['Zip Code '])#this will return the column as an array

**Convert multiple columns to a list of list/matrix**

In [None]:
covidSummaryDataZip[['Zip Code ','Population']].values.tolist()#this will return a list of list

In [None]:
covidSummaryDataZip[['Zip Code ','Population']].values#this will return a numpy matrix

### Manipulating DataFrame

#### Creating a new column in DataFrame

Lets create an array of random values (using numpy random). It needs to have number of elements as total number of rows of the dataframe. So lets check the shape of the dataframe

In [None]:
covidSummaryDataZip.shape

So there are 1187 rows and 8 columns. So we need to create an array having 1187 values. 

In [None]:
randomData = np.random.randint(0,100,size = covidSummaryDataZip.shape[0]) # 1187 random values between 0 and 100

In [None]:
covidSummaryDataZip['random'] = randomData

Now check your new column

In [None]:
covidSummaryDataZip

You can see the new random column

#### Creating a new column using existing columns

Let's create a new column 'normalized_case_count' which is obtained by taking the ratio between Case Count - Cumulative and Population

In [None]:
covidSummaryDataZip['normalized_case_count'] = covidSummaryDataZip['Case Count - Cumulative']/covidSummaryDataZip['Population']
covidSummaryDataZip

Now what if we want to multiply the result with 1000

In [None]:
covidSummaryDataZip['normalized_case_count'] = covidSummaryDataZip['normalized_case_count']*1000

In [None]:
covidSummaryDataZip

#### Setting new values to different columns

As a simple example we want to change the first 100 values of the random column to zero

In [None]:
covidSummaryDataZip['random'].loc[:100] = 0
covidSummaryDataZip

Now lets create a new column isHighlyPopulated with a default value of 0

Note: if you want to create a list of same value then you can use multiplication operator along with the list. For example to create a list of hundred 0's we can write like 

```python
aList = [0]*100
```

For creating a list of list having [0,0] as each element we could write

```python
aListOfList = [[0,0]]*100
```

In [None]:
aList = [0]*100
aListOfList = [[0,0]]*100
print (aList)
print (aListOfList)

In [None]:
covidSummaryDataZip['isHighlyPopulated'] = [0]*(covidSummaryDataZip.shape[0])

In [None]:
covidSummaryDataZip

Now we can change the value of isHighlyPopulated field based on a condition. For example change the value of isHighlyPopulated to 1, if total population is greater than 10,000

In [None]:
covidSummaryDataZip.loc[covidSummaryDataZip['Population']>10000,'isHighlyPopulated'] = 1
covidSummaryDataZip

Change first three values of the random column to 1,2,3 respectively

In [None]:
covidSummaryDataZip['random'].iloc[:3] = [1,2,3]
covidSummaryDataZip

**Finding records with a NaN value for the normalized_case_count column**

In [None]:
covidSummaryDataZip.loc[covidSummaryDataZip['normalized_case_count'].isna()]

So we use isna() method to check whether a numerical value is NaN

Now what if we want to change this records to zero (or any other value). A method that we can use is **fillna**

In [None]:
covidSummaryDataZip['normalized_case_count'] = covidSummaryDataZip['normalized_case_count'].fillna(0)
covidSummaryDataZip

### Operations on DataFrame

1. Finding mean of every numerical column using **mean()**

In [None]:
covidSummaryDataZip.mean()

Finding mean of a particular column

In [None]:
covidSummaryDataZip['Population'].mean()

2. Finding max using **max()**

In [None]:
covidSummaryDataZip.max()

Max of particular column

In [None]:
covidSummaryDataZip['Population'].max()

3. Finding row with the max value for particular column **idxmax()**

In [None]:
row = covidSummaryDataZip['Population'].idxmax()
covidSummaryDataZip.loc[row]

4. Finding frequency of elements using **value_counts()**

Lets calculate frequency of values in random column

In [None]:
covidSummaryDataZip['random'].value_counts()

5. Finding unique values in a column using **unique()**

In [None]:
covidSummaryDataZip['random'].unique()

6. Show duplicate records suing **duplicated()**

For this example we will use a sample dataset with few rows

In [None]:
sampleFrame = pd.DataFrame({'name':['Tom','Tim','Martha','Tom'],'age':[23,24,22,23]})
sampleFrame

In [None]:
sampleFrame.loc[sampleFrame.duplicated()]

7. Remove duplicates using **drop_duplicates()**

In [None]:
sampleFrame.drop_duplicates(inplace=True) #use inplace True to drop the records from the same dataframe else a new dataframe is created
sampleFrame

### Combining different DataFrame's using

We can combine multiple DataFrame's based on keys common to both the DataFrames. This is similar to joining tables in relational databases. Let's see an example

In [None]:
studentAttributes = pd.DataFrame({'id':[1,2,3,4],'name':['jim','pat','mat','jay'],'age':[20,18,20,19]})
studentAttributes

In [None]:
studentMarks = pd.DataFrame([[1,87],[2,77],[3,86],[4,32]],columns = ['id','marks'])
studentMarks

We can combine these DataFrame's using **merge()** method, with id as the key

In [None]:
studentRecords = studentAttributes.merge(studentMarks,on='id') #this is an inner join by default
studentRecords

Now what if one of the students didn't attend the exam.

In [None]:
studentMarks = pd.DataFrame([[1,87],[2,77],[3,86]],columns = ['id','marks'])
studentMarks

If we merge the two DataFrames, the merged DataFrame won't have the particular row eventhough the student attributes are available.

In [None]:
studentRecords = studentAttributes.merge(studentMarks,on='id') #this is an inner join by default
studentRecords

### Grouping Records together

We will use vaccine_data as the prime dataset for this section

In [None]:
vaccineData

Suppose we want to calculate the total count for each counties (total number of times a particular county occured in a record)

We could use **groupby()** along with **size()** method

In [None]:
vaccineData.groupby('county').size()

Now we want to caculate the total number of vaccines_completed for each county

We can use **groupby()** along with **sum()**

In [None]:
vaccineData.groupby(['county'])['vaccines_completed'].sum()

For calculating average number of vaccines we can use **groupby()** along with **mean()**

In [None]:
vaccineData.groupby(['county'])['vaccines_completed'].mean()

### Plotting

For plotting we will use the global warming and ice volume datasets

In [None]:
averageTemperatureData

In [None]:
iceMassData

Lets plot average temperature against year. 

In [None]:
averageTemperatureData.columns

In [None]:
averageTemperatureData.plot(x='Year',y='Lowess(5)')

Let's change the column name from Lowess(5) to avgtemperature

In [None]:
averageTemperatureData.rename(columns={'Lowess(5)':'avgtemperature'},inplace=True)

In [None]:
averageTemperatureData

In [None]:
averageTemperatureData.plot(x='Year',y='avgtemperature')

Lets plot the icemass data.

In [None]:
iceMassData.plot(x='Yeartime',y='gigatonne',rot=45)

In [None]:
vaccineData.groupby(['county'])['vaccines_completed'].sum().plot.bar(figsize=(20,10),xlabel="Counties", ylabel="Vaccines Completed",fontsize=13)

### Writing DataFrame to CSV file

A DataFrame can be easily converted to a CSV file using the **to_csv()**

In [None]:
vaccineData.to_csv('outputdirectory/vaccineData.csv',index=False)