# Introduction to Numpy and Pandas

## Learning Objectives
By the end of this lesson, you should be able to:
* Perform basic operations in Numpy;
* Perform basic operations in Pandas; and
* Analyse data using Numpy and Pandas.

## Outline
In this train we will:
* Introduce the Numpy Python library and some of its functionality;
* Introduce the Pandas Python and some of its functionality; and
* Load, manipulate, and analyse data using Numpy and Pandas.

## Introduction

Python has many packages for working with data and performing operations like the loading, analysing and storing of data. Numpy and Pandas are two commonly used and useful ones. In this train, you will learn some of the basic operations in Numpy and Pandas.

# Numpy
Numpy is a Python package which provides high-performance, multidimensional array objects, numerical computing tools, and is fundamental in scientific computing. It is the core library for scientific computing in Python ([see the full documentation](https://numpy.org/)). We will start by defining what an array is, how to create an array, how to access and modify an array, and look at useful functions on arrays. Finally, we will identify when an array should be the chosen data structure when performing data analysis or manipulation.

### What is an Array?
Similar to that of a Python list, a numpy array is a data structure which stores multiple values. The main difference between the two is that lists can contain heterogeneous data types (combinations of `str`, `int`, even `list`), whilst numpy arrays can only store values of the same data type. Numpy arrays can be thought of as a grid of values and can be multi-dimensional.

Numpy arrays are stored more efficiently than Python lists and allow mathematical operations to be vectorized, which results in significantly higher performance than with looping constructs in Python.

### Creating a Numpy Array
If we want to work with any numpy objects or functions we first need to import the Numpy library.


In [None]:
# immporting numpy package
import numpy as np

To make a numpy array, you use the `np.array()` function. All you need to do is pass a list to it and optionally, you can specify the data type of the data. Let's look at an example:

In [None]:
# Create an array by passing in a list of lists.
ratings = np.array([[94,89,63], [93,92,48], [92,94,56]])

ratings

array([[94, 89, 63],
       [93, 92, 48],
       [92, 94, 56]])

In [None]:
ratings_list = list(([94,89,63], [93,92,48], [92,94,56]))

In [None]:
len(ratings_list)

3

You can see that there are 3 rows and columns in this array. We can inspect the shape of the array by using its `shape` attribute. This will return a tuple of integers giving the size of the array along each dimension.

In [None]:
ratings.shape

(3, 3)

We can also create arrays using the following functions:

* `np.ones()` - array of ones
* `np.zeros()` - array of zeros
* `np.random.random()` - array with random values

Let's look at some examples:

In [None]:
# Create an array of ones - pass in shape as a tuple.
np.ones((3,4))

array([[1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.]])

In [None]:
# Create an array of zeros - pass in shape as a tuple.
np.zeros((2,3))

array([[0., 0., 0.],
       [0., 0., 0.]])

In [None]:
# Create an array of random values - pass in shape as a tuple.
np.random.random((3,3))

array([[0.82512646, 0.43199511, 0.81864918],
       [0.92504834, 0.7403102 , 0.63134769],
       [0.04490926, 0.22525485, 0.57725473]])

### Accessing Numpy Arrays
Numpy offers several ways to index into arrays. Here, we'll work through slicing, and Boolean array indexing.

#### Slicing
Similar to Python lists, Numpy arrays can be sliced. Since arrays may be multi-dimensional, you must specify a slice for each dimension of the array, where the slices per dimension is split by a comma. For a 2-D array, the first dimension is the vertical axis while the second dimension is the horizontal axis. For a 2-D array:

* `np.array[vertical index , horizontal index]` - for one element
* `np.array[vertical start:vertical end , horizontal start:horizontal end]` - for more than one element

Let's look at a few examples:

In [None]:
# Whole array.
ratings

array([[94, 89, 63],
       [93, 92, 48],
       [92, 94, 56]])

In [None]:
# Select top left element.
ratings[0,0]

94

In [None]:
# Select first row.
ratings[0,:]

array([94, 89, 63])

In [None]:
# Select first column.
ratings[:,0]

array([94, 93, 92])

In [None]:
# Select first two rows and first two columns.
ratings[0:2,0:2]

array([[94, 89],
       [93, 92]])

#### Boolean Array Indexing
Boolean array indexing let's you pick out a selection of elements from an array. This type of indexing is often used to select the elements of an array which satisfy a specific condition. The syntax is as follows:

* `np.array[condition]`

Let's look at an example:

In [None]:
# Select all values greater than 90.
ratings[ratings > 90]

array([94, 93, 92, 92, 94])

### Modifying Numpy Arrays
We will now look at how to add elements to an array, followed by how to remove elements from an array.

#### Adding Elements
Adding elements can be done by using the `np.append()` function. This will add elements to the end of an array. Let's look at an example:

In [None]:
# Append an extra row - note that axis=0.
ratings_extra_row = np.append(ratings, [[92, 88, 78]], axis=0)

ratings_extra_row

array([[94, 89, 63],
       [93, 92, 48],
       [92, 94, 56],
       [92, 88, 78]])

In [None]:
ratings_extra_row.shape

(4, 3)

In [None]:
# Append an extra col
ratings_extra_col = np.c_[ratings_extra_row, np.array([70, 70, 70, 70])]

ratings_extra_col

array([[94, 89, 63, 70],
       [93, 92, 48, 70],
       [92, 94, 56, 70],
       [92, 88, 78, 70]])

In [None]:
# Append an extra col
ratings_extra_col2 = np.append(ratings_extra_col,np.array([[75, 75, 75, 75]])\
                               .reshape(4, 1), axis=1)

ratings_extra_col2

array([[94, 89, 63, 70, 75],
       [93, 92, 48, 70, 75],
       [92, 94, 56, 70, 75],
       [92, 88, 78, 70, 75]])

#### Removing Elements
Deleting elements can be done by using the `np.delete()` function. This will delete elements at the specified indices. Let's look at an example:

In [None]:
ratings

array([[94, 89, 63],
       [93, 92, 48],
       [92, 94, 56]])

In [None]:
# Delete the 3rd row - note that axis=0.
ratings_del_row = np.delete(ratings, [2], axis=0)

ratings_del_row

array([[94, 89, 63],
       [93, 92, 48]])

In [None]:
ratings_del_col = np.delete(ratings, [2], axis=1)

In [None]:
ratings_del_col

array([[94, 89],
       [93, 92],
       [92, 94]])

### Functions on Numpy Arrays
There are various functions we can perform on numpy arrays. We will look at a few of them, namely:

* `np.array.sum()`
* `np.array.min()`
* `np.array.max()`

In all these functions we can pass an axis argument. This will allow us to sum the columns and rows separately. Let's look at examples:

In [None]:
# Sum of all elements in array.
ratings.sum()

721

In [None]:
# Sum of each row.
ratings.sum(axis=1)

array([246, 233, 242])

In [None]:
# Sum of each column.
ratings.sum(axis=0)

array([279, 275, 167])

In [None]:
ratings

array([[94, 89, 63],
       [93, 92, 48],
       [92, 94, 56]])

In [None]:
# Min of each row.
ratings.min(axis=1)

array([63, 48, 56])

In [None]:
# Max of each column.
ratings.max(axis=0)

array([94, 94, 63])

## A little explanation on axis as used in Numpy

imagine you have a big box of building blocks. Each row in the box is like a line of blocks going from left to right, and each column is like a stack of blocks going from top to bottom.

Now, when we talk about "axis 0," think of it as the direction from the top to the bottom of the box, like going down the columns of blocks. So, if we say, "let's do something along axis 0," it means we're looking at or doing something with all the stacks of blocks, like all the columns.

On the other hand, when we talk about "axis 1," think of it as the direction from the left to the right of the box, like going across the rows of blocks. So, if we say, "let's do something along axis 1," it means we're looking at or doing something with all the lines of blocks, like all the rows.

So, axis 0 is like going down in columns, and axis 1 is like going across in rows. It's just a way to help us know which way we're looking or moving when we're playing with our blocks (or arrays in the case of NumPy).

### When to use Numpy Arrays

A Numpy array is a data structure which stores multiple values, all of the same data type, and can be multi-dimensional. We should use a Numpy array if all of the following statements hold:

* We have multi-dimensional data
* All entries are of the same data type

Numpy arrays are also preferred over list of lists due to efficiency and functionality.

## Appendix
- [Basic Numpy Functionality](https://numpy.org/doc/stable/user/absolute_beginners.html#what-is-an-array)

- [Numpy package home page](https://numpy.org)

- [Pandas package home page](https://pandas.pydata.org/)