# 3.1 - Data Handling & Processing: NumPy and Pandas

In [1]:
import numpy as np
import pandas as pd

## 3.1.1 - NumPy

## Componentwise scalar multiplication with NumPy

NumPy [2] is one of the most popular and useful scientific packages for the Python language. Most notably, it adds a wide range of linear algebra tools and 'a powerful N-dimensional array object' (NumPy Developers, 20xx)[2] which allows for the easy creation of tensors in Python.

NumPy generated tensors, ndarrays, are designed to be extremely easy and intuitive to manipulate. Unlike standard Python arrays, addition and scalar multiplication is performed componentwise on NumPy arrays so that they exhibit the same properties of tensors found in mathematics. The snippet of code below shows multiplication by a constant of (a) a standard Python array and (b) a NumPy array. Notice the standard Python array simply duplicates the elements but the NumPy array performs componentwise multiplication as we would expect from a vector in mathematics.

In [2]:
c = 2

x_standard = [1, 2, 3]
x_numpy = np.array([1,2,3])

print("NumPy ", c*x_numpy)
print("Standard ", c*x_standard)

NumPy  [2 4 6]
Standard  [1, 2, 3, 1, 2, 3]


## Broadcasting with NumPy

In addition to this behaviour, NumPy implements a technique called broadcasting. Broadcasting refers to NumPy's ability to make sense of arithmetic involving two or more differently shaped tensors by means of cleverly duplicating the smaller array. [5]

__Example:__ Suppose we wish to compute $(1, 2, 3)+2$. Despite the fact there is no ambiguity in how we should interpret this sum, it cannot be computed in mathematics because a vector quantity and a scalar quantity are not compatible under addition. NumPy however, will produce the expected result: $(1, 2, 3) + 2 = (3, 4, 5)$. It achieves this by broadcasting the smaller, zero-dimensional tensor, $(2)$, over the one-dimensional tensor $(1; 2; 3)$. That is, interpreting $2$ as $(2, 2, 2)$ before performing componentwise addition. The snippet of code below shows the Python implementation of this example. NumPy achieves the correct result by broadcasting whereas Python returns a ```TypeError```, meaning it cannot make sense of an addition between
an array and an integer.

In [3]:
c = 2

x_standard = [1, 2, 3]
x_numpy = np.array([1,2,3])

print("NumPy ", c+x_numpy)
print("Standard ", c+x_standard)

NumPy  [3 4 5]


TypeError: unsupported operand type(s) for +: 'int' and 'list'

Broadcasting is only possible if the tensors under consideration are compatible. Two tensors are compatible if (a) their dimensions coincide exactly or (b) if their dimensions coincide in all but finitely many places where the dimension of the tensor is $1$ [5]. For example, in case (a), NumPy allows us to add two $5 \times 5$ matrices together as both of them have exactly the same dimensions, but NumPy will not permit the addition of a $5 \times 5$ matrix to a, say, $3 \times 4$ matrix.

In [4]:
x = np.zeros((5, 5))
y = np.ones((5, 5))
z = np.ones((3, 4))

print(x+y)
print(x+z)

[[ 1.  1.  1.  1.  1.]
 [ 1.  1.  1.  1.  1.]
 [ 1.  1.  1.  1.  1.]
 [ 1.  1.  1.  1.  1.]
 [ 1.  1.  1.  1.  1.]]


ValueError: operands could not be broadcast together with shapes (5,5) (3,4) 

Furthermore, in case (b) NumPy will permit the addition of, say, a $5 \times 5$ matrix and a $1 \times 5$ matrix by adding the smaller matrix to each row of the larger one. This is because the dimensions of the arrays agree except for in one place where the dimension of the smaller array is $1$. By a similar argument, NumPy would permit the addition of a $5 \times 5$ matrix and a scalar by performing addition componentwise. In this case, the array dimensions differ in two places, but the dimension of the scalar quantity is $(1,1)$ so broadcasting is feasible in this case.

In [5]:
x = np.zeros((5, 5))
y = np.ones((1, 5))
z = 4

print(x+y)
print("") #line space
print(x+z)

[[ 1.  1.  1.  1.  1.]
 [ 1.  1.  1.  1.  1.]
 [ 1.  1.  1.  1.  1.]
 [ 1.  1.  1.  1.  1.]
 [ 1.  1.  1.  1.  1.]]

[[ 4.  4.  4.  4.  4.]
 [ 4.  4.  4.  4.  4.]
 [ 4.  4.  4.  4.  4.]
 [ 4.  4.  4.  4.  4.]
 [ 4.  4.  4.  4.  4.]]


NumPy's broadcasting technique ensures that, given a sequence of unambiguous operations performed on its arrays, it will attempt to perform calculations conforming to our mathematical intuition.

## 3.1.2 - Pandas

Pandas [3] is another popular scientific library written for Python. It is primarily used for data analysis and manipulation [6] and, most notably, it introduces a flexible DataFrame object: a two-dimensional spreadsheet-like data structure for storing and meaningfully organising data in Python [6]. It also introduces various IO tools for reading and writing data from files [7]. These IO tools are compatible with a multitude of file types [7] including, but not limited to csv files, html files and json files. For example, the code snippet below demonstrates the ability of pandas to read and import data from locally stored csv files. We can also see the layout of the data using the DataFrame object. From this DataFrame we could access and extract specific columns, rows or cells and we will see examples of this later.

In [7]:
data = pd.read_csv("C://Users//Sam Kettlewell//Google Drive//University Work//Third Year//MATH3001//Chapters//Chapter 3//testdata.csv")

In [8]:
data

Unnamed: 0,positive integers,prime numbers,UK cities beginning with a B
0,1,2,Birmingham
1,2,3,Bristol
2,3,5,Belfast
3,4,7,Brighton
4,5,11,Bangor
