<center> Business Analytics & Business Intelligence  <br>
Deepak KC, deepak.kc@hamk.fi, 2020 <br> Introduction to Numpy & Pandas  </center>

# NumPy

[NumPy](https://docs.scipy.org/doc/numpy/) is a Python library for handling multi-dimensional arrays. It contains both the data structures needed for the storing and accessing arrays, and operations and functions for computation using these arrays. Although the arrays are usually used for storing numbers, other type of data can be stored as well, such as strings. Unlike lists in core Python, NumPy's fundamental data structure, the array, must have the same data type for all its elements. The homogeneity of arrays allows highly optimized functions that use arrays as their inputs and outputs.

There are several uses for high-dimensional arrays in data analysis. For instance, they can be used to:

* store matrices, solve systems of linear equations, find eigenvalues/vectors, find matrix decompositions, and solve other problems familiar from linear algebra
* store multi-dimensional measurement data. For example, an element `a[i,j]` in a 2-dimensional array might store the temperature $t_{ij}$ measured at coordinates i, j on a 2-dimension surface.
* images and videos can be represented as NumPy arrays:

  * a gray-scale image can be represented as a two dimensional array
  * a color image can be represented as a three dimensional image, the third dimension contains the color components red, green, and blue
  * a color video can be represented as a four dimensional array
* a 2-dimensional table might store a sequence of *samples*, and each sample might be divided into *features*. For example, we could measure the weather conditions once per day, and the conditions could include the temperature, direction and speed of wind, and the amount of rain. Then we would have one sample per day, and the features would be the temperature, wind, and rain. In the standard representation of this kind of tabular data, the rows corresponds to samples and the columns correspond to features. We see more of this kind of data in the chapters on Pandas and Scikit-learn.

In this chapter we will go through:

* Creation of arrays
* Array types and attributes
* Accessing arrays with indexing and slicing
* Reshaping of arrays
* Combining and splitting arrays

We start by importing the NumPy library, and we use the standard abbreviation `np` for it.

In [None]:
import numpy as np

## Creation of arrays
There are several ways of creating NumPy arrays. One way is to give a (nested) list as a parameter to the `array` constructor:

In [None]:
np.array([1,2,3])   # one dimensional array

Note that leaving out the brackets from the above expression, i.e. calling `np.array(1,2,3)` will result in an error.

Two dimensional array can be given by listing the rows of the array:

In [None]:
np.array([[1,2,3], [4,5,6]])

Similarly, three dimensional array can be described as a list of lists of lists:

In [None]:
np.array([[[1,2], [3,4]], [[5,6], [7,8]]])

There are some helper functions to create common types of arrays:

In [None]:
np.zeros((3,4))

To specify that elements are `int`s instead of `float`s, use the parameter `dtype`:

In [None]:
np.zeros((3,4), dtype=int)

Similarly `ones` initializes all elements to one, `full` initializes all elements to a specified value, and `empty` leaves the elements uninitialized:

In [None]:
np.ones((2,3))

In [None]:
np.full((2,3), fill_value=7)

In [None]:
np.empty((2,4))

The `eye` function creates the identity matrix, that is, a matrix with elements on the diagonal are set to one, and non-diagonal elements are set to zero:

In [None]:
np.eye(5, dtype=int)

The `arange` function works like the `range` function, but produces an array instead of a list.

In [None]:
np.arange(0,10,2)

For non-integer ranges it is better to use `linspace`:

In [None]:
np.linspace(0, np.pi, 5)  # Evenly spaced range with 5 elements

With `linspace` one does not have to compute the length of the step, but instead one specifies the wanted number of elements. By default, the endpoint is included in the result, unlike with `arange`.

### Arrays with random elements

To test our programs we might use real data as input. However, real data is not always available, and it may take time to gather. We could instead generate random numbers to use as substitute. They can be generated really easily with NumPy, and can be sampled from several different distributions, of which we mention below only a few. Random data can simulate real data better than, for example, ranges or constant arrays. Sometimes we also need random numbers in our programs to choose a subset of real data (sampling). NumPy can easily produce arrays of wanted shape filled with random numbers. Below are few examples.

In [None]:
np.random.random((3,4))          # Elements are uniformly distributed from half-open interval [0.0,1.0)

In [None]:
np.random.normal(0, 1, (3,4))    # Elements are normally distributed with mean 0 and standard deviation 1

In [None]:
np.random.randint(-2, 10, (3,4))  # Elements are uniformly distributed integers from the half-open interval [-2,10)

Sometimes it is useful to be able to recreate exactly the same data in every run of our program. For example, if there is a bug in our program, which manifests itself only with certain input, then to debug our program it needs to behave deterministically. We can create random numbers deterministically, if we always start from the same starting point. This starting point is usually an integer, and we call it a *seed*. Example of use:

In [None]:
np.random.seed(0)
print(np.random.randint(0, 100, 10))
print(np.random.normal(0, 1, 10))

If you run the above cell multiple times, it will always give the same numbers, unlike the earlier examples. Try rerunning them now!

The call to `np.random.seed` initializes the *global* random number generator. The calls `np.random.random`, `np.random.normal`, etc all use this global random number generator. It is however possible to create new random number generators, and use those to sample random numbers from a distribution. Example on usage:

In [None]:
new_generator = np.random.RandomState(seed=123)  # RandomState is a class, so we give the seed to its constructor
new_generator.randint(0, 100, 10)

You will see these used later in the materials and in the exercises, just so we can agree what the random input data is. How else could we agree whether result is correct or not, if we can't agree what the input is!

## Array types and attributes

An array has several attributes: `ndim` tells the number of dimensions, `shape` tells the size in each dimension, `size` tells the number of elements, and `dtype` tells the element type. Let's create a helper function to explore these attributes:

In [None]:
def info(name, a):
    print(f"{name} has dim {a.ndim}, shape {a.shape}, size {a.size}, and dtype {a.dtype}:")
    print(a)

In [None]:
b=np.array([[1,2,3], [4,5,6]])
info("b", b)

In [None]:
c=np.array([b, b])          # Creates a 3-dimensional array
info("c", c)

In [None]:
d=np.array([[1,2,3,4]])                # a row vector
info("d", d)

Note above how Python printed the three dimensional array. The general rules of printing an n-dimensional array as a nested list are:

* the last dimension is printed from left to right,
* the second-to-last is printed from top to bottom,
* the rest are also printed from top to bottom, with each slice separated from the next by an empty line.

## Indexing, slicing and reshaping

### Indexing
One dimensional array behaves like the list in Python:

In [None]:
a=np.array([1,4,2,7,9,5])
print(a[1])
print(a[-2])

For multi-dimensional array the index is a comma separated tuple instead of a single integer:

In [None]:
b=np.array([[1,2,3], [4,5,6]])
print(b)
print(b[1,2])    # row index 1, column index 2
print(b[0,-1])   # row index 0, column index -1

In [None]:
# As with lists, modification through indexing is possible
b[0,0] = 10
print(b)

Note that if you give only a single index to a multi-dimensional array, it indexes the first dimension of the array, that is the rows. For example:

In [None]:
print(b[0])    # First row
print(b[1])    # Second row

#### Slicing
Slicing works similarly to lists, but now we can have slices in different dimensions:

In [None]:
print(a)
print(a[1:3])
print(a[::-1])    # Reverses the array

In [None]:
print(b)
print(b[:,0])
print(b[0,:])
print(b[:,1:])

We can even assign to a slice:

In [None]:
b[:,1:] = 7
print(b)

A common idiom is to extract rows or columns from an array:

In [None]:
print(b[:,0])    # First column
print(b[1,:])    # Second row

### Reshaping

When an array is reshaped, its number of elements stays the same, but they are reinterpreted to have a different shape. An example of this is to interpret a one dimensional array as two dimension array:

In [None]:
a=np.arange(9)
anew=a.reshape(3,3)
info("anew", anew)
info("a", a)

In [None]:
d=np.arange(4)             # 1d array
dr=d.reshape(1,4)          # row vector
dc=d.reshape(4,1)          # column vector
info("d", d)
info("dr", dr)
info("dc", dc)

<div class="alert alert-warning">
Note the 1d array and the row and column vectors, which are 2d arrays, are fundamentally different objects, even though they look similar. They behave differently when we combine or otherwise operate arrays of different shapes, as we shall see in the next section and later in this material.
</div>

An alternative syntax to create, for example, column or row vectors is through the `np.newaxis` keyword. Sometimes this is easier or more natural than with the `reshape` method:

In [None]:
info("d", d)
info("drow", d[:, np.newaxis])
info("drow", d[np.newaxis, :])
info("dcol", d[:, np.newaxis])

## Array concatenation, splitting and stacking

The are two ways of combining several arrays into one bigger array: `concatenate` and `stack`. `Concatenate` takes n-dimensional arrays and returns an n-dimensional array, whereas `stack` takes n-dimensional arrays and returns n+1-dimensional array. Few examples of these:

In [None]:
a=np.arange(2)
b=np.arange(2,5)
print(f"a has shape {a.shape}: {a}")
print(f"b has shape {b.shape}: {b}")
np.concatenate((a,b))  # concatenating 1d arrays

In [None]:
c=np.arange(1,5).reshape(2,2)
print(f"c has shape {c.shape}:", c, sep="\n")
np.concatenate((c,c))   # concatenating 2d arrays

By default `concatenate` joins the arrays along axis 0. To join the arrays horizontally, add parameter `axis=1`:

In [None]:
np.concatenate((c,c), axis=1)

If you want to catenate arrays with different dimensions, for example to add a new column to a 2d array, you must first  reshape the arrays to have same number of dimensions:

In [None]:
print("New row:")
print(np.concatenate((c,a.reshape(1,2))))
print("New column:")
print(np.concatenate((c,a.reshape(2,1)), axis=1))

Use `stack` to create higher dimensional arrays from lower dimensional arrays:

In [None]:
np.stack((b,b))

In [None]:
np.stack((b,b), axis=1)

Inverse operation of `concatenate` is `split`. Its argument specifies either the number of equal parts the array is divided into, or it specifies explicitly the break points.

In [None]:
d=np.arange(12).reshape(6,2)
print("d:")
print(d)
d1,d2 = np.split(d, 2)
print("d1:")
print(d1)
print("d2:")
print(d2)

In [None]:
d=np.arange(12).reshape(2,6)
print("d:")
print(d)
parts=np.split(d, (2,3,5), axis=1)
for i, p in enumerate(parts):
    print("part %i:" % i)
    print(p)

# Introduction to Pandas

[*pandas*](http://pandas.pydata.org/) is a column-oriented data analysis API. It's a great tool for handling and analyzing input data, and many ML frameworks support *pandas* data structures as inputs.
Although a comprehensive introduction to the *pandas* API would span many pages, the core concepts are fairly straightforward, and we'll present them below. For a more complete reference, the [*pandas* docs site](http://pandas.pydata.org/pandas-docs/stable/index.html) contains extensive documentation and many tutorials.

**Learning Objectives:**
  * Gain an introduction to the `DataFrame` and `Series` data structures of the *pandas* library
  * Access and manipulate data within a `DataFrame` and `Series`
  * Import CSV data into a *pandas* `DataFrame`
  * Reindex a `DataFrame` to shuffle data
  


In [None]:
#lets start by importing the pandas library
import pandas as pd

#to know the version of your pandas 
pd.__version__

'0.24.2'

### Primary Data structures in Pandas

The primary data structures in *pandas* are implemented as two classes:

  * **`DataFrame`**, which you can imagine as a relational data table, with rows and named columns.
  * **`Series`**, which is a single column. A `DataFrame` contains one or more `Series` and a name for each `Series`.

The data frame is a commonly used abstraction for data manipulation.

In [None]:
# One way to create a Series is to construct a Series object. For example:
pd.Series(['San Francisco', 'San Jose', 'Sacramento'])

0    San Francisco
1         San Jose
2       Sacramento
dtype: object

In [None]:
# creating Dataframe objects

city_names = pd.Series(['San Francisco', 'San Jose', 'Sacramento'])
population = pd.Series([852469, 1015785, 485199])

pd.DataFrame({ 'City name': city_names, 'Population': population })

But most of the time, you load an entire file into a DataFrame. The following example loads a file with California housing data. Run the following cell to load the data and create feature definitions:

In [None]:
california_housing_dataframe = pd.read_csv("https://download.mlcc.google.com/mledu-datasets/california_housing_train.csv", sep=",")
california_housing_dataframe.head(3)

Another powerful feature of *pandas* is graphing. For example, `DataFrame.hist` lets you quickly study the distribution of values in a column:

In [None]:
california_housing_dataframe.hist('housing_median_age')

***Deepak KC, 2020***