# Introduction to Numpy and Pandas

## Learning Objectives
By the end of this lesson, you should be able to:
* Perform basic operations in Numpy;
* Perform basic operations in Pandas; and
* Analyse data using Numpy and Pandas.

## Outline
* Introduce the Numpy Python library and some of its functionality;
* Introduce the Pandas Python and some of its functionality; and
* Load, manipulate, and analyse data using Numpy and Pandas.

## Introduction

Python has many packages for working with data and performing operations like the loading, analysing and storing of data. Numpy and Pandas are two commonly used and useful ones. In this train, you will learn some of the basic operations in Numpy and Pandas.

# Numpy
Numpy is a Python package which provides high-performance, multidimensional array objects, numerical computing tools, and is fundamental in scientific computing. It is the core library for scientific computing in Python ([see the full documentation](https://numpy.org/)). We will start by defining what an array is, how to create an array, how to access and modify an array, and look at useful functions on arrays. Finally, we will identify when an array should be the chosen data structure when performing data analysis or manipulation.

### What is an Array?
Similar to that of a Python list, a numpy array is a data structure which stores multiple values. The main difference between the two is that lists can contain heterogeneous data types (combinations of `str`, `int`, even `list`), whilst numpy arrays can only store values of the same data type. Numpy arrays can be thought of as a grid of values and can be multi-dimensional.

Numpy arrays are stored more efficiently than Python lists and allow mathematical operations to be vectorized, which results in significantly higher performance than with looping constructs in Python.

### Creating a Numpy Array
If we want to work with any numpy objects or functions we first need to import the Numpy library.


In [None]:
import numpy as np

To make a numpy array, you use the `np.array()` function. All you need to do is pass a list to it and optionally, you can specify the data type of the data. Let's look at an example:

In [None]:
# Create an array by passing in a list of lists.
ratings = np.array([[94,89,63], [93,92,48], [92,94,56]])

ratings

You can see that there are 3 rows and columns in this array. We can inspect the shape of the array by using its `shape` attribute. This will return a tuple of integers giving the size of the array along each dimension.

In [None]:
ratings.shape

We can also create arrays using the following functions:

* `np.ones()` - array of ones
* `np.zeros()` - array of zeros
* `np.random.random()` - array with random values

Let's look at some examples:

In [None]:
# Create an array of ones - pass in shape as a tuple.
np.ones((3,4))

In [None]:
# Create an array of zeros - pass in shape as a tuple.
np.zeros((2,3))

In [None]:
# Create an array of random values - pass in shape as a tuple.
np.random.random((3,3))

### Accessing Numpy Arrays
Numpy offers several ways to index into arrays. Here, we'll work through slicing, and Boolean array indexing.

#### Slicing
Similar to Python lists, Numpy arrays can be sliced. Since arrays may be multi-dimensional, you must specify a slice for each dimension of the array, where the slices per dimension is split by a comma. For a 2-D array, the first dimension is the vertical axis while the second dimension is the horizontal axis. For a 2-D array:

* `np.array[vertical index , horizontal index]` - for one element
* `np.array[vertical start:vertical end , horizontal start:horizontal end]` - for more than one element

Let's look at a few examples:

In [None]:
# Whole array.
ratings

In [None]:
# Select top left element.
ratings[0,0]

In [None]:
# Select first row.
ratings[0,:]

In [None]:
# Select first column.
ratings[:,0]

In [None]:
# Select first two rows and first two columns.
ratings[0:2,0:2]

#### Boolean Array Indexing
Boolean array indexing let's you pick out a selection of elements from an array. This type of indexing is often used to select the elements of an array which satisfy a specific condition. The syntax is as follows:

* `np.array[condition]`

Let's look at an example:

In [None]:
# Select all values greater than 90.
ratings[ratings > 90]

### Modifying Numpy Arrays
We will now look at how to add elements to an array, followed by how to remove elements from an array.

#### Adding Elements
Adding elements can be done by using the `np.append()` function. This will add elements to the end of an array. Let's look at an example:

In [None]:
# Append an extra row - note that axis=0.
ratings_extra_row = np.append(ratings, [[92, 88, 78]], axis=0)

ratings_extra_row

#### Removing Elements
Deleting elements can be done by using the `np.delete()` function. This will delete elements at the specified indices. Let's look at an example:

In [None]:
# Delete the 3rd row - note that axis=0.
ratings_del_row = np.delete(ratings, [2], axis=0)

ratings_del_row

### Functions on Numpy Arrays
There are various functions we can perform on numpy arrays. We will look at a few of them, namely:

* `np.array.sum()`
* `np.array.min()`
* `np.array.max()`

In all these functions we can pass an axis argument. This will allow us to sum the columns and rows separately. Let's look at examples:

In [None]:
# Sum of all elements in array.
ratings.sum()

In [None]:
# Sum of each row.
ratings.sum(axis=1)

In [None]:
# Min of each row.
ratings.min(axis=1)

In [None]:
# Max of each column.
ratings.max(axis=0)

### When to use Numpy Arrays

A Numpy array is a data structure which stores multiple values, all of the same data type, and can be multi-dimensional. We should use a Numpy array if all of the following statements hold:

* We have multi-dimensional data
* All entries are of the same data type

Numpy arrays are also preferred over list of lists due to efficiency and functionality.

## Pandas

In this section we will look at Pandas dataframes, which is most commonly used for any Python data analysis. We will define what a Pandas dataframe is, show how to create dataframes and how to access them.

### What is a Pandas Dataframe?
Pandas is a data manipulation tool which is built on the Numpy package. Pandas' key data structure is the `dataframe`. A dataframe allows for the storage and manipulation of tabular data. It is a two-dimensional labelled data structure.


Basically, you could say that the Pandas dataframe consists of three main components: the data, index, and columns. Let's walk through some examples to gain an understanding of these components.

#### Creating a Dataframe
We can create a dataframe by calling the `DataFrame()` constructor. The main arguments in the constructor are the data, index and columns. The data that is passed in can be in the form of other data structures (lists, dictionaries or numpy arrays) or by loading in a file.

Let's start our examples by first importing the Pandas library:

In [None]:
import pandas as pd

#### Dataframes from other Data Structures

We now look at some examples on how to create a dataframe from various data structures in Python.

Using **lists**, we need to create a list of lists (nested list) with the relevant data. We then also need to pass an index (player names in this example) and column names.

In [None]:
# Create list of lists containing data.
list_df = [[32, 'Portugal', 94], [30, 'Argentina', 93], [25 , 'Brazil', 92]]

# Create index - names of players.
index = ['Christiano Ronaldo', 'Lionel Messi', 'Neymar']

# Create column names.
columns = ['Age', 'Nationality', 'Overall']

# Create dataframe by passing in data, index and columns.
pd.DataFrame(data=list_df, index=index, columns=columns)

Using **dictionaries**, we need to create a dictionary with the relevant data.

The keys should be the column names, while the values should be the data entries for that column. We then also need to pass an index. Note that because the keys account for the column names, we don't have to pass in an argument for columns.

In [None]:
# Create dictionary containing data.
dict_df = {'Age':[32, 30, 25], 'Nationality':['Portugal', 'Argentina', 'Brazil'], 'Overall':[94, 93, 92]}

# Create index - names of players.
index = ['Christiano Ronaldo', 'Lionel Messi', 'Neymar']

# Create dataframe by passing in data, index and columns.
pd.DataFrame(data=dict_df, index=index)

Using **numpy arrays**, we need to create a numpy array with the relevant data. We then also need to pass an index (player names) and column names.

In [None]:
# Create numpy array containing data.
array_df = np.array([[32, 'Portugal', 94], [30, 'Argentina', 93], [25 , 'Brazil', 92]])

# Create index - names of players.
index = ['Christiano Ronaldo', 'Lionel Messi', 'Neymar']

# Create column names.
columns = ['Age', 'Nationality', 'Overall']

# Create dataframe by passing in data, index and columns.
pd.DataFrame(data=array_df, index=index, columns=columns)

#### Dataframes from other files
The data argument can also be a loaded file. We will load the full csv file. Pandas make it very easy to do this; you simply load the data using the `read_csv()` function and pass in the full path of the file as a string.

Pandas will use the first row as columns so we don't need to pass column names. We can also specify the index when we load the data by passing our index column as a string under the index_col argument.

In [None]:
# Load data - pass 'Name' as our index column.
# For this exercise, we'll use football player data to evaluate our dataframe.


# Create dataframe called df.
df = pd.read_csv() # load the data here.

# Use the head() function to look at the first 5 rows.
df.head()

We use the `head()` function to only look at the first 5 records of our data. This is helpful if the dataframe has many rows and loading it will take lots of time.

### Accessing Dataframes

Accessing data within dataframes is not as straightforward as with the previous data structures. This can be done by index, by column, or by both. Let's work through these methods.

#### By Index
To access by index only in a dataframe we can use the `iloc` or `loc` functions with the indices in square brackets. The `iloc` function refers to the index location, so we pass in the number of the index, while the `loc` function refers to the name of the index, so we pass in the index name. Use slicing if you want more than one index. Eg:

* `dataframe.iloc[index i]` - returns series at index i
* `dataframe.iloc[index start: index end]` - returns dataframe from start to end (end not included)
* `dataframe.loc['index name']` - returns series of given index name

Let's look at a few examples:

In [None]:
# Select the 5th row using iloc[].
df.iloc[4]

In [None]:
# Select rows 5 to 10.
df.iloc[4:10]

In [None]:
# Select the M. Neuer index using loc[].
df.loc['M. Neuer']

#### By Column
To access by column only we can simply call `dataframe['Column Name']`. If we want more than one column we input a list of column names inside the square brackets:

* `dataframe['Column Name']` - returns series of given column
* `dataframe[['Column 1', 'Column 2']]` - returns dataframe with the given columns

Let's look at examples.

In [None]:
# Select the column 'Age'.
df['Age']

In [None]:
# Select the columns 'Age' and 'Nationality'.
df[['Age', 'Nationality']]

#### By index and column
We can also select a subset of the dataframe using indices and columns in combination. Let's look at a few examples:

In [None]:
# Select the first 5 rows and first 2 columns - Rows first.
df.iloc[0:5][['Age', 'Nationality']]

In [None]:
# Select the first 5 rows and first 2 columns - Columns first.
df[['Age', 'Nationality']].iloc[0:5]

Transforming some values in the Name Column, to find occurrences of 'L. Suárez' and replace them with 'L. Suarez. replacing the 'á' will make it easier to index the player with your keyboard.

In [None]:
df['Name'] = df['Name'].str.replace('L. Suárez', 'L. Suarez')

### When to use Dataframes
Unlike the Numpy array which is suited for storing and performing computations on homogeneous data (data of the same type), Pandas dataframes can accommodate heterogeneous data. This makes them the choice data structure for manipulating often messy data (e.g tabular data from spreadsheets, or SQL tables).

We should use a Pandas dataframe if all of the following statements hold:

* We have 2-dimensional data (rows and columns)
* The data type is the same within a column
* We are interested in the index (rows) and column names

## Exercises

Here are some exercises for you to try out. The solutions can be found at the end of this train.

#### Exercise 1:
Select rows 9-14 of the football player dataframe we've just been using.

In [None]:
# Answer here

#### Exercise 2:
Select, in order, only the "Preferred Positions", "Overall" and "Age" columns for the football player Neymar.

In [None]:
# Answer here

#### Exercise 3:
Select, in order, only the "Overall", "Age" and "Nationality" columns for all players with ages of 35 years or older

In [None]:
# Answer here

## Conclusion
In this train, you learned to perform basic operations using more Numpy and Pandas, you also learnt how to load, manipulate and analyse data using these packages. The reader is expected to complete the exercises before moving forward to ensure familiarity with these packages.

## Appendix
- [Basic Numpy Functionality](https://numpy.org/doc/stable/user/absolute_beginners.html#what-is-an-array)

- [Numpy package home page](https://numpy.org)

- [Pandas package home page](https://pandas.pydata.org/)