# Intro to Numpy and Pandas

© Explore Data Science Academy

# Introduction

Python has many packages for working with data and performing various operations such as loading, analysing and storing data. Numpy and Pandas are two of the most commonly used and useful you will come across.In this train, you will learn some of the basic operations in Numpy and Pandas.

## Learning Objectives:
* Learn how to perform basic operations in Numpy
* Learn how to perform basic operations in Pandas
* Learn how to analyse data using Numpy and Pandas

## Numpy

Numpy is the core library for scientific computing in Python (see the full documentation which is [available online](https://docs.scipy.org/doc/numpy-1.13.0/reference/)). It provides high-performance multidimensional array objects, and tools for working with them. We will start by defining an array, show how to create an array, show how to access an array, show how to modify an array and look at useful functions on arrays. Finally we will identify when an array should be the chosen data stucture.

## What is an Array?

An array is a data structure that stores values of the same data type. In Python, this is the main difference between arrays and lists. While python lists can contain values corresponding to different data types, arrays in python can only contain values corresponding to same data type. However, in Python, there is no native array data structure, so we can use python lists. 

An array can be multi-dimensional: 

<img src="http://community.datacamp.com.s3.amazonaws.com/community/production/ckeditor_assets/pictures/332/content_arrays-axes.png">

Above you can see 1-D, 2-D and 3-D arrays. A 1-D array is essentially a sequence (one row or one column). In a 2-D array you have rows and columns. The rows are indicated as the "axis 0" and the columns as the "axis 1". The number of the axes corresponds to the number of dimensions. Note that these axes are only valid for arrays that have at least 2 dimensions, as there is no point in having this for 1-D arrays.

We can account for multi-dimentional arrays by using lists within lists. 

However, numpy arrays are stored more efficiently than Python lists and allow mathematical operations to be vectorized, which results in significantly higher performance than with looping constructs in Python.

## Creating a Numpy Array

If we want to work with any numpy objects or functions we first need to import the numpy library.

In [1]:
import numpy as np

To make a numpy array, you can just use the `np.array()` function. All you need to do is pass a list to it and optionally, you can also specify the data type of the data. Let's look at an example:

In [2]:
# Create array by passing in list of list
ratings = np.array([[94,89,63], [93,92,48], [92,94,56]])

ratings

array([[94, 89, 63],
       [93, 92, 48],
       [92, 94, 56]])

You can see that there are 3 rows and columns in this array. We can inspect the shape of the array by using the `shape()` function on our array:

In [3]:
ratings.shape

(3, 3)

Like we stated earlier, it will return a tuple of integers giving the size of the array along each dimension.

We can also create arrays using the following functions:

* `np.ones()` - array of ones
* `np.zeros()` - array of zeros
* `np.random.random()` - array with random values

Let's look at some examples:

In [4]:
# Create array of ones - pass in shape as tuple
np.ones((3,4))

array([[1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.]])

In [5]:
# Create array of zeros - pass in shape as tuple
np.zeros((2,3))

array([[0., 0., 0.],
       [0., 0., 0.]])

In [6]:
# Create array of random values - pass in shape as tuple
np.random.random((3,3))

array([[0.18020631, 0.05812659, 0.6801626 ],
       [0.47128673, 0.5187448 , 0.86380974],
       [0.87123085, 0.66194948, 0.14180403]])

## Accessing Numpy Arrays

Numpy offers several ways to index into arrays.

### Slicing

Similar to Python lists, numpy arrays can be sliced. Since arrays may be multi-dimensional, you must specify a slice for each dimension of the array, where the slices per dimension is split by a comma. For a 2-D array the first dimension is the vertical axis while the second dimension is the horizontal axis. For a 2-D array:

* np.array[vertical index , horizontal index] - for one element
* np.array[vertical start:vertical end , horizontal start:horizontal end] - for more than one element 

Let's look at a few examples:

In [7]:
# Whole array
ratings

array([[94, 89, 63],
       [93, 92, 48],
       [92, 94, 56]])

In [8]:
# Select top left element
ratings[0,0]

94

In [9]:
# Select first row
ratings[0,:]

array([94, 89, 63])

In [10]:
# Select first column
ratings[:,0]

array([94, 93, 92])

In [11]:
# Select first two rows and first two columns
ratings[0:2,0:2]

array([[94, 89],
       [93, 92]])

### Boolean Array Indexing

Boolean array indexing lets you pick out arbitrary elements of an array. Frequently this type of indexing is used to select the elements of an array that satisfy some condition. The syntax is as follows:

* np.array[condition]

Let's look at an example:

In [12]:
# Select all values greater than 90
ratings[ratings > 90]

array([94, 93, 92, 92, 94])

## Modifying Numpy Arrays

We will now look at how to add elements to an array followed by how to remove elements from an array.

### Adding Elements

Adding elemets can be done by using the `np.append()` function. This will add elements to the end of an array. Let's look at an example:

In [13]:
# Append an extra row - note that axis=0
ratings_extra_row = np.append(ratings, [[92, 88, 78]], axis=0)

ratings_extra_row

array([[94, 89, 63],
       [93, 92, 48],
       [92, 94, 56],
       [92, 88, 78]])

### Removing Elements

Deleting elemets can be done by using the `np.delete()` function. This will delete elements at the specified indices. Let's look at an example:

In [14]:
# Delete the 3rd row - note that axis=0
ratings_del_row = np.delete(ratings, [2], axis=0)

ratings_del_row

array([[94, 89, 63],
       [93, 92, 48]])

In [15]:
ratings[0].mean()

82.0

## Functions on Numpy Arrays

There are various functions we can perform on numpy arrays. We will look at a few of them, namely:

* `np.array.sum()`
* `np.array.min()`
* `np.array.max()`

In all these functions we can pass an axis argument. This will allow us to sum the columns and rows separately. Let's look at examples:

In [16]:
# Sum of all elements in array
ratings.sum()

721

In [17]:
# Sum of each row
ratings.sum(axis=1)

array([246, 233, 242])

In [18]:
# Min of each row
ratings.min(axis=1)

array([63, 48, 56])

In [19]:
# Max of each column
ratings.max(axis=0)

array([94, 94, 63])

In [20]:
list(ratings)

[array([94, 89, 63]), array([93, 92, 48]), array([92, 94, 56])]

## When to use Numpy Arrays

A numpy array is a grid of values, all of the same type, and is indexed by a tuple of non-negative integers. We should use a numpy array if all of the following statements hold:

* We have multi-dimensional data
* All entries are of the same data type

Numpy arrays are also preferred over list of lists due to efficiency and functionality.

# Pandas Dataframes

In this section we will look at `pandas dataframes`, which is most commenly used for any python data analysis. We will define what a pandas dataframe is, show how to create dataframes and show how to access dataframes. 

## What is a Pandas Dataframe?

Pandas is a high-level data manipulation tool. It is built on the Numpy package and its key data structure is called the `DataFrame`. `DataFrames` allow you to store and manipulate tabular data in rows of observations and columns of variables. It is therefore a two-dimensional labeled data structure. 

<img src="https://www.w3resource.com/w3r_images/pandas-data-structure.svg">

Basically you could say that the `Pandas DataFrame` consists of three main components: the data, index, and columns. 

**Data**

The data can consist of:

* Pandas DataFrames
* Pandas Series - a one-dimensional labeled array capable of holding any data type with axis labels or index.
* Numpy arrays
* Dictionaries 
* Lists
* Files

**Index**

The indices refer to the difference in the rows. On the graphic it is on axis 0.

**Columns**

The columns refer to the difference in column names. On the graphic it is on axis 1. Each column must be of the same data type.

## Creating a Dataframe

We can create a dataframe by using the `DataFrame()` function. The main arguments in the function are the data, index and columns. The data that is passed in can be in the form of other data structures (lists, dictionaries or numpy arrays) or by loading in a file.

But first we need to import the pandas library:

In [21]:
import pandas as pd

### From other Data Structures

Using **lists**, we need to create a list of lists with the relevant data. We then also need to pass an index (player names) and column names.

In [22]:
# Create list of lists containing data
list_df = [[32, 'Portugal', 94], [30, 'Argentina', 93], [25 , 'Brazil', 92]]

# Create index - names of players
index = ['Christiano Ronaldo', 'Lionel Messi', 'Neymar']

# Create column names 
columns = ['Age', 'Nationality', 'Overall']

# Create dataframe by passing in data, index and columns
pd.DataFrame(data=list_df, index=index, columns=columns)

Unnamed: 0,Age,Nationality,Overall
Christiano Ronaldo,32,Portugal,94
Lionel Messi,30,Argentina,93
Neymar,25,Brazil,92


Using **dictionaries**, we need to create a dictionary with the relevant data. The keys should be the column names while the values should be the data entries for that column. We then also need to pass an index (player names). Note that because the keys account for the column names we don't have to pass in an argument for columns.

In [23]:
# Create dictionary containing data
dict_df = {'Age':[32, 30, 25], 'Nationality':['Portugal', 'Argentina', 'Brazil'], 'Overall':[94, 93, 92]}

# Create index - names of players
index = ['Christiano Ronaldo', 'Lionel Messi', 'Neymar']

# Create dataframe by passing in data, index and columns
pd.DataFrame(data=dict_df, index=index)

Unnamed: 0,Age,Nationality,Overall
Christiano Ronaldo,32,Portugal,94
Lionel Messi,30,Argentina,93
Neymar,25,Brazil,92


Using **numpy arrays**, we need to create a numpy array with the relevant data. We then also need to pass an index (player names) and column names.

In [24]:
import numpy as np

# Create numpy array containing data
array_df = np.array([[32, 'Portugal', 94], [30, 'Argentina', 93], [25 , 'Brazil', 92]])

# Create index - names of players
index = ['Christiano Ronaldo', 'Lionel Messi', 'Neymar']

# Create column names 
columns = ['Age', 'Nationality', 'Overall']

# Create dataframe by passing in data, index and columns
pd.DataFrame(data=array_df, index=index, columns=columns)

Unnamed: 0,Age,Nationality,Overall
Christiano Ronaldo,32,Portugal,94
Lionel Messi,30,Argentina,93
Neymar,25,Brazil,92


### Loading a File

The data argument can also be a loaded file. We will load the full csv file. Pandas makes it very easy to do this, you simply load the data using the `read_csv()` function and pass in the name of the file as a string.

Pandas will use the first row as columns so we don't need to pass column names. We can also specify the index when we load the data by passing our index column as a string under the index_col argument.

Note: The csv file can be found under the additional material for this tutorial. Unzip the file and make sure to put it in the same directory as this notebook.

In [25]:
# Load data - pass 'Name' as our index column
load_df = pd.read_csv('https://raw.githubusercontent.com/Explore-AI/Public-Data/master/Data/fundamentals/football_players.csv', index_col='Name')

# Create dataframe called df
df = pd.DataFrame(load_df)

# Use the head() function to look at the first 5 rows
df.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0_level_0,Age,Nationality,Overall,Acceleration,Aggression,Agility,Balance,Ball control,Composure,Crossing,...,Short passing,Shot power,Sliding tackle,Sprint speed,Stamina,Standing tackle,Strength,Vision,Volleys,Preferred Positions
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Cristiano Ronaldo,32,Portugal,94,89,63,89,63,93,95,85,...,83,94,23,91,92,31,80,85,88,ST LW
L. Messi,30,Argentina,93,92,48,90,95,95,96,77,...,88,85,26,87,73,28,59,90,85,RW
Neymar,25,Brazil,92,94,56,96,82,95,92,75,...,81,80,33,90,78,24,53,80,83,LW
L. Suárez,30,Uruguay,92,88,78,86,60,91,83,77,...,83,87,38,77,89,45,80,84,88,ST
M. Neuer,31,Germany,92,58,29,52,35,48,70,15,...,55,25,11,61,44,10,83,70,11,GK


For the first time we have access to our whole Fifa 2018 player dataset. You may have noticed that we used the `head()` function to only look at the first 5 records of our data. This is helpful if the dataframe has many rows and loading it will take lots of time.

## Accessing Dataframes

Accessing dataframes is not as straightforward as with the previous data structures.

### By Index

To access by index only in a dataframe we can use the `iloc` or `loc` functions with the indices in square brackets. The `iloc` function refers to the index location, so we pass in the number of the index, while the `loc` function refers to the name of the index, so we pass in the index name. Use slicing if you want more than one index. Eg:

* dataframe.iloc[index no] - returns series at index no
* dataframe.iloc[index no start: index no end] - returns dataframe from start no to end no (end not included)
* dataframe.loc['index name'] - returns series of given index name

Let's look at a few examples:

In [26]:
# Select the 5th row using iloc[]
df.iloc[4]

Age                         31
Nationality            Germany
Overall                     92
Acceleration            58    
Aggression              29    
Agility                 52    
Balance                 35    
Ball control            48    
Composure               70    
Crossing                15    
Curve                   14    
Dribbling               30    
Finishing               13    
Free kick accuracy      11    
GK diving               91    
GK handling             90    
GK kicking              95    
GK positioning          91    
GK reflexes             89    
Heading accuracy        25    
Interceptions           30    
Jumping                 78    
Long passing            59    
Long shots              16    
Marking                 10    
Penalties               47    
Positioning             12    
Reactions               85    
Short passing           55    
Shot power              25    
Sliding tackle          11    
Sprint speed            61    
Stamina 

In [27]:
# Select rows 5 to 10
df.iloc[4:10]

Unnamed: 0_level_0,Age,Nationality,Overall,Acceleration,Aggression,Agility,Balance,Ball control,Composure,Crossing,...,Short passing,Shot power,Sliding tackle,Sprint speed,Stamina,Standing tackle,Strength,Vision,Volleys,Preferred Positions
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
M. Neuer,31,Germany,92,58,29,52,35,48,70,15,...,55,25,11,61,44,10,83,70,11,GK
R. Lewandowski,28,Poland,91,79,80,78,80,89,87,62,...,83,88,19,83,79,42,84,78,87,ST
De Gea,26,Spain,90,57,38,60,43,42,64,17,...,50,31,13,58,40,21,64,68,13,GK
E. Hazard,26,Belgium,90,93,54,93,91,92,87,80,...,86,79,22,87,79,27,65,86,79,LW
T. Kroos,27,Germany,90,60,60,71,69,89,85,85,...,90,87,69,52,77,82,74,88,82,CDM CM
G. Higuaín,29,Argentina,90,78,50,75,69,85,86,68,...,75,88,18,80,72,22,85,70,88,ST


In [28]:
# Select the M. Neuer index using loc[]
df.loc['M. Neuer']

Age                         31
Nationality            Germany
Overall                     92
Acceleration            58    
Aggression              29    
Agility                 52    
Balance                 35    
Ball control            48    
Composure               70    
Crossing                15    
Curve                   14    
Dribbling               30    
Finishing               13    
Free kick accuracy      11    
GK diving               91    
GK handling             90    
GK kicking              95    
GK positioning          91    
GK reflexes             89    
Heading accuracy        25    
Interceptions           30    
Jumping                 78    
Long passing            59    
Long shots              16    
Marking                 10    
Penalties               47    
Positioning             12    
Reactions               85    
Short passing           55    
Shot power              25    
Sliding tackle          11    
Sprint speed            61    
Stamina 

### By Column

To access by column only we can do so by simply calling `dataframe['Column Name']`. If we want more than one column we input a list of column names inside the square brackets. Eg:

* dataframe['Column Name'] - returns series of given column
* dataframe[['Column 1', 'Column 2']] - returns dataframe with the given columns

Let's look at examples:

In [29]:
# Select the column 'Age'
df['Age']

Name
Cristiano Ronaldo    32
L. Messi             30
Neymar               25
L. Suárez            30
M. Neuer             31
                     ..
A. Kelsey            17
B. Richardson        47
J. Young             17
J. Lundstram         18
L. Sackey            18
Name: Age, Length: 17981, dtype: int64

In [30]:
# Select the columns 'Age' and 'Nationality'
df[['Age', 'Nationality']]

Unnamed: 0_level_0,Age,Nationality
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Cristiano Ronaldo,32,Portugal
L. Messi,30,Argentina
Neymar,25,Brazil
L. Suárez,30,Uruguay
M. Neuer,31,Germany
...,...,...
A. Kelsey,17,England
B. Richardson,47,England
J. Young,17,Scotland
J. Lundstram,18,England


### By index and Column

We can also select a subset of the dataframe using indices and columns in combination. Let's look at a few examples:

In [31]:
# Select the first 5 rows and first 2 columns - Rows first
df.iloc[0:5][['Age', 'Nationality']]

Unnamed: 0_level_0,Age,Nationality
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Cristiano Ronaldo,32,Portugal
L. Messi,30,Argentina
Neymar,25,Brazil
L. Suárez,30,Uruguay
M. Neuer,31,Germany


In [32]:
# Select the first 5 rows and first 2 columns - Columns first
df[['Age', 'Nationality']].iloc[0:5]

Unnamed: 0_level_0,Age,Nationality
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Cristiano Ronaldo,32,Portugal
L. Messi,30,Argentina
Neymar,25,Brazil
L. Suárez,30,Uruguay
M. Neuer,31,Germany


## When to use Dataframes

Unlike the numpy array which is suited for storing and performing computations on homogenous data (data of the same type), pandas dataframes can accommodate heterogenous data. This makes them the choice data structure for manipulating often messy statistical data (e.g tabular data from spreadsheets, or SQL tables). 

We should use a pandas dataframe if all of the following statements hold:

* We have 2-dimensional data (rows and columns)
* The data type is the same within a column
* We are interested in the index (rows) and column names



## Exercises

Here are some exercises for you to try out. The solutions can be found at the end of this train.

Exercise 1:

Select rows 9-14 of the dataframe.

In [33]:
# Answer here

Exercise 2:

Select, in order, only the "Preferred Positions", "Overall" and "Age" columns for Neymar.

In [34]:
# Answer here

Exercise 3:

Select, in order, only the "Overall", "Age" and "Nationality" columns for all players with ages of 35 years or older

In [35]:
# Answer here

# Conclusion
In this tutorial, you learned how to do the following:
* Basic operations in Numpy
* Basic operations in Pandas
* Load, manipulate and alayse data using Numpy and Pandas

That is the end of this tutorial. You should have a better understanding of what an array is and how to work with numpy arrays as well as what a pandas dataframe is and how to create and access it. In the future, we will look at useful functions of dataframes and when to use them as the chosen data structure.

## Solutions

In [36]:
# Exercise 1 solution

df.iloc[8:14]

Unnamed: 0_level_0,Age,Nationality,Overall,Acceleration,Aggression,Agility,Balance,Ball control,Composure,Crossing,...,Short passing,Shot power,Sliding tackle,Sprint speed,Stamina,Standing tackle,Strength,Vision,Volleys,Preferred Positions
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
T. Kroos,27,Germany,90,60,60,71,69,89,85,85,...,90,87,69,52,77,82,74,88,82,CDM CM
G. Higuaín,29,Argentina,90,78,50,75,69,85,86,68,...,75,88,18,80,72,22,85,70,88,ST
Sergio Ramos,31,Spain,90,75,84,79,60,84,80,66,...,78,79,91,77,84,89,81,63,66,CB
K. De Bruyne,26,Belgium,89,76,68,80,75,87,84,90,...,90,85,40,75,87,51,73,90,82,RM CM CAM
T. Courtois,25,Belgium,89,46,23,61,45,23,52,14,...,32,36,16,52,38,18,70,44,12,GK
A. Sánchez,28,Chile,89,88,80,90,87,87,86,80,...,81,84,35,84,85,39,72,83,83,RM LW ST LM


In [37]:
# Exercise 2 solution

df.iloc[2][['Preferred Positions', 'Overall', 'Age']]

Preferred Positions    LW 
Overall                 92
Age                     25
Name: Neymar, dtype: object

In [38]:
# Exercise 3 solution

df[['Overall', 'Age', 'Nationality']][df['Age'] >= 35]

Unnamed: 0_level_0,Overall,Age,Nationality
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
G. Buffon,89,39,Italy
Z. Ibrahimović,88,35,Sweden
P. Čech,86,35,Czech Republic
A. Barzagli,85,36,Italy
Aduriz,84,36,Spain
...,...,...,...
R. Kawai,52,38,Japan
Y. Nishibe,52,36,Japan
A. Al Basisi,52,39,Saudi Arabia
P. Jennings,52,37,England


# Appendix
Additional links:

[Numpy package home page](https://numpy.org/0)

[Pandas package home page](https://pandas.pydata.org/)

In [1]:
import pandas as pd

In [2]:
# Create list of lists containing data
list_df = [[32, 'Portugal', 94], [30, 'Argentina', 93], [25 , 'Brazil', 92]]

# Create index - names of players
index = ['Christiano Ronaldo', 'Lionel Messi', 'Neymar']

# Create column names 
columns = ['Age', 'Nationality', 'Overall']

# Create dataframe by passing in data, index and columns
pd.DataFrame(data=list_df, index=index, columns=columns)

Unnamed: 0,Age,Nationality,Overall
Christiano Ronaldo,32,Portugal,94
Lionel Messi,30,Argentina,93
Neymar,25,Brazil,92
