#   An Introduction to NumPy  ![title](numpy_logo.png)

***

All exercises here are based on python courses on www.datacamp.com unless stated otherwise

Adds support for large, multi-dimensional arrays and matrices, it includes high-level math functions to use with these arrays.

NumPy has a comparable functionality to that of MATLAB. They are both interpreted, and both allow for fast program writing as long as most operations work on arrays or matrices and not on scalars. 

MATLAB has a large number of toolboxes, Simulink in particular. In contrast, NumPy is highly integrated with Python, which is a more complete and much more modern programming language. Other Python packages add complimentary functionality to NumPy that make Python to be even more MATLAB like. SciPy and Matplotlib are good examples of this.

The NumPy package is included with Anaconda

I like football and found several datasets around that, I chose the data set for EA FIFA 2017 player's. You can download this data set at: https://www.kaggle.com/artimous/complete-fifa-2017-player-dataset-global#FullData.csv

In [18]:
#we'll use the pandas package as well, basically to import files in this exercises, we'll talk about that at a later time
import pandas as pd
df = pd.read_csv("data/FullData2.csv",sep=",")
df

Unnamed: 0,Name,Nationality,National_Position,National_Kit,Club,Club_Position,Club_Kit,Club_Joining,Contract_Expiry,Rating,...,Long_Shots,Curve,Freekick_Accuracy,Penalties,Volleys,GK_Positioning,GK_Diving,GK_Kicking,GK_Handling,GK_Reflexes
0,Cristiano Ronaldo,Portugal,LS,7.0,Real Madrid,LW,7.0,07/01/2009,2021.0,94,...,90,81,76,85,88,14,7,15,11,11
1,Lionel Messi,Argentina,RW,10.0,FC Barcelona,RW,10.0,07/01/2004,2018.0,93,...,88,89,90,74,85,14,6,15,11,8
2,Neymar,Brazil,LW,10.0,FC Barcelona,LW,11.0,07/01/2013,2021.0,92,...,77,79,84,81,83,15,9,15,9,11
3,Luis Suárez,Uruguay,LS,9.0,FC Barcelona,ST,9.0,07/11/2014,2021.0,92,...,86,86,84,85,88,33,27,31,25,37
4,Manuel Neuer,Germany,GK,1.0,FC Bayern,GK,1.0,07/01/2011,2021.0,92,...,16,14,11,47,11,91,89,95,90,89
5,De Gea,Spain,GK,1.0,Manchester Utd,GK,1.0,07/01/2011,2019.0,90,...,12,21,19,40,13,86,88,87,85,90
6,Robert Lewandowski,Poland,LS,9.0,FC Bayern,ST,9.0,07/01/2014,2021.0,90,...,82,77,76,81,86,8,15,12,6,10
7,Gareth Bale,Wales,RS,11.0,Real Madrid,RW,11.0,09/02/2013,2022.0,90,...,90,86,85,76,76,5,15,11,15,6
8,Zlatan Ibrahimović,Sweden,,,Manchester Utd,ST,9.0,07/01/2016,2017.0,90,...,88,82,82,91,93,9,13,10,15,12
9,Thibaut Courtois,Belgium,GK,1.0,Chelsea,GK,13.0,07/26/2011,2019.0,89,...,17,19,11,27,12,86,84,69,91,89


That's not a convenient way to see this, I just want to see the first few lines.

In [46]:
print(df.head())

                Name Nationality National_Position  National_Kit  \
0  Cristiano Ronaldo    Portugal                LS           7.0   
1       Lionel Messi   Argentina                RW          10.0   
2             Neymar      Brazil                LW          10.0   
3        Luis Suárez     Uruguay                LS           9.0   
4       Manuel Neuer     Germany                GK           1.0   

           Club Club_Position  Club_Kit Club_Joining  Contract_Expiry  Rating  \
0   Real Madrid            LW       7.0   07/01/2009           2021.0      94   
1  FC Barcelona            RW      10.0   07/01/2004           2018.0      93   
2  FC Barcelona            LW      11.0   07/01/2013           2021.0      92   
3  FC Barcelona            ST       9.0   07/11/2014           2021.0      92   
4     FC Bayern            GK       1.0   07/01/2011           2021.0      92   

      ...      Long_Shots Curve Freekick_Accuracy Penalties  Volleys  \
0     ...              90    81 

To make things easier, for the time being I filtered out Height and Weight and saved them as different files. We're going to import those instead. So no we import NumPy as np to save some writing

In [9]:
import numpy as np
Height = pd.read_csv("data/Height.csv",sep=",")
np_H = np.array(Height)
np_H

array([[185],
       [170],
       [174],
       ...,
       [173],
       [180],
       [185]], dtype=int64)

note: Once we entered data, got a variable, or imported a package into our notebook, we don't need to do that again.

In [10]:
Weight = pd.read_csv("data/Weight.csv",sep=",")
np_W = np.array(Weight)
np_W

array([[80],
       [72],
       [68],
       ...,
       [61],
       [80],
       [77]], dtype=int64)

I want to know the player's BMI. Lets try out what happens if we do it without NumPy   * BMI = W/(H^2)

In [11]:
Weight = pd.read_csv("data/Weight.csv",sep=",")
Height = pd.read_csv("data/Height.csv",sep=",")
bmi = Weight / (Height ** 2)
print(bmi.head())

   Height  Weight
0     NaN     NaN
1     NaN     NaN
2     NaN     NaN
3     NaN     NaN
4     NaN     NaN


NumPy to the rescue

In [19]:
Weight = pd.read_csv("data/Weight.csv",sep=",")
#We now create a NumPy array
np_W = np.array(Weight)
Height = pd.read_csv("data/Height.csv",sep=",")
np_H = np.array(Height)
bmi = np_W / (np_H/100) ** 2
bmi

array([[23.37472608],
       [24.91349481],
       [22.46003435],
       ...,
       [20.38156971],
       [24.69135802],
       [22.49817385]])

NumPy works, because it assumes that your NumPy array has values that are all of the same type (floats, booleans, etc).
if you enter an array:

In [14]:
array = np.array([10, "Pele", True])
array

array(['10', 'Pele', 'True'], dtype='<U11')

Everything was converted to strings

A NumPy array is a another is another kind of Python Type, like floats, and booleans, so it comes with its own Methods (and each method has a particular behavior)

In [15]:
normal_list = [1, 2, 3]
numpy_array = np.array([1, 2, 3])
#without NumPy
nl=normal_list + normal_list
print(nl)
#with NumPy
na=numpy_array + numpy_array
print(na)

[1, 2, 3, 1, 2, 3]
[2 4 6]


## NumPy Subsetting

If you want to get an element from your array, you can use square brackets (Rember, numbering starts at 0, not at 1)

Speciffically for NumPy you can also use an array for booleans to do subsetting for example, if you wanted to get all BMI values, in our previous exercise, for BMI values that are above 22.5. you could use

In [16]:
Weight = pd.read_csv("data/Weight.csv",sep=",")
np_W = np.array(Weight)
Height = pd.read_csv("data/Height.csv",sep=",")
np_H = np.array(Height)
bmi = np_W / (np_H/100) ** 2
bmi > 22.5

array([[ True],
       [ True],
       [False],
       ...,
       [False],
       [ True],
       [False]])

You could then use this boolean array inside square brackets to subset. only those lements that are above 22.5 will be selected

In [13]:
bmi[bmi>22.5]

array([23.37472608, 24.91349481, 25.66115203, ..., 24.48565201,
       23.37472608, 24.69135802])

This simple method is a great way to pre-process or get insights from your data!

# Operations with Arrays

We are now going to create a list of heights in centimeters, put that list in a NumPy Array and print the type of data

In [17]:
Heights_in_Cm = [180, 210, 175, 165, 180, 182, 205, 190]
np_Heights_in_Cm = np.array(Heights_in_Cm)
print(type(np_Heights_in_Cm))

<class 'numpy.ndarray'>


After a quick search on the internet, your TA found the Complete FIFA 2017 Player dataset. https://www.kaggle.com/artimous/complete-fifa-2017-player-dataset-global#FullData.csv
For class purposes we're only going to deal with player's heights at the moment, and we'll use Pandas to read that file, but we'll come back to that in the future, so don't pay much attention to the first 2 lines

1) We'll create a NumPy array with the Height data
3) We'll print that out
4) We'll convert that to inches, because... inches make so much sense, (said no one ever). and we're doing it for (data) science
5) We'll print that again

In [31]:
Height = pd.read_csv("data/Height.csv",sep=",")
np_height = np.array(Height)
print(np_height)

[[185]
 [170]
 [174]
 ...
 [173]
 [180]
 [185]]


let's convert it..

In [26]:
np_height_in = 0.393701*np_height
print(np_height_in)

[[72.834685]
 [66.92917 ]
 [68.503974]
 ...
 [68.110273]
 [70.86618 ]
 [72.834685]]


Just to check, what kind of data is this?

In [20]:
print(type(np_height))

<class 'numpy.ndarray'>


So, it's not boolean, string, or float... it's a numpy array

From our Intro, let's calculate BMI again

In [5]:
Weight = pd.read_csv("data/Weight.csv",sep=",")
np_weight = np.array(Weight)
bmi = np_weight / (np_height/100) ** 2
bmi

array([[23.37472608],
       [24.91349481],
       [22.46003435],
       ...,
       [20.38156971],
       [24.69135802],
       [22.49817385]])

So let's find the skinny players, say those with a BMI under 21

In [6]:
light = bmi < 21
print (light)
print(bmi[light])

[[False]
 [False]
 [False]
 ...
 [ True]
 [False]
 [False]]
[20.75638717 20.97117202 20.9839876  ... 20.74755019 20.91070816
 20.38156971]


## Coercion

As mentioned earlier, NumPy assumes all data elements within a NumPy array are of the same type, so when it finds something that is not, it forces it (coerces) it into a certain value. So remember Python lists and NumPy arrays can behave in different ways

In [32]:
a = np.array([1, 2, True])
print(a)
b = np.array([1, 2, False])
print(b)

[1 2 1]
[1 2 0]


### Subsetting

You can subset within NumPy just the same way you would in "normal" Python, by using square brackets. the first record is 0, and the last one is non-inclusive. let's use those BMI's again and print out whats at index 10.. then lets print out index 9 up to 15 inclusive

In [35]:
print(np_weight[10])
print("And index 10 to 15")
print(np_weight[9:16])

[90]
And index 10 to 15
[[91]
 [90]
 [74]
 [65]
 [76]
 [92]
 [79]]


# 2D - Numpy Arrays

So we've worked with the arrays "np_heights" and "np_weights" which were both 1 dimensional arrays, but it's perfectly possible to have 2D, 3D and all the way to 7-Dimensional arrays. We'll now work with 2D arrays

Let's make up a small list of lists, and import NumPy

In [42]:
Futbol = [[175, 77],[212, 103], [205, 99], [188, 79]]
np_Futbol = np.array(Futbol)
np_Futbol

array([[175,  77],
       [212, 103],
       [205,  99],
       [188,  79]])

np_Futbol is a rectangular data structure, each sublist corresponds to a row in the 2Dimensional numpy array. In fact just to make sure, if we call the shape attribute of the np_Futbol array that gives information on how the data structure looks like

In [39]:
np_Futbol.shape

(4, 2)

We can see that we have 4 rows and 2 columns

As mentioned before, an array can only contain one single type. if we change one float to a string, all of the other will be coerced to string, so it has only a single type.

In [40]:
Futbol = [[175, 77],[212, 103], [205, "99"], [188, 79]]
np_Futbol = np.array(Futbol)
np_Futbol

array([['175', '77'],
       ['212', '103'],
       ['205', '99'],
       ['188', '79']], dtype='<U11')

We conveniently have or Weight and Heights of our Football heroes in one nice table, again ignore those 2 initial lines, we're going to import numpy, create a 2D NumPy array and print out the shape

In [43]:
Football = pd.read_csv("data/FullData5.csv",sep=",")
np_Football = np.array(Football)
np_Football.shape

(17588, 3)

Woow, that's a lot of lines... and just 3 columns

In [44]:
print(Football.head())

                Name  Height  Weight
0  Cristiano Ronaldo     185      80
1       Lionel Messi     170      72
2             Neymar     174      68
3        Luis Suárez     182      85
4       Manuel Neuer     193      92


Subsetting is one of those areas were NumPy really shines. If you had a normal Python list subsetting can get complicated, but in NumPy it's quite intuitive. a ":" is used for slicing. If we did an operation with the whole line, our array would be recognized as a string.. Let's print the 50th row of our np_Football array:

In [45]:
print(np_Football[49, :])

['Sergio Busquets' 189 76]


He plays with Spain, they got a Russian surprise... haha Spasiba.

Anyways, let's continue doing some subsetting and select only the entire third column

In [46]:
np_weight = np_Football[:,2]
print(np_weight)

[80 72 68 ... 61 80 77]


Now, we'll find out how tall our 100th player is

In [47]:
print(np_Football[99,1])

181


Let's combine what we learned. We're going to subset or data so we can perform and arithmetic operation; which is to convert all measurements into imperial system just because we can, but we'll have to subset the 2nd & 3rd column, because the 1st is a string

In [48]:
HW=np_Football[:,1:]
conversion = np.array([0.393701, 2.20462])
print(HW*conversion)

[[72.83468500000001 176.3696]
 [66.92917 158.73263999999998]
 [68.503974 149.91415999999998]
 ...
 [68.110273 134.48182]
 [70.86618 176.3696]
 [72.83468500000001 169.75573999999997]]


Ta-daaaa

***

![title](img/stats-are-coming.jpeg)

don't worry... they're very simple

# Basic Stats

Looking at this basic stats out from the very beginning is good practice.. if you find that you have an average of 3 Mts tall players.... something is wrong with your data

Mean

In [54]:
mean=np.mean(HW[:,0])
mean

181.10546963838982

That's a lot of decimals, and it's very hard to know what that figure is just by the output. Let's make this figure more readable

In [55]:
print("Average " + str(round(mean,2)))

Average 181.11


Median

In [50]:
np.median(HW[:,1])

75.0

Std Dev

In [51]:
np.std(HW[:,0])

6.6749702781060405

Sum

In [52]:
np.sum(HW[:,0])

3185283

Sort

In [53]:
np.sort(HW[:,1])

array([48, 49, 50, ..., 107, 110, 110], dtype=object)