# Introduction to Python

## Chapter 4 - NumPy

### NumPy
In order to use multiple lists for calcuations, you are going to need additional functionality than what a list can do. 

In [1]:
height = [1.73, 1.68, 1.71, 1.89, 1.79]
weight = [65.4, 59.2, 63.6, 88.4, 68.7]

weight/height**2

TypeError: unsupported operand type(s) for ** or pow(): 'list' and 'int'

Out of the box, Python doesn't know how to do calculations on lists, so it throws the error above. The Numeric Python (NumPy) package provides an alternative to the regular Python list by providing the NumPy Array. Similar to a list, a NumPy Array has additional features, including the ability to perform calculations over the elements in the array. To use NumPy you will have to install it on your machine:
> pip3 install numpy 

You will then need to import Numpy into your Python script to start using NumPy functionality.

In [2]:
import numpy as np
np_height = np.array(height)
np_height

array([1.73, 1.68, 1.71, 1.89, 1.79])

In [3]:
np_weight = np.array(weight)
np_weight

array([65.4, 59.2, 63.6, 88.4, 68.7])

In [4]:
bmi = np_weight/np_height**2
bmi

array([21.85171573, 20.97505669, 21.75028214, 24.7473475 , 21.44127836])

It is important to note, that unlike Python Lists, NumPy Arrays can only contain values of a single type. If you do try to create an array with different types, the resulting NumPy Array will contain a single type - NumPy Array will determine the lowest common object type to use for every element, most likely strings.

A NumPy Array is just another Python object which means it will have its own methods which can behave differently than you expect. For example, if you paste a Python list to a Python list using the plus operator, it pastes the two lists together

In [5]:
python_list = [1,2,3]
python_list + python_list

[1, 2, 3, 1, 2, 3]

If you do the same thing to a NumPy Array, Python will do an element wise sum of the array.

In [6]:
np_array = np.array([3,4,5])
np_array + np_array

array([ 6,  8, 10])

#### NumPy Subsetting
Like lists, when you want to get elements from your NumPy Array, you use square brackets []. Suppose you want to get the second element from the bmi array.

In [7]:
bmi

array([21.85171573, 20.97505669, 21.75028214, 24.7473475 , 21.44127836])

In [8]:
bmi[1]

20.97505668934241

Specific to NumPy Arrays, you can subset using booleans. 

In [9]:
bmi > 23

array([False, False, False,  True, False])

Next, you use the boolean array inside square brackets[] to subset the array. Using a comparison to get a subset of your data is very common. 

In [10]:
bmi[bmi>23]

array([24.7473475])

In [1]:
#Behind the scenes work to import the baseball and soccer information that by default DataCamp provided
import pandas as pd
import numpy as np
MLB = pd.read_csv(r'C:\datacamp\01-PythonIntro\data\baseball.csv')
height_in = MLB.iloc[:,3].tolist()
weight_lb = MLB.iloc[:,4].tolist()
age = MLB.iloc[:,5].tolist()

updated_df = pd.read_csv(r'C:\datacamp\01-PythonIntro\data\update.csv', header=None)
updated = updated_df.values.tolist()

FIFA = pd.read_csv(r'C:\datacamp\01-PythonIntro\data\fifa.csv')
positions = FIFA.iloc[:,3].tolist()
positions =  [x.strip(' ') for x in positions]
heights = FIFA.iloc[:,4].tolist()

### Exercise 1

#### Your First NumPy Array
In this chapter, we're going to dive into the world of baseball. Along the way, you'll get comfortable with the basics of numpy, a powerful package to do data science. <br>
<br>
A list baseball has already been defined in the Python script, representing the height of some baseball players in centimeters. Can you add some code here and there to create a numpy array from it?

__Instructions:__
 - Import the numpy package as np, so that you can refer to numpy with np.
 - Use np.array() to create a numpy array from baseball. Name this array np_baseball.
 - Print out the type of np_baseball to check that you got it right.

In [2]:
# Create list baseball
baseball = [180, 215, 210, 210, 188, 176, 209, 200]

#Import NumPy as np
import numpy as np

#Create a numpy array from baseball: np_baseball
np_baseball = np.array(baseball)

print(type(np_baseball))

<class 'numpy.ndarray'>


#### Baseball players' height
You are a huge baseball fan. You decide to call the MLB (Major League Baseball) and ask around for some more statistics on the height of the main players. They pass along data on more than a thousand players, which is stored as a regular Python list: height_in. The height is expressed in inches. Can you make a numpy array out of it and convert the units to meters?
height_in is already available and the numpy package is loaded, so you can start straight away (Source: stat.ucla.edu). <br>

__Instructions:__
 - Create a numpy array from height_in. Name this new array np_height_in.
 - Print np_height_in.
 - Multiply np_height_in with 0.0254 to convert all height measurements from inches to meters. Store the new values in a new array, np_height_m.
 - Print out np_height_m and check if the output makes sense.

In [2]:
#Create a numpy array from height_in: np_height
np_height_in = np.array(height_in)
print(np_height_in)

#Convert np_height to m: np_height_m
np_height_m = np_height_in*.0254
print(np_height_m)

[74 74 72 ... 75 75 73]
[1.8796 1.8796 1.8288 ... 1.905  1.905  1.8542]


#### Baseball player's BMI
The MLB also offers to let you analyze their weight data. Again, both are available as regular Python lists: height_in and weight_lb. height_in is in inches and weight_lb is in pounds. It's now possible to calculate the BMI of each baseball player. Python code to convert height_in to a numpy array with the correct units is already available in the workspace. Follow the instructions step by step and finish the game!<br>

__Instructions:__
 - Create a numpy array from the weight_lb list with the correct units. Multiply by 0.453592 to go from pounds to kilograms. Store the resulting numpy array as np_weight_kg.
 - Use np_height_m and np_weight_kg to calculate the BMI of each player. Use the following equation: $BMI=weight(kg)/height(m)**2$

 - Save the resulting numpy array as bmi.
 - Print out bmi

In [3]:
#Create array from weight_lb with correct units: np_weight_kg
np_weight_kg = np.array(weight_lb) * .453592

#Calcuate BMI: bmi
bmi = np_weight_kg/np_height_m**2
print(bmi)

[23.11037639 27.60406069 28.48080465 ... 25.62295933 23.74810865
 25.72686361]


#### Lightweight baseball players
To subset both regular Python lists and numpy arrays, you can use square brackets:
x = [4 , 9 , 6, 3, 1] <br>
x[1]<br>
import numpy as np<br>
y = np.array(x)<br>
y[1]<br>
<br>
For numpy specifically, you can also use boolean numpy arrays:<br>
high = y > 5<br>
y[high]<br>
<br>
The code that calculates the BMI of all baseball players is already included. Follow the instructions and reveal interesting things from the data!

__Instructions:__
 - Create a boolean numpy array: the element of the array should be True if the corresponding baseball player's BMI is below 21. You can use the < operator for this. Name the array light.
 - Print the array light.
 - Print out a numpy array with the BMIs of all baseball players whose BMI is below 21. Use light inside square brackets to do a selection on the bmi array.'''#Subsetting NumPy Arrays


In [4]:
# height and weight are available as a regular lists

# Import numpy
import numpy as np

# Calculate the BMI: bmi
np_height_m = np.array(height_in) * 0.0254
np_weight_kg = np.array(weight_lb) * 0.453592
bmi = np_weight_kg / np_height_m ** 2

# Create the light array
light = np.array(bmi<21)

# Print out light
print(light)

# Print out BMIs of all baseball players whose BMI is below 21
print(bmi[light])

[False False False ... False False False]
[20.54255679 20.54255679 20.69282047 20.69282047 20.34343189 20.34343189
 20.69282047 20.15883472 19.4984471  20.69282047 20.9205219 ]


#### Subsetting NumPy Arrays
You've seen it with your own eyes: Python lists and numpy arrays sometimes behave differently. Luckily, there are still certainties in this world. For example, subsetting (using the square bracket notation on lists or arrays) works exactly the same. To see this for yourself, try the following lines of code in the IPython Shell:<br>
x = ["a", "b", "c"]<br>
x[1]<br>
<br>
np_x = np.array(x)<br>
np_x[1]<br>
<br>
The script below already contains code that imports numpy as np, and stores both the height and weight of the MLB players as numpy arrays.

__Instructions:__
 - Subset np_weight_lb by printing out the element at index 50.
 - Print out a sub-array of np_height_in that contains the elements at index 100 up to and including index 110.

In [22]:
#Showing that square brackets work exactly the same for lists as the do NumPy arrays

# height and weight are available as a regular lists

# Import numpy
import numpy as np

# Store weight and height lists as numpy arrays
np_weight_lb = np.array(weight_lb)
np_height_in = np.array(height_in)

# Print out the weight at index 50
print(np_weight_lb[50])

# Print out sub-array of np_height: index 100 up to and including index 110
print(np_height_in[100:111])

200
[73 74 72 73 69 72 73 75 75 73 72]


### 2D NumPy Arrays

When you look at the type of the baseball arrays, you see that the type() function comes back with numpy.ndarrays. The numpy portion tells you it was a type defined by the NumPy package. ndarray stands for N-dimensional array. The weight and height arrays are one dimensional arrays, but it is possible to create 2, 3 or even 7 dimensional arrays with NumPy.

In [21]:
import numpy as np
np_height = np.array([1.73, 1.68, 1.71, 1.89, 1.79])

np_weight = np.array([65.4, 59.2, 63.6, 88.4, 68.7])

type(np_height)

numpy.ndarray

In [22]:
type(np_weight)

numpy.ndarray

You can create a NumPy 2D array from a regular Python list of lists. You can take the height and weight of everyone in your family, as a list of lists and convert it to a 2D NumPy array. Each sublist in the list corresponds to a row in the 2D NumPy array.

In [23]:
np_2d = np.array([[1.73, 1.68, 1.71, 1.89, 1.79], [65.4, 59.2, 63.6, 88.4, 68.7]])

np_2d

array([[ 1.73,  1.68,  1.71,  1.89,  1.79],
       [65.4 , 59.2 , 63.6 , 88.4 , 68.7 ]])

Using the .shape attribute on np_2d, it shows that we have 2 rows and 5 columns. The .shape is an attribute that gives you more information about the object - note that it is called similar to calling a function on an object, but attributes and methods are NOT the same and methods have round brackets () after them. Like one dimensional arrays, multi-dimensional arrays can only contain a single type.

In [24]:
np_2d.shape

(2, 5)

#### Subsetting 2D Arrays
         0      1     2    3     4
array(([1.73, 1.68, 1.71, 1.89, 1.79].  0
       [65.4, 59.2, 63.6, 88.4, 68.7]]) 1
       
With 2D arrays, you have more advanced ways of subsetting. Suppose you want the first row and then the third element in that row. To select the first row, you need the index 0 in square brackets.

In [25]:
np_2d[0]

array([1.73, 1.68, 1.71, 1.89, 1.79])

To then select the third element, you can extend the call with another pair of brackets, this time the index 2. Basically you are selecting the row and then from that row, do another selection.

In [26]:
np_2d[0][2]

1.71

There's also an alternative way to subset using single square brackets and a comma.This call returns the exact same value as before. The value before the comma specifies the row, the value after the comma specifies the comma. 

In [27]:
np_2d[0,2]

1.71

This type of subsetting gives you a lot of options for selecting specific data. Suppose you want the height and weight of just the second and third values. You want both rows (height and weight), so you put in a colon (:) before the comma and since you only want the 2nd and 3rd column, you put in the indices 1 to 3 using the colon to separate those indices after the comma.

Remember that the third index is not included. The intersection gives us a 2D array with 2 rows and two columns.

In [28]:
np_2d[:,1:3]

array([[ 1.68,  1.71],
       [59.2 , 63.6 ]])

Similarly, you only want the weight of the family members, which means you only want the second row, so put 1 before the comma and then put a colon(:) after the comma to get all the columns. The intersection gives us the entire 2nd row.

In [29]:
np_2d[1,:]

array([65.4, 59.2, 63.6, 88.4, 68.7])

Finally, 2D NumPy arrays allows element wise calculations.

### Exercise 2

#### Your First 2D NumPy Array
Before working on the actual MLB data, let's try to create a 2D numpy array from a small list of lists. In this exercise, baseball is a list of lists. The main list contains 4 elements. Each of these elements is a list containing the height and the weight of 4 baseball players, in this order. baseball is already coded for you in the script.

__Instructions:__
 - Use np.array() to create a 2D numpy array from baseball. Name it np_baseball.
 - Print out the type of np_baseball.
 - Print out the shape attribute of np_baseball. Use np_baseball.shape

In [8]:
# Create baseball, a list of lists
baseball = [[180, 78.4],
            [215, 102.7],
            [210, 98.5],
            [188, 75.2]]

# Import numpy
import numpy as np

#Create 2D array from baseball: np_baseball
np_baseball = np.array(baseball)

#Print np_baseball type
print(type(np_baseball))

#Print np_baseball shape
print(np_baseball.shape)

<class 'numpy.ndarray'>
(4, 2)


In [24]:
#Create behind the scenes list of list that DataCamp provides for this exercise
baseball = [[height_in[i], weight_lb[i]] for i in range(0, len(height_in))]

#### Baseball data in 2D form
You have another look at the MLB data and realize that it makes more sense to restructure all this information in a 2D numpy array. This array should have 1015 rows, corresponding to the 1015 baseball players you have information on, and 2 columns (for height and weight).<br>
<br>
The MLB was, again, very helpful and passed you the data in a different structure, a Python list of lists. In this list of lists, each sublist represents the height and weight of a single baseball player. The name of this embedded list is baseball.
Can you store the data as a 2D array to unlock numpy's extra functionality?

__Instructions:__
 - Use np.array() to create a 2D numpy array from baseball. Name it np_baseball.
 - Print out the shape attribute of np_baseball

In [25]:
# baseball is available as a regular list of lists

# Import numpy package
import numpy as np

# Create a 2D numpy array from baseball: np_baseball
np_baseball = np.array(baseball)

# Print out the shape of np_baseball
print(np_baseball.shape)

(1015, 2)


#### Subsetting 2D NumPy Arrays
If your 2D numpy array has a regular structure, i.e. each row and column has a fixed number of values, complicated ways of subsetting become very easy. Have a look at the code below where the elements "a" and "c" are extracted from a list of lists.
__regular list of lists__
x = [["a", "b"], ["c", "d"]]<br>
[x[0][0], x[1][0]]<br>
<br>
__numpy__
import numpy as np<br>
np_x = np.array(x)<br>
np_x[:,0]<br>
<br>
For regular Python lists, this is a real pain. For 2D numpy arrays, however, it's pretty intuitive! The indexes before the comma refer to the rows, while those after the comma refer to the columns. The : is for slicing; in this example, it tells Python to include all rows.<br>
<br>
The code that converts the pre-loaded baseball list to a 2D numpy array is already in the script. The first column contains the players' height in inches and the second column holds player weight, in pounds. Add some lines to make the correct selections. Remember that in Python, the first element is at index 0!

__Instructions:__
 - Print out the 50th row of np_baseball.
 - Make a new variable, np_weight_lb, containing the entire second column of np_baseball.
 - Select the height (first column) of the 124th baseball player in np_baseball and print it out.

In [26]:
# baseball is available as a regular list of lists

# Import numpy package
import numpy as np

# Create np_baseball (2 cols)
np_baseball = np.array(baseball)

# Print out the 50th row of np_baseball
print(np_baseball[49,:])

# Select the entire second column of np_baseball: np_weight
np_weight_lb = np_baseball[:,1]

# Print out height of 124th player
print(np_baseball[123,0])

[ 70 195]
75


In [27]:
#Create behind the scenes list of list that DataCamp provides for this exercise
baseball = [[height_in[i], weight_lb[i], age[i]] for i in range(0, len(height_in))]

#### 2D Arithmetic
Remember how you calculated the Body Mass Index for all baseball players? numpy was able to perform all calculations element-wise (i.e. element by element). For 2D numpy arrays this isn't any different! You can combine matrices with single numbers, with vectors, and with other matrices.Execute the code below in the IPython shell and see if you understand:<br>
<br>
import numpy as np<br>
np_mat = np.array([[1, 2],
                   [3, 4],
                   [5, 6]])<br>
                   
np_mat * 2<br>
np_mat + np.array([10, 10])<br>
np_mat + np_mat<br>
<br>
np_baseball is coded for you; it's again a 2D numpy array with 3 columns representing height (in inches), weight (in pounds) and age (in years).

__Instructions:__
 - You managed to get hold of the changes in height, weight and age of all baseball players. It is available as a 2D numpy array, updated. Add np_baseball and updated and print out the result.
 - You want to convert the units of height and weight to metric (meters and kilograms respectively). As a first step, create a numpy array with three values: 0.0254, 0.453592 and 1. Name this array conversion.
 - Multiply np_baseball with conversion and print out the result.

In [59]:
import numpy as np
np_mat = np.array([[1, 2],
                   [3, 4],
                   [5, 6]])
np_mat * 2
np_mat + np.array([10, 10])
np_mat + np_mat

array([[ 2,  4],
       [ 6,  8],
       [10, 12]])

In [28]:
# baseball is available as a regular list of lists
# updated is available as 2D numpy array

# Import numpy package
import numpy as np

# Create np_baseball (3 cols)
np_baseball = np.array(baseball)

# Print out addition of np_baseball and updated
print(np_baseball + updated)

# Create numpy array: conversion
conversion = np.array([.0254, .453592, 1])

# Print out product of np_baseball and conversion
print(np_baseball*conversion)

[[ 75.2303559 168.837751   23.99     ]
 [ 75.0261425 231.0973231  35.69     ]
 [ 73.1544228 215.0816764  31.78     ]
 ...
 [ 76.0934993 209.2389078  26.19     ]
 [ 75.8228567 172.2179996  32.01     ]
 [ 73.9948422 203.1440271  28.92     ]]
[[ 1.8796  81.64656 22.99   ]
 [ 1.8796  97.52228 34.69   ]
 [ 1.8288  95.25432 30.78   ]
 ...
 [ 1.905   92.98636 25.19   ]
 [ 1.905   86.18248 31.01   ]
 [ 1.8542  88.45044 27.92   ]]


### NumPy Basic Statistics

When working with hundreds of thousands, or millions of rows, using summary statistics can give you a good overview of your data. NumPy provides a number of tools for generating summary statistics. We have updated the np_baseball array with updated height and weight and converted it to metric. Now we use NumPy's mean() function to find the average height of our baseball players.

In [70]:
np_baseball = (np_baseball + updated) * conversion

In [71]:
np.mean(np_baseball[:,0])

1.922932450984932

You can also find the median height, or the height directly in the middle if you sort all people from shortest to tallest, using NumPy's median() function.

In [72]:
np.median(np_baseball[:,0])

1.9218273491239999

These summary statistics gives you a sanity check on your data. If you come up with an average weight of 2,000kg, then you know something is wrong. Other statistics like corrcoef() can be used to check to see if height and weight are correlated and std() to view the standard deviation. There are other functions in NumPy, like sum() and sort() 

In [73]:
np.corrcoef(np_baseball[:,0], np_baseball[:,1])

array([[1.        , 0.37601386],
       [0.37601386, 1.        ]])

In [74]:
np.std(np_baseball[:,0])

0.060061242312699555

In [71]:
#Behind the scenes data importing that DataCamp provides for this exercise
np_baseball = np.array(baseball)

In [76]:
print(np.mean(np_height_in))

73.6896551724138


### Exercise 3

#### Average versus median
You now know how to use numpy functions to get a better feeling for your data. It basically comes down to importing numpy and then calling several simple functions on the numpy arrays: <br>
import numpy as np<br>
x = [1, 4, 8, 10, 12]<br>
np.mean(x)<br>
np.median(x)<br>
<br>
The baseball data is available as a 2D numpy array with 3 columns (height, weight, age) and 1015 rows. The name of this numpy array is np_baseball. After restructuring the data, however, you notice that some height values are abnormally high. Follow the instructions and discover which summary statistic is best suited if you're dealing with so-called outliers.

__Instructions:__
 - Create numpy array np_height_in that is equal to first column of np_baseball.
 - Print out the mean of np_height_in.
 - Print out the median of np_height_in

In [61]:
import numpy as np
x = [1, 4, 8, 10, 12]
np.mean(x)
np.median(x)

8.0

In [29]:
# np_baseball is available

# Import numpy
import numpy as np

# Create np_height from np_baseball
np_height_in = np.array(np_baseball[:,0])

# Print out the mean of np_height
print(np.mean(np_height_in))

# Print out the median of np_height
print(np.median(np_height_in))

73.6896551724138
74.0


#### Explore the baseball data
Because the mean and median are so far apart, you decide to complain to the MLB. They find the error and send the corrected data over to you. It's again available as a 2D Numpy array np_baseball, with three columns. The Python script below already includes code to print out informative messages with the different summary statistics. Can you finish the job?

__Instructions:__
 - The code to print out the mean height is already included. Complete the code for the median height. Replace None with the correct code.
 - Use np.std() on the first column of np_baseball to calculate stddev. Replace None with the correct code.
 - Do big players tend to be heavier? Use np.corrcoef() to store the correlation between the first and second column of np_baseball in corr. Replace None with the correct code.


In [30]:
# np_baseball is available

# Import numpy
import numpy as np

# Print mean height (first column)
avg = np.mean(np_baseball[:,0])
print("Average: " + str(avg))

# Print median height. Replace 'None'
med = np.median(np_baseball[:,0])
print("Median: " + str(med))

# Print out the standard deviation on height. Replace 'None'
stddev = np.std(np_baseball[:,0])
print("Standard Deviation: " + str(stddev))

# Print out correlation between first and second column. Replace 'None'
corr = np.corrcoef(np_baseball[:,0], np_baseball[:,1])
print("Correlation: " + str(corr))

Average: 73.6896551724138
Median: 74.0
Standard Deviation: 2.312791881046546
Correlation: [[1.         0.53153932]
 [0.53153932 1.        ]]


#### Blend it all together
In the last few exercises you've learned everything there is to know about heights and weights of baseball players. Now it's time to dive into another sport: soccer. <br>
<br>
You've contacted FIFA for some data and they handed you two lists. The lists are the following:<br>
positions = ['GK', 'M', 'A', 'D', ...]<br>
heights = [191, 184, 185, 180, ...]<br>
<br>
Each element in the lists corresponds to a player. The first list, positions, contains strings representing each player's position. The possible positions are: 'GK' (goalkeeper), 'M' (midfield), 'A' (attack) and 'D' (defense). The second list, heights, contains integers representing the height of the player in cm. The first player in the lists is a goalkeeper and is pretty tall (191 cm).<br>
<br>
You're fairly confident that the median height of goalkeepers is higher than that of other players on the soccer field. Some of your friends don't believe you, so you are determined to show them using the data you received from FIFA and your newly acquired Python skills.

__Instructions:__
 - Convert heights and positions, which are regular lists, to numpy arrays. Call them np_heights and np_positions.
 - Extract all the heights of the goalkeepers. You can use a little trick here: use np_positions == 'GK' as an index for np_heights. Assign the result to gk_heights.
 - Extract all the heights of all the other players. This time use np_positions != 'GK' as an index for np_heights. Assign the result to other_heights.
 - Print out the median height of the goalkeepers using np.median(). Replace None with the correct code.
 - Do the same for the other players. Print out their median height. Replace None with the correct code.

In [31]:
import numpy as np

#Convert the positions and heights lists to NumPy Arrays
np_positions = np.array(positions)
np_heights = np.array(heights)

#Subset the heights of goalkeepers
gk_heights = np.array(np_heights[np_positions == "GK"])

#Subset the heights of all the other positions
other_heights = np.array(np_heights[np_positions != "GK"])

#Print the median height of goalkeepers
print("Median height of goalkeepers:" + str(np.median(gk_heights)))

#Print the median height of all the other players
print("Median height of other positions:" + str(np.median(other_heights)))

Median height of goalkeepers:188.0
Median height of other positions:181.0
