[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ML-Challenge/week2-data-analysis/blob/master/L1.Numpy.ipynb)

# Setup

In [None]:
# Download utils.py to working directory
import urllib.request
urllib.request.urlretrieve('https://raw.githubusercontent.com/ML-Challenge/week2-data-analysis/master/utils.py', 'utils.py')

In [None]:
# Import utils
# We'll be using this module throughout the lesson
import utils

## NumPy

NumPy is a fundamental Python package to efficiently practice data science. In this lesson we will learn to work with such powerful tools as the NumPy array, and get started with data exploration.

### Our First NumPy Array

We're going to dive into the world of baseball, along the way, we'll get comfortable with the basics of ```numpy```, a powerful data science package.

A list ```baseball``` has been defined in the Python script, representing the height of some baseball players in centimeters. Next, we'll see how to convert it to a NumPy array.

In [None]:
# Create list baseball
baseball = [180, 215, 210, 210, 188, 176, 209, 200]

In [None]:
# Import the numpy package as np
import numpy as np

In [None]:
# Create a numpy array from baseball: np_baseball
np_baseball = np.array(baseball)

In [None]:
# Print out type of np_baseball
print(type(np_baseball))

### Baseball players' height

We've decided to call the MLB (Major League Baseball) and ask around for some more statistics on the height of the main players. They pass along data on more than a thousand players, which is stored as a regular Python list: ```height_in```. The height is expressed in inches. Since we now know how to convert ```height_in``` to a NumPy array, how about some additional processing? We don't like that the players' heights are in inches, let's convert them to meters.

In [None]:
# Create a numpy array from height: np_height
np_height = np.array(utils.height_in)

In [None]:
# Print out np_height
print(np_height)

Now that we have our NumPy array, how do we go about converting inches to meters?

The solution? **NumPy broadcasting**

In [None]:
# Convert np_height to m: np_height_m
np_height_m = np_height * 0.0254 

Awesome, right? We did the conversion in just one line of code.

Broadcasting makes possible array arithmetic between arrays of different shape or size. In the case above, broadcasting allowed us to multiply an array and by a scalar.

In [None]:
# Print np_height_m
print(np_height_m)

### Baseball player's BMI

The MLB also offers us their weight data. Again, both are available as regular Python lists: ```height_in``` and ```weight_lb```. ```height_in``` is in inches and ```weight_lb``` is in pounds.

It's now possible to calculate the BMI of each baseball player. We will use the following equation:

$$\mathrm{BMI} = \frac{\mathrm{weight (kg)}}{\mathrm{height (m)}^2}$$

In [None]:
# Create array from weight with correct units: np_weight_kg
np_weight_kg = np.array(utils.weight_lb) * 0.453592

In [None]:
# Calculate the BMI: bmi
bmi = np_weight_kg / np_height_m ** 2

In [None]:
# Print out bmi
print(bmi)

### Lightweight baseball players

To subset both regular Python lists and ```numpy``` arrays, we can use square brackets:

```
x = [4 , 9 , 6, 3, 1]
x[1]
import numpy as np
y = np.array(x)
y[1]
```

For ```numpy``` specifically, we can also use boolean ```numpy``` arrays:

```
high = y > 5
y[high]
```

In [None]:
# Create the light array
light = bmi < 21

In [None]:
# Print out light
print(light)

In [None]:
# Print out BMIs of all baseball players whose BMI is below 21
print(bmi[light])

### NumPy Side Effects

```NumPy``` is great for doing vector arithmetic. If we compare its functionality with regular Python lists, however, some things have changed.

First of all, ```numpy``` arrays cannot contain elements with different types. If we try to build such a list, some of the elements' types are changed to end up with a homogeneous list. This is known as type coercion.

Second, the typical arithmetic operators, such as ```+```, ```-```, ```*``` and ```/``` have a different meaning for regular Python lists and ```numpy``` arrays.

Have a look at this line of code:

```
np.array([True, 1, 2]) + np.array([3, 4, False])
```

In [None]:
np.array([True, 1, 2])

In [None]:
np.array([3, 4, False])

In [None]:
np.array([True, 1, 2]) + np.array([3, 4, False])

### Subsetting NumPy Arrays

Python lists and ```numpy``` arrays sometimes behave differently. Luckily, there are still certainties in this world. For example, subsetting (using the square bracket notation on lists or arrays) works exactly the same:

```
x = ["a", "b", "c"]
x[1]

np_x = np.array(x)
np_x[1]
```

In [None]:
# Print out the weight at index 50
print(np_weight_kg[50])

In [None]:
# Print out sub-array of np_height: index 100 up to and including index 110
print(np_height_m[100:111])

## 2D NumPy Arrays

### Our First 2D NumPy Array

Before working on the actual MLB data, let's try to create a 2D ```numpy``` array from a small list of lists.

In [None]:
# Create baseball_2d, a list of lists
baseball_2d = [[180, 78.4],
               [215, 102.7],
               [210, 98.5],
               [188, 75.2]]

In [None]:
# Create a 2D numpy array from baseball_2d: np_baseball_2d
np_baseball_2d = np.array(baseball_2d)

In [None]:
# Print out the type of np_baseball_2d
print(type(np_baseball_2d))

In [None]:
# Print out the shape of np_baseball_2d
print(np_baseball_2d.shape)

### Baseball data in 2D form

Looking at the MLB data, it makes more sense to restructure all this information in a 2D ```numpy``` array. This array should have 1015 rows, corresponding to the 1015 baseball players we have information on, and 2 columns (for height and weight).

The MLB was, again, very helpful and passed us the data in a different structure, a Python list of lists. In this list of lists, each sublist represents the height and weight of a single baseball player. The name of this embedded list is ```baseball```.

In [None]:
# Create a 2D numpy array from baseball: np_baseball
np_baseball = np.array(utils.baseball2D)

In [None]:
# Print out the shape of np_baseball
print(np_baseball.shape)

### Subsetting 2D NumPy Arrays

If the 2D ```numpy``` array has a regular structure, i.e. each row and column has a fixed number of values, complicated ways of subsetting become very easy. Have a look at the code below where the elements ```"a"``` and ```"c"``` are extracted from a list of lists.

```
# regular list of lists
x = [["a", "b"], ["c", "d"]]
[x[0][0], x[1][0]]

# numpy
import numpy as np
np_x = np.array(x)
np_x[:,0]
```

For regular Python lists, this is a real pain. For 2D ```numpy``` arrays, however, it's pretty intuitive! The indexes before the comma refer to the rows, while those after the comma refer to the columns. The ```:``` is for slicing; in this example, it tells Python to include all rows.

In [None]:
# Print out the 50th row of np_baseball
print(np_baseball[49,:])

In [None]:
# Select the entire second column of np_baseball: np_weight
np_weight_lb = np_baseball[:, 1]
print(np_weight_lb)

In [None]:
# Print out height of 124th player
print(np_baseball[123,0])

### 2D Arithmetic

Remember how we calculated the Body Mass Index for all baseball players? ```numpy``` was able to perform all calculations element-wise (i.e. element by element). For 2D ```numpy``` arrays this isn't any different! We can combine matrices with single numbers, with vectors, and with other matrices:

```
import numpy as np
np_mat = np.array([[1, 2],
                   [3, 4],
                   [5, 6]])
np_mat * 2
np_mat + np.array([10, 10])
np_mat + np_mat
```

We managed to get hold of the changes in height, weight and age of all baseball players. It is available as a 2D ```numpy``` array, ```updated```.

In [None]:
np_baseball = np.array(utils.baseball)
np_updated = np.array(utils.updated)

Let's add ```np_baseball``` and ```updated``` and print out the result.

In [None]:
# Print out addition of np_baseball and np_updated
print(np_baseball + np_updated)

We want to convert the units of height and weight to metric (meters and kilograms respectively). As a first step, create a ```numpy``` array with three values: ```0.0254```, ```0.453592``` and ```1```.

In [None]:
conversion = np.array([0.0254, 0.453592, 1])

In [None]:
# Print out product of np_baseball and conversion
print(np_baseball * conversion)

## Numpy: Basic Statistics

### Average versus median

Now we know how to use ```numpy``` functions to get a better feeling for our data. It basically comes down to importing ```numpy``` and then calling several simple functions on the ```numpy``` arrays:

```
import numpy as np
x = [1, 4, 8, 10, 12]
np.mean(x)
np.median(x)
```

After restructuring the data, however, we notice that some height values are abnormally high. Follow the instructions and discover which summary statistic is best suited if we're dealing with so-called outliers.

In [None]:
np_baseball = np.array(utils.baseball_outliers)

In [None]:
# Create np_height_in from np_baseball
np_height_in = np_baseball[:, 0]

In [None]:
# Print out the mean of np_height
print(np.mean(np_height_in))

In [None]:
# Print out the median of np_height
print(np.median(np_height_in))

### Explore the baseball data

The mean and median are so far apart. After complaining to the MLB, they found the error and sent the corrected data over to us. It's again available as a 2D Numpy array ```np_baseball```, with three columns.

In [None]:
np_baseball = np.array(utils.baseball)

In [None]:
# Print mean height (first column)
avg = np.mean(np_baseball[:,0])
print("Average: " + str(avg))

In [None]:
# Print median height.
med = np.median(np_baseball[:, 0])
print("Median: " + str(med))

In [None]:
# Print out the standard deviation on height.
stddev = np.std(np_baseball[:, 0])
print("Standard Deviation: " + str(stddev))

Do big players tend to be heavier?

In [None]:
# Print out correlation between first and second column.
corr = np.corrcoef(np_baseball[:, 0], np_baseball[:, 1])
print("Correlation: " + str(corr))

## Blend it all together

In the last few exercises we've learned everything there is to know about heights and weights of baseball players. Now it's time to dive into another sport: soccer.

After contacting FIFA for some data, they handed us two lists. The lists are the following:

```
positions = ['GK', 'M', 'A', 'D', ...]
heights = [191, 184, 185, 180, ...]
```

Each element in the lists corresponds to a player. The first list, positions, contains strings representing each player's position. The possible positions are: ```'GK'``` (goalkeeper), ```'M'``` (midfield), ```'A'``` (attack) and ```'D'``` (defense). The second list, heights, contains integers representing the height of the player in cm. The first player in the lists is a goalkeeper and is pretty tall (191 cm).

We're fairly confident that the median height of goalkeepers is higher than that of other players on the soccer field. Let's use our newly acquired NumPy skills to prove this.

In [None]:
# Import numpy
import numpy as np

In [None]:
# Convert positions and heights to numpy arrays: np_positions, np_heights
np_positions = np.array(utils.positions)
np_heights = np.array(utils.heights)

In [None]:
# Heights of the goalkeepers: gk_heights
gk_heights = np_heights[np_positions == "GK"]

In [None]:
# Heights of the other players: other_heights
other_heights = np_heights[np_positions != 'GK']

In [None]:
# Print out the median height of goalkeepers. Replace 'None'
print("Median height of goalkeepers: " + str(np.median(gk_heights)))

In [None]:
# Print out the median height of other players. Replace 'None'
print("Median height of other players: " + str(np.median(other_heights)))

---
**[Week 2 - Data Analysis and Visualisation](https://radu-enuca.gitbook.io/ml-challenge/data-analysis-and-visualisation)**

*Have questions or comments? Visit the ML Challenge Mattermost Channel.*