Name: Nurkholis\
Source: Datacamp

# NumPy

NumPy is a fundamental Python package to efficiently practice data science. Learn to work with powerful tools in the NumPy array, and get started with data exploration.

## Your First NumPy Array

In this chapter, we're going to dive into the world of baseball. Along the way, you'll get comfortable with the basics of $\color{blue}{\text{numpy}}$, a powerful package to do data science.

A **list baseball** has already been defined in the Python script, representing the height of some baseball players in centimeters. Can you add some code here and there to create a $\color{blue}{\text{numpy}}$ array from it?

In [1]:
# Import the numpy package as np
import numpy as np

# Create list baseball
baseball = [180, 215, 210, 210, 188, 176, 209, 200]

# Create a numpy array from baseball: np_baseball
np_baseball = np.array(baseball)

# Print out type of np_baseball
print("List baseball with numpy array\t:", np_baseball,
      "\ntype\t\t\t\t:", type(np_baseball))

List baseball with numpy array	: [180 215 210 210 188 176 209 200] 
type				: <class 'numpy.ndarray'>


## Baseball players' height

You are a huge baseball fan. You decide to call the MLB (Major League Baseball) and ask around for some more statistics on the height of the main players. They pass along data on more than a thousand players, which is stored as a regular Python list: $\color{blue}{\text{height\_in}}$. The height is expressed in inches. Can you make a $\color{blue}{\text{numpy}}$ array out of it and convert the units to meters?

$\color{blue}{\text{height\_in}}$ is already available and the numpy package is loaded, so you can start straight away (Source: <a href="https://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_MLB_HeightsWeights" target="_blank" rel="noopener noreferrer">$\color{blue}{\text{stat.ucla.edu}}$</a>).

In [2]:
#import pandas
import pandas as pd

# Import numpy
import numpy as np

# Read the CSV file
baseball_csv = pd.read_csv("baseball.csv")

# create np_baseball_csv as numpy arrays
np_baseball_csv = np.array(baseball_csv)

# Read just the list height.
height_in = np_baseball_csv[:, 3]

# Create a numpy array from height_in: np_height_in
np_height_in = np.array(height_in)

# Print out np_height_in 
print("baseball player height (in inches):", np_height_in)

# Convert np_height_in to m: np_height_m
np_height_m = np_height_in * 0.0254

# Convert np_height_m with int function to float function: np_height_m_float
np_height_m_float = np_height_m.astype(float)

# Print np_height_m_float
print("baseball player height (in meters):", np.round(np_height_m_float, decimals=2))

baseball player height (in inches): [74 74 72 ... 75 75 73]
baseball player height (in meters): [1.88 1.88 1.83 ... 1.9  1.9  1.85]


## Baseball player's BMI

The MLB also offers to let you analyze their weight data. Again, both are available as regular Python lists: $\color{blue}{\text{height\_in}}$ and $\color{blue}{\text{weight\_lb}}$. $\color{blue}{\text{height\_in}}$ is **in inches** and $\color{blue}{\text{weight\_lb}}$ is **in pounds**.

It's now possible to calculate the BMI of each baseball player.

In [3]:
#import pandas
import pandas as pd

# Import numpy
import numpy as np

# Read the CSV file and just list height and weight
baseball_csv = pd.read_csv("baseball.csv")

# create np_baseball_csv as numpy arrays
np_baseball_csv = np.array(baseball_csv)

# Read just the list height and weight.
height_in = np_baseball_csv[:, 3]
weight_lb = np_baseball_csv[:, 4]

# Create array from height_in with metric units: np_height_m
np_height_m = (np.array(height_in) * 0.0254).astype(float)

# Create array from weight_lb with metric units: np_weight_kg
np_weight_kg = (np.array(weight_lb) * 0.453592).astype(float)

# Calculate the BMI: bmi
bmi = np.round(np_weight_kg / np_height_m **2, decimals=3)

# Print out bmi
print("BMI of each baseball player:", bmi)

BMI of each baseball player: [23.11  27.604 28.481 ... 25.623 23.748 25.727]


## Lightweight baseball players

To subset both regular Python lists and $\color{blue}{\text{numpy}}$ arrays, you can use square brackets:

In [4]:
x = [4 , 9 , 6, 3, 1]
x[1]

9

In [5]:
x = [4 , 9 , 6, 3, 1]

import numpy as np
y = np.array(x)
y[1]

9

For numpy specifically, you can also use **boolean** $\color{blue}{\text{numpy}}$ arrays:

In [6]:
y = np.array([4 , 9 , 6, 3, 1])
high = y > 5
y[high]

array([9, 6])

Follow the instructions and reveal interesting things from the data! $\color{blue}{\text{height\_in}}$ and $\color{blue}{\text{weight\_lb}}$ are available as regular lists.

In [7]:
#import pandas
import pandas as pd

# Import numpy
import numpy as np

# Read the CSV file and just list height and weight
baseball_csv = pd.read_csv("baseball.csv")
# create np_baseball_csv as numpy arrays
np_baseball_csv = np.array(baseball_csv)

# Calculate the BMI: bmi
np_height_m = (np.array(height_in) * 0.0254).astype(float)
np_weight_kg = (np.array(weight_lb) * 0.453592).astype(float)
bmi = np.round(np_weight_kg / np_height_m ** 2, decimals = 2)

# Create the light array
Light = bmi < 21

# Print out light
print("Boolean light of BMI\t:", Light[True])

# Print out BMIs of all baseball players whose BMI is below 21
print("BMI\t\t\t:", bmi[Light])

Boolean light of BMI	: [[False False False ... False False False]]
BMI			: [20.54 20.54 20.69 20.69 20.34 20.34 20.69 20.16 19.5  20.69 20.92]


## NumPy Side Effects

As Hugo explained before, $\color{blue}{\text{numpy}}$ is great for doing vector arithmetic. If you compare its functionality with regular Python lists, however, some things have changed.

First of all, $\color{blue}{\text{numpy}}$ arrays cannot contain elements with different types. If you try to build such a list, some of the elements' types are changed to end up with a homogeneous list. This is known as type coercion.

Second, the typical arithmetic operators, **such as +, -, * and /** have a different meaning for regular Python lists and numpy arrays.

Have a look at this line of code:

In [8]:
np.array([True, 1, 2]) + np.array([3, 4, False])

array([4, 5, 2])

Can you tell which code chunk builds the exact same Python object? The $\color{blue}{\text{numpy}}$ package is already imported as $\color{blue}{\text{np}}$, so you can start experimenting in the Python!

**Possible answers:** \
- np.array([True, 1, 2, 3, 4, False]) \
- **np.array([4, 3, 0]) + np.array([0, 2, 2])** \
- np.array([1, 1, 2]) + np.array([3, 4, -1]) \
- np.array([0, 1, 2, 3, 4, 5])

## Subsetting Numpy Arrays

You've seen it with your own eyes: Python lists and $\color{blue}{\text{numpy}}$ arrays sometimes behave differently. Luckily, there are still certainties in this world. For example, subsetting (using the square bracket notation on lists or arrays) works exactly the same. To see this for yourself, try the following lines of code in the Python:

In [9]:
x = ["a", "b", "c"]
x[1]

'b'

In [10]:
import numpy as np
x = ["a", "b", "c"]

np_x = np.array(x)
np_x[1]

'b'

The script in the editor already contains code that imports $\color{blue}{\text{numpy}}$ as $\color{blue}{\text{np}}$, and stores both the height and weight of the MLB players as numpy arrays. $\color{blue}{\text{height\_in}}$ and $\color{blue}{\text{weight\_lb}}$ are available as regular lists.

In [11]:
#import pandas
import pandas as pd

# Import numpy
import numpy as np

# Read the CSV file and just list height and weight
baseball_csv = pd.read_csv("baseball.csv")

# create np_baseball_csv as numpy arrays
np_baseball_csv = np.array(baseball_csv)

# Read just the list height and weight.
height_in = np_baseball_csv[:, 3]
weight_lb = np_baseball_csv[:, 4]

# Store weight and height lists as numpy arrays
np_weight_lb = np.array(weight_lb)
np_height_in = np.array(height_in)

# Print out the weight at index 50
print("the weight at index 50\t\t:", np_weight_lb[50])

# Print out sub-array of np_height_in: index 100 up to and including index 110
print("the height at index 100-111\t:", np_height_in[100:111])

the weight at index 50		: 200
the height at index 100-111	: [73 74 72 73 69 72 73 75 75 73 72]


## Your First 2D NumPy Array

Before working on the actual MLB data, let's try to create a 2D $\color{blue}{\text{numpy}}$ array from a small list of lists.

In this exercise, $\color{blue}{\text{baseball}}$ is a list of lists. The main list contains 4 elements. Each of these elements is a list containing the height and the weight of 4 baseball players, in this order. $\color{blue}{\text{baseball}}$ is already coded for you in the script.

In [12]:
# Import numpy
import numpy as np

# Create baseball, a list of lists
baseball = [[180, 78.4],
            [215, 102.7],
            [210, 98.5],
            [188, 75.2]]

# Create a 2D numpy array from baseball: np_baseball
np_baseball = np.array(baseball)

# Print out the type of np_baseball
print("type of np_baseball\t:", type(np_baseball))

# Print out the shape of np_baseball
print("shape of np_baseball\t:", np_baseball.shape)

type of np_baseball	: <class 'numpy.ndarray'>
shape of np_baseball	: (4, 2)


## Baseball data in 2D form

You have another look at the MLB data and realize that it makes more sense to restructure all this information in a 2D $\color{blue}{\text{numpy}}$ array. This array should have 1015 rows, corresponding to the 1015 baseball players you have information on, and 2 columns (for height and weight).

The MLB was, again, very helpful and passed you the data in a different structure, a Python list of lists. In this list of lists, each sublist represents the height and weight of a single baseball player. The name of this embedded list is $\color{blue}{\text{baseball}}$.

Can you store the data as a 2D array to unlock $\color{blue}{\text{numpy}}$'s extra functionality? $\color{blue}{\text{baseball}}$ is available as a regular list of lists.

In [13]:
#import pandas
import pandas as pd

# Import numpy
import numpy as np

# Read the CSV file and just list height and weight
baseball_csv = pd.read_csv("baseball.csv")

# create np_baseball_csv as numpy arrays
np_baseball_csv = np.array(baseball_csv)

# Create and Read the list combined of height and weight: np_baseball
np_baseball = np_baseball_csv[:, 3:5]

# Print out the shape of np_baseball
print("shape of np_baseball:", np_baseball.shape)

shape of np_baseball: (1015, 2)


## Subsetting 2D NumPy Arrays

If your 2D $\color{blue}{\text{numpy}}$ array has a regular structure, i.e. each row and column has a fixed number of values, complicated ways of subsetting become very easy. Have a look at the code below where the elements $\color{red}{\text{"a"}}$ and $\color{red}{\text{"c"}}$ are extracted from a **list of lists**.

In [14]:
# regular list of lists
x = [["a", "b"], ["c", "d"]]
[x[0][0], x[1][0]]

['a', 'c']

In [15]:
# regular list of lists
x = [["a", "b"], ["c", "d"]]

# numpy
import numpy as np
np_x = np.array(x)
np_x[:, 0]

array(['a', 'c'], dtype='<U1')

For regular Python lists, this is a real pain. For 2D $\color{blue}{\text{numpy}}$ arrays, however, it's pretty intuitive! The indexes before the comma refer to the rows, while those after the comma refer to the columns. The $\color{blue}{\text{:}}$ is for slicing; in this example, it tells Python to include all rows.

The code that converts the pre-loaded $\color{blue}{\text{baseball}}$ list to a 2D $\color{blue}{\text{numpy}}$ array is already in the script. The first column contains the players' height in inches and the second column holds player weight, in pounds. Add some lines to make the correct selections. Remember that in Python, the first element is at index 0! $\color{blue}{\text{baseball}}$ is available as a regular list of lists.

In [16]:
#import pandas
import pandas as pd

# Import numpy
import numpy as np

# Read the CSV file and just list height and weight
baseball_csv = pd.read_csv("baseball.csv")

# create np_baseball_csv as numpy arrays
np_baseball_csv = np.array(baseball_csv)

# Create and Read the list of height, weight, and age or 2 columns: np_baseball
np_baseball = np_baseball_csv[:, 3:5]

# Print out the 5th row of np_baseball
print("the 5th row of np_baseball:\n", np_baseball[:5])

# Select the entire second column of np_baseball: np_weight_lb
np_weight_lb = np_baseball[:, 1]

# Print out height of 124th player -> The element of 124th is index 123
print("height of 124th player is", np_baseball[123, 0])

the 5th row of np_baseball:
 [[74 180]
 [74 215]
 [72 210]
 [72 210]
 [73 188]]
height of 124th player is 75


## 2D Arithmetic

Remember how you calculated the Body Mass Index for all baseball players? $\color{blue}{\text{numpy}}$ was able to perform all calculations element-wise (i.e. element by element). For 2D $\color{blue}{\text{numpy}}$ arrays this isn't any different! You can combine matrices with single numbers, with vectors, and with other matrices.

Execute the code below in the Python and see if you understand:

In [17]:
import numpy as np
np_mat = np.array([[1, 2],
                   [3, 4],
                   [5, 6]])
np_mat * 2

array([[ 2,  4],
       [ 6,  8],
       [10, 12]])

In [18]:
import numpy as np
np_mat = np.array([[1, 2],
                   [3, 4],
                   [5, 6]])
np_mat + np.array([10, 10])

array([[11, 12],
       [13, 14],
       [15, 16]])

In [19]:
import numpy as np
np_mat = np.array([[1, 2],
                   [3, 4],
                   [5, 6]])
np_mat + np_mat

array([[ 2,  4],
       [ 6,  8],
       [10, 12]])

$\color{blue}{\text{np\_baseball}}$ is coded for you; it's again a 2D $\color{blue}{\text{numpy}}$ array with **3 columns** representing **height (in inches)**, **weight (in pounds)** and **age (in years)**.

In [20]:
#import pandas
import pandas as pd

# Import numpy package
import numpy as np

# Read the CSV file and just list height and weight
baseball_csv = pd.read_csv("baseball.csv")

# create np_baseball_csv as numpy arrays
np_baseball_csv = np.array(baseball_csv)

# Create and Read the list of height, weight, and age or 3 columns: np_baseball
np_baseball = (np_baseball_csv[:, 3:6]).astype(float)

# Create numpy array: conversion
conversion = (np.array([0.0254, 0.453592, 1])).astype(float)

# Print out product of np_baseball and conversion
print("baseball 3 columns(height, weight, age) conversion:\n", np.round(np_baseball * conversion, decimals=3))

baseball 3 columns(height, weight, age) conversion:
 [[ 1.88  81.647 22.99 ]
 [ 1.88  97.522 34.69 ]
 [ 1.829 95.254 30.78 ]
 ...
 [ 1.905 92.986 25.19 ]
 [ 1.905 86.182 31.01 ]
 [ 1.854 88.45  27.92 ]]


## Average versus median

You now know how to use $\color{blue}{\text{numpy}}$ functions to get a better feeling for your data. It basically comes down to importing $\color{blue}{\text{numpy}}$ and then calling several simple functions on the $\color{blue}{\text{numpy}}$ arrays:

In [21]:
import numpy as np
x = [1, 4, 8, 10, 12]
np.mean(x)

7.0

In [22]:
import numpy as np
x = [1, 4, 8, 10, 12]
np.median(x)

8.0

The baseball data is available as a 2D $\color{blue}{\text{numpy}}$ array with 3 columns (height, weight, age) and 1015 rows. The name of this $\color{blue}{\text{numpy}}$ array is $\color{blue}{\text{np\_baseball}}$.

In [23]:
#import pandas
import pandas as pd

# Import numpy package
import numpy as np

# Read the CSV file and just list height and weight
baseball_csv = pd.read_csv("baseball.csv")

# create np_baseball_csv as numpy arrays
np_baseball_csv = np.array(baseball_csv)

# Create and Read the list of height, weight, and age or 3 columns: np_baseball
np_baseball = np_baseball_csv[:, 3:6]

# Create np_height_in from np_baseball
np_height_in = np_baseball[:,0]

# Print out the mean of np_height_in
print(np.mean(np_height_in))

# Print out the median of np_height_in
print(np.median(np_height_in))

73.6896551724138
74.0


## Explore the baseball data

Because the mean and median are so far apart, you decide to complain to the MLB. They find the error and send the corrected data over to you. It's again available as a 2D NumPy array $\color{blue}{\text{np\_baseball}}$, with three columns.

The Python script in the editor already includes code to print out informative messages with the different summary statistics. Can you finish the job? $\color{blue}{\text{np\_baseball}}$ is available.

In [24]:
#import pandas
import pandas as pd

# Import numpy package
import numpy as np

# Read the CSV file and just list height and weight
baseball_csv = pd.read_csv("baseball.csv")

# create np_baseball_csv as numpy arrays
np_baseball_csv = np.array(baseball_csv)

# Create and Read the list of height, weight, and age or 3 columns: np_baseball
np_baseball = (np_baseball_csv[:, 3:6]).astype(float)

# Print mean height (first column)
avg = np.mean(np_baseball[:,0])
print("Average\t\t\t: " + str(avg))

# Print median height. Replace 'None'
med = np.median(np_baseball[:,0])
print("Median\t\t\t: " + str(med))

# Print out the standard deviation on height. Replace 'None'
stddev = np.std(np_baseball[:,0])
print("Standard Deviation\t: " + str(stddev))

# Print out correlation between first and second column. Replace 'None'
corr = np.corrcoef(np_baseball[:,0], np_baseball[:,1])
print("\nCorrelation:\n" + str(corr))

Average			: 73.6896551724138
Median			: 74.0
Standard Deviation	: 2.312791881046546

Correlation:
[[1.         0.53153932]
 [0.53153932 1.        ]]


## Blend it all together

In the last few exercises you've learned everything there is to know about heights and weights of baseball players. Now it's time to dive into another sport: soccer.

You've contacted FIFA for some data and they handed you two lists. The lists are the following:

positions = ['GK', 'M', 'A', 'D', ...]
heights = [191, 184, 185, 180, ...]

Each element in the lists corresponds to a player. The first list, $\color{blue}{\text{positions}}$, contains strings representing each player's position. The possible positions are: $\color{red}{\text{'GK'}}$ (goalkeeper), $\color{red}{\text{'M'}}$ (midfield), $\color{red}{\text{'A'}}$ (attack) and $\color{red}{\text{'D'}}$ (defense). The second list, $\color{blue}{\text{heights}}$, contains integers representing the height of the player in cm. The first player in the lists is a goalkeeper and is pretty tall (191 cm).

You're fairly confident that the median height of goalkeepers is higher than that of other players on the soccer field. Some of your friends don't believe you, so you are determined to show them using the data you received from FIFA and your newly acquired Python skills. $\color{blue}{\text{heights}}$ and $\color{blue}{\text{positions}}$ are available as lists

In [25]:
#import pandas
import pandas as pd

# Import numpy package
import numpy as np

# Read the CSV file and just list height and weight
fifa_csv = pd.read_csv("fifa.csv")

# create np_baseball_csv as numpy arrays
np_fifa_csv = np.array(fifa_csv)

# Create and Read the list of height, weight, and age or 3 columns: np_baseball
np_fifa = np_fifa_csv[:, 3:5]
print("fifa shape\t\t:", np_fifa.shape)
# Convert positions and heights to numpy arrays: np_positions, np_heights
np_positions = (np_fifa[:,0]).astype(str)
np_heights   = (np_fifa[:,1]).astype(int)

# print : np_positions, np_heights
print("np_positions\t\t:", np_positions)
print("np_heights\t\t:", np_heights)

# because list in np_positions there are white space,
# so we can remove white space with np.char.strip()
# wws = without white space
np_positions_wws = np.char.strip(np_positions)
print("np_positions_wws\t:", np_positions_wws)

# Heights of the goalkeepers: gk_heights
gk_heights = np_heights[np_positions_wws == "GK"]

# Heights of the other players: other_heights
other_heights = np_heights[np_positions_wws != "GK"]

# Print out the median height of goalkeepers. Replace 'None'
print("\nMedian height of goalkeepers\t: " + str(np.median(gk_heights)))

# Print out the median height of other players. Replace 'None'
print("Median height of other players\t: " + str(np.median(other_heights)))

fifa shape		: (8847, 2)
np_positions		: [' GK' ' M' ' A' ... ' D' ' D' ' M']
np_heights		: [191 184 185 ... 183 179 179]
np_positions_wws	: ['GK' 'M' 'A' ... 'D' 'D' 'M']

Median height of goalkeepers	: 188.0
Median height of other players	: 181.0
