# 1. NumPy

NumPy is a fundamental Python package to efficiently practice data science. Learn to work with powerful tools in the NumPy array, and get started with data exploration.

## 1.1 Import Libraries

In [4]:
import numpy as np
import pandas as pd

## 1.2 User Variables

In [5]:
df_baseball = pd.read_csv("../datasets/baseball.csv")
df_baseball.head()

Unnamed: 0,Name,Team,Position,Height,Weight,Age,PosCategory
0,Adam_Donachie,BAL,Catcher,74,180,22.99,Catcher
1,Paul_Bako,BAL,Catcher,74,215,34.69,Catcher
2,Ramon_Hernandez,BAL,Catcher,72,210,30.78,Catcher
3,Kevin_Millar,BAL,First_Baseman,72,210,35.43,Infielder
4,Chris_Gomez,BAL,First_Baseman,73,188,35.71,Infielder


In [25]:
df_updated = pd.read_csv("../datasets/updated_numpy_subset.csv")
df_updated.head()

Unnamed: 0,0,1,2
0,1.230356,-11.162249,1.0
1,1.026143,16.097323,1.0
2,1.154423,5.081676,1.0
3,0.644275,-5.095381,1.0
4,1.005901,2.243427,1.0


In [58]:
df_baseball_outlier = pd.read_csv("../datasets/baseball_outlier.csv")
df_baseball_outlier.head()

Unnamed: 0,0,1,2
0,74000.0,180.0,22.99
1,74.0,215.0,34.69
2,72.0,210.0,30.78
3,72.0,210.0,35.43
4,73.0,188.0,35.71


# 2. Exercises

## 2.1 Your First NumPy Array

### Description

You're now going to dive into the world of baseball. Along the way, you'll get comfortable with the basics of ``numpy``, a powerful package to do data science.

A list ``baseball`` has already been defined in the Python script, representing the height of some baseball players in centimeters. Can you add some code to create a ``numpy`` array from it?

### Instructions

* Import the ``numpy`` package as ``np``, so that you can refer to ``numpy`` with ``np``.
* Use ``np.array()`` to create a ``numpy`` array from ``baseball``. Name this array ``np_baseball``.
* Print out the type of ``np_baseball`` to check that you got it right.

In [2]:
# Import the numpy package as np
import numpy as np

baseball = [180, 215, 210, 210, 188, 176, 209, 200]

# Create a numpy array from baseball: np_baseball
np_baseball = np.array(baseball)

# Print out type of np_baseball
print(type(np_baseball))

<class 'numpy.ndarray'>


## 2.2 Baseball players' height

### Description

You are a huge baseball fan. You decide to call the MLB (Major League Baseball) and ask around for some more statistics on the height of the main players. They pass along data on more than a thousand players, which is stored as a regular Python list: ``height_in``. The height is expressed in inches. Can you make a ``numpy`` array out of it and convert the units to meters?

``height_in`` is already available and the ``numpy`` package is loaded, so you can start straight away (Source: ``stat.ucla.edu``).

### Instructions

* Create a numpy array from ``height_in``. Name this new array n``p_height_in``.
* Print ``np_height_in``.
* Multiply ``np_height_in`` with ``0.0254`` to convert all height measurements from inches to meters. Store the new values in a new array, ``np_height_m``.
* Print out ``np_height_m`` and check if the output makes sense.

In [7]:
height_in = df_baseball["Height"].to_list()

In [8]:
# Import numpy
import numpy as np

# Create a numpy array from height_in: np_height_in
np_height_in = np.array(height_in)

# Print out np_height_in
print(np_height_in)

# Convert np_height_in to m: np_height_m
np_height_m = 0.0254 * np_height_in

# Print np_height_m
print(np_height_m)

[74 74 72 ... 75 75 73]
[1.8796 1.8796 1.8288 ... 1.905  1.905  1.8542]


## 2.3 Quiz: NumPy Side Effects

### Description

``numpy`` is great for doing vector arithmetic. If you compare its functionality with regular Python lists, however, some things have changed.

First of all, ``numpy`` arrays cannot contain elements with different types. Second, the typical arithmetic operators, such as ``+``, ``-``, ``*`` and ``/`` have a different meaning for regular Python lists and ``numpy`` arrays.

### Instruction

Some lines of code have been provided for you. Try these out and select the one that would match this:

```py
np.array([True, 1, 2]) + np.array([3, 4, False])
```

The ``numpy`` package is already imported as ``np``.

### Answer

```py
np.array([4, 3, 0]) + np.array([0, 2, 2])
```

In [9]:
[True, 1, 2] + [3, 4, False]

[True, 1, 2, 3, 4, False]

In [10]:
np.array([True, 1, 2]) + np.array([3, 4, False])

array([4, 5, 2])

## 2.4 Subsetting NumPy Arrays

### Description

Subsetting (using the square bracket notation on lists or arrays) works exactly the same with both lists and arrays.

This exercise already has two lists, ``height_in`` and ``weight_lb``, loaded in the background for you. These contain the height and weight of the MLB players as regular lists. It also has two ``numpy`` array lists, ``np_weight_lb`` and ``np_height_in`` prepared for you.

### Instructions

* Subset ``np_weight_lb`` by printing out the element at index 50.
* Print out a sub-array of ``np_height_in`` that contains the elements at index 100 up to and including index 110.

In [11]:
weight_lb = df_baseball["Weight"].to_list()

In [12]:
import numpy as np

np_weight_lb = np.array(weight_lb)
np_height_in = np.array(height_in)

# Print out the weight at index 50
print(np_weight_lb[50])

# Print out sub-array of np_height_in: index 100 up to and including index 110
print(np_height_in[100:111])

200
[73 74 72 73 69 72 73 75 75 73 72]


## 2.5 Your First 2D NumPy Array

### Description

Before working on the actual MLB data, let's try to create a 2D ``numpy`` array from a small list of lists.

In this exercise, ``baseball`` is a list of lists. The main list contains 4 elements. Each of these elements is a list containing the height and the weight of 4 ``baseball`` players, in this order. ``baseball`` is already coded for you in the script.

### Instructions

* Use ``np.array()`` to create a 2D ``numpy`` array from ``baseball``. Name it ``np_baseball``.
* Print out the type of ``np_baseball``.
* Print out the shape attribute of ``np_baseball``. Use ``np_baseball.shape``.

In [13]:
import numpy as np

baseball = [[180, 78.4],
            [215, 102.7],
            [210, 98.5],
            [188, 75.2]]

# Create a 2D numpy array from baseball: np_baseball
np_baseball = np.array(baseball)

# Print out the type of np_baseball
print(type(np_baseball))

# Print out the shape of np_baseball
print(np_baseball.shape)

<class 'numpy.ndarray'>
(4, 2)


## 2.6 Baseball data in 2D form

### Description

You realize that it makes more sense to restructure all this information in a 2D ``numpy`` array.

You have a Python list of lists. In this list of lists, each sublist represents the height and weight of a single baseball player. The name of this list is ``baseball`` and it has been loaded for you already (although you can't see it).

Store the data as a 2D array to unlock ``numpy``'s extra functionality.

### Instructions

* Use ``np.array()`` to create a 2D ``numpy`` array from ``baseball``. Name it ``np_baseball``.
* Print out the ``shape`` attribute of ``np_baseball``.

In [19]:
height_in = df_baseball["Height"]
weight_lb = df_baseball["Weight"]

In [15]:
baseball = [[x,y] for (x,y) in zip(height_in, weight_lb)]

In [16]:
import numpy as np

# Create a 2D numpy array from baseball: np_baseball
np_baseball = np.array(baseball)

# Print out the shape of np_baseball
print(np_baseball.shape)

(1015, 2)


## 2.7 Subsetting 2D NumPy Arrays

### Description

If your 2D ``numpy`` array has a regular structure, i.e. each row and column has a fixed number of values, complicated ways of subsetting become very easy. Have a look at the code below where the elements ``"a"`` and ``"c"`` are extracted from a list of lists.

```py
# numpy
import numpy as np
np_x = np.array(x)
np_x[:, 0]
```

The indexes before the comma refer to the rows, while those after the comma refer to the columns. The ``:`` is for slicing; in this example, it tells Python to include all rows.

### Instructions

* Print out the 50th row of ``np_baseball``.
* Make a new variable, ``np_weight_lb``, containing the entire second column of ``np_baseball``.
* Select the height (first column) of the 124th baseball player in ``np_baseball`` and print it out.

In [17]:
import numpy as np

np_baseball = np.array(baseball)

# Print out the 50th row of np_baseball
print(np_baseball[49])

# Select the entire second column of np_baseball: np_weight_lb
np_weight_lb = np_baseball[:,1]

# Print out height of 124th player
print(np_baseball[123,0])

[ 70 195]
75


## 2.8 2D Arithmetic

### Description

2D ``numpy`` arrays can perform calculations element by element, like ``numpy`` arrays.

``np_baseball`` is coded for you; it's again a 2D ``numpy`` array with 3 columns representing height (in inches), weight (in pounds) and age (in years). ``baseball`` is available as a regular list of lists and ``updated`` is available as 2D ``numpy`` array.

### Instructions

* You managed to get hold of the changes in height, weight and age of all baseball players. It is available as a 2D ``numpy`` array, ``updated``. Add ``np_baseball`` and ``updated`` and print out the result.
* You want to convert the units of height and weight to metric (meters and kilograms, respectively). As a first step, create a ``numpy`` array with three values: ``0.0254``, ``0.453592`` and ``1``. Name this array ``conversion``.
* Multiply ``np_baseball`` with ``conversion`` and print out the result.

In [None]:
height_in = df_baseball["Height"]
weight_lb = df_baseball["Weight"]
age = df_baseball["Age"]

In [35]:
baseball = [[x,y,z] for (x,y,z) in zip(height_in, weight_lb, age)]
baseball

[[74, 180, 22.99],
 [74, 215, 34.69],
 [72, 210, 30.78],
 [72, 210, 35.43],
 [73, 188, 35.71],
 [69, 176, 29.39],
 [69, 209, 30.77],
 [71, 200, 35.07],
 [76, 231, 30.19],
 [71, 180, 27.05],
 [73, 188, 23.88],
 [73, 180, 26.96],
 [74, 185, 23.29],
 [74, 160, 26.11],
 [69, 180, 27.55],
 [70, 185, 34.27],
 [73, 189, 27.99],
 [75, 185, 22.38],
 [78, 219, 22.89],
 [79, 230, 25.76],
 [76, 205, 36.33],
 [74, 230, 31.17],
 [76, 195, 32.31],
 [72, 180, 31.03],
 [71, 192, 29.26],
 [75, 225, 29.47],
 [77, 203, 32.46],
 [74, 195, 35.67],
 [73, 182, 25.89],
 [74, 188, 26.55],
 [78, 200, 24.17],
 [73, 180, 26.69],
 [75, 200, 25.13],
 [73, 200, 27.9],
 [75, 245, 30.17],
 [75, 240, 31.36],
 [74, 215, 30.99],
 [69, 185, 32.24],
 [71, 175, 27.61],
 [74, 199, 28.2],
 [73, 200, 28.85],
 [73, 215, 24.21],
 [76, 200, 22.02],
 [74, 205, 24.97],
 [74, 206, 26.78],
 [70, 186, 32.51],
 [72, 188, 30.95],
 [77, 220, 33.09],
 [74, 210, 32.74],
 [70, 195, 30.69],
 [73, 200, 23.45],
 [75, 200, 24.94],
 [76, 212, 24.

In [36]:
updated = np.array([[x,y,z] for (x,y,z) in zip(df_updated["0"], df_updated["1"], df_updated["2"])])

In [37]:
import numpy as np

np_baseball = np.array(baseball)

# Print out addition of np_baseball and updated
print(np_baseball + updated)

# Create numpy array: conversion
conversion = np.array([0.0254, 0.453592, 1])

# Print out product of np_baseball and conversion
print(np_baseball * conversion)

[[ 75.2303559  168.83775102  23.99      ]
 [ 75.02614252 231.09732309  35.69      ]
 [ 73.1544228  215.08167641  31.78      ]
 ...
 [ 76.09349925 209.23890778  26.19      ]
 [ 75.82285669 172.21799965  32.01      ]
 [ 73.99484223 203.14402711  28.92      ]]
[[ 1.8796  81.64656 22.99   ]
 [ 1.8796  97.52228 34.69   ]
 [ 1.8288  95.25432 30.78   ]
 ...
 [ 1.905   92.98636 25.19   ]
 [ 1.905   86.18248 31.01   ]
 [ 1.8542  88.45044 27.92   ]]


## 2.9 Average versus median

### Description

You now know how to use ``numpy`` functions to get a better feeling for your data.

The baseball data is available as a 2D ``numpy`` array with 3 columns (height, weight, age) and 1015 rows. The name of this ``numpy`` array is ``np_baseball``. After restructuring the data, however, you notice that some height values are abnormally high. Follow the instructions and discover which summary statistic is best suited if you're dealing with so-called <i>outliers</i>. ``np_baseball`` is available.

### Instructions

* Create ``numpy`` array ``np_height_in`` that is equal to first column of ``np_baseball``.
* Print out the mean of ``np_height_in``.
* Print out the median of ``np_height_in``.

In [60]:
np_baseball = np.array(df_baseball_outlier)

In [61]:
import numpy as np

# Create np_height_in from np_baseball
np_height_in = np_baseball[:,0]

# Print out the mean of np_height_in
print(np.mean(np_height_in))

# Print out the median of np_height_in
print(np.median(np_height_in))

1586.4610837438424
74.0


An average height of 1586 inches, that doesn't sound right, does it? However, the median does not seem affected by the outliers: 74 inches makes perfect sense. It's always a good idea to check both the median and the mean, to get an idea about the overall distribution of the entire dataset.

## 2.10 Explore the baseball data

### Description

Because the mean and median are so far apart, you decide to complain to the MLB. They find the error and send the corrected data over to you. It's again available as a 2D NumPy array ``np_baseball``, with three columns.

The Python script in the editor already includes code to print out informative messages with the different summary statistics and ``numpy`` is already loaded as ``np``. Can you finish the job? ``np_baseball`` is available.

### Instructions

* The code to print out the mean height is already included. Complete the code for the median height.
* Use ``np.std()`` on the first column of ``np_baseball`` to calculate ``stddev``.
* Do big players tend to be heavier? Use ``np.corrcoef()`` to store the correlation between the first and second column of ``np_baseball`` in ``corr``.

In [68]:
baseball = [[x,y,z] for (x,y,z) in zip(height_in, weight_lb, age)]
np_baseball = np.array(baseball)

In [69]:
avg = np.mean(np_baseball[:,0])
print("Average: " + str(avg))

# Print median height
med = np.median(np_baseball[:,0])
print("Median: " + str(med))

# Print out the standard deviation on height
stddev = np.std(np_baseball[:,0])
print("Standard Deviation: " + str(stddev))

# Print out correlation between first and second column
corr = np.corrcoef(np_baseball[:,0], np_baseball[:,1])
print("Correlation: " + str(corr))

Average: 73.6896551724138
Median: 74.0
Standard Deviation: 2.312791881046546
Correlation: [[1.         0.53153932]
 [0.53153932 1.        ]]
