<h1>Numpy</h1>

<p>Numpy is a Python package for manipulating lists and tables of numerical data. We can use it to do a lot of statistical calculations. We call the list or table of data a numpy array.</p>

<p>We often will take the data from our pandas DataFrame and put it in numpy arrays. Pandas DataFrames are great because we have the column names and other text data that makes it human readable. A DataFrame, while easy for a human to read, is not the ideal format for doing calculations. The numpy arrays are generally less human readable, but are in a format that enables the necessary computation.</p>

In [1]:
import pandas as pd
df = pd.read_csv('../data/titanic.csv')

<h3>Convert the Fare column to a numpy array.</h3>

In [2]:
print(df['Fare'].values)

[  7.25    71.2833   7.925   53.1      8.05     8.4583  51.8625  21.075
  11.1333  30.0708  16.7     26.55     8.05    31.275    7.8542  16.
  29.125   13.      18.       7.225   26.      13.       8.0292  35.5
  21.075   31.3875   7.225  263.       7.8792   7.8958  27.7208 146.5208
   7.75    10.5     82.1708  52.       7.2292   8.05    18.      11.2417
   9.475   21.      41.5792   7.8792   8.05    15.5      7.75    21.6792
  17.8     39.6875   7.8     76.7292  26.      61.9792  35.5     10.5
   7.2292  27.75    46.9      7.2292  80.      83.475   27.9     27.7208
  15.2458  10.5      8.1583   7.925    8.6625  10.5     46.9     73.5
  14.4542  56.4958   7.65     7.8958   8.05    29.      12.475    9.
   9.5      7.7875  47.1     10.5     15.85    34.375    8.05   263.
   8.05     8.05     7.8542  61.175   20.575    7.25     8.05    34.6542
  63.3583  23.      26.       7.8958   7.8958  77.2875   8.6542   7.925
   7.8958   7.65     7.775    7.8958  24.15    52.      14.4542   8.05
   

<strong>The result is a 1-dimensional array. You can tell since there's only one set of brackets and it only expands across the page (not down as well).</strong>
<br/>
<strong>The values attribute of a Pandas Series give the data as a numpy array.</strong>

<h3>We can also select multiple columns and get a 2-dimensional numpy array.</h3>

In [3]:
arr = df[['Pclass', 'Fare', 'Age']].values
print(arr)

[[ 3.      7.25   22.    ]
 [ 1.     71.2833 38.    ]
 [ 3.      7.925  26.    ]
 ...
 [ 3.     23.45    7.    ]
 [ 1.     30.     26.    ]
 [ 3.      7.75   32.    ]]


<h3>We use the numpy shape attribute to determine the size of our numpy array. The size tells us how many rows and columns are in our data.</h3>

In [4]:
print(arr.shape)

(887, 3)


<strong>This result means we have 887 rows and 3 columns.</strong>
<br/>
<strong>You can also use the shape attribute on a pandas DataFrame (df.shape).</strong>

<h3>We can select a single element from a numpy array with the following:</h3>

In [5]:
print(arr[0, 1])

7.25


<h3>We can also select a single row, for example, the whole row of the first passenger:</h3>

In [6]:
print(arr[0])

[ 3.    7.25 22.  ]


<h3>To select a single column (in this case the Age column), we have to use some special syntax:</h3>

In [7]:
age_col = arr[:, 2]
print(age_col)

[22.   38.   26.   35.   35.   27.   54.    2.   27.   14.    4.   58.
 20.   39.   14.   55.    2.   23.   31.   22.   35.   34.   15.   28.
  8.   38.   26.   19.   24.   23.   40.   48.   18.   66.   28.   42.
 18.   21.   18.   14.   40.   27.    3.   19.   30.   20.   27.   16.
 18.    7.   21.   49.   29.   65.   46.   21.   28.5   5.   11.   22.
 38.   45.    4.   64.    7.   29.   19.   17.   26.   32.   16.   21.
 26.   32.   25.   23.   28.    0.83 30.   22.   29.   31.   28.   17.
 33.   16.   20.   23.   24.   29.   20.   46.   26.   59.   22.   71.
 23.   34.   34.   28.   29.   21.   33.   37.   28.   21.   29.   38.
 28.   47.   14.5  22.   20.   17.   21.   70.5  29.   24.    2.   21.
 19.   32.5  32.5  54.   12.   19.   24.    2.   45.   33.   20.   47.
 29.   25.   23.   19.   37.   16.   24.   40.   22.   24.   19.   18.
 19.   27.    9.   36.5  42.   51.   22.   55.5  40.5  27.   51.   16.
 30.   37.    5.   44.   40.   26.   17.    1.    9.   48.   45.   60.
 28.  

<h3>Masking</h3>
<p>Often times you want to select all the rows that meet a certain criteria.</p>
<h3>We create what we call a mask first. This is an array of boolean values (True/False) of whether the passenger is a child or not.</h3>

In [8]:
mask = arr[:, 2] < 18
mask

array([False, False, False, False, False, False, False,  True, False,
        True,  True, False, False, False,  True, False,  True, False,
       False, False, False, False,  True, False,  True, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False,  True, False, False,  True, False, False,
       False, False,  True, False,  True, False, False, False, False,
       False, False, False,  True,  True, False, False, False,  True,
       False,  True, False, False,  True, False, False,  True, False,
       False, False, False, False, False,  True, False, False, False,
       False, False,  True, False,  True, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False,  True, False, False,  True, False, False, False,
       False,  True, False, False, False, False, False,  True, False,
       False,  True,

<strong>The False values mean adult and the True values mean child, so the first 7 passengers are adults, then 8th is a child, and the 9th is an adult.</strong>
<h3>Now we use our mask to select just the rows we care about:</h3>

In [9]:
arr[mask]

array([[  3.    ,  21.075 ,   2.    ],
       [  2.    ,  30.0708,  14.    ],
       [  3.    ,  16.7   ,   4.    ],
       [  3.    ,   7.8542,  14.    ],
       [  3.    ,  29.125 ,   2.    ],
       [  3.    ,   8.0292,  15.    ],
       [  3.    ,  21.075 ,   8.    ],
       [  3.    ,  11.2417,  14.    ],
       [  2.    ,  41.5792,   3.    ],
       [  3.    ,  21.6792,  16.    ],
       [  3.    ,  39.6875,   7.    ],
       [  2.    ,  27.75  ,   5.    ],
       [  3.    ,  46.9   ,  11.    ],
       [  3.    ,  27.9   ,   4.    ],
       [  3.    ,  15.2458,   7.    ],
       [  3.    ,   7.925 ,  17.    ],
       [  3.    ,  46.9   ,  16.    ],
       [  2.    ,  29.    ,   0.83  ],
       [  2.    ,  10.5   ,  17.    ],
       [  3.    ,  34.375 ,  16.    ],
       [  3.    ,  14.4542,  14.5   ],
       [  3.    ,  14.4583,  17.    ],
       [  3.    ,  31.275 ,   2.    ],
       [  3.    ,  11.2417,  12.    ],
       [  3.    ,  22.3583,   2.    ],
       [  3.    ,   9.216

<strong>If we recall that the third column is the passengers age, we see that all the rows here are for passengers that are children</strong>

<strong>Generally, we don't need to define the mask variable and can do the above in just a single line:</strong>

In [10]:
arr[arr[:, 2] < 18] 

array([[  3.    ,  21.075 ,   2.    ],
       [  2.    ,  30.0708,  14.    ],
       [  3.    ,  16.7   ,   4.    ],
       [  3.    ,   7.8542,  14.    ],
       [  3.    ,  29.125 ,   2.    ],
       [  3.    ,   8.0292,  15.    ],
       [  3.    ,  21.075 ,   8.    ],
       [  3.    ,  11.2417,  14.    ],
       [  2.    ,  41.5792,   3.    ],
       [  3.    ,  21.6792,  16.    ],
       [  3.    ,  39.6875,   7.    ],
       [  2.    ,  27.75  ,   5.    ],
       [  3.    ,  46.9   ,  11.    ],
       [  3.    ,  27.9   ,   4.    ],
       [  3.    ,  15.2458,   7.    ],
       [  3.    ,   7.925 ,  17.    ],
       [  3.    ,  46.9   ,  16.    ],
       [  2.    ,  29.    ,   0.83  ],
       [  2.    ,  10.5   ,  17.    ],
       [  3.    ,  34.375 ,  16.    ],
       [  3.    ,  14.4542,  14.5   ],
       [  3.    ,  14.4583,  17.    ],
       [  3.    ,  31.275 ,   2.    ],
       [  3.    ,  11.2417,  12.    ],
       [  3.    ,  22.3583,   2.    ],
       [  3.    ,   9.216

<strong>So in short, A mask is a boolean array (True/False values) that tells us which values from the array we’re interested in.</strong>

<h3>Summing and counting</h3>

<p>Let’s say we want to know how many of our passengers are children.</p>
<p>Recall that True values are interpreted as 1 and False values are interpreted as 0. So we can just sum up the array and that’s equivalent to counting the number of true values.</p>

In [11]:
print(mask.sum())
print((arr[:, 2] < 18).sum())

130
130


<h3>Let's say we want to know the number of passengers who has and hasn't survived in each class.</h3>

In [12]:
smaller_df = df[['Pclass', 'Survived']]
arr = smaller_df.values

s_1 = arr[(arr[:, 0] == 1) & (arr[:, 1] == 1)]
s_2 = arr[(arr[:, 0] == 2) & (arr[:, 1] == 1)]
s_3 = arr[(arr[:, 0] == 3) & (arr[:, 1] == 1)]

print("survived in first class: ", len(s_1))
print("survived in second class: ", len(s_2))
print("survived in third class: ", len(s_3))

ns_1 = arr[(arr[:, 0] == 1) & (arr[:, 1] == 0)]
ns_2 = arr[(arr[:, 0] == 2) & (arr[:, 1] == 0)]
ns_3 = arr[(arr[:, 0] == 3) & (arr[:, 1] == 0)]

print("did not survive in first class: ", len(ns_1))
print("did not survive in second class: ", len(ns_2))
print("did not survive in second class: ", len(ns_3))

print("Total number of passengers: ", len(smaller_df))


survived in first class:  136
survived in second class:  87
survived in third class:  119
did not survive in first class:  80
did not survive in second class:  97
did not survive in second class:  368
Total number of passengers:  887
