# Numpy and Pandas: Essential data science packages

## Why NumPy and Pandas instead of regular Python arrays?
A vector can be represented in many ways in python (e.g. a list of numbers). Because Machine Learning requires a lot of computation, it is much better to use NumPy’s ndarray, which is more convenient and has optimized implementations of essential mathematical operations on vectors.

## Numpy
NumPy stands for ‘Numerical Python’ or ‘Numeric Python’. It is an open source module of Python which provides fast mathematical computation on arrays and matrices. Since, arrays and matrices are an essential part of the Machine Learning ecosystem, NumPy along with Machine Learning modules like Scikit-learn, Pandas, Matplotlib, TensorFlow, etc. complete the Python Machine Learning Ecosystem.

Let's assume we want to multiply a python list with a scalar:

In [1]:
a_list = [1, 3, 4, 10, -42.3]
a_list * 5

[1,
 3,
 4,
 10,
 -42.3,
 1,
 3,
 4,
 10,
 -42.3,
 1,
 3,
 4,
 10,
 -42.3,
 1,
 3,
 4,
 10,
 -42.3,
 1,
 3,
 4,
 10,
 -42.3]

With Numpy:

In [2]:
import numpy as np

In [3]:
a_np = np.array([1, 3, 4, 10, -42.3])
a_np = a_np * 5

print(type(a_np))
print(a_np.dtype)
a_np

<class 'numpy.ndarray'>
float64


array([   5. ,   15. ,   20. ,   50. , -211.5])

In [4]:
a_np = a_np + 1
a_np

array([   6. ,   16. ,   21. ,   51. , -210.5])

Determining the dimensions and size of an array:

In [5]:
print(a_np.ndim) # number of dimensions
print(a_np.shape) # tuple with array of dimensions
print(a_np.size) # total number of elements
print(a_np.itemsize) # required bytes to store each element

1
(5,)
5
8


#### Important: NumPy does not display trailing 0s

### Iterating through elements

In [6]:
integers = np.array([[1, 2, 3], [4, 5, 6]])

In [7]:
for row in integers:
    for column in row:
        print(column, end='  ')
    print() 

1  2  3  
4  5  6  


In [8]:
for i in integers.flat: # as if it was one-dimensional
    print(i, end='  ')

1  2  3  4  5  6  

### Creating different arrays

In [9]:
np.zeros(5)

array([0., 0., 0., 0., 0.])

In [10]:
np.ones((2, 4), dtype=int)

array([[1, 1, 1, 1],
       [1, 1, 1, 1]])

In [11]:
np.full((3, 5), 13)

array([[13, 13, 13, 13, 13],
       [13, 13, 13, 13, 13],
       [13, 13, 13, 13, 13]])

In [12]:
np.arange(5)

array([0, 1, 2, 3, 4])

In [13]:
np.arange(5, 10)

array([5, 6, 7, 8, 9])

In [14]:
np.arange(10, 1, -2)

array([10,  8,  6,  4,  2])

In [15]:
np.linspace(0.0, 1.0, num=5)

array([0.  , 0.25, 0.5 , 0.75, 1.  ])

### Reshaping Arrays
New shape must have the same number of elements as the original:

In [16]:
np.arange(1, 21)

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20])

In [17]:
np.arange(1, 21).reshape(4, 5)

array([[ 1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10],
       [11, 12, 13, 14, 15],
       [16, 17, 18, 19, 20]])

We can use this to display large arrays (if there are 1000 items or more, NumPy drops the middle rows, columns or both).

In [18]:
print(np.arange(1, 100001).reshape(4, 25000))
print(np.arange(1, 100001).reshape(100, 1000))

[[     1      2      3 ...  24998  24999  25000]
 [ 25001  25002  25003 ...  49998  49999  50000]
 [ 50001  50002  50003 ...  74998  74999  75000]
 [ 75001  75002  75003 ...  99998  99999 100000]]
[[     1      2      3 ...    998    999   1000]
 [  1001   1002   1003 ...   1998   1999   2000]
 [  2001   2002   2003 ...   2998   2999   3000]
 ...
 [ 97001  97002  97003 ...  97998  97999  98000]
 [ 98001  98002  98003 ...  98998  98999  99000]
 [ 99001  99002  99003 ...  99998  99999 100000]]


### Speed: Simple Python vs. Numpy

In [19]:
%%timeit
for i in range(len(a_list)):
    a_list[i] = a_list[i] * 5

The slowest run took 4.10 times longer than the fastest. This could mean that an intermediate result is being cached.
290 µs ± 134 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [20]:
%%timeit
a_np * 5

1.94 µs ± 229 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


#### => Numpy is ~150 times faster!

other possible notation:

In [21]:
import random
%timeit rolls_array = np.random.randint(1, 7, 6_000_000)

109 ms ± 8.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


### Array operators

In [22]:
A = np.array([[2, 4], [5, -6]])
B = np.array([[9, -3], [3, 6]])
print(A)
print() # for a blank line
print(B)

[[ 2  4]
 [ 5 -6]]

[[ 9 -3]
 [ 3  6]]


Applying scalar numeric values on all values in an array works with all operators

In [23]:
print(A * 3)
A - 3

[[  6  12]
 [ 15 -18]]


array([[-1,  1],
       [ 2, -9]])

#### Hadamard [elementwise] product (broadcasting)
If we just multiply two matrices in numpy using "*" then we just multiply element by element. This is also called the hadamard product. 

<img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/4eb9bb54b2820fb3583901ec05bc4b474b6d90bc">

In [24]:
A*B

array([[ 18, -12],
       [ 15, -36]])

Those elementwise operations work with all operators if the arrays have the same shape

In [25]:
A + B

array([[11,  1],
       [ 8,  0]])

#### Dot Product
If we want to use the dot product (also called the inner product) then we can just use the numpy function for that.

In [26]:
np.dot(A,B)

array([[ 30,  18],
       [ 27, -51]])

### NumPy calculation methods

These calculations are performed over all elements:

In [27]:
grades = np.array([[87, 96, 70], [100, 87, 90],
                   [94, 77, 90], [100, 81, 82]])

In [28]:
grades

array([[ 87,  96,  70],
       [100,  87,  90],
       [ 94,  77,  90],
       [100,  81,  82]])

In [29]:
grades.sum()

1054

In [30]:
grades.min()

70

In [31]:
grades.max()

100

In [32]:
grades.mean()

87.83333333333333

In [33]:
grades.std()

8.792357792739987

In [34]:
grades.var()

77.30555555555556

We can specify that the calculations should be performed only over columns or rows:

In [35]:
grades.mean(axis=0) # column-by-column

array([95.25, 85.25, 83.  ])

In [36]:
grades.mean(axis=1) # row-by-row

array([84.33333333, 92.33333333, 87.        , 87.66666667])

## Pandas
Pandas is one of the most widely used python libraries in data science. It provides high-performance, easy to use structures and data analysis tools. Unlike NumPy library which provides objects for multi-dimensional arrays, Pandas provides in-memory 2d table object called Dataframe. It is like a spreadsheet with column names and row labels.

In [37]:
 import pandas as pd

Pandas provides two key collections: Series & Dataframes.

### Series

The series is one of the core data structures in pandas. You think of it a cross between a list and a dictionary. 
The items are all stored in an order and there are labels by which you can retrieve them.

#### Creating series

In [38]:
animals = ['Tiger', 'Bear', 'Moose']
pd.Series(animals)

0    Tiger
1     Bear
2    Moose
dtype: object

In [39]:
numbers = [1, 2, 3]
pd.Series(numbers)

0    1
1    2
2    3
dtype: int64

In [40]:
animals = ['Tiger', 'Bear', None]
pd.Series(animals)

0    Tiger
1     Bear
2     None
dtype: object

In [41]:
numbers = [1, 2, None]
pd.Series(numbers)

0    1.0
1    2.0
2    NaN
dtype: float64

In [42]:
import numpy as np
np.nan == None

False

In [43]:
np.nan == np.nan

False

In [44]:
np.isnan(np.nan)

True

In [45]:
sports = {'Football': 'Germany',
          'Golf': 'Scotland',
          'Sumo': 'Japan',
          'Taekwondo': 'South Korea'}
s = pd.Series(sports)
print(s)
print(s.index)

Football         Germany
Golf            Scotland
Sumo               Japan
Taekwondo    South Korea
dtype: object
Index(['Football', 'Golf', 'Sumo', 'Taekwondo'], dtype='object')


In [46]:
s = pd.Series(['Tiger', 'Bear', 'Moose'], index=['India', 'America', 'Canada'])
s

India      Tiger
America     Bear
Canada     Moose
dtype: object

#### Querying series (change or add data)

In [47]:
s.iloc[2] # by integer index

'Moose'

In [48]:
s[2] # doesn't work if we used integer labels

'Moose'

In [49]:
s.loc['Canada'] # by label

'Moose'

In [50]:
s['Canada']

'Moose'

#### Working with series

In [51]:
s = pd.Series([100.00, 120.00, 101.00, 3.00])
s

0    100.0
1    120.0
2    101.0
3      3.0
dtype: float64

In [52]:
np.sum(s) # much faster than a for loop

324.0

In [53]:
len(s)

4

In [54]:
s+=2 # broadcasting (NumPy) also works in Pandas
s

0    102.0
1    122.0
2    103.0
3      5.0
dtype: float64

Appending series does not change the underlying series, but creates a new one:

In [55]:
original_sports = pd.Series({'Football': 'Germany',
                             'Golf': 'Scotland',
                             'Sumo': 'Japan',
                             'Taekwondo': 'South Korea'})

cricket_loving_countries = pd.Series(['Australia',
                                      'Barbados',
                                      'Pakistan',
                                      'England'], 
                                   index=['Cricket',
                                          'Cricket',
                                          'Cricket',
                                          'Cricket'])

all_countries = original_sports.append(cricket_loving_countries)
all_countries

Football         Germany
Golf            Scotland
Sumo               Japan
Taekwondo    South Korea
Cricket        Australia
Cricket         Barbados
Cricket         Pakistan
Cricket          England
dtype: object

In [56]:
all_countries.loc['Cricket'] # returns a series itself

Cricket    Australia
Cricket     Barbados
Cricket     Pakistan
Cricket      England
dtype: object

### Dataframes

The DataFrame is conceptually a two-dimensional series object, where there's an index and multiple columns of content, with each column having a label. In fact, the distinction between a column and a row is really only a conceptual distinction. And you can think of the DataFrame itself as simply a two-axes labeled array ... or a table.

#### Creating dataframes

In [57]:
import pandas as pd
purchase_1 = pd.Series({'Name': 'Matthias',
                        'Item Purchased': 'Dog Food',
                        'Cost': 22.50})
purchase_2 = pd.Series({'Name': 'Thomas',
                        'Item Purchased': 'Kitty Litter',
                        'Cost': 2.50})
purchase_3 = pd.Series({'Name': 'Christina',
                        'Item Purchased': 'Bird Seed',
                        'Cost': 5.00})

df = pd.DataFrame([purchase_1, purchase_2, purchase_3], index=['Store 1', 'Store 1', 'Store 2'])
df

Unnamed: 0,Name,Item Purchased,Cost
Store 1,Matthias,Dog Food,22.5
Store 1,Thomas,Kitty Litter,2.5
Store 2,Christina,Bird Seed,5.0


#### Selecting values / Creating Subsets
We can use this for example to create a smaller set of our original dataset. E.g. because we don't need all columns or because we just want to experiment with some of the rows first.

In [58]:
df

Unnamed: 0,Name,Item Purchased,Cost
Store 1,Matthias,Dog Food,22.5
Store 1,Thomas,Kitty Litter,2.5
Store 2,Christina,Bird Seed,5.0


In [59]:
df['Cost'] # select 1 column

Store 1    22.5
Store 1     2.5
Store 2     5.0
Name: Cost, dtype: float64

In [60]:
print(df.T)
df.T.loc['Cost'] # (bad) alternative to select 1 column

                 Store 1       Store 1    Store 2
Name            Matthias        Thomas  Christina
Item Purchased  Dog Food  Kitty Litter  Bird Seed
Cost                22.5           2.5          5


Store 1    22.5
Store 1     2.5
Store 2       5
Name: Cost, dtype: object

In [61]:
columns = ['Item Purchased','Cost'] # select multiple columns
df[columns]

Unnamed: 0,Item Purchased,Cost
Store 1,Dog Food,22.5
Store 1,Kitty Litter,2.5
Store 2,Bird Seed,5.0


In [62]:
df[['Item Purchased','Cost']] # select multiple columns (compact)

Unnamed: 0,Item Purchased,Cost
Store 1,Dog Food,22.5
Store 1,Kitty Litter,2.5
Store 2,Bird Seed,5.0


But this doesn't work as well when we want to specify rows and columns. Instead we can use the iloc and loc functions.

With iloc select the cell(s) by using the index (remember that these also start at 0!).

With loc we can use the named indeces of a pandas dataframe (e.g. the columns)

In [63]:
df

Unnamed: 0,Name,Item Purchased,Cost
Store 1,Matthias,Dog Food,22.5
Store 1,Thomas,Kitty Litter,2.5
Store 2,Christina,Bird Seed,5.0


In [64]:
df.iloc[0].iloc[1]

'Dog Food'

In [65]:
df.iloc[1]

Name                    Thomas
Item Purchased    Kitty Litter
Cost                       2.5
Name: Store 1, dtype: object

In [66]:
df.iloc[[0, 1]]

Unnamed: 0,Name,Item Purchased,Cost
Store 1,Matthias,Dog Food,22.5
Store 1,Thomas,Kitty Litter,2.5


In [67]:
df.iloc[0:2]

Unnamed: 0,Name,Item Purchased,Cost
Store 1,Matthias,Dog Food,22.5
Store 1,Thomas,Kitty Litter,2.5


In [68]:
df.loc['Store 2']

Name              Christina
Item Purchased    Bird Seed
Cost                      5
Name: Store 2, dtype: object

In [69]:
df.loc['Store 1']

Unnamed: 0,Name,Item Purchased,Cost
Store 1,Matthias,Dog Food,22.5
Store 1,Thomas,Kitty Litter,2.5


In [70]:
df.loc[['Store 1', 'Store 2']]

Unnamed: 0,Name,Item Purchased,Cost
Store 1,Matthias,Dog Food,22.5
Store 1,Thomas,Kitty Litter,2.5
Store 2,Christina,Bird Seed,5.0


In [71]:
df.loc["Store 1":"Store 2"]

Unnamed: 0,Name,Item Purchased,Cost
Store 1,Matthias,Dog Food,22.5
Store 1,Thomas,Kitty Litter,2.5
Store 2,Christina,Bird Seed,5.0


Depending on whether we specify lists in the iloc and loc get back different values

No list -> Single value

Range, list or a colon -> pandas series (like a one dimensional dataframe)

Two ranges, lists, colons or combinations of these -> another dataframe

In [72]:
type(df.loc['Store 2'])

pandas.core.series.Series

In [73]:
type(df.loc['Store 1'])

pandas.core.frame.DataFrame

We can combine those concepts to select rows and columns

In [74]:
df.iloc[0]

Name              Matthias
Item Purchased    Dog Food
Cost                  22.5
Name: Store 1, dtype: object

In [75]:
df.iloc[0]['Cost'] # chaining - select rows with iloc[] and then columns with []

22.5

We can also select ranges of values

In [76]:
df.iloc[0:2]

Unnamed: 0,Name,Item Purchased,Cost
Store 1,Matthias,Dog Food,22.5
Store 1,Thomas,Kitty Litter,2.5


In [77]:
df.iloc[0:2]["Cost"] # chaining - select rows with iloc[] and then columns with []

Store 1    22.5
Store 1     2.5
Name: Cost, dtype: float64

 Instead of chaining we can also use the inbuilt functionality

In [78]:
df.loc['Store 1','Cost'] # loc[row_indexer, column_indexer]

Store 1    22.5
Store 1     2.5
Name: Cost, dtype: float64

In [79]:
df.iloc[0:2, 0:2] # iloc[row_indexer, column_indexer]

Unnamed: 0,Name,Item Purchased
Store 1,Matthias,Dog Food
Store 1,Thomas,Kitty Litter


Entering a colon (":") for either of the index values means "choose all"

In [80]:
df.loc[:,"Cost"]

Store 1    22.5
Store 1     2.5
Store 2     5.0
Name: Cost, dtype: float64

In [81]:
df.loc[:,['Name', 'Cost']]

Unnamed: 0,Name,Cost
Store 1,Matthias,22.5
Store 1,Thomas,2.5
Store 2,Christina,5.0


We can also check for logical conditions and thereby search for cells where certain conditions are fulfilled.

In [82]:
df

Unnamed: 0,Name,Item Purchased,Cost
Store 1,Matthias,Dog Food,22.5
Store 1,Thomas,Kitty Litter,2.5
Store 2,Christina,Bird Seed,5.0


Logical checks can be done just like you might expect using the logical operators "==", "!=", "<=" etc.

When we check for a condition, then we get a dataframe (or series) full of boolean values. Lets check which values of the column mean_perimeter are greater than 120

In [83]:
df['Cost'] >= 5.0

Store 1     True
Store 1    False
Store 2     True
Name: Cost, dtype: bool

We can also apply these boolean values to select only certain rows or columns of the dataset where the values fulfill a condition.

In [84]:
mask = df['Cost'] >= 5.0
mask

Store 1     True
Store 1    False
Store 2     True
Name: Cost, dtype: bool

In [85]:
df[mask]

Unnamed: 0,Name,Item Purchased,Cost
Store 1,Matthias,Dog Food,22.5
Store 2,Christina,Bird Seed,5.0


We can even skip a step here and insert the logic check in the selection command

In [86]:
df[df['Cost'] >= 5.0] # we can also use logical operators to make more than 1 logic check

Unnamed: 0,Name,Item Purchased,Cost
Store 1,Matthias,Dog Food,22.5
Store 2,Christina,Bird Seed,5.0


Instead of applying the boolean mask directly we could also use .where(). This returns the whole dataframe and fills it with NaN, we therefore have to use .dropna()

In [87]:
above_5 = df.where(df['Cost'] >= 5.0)
above_5

Unnamed: 0,Name,Item Purchased,Cost
Store 1,Matthias,Dog Food,22.5
Store 1,,,
Store 2,Christina,Bird Seed,5.0


In [88]:
above_5 = above_5.dropna()
above_5

Unnamed: 0,Name,Item Purchased,Cost
Store 1,Matthias,Dog Food,22.5
Store 2,Christina,Bird Seed,5.0


#### Manipulating values
Just like we can select values we can also manipulate these values or use logical conditions to 'mask' the dataframe.

Lets start with manipulation.

Instead of just selecting the data, we can also overwrite it:

In [89]:
df

Unnamed: 0,Name,Item Purchased,Cost
Store 1,Matthias,Dog Food,22.5
Store 1,Thomas,Kitty Litter,2.5
Store 2,Christina,Bird Seed,5.0


In [90]:
df.iloc[:2,0] = 'Dominik' # sets the first 2 rows of the first column to Dominik
df

Unnamed: 0,Name,Item Purchased,Cost
Store 1,Dominik,Dog Food,22.5
Store 1,Dominik,Kitty Litter,2.5
Store 2,Christina,Bird Seed,5.0


#### Deleting data
We have two options to drop data:

In [91]:
df.drop('Store 1') # option 1 / we can also remove a column by specifying axis = 1

Unnamed: 0,Name,Item Purchased,Cost
Store 2,Christina,Bird Seed,5.0


In [92]:
df

Unnamed: 0,Name,Item Purchased,Cost
Store 1,Dominik,Dog Food,22.5
Store 1,Dominik,Kitty Litter,2.5
Store 2,Christina,Bird Seed,5.0


In [93]:
copy_df = df.copy()
copy_df = copy_df.drop('Store 1')
copy_df

Unnamed: 0,Name,Item Purchased,Cost
Store 2,Christina,Bird Seed,5.0


In [94]:
del copy_df['Name'] # option 2
copy_df

Unnamed: 0,Item Purchased,Cost
Store 2,Bird Seed,5.0


#### Creating new columns
By "selecting" columns that don't exist yet we can also create new columns. This is especially useful if we want to generate new features out of existing data in the dataset.

Often you can also use this to add comments, timestamps or class labels.

In [95]:
from datetime import datetime as dt

In [96]:
df['Timestamp'] = dt.now() # adds a column with same values; we could also add different values (SAME LENGTH!)
df

Unnamed: 0,Name,Item Purchased,Cost,Timestamp
Store 1,Dominik,Dog Food,22.5,2022-11-24 16:43:57.651632
Store 1,Dominik,Kitty Litter,2.5,2022-11-24 16:43:57.651632
Store 2,Christina,Bird Seed,5.0,2022-11-24 16:43:57.651632


#### Reading a CSV

In [97]:
df = pd.read_csv('data/olympics.csv') # will not work, because you don't have the corresponding data in the same folder

FileNotFoundError: [Errno 2] No such file or directory: 'data/olympics.csv'

In [98]:
df = pd.read_csv('data/olympics.csv', index_col = 0, skiprows=1) # parameters that specify, how we name rows and columns

FileNotFoundError: [Errno 2] No such file or directory: 'data/olympics.csv'

### Working with Pandas

We can now use all the functionality that Pandas offers (minmax (idxmin/idxmax), sorting, counting, ... ). Keep in mind that in most cases we have to specify on which axis we want to apply a function.

#### Check the Lecture slides for a summary of Pandas Dataframes!