## 3. Numerical Analysis Library 🔢
When doing any sort of Machine Learning end-to-end project, understanding the data is an integral part. And to achieve this with little effort, library such as NumPy and Pandas are used 

### 3.1 NumPy
Numpy (Numerical Python) is the core library for scientific computing in Python. It deals with mathematical computation and enables users to compute on multi-dimensional data structures more efficiently and easily.

#### 3.1.1 Housekeeping

Before starting, make sure you have installed the Numpy package by executing this shell:

In [104]:
!pip install numpy



In [105]:
import numpy as np      # Import numpy library into the current Python workflow, and refer it as np (a convention)
print(np.__version__)   # Prints the current version of Numpy, this is useful when referring to documentation of the library

1.21.0


#### 3.1.2 Initializing Array / Matrix

Numpy offers a very intuitive way of representing matrices as multidimensional arrays. 
Shown below are some of the ways of initializing arrays:
- `np.array()` : Create an array/matrix by specifying entry for each element
- `np.ones()` : Create a matrix full of ones
- `np.zeros()` : Create a matrix full of zeros
- `np.eye()` : Create an identity matrix
- `np.random.random()` : Create a matrix filled with random numbers between 0 and 1
- `np.arange()` : Create an array in which its elemetss in sorted order 
- `np.empty()` : Create a matrix placeholder

Further explanation on Numpy's documentations at https://numpy.org/doc/

In [54]:
# TODO: Crerate a 3x3 matrix full of ones

# Expected outcome:
# [[ 1  1  1]
#  [ 1  1  1]
#  [ 1  1  1]]

b = np.ones((3, 3))
print("Matrix b")
print(b)

Matrix b
[[1. 1. 1.]
 [1. 1. 1.]
 [1. 1. 1.]]


In [55]:
# TODO: Crerate a 5x5 matrix full of zeros

# Expected outcome:
# [[ 0  0  0  0  0]
#  [ 0  0  0  0  0]
#  [ 0  0  0  0  0]
#  [ 0  0  0  0  0]
#  [ 0  0  0  0  0]]

c = np.zeros((5, 5))
print("Matrix c")
print(c)

Matrix c
[[0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]]


In [57]:
# TODO: Crerate a 2x2 identity matrix

# Expected outcome:
# [[ 1  0]
#  [ 0  1]]

d = np.eye(2)
print("Matrix d")
print(d)

Matrix d
[[1. 0.]
 [0. 1.]]


In [73]:
# TODO: Create a 3x3 matrix filled with random numbers between 0 and 1

# Expected outcome (seed = 41):
#  [[0.25092362 0.04609582 0.67681624]
#  [0.04346949 0.1164237  0.60386569]
#  [0.19093066 0.66851572 0.91744785]]

# Further information on what is 'seed' and how numpy generate random number is given below: 
# 'https://www.w3schools.com/python/numpy/numpy_random.asp'

np.random.seed(41)
e = np.random.random((3, 3))
print("\nMatrix e")
print(e)


Matrix e
[[0.25092362 0.04609582 0.67681624]
 [0.04346949 0.1164237  0.60386569]
 [0.19093066 0.66851572 0.91744785]]


In [74]:
# TODO: Create an array which has 0-9 as its elements in sorted order

# expected output: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

f = np.arange(10)
print("\nMatrix f")
print(f)



Matrix f
[0 1 2 3 4 5 6 7 8 9]


In [82]:
# TODO: create a 5x3 matrix placeholder, without initializing entries (elements in the matrix).
# The number in each entries will be randomly generated since it entry in each elements are supposed to be changed at a later stage (hence why it's called placeholder)
# Since the entries are randomly generated, we can generate a reproducible 'random number' by specifying a seed

# Expected outcome (seed = 13):
#    [[0.         0.         0.4472136 ]
#    [0.0531494  0.18257419 0.4472136 ]
#    [0.2125976  0.36514837 0.4472136 ]
#    [0.4783446  0.54772256 0.4472136 ]
#    [0.85039041 0.73029674 0.4472136 ]]

np.random.seed(13)
g = np.empty((5, 3))
print("\nMatrix g")
print(g)


Matrix g
[[0.         0.         0.4472136 ]
 [0.0531494  0.18257419 0.4472136 ]
 [0.2125976  0.36514837 0.4472136 ]
 [0.4783446  0.54772256 0.4472136 ]
 [0.85039041 0.73029674 0.4472136 ]]


#### 3.1.3 Matrix Operations

In machine learning, we will deal with a lot of matrix calculations. It is therefore good for us to get accustomed to some of the common operations we perform on them. Here is a list of the first few:

- `np.transpose()` : Find the transpose of an array
- `np.dot(a, b)` : Calculate the Dot product of two arrays
- `np.linalg.inv()` : Calculate the inverse matrix of an array (only valid for square matrices, whose dimension is n * n)
- `np.diagonal()` : Extract diagonal components of a two-dimensional array
- `a.reshape(row = x, column = y)` : Reshape an array to the given dimension

Now let's check what each of them does by filling in the cells below.

In [43]:
# Initialise the data we will use below
x = np.array([
    [3, 11, 1, 4],
    [7, 5, 2, 7],
    [6, 8, 9, 7],
    [0, 10, 4, 2]
])
x

array([[ 3, 11,  1,  4],
       [ 7,  5,  2,  7],
       [ 6,  8,  9,  7],
       [ 0, 10,  4,  2]])

In [44]:
# TODO: Transpose the array

# Expected outcome:
# [[ 3  7  6  0]
#  [11  5  8 10]
#  [ 1  2  9  4]]

transposed = np.transpose(x)
transposed

array([[ 3,  7,  6,  0],
       [11,  5,  8, 10],
       [ 1,  2,  9,  4],
       [ 4,  7,  7,  2]])

In [52]:
# TODO: Dot product of two arrays: original x and x_transposed

# Expected outcome:
# [[131  78 115 114]
#  [ 78  78 100  58]
#  [115 100 181 116]
#  [114  58 116 116]]

y = np.dot(x, transposed)
y

array([[147, 106, 143, 122],
       [106, 127, 149,  72],
       [143, 149, 230, 130],
       [122,  72, 130, 120]])

In [47]:
# TODO: Calculate the inverse matrix of x

# Expected outcome: 
#    [[ 1.38095238 -1.14285714  0.80952381 -1.5952381 ]
#    [ 0.29365079 -0.21428571  0.1031746  -0.1984127 ]
#    [ 0.07142857 -0.21428571  0.21428571 -0.14285714]
#    [-1.61111111  1.5        -0.94444444  1.77777778]]

inverse = np.linalg.inv(x)
print(inverse)

[[ 1.38095238 -1.14285714  0.80952381 -1.5952381 ]
 [ 0.29365079 -0.21428571  0.1031746  -0.1984127 ]
 [ 0.07142857 -0.21428571  0.21428571 -0.14285714]
 [-1.61111111  1.5        -0.94444444  1.77777778]]


In [48]:
# TODO: Extract the diagonal elements of an array x

# Expected outcome: [3 5 9 2]

diagonal = np.diag(x)
print(diagonal)

[3 5 9 2]


In [51]:
# TODO: Reshape an array x to one that has 8 rows and 2 columns

# Expected outcome: 
#    [[ 3 11]
#    [ 1  4]
#    [ 7  5]
#    [ 2  7]
#    [ 6  8]
#    [ 9  7]
#    [ 0 10]
#    [ 4  2]]

reshaped = x.reshape((8, 2))
print(reshaped)

[[ 3 11]
 [ 1  4]
 [ 7  5]
 [ 2  7]
 [ 6  8]
 [ 9  7]
 [ 0 10]
 [ 4  2]]


#### 3.1.4 Statistics in Numpy

When we deal with large amounts of data, we will often want to know things about the data as a whole. This is where NumPy's statistics come to the rescue. Most of them are self-explanatory:

- `np.sum()` : sum of all elements in an array
- `np.max()` : returns the maximum element in an array
- `np.min()` : Minimum value of an array
- `np.mean()` : Mean of elements in an array
- `np.median()` : Median value among elements
- `np.var()` : Variance of the elements in the array
- `np.std()` : Standard deviation of the elements in the array

As before, fill in the cells below to get used to these methods.

In [95]:
x = np.array(
    [34, 56, 6, 3, 9, 89, 120, 12, 201],
    dtype = np.int32
)

In [96]:
# TODO: Summation of elements 
# Expected outcome: 530

summation = np.sum(x)
print(summation)

530


In [97]:
# TODO: Maximum element in the array
# Expected outcome: 201

maximum = x.max()
print(maximum)

201


In [98]:
# TODO: Minimum element in the array
# Expected outcome: 3

minimum = x.min()
print(minimum)

3


In [99]:
# TODO: Average value of elements in the array
# Expected outcome: 58.89

mean = x.mean()
print(mean)

58.888888888888886


In [100]:
# TODO: Median element in the array
# Expected outcome: 34.0

median = np.median(x)
print(median)

34.0


In [101]:
# TODO: Variation of x
# Expected outcome: 4008.098765432099

variation = np.var(x)
print(variation)

4008.098765432099


In [102]:
# TODO: Standard deviation of the array
# Expected outcome: 63.3095471902311

std = np.std(x)
print(std)

63.3095471902311


#### 3.1.5 Further Resources
Shown below are some further resources you can dive to improve your literacy in using NumPy library
- [NumPy Quickstart Tutorial](https://numpy.org/devdocs/user/quickstart.html)
- [NumPy Tutorial - *by Nicolas Rougier*](https://github.com/rougier/numpy-tutorial)
- [Stanford CS231 - *by Justin Johnson*](https://cs231n.github.io/python-numpy-tutorial/)
- [Cheat Sheet](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Numpy_Python_Cheat_Sheet.pdf)
- [Documentation](https://numpy.org/doc/) 

### 3.2 Pandas

#### 3.2.1 Housekeeping
Make sure you've installed the Pandas library

In [106]:
!pip install pandas



In [108]:
import pandas as pd       # Import pandas library into the current Python workflow, and refer it as pd (a convention)
print(pd.__version__) # prints the current version of Numpy, this is useful when referring to documentation of the library

1.4.3


#### 3.2.2 Series
Series is in Pandas is one dimensional array consisting of the same data type

Properties: index, values, dtype

To create a Series, use the function below:
- `pd.Series()`

In [147]:
# Creating a Series by passing a list of values, alongside with its index:
s1 = pd.Series([1, 3, 5, np.nan, 6, 8], index=['A','B','C','D','E','F'])
print("-----Series 1-----\n", s1, "\n")


# Creating a Series by passing a variable of list, alongside providing its index
data = [1, 2, 3]
index = ['X', 'Y', 'Z']
s2 = pd.Series(data=data, index=index, name='series')
print("-----Series 2-----\n", s2, "\n")


# Creating a Series by passing a list of values, letting pandas create a default integer index:
s3 = pd.Series([1, 3, 5, np.nan, 6, 8], index=['A','B','C','D','E','F'])
print("-----Series 3-----\n", s3, "\n")


# Creating a Series by passing a list of dictionaries
s4 = pd.Series(data = {'a':1, 'b':2, 'c':3})
print("-----Series 4-----\n", s4, "\n")

-----Series 1-----
 A    1.0
B    3.0
C    5.0
D    NaN
E    6.0
F    8.0
dtype: float64 

-----Series 2-----
 X    1
Y    2
Z    3
Name: series, dtype: int64 

-----Series 3-----
 A    1.0
B    3.0
C    5.0
D    NaN
E    6.0
F    8.0
dtype: float64 

-----Series 4-----
 a    1
b    2
c    3
dtype: int64 



In [130]:
# Further information contained in 's1'
print("-----Series 1-----\n")
print("example.name:\n",s1.name,"\n")
print("example.values:\n",s1.values,"\n")
print("example.dtypes:\n",s1.dtypes,"\n")

-----Series 1-----

example.name:
 None 

example.values:
 [ 1.  3.  5. nan  6.  8.] 

example.dtypes:
 float64 



#### 3.2.3 DataFrame

DataFrame is the bread and butter of Pandas. It consists of two-dimensional tabular data. 

Properties: index, column, values, dtype

To create a DataFrame, use the function below:
- `pd.DataFrame()`

In [141]:
# Creating a DataFrame by passing a NumPy array and labeled columns:
df1 = pd.DataFrame(np.random.randn(6, 4), columns=['Price1','Price2','Price3','Price4'])
print("-----DataFrame 1-----\n", df1, "\n")

# Creating a DataFrame by passing a dictionary of objects that can be converted into a series-like structure:
df2 = pd.DataFrame({
    "A": 1.0,
    "B": pd.Timestamp("20130102"),
    "C": pd.Series(1, index=list(range(4)), dtype="float32"),
    "D": np.array([3] * 4, dtype="int32"),
    "E": pd.Categorical(["test", "train", "test", "train"]),
    "F": "foo",
})
print("-----DataFrame 2-----\n", df2, "\n")

-----DataFrame 1-----
      Price1    Price2    Price3    Price4
0  0.184946 -0.651056 -1.120507 -0.346607
1 -0.911345 -0.746135 -0.770969 -0.413954
2 -0.291811  0.126773 -1.247336  0.317171
3 -1.818421  0.055692  0.080866 -2.319848
4 -0.341121  1.089846  0.450531 -0.494299
5 -0.546577  0.292026  0.154205  0.817306 

-----DataFrame 2-----
      A          B    C  D      E    F
0  1.0 2013-01-02  1.0  3   test  foo
1  1.0 2013-01-02  1.0  3  train  foo
2  1.0 2013-01-02  1.0  3   test  foo
3  1.0 2013-01-02  1.0  3  train  foo 



#### 3.2.4 Reading File and Export DataFrame
In real project, we will be mostly using a provided dataset. Hence, we need a way of importing that data (CSV/HDF5/Excel) into a DataFrame so that it can be easily processed and manipulated later. On top of that, we will also want a way to export our processed data into some file

Function for importing a dataset is: 
- `pd.read_csv()`
- `pd.read_hdf()`
- `pd.read_excel()`

Function for exporting a DataFrame is: 
- `pd.DataFrame.to_csv()`
- `pd.DataFrame.to_hdf()`
- `pd.DataFrame.to_excel()`

In [156]:
df_import = pd.read_csv('username.csv', sep=';')     # Import a CSV file called 'username.csv'
df_import.head()                                     # Show the first 5 dataset

Unnamed: 0,Username,Identifier,First name,Last name
0,booker12,9012,Rachel,Booker
1,grey07,2070,Laura,Grey
2,johnson81,4081,Craig,Johnson
3,jenkins46,9346,Mary,Jenkins
4,smith79,5079,Jamie,Smith


In [158]:
df_import.to_csv('username2.csv')           # Export a DataFrame into a file called 'username2.csv' 

#### 3.2.5 Data Manipulation
When your data is in DataFrame type, the are endless things you can do with your data, from:
- Reshaping Dataframe 
- Slicing 
- Query (take a subset of a DataFrame that meets certain criteria)
- Handling Missing Data (very handy in Data Preprocessing step)
- Summarize DataFrame

Below are some of the functions you can call to execute the aforementioned functionality: 
- `df.sort_index()` : Sort the index of a DataFrame
- `df.sort_values()` : Sort the DataFrame by the value in particular column
- `df.drop(columns=['A’, 'B'])` : Drop columns from DataFrame
- `df.head(n)` : Select first n rows; Default: n = 5
- `df.tail(n)` : Select last n rows; Default: n = 5
- `df.loc()` : Access a group of rows and columns by label(s) or a boolean array.
- `df.iloc()` :  Access a group of rows and columns by position/index
- `df.sample(n)` : Randomly select n rows.
- `df[['A'],['B']]` : Select multiple columns with specific names (show only column 'A' and column 'B')
- `df.query('A > 7')` : Take the rows that met specified criteria (Row at which the value of column A is greater than 7)
- `df.dropna()` : Drop rows with any column having NA/null data.
- `df.fillna(value)` : Replace all NA/null data with value.
- `len(df)` : Return number of rows in DataFrame.
- `df.shape()` : Return a tuple of number of rows, and number of columns in DataFrame
- `df.describe()` : Provide a basic descriptive and statistics for each column 

In [201]:
# Initialise the data we will use below
df = pd.read_csv('username.csv', sep=';')     # Import a CSV file called 'username.csv'
display(df)

Unnamed: 0,Username,Identifier,First name,Last name,Age
0,alpha01,1035,Romane,Alpha,21
1,booker12,9012,Rachel,Booker,13
2,grey07,2070,Laura,Grey,42
3,hedgehogger14,2456,Joe,Hogger,26
4,johnson81,4081,Craig,Johnson,43
5,jenkins46,9346,Mary,Jenkins,62
6,smith79,5079,Jamie,Smith,17
7,midnighteagle10,8274,Donald,Eagle,25
8,mavericky00,1035,Thomas,Maverick,35
9,laxman10,9821,Trevor,Dos,41


In [212]:
# TODO: Find the shape of the DataFrame df

# Expected outcome: (10, 5)

df.shape

(10, 5)

In [202]:
# TODO: Extract the first 3 rows of the DataFrame

# Expected outcome: 
# 	    Username	    Identifier	First name	Last name
#    0	alpha01	        1035	    Romane	    Alpha
#    1	booker12	    9012	    Rachel	    Booker
#    2	grey07	        2070	    Laura	    Grey

df_extract1 = df.head(3)
print(df_extract1)

   Username  Identifier First name Last name  Age
0   alpha01        1035     Romane     Alpha   21
1  booker12        9012     Rachel    Booker   13
2    grey07        2070      Laura      Grey   42


In [203]:
# TODO: Extract data in row 3-6 

# Expected outcome: 
#       Username          Identifier    First name  Last name   Age
#    3  hedgehogger14     2456          Joe         Hogger      26
#    4  johnson81         4081          Craig       Johnson     43
#    5  jenkins46         9346          Mary        Jenkins     62
#    6  smith79           5079          Jamie       Smith       17

df_extract2 = df.iloc[3:7]
print(df_extract2)

        Username  Identifier First name Last name  Age
3  hedgehogger14        2456        Joe    Hogger   26
4      johnson81        4081      Craig   Johnson   43
5      jenkins46        9346       Mary   Jenkins   62
6        smith79        5079      Jamie     Smith   17


In [204]:
# TODO: Produce a basic statistical description of the DataFrame

# Expected outcome: 
#       	Identifier	    Age
#   count	10.000000	    10.000000
#   mean	5220.900000	    32.500000
#   std	    3587.123529	    14.908983
#   min	    1035.000000	    13.000000
#   25%	    2166.500000	    22.000000
#   50%	    4580.000000	    30.500000
#   75%	    8827.500000	    41.750000
#   max	    9821.000000	    62.000000

df.describe()

Unnamed: 0,Identifier,Age
count,10.0,10.0
mean,5220.9,32.5
std,3587.123529,14.908983
min,1035.0,13.0
25%,2166.5,22.0
50%,4580.0,30.5
75%,8827.5,41.75
max,9821.0,62.0


In [205]:
# TODO: Extract the last 3 rows with only 'Username' and 'Identifier' as its column  

# Expected outcome: 
#       Username	        Identifier
#   7	midnighteagle10	    8274
#   8	mavericky00     	1035
#   9	laxman10	        9821

df_extract3 = df.tail(3)[['Username','Identifier']]
display(df_extract3)

Unnamed: 0,Username,Identifier
7,midnighteagle10,8274
8,mavericky00,1035
9,laxman10,9821


In [207]:
df.columns

Index(['Username', 'Identifier', 'First name', 'Last name', 'Age'], dtype='object')

In [210]:
# TODO: Sort the dataframe df by age in descending order

# Expected outcome: 
#       Username        Identifier  First name  Last name   Age
#   5   jenkins46       9346        Mary        Jenkins     62
#   4   johnson81       4081        Craig       Johnson     43
#   2   grey07          2070        Laura       Grey        42
#   9   laxman10        9821        Trevor      Dos         41
#   8   mavericky00     1035        Thomas      Maverick    35
#   3   hedgehogger14   2456        Joe         Hogger      26
#   7   midnighteagle10 8274        Donald      Eagle       25
#   0   alpha01         1035        Romane      Alpha       21
#   6   smith79         5079        Jamie       Smith       17
#   1   booker12        9012        Rachel      Booker      13

df_sorted = df.sort_values(by=['Age'], ascending=False)
print(df_sorted)

          Username  Identifier First name Last name  Age
5        jenkins46        9346       Mary   Jenkins   62
4        johnson81        4081      Craig   Johnson   43
2           grey07        2070      Laura      Grey   42
9         laxman10        9821     Trevor       Dos   41
8      mavericky00        1035     Thomas  Maverick   35
3    hedgehogger14        2456        Joe    Hogger   26
7  midnighteagle10        8274     Donald     Eagle   25
0          alpha01        1035     Romane     Alpha   21
6          smith79        5079      Jamie     Smith   17
1         booker12        9012     Rachel    Booker   13


In [215]:
# TODO: Find data in which the identifier is greater than 8000

#   	Username	    Identifier	First name	Last name	Age
#   1	booker12	    9012	    Rachel	    Booker	    13
#   5	jenkins46	    9346	    Mary	    Jenkins	    62
#   7	midnighteagle10	8274	    Donald	    Eagle	    25
#   9	laxman10	    9821	    Trevor	    Dos	        41

df.query('Identifier > 8000')

Unnamed: 0,Username,Identifier,First name,Last name,Age
1,booker12,9012,Rachel,Booker,13
5,jenkins46,9346,Mary,Jenkins,62
7,midnighteagle10,8274,Donald,Eagle,25
9,laxman10,9821,Trevor,Dos,41


#### 3.2.6 Further Resources

Shown below are some further resources you can dive to improve your literacy in using NumPy library
- [Pandas Workshop - *by Stefanie Molin*](https://github.com/stefmolin/pandas-workshop)
- [Pandas Cookbook - *by Julia Evans*](https://github.com/jvns/pandas-cookbook)
- [Pandas Exercises - *by Guilherme Samora*](https://github.com/guipsamora/pandas_exercises)
- [Cheat Sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)
- [Documentation](https://pandas.pydata.org/docs/)