# Python Modules

* What makes Python so powerful, modules!
* Python is open-sourced, which means that developers all around the world can contribute to its development!
* Modules are a a collection of Python functions and classes that you can import and use in your code!
* As mentioned earlier, before spending a lot of time writing your own functions, try searching online, there is probably a module out there that can do what you need

## Part One: NumPy

### Why Numpy?

<div>
<img src="img/numpy.gif" width="600"/>
</div>

Read more: https://towardsdatascience.com/why-is-numpy-awesome-3f8f011abf70#:~:text=NumPy%20can%20be%20used%20to,calculations%20you%20can%20use%20np

In [7]:
# Importing the Numpy Module

import numpy as np

In [8]:
np.__version__

'1.22.4'

In [70]:
# let's make an array using the array() function and assign it to variable "a"

a = np.array([1,2,3])
print(a)

[1 2 3]


An **array** object represents a multidimensional, homogeneous array of fixed-size items


In [10]:
# The variable is a np.array object, use the type() function to confirm that

type(a)

numpy.ndarray

An associated **data-type** object describes the format of each element in the array (its byte-order, how many bytes it occupies in memory, whether it is an integer, a floating point number, or something else, etc.)

In [11]:
# The type of the data stored in the array can be checked using array.dtype attribute

a.dtype

dtype('int64')

In [12]:
# We can explicitly specify the data type we want
a = np.array([1,2,3], dtype = 'int64')

a.dtype

dtype('int64')

Arrays have different **dimenssions**, **sizes**, and **shapes**:

* 1- dimenssion: number of dimenssions
* 2- size: number of elements
* 3- shape: rows x columns

In [13]:
print("Array a is ", a.ndim, "D")
print("Array a has", a.size, " elements")
print("Array a has a shape of: ", a.shape)
print("Array a has a shape of: ", np.transpose(a).shape)
print(np.transpose(a))

Array a is  1 D
Array a has 3  elements
Array a has a shape of:  (3,)
Array a has a shape of:  (3,)
[1 2 3]


In [14]:
# We can construct 2D arrays 

b = np.array([[1,2,3,4],
              [5,6,7,8],
              [4,3,2,1]]
            )

print("Array b is ", b.ndim, " dimensional")
print("Array b has", b.size, " elements")
print("Array b has a shape of: ", b.shape, "[3 Rows & 4 Columns]")

Array b is  2  dimensional
Array b has 12  elements
Array b has a shape of:  (3, 4) [3 Rows & 4 Columns]


**Can Arrays  contain heterogenous data types?**

In [15]:
b = np.array([['a',2,3,4],
              [5,6,7,8],
              [4,3,2,1]]
            )

In [16]:
b = np.array([['a',2,3,4],
              [5,6,7,8],
              [4,3,2,1],], dtype='int64'
            )

ValueError: invalid literal for int() with base 10: 'a'

Knowledge of the shape of the arrays is integral in performing operations on them. This concept also extends into **Tensors** which are at the heart of Deep learning packages such as **Pytorch** and **TensorFlow**. Let's take a look at this in an example:

In [17]:
# let's try create an array "c" by adding our 1D array ("a") & 2D array ("b")

c = a + b

UFuncTypeError: ufunc 'add' did not contain a loop with signature matching types (dtype('int64'), dtype('<U21')) -> None

**Arrays shapes need to be the same for element wise operations**

<div>
<img src="img/dim_img.gif" width="200"/>
</div>

Read more: https://towardsdatascience.com/understanding-dimensions-in-pytorch-6edf9972d3be

**What if you need to change the dimenssionality of your array?**
##### We can use the reshape() function to change 2D arrays to 1D

In [40]:
a2 = np.random.random(12)  # Generate 1D array with random numbers 
print(a2)

[0.71852024 0.17010373 0.62556292 0.74335347 0.16350417 0.92046097
 0.33007963 0.05799293 0.99189453 0.68607281 0.32268093 0.2591686 ]


In [20]:
print(type(a2))
print(a2.dtype)
print(a2.size)

<class 'numpy.ndarray'>
float64
12


In [47]:
b = np.array([[1,2,3,4],[5,6,7,8],[4,3,2,1]])
print(b)

c = a2 + b

[[1 2 3 4]
 [5 6 7 8]
 [4 3 2 1]]


In [43]:
print("Array a2 has the following shape: ",a2.shape)
print("Remember array b's shape is: ", b.shape)

Array a2 has the following shape:  (12,)
Remember array b's shape is:  (3, 4)


In [48]:
a2 = a2.reshape(3,4)   #reshape function: array.reshape()
print(a2)
print(b)

[[0.71852024 0.17010373 0.62556292 0.74335347]
 [0.16350417 0.92046097 0.33007963 0.05799293]
 [0.99189453 0.68607281 0.32268093 0.2591686 ]]
[[1 2 3 4]
 [5 6 7 8]
 [4 3 2 1]]


##### Now let's try adding our 2D arrays:

In [49]:
c = a2+b
c

array([[1.71852024, 2.17010373, 3.62556292, 4.74335347],
       [5.16350417, 6.92046097, 7.33007963, 8.05799293],
       [4.99189453, 3.68607281, 2.32268093, 1.2591686 ]])

**We can also convert the 2D array into a 1D array using the ravel() function**

In [50]:
a2 = a2.ravel()
b = b.ravel()

In [51]:
print(a2)
print(b)

[0.71852024 0.17010373 0.62556292 0.74335347 0.16350417 0.92046097
 0.33007963 0.05799293 0.99189453 0.68607281 0.32268093 0.2591686 ]
[1 2 3 4 5 6 7 8 4 3 2 1]


In [52]:
c = a2 + b 
print(c)

[1.71852024 2.17010373 3.62556292 4.74335347 5.16350417 6.92046097
 7.33007963 8.05799293 4.99189453 3.68607281 2.32268093 1.2591686 ]


In [53]:
print(type(c))

<class 'numpy.ndarray'>


In [54]:
c = c.reshape(3,4)
print(c)

[[1.71852024 2.17010373 3.62556292 4.74335347]
 [5.16350417 6.92046097 7.33007963 8.05799293]
 [4.99189453 3.68607281 2.32268093 1.2591686 ]]


**Another common transformation you may want to do is transposition [Row <--> Column]:**

<div>
    <img src="img/transpose.gif", width="200">
</div>

Soruce: https://commons.wikimedia.org/wiki/File:Matrix_transpose.gif

In [57]:
b = b.reshape(3,4)
print(type(b))


<class 'numpy.ndarray'>


In [58]:
# Let's transpose our array (b):

b_t = b.transpose()
print(b_t)

[[1 5 4]
 [2 6 3]
 [3 7 2]
 [4 8 1]]


In [71]:
print(b.shape, b_t.shape)

(3, 4) (4, 3)


### Other ways to create Arrays:

You already saw the use of the **random()** function of the **numpy.random** module, 
it will generate an array of a given shape with random numbers: 

In [72]:
a = np.random.random(3)   # 1D array of size 3 
print(a)
print(type(a))

[0.98669805 0.51126114 0.66684471]
<class 'numpy.ndarray'>


We can also specify a shape:

In [75]:
a = np.random.random((4,4))  # 2D array of size 16 (4X4)
print(a)

[[0.12616638 0.18551453 0.31504743 0.27660476]
 [0.88736839 0.49262451 0.37070413 0.97170731]
 [0.75524699 0.16465069 0.23872915 0.23262825]
 [0.2234598  0.41471642 0.03445069 0.94992823]]


Another very useful function for making NumPy arrays using **arange()** . This function generates NumPy arrays with numerical sequences that respond to particular rules depending on the passed arguments.

* For example, we can make an array with values ranging from 0 to 10:

In [76]:
a = np.arange(0,11)   # np.arange(starting value, ending vlaue excluded)
print(type(a))
print(a.ndim)
print(a)

<class 'numpy.ndarray'>
1
[ 0  1  2  3  4  5  6  7  8  9 10]


In [77]:
a = np.arange(10,101,10)  #The third value is the size of the intervals between values
print(a)
print(a.size)

[ 10  20  30  40  50  60  70  80  90 100]
10


We can also make multidimensional arrays from combinging **arange()** with **reshape()**

In [78]:
a = np.arange(0,8).reshape(2,2,2)      #3D arrays made by using arange() + reshape()
print(a)

[[[0 1]
  [2 3]]

 [[4 5]
  [6 7]]]


Sometimes you may want to make **zeros** and **ones** Arrays

In [81]:
d = np.zeros((3, 3))   # Make a 3x3 array filled with zeros:
print(d)

[[0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]]


In [82]:
e = np.ones((2,2,2))  # Make a 2x2x2 array filled with 1s:
print(e)

[[[1. 1.]
  [1. 1.]]

 [[1. 1.]
  [1. 1.]]]


## Some Basic Operations

### Element-wise operators

The most basic operators are element-wise operators i.e., applying operations on individual elements of arrays.

In [87]:
a = np.arange(0,9)
print(a)

[0 1 2 3 4 5 6 7 8]


In [84]:
a+10

array([10, 11, 12, 13, 14, 15, 16, 17, 18])

In [89]:
a*2 + a

array([ 0,  3,  6,  9, 12, 15, 18, 21, 24])

In [90]:
b = np.random.random(9)
a+b

array([0.41756909, 1.69927488, 2.52108236, 3.31213896, 4.66137343,
       5.93063736, 6.07888158, 7.80764459, 8.32855082])

**Useful element-wise functions:**
* np.sqrt()
* np.square()
* np.sum():
* np.log()
* np.mean()
* np.min()
* np.max()
* np.std()
* np.argmin(): returns indices of the min element of the array in a particular axis

In [91]:
c= np.random.random(4)
c_sqrt = np.sqrt(c)

print(c)
print(c_sqrt)

[0.12812245 0.99791796 0.10177875 0.48678672]
[0.35794196 0.99895844 0.31902782 0.69770102]


In [92]:
c= np.random.random(4)
c_sqrt = np.argmin(c)

print(c)
print(c_sqrt)

[0.5787129  0.01476661 0.08491039 0.88972531]
1


### Matrix operators

In [93]:
#Determinant

import numpy as np
a = np.array([[3, 1],
               [1, 3]])
print(a)
det = np.linalg.det(a)
print("\nDeterminant:", np.round(det))

[[3 1]
 [1 3]]

Determinant: 8.0


In [94]:
#Eigenvalues and eigenvectors 

w, v = np.linalg.eig(a)
print("\nEigenvalues:")
print(w)
print("\nEigenvectors:")
print(v)

e = np.linalg.eig(a)
e


Eigenvalues:
[4. 2.]

Eigenvectors:
[[ 0.70710678 -0.70710678]
 [ 0.70710678  0.70710678]]


(array([4., 2.]),
 array([[ 0.70710678, -0.70710678],
        [ 0.70710678,  0.70710678]]))

## Indexing & Slicing

We can extract elements and sections of the arrays just like with lists

### Indexing
Array indexing refers to the use of square brackets (‘[ ]’) to grab elements individually for various uses such as extracting a value, selecting items, or even assigning a new value. Let's look at this in practice.

In [99]:
a = np.arange(0,101)
a

array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,
        13,  14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,
        26,  27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,
        39,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,
        52,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,
        65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,
        78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,
        91,  92,  93,  94,  95,  96,  97,  98,  99, 100])

In [98]:
# Extract first and last value
print(a[0], a[-1])

0 100


To select multiple items at once, you can pass array of indecies within the square brackets

In [100]:
print(a[[1,22,-1]]) #only takes ONE argument in the index bracket. This case, it's a list

[  1  22 100]


**Can we select multiple items the same way in lists?**

In [110]:
a = np.arange(0,101)
a = list(a)
type(a)

list

In [111]:
print(a[[1,2,-1]])

TypeError: list indices must be integers or slices, not list

**In the case of 2D arrays, rows and columns are treated like coordinates.**
* The they are represented as rectangular matrices consisting of rows and columns.
    * Defined by two axes, where axis 0 is represented by the rows and axis 1 is represented by the columns.
    * Index with two values [row index, column index].

In [112]:
A = np.arange(0,16).reshape(4,4)
print(A)

[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [12 13 14 15]]


In [114]:
print(A[0,3])          # indexing single values
print(A[[0,3],[2,0]])  # indexing multiple values [Row list] , [Column List]

3
[ 2 12]


### Slicing

Slicing is the operation which allows you to extract portions of an array to generate new ones

**Array[Start:End]**

In [115]:
a = np.arange(0,11)   #Create an array 
a[0:5]                #Take a slice 0th element to the 5th 

array([0, 1, 2, 3, 4])

In [119]:
# step size

a[0:16:3]   # Take a slice of the array from 0 to 9 - every 3rd value starting with 0  

array([0, 3, 6, 9])

## Part Two: Pandas

* **Pandas [Panel Data System] is the work horse for data analysis & manipulation in Python.** 
    * Provides a tabular interface to interact with data - feels like excel. 
    * Open-source library 
    * Built on top of numpy providing high-performance, easy-to-use data structures and data analysis tools 

* **Has 2 main data objects/containers: Data Series & Data Frames.** 
    * Data Series - Deals with 1D data
    * Data Frames - Multidimensional data
    
    
* **Useful referenecs:**
    * Documentation: https://pandas.pydata.org/docs/  


In [120]:
#Installing Pandas. To install, uncomment either line below

pip install pandas
conda install pandas

SyntaxError: invalid syntax (2731478906.py, line 3)

In [121]:
# Import our modules:

import pandas as pd         
import numpy as np

In [122]:
pd.__version__

'1.4.2'

## Data Series

The Series is made up of 2 arrays (index & value) linked to each other. You have a **value column** which can hold data of any NumPy type and each of these values are associated with a label which is provided within the **index column**. 

<div>
<img src="img/series_spreadsheet.png" width="200">
</div>
    

**Source:** https://codechalleng.es/bites/251/

## Data Frames

A combination of multiple Data Series/Columns, each of which can contain different data types (numeric, string, Boolean, etc.). Given the multidimensional nature of DataFrames, the data values are now linked to 2 different indices: row number and column number

<div>
    <img src="img/series-and-dataframe.png" width="600">
</div>

**Source:** https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/

### Creating Data Frames

In [123]:
# Easiest way is to convert a dictionary to DF

data = {'Countries' : ['Mexico','Spain','England','Argentina','New Zealand'],'Avg Age' : [92.1,55.3,81.5,63,74.5]}
type(data)

dict

In [124]:
df = pd.DataFrame(data)
df

Unnamed: 0,Countries,Avg Age
0,Mexico,92.1
1,Spain,55.3
2,England,81.5
3,Argentina,63.0
4,New Zealand,74.5


In [126]:
# We can selecte one or more columns

df = pd.DataFrame(data, columns=['Avg Age']) # can export just one col
df

Unnamed: 0,Avg Age
0,92.1
1,55.3
2,81.5
3,63.0
4,74.5


In [127]:
# We can explicitly provide labels for indices:

df = pd.DataFrame(data, index = ['one', 'two', 'three', 'four', 'five'])
df

Unnamed: 0,Countries,Avg Age
one,Mexico,92.1
two,Spain,55.3
three,England,81.5
four,Argentina,63.0
five,New Zealand,74.5


## Data Extraction (Indexing/Slicing)

In [129]:
#importing data from excel file

file = "Stanely_cup_winners.xlsx"
df = pd.read_excel(file)
df

Unnamed: 0,Team,Wins
0,Anaheim Ducks,1
1,Boston Bruins,6
2,Calgary Flames,1
3,Carolina Hurricanes,1
4,Chicago Blackhawks,6
5,Colorado Avalanche,2
6,Dallas Stars,1
7,Detroit Red Wings,11
8,Edmonton Oilers,5
9,Los Angeles Kings,2


In [130]:
# Obtain the column labels using Df.columns 

df.columns

Index(['Team', 'Wins'], dtype='object')

In [131]:
# Obtain the index labels using Df.index 

df.index

RangeIndex(start=0, stop=22, step=1)

In [132]:
# Obtain the values using Df.values (row by row):

df.values

array([['Anaheim Ducks', 1],
       ['Boston Bruins', 6],
       ['Calgary Flames', 1],
       ['Carolina Hurricanes', 1],
       ['Chicago Blackhawks', 6],
       ['Colorado Avalanche', 2],
       ['Dallas Stars', 1],
       ['Detroit Red Wings', 11],
       ['Edmonton Oilers', 5],
       ['Los Angeles Kings', 2],
       ['Montréal Canadiens', 23],
       ['Montreal Maroons', 2],
       ['New Jersey Devils', 3],
       ['New York Islanders', 4],
       ['New York Rangers', 4],
       ['Philadelphia Flyers', 2],
       ['Pittsburgh Penguins', 5],
       ['St. Louis Blues', 1],
       ['St. Louis Eagles', 4],
       ['Tampa Bay Lightning', 3],
       ['Toronto Maple Leafs', 13],
       ['Washington Capitals', 1]], dtype=object)

In [133]:
# Obtain the values in a given column: df['column label']:

df['Team']

0           Anaheim Ducks
1           Boston Bruins
2          Calgary Flames
3     Carolina Hurricanes
4      Chicago Blackhawks
5      Colorado Avalanche
6            Dallas Stars
7       Detroit Red Wings
8         Edmonton Oilers
9       Los Angeles Kings
10     Montréal Canadiens
11       Montreal Maroons
12      New Jersey Devils
13     New York Islanders
14       New York Rangers
15    Philadelphia Flyers
16    Pittsburgh Penguins
17        St. Louis Blues
18       St. Louis Eagles
19    Tampa Bay Lightning
20    Toronto Maple Leafs
21    Washington Capitals
Name: Team, dtype: object

In [134]:
# The result of slicing a column is a Series:

type(df['Team'])

pandas.core.series.Series

In [135]:
# We can also extract the values in a column by calling the column label as an attribute of the Dataframe:

df.Team


0           Anaheim Ducks
1           Boston Bruins
2          Calgary Flames
3     Carolina Hurricanes
4      Chicago Blackhawks
5      Colorado Avalanche
6            Dallas Stars
7       Detroit Red Wings
8         Edmonton Oilers
9       Los Angeles Kings
10     Montréal Canadiens
11       Montreal Maroons
12      New Jersey Devils
13     New York Islanders
14       New York Rangers
15    Philadelphia Flyers
16    Pittsburgh Penguins
17        St. Louis Blues
18       St. Louis Eagles
19    Tampa Bay Lightning
20    Toronto Maple Leafs
21    Washington Capitals
Name: Team, dtype: object

In [137]:
# Extracting rows using the iloc

df.iloc[1]

Team    Boston Bruins
Wins                6
Name: 1, dtype: object

In [145]:
# This can be used for multiple rows:

#What are the top three teams that won Stanley Cup?

df.iloc[[0,1,2]]
df.loc[[0,1,2]]

Unnamed: 0,Team,Wins
0,Anaheim Ducks,1
1,Boston Bruins,6
2,Calgary Flames,1


In [146]:
# We can also obtain a range using a slicing approach:

df.loc[20:]

Unnamed: 0,Team,Wins
20,Toronto Maple Leafs,13
21,Washington Capitals,1


In [158]:
# Get a single value from cell index

df['Team'][0]

'Anaheim Ducks'

In [141]:
#Another way of doing this

df.loc[0][0] #[row][column]

'Anaheim Ducks'

In [156]:
#Can we get the index of a certain value? similar to the find function in Excel

df['Team'].where(df['Team'] == 'Toronto Maple Leafs').dropna().index[0]

20

## Assigning new Values

### Adding & Deleting Columns:

In [147]:
# Adding a column is simple, provide the name followed by = value(s)

df['Country'] = [
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                ]
df

Unnamed: 0,Team,Wins,Country
0,Anaheim Ducks,1,US
1,Boston Bruins,6,US
2,Calgary Flames,1,US
3,Carolina Hurricanes,1,US
4,Chicago Blackhawks,6,US
5,Colorado Avalanche,2,US
6,Dallas Stars,1,US
7,Detroit Red Wings,11,US
8,Edmonton Oilers,5,US
9,Los Angeles Kings,2,US


In [148]:
# Deleting is also easy: del df['column name']. Note: it is not recommended to modify the original dataframe.

del df['Country']
df

Unnamed: 0,Team,Wins
0,Anaheim Ducks,1
1,Boston Bruins,6
2,Calgary Flames,1
3,Carolina Hurricanes,1
4,Chicago Blackhawks,6
5,Colorado Avalanche,2
6,Dallas Stars,1
7,Detroit Red Wings,11
8,Edmonton Oilers,5
9,Los Angeles Kings,2


In [149]:
df['Country'] = [
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                ]

In [150]:
# We can also delete the columns using the df.drop() function:

df2 = df.drop(['Country'], axis=1)       # Axis 1 = columns, 0 = Rows
df2

Unnamed: 0,Team,Wins
0,Anaheim Ducks,1
1,Boston Bruins,6
2,Calgary Flames,1
3,Carolina Hurricanes,1
4,Chicago Blackhawks,6
5,Colorado Avalanche,2
6,Dallas Stars,1
7,Detroit Red Wings,11
8,Edmonton Oilers,5
9,Los Angeles Kings,2


In [151]:
df['Country']

0     US
1     US
2     US
3     US
4     US
5     US
6     US
7     US
8     US
9     US
10    US
11    US
12    US
13    US
14    US
15    US
16    US
17    US
18    US
19    US
20    US
21    US
Name: Country, dtype: object

In [152]:
pd.options.mode.chained_assignment = None # This line is to disable uncessary warnings. You can ignore this 
df['Country'][2]='Canada'
df['Country'][8]='Canada'
df['Country'][20]='Canada'
df['Country'][10]='Canada'
df['Country'][11]='Canada'

In [153]:
df

Unnamed: 0,Team,Wins,Country
0,Anaheim Ducks,1,US
1,Boston Bruins,6,US
2,Calgary Flames,1,Canada
3,Carolina Hurricanes,1,US
4,Chicago Blackhawks,6,US
5,Colorado Avalanche,2,US
6,Dallas Stars,1,US
7,Detroit Red Wings,11,US
8,Edmonton Oilers,5,Canada
9,Los Angeles Kings,2,US


In [154]:
# Fill column with a single constant value:

df['Last year won'] = 0
df

Unnamed: 0,Team,Wins,Country,Last year won
0,Anaheim Ducks,1,US,0
1,Boston Bruins,6,US,0
2,Calgary Flames,1,Canada,0
3,Carolina Hurricanes,1,US,0
4,Chicago Blackhawks,6,US,0
5,Colorado Avalanche,2,US,0
6,Dallas Stars,1,US,0
7,Detroit Red Wings,11,US,0
8,Edmonton Oilers,5,Canada,0
9,Los Angeles Kings,2,US,0


### Looking for Values within the DataFrames: ***the isin() function***

In [159]:
df.isin(['US', 
        'Canada',1])#.any()

Unnamed: 0,Team,Wins,Country,Last year won
0,False,True,True,False
1,False,False,True,False
2,False,True,True,False
3,False,True,True,False
4,False,False,True,False
5,False,False,True,False
6,False,True,True,False
7,False,False,True,False
8,False,False,True,False
9,False,False,True,False


### Sorting, Filtering, Transposition

In [160]:
df3 = df.sort_values(by=['Wins'], ascending=False)
df3

Unnamed: 0,Team,Wins,Country,Last year won
10,Montréal Canadiens,23,Canada,0
20,Toronto Maple Leafs,13,Canada,0
7,Detroit Red Wings,11,US,0
1,Boston Bruins,6,US,0
4,Chicago Blackhawks,6,US,0
8,Edmonton Oilers,5,Canada,0
16,Pittsburgh Penguins,5,US,0
18,St. Louis Eagles,4,US,0
13,New York Islanders,4,US,0
14,New York Rangers,4,US,0


In [161]:
#Setting a new index column

df3=df3.set_index(np.arange(0,22))
df3

Unnamed: 0,Team,Wins,Country,Last year won
0,Montréal Canadiens,23,Canada,0
1,Toronto Maple Leafs,13,Canada,0
2,Detroit Red Wings,11,US,0
3,Boston Bruins,6,US,0
4,Chicago Blackhawks,6,US,0
5,Edmonton Oilers,5,Canada,0
6,Pittsburgh Penguins,5,US,0
7,St. Louis Eagles,4,US,0
8,New York Islanders,4,US,0
9,New York Rangers,4,US,0


In [162]:
# Filtering is done just like in the case of series:

df3[df3['Wins'] > 1]     

Unnamed: 0,Team,Wins,Country,Last year won
0,Montréal Canadiens,23,Canada,0
1,Toronto Maple Leafs,13,Canada,0
2,Detroit Red Wings,11,US,0
3,Boston Bruins,6,US,0
4,Chicago Blackhawks,6,US,0
5,Edmonton Oilers,5,Canada,0
6,Pittsburgh Penguins,5,US,0
7,St. Louis Eagles,4,US,0
8,New York Islanders,4,US,0
9,New York Rangers,4,US,0


In [163]:
# Transposition can be obtained by simple using the df.T option:

df.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,12,13,14,15,16,17,18,19,20,21
Team,Anaheim Ducks,Boston Bruins,Calgary Flames,Carolina Hurricanes,Chicago Blackhawks,Colorado Avalanche,Dallas Stars,Detroit Red Wings,Edmonton Oilers,Los Angeles Kings,...,New Jersey Devils,New York Islanders,New York Rangers,Philadelphia Flyers,Pittsburgh Penguins,St. Louis Blues,St. Louis Eagles,Tampa Bay Lightning,Toronto Maple Leafs,Washington Capitals
Wins,1,6,1,1,6,2,1,11,5,2,...,3,4,4,2,5,1,4,3,13,1
Country,US,US,Canada,US,US,US,US,US,Canada,US,...,US,US,US,US,US,US,US,US,Canada,US
Last year won,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Statistical Analysis

In [164]:
# Statistical summary can be obtained using the df.describe() function:

df3.describe()

Unnamed: 0,Wins,Last year won
count,22.0,22.0
mean,4.590909,0.0
std,5.18844,0.0
min,1.0,0.0
25%,1.25,0.0
50%,3.0,0.0
75%,5.0,0.0
max,23.0,0.0


In [165]:
# You can still obtain the sum and mean like we did in Numpy:

print(df.sum())
#print(df.mean())

Team             Anaheim DucksBoston BruinsCalgary FlamesCaroli...
Wins                                                           101
Country          USUSCanadaUSUSUSUSUSCanadaUSCanadaCanadaUSUSUS...
Last year won                                                    0
dtype: object


In [169]:
#Which country won the most Stanely cups to date?

df4 = df3.groupby(by=['Country']).sum()
df4

Unnamed: 0_level_0,Wins,Last year won
Country,Unnamed: 1_level_1,Unnamed: 2_level_1
Canada,44,0
US,57,0


# Let us have some fun with the Data

#### Update the column with the last year the team won a Stanely cup using the data from other excel sheet

In [170]:
#1 import the new excel sheet with the data from every year

file = "Stanely_cup_winners_by_year.xlsx"
df_yr = pd.read_excel(file)
df_yr

Unnamed: 0,Year,Team,Runner up
0,2020,Tampa Bay Lightning,Dallas Stars
1,2019,St. Louis Blues,Boston Bruins
2,2018,Washington Capitals,Vegas Golden Knights
3,2017,Pittsburgh Penguins,Nashville Predators
4,2016,Pittsburgh Penguins,San Jose Sharks
...,...,...,...
103,1915,Vancouver Millionaires,Ottawa Senators
104,1914,Toronto Blueshirts,Victoria Cougars
105,1913,Quebec Bulldogs,Sydney Miners
106,1912,Quebec Bulldogs,Moncton Victories


In [173]:
#2 run through the data to extract the last year a team has won the Stanely cup

i=0
while i<len(df_yr):
    team = df_yr['Team'][i]
    year = df_yr['Year'][i]
    
    if team in list(df3['Team']): # make the entire column a list, then check
        indx = df3['Team'].where(df3['Team'] == team).dropna().index[0] # check in the column series 
        if year>df3['Last year won'][indx]:
            df3['Last year won'][indx]=year
     
    i+=1


In [174]:
df3

Unnamed: 0,Team,Wins,Country,Last year won
0,Montréal Canadiens,23,Canada,0
1,Toronto Maple Leafs,13,Canada,1967
2,Detroit Red Wings,11,US,2008
3,Boston Bruins,6,US,2011
4,Chicago Blackhawks,6,US,2015
5,Edmonton Oilers,5,Canada,1990
6,Pittsburgh Penguins,5,US,2017
7,St. Louis Eagles,4,US,0
8,New York Islanders,4,US,1983
9,New York Rangers,4,US,1994


In [175]:
# Calculate the correlation between wins and if won last year. (Pearson Correlation Coefficient):

df3['Wins'].corr(df3['Last year won'])

-0.5582200870235183

## Dealing with Missing Values (NaN Values)

As mentioned in the series section, experimental datasets often have missing values. We can deal with such data in many different ways. For example, if you have a large dataset you may simple chose to remove such data points or if the concerned data is numerical, you may replace the NaN value with the average of the column. Let's take a look at how do these operations:

### Dropping NaN values: 

In [178]:
#Creating a dataframe with NaN values: Converting 0 in the year column to NaN

df3['Last year won'] = df3['Last year won'].replace({0:np.nan})
df3

Unnamed: 0,Team,Wins,Country,Last year won
0,Montréal Canadiens,23,Canada,
1,Toronto Maple Leafs,13,Canada,1967.0
2,Detroit Red Wings,11,US,2008.0
3,Boston Bruins,6,US,2011.0
4,Chicago Blackhawks,6,US,2015.0
5,Edmonton Oilers,5,Canada,1990.0
6,Pittsburgh Penguins,5,US,2017.0
7,St. Louis Eagles,4,US,
8,New York Islanders,4,US,1983.0
9,New York Rangers,4,US,1994.0


In [177]:
# We could just drop all NaN data points:

df_fix = df3.dropna()
df_fix

# Deleted the row with NaN value(s)

Unnamed: 0,Team,Wins,Country,Last year won
1,Toronto Maple Leafs,13,Canada,1967.0
2,Detroit Red Wings,11,US,2008.0
3,Boston Bruins,6,US,2011.0
4,Chicago Blackhawks,6,US,2015.0
5,Edmonton Oilers,5,Canada,1990.0
6,Pittsburgh Penguins,5,US,2017.0
8,New York Islanders,4,US,1983.0
9,New York Rangers,4,US,1994.0
10,Tampa Bay Lightning,3,US,2020.0
11,New Jersey Devils,3,US,2003.0


In [179]:
# We could just drop all NaN data points - dropping by column using the axis tag:

df_fix = df3.dropna(axis=1)
df_fix

Unnamed: 0,Team,Wins,Country
0,Montréal Canadiens,23,Canada
1,Toronto Maple Leafs,13,Canada
2,Detroit Red Wings,11,US
3,Boston Bruins,6,US
4,Chicago Blackhawks,6,US
5,Edmonton Oilers,5,Canada
6,Pittsburgh Penguins,5,US
7,St. Louis Eagles,4,US
8,New York Islanders,4,US
9,New York Rangers,4,US


Note: df.dropna() essentially drops all the values in the row or column with the missing data. To avoid this, we can use the how argument. This argument takes two possible inputs:

* ‘any’ : If any NA values are present, drop that row or column. [This is the Default]

* ‘all’ : If all values are NA, drop that row or column.

In [180]:
df_fix = df3.dropna(how='all')
df_fix 

# So we don't drop the the row because all values were not NaN

Unnamed: 0,Team,Wins,Country,Last year won
0,Montréal Canadiens,23,Canada,
1,Toronto Maple Leafs,13,Canada,1967.0
2,Detroit Red Wings,11,US,2008.0
3,Boston Bruins,6,US,2011.0
4,Chicago Blackhawks,6,US,2015.0
5,Edmonton Oilers,5,Canada,1990.0
6,Pittsburgh Penguins,5,US,2017.0
7,St. Louis Eagles,4,US,
8,New York Islanders,4,US,1983.0
9,New York Rangers,4,US,1994.0


In this case, we may want to simply replace the missing values. We can achieve this using the fillna() function:

In [181]:
df_fix1 = df3.fillna(0)  #Replace with 0
df_fix1 

Unnamed: 0,Team,Wins,Country,Last year won
0,Montréal Canadiens,23,Canada,0.0
1,Toronto Maple Leafs,13,Canada,1967.0
2,Detroit Red Wings,11,US,2008.0
3,Boston Bruins,6,US,2011.0
4,Chicago Blackhawks,6,US,2015.0
5,Edmonton Oilers,5,Canada,1990.0
6,Pittsburgh Penguins,5,US,2017.0
7,St. Louis Eagles,4,US,0.0
8,New York Islanders,4,US,1983.0
9,New York Rangers,4,US,1994.0


In [182]:
df_fix1 = df3.fillna('Unknown')  # Replace with a different category
df_fix1 

Unnamed: 0,Team,Wins,Country,Last year won
0,Montréal Canadiens,23,Canada,Unknown
1,Toronto Maple Leafs,13,Canada,1967.0
2,Detroit Red Wings,11,US,2008.0
3,Boston Bruins,6,US,2011.0
4,Chicago Blackhawks,6,US,2015.0
5,Edmonton Oilers,5,Canada,1990.0
6,Pittsburgh Penguins,5,US,2017.0
7,St. Louis Eagles,4,US,Unknown
8,New York Islanders,4,US,1983.0
9,New York Rangers,4,US,1994.0


In [183]:
# Might replace with mean

df_mean = df3.fillna(df3['Last year won'].mean())
df_mean

Unnamed: 0,Team,Wins,Country,Last year won
0,Montréal Canadiens,23,Canada,1998.55
1,Toronto Maple Leafs,13,Canada,1967.0
2,Detroit Red Wings,11,US,2008.0
3,Boston Bruins,6,US,2011.0
4,Chicago Blackhawks,6,US,2015.0
5,Edmonton Oilers,5,Canada,1990.0
6,Pittsburgh Penguins,5,US,2017.0
7,St. Louis Eagles,4,US,1998.55
8,New York Islanders,4,US,1983.0
9,New York Rangers,4,US,1994.0


In [184]:
# Might replace with median:

df_mean = df3.fillna(df3['Last year won'].median())
df_mean

Unnamed: 0,Team,Wins,Country,Last year won
0,Montréal Canadiens,23,Canada,2004.5
1,Toronto Maple Leafs,13,Canada,1967.0
2,Detroit Red Wings,11,US,2008.0
3,Boston Bruins,6,US,2011.0
4,Chicago Blackhawks,6,US,2015.0
5,Edmonton Oilers,5,Canada,1990.0
6,Pittsburgh Penguins,5,US,2017.0
7,St. Louis Eagles,4,US,2004.5
8,New York Islanders,4,US,1983.0
9,New York Rangers,4,US,1994.0


## Reading & Writing Data to CSV:

### Writing to CSV

In [185]:
# To save the data we can use the to_csv() function:

df3.to_csv('Stanely_cup_analysis.csv')  # Will save the dataframe as a csv called test in the directory you are in

### Reading CSV

In [186]:
# Assuming you have the data already in the same directory/file:

df_final = pd.read_csv('Stanely_cup_analysis.csv')

# Can  explicitly specify column names:
# syntax: pd.read_csv('path/filename.csv', names=['label1', 'label2', 'etc'])

In [187]:
# We can look at the imported data using the df.head() and df.tail() functions:

df_final.head(10) # display the top 10 entries in the dataframe - note can leave the () empty, this will default a value of 5

Unnamed: 0.1,Unnamed: 0,Team,Wins,Country,Last year won
0,0,Montréal Canadiens,23,Canada,
1,1,Toronto Maple Leafs,13,Canada,1967.0
2,2,Detroit Red Wings,11,US,2008.0
3,3,Boston Bruins,6,US,2011.0
4,4,Chicago Blackhawks,6,US,2015.0
5,5,Edmonton Oilers,5,Canada,1990.0
6,6,Pittsburgh Penguins,5,US,2017.0
7,7,St. Louis Eagles,4,US,
8,8,New York Islanders,4,US,1983.0
9,9,New York Rangers,4,US,1994.0


In [188]:
df_final.tail(10) # display the bottom 10 entries in the dataframe - note can leave the () empty, this will default a value of 5

Unnamed: 0.1,Unnamed: 0,Team,Wins,Country,Last year won
12,12,Los Angeles Kings,2,US,2014.0
13,13,Colorado Avalanche,2,US,2001.0
14,14,Philadelphia Flyers,2,US,1975.0
15,15,Montreal Maroons,2,Canada,1935.0
16,16,St. Louis Blues,1,US,2019.0
17,17,Anaheim Ducks,1,US,2007.0
18,18,Dallas Stars,1,US,1999.0
19,19,Carolina Hurricanes,1,US,2006.0
20,20,Calgary Flames,1,Canada,1989.0
21,21,Washington Capitals,1,US,2018.0


In [189]:
# Can also look at the number of columns: 

len(df_final.columns)


5

In [190]:
# Can get a statistical summary: 

df_final.describe()

Unnamed: 0.1,Unnamed: 0,Wins,Last year won
count,22.0,22.0,20.0
mean,10.5,4.590909,1998.55
std,6.493587,5.18844,21.17465
min,0.0,1.0,1935.0
25%,5.25,1.25,1989.75
50%,10.5,3.0,2004.5
75%,15.75,5.0,2014.25
max,21.0,23.0,2020.0


### Read more about the JSON file strcuture. It will be very useful for your projects
https://realpython.com/python-json/

# Part Three: Fingerprinting 

## ML Packages Only Understand Numbers

Machine learning doesn’t understand non-numerical inputs. However, in real life many things are not numerical; the trick is to find proxies. In the above example, we can provide the robot the latitude and longitude of Canada. 

## Numerical Proxies are Also Important in Science

Non numerical (categorical) labels are everywhere in science. To feed this information as inputs we need to find some numerical proxies. This featurization of the dataset is probably among the hottest research topics and many different research groups have come up with their own solution. It’s hard to really tell what is good and what is not as each solution is based off a different dataset! 

## One Hot Encoding

* One of the most common methods of dealing with categorical data. I think it’s also one of the simplest. 
* Implemented in many packages including Pandas & SK-learn. 
* Process:
    1. Create a column for each category of data2. 
    2. Go through the rows with the categorical data and fill them with dummy/indicator variables (1 or 0):
        1. Enter 1 into column if the category matches column category.
        2. Enter 0 into column if the category does not match column category.

<div>
    <img src="img/encoding.png">
</div>

**Source:**https://towardsdatascience.com/building-a-one-hot-encoding-layer-with-tensorflow-f907d686bf39

## One Hot Encoding in Action - Using the Pandas get_dummies function:

In [191]:
# Let's import the modules we will use for this lecture:

import pandas as pd         # Pandas
import numpy as np          # NumPy

In [192]:
# Create a DataFrame to work with:

data = df3
df = pd.DataFrame(data)
df

Unnamed: 0,Team,Wins,Country,Last year won
0,Montréal Canadiens,23,Canada,
1,Toronto Maple Leafs,13,Canada,1967.0
2,Detroit Red Wings,11,US,2008.0
3,Boston Bruins,6,US,2011.0
4,Chicago Blackhawks,6,US,2015.0
5,Edmonton Oilers,5,Canada,1990.0
6,Pittsburgh Penguins,5,US,2017.0
7,St. Louis Eagles,4,US,
8,New York Islanders,4,US,1983.0
9,New York Rangers,4,US,1994.0


In [193]:
# Implement One hot Encoding using the get_dummies() function:

df = pd.get_dummies(df)
df

Unnamed: 0,Wins,Last year won,Team_Anaheim Ducks,Team_Boston Bruins,Team_Calgary Flames,Team_Carolina Hurricanes,Team_Chicago Blackhawks,Team_Colorado Avalanche,Team_Dallas Stars,Team_Detroit Red Wings,...,Team_New York Rangers,Team_Philadelphia Flyers,Team_Pittsburgh Penguins,Team_St. Louis Blues,Team_St. Louis Eagles,Team_Tampa Bay Lightning,Team_Toronto Maple Leafs,Team_Washington Capitals,Country_Canada,Country_US
0,23,,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
1,13,1967.0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,1,0
2,11,2008.0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,1
3,6,2011.0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,6,2015.0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,1
5,5,1990.0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
6,5,2017.0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,1
7,4,,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,1
8,4,1983.0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
9,4,1994.0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,1


### Oxides example

In [None]:
data = {'formula':['IrO2', 'RuO2', 'TiO2', 'Ni3O4'], 'overpotential':[300, 250, 400, 280]}
df2 = pd.DataFrame(data)
df2

## Matminer

Matminer is a powerful python library which aims to simplify the machine learning pipeline for material science. Has 3 key capabilities:

1. Data retrieval tools: allows to import data from various online databases in the form of DataFrames. 
2. Data descriptor tools: utilities to describe a material from its composition or structure, and represent them in numerical format such that they are readily usable as features.
3. Plotting tools: create plots to visualize the data.

The inputs prepared by Matminer can be easily fed into ML packages such as SK-Learn and Keras to conduct Machine Learning.

**Matminer Homepage:** https://hackingmaterials.lbl.gov/matminer/ 

**Table of Featurizers**: https://hackingmaterials.lbl.gov/matminer/featurizer_summary.html

<div>
    <img src="img/matminer.png" width=600>
 </div>

In [198]:
# Uncomment and run the line between to install Matminer

pip install matminer

SyntaxError: invalid syntax (2609264039.py, line 3)

### 1. Get composition

In [196]:
from matminer.featurizers.conversions import StrToComposition
from tqdm.autonotebook import tqdm # This line is to supresse some warnings. You can ignore it.

#convert the formula from a string into chemical composition
df2 = StrToComposition().featurize_dataframe(df2, "formula")
df2.head()

KeyError: "None of [Index(['formula'], dtype='object')] are in the [columns]"

In [197]:
from matminer.featurizers.composition import ElementFraction

##get stoichiometry from formula
df2 = ElementFraction().featurize_dataframe(df2, "composition")
df2

KeyError: "None of [Index(['composition'], dtype='object')] are in the [columns]"

In [None]:
df2['O']

In [None]:
from matminer.featurizers.composition import BandCenter

df2 = BandCenter().featurize_dataframe(df2, col_id="composition")
df2.head()

In [None]:
from matminer.featurizers.conversions import CompositionToOxidComposition
from matminer.featurizers.composition import OxidationStates

df2 = CompositionToOxidComposition().featurize_dataframe(df2, "composition")

os_feat = OxidationStates()
df2 = os_feat.featurize_dataframe(df2, "composition_oxid")
df2

In [None]:
from matminer.featurizers.composition import ElectronegativityDiff

df2 = ElectronegativityDiff().featurize_dataframe(df2, col_id="composition_oxid")  
df2.head()