Working with NumPy Arrays
======================

Pandas dataframes are built on top of a data structure known as the NumPy Array. If you completed the first MolSSI Python scripting workshop, you are already familiar with some properties of NumPy arrays.

In general, you should use pandas dataframe when working with data which is:
   - Two dimensional (rows and columns).
   - Labeled.
   - Mixed type.
   - Something for which you would like to be able to easily get statistics.
    
You should work with NumPy arrays when:
  - You have higher dimensional data (collection of two dimensional arrays).
  - Your data is all numerical.
  - You are using a library which requires NumPy arrays (scikitlearn).

In [1]:
import os
import pandas as pd

import numpy as np

In [2]:
file_path = os.path.join("data", "PubChemElements_all.csv")

df = pd.read_csv(file_path)
df.head()

Unnamed: 0,AtomicNumber,Symbol,Name,AtomicMass,CPKHexColor,ElectronConfiguration,Electronegativity,AtomicRadius,IonizationEnergy,ElectronAffinity,OxidationStates,StandardState,MeltingPoint,BoilingPoint,Density,GroupBlock,YearDiscovered
0,1,H,Hydrogen,1.008,FFFFFF,1s1,2.2,120.0,13.598,0.754,"+1, -1",Gas,13.81,20.28,9e-05,Nonmetal,1766
1,2,He,Helium,4.0026,D9FFFF,1s2,,140.0,24.587,,0,Gas,0.95,4.22,0.000179,Noble gas,1868
2,3,Li,Lithium,7.0,CC80FF,[He]2s1,0.98,182.0,5.392,0.618,+1,Solid,453.65,1615.0,0.534,Alkali metal,1817
3,4,Be,Beryllium,9.012183,C2FF00,[He]2s2,1.57,153.0,9.323,,+2,Solid,1560.0,2744.0,1.85,Alkaline earth metal,1798
4,5,B,Boron,10.81,FFB5B5,[He]2s2 2p1,2.04,192.0,8.298,0.277,+3,Solid,2348.0,4273.0,2.37,Metalloid,1808


In [3]:
df_array = df.to_numpy()

In [4]:
df.shape

(118, 17)

In [5]:
df_array.shape

(118, 17)

## Reshaping Arrays

In [6]:
numbers = df[["AtomicNumber", "Electronegativity", "AtomicRadius"]].to_numpy()

In [7]:
# reshape so each "category" is its own two dimensional array
numbers = numbers.reshape(-1, 1, 3)

In [8]:
# atomic numbers
numbers[:, :, 0]

array([[  1.],
       [  2.],
       [  3.],
       [  4.],
       [  5.],
       [  6.],
       [  7.],
       [  8.],
       [  9.],
       [ 10.],
       [ 11.],
       [ 12.],
       [ 13.],
       [ 14.],
       [ 15.],
       [ 16.],
       [ 17.],
       [ 18.],
       [ 19.],
       [ 20.],
       [ 21.],
       [ 22.],
       [ 23.],
       [ 24.],
       [ 25.],
       [ 26.],
       [ 27.],
       [ 28.],
       [ 29.],
       [ 30.],
       [ 31.],
       [ 32.],
       [ 33.],
       [ 34.],
       [ 35.],
       [ 36.],
       [ 37.],
       [ 38.],
       [ 39.],
       [ 40.],
       [ 41.],
       [ 42.],
       [ 43.],
       [ 44.],
       [ 45.],
       [ 46.],
       [ 47.],
       [ 48.],
       [ 49.],
       [ 50.],
       [ 51.],
       [ 52.],
       [ 53.],
       [ 54.],
       [ 55.],
       [ 56.],
       [ 57.],
       [ 58.],
       [ 59.],
       [ 60.],
       [ 61.],
       [ 62.],
       [ 63.],
       [ 64.],
       [ 65.],
       [ 66.],
       [ 6

In [9]:
# By default pandas Series are one-dimensional. We will have to reshape the arrays sometimes.
names = df["Name"].to_numpy()
names = names.reshape(-1, 1)