Working with NumPy Arrays
======================

Pandas dataframes are built on top of a data structure known as the NumPy Array. If you completed the first MolSSI Python scripting workshop, you are already familiar with some properties of NumPy arrays.

In general, you should use pandas dataframe when working with data which is:
   - Two dimensional (rows and columns)
   - Labeled
   - Mixed type
   - Something for which you would like to be able to easily get statistics
    
You should work with NumPy arrays when:
  - You have higher dimensional data (collection of two dimensional arrays)
  - Your data is all the same type
  - You are doing something very computationally intensive (using scikitlearn)

In [None]:
import os
import pandas as pd

import numpy as np

In [None]:
file_path = os.path.join("data", "PubChemElements_all.csv")

df = pd.read_csv(file_path)
df.head()

In [None]:
df_array = df.to_numpy()

In [None]:
df.shape

In [None]:
df_array.shape

# Reshaping Arrays

In [None]:
numbers = df[["AtomicNumber", "Electronegativity", "AtomicRadius"]].to_numpy()

In [None]:
# reshape so each "category" is its own two dimensional array
numbers = numbers.reshape(-1, 1, 3)

In [None]:
# atomic numbers
numbers[:, :, 0]

In [None]:
# By default pandas Series are one-dimensional. We will have to reshape the arrays sometimes.
names = df["Name"].to_numpy()
names = names.reshape(-1, 1)