# Python Modules

* What makes Python so powerful, modules!
* Python is open-sourced, which means that developers all around the world can contribute to its development!
* Modules are a a collection of Python functions and classes that you can import and use in your code!
* As mentioned earlier, before spending a lot of time writing your own functions, try searching online, there is probably a module out there that can do what you need

## Part One: NumPy

### Why Numpy?

<div>
<img src="img/numpy.gif" width="600"/>
</div>

Read more: https://towardsdatascience.com/why-is-numpy-awesome-3f8f011abf70#:~:text=NumPy%20can%20be%20used%20to,calculations%20you%20can%20use%20np

In [1]:
# Importing the Numpy Module

import numpy as np

In [2]:
np.__version__

'2.1.2'

In [3]:
# let's make an array using the array() function and assign it to variable "a"

a = np.array([1,2,3])
a

array([1, 2, 3])

An **array** object represents a multidimensional, homogeneous array of fixed-size items


In [4]:
# The variable is a np.array object, use the type() function to confirm that

type(a)

numpy.ndarray

An associated **data-type** object describes the format of each element in the array (its byte-order, how many bytes it occupies in memory, whether it is an integer, a floating point number, or something else, etc.)

In [5]:
# The type of the data stored in the array can be checked using array.dtype attribute

a.dtype

dtype('int64')

In [6]:
# We can explicitly specify the data type we want
a = np.array([1,2,3], dtype = 'int64')

a.dtype

dtype('int64')

Arrays have different **dimenssions**, **sizes**, and **shapes**:

* 1- dimenssion: number of dimenssions
* 2- size: number of elements
* 3- shape: rows x columns

In [7]:
print("Array a is ", a.ndim, "D")
print("Array a has", a.size, " elements")
print("Array a has a shape of: ", a.shape)

Array a is  1 D
Array a has 3  elements
Array a has a shape of:  (3,)


In [8]:
# We can construct 2D arrays 

b = np.array([[1,2,3,4],
              [5,6,7,8],
              [4,3,2,1]]
            )

print("Array b is ", b.ndim, " dimensional")
print("Array b has", b.size, " elements")
print("Array b has a shape of: ", b.shape, "[3 Rows & 4 Columns]")

Array b is  2  dimensional
Array b has 12  elements
Array b has a shape of:  (3, 4) [3 Rows & 4 Columns]


**Can Arrays  contain heterogenous data types?**

In [9]:
b = np.array([['a',2,3,4],
              [5,6,7,8],
              [4,3,2,1]]
            )

In [10]:
b = np.array([['a',2,3,4],
              [5,6,7,8],
              [4,3,2,1],], dtype='int64'
            )

ValueError: invalid literal for int() with base 10: 'a'

Knowledge of the shape of the arrays is integral in performing operations on them. This concept also extends into **Tensors** which are at the heart of Deep learning packages such as **Pytorch** and **TensorFlow**. Let's take a look at this in an example:

In [11]:
# let's try create an array "c" by adding our 1D array ("a") & 2D array ("b")

c = a + b

UFuncTypeError: ufunc 'add' did not contain a loop with signature matching types (dtype('int64'), dtype('<U21')) -> None

**Arrays shapes need to be the same for element wise operations**

<div>
<img src="img/dim_img.gif" width="200"/>
</div>

Read more: https://towardsdatascience.com/understanding-dimensions-in-pytorch-6edf9972d3be

**What if you need to change the dimenssionality of your array?**
##### We can use the reshape() function to change 2D arrays to 1D

In [12]:
a2 = np.random.random(12)  # Generate 1D array with random numbers 
print(a2)

[0.66398067 0.07230728 0.74471117 0.05378114 0.18860354 0.87486186
 0.3384429  0.73011804 0.14667808 0.06769395 0.91763343 0.36112798]


In [13]:
print(type(a2))
print(a2.dtype)
print(a2.size)

<class 'numpy.ndarray'>
float64
12


In [14]:
b = np.array([[1,2,3,4],
              [5,6,7,8],
              [4,3,2,1]]
            )
c = a2 + b

ValueError: operands could not be broadcast together with shapes (12,) (3,4) 

In [15]:
print("Array a2 has the following shape: ",a2.shape)
print("Remember array b's shape is: ", b.shape)

Array a2 has the following shape:  (12,)
Remember array b's shape is:  (3, 4)


In [16]:
a2 = a2.reshape(3,4)   #reshape function: array.reshape()
print(a2)
print(b)

[[0.66398067 0.07230728 0.74471117 0.05378114]
 [0.18860354 0.87486186 0.3384429  0.73011804]
 [0.14667808 0.06769395 0.91763343 0.36112798]]
[[1 2 3 4]
 [5 6 7 8]
 [4 3 2 1]]


##### Now let's try adding our 2D arrays:

In [17]:
c = a2+b
c

array([[1.66398067, 2.07230728, 3.74471117, 4.05378114],
       [5.18860354, 6.87486186, 7.3384429 , 8.73011804],
       [4.14667808, 3.06769395, 2.91763343, 1.36112798]])

**We can also convert the 2D array into a 1D array using the ravel() function**

In [18]:
a2 = a2.ravel()
b = b.ravel()

In [19]:
print(a2)
print(b)

[0.66398067 0.07230728 0.74471117 0.05378114 0.18860354 0.87486186
 0.3384429  0.73011804 0.14667808 0.06769395 0.91763343 0.36112798]
[1 2 3 4 5 6 7 8 4 3 2 1]


In [20]:
c = a2 + b 
print(c)

[1.66398067 2.07230728 3.74471117 4.05378114 5.18860354 6.87486186
 7.3384429  8.73011804 4.14667808 3.06769395 2.91763343 1.36112798]


In [21]:
print(type(c))

<class 'numpy.ndarray'>


In [22]:
c = c.reshape(3,4)
print(c)

[[1.66398067 2.07230728 3.74471117 4.05378114]
 [5.18860354 6.87486186 7.3384429  8.73011804]
 [4.14667808 3.06769395 2.91763343 1.36112798]]


**Another common transformation you may want to do is transposition [Row <--> Column]:**

<div>
    <img src="img/transpose.gif", width="200">
</div>

Soruce: https://commons.wikimedia.org/wiki/File:Matrix_transpose.gif

In [23]:
b = b.reshape(3,4)
print(b)

[[1 2 3 4]
 [5 6 7 8]
 [4 3 2 1]]


In [24]:
# Let's transpose our array (b):

b_t = b.transpose()
print(b_t)

[[1 5 4]
 [2 6 3]
 [3 7 2]
 [4 8 1]]


In [None]:
print(b.shape, b_t.shape)

### Other ways to create Arrays:

You already saw the use of the **random()** function of the **numpy.random** module, 
it will generate an array of a given shape with random numbers: 

In [25]:
a = np.random.random(3)   # 1D array of size 3 
print(a)
print(type(a))

[0.84245157 0.00512282 0.70705285]
<class 'numpy.ndarray'>


We can also specify a shape:

In [26]:
a = np.random.random((4,4))  # 2D array of size 16 (4X4)
print(a)

[[0.09821719 0.72168684 0.66745803 0.40147631]
 [0.08877    0.6621942  0.42933352 0.66505953]
 [0.90618392 0.78469907 0.44493132 0.16620762]
 [0.07951974 0.70670697 0.24490607 0.74824536]]


Another very useful function for making NumPy arrays using **arange()** . This function generates NumPy arrays with numerical sequences that respond to particular rules depending on the passed arguments.

* For example, we can make an array with values ranging from 0 to 10:

In [27]:
a = np.arange(0,11)   # np.arange(starting value, ending vlaue excluded)
print(type(a))
print(a.ndim)
print(a)

<class 'numpy.ndarray'>
1
[ 0  1  2  3  4  5  6  7  8  9 10]


In [28]:
a = np.arange(10,101,10)  #The third value is the size of the intervals between values
print(a)
print(a.size)

[ 10  20  30  40  50  60  70  80  90 100]
10


We can also make multidimensional arrays from combinging **arange()** with **reshape()**

In [29]:
a = np.arange(0,8).reshape(2,2,2)      #3D arrays made by using arange() + reshape()
print(a)

[[[0 1]
  [2 3]]

 [[4 5]
  [6 7]]]


Sometimes you may want to make **zeros** and **ones** Arrays

In [30]:
d = np.zeros((3, 3))   # Make a 3x3 array filled with zeros:
print(d)

[[0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]]


In [31]:
e = np.ones((2,2,2))  # Make a 2x2x2 array filled with 1s:
print(e)

[[[1. 1.]
  [1. 1.]]

 [[1. 1.]
  [1. 1.]]]


## Some Basic Operations

### Element-wise operators

The most basic operators are element-wise operators i.e., applying operations on individual elements of arrays.

In [32]:
a = np.arange(0,9)
a

array([0, 1, 2, 3, 4, 5, 6, 7, 8])

In [33]:
a+10

array([10, 11, 12, 13, 14, 15, 16, 17, 18])

In [34]:
a*2

array([ 0,  2,  4,  6,  8, 10, 12, 14, 16])

In [35]:
b = np.random.random(9)
a+b

array([0.62777636, 1.471232  , 2.32234804, 3.79320946, 4.38055023,
       5.86576162, 6.89814412, 7.17705486, 8.50476304])

**Useful element-wise functions:**
* np.sqrt()
* np.square()
* np.sum():
* np.log()
* np.mean()
* np.min()
* np.max()
* np.std()
* np.argmin(): returns indices of the min element of the array in a particular axis

In [36]:
c= np.random.random(4)
c_sqrt = np.sqrt(c)

print(c)
print(c_sqrt)

[0.14936247 0.02286064 0.79677855 0.18043916]
[0.38647441 0.15119735 0.89262453 0.42478131]


In [37]:
c= np.random.random(4)
c_sqrt = np.argmin(c)

print(c)
print(c_sqrt)

[0.28162425 0.09167325 0.50567924 0.14727631]
1


### Matrix operators

In [38]:
#Determinant

import numpy as np
a = np.array([[3, 1],
               [1, 3]])
print(a)
det = np.linalg.det(a)
print("\nDeterminant:", np.round(det))

[[3 1]
 [1 3]]

Determinant: 8.0


In [39]:
#Eigenvalues and eigenvectors 

w, v = np.linalg.eig(a)
print("\nEigenvalues:")
print(w)
print("\nEigenvectors:")
print(v)

e = np.linalg.eig(a)
e


Eigenvalues:
[4. 2.]

Eigenvectors:
[[ 0.70710678 -0.70710678]
 [ 0.70710678  0.70710678]]


EigResult(eigenvalues=array([4., 2.]), eigenvectors=array([[ 0.70710678, -0.70710678],
       [ 0.70710678,  0.70710678]]))

## Indexing & Slicing

We can extract elements and sections of the arrays just like with lists

### Indexing
Array indexing refers to the use of square brackets (‘[ ]’) to grab elements individually for various uses such as extracting a value, selecting items, or even assigning a new value. Let's look at this in practice.

In [None]:
a = np.arange(0,101)

In [None]:
# Extract first and last value
print(a[0], a[-1])

To select multiple items at once, you can pass array of indecies within the square brackets

In [None]:
print(a[[1,22,-1]])

**Can we select multiple items the same way in lists?**

In [None]:
a = np.arange(0,101)
a = list(a)
type(a)

In [None]:
print(a[[1,2,-1]])

**In the case of 2D arrays, rows and columns are treated like coordinates.**
* The they are represented as rectangular matrices consisting of rows and columns.
    * Defined by two axes, where axis 0 is represented by the rows and axis 1 is represented by the columns.
    * Index with two values [row index, column index].

In [None]:
A = np.arange(0,16).reshape(4,4)
print(A)

In [None]:
print(A[0,3])          # indexing single values
print(A[[0,3],[2,0]])  # indexing multiple values [Row list] , [Column List]

### Slicing

Slicing is the operation which allows you to extract portions of an array to generate new ones

**Array[Start:End]**

In [None]:
a = np.arange(0,11)   #Create an array 
a[0:5]                #Take a slice 0th element to the 5th 

In [None]:
# step size

a[0:10:3]   # Take a slice of the array from 0 to 9 - every 3rd value starting with 0  

## Part Two: Pandas

* **Pandas [Panel Data System] is the work horse for data analysis & manipulation in Python.** 
    * Provides a tabular interface to interact with data - feels like excel. 
    * Open-source library 
    * Built on top of numpy providing high-performance, easy-to-use data structures and data analysis tools 

* **Has 2 main data objects/containers: Data Series & Data Frames.** 
    * Data Series - Deals with 1D data
    * Data Frames - Multidimensional data
    
    
* **Useful referenecs:**
    * Documentation: https://pandas.pydata.org/docs/  


#Installing Pandas

#pip install pandas
#conda install pandas

In [None]:
# Import our modules:

import pandas as pd         
import numpy as np

In [None]:
pd.__version__

## Data Series

The Series is made up of 2 arrays (index & value) linked to each other. You have a **value column** which can hold data of any NumPy type and each of these values are associated with a label which is provided within the **index column**. 

<div>
<img src="img/series_spreadsheet.png" width="200">
</div>
    

**Source:** https://codechalleng.es/bites/251/

## Data Frames

A combination of multiple Data Series/Columns, each of which can contain different data types (numeric, string, Boolean, etc.). Given the multidimensional nature of DataFrames, the data values are now linked to 2 different indices: row number and column number

<div>
    <img src="img/series-and-dataframe.png" width="600">
</div>

**Source:** https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/

### Creating Data Frames

In [None]:
# Easiest way is to convert a dictionary to DF

data = {'Countries' : ['Mexico','Spain','England','Argentina','New Zealand'],'Avg Age' : [92.1,55.3,81.5,63,74.5]}
type(data)

In [None]:
df = pd.DataFrame(data)
df

In [None]:
# We can selecte one or more columns

df = pd.DataFrame(data, columns=['Avg Age'])
df

In [None]:
# We can explicitly provide labels for indices:

df = pd.DataFrame(data, index = ['one', 'two', 'three', 'four', 'five'])
df

## Data Extraction (Indexing/Slicing)

In [None]:
#importing data from excel file

file = "Stanely_cup_winners.xlsx"
df = pd.read_excel(file)
df

In [None]:
# Obtain the column labels using Df.columns 

df.columns

In [None]:
# Obtain the index labels using Df.index 

df.index

In [None]:
# Obtain the values using Df.values (row by row):

df.values

In [None]:
# Obtain the values in a given column: df['column label']:

df['Team']

In [None]:
# The result of slicing a column is a Series:

type(df['Team'])

In [None]:
# We can also extract the values in a column by calling the column label as an attribute of the Dataframe:

df.Team


In [None]:
# Extracting rows using the iloc

df.iloc[1]

In [None]:
# This can be used for multiple rows:

#What are the top three teams that won Stanley Cup?

df.iloc[[0,1,2]]

In [None]:
# We can also obtain a range using a slicing approach:

df.iloc[20:]

In [None]:
# Get a single value from cell index

df['Team'][0]

In [None]:
#Another way of doing this

df.loc[0][0] #[row][column]

In [None]:
#Can we get the index of a certain value? similar to the find function in Excel

df['Team'].where(df['Team'] == 'Toronto Maple Leafs').dropna().index[0]

# you can also realize the same function through the below method
df.index[df['Team'] == 'Toronto Maple Leafs'][0]

## Assigning new Values

### Adding & Deleting Columns:

In [None]:
# Adding a column is simple, provide the name followed by = value(s)

df['Country'] = [
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                ]
df

In [None]:
# Deleting is also easy: del df['column name']. Note: it is not recommended to modify the original dataframe.

del df['Country']
df

In [None]:
df['Country'] = [
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                    'US',
                ]

In [None]:
# We can also delete the columns using the df.drop() function:

df2 = df.drop(['Country'], axis=1)       # Axis 1 = columns, 0 = Rows
df2

In [None]:
pd.options.mode.chained_assignment = None # This line is to disable uncessary warnings. You can ignore this 
df['Country'][2]='Canada'
df['Country'][8]='Canada'
df['Country'][20]='Canada'
df['Country'][10,11]='Canada'

In [None]:
df

In [None]:
# Fill column with a single constant value:

df['Last year won'] = 0
df

### Looking for Values within the DataFrames: ***the isin() function***

In [None]:
df.isin(['US', 
        'Canada',1])#.any()

### Sorting, Filtering, Transposition

In [None]:
df3 = df.sort_values(by=['Wins'], ascending=False)
df3

In [None]:
#Setting a new index column

df3=df3.set_index(np.arange(0,22))
df3

In [None]:
# Filtering is done just like in the case of series:

df3[df3['Wins'] > 1]     

In [None]:
# Transposition can be obtained by simple using the df.T option:

df.T

## Statistical Analysis

In [None]:
# Statistical summary can be obtained using the df.describe() function:

df3.describe()

In [None]:
# You can still obtain the sum and mean like we did in Numpy:

print(df.sum())
#print(df.mean())

In [None]:
#Which country won the most Stanely cups to date?

df4 = df3.groupby(by=['Country']).sum()
df4

# Let us have some fun with the Data

#### Update the column with the last year the team won a Stanely cup using the data from other excel sheet

In [None]:
#1 import the new excel sheet with the data from every year

file = "Stanely_cup_winners_by_year.xlsx"
df_yr = pd.read_excel(file)
df_yr

In [None]:
#2 run through the data to extract the last year a team has won the Stanely cup

i=0
while i<len(df_yr):
    team = df_yr['Team'][i]
    year = df_yr['Year'][i]
    
    if team in list(df3['Team']):
        indx = df3['Team'].where(df3['Team'] == team).dropna().index[0]
        if year>df3['Last year won'][indx]:
            df3['Last year won'][indx]=year
     
    i+=1


In [None]:
df3

In [None]:
# Correlation (Pearson Correlation Coefficient):

df3['Wins'].corr(df3['Last year won'])

## Dealing with Missing Values (NaN Values)

As mentioned in the series section, experimental datasets often have missing values. We can deal with such data in many different ways. For example, if you have a large dataset you may simple chose to remove such data points or if the concerned data is numerical, you may replace the NaN value with the average of the column. Let's take a look at how do these operations:

### Dropping NaN values: 

In [None]:
#Creating a dataframe with NaN values: Converting 0 in the year column to NaN

df3['Last year won'] = df3['Last year won'].replace({0:np.nan})
df3

In [None]:
# We could just drop all NaN data points:

df_fix = df3.dropna()
df_fix

# Deleted the row with NaN value(s)

In [None]:
# We could just drop all NaN data points - dropping by column using the axis tag:

df_fix = df3.dropna(axis=1)
df_fix

Note: df.dropna() essentially drops all the values in the row or column with the missing data. To avoid this, we can use the how argument. This argument takes two possible inputs:

* ‘any’ : If any NA values are present, drop that row or column. [This is the Default]

* ‘all’ : If all values are NA, drop that row or column.

In [None]:
df_fix = df3.dropna(how='all')
df_fix 

# So we don't drop the the row because all values were not NaN

In this case, we may want to simply replace the missing values. We can achieve this using the fillna() function:

In [None]:
df_fix1 = df3.fillna(0)  #Replace with 0
df_fix1 

In [None]:
df_fix1 = df3.fillna('Unknown')  # Replace with a different category
df_fix1 

In [None]:
# Might replace with mean

df_mean = df3.fillna(df3['Last year won'].mean())
df_mean

In [None]:
# Might replace with median:

df_mean = df3.fillna(df3['Last year won'].median())
df_mean

## Reading & Writing Data to CSV:

### Writing to CSV

In [None]:
# To save the data we can use the to_csv() function:

df3.to_csv('Stanely_cup_analysis.csv')  # Will save the dataframe as a csv called test in the directory you are in

### Reading CSV

In [None]:
# Assuming you have the data already in the same directory/file:

df_final = pd.read_csv('Stanely_cup_analysis.csv')

# Can  explicitly specify column names:
# syntax: pd.read_csv('path/filename.csv', names=['label1', 'label2', 'etc'])

In [None]:
# We can look at the imported data using the df.head() and df.tail() functions:

df_final.head(10) # display the top 10 entries in the dataframe - note can leave the () empty, this will default a value of 5

In [None]:
df_final.tail(10) # display the bottom 10 entries in the dataframe - note can leave the () empty, this will default a value of 5

In [None]:
# Can also look at the number of columns: 

len(df_final.columns)


In [None]:
# Can get a statistical summary: 

df_final.describe()

### Read more about the JSON file strcuture. It will be very useful for your projects
https://realpython.com/python-json/

# Part Three: Fingerprinting 

## ML Packages Only Understand Numbers

Machine learning doesn’t understand non-numerical inputs. However, in real life many things are not numerical; the trick is to find proxies. In the above example, we can provide the robot the latitude and longitude of Canada. 

## Numerical Proxies are Also Important in Science

Non numerical (categorical) labels are everywhere in science. To feed this information as inputs we need to find some numerical proxies. This featurization of the dataset is probably among the hottest research topics and many different research groups have come up with their own solution. It’s hard to really tell what is good and what is not as each solution is based off a different dataset! 

## One Hot Encoding

* One of the most common methods of dealing with categorical data. I think it’s also one of the simplest. 
* Implemented in many packages including Pandas & SK-learn. 
* Process:
    1. Create a column for each category of data2. 
    2. Go through the rows with the categorical data and fill them with dummy/indicator variables (1 or 0):
        1. Enter 1 into column if the category matches column category.
        2. Enter 0 into column if the category does not match column category.

<div>
    <img src="img/encoding.png">
</div>

**Source:**https://towardsdatascience.com/building-a-one-hot-encoding-layer-with-tensorflow-f907d686bf39

## One Hot Encoding in Action - Using the Pandas get_dummies function:

In [None]:
# Let's import the modules we will use for this lecture:

import pandas as pd         # Pandas
import numpy as np          # NumPy

In [None]:
# Create a DataFrame to work with:

data = df3
df = pd.DataFrame(data)
df

In [None]:
# Implement One hot Encoding using the get_dummies() function:

df = pd.get_dummies(df)
df

---------

### Lab Report P2

**Which team has the most runner-up finishes in the history of NHL?**
Create a table with the teams and their runner-up finishes. The table must contain the following columns:
- Team
- Total number of runner-up finishes
- Years [list of the years they finished as a runner-up]

<u>You have to use **Jupyter** and you must **ONLY** use the information in the two sheets provided in the folder of today's practital</u>

Submit your Jupyter notebook and output table as .csv by the upcoming Wed 12:00 pm together with our first assignment (to be released)

---------



## Meaningful Encoding

* Dummy variables don’t  carry much scientific knowledge. 
* You should try to find proxies that are meaningful: 
    * Electronegativity, bandgap, band center, XRD, RDF etc.

### Oxides example

In [None]:
data = {'formula':['IrO2', 'RuO2', 'TiO2', 'Ni3O4'], 'overpotential':[300, 250, 400, 280]}
df2 = pd.DataFrame(data)
df2

## Matminer

Matminer is a powerful python library which aims to simplify the machine learning pipeline for material science. Has 3 key capabilities:

1. Data retrieval tools: allows to import data from various online databases in the form of DataFrames. 
2. Data descriptor tools: utilities to describe a material from its composition or structure, and represent them in numerical format such that they are readily usable as features.
3. Plotting tools: create plots to visualize the data.

The inputs prepared by Matminer can be easily fed into ML packages such as SK-Learn and Keras to conduct Machine Learning.

**Matminer Homepage:** https://hackingmaterials.lbl.gov/matminer/ 

**Table of Featurizers**: https://hackingmaterials.lbl.gov/matminer/featurizer_summary.html

<div>
    <img src="img/matminer.png" width=600>
 </div>

In [None]:
#pip install matminer

### 1. Get composition

In [None]:
from matminer.featurizers.conversions import StrToComposition

#convert the formula from a string into chemical composition
df2 = StrToComposition().featurize_dataframe(df2, "formula")
df2.head()

In [None]:
from matminer.featurizers.composition import ElementFraction

##get stoichiometry from formula
df2 = ElementFraction().featurize_dataframe(df2, "composition")
df2

In [None]:
df2['O']

In [None]:
from matminer.featurizers.composition import BandCenter

df2 = BandCenter().featurize_dataframe(df2, col_id="composition")
df2.head()

In [None]:
from matminer.featurizers.conversions import CompositionToOxidComposition
from matminer.featurizers.composition import OxidationStates

df2 = CompositionToOxidComposition().featurize_dataframe(df2, "composition")

os_feat = OxidationStates()
df2 = os_feat.featurize_dataframe(df2, "composition_oxid")
df2

In [None]:
from matminer.featurizers.composition import ElectronegativityDiff

df2 = ElectronegativityDiff().featurize_dataframe(df2, col_id="composition_oxid")  
df2.head()