# Context - Machine learning for material development

In order to accelerate the development of material design, the ability of machine learning as a tool is looked into. We want to test whether machine learning can predict new functional materials in relevant fields. In this case, we train the machine learning algorithm to judge whether a "compound" can be a prospective thermoelectric material. Thermoelectric materials are those materials that can generate electricity due to a temperature gradient across it. "Band gap" is a property used here to determine the ability of a material to be thermo-electric. The band-gap is influenced by input descriptors.  The details of dataset are given below. 

Three main components needed to do this task are :  descriptors (features), training data and a machine learning algorithm.​

## Descriptors : 
These are properties of material that can be used to compare one compound to another compound. In this case, the descriptors are information about the chemical elements that form the new compound. We have downloaded information for about 70000 compounds. A compound is made of chemical elements. For example, water is a compound with formulae H2O. In machine learning language, descriptors are input feature data to be used by the machine learning algorithm. Thus, Descriptors are sort of prior chemical knowledge thats helps the machine learning model to look "where" the pattern is in the periodic table, thus allowing the algorithm to determine "what" the pattern is. The descriptors are : 

**Attribute Information:**
1. Classification based on "Band Gap" Attribute [**y1c** variable below in code]

    Class 1 could be potential Thermo-electric material (Materials with Band Gap > 0)
    
    Class 2 is Non-Thermoelectric material (Materials with Band gap = 0)


**Input features - 55 features influencing the Band-gap (and hence the class):**
1. Average atomic number 
2. Average group number  
3. Valence electron variance
4. Electronegativity variance 
5. Average atomic radius 
6.  
7. 
8. 
55. Electronic configuration  


##  Training data [Variables : X_train, X_test, y_train, y_test]: 
This is downloaded from the materials project website (https://materialsproject.org/) and arranged using a in-house script. This data is made available to you below for testing different machine learning algorithms. 

##  Machine learning algorithm : 
you have to test the performance of different machine learning algorithms on the variables below 
*X_train, X_test, y_train, y_test*
and see which is best in classifying. 

##  Class distribution:
    **Nearly 80% Class 1 and 20% in Class 2**

In [1]:
import numpy as np 
import pandas as pd 
%matplotlib inline 
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec

In [2]:
datamldb = pd.read_csv('./data/initial_db.csv',sep=',')
datamldb2=datamldb[datamldb['full_formula'].str.contains("Li")]
datamldb2.drop(['Unnamed: 63','electrical_resistivity variance','electrical_resistivity mean','n_units_in_cell','density'],inplace=True,axis = 1)
datamldb=datamldb2.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
S300099=datamldb.drop(['band_gap','e_above_hull','pretty_formula','full_formula'], axis=1).copy()

print("Input data feature",S300099.shape)

#One hot encoding for "Symmetry Group"
S300099_symmetry = pd.get_dummies(S300099['symmetry_group'])
S300099new=S300099.drop(['symmetry_group'], axis=1).copy()
S300099new = pd.concat([S300099new,S300099_symmetry], axis=1)


print("  ")
print("Peek into data structure")
print("  ")
print(S300099new.head())

from sklearn import preprocessing
S300099new_scaled = preprocessing.scale(S300099new)
y2 = datamldb[['band_gap', 'e_above_hull']]
y2 = np.array(y2.values)

y1c=[1 if 0<value else 0 for value in y2[:,0]]
print("  ")
print('percent of class 1', y1c.count(1)/len(y1c)*100)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(S300099new_scaled,y1c,test_size=0.2,stratify=y1c,random_state = 2)




A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Input data feature (12950, 55)
  
Peek into data structure
  
    atomic_mass variance  atomic_mass mean  atomic_radius variance  \
23            146.747994         22.277706                0.047900   
24              6.687521         11.795160                0.096600   
35              3.147166         11.286457                0.068878   
47           1111.119263         64.021250                0.031875   
49           1798.090760         47.445250                0.050625   

    atomic_radius mean  X variance    X mean  valence_electrons variance  \
23            1.172800    0.561188  1.915680                    5.533696   
24            0.830000    0.575096  2.432000                    2.160000   
35            0.807143    0.301824  2.325714                    1.102041   
47            1.325000    0.501819  1.852500                    5.187500   
49            1.225000    0.428075  2.025000                    4.187500   

    valence_electrons mean  Total group #1  Total group #2 .