<div style="text-align: center;">
    <a href="https://www.dataia.eu/">
        <img border="0" src="https://github.com/ramp-kits/template-kit/raw/main/img/DATAIA-h.png" width="90%"></a>
</div>

# Template Kit for RAMP challenge

<i> Thomas Moreau (Inria) </i>

## Introduction

Superconductors are materials that can conduct electricity without resistance when cooled below a specific critical temperature (Tc). This unique property has significant implications in various technological and industrial applications, including Magnetic Resonance Imaging (MRI), high-speed maglev trains, and energy-efficient power grids. However, the widespread adoption of superconductors is hindered by the challenge of predicting their critical temperature, which typically requires extensive experimental research.

The goal of this challenge is to develop predictive models that can accurately estimate the critical temperature of superconducting materials based on their chemical composition and physical properties. By leveraging machine learning techniques, participants will contribute to advancing materials science by creating data-driven models that could accelerate the discovery of new superconductors with desirable properties.

### Where the data comes from

The dataset for this challenge originates from the ***Superconducting Material Database maintained by Japan's National Institute for Materials Science (NIMS)***,  available at http://supercon.nims.go.jp/index_en.html. This database aggregates information about superconductors, including their chemical composition and experimentally determined critical temperatures.

The dataset used in this challenge was curated and analyzed by ***Kam Hamidieh*** in his research paper *"A Data-Driven Statistical Model for Predicting the Critical Temperature of a Superconductor"* published in 2018 in Computational Materials Science. It contains information on 21,263 superconductors and 81 extracted features that are useful for modeling the critical temperature. The features are derived from elemental properties such as atomic mass, thermal conductivity, valence, and electron affinity.

The dataset contains files: 

1. train.csv: Contains 81 features extracted from 21,263 superconductors along with their corresponding critical temperatures.

2. unique_m.csv: Provides the chemical formulas of the superconductors in the dataset.

For this challenge the participant will only use the data extracted from the superconductors along with their corresponding critical temperatures.

### The task of the challenge

Participants are required to build a machine learning model that predicts the critical temperature (Tc) of superconducting materials based on their extracted features. The primary objective is to develop an accurate predictive model that can generalize well to unseen superconductors.  

Each superconducting material in the dataset is represented by 81 numerical features, which are derived from its chemical composition and elemental properties. These features include atomic mass, electron affinity, thermal conductivity, valence, and other physicochemical characteristics. The challenge is to identify patterns and relationships within this high-dimensional feature space that contribute to determining the critical temperature of a given material.  

To achieve this, participants will preprocess the dataset, explore feature selection or transformation techniques, and experiment with different machine learning models. The success of the model will depend on effectively capturing complex interactions between the material’s properties and its superconducting behavior.  

This competition presents a unique opportunity to leverage data science in materials research, where data-driven approaches can accelerate the discovery of new superconducting materials and improve our understanding of the underlying physics governing their behavior.

### Motivation

Superconductors have the potential to revolutionize multiple industries, particularly in energy transmission and advanced computing. However, predicting the critical temperature of new superconducting materials remains a complex challenge due to the absence of a well-defined theoretical model. Traditional approaches rely on experimental testing, which is both time-consuming and expensive.

This challenge presents an opportunity to harness the power of data science and machine learning to address this problem. By analyzing a large dataset of superconductors and their properties, participants can build predictive models that could aid scientists in identifying promising materials without the need for exhaustive experimental testing. A successful model could significantly accelerate the discovery of high-temperature superconductors, which are crucial for developing more energy-efficient technologies and advancing scientific research in condensed matter physics.

By participating in this challenge, data scientists and researchers can contribute to an important scientific endeavor while gaining hands-on experience in applying machine learning techniques to real-world physics and chemistry problems.

## Requirements

In [None]:
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.colors as colors

%matplotlib inline
pd.set_option('display.max_columns', None)


# Exploratory data analysis

The goal of this section is to show what's in the data, and how to play with it.
This is the first set in any data science project, and here, you should give a sense of the data the participants will be working with.

You can first load and describe the data, and then show some interesting properties of it.

In [None]:
# Load the data
import problem
X_df, y = problem.get_train_data()

In [None]:
df_train = pd.readCSV("data/X_train.csv")

# Challenge evaluation

A particularly important point in a challenge is to describe how it is evaluated. This is the section where you should describe the metric that will be used to evaluate the participants' submissions, as well as your evaluation strategy, in particular if there is some complexity in the way the data should be split to ensure valid results.

# Submission format

Here, you should describe the submission format. This is the format the participants should follow to submit their predictions on the RAMP plateform.

This section also show how to use the `ramp-workflow` library to test the submission locally.

## The pipeline workflow

The input data are stored in a dataframe. To go from a dataframe to a numpy array we will use a scikit-learn column transformer. The first example we will write will just consist in selecting a subset of columns we want to work with.

In [2]:
# %load submissions/starting_kit/estimator.py

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression


def get_estimator():
    pipe = make_pipeline(
        StandardScaler(),
        LogisticRegression()
    )

    return pipe


## Testing using a scikit-learn pipeline

In [None]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(get_estimator(), X_df, y, cv=5, scoring='accuracy')
print(scores)

[0.97222222 0.96527778 0.97212544 0.95121951 0.96167247]


## Submission

To submit your code, you can refer to the [online documentation](https://paris-saclay-cds.github.io/ramp-docs/ramp-workflow/stable/using_kits.html).