<a href="https://colab.research.google.com/github/MARC27-Internet-Private-Limited/MXene-LLM/blob/main/MXene.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<div style="text-align: center;">
  <img src="https://research.marc27.com/_app/immutable/assets/logo_marc27.B__kGcan.svg"
       alt="Our logo" width="50">
  <h1>Death of P.R.I.S.M.</h1>
  <h3><u>P</u>latform for <u>R</u>esearch in <u>I</u>ntelligent <u>S</u>ynthesis of <u>M</u>Xenes</h3>
</div>




---


Welcome to the **Death of P.R.I.S.M.** notebook—a tongue-in-cheek nod to our once-thriving MXene research project. This notebook sets up a data pipeline that:

- **Scrapes and Aggregates Research Data:** Automatically fetches relevant academic articles and patents related to MXene synthesis.
- **Trains a Small LLM:** Uses supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) to develop a model that learns to identify the best data. Here we're using:

```
deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
```


- **Populates a Structured Database:** Organizes the curated data for downstream use in training advanced AI models (MatterGen and GNoME) for material discovery.

Running on Google Colab, this notebook bridges raw research with AI-driven insights, reviving our legacy in a new, smarter way.


---



In [None]:
from google.colab import drive
drive.mount('/content/drive')
!nvidia-smi
!pip install beautifulsoup4 requests transformers datasets accelerate trl bitsandbytes gradio mp-api biopython pandas peft

# Import libraries and load model
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import gradio as gr

# Load quantized model (4-bit for GPU efficiency)
bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype="float16")
model = AutoModelForCausalLM.from_pretrained(
    "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B",
    quantization_config=bnb_config,
    device_map="auto"  # Auto-assign to GPU
)
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Qwen-7B")
print("Setup complete - model loaded on GPU")

Mounted at /content/drive
Collecting datasets
  Downloading datasets-3.3.2-py3-none-any.whl.metadata (19 kB)
Collecting trl
  Downloading trl-0.15.2-py3-none-any.whl.metadata (11 kB)
Collecting bitsandbytes
  Downloading bitsandbytes-0.45.3-py3-none-manylinux_2_24_x86_64.whl.metadata (5.0 kB)
Collecting gradio
  Downloading gradio-5.20.0-py3-none-any.whl.metadata (16 kB)
Collecting mp-api
  Downloading mp_api-0.45.3-py3-none-any.whl.metadata (2.3 kB)
Collecting biopython
  Downloading biopython-1.85-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (13 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py311-none-any.whl.metadata (7.2 kB)
Collecting aiofiles<24.0,>=22.0 (from gradio)
  Downloadi

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/680 [00:00<?, ?B/s]

ERROR:bitsandbytes.cextension:Could not load bitsandbytes native library: /lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.32' not found (required by /usr/local/lib/python3.11/dist-packages/bitsandbytes/libbitsandbytes_cpu.so)
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/bitsandbytes/cextension.py", line 85, in <module>
    lib = get_native_library()
          ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/bitsandbytes/cextension.py", line 72, in get_native_library
    dll = ct.cdll.LoadLibrary(str(binary_path))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/ctypes/__init__.py", line 454, in LoadLibrary
    return self._dlltype(name)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/ctypes/__init__.py", line 376, in __init__
    self._handle = _dlopen(self._name, mode)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: /lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4

RuntimeError: CUDA is required but not available for bitsandbytes. Please consider installing the multi-platform enabled version of bitsandbytes, which is currently a work in progress. Please check currently supported platforms and installation instructions at https://huggingface.co/docs/bitsandbytes/main/en/installation#multi-backend

In [None]:
# Import scraping tools
import requests
from bs4 import BeautifulSoup
import pandas as pd
from mp_api.client import MPRester
from Bio import Entrez

abstracts = []

# Scrape arXiv
print("Scraping arXiv...")
url = "https://arxiv.org/search/?query=\"MXene\"+OR+\"MAX+phase\"&searchtype=all"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
papers = soup.find_all('li', class_='arxiv-result')[:15]
arxiv_data = [{"source": "arXiv", "abstract": p.find('span', class_='abstract-full').text.strip()} for p in papers]
abstracts.extend(arxiv_data)
print(f"arXiv scraped: {len(arxiv_data)} abstracts")

# Scrape Materials Project
print("Scraping Materials Project...")
with MPRester("UILrX3LPVIpFqHfDTrzroabSiub1QYOG") as mpr:
    try:
        docs = mpr.materials.summary.search(
            chemsys=["Ti-Al-C", "Cr-Al-C", "Nb-Al-C", "Mo-Al-C", "V-Al-C",
                     "Ti-Ga-N", "Cr-Ga-N", "Nb-Si-C", "Mo-Si-N", "V-Si-B",
                     "Ti-Si-C", "Cr-Zn-C", "Nb-Ga-N", "Mo-Zn-N", "V-Ga-C"],
            fields=["formula_pretty", "structure"]
        )[:15]
        mp_data = [{"source": "MP", "abstract": f"{doc.formula_pretty} - lattice a={doc.structure.lattice.a:.3f}Å" if doc.structure else f"{doc.formula_pretty} - no structure data"} for doc in docs]
        abstracts.extend(mp_data)
        print(f"Materials Project scraped: {len(mp_data)} abstracts")
    except Exception as e:
        print(f"Materials Project error: {e}")
        abstracts.extend([{"source": "MP", "abstract": "Failed to fetch - check API"}])

# Scrape PubMed
print("Scraping PubMed...")
Entrez.email = "siddharthayashkovid@gmail.com"  # Replace with your real email
handle = Entrez.esearch(db="pubmed", term="\"MXene\" OR \"MAX phase\"", retmax=15)
record = Entrez.read(handle)
ids = record["IdList"]
handle = Entrez.efetch(db="pubmed", id=ids, rettype="abstract", retmode="text")
pubmed_text = handle.read()
pubmed_data = [{"source": "PubMed", "abstract": abstract.strip()} for abstract in pubmed_text.split('\n\n') if abstract.strip() and ("MXene" in abstract or "MAX phase" in abstract)][:15]
abstracts.extend(pubmed_data)
print(f"PubMed scraped: {len(pubmed_data)} abstracts")

# Save to CSV
df = pd.DataFrame(abstracts)
df.to_csv('/content/drive/MyDrive/mxene_project/abstracts.csv', index=False)
print("Scraped abstracts - MXene/MAX dataset ready")

Scraping arXiv...
arXiv scraped: 15 abstracts
Scraping Materials Project...


Retrieving SummaryDoc documents:   0%|          | 0/25 [00:00<?, ?it/s]

Materials Project scraped: 15 abstracts
Scraping PubMed...
PubMed scraped: 15 abstracts
Scraped abstracts - MXene/MAX focused


In [None]:
# Import training tools
from datasets import load_dataset
from transformers import TrainingArguments
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer

# Training data - sample from abstracts.csv
data = [
    {"input": "Ti₃AlC₂ synthesized with HF etching, a=3.07Å, c=18.5Å.",
     "output": '{"category": "properties", "composition": "Ti3AlC2", "lattice_a": "3.07Å", "lattice_c": "18.5Å", "etchant": "HF"}'},
    {"input": "Cr₂GaN manufacturing via sputtering.",
     "output": '{"category": "manufacturing", "composition": "Cr2GaN"}'},
    {"input": "Nb₄C₃Tₓ tested for supercapacitors.",
     "output": '{"category": "testing", "mxene": "Nb4C3Tx"}'},
    {"input": "Mo₂TiC₂ synthesized, a=3.01Å, bandgap 0.9 eV.",
     "output": '{"category": "properties", "composition": "Mo2TiC2", "lattice_a": "3.01Å", "bandgap": "0.9 eV"}'},
    {"input": "V₂SiC etched to V₂CTₓ, conductivity high.",
     "output": '{"category": "properties", "composition": "V2SiC", "mxene": "V2CTx"}'}
]
df = pd.DataFrame(data)
df.to_csv('/content/drive/MyDrive/mxene_project/train_data.csv', index=False)

# Add LoRA adapters to quantized model
lora_config = LoraConfig(
    r=16,  # Rank of adapters
    lora_alpha=32,  # Scaling factor
    target_modules=["q_proj", "v_proj"],  # Attention layers
    lora_dropout=0.05,  # Regularization
    bias="none",
    task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()  # Verify trainable params

# SFT
dataset = load_dataset('csv', data_files='/content/drive/MyDrive/mxene_project/train_data.csv')
def tokenize(examples):
    inputs = tokenizer(examples["input"], padding="max_length", truncation=True, max_length=512)
    outputs = tokenizer(examples["output"], padding="max_length", truncation=True, max_length=128)
    inputs["labels"] = outputs["input_ids"]
    return inputs
tokenized_dataset = dataset.map(tokenize, batched=True)
training_args = TrainingArguments(
    output_dir='/content/drive/MyDrive/mxene_project/model_sft',
    per_device_train_batch_size=1,
    num_train_epochs=3,
    save_steps=50,
    logging_steps=5,
    fp16=True
)
trainer = SFTTrainer(model=model, args=training_args, train_dataset=tokenized_dataset["train"])
trainer.train()
print("SFT with LoRA complete - MXene/MAX classifier ready")

Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/5 [00:00<?, ? examples/s]

Converting train dataset to ChatML:   0%|          | 0/5 [00:00<?, ? examples/s]

Applying chat template to train dataset:   0%|          | 0/5 [00:00<?, ? examples/s]

Truncating train dataset:   0%|          | 0/5 [00:00<?, ? examples/s]

ValueError: You cannot perform fine-tuning on purely quantized models. Please attach trainable adapters on top of the quantized model to correctly perform fine-tuning. Please see: https://huggingface.co/docs/transformers/peft for more details

In [None]:
# Import Gradio and download tools
import gradio as gr
import requests

# Process function - classify and download
def process_abstract(text):
    inputs = tokenizer(text, return_tensors="pt").to("cuda")
    outputs = model.generate(**inputs, max_new_tokens=100)
    result = tokenizer.decode(outputs[0], skip_special_tokens=True)
    if "properties" in result.lower():
        url = "https://arxiv.org/pdf/2101.00001.pdf"  # Placeholder - update later
        response = requests.get(url)
        with open('/content/drive/MyDrive/mxene_project/last_paper.pdf', 'wb') as f:
            f.write(response.content)
        return result, "Paper downloaded (last_paper.pdf)!"
    return result, "Not properties - no download."

# Launch Gradio UI
interface = gr.Interface(
    fn=process_abstract,
    inputs=gr.Textbox(lines=2, placeholder="Enter abstract (e.g., Ti₃AlC₂ synthesized, a=3.07Å)..."),
    outputs=[gr.Textbox(label="Extracted Data"), gr.Textbox(label="Download Status")],
    title="MXene/MAX Phase Classifier",
    description="Classifies abstracts and downloads properties papers - trained on MXene/MAX data."
)
interface.launch(share=True)

In [None]:
# # Import RLHF tools
# from trl import PPOTrainer, PPOConfig
#
# # Validation data for rewards
# validation_data = [
#     {"input": "Ti₂AlC etched to Ti₂CTₓ, a=3.05Å.", "correct_output": '{"category": "properties", "composition": "Ti2AlC", "lattice_a": "3.05Å", "mxene": "Ti2CTx"}'},
#     {"input": "Cr₂CBr₂ tested, bandgap 0 eV.", "correct_output": '{"category": "properties", "composition": "Cr2CBr2", "bandgap": "0 eV"}'}
# ]
#
# # Reward function
# def compute_rewards(predicted, correct):
#     return 1.0 if predicted == correct else -1.0
#
# # RLHF with PPO
# ppo_config = PPOConfig(model_name="deepseek-ai/DeepSeek-R1-Distill-Qwen-7B", learning_rate=1e-5, batch_size=1)
# ppo_trainer = PPOTrainer(model=model, config=ppo_config, tokenizer=tokenizer, dataset=tokenized_dataset["train"])
#
# for epoch in range(10):  # 10 iterations
#     for batch in tokenized_dataset["train"]:
#         inputs = tokenizer(batch["input"], return_tensors="pt").to("cuda")
#         outputs = model.generate(**inputs, max_new_tokens=100)
#         pred = tokenizer.decode(outputs[0], skip_special_tokens=True)
#         reward = compute_rewards(pred, batch["output"])
#         ppo_trainer.step([inputs["input_ids"][0]], [outputs[0]], [reward])
# print("RLHF complete - MXene/MAX extraction refined")

In [None]:
# Needs mxene_db.csv from diegonti/MXene-DB—manual upload.
# Import pandas for merging
import pandas as pd

# Load and merge datasets
mxene_db = pd.read_csv('/content/drive/MyDrive/mxene_project/mxene_db.csv')  # Upload from GitHub
scraped_df = pd.read_csv('/content/drive/MyDrive/mxene_project/abstracts.csv')
mxene_db_subset = mxene_db[['full_name', 'a', 'Eg_PBE']].rename(columns={'full_name': 'composition', 'a': 'lattice_a', 'Eg_PBE': 'bandgap'})
combined_df = pd.concat([scraped_df, mxene_db_subset], ignore_index=True)
combined_df.to_csv('/content/drive/MyDrive/mxene_project/combined_data.csv', index=False)
print("MXene-DB merged - full MXene/MAX database ready")