## **DATA 6250**
# **Machine Learning for Data Science**
## **Final Project**
## **Pre-Processing of Data**
## **Filling of Missing Values in Data**
### ***REFERENCE: EPOCH AI***
### ***Links to Dataset:***
- *Notable AI Models* : https://epoch.ai/data/notable_ai_models.csv
- *Large-Scale AI Models* : https://epoch.ai/data/large_scale_ai_models.csv
- *ML Hardware* : https://epoch.ai/data/ml_hardware.csv

#### Done By: Rohan Pratap Reddy Ravula
#### School of Computing and Data Science
#### Wentworth Institute of Technology

### Mount the Colab Notebook to Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Data Processing
### Load the required Libraries

In [None]:
import pandas as pd
import numpy as np
import os
from tqdm import tqdm

### Load the CSV file

In [None]:
input_path = "/content/drive/MyDrive/DATA 6250/Datasets/Updated/Normalized/large_scale_ai_models_normalized.csv"
df = pd.read_csv(input_path)

### Rename the Country Names

In [None]:
rename_vals = {'United States of America':'USA',
               'United Kingdom of Great Britain and Northern Ireland':'UK',
               'Korea (Republic of)':'South Korea',
               'United Arab Emirates':'UAE'}

df['Country'] = df['Country'].replace(rename_vals)
for val in df['Country'].unique():
  print(val)

USA
France
China
UK
Multinational
South Korea
Germany
Japan
UAE
Hong Kong
Canada
Finland
Russia
Saudi Arabia
Singapore
Israel


### Find all the Columns in the Data

In [None]:
for col in df.columns:
    print(repr(col))

'Model'
'Domain'
'Country'
'Organization'
'Date'
'Category'
'Task'
'Parameters'
'data size'
'Training time (hours)'
'Confidence'
'Hardware quantity'
'Training hardware'
'Authors'
'accessibility'
'Parameters notes'
'Training compute (FLOP)'
'Training compute notes'
'Training dataset'
'Training dataset notes'
'Dataset size notes'
'Training time notes'
'Abstract'
'Finetune compute (FLOP)'
'Finetune compute notes'
'Training code accessibility'


### Create a function to merge all text columns in the data

In [None]:
def create_feature_extraction_notes(row_vals):
  notes = ""
  for col in ['Parameters notes','Training compute notes',
              'Training dataset notes','Dataset size notes',
              'Training time notes','Finetune compute notes']:
      if not pd.isna(row_vals[col]):
        notes += f"{col}: {row_vals[col]}\n"
  return notes

### Apply it and delete the original columns

In [None]:
tqdm.pandas()
df_new = df.copy()
df_new['Overall_notes'] = df_new.progress_apply(create_feature_extraction_notes,axis=1)
df_new.drop(['Parameters notes','Training compute notes',
             'Training dataset notes','Dataset size notes',
             'Training time notes','Finetune compute notes'],
            axis=1,inplace=True)
df = df_new.copy()
del df_new

100%|██████████| 2078/2078 [00:00<00:00, 48811.95it/s]


### Find all columns in the new data

In [None]:
for col in df.columns:
    print(repr(col))

'Model'
'Domain'
'Country'
'Organization'
'Date'
'Category'
'Task'
'Parameters'
'data size'
'Training time (hours)'
'Confidence'
'Hardware quantity'
'Training hardware'
'Authors'
'accessibility'
'Training compute (FLOP)'
'Training dataset'
'Abstract'
'Finetune compute (FLOP)'
'Training code accessibility'
'Overall_notes'


### Make new Data frame

In [None]:
df_new = df[['Model','Parameters','data size','Training time (hours)',
             'Training compute (FLOP)', 'Finetune compute (FLOP)','Overall_notes' ]].copy()
df_new.drop_duplicates(inplace=True)
features = ['Parameters','data size','Training time (hours)', 'Training compute (FLOP)', 'Finetune compute (FLOP)']
df_new.head()

Unnamed: 0,Model,Parameters,data size,Training time (hours),Training compute (FLOP),Finetune compute (FLOP),Overall_notes
0,Llama 4 Scout,109000000000.0,30000000000000.0,,4.08e+24,,"Parameters notes: ""Our smaller model, Llama 4 ..."
4,Llama 4 Maverick,400000000000.0,30000000000000.0,,1.4916e+25,,"Parameters notes: ""Llama 4 Maverick models hav..."
8,GPT-4.5,,,,,,"Training dataset notes: ""GPT-4.5 was pre-train..."
29,Claude 3.7 Sonnet,,,,3.35e+25,,Training compute notes: https://docs.google.co...
50,Evo 2 40B,40300000000.0,9300000000000.0,,2.25e+24,,Parameters notes: Table 1 lists 40.3B paramter...


## Processing Using Language Models
### 'llmware/dragon-qwen-7b-ov' for RAG
### 'all-mpnet-base-v2' for sentence embeddings
### Install all required libraries

In [None]:
!pip install -U sentence-transformers transformers
!pip install "llmware[full]" openvino openvino_genai

Collecting sentence-transformers
  Downloading sentence_transformers-4.0.2-py3-none-any.whl.metadata (13 kB)
Collecting transformers
  Downloading transformers-4.51.2-py3-none-any.whl.metadata (38 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch>=1.11.0->sentence-transformers)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch>=1.11.0->

### Load the required Libraries

In [None]:
from sentence_transformers import SentenceTransformer, util
from huggingface_hub import snapshot_download
import torch

### File os setup

In [None]:
import os
os.environ["PYDEVD_DISABLE_FILE_VALIDATION"] = "1"

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

Using device: cuda


### Initializing SentenceTransformer

In [None]:
sen_model = SentenceTransformer('all-mpnet-base-v2',device=device)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.4k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

### Creating Embeddings and Store it on new data

In [None]:
embeddings = sen_model.encode(df_new['Model'].tolist(),convert_to_tensor=True)
df_new['Embeddings'] = [row for row in embeddings.cpu()]

### Create a function For calculating valid embeddings for all features

In [None]:
def get_unmasked_embeddings(unmasked_df):
  unmasked_df = unmasked_df.reset_index(drop=True)
  emb_list = []
  for emb in unmasked_df['Embeddings']:
    if not isinstance(emb, torch.Tensor):
        emb = torch.tensor(emb)
    emb_list.append(emb)
  unmask_embeddings = torch.stack(emb_list, dim=0)
  return unmask_embeddings,unmasked_df

### Create a function to assign values under features column based on their nearest models value.

In [None]:
def assign_values_to_features(row_vals,unmask_df,feature,confidence_val=0.5):
  emb = row_vals['Embeddings']
  if not isinstance(emb, torch.Tensor):
    emb = torch.tensor(emb)
  emb = emb.unsqueeze(0)
  unmask_embeddings,unmask = get_unmasked_embeddings(unmask_df)
  similarities = util.cos_sim(emb, unmask_embeddings)[0]
  max_val, max_idx = torch.max(similarities, dim=0)
  max_val = max_val.item()
  max_idx = max_idx.item()
  if max_val > confidence_val:
    return (unmask.iloc[max_idx][feature])
  else:
    return None

### Initializing RAG Model

In [None]:
from llmware.models import ModelCatalog
rag_model = ModelCatalog().load_model("dragon-qwen-7b-ov",temperature=0.01)

For more details, check out https://huggingface.co/docs/huggingface_hub/main/en/guides/download#download-files-to-local-folder.


tokenizer_qw.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]



Fetching 18 files:   0%|          | 0/18 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/123 [00:00<?, ?B/s]

LICENSE.txt:   0%|          | 0.00/11.4k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/11.6k [00:00<?, ?B/s]

.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

hash_record_sha256.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/1.10k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/85.0 [00:00<?, ?B/s]

openvino_detokenizer.bin:   0%|          | 0.00/1.98M [00:00<?, ?B/s]

openvino_model.bin:   0%|          | 0.00/4.62G [00:00<?, ?B/s]

openvino_tokenizer.bin:   0%|          | 0.00/4.10M [00:00<?, ?B/s]

openvino_detokenizer.xml:   0%|          | 0.00/5.21k [00:00<?, ?B/s]

openvino_model.xml:   0%|          | 0.00/3.57M [00:00<?, ?B/s]

openvino_tokenizer.xml:   0%|          | 0.00/27.6k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/390 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.34k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

INFO:llmware.models:OVGenerativeModel - loading - could not find GPU - setting device for CPU


### Creating a Generalized function to extract values using RAG model

In [None]:
def extract_feature_from_notes(row_vals, feature):
    if not row_vals['Overall_notes'] or not row_vals['Overall_notes'].strip():
        return "na"

    # Define feature-specific configurations
    feature_configs = {
        "Parameters": {
            "question": f"What is the number of parameters for the model '{row_vals['Model']}'?",
            "description": "number of parameters",
            "examples": [
                (
                    f"Context for the model 'GLM-4 (0116)':\n"
                    f"'GLM-4 has 130 billion parameters,...'\n"
                    f"Question: What is the number of parameters for the model 'GLM-4 (0116)'?\n"
                    f"Answer: 130000000000"
                ),
                (
                    f"Context for the model 'GPT-4.5':\n"
                    f"'Not Found._x000D_ Is the training time noted:_x000D_ <bot>: Not Found._x000D_...'\n"
                    f"Question: What is the number of parameters for the model 'GPT-4.5'?\n"
                    f"Answer: na"
                )
            ],
            "format_note": "For example, for '130 billion parameters' return '130000000000', for '1.6B parameters' return '1600000000'."
        },
        "data size": {
            "question": f"What is the size of the training dataset for the model '{row_vals['Model']}'?",
            "description": "dataset size (e.g., number of tokens, words, or samples)",
            "examples": [
                (
                    f"Context for the model 'Qwen2.5-Max':\n"
                    f"'Concurrently, we are developing Qwen2.5-Max, a large-scale MoE model that has been pretrained on over 20 trillion tokens...'\n"
                    f"Question: What is the size of the training dataset for the model 'Qwen2.5-Max'?\n"
                    f"Answer: 20000000000000"
                ),
                (
                    f"Context for the model 'GPT-4.5':\n"
                    f"'Not Found._x000D_ Is the training time noted:_x000D_ <bot>: Not Found._x000D_ Is the training dataset noted:_x000D_ <bot>: biomedical research papers._x000D_...'\n"
                    f"Question: What is the size of the training dataset for the model 'GPT-4.5'?\n"
                    f"Answer: na"
                )
            ],
            "format_note": "For example, for '10 trillion tokens' return '10000000000000', for '1.2 billion words' return '1200000000', or for '10.7M video-caption pairs' return '10700000'."
        },
        "Training time (hours)": {
            "question": f"What is the training time in hours for the model '{row_vals['Model']}'?",
            "description": "training time in hours",
            "examples": [
                (
                    f"Context for the model 'Mistral Large':\n"
                    f"'...trained for approximately 2500 hours on 4000 H100s...'\n"
                    f"Question: What is the training time in hours for the model 'Mistral Large'?\n"
                    f"Answer: 2500"
                ),
                (
                    f"Context for the model 'GPT-4.5':\n"
                    f"'Not Found._x000D_ Is the training time noted:_x000D_ <bot>: Not Found._x000D_...'\n"
                    f"Question: What is the training time in hours for the model 'GPT-4.5'?\n"
                    f"Answer: na"
                )
            ],
            "format_note": "For example, for '2500 hours' return '2500', for '3 months' convert to hours (assume 30 days per month) and return '2160'."
        },
        "Training compute (FLOP)": {
            "question": f"What is the training compute in FLOPs for the model '{row_vals['Model']}'?",
            "description": "training compute in FLOPs",
            "examples": [
                (
                    f"Context for the model 'Grok-3':\n"
                    f"'...trained on around 464000000000000000000000000 FLOPs...'\n"
                    f"Question: What is the training compute in FLOPs for the model 'Grok-3'?\n"
                    f"Answer: 464000000000000000000000000"
                ),
                (
                    f"Context for the model 'GPT-4.5':\n"
                    f"'Not Found._x000D_ Is the training compute noted:_x000D_ <bot>: Not Found._x000D_...'\n"
                    f"Question: What is the training compute in FLOPs for the model 'GPT-4.5'?\n"
                    f"Answer: na"
                )
            ],
            "format_note": "For example, for '464e24 FLOPs' return '464000000000000000000000000', for '1.2e25 FLOPs' return '12000000000000000000000000'."
        },
        "Finetune compute (FLOP)": {
            "question": f"What is the finetune compute in FLOPs for the model '{row_vals['Model']}'?",
            "description": "finetune compute in FLOPs",
            "examples": [
                (
                    f"Context for the model 'AFM-server':\n"
                    f"'...finetuned with 1e24 FLOPs on a specialized dataset...'\n"
                    f"Question: What is the finetune compute in FLOPs for the model 'AFM-server'?\n"
                    f"Answer: 1000000000000000000000000"
                ),
                (
                    f"Context for the model 'GPT-4.5':\n"
                    f"'Not Found._x000D_ Is the fine-tune compute noted:_x000D_ <bot>: Not Found._x000D_...'\n"
                    f"Question: What is the finetune compute in FLOPs for the model 'GPT-4.5'?\n"
                    f"Answer: na"
                )
            ],
            "format_note": "For example, for '1e24 FLOPs' return '1000000000000000000000000', for '2.5e25 FLOPs' return '25000000000000000000000000'."
        }
    }

    # Validate feature
    if feature not in feature_configs:
        raise ValueError(f"Invalid feature: {feature}. Supported features are: {list(feature_configs.keys())}")

    config = feature_configs[feature]

    try:
        context = (
            f"Context for the model '{row_vals['Model']}':\n"
            f"{row_vals['Overall_notes']}"
        )
        question = (
            f"{config['question']}\n"
            f"Provide the answer as a numerical value representing the {config['description']}. "
            f"{config['format_note']} "
            f"If the {config['description']} is not explicitly mentioned or cannot be determined, return 'na'.\n\n"
            f"Example 1:\n"
            f"{config['examples'][0]}\n\n"
            f"Example 2:\n"
            f"{config['examples'][1]}"
        )
        prompt = f"<human>:{context}\n{question}\n<bot>:"
        output = rag_model.inference(prompt)
        answer = output.get("llm_response", output).strip()
        return answer if answer else "na"
    except Exception as e:
        print(f"QA extraction error for feature {feature}: {e}")
        return "na"

### Create a Filling Code

In [None]:
features = ['Parameters','data size','Training time (hours)',
             'Training compute (FLOP)', 'Finetune compute (FLOP)']
for feature in tqdm(features, desc="Processing features"):
  # Trying to get values by RAG
  mask = df_new[df_new[feature].isna()].copy()
  unmask = df_new[~df_new[feature].isna()].copy()

  mask[feature] = mask.progress_apply(lambda row: extract_feature_from_notes(row,feature),axis=1)

  df_new.loc[mask.index,feature] = mask[feature]
  df_new[feature] = pd.to_numeric(df_new[feature],errors='coerce')
  # Trying to get values by nearest known values  with confidence of 0.5
  mask = df_new[df_new[feature].isna()].copy()
  unmask = df_new[~df_new[feature].isna()].copy()

  mask[feature] = mask.progress_apply(lambda row: assign_values_to_features(row,unmask,feature),axis=1)

  df_new.loc[mask.index,feature] = mask[feature]
  df_new[feature] = pd.to_numeric(df_new[feature],errors='coerce')
  # Trying to get values by rms filling of known values
  mask = df_new[df_new[feature].isna()].copy()
  unmask = df_new[~df_new[feature].isna()].copy()

  mask[feature].fillna(np.sqrt((unmask[feature] ** 2).mean()),inplace=True)

  df_new.loc[mask.index,feature] = mask[feature]
print('Given Features are filled')


Processing features:   0%|          | 0/5 [00:00<?, ?it/s]
  0%|          | 0/83 [00:00<?, ?it/s][A
  2%|▏         | 2/83 [00:19<13:19,  9.87s/it][A
  4%|▎         | 3/83 [00:28<12:35,  9.44s/it][A
  5%|▍         | 4/83 [00:45<16:07, 12.24s/it][A
  7%|▋         | 6/83 [01:01<12:50, 10.01s/it][A
  8%|▊         | 7/83 [01:19<15:23, 12.15s/it][A
 11%|█         | 9/83 [01:35<12:41, 10.29s/it][A
 12%|█▏        | 10/83 [01:50<14:05, 11.58s/it][A
 13%|█▎        | 11/83 [02:07<15:27, 12.88s/it][A
 16%|█▌        | 13/83 [02:22<12:28, 10.69s/it][A
 17%|█▋        | 14/83 [02:40<14:10, 12.32s/it][A
 20%|██        | 17/83 [02:56<09:41,  8.82s/it][A
 22%|██▏       | 18/83 [03:12<10:59, 10.15s/it][A
 24%|██▍       | 20/83 [03:18<08:00,  7.63s/it][A
 25%|██▌       | 21/83 [03:34<09:37,  9.32s/it][A
 28%|██▊       | 23/83 [03:55<09:47,  9.80s/it][A
 29%|██▉       | 24/83 [04:11<10:50, 11.03s/it][A
 30%|███       | 25/83 [04:19<09:59, 10.33s/it][A
 33%|███▎      | 27/83 [04:27<07:21,  

Given Features are filled





In [None]:
df_new.drop('Overall_notes',axis=1,inplace=True)

In [None]:
features = ['Parameters','data size','Training time (hours)',
             'Training compute (FLOP)', 'Finetune compute (FLOP)']
df.drop(columns=features,inplace=True)

In [None]:
df.drop(columns=['Overall_notes','Authors','Abstract'],inplace=True)

In [None]:
for col in df.columns:
    print(repr(col))

'Model'
'Domain'
'Country'
'Organization'
'Date'
'Category'
'Task'
'Confidence'
'Hardware quantity'
'Training hardware'
'accessibility'
'Training dataset'
'Training code accessibility'


In [None]:
df = pd.merge(df, df_new, on='Model', how='left')

In [None]:
df.drop(columns=['Embeddings'],inplace=True)

In [None]:
for col in df.columns:
    print(repr(col),f"\tdtype: {df[col].dtype}")

'Model' 	dtype: object
'Domain' 	dtype: object
'Country' 	dtype: object
'Organization' 	dtype: object
'Date' 	dtype: object
'Category' 	dtype: object
'Task' 	dtype: object
'Confidence' 	dtype: object
'Hardware quantity' 	dtype: float64
'Training hardware' 	dtype: object
'accessibility' 	dtype: object
'Training dataset' 	dtype: object
'Training code accessibility' 	dtype: object
'Parameters' 	dtype: float64
'data size' 	dtype: float64
'Training time (hours)' 	dtype: float64
'Training compute (FLOP)' 	dtype: float64
'Finetune compute (FLOP)' 	dtype: float64


In [None]:
del df_new

In [None]:
cat_cols = df.select_dtypes(include='object').columns
for col in cat_cols:
  df[col].fillna('Not-defined',inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna('Not-defined',inplace=True)


In [None]:
output_path_large_models = "/content/drive/MyDrive/DATA 6250/Datasets/Updated/Filled/large_scale_ai_models_filled.csv"
if os.path.exists(output_path_large_models):
    os.remove(output_path_large_models)
path = os.path.dirname(output_path_large_models)
if not os.path.exists(path):
    os.makedirs(path)
df.to_csv(output_path_large_models, index=False)

### Filling Training Hardware

In [None]:
output_path_large_models = "/content/drive/MyDrive/DATA 6250/Datasets/Updated/Filled/large_scale_ai_models_filled.csv"
df_new = pd.read_csv(output_path_large_models)
df_new.head()

Unnamed: 0,Model,Domain,Country,Organization,Date,Category,Task,Confidence,Hardware quantity,Training hardware,accessibility,Training dataset,Training code accessibility,Parameters,data size,Training time (hours),Training compute (FLOP),Finetune compute (FLOP)
0,Llama 4 Scout,Multimodal,USA,Meta AI,2025-04-05,Industry,Chat,Unverified,,Not-defined,Open weights (restricted use),Not-defined,Not-defined,109000000000.0,30000000000000.0,1728.0,4.08e+24,1877.162158
1,Llama 4 Scout,Multimodal,USA,Meta AI,2025-04-05,Industry,Code generation,Unverified,,Not-defined,Open weights (restricted use),Not-defined,Not-defined,109000000000.0,30000000000000.0,1728.0,4.08e+24,1877.162158
2,Llama 4 Scout,Language,USA,Meta AI,2025-04-05,Industry,Chat,Unverified,,Not-defined,Open weights (restricted use),Not-defined,Not-defined,109000000000.0,30000000000000.0,1728.0,4.08e+24,1877.162158
3,Llama 4 Scout,Language,USA,Meta AI,2025-04-05,Industry,Code generation,Unverified,,Not-defined,Open weights (restricted use),Not-defined,Not-defined,109000000000.0,30000000000000.0,1728.0,4.08e+24,1877.162158
4,Llama 4 Maverick,Multimodal,USA,Meta AI,2025-04-05,Industry,Chat,Unverified,,Not-defined,Open weights (restricted use),Not-defined,Not-defined,400000000000.0,30000000000000.0,1728.0,1.4916e+25,1877.162158


In [None]:
embeddings = sen_model.encode(df_new['Model'].tolist(),convert_to_tensor=True)
df_new['Embeddings'] = [row for row in embeddings.cpu()]

## Filling values based on similarity

In [None]:
def get_unmasked_embeddings(unmasked_df):
  unmasked_df = unmasked_df.reset_index(drop=True)
  emb_list = []
  for emb in unmasked_df['Embeddings']:
    if not isinstance(emb, torch.Tensor):
        emb = torch.tensor(emb)
    emb_list.append(emb)
  unmask_embeddings = torch.stack(emb_list, dim=0)
  return unmask_embeddings,unmasked_df

In [None]:
def assign_values_to_features(row_vals,unmask_df,feature,confidence_val=0.5):
  emb = row_vals['Embeddings']
  if not isinstance(emb, torch.Tensor):
    emb = torch.tensor(emb)
  emb = emb.unsqueeze(0)
  unmask_embeddings,unmask = get_unmasked_embeddings(unmask_df)
  similarities = util.cos_sim(emb, unmask_embeddings)[0]
  max_val, max_idx = torch.max(similarities, dim=0)
  max_val = max_val.item()
  max_idx = max_idx.item()
  if max_val > confidence_val:
    return (unmask.iloc[max_idx][feature])
  else:
    return None

In [None]:
mask = df_new[df_new['Training hardware'] == 'Not-defined'].copy()
unmask = df_new[df_new['Training hardware'] != 'Not-defined'].copy()

In [None]:
tqdm.pandas()
mask['Training hardware'] = mask.progress_apply(lambda row: assign_values_to_features(row,unmask,'Training hardware'),axis=1)

100%|██████████| 982/982 [00:01<00:00, 632.48it/s]


In [None]:
mask.fillna('Not-defined',inplace=True)

  mask.fillna('Not-defined',inplace=True)


In [None]:
df_new.loc[mask.index,'Training hardware'] = mask['Training hardware']

In [None]:
mask = df_new[df_new['Hardware quantity'].isna()].copy()
unmask = df_new[~df_new['Hardware quantity'].isna()].copy()
mask['Hardware quantity'] = mask.progress_apply(lambda row: assign_values_to_features(row,unmask,'Hardware quantity'),axis=1)

100%|██████████| 1693/1693 [00:01<00:00, 1224.37it/s]


In [None]:
df_new.loc[mask.index,'Hardware quantity'] = mask['Hardware quantity']
mask = df_new[df_new['Hardware quantity'] == 'Not-defined'].copy()
unmask = df_new[df_new['Hardware quantity'] != 'Not-defined'].copy()
mask['Hardware quantity'].fillna(np.sqrt((unmask['Hardware quantity'] ** 2).mean()),inplace=True)
df_new.loc[mask.index,'Hardware quantity'] = mask['Hardware quantity']

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  mask['Hardware quantity'].fillna(np.sqrt((unmask['Hardware quantity'] ** 2).mean()),inplace=True)


In [None]:
df_new.drop(columns=['Embeddings'],inplace=True)

## Store the output data frame

In [None]:
output_path_large_models = "/content/drive/MyDrive/DATA 6250/Datasets/Updated/Filled/large_scale_ai_models_filled.csv"
if os.path.exists(output_path_large_models):
    os.remove(output_path_large_models)
path = os.path.dirname(output_path_large_models)
if not os.path.exists(path):
    os.makedirs(path)
df_new.to_csv(output_path_large_models, index=False)