# Exploratory Data Analysis

### Introduction

We have data from a CSV file found [here](https://crfm.stanford.edu/ecosystem-graphs/index.html?mode=table). Let's try to understand it. The motivating goal of the Data Engineer is to clean and structure the data so that it is accessible to our Data Analyst and Software Engineer teams.

### Diving In

We start by importing the tool to look at tabular data and extracting it from the csv source.

In [1]:
import pandas as pd # type: ignore

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd # type: ignore


In [13]:
df = pd.read_csv('../data/assets.csv')

In [54]:
df.iloc[33, 3]

'Large language models trained on up to 1.5 trillion tokens.'

Strip away columns with no information, and limit data to LLM models:

In [15]:
df = df[(df['type']=='model')]

In [91]:
df

Unnamed: 0,name,organization,description,created_date,url,size,analysis,dependencies,quality_control,access,...,prohibited_uses,monitoring,feedback,model_card,training_emissions,training_time,training_hardware,id,input_modality,output_modality
3,Lag-LLaMA,"Morgan Stanley, ServiceNow Research, Universit...",Lag-LLaMA is a general-purpose foundation mode...,2024-02-08,https://time-series-foundation-models.github.i...,unknown,Evaluated on previously unseen time series dat...,[],,open,...,,unknown,https://huggingface.co/time-series-foundation-...,https://huggingface.co/time-series-foundation-...,unknown,unknown,A single NVIDIA Tesla-P100 GPU,1,[text],[text]
4,Prithvi,IBM,Prithvi is a first-of-its-kind temporal Vision...,2023-08-03,https://github.com/NASA-IMPACT/hls-foundation-os,100M parameters (dense),,[NASA HLS data],,open,...,,,https://huggingface.co/ibm-nasa-geospatial/Pri...,https://huggingface.co/ibm-nasa-geospatial/Pri...,,,,2,"[text, video]","[text, video]"
6,Granite,IBM,Granite is a set of multi-size foundation mode...,2023-09-28,https://www.ibm.com/blog/building-ai-for-busin...,13B parameters (dense),unknown,[],"Training data passed through IBM HAP detector,...",limited,...,,,,,unknown,unknown,unknown,3,[text],"[code, text]"
7,Animagine XL 3.1,Cagliostro Research Lab,"An open-source, anime-themed text-to-image mod...",2024-03-18,https://cagliostrolab.net/posts/animagine-xl-v...,unknown,unknown,[Animagine XL 3.0],"The model undergoes pretraining, first stage f...",open,...,Not suitable for creating realistic photos or ...,unknown,https://huggingface.co/cagliostrolab/animagine...,https://huggingface.co/cagliostrolab/animagine...,unknown,"Approximately 15 days, totaling over 350 GPU h...",2x A100 80GB GPUs,4,[text],[image]
11,Bark,Suno,Bark is a text-to-audio model that can generat...,2023-04-20,https://github.com/suno-ai/bark,,,[AudioLM],,open,...,,,https://huggingface.co/spaces/suno/bark/discus...,https://github.com/suno-ai/bark/blob/main/mode...,unknown,unknown,,5,[text],[audio]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
559,Samba 1,Samba Nova Systems,Samba 1 is a trillion parameter generative AI ...,2024-02-28,https://sambanova.ai/blog/samba-1-composition-...,1T parameters (dense),unknown,"[Llama 2, Mistral, Falcon-180B, Deepseek, BLOO...",,limited,...,,unknown,,,unknown,unknown,unknown,355,[text],[text]
561,SciPhi Mistral,SciPhi,SciPhi Mistral is a Large Language Model (LLM)...,2023-11-07,https://huggingface.co/SciPhi/SciPhi-Mistral-7...,7B parameters (dense),,[Mistral],,open,...,,unknown,https://huggingface.co/SciPhi/SciPhi-Mistral-7...,https://huggingface.co/SciPhi/SciPhi-Mistral-7...,unknown,unknown,unknown,356,[text],[text]
562,Notus,Argilla,"Notus is an open source LLM, fine-tuned using ...",2023-12-01,https://argilla.io/blog/notus7b/,7B parameters (dense),Evaluated on MT-Bench and AlphaEval benchmarks.,"[UltraFeedback, Zephyr]",,open,...,,,https://huggingface.co/argilla/notus-7b-v1/dis...,https://huggingface.co/argilla/notus-7b-v1,unknown,unknown,8 x A100 40GB GPUs,357,[text],[text]
563,Amber,LLM360,"Amber is the first model in the LLM360 family,...",2023-12-12,https://www.llm360.ai/,7B parameters (dense),Evaluated on several benchmark LLM tasks,"[Arxiv, Books, C4, RefinedWeb, StarCoder, Stac...",,open,...,,unknown,https://huggingface.co/LLM360/Amber/discussions,https://huggingface.co/LLM360/Amber,unknown,unknown,"56 DGX A100 nodes, each equipped with 4 80GB A...",358,[text],[text]


In [56]:
df.columns

Index(['name', 'organization', 'description', 'created_date', 'url',
       'modality', 'size', 'analysis', 'dependencies', 'quality_control',
       'access', 'license', 'intended_uses', 'prohibited_uses', 'monitoring',
       'feedback', 'model_card', 'training_emissions', 'training_time',
       'training_hardware'],
      dtype='object')

Description of dataset: 
- name: Name of the model (must be unique identifier)
- organization: Organization that created the model
- description: Description of the model. 
- created date: when the model was created.
- url: Link to website or paper that details model
- model card: link to model card that describes this model. 
- modality: Modalities represented in the model (e.g. Text, Text(English), Video, Code, Code (python), Image )
- analysis: Description of any analysis that was done on the model.
- size: Size (and shape) of the model (e.g. number of parameters)
- dependencies: a list of assets that was used to create the model (applications, models, datasets).
- training emissions: Estimate of the carbon emissions used to create the model. 
- training time: how long it took to train model
- training hardware: hardware used to train model.
- quality control: what measures were taken to ensure safety, quality and mitigate harm. 
- access: who can access (and use) the model
- license: license of the model
- intended uses: description of what the model can be used for downstream
- prohibited uses: description of what the model should not be used for downstream.
- monitoring: description of measures taken to monitor the model.
- feedback: how downstream issues with model should be reported. 

In [57]:
# duplicates of models -- are they needed? Differ in datasets used to train, domain-specific
len(df['name'].unique()) / len(df) * 100


99.72144846796658

In [58]:
df['size'].describe()

count         355
unique        110
top       unknown
freq           78
Name: size, dtype: object

Can I plot top 10 models based on num. of parameters? 

Is the dataset outdated? 

In [55]:
for column_name in df.copy().columns:
    if len(df[column_name].unique()) <= 1:
        df.drop(columns=column_name, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop(columns=column_name, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop(columns=column_name, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop(columns=column_name, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop(columns=column_name, inplace=Tr

What is the size of the data?

In [77]:
row_count, column_count = df.shape
print(f'The data has {row_count} rows and {column_count} columns.')

The data has 359 rows and 20 columns.


That is quite a lot of columns. Let's list them, along with the number of non-null values and their data type.

In [78]:
webpage_columns = [
    'type'
    , 'name'
    , 'organization'
    , 'created_date'
    , 'size'
    , 'modality'
    , 'access'
    , 'license'
    , 'dependencies'
]

In [79]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 359 entries, 3 to 564
Data columns (total 20 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   name                359 non-null    object
 1   organization        359 non-null    object
 2   description         299 non-null    object
 3   created_date        357 non-null    object
 4   url                 357 non-null    object
 5   modality            357 non-null    object
 6   size                355 non-null    object
 7   analysis            234 non-null    object
 8   dependencies        359 non-null    object
 9   quality_control     85 non-null     object
 10  access              359 non-null    object
 11  license             343 non-null    object
 12  intended_uses       129 non-null    object
 13  prohibited_uses     86 non-null     object
 14  monitoring          118 non-null    object
 15  feedback            161 non-null    object
 16  model_card          170 non-nul

Can we find a primary key?

In [8]:
len(df['name'].unique()) == df.shape[0]

False

Nevermind. Let's make our own.

In [80]:
df['id'] = range(1, 1+len(df))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['id'] = range(1, 1+len(df))


We can see that `created_date` should be a datetime

In [81]:
df['created_date'] = pd.to_datetime(df['created_date'], errors='coerce')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['created_date'] = pd.to_datetime(df['created_date'], errors='coerce')


In [82]:
df['created_date'].describe()

count                              357
mean     2023-04-11 18:21:10.588235008
min                2019-10-01 00:00:00
25%                2022-10-27 00:00:00
50%                2023-07-11 00:00:00
75%                2023-11-20 00:00:00
max                2024-04-29 00:00:00
Name: created_date, dtype: object

We see that `modality` and `dependencies` are not atomic:

In [83]:
df[['modality']].dropna(inplace=False).head()

Unnamed: 0,modality
3,text; text
4,"text, video; text, video"
6,"text; code, text"
7,text; image
11,text; audio


In [84]:
def repackage_modality(raw:str) -> tuple[list[str]]:
    raw = str(raw)
    semicolon_count = raw.count(';')
    assert semicolon_count <= 1, 'LLM modality invalid.'
    if semicolon_count == 0:
        raw = raw + ';' + raw
    modal_input_str, modal_output_str = raw.split(';')
    modal_inputs = [s.strip() for s in modal_input_str.split(',')]
    modal_outputs = [s.strip() for s in modal_output_str.split(',')]
    return modal_inputs, modal_outputs

In [85]:
df[['input_modality', 'output_modality']] = df['modality'].apply(repackage_modality).apply(pd.Series)
df.drop(columns='modality', inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[['input_modality', 'output_modality']] = df['modality'].apply(repackage_modality).apply(pd.Series)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[['input_modality', 'output_modality']] = df['modality'].apply(repackage_modality).apply(pd.Series)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop(columns='modality', inplace=True)


In [86]:
df[['dependencies']].dropna(inplace=False).tail()

Unnamed: 0,dependencies
559,"['Llama 2', 'Mistral', 'Falcon-180B', 'Deepsee..."
561,['Mistral']
562,"['UltraFeedback', 'Zephyr']"
563,"['Arxiv', 'Books', 'C4', 'RefinedWeb', 'StarCo..."
564,"['SlimPajama', 'StarCoder']"


In [87]:
def repackage_dependencies(raw:str) -> list[str]:
    return [s.strip(' ').strip('\'') for s in str(raw)[1:-1].split(',')]

In [88]:
len(df.dependencies)

359

In [89]:
df[['dependencies']] = df[['dependencies']].copy().map(repackage_dependencies)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[['dependencies']] = df[['dependencies']].copy().map(repackage_dependencies)


In [90]:
df.shape

(359, 22)

In [19]:
df = df.explode('input_modality').explode('output_modality').explode('dependencies')

In [20]:
df.shape

(898, 22)

In [98]:
df.head(2)

Unnamed: 0,name,organization,description,created_date,url,size,analysis,dependencies,quality_control,access,...,prohibited_uses,monitoring,feedback,model_card,training_emissions,training_time,training_hardware,id,input_modality,output_modality
3,Lag-LLaMA,"Morgan Stanley, ServiceNow Research, Universit...",Lag-LLaMA is a general-purpose foundation mode...,2024-02-08,https://time-series-foundation-models.github.i...,unknown,Evaluated on previously unseen time series dat...,[],,open,...,,unknown,https://huggingface.co/time-series-foundation-...,https://huggingface.co/time-series-foundation-...,unknown,unknown,A single NVIDIA Tesla-P100 GPU,1,[text],[text]
4,Prithvi,IBM,Prithvi is a first-of-its-kind temporal Vision...,2023-08-03,https://github.com/NASA-IMPACT/hls-foundation-os,100M parameters (dense),,[NASA HLS data],,open,...,,,https://huggingface.co/ibm-nasa-geospatial/Pri...,https://huggingface.co/ibm-nasa-geospatial/Pri...,,,,2,"[text, video]","[text, video]"


In [22]:
df.iloc[5:10, :11]

Unnamed: 0,name,organization,description,created_date,url,size,analysis,dependencies,quality_control,access,license
6,Granite,IBM,Granite is a set of multi-size foundation mode...,2023-09-28,https://www.ibm.com/blog/building-ai-for-busin...,13B parameters (dense),unknown,,"Training data passed through IBM HAP detector,...",limited,
6,Granite,IBM,Granite is a set of multi-size foundation mode...,2023-09-28,https://www.ibm.com/blog/building-ai-for-busin...,13B parameters (dense),unknown,,"Training data passed through IBM HAP detector,...",limited,
7,Animagine XL 3.1,Cagliostro Research Lab,"An open-source, anime-themed text-to-image mod...",2024-03-18,https://cagliostrolab.net/posts/animagine-xl-v...,unknown,unknown,Animagine XL 3.0,"The model undergoes pretraining, first stage f...",open,Fair AI Public License 1.0-SD
11,Bark,Suno,Bark is a text-to-audio model that can generat...,2023-04-20,https://github.com/suno-ai/bark,,,AudioLM,,open,MIT
13,GPT-JT,Together,,2022-11-29,https://www.together.xyz/blog/releasing-v1-of-...,6B parameters (dense),,GPT-J,,open,Apache 2.0


In [23]:
df[['id', 'dependencies']]

Unnamed: 0,id,dependencies
3,1,
4,2,NASA HLS data
4,2,NASA HLS data
4,2,NASA HLS data
4,2,NASA HLS data
...,...,...
563,358,Wikipedia
564,359,SlimPajama
564,359,StarCoder
564,359,SlimPajama


In [24]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 898 entries, 3 to 564
Data columns (total 22 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   name                898 non-null    object        
 1   organization        898 non-null    object        
 2   description         732 non-null    object        
 3   created_date        896 non-null    datetime64[ns]
 4   url                 896 non-null    object        
 5   size                892 non-null    object        
 6   analysis            584 non-null    object        
 7   dependencies        898 non-null    object        
 8   quality_control     159 non-null    object        
 9   access              898 non-null    object        
 10  license             866 non-null    object        
 11  intended_uses       330 non-null    object        
 12  prohibited_uses     208 non-null    object        
 13  monitoring          232 non-null    object        
 14 

In [97]:
models_list = list(df['name'])
models_list.sort()
models_list

['A.X',
 'ACT-1',
 'Alpaca',
 'AlphaCode',
 'AlphaFold2',
 'Amber',
 'Animagine XL 3.1',
 'Anthropic RLHF models',
 'AudioGen',
 'AudioLM',
 'Aurora-M',
 'Aya',
 'BEiT-3',
 'BGE M3 Embedding',
 'BLIP',
 'BLIP-2',
 'BLOOM',
 'BLOOMZ',
 'BLUUMI',
 'Baichuan 2',
 'Bark',
 'BigTrans',
 'BioGPT',
 'BioMedLM',
 'BioMistral',
 'BiomedGPT',
 'Bittensor Language Model',
 'BloombergGPT',
 'CLIP',
 'CORGI',
 'COSMO',
 'CPM Bee',
 'Camel',
 'CausalLM',
 'Cerebras-GPT',
 'ChatGLM',
 'Chinchilla',
 'Chronos',
 'Claude',
 'Claude 2',
 'Claude 2.1',
 'Claude 3',
 'Claude Instant',
 'Code LLaMA',
 'Code Tulu 2',
 'CodeGeeX',
 'CodeGen',
 'CodeParrot',
 'Codex',
 'CogVLM',
 'CogVideo',
 'CogView',
 'CogView 2',
 'Cohere Base',
 'Cohere Command',
 'Cohere Embed (English)',
 'Cohere Embed (Multilingual)',
 'Cohere Embedv3 (English)',
 'Command-R',
 'CommonCanvas',
 'Composer',
 'Conformer-1',
 'CosmicMan',
 'CrystalCoder',
 'DALL·E',
 'DALL·E 2',
 'DALL·E 3',
 'DBRX',
 'DeciLM',
 'DeepFloyd IF',
 'Deepsee