# Exploratory Data Analysis

### Introduction

We have data from a CSV file found [here](https://crfm.stanford.edu/ecosystem-graphs/index.html?mode=table). Let's try to understand it. The motivating goal of the Data Engineer is to clean and structure the data so that it is accessible to our Data Analyst and Software Engineer teams.

### Diving In

We start by importing the tool to look at tabular data and extracting it from the csv source.

In [1]:
import pandas as pd # type: ignore

In [2]:
df = pd.read_csv('../data/assets.csv')

Strip away columns with no information, and limit data to LLM models:

In [3]:
df = df[(df['type']=='model')]

In [4]:
for column_name in df.copy().columns:
    if len(df[column_name].unique()) <= 1:
        df.drop(columns=column_name, inplace=True)

What is the size of the data?

In [5]:
row_count, column_count = df.shape
print(f'The data has {row_count} rows and {column_count} columns.')

The data has 359 rows and 20 columns.


That is quite a lot of columns. Let's list them, along with the number of non-null values and their data type.

In [6]:
webpage_columns = [
    'type'
    , 'name'
    , 'organization'
    , 'created_date'
    , 'size'
    , 'modality'
    , 'access'
    , 'license'
    , 'dependencies'
]

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 359 entries, 3 to 564
Data columns (total 20 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   name                359 non-null    object
 1   organization        359 non-null    object
 2   description         299 non-null    object
 3   created_date        357 non-null    object
 4   url                 357 non-null    object
 5   modality            357 non-null    object
 6   size                355 non-null    object
 7   analysis            234 non-null    object
 8   dependencies        359 non-null    object
 9   quality_control     85 non-null     object
 10  access              359 non-null    object
 11  license             343 non-null    object
 12  intended_uses       129 non-null    object
 13  prohibited_uses     86 non-null     object
 14  monitoring          118 non-null    object
 15  feedback            161 non-null    object
 16  model_card          170 non-nul

Can we find a primary key?

In [8]:
len(df['name'].unique()) == df.shape[0]

False

Nevermind. Let's make our own.

In [9]:
df['id'] = range(1, 1+len(df))

We can see that `created_date` should be a datetime

In [10]:
df['created_date'] = pd.to_datetime(df['created_date'], errors='coerce')

We see that `modality` and `dependencies` are not atomic:

In [11]:
df[['modality']].dropna(inplace=False).head()

Unnamed: 0,modality
3,text; text
4,"text, video; text, video"
6,"text; code, text"
7,text; image
11,text; audio


In [12]:
def repackage_modality(raw:str) -> tuple[list[str]]:
    raw = str(raw)
    semicolon_count = raw.count(';')
    assert semicolon_count <= 1, 'LLM modality invalid.'
    if semicolon_count == 0:
        raw = raw + ';' + raw
    modal_input_str, modal_output_str = raw.split(';')
    modal_inputs = [s.strip() for s in modal_input_str.split(',')]
    modal_outputs = [s.strip() for s in modal_output_str.split(',')]
    return modal_inputs, modal_outputs

In [13]:
df[['input_modality', 'output_modality']] = df['modality'].apply(repackage_modality).apply(pd.Series)
df.drop(columns='modality', inplace=True)

In [14]:
df[['dependencies']].dropna(inplace=False).tail()

Unnamed: 0,dependencies
559,"['Llama 2', 'Mistral', 'Falcon-180B', 'Deepsee..."
561,['Mistral']
562,"['UltraFeedback', 'Zephyr']"
563,"['Arxiv', 'Books', 'C4', 'RefinedWeb', 'StarCo..."
564,"['SlimPajama', 'StarCoder']"


In [15]:
def repackage_dependencies(raw:str) -> list[str]:
    return [s.strip(' ').strip('\'') for s in str(raw)[1:-1].split(',')]

In [16]:
len(df.dependencies)

359

In [17]:
df[['dependencies']] = df[['dependencies']].copy().map(repackage_dependencies)

In [18]:
df.shape

(359, 22)

In [19]:
df = df.explode('input_modality').explode('output_modality').explode('dependencies')

In [20]:
df.shape

(898, 22)

In [21]:
df.head(2)

Unnamed: 0,name,organization,description,created_date,url,size,analysis,dependencies,quality_control,access,...,prohibited_uses,monitoring,feedback,model_card,training_emissions,training_time,training_hardware,id,input_modality,output_modality
3,Lag-LLaMA,"Morgan Stanley, ServiceNow Research, Universit...",Lag-LLaMA is a general-purpose foundation mode...,2024-02-08,https://time-series-foundation-models.github.i...,unknown,Evaluated on previously unseen time series dat...,,,open,...,,unknown,https://huggingface.co/time-series-foundation-...,https://huggingface.co/time-series-foundation-...,unknown,unknown,A single NVIDIA Tesla-P100 GPU,1,text,text
4,Prithvi,IBM,Prithvi is a first-of-its-kind temporal Vision...,2023-08-03,https://github.com/NASA-IMPACT/hls-foundation-os,100M parameters (dense),,NASA HLS data,,open,...,,,https://huggingface.co/ibm-nasa-geospatial/Pri...,https://huggingface.co/ibm-nasa-geospatial/Pri...,,,,2,text,text


In [22]:
df.iloc[5:10, :11]

Unnamed: 0,name,organization,description,created_date,url,size,analysis,dependencies,quality_control,access,license
6,Granite,IBM,Granite is a set of multi-size foundation mode...,2023-09-28,https://www.ibm.com/blog/building-ai-for-busin...,13B parameters (dense),unknown,,"Training data passed through IBM HAP detector,...",limited,
6,Granite,IBM,Granite is a set of multi-size foundation mode...,2023-09-28,https://www.ibm.com/blog/building-ai-for-busin...,13B parameters (dense),unknown,,"Training data passed through IBM HAP detector,...",limited,
7,Animagine XL 3.1,Cagliostro Research Lab,"An open-source, anime-themed text-to-image mod...",2024-03-18,https://cagliostrolab.net/posts/animagine-xl-v...,unknown,unknown,Animagine XL 3.0,"The model undergoes pretraining, first stage f...",open,Fair AI Public License 1.0-SD
11,Bark,Suno,Bark is a text-to-audio model that can generat...,2023-04-20,https://github.com/suno-ai/bark,,,AudioLM,,open,MIT
13,GPT-JT,Together,,2022-11-29,https://www.together.xyz/blog/releasing-v1-of-...,6B parameters (dense),,GPT-J,,open,Apache 2.0


In [23]:
df[['id', 'dependencies']]

Unnamed: 0,id,dependencies
3,1,
4,2,NASA HLS data
4,2,NASA HLS data
4,2,NASA HLS data
4,2,NASA HLS data
...,...,...
563,358,Wikipedia
564,359,SlimPajama
564,359,StarCoder
564,359,SlimPajama


In [24]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 898 entries, 3 to 564
Data columns (total 22 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   name                898 non-null    object        
 1   organization        898 non-null    object        
 2   description         732 non-null    object        
 3   created_date        896 non-null    datetime64[ns]
 4   url                 896 non-null    object        
 5   size                892 non-null    object        
 6   analysis            584 non-null    object        
 7   dependencies        898 non-null    object        
 8   quality_control     159 non-null    object        
 9   access              898 non-null    object        
 10  license             866 non-null    object        
 11  intended_uses       330 non-null    object        
 12  prohibited_uses     208 non-null    object        
 13  monitoring          232 non-null    object        
 14 