# Exploratory Data Analysis

### Introduction

We have data from a CSV file found [here](https://crfm.stanford.edu/ecosystem-graphs/index.html?mode=table). Let's try to understand it. The motivating goal of the Data Engineer is to clean and structure the data so that it is accessible to our Data Analyst and Software Engineer teams.

### Diving In

We start by importing the tool to look at tabular data and extracting it from the csv source.

In [1]:
import pandas as pd

In [2]:
source_data = pd.read_csv('assets.csv')

What is the size of the data?

In [5]:
row_count, column_count = source_data.shape
print(f'The data has {row_count} rows and {column_count} columns.')

The data has 568 rows and 31 columns.


That is quite a lot of columns. Let's list them, along with the number of non-null values and their data type.

In [6]:
source_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 568 entries, 0 to 567
Data columns (total 31 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   type                  568 non-null    object
 1   name                  568 non-null    object
 2   organization          568 non-null    object
 3   description           471 non-null    object
 4   created_date          548 non-null    object
 5   url                   562 non-null    object
 6   datasheet             39 non-null     object
 7   modality              467 non-null    object
 8   size                  467 non-null    object
 9   sample                112 non-null    object
 10  analysis              283 non-null    object
 11  dependencies          568 non-null    object
 12  included              38 non-null     object
 13  excluded              38 non-null     object
 14  quality_control       149 non-null    object
 15  access                568 non-null    ob

We can see that some data types are incorrect:

- `created_date` should be a datetime

In [7]:
web_columns = [
    'type'
    , 'name'
    , 'organization'
    , 'created_date'
    , 'size'
    , 'modality'
    , 'access'
    , 'license'
    , 'dependencies'
]

In [12]:
for c in source_data.columns:
    print(source_data[c].iloc[50])

model
StableLM
Stability AI
Large language models trained on up to 1.5 trillion tokens.
2023-04-20
https://github.com/Stability-AI/StableLM
nan
text; text
7B parameters (dense)
nan
nan
['StableLM-Alpha dataset', 'Alpaca dataset', 'gpt4all dataset', 'ShareGPT52K dataset', 'Dolly dataset', 'HH dataset']
nan
nan
nan
open
Apache 2.0
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan


In [10]:
source_data.head(1)

Unnamed: 0,type,name,organization,description,created_date,url,datasheet,modality,size,sample,...,model_card,training_emissions,training_time,training_hardware,adaptation,output_space,terms_of_service,monthly_active_users,user_distribution,failures
0,dataset,ToyMix,Mila-Quebec AI Institute,ToyMix is the smallest dataset of three extens...,2023-10-09,https://arxiv.org/pdf/2310.04292.pdf,,"molecules, tasks",13B labels of quantum and biological nature.,[],...,,,,,,,,,,


In [13]:
source_data[source_data.name == 'Sonic']

Unnamed: 0,type,name,organization,description,created_date,url,datasheet,modality,size,sample,...,model_card,training_emissions,training_time,training_hardware,adaptation,output_space,terms_of_service,monthly_active_users,user_distribution,failures
