# Exploratory Data Analysis

### Introduction

We have data from a CSV file found [here](https://crfm.stanford.edu/ecosystem-graphs/index.html?mode=table). Let's try to understand it. The motivating goal of the Data Engineer is to clean and structure the data so that it is accessible to our Data Analyst and Software Engineer teams.

### Diving In

We start by importing the tool to look at tabular data and extracting it from the csv source.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('assets.csv')

Strip away columns with no information, and limit data to LLM models:

In [3]:
df = df[(df['type']=='model')]

In [4]:
for c in df.copy().columns:
    if len(df[c].unique()) <= 1:
        df.drop(columns=c, inplace=True)

What is the size of the data?

In [5]:
row_count, column_count = df.shape
print(f'The data has {row_count} rows and {column_count} columns.')

The data has 359 rows and 20 columns.


That is quite a lot of columns. Let's list them, along with the number of non-null values and their data type.

In [6]:
webpage_columns = [
    'type'
    , 'name'
    , 'organization'
    , 'created_date'
    , 'size'
    , 'modality'
    , 'access'
    , 'license'
    , 'dependencies'
]

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 359 entries, 3 to 564
Data columns (total 20 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   name                359 non-null    object
 1   organization        359 non-null    object
 2   description         299 non-null    object
 3   created_date        357 non-null    object
 4   url                 357 non-null    object
 5   modality            357 non-null    object
 6   size                355 non-null    object
 7   analysis            234 non-null    object
 8   dependencies        359 non-null    object
 9   quality_control     85 non-null     object
 10  access              359 non-null    object
 11  license             343 non-null    object
 12  intended_uses       129 non-null    object
 13  prohibited_uses     86 non-null     object
 14  monitoring          118 non-null    object
 15  feedback            161 non-null    object
 16  model_card          170 non-nul

We can see that `created_date` should be a datetime

In [8]:
df['created_date'] = pd.to_datetime(df['created_date'], errors='coerce')

dtype cleaning to do:

```
- date date
- modal list -> str
- size int (number of param in type model, not type dataset)
- monthly_active_users str -> int
- sample list of websites -> ??
```