# get-dataset-characteristics-updated
2.10.23

A more sophisticated and reproducible version of the original `get-dataset-characteristics.ipynb` notebook. 
One question I have is how different are the matrices in the `data/peptides_data/` subdir vs `data/maxquant-peptides`, for example. Are the missingness fractions significantly different? 

In [1]:
import pandas as pd
import numpy as np

#### Configs

In [2]:
processed_data_path = "../../../../data/peptides-data/"
pxds = \
    ["PXD013792", "PXD014156", "PXD006348", "PXD011961", 
     "PXD014525", "PXD016079", "PXD006109", "PXD014525", 
     "PXD034525", "PXD014815", "Satpathy2020", 
     "Petralia2020", "PXD007683"]

#### Init the dataset characteristics dataframe

In [3]:
cols = ["PXD", "n samples", "n peptides", "n present", "n missing", "mv frac"]
dataset_stats = pd.DataFrame(columns=cols)

dataset_stats["PXD"] = pxds

#### Loop through every dataset, store results

In [4]:
i = 0
for pxd in pxds:
    df = pd.read_csv(processed_data_path + pxd + "_peptides.csv")
    
    df[df==0.0] = np.nan
    n_nans = np.count_nonzero(np.isnan(df))
    n_present = df.size - n_nans
    mv_frac = np.around(n_nans / df.size, 3)
    
    dataset_stats.loc[i, "n missing"] = n_nans
    dataset_stats.loc[i, "n present"] = n_present
    dataset_stats.loc[i, "n samples"] = df.shape[1]
    dataset_stats.loc[i, "n peptides"] = df.shape[0]
    dataset_stats.loc[i, "mv frac"] = mv_frac
    
    i += 1

In [5]:
dataset_stats

Unnamed: 0,PXD,n samples,n peptides,n present,n missing,mv frac
0,PXD013792,12,2224,7373,19315,0.724
1,PXD014156,20,697,6263,7677,0.551
2,PXD006348,24,10362,70307,178381,0.717
3,PXD011961,23,23415,290232,248313,0.461
4,PXD014525,36,17208,47224,572264,0.924
5,PXD016079,31,32999,560332,462637,0.452
6,PXD006109,20,38124,637008,125472,0.165
7,PXD014525,36,17208,47224,572264,0.924
8,PXD034525,10,40346,352593,50867,0.126
9,PXD014815,42,24204,726182,290386,0.286
