# CORD-19 Data Analysis
## Part 1: Data Loading and Exploration
Loading metadata.csv (~257 MB, ~400K rows) to explore the dataset structure, dimensions, data types, missing values, and basic statistics.

In [4]:
import pandas as pd

# Load metadata.csv (use low_memory=False for large CSV)
df = pd.read_csv('metadata.csv', low_memory=False)

# First few rows
df.head()

Unnamed: 0,cord_uid,sha,source_x,title,doi,pmcid,pubmed_id,license,abstract,publish_time,authors,journal,mag_id,who_covidence_id,arxiv_id,pdf_json_files,pmc_json_files,url,s2_id
0,ug7v899j,d1aafb70c066a2068b02786f8929fd9c900897fb,PMC,Clinical features of culture-proven Mycoplasma...,10.1186/1471-2334-1-6,PMC35282,11472636,no-cc,OBJECTIVE: This retrospective chart review des...,2001-07-04,"Madani, Tariq A; Al-Ghamdi, Aisha A",BMC Infect Dis,,,,document_parses/pdf_json/d1aafb70c066a2068b027...,document_parses/pmc_json/PMC35282.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3...,
1,02tnwd4m,6b0567729c2143a66d737eb0a2f63f2dce2e5a7d,PMC,Nitric oxide: a pro-inflammatory mediator in l...,10.1186/rr14,PMC59543,11667967,no-cc,Inflammatory diseases of the respiratory tract...,2000-08-15,"Vliet, Albert van der; Eiserich, Jason P; Cros...",Respir Res,,,,document_parses/pdf_json/6b0567729c2143a66d737...,document_parses/pmc_json/PMC59543.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...,
2,ejv2xln0,06ced00a5fc04215949aa72528f2eeaae1d58927,PMC,Surfactant protein-D and pulmonary host defense,10.1186/rr19,PMC59549,11667972,no-cc,Surfactant protein-D (SP-D) participates in th...,2000-08-25,"Crouch, Erika C",Respir Res,,,,document_parses/pdf_json/06ced00a5fc04215949aa...,document_parses/pmc_json/PMC59549.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...,
3,2b73a28n,348055649b6b8cf2b9a376498df9bf41f7123605,PMC,Role of endothelin-1 in lung disease,10.1186/rr44,PMC59574,11686871,no-cc,Endothelin-1 (ET-1) is a 21 amino acid peptide...,2001-02-22,"Fagan, Karen A; McMurtry, Ivan F; Rodman, David M",Respir Res,,,,document_parses/pdf_json/348055649b6b8cf2b9a37...,document_parses/pmc_json/PMC59574.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...,
4,9785vg6d,5f48792a5fa08bed9f56016f4981ae2ca6031b32,PMC,Gene expression in epithelial cells in respons...,10.1186/rr61,PMC59580,11686888,no-cc,Respiratory syncytial virus (RSV) and pneumoni...,2001-05-11,"Domachowske, Joseph B; Bonville, Cynthia A; Ro...",Respir Res,,,,document_parses/pdf_json/5f48792a5fa08bed9f560...,document_parses/pmc_json/PMC59580.xml.json,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5...,


In [5]:
# DataFrame dimensions (rows, columns)
print('Shape:', df.shape)

Shape: (192509, 19)


In [6]:
# Data types of each column
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 192509 entries, 0 to 192508
Data columns (total 19 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   cord_uid          192509 non-null  object 
 1   sha               79755 non-null   object 
 2   source_x          192509 non-null  object 
 3   title             192459 non-null  object 
 4   doi               137235 non-null  object 
 5   pmcid             86510 non-null   object 
 6   pubmed_id         117304 non-null  object 
 7   license           192509 non-null  object 
 8   abstract          137643 non-null  object 
 9   publish_time      192491 non-null  object 
 10  authors           186032 non-null  object 
 11  journal           181791 non-null  object 
 12  mag_id            0 non-null       float64
 13  who_covidence_id  50325 non-null   object 
 14  arxiv_id          2464 non-null    object 
 15  pdf_json_files    79755 non-null   object 
 16  pmc_json_files    62

In [7]:
# Missing values in key columns
key_columns = ['title', 'abstract', 'publish_time', 'authors', 'journal']
print('Missing values:\n', df[key_columns].isnull().sum())

Missing values:
 title              50
abstract        54866
publish_time       18
authors          6477
journal         10718
dtype: int64


In [8]:
# Basic statistics for numerical columns
print(df.describe())

       mag_id         s2_id
count     0.0  1.646440e+05
mean      NaN  1.458086e+08
std       NaN  9.326480e+07
min       NaN  9.600000e+01
25%       NaN  2.983436e+07
50%       NaN  2.156159e+08
75%       NaN  2.188923e+08
max       NaN  2.205264e+08
