Skip to content

Joseph-Selasie/Python-week8

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

CORD-19 Data Analysis & Streamlit App

Overview
This project analyzes the CORD-19 metadata dataset, focusing on COVID-19 research papers. It demonstrates basic data loading, cleaning, visualization, and builds a simple Streamlit web app for interactive exploration.

Steps & Code Comments

1. Data Loading & Exploration**
   - Loaded `metadata.csv` using pandas.
   - Displayed first few rows, checked shape, data types, and missing values.
   - Example code comment:
     ```python
     # Load the dataset and show basic info
     df = pd.read_csv('metadata.csv')
     print(df.head())  # Display first 5 rows
     ```

2. Data Cleaning**
   - Dropped rows with missing publication dates.
   - Converted `publish_time` to datetime and extracted year.
   - Added abstract word count.
   - Example code comment:
     ```python
     # Remove rows with missing publish_time
     df_clean = df.dropna(subset=['publish_time'])
     # Convert publish_time to datetime
     df_clean['publish_time'] = pd.to_datetime(df_clean['publish_time'], errors='coerce')
     ```

3. Analysis & Visualization**
   - Counted papers by year.
   - Identified top journals.
   - Created a word cloud of paper titles.
   - Example code comment:
     ```python
     # Plot number of publications per year
     year_counts = df['year'].value_counts().sort_index()
     plt.bar(year_counts.index, year_counts.values)
     ```

4. Streamlit App**
   - Built an interactive app to select year range and view visualizations.
   - Example code comment:
     ```python
     # Streamlit slider for year range selection
     year_range = st.slider("Select year range", min_year, max_year, (min_year, max_year))
     ```

Findings

- Most COVID-19 research papers were published in 2020 and 2021.
- Certain journals (e.g., *medRxiv*, *bioRxiv*) published the most papers.
- Common words in titles include "COVID", "SARS", "coronavirus", and "pandemic".

Challenges

- The dataset had many missing values, especially in abstracts and journals.
- Some date formats were inconsistent and required careful conversion.
- Visualizing large text data (word cloud) needed extra libraries and memory.

## How to Run

1. Install dependencies:
   ```
   pip install pandas matplotlib seaborn streamlit wordcloud
   ```
2. Run analysis scripts for data cleaning and visualization.
3. Start the Streamlit app:
   ```
   streamlit run app.py
   ```

## Reflection

This project helped me practice real-world data cleaning, basic analysis, and building a simple web app. Handling missing data and date formats was challenging but improved my pandas skills. Streamlit made sharing results easy and interactive.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages