GitHub - Joseph-Selasie/Python-week8

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
App.py		App.py
CORD-19-research-challenge-metadata.json		CORD-19-research-challenge-metadata.json
ReadMe		ReadMe
metadata.csv		metadata.csv

Repository files navigation

CORD-19 Data Analysis & Streamlit App

Overview
This project analyzes the CORD-19 metadata dataset, focusing on COVID-19 research papers. It demonstrates basic data loading, cleaning, visualization, and builds a simple Streamlit web app for interactive exploration.

Steps & Code Comments

1. Data Loading & Exploration**
   - Loaded `metadata.csv` using pandas.
   - Displayed first few rows, checked shape, data types, and missing values.
   - Example code comment:
     ```python
     # Load the dataset and show basic info
     df = pd.read_csv('metadata.csv')
     print(df.head())  # Display first 5 rows
     ```

2. Data Cleaning**
   - Dropped rows with missing publication dates.
   - Converted `publish_time` to datetime and extracted year.
   - Added abstract word count.
   - Example code comment:
     ```python
     # Remove rows with missing publish_time
     df_clean = df.dropna(subset=['publish_time'])
     # Convert publish_time to datetime
     df_clean['publish_time'] = pd.to_datetime(df_clean['publish_time'], errors='coerce')
     ```

3. Analysis & Visualization**
   - Counted papers by year.
   - Identified top journals.
   - Created a word cloud of paper titles.
   - Example code comment:
     ```python
     # Plot number of publications per year
     year_counts = df['year'].value_counts().sort_index()
     plt.bar(year_counts.index, year_counts.values)
     ```

4. Streamlit App**
   - Built an interactive app to select year range and view visualizations.
   - Example code comment:
     ```python
     # Streamlit slider for year range selection
     year_range = st.slider("Select year range", min_year, max_year, (min_year, max_year))
     ```

Findings

- Most COVID-19 research papers were published in 2020 and 2021.
- Certain journals (e.g., *medRxiv*, *bioRxiv*) published the most papers.
- Common words in titles include "COVID", "SARS", "coronavirus", and "pandemic".

Challenges

- The dataset had many missing values, especially in abstracts and journals.
- Some date formats were inconsistent and required careful conversion.
- Visualizing large text data (word cloud) needed extra libraries and memory.

## How to Run

1. Install dependencies:
   ```
   pip install pandas matplotlib seaborn streamlit wordcloud
   ```
2. Run analysis scripts for data cleaning and visualization.
3. Start the Streamlit app:
   ```
   streamlit run app.py
   ```

## Reflection

This project helped me practice real-world data cleaning, basic analysis, and building a simple web app. Handling missing data and date formats was challenging but improved my pandas skills. Streamlit made sharing results easy and interactive.