-
Notifications
You must be signed in to change notification settings - Fork 0
Joseph-Selasie/Python-week8
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
CORD-19 Data Analysis & Streamlit App Overview This project analyzes the CORD-19 metadata dataset, focusing on COVID-19 research papers. It demonstrates basic data loading, cleaning, visualization, and builds a simple Streamlit web app for interactive exploration. Steps & Code Comments 1. Data Loading & Exploration** - Loaded `metadata.csv` using pandas. - Displayed first few rows, checked shape, data types, and missing values. - Example code comment: ```python # Load the dataset and show basic info df = pd.read_csv('metadata.csv') print(df.head()) # Display first 5 rows ``` 2. Data Cleaning** - Dropped rows with missing publication dates. - Converted `publish_time` to datetime and extracted year. - Added abstract word count. - Example code comment: ```python # Remove rows with missing publish_time df_clean = df.dropna(subset=['publish_time']) # Convert publish_time to datetime df_clean['publish_time'] = pd.to_datetime(df_clean['publish_time'], errors='coerce') ``` 3. Analysis & Visualization** - Counted papers by year. - Identified top journals. - Created a word cloud of paper titles. - Example code comment: ```python # Plot number of publications per year year_counts = df['year'].value_counts().sort_index() plt.bar(year_counts.index, year_counts.values) ``` 4. Streamlit App** - Built an interactive app to select year range and view visualizations. - Example code comment: ```python # Streamlit slider for year range selection year_range = st.slider("Select year range", min_year, max_year, (min_year, max_year)) ``` Findings - Most COVID-19 research papers were published in 2020 and 2021. - Certain journals (e.g., *medRxiv*, *bioRxiv*) published the most papers. - Common words in titles include "COVID", "SARS", "coronavirus", and "pandemic". Challenges - The dataset had many missing values, especially in abstracts and journals. - Some date formats were inconsistent and required careful conversion. - Visualizing large text data (word cloud) needed extra libraries and memory. ## How to Run 1. Install dependencies: ``` pip install pandas matplotlib seaborn streamlit wordcloud ``` 2. Run analysis scripts for data cleaning and visualization. 3. Start the Streamlit app: ``` streamlit run app.py ``` ## Reflection This project helped me practice real-world data cleaning, basic analysis, and building a simple web app. Handling missing data and date formats was challenging but improved my pandas skills. Streamlit made sharing results easy and interactive.
About
No description, website, or topics provided.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published