# COGS 108 - Data Checkpoint

# Names

- Alan Tsui
- Edmond Choi
- Keith Ho
- Kelly Kong 
- Nari Kim

<a id='research_question'></a>
# Research Question

Is there a positive correlation between pandemics and depressive symptoms among the population in heavily affected countries, such as the United States and China?


# Dataset(s)

*Fill in your dataset information here*

(Copy this information for each dataset)
- Dataset Name:
- Link to the dataset:
- Number of observations:

1-2 sentences describing each dataset. 

If you plan to use multiple datasets, add 1-2 sentences about how you plan to combine these datasets.


Dataset Name: Cleaned COVID Cases
- Link to the dataset: https://ourworldindata.org/coronavirus
- Number of observations: 67548 rows x 6 columns
- The cleaned_covid.csv file contains a dataset with an up-to-date count of the total number of cases ranging from 2/24/2020 to 2/8/2021. It also includes the number of new cases, total deaths, new_deaths, date and the location. 

Dataset Name: Cleaned Healthcare Tweets Data
- Link to the dataset: https://www.kaggle.com/mindyng/healthcareworkersburnout
- Number of observations: 1879x2
- The cleaned_healthcare_tweets_data.csv file contains a dataset with tweets from healthcare workers along with the date of when they were posted. 

Dataset Name: Prevalence of Depression During Covid (n=475)
- Link to the dataset: https://globalizationandhealth.biomedcentral.com/articles/10.1186/s12992-020-00621-z/tables/4 
- Number of Observations: 3 rows x7 columns
- The cleaned prevalence_of_depression_during_covid.csv file contains data regarding the varying degrees of depression among health care workers. It specifically splits up healthcare workers into different categories like doctors, nurses, and other caretakers. 

Dataset Name: depressioninhealthcareworkerschina.csv
- Link to the dataset: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7206431/
- Number of observations: 52 rows x 11 columns
- The depressioninhealthcareworkerschina.csv file contains a dataset with percentage of first response healthcare workers to Covid-19 who have depression in different regions in China.

Dataset Name: Global Depression Trend
- Link to the dataset: https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(18)32279-7/fulltext#seccestitle210 
- Number of observations: 3 rows x 7 columns
- The global_depression_trend.csv file contains the dataset concerning the trend of depression globally during 1990-2017. It is helpful when we want to inspect previous pandemics too and see if those caused a spike in depression rates like we hypothesize covid to have. It will be grouped together with the datasets we find on previous pandemics such as SARS and H1N1.

Dataset Name: H1N1 Case Summary
- Link to the dataset: https://www.kaggle.com/imdevskp/h1n1-swine-flu-2009-pandemic-dataset?select=data.csv
- Number of observations: 2491 rows x 4 columns 
- The h1n1_summary.csv file contains a dataset with a day by day count of the number of cases and deaths ranging from 4/4/2009 to 7/6/2009 organized by country. It will be useful for looking at the trend of depressive symptoms as it related to previous pandemic data. 




# Setup

In [2]:
import numpy as np
import pandas as pd

# Data Cleaning

Describe your data cleaning steps here.

In [3]:
# all these files were cleaned through excel, so we will just be displaying the data here rather than doing the cleaning through the notebook and
# describing the steps we took to clean it.

clean_covid = pd.read_csv('data/clean_covid.csv')
clean_covid.describe()

#Data cleaning: After inspecting the dataset we noticed there were NAN values in the first couple days recorded, 
#which indicated that there were NAN values in place on 0 (eg. there were no new cases or new deaths on February 24, 2020) so we replaced them. 
#Lastly, we only took the columns we needed from the dataset to answer our research question.

Unnamed: 0.1,Unnamed: 0,total_cases,new_cases,total_deaths,new_deaths
count,67548.0,66953.0,66951.0,58094.0,58252.0
mean,33773.5,537445.1,5040.466192,16768.17,128.289655
std,19499.572329,3811515.0,32222.900524,96601.77,707.591636
min,0.0,1.0,-46076.0,1.0,-1918.0
25%,16886.75,621.0,1.0,33.0,0.0
50%,33773.5,6611.0,54.0,211.0,1.0
75%,50660.25,72700.0,623.0,2002.0,15.0
max,67547.0,106478000.0,858062.0,2325512.0,17882.0


In [4]:
cleaned_Healthcare_Tweets_Data = pd.read_csv('data/cleaned_healthcare_tweets_data.csv')
cleaned_Healthcare_Tweets_Data.head()

#Data Cleaning: After looking at the original dataset, we realized that most of the information was not important 
#so we dropped most of the columns. For example, we didn’t think that the tweet source or word count would be important in 
#answering our question so we dropped those columns. The column that we were looking for was the tweet text itself, which was the column we took 
#to help answer our research question.


Unnamed: 0,Tweet Date,Tweet Text,sentiment
0,2021-01-19,I'm a big music person. It speaks to me in all...,-0.0625
1,2021-01-19,@JohnWHarris15 I adore you and your unlimited ...,0.0
2,2021-01-19,@bubblydncer I have replied to countless texts...,-0.125
3,2021-01-19,@MelBeer93 Ahhh I love this!,0.625
4,2021-01-19,@TheKimClub https://t.co/UUx3417HcD,0.0


In [5]:
prevalence_Of_Depression_During_Covid = pd.read_csv('data/prevalence_of_depression_during_covid.csv')
prevalence_Of_Depression_During_Covid.head()

#Data Cleaning: To clean the data, we noticed there were irrelevant rows in the dataset that did not apply to our research question. This includes 
#the rows covered by Anxiety and Insomnia. We specifically want to focus on the impact covid has had on depression rates among healthcare workers, so we 
#dropped the other 2 rows. 


Unnamed: 0,Mental health outcomes,Categories,Total N (%),Doctor (n = 161),Nurse (n = 167),Other health workers\n(n = 147),P-value *
0,Depression,Normal,297 (62.5),122 (75.3),89 (53.3),86 (58.9),0.001
1,,Borderline,114 (24.0),27 (16.7),46 (27.5),41 (28.1),
2,,Abnormal,64 (13.5),13 (8.0),32 (19.2),19 (13.0),


In [6]:
depressioninhealthcareworkerschina = pd.read_csv('data/depressioninhealthcareworkerschina.csv')
depressioninhealthcareworkerschina.head()

#The original dataset contained data of other mental health issues that was present with workers that was irrelevant to our research question. We 
#dropped those columns as our question asks specifically if there's a relation between ‘depression’ and healthcare workers. 


Unnamed: 0,Author,Study Population,Response rate (%),Region,Health care workers,Unnamed: 5,Unnamed: 6,Male%,Assessment,Cut-off,Depression% (n)
0,,,,,Physicians,Nurses,Other,,,,
1,Du et al. (2020),134.0,43·2%,China,35·1%,41·0%,23·9%,39·6%,BDI-II\nBAI,≥14\n≥8,12·7%\n(17)
2,Guo et al. (2020),11118.0,N.A.,China,30·28%,53·07%,16·65%,25·2%,SAS\nSDS,≥50\n≥50,31·45%\n(3497)
3,Huang et al. (2020a),230.0,93·5%,Fuyang,30·4%,69·6%,0·0%,18·7%,SAS,≥50,N.A.
4,Huang and Zhao (2020),2250.0,85·3%,China,N.A.,N.A.,N.A.,N.A.,CES-D\nGAD-7,≥28\n≥9,19·8%\n(446)


In [7]:
global_depression_trend = pd.read_csv('data/global_depression_trend.csv')

global_depression_trend.drop(global_depression_trend.columns[1], axis = 1, inplace = True)
global_depression_trend.rename(columns = {global_depression_trend.columns[0]: "Disorder"}, inplace = True)
global_depression_trend.head()
#The original dataset was very large as it contained data concerning various mental health problems besides depression. It also included various 
#observations about both physical and mental conditions for the population that was not relevant, so we dropped all those rows and included only the 
#3 rows concerning depression among the population from 1990-2017. We opted to keep the previous years from 1990 to present because we are considering
#using other pandemic datasets to see if there is a correlation between pandemics and depression. We plan to inspect both SARS and H1N1 pandemics, so we would 
#like to keep the data about depression trends during that time in our cleaned dataset. We also ended up dropping the column at index 1 because the formatting caused 
# it to create an extra column. We then renamed the Disorder column, so that the title of that column is more descriptive.



Unnamed: 0,Disorder,Prevalence (thousands) 2017 counts,Incidence (thousands) 2017 counts,2017 counts,"Percentage change in counts, 1990–2007","Percentage change in counts, 2007–17","Percentage change in age-standardised rates, 1990–2007","Percentage change in age-standardised rates, 2007–17"
0,Depressive disorders,264 455.6,258 164.5,43 099.9,33·40%,14.30%,−1.9%,−2.6%
1,Major depressive disorder,163 044.1,241 893.3,32 846.7,32.10%,12.60%,−2.4%,−3.6%
2,Dysthymia,106 904.4,16 271.1,10 253.2,38.30%,20.40%,−0.3%,0.80%


In [13]:
h1n1_summary = pd.read_csv('data/h1n1_summary.csv')
h1n1_summary

#The original dataset included links to the datasource that was not important to answer our research question so we dropped that column. We kept the other countries 
#for now in case we wanted to encompass a wider range of countries later on in our project. 

Unnamed: 0,Date,Country,Cumulative no. of cases,Cumulative no. of deaths
0,2009-04-24,Mexico,18,0
1,2009-04-24,United States of America,7,0
2,2009-04-26,Mexico,18,0
3,2009-04-26,United States of America,20,0
4,2009-04-27,Canada,6,0
...,...,...,...,...
2485,2009-07-06,Venezuela,206,0
2486,2009-07-06,Viet Nam,181,0
2487,2009-07-06,Virgin Islands,1,0
2488,2009-07-06,West Bank and Gaza Strip,60,0


**Combining Datasets:** 
We will join the Global Depression Trend with the H1N1 Cases by date to compare the trends between the two. As for the other datasets we decided to not combine them and to analyze them separately .We are also considering grouping the Prevalence of Depression during Covid dataset with depressioninhealthcareworkerschina.csv dataset as it will give us a more complete picture of depression rates globally since the latter contains data from China. However, we are still thinking of what metric to combine them by and how to seamlessly merge the data. 


# Project Proposal (updated)

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 1/25  |  3 PM | Read & Think about COGS 108 expectations; brainstorm topics/questions  | Discuss and finalize topic ideas for the project proposal. Finish project proposal. | 
| 2/1  |  3 PM |  Each member must find at least two datasets to use for project. |Discuss datasets we’ve found and finalize which ones to use. Finish Checkpoint #1 (due Feb 12). | 
| 2/10  | 3 PM  | Edit, finalize, and submit proposal; Search for datasets  | Discuss Wrangling and possible analytical approaches; Assign group members to lead each specific part   |
| 2/14  | 3 PM  | Import & Wrangle Data; EDA | Review/Edit wrangling/EDA; Discuss Analysis Plan   |
| 2/23  | 3 PM  | Finalize wrangling/EDA; Begin Analysis| Discuss/edit Analysis; Complete project check-in |
| 3/13  | 3 PM  | Complete analysis; Draft results/conclusion/discussion | Discuss/edit full project |
| 3/19  | Before 11:59 PM  | NA | Turn in Final Project & Group Project Surveys |