<a href="https://colab.research.google.com/github/AnkitRajSri/Effects-of-Lockdown-on-Mental-Health/blob/master/WHO_Guidelines_Data_Exploration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Data Sourcing and Preparation

In [0]:
import pandas as pd
import numpy as np
import re
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
import textract

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [0]:
#read the content of pdf as text
text = textract.process('mental-health-considerations.pdf')
#use numeric points as paragraph delimiter to convert the text into list of paragraphs.
print(text)

b' \n \nMental health and psychosocial considerations during the \nCOVID-19 outbreak  \n \n18 March 2020  \n \nIn January 2020 the World Health Organization (WHO) declared the outbreak of a new coronavirus \ndisease, COVID-19, to be a Public Health Emergency of International Concern. WHO stated that \nthere is a high risk of COVID-19 spreading to other countries around the world. In March 2020, \nWHO made the assessment that COVID-19 can be characterized as a pandemic. \n \nWHO and public health authorities around the world are acting to contain the COVID-19 outbreak. \nHowever, this time of crisis is generating stress throughout the population. The considerations \npresented in this document have been developed by the WHO Department of Mental Health and \nSubstance Use as a series of messages that can be used in communications to support mental and \npsychosocial well-being in different target groups during the outbreak. \n \nMessages for the general population \n \n1. COVID-19 has an

In [0]:
pdf_data = pd.DataFrame(re.split('[0-9]\.+', text.decode('utf-8')))
pdf_data = pdf_data.loc[1:31, ]
message = pdf_data.loc[5:6,].values[0] + '. ' + pdf_data.loc[5:6,].values[1]
pdf_data.loc[5,] = message
pdf_data = pdf_data.drop([6]).reset_index()
pdf_data = pdf_data.drop(['index'], axis = 1)
pdf_data.head()

Unnamed: 0,0
0,COVID-19 has and is likely to affect people f...
1,Do not refer to people with the disease as “C...
2,"Minimize watching, reading or listening to ne..."
3,Protect yourself and be supportive to others....
4,Find opportunities to amplify positive and ho...


In [0]:
pdf_data.columns = ['WHO_message']
pdf_data.loc[0:5, 'concerned_individuals'] = 'General Population'
pdf_data.loc[6:10, 'concerned_individuals'] = 'Healthcare Workers'
pdf_data.loc[11:16, 'concerned_individuals'] = 'Team Leaders/Managers in Health Facilities'
pdf_data.loc[17:20, 'concerned_individuals'] = 'Carers of Children'
pdf_data.loc[21:26, 'concerned_individuals'] = 'Older Adults/People with underlying health conditions/Carers'
pdf_data.loc[27:29, 'concerned_individuals'] = 'People in isolation'
pdf_data.head()

Unnamed: 0,WHO_message,concerned_individuals
0,COVID-19 has and is likely to affect people f...,General Population
1,Do not refer to people with the disease as “C...,General Population
2,"Minimize watching, reading or listening to ne...",General Population
3,Protect yourself and be supportive to others....,General Population
4,Find opportunities to amplify positive and ho...,General Population


### Data Preprocessing

In [0]:
from nltk.tokenize import regexp_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re
from string import punctuation

stopwords = stopwords.words('english')
stopwords.extend(['covid', 'covid-19', 'covid 19', 'coronavirus', 'outbreak', 'pandemic'])
word_lemmatizer = WordNetLemmatizer()

def cleanText(text):
    text = re.sub(r'\w+:\/{2}[\d\w-]+(\.[\d\w-]+)*(?:(?:\/[^\s/]*))*', '', text)
    words = regexp_tokenize(text.lower(), r'[A-Za-z]+')
    words = [w for w in words if len(w) > 1 and w != 'rt' and w not in stopwords]
    words = [word_lemmatizer.lemmatize(w) for w in words]
    cleaned_text = ' '.join(words)
    return cleaned_text

In [0]:
pdf_data['cleaned_message'] = pdf_data['WHO_message'].apply(lambda x : cleanText(x))
pdf_data.loc[29, 'cleaned_message'] = re.sub('stay informed find latest information covid spreading advice guidance covid addressing social stigma', '', pdf_data.loc[29, 'cleaned_message'])
pdf_data.loc[29, 'cleaned_message']

'near constant stream news report cause anyone feel anxious distressed seek information update practical guidance specific time day health professional website avoid listening following rumour make feel uncomfortable stay informed find latest information spreading advice guidance addressing social stigma stigma guide'

In [0]:
grouped_pdf_data = pdf_data.groupby('concerned_individuals')
grouped_pdf_data.head(2)

Unnamed: 0,WHO_message,concerned_individuals,cleaned_message
0,COVID-19 has and is likely to affect people f...,General Population,likely affect people many country many geograp...
1,Do not refer to people with the disease as “C...,General Population,refer people disease case victim family diseas...
6,Feeling under pressure is a likely experience...,Healthcare Workers,feeling pressure likely experience many collea...
7,Take care of yourself at this time. Try and u...,Healthcare Workers,take care time try use helpful coping strategy...
11,Keeping all staff protected from chronic stre...,Team Leaders/Managers in Health Facilities,keeping staff protected chronic stress poor me...
12,Ensure that good quality communication and ac...,Team Leaders/Managers in Health Facilities,ensure good quality communication accurate inf...
17,Help children find positive ways to express f...,Carers of Children,help child find positive way express feeling f...
18,Keep children close to their parents and fami...,Carers of Children,keep child close parent family considered safe...
21,"Older adults, especially in isolation and tho...",Older Adults/People with underlying health con...,older adult especially isolation cognitive dec...
22,Share simple facts about what is going on and...,Older Adults/People with underlying health con...,share simple fact going give clear information...


In [0]:
from gensim.summarization import summarize, keywords

pdf_data['keywords'] = pdf_data['cleaned_message'].apply(lambda x : re.sub(r'\n', ', ', keywords(x, words = 3)))
pdf_data.head()

Unnamed: 0,WHO_message,concerned_individuals,cleaned_message,keywords
0,COVID-19 has and is likely to affect people f...,General Population,likely affect people many country many geograp...,"affected, affect people, deserve"
1,Do not refer to people with the disease as “C...,General Population,refer people disease case victim family diseas...,"family, people, separate"
2,"Minimize watching, reading or listening to ne...",General Population,minimize watching reading listening news cause...,"information, news, help"
3,Protect yourself and be supportive to others....,General Population,protect supportive others assisting others tim...,"need, supportive, support, solidarity"
4,Find opportunities to amplify positive and ho...,General Population,find opportunity amplify positive hopeful stor...,"people, positive, story"


In [0]:
group1_data = grouped_pdf_data.get_group('Carers of Children')
group2_data = grouped_pdf_data.get_group('General Population')
group3_data = grouped_pdf_data.get_group('Healthcare Workers')
group4_data = grouped_pdf_data.get_group('Older Adults/People with underlying health conditions/Carers')
group5_data = grouped_pdf_data.get_group('People in isolation')
group6_data = grouped_pdf_data.get_group('Team Leaders/Managers in Health Facilities')

### Data Exploration

##### Most common vocabulary

In [0]:
from collections import Counter

def getMostCommonWords(text):
  words = regexp_tokenize(text.lower(), r'[A-Za-z]+')
  frequent_words = Counter(words).most_common(5)
  return frequent_words

def createVocabDf(df):
  text = ' '.join(df['cleaned_message'].values)
  frequent_words = getMostCommonWords(text)
  vocab_df = pd.DataFrame(frequent_words)
  vocab_df.columns = ['word', 'count']
  return vocab_df

In [0]:
group1_vocab = createVocabDf(group1_data)
group2_vocab = createVocabDf(group2_data)
group3_vocab = createVocabDf(group3_data)
group4_vocab = createVocabDf(group4_data)
group5_vocab = createVocabDf(group5_data)
group6_vocab = createVocabDf(group6_data)

In [0]:
from plotly.subplots import make_subplots
import plotly.graph_objects as go

fig = make_subplots(rows = 2, cols = 3, shared_yaxes = True, subplot_titles = ('Carers of Children', 'General Population', 'Healthcare Workers', 
                                                                               'Older Adults/People with health conditions/Carers', 
                                                                               'People in isolation', 'Team Leaders/Managers in Health Facilities'))

fig.add_trace(go.Bar(x = group1_vocab['word'], y = group1_vocab['count'], name = 'Carers of Children'), row = 1, col = 1)
fig.add_trace(go.Bar(x = group2_vocab['word'], y = group2_vocab['count'], name = 'General Population'), row = 1, col = 2)
fig.add_trace(go.Bar(x = group3_vocab['word'], y = group3_vocab['count'], name = 'Healthcare Workers'), row = 1, col = 3)
fig.add_trace(go.Bar(x = group4_vocab['word'], y = group4_vocab['count'], name = 'Older Adults'), row = 2, col = 1)
fig.add_trace(go.Bar(x = group5_vocab['word'], y = group5_vocab['count'], name = 'People in isolation'), row = 2, col = 2)
fig.add_trace(go.Bar(x = group6_vocab['word'], y = group6_vocab['count'], name = 'Team Leaders'), row = 2, col = 3)
fig.update_layout(title_text="Most commonly used words in the WHO guidelines")
fig.show()

##### Topic Modelling

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation as LDA

vector = TfidfVectorizer()

def extractTopics(tfidf):
  list_of_topics = []
  model = LDA(n_components = 3).fit(tfidf)
  terms = vector.get_feature_names()
  for i, comp in enumerate(model.components_):
    terms_comp = zip(terms, comp)
    sorted_terms = sorted(terms_comp, key = lambda x:x[1], reverse = True)[:4]
    topics = ' '.join([w[0] for w in sorted_terms])
    list_of_topics.append(topics)
  return list_of_topics

In [0]:
for group in grouped_pdf_data.groups.keys():
  df = grouped_pdf_data.get_group(group)
  list_of_messages = df['cleaned_message'].tolist()
  tfidf = vector.fit_transform(list_of_messages)
  representative_topics = ', '.join(extractTopics(tfidf))

  print('The representative topics in the guidelines for', '\033[3m', '%s'%group, '\033[0m', 'are:')
  print('\033[1m', representative_topics, '\033[0m', '\n')

The representative topics in the guidelines for [3m Carers of Children [0m are:
[1m express feeling way child, child possible social routine, child adult time emotion [0m 

The representative topics in the guidelines for [3m General Population [0m are:
[1m people healthcare worker positive, need others together community, country many fact information [0m 

The representative topics in the guidelines for [3m Healthcare Workers [0m are:
[1m strategy coping sufficient using, health support may mental, feeling managing psychosocial way [0m 

The representative topics in the guidelines for [3m Older Adults/People with underlying health conditions/Carers [0m are:
[1m quarantine isolation daily health, information make needed sure, regular keep new one [0m 

The representative topics in the guidelines for [3m People in isolation [0m are:
[1m feel guidance information stigma, keep routine social healthy, stay social health find [0m 

The representative topics in the guidel