# <div style="text-align: center">Quest for COVID-19 Data</div>

# Acquisition Phase
## Web Scraping

<div style="text-align: justify"> In this phase, data will be extracted from the twitter account of Ministry of Health (MOH)and Ministry of Interior (MOI), as well as, MOH's website. Then, the raw data was transformed into structured data for further analysis. These sources were chosen because they provide accurate COVID-19 information and updates from official sources.</div>

In [1]:
# Importing necessary libraries
import tweepy
import json
from bs4 import BeautifulSoup
from selenium import webdriver
import requests
import sys
import os
import tweepy as tw
import pandas as pd 
import time
import re
from selenium.webdriver.support.ui import WebDriverWait
import requests

In [2]:
# Defining keys
# Keys were removed for security reasons. If submitting keys is required, kindly contact me. 
consumer_key = '--'
consumer_secret = '--'
access_token = '--'
access_token_secret = '--'

# Authorizing Twitter credentials
auth = tw.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tw.API(auth, wait_on_rate_limit=True)

## Collecting  MOH Tweets

In [3]:
# Collecting Ministry of Health (MOH) tweets
tweets = api.user_timeline(screen_name='SaudiMOH', 
                           # 200 is the maximum allowed count
                           count=200,
                           include_rts = False,
                           # Necessary to keep full_text 
                           # otherwise only the first 140 words are extracted
                           tweet_mode = 'extended'
                           )

In [2]:
# Checking MOH tweets
# for tweet in tweets:
#     print(tweet._json)
#     print('---')

In [5]:
# Closely examining MOH tweets
for info in tweets[:3]:
     print("ID: {}".format(info.id))
     print(info.created_at)
     print(info.full_text)
     print(info.lang)
     print("\n")

ID: 1249064845359689729
2020-04-11 20:00:40
من غرف عملياتنا إلى غرف عملياتكم، شكراً لكم. 
#كلنا_مسؤول https://t.co/upj9Sj19P1
ar


ID: 1249007249084866560
2020-04-11 16:11:48
Highlights of the press conference of the official spokesperson. https://t.co/hK5b9K8bm1
en


ID: 1249007209733914625
2020-04-11 16:11:38
أبرز ما جاء في المؤتمر الصحفي للمتحدث الرسمي لوزارة الصحة. https://t.co/3wAJVkdTlF
ar




## Preprocessing
In this section, raw data is being processed for other processing procedure.

In [6]:
# Transforming the tweepy tweets into dataframe
outtweets = [[tweet.id_str, 
              tweet.created_at,
              "Ministry of Health",
              tweet.favorite_count, 
              tweet.retweet_count, 
              tweet.full_text.encode("utf-8").decode("utf-8"),
              tweet.lang]
             for idx,tweet in enumerate(tweets)]
df = pd.DataFrame(outtweets,columns=["ID","Created_at","Source","Favorite_Count","Retweet_Count", "Tweets", "Language"])
df.head(9)

Unnamed: 0,ID,Created_at,Source,Favorite_Count,Retweet_Count,Tweets,Language
0,1249064845359689729,2020-04-11 20:00:40,Ministry of Health,1229,964,من غرف عملياتنا إلى غرف عملياتكم، شكراً لكم. \...,ar
1,1249007249084866560,2020-04-11 16:11:48,Ministry of Health,373,190,Highlights of the press conference of the offi...,en
2,1249007209733914625,2020-04-11 16:11:38,Ministry of Health,619,440,أبرز ما جاء في المؤتمر الصحفي للمتحدث الرسمي ل...,ar
3,1248954430550151173,2020-04-11 12:41:55,Ministry of Health,5718,6646,#الصحة تعلن عن تسجيل (382) حالة إصابة جديدة بف...,ar
4,1248942622913236994,2020-04-11 11:55:00,Ministry of Health,757,645,لأن حماية أحبائنا من كبار السن من فيروس ⁧#كورو...,ar
5,1248933857593831427,2020-04-11 11:20:10,Ministry of Health,837,806,بادر بالإفصاح وكن عضوًا فاعلاً في الحد من تفشي...,ar
6,1248714652667740160,2020-04-10 20:49:07,Ministry of Health,719,595,لكل العائدين من السفر نشارككم دليل شامل للإجرا...,ar
7,1248712412762714112,2020-04-10 20:40:13,Ministry of Health,2361,1507,حمداً لله على سلامتكم 🇸🇦\n#عودة_آمنة https://t...,ar
8,1248675865526964225,2020-04-10 18:15:00,Ministry of Health,609,292,Know the difference between medical quarantine...,en


In [7]:
# Checking the 'Tweets' columns
df[['Tweets']]

Unnamed: 0,Tweets
0,من غرف عملياتنا إلى غرف عملياتكم، شكراً لكم. \...
1,Highlights of the press conference of the offi...
2,أبرز ما جاء في المؤتمر الصحفي للمتحدث الرسمي ل...
3,#الصحة تعلن عن تسجيل (382) حالة إصابة جديدة بف...
4,لأن حماية أحبائنا من كبار السن من فيروس ⁧#كورو...
...,...
154,#الصحة تعلن عن تسجيل (92) حالة إصابة جديدة بفي...
155,لسلامتك وسلامة أصدقائك #لا_تقول_تم \n#الوقاية_...
156,اكشف عن أعراض فيروس #كورونا الجديد بكل سهولة ع...
157,"الاسبانية \nSi sufre de fiebre alta, tos o dif..."


## Collecting MOI Tweets

In [8]:
# Collecting Ministry of Interior (MOI) tweets
tweets_moi = api.user_timeline(screen_name='MOISaudiArabia', 
                           # 200 is the maximum allowed count
                           count=200,
                           include_rts = False,
                           # Necessary to keep full_text 
                           # otherwise only the first 140 words are extracted
                           tweet_mode = 'extended'
                           )

In [3]:
# # Checking MOI tweets
# for tweet in tweets_moi:
#     print(tweet._json)
#     print('---')

In [10]:
# Closely examining MOI tweets
for info in tweets_moi[:3]:
     print("ID: {}".format(info.id))
     print(info.created_at)
     print(info.full_text)
     print(info.lang)
     print("\n")

ID: 1248913715229048832
2020-04-11 10:00:08
أبرز أخبار وزارة الداخلية خلال الفترة من 12 وحتى 18 شعبان 1441هـ .

#الداخلية_في_أسبوع https://t.co/DeQeXr0TVj
ar


ID: 1248597515940909056
2020-04-10 13:03:40
مصدر مسؤول بوزارة الداخلية: تقرر منع التجول والتنقل الكلي وعدم الخروج من المنازل في أحياء " الشريبات، بني ظفر، قربان، الجمعة، جزء من الإسكان، بني خدرة" بالمدينة المنورة، ابتداءً من اليوم وحتى إشعار آخر. https://t.co/Lle5ERIS6r
ar


ID: 1248551321571033089
2020-04-10 10:00:06
للتواصل مع إمارات المناطق وقطاعات #وزارة_الداخلية .. #أبشر https://t.co/PVS56VYgXl
ar




## Preprocessing
In this section, raw data is being processed for other processing procedure.

In [11]:
# Transforming the tweepy tweetsfrom pandas import DataFrame
outtweets_moi = [[tweet.id_str, 
              tweet.created_at,
              "Ministry of Interior",
              tweet.favorite_count, 
              tweet.retweet_count, 
              tweet.full_text.encode("utf-8").decode("utf-8"),
              tweet.lang]
             for idx,tweet in enumerate(tweets_moi)]
moi_df = pd.DataFrame(outtweets_moi,columns=["ID","Created_at","Source","Favorite_Count","Retweet_Count", "Tweets", "Language"])
moi_df.head(9)

Unnamed: 0,ID,Created_at,Source,Favorite_Count,Retweet_Count,Tweets,Language
0,1248913715229048832,2020-04-11 10:00:08,Ministry of Interior,912,590,أبرز أخبار وزارة الداخلية خلال الفترة من 12 وح...,ar
1,1248597515940909056,2020-04-10 13:03:40,Ministry of Interior,1714,1907,مصدر مسؤول بوزارة الداخلية: تقرر منع التجول وا...,ar
2,1248551321571033089,2020-04-10 10:00:06,Ministry of Interior,1089,810,للتواصل مع إمارات المناطق وقطاعات #وزارة_الداخ...,ar
3,1248295592440315905,2020-04-09 17:03:56,Ministry of Interior,2561,2010,مصدر مسؤول بوزارة الداخلية: \nسريان السماح لجم...,ar
4,1247639667698098176,2020-04-07 21:37:31,Ministry of Interior,2036,1178,الأمير عبدالعزيز بن سعود يشارك في الاجتماع الط...,ar
5,1247517368445571077,2020-04-07 13:31:33,Ministry of Interior,2722,2958,مصدر مسؤول بوزارة الداخلية: تقديم ساعات منع ال...,ar
6,1247509451948556289,2020-04-07 13:00:05,Ministry of Interior,3165,2567,من منسوبي الداخلية لأبطال الصحة: نحن معكم و #ك...,ar
7,1247363557462880256,2020-04-07 03:20:21,Ministry of Interior,3878,3198,منع التجول من أجل سلامتكم ... و #كلنا_مسؤول ht...,ar
8,1247237019132141568,2020-04-06 18:57:32,Ministry of Interior,6206,6012,مصدر مسؤول بوزارة الداخلية : منع التجول على مد...,ar


In [12]:
# Concatenating the MOH and MOI dataframe
# Checking the shape of the dataframe
tweets = pd.concat([df, moi_df])
tweets.shape

(316, 7)

In [13]:
# Saving the Concatenated MOH and MOI dataframe
tweets.to_csv('SaudiMOI_MOH.csv')

##  Collecting Frequently Asked Questions (FAQs)  from The MOH Website

In [14]:
# Scraping the FAQs page from the MOH website
url = "https://www.moh.gov.sa/en/HealthAwareness/EducationalContent/Corona/Pages/corona.aspx"

In [15]:
# Accessing and navigating MOH's FAQs page website
driver = webdriver.Chrome('chromedriver/chromedriver')
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')

In [16]:
# Assigning the scraped data to a variable
results = soup.find("div", {"id": "ctl00_PlaceHolderMain_Content__ControlWrapper_RichHtmlField"})

In [4]:
# # Checking the Scraped data
# print(results)

In [18]:
# The data will be extracted and parsed
data = []
for result in results:
    datum = {}
    Answers = None
    Questions = result.find("strong")
    if Questions !=None:
        Answers = result.findNext("div")
    
    # Defining functions to clean the extracted data
    def strip_string(var):
        var =str(var)
        var = var.rstrip()
        var = var.replace('\n', '')
        return var
     
    # Cleaning extracted data
    if (Answers !=None) and (Questions !=None):
        datum['Questions'] = strip_string(Questions)
        datum['Answers'] = strip_string(Answers)
        data.append(datum)

## Preprocessing
In this section, raw data is being processed for other processing procedure.

In [19]:
# Creating a dataframe
qna = pd.DataFrame(data, columns = ['Questions','Answers', 'Tags'])

In [20]:
# Defining a function to clean tags
def cleanTags(text):
    clean = re.compile('<.*?>')
    data =[]
    for i in text:
        data.append(re.sub(clean,'',i))
    return data

In [21]:
# Cleaning tags
qna['Questions'] = cleanTags(qna['Questions'])
qna['Answers'] = cleanTags(qna['Answers'])

In [22]:
# Dropping the repeated and empty rows
qna = qna[~qna.index.isin([0,1])]

In [23]:
# Resetting index and dropping to sort the index
qna.reset_index(inplace = True, drop = True)
qna

Unnamed: 0,Questions,Answers,Tags
0,What are Coronaviruses?,Coronaviruses (CoV) are a large family of viru...,
1,What are the species of coronaviruses that hav...,The SARS-CoV was transmitted from civet cats t...,
2,What is (COVID-19)?,It is the new coronavirus and most cases appea...,
3,How the virus identified?,The virus was identified through genetic seque...,
4,What is the origin of the virus?,It is believed that the COVID-19 originated in...,
5,Can the virus spread from person to person?,"Yes, the virus can spread from the infected pe...",
6,I just came from China having high temperature...,"Visit the nearest health facility, for more in...",
7,Can the COVID-19 spread through shipments comi...,"According to available information, the goods ...",
8,What are the symptoms of COVID-19?,The common symptoms of COVID-19 include: fever...,
9,Sneezing etiquette to prevent infection:,Use tissue papers for sneezing or coughing and...,


In [24]:
# Adding a 'Tags' colmun 
qna['Tags'] = ['definition', 'species', 'definition', 'identify', 'origin', 'person', 'temperature', 'shipments', 'indication', 'etiquette', 'tips']    

In [25]:
# Saving the MOH's FAQs in a csv file 
qna.to_csv('moh_qna.csv')