# Dataset Creation with Web Scraping 

Typeform is an online SAAS company for web form and survey creation. OSMI has utilized this software for its surveys from the year 2016. On the publicly available survey webpage, all the questions asked are followed by a distribution of the responses. We scrape data about the questions for the creation of a CSV file for each year's survey. To scrape the questions of its online surveys, we use the ```requests``` and ```BeautifulSoup``` libraries. In order to acquire the responses of each participant

**Data Sources:**
1. [OSMI 2016 Survey]()
2. [OSMI 2017 Survey]()
3. [OSMI 2018 Survey]()
4. [OSMI 2020 Survey]()

## 01 OSMI 2018 Survey

In [194]:
# Import libraries necessary for web scraping and dataset creation
import re
import sys
import json
import requests
import pandas as pd
from bs4 import BeautifulSoup
from collections import Counter

# Request typeform survey webpage for contents
res = requests.get(
      'https://osmi.typeform.com/report/xztgPT/NFu2PHjwsMUkkL3h')

# Extract response status and validate for successful transaction
status = res.status_code
if status != 200:
    sys.exit(1)
else:
    print("Web scraping response status:\n", status)

# Parse HTML title, head and body contents using BeautifulSoup
soup = BeautifulSoup(res.content, 'html.parser')

print("Dataset Title:\n", soup.title.text)

Web scraping response status:
 200
Dataset Title:
 OSMI Mental Health in Tech Survey 2018


In [195]:
# Select content inside the script element that contains information about the survey questions and answers
script = soup.select('script')[11]

# Set a Regex pattern to extract the report's payload and apply the pattern on the script text
pattern = re.compile("(?<=window.__REPORT_PAYLOAD = ).*(?=};)")
fields = re.findall(pattern, script.text)

# Complete the string to be able to input to the JSON parser
fields[0] = fields[0] + '}'

# Convert the string to JSON
json_param = json.loads(fields[0])

# Print all the questions asked in the survey
questions = json_param['blocks']
print("Number of questions in the survey:", len(questions))
print('-'*100)

# Process each question to remove * present representing bolded words
for question in questions:
    question['title'] = question['title'].replace('*', '')
    print(question['title'])

Number of questions in the survey: 68
----------------------------------------------------------------------------------------------------
Are you self-employed?
How many employees does your company or organization have?
Is your employer primarily a tech company/organization?
Is your primary role within your company related to tech/IT?
Does your employer provide mental health benefits as part of healthcare coverage?
Do you know the options for mental health care available under your employer-provided health coverage?
Has your employer ever formally discussed mental health (for example, as part of a wellness campaign or other official communication)?
Does your employer offer resources to learn more about mental health disorders and options for seeking help?
Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources provided by your employer?
If a mental health issue prompted you to request a medical leave from work, how easy or di

In [196]:
# Analyze the types of questions asked in the survey
question_types = []
for question in questions:
    question_types.append(question['type'])
Counter(question_types)

Counter({'yes_no': 16,
         'multiple_choice': 40,
         'opinion_scale': 7,
         'rating': 1,
         'dropdown': 4})

In [197]:
# Get the total number of participants in the survey
total_resp = json_param['totalResponsesCount']

# Get the number of participants that answered each question
counts = []
for question in questions:
    counts.append(question['summary']['count'])
print("Number of respondents for each question out of {0} participants:\n {1}".format(total_resp, counts))

Number of respondents for each question out of 417 participants:
 [417, 361, 361, 361, 361, 361, 361, 361, 361, 361, 361, 361, 361, 361, 360, 360, 361, 361, 56, 56, 56, 56, 56, 56, 56, 41, 417, 363, 363, 363, 363, 363, 363, 363, 363, 363, 363, 362, 363, 363, 363, 417, 191, 0, 81, 188, 415, 417, 417, 417, 417, 417, 417, 417, 417, 417, 51, 16, 417, 417, 417, 417, 417, 417, 311, 311, 417, 314]


In [198]:
# Import Kaggle dataset and examine the columns
data_2018 = pd.read_csv('datasets/2018_dataset_from_Kaggle.csv')
print("Number of columns:", len(data_2018.columns))
print('-'*100)
for count, i in enumerate(data_2018.columns):
    if '?' not in i:
        print(count, i)

Number of columns: 123
----------------------------------------------------------------------------------------------------
0 #
14 Describe the conversation you had with your employer about your mental health, including their reactions and what actions were taken to address your mental health issue/questions.
17 Describe the conversation with coworkers you had about your mental health including their reactions.
19 Describe the conversation your coworker had with you about their mental health (please do not use names).
40 Describe the conversation you had with your previous employer about your mental health, including their reactions and actions taken to address your mental health issue/questions.
43 Describe the conversation you had with your previous coworkers about your mental health including their reactions.
45 Describe the conversation your coworker had with you about their mental health (please do not use names)..1
50 Anxiety Disorder (Generalized, Social, Phobia, etc)
51 Mood Di

Since we look to aggregate columns from multiple datasets throughout the years, we merge the columns of descriptive questions to form a single column: `comments`. We also drop the first column representing the participant ID and a few other columns that are not relevant to our study.

In [199]:
data_2018.drop(['#', 'Start Date (UTC)', 'Submit Date (UTC)'], axis=1, inplace=True)

Index(['<strong>Are you self-employed?</strong>',
       'How many employees does your company or organization have?',
       'Is your employer primarily a tech company/organization?',
       'Is your primary role within your company related to tech/IT?',
       'Does your employer provide mental health benefits as part of healthcare coverage?',
       'Do you know the options for mental health care available under your employer-provided health coverage?',
       'Has your employer ever formally discussed mental health (for example, as part of a wellness campaign or other official communication)?',
       'Does your employer offer resources to learn more about mental health disorders and options for seeking help?',
       'Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources provided by your employer?',
       'If a mental health issue prompted you to request a medical leave from work, how easy or difficult would it be to a

The columns with indices between 50 to 62, 63 to 75 and 76 to 88 correspond to, respectively,

`What disorder(s) have you been diagnosed with?`

`If possibly, what disorder(s) do you believe you have?`

`If so, what disorder(s) were you diagnosed with?`.