# Dataset Creation and Cleaning using Web Scraping Techniques

Typeform is an online SAAS company for web form and survey creation. Utilizing this software, OSMI creates and makes public its surveys and has done so since the year 2016. On the publicly available survey webpage, all the questions asked are followed by a distribution of the responses. We scrape data about the questions for merging with the Kaggle dataset for each year's survey. To scrape the questions of its online surveys, we use the ```requests``` and ```BeautifulSoup``` libraries. In order to acquire the responses of each participant, we use the dataset on Kaggle.

### Data Sources
1. [OSMI 2016 Survey on Typeform](https://osmi.typeform.com/report/Ao6BTw/U76z) and the corresponding [Kaggle Dataset](https://www.kaggle.com/osmi/mental-health-in-tech-2016)
2. [Kaggle Dataset for 2017](https://www.kaggle.com/osmihelp/osmi-mental-health-in-tech-survey-2017)
3. [OSMI 2018 Survey on Typeform](https://osmi.typeform.com/report/xztgPT/NFu2PHjwsMUkkL3h) and the corresponding [Kaggle Dataset](https://www.kaggle.com/osmihelp/osmi-mental-health-in-tech-survey-2018)
4. [OSMI 2020 Survey on Typeform](https://osmi.typeform.com/report/A7mlxC/itVHRYbNRnPqDI9C) and the corresponding [Kaggle Dataset](https://www.kaggle.com/osmihelp/osmi-2020-mental-health-in-tech-survey-results)

### Table of Contents

1. [Work on the 2018 Dataset](#01-Work-on-the-2016-Dataset)

    1. [Acquistion of Survey Questions with Web Scraping](#1.-Acquistion-of-Survey-Questions-with-Web-Scraping)
    2. [Cleaning and Aggregation of Columns](#2.-Cleaning-and-Aggregating-Columns-on-the-Dataset-from-Kaggle)
    3. [Final Dataset](#3.-Final-Attribute-wise-Cleaned-Kaggle-Dataset)
2. [Work on the 2020 Dataset](#02-Work-on-the-2020-Dataset)

# 01 Work on the 2018 Dataset

## 1. Acquistion of Survey Questions with Web Scraping

In [52]:
# Import libraries necessary for web scraping and dataset creation
import re
import sys
import json
import requests
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
from collections import Counter

# Request typeform survey webpage for contents
res = requests.get(
      'https://osmi.typeform.com/report/xztgPT/NFu2PHjwsMUkkL3h')

# Extract response status and validate for successful transaction
status = res.status_code
if status != 200:
    sys.exit(1)
else:
    print("Web scraping response status:\n", status)

# Parse HTML title, head and body contents using BeautifulSoup
soup = BeautifulSoup(res.content, 'html.parser')

print("Dataset Title:\n", soup.title.text)

Web scraping response status:
 200
Dataset Title:
 OSMI Mental Health in Tech Survey 2018


In [53]:
# Select content inside the script element that contains information about the survey questions and answers
script = soup.select('script')[11]

# Set a Regex pattern to extract the report's payload and apply the pattern on the script text
pattern = re.compile("(?<=window.__REPORT_PAYLOAD = ).*(?=};)")
fields = re.findall(pattern, script.text)

# Complete the string to be able to input to the JSON parser
fields[0] = fields[0] + '}'

# Convert the string to JSON
json_param = json.loads(fields[0])

# Print all the questions asked in the survey
questions = json_param['blocks']
print("Number of questions in the survey:", len(questions))
print('-'*100)

# Process each question to remove * present representing bolded words
question_titles = []
for question in questions:
    question['title'] = question['title'].replace('*', '')
    question['title'] = question['title'].replace('_', '')
    question_titles.append(question['title'])
    print(question['title'])

Number of questions in the survey: 68
----------------------------------------------------------------------------------------------------
Are you self-employed?
How many employees does your company or organization have?
Is your employer primarily a tech company/organization?
Is your primary role within your company related to tech/IT?
Does your employer provide mental health benefits as part of healthcare coverage?
Do you know the options for mental health care available under your employer-provided health coverage?
Has your employer ever formally discussed mental health (for example, as part of a wellness campaign or other official communication)?
Does your employer offer resources to learn more about mental health disorders and options for seeking help?
Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources provided by your employer?
If a mental health issue prompted you to request a medical leave from work, how easy or di

In [54]:
# Analyze the types of questions asked in the survey
question_types = []
for question in questions:
    question_types.append(question['type'])
Counter(question_types)

Counter({'yes_no': 16,
         'multiple_choice': 40,
         'opinion_scale': 7,
         'rating': 1,
         'dropdown': 4})

In [55]:
# Get the total number of participants in the survey
total_resp = json_param['totalResponsesCount']

# Get the number of participants that answered each question
counts = []
for question in questions:
    counts.append(question['summary']['count'])
print("Number of respondents for each question out of {0} participants:\n {1}".format(total_resp, counts))

Number of respondents for each question out of 417 participants:
 [417, 361, 361, 361, 361, 361, 361, 361, 361, 361, 361, 361, 361, 361, 360, 360, 361, 361, 56, 56, 56, 56, 56, 56, 56, 41, 417, 363, 363, 363, 363, 363, 363, 363, 363, 363, 363, 362, 363, 363, 363, 417, 191, 0, 81, 188, 415, 417, 417, 417, 417, 417, 417, 417, 417, 417, 51, 16, 417, 417, 417, 417, 417, 417, 311, 311, 417, 314]


## 2. Cleaning and Aggregating Columns on the Dataset from Kaggle

### Importing and Overview

In [56]:
# Import Kaggle dataset and examine the columns
data_2018 = pd.read_csv('datasets/2018_data_kaggle.csv')

# What are the total number of columns?
print("Number of columns:", len(data_2018.columns))
print('-'*100)

# What are the columns that aren't in question format?
for count, i in enumerate(data_2018.columns):
    if '?' not in i:
        print(count, i)

Number of columns: 123
----------------------------------------------------------------------------------------------------
0 #
14 Describe the conversation you had with your employer about your mental health, including their reactions and what actions were taken to address your mental health issue/questions.
17 Describe the conversation with coworkers you had about your mental health including their reactions.
19 Describe the conversation your coworker had with you about their mental health (please do not use names).
40 Describe the conversation you had with your previous employer about your mental health, including their reactions and actions taken to address your mental health issue/questions.
43 Describe the conversation you had with your previous coworkers about your mental health including their reactions.
45 Describe the conversation your coworker had with you about their mental health (please do not use names)..1
50 Anxiety Disorder (Generalized, Social, Phobia, etc)
51 Mood Di

### Analysis
From this we analyze that the dataset from Kaggle has columns that aren't in the data scraped from the survey webpage. It also appears that some columns represent possible responses for questions. Therefore, we decide to merge the corresponding columns and replace the column names with the question titles.
Since we look to aggregate columns from multiple datasets throughout the years, we merge the columns of descriptive questions to form a single column: `comments`. We also drop the first column representing the participant ID and a few other columns that are not relevant to our study.

### Column Name Cleaning

In [57]:
data_2018.drop(['#', 'Start Date (UTC)', 'Submit Date (UTC)', 'Network ID'], axis=1, inplace=True)

The columns with indices between 50 to 62, 63 to 75 and 76 to 88 correspond to, respectively,

`What disorder(s) have you been diagnosed with?`

`If possibly, what disorder(s) do you believe you have?`

`If so, what disorder(s) were you diagnosed with?`.

Therefore, we merge the corresponding columns of indices to create new columns with the attribute headings being the above qustions.

In [58]:
# Setup the column names that are to be merged

# Columns to be merged and replaced by `What disorder(s) have you been diagnosed with?`
cols_1 = ['Anxiety Disorder (Generalized, Social, Phobia, etc)',
          'Mood Disorder (Depression, Bipolar Disorder, etc)',
          'Psychotic Disorder (Schizophrenia, Schizoaffective, etc)',
          'Eating Disorder (Anorexia, Bulimia, etc)',
          'Attention Deficit Hyperactivity Disorder',
          'Personality Disorder (Borderline, Antisocial, Paranoid, etc)',
          'Obsessive-Compulsive Disorder',
          'Post-Traumatic Stress Disorder',
          'Stress Response Syndromes',
          'Dissociative Disorder',
          'Substance Use Disorder',
          'Addictive Disorder',
          'Other']
data_2018.rename(columns = {'Post-traumatic Stress Disorder':'Post-Traumatic Stress Disorder.1',
                           'Post-traumatic Stress Disorder.1':'Post-Traumatic Stress Disorder.2'}, inplace = True)

# Columns to be merged and replaced by `If possibly, what disorder(s) do you believe you have?`
cols_2 = [i + '.1' for i in cols_1]

# Columns to be merged and replaced by `If so, what disorder(s) were you diagnosed with?`
cols_3 = [i + '.2' for i in cols_1]

# Columns to be merged and replaced by `comments`
cols_4 = ['Describe the conversation you had with your employer about your mental health, including their reactions and what actions were taken to address your mental health issue/questions.',
          'Describe the conversation with coworkers you had about your mental health including their reactions.',
          'Describe the conversation your coworker had with you about their mental health (please do not use names).',
          'Describe the conversation you had with your previous employer about your mental health, including their reactions and actions taken to address your mental health issue/questions.',
          'Describe the conversation you had with your previous coworkers about your mental health including their reactions.',
          'Describe the conversation your coworker had with you about their mental health (please do not use names)..1',
          'Describe the circumstances of the badly handled or unsupportive response.',
          'Describe the circumstances of the supportive or well handled response.',
          'Briefly describe what you think the industry as a whole and/or employers could do to improve mental health support for employees.',
          'If there is anything else you would like to tell us that has not been covered by the survey questions, please use this space to do so.',
          'Other.3']

# Names of the attributes after transformation 
q_1 = 'What disorder(s) have you been diagnosed with?'
q_2 = 'If possibly, what disorder(s) do you believe you have?'
q_3 = 'If so, what disorder(s) were you diagnosed with?'
q_4 = 'comments'

In [59]:
# Function defined to merge a list of columns to form a new column with the input name
def merge_columns(df, new_col, col_list):
    '''
    Input: 
        df - Dataframe being worked on (Pandas Dataframe)
        new_col - List of column
    
    '''
    df[new_col] = df[col_list].apply(lambda x: ', '.join(x.dropna().astype(str)), 1)
    df[new_col] = df[new_col].apply(lambda aCode: np.nan if aCode == '' else aCode)
    df.drop(col_list, axis=1, inplace=True)

In [60]:
merge_columns(data_2018, q_1, cols_1)
merge_columns(data_2018, q_2, cols_2)
merge_columns(data_2018, q_3, cols_3)
merge_columns(data_2018, q_4, cols_4)

In [61]:
print("Number of columns:", len(data_2018.columns))
for i in range(len(data_2018.columns)):
    for string in ['<strong>', '</strong>', '<em>', '</em>']:
        data_2018.rename(columns = {data_2018.columns[i]:data_2018.columns[i].replace(string, '')}, inplace=True)

Number of columns: 73


In [62]:
# Analyze the attributes that are present in the Kaggle dataset and not in the scraped dataset
print([i for i in data_2018.columns if i not in question_titles])

['Why or why not?', 'Why or why not?.1', 'What is your age?', 'What is your gender?', 'comments']


The attributes age, gender and comments are important for the final aggregation, and so we retain them as is. We do not need the explanatory questions, and so they are dropped.

In [63]:
data_2018.drop(['Why or why not?', 'Why or why not?.1',
               'Would you be willing to talk to one of us more extensively about your experiences with mental health issues in the tech industry? (Note that all interview responses would be used anonymously and only with your permission.)']
               , axis=1, inplace=True)

## 3. Final Attribute-wise Cleaned Kaggle Dataset

In [64]:
print(*[i for i in data_2018.columns], sep='\n')

Are you self-employed?
How many employees does your company or organization have?
Is your employer primarily a tech company/organization?
Is your primary role within your company related to tech/IT?
Does your employer provide mental health benefits as part of healthcare coverage?
Do you know the options for mental health care available under your employer-provided health coverage?
Has your employer ever formally discussed mental health (for example, as part of a wellness campaign or other official communication)?
Does your employer offer resources to learn more about mental health disorders and options for seeking help?
Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources provided by your employer?
If a mental health issue prompted you to request a medical leave from work, how easy or difficult would it be to ask for that leave?
Would you feel more comfortable talking to your coworkers about your physical health or your men

In [65]:
print("Questions on the Kaggle dataset and not on the survey web page:")
print(*[(str(count) + '. ' + str(i)) for count, i in enumerate(data_2018.columns) 
        if i not in question_titles], sep='\n')
print("Questions on the survey web page and not on the Kaggle dataset:")
print(*[(str(count) + '. ' + str(i)) for count, i in enumerate(question_titles) 
        if i not in data_2018.columns], sep='\n')

Questions on the Kaggle dataset and not on the survey web page:
59. What is your age?
60. What is your gender?
69. comments
Questions on the survey web page and not on the Kaggle dataset:
62. Would you be willing to talk to one of us more extensively about your experiences with mental health issues in the tech industry? (Note that all interview responses would be used anonymously and only with your permission.)


As observed, the last three columns in the cleaned Kaggle dataset are not present on the survey webpage.

In [66]:
data_2018.head()

Unnamed: 0,Are you self-employed?,How many employees does your company or organization have?,Is your employer primarily a tech company/organization?,Is your primary role within your company related to tech/IT?,Does your employer provide mental health benefits as part of healthcare coverage?,Do you know the options for mental health care available under your employer-provided health coverage?,"Has your employer ever formally discussed mental health (for example, as part of a wellness campaign or other official communication)?",Does your employer offer resources to learn more about mental health disorders and options for seeking help?,Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources provided by your employer?,"If a mental health issue prompted you to request a medical leave from work, how easy or difficult would it be to ask for that leave?",...,What is your gender?,What country do you live in?,What US state or territory do you live in?,What is your race?,What country do you work in?,What US state or territory do you work in?,What disorder(s) have you been diagnosed with?,"If possibly, what disorder(s) do you believe you have?","If so, what disorder(s) were you diagnosed with?",comments
0,0,More than 1000,1.0,0.0,Yes,Yes,Yes,Yes,Yes,Somewhat difficult,...,Female,Canada,,,Canada,,,"Anxiety Disorder (Generalized, Social, Phobia,...",,I informed my employer that I was very sick du...
1,0,More than 1000,1.0,1.0,Yes,Yes,No,I don't know,I don't know,Somewhat difficult,...,male,United States of America,Massachusetts,White,United States of America,Massachusetts,,,"Anxiety Disorder (Generalized, Social, Phobia,...","it's a comfort thing at this point imo, needs ..."
2,0,6-25,0.0,1.0,Yes,Yes,No,No,I don't know,Somewhat easy,...,Male,United States of America,Florida,White,United States of America,Florida,,,,We had a coworker have a drastic turn in behav...
3,0,6-25,1.0,1.0,No,No,No,No,I don't know,Neither easy nor difficult,...,male,Norway,,,Norway,,,,,
4,0,26-100,1.0,1.0,Yes,Yes,Yes,Yes,Yes,Somewhat easy,...,Ostensibly Male,United States of America,Tennessee,White,United States of America,Tennessee,,,"Anxiety Disorder (Generalized, Social, Phobia,...","""Hey, I have a lot of anxiety sometimes."" ""Oh..."


In [67]:
data_2018.to_csv('datasets/2018_data.csv')

# 02 Work on the 2020 Dataset

## 1. Acquistion of Survey Questions with Web Scraping

In [68]:
# Request typeform survey webpage for contents
res = requests.get(
      'https://osmi.typeform.com/report/A7mlxC/itVHRYbNRnPqDI9C')

# Extract response status and validate for successful transaction
status = res.status_code
if status != 200:
    sys.exit(1)
else:
    print("Web scraping response status:\n", status)

# Parse HTML title, head and body contents using BeautifulSoup
soup = BeautifulSoup(res.content, 'html.parser')

print("Dataset Title:\n", soup.title.text)

Web scraping response status:
 200
Dataset Title:
 OSMI Mental Health in Tech Survey 2020


In [69]:
# Select content inside the script element that contains information about the survey questions and answers
script = soup.select('script')[11]

# Set a Regex pattern to extract the report's payload and apply the pattern on the script text
pattern = re.compile("(?<=window.__REPORT_PAYLOAD = ).*(?=};)")
fields = re.findall(pattern, script.text)

# Complete the string to be able to input to the JSON parser
fields[0] = fields[0] + '}'

# Convert the string to JSON
json_param = json.loads(fields[0])

# Print all the questions asked in the survey
questions = json_param['blocks']
print("Number of questions in the survey:", len(questions))
print('-'*100)

# Process each question to remove * present representing bolded words
question_titles = []
for question in questions:
    question['title'] = question['title'].replace('*', '')
    question['title'] = question['title'].replace('_', '')
    question_titles.append(question['title'])
    print(question['title'])

Number of questions in the survey: 68
----------------------------------------------------------------------------------------------------
Are you self-employed?
How many employees does your company or organization have?
Is your employer primarily a tech company/organization?
Is your primary role within your company related to tech/IT?
Does your employer provide mental health benefits as part of healthcare coverage?
Do you know the options for mental health care available under your employer-provided health coverage?
Has your employer ever formally discussed mental health (for example, as part of a wellness campaign or other official communication)?
Does your employer offer resources to learn more about mental health disorders and options for seeking help?
Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources provided by your employer?
If a mental health issue prompted you to request a medical leave from work, how easy or di

In [70]:
# Analyze the types of questions asked in the survey
question_types = []
for question in questions:
    question_types.append(question['type'])
Counter(question_types)

Counter({'yes_no': 16,
         'multiple_choice': 40,
         'opinion_scale': 7,
         'rating': 1,
         'dropdown': 4})

In [71]:
# Get the total number of participants in the survey
total_resp = json_param['totalResponsesCount']

# Get the number of participants that answered each question
counts = []
for question in questions:
    counts.append(question['summary']['count'])
print("Number of respondents for each question out of {0} participants:\n {1}".format(total_resp, counts))

Number of respondents for each question out of 180 participants:
 [180, 155, 155, 155, 155, 155, 155, 155, 155, 155, 155, 155, 155, 155, 155, 154, 155, 155, 25, 25, 25, 25, 25, 25, 25, 20, 180, 135, 135, 135, 135, 135, 135, 135, 135, 135, 135, 133, 133, 135, 135, 180, 51, 0, 47, 47, 176, 180, 180, 180, 180, 180, 180, 180, 180, 180, 25, 15, 180, 180, 180, 180, 180, 180, 65, 65, 180, 67]


## 2. Cleaning and Aggregating Columns on the Dataset from Kaggle

### Importing and Overview

In [72]:
# Import Kaggle dataset and examine the columns
data_2020 = pd.read_csv('datasets/2020_data_kaggle.csv')

# What are the total number of columns?
print("Number of columns:", len(data_2020.columns))
print('-'*100)

# What are the columns that aren't in question format?
for count, i in enumerate(data_2020.columns):
    if '?' not in i:
        print(count, i)

Number of columns: 120
----------------------------------------------------------------------------------------------------
0 #
14 Describe the conversation you had with your employer about your mental health, including their reactions and what actions were taken to address your mental health issue/questions.
17 Describe the conversation with coworkers you had about your mental health including their reactions.
19 Describe the conversation your coworker had with you about their mental health (please do not use names).
40 Describe the conversation you had with your previous employer about your mental health, including their reactions and actions taken to address your mental health issue/questions.
43 Describe the conversation you had with your previous coworkers about your mental health including their reactions.
45 Describe the conversation your coworker had with you about their mental health (please do not use names)..1
50 Anxiety Disorder (Generalized, Social, Phobia, etc)
51 Mood Di

### Analysis
From this we analyze that the dataset from Kaggle has columns that aren't in the data scraped from the survey webpage. It also appears that some columns represent possible responses for questions. Therefore, we decide to merge the corresponding columns and replace the column names with the question titles.
Since we look to aggregate columns from multiple datasets throughout the years, we merge the columns of descriptive questions to form a single column: `comments`. We also drop the first column representing the participant ID and a few other columns that are not relevant to our study.

### Column Name Cleaning

In [73]:
data_2020.drop(['#'], axis=1, inplace=True)

The columns with indices between 50 to 62, 63 to 75 and 76 to 88 correspond to, respectively,

`What disorder(s) have you been diagnosed with?`

`If possibly, what disorder(s) do you believe you have?`

`If so, what disorder(s) were you diagnosed with?`.

Therefore, we merge the corresponding columns of indices to create new columns with the attribute headings being the above qustions.

In [74]:
data_2020.rename(columns = {'Post-traumatic Stress Disorder':'Post-Traumatic Stress Disorder.1',
                           'Post-traumatic Stress Disorder.1':'Post-Traumatic Stress Disorder.2'}, inplace = True)

In [75]:
cols_4 = ['Describe the conversation you had with your employer about your mental health, including their reactions and what actions were taken to address your mental health issue/questions.',
          'Describe the conversation with coworkers you had about your mental health including their reactions.',
          'Describe the conversation your coworker had with you about their mental health (please do not use names).',
          'Describe the conversation you had with your previous employer about your mental health, including their reactions and actions taken to address your mental health issue/questions.',
          'Describe the conversation you had with your previous coworkers about your mental health including their reactions.',
          'Describe the conversation your coworker had with you about their mental health (please do not use names)..1',
          'Describe the circumstances of the badly handled or unsupportive response.',
          'Describe the circumstances of the supportive or well handled response.',
          'Briefly describe what you think the industry as a whole and/or employers could do to improve mental health support for employees.',
          'If there is anything else you would like to tell us that has not been covered by the survey questions, please use this space to do so.',
          'Other.3']

In [76]:
merge_columns(data_2020, q_1, cols_1)
merge_columns(data_2020, q_2, cols_2)
merge_columns(data_2020, q_3, cols_3)
merge_columns(data_2020, q_4, cols_4)

In [77]:
print("Number of columns:", len(data_2020.columns))
for i in range(len(data_2020.columns)):
    for string in ['<strong>', '</strong>', '<em>', '</em>', '*', '_']:
        data_2020.rename(columns = {data_2020.columns[i]:data_2020.columns[i].replace(string, '')}, inplace=True)

Number of columns: 73


In [78]:
# Analyze the attributes that are present in the Kaggle dataset and not in the scraped dataset
print([i for i in data_2020.columns if i not in question_titles])

['Why or why not?', 'Why or why not?.1', 'What is your age?', 'What is your gender?', 'comments']


The attributes age, gender and comments are important for the final aggregation, and so we retain them as is. We do not need the explanatory questions, and so they are dropped.

In [79]:
data_2020.drop(['Why or why not?', 'Why or why not?.1',
               'Would you be willing to talk to one of us more extensively about your experiences with mental health issues in the tech industry? (Note that all interview responses would be used anonymously and only with your permission.)'], axis=1, inplace=True)

## 3. Final Attribute-wise Cleaned Kaggle Dataset

In [80]:
print(len(data_2020.columns))
print(*[i for i in data_2020.columns], sep='\n')

70
Are you self-employed?
How many employees does your company or organization have?
Is your employer primarily a tech company/organization?
Is your primary role within your company related to tech/IT?
Does your employer provide mental health benefits as part of healthcare coverage?
Do you know the options for mental health care available under your employer-provided health coverage?
Has your employer ever formally discussed mental health (for example, as part of a wellness campaign or other official communication)?
Does your employer offer resources to learn more about mental health disorders and options for seeking help?
Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources provided by your employer?
If a mental health issue prompted you to request a medical leave from work, how easy or difficult would it be to ask for that leave?
Would you feel more comfortable talking to your coworkers about your physical health or your 

In [81]:
print("Questions on the Kaggle dataset and not on the survey web page:")
print(*[(str(count) + '. ' + str(i)) for count, i in enumerate(data_2020.columns) 
        if i not in question_titles], sep='\n')
print("Questions on the survey web page and not on the Kaggle dataset:")
print(*[(str(count) + '. ' + str(i)) for count, i in enumerate(question_titles) 
        if i not in data_2020.columns], sep='\n')

Questions on the Kaggle dataset and not on the survey web page:
59. What is your age?
60. What is your gender?
69. comments
Questions on the survey web page and not on the Kaggle dataset:
62. Would you be willing to talk to one of us more extensively about your experiences with mental health issues in the tech industry? (Note that all interview responses would be used anonymously and only with your permission.)


As observed, the last three columns in the cleaned Kaggle dataset are not present on the survey webpage.

In [82]:
print("Questions on the 2018 dataset and not on the 2020 one:")
print(*[(str(count) + '. ' + str(i)) for count, i in enumerate(data_2018.columns) 
        if i not in data_2020.columns], sep='\n')

Questions on the 2018 dataset and not on the 2020 one:
55. If they knew you suffered from a mental health disorder, how do you think that team members/co-workers would react?
57. Have you observed or experienced supportive or well handled response to a mental health issue in your current or previous workplace?


In [83]:
data_2020.rename(columns = {
    'Have you observed or experienced a supportive or well handled response to a mental health issue in your current or previous workplace?'
    :'Have you observed or experienced supportive or well handled response to a mental health issue in your current or previous workplace?',
    'If they knew you suffered from a mental health disorder, how do you think that your team members/co-workers would react?':
    'If they knew you suffered from a mental health disorder, how do you think that team members/co-workers would react?'}, inplace=True)

In [84]:
print("Questions on the 2018 dataset and not on the 2020 one:")
print(*[(str(count) + '. ' + str(i)) for count, i in enumerate(data_2018.columns) 
        if i not in data_2020.columns], sep='\n')
print("Questions on the 2020 dataset and not on the 2018 one:")
print(*[(str(count) + '. ' + str(i)) for count, i in enumerate(data_2020.columns) 
        if i not in data_2018.columns], sep='\n')

Questions on the 2018 dataset and not on the 2020 one:

Questions on the 2020 dataset and not on the 2018 one:



In [85]:
data_2020.head()

Unnamed: 0,Are you self-employed?,How many employees does your company or organization have?,Is your employer primarily a tech company/organization?,Is your primary role within your company related to tech/IT?,Does your employer provide mental health benefits as part of healthcare coverage?,Do you know the options for mental health care available under your employer-provided health coverage?,"Has your employer ever formally discussed mental health (for example, as part of a wellness campaign or other official communication)?",Does your employer offer resources to learn more about mental health disorders and options for seeking help?,Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources provided by your employer?,"If a mental health issue prompted you to request a medical leave from work, how easy or difficult would it be to ask for that leave?",...,What is your gender?,What country do you live in?,What US state or territory do you live in?,What is your race?,What country do you work in?,What US state or territory do you work in?,What disorder(s) have you been diagnosed with?,"If possibly, what disorder(s) do you believe you have?","If so, what disorder(s) were you diagnosed with?",comments
0,1,,,,,,,,,,...,Male,United States of America,Connecticut,White,United States of America,Connecticut,,,,
1,1,,,,,,,,,,...,female,Russia,,,Russia,,,,"Anxiety Disorder (Generalized, Social, Phobia,...",
2,1,,,,,,,,,,...,Male,India,,,India,,,,,"Some times with my boss, he will response prop..."
3,1,,,,,,,,,,...,Female,Canada,,,Canada,,,,,
4,1,,,,,,,,,,...,F,Canada,,,Canada,,,,,"taking less to that person or ignorance, behav..."


In [86]:
data_2020.to_csv('datasets/2020_data.csv')

# 03 Work on the 2019 Dataset

In [87]:
data_2019 = pd.read_csv('datasets/2019_data_kaggle.csv')

# What are the total number of columns?
print("Number of columns:", len(data_2019.columns))
print('-'*100)

# What are the columns that aren't in question format?
for count, i in enumerate(data_2019.columns):
    if '?' not in i:
        print(count, i)

Number of columns: 82
----------------------------------------------------------------------------------------------------
13 Describe the conversation you had with your employer about your mental health, including their reactions and what actions were taken to address your mental health issue/questions.
16 Describe the conversation with coworkers you had about your mental health including their reactions.
18 Describe the conversation your coworker had with you about their mental health (please do not use names).
39 Describe the conversation you had with your previous employer about your mental health, including their reactions and actions taken to address your mental health issue/questions.
42 Describe the conversation you had with your previous coworkers about your mental health including their reactions.
44 Describe the conversation your coworker had with you about their mental health (please do not use names)..1
68 Describe the circumstances of the badly handled or unsupportive res

In [88]:
cols_5 = ['Describe the conversation you had with your employer about your mental health, including their reactions and what actions were taken to address your mental health issue/questions.',
          'Describe the conversation with coworkers you had about your mental health including their reactions.',
          'Describe the conversation your coworker had with you about their mental health (please do not use names).',
          'Describe the conversation you had with your previous employer about your mental health, including their reactions and actions taken to address your mental health issue/questions.',
          'Describe the conversation you had with your previous coworkers about your mental health including their reactions.',
          'Describe the conversation your coworker had with you about their mental health (please do not use names)..1',
          'Describe the circumstances of the badly handled or unsupportive response.',
          'Describe the circumstances of the supportive or well handled response.',
          'Briefly describe what you think the industry as a whole and/or employers could do to improve mental health support for employees.',
          'If there is anything else you would like to tell us that has not been covered by the survey questions, please use this space to do so.']

merge_columns(data_2019, 'comments', cols_5)

In [89]:
print("Number of columns:", len(data_2019.columns))
for i in range(len(data_2019.columns)):
    for string in ['<strong>', '</strong>', '<em>', '</em>', '*', '_']:
        data_2019.rename(columns = {data_2019.columns[i]:data_2019.columns[i].replace(string, '')}, inplace=True)

Number of columns: 73


In [90]:
# Analyze the attributes that are present in the Kaggle dataset and not in the scraped dataset
print([i for i in data_2019.columns if i not in question_titles])

['Why or why not?', 'Why or why not?.1', 'What is your age?', 'What is your gender?', 'comments']


In [91]:
data_2019.drop(['Why or why not?', 'Why or why not?.1', 
               'Would you be willing to talk to one of us more extensively about your experiences with mental health issues in the tech industry? (Note that all interview responses would be used anonymously and only with your permission.)'], axis=1, inplace=True)

In [92]:
print(len(data_2019.columns))
print(*[i for i in data_2019.columns], sep='\n')

70
Are you self-employed?
How many employees does your company or organization have?
Is your employer primarily a tech company/organization?
Is your primary role within your company related to tech/IT?
Does your employer provide mental health benefits as part of healthcare coverage?
Do you know the options for mental health care available under your employer-provided health coverage?
Has your employer ever formally discussed mental health (for example, as part of a wellness campaign or other official communication)?
Does your employer offer resources to learn more about mental health disorders and options for seeking help?
Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources provided by your employer?
If a mental health issue prompted you to request a medical leave from work, how easy or difficult would it be to ask for that leave?
Would you feel more comfortable talking to your coworkers about your physical health or your 

In [94]:
print("Questions on the 2018 dataset and not on the 2019 one:")
print(*[(str(count) + '. ' + str(i)) for count, i in enumerate(data_2018.columns) 
        if i not in data_2019.columns], sep='\n')
print("Questions on the 2019 dataset and not on the 2018 one:")
print(*[(str(count) + '. ' + str(i)) for count, i in enumerate(data_2019.columns) 
        if i not in data_2018.columns], sep='\n')

Questions on the 2018 dataset and not on the 2019 one:
55. If they knew you suffered from a mental health disorder, how do you think that team members/co-workers would react?
57. Have you observed or experienced supportive or well handled response to a mental health issue in your current or previous workplace?
Questions on the 2019 dataset and not on the 2018 one:
58. If they knew you suffered from a mental health disorder, how do you think that your team members/co-workers would react?
60. Have you observed or experienced a supportive or well handled response to a mental health issue in your current or previous workplace?


In [95]:
data_2019.rename(columns = {
    'Have you observed or experienced a supportive or well handled response to a mental health issue in your current or previous workplace?'
    :'Have you observed or experienced supportive or well handled response to a mental health issue in your current or previous workplace?',
    'If they knew you suffered from a mental health disorder, how do you think that your team members/co-workers would react?':
    'If they knew you suffered from a mental health disorder, how do you think that team members/co-workers would react?'}, inplace=True)

In [96]:
print("Questions on the 2018 dataset and not on the 2019 one:")
print(*[(str(count) + '. ' + str(i)) for count, i in enumerate(data_2018.columns) 
        if i not in data_2019.columns], sep='\n')
print("Questions on the 2019 dataset and not on the 2018 one:")
print(*[(str(count) + '. ' + str(i)) for count, i in enumerate(data_2019.columns) 
        if i not in data_2018.columns], sep='\n')

Questions on the 2018 dataset and not on the 2019 one:

Questions on the 2019 dataset and not on the 2018 one:



In [97]:
data_2019.to_csv('datasets/2019_data.csv')