# Mental Health in the Tech Industry Overview

Source: https://www.kaggle.com/datasets/anth7310/mental-health-in-the-tech-industry

Let's go through the database to see what it is about and do some cleaning. Please treat it as a scratchpad. Remember to do the database backup before applying any changes.

In [1]:
import sqlite3

Check the tables names in the database.

In [2]:
def get_tables(db_name):
    ''' Get tables names from the database. '''
    conn = sqlite3.connect(f'DB/{db_name}.sqlite')
    c = conn.cursor()
    c.execute('SELECT name FROM sqlite_master WHERE type="table"')
    tables = c.fetchall()
    conn.commit()
    conn.close()
    return tables

In [3]:
get_tables('mental_health')

[('Answer',), ('Question',), ('Survey',)]

Check the tables' content.

In [4]:
def get_columns(db_name, table):
    ''' Get table columns names from the database. '''
    conn = sqlite3.connect(f'DB/{db_name}.sqlite')
    c = conn.cursor()
    c.execute(f"SELECT * FROM {table} LIMIT 1")
    cols = [x[0] for x in c.description]
    conn.commit()
    conn.close()
    return cols

In [5]:
def get_table(db_name, table):
    ''' Get table content from the database. '''
    conn = sqlite3.connect(f'DB/{db_name}.sqlite')
    c = conn.cursor()
    c.execute(f"SELECT * FROM {table}")
    table = c.fetchall()
    conn.commit()
    conn.close()
    return table

In [6]:
get_columns('mental_health', 'Answer')

['AnswerText', 'SurveyID', 'UserID', 'QuestionID']

In [7]:
get_table('mental_health', 'Answer')[:10]

[('37', 2014, 1, 1),
 ('44', 2014, 2, 1),
 ('32', 2014, 3, 1),
 ('31', 2014, 4, 1),
 ('31', 2014, 5, 1),
 ('33', 2014, 6, 1),
 ('35', 2014, 7, 1),
 ('39', 2014, 8, 1),
 ('42', 2014, 9, 1),
 ('23', 2014, 10, 1)]

In [8]:
get_columns('mental_health', 'Question')

['questiontext', 'questionid']

In [14]:
get_table('mental_health', 'Question')

[('What is your age?', 1),
 ('What is your gender?', 2),
 ('What country do you live in?', 3),
 ('If you live in the United States, which state or territory do you live in?',
  4),
 ('Are you self-employed?', 5),
 ('Do you have a family history of mental illness?', 6),
 ('Have you ever sought treatment for a mental health disorder from a mental health professional?',
  7),
 ('How many employees does your company or organization have?', 8),
 ('Is your employer primarily a tech company/organization?', 9),
 ('Does your employer provide mental health benefits as part of healthcare coverage?',
  10),
 ('Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources provided by your employer?',
  11),
 ('Would you bring up a mental health issue with a potential employer in an interview?',
  12),
 ('Is your primary role within your company related to tech/IT?', 13),
 ('Do you know the options for mental health care available under your empl

In [19]:
get_columns('mental_health', 'Survey')

['SurveyID', 'Description']

In [15]:
get_table('mental_health', 'Survey')

[(2014, 'mental health survey for 2014'),
 (2016, 'mental health survey for 2016'),
 (2017, 'mental health survey for 2017'),
 (2018, 'mental health survey for 2018'),
 (2019, 'mental health survey for 2019')]

Let's limit the research to a few questions for this exercise (There is 118 questions!). 

In [6]:
def get_all_answers_per_q(q_no):
    ''' Get all answers from Answer table for particular question 
        represented by QuestionID number from the Question table. '''
    conn = sqlite3.connect('DB/mental_health.sqlite')
    c = conn.cursor()
    c.execute("SELECT AnswerText FROM Answer WHERE QuestionID = ?", (q_no,))
    answers = set(c.fetchall())
    conn.commit()
    conn.close()
    return answers

In [32]:
get_all_answers_per_q(3)

{('-1',),
 ('Afghanistan',),
 ('Algeria',),
 ('Argentina',),
 ('Australia',),
 ('Austria',),
 ('Bahamas, The',),
 ('Bangladesh',),
 ('Belarus',),
 ('Belgium',),
 ('Bosnia and Herzegovina',),
 ('Brazil',),
 ('Brunei',),
 ('Bulgaria',),
 ('Canada',),
 ('Chile',),
 ('China',),
 ('Colombia',),
 ('Costa Rica',),
 ('Croatia',),
 ('Czech Republic',),
 ('Denmark',),
 ('Ecuador',),
 ('Estonia',),
 ('Ethiopia',),
 ('Finland',),
 ('France',),
 ('Georgia',),
 ('Germany',),
 ('Ghana',),
 ('Greece',),
 ('Guatemala',),
 ('Hong Kong',),
 ('Hungary',),
 ('Iceland',),
 ('India',),
 ('Indonesia',),
 ('Iran',),
 ('Ireland',),
 ('Israel',),
 ('Italy',),
 ('Japan',),
 ('Jordan',),
 ('Kenya',),
 ('Latvia',),
 ('Lithuania',),
 ('Macedonia',),
 ('Mauritius',),
 ('Mexico',),
 ('Moldova',),
 ('Netherlands',),
 ('New Zealand',),
 ('Nigeria',),
 ('Norway',),
 ('Other',),
 ('Pakistan',),
 ('Philippines',),
 ('Poland',),
 ('Portugal',),
 ('Romania',),
 ('Russia',),
 ('Saudi Arabia',),
 ('Serbia',),
 ('Singapore',),


How many survey participants there are for particular country? 

In [7]:
def get_users_no_for_q_and_answer(q_no, answer):
    ''' Get number of users from Answer table for particular question number (QuestionID) and answer (AnswerText). '''
    conn = sqlite3.connect('DB/mental_health.sqlite')
    c = conn.cursor()
    c.execute("SELECT UserID FROM Answer WHERE QuestionID = ? and AnswerText = ?", (q_no, answer))
    user_no = len(c.fetchall())
    conn.commit()
    conn.close()
    return user_no

In [36]:
get_users_no_for_q_and_answer(3, 'Poland')

21

Let's check the numbers of participants per country to find some better example.

In [2]:
def get_users_no_per_answer(q_no):
    ''' Get number of users from Answer table for particular question number (QuestionID) grouped by answer. '''
    conn = sqlite3.connect('DB/mental_health.sqlite')
    c = conn.cursor()
    c.execute("SELECT AnswerText, count(UserID) FROM Answer WHERE QuestionID = ? GROUP BY AnswerText", (q_no,))
    user_no = c.fetchall()
    conn.commit()
    conn.close()
    return user_no

In [47]:
get_users_no_per_answer(3)

[('-1', 2),
 ('Afghanistan', 3),
 ('Algeria', 2),
 ('Argentina', 4),
 ('Australia', 73),
 ('Austria', 10),
 ('Bahamas, The', 1),
 ('Bangladesh', 3),
 ('Belarus', 1),
 ('Belgium', 17),
 ('Bosnia and Herzegovina', 3),
 ('Brazil', 37),
 ('Brunei', 1),
 ('Bulgaria', 13),
 ('Canada', 199),
 ('Chile', 3),
 ('China', 2),
 ('Colombia', 6),
 ('Costa Rica', 2),
 ('Croatia', 4),
 ('Czech Republic', 6),
 ('Denmark', 9),
 ('Ecuador', 1),
 ('Estonia', 5),
 ('Ethiopia', 1),
 ('Finland', 13),
 ('France', 51),
 ('Georgia', 2),
 ('Germany', 136),
 ('Ghana', 1),
 ('Greece', 7),
 ('Guatemala', 1),
 ('Hong Kong', 2),
 ('Hungary', 4),
 ('Iceland', 2),
 ('India', 50),
 ('Indonesia', 3),
 ('Iran', 1),
 ('Ireland', 51),
 ('Israel', 9),
 ('Italy', 19),
 ('Japan', 9),
 ('Jordan', 1),
 ('Kenya', 1),
 ('Latvia', 2),
 ('Lithuania', 2),
 ('Macedonia', 1),
 ('Mauritius', 1),
 ('Mexico', 12),
 ('Moldova', 1),
 ('Netherlands', 98),
 ('New Zealand', 24),
 ('Nigeria', 2),
 ('Norway', 12),
 ('Other', 2),
 ('Pakistan', 7),

There are two names 'United States' and 'United States of America' for USA. The 'United States' will be replaced by 'United States of America'.

In [9]:
def change_table_value(db_name, table, column, v1, v2):
    ''' Change v1 from table in db_name database to v2. '''
    conn = sqlite3.connect(f'DB/{db_name}.sqlite')
    c = conn.cursor()
    c.execute(f"UPDATE {table} SET {column} = ? WHERE {column} = ?", (v2, v1))
    conn.commit()
    conn.close()

In [52]:
change_table_value('mental_health', 'Answer', 'AnswerText', 'United States', 'United States of America')

Verify the change.

In [53]:
get_users_no_per_answer(3)

[('-1', 2),
 ('Afghanistan', 3),
 ('Algeria', 2),
 ('Argentina', 4),
 ('Australia', 73),
 ('Austria', 10),
 ('Bahamas, The', 1),
 ('Bangladesh', 3),
 ('Belarus', 1),
 ('Belgium', 17),
 ('Bosnia and Herzegovina', 3),
 ('Brazil', 37),
 ('Brunei', 1),
 ('Bulgaria', 13),
 ('Canada', 199),
 ('Chile', 3),
 ('China', 2),
 ('Colombia', 6),
 ('Costa Rica', 2),
 ('Croatia', 4),
 ('Czech Republic', 6),
 ('Denmark', 9),
 ('Ecuador', 1),
 ('Estonia', 5),
 ('Ethiopia', 1),
 ('Finland', 13),
 ('France', 51),
 ('Georgia', 2),
 ('Germany', 136),
 ('Ghana', 1),
 ('Greece', 7),
 ('Guatemala', 1),
 ('Hong Kong', 2),
 ('Hungary', 4),
 ('Iceland', 2),
 ('India', 50),
 ('Indonesia', 3),
 ('Iran', 1),
 ('Ireland', 51),
 ('Israel', 9),
 ('Italy', 19),
 ('Japan', 9),
 ('Jordan', 1),
 ('Kenya', 1),
 ('Latvia', 2),
 ('Lithuania', 2),
 ('Macedonia', 1),
 ('Mauritius', 1),
 ('Mexico', 12),
 ('Moldova', 1),
 ('Netherlands', 98),
 ('New Zealand', 24),
 ('Nigeria', 2),
 ('Norway', 12),
 ('Other', 2),
 ('Pakistan', 7),

Let's choose a couple question and review them. 

In [None]:
# Ok, "a couple" is a wrong statement. Perhaps some of them will be eliminated during the process.
# q_no (29): 1, 2, 3, 4, 5, 6, 7, 8, 9, 12, 13, 22, 28, 30, 32, 33, 34, 48, 49, 50, 51, 54, 55, 56, 78, 79, 89, 93, 118

# ('What is your age?', 1),
# ('What is your gender?', 2),
# ('What country do you live in?', 3),
# ('If you live in the United States, which state or territory do you live in?', 4),
# ('Are you self-employed?', 5),
# ('Do you have a family history of mental illness?', 6),
# ('Have you ever sought treatment for a mental health disorder from a mental health professional?', 7),
# ('How many employees does your company or organization have?', 8),
# ('Is your employer primarily a tech company/organization?', 9),
# ('Would you bring up a mental health issue with a potential employer in an interview?', 12),
# ('Is your primary role within your company related to tech/IT?', 13),
# ('Do you have previous employers?', 22),
# ('Would you have been willing to discuss your mental health with your direct supervisor(s)?', 28),
# ('How willing would you be to share with friends and family that you have a mental illness?', 30),
# ('Have you had a mental health disorder in the past?', 32),
# ('Do you currently have a mental health disorder?', 33),
# ('Have you ever been diagnosed with a mental health disorder?', 34),
# ('If you have a mental health disorder, how often do you feel that it interferes with your work when being treated effectively?', 48),
# ('If you have a mental health disorder, how often do you feel that it interferes with your work when not being treated effectively (i.e., when you are experiencing symptoms)?', 49),
# ('What country do you work in?', 50),
# ('What US state or territory do you work in?', 51),
# ('Do you believe your productivity is ever affected by a mental health issue?', 54),
# ('If yes, what percentage of your work time (time performing primary or secondary job functions) is affected by a mental health issue?', 55),
# ('Have you observed or experienced an unsupportive or badly handled response to a mental health issue in your current or previous workplace?', 56),
# ('Are you openly identified at work as a person with a mental health issue?', 78),
# ('Has being identified as a person with a mental health issue affected your career?', 79),
# ('What is your race?', 89),
# ('Do you work remotely (outside of an office) at least 50% of the time?', 93),
# ('Do you work remotely?', 118)

Check the answer for all questions one by one.

In [54]:
get_users_no_per_answer(1)

[('-1', 5),
 ('-29', 1),
 ('0', 1),
 ('11', 1),
 ('15', 1),
 ('17', 1),
 ('18', 9),
 ('19', 20),
 ('20', 17),
 ('21', 39),
 ('22', 74),
 ('23', 107),
 ('24', 128),
 ('25', 147),
 ('26', 194),
 ('27', 197),
 ('28', 220),
 ('29', 229),
 ('3', 1),
 ('30', 250),
 ('31', 223),
 ('32', 227),
 ('323', 1),
 ('329', 1),
 ('33', 201),
 ('34', 202),
 ('35', 201),
 ('36', 147),
 ('37', 184),
 ('38', 160),
 ('39', 137),
 ('40', 122),
 ('41', 88),
 ('42', 100),
 ('43', 82),
 ('44', 68),
 ('45', 74),
 ('46', 58),
 ('47', 38),
 ('48', 29),
 ('49', 36),
 ('5', 1),
 ('50', 30),
 ('51', 22),
 ('52', 17),
 ('53', 15),
 ('54', 17),
 ('55', 22),
 ('56', 13),
 ('57', 14),
 ('58', 5),
 ('59', 6),
 ('60', 5),
 ('61', 7),
 ('62', 3),
 ('63', 5),
 ('64', 3),
 ('65', 3),
 ('66', 2),
 ('67', 2),
 ('70', 1),
 ('72', 1),
 ('74', 1),
 ('8', 1),
 ('99', 1)]

Let's delete users younger than 18 and older than 100. It is hard to believe that someone 99 years old is still working, but could be. If that is true - respect sir! 

Wait a moment. Should be deleted only the answers for this question or whole user with all answers? To not lose too much data let's proceed with the checks and then decide. Deletion of user with all answer can be harsh for dataset. Deleting only particular answer can lead to inconsistency in data. Perhaps changing these answer to 'n/a' be a better solution?

In [55]:
get_users_no_per_answer(2)

[('-1', 24),
 ('43', 1),
 ('A little about you', 1),
 ('AFAB', 1),
 ('Agender', 4),
 ('Agender trans woman', 1),
 ('Agender/genderfluid', 1),
 ('All', 1),
 ('Androgyne', 1),
 ('Androgynous', 1),
 ('Bigender', 1),
 ('Cishet male', 1),
 ('Contextual', 1),
 ('Demiguy', 1),
 ('Enby', 2),
 ('Female', 914),
 ('Female (trans)', 2),
 ('Female assigned at birth', 1),
 ('Female or Multi-Gender Femme', 1),
 ('Female-identified', 1),
 ('Female-ish', 1),
 ('Female/gender non-binary.', 1),
 ('Fluid', 1),
 ('Genderfluid', 3),
 ('Genderfluid (born female)', 1),
 ('Genderflux demi-girl', 1),
 ('Genderqueer', 4),
 ('Genderqueer demigirl', 1),
 ('Genderqueer/non-binary', 1),
 ('God King of the Valajar', 1),
 ('Guy (-ish) ^_^', 1),
 ('Human', 1),
 ('I am a Wookie', 1),
 ('I have a penis', 1),
 ('MALE', 1),
 ('Male', 2830),
 ('Male (or female, or both)', 1),
 ('Male (trans, FtM)', 1),
 ('Male-ish', 2),
 ('Male/genderqueer', 1),
 ('Masculine', 1),
 ('NB', 1),
 ('Nah', 1),
 ('Neuter', 1),
 ('Non binary', 2),

-1, 43, \\- and all below quantity of five answers going to be Other. However before that some answers have to be combined. 

In [None]:
# Female <-- female
# Male <-- MALE, I have a penis
# Non-binary <-- Non binary, Nonbinary

In [56]:
get_users_no_per_answer(4)

[('-1', 1622),
 ('Alabama', 19),
 ('Alaska', 4),
 ('Arizona', 19),
 ('California', 382),
 ('Colorado', 60),
 ('Connecticut', 13),
 ('DC', 4),
 ('Delaware', 1),
 ('District of Columbia', 7),
 ('Florida', 52),
 ('Georgia', 49),
 ('Idaho', 8),
 ('Illinois', 223),
 ('Indiana', 96),
 ('Iowa', 18),
 ('Kansas', 27),
 ('Kentucky', 14),
 ('Louisiana', 7),
 ('Maine', 9),
 ('Maryland', 42),
 ('Massachusetts', 76),
 ('Michigan', 108),
 ('Minnesota', 101),
 ('Mississippi', 1),
 ('Missouri', 42),
 ('Montana', 2),
 ('Nebraska', 61),
 ('Nevada', 9),
 ('New Hampshire', 15),
 ('New Jersey', 27),
 ('New Mexico', 9),
 ('New York', 146),
 ('North Carolina', 51),
 ('North Dakota', 7),
 ('Ohio', 109),
 ('Oklahoma', 20),
 ('Oregon', 99),
 ('Pennsylvania', 94),
 ('Rhode Island', 6),
 ('South Carolina', 10),
 ('South Dakota', 8),
 ('Tennessee', 121),
 ('Texas', 119),
 ('Utah', 29),
 ('Vermont', 9),
 ('Virginia', 46),
 ('Washington', 168),
 ('West Virginia', 3),
 ('Wisconsin', 43),
 ('Wyoming', 3)]

-1 --> 'n/a' and DC --> Washington

In [57]:
get_users_no_per_answer(5)

[('-1', 18), ('0', 3550), ('1', 650)]

-1 --> 'n/a'

In [58]:
get_users_no_per_answer(6)

[("I don't know", 649), ('No', 1701), ('Yes', 1868)]

In [59]:
get_users_no_per_answer(7)

[('0', 1806), ('1', 2412)]

In [60]:
get_users_no_per_answer(8)

[('-1', 504),
 ('1-5', 254),
 ('100-500', 788),
 ('26-100', 824),
 ('500-1000', 247),
 ('6-25', 689),
 ('More than 1000', 912)]

-1 --> 'n/a'

In [61]:
get_users_no_per_answer(9)

[('-1', 504), ('0', 826), ('1', 2888)]

-1 --> 'n/a'

In [62]:
get_users_no_per_answer(12)

[('Maybe', 1036), ('No', 2951), ('Yes', 231)]

In [63]:
get_users_no_per_answer(13)

[('-1', 1387), ('0', 100), ('1', 1471)]

-1 --> 'n/a'

In [64]:
get_users_no_per_answer(22)

[('0', 368), ('1', 2590)]

In [65]:
get_users_no_per_answer(28)

[('-1', 368),
 ("I don't know", 190),
 ('No, at none of my previous employers', 416),
 ('No, none of my previous supervisors', 485),
 ('Some of my previous employers', 654),
 ('Some of my previous supervisors', 654),
 ('Yes, all of my previous supervisors', 98),
 ('Yes, at all of my previous employers', 93)]

-1 --> 'n/a'

In [66]:
get_users_no_per_answer(30)

[('Neutral', 329),
 ('Not applicable to me (I do not have a mental illness)', 112),
 ('Not open at all', 332),
 ('Somewhat not open', 290),
 ('Somewhat open', 791),
 ('Very open', 1104)]

In [67]:
get_users_no_per_answer(32)

[('-1', 15),
 ("Don't Know", 109),
 ('Maybe', 246),
 ('No', 896),
 ('Possibly', 275),
 ('Yes', 1417)]

-1 --> 'n/a'

In [68]:
get_users_no_per_answer(33)

[("Don't Know", 124),
 ('Maybe', 327),
 ('No', 969),
 ('Possibly', 301),
 ('Yes', 1237)]

In [69]:
get_users_no_per_answer(34)

[('-1', 863), ('No', 732), ('Yes', 1363)]

-1 --> 'n/a'

In [70]:
get_users_no_per_answer(48)

[('Never', 165),
 ('Not applicable to me', 1119),
 ('Often', 166),
 ('Rarely', 700),
 ('Sometimes', 808)]

In [14]:
get_users_no_per_answer(49)

[('Never', 24),
 ('Not applicable to me', 966),
 ('Often', 1183),
 ('Rarely', 113),
 ('Sometimes', 672)]

In [15]:
get_users_no_per_answer(50)

[('-1', 2),
 ('Afghanistan', 3),
 ('Algeria', 1),
 ('Argentina', 3),
 ('Australia', 50),
 ('Austria', 8),
 ('Bangladesh', 3),
 ('Belgium', 9),
 ('Bosnia and Herzegovina', 2),
 ('Botswana', 1),
 ('Brazil', 30),
 ('Brunei', 1),
 ('Bulgaria', 9),
 ('Canada', 122),
 ('Chile', 3),
 ('China', 1),
 ('Colombia', 4),
 ('Costa Rica', 1),
 ('Croatia', 2),
 ('Czech Republic', 4),
 ('Denmark', 7),
 ('Ecuador', 1),
 ('Eritrea', 1),
 ('Estonia', 5),
 ('Ethiopia', 1),
 ('Finland', 10),
 ('France', 34),
 ('Georgia', 1),
 ('Germany', 90),
 ('Ghana', 1),
 ('Greece', 4),
 ('Guatemala', 1),
 ('Hong Kong', 2),
 ('Hungary', 2),
 ('Iceland', 2),
 ('India', 39),
 ('Indonesia', 3),
 ('Iran', 1),
 ('Ireland', 23),
 ('Israel', 4),
 ('Italy', 10),
 ('Japan', 6),
 ('Jordan', 1),
 ('Kenya', 1),
 ('Latvia', 1),
 ('Lithuania', 1),
 ('Luxembourg', 1),
 ('Macedonia', 1),
 ('Mauritius', 1),
 ('Mexico', 10),
 ('Netherlands', 70),
 ('New Zealand', 16),
 ('Nigeria', 1),
 ('Norway', 11),
 ('Other', 3),
 ('Pakistan', 6),
 ('P

-1 --> 'n/a'

In [16]:
get_users_no_per_answer(51)

[('-1', 1087),
 ('Alabama', 10),
 ('Alaska', 4),
 ('Arizona', 14),
 ('California', 269),
 ('Colorado', 54),
 ('Connecticut', 10),
 ('Delaware', 1),
 ('District of Columbia', 15),
 ('Florida', 36),
 ('Georgia', 35),
 ('Hawaii', 1),
 ('Idaho', 6),
 ('Illinois', 197),
 ('Indiana', 67),
 ('Iowa', 15),
 ('Kansas', 23),
 ('Kentucky', 10),
 ('Louisiana', 5),
 ('Maine', 7),
 ('Maryland', 28),
 ('Massachusetts', 61),
 ('Michigan', 84),
 ('Minnesota', 81),
 ('Mississippi', 1),
 ('Missouri', 27),
 ('Montana', 2),
 ('Nebraska', 58),
 ('Nevada', 6),
 ('New Hampshire', 8),
 ('New Jersey', 20),
 ('New Mexico', 6),
 ('New York', 101),
 ('North Carolina', 37),
 ('North Dakota', 7),
 ('Ohio', 78),
 ('Oklahoma', 15),
 ('Oregon', 64),
 ('Pennsylvania', 63),
 ('Rhode Island', 4),
 ('South Carolina', 5),
 ('South Dakota', 5),
 ('Tennessee', 74),
 ('Texas', 75),
 ('Utah', 17),
 ('Vermont', 6),
 ('Virginia', 31),
 ('Washington', 96),
 ('West Virginia', 1),
 ('Wisconsin', 30),
 ('Wyoming', 1)]

-1 --> 'n/a'

In [17]:
get_users_no_per_answer(54)

[('-1', 2454),
 ('No', 26),
 ('Not applicable to me', 51),
 ('Unsure', 60),
 ('Yes', 367)]

-1 --> 'n/a'

In [18]:
get_users_no_per_answer(55)

[('-1', 2591),
 ('1-25%', 164),
 ('26-50%', 125),
 ('51-75%', 53),
 ('76-100%', 25)]

-1 --> 'n/a'

In [19]:
get_users_no_per_answer(56)

[('-1', 91),
 ("I've always been self-employed", 15),
 ('Maybe/Not sure', 748),
 ('No', 1207),
 ('Yes, I experienced', 356),
 ('Yes, I observed', 541)]

-1 --> 'n/a'

In [20]:
get_users_no_per_answer(78)

[('-1', 2), ('0', 1340), ('1', 183)]

-1 --> 'n/a'

In [21]:
get_users_no_per_answer(79)

[('-1', 1345), ('0', 119), ('1', 61)]

-1 --> 'n/a'

In [22]:
get_users_no_per_answer(89)

[('-1', 537),
 ('American Indian or Alaska Native', 1),
 ('Asian', 31),
 ('Black or African American', 15),
 ('Caucasian', 1),
 ('European American', 1),
 ('Hispanic', 1),
 ('I prefer not to answer', 29),
 ('More than one of the above', 35),
 ('White', 873),
 ('White Hispanic', 1)]

-1 --> 'n/a'

In [23]:
get_users_no_per_answer(93)

[('No', 884), ('Yes', 376)]

In [24]:
get_users_no_per_answer(118)

[('Always', 343), ('Never', 333), ('Sometimes', 757)]

Change the values for above questions. Many of them have the same fix to be done - (-1 --> 'n/a'). That can be achieved by replacing '-1' in entire column, but for safety let's specify the question ids.

In [49]:
question_ids = [2, 4, 5, 8, 9, 13, 28, 32, 34, 50, 51, 54, 55, 56, 78, 79, 89]

In [10]:
def change_answer_value(q_ids, v1, v2):
    ''' Update the answer value for the list of question provided in list.
        v1 is a value which will be change.
        v2 is a target value. '''
    values_no = ('?, '*len(q_ids))[:-2]
    conn = sqlite3.connect('DB/mental_health.sqlite')
    c = conn.cursor()
    c.execute(f"UPDATE Answer SET AnswerText = ? WHERE AnswerText = ? and QuestionID IN ({values_no})", (v2, v1, *q_ids))
    conn.commit()
    conn.close()

In [29]:
change_answer_value(question_ids, '-1', 'n/a')

Verify the changes.

In [50]:
for q in question_ids:
    print(get_users_no_per_answer(q)[0]) # -1 was always as a first row

('Female', 1024)
('Alabama', 19)
('0', 3550)
('1-5', 254)
('0', 826)
('0', 100)
("I don't know", 190)
("Don't Know", 109)
('No', 732)
('Afghanistan', 3)
('Alabama', 10)
('No', 26)
('1-25%', 164)
("I've always been self-employed", 15)
('0', 1340)
('0', 119)
('American Indian or Alaska Native', 1)


Change the 'DC' to 'Washington'. That can be done in entire table.

In [34]:
change_table_value('mental_health', 'Answer', 'AnswerText', 'DC', 'Washington')

Verify the change.

In [11]:
def get_value_from_column(table, column, value):
    ''' Get particular answer from defined table. '''
    conn = sqlite3.connect('DB/mental_health.sqlite')
    c = conn.cursor()
    c.execute(f"SELECT {column} FROM {table} WHERE {column} = ?", (value,))
    value = c.fetchall()
    conn.commit()
    conn.close()
    return value

In [36]:
get_value_from_column('Answer', 'AnswerText', 'DC')

[]

For question 2:  -1 --> 'n/a and all below quantity of five answers going to be Other. However before that the following answers have to be combined:

In [None]:
# Female <-- female
# Male <-- MALE, I have a penis, male
# Non-binary <-- Non binary, Nonbinary

In [48]:
# change_answer_value('2', '-1', 'n/a') # already done above

In [37]:
# in case of more value that could be solved by dict (.keys, .values)
change_table_value('mental_health', 'Answer', 'AnswerText', 'female', 'Female')
change_table_value('mental_health', 'Answer', 'AnswerText', 'MALE', 'Male')
change_table_value('mental_health', 'Answer', 'AnswerText', 'I have a penis', 'Male')
change_table_value('mental_health', 'Answer', 'AnswerText', 'male', 'Male')
change_table_value('mental_health', 'Answer', 'AnswerText', 'Non binary', 'Non-binary')
change_table_value('mental_health', 'Answer', 'AnswerText', 'Nonbinary', 'Non-binary')

In [45]:
def change_answer_for_q_smaller(q_id, t_value, qty):
    ''' Change answer for provided question q_id to t_value where occurency is lower than qty. '''
    conn = sqlite3.connect('DB/mental_health.sqlite')
    c = conn.cursor()
    c.execute(f'SELECT AnswerText, count(UserID) as UNo FROM Answer WHERE QuestionID = ? GROUP BY AnswerText HAVING UNo > ?', 
              (q_id, qty))
    sel = c.fetchall()
    sel = [k for k, _ in sel]
    values_no = ('?, '*len(sel))[:-2]
    c.execute(f'UPDATE Answer SET AnswerText = ? WHERE QuestionID = ? and AnswerText NOT IN ({values_no})', 
              (t_value, q_id, *sel))
    conn.commit()
    conn.close()

In [46]:
change_answer_for_q_smaller(2, 'Other', 5)

Verify the result.

In [3]:
get_users_no_per_answer(2)

[('Female', 1024),
 ('Male', 3044),
 ('Non-binary', 13),
 ('Other', 107),
 ('n/a', 24),
 ('non-binary', 6)]

Regarding the question 1, the age. The decision is to delete user with answers for these below 18 and above 100. That 15/4218 so less than 0.4% 

In [78]:
def delete_participant_out_of_age_range(age_range):
    ''' Delete users with their answers which age is out of the provided range.
        age_range is a list. The deletion is done below [0] and above [1]. '''
    conn = sqlite3.connect('DB/mental_health.sqlite')
    c = conn.cursor()
    c.execute(f"DELETE FROM Answer WHERE QuestionID = 1 and (CAST(AnswerText AS INTEGER) < ? or CAST(AnswerText AS INTEGER) > ?)", (age_range[0], age_range[1]))
    conn.commit()
    conn.close()

In [79]:
delete_participant_out_of_age_range([18, 100])

Verify the changes.

In [83]:
get_users_no_per_answer(1)

[('18', 9),
 ('19', 20),
 ('20', 17),
 ('21', 39),
 ('22', 74),
 ('23', 107),
 ('24', 128),
 ('25', 147),
 ('26', 194),
 ('27', 197),
 ('28', 220),
 ('29', 229),
 ('30', 250),
 ('31', 223),
 ('32', 227),
 ('33', 201),
 ('34', 202),
 ('35', 201),
 ('36', 147),
 ('37', 184),
 ('38', 160),
 ('39', 137),
 ('40', 122),
 ('41', 88),
 ('42', 100),
 ('43', 82),
 ('44', 68),
 ('45', 74),
 ('46', 58),
 ('47', 38),
 ('48', 29),
 ('49', 36),
 ('50', 30),
 ('51', 22),
 ('52', 17),
 ('53', 15),
 ('54', 17),
 ('55', 22),
 ('56', 13),
 ('57', 14),
 ('58', 5),
 ('59', 6),
 ('60', 5),
 ('61', 7),
 ('62', 3),
 ('63', 5),
 ('64', 3),
 ('65', 3),
 ('66', 2),
 ('67', 2),
 ('70', 1),
 ('72', 1),
 ('74', 1),
 ('99', 1)]

The overview and cleaning for chosen questions can be closed here. Let's go to the next step where the pipeline will be build for cleaning (only chosen questions supported). Then the next steps will be classes for gathering data and visualization.