# Mental Health in the Tech Industry Data Cleansing

Source: https://www.kaggle.com/datasets/anth7310/mental-health-in-the-tech-industry

Let's build the data cleaning class for further reference. In case of next year survey that can be used but before the data need to be review to check if some adaptations are required of course.

The supported questions are: 1, 2, 3, 4, 5, 6, 7, 8, 9, 12, 13, 22, 28, 30, 32, 33, 34, 48, 49, 50, 51, 54, 55, 56, 78, 79, 89, 93, 118

In [1]:
import sqlite3
import shutil

In [2]:
class SurveyCleaning:
    ''' The class suppors data cleaning for the Mental Health in the Tech Industry Database 
        for the following questions: 1, 2, 3, 4, 5, 6, 7, 8, 9, 12, 13, 22, 28, 30, 32, 33, 
        34, 48, 49, 50, 51, 54, 55, 56, 78, 79, 89, 93, 118 
        The changes are done directly on provided database so please make sure the backup is done.
        The backup function can be used for that.
        The path is the directory where the database is stored e.g. DB/ 
        The db_name is the database name'''
    
    def __init__(self, path, db_name):
        self.path = path
        self.db_name = db_name
        self.supported_questions = [1, 2, 3, 4, 5, 6, 7, 8, 9, 12, 13, 22, 28, 30, 32, 33, 
        34, 48, 49, 50, 51, 54, 55, 56, 78, 79, 89, 93, 118]
        
    def backup(self):
        ''' Create a backup for the database. '''
        # Copy the contents (no metadata) of the file named src to a file named dst.
        # https://docs.python.org/2/library/shutil.html
        try:
            shutil.copyfile(f'{self.path}{self.db_name}.sqlite', f'{self.path}{self.db_name}_backup.sqlite')
            print(f'Backup saved successfully in {self.path} directory.')
            print(f'Saved as {self.db_name}_backup.sqlite')
        except:
            raise
    
    def clear_all_at_once(self, questions_list):
        ''' The cleaning is done according to Overview conclusions which are the following:
            Update -1 --> 'n/a' for the questions: 2, 4, 5, 8, 9, 13, 28, 32, 34, 50, 51, 54, 55, 56, 78, 79, 89
            Update 'DC' --> 'Washington' for question 4
            Update female --> Female; MALE, I have a penis --> Male; Non binary, Nonbinary, non-binary --> Non-binary for question 2
            Update -1, 43, \- and all below quantity of five answers going to be 'Other' for question 2
            Update Delete user with answers for these below 18 and above 100.
            The questions_list is a list of the questions ids to be cleared.
            The supported questions are 1, 2, 3, 4, 5, 6, 7, 8, 9, 12, 13, 22, 28, 30, 32, 33, 
            34, 48, 49, 50, 51, 54, 55, 56, 78, 79, 89, 93, 118. 
            To clean other please use dedicated functions. '''
        
        # verify if questions_list fits the self.supported_questions (supported questions)
        self.verify_questions_list(questions_list)
        
        # -1 --> 'n/a'
        questions_ids = [2, 4, 5, 8, 9, 13, 28, 32, 34, 50, 51, 54, 55, 56, 78, 79, 89]
        supp_questions = self.get_question_from_supported(questions_ids, questions_list)
        if supp_questions:
            self.change_answer_value(supp_questions, '-1', 'n/a')
            print(f'For following questions: {self.convert_list_to_string(supp_questions)}.')
            print('"-1" has been changed to "n/a".')
        
        # 'DC' --> 'Washington'
        questions_ids = [4]
        supp_questions = self.get_question_from_supported(questions_ids, questions_list)
        if supp_questions:
            self.change_answer_value(supp_questions, 'DC', 'Washington')
            print(f'For following questions: {self.convert_list_to_string(supp_questions)}.')
            print('"DC" has been changed to "Washington".')
            
        # A couple changes for question 2
        question_ids = [2]
        supp_questions = self.get_question_from_supported(question_ids, questions_list)
        if supp_questions:
            
            # Female, Male and Non binary adjustment
            female = ['female']
            male = ['MALE', 'I have a penis', 'male']
            non_binary = ['Non binary', 'Nonbinary', 'non-binary']
            self.change_answer_values(supp_questions, Female = female, Male = male, Non_binary = non_binary)
            print(f'For following questions: {self.convert_list_to_string(supp_questions)}.')
            print('"female" has been changed to "Female".')
            print('"MALE", "male" and "I have a penis" has been changed to "Male".')
            print('"Non binary", "Nonbinary" and "non-binary" has been changed to "Non-binary".')
        
            # all below quantity of five answers going to be Other
            t_value = 'Other'
            qty = 5
            self.change_answer_for_qs_smaller(question_ids, t_value, qty)
            print(f'''All answers with quantity below {qty} have been changed to "{t_value}".''')
        
            # delete users (with their answers) for these below 18 and above 100 years old
            age_range = [18, 100] # delete below [0] and above [1]
            self.delete_participant_out_of_age_range(age_range)
            print(f'All users (with their answers) for these below {age_range[0]} and above {age_range[1]} years old have been deleted.')
        
        # 'Maybe' --> 'Possibly'
        questions_ids = [32, 33]
        supp_questions = self.get_question_from_supported(questions_ids, questions_list)
        if supp_questions:
            self.change_answer_value(supp_questions, 'Maybe', 'Possibly')
            print(f'For following questions: {self.convert_list_to_string(supp_questions)}.')
            print('"Maybe" has been changed to "Possibly".')
    
    def verify_questions_list(self, requested_questions):
        supported_q_set = set(self.supported_questions)
        requested_q_set = set(requested_questions)
        not_supported = requested_q_set.difference(supported_q_set) # cannot be supported
        supported = supported_q_set - supported_q_set.difference(requested_q_set) # can be supported
        if not_supported:
            print(f'The following questions cannot be supported: {self.convert_list_to_string(not_supported)} due to implementation limitation.')
        if supported:
            print(f'The following questions are supported: {self.convert_list_to_string(supported)}. Cleansing of these questions started. ')
        else:
            print(f'None of the {self.convert_list_to_string(requested_questions)} are supported')
        
    @staticmethod
    def get_question_from_supported(supported, requested):
        return list(set(supported) - set(supported).difference(set(requested)))
    
    @staticmethod
    def convert_list_to_string(list_in):
        return ", ".join(str(e) for e in list_in)
    
    def change_answer_value(self, q_ids, v1, v2):
        ''' Update the answer value for the list of question provided in list.
            v1 is a value which will be change.
            v2 is a target value. '''
        values_no = ('?, '*len(q_ids))[:-2]
        conn = sqlite3.connect(f'{self.path}{self.db_name}.sqlite')
        c = conn.cursor()
        c.execute(f"UPDATE Answer SET AnswerText = ? WHERE AnswerText = ? and QuestionID IN ({values_no})", (v2, v1, *q_ids))
        conn.commit()
        conn.close()
        
    def change_answer_values(self, q_ids, **kwargs):
        ''' Update the answer values for the list of question provided in list.
            Can be use for mapping a couple answer values to another one value. '''
        for key, values in kwargs.items():
            target_value = key.replace('_','-')
            for prev_value in values:
                self.change_answer_value(q_ids, prev_value, target_value)
    
    def change_answer_for_qs_smaller(self, q_ids, t_value, qty):
        ''' Change answer for provided questions q_ids to t_value where occurency is lower than qty. '''
        for q_id in q_ids:
            self.change_answer_for_q_smaller(q_id, t_value, qty)

    def change_answer_for_q_smaller(self, q_id, t_value, qty):
        ''' Change answer for provided question q_id to t_value where occurency is lower than qty. '''
        conn = sqlite3.connect(f'{self.path}{self.db_name}.sqlite')
        c = conn.cursor()
        c.execute(f'SELECT AnswerText, count(UserID) as UNo FROM Answer WHERE QuestionID = ? GROUP BY AnswerText HAVING UNo > ?', 
                  (q_id, qty))
        sel = c.fetchall()
        sel = [k for k, _ in sel]
        values_no = ('?, '*len(sel))[:-2]
        c.execute(f'UPDATE Answer SET AnswerText = ? WHERE QuestionID = ? and AnswerText NOT IN ({values_no})', 
                  (t_value, q_id, *sel))
        conn.commit()
        conn.close()
    
    def delete_participant_out_of_age_range(self, age_range):
        ''' Delete users with their answers which age is out of the provided range.
            age_range is a list. The deletion is done below [0] and above [1]. '''
        conn = sqlite3.connect(f'{self.path}{self.db_name}.sqlite')
        c = conn.cursor()
        c.execute(f"DELETE FROM Answer WHERE QuestionID = 1 and (CAST(AnswerText AS INTEGER) < ? or CAST(AnswerText AS INTEGER) > ?)", (age_range[0], age_range[1]))
        conn.commit()
        conn.close()
        
    def change_column_value(self, column, v1, v2):
        ''' Change v1 value to v2 for particular column. '''
        conn = sqlite3.connect(f'{self.path}{self.db_name}.sqlite')
        c = conn.cursor()
        c.execute(f"UPDATE Answer SET {column} = ? WHERE {column} = ?", (v2, v1))
        conn.commit()
        conn.close()

In [3]:
answer_cleansing = SurveyCleaning('DB/', 'mental_health')

In [31]:
answer_cleansing.backup()

Backup saved successfully in DB/ directory.
Saved as mental_health_backup.sqlite


In [4]:
all_questions = [1, 2, 3, 4, 5, 6, 7, 8, 9, 12, 13, 22, 28, 30, 32, 33, 
        34, 48, 49, 50, 51, 54, 55, 56, 78, 79, 89, 93, 118]

In [5]:
answer_cleansing.clear_all_at_once(all_questions)

The following questions are supported: 1, 2, 3, 4, 5, 6, 7, 8, 9, 12, 13, 78, 79, 22, 89, 28, 93, 30, 32, 33, 34, 48, 49, 50, 51, 54, 55, 56, 118. Cleansing of these questions started. 
For following questions: 2, 4, 5, 8, 9, 13, 78, 79, 89, 28, 32, 34, 50, 51, 54, 55, 56.
"-1" has been changed to "n/a".
For following questions: 4.
"DC" has been changed to "Washington".
For following questions: 2.
"female" has been changed to "Female".
"MALE", "male" and "I have a penis" has been changed to "Male".
"Non binary", "Nonbinary" and "non-binary" has been changed to "Non-binary".
All answers with quantity below 5 have been changed to "Other".
All users (with their answers) for these below 18 and above 100 years old have been deleted.
For following questions: 32, 33.
"Maybe" has been changed to "Possibly".
