# Mental Health in the Tech Industry Data Cleaning

Source: https://www.kaggle.com/datasets/anth7310/mental-health-in-the-tech-industry

Let's build the data cleaning class for further reference. In case of next year survey that can be used but before the data need to be review to check if some adaptations are required of course.

The supported questions are: 1, 2, 3, 4, 5, 6, 7, 8, 9, 12, 13, 22, 28, 30, 32, 33, 34, 48, 49, 50, 51, 54, 55, 56, 78, 79, 89, 93, 118

In [19]:
import sqlite3
import io
import shutil

In [29]:
class SurveyCleaning:
    ''' The class suppors data cleaning for the Mental Health in the Tech Industry Database 
        for the following questions: 1, 2, 3, 4, 5, 6, 7, 8, 9, 12, 13, 22, 28, 30, 32, 33, 
        34, 48, 49, 50, 51, 54, 55, 56, 78, 79, 89, 93, 118 
        The changes are done directly on provided database so please make sure the backup is done.
        The backup function can be used for that.
        The path is the directory where the database is stored e.g. DB/ 
        The db_name is the database name'''
    
    def __init__(self, path, db_name):
        self.path = path
        self.db_name = db_name
        self.question_ids = [1, 2, 3, 4, 5, 6, 7, 8, 9, 12, 13, 22, 28, 30, 32, 33, 
        34, 48, 49, 50, 51, 54, 55, 56, 78, 79, 89, 93, 118]
        
    def backup(self):
        ''' Create a backup for the database. '''
        # Copy the contents (no metadata) of the file named src to a file named dst.
        # https://docs.python.org/2/library/shutil.html
        try:
            shutil.copyfile(f'{self.path}{self.db_name}.sqlite', f'{self.path}{self.db_name}_backup.sqlite')
            print(f'Backup saved successfully in {self.path} directory.')
            print(f'Saved as {self.db_name}_backup.sqlite')
        except:
            raise
    
    def change_table_value(self, column, v1, v2):
        ''' Change v1 from table in db_name database to v2. '''
        conn = sqlite3.connect(f'{self.path}{self.db_name}.sqlite')
        c = conn.cursor()
        c.execute(f"UPDATE Answer SET {column} = ? WHERE {column} = ?", (v2, v1))
        conn.commit()
        conn.close()
    
    def change_answer_value(self, q_ids, v1, v2):
        ''' Update the answer value for the list of question provided in list.
            v1 is a value which will be change.
            v2 is a target value. '''
        values_no = ('?, '*len(q_ids))[:-2]
        conn = sqlite3.connect(f'{self.path}{self.db_name}.sqlite')
        c = conn.cursor()
        c.execute(f"UPDATE Answer SET AnswerText = ? WHERE AnswerText = ? and QuestionID IN ({values_no})", (v2, v1, *q_ids))
        conn.commit()
        conn.close()
    
    def change_answer_for_q_smaller(self, q_id, t_value, qty):
        ''' Change answer for provided question q_id to t_value where occurency is lower than qty. '''
        conn = sqlite3.connect(f'{self.path}{self.db_name}.sqlite')
        c = conn.cursor()
        c.execute(f'SELECT AnswerText, count(UserID) as UNo FROM Answer WHERE QuestionID = ? GROUP BY AnswerText HAVING UNo > ?', 
                  (q_id, qty))
        sel = c.fetchall()
        sel = [k for k, _ in sel]
        values_no = ('?, '*len(sel))[:-2]
        c.execute(f'UPDATE Answer SET AnswerText = ? WHERE QuestionID = ? and AnswerText NOT IN ({values_no})', 
                  (t_value, q_id, *sel))
        conn.commit()
        conn.close()
    
    def delete_participant_out_of_age_range(self, age_range):
        conn = sqlite3.connect(f'{self.path}{self.db_name}.sqlite')
        c = conn.cursor()
        c.execute(f"DELETE FROM Answer WHERE QuestionID = 1 and (CAST(AnswerText AS INTEGER) < ? or CAST(AnswerText AS INTEGER) > ?)", (age_range[0], age_range[1]))
        conn.commit()
        conn.close()
    
    def clear_all_at_once(self):
        ''' The cleaning is done according to Overview conclusions which are the following:
            Update -1 --> 'n/a' for the questions: 2, 4, 5, 8, 9, 13, 28, 32, 34, 50, 51, 54, 55, 56, 78, 79, 89
            Update 'DC' --> 'Washington' for question 4
            Update female --> Female; MALE, I have a penis --> Male; Non binary, Nonbinary --> Non-binary for question 2
            Update -1, 43, \- and all below quantity of five answers going to be 'Other' for question 2
            Update Delete user with answers for these below 18 and above 100.'''
        

In [30]:
test = SurveyCleaning('DB/', 'mental_health')

In [31]:
test.backup()

Backup saved successfully in DB/ directory.
Saved as mental_health_backup.sqlite
