# Mental Health Analysis 

KATE expects your code to define variables with specific names that correspond to certain things we are interested in.

KATE will run your notebook from top to bottom and check the latest value of those variables, so make sure you don't overwrite them.

* Remember to uncomment the line assigning the variable to your answer and don't change the variable or function names.
* Use copies of the original or previous DataFrames to make sure you do not overwrite them by mistake.

You will find instructions below about how to define each variable.

Once you're happy with your code, upload your notebook to KATE to check your feedback.

# How to complete and submit
Each exercise will look something like this:

```python
example_query = ''
#example_result = pd.read_sql(example_query, conn)
```

In each exercise you will need to define a query variable by writing the SQL code that you think will solve the problem. Once you have your query, uncomment the 2nd line, this will execute it and load the resulting data into a dataframe.

Nothing else needs to be changed in the 2nd line besides uncommenting it. 

After running this you will be free to inspect the result produced to see whether it's what you'd expect as the result. KATE will look for variables with the names defined in this notebook, so it is important not to rename the variables defined in this notebook.

Once you've completed the exercises upload this notebook to **KATE** to get feedback. You can also upload the notebook when you only have parts of it completed - if you do so, make sure you do not uncomment the `pd.read_sql` lines for which you don't have a query yet.

Refer to the instructions on **KATE** for more details on the dataset.

# Introduction to the Mental Health dataset 

This dataset is an Open Source Mental Illness (OSMI) data. 

It has been collected using surveys from 2014, 2016, 2017, 2018 and 2019. 

The surveys are a way of understanding the mental health situation and the frequency of mental health disorder in the tech industry. 

The dataset is available in sqlite format and can be downloaded from [here](https://www.kaggle.com/anth7310/mental-health-in-the-tech-industry)

Some preprocessing was performed before making the dataset available: similar questions were merged together, values for answers were made consistent (for example  1 == 1.0), and spelling errors were fixed. 
The raw data was processed using Python, SQL and Excel for cleaning and manipulation.


# Setup

The below code is setting up a connection to the SQLite Database. 

**Do not change this code!** The `conn` variable will be used throughout the notebook to query the database.

In [1]:
import pandas as pd
import sqlite3

conn = sqlite3.connect('data/mental_health.sqlite')


# Queries

Now that we have created a connection to the database, let's make some queries.

The database contains three tables: Survey, Question, and Answer.

  1. **Survey**, containing columns:
    - `PRIMARY KEY INT SurveyID`
    - `TEXT Description`


  2. **Question**, containing columns: 
    - `PRIMARY KEY QuestionID`
    - `TEXT QuestionText`


  3. **Answer**, containing columns:
    - `PRIMARY/FOREIGN KEY SurveyID`
    - `PRIMARY KEY UserID`
    - `PRIMARY/FOREIGN KEY QuestionID`
    - `TEXT AnswerText`


SuveyID contains the survey year i.e. 2014, 2016, 2017, 2018, 2019 and the same question can be used for multiple surveys. 

Answer table is composite, with multiple primary keys. Here, SurveyID and QuestionID are [`FOREIGN KEYS`](https://www.w3schools.com/sql/sql_foreignkey.asp)

Some questions can contain multiple answers, thus the same user can appear more than once for any given QuestionID.

You can find more information [here](https://www.kaggle.com/anth7310/mental-health-in-the-tech-industry).


**1. Write a SQL query that finds all the records within the Question table where the QuestionID is equal to 2 or 3. The columns should be called `Question` and `ID`**

In [22]:
# Add your code below
question_2_3_query = '''
SELECT 
    questiontext AS Question,
    questionid AS ID
FROM Question
WHERE QuestionID In (2, 3)
'''
question_2_3_result = pd.read_sql(question_2_3_query, conn)
question_2_3_result

Unnamed: 0,Question,ID
0,What is your gender?,2
1,What country do you live in?,3


**2. Write a SQL query to retrieve the surveys from 2014 and 2017. The columns should be called `Year` and `Year_Description`**

In [13]:
# Add your code below
survey_years_query = '''
SELECT SurveyID AS Year, Description AS Year_Description
FROM Survey
WHERE SurveyID LIKE '%2014%'
OR SurveyID LIKE '%2017%'
'''
survey_years_result = pd.read_sql(survey_years_query, conn)
survey_years_result

Unnamed: 0,Year,Year_Description
0,2014,mental health survey for 2014
1,2017,mental health survey for 2017


**3. Write a SQL query to find out how many answers in total have been given throughout the years. Your result should contain one column, called `answers_count`**

In [23]:
# Add your code below
number_of_answers_query = '''
SELECT 
    COUNT(QuestionID) AS answers_count
FROM Answer
'''
number_of_answers_result = pd.read_sql(number_of_answers_query, conn)
number_of_answers_result

Unnamed: 0,answers_count
0,236898


**4. Write a SQL query to find out how many answers have been given in 2017 and 2019. Your result should contain one column, called `answers_count`**

In [26]:
# Add your code below
number_of_answers_17_19_query = '''
SELECT
    COUNT(QuestionID) AS answers_count
FROM Answer
WHERE SurveyID LIKE '%2017%'
OR SurveyID LIKE '%2019%'
'''
number_of_answers_17_19_result = pd.read_sql(number_of_answers_17_19_query, conn)
number_of_answers_17_19_result

Unnamed: 0,answers_count
0,84208


**5. Write a SQL query to extract the first 100 answers for the year 2014. Your result should contain one column (the answer text)**

In [32]:
# Add your code below
answer_2014_query = '''
SELECT 
    AnswerText
FROM Answer
WHERE SurveyID LIKE '%2014%'
LIMIT 100
''' 
answer_2014_result = pd.read_sql(answer_2014_query, conn)
answer_2014_result

Unnamed: 0,AnswerText
0,37
1,44
2,32
3,31
4,31
...,...
95,29
96,24
97,31
98,33


**6. For each year of the survey, how many questions have been asked? Return a table containing the survey year and the number of unique questions that have been asked for each year. Call the survey year column `year` and the second column `survey_answers`**

In [40]:
# Add your code below
answer_per_survey_query = '''
SELECT 
    SurveyID AS year,
    COUNT(DISTINCT(QuestionID)) AS survey_answers
FROM Answer
GROUP BY SurveyID
'''

answer_per_survey_result = pd.read_sql(answer_per_survey_query, conn)
answer_per_survey_result

Unnamed: 0,year,survey_answers
0,2014,26
1,2016,60
2,2017,76
3,2018,76
4,2019,76


**7. Select the maximum age of the participants for each survey year. Return a table containing the survey year and the maximum age of participants for that year. Your result should contain two columns: one called `year` and one called `max_age`**

**Hint**: Have a look at the Question table first to find which question asks participants about their age.

In [53]:
# Add your code below
max_age_query = '''
SELECT
    SurveyID AS year,
    MAX(CAST(AnswerText AS INT)) AS max_age
FROM Answer
WHERE QuestionID == '1'
GROUP BY SurveyID
'''
max_age_result = pd.read_sql(max_age_query, conn)
max_age_result

Unnamed: 0,year,max_age
0,2014,329
1,2016,323
2,2017,67
3,2018,67
4,2019,64


**8. Write a SQL query that finds out how many people always, never, or sometimes work remotely. Your result should have one column called `answer`, and one called `count`**

**Hint**: Have a look at the Question table first to find which question asks participants about how often they work remotely. Note that always, never, and sometimes are the three possible answers.

In [67]:
# Add your code below
work_remotely_query = '''
SELECT
    AnswerText AS answer,
    COUNT(UserID) AS count
FROM Answer
WHERE QuestionID == '118'
GROUP BY AnswerText
'''
work_remotely_result = pd.read_sql(work_remotely_query, conn)
work_remotely_result

Unnamed: 0,answer,count
0,Always,343
1,Never,333
2,Sometimes,757


**9. Write a SQL query that returns the given age of 2016 survey participants as well as the count of participants for each age. Call the age column `participant_age` and the count column `number_of_participants`**

In [73]:
# Add your code below participant_age and number_of_participants
age_freq_query = '''
SELECT
    AnswerText AS participant_age,
    COUNT(UserID) AS number_of_participants
FROM Answer
WHERE SurveyID == '2016'
AND QuestionID == '1'
GROUP BY participant_age
'''
age_freq_result = pd.read_sql(age_freq_query, conn)
age_freq_result

Unnamed: 0,participant_age,number_of_participants
0,15,1
1,17,1
2,19,4
3,20,6
4,21,15
5,22,32
6,23,24
7,24,42
8,25,44
9,26,64


**10. Now let's make Question 6 a little bit more complicated and order the year in descending order. Call the survey year column `year` and the count column `survey_answers`**

In [75]:
# Add your code below
answer_per_survey_advanced_query = '''
SELECT 
    SurveyID AS year,
    COUNT(DISTINCT(QuestionID)) AS survey_answers
FROM Answer
GROUP BY SurveyID
ORDER BY year DESC
'''
answer_per_survey_advanced_result = pd.read_sql(answer_per_survey_advanced_query, conn)
answer_per_survey_advanced_result

Unnamed: 0,year,survey_answers
0,2019,76
1,2018,76
2,2017,76
3,2016,60
4,2014,26
