# Statistical Analysis Experiment for Depression

This notebook will produce the results for the statistical analysis experiment of my final year project "<i>Statistical Analysis for Detection Depression</i>". The statistical analysis method used for this project is descriptive analysis.<br><br>

The descriptive analysis technique has two types which are <mark><b>central tendency</b></mark> and <mark><b>variability</b></mark>. For this project, it will use the central tendency methods: mean, median, mode, maximum, and minimum and variability methods: standard deviation and variance.<br><br>

Deliverable requirements for statistical analysis experiment:
- [x] Use descriptive analysis on the MADRS score data to analyse the central tendency and variability of the number of days all patients have measured
- [ ] In the MADRS score data, separate into two dataframes for depressed and non-depressed patients using descriptive analysis to analyse the central tendency and variability of the number of days they measured
- [ ] Calculate the difference in the MADRS score for the depressed patients
- [ ] Use descriptive analysis to analyse the MADRS score at the start and end of measurement and its difference for patients with depression
- [ ] Use data visualisation to show the MADRS score for all depressed patients
- [ ] Use data visualisation to show the number of days that depressed and non-depressed patients have measured

## Initial stages

Firstly, import the relevant libraries and load the <code>scores.csv</code> dataframe using Pandas and display the data.

In [1]:
# Import the libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
# Load the scores.csv file and display the data
scores_df = pd.read_csv("data/scores.csv")
scores_df

Unnamed: 0,number,days,gender,age,afftype,melanch,inpatient,edu,marriage,work,madrs1,madrs2
0,condition_1,11,2,35-39,2.0,2.0,2.0,6-10,1.0,2.0,19.0,19.0
1,condition_2,18,2,40-44,1.0,2.0,2.0,6-10,2.0,2.0,24.0,11.0
2,condition_3,13,1,45-49,2.0,2.0,2.0,6-10,2.0,2.0,24.0,25.0
3,condition_4,13,2,25-29,2.0,2.0,2.0,11-15,1.0,1.0,20.0,16.0
4,condition_5,13,2,50-54,2.0,2.0,2.0,11-15,2.0,2.0,26.0,26.0
5,condition_6,7,1,35-39,2.0,2.0,2.0,6-10,1.0,2.0,18.0,15.0
6,condition_7,11,1,20-24,1.0,,2.0,11-15,2.0,1.0,24.0,25.0
7,condition_8,5,2,25-29,2.0,,2.0,11-15,1.0,2.0,20.0,16.0
8,condition_9,13,2,45-49,1.0,,2.0,6-10,1.0,2.0,26.0,26.0
9,condition_10,9,2,45-49,2.0,2.0,2.0,6-10,1.0,2.0,28.0,21.0


In [3]:
# Display the first 5 rows
scores_df.head()

Unnamed: 0,number,days,gender,age,afftype,melanch,inpatient,edu,marriage,work,madrs1,madrs2
0,condition_1,11,2,35-39,2.0,2.0,2.0,6-10,1.0,2.0,19.0,19.0
1,condition_2,18,2,40-44,1.0,2.0,2.0,6-10,2.0,2.0,24.0,11.0
2,condition_3,13,1,45-49,2.0,2.0,2.0,6-10,2.0,2.0,24.0,25.0
3,condition_4,13,2,25-29,2.0,2.0,2.0,11-15,1.0,1.0,20.0,16.0
4,condition_5,13,2,50-54,2.0,2.0,2.0,11-15,2.0,2.0,26.0,26.0


In [4]:
# Display the last 5 rows
scores_df.tail()

Unnamed: 0,number,days,gender,age,afftype,melanch,inpatient,edu,marriage,work,madrs1,madrs2
50,control_28,16,2,45-49,,,,,,,,
51,control_29,13,2,50-54,,,,,,,,
52,control_30,9,2,35-39,,,,,,,,
53,control_31,13,1,20-24,,,,,,,,
54,control_32,14,2,25-29,,,,,,,,


In the <code>scores.csv</code> file, there are 55 patients in total. 23 patients are depressed (labelled <mark><b>condition</b></mark>) and 32 patients are non-depressed (labelled <mark><b>control</b></mark>).

In [5]:
scores_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55 entries, 0 to 54
Data columns (total 12 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   number     55 non-null     object 
 1   days       55 non-null     int64  
 2   gender     55 non-null     int64  
 3   age        55 non-null     object 
 4   afftype    23 non-null     float64
 5   melanch    20 non-null     float64
 6   inpatient  23 non-null     float64
 7   edu        53 non-null     object 
 8   marriage   23 non-null     float64
 9   work       23 non-null     float64
 10  madrs1     23 non-null     float64
 11  madrs2     23 non-null     float64
dtypes: float64(7), int64(2), object(3)
memory usage: 5.3+ KB


As shown in the information above, majority of the columns have missing values but for the statistical experiment, the days, madrs1 and madrs2 columns are the main focus. We want to know the central tendency and variability for the number of days patients have measured and the MADRS scores for depressed patients in order to retrieve the basis of the data.

## Central Tendency and Variability for the number of days all patients measured their activity

For this statistical analysis experiment, I will use the descriptive analysis to analyse the central tendency and variability for the number of days all patients have measured their activity.<br>

In this deliverable, the <code>days</code> column is the column we are focusing as we want to know the mean, mode, median, minimum, maximum, standard deviation and variance for all patients.

In [6]:
'''
Index 0 is condition_1 to index 23 is condition_22
Index 23 is control_1 to index 54 is control_32
'''
# First look at the number of days all patients measure their activity
scores_df["days"]

0     11
1     18
2     13
3     13
4     13
5      7
6     11
7      5
8     13
9      9
10    14
11    12
12    14
13    14
14    13
15    16
16    13
17    13
18    13
19    13
20    13
21    14
22    16
23     8
24    20
25    12
26    13
27    13
28    13
29    13
30    13
31    13
32     8
33    13
34    14
35    13
36    13
37    11
38    13
39     9
40    13
41    13
42    13
43     8
44    13
45    13
46    13
47    13
48    13
49    13
50    16
51    13
52     9
53    13
54    14
Name: days, dtype: int64

In [7]:
# Count of the number of days all patients measure their activity
scores_df["days"].value_counts()

13    31
14     6
11     3
9      3
16     3
8      3
12     2
18     1
7      1
5      1
20     1
Name: days, dtype: int64

In [8]:
# Mean (average) number of days of all patients
scores_df["days"].mean()

12.6

In [9]:
# Mode (most common) number of days of all patients
scores_df["days"].mode()

0    13
Name: days, dtype: int64

In [10]:
# Median (middle) number of days of all patients
scores_df["days"].median()

13.0

In [11]:
# Minimum number of days of all patients
scores_df["days"].min()

5

In [12]:
# Maximum number of days of all patients
scores_df["days"].max()

20

In [13]:
# Standard deviation of number of days of all patients
scores_df["days"].std()

2.4914669187864833

In [14]:
# Variance of number of days of all patients
scores_df["days"].var()

6.2074074074074135

To put into conclusion for this deliverable, the results for the number of days all patients have measured their activity are:
* Mean = 12.6
* Mode = 13
* Median = 13
* Minimum = 5
* Maximum = 20
* Standard deviation = 2.4914669187864833 (approx. 2.5)
* Variance = 6.2074074074074135 (approx. 6.2)

It is shown that many patients took 13 days to measure their activity. Out of all 55 patients, the shortest number of days was 5 days and the longest number of days was 20 days.

## Central Tendency and Variability for the number of days depressed and non-depressed measured their activity separately

For this statistical analysis experiment, I will use the descriptive analysis to analyse the number of days depressed and non-depressed patients have measured their activity separately.<br>

In this deliverable, the <code>days</code> column is the column we are focusing as we want to know the mean, mode, median, minimum, maximum, standard deviation and variance for both categories separately. This is done by splitting them into two dataframes: one for depressed group and the other for non-depressed group of patients.

### Number of days for depressed patients

We will load a copy of the <code>scores.csv</code> DataFrame but with the depressed patients only. The depressed patients are labelled as <code>condition</code>.

In [15]:
scores_condition_df = scores_df[scores_df.number.str.contains('condition')]
scores_condition_df

Unnamed: 0,number,days,gender,age,afftype,melanch,inpatient,edu,marriage,work,madrs1,madrs2
0,condition_1,11,2,35-39,2.0,2.0,2.0,6-10,1.0,2.0,19.0,19.0
1,condition_2,18,2,40-44,1.0,2.0,2.0,6-10,2.0,2.0,24.0,11.0
2,condition_3,13,1,45-49,2.0,2.0,2.0,6-10,2.0,2.0,24.0,25.0
3,condition_4,13,2,25-29,2.0,2.0,2.0,11-15,1.0,1.0,20.0,16.0
4,condition_5,13,2,50-54,2.0,2.0,2.0,11-15,2.0,2.0,26.0,26.0
5,condition_6,7,1,35-39,2.0,2.0,2.0,6-10,1.0,2.0,18.0,15.0
6,condition_7,11,1,20-24,1.0,,2.0,11-15,2.0,1.0,24.0,25.0
7,condition_8,5,2,25-29,2.0,,2.0,11-15,1.0,2.0,20.0,16.0
8,condition_9,13,2,45-49,1.0,,2.0,6-10,1.0,2.0,26.0,26.0
9,condition_10,9,2,45-49,2.0,2.0,2.0,6-10,1.0,2.0,28.0,21.0


Then we will analyse the central tendency and variability for the depressed patients.

In [16]:
# First look at the number of days depressed patients measure their activity
scores_condition_df["days"]

0     11
1     18
2     13
3     13
4     13
5      7
6     11
7      5
8     13
9      9
10    14
11    12
12    14
13    14
14    13
15    16
16    13
17    13
18    13
19    13
20    13
21    14
22    16
Name: days, dtype: int64

In [17]:
# Count of the number of days depressed patients measure their activity
scores_condition_df["days"].value_counts()

13    10
14     4
11     2
16     2
18     1
7      1
5      1
9      1
12     1
Name: days, dtype: int64

In [18]:
# Mean (average) number of days depressed patients measure their activity
scores_condition_df["days"].mean()

12.652173913043478

In [19]:
# Mode (most common) number of days depressed patients measure their activity
scores_condition_df["days"].mode()

0    13
Name: days, dtype: int64

In [20]:
# Median (middle) number of days depressed patients measure their activity
scores_condition_df["days"].median()

13.0

In [21]:
# Minimum number of days depressed patients measure their activity
scores_condition_df["days"].min()

5

In [22]:
# Maximum number of days depressed patients measure their activity
scores_condition_df["days"].max()

18

In [23]:
# Standard deviation of number of days depressed patients measure their activity
scores_condition_df["days"].std()

2.773391354414858

In [24]:
# Variance of number of days depressed patients measure their activity
scores_condition_df["days"].var()

7.691699604743081

The results for the number of days depressed patients have measured their activity are:
* Mean = 12.652173913043478 (approx. 12.7)
* Mode = 13
* Median = 13
* Minimum = 5
* Maximum = 18
* Standard deviation = 2.773391354414858 (approx. 2.8)
* Variance = 7.691699604743081 (approx. 7.7)

It is shown that many depressed patients took 13 days to measure their activity. Out of all 23 depressed patients, the shortest number of days was 5 days and the longest number of days was 18 days.

### Number of days for non-depressed patients

We will load a copy of the <code>scores.csv</code> DataFrame but with the non-depressed patients only. The non-depressed patients are labelled as <code>control</code>.

In [25]:
scores_control_df = scores_df[scores_df.number.str.contains('control')]
scores_control_df = scores_control_df.reset_index(drop=True)
scores_control_df

Unnamed: 0,number,days,gender,age,afftype,melanch,inpatient,edu,marriage,work,madrs1,madrs2
0,control_1,8,2,25-29,,,,,,,,
1,control_2,20,1,30-34,,,,,,,,
2,control_3,12,2,30-34,,,,,,,,
3,control_4,13,1,25-29,,,,,,,,
4,control_5,13,1,30-34,,,,,,,,
5,control_6,13,1,25-29,,,,,,,,
6,control_7,13,1,20-24,,,,,,,,
7,control_8,13,2,40-44,,,,,,,,
8,control_9,13,2,30-34,,,,,,,,
9,control_10,8,1,30-34,,,,,,,,


In [26]:
# First look at the number of days non-depressed patients measure their activity
scores_control_df["days"]

0      8
1     20
2     12
3     13
4     13
5     13
6     13
7     13
8     13
9      8
10    13
11    14
12    13
13    13
14    11
15    13
16     9
17    13
18    13
19    13
20     8
21    13
22    13
23    13
24    13
25    13
26    13
27    16
28    13
29     9
30    13
31    14
Name: days, dtype: int64

In [27]:
# Count of the number of days non-depressed patients measure their activity
scores_control_df["days"].value_counts()

13    21
8      3
14     2
9      2
20     1
12     1
11     1
16     1
Name: days, dtype: int64

In [28]:
# Mean (average) number of days non-depressed patients measure their activity
scores_control_df["days"].mean()

12.5625

In [29]:
# Mode (most common) number of days non-depressed patients measure their activity
scores_control_df["days"].mode()

0    13
Name: days, dtype: int64

In [30]:
# Median (middle) number of days non-depressed patients measure their activity
scores_control_df["days"].median()

13.0

In [31]:
# Minimum number of days non-depressed patients measure their activity
scores_control_df["days"].min()

8

In [32]:
# Maximum number of days non-depressed patients measure their activity
scores_control_df["days"].max()

20

In [33]:
# Standard deviation of number of days non-depressed patients measure their activity
scores_control_df["days"].std()

2.313181024393228

In [34]:
# Variance of number of days non-depressed patients measure their activity
scores_control_df["days"].var()

5.350806451612903

The results for the number of days non-depressed patients have measured their activity are:
* Mean = 12.5625 (approx. 12.6)
* Mode = 13
* Median = 13
* Minimum = 8
* Maximum = 20
* Standard deviation = 2.313181024393228 (approx. 2.3)
* Variance = 5.350806451612903 (approx. 5.4)

It is shown that many non-depressed patients took 13 days to measure their activity. Out of all 32 non-depressed patients, the shortest number of days was 8 days and the longest number of days was 20 days.