# Mental Health Analysis & Prediction

In this project, we will work with a dataset on the distribution of mental disorders in different countries. The data includes the percentage of the population affected by five disorders: schizophrenia, bipolar disorder, depression, anxiety, and eating disorders. We will preprocess the data and apply machine learning algorithms to analyze and make predictions on this important mental health issue.

## Data Visualization & Understanding👀

**Workspace configuration** 🌏
1. Add *mental-illnesses-prevalence.csv* file to your current workspace
2. Check the file is correctly placed

In [62]:
!curl https://raw.githubusercontent.com/Angelaruizalvarez/Mental-Health-ML-Algorithms/main/mental-illnesses-prevalence.csv > mental.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100  440k  100  440k    0     0  2701k      0 --:--:-- --:--:-- --:--:-- 2704k


In [63]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Data loading from workspace
df = pd.read_csv("mental.csv")

import os
print(os.listdir())  # Shows files in current folder

['.config', 'deviation_boolean.csv', 'entity_code_mapping.csv', 'df_countries.csv', 'df_income.csv', 'df_continents.csv', 'deviation_errors.csv', 'mental.csv', 'sample_data']


**Data Visualization**👀



In [64]:
# Visualizing DDBB & different columns
print("\nPrimeras filas del DataFrame:")
df.head()


Primeras filas del DataFrame:


Unnamed: 0,Entity,Code,Year,Schizophrenia disorders (share of population) - Sex: Both - Age: Age-standardized,Depressive disorders (share of population) - Sex: Both - Age: Age-standardized,Anxiety disorders (share of population) - Sex: Both - Age: Age-standardized,Bipolar disorders (share of population) - Sex: Both - Age: Age-standardized,Eating disorders (share of population) - Sex: Both - Age: Age-standardized
0,Afghanistan,AFG,1990,0.223206,4.996118,4.713314,0.703023,0.1277
1,Afghanistan,AFG,1991,0.222454,4.98929,4.7021,0.702069,0.123256
2,Afghanistan,AFG,1992,0.221751,4.981346,4.683743,0.700792,0.118844
3,Afghanistan,AFG,1993,0.220987,4.976958,4.673549,0.700087,0.115089
4,Afghanistan,AFG,1994,0.220183,4.977782,4.67081,0.699898,0.111815


In [65]:
# General info of the dataset (columns, data types, null values)
print("\nDataset info:")
df.info(show_counts=True)


Dataset info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6420 entries, 0 to 6419
Data columns (total 8 columns):
 #   Column                                                                             Non-Null Count  Dtype  
---  ------                                                                             --------------  -----  
 0   Entity                                                                             6420 non-null   object 
 1   Code                                                                               6150 non-null   object 
 2   Year                                                                               6420 non-null   int64  
 3   Schizophrenia disorders (share of population) - Sex: Both - Age: Age-standardized  6420 non-null   float64
 4   Depressive disorders (share of population) - Sex: Both - Age: Age-standardized     6420 non-null   float64
 5   Anxiety disorders (share of population) - Sex: Both - Age: Age-standardized        6420 no

In [66]:
# Data Formats
print("\nData formats:")
df.dtypes


Data formats:


Unnamed: 0,0
Entity,object
Code,object
Year,int64
Schizophrenia disorders (share of population) - Sex: Both - Age: Age-standardized,float64
Depressive disorders (share of population) - Sex: Both - Age: Age-standardized,float64
Anxiety disorders (share of population) - Sex: Both - Age: Age-standardized,float64
Bipolar disorders (share of population) - Sex: Both - Age: Age-standardized,float64
Eating disorders (share of population) - Sex: Both - Age: Age-standardized,float64


In [67]:
# Descriptive statistics of numerical columns
print("\nDescriptive statistics:")
df.describe()


Descriptive statistics:


Unnamed: 0,Year,Schizophrenia disorders (share of population) - Sex: Both - Age: Age-standardized,Depressive disorders (share of population) - Sex: Both - Age: Age-standardized,Anxiety disorders (share of population) - Sex: Both - Age: Age-standardized,Bipolar disorders (share of population) - Sex: Both - Age: Age-standardized,Eating disorders (share of population) - Sex: Both - Age: Age-standardized
count,6420.0,6420.0,6420.0,6420.0,6420.0,6420.0
mean,2004.5,0.266604,3.767036,4.10184,0.636968,0.195664
std,8.656116,0.039383,0.925286,1.050543,0.233391,0.13838
min,1990.0,0.188416,1.522333,1.879996,0.181667,0.04478
25%,1997.0,0.242267,3.080036,3.425846,0.520872,0.096416
50%,2004.5,0.273477,3.636772,3.939547,0.579331,0.14415
75%,2012.0,0.286575,4.366252,4.564164,0.844406,0.251167
max,2019.0,0.462045,7.645899,8.624634,1.50673,1.031688


## Data Cleaning & Preparation✅

After a first view of the data and before starting to work with it, we must clean it and preprocess it. We will look for null values and fill or discard them, drop redundant information, correct typos and basically prepare our dataset for following our prediction models.


### Null values & duplicated rows✅

In [68]:
#Verify duplicated rows and discard them if existing
duplicated = df.duplicated().sum()
print(f"\nNumber of duplicated rows: {duplicated}")


Number of duplicated rows: 0


In [69]:
# Verify null values & correct / discard them if existing
print("\nNull values per column:")
df.isnull().sum()



Null values per column:


Unnamed: 0,0
Entity,0
Code,270
Year,0
Schizophrenia disorders (share of population) - Sex: Both - Age: Age-standardized,0
Depressive disorders (share of population) - Sex: Both - Age: Age-standardized,0
Anxiety disorders (share of population) - Sex: Both - Age: Age-standardized,0
Bipolar disorders (share of population) - Sex: Both - Age: Age-standardized,0
Eating disorders (share of population) - Sex: Both - Age: Age-standardized,0


### Data consistency✅

In [70]:
# Check that all countries have data from 1990 to 2019.
date_range = df.groupby('Code')['Year'].agg(['min', 'max'])
print("\nYear range by code:")
print(date_range)


Rango de años por código:
       min   max
Code            
AFG   1990  2019
AGO   1990  2019
ALB   1990  2019
AND   1990  2019
ARE   1990  2019
...    ...   ...
WSM   1990  2019
YEM   1990  2019
ZAF   1990  2019
ZMB   1990  2019
ZWE   1990  2019

[205 rows x 2 columns]


### Data corrections & Imputations✅

As only "Code" values are empty, we will identify them for manual imputation and correction

In [71]:
# Identify missing codes and their corresponding entities
missing_codes = df[df['Code'].isnull()]['Entity'].unique()
print("\nEntities with missing codes:")
print(missing_codes)


Entities with missing codes:
['Africa (IHME GBD)' 'America (IHME GBD)' 'Asia (IHME GBD)'
 'Europe (IHME GBD)' 'European Union (27)' 'High-income countries'
 'Low-income countries' 'Lower-middle-income countries'
 'Upper-middle-income countries']


In [72]:
# Manual imputation
df.loc[df['Entity'] == 'Africa (IHME GBD)', 'Code'] = 'AFR_IHME'
df.loc[df['Entity'] == 'America (IHME GBD)', 'Code'] = 'AMR_IHME'
df.loc[df['Entity'] == 'Asia (IHME GBD)', 'Code'] = 'ASIA_IHME'
df.loc[df['Entity'] == 'Europe (IHME GBD)', 'Code'] = 'EUR_IHME'
df.loc[df['Entity'] == 'European Union (27)', 'Code'] = 'EU27'
df.loc[df['Entity'] == 'High-income countries', 'Code'] = 'HIGH_INC'
df.loc[df['Entity'] == 'Low-income countries', 'Code'] = 'LOW_INC'
df.loc[df['Entity'] == 'Lower-middle-income countries', 'Code'] = 'LOW_MID_INC'
df.loc[df['Entity'] == 'Upper-middle-income countries', 'Code'] = 'UP_MID_INC'

###Data Redundancy✅

As we have identified that the 'entity' and 'code' columns are somewhat redundant, we saved the Entity-Code table to a CSV file before dropping the 'entity' column. This choice is based on the fact that one column has a much more abbreviated notation than the other.

In [73]:
# Saving relationship Entity into DDBB
df[['Entity', 'Code']].drop_duplicates()
entity_code_df = df[['Entity', 'Code']].drop_duplicates()
entity_code_df.to_csv('entity_code_mapping.csv', index=False)

print("\nEntity y Code actualizados:")
entity_code_df


Entity y Code actualizados:


Unnamed: 0,Entity,Code
0,Afghanistan,AFG
30,Africa (IHME GBD),AFR_IHME
60,Albania,ALB
90,Algeria,DZA
120,America (IHME GBD),AMR_IHME
...,...,...
6270,Vietnam,VNM
6300,World,OWID_WRL
6330,Yemen,YEM
6360,Zambia,ZMB


In [74]:
# Check that file is correctly saved (optional step I did for double check)
print(os.listdir())
table = "entity_code_mapping.csv"
rel = pd.read_csv(table)
rel.head()

['.config', 'deviation_boolean.csv', 'entity_code_mapping.csv', 'df_countries.csv', 'df_income.csv', 'df_continents.csv', 'deviation_errors.csv', 'mental.csv', 'sample_data']


Unnamed: 0,Entity,Code
0,Afghanistan,AFG
1,Africa (IHME GBD),AFR_IHME
2,Albania,ALB
3,Algeria,DZA
4,America (IHME GBD),AMR_IHME


In [75]:
# Drop of 'entity' column
df.drop(columns=['Entity'], inplace=True)
df.head()

Unnamed: 0,Code,Year,Schizophrenia disorders (share of population) - Sex: Both - Age: Age-standardized,Depressive disorders (share of population) - Sex: Both - Age: Age-standardized,Anxiety disorders (share of population) - Sex: Both - Age: Age-standardized,Bipolar disorders (share of population) - Sex: Both - Age: Age-standardized,Eating disorders (share of population) - Sex: Both - Age: Age-standardized
0,AFG,1990,0.223206,4.996118,4.713314,0.703023,0.1277
1,AFG,1991,0.222454,4.98929,4.7021,0.702069,0.123256
2,AFG,1992,0.221751,4.981346,4.683743,0.700792,0.118844
3,AFG,1993,0.220987,4.976958,4.673549,0.700087,0.115089
4,AFG,1994,0.220183,4.977782,4.67081,0.699898,0.111815


### Data coherence✅

I have shortened the names of some of the columns with the aim of simplifying this database, taking into account that the names of the columns indicate the same thing for all disorders: they are values ​​of share of population for both sexes and a standardized age.

In [76]:
import re # Regular expression operations library
df.columns = [re.sub(r' \(share of population\) - Sex: Both - Age: Age-standardized', '', col) for col in df.columns]
df.head()


Unnamed: 0,Code,Year,Schizophrenia disorders,Depressive disorders,Anxiety disorders,Bipolar disorders,Eating disorders
0,AFG,1990,0.223206,4.996118,4.713314,0.703023,0.1277
1,AFG,1991,0.222454,4.98929,4.7021,0.702069,0.123256
2,AFG,1992,0.221751,4.981346,4.683743,0.700792,0.118844
3,AFG,1993,0.220987,4.976958,4.673549,0.700087,0.115089
4,AFG,1994,0.220183,4.977782,4.67081,0.699898,0.111815


### Maximum Deviation error

(EXTRA information that ended up not being used)

deviation_errors.csv: difference between the highest and lowest value per code in each disorder.

deviation_threshold.csv: values that exceed a high deviation threshold

In [77]:
columns_disorders = df.columns[2:]  # Excluyendo 'Code' y 'Year'
deviation_errors = df.groupby('Code')[columns_disorders].agg(lambda x: x.max() - x.min())
deviation_errors.to_csv('deviation_errors.csv')

print(os.listdir())
table2 = "deviation_errors.csv"
dev = pd.read_csv(table2)
dev.head()

['.config', 'deviation_boolean.csv', 'entity_code_mapping.csv', 'df_countries.csv', 'df_income.csv', 'df_continents.csv', 'deviation_errors.csv', 'mental.csv', 'sample_data']


Unnamed: 0,Code,Schizophrenia disorders,Depressive disorders,Anxiety disorders,Bipolar disorders,Eating disorders
0,AFG,0.009145,0.077512,0.186022,0.003659,0.033153
1,AFR_IHME,0.002214,0.202847,0.098193,0.005538,0.015287
2,AGO,0.008523,0.212984,0.052193,0.000244,0.034507
3,ALB,0.007158,0.091062,0.178392,0.001349,0.038032
4,AMR_IHME,0.014339,0.334991,0.943326,0.028284,0.017213


In [78]:
threshold = deviation_errors.values.max() * 0.75  # 75% del valor máximo
deviation_boolean = deviation_errors > threshold

deviation_boolean.to_csv('deviation_boolean.csv')

print(os.listdir())
table3 = "deviation_boolean.csv"
dev_bool = pd.read_csv(table3)
print("\nValores de desviación que superan el umbral:")
dev_bool.head()

['.config', 'deviation_boolean.csv', 'entity_code_mapping.csv', 'df_countries.csv', 'df_income.csv', 'df_continents.csv', 'deviation_errors.csv', 'mental.csv', 'sample_data']

Valores de desviación que superan el umbral:


Unnamed: 0,Code,Schizophrenia disorders,Depressive disorders,Anxiety disorders,Bipolar disorders,Eating disorders
0,AFG,False,False,False,False,False
1,AFR_IHME,False,False,False,False,False
2,AGO,False,False,False,False,False
3,ALB,False,False,False,False,False
4,AMR_IHME,False,False,False,False,False


### Data organization✅

Seeing that we have some codes associated with specific countries, others with continents and others that group countries according to economic income, I have decided to separate them into 3 different databases in order to be able to run different algorithms on each one and obtain more conclusive results.

In [79]:
# Individual country codes
df_countries = df[df['Code'].str.len() == 3]
df_countries.reset_index(inplace=True, drop=True)
df_countries.to_csv('df_countries.csv')
# Continents codes
df_continents = df[df['Code'].isin(['AFR_IHME', 'AMR_IHME', 'ASIA_IHME', 'EUR_IHME', 'EU27'])]
df_continents.reset_index(inplace=True, drop=True)
df_continents.to_csv('df_continents.csv')
# Income classification codes
df_income = df[df['Code'].isin(['HIGH_INC', 'LOW_INC', 'LOW_MID_INC', 'UP_MID_INC'])]
df_income.reset_index(inplace=True, drop=True)
df_income.to_csv('df_income.csv')

In [80]:
df_countries.head()

Unnamed: 0,Code,Year,Schizophrenia disorders,Depressive disorders,Anxiety disorders,Bipolar disorders,Eating disorders
0,AFG,1990,0.223206,4.996118,4.713314,0.703023,0.1277
1,AFG,1991,0.222454,4.98929,4.7021,0.702069,0.123256
2,AFG,1992,0.221751,4.981346,4.683743,0.700792,0.118844
3,AFG,1993,0.220987,4.976958,4.673549,0.700087,0.115089
4,AFG,1994,0.220183,4.977782,4.67081,0.699898,0.111815


In [81]:
df_continents.head()

Unnamed: 0,Code,Year,Schizophrenia disorders,Depressive disorders,Anxiety disorders,Bipolar disorders,Eating disorders
0,AFR_IHME,1990,0.219527,4.602806,3.696839,0.607027,0.111027
1,AFR_IHME,1991,0.219559,4.598041,3.695416,0.60709,0.110425
2,AFR_IHME,1992,0.219579,4.593013,3.693819,0.607127,0.109845
3,AFR_IHME,1993,0.219583,4.588568,3.692097,0.607134,0.109305
4,AFR_IHME,1994,0.219556,4.586263,3.690115,0.607082,0.108813


In [82]:
df_income.head()

Unnamed: 0,Code,Year,Schizophrenia disorders,Depressive disorders,Anxiety disorders,Bipolar disorders,Eating disorders
0,HIGH_INC,1990,0.325683,3.471353,4.729571,0.732095,0.363852
1,HIGH_INC,1991,0.326571,3.480142,4.724091,0.731622,0.369117
2,HIGH_INC,1992,0.327345,3.49173,4.71983,0.731283,0.373884
3,HIGH_INC,1993,0.327982,3.505563,4.716575,0.731041,0.378027
4,HIGH_INC,1994,0.32849,3.521293,4.714491,0.730872,0.381161


## **Machine Learning Algorithms**

### Regression line

Implementation of linear regression on the evolution of mental disorders and predict up to 2050.
