# World Mental Health indicators 

### 📋 Introduction

Mental health disorders constitute a significant portion of the global burden of disease and disability. According to the World Health Organization (WHO), mental health conditions such as depression, anxiety, and substance use disorders are among the leading causes of ill health and disability worldwide. Despite increasing awareness and policy efforts, many individuals still suffer in silence due to stigma, lack of resources, and inadequate mental health infrastructure.

This analysis leverages a longitudinal dataset capturing the **estimated prevalence (% of the population)** of various mental health conditions across countries and territories from **1990 onwards**. By examining trends over time and across regions, this analysis aims to provide insights into how these disorders are distributed globally, how they have changed historically, and what patterns might emerge when disorders are compared or correlated.

Understanding these patterns is crucial for guiding effective mental health interventions, shaping public policy, allocating healthcare resources, and promoting mental well-being at both national and international levels.

---

### 📂 Dataset Description

The dataset consists of annual estimates of mental health disorder prevalence for multiple countries over several decades. Each observation represents the percentage of a country's population affected by a specific disorder in a given year. The dataset includes the following variables:

- `index`: A unique identifier for each row in the dataset.
- `Entity`: The name of the country, region, or global aggregate.
- `Code`: The ISO 3166-1 alpha-3 code for the entity (where applicable).
- `Year`: The calendar year of the observation.
- `Schizophrenia (%)`: Estimated percentage of the population diagnosed with schizophrenia.
- `Bipolar disorder (%)`: Estimated percentage affected by bipolar disorder.
- `Eating disorders (%)`: Estimated percentage affected by conditions like anorexia nervosa or bulimia.
- `Anxiety disorders (%)`: Estimated percentage affected by anxiety disorders (e.g., generalized anxiety, phobias).
- `Drug use disorders (%)`: Estimated percentage with disorders related to drug misuse.
- `Depression (%)`: Estimated percentage affected by depressive disorders (e.g., major depression).
- `Alcohol use disorders (%)`: Estimated percentage with alcohol-related disorders.

All prevalence values are reported as percentages and reflect modeled estimates based on epidemiological studies, surveys, and statistical methods. The scope and quality of the data may vary between countries and over time due to differences in reporting, diagnostic standards, and data availability.

---

### 🎯 Objectives of the Analysis

This analysis aims to explore the following research questions and themes:

1. **Temporal Trends**:  
   How have the prevalence rates of mental health disorders changed globally and regionally from 1990 to the present? Are there disorders showing significant increases or decreases over time?

2. **Geographical Distribution**:  
   Which regions or countries exhibit the highest or lowest prevalence of specific mental health disorders? Are there noticeable regional patterns or clusters?

3. **Disorder-Specific Patterns**:  
   How do disorders compare in terms of their prevalence across different entities? Are some disorders more universally common while others show localized spikes?

4. **Correlations and Co-occurrence**:  
   Are certain mental health conditions likely to co-occur within populations? For example, is there a strong relationship between anxiety and depression prevalence?

5. **Public Health Implications**:  
   What do these trends imply for mental health services, awareness campaigns, and future interventions? Where should resources be prioritized?

---

### 📌 Scope and Limitations

While this dataset provides valuable insights into global mental health trends, it's important to acknowledge the following limitations:

- **Estimation Errors**: Prevalence values are based on modeled estimates and may not reflect precise real-world values.
- **Reporting Disparities**: Mental health data may be underreported or inconsistently documented in certain regions due to stigma, lack of access to healthcare, or limited data collection infrastructure.
- **No Demographic Breakdown**: The dataset does not include demographic details such as age, gender, or socioeconomic status, which are critical for deeper stratified analysis.
- **Causality vs Correlation**: Observed patterns should not be interpreted as causal relationships without further in-depth study and contextual information.

Despite these constraints, the dataset offers a valuable high-level perspective on mental health disorder prevalence and provides a strong starting point for more focused research.

---

### 🧭 Next Steps

The following sections will include:

- **Data Cleaning and Preparation**  
- **Exploratory Data Analysis (EDA)**  
- **Trend and Correlation Analysis**  
- **Key Insights and Visualizations**  
- **Recommendations and Future Work**

Let’s begin by preparing the dataset for analysis.


In [131]:
# Upgrade numpy and related libraries to fix binary incompatibility
# importing libraries
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

Loading the dataset using pandas into a dataframe is done here. There are two parts of the dataset, split from index 6469. Our main focus in the machine learning process is on index 0 to 6468.

In [132]:
# loading the dataset using pandas
df = pd.read_csv("data.csv")
data = df[1:6468].copy()

  df = pd.read_csv("data.csv")


## Data Understanding

In [133]:
# checking the data head
data.head()

Unnamed: 0,index,Entity,Code,Year,Schizophrenia (%),Bipolar disorder (%),Eating disorders (%),Anxiety disorders (%),Drug use disorders (%),Depression (%),Alcohol use disorders (%)
1,1,Afghanistan,AFG,1991,0.160312,0.697961,0.099313,4.82974,1.684746,4.079531,0.671768
2,2,Afghanistan,AFG,1992,0.160135,0.698107,0.096692,4.831108,1.694334,4.088358,0.670644
3,3,Afghanistan,AFG,1993,0.160037,0.698257,0.094336,4.830864,1.70532,4.09619,0.669738
4,4,Afghanistan,AFG,1994,0.160022,0.698469,0.092439,4.829423,1.716069,4.099582,0.66926
5,5,Afghanistan,AFG,1995,0.160076,0.698695,0.09098,4.828337,1.728112,4.104207,0.668746


In [134]:
# Checking the data shape
data.shape

(6467, 11)

In [135]:
# Checking the data tail
data.tail()

Unnamed: 0,index,Entity,Code,Year,Schizophrenia (%),Bipolar disorder (%),Eating disorders (%),Anxiety disorders (%),Drug use disorders (%),Depression (%),Alcohol use disorders (%)
6463,6463,Zimbabwe,ZWE,2013,0.15567,0.607993,0.117248,3.090168,0.76628,3.128192,1.515641
6464,6464,Zimbabwe,ZWE,2014,0.155993,0.60861,0.118073,3.093964,0.768914,3.14029,1.51547
6465,6465,Zimbabwe,ZWE,2015,0.156465,0.609363,0.11947,3.098687,0.771802,3.15571,1.514751
6466,6466,Zimbabwe,ZWE,2016,0.157111,0.610234,0.121456,3.104294,0.772275,3.174134,1.513269
6467,6467,Zimbabwe,ZWE,2017,0.157963,0.611242,0.124443,3.110926,0.772648,3.192789,1.510943


In [136]:
# checking statistical description of the data
data.describe

<bound method NDFrame.describe of       index       Entity Code  Year Schizophrenia (%) Bipolar disorder (%)  \
1         1  Afghanistan  AFG  1991          0.160312             0.697961   
2         2  Afghanistan  AFG  1992          0.160135             0.698107   
3         3  Afghanistan  AFG  1993          0.160037             0.698257   
4         4  Afghanistan  AFG  1994          0.160022             0.698469   
5         5  Afghanistan  AFG  1995          0.160076             0.698695   
...     ...          ...  ...   ...               ...                  ...   
6463   6463     Zimbabwe  ZWE  2013           0.15567             0.607993   
6464   6464     Zimbabwe  ZWE  2014          0.155993              0.60861   
6465   6465     Zimbabwe  ZWE  2015          0.156465             0.609363   
6466   6466     Zimbabwe  ZWE  2016          0.157111             0.610234   
6467   6467     Zimbabwe  ZWE  2017          0.157963             0.611242   

     Eating disorders (%)  An

## Data Cleaning

In this section, data will be cleaned based on these four metrics in their highlighted sequence:
- Completeness
- Validity
- Consistency
- Accuracy

### Validity

#### - Checking datatypes are correct

In [137]:
# Checking datatypes of each column|
data.dtypes

index                          int64
Entity                        object
Code                          object
Year                          object
Schizophrenia (%)             object
Bipolar disorder (%)          object
Eating disorders (%)          object
Anxiety disorders (%)        float64
Drug use disorders (%)       float64
Depression (%)               float64
Alcohol use disorders (%)    float64
dtype: object

Other than 'Entity', the date column should be of time_date format and the rest should be of float format

In [138]:
# Listing columns that we are to change the datatypes of
# 'Year' should be of datetime64[ns] format, and the rest should be of float64 format
# 'Schizophrenia (%)', 'Bipolar disorder (%)', and 'Eating disorders (%)' should be of float64 format
# 'Everything should remain as is
changing_datatypes = {
    'Year': 'datetime64[ns]',
    'Schizophrenia (%)': 'float64',
    'Bipolar disorder (%)': 'float64',
    'Eating disorders (%)': 'float64'
}

# Changing the datatypes of the specified columns
data = data.astype(changing_datatypes)

# Checking the datatypes after conversion
data.dtypes

index                                 int64
Entity                               object
Code                                 object
Year                         datetime64[ns]
Schizophrenia (%)                   float64
Bipolar disorder (%)                float64
Eating disorders (%)                float64
Anxiety disorders (%)               float64
Drug use disorders (%)              float64
Depression (%)                      float64
Alcohol use disorders (%)           float64
dtype: object

#### - Uniqueness

At this stage, we ensure there are no duplicates in the dataset

In [139]:
# Checking for uniqueness
data.duplicated().sum()

0

No duplicates in the data

### Completeness

Involves ensuring there are no missing values in the data. Involves finding null values in the data. 

In [140]:
# Checking for null values in the dataset
data.isnull().sum()

index                          0
Entity                         0
Code                         980
Year                           0
Schizophrenia (%)              0
Bipolar disorder (%)           0
Eating disorders (%)           0
Anxiety disorders (%)          0
Drug use disorders (%)         0
Depression (%)                 0
Alcohol use disorders (%)      0
dtype: int64

The `Code` column is not necessary in our assessment. It should be dropped

In [141]:
# Dropping the 'Code' column as it is not necessary for our analysis
data.drop(columns=['Code'], inplace=True)

However, in anticipation, we will have a pipeline with KNN imputer model that will fill any future null values should they exist

In [142]:
# Importing necessary libraries for preprocessing
from sklearn.impute import KNNImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

In [143]:
# Splitting the data into numerical and categorical features
data['Year'] = data['Year'].dt.year
numerical_columns = data.select_dtypes(include=['int64', 'float64']).columns
catergorical_columns = data.select_dtypes(include=['object']).columns

In [144]:
# Checking numerical columns
numerical_columns

Index(['index', 'Year', 'Schizophrenia (%)', 'Bipolar disorder (%)',
       'Eating disorders (%)', 'Anxiety disorders (%)',
       'Drug use disorders (%)', 'Depression (%)',
       'Alcohol use disorders (%)'],
      dtype='object')

In [145]:
# Checking categorical columns
catergorical_columns

Index(['Entity'], dtype='object')

In [146]:
# Creating a preprocessing pipeline for all columns
preprocessor = ColumnTransformer(
    transformers = [
        ('num', MinMaxScaler(), numerical_columns), # Scaling the data to a range of [0, 1]
        ('cat', OneHotEncoder(handle_unknown = 'ignore', sparse_output=False), catergorical_columns)  # encoding categorical variables
         
    ],
     remainder = 'passthrough' 
)

In [147]:
# Complete pipeline including the KNN imputer
pipeline = Pipeline([
    # Adding the preprocessing pipeline
    ('preprocessor', preprocessor), 
    # Using KNN imputer with uniform weights
    ('imputer', KNNImputer(n_neighbors=5, weights='uniform'))
  ]
  )  
    

In [148]:
# Fit tansforming the data using the pipeline
fitted = pipeline.fit_transform(data)


In [149]:
# 6. Get column names for DataFrame
onehot_cols = pipeline.named_steps['preprocessor'].named_transformers_['cat'].get_feature_names_out(catergorical_columns)
imputed_col_names = list(numerical_columns) + list(onehot_cols)
imputed_col_names_with_id = ['Index'] + imputed_col_names

# 7. Create DataFrame from imputed data
data_imputed_df = pd.DataFrame(fitted, columns=imputed_col_names)

# 8. Inverse-transform numerical columns to original scale
scaler = pipeline.named_steps['preprocessor'].named_transformers_['num']
num_indices = [data_imputed_df.columns.get_loc(col) for col in numerical_columns]
imputed_numerical_scaled = data_imputed_df.iloc[:, num_indices]
imputed_numerical_original = scaler.inverse_transform(imputed_numerical_scaled)

# 9. Only fill nulls in original data with imputed values
data_filled = data.copy()
for i, col in enumerate(numerical_columns):
    data_filled.loc[data_filled[col].isnull(), col] = imputed_numerical_original[data_filled[col].isnull(), i]

# 10. Result: data_filled has original values + KNN-imputed nulls
data_filled.head()

Unnamed: 0,index,Entity,Year,Schizophrenia (%),Bipolar disorder (%),Eating disorders (%),Anxiety disorders (%),Drug use disorders (%),Depression (%),Alcohol use disorders (%)
1,1,Afghanistan,1991,0.160312,0.697961,0.099313,4.82974,1.684746,4.079531,0.671768
2,2,Afghanistan,1992,0.160135,0.698107,0.096692,4.831108,1.694334,4.088358,0.670644
3,3,Afghanistan,1993,0.160037,0.698257,0.094336,4.830864,1.70532,4.09619,0.669738
4,4,Afghanistan,1994,0.160022,0.698469,0.092439,4.829423,1.716069,4.099582,0.66926
5,5,Afghanistan,1995,0.160076,0.698695,0.09098,4.828337,1.728112,4.104207,0.668746


In [150]:
# The imputed DataFrame with original values and KNN-imputed nulls is already available as 'data_filled'
print("\nFinal Imputed DataFrame:")
print(data_filled)
print("\nMissing values after full reconstruction:")
print(data_filled.isnull().sum())


Final Imputed DataFrame:
      index       Entity  Year  Schizophrenia (%)  Bipolar disorder (%)  \
1         1  Afghanistan  1991           0.160312              0.697961   
2         2  Afghanistan  1992           0.160135              0.698107   
3         3  Afghanistan  1993           0.160037              0.698257   
4         4  Afghanistan  1994           0.160022              0.698469   
5         5  Afghanistan  1995           0.160076              0.698695   
...     ...          ...   ...                ...                   ...   
6463   6463     Zimbabwe  2013           0.155670              0.607993   
6464   6464     Zimbabwe  2014           0.155993              0.608610   
6465   6465     Zimbabwe  2015           0.156465              0.609363   
6466   6466     Zimbabwe  2016           0.157111              0.610234   
6467   6467     Zimbabwe  2017           0.157963              0.611242   

      Eating disorders (%)  Anxiety disorders (%)  Drug use disorders (%)

 #### Fixing outliers 

Outliers in the data will be identified and fixed using an isolation forest model. 

In [151]:
from sklearn.ensemble import IsolationForest

In [152]:
numerical_columns = data_filled.select_dtypes(include=['int64', 'float64']).columns
catergorical_columns = data_filled.select_dtypes(include=['object']).columns

In [153]:
numerical_columns

Index(['index', 'Year', 'Schizophrenia (%)', 'Bipolar disorder (%)',
       'Eating disorders (%)', 'Anxiety disorders (%)',
       'Drug use disorders (%)', 'Depression (%)',
       'Alcohol use disorders (%)'],
      dtype='object')

In [154]:
catergorical_columns

Index(['Entity'], dtype='object')

In [155]:
X = data_filled[numerical_columns]

In [156]:
model = IsolationForest(contamination=0.1, random_state=42)

In [157]:
model.fit(X)

0,1,2
,n_estimators,100
,max_samples,'auto'
,contamination,0.1
,max_features,1.0
,bootstrap,False
,n_jobs,
,random_state,42
,verbose,0
,warm_start,False


In [158]:
data_filled['outlier_label'] = model.predict(X)

In [159]:
data_filled['anomaly_score'] = model.decision_function(X)

In [160]:
print(data_filled[['Entity', 'Year'] + list(numerical_columns) + ['outlier_label', 'anomaly_score']].head())


        Entity  Year  index  Year  Schizophrenia (%)  Bipolar disorder (%)  \
1  Afghanistan  1991      1  1991           0.160312              0.697961   
2  Afghanistan  1992      2  1992           0.160135              0.698107   
3  Afghanistan  1993      3  1993           0.160037              0.698257   
4  Afghanistan  1994      4  1994           0.160022              0.698469   
5  Afghanistan  1995      5  1995           0.160076              0.698695   

   Eating disorders (%)  Anxiety disorders (%)  Drug use disorders (%)  \
1              0.099313               4.829740                1.684746   
2              0.096692               4.831108                1.694334   
3              0.094336               4.830864                1.705320   
4              0.092439               4.829423                1.716069   
5              0.090980               4.828337                1.728112   

   Depression (%)  Alcohol use disorders (%)  outlier_label  anomaly_score  
1        

In [161]:
# Filter to see only the detected outliers
outliers = data_filled[data_filled['outlier_label'] == -1]
print("\nDetected Outliers:")
print(outliers[['Entity', 'Year'] + ['anomaly_score']])


Detected Outliers:
             Entity  Year  anomaly_score
1       Afghanistan  1991      -0.031105
2       Afghanistan  1992      -0.029858
3       Afghanistan  1993      -0.029858
4       Afghanistan  1994      -0.027390
5       Afghanistan  1995      -0.026250
...             ...   ...            ...
6101  United States  2015      -0.141099
6102  United States  2016      -0.142989
6103  United States  2017      -0.142069
6410          Yemen  2016      -0.006501
6411          Yemen  2017      -0.007290

[647 rows x 3 columns]


In [162]:
# sort the outliers by anomaly score
outliers_sorted = outliers.sort_values(by='anomaly_score', ascending=True)
print("\nSorted Outliers by Anomaly Score:")
print(outliers_sorted[['Entity', 'Year'] + ['anomaly_score']])


Sorted Outliers by Anomaly Score:
             Entity  Year  anomaly_score
307     Australasia  2017      -0.144461
6102  United States  2016      -0.142989
6103  United States  2017      -0.142069
6101  United States  2015      -0.141099
335       Australia  2017      -0.139885
...             ...   ...            ...
1692      East Asia  2002      -0.000287
3268          Libya  2010      -0.000285
2069         France  2015      -0.000221
1258          Chile  2016      -0.000058
4281         Norway  2015      -0.000017

[647 rows x 3 columns]


In [163]:
import google.generativeai as genai

genai.configure(api_key="AIzaSyAbRSYEbsSf7wd1vm3dh6OgonG2G-Z2VVQ")


In [164]:
# Make sure you have the latest version of google-generativeai installed
# If not, run: %pip install -U google-generativeai

# Initialize Gemini model
model = genai.GenerativeModel('gemini-2.5-pro')

prompt = f"""
You are a data analyst. This is the structure of the dataset:

{data.head(3).to_string(index=False)}

Users will ask you questions about this dataset. Reply with Python code that uses pandas and matplotlib/plotly to analyze and visualize it.
Always provide visualizations where relevant.
"""


In [165]:
# Start the conversation with the model
chat = model.start_chat(history=[])

# Ask a specific question about the data
user_question = "What are the top 5 countries by total sales?"

# Send the prompt + question to Gemini
response = chat.send_message(prompt + "\n\n" + user_question)

# Print the response (should be Python code)
print(response.text)

Of course.

The provided dataset contains information about the prevalence of mental health disorders, not sales data.

However, I can reinterpret your request to find the "top 5 countries" based on a relevant metric from this dataset. A good proxy for "total" would be the average prevalence of all listed disorders combined for each country across all years. This will show which countries have, on average, the highest overall prevalence of these conditions.

Here is the Python code to perform this analysis and create a visualization.

### Python Code

```python
import pandas as pd
import plotly.express as px

# Load the dataset
# Assuming the data is in a CSV file named 'mental_health_data.csv'
# If your file has a different name, please change it below.
try:
    df = pd.read_csv('mental_health_data.csv')
except FileNotFoundError:
    print("Please make sure your data file is named 'mental_health_data.csv' and is in the same directory.")
    # Creating a sample dataframe for demonstrat

In [166]:
for m in genai.list_models():
    print(m.name)


models/embedding-gecko-001
models/gemini-1.5-pro-latest
models/gemini-1.5-pro-002
models/gemini-1.5-pro
models/gemini-1.5-flash-latest
models/gemini-1.5-flash
models/gemini-1.5-flash-002
models/gemini-1.5-flash-8b
models/gemini-1.5-flash-8b-001
models/gemini-1.5-flash-8b-latest
models/gemini-2.5-pro-preview-03-25
models/gemini-2.5-flash-preview-05-20
models/gemini-2.5-flash
models/gemini-2.5-flash-lite-preview-06-17
models/gemini-2.5-pro-preview-05-06
models/gemini-2.5-pro-preview-06-05
models/gemini-2.5-pro
models/gemini-2.0-flash-exp
models/gemini-2.0-flash
models/gemini-2.0-flash-001
models/gemini-2.0-flash-exp-image-generation
models/gemini-2.0-flash-lite-001
models/gemini-2.0-flash-lite
models/gemini-2.0-flash-preview-image-generation
models/gemini-2.0-flash-lite-preview-02-05
models/gemini-2.0-flash-lite-preview
models/gemini-2.0-pro-exp
models/gemini-2.0-pro-exp-02-05
models/gemini-exp-1206
models/gemini-2.0-flash-thinking-exp-01-21
models/gemini-2.0-flash-thinking-exp
models/ge