**CAPSTONE PROJECT MENTAL HEALTH ISSUE IDENTIFICATION SYSTEM**

Please fill out:
* Student names: 
* Student pace:  **PART TIME**
* Scheduled project review date/time: **18/11/2024**
* Instructor name: ****

**1.BUSINESS UNDERSTANDING**

**1.  Introduction**

Mental health has become an urgent public health concern across the globe, and Kenya is no exception. Approximately 25% of outpatients and 40% of inpatients in Kenyan healthcare facilities are affected by mental health conditions, according to the Kenyan National Commission of Human Rights. Depression, substance abuse, stress, and anxiety disorders are among the most commonly diagnosed mental health issues in hospital settings, a reflection of an alarming national trend. The situation is compounded by limited data on mental health, neurological issues, and substance use (MNS) in Kenya, making it challenging to address these concerns effectively.


The World Health Organization (WHO) ranks Kenya among the African nations with the highest depression rates, with estimates suggesting that around two million Kenyans are impacted by depression alone. Disturbingly, one in four Kenyans will experience a mental health disorder at some point in their lives.


Given the urgent need to address mental health concerns, this project aims to leverage artificial intelligence to identify and analyze mental health indicators within social media text.

By capturing and analyzing patterns of mental health issues expressed in public discourse, the project seeks to provide insights that can inform policymakers, healthcare providers, and support systems. In doing so, it contributes to a broader understanding of mental health in Kenya and aligns with the national objective of prioritizing mental well-being.

**Problem Statement**








Mental health issues like depression, anxiety, and suicidal tendencies often go unnoticed in daily conversations, especially in online forums, social media posts, or text-based support systems. Existing tools are either too general or overly reliant on structured input, missing subtle signs of mental distress embedded in unstructured conversations. This project aims to identify potential mental health concerns based on users’ language and conversational patterns in online texts.

**Goals and Objectives**

**1.Identify and Categorize Mental Health Issues:**

Develop a model that can accurately classify different mental health issues (e.g., depression, anxiety, suicidal tendencies) based on text data in Reddit posts and comments.

**2.Analyze Language Patterns Linked to Mental Distress:**

 Detect and analyze linguistic features and conversational patterns commonly associated with mental health issues to help distinguish subtle indicators of distress.

 **3.Assess Sentiment and Emotional Intensity:**
 
  Implement sentiment analysis to assess the emotional intensity and tone of the posts and comments, helping to prioritize urgent cases or severe distress

  **4.Provide Actionable Insights for Intervention:**
  
   Generate insights that could support mental health professionals and social media moderators in identifying and addressing potential cases of mental health crises on forums and social platforms.

**STAKEHOLDERS**

 ### 1.Government and Health Agencies ###

i. **Ministry of Health (Kenya)**: As a primary body responsible for public health policies, they are key stakeholders in using the project's insights to shape mental health policies and interventions.

ii. **Kenyan National Commission on Human Rights**: Involved in advocacy for better mental health services and safeguarding human rights for those affected by mental health issues.

iii. **National Authority for the Campaign Against Alcohol and Drug Abuse (NACADA)**: Given the links between substance abuse and mental health, NACADA's involvement could help tailor intervention programs.

### 2.Healthcare Providers ###

i. **Psychiatrists, Psychologists, and Therapists**: As frontline workers in diagnosing and treating mental health disorders, they would benefit from insights into prevalent issues and potential trends in patient symptoms.

ii. **Healthcare Facilities (Hospitals, Clinics)**: Understanding the mental health landscape can help facilities prepare resources and adapt treatment protocols to better address patient needs.

iii. **Public Health Organizations**: Including organizations like the World Health Organization (WHO), which can leverage findings to inform global and regional strategies on mental health.

 ### 3.Mental Health Advocacy Groups and NGOs ###

i. **Basic Needs Kenya, Mental Health Kenya, and Befrienders Kenya**: These advocacy groups work on awareness, support, and outreach programs, so insights from the project can help them tailor their initiatives and better support affected individuals.

ii. **Kenya Red Cross**: Often involved in providing mental health support during crises, they could use the data to identify areas with higher mental health needs.

### 4.Policy Makers and Legislators ###

i. **National Assembly's Health Committee**: To help in reviewing and proposing mental health legislation that aligns with the insights gathered from the analysis.

ii. **County Health Administrators**: Local level officials who can use insights for tailored mental health programs at the community level.


**2.DATA COLLECTION**

To gather a robust dataset for the Mindcheck project, we utilized the Reddit API through the Python Reddit API Wrapper (PRAW). This approach enabled us to collect a wide range of posts and comments relevant to mental health discussions, positive expressions, and neutral content, which would support the accurate identification and classification of mental health concerns.

we used keyword-based search queries and collected up to 5,000 posts per subreddit. Each post’s title, body, comments, and metadata (e.g., author information, comment scores, timestamps, and subreddit details) were captured to support downstream text analysis. We also included additional post attributes, such as flair, upvote ratios, and crosspost counts, which may serve as helpful features in identifying mental health patterns.

The final dataset was structured and saved as a CSV file for convenient access, providing a comprehensive sample of mental health, positive, and neutral content from Reddit. 

The data structure supports a comprehensive analysis of mental health discussions on social media, allowing for insights into engagement, sentiment, and topic categorization.

**DATA LOADING AND IMPORTING RELEVANT LIBRARIES**


In [8]:
# IMPORTING RELEVANT LIBRARIES
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score, precision_score, recall_score, f1_score



In [None]:
#LOADING THE DATASET

data = pd.read_csv("broad_reddit_search_with_labels.csv")

In [None]:
#VIEW FIRST FIVE ROWS
data.head()

**2.2 DATA DESCRIPTION**

In [None]:
#GETTING GENERAL INFORMATION ON NON-NULL COUNTS AND DATA TYPES FOR PER COLUMN
data.info()

**Description of the data:**

Total Entries: 92,395

Columns: 27, with various data types including object (text), int64 (integer), float64 (floating-point), and bool (boolean).


**Data Columns Overview**

**1.Post and Comment Content:**

**title:**
 The title of the post, which may provide a summary of the content.
**post_body**
 The main content or body of the post.

**comment_body:** 
The content of a specific comment on the post.

**2.Engagement and Score:**

**post_score:** 
The score or upvotes received by the post, which may indicate popularity.
**comment_score:**
The score or upvotes received by the comment.
**upvote_ratio:**
The ratio of upvotes to total votes for the post.
**number of_crossposts:**
The number of times this post has been cross-posted to other subreddits.
**post_num_comments:** 
The number of comments on the post, indicating engagement.

**3.Metadata:**

**post_url:** The URL of the post, useful for tracking or referencing.
**created:** The timestamp when the post or comment was created.
**subreddit:** The subreddit where the post or comment was made, which helps in filtering data by community focus.
**label:** This could represent a manual or model-assigned label (e.g., sentiment, topic, or mental health category).

**4.User Information:**
**author:** The username of the post’s author.
**comment_author:** The username of the comment’s author.
**author_premium:** Indicates if the author has a premium account.
**distinguished:** A flag indicating if the post is from a moderator or other special status.


**5.Post and Comment Attributes:**

**over_18:** A flag indicating if the content is marked as NSFW (Not Safe For Work).
is_self_post: Indicates if the post is a self-post (text-only) rather than a link.
**post_flair and link_flair_text:** Text tags applied to the post, which may reflect topic categories or sentiments.
**author_flair_text:** A flair assigned to the author, possibly indicating affiliation or status in the subreddit.

**6.Awards and Other Engagement Indicators:**

**all_awardings and total_awards_received** Data on awards given to the post or comment, reflecting user appreciation.
**post_thumbnail:** A thumbnail image associated with the post, if available.

**7.Identifiers:**
post_id and comment_id: Unique identifiers for each post and comment, respectively. These help in tracking specific posts or comments.


In [None]:
#GETTING GENERAL INFORMATION ON NON-NULL COUNTS AND DATA TYPES FOR PER COLUMN
data.info()

In [None]:
#CHECK NUMBER OF ROWS AND COLUMNS
data.shape

The data set has 92395 rowns and 27 columns

**DROPPING IRRELEVANT COLUMNS**

Based on our project goal of identifying and understanding mental health discussions in social media text:We gruoped the columns into Relevant and Irrelevant columns:

**Relevant Columns(11)**

**Text Content:** title, post_body, comment_body - Primary text fields for analyzing mental health topics and sentiment.

**Engagement Metrics:** post_score, comment_score, upvote_ratio - Indicate community engagement and post relevance.

**Categorization:** label, subreddit, post_flair, link_flair_text - Useful for identifying mental health categories, topics, or sentiment.

**Timestamp:** created - Helps in analyzing trends over time.

**Less Relevant Columns(16)**

**Identifiers and URLs:** post_url, post_id, comment_id, author, comment_author - Useful for tracking but not for text analysis.

**Other Metadata:** over_18, is_self_post, author premium,distinguished, post_thumbnail, all_awardings, total_awards_received, author_flair_text - Provide limited insight into mental health content.

In [None]:
#Dropping Columns
# List of columns to drop  based on our analysis
columns_to_drop = [
    'post_url', 'post_id', 'comment_id', 'author', 'comment_author',
    'post_num_comments', 'over_18','author_premium','is_self_post', 'distinguished',
    'post_thumbnail', 'all_awardings', 'total_awards_received',
    'author_flair_text', 'num_crossposts', 'all_awardings'
]

# Dropping irrelevant columns from the DataFrame
data = data.drop(columns=columns_to_drop)

# Display the DataFrame to verify
data.head()


**2.3 DATA CLEANING**

In [None]:
#CHECKING FOR MISSING VALUES
missing_values = data.isnull().sum()
missing_values

In [None]:
# Fill missing values in 'post_body' with an empty string because it's a text field, 
# and missing text can be assumed to have no content.
data['post_body'] = data['post_body'].fillna('')

# Fill missing values in 'post_flair' and 'link_flair_text' with 'No Flair' 
# or 'Unknown' since flairs are categorical and a missing value here likely means 
# that the post didn't have any flair assigned.
data['post_flair'] = data['post_flair'].fillna('No Flair')
data['link_flair_text'] = data['link_flair_text'].fillna('No Flair')

# Display the result to verify the filling of missing values
data.isnull().sum()


In [None]:
#CHECK FOR DUPLICATES

duplicates = data.duplicated()
duplicate_count = duplicates.sum()
duplicate_count

In [None]:
# Remove duplicate rows based on all columns
data = data.drop_duplicates()

# Reset the index after dropping duplicates
data = data.reset_index(drop=True)

# Display the number of rows after removing duplicates
print(f"Number of rows after removing duplicates: {len(data)}")


**DEALING WITH OUTLIERS**

We decided to plot histograms and box plots to understand the distribution of the data so that we can choose effective method to deal with outliers.

In [None]:
#Visualizng distribution of our data
import matplotlib.pyplot as plt
import seaborn as sns

# Select numeric columns only
numeric_columns = data.select_dtypes(include=['float64', 'int64']).columns

# Set up the plotting area
plt.figure(figsize=(15, len(numeric_columns) * 4))

for i, column in enumerate(numeric_columns, 1):
    plt.subplot(len(numeric_columns), 2, 2*i - 1)
    sns.histplot(data[column], kde=True)
    plt.title(f'Histogram of {column}')
    
    plt.subplot(len(numeric_columns), 2, 2*i)
    sns.boxplot(x=data[column])
    plt.title(f'Boxplot of {column}')

plt.tight_layout()
plt.show()


From the histograms and boxplots , the data appears to be heavily skewed . Here’s a breakdown of the observations:

**Histograms:**

Most histograms show a right (positive) skew, with a concentration of values on the left and a long tail extending to the right.
This pattern suggests that there are many lower values and a few extreme higher values, which is common in social media engagement metrics like scores and upvote ratios.

**Boxplots:**

The boxplots show many outliers on the right side, which is a characteristic of positively skewed data.
The interquartile range (IQR) is narrow for several columns, but there are many data points beyond the upper whisker, indicating the presence of high-value outliers.

In [None]:
# Identify numeric columns in the DataFrame
numeric_columns = data.select_dtypes(include=['float64', 'int64']).columns

# Initialize a dictionary to store outliers
iqr_outliers = {}

for column in numeric_columns:
    # Calculate Q1 (25th percentile) and Q3 (75th percentile) for each column
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1  # Calculate the IQR

    # Define outlier boundaries
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    # Identify outliers
    outliers = data[column][(data[column] < lower_bound) | (data[column] > upper_bound)]
    iqr_outliers[column] = outliers

    # Print the number of outliers for each column
    print(f"{column}: {len(outliers)} outliers")

# Optional: Display specific outlier values for a particular column
# Uncomment and replace 'column_name' with the name of the column to inspect
# print(iqr_outliers['column_name'])


Given the nature and purpose of our analysis, removing these outliers may lead to a loss of valuable information about high-engagement or high-impact posts. We decided to retain them.

In [None]:
import numpy as np

# Function to calculate percentile rank of each value in a column
def calculate_percentile_ranks(outliers, column):
    # Calculate the percentile rank for each outlier value
    percentile_ranks = [np.percentile(data[column], (data[column] <= val).mean() * 100) for val in outliers]
    return percentile_ranks

# Initialize dictionary to store percentile ranks of outliers for each column
outlier_percentiles = {}

# List of numeric columns to check
columns_to_check = ['comment_score', 'post_score', 'upvote_ratio']

# Detect outliers and calculate their percentile ranks
for column in columns_to_check:
    # Calculate Q1 and Q3 for the IQR method
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1
    
    # Define outlier boundaries
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    # Identify outliers
    outliers = data[column][(data[column] < lower_bound) | (data[column] > upper_bound)]
    
    # Calculate the percentile ranks of the outliers
    percentile_ranks = calculate_percentile_ranks(outliers, column)
    
    # Store results in a dictionary
    outlier_percentiles[column] = percentile_ranks

    # Print summary
    print(f"{column}: Most outliers fall in the following percentile ranges:")
    print(pd.Series(percentile_ranks).describe())  # Summary statistics of outlier percentiles


**Analysis of Outlier Percentiles**

**comment_score:**

Mean Percentile: 74.98, suggesting that most outliers are relatively high values.
Percentiles Range: There’s a wide range with values going as low as the 5th percentile but with a mean around the 75th percentile.

**Recommendation:** Cap at the 95th percentile on the upper end. Since there are also some low outliers, we capped the lower end at the 5th percentile.

**post_score:**

Mean Percentile: 28,390, suggesting that most outliers have very high values.
Range: Outliers vary significantly, with many extreme values, as indicated by a high std (25,245).

**Recommendation:** Cap the upper end at the 90th or 95th percentile. Given the wide range of values, capping the lower end may not be necessary if the focus is on reducing high extremes.

**upvote_ratio:**

Mean Percentile: 0.43, with values clustered around 0.34–0.54.
Range: Since this is a ratio, outliers are not as extreme as in comment_score or post_score.

**Recommendation:** Cap the upper end at the 90th percentile. This should reduce outliers without affecting the main distribution too much.

**CAPPING OUTLERS**

In [None]:
# Apply capping based on the recommended percentiles
data['comment_score'] = data['comment_score'].clip(lower=data['comment_score'].quantile(0.05),
                                                   upper=data['comment_score'].quantile(0.95))

data['post_score'] = data['post_score'].clip(upper=data['post_score'].quantile(0.95))

data['upvote_ratio'] = data['upvote_ratio'].clip(upper=data['upvote_ratio'].quantile(0.90))



In [None]:
# List of capped columns to check for remaining outliers
columns_to_check = ['comment_score', 'post_score', 'upvote_ratio']

print("Outliers detected using IQR method after capping:")

for column in columns_to_check:
    # Calculate Q1 and Q3 for the IQR method
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1

    # Define outlier boundaries
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    # Identify outliers
    outliers = data[column][(data[column] < lower_bound) | (data[column] > upper_bound)]
    
    # Print the number of remaining outliers for each column after capping
    print(f"{column}: {len(outliers)} remaining outliers after capping")


**3.0 EXPLORATORY DATA ANALYSIS**

**3.1 Bar Plot showing Distribution of Labels**

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Plot the distribution of labels
plt.figure(figsize=(10, 6))
sns.countplot(data=data, x='label')
plt.title("Distribution of Mental Health Labels")
plt.xlabel("Mental Health Issue Label")
plt.ylabel("Count")
plt.xticks(rotation=45)
plt.show()


**Insights:**

 The chart shows three categories with varying frequencies. Neutral category (the green bar) has the highest count, followed by the happy/positive category (orange bar), and the lowest is the  mental health issue category (blue bar).


**Implications**

some categories are more prevalent in the  dataset e.g Neutral and happy . This could impact model training, as an imbalanced dataset may lead the model to perform better on the majority category and worse on the minority.

We may need to consider balancing techniques, such as oversampling the minority class or using class weights, to ensure that the model performs well across all categories.


**3.2 Histogram showing Post Length Distribution**

In [None]:
# Calculate the length of each post
data['post_length'] = data['post_body'].apply(lambda x: len(x.split()))

# Plot the distribution of post lengths
plt.figure(figsize=(15, 12))
sns.histplot(data['post_length'], bins=30, kde=True)
plt.title("Distribution of Post Lengths")
plt.xlabel("Post Length (words)")
plt.ylabel("Frequency")
plt.show()


**Insights**

The data is highly skewed to the right, with a large concentration of posts on the left side (lower range).
 This indicates that most posts fall within a lower range  i.e they are shorter than 1000 words , while fewer posts have higher values above 1000 words.

 However, there are some outliers that deviate from the majority  post length and may go up of up to 6,000 words.

 **Implications**

 For modeling, we may handle outliers (very long posts) separately or exclude them if they don't contribute meaningfully to our analysis.

**3.3 Histrogram Showing Distribution of Sentiment Scores**

In [None]:

# Initialize the sentiment analyzer
analyzer = SentimentIntensityAnalyzer()

# Apply VADER sentiment analysis to each post
data['sentiment'] = data['post_body'].apply(lambda x: analyzer.polarity_scores(x)['compound'])

# Plot the sentiment distribution
plt.figure(figsize=(10, 6))
sns.histplot(data['sentiment'], bins=30, kde=True)
plt.title("Distribution of Sentiment Scores")
plt.xlabel("Sentiment Score")
plt.ylabel("Frequency")
plt.show()


**Implications**

Thie Histogram seems to show a bimodal distribution with two distinct peaks.

The two prominent peaks indicate that there are two distinct groups within the data. This could imply two different populations or behaviors within the dataset.

The first peak represents one group whose sentiment score is centered around  -0.25 (fairly negative)  while the second peak represents a differenet  group whose sentiment score is centered around 1(very postive).

Both ends of the histogram show smaller bars, which could represent outliers or infrequent behaviors not fitting into the main clusters.

**Implications**

Given the clear separation between the two peaks, it may be beneficial to treat the two groups separately in our analysis.

This segmentation could allow for more targeted insights or better model performance, especially if the behaviors or language used in each group differ.


Modeling Considerations:

A single model may not capture the nuances across both clusters effectively. We may need to consider building separate models or using clustering techniques to handle each group independently.

**FEATURE ENGINEERING**

We decided to Create a Feature named  High Engagement which shows high-engagement posts based on post_score and visualize with a count plot. . High engagement posts might indicate popular or impactful discussions.

In [None]:

# Step 1: Define high engagement based on the 90th percentile of post_score
high_engagement_threshold = data['post_score'].quantile(0.90)
data['high_engagement'] = data['post_score'] > high_engagement_threshold

# Step 2: Filter data for the specific labels: mental_health_issue, happy, and neutral
filtered_data = data[data['label'].isin(['mental_health_issue', 'happy', 'neutral'])]

# Step 3: Plot high vs low engagement for each label
plt.figure(figsize=(10, 6))
sns.countplot(x='label', hue='high_engagement', data=filtered_data)
plt.title('Comparison of High vs. Low Engagement for mental_health_issue, happy, and neutral Labels')
plt.xlabel('Label')
plt.ylabel('Number of Posts')
plt.legend(title='Engagement Level', labels=['Low Engagement', 'High Engagement'])
plt.show()


**Insights**

**Engagement Distribution:**

1.Neutral content has the highest number of low-engagement posts (blue), indicating that neutral topics generally receive less interaction from the audience.

2.Mental health-related content shows doesn't have any hign engagement posts indicating that people with mental health issues most likely don't want to talk about it on social platform hence the high number posts with  low engagement .

3.The Happy label has both  moderate number of both high and low engagement posts, indicating a balanced engagement level for positive content.

In [None]:
#viewing columns
data.columns

**CO-ORELATION MATRIX FOR NUMERICAL FEATURES**

In [None]:


# Select only numerical columns for the correlation matrix
numerical_columns = [
    'comment_score', 'created', 'post_score', 'post_created', 
    'upvote_ratio', 'post_length', 'high_engagement'
]

# Calculate the correlation matrix
correlation_matrix = data[numerical_columns].corr()

# Plot the correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Correlation Matrix of Engagement-Related Features')
plt.show()


Due to the strong correlation between post_score and high engagement, we'll consider removing one of these columns to reduce multicollinearity.

Well keep oneof them   based on the importance to your analysis. 


**WORD CLOUD VISUALIZATIONS PER LABEL**

In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from matplotlib import font_manager

# Define path to a TrueType font
font_path = font_manager.findSystemFonts(fontpaths=None, fontext='ttf')[0]

# Function to generate word cloud for a specific label
def generate_word_cloud(data, label, column='post_body'):
    # Filter the data for the specified label and join all text in the column
    text = " ".join(data[data['label'] == label][column].dropna())
    
    # Generate word cloud
    wordcloud = WordCloud(width=800, height=400, background_color='white', colormap='viridis', font_path=font_path).generate(text)
    
    # Plot word cloud
    plt.figure(figsize=(10, 5))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis('off')
    plt.title(f'Word Cloud for {label} Label')
    plt.show()

# Generate word clouds for each label
generate_word_cloud(data, 'mental_health_issue')
generate_word_cloud(data, 'happy')
generate_word_cloud(data, 'neutral')
