**CAPSTONE PROJECT MENTAL HEALTH ISSUE IDENTIFICATION SYSTEM**

Please fill out:
* Student names: Issac Wanganga, Cynthia Jerono, Jim Akoko, Bestina Mutisya, Victor Maina, Beryl Wafula
* Student pace:  **PART TIME**
* Scheduled project review date/time: **18/11/2024**
* Instructor name: Mildred Jepkosgei

**1.BUSINESS UNDERSTANDING**

**1.  Introduction**

Mental health has become an urgent public health concern across the globe, and Kenya is no exception. Approximately 25% of outpatients and 40% of inpatients in Kenyan healthcare facilities are affected by mental health conditions, according to the Kenyan National Commission of Human Rights. Depression, substance abuse, stress, and anxiety disorders are among the most commonly diagnosed mental health issues in hospital settings, a reflection of an alarming national trend. The situation is compounded by limited data on mental health, neurological issues, and substance use (MNS) in Kenya, making it challenging to address these concerns effectively.


The World Health Organization (WHO) ranks Kenya among the African nations with the highest depression rates, with estimates suggesting that around two million Kenyans are impacted by depression alone. Disturbingly, one in four Kenyans will experience a mental health disorder at some point in their lives.


Given the urgent need to address mental health concerns, this project aims to leverage artificial intelligence to identify and analyze mental health indicators within social media text.

By capturing and analyzing patterns of mental health issues expressed in public discourse, the project seeks to provide insights that can inform policymakers, healthcare providers, and support systems. In doing so, it contributes to a broader understanding of mental health in Kenya and aligns with the national objective of prioritizing mental well-being.

**Problem Statement**








Mental health issues like depression, anxiety, and suicidal tendencies often go unnoticed in daily conversations, especially in online forums, social media posts, or text-based support systems. Existing tools are either too general or overly reliant on structured input, missing subtle signs of mental distress embedded in unstructured conversations. This project aims to identify potential mental health concerns based on users’ language and conversational patterns in online texts.

**Goals and Objectives**

**1.Identify and Categorize Mental Health Issues:**

Develop a model that can accurately classify different mental health issues (e.g., depression, anxiety, suicidal tendencies) based on text data in Reddit posts and comments.

**2.Analyze Language Patterns Linked to Mental Distress:**

 Detect and analyze linguistic features and conversational patterns commonly associated with mental health issues to help distinguish subtle indicators of distress.

 **3.Assess Sentiment and Emotional Intensity:**
 
  Implement sentiment analysis to assess the emotional intensity and tone of the posts and comments, helping to prioritize urgent cases or severe distress

  **4.Provide Actionable Insights for Intervention:**
  
   Generate insights that could support mental health professionals and social media moderators in identifying and addressing potential cases of mental health crises on forums and social platforms.

**STAKEHOLDERS**

 ### 1.Government and Health Agencies ###

i. **Ministry of Health (Kenya)**: As a primary body responsible for public health policies, they are key stakeholders in using the project's insights to shape mental health policies and interventions.

ii. **Kenyan National Commission on Human Rights**: Involved in advocacy for better mental health services and safeguarding human rights for those affected by mental health issues.

iii. **National Authority for the Campaign Against Alcohol and Drug Abuse (NACADA)**: Given the links between substance abuse and mental health, NACADA's involvement could help tailor intervention programs.

### 2.Healthcare Providers ###

i. **Psychiatrists, Psychologists, and Therapists**: As frontline workers in diagnosing and treating mental health disorders, they would benefit from insights into prevalent issues and potential trends in patient symptoms.

ii. **Healthcare Facilities (Hospitals, Clinics)**: Understanding the mental health landscape can help facilities prepare resources and adapt treatment protocols to better address patient needs.

iii. **Public Health Organizations**: Including organizations like the World Health Organization (WHO), which can leverage findings to inform global and regional strategies on mental health.

 ### 3.Mental Health Advocacy Groups and NGOs ###

i. **Basic Needs Kenya, Mental Health Kenya, and Befrienders Kenya**: These advocacy groups work on awareness, support, and outreach programs, so insights from the project can help them tailor their initiatives and better support affected individuals.

ii. **Kenya Red Cross**: Often involved in providing mental health support during crises, they could use the data to identify areas with higher mental health needs.

### 4.Policy Makers and Legislators ###

i. **National Assembly's Health Committee**: To help in reviewing and proposing mental health legislation that aligns with the insights gathered from the analysis.

ii. **County Health Administrators**: Local level officials who can use insights for tailored mental health programs at the community level.


**2.DATA COLLECTION**

To gather a robust dataset for the Mindcheck project, we utilized the Reddit API through the Python Reddit API Wrapper (PRAW). This approach enabled us to collect a wide range of posts and comments relevant to mental health discussions, positive expressions, and neutral content, which would support the accurate identification and classification of mental health concerns.

we used keyword-based search queries and collected up to 5,000 posts per subreddit. Each post’s title, body, comments, and metadata (e.g., author information, comment scores, timestamps, and subreddit details) were captured to support downstream text analysis. We also included additional post attributes, such as flair, upvote ratios, and crosspost counts, which may serve as helpful features in identifying mental health patterns.

The final dataset was structured and saved as a CSV file for convenient access, providing a comprehensive sample of mental health, positive, and neutral content from Reddit. 

The data structure supports a comprehensive analysis of mental health discussions on social media, allowing for insights into engagement, sentiment, and topic categorization.

**DATA LOADING AND IMPORTING RELEVANT LIBRARIES**


In [3]:
# IMPORTING RELEVANT LIBRARIES
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report, accuracy_score, precision_score, recall_score, f1_score



In [6]:
#LOADING THE DATASET

data = pd.read_csv("broad_reddit_search_with_labels.csv")

In [7]:
#VIEW FIRST FIVE ROWS OF THE DATASET
data.head()

Unnamed: 0,title,post_body,comment_body,comment_score,post_url,created,subreddit,label,post_score,post_num_comments,...,author_premium,distinguished,all_awardings,num_crossposts,total_awards_received,post_thumbnail,link_flair_text,post_id,comment_id,author_flair_text
0,I don't know what's wrong with me,I'm finding it really hard to keep myself toge...,Have you got a therapist on board? Sounds like...,2,https://www.reddit.com/r/mentalhealth/comments...,1730484000.0,mentalhealth,mental_health_issue,1,1,...,False,,[],0,0,self,Venting,1ghb2bs,luw5fgj,
1,Friends who distance themselves from you or cu...,Please tell me if I sound entitled or selfish....,"I think it’s a combination of factors, and I s...",2,https://www.reddit.com/r/mentalhealth/comments...,1730480000.0,mentalhealth,mental_health_issue,1,1,...,False,,[],0,0,self,Venting,1gh9prb,luvspbv,
2,sometimes my brain just keeps telling me bad t...,Does anyone else ever get like this? Or have a...,"When I have a panic attack, I tend to think th...",1,https://www.reddit.com/r/mentalhealth/comments...,1730486000.0,mentalhealth,mental_health_issue,2,2,...,False,,[],0,0,self,Venting,1gh98yg,luwcwu7,
3,sometimes my brain just keeps telling me bad t...,Does anyone else ever get like this? Or have a...,Im wondering if it was simply a panic attack. ...,2,https://www.reddit.com/r/mentalhealth/comments...,1730487000.0,mentalhealth,mental_health_issue,2,2,...,False,,[],0,0,self,Venting,1gh98yg,luwep2n,
4,Need objective support. I’m in over my head,I feel in over my head and I’m not sure what t...,Please consider seeing a psychologist. Good luck!,1,https://www.reddit.com/r/mentalhealth/comments...,1730477000.0,mentalhealth,mental_health_issue,1,4,...,False,,[],0,0,self,Need Support,1gh8q6w,luvir86,


**2.2 DATA DESCRIPTION**

In [8]:
#GETTING GENERAL INFORMATION ON NON-NULL COUNTS AND DATA TYPES FOR PER COLUMN
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 92395 entries, 0 to 92394
Data columns (total 27 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   title                  92395 non-null  object 
 1   post_body              63730 non-null  object 
 2   comment_body           92395 non-null  object 
 3   comment_score          92395 non-null  int64  
 4   post_url               92395 non-null  object 
 5   created                92395 non-null  float64
 6   subreddit              92395 non-null  object 
 7   label                  92395 non-null  object 
 8   post_score             92395 non-null  int64  
 9   post_num_comments      92395 non-null  int64  
 10  author                 92395 non-null  object 
 11  comment_author         86180 non-null  object 
 12  post_created           92395 non-null  float64
 13  post_flair             35211 non-null  object 
 14  upvote_ratio           92395 non-null  float64
 15  ov

**Description of the data:**

Total Entries: 92,395

Columns: 27, with various data types including object (text), int64 (integer), float64 (floating-point), and bool (boolean).


**Data Columns Overview**

**1.Post and Comment Content:**

**title:**
 The title of the post, which may provide a summary of the content.
**post_body**
 The main content or body of the post.

**comment_body:** 
The content of a specific comment on the post.

**2.Engagement and Score:**

**post_score:** 
The score or upvotes received by the post, which may indicate popularity.
**comment_score:**
The score or upvotes received by the comment.
**upvote_ratio:**
The ratio of upvotes to total votes for the post.
**number of_crossposts:**
The number of times this post has been cross-posted to other subreddits.
**post_num_comments:** 
The number of comments on the post, indicating engagement.

**3.Metadata:**

**post_url:** The URL of the post, useful for tracking or referencing.
**created:** The timestamp when the post or comment was created.
**subreddit:** The subreddit where the post or comment was made, which helps in filtering data by community focus.
**label:** This could represent a manual or model-assigned label (e.g., sentiment, topic, or mental health category).

**4.User Information:**
**author:** The username of the post’s author.
**comment_author:** The username of the comment’s author.
**author_premium:** Indicates if the author has a premium account.
**distinguished:** A flag indicating if the post is from a moderator or other special status.


**5.Post and Comment Attributes:**

**over_18:** A flag indicating if the content is marked as NSFW (Not Safe For Work).
is_self_post: Indicates if the post is a self-post (text-only) rather than a link.
**post_flair and link_flair_text:** Text tags applied to the post, which may reflect topic categories or sentiments.
**author_flair_text:** A flair assigned to the author, possibly indicating affiliation or status in the subreddit.

**6.Awards and Other Engagement Indicators:**

**all_awardings and total_awards_received** Data on awards given to the post or comment, reflecting user appreciation.
**post_thumbnail:** A thumbnail image associated with the post, if available.

**7.Identifiers:**
post_id and comment_id: Unique identifiers for each post and comment, respectively. These help in tracking specific posts or comments.


In [10]:
#CHECK NUMBER OF ROWS AND COLUMNS
data.shape

(92395, 27)

The data set has 92395 rowns and 27 columns

**DROPPING IRRELEVANT COLUMNS**

**Description of the data:**

Total Entries: 92,395

Columns: 27, with various data types including object (text), int64 (integer), float64 (floating-point), and bool (boolean).


**Data Columns Overview**

**1.Post and Comment Content:**

**title:**
 The title of the post, which may provide a summary of the content.
**post_body**
 The main content or body of the post.

**comment_body:** 
The content of a specific comment on the post.

**2.Engagement and Score:**

**post_score:** 
The score or upvotes received by the post, which may indicate popularity.
**comment_score:**
The score or upvotes received by the comment.
**upvote_ratio:**
The ratio of upvotes to total votes for the post.
**number of_crossposts:**
The number of times this post has been cross-posted to other subreddits.
**post_num_comments:** 
The number of comments on the post, indicating engagement.

**3.Metadata:**

**post_url:** The URL of the post, useful for tracking or referencing.
**created:** The timestamp when the post or comment was created.
**subreddit:** The subreddit where the post or comment was made, which helps in filtering data by community focus.
**label:** This could represent a manual or model-assigned label (e.g., sentiment, topic, or mental health category).

**4.User Information:**
**author:** The username of the post’s author.
**comment_author:** The username of the comment’s author.
**author_premium:** Indicates if the author has a premium account.
**distinguished:** A flag indicating if the post is from a moderator or other special status.


**5.Post and Comment Attributes:**

**over_18:** A flag indicating if the content is marked as NSFW (Not Safe For Work).
is_self_post: Indicates if the post is a self-post (text-only) rather than a link.
**post_flair and link_flair_text:** Text tags applied to the post, which may reflect topic categories or sentiments.
**author_flair_text:** A flair assigned to the author, possibly indicating affiliation or status in the subreddit.

**6.Awards and Other Engagement Indicators:**

**all_awardings and total_awards_received** Data on awards given to the post or comment, reflecting user appreciation.
**post_thumbnail:** A thumbnail image associated with the post, if available.

**7.Identifiers:**
post_id and comment_id: Unique identifiers for each post and comment, respectively. These help in tracking specific posts or comments.


In [6]:
import pandas as pd  # Make sure pandas is imported

# Now you can load your data
data = pd.read_csv('broad_reddit_search_with_labels.csv')

# List of columns to drop based on your analysis
columns_to_drop = [
    'post_url', 'post_id', 'comment_id', 'author', 'comment_author',
    'post_num_comments', 'over_18', 'author_premium', 'is_self_post', 'distinguished',
    'post_thumbnail', 'all_awardings', 'total_awards_received',
    'author_flair_text', 'num_crossposts', 'all_awardings'
]

# Dropping irrelevant columns from the DataFrame
data = data.drop(columns=columns_to_drop)

# Display the first few rows to verify
data.head()

Unnamed: 0,title,post_body,comment_body,comment_score,created,subreddit,label,post_score,post_created,post_flair,upvote_ratio,link_flair_text
0,I don't know what's wrong with me,I'm finding it really hard to keep myself toge...,Have you got a therapist on board? Sounds like...,2,1730484000.0,mentalhealth,mental_health_issue,1,1730482000.0,Venting,1.0,Venting
1,Friends who distance themselves from you or cu...,Please tell me if I sound entitled or selfish....,"I think it’s a combination of factors, and I s...",2,1730480000.0,mentalhealth,mental_health_issue,1,1730479000.0,Venting,1.0,Venting
2,sometimes my brain just keeps telling me bad t...,Does anyone else ever get like this? Or have a...,"When I have a panic attack, I tend to think th...",1,1730486000.0,mentalhealth,mental_health_issue,2,1730478000.0,Venting,1.0,Venting
3,sometimes my brain just keeps telling me bad t...,Does anyone else ever get like this? Or have a...,Im wondering if it was simply a panic attack. ...,2,1730487000.0,mentalhealth,mental_health_issue,2,1730478000.0,Venting,1.0,Venting
4,Need objective support. I’m in over my head,I feel in over my head and I’m not sure what t...,Please consider seeing a psychologist. Good luck!,1,1730477000.0,mentalhealth,mental_health_issue,1,1730477000.0,Need Support,1.0,Need Support


In [8]:
data.info

<bound method DataFrame.info of                                                    title  \
0                      I don't know what's wrong with me   
1      Friends who distance themselves from you or cu...   
2      sometimes my brain just keeps telling me bad t...   
3      sometimes my brain just keeps telling me bad t...   
4            Need objective support. I’m in over my head   
...                                                  ...   
92390  What is a mental health tip you thought was BS...   
92391  What is a mental health tip you thought was BS...   
92392  What is a mental health tip you thought was BS...   
92393  What is a mental health tip you thought was BS...   
92394  What is a mental health tip you thought was BS...   

                                               post_body  \
0      I'm finding it really hard to keep myself toge...   
1      Please tell me if I sound entitled or selfish....   
2      Does anyone else ever get like this? Or have a...   
3      

**2.3 DATA CLEANING**

In [9]:
#CHECKING FOR MISSING VALUES
missing_values = data.isnull().sum()
missing_values

title                  0
post_body          28665
comment_body           0
comment_score          0
created                0
subreddit              0
label                  0
post_score             0
post_created           0
post_flair         57184
upvote_ratio           0
link_flair_text    57184
dtype: int64

The output shows the number of missing (null) values for each column in the data DataFrame:

- title: No missing values (0).
- post_body: 28,665 missing values, indicating that many posts do not have associated body text.
- comment_body: No missing values (0), meaning every comment has text.
- comment_score: No missing values (0), all comments have a score.
- created: No missing values (0), this likely represents the timestamp of when posts or comments were created.
- subreddit: No missing values (0), indicating that every post/comment is associated with a subreddit.
- label: No missing values (0), meaning every post/comment is labeled (perhaps with a sentiment or category).
- post_score: No missing values (0), all posts have a score.
- post_created: No missing values (0), indicating every post has a creation timestamp.
- post_flair: 57,184 missing values, meaning many posts don't have a flair associated with them.
- upvote_ratio: No missing values (0), indicating the upvote ratio is available for all posts/comments.
- link_flair_text: 57,184 missing values, meaning many posts lack link flair text.

In [10]:
# Fill missing values in 'post_body' with an empty string because it's a text field, 
# and missing text can be assumed to have no content.
data['post_body'] = data['post_body'].fillna('')

# Fill missing values in 'post_flair' and 'link_flair_text' with 'No Flair' 
# or 'Unknown' since flairs are categorical and a missing value here likely means 
# that the post didn't have any flair assigned.
data['post_flair'] = data['post_flair'].fillna('No Flair')
data['link_flair_text'] = data['link_flair_text'].fillna('No Flair')

# Display the result to verify the filling of missing values
data.isnull().sum()

title              0
post_body          0
comment_body       0
comment_score      0
created            0
subreddit          0
label              0
post_score         0
post_created       0
post_flair         0
upvote_ratio       0
link_flair_text    0
dtype: int64

The output shows that after filling missing values:

All columns, including post_body, post_flair, and link_flair_text, now have 0 missing values (0), indicating that all missing data was successfully handled.
This means the dataset is now complete with no missing values in the specified columns.