# Mental Support Chatbot

This project aims to create a chatbot for Mental Health Support using Cosine Similarity for matching answers to user's questions. The chatbot is trained on a dataset of questions and answers related to mental health. The chatbot is implemented using Python and, further, it's going to be deployed using Flask.

## Dataset

The dataset used for training the chatbot is a csv file that contains three columns: 'Question_ID', 'Question' and 'Answer'. The dataset was downloaded from [Kaggle](https://www.kaggle.com/datasets/narendrageek/mental-health-faq-for-chatbot).

The following code is going to load the dataset and print one of the intents.

In [2]:
#import necessary libraries
import json
import numpy as np
import pandas as pd

#load the data from csv file
data = pd.read_csv('data.csv')

#display first 5 rows of the data
print(data.head())

   Question_ID                                          Questions  \
0      1590140        What does it mean to have a mental illness?   
1      2110618                    Who does mental illness affect?   
2      6361820                        What causes mental illness?   
4      7657263            Can people with mental illness recover?   

                                             Answers  
0  Mental illnesses are health conditions that di...  
1  It is estimated that mental illness affects 1 ...  
2  It is estimated that mental illness affects 1 ...  
3  Symptoms of mental health disorders vary depen...  
4  When healing from mental illness, early identi...  


In [3]:
##  Imports
!pip install nltk
!pip install scikit-learn
import nltk
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Punkt for sentence tokenization
nltk.download('punkt_tab')

#WordNet for lemmatization
nltk.download('wordnet')

#Stopwords
nltk.download('stopwords')

# Preprocessing

def preprocess_text(text):
    #Lemmatization
    lematizer = nltk.stem.WordNetLemmatizer()
    #Stemming
    stemmer = nltk.stem.PorterStemmer()
    #Remove special characters
    text = re.sub(r'[^\w\s]', '', text)
    tokens = nltk.word_tokenize(text.lower())
    #Remove stopwords
    tokens = [word for word in tokens if word not in nltk.corpus.stopwords.words('english')]
    tokens = [lematizer.lemmatize(token) for token in tokens]
    tokens = [stemmer.stem(token) for token in tokens]
    return ''.join(tokens)

def preprocess_with_stopwords(text):
#Lemmatization
    lematizer = nltk.stem.WordNetLemmatizer()
    #Stemming
    stemmer = nltk.stem.PorterStemmer()
    #Remove special characters
    text = re.sub(r'[^\w\s]', '', text)
    tokens = nltk.word_tokenize(text.lower())
    #Remove stopwords
    tokens = [lematizer.lemmatize(token) for token in tokens]
    tokens = [stemmer.stem(token) for token in tokens]
    return ''.join(tokens)





[nltk_data] Downloading package punkt_tab to
[nltk_data]     /home/l1ghtzao/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to /home/l1ghtzao/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/l1ghtzao/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [6]:
# Get the questions and answers from the data
questions = list(data['Questions'])
answers = list(data['Answers'])

In [24]:
vectorizer = TfidfVectorizer(tokenizer=nltk.word_tokenize)
X = vectorizer.fit_transform([preprocess_text(q) for q in questions])

def get_response(text):
    processed_text = preprocess_text(text)
    print("processed_text: ", processed_text)
    vectorized_text = vectorizer.transform([processed_text])
    similarities = cosine_similarity(X, vectorized_text)
    print("similarities: ", similarities)
    max_similarity = np.max(similarities)
    print("max_similarity: ", max_similarity)
    if(max_similarity > 0.6):
        high_similarity_questions = [q for q, s in zip(questions, similarities) if s > 0.6]
        print("high_similarity_questions: ", high_similarity_questions)

        target_answers = []
        for q in high_similarity_questions:
            q_index = questions.index(q)
            target_answers.append(answers[q_index])
        print("target_answers: ", target_answers)

        Z = vectorizer.transform([preprocess_with_stopwords(q) for q in high_similarity_questions])
        processed_text_with_stopwords = preprocess_with_stopwords(text)
        print("processed_text_with_stopwords:", processed_text_with_stopwords)
        vectorized_text_with_stopwords = vectorizer.transform([processed_text_with_stopwords])
        final_similarities = cosine_similarity(vectorized_text_with_stopwords, Z)
        closest = np.argmax(final_similarities)
        return target_answers[closest]
    else:
        print("Sorry, I don't understand...")

print(get_response("Who does mental illness affect?"))
print(get_response("worried about my mental health?"))
print(get_response("What do I do if I’m worried about my mental health?"))


processed_text:  mentalillaffect
similarities:  [[0.]
 [1.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]]
max_similarity:  1.0
high_similarity_questions:  ['Who does mental illness affect?']
target_answers:  ['It is estimated that mental illness affects 1 in 5 adults in America, and that 1 in 24 adults have a serious mental illness. Mental illness does not discriminate; it can affect anyone, regardless of gender, age, income, social status, ethnicity, religion, sex

