<h2>Business Understanding</h2>
Before trying to build a model, it is crucial to first understand the business need behind the solution that we will attempt to solve.

<h4>Assess Situation:</h4>
The dataset contains the scores of 120 patients on the 17 essential symptoms psychiatrists use to diagnose the described disorders, which are Bipolar Disorder Type-1, Bipolar Disorder Type-2 and Major Depressive Disorder. The behavioral symptoms are considered according to the levels of: Sadness, Exhaustness, Euphoric, Sleep disorder, Mood swings, Suicidal thoughts, Anorexia, Anxiety, Try-explaining, Nervous breakdown, Ignore & Move-on, Admitting mistakes, Overthinking, Aggressive response, Optimism, Sexual activity, and Concentration. 
<br><br>
The Normal category refer to the individuals using therapy time for specialized counseling, personal development, and life skill enrichments. 
While such individuals may also have minor mental problems, they differ from those suffering from Major Depressive Disorder and Bipolar Disorder.

<h4>Business Objectives:</h4>
 Determining the likelihood that a potential patient has a mental disorder based on the answers to the provided 17 essential symptoms 
 to utilize the tool in the future if performance is satisfactory.

<h2>Data Understanding</h2>
Examining the data before we prepare it for modelling

In [1]:
#importing the pandas library
import pandas as pd

# Load the dataset
url='Dataset-Mental-Disorders.csv'
df=pd.read_csv(url)

In [2]:
# Display the first 5 rows of the dataset
df.head()

Unnamed: 0,Patient Number,Sadness,Euphoric,Exhausted,Sleep dissorder,Mood Swing,Suicidal thoughts,Anorxia,Authority Respect,Try-Explanation,Aggressive Response,Ignore & Move-On,Nervous Break-down,Admit Mistakes,Overthinking,Sexual Activity,Concentration,Optimisim,Expert Diagnose
0,Patiant-01,Usually,Seldom,Sometimes,Sometimes,YES,YES,NO,NO,YES,NO,NO,YES,YES,YES,3 From 10,3 From 10,4 From 10,Bipolar Type-2
1,Patiant-02,Usually,Seldom,Usually,Sometimes,NO,YES,NO,NO,NO,NO,NO,NO,NO,NO,4 From 10,2 From 10,5 From 10,Depression
2,Patiant-03,Sometimes,Most-Often,Sometimes,Sometimes,YES,NO,NO,NO,YES,YES,NO,YES,YES,NO,6 From 10,5 From 10,7 From 10,Bipolar Type-1
3,Patiant-04,Usually,Seldom,Usually,Most-Often,YES,YES,YES,NO,YES,NO,NO,NO,NO,NO,3 From 10,2 From 10,2 From 10,Bipolar Type-2
4,Patiant-05,Usually,Usually,Sometimes,Sometimes,NO,NO,NO,NO,NO,NO,NO,YES,YES,YES,5 From 10,5 From 10,6 From 10,Normal


In [None]:
# Describe the dataset
df.describe(include = 'all')

In [None]:
# Check for missing values
df.isnull().sum()

# No missing values detected. Awesome!

In [None]:
#delete the first column in the dataframe since it is irrelevant for modelling
df = df.drop(df.columns[0], axis=1)
df.head()

In [None]:
# The unique values for each column of the dataset
for column in df.columns:
    print("Unique values in", column, "are:")
    print(df[column].unique())
    print('\n')

# YES value detected twice (due to extra space) in column Suicidal_thoughts

In [None]:
# Count of values in Expert Diagnose
df['Expert Diagnose'].value_counts()

<h2>Data Preparation</h2>
Selecting the data that will be used for the modelling phase, cleaning any neccesary data,
reformatting categorical data from string inputs to integers, and standardization of data scales

In [None]:
#cleaned version of the dataframe
clean_df = df.copy()
clean_df.columns

#Remove spaces from column names
clean_df.columns = clean_df.columns.str.replace(' ', '_')
clean_df.head()


In [None]:
#rcolumn names replaced with correct spelling for better understanding
clean_df.columns = clean_df.columns.str.replace('Sleep_dissorder', 'Sleep_disorder')
clean_df.columns = clean_df.columns.str.replace('Anorxia', 'Anorexia')
clean_df.columns = clean_df.columns.str.replace('Try-Explanation', 'Try_explanation')
clean_df.columns = clean_df.columns.str.replace('Ignore_&_Move-On', 'Ignore_MoveOn')
clean_df.columns = clean_df.columns.str.replace('Nervous_Break-down', 'Nervous_Breakdown')
clean_df.columns = clean_df.columns.str.replace('Optimisim', 'Optimism')
clean_df.columns = clean_df.columns.str.replace('Expert_Diagnose', 'Expert_Diagnosis')

In [None]:
#overview of the cleaned dataset
clean_df.describe(include = 'all')

In [None]:
#extra space removed in the column Suicidal_thoughts
clean_df['Suicidal_thoughts'] = clean_df['Suicidal_thoughts'].str.replace('YES ', 'YES')

#unique values for the column Suicidal_thoughts
print("Unique values in Suicidal_thoughts are:", clean_df['Suicidal_thoughts'].unique())

In [None]:
#change the binary features
clean_df.replace(('YES', 'NO'), (1, 0), inplace=True)
clean_df.head()

In [None]:
#changes the categorical features of columns using likert scale
clean_df.replace(('Usually', 'Most-Often','Sometimes','Seldom'), ('3','2','1','0'), inplace=True)
clean_df.head()

In [None]:
#Replaces 'From 10' in the values with nothing
clean_df.replace('From 10', '', regex=True, inplace=True)


In [None]:
#changes the categorical values of Expert Diagnosis to numerical values
clean_df.replace(('Normal','Bipolar Type-1','Bipolar Type-2','Depression'),
                 ('0','1','2','3'), inplace=True)
clean_df.head()

In [None]:
#shows the distribution in percentages for all columns in a pie chart
import matplotlib.pyplot as plt
for column in clean_df.columns:
    clean_df[column].value_counts().plot.pie(textprops={'fontsize': 12, 'fontweight':'bold', 'color':'white'}, autopct='%1.1f%%', figsize=(6, 6))
    plt.title(label=column, fontsize=15, fontweight='bold', color='black')
    plt.legend(loc='upper right', fontsize=12, title='Values', title_fontproperties ={'weight': 'bold'}, shadow=True, fancybox=True, bbox_to_anchor=(1.2, 1))
    plt.show()


In [None]:
import matplotlib.pyplot as plt

#Sorts the columns by value type starting from the lowest
sorted_columns = clean_df.columns.sort_values()

#Plots the bar chart for each column
for column in sorted_columns:
    clean_df[column].value_counts().sort_index().plot.bar()
    plt.title(column)
    plt.figure(figsize=(1, 1))
    plt.show()


<h2>Data Modelling</h2>
After preparing the data, now we'll split the data into train and test sets to evaluate the models performance
For the model, MultinomialNB Naive Bayes (MultinomialNB) will be used.

In [None]:
from sklearn.model_selection import train_test_split

#Split the data into independent variables (X) and dependent variable (y)
X = clean_df.drop('Expert_Diagnosis', axis=1)
y = clean_df['Expert_Diagnosis']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)



In [None]:
# Create a Multinomial Naive Bayes model and trains the model
from sklearn.naive_bayes import MultinomialNB

nb_model = MultinomialNB()
nb_model.fit(X_train, y_train)

#predictions
y_pred = nb_model.predict(X_test)

In [None]:
# Accuracy score and Classification report of the model
from sklearn.metrics import accuracy_score, classification_report

#accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy of the model is:", accuracy)

#confusion matrix using heatmap
from sklearn.metrics import confusion_matrix
import seaborn as sns

conf_matrix = confusion_matrix(y_test, y_pred)
heatmap_confusion = sns.heatmap(conf_matrix, annot=True)
heatmap_confusion.set(xlabel="Predicted", ylabel="Actual")

# Classification report
print(classification_report(y_test, y_pred))


<h4>Accuracy evaluation:</h4>
Currently the overall model has an acceptable accuracy of 87.5%, classifying all people with a mental disorder correctly.<br>
However, it does fail to predict some people with a 'normal' diagnosis, and classifies them as either Bipolar type-1 (1 person) or Depression (2 people).
<br><br>
This indicates that the model is a bit on the careful side with regards to classifying people as 'normal'. <br>
A misdiagnoses of 'normal' could lead someone to not get the care they need, so this safety is seen as a good thing.<br>


<h4>Discussion on accuracy score:</h4>
Accuracy score, however, does not take into account the distribution of the data. In this case, where the diagnoses are evenly distributed, this is less of a problem.<br>
If a representative sample were taken from the population, the distribution would shift since mental disorder prevalence is much lower in real life.<br>
Depressive disorders in a given year only occur in 3.4% of the population and Bipolar disorder only occurs in 0.5% of the population.<br>
Since this dataset is not representative for the total population, the F1-score should be included if the model continues to be used on a growing dataset.<br>
Currently the F1-score, which does take the distribution of data into account, currently has a weighted average score of 86%, which is still very good.<br>


<h4>Next steps:</h4>
With the current data available, the accuracy of the model and the way the false prediction occur, would suit the business objective that was said out.<br>
Now we are going to test some more models to see if the accuracy can be further improved.
<br><br>

1. Optimizing the Alpha parameter of the model:
Using the alpha parameter to apply smoothing in Naive Bayes model to the data is something worth considering since the dataset is quite sparse with limited counts of each diagnosis.
<br>

2. Combining the disorders into a single 'abnormal' to compare against 'normal' diagnosis. In this case, since the diagnosis variable is inbalanced, the F1-score would be looked at for improvement
<br>

3. Trying more machine learning models for increased accuracy and F1-scores

In [None]:
# 1. Optimizing the Alpha parameter of the model (using GridSearchCV)

from sklearn.model_selection import GridSearchCV

# Create a dictionary of parameters for alpha to test in Grid Search
param_grid = {
    'alpha': [100, 10, 1, 0.8, 0.6, 0.4, 0.2, 0.1, 0.01, 0.001]
}

#Create new MultinomialNB model with alpha parameter
nb_model2 = MultinomialNB(alpha=param_grid['alpha'])

# Create a GridSearchCV object
nb_grid = GridSearchCV(estimator=nb_model2, param_grid=param_grid, cv=5)

# Train the model
nb_grid.fit(X_train, y_train)

# Get the best parameters
print("Best parameters are:", nb_grid.best_params_)
print("Best score is:", nb_grid.best_score_)
print("Best estimator is:", nb_grid.best_estimator_)
print("Best index is:", nb_grid.best_index_)
print("Scorer is:", nb_grid.scorer_)


In [None]:
# Test the new model with the best parameters
nb_model2 = MultinomialNB(alpha=nb_grid.best_params_['alpha'])
nb_model2.fit(X_train, y_train)

y_pred2 = nb_model2.predict(X_test)

# accuracy of the model
accuracy2 = accuracy_score(y_test, y_pred2)
print("Accuracy of the model is:", accuracy2)

# confusion matrix using heatmap
conf_matrix2 = confusion_matrix(y_test, y_pred2)
heatmap_confusion2 = sns.heatmap(conf_matrix2, annot=True)
heatmap_confusion2.set(xlabel="Predicted", ylabel="Actual")

# Classification report
print(classification_report(y_test, y_pred2))

# Using GridSearchCV to optimize the model did not improve the accuracy of the model. 
# The accuracy of the model is still the same as before, since the default alpha parameter of the model is already the best parameter for the model.

In [None]:
# 2. Combining the disorders into a single 'abnormal' to compare against 'normal' diagnosis.

# Combining disorders into a single 'abnormal' diagnosis
y_train_abnormal = y_train.replace(['1', '2', '3'], '1')
y_test_abnormal = y_test.replace(['1', '2', '3'], '1')

# new Gaussian model and train the model
nb_model_3 = MultinomialNB()
nb_model_3.fit(X_train, y_train_abnormal)

# predictions
y_pred_3 = nb_model_3.predict(X_test)

# accuracy of the model
accuracy_3 = accuracy_score(y_test_abnormal, y_pred_3)
print("Accuracy of the model is:", accuracy_3)

# confusion matrix using heatmap
conf_matrix_3 = confusion_matrix(y_test_abnormal, y_pred_3)
heatmap_confusion_3 = sns.heatmap(conf_matrix_3, annot=True)
heatmap_confusion_3.set(xlabel="Predicted", ylabel="Actual")

# Classification report
print(classification_report(y_test_abnormal, y_pred_3))

# The accuracy of the model is 83%, which is lower than the previous model.
# The weighted average F1-score of the model is 0.81, which is also lower than the previous model.
# The original model is better than the model with combined disorders.

In [None]:
# 3. Trying more machine learning models for increased accuracy and F1-scores

# Import the libaries for the classification models to test
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.svm import SVC


# Create a list of models to test to use in for-loop later
listmodels = [
    ('Logistic Regression', LogisticRegression),
    ('Decision Tree', DecisionTreeClassifier),
    ('Random Forest', RandomForestClassifier),
    ('Support Vector Machine', SVC),
    ('Gradient Boosting', GradientBoostingClassifier),
    ('AdaBoost', AdaBoostClassifier)
]

# Test models and add the insights and the used setting to lists for later use
model_results = []
model_settings = []

# Looping through the previously mentioned models to test them with original train-test split
for model_name, model in listmodels:
    model = model()
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    f1 = classification_report(y_test, y_pred, output_dict=True)['weighted avg']['f1-score']
    
    # save results and settings of the model to the previously created lists
    model_results.append((model_name, accuracy, f1))
    model_settings.append((model_name, model.get_params()))

# Print the results of the models
for model in model_results:
    print(model)

# print the settings of the models
for model in model_settings:
    print(model)


In [None]:
# compare the models results in a bar chart based on accuracy and F1-scores

# Create a DataFrame of the model results to use in the bar chart
model_results_df = pd.DataFrame(model_results, columns=['Model', 'Accuracy', 'F1-Score'])
model_results_df.set_index('Model', inplace=True)
model_results_df.plot(kind='bar')

# Add labels, titles and other styling for aesthetics
plt.title(label='Comparison of Classification Models', fontsize=15, fontweight='bold', color='black')
plt.xlabel('Classification Models', fontsize=12, fontweight='bold', color='black')
plt.ylabel('Scores', fontsize=12, fontweight='bold', color='black')
plt.legend(loc='upper right', fontsize=12, title='Scores', title_fontproperties ={'weight': 'bold'}, shadow=True, fancybox=True, bbox_to_anchor=(1.3, 1))

# Show the bar chart
plt.show()

In [None]:
# create new logistic regression model with the best parameters that came from outputs of previous model comparison
best_model = LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, l1_ratio=None, max_iter=100, multi_class='auto', n_jobs=None, penalty='l2', random_state=None, solver='lbfgs', tol=0.0001, verbose=0, warm_start=False)

# train the new model and make predictions
best_model.fit(X_train, y_train)
y_pred_lr = best_model.predict(X_test)

# the accuracy of the logistic regression model and confusion matrix using heatmap
accuracy_lr = accuracy_score(y_test, y_pred_lr)
print("Accuracy of the model is:", accuracy_lr)
print(classification_report(y_test, y_pred_lr))

conf_matrix_lr = confusion_matrix(y_test, y_pred_lr)
heatmap_confusion_lr = sns.heatmap(conf_matrix_lr, annot=True)
heatmap_confusion_lr.set(xlabel="Predicted", ylabel="Actual")

In [None]:
# compare original Naive Bayes model with the Logistic Regression model

# create a list of the two models that are going to be compared: the original MultinominalNB and the new Logistic Regression
model_results = [
    ('MultinominalNB', accuracy, classification_report(y_test, y_pred, output_dict=True)['weighted avg']['f1-score']),
    ('Logistic Regression', accuracy_lr, classification_report(y_test, y_pred_lr, output_dict=True)['weighted avg']['f1-score'])
]

# compare the models and put the results in a bar chart
model_results_df = pd.DataFrame(model_results, columns=['Model', 'Accuracy', 'F1-Score'])
model_results_df.set_index('Model', inplace=True)
model_results_df.plot(kind='bar')

# Labels and other asthetic styling
plt.title(label='Comparison of Classification Models', fontsize=15, fontweight='bold', color='black')
plt.xlabel('Classification Models', fontsize=12, fontweight='bold', color='black')
plt.ylabel('Scores', fontsize=12, fontweight='bold', color='black')
plt.legend(loc='upper right', fontsize=12, title='Scores', title_fontproperties ={'weight': 'bold'}, shadow=True, fancybox=True, bbox_to_anchor=(1.3, 1))

plt.show()


The comparison interpretation based on results <br>

Accuracy Score:<br>
The Naive Bayes model achieved an accuracy of 87.5%, while the Logistic Regression model achieved a higher accuracy of 91.67%.<br>
This indicates that the Logistic Regression model performed better in terms of correctly classifying the instances.
<br><br>
F1-score:<br>
The Naive Bayes model had an F1-score of 86%, while the Logistic Regression model had a higher F1-score of 91.44%. <br>
This suggests that the Logistic Regression model performed better in terms of overall model performance.
<br><br>
Based on these results, it can be concluded that the Logistic Regression model outperformed the Naive Bayes model in terms of accuracy, F1-score, and overall predictive performance <br>
Therefore, the Logistic Regression model would be the preferred model.

TO DO: ADD WEIYING'S MODEL TO THIS FILE AND COMPARE THIS WITH THE CHART ABOVE<br>
TO DO: REVIEW EACH OTHER'S NOTEBOOKS<br>
TO DO: ADD WEIYING'S MODEL TO DISCUSSION ABOVE<br>

FINAL TODO: SUGGESTIONS FOR NEXT STEPS

<h2>Suggested next steps</h2>

Bart's Suggestions
1. Implementation/deployment of the model<br>
A use case where the model uses the responses of sign-up forms from people who contacts a psychiatrist. Based on the responses to the 17 evaluation questions that are used for predictions, the psyciatrist will have an idea of the probable mental disorder of the patient, if any. This is of course, only if the accuracy of the model is satisfactory to the psychiatrist.<br>
2. Gathering more data<br> 
Since the current dataset is very sparse/limited, more data would help with building a more robust model for the prediction of mental disorders. This can be done through manual input into the system from new patients, for example.<br>
3. Experiment with new classification models<br>
The current Logistic Regression model is only one of few classification models and is the one that performs best out of the comparisons. However, this is only in the current situation. Throughout the evolution of the data set, trying different models regularly is advised.