This is the Jupyter notebook for team F. We are using a comprehensive dataset of 1500 patients diagnosed with Obsessive Compulsive Disorder (OCD). This dataset is available on Kaggle.  
  
Reference:   
Haque O, Alamgir Z. OCD Patient Dataset: Demographics and Clinical Data [Internet]. 2023 [cited 2024-09-08]. Available from: https://www.kaggle.com/datasets/ohinhaque/ocd-patient-dataset-demographics-and-clinical-data/data

The dataset looks as follows:

In [30]:
# read dataset
import pandas as pd
df = pd.read_csv("ocd_patient_dataset.csv")
df.head()

Unnamed: 0,Patient ID,Age,Gender,Ethnicity,Marital Status,Education Level,OCD Diagnosis Date,Duration of Symptoms (months),Previous Diagnoses,Family History of OCD,Obsession Type,Compulsion Type,Y-BOCS Score (Obsessions),Y-BOCS Score (Compulsions),Depression Diagnosis,Anxiety Diagnosis,Medications
0,1018,32,Female,African,Single,Some College,2016-07-15,203,MDD,No,Harm-related,Checking,17,10,Yes,Yes,SNRI
1,2406,69,Male,African,Divorced,Some College,2017-04-28,180,,Yes,Harm-related,Washing,21,25,Yes,Yes,SSRI
2,1188,57,Male,Hispanic,Divorced,College Degree,2018-02-02,173,MDD,No,Contamination,Checking,3,4,No,No,Benzodiazepine
3,6200,27,Female,Hispanic,Married,College Degree,2014-08-25,126,PTSD,Yes,Symmetry,Washing,14,28,Yes,Yes,SSRI
4,5824,56,Female,Hispanic,Married,High School,2022-02-20,168,PTSD,Yes,Hoarding,Ordering,39,18,No,No,


The dataset includes demographic information such as patient id age, gender, ethnicity, marital status and education level; along with clinical details like OCD diagnosis date, symptom duration, past history of psychiatric diagnosis, and family history of OCD. Moreover, symptom severity is assessed through the Yale-Brown Obsessive Compulsive Scale (Y-BOCS) for obsessive and compulsive symptom types. Furthermore, there is record of other mental health conditions such as anxiety and depression, and the medications prescribed to the patients.  
  
Reference:  
Goodman WK, Lawrence HP, Rasmussen SA, Mazure C, Fleischmann RL, Hill CL, et al. The Yale-Brown Obsessive Compulsive Scale. Arch Gen Psychiatry. 1989;46:1006-11

Over the next few sections, the following tasks have been carried out:  
1. The variable datatypes have been checked  
2. The variables that had "yes" or "no answers were converted to Boolean  
3. The missing values have been taken care of.  


In [31]:
# display variable datatypes
df.dtypes

Patient ID                        int64
Age                               int64
Gender                           object
Ethnicity                        object
Marital Status                   object
Education Level                  object
OCD Diagnosis Date               object
Duration of Symptoms (months)     int64
Previous Diagnoses               object
Family History of OCD            object
Obsession Type                   object
Compulsion Type                  object
Y-BOCS Score (Obsessions)         int64
Y-BOCS Score (Compulsions)        int64
Depression Diagnosis             object
Anxiety Diagnosis                object
Medications                      object
dtype: object

In [32]:
# convert to boolean
df = pd.read_csv("ocd_patient_dataset.csv", true_values =["Yes"], false_values=["No"])
df.head()
df.dtypes

Patient ID                        int64
Age                               int64
Gender                           object
Ethnicity                        object
Marital Status                   object
Education Level                  object
OCD Diagnosis Date               object
Duration of Symptoms (months)     int64
Previous Diagnoses               object
Family History of OCD              bool
Obsession Type                   object
Compulsion Type                  object
Y-BOCS Score (Obsessions)         int64
Y-BOCS Score (Compulsions)        int64
Depression Diagnosis               bool
Anxiety Diagnosis                  bool
Medications                      object
dtype: object

In [33]:
# missing values
df.isnull().sum()

Patient ID                         0
Age                                0
Gender                             0
Ethnicity                          0
Marital Status                     0
Education Level                    0
OCD Diagnosis Date                 0
Duration of Symptoms (months)      0
Previous Diagnoses               248
Family History of OCD              0
Obsession Type                     0
Compulsion Type                    0
Y-BOCS Score (Obsessions)          0
Y-BOCS Score (Compulsions)         0
Depression Diagnosis               0
Anxiety Diagnosis                  0
Medications                      386
dtype: int64

In [34]:
# 'Previous Diagnoses' and 'Medications' have missing values, however we have assumed that it implies that the patient does not have a diagnosis or is not taking any medication respectively.
# Therefore filling missing values with the word None
df['Previous Diagnoses'] = df['Previous Diagnoses'].fillna('None')
df['Medications'] = df['Medications'].fillna('None')
df.isnull().sum()

Patient ID                       0
Age                              0
Gender                           0
Ethnicity                        0
Marital Status                   0
Education Level                  0
OCD Diagnosis Date               0
Duration of Symptoms (months)    0
Previous Diagnoses               0
Family History of OCD            0
Obsession Type                   0
Compulsion Type                  0
Y-BOCS Score (Obsessions)        0
Y-BOCS Score (Compulsions)       0
Depression Diagnosis             0
Anxiety Diagnosis                0
Medications                      0
dtype: int64

The Y-BOCS scale is a standardized clinical assessment tool designed to evaluate the severity and type of symptoms in individuals with OCD. There are 10 items: 5 for grading Obsession and 5 for grading compulsion and each can be scored from 0-4. Thus, scores for obsession and compulsion range from 0-20 each, and the total score ranges from 0-40. The dataset did not have the total score calculated which we have done below by adding the scores for Y-BOCS Score (Obsessions) and Y-BOCS Score (Compulsions).

In [35]:
# calculate total score
df['Total_Score'] = df['Y-BOCS Score (Obsessions)'] + df['Y-BOCS Score (Compulsions)']

We realised that there were instances in our dataset that had total score above 40, and individual scores for Obsessions and Compulsions above 20 each, which is not acceptable based on the scoring system of the scale. Hence, we have eliminated those.

In [36]:
# count above 40 for Total_Score
count_above_40 = (df['Total_Score'] > 40).sum()
print(f'Number of values above 40: {count_above_40}')

# count above 20 for Y-BOCS Score (Obsessions)
count_above_20_O = (df['Y-BOCS Score (Obsessions)'] > 20).sum()
print(f'Number of values above 20: {count_above_20_O}')

# count above 20 for Y-BOCS Score (Compulsions)
count_above_20_C = (df['Y-BOCS Score (Compulsions)'] > 20).sum()
print(f'Number of values above 20: {count_above_20_C}')



Number of values above 40: 720
Number of values above 20: 717
Number of values above 20: 722


In [37]:
# creating a dataframe with valid scores
filtered_df = df[
    (df['Y-BOCS Score (Obsessions)'] <= 20) &
    (df['Y-BOCS Score (Compulsions)'] <= 20) &
    (df['Total_Score'] <= 40)
]
filtered_df

Unnamed: 0,Patient ID,Age,Gender,Ethnicity,Marital Status,Education Level,OCD Diagnosis Date,Duration of Symptoms (months),Previous Diagnoses,Family History of OCD,Obsession Type,Compulsion Type,Y-BOCS Score (Obsessions),Y-BOCS Score (Compulsions),Depression Diagnosis,Anxiety Diagnosis,Medications,Total_Score
0,1018,32,Female,African,Single,Some College,2016-07-15,203,MDD,False,Harm-related,Checking,17,10,True,True,SNRI,27
2,1188,57,Male,Hispanic,Divorced,College Degree,2018-02-02,173,MDD,False,Contamination,Checking,3,4,False,False,Benzodiazepine,7
6,9861,38,Female,Hispanic,Single,College Degree,2017-03-13,110,MDD,False,Contamination,Praying,12,16,True,False,SNRI,28
11,7905,73,Female,Hispanic,Divorced,High School,2017-01-13,233,GAD,False,Religious,Counting,4,16,True,True,Benzodiazepine,20
19,2637,66,Female,Asian,Divorced,College Degree,2018-08-14,73,Panic Disorder,False,Harm-related,Washing,0,12,False,True,SNRI,12
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1484,9561,24,Female,Asian,Married,Some College,2018-01-08,95,Panic Disorder,True,Contamination,Praying,15,9,False,True,,24
1485,3419,62,Female,African,Single,College Degree,2020-12-11,162,Panic Disorder,False,Hoarding,Checking,20,15,False,True,SNRI,35
1493,1819,58,Female,Hispanic,Divorced,Some College,2016-07-07,22,,True,Contamination,Praying,10,1,False,True,SNRI,11
1497,6089,40,Male,Asian,Married,Some College,2018-03-13,100,,True,Contamination,Counting,2,15,True,True,Benzodiazepine,17


We were interested in predicting the severity of OCD given by "Total_Score", and based on a literature review found that age, duration of symptoms, family history of OCD, and diagnosis of Anxiety or Depression affect it. Gender being a basic demographic feature has been retained though research does not clearly indicate its influence on OCD. Obsession Type and Compulsion Type are a reflection of patient's symptoms which may affect severity of the disorder, and therefore have been retained.  

References:  
1. Mathes BM, Morabito DM, Schmidt NB. Epidemiological and Clinical Gender Differences in OCD. Curr Psychiatry Rep. 2019 Apr 23;21(5):36. doi: 10.1007/s11920-019-1015-2.
2. Riddle DB, Guzick A, Minhajuddin A, Smárason O, Armstrong GM, Slater H, et al. Obsessive-compulsive disorder in youth and young adults with depression: Clinical characteristics of comorbid presentations. J Obsessive Compuls Relat Disord. 2023 Jul;38:100820. doi: 10.1016/j.jocrd.2023.100820. 
3. Zheng H, Zhang Z, Huang C, Luo G. Medical status of outpatients with obsessive-compulsive disorder in psychiatric department and its influencing factors. Zhong Nan Da Xue Xue Bao Yi Xue Ban. 2022 Oct 28;47(10):1418-1424. English, Chinese. doi: 10.11817/j.issn.1672-7347.2022.220125. 
4. Mahjani B, Bey K, Boberg J, Burton C. Genetics of obsessive-compulsive disorder. Psychol Med. 2021 Oct;51(13):2247-2259. doi: 10.1017/S0033291721001744. Epub 2021 May 25. PMID: 34030745; PMCID: PMC8477226.
5. Mathes BM, Morabito DM, Schmidt NB. Epidemiological and Clinical Gender Differences in OCD. Curr Psychiatry Rep. 2019 Apr 23;21(5):36. doi: 10.1007/s11920-019-1015-2. PMID: 31016410.

With the goal of predicting OCD severity, it was necessary to label the Total_Score for classification task. We tried to  identify the distribution of scores across bins of 5 from 0-40 in order to define score categories.

In [38]:
# Define bin ranges (0 to 40 with bins of size 5)
bins = range(0, 45, 5)  # This creates bins from 0-5, 6-10, ..., 36-40

# Use pd.cut to categorize the Total_Score into bins
filtered_df['Score_Bins'] = pd.cut(df['Total_Score'], bins=bins, right=False)

# Calculate the count for each bin
bin_counts = filtered_df['Score_Bins'].value_counts().sort_index()

# Display the bin counts
print(bin_counts)

Score_Bins
[0, 5)       11
[5, 10)      50
[10, 15)     61
[15, 20)     70
[20, 25)    102
[25, 30)     61
[30, 35)     35
[35, 40)     28
Name: count, dtype: int64




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Based on the distribution above, we have decided to categorise Total_Score as follows:  
0-20: Low
21-40: High

In [39]:
# Categorize Total_Score into 'Low' and 'High' based on the given range
filtered_df['Score_Category'] = pd.cut(filtered_df['Total_Score'], 
                              bins=[0, 20, 40], 
                              labels=['Low', 'High'], 
                              include_lowest=True)

#  View a concise dataframe based on the feature selection
filtered_df = filtered_df[["Age", "Gender", "Family History of OCD", "Duration of Symptoms (months)", "Obsession Type", "Compulsion Type", "Total_Score", "Depression Diagnosis", "Anxiety Diagnosis", "Score_Category"]]
filtered_df.head()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,Age,Gender,Family History of OCD,Duration of Symptoms (months),Obsession Type,Compulsion Type,Total_Score,Depression Diagnosis,Anxiety Diagnosis,Score_Category
0,32,Female,False,203,Harm-related,Checking,27,True,True,High
2,57,Male,False,173,Contamination,Checking,7,False,False,Low
6,38,Female,False,110,Contamination,Praying,28,True,False,High
11,73,Female,False,233,Religious,Counting,20,True,True,Low
19,66,Female,False,73,Harm-related,Washing,12,False,True,Low


### Descriptive Analytics  
In this section, we have tried to analyse if the literature review results hold true for our dataset.

In [40]:
import plotly.express as px
import plotly.graph_objects as go

We tried to evaluate if our data is equally distributed for age and gender which are important demographic features and Score_Category which will be the task.

In [41]:
# Checking if the data is equally distributed for age and gender
# Creating age groups (bins)
filtered_df['Age Group'] = pd.cut(filtered_df['Age'], bins=[0, 18, 30, 50, 70, 100], 
                         labels=['0-18', '19-30', '31-50', '51-70', '71+'])

# Count the number of males and females in each age group
gender_age_group = filtered_df.groupby(['Age Group', 'Gender']).size().unstack().fillna(0)

# Create a figure
fig0 = go.Figure()

# Add bar plot for males
fig0.add_trace(
    go.Bar(
        x=gender_age_group.index,
        y=gender_age_group['Male'],
        name='Male',
        marker_color='blue'
    )
)

# Add bar plot for females
fig0.add_trace(
    go.Bar(
        x=gender_age_group.index,
        y=gender_age_group['Female'],
        name='Female',
        marker_color='pink'
    )
)

# Update layout to create stacked bars
fig0.update_layout(
    barmode='stack',  # Stacked bar mode
    title='Number of People by Age Group and Gender',
    xaxis_title='Age Group',
    yaxis_title='Number of People',
    legend_title='Gender',
    xaxis_tickangle=0,
    yaxis=dict(showgrid=True)
)

# Show the plot
fig0.show()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy





In [42]:
# Count the instances for each Score_Category
score_category_counts = filtered_df['Score_Category'].value_counts()

# Create a bar plot using Plotly to visualize the counts of 'Low' and 'High'
fig00 = px.bar(x=score_category_counts.index, 
             y=score_category_counts.values, 
             labels={'x': 'Score Category', 'y': 'Count of Instances'}, 
             title='Count of Instances in Low and High Score Categories',
             text=score_category_counts.values)  # Display counts on bars

# Customize the layout
fig00.update_layout(xaxis_title='Score Category', 
                  yaxis_title='Number of Instances',
                  showlegend=False)

# Show the figure
fig00.show()

1. What is the distribution of OCD severity (Total_Score) across different age groups?  
This question helps to determine if certain age groups tend to experience more severe OCD symptoms. It might reveal whether younger or older people are more affected.  
The age has been split into age groups and we have tried to evaluate how our dataset is represented across the various age bins.

In [43]:
# Box plot to show distribution
fig = px.box(filtered_df, x='Age Group', y='Total_Score', title='Distribution of OCD Severity Across Age Groups',
             labels={'Age Group': 'Age Group', 'Total_Score': 'OCD Severity (Y-BOCS Total Score)'},
             category_orders={'Age Group': ['0-18', '19-30', '31-50', '51-70', '71+']})  # Ensure proper order

fig.show()

Below, we try to evaluate the corelation between Age and Total_Score.

In [44]:
# Calculate the Pearson correlation coefficient
correlation_coefficient = filtered_df['Age'].corr(filtered_df['Total_Score'])
print(f'Pearson correlation coefficient between Age and Total_Score: {correlation_coefficient:.2f}')

# Create a scatter plot to show the relationship
fig1 = px.scatter(filtered_df, x='Age', y='Total_Score', 
                 title='Scatter Plot of Age vs Total Score',
                 labels={'Age': 'Age', 'Total_Score': 'OCD Severity (Y-BOCS Total Score)'},
                 trendline='ols')  # Add a trendline to visualize the linear relationship

# Show the figure
fig1.show()

Pearson correlation coefficient between Age and Total_Score: 0.07


2. Is there a significant difference in OCD severity based on gender?  
Literature does not clearly indicate a relation between gender and OCD. Gender-based analysis can uncover any significant differences between males and females in terms of OCD severity. This could help in targeting gender-specific interventions.

In [45]:
# checking for gender distribution in the dataset
gender_counts = filtered_df['Gender'].value_counts()
print(gender_counts)

fig2 = px.bar(gender_counts, 
             x=gender_counts.index, 
             y=gender_counts.values, 
             title='Gender Distribution',
             labels={'x': 'Gender', 'y': 'Count'},
             color=gender_counts.index)

# Show the figure
fig2.show()

Gender
Female    224
Male      195
Name: count, dtype: int64


In [46]:
# Box plot for OCD severity (Total_Score) by Gender
fig3 = px.box(df, x='Gender', y='Total_Score', title='Distribution of OCD Severity by Gender',
              labels={'Total_Score': 'OCD Severity (Y-BOCS Total Score)', 'Gender': 'Gender'})
fig3.show()

3. How does family history of OCD influence the Total_Score?  
We try to examine whether individuals with a family history of OCD are more likely to have severe symptoms, potentially highlighting genetic or environmental influences.

In [47]:
import plotly.io as pio

In [48]:
# Count the number of people with and without family history of OCD
family_history_counts = filtered_df['Family History of OCD'].value_counts()

# Create a bar plot for the number of people with and without family history of OCD
bar_trace = go.Bar(
    x=family_history_counts.index, 
    y=family_history_counts.values, 
    name='Number of People',
    yaxis='y1',
    marker_color='lightblue'
)

# Create a box plot for the distribution of Total_Score for each category
box_trace = go.Box(
    x=filtered_df['Family History of OCD'], 
    y=filtered_df['Total_Score'], 
    name='Distribution of Total Score',
    yaxis='y2',
    marker_color='orange'
)

# Combine the bar and box plot using secondary y-axes
fig4 = go.Figure(data=[bar_trace, box_trace])

# Update layout for dual y-axes
fig4.update_layout(
    title='Number of People with Family History of OCD and Distribution of Their Scores',
    xaxis_title='Family History of OCD',
    yaxis=dict(
        title='Number of People',
        showgrid=False
    ),
    yaxis2=dict(
        title='Distribution of Total Score',
        overlaying='y',  # Overlay on the same plot
        side='right'
    ),
    legend=dict(x=0.1, y=1.1)
)

# Show the figure
fig4.show()

4. What is the correlation between the duration of symptoms and OCD severity?

In [49]:
# Calculate the Pearson correlation coefficient
correlation_coefficient = filtered_df['Duration of Symptoms (months)'].corr(filtered_df['Total_Score'])
print(f'Pearson correlation coefficient between Duration of symptoms and Total_Score: {correlation_coefficient:.2f}')

# Create a scatterplot
fig5 = px.scatter(df, x='Duration of Symptoms (months)', y='Total_Score', 
                  title='Relationship between Duration of Symptoms and OCD Severity',
                  labels={'Duration of Symptoms (months)': 'Duration of Symptoms (months)', 'Total_Score': 'OCD Severity (Y-BOCS Total Score)'}, 
                  trendline='ols')
fig5.show()

Pearson correlation coefficient between Duration of symptoms and Total_Score: 0.03


5. How does diagnosis of Anxiety or Depression affect Total_Score?

In [50]:
# Add a 'Condition' column for easy visualization
filtered_df['Condition'] = filtered_df.apply(lambda row: 
                            'Anxiety T, Depression T' if row['Anxiety Diagnosis'] and row['Depression Diagnosis'] else
                            'Anxiety T, Depression F' if row['Anxiety Diagnosis'] and not row['Depression Diagnosis'] else
                            'Anxiety F, Depression T' if not row['Anxiety Diagnosis'] and row['Depression Diagnosis'] else
                            'Anxiety F, Depression F', axis=1)

# Create a boxplot to visualize the total score distribution across different conditions using Plotly
fig6 = px.box(filtered_df, 
             x='Condition', 
             y='Total_Score', 
             title='Total Score Distribution by Anxiety and Depression Conditions',
             labels={'Condition': 'Anxiety and Depression Conditions', 'Total_Score': 'Total Score'},
             category_orders={'Condition': ['Anxiety T, Depression T', 'Anxiety T, Depression F', 
                                            'Anxiety F, Depression T', 'Anxiety F, Depression F']}
            )

# Update layout to rotate the x-axis labels and add gridlines
fig6.update_layout(
    xaxis_title='Condition',
    yaxis_title='Total Score',
    xaxis_tickangle=45,
    yaxis=dict(showgrid=True)
)

# Show the plot
fig6.show()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



6. Are the Obsession types represented equally in this dataset? What is the average Total_Score for each type?

In [51]:
# Count the number of instances for each obsession type
obsession_counts = filtered_df['Obsession Type'].value_counts()

# Calculate the average Total Score for each obsession type
obsession_avg_scores = filtered_df.groupby('Obsession Type')['Total_Score'].mean()

# Create a figure
fig7 = go.Figure()

# Add bar plot for the number of instances (left y-axis)
fig7.add_trace(
    go.Bar(
        x=obsession_counts.index,
        y=obsession_counts.values,
        name='Number of Instances',
        marker_color='lightblue',
        text=obsession_counts.values,  # Show count on the bars
        textposition='auto'
    )
)

# Add line plot for the average total score (right y-axis)
fig7.add_trace(
    go.Scatter(
        x=obsession_avg_scores.index,
        y=obsession_avg_scores.values,
        name='Average Total Score',
        yaxis='y2',  # Associate this with the second y-axis
        mode='lines+markers',
        line=dict(color='orange', width=3),
        marker=dict(size=8),
        text=obsession_avg_scores.values,  # Show average scores on the points
        textposition='top center'
    )
)

# Update layout to include a second y-axis
fig7.update_layout(
    title='Number of Instances and Average Total Score by Obsession Type',
    xaxis_title='Obsession Type',
    yaxis_title='Number of Instances',
    yaxis2=dict(
        title='Average Total Score',
        overlaying='y',  # Overlay y-axis 2 on the same plot as y-axis 1
        side='right'
    ),
    legend=dict(x=0.1, y=1.1),
    xaxis_tickangle=45,
    yaxis=dict(showgrid=True)
)

# Show the plot
fig7.show()

7. Are the Compulsion types represented equally in this dataset? What is the average Total_Score for each type?

In [52]:
# Count the number of instances for each compulsion type
compulsion_counts = filtered_df['Compulsion Type'].value_counts()

# Calculate the average Total Score for each compulsion type
compulsion_avg_scores = filtered_df.groupby('Compulsion Type')['Total_Score'].mean()

# Create a figure
fig8 = go.Figure()

# Add bar plot for the number of instances (left y-axis)
fig8.add_trace(
    go.Bar(
        x=compulsion_counts.index,
        y=compulsion_counts.values,
        name='Number of Instances',
        marker_color='lightblue',
        text=compulsion_counts.values,  # Show count on the bars
        textposition='auto'
    )
)

# Add line plot for the average total score (right y-axis)
fig8.add_trace(
    go.Scatter(
        x=compulsion_avg_scores.index,
        y=compulsion_avg_scores.values,
        name='Average Total Score',
        yaxis='y2',  # Associate this with the second y-axis
        mode='lines+markers',
        line=dict(color='orange', width=3),
        marker=dict(size=8),
        text=compulsion_avg_scores.values,  # Show average scores on the points
        textposition='top center'
    )
)

# Update layout to include a second y-axis
fig8.update_layout(
    title='Number of Instances and Average Total Score by Compulsion Type',
    xaxis_title='Compulsion Type',
    yaxis_title='Number of Instances',
    yaxis2=dict(
        title='Average Total Score',
        overlaying='y',  # Overlay y-axis 2 on the same plot as y-axis 1
        side='right'
    ),
    legend=dict(x=0.1, y=1.1),
    xaxis_tickangle=45,
    yaxis=dict(showgrid=True)
)

# Show the plot
fig8.show()

### Predictive Analytics

Input features: Age, Gender, Family history of OCD, Duration of symptoms, Obsession type, Compulsion Type, Diagnosis of Anxiety, Diagnosis of Depression  
Task: Predict the severity of OCD

To prepare the dataframe for prediction, over the next few codes, the following has been done:
1. Conversion of boolean values to integers
2. Normalisation of numerical features
3. Conversion of categorical features to numerical by one hot encoding

In [53]:
# Convert boolean values to integers (0 for False, 1 for True) for Family history of OCD
filtered_df['Family History of OCD'] = filtered_df['Family History of OCD'].astype(int)


# Convert boolean values to integers (0 for False, 1 for True) for Depression Diagnosis
filtered_df['Depression Diagnosis'] = filtered_df['Depression Diagnosis'].astype(int)

# Convert boolean values to integers (0 for False, 1 for True) for Anxiety Diagnosis
filtered_df['Anxiety Diagnosis'] = filtered_df['Anxiety Diagnosis'].astype(int)

# Convert Gender to Boolean: Male as True, Female as False
filtered_df['Gender_Boolean'] = filtered_df['Gender'].map({'Male': True, 'Female': False})

# Convert Boolean to Integers (True -> 1, False -> 0)
filtered_df['Gender_Boolean'] = filtered_df['Gender_Boolean'].astype(int)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/

In [54]:
# Normalisation of numerical features
from sklearn.preprocessing import MinMaxScaler

# Initialize Min-Max Scaler
scaler = MinMaxScaler()

# Select the columns you want to normalize
columns_to_normalize = ['Age', 'Duration of Symptoms (months)']

# Apply the Min-Max scaling
filtered_df[columns_to_normalize] = scaler.fit_transform(filtered_df[columns_to_normalize])



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [55]:
# One ot encoding for Obsession Type and Compulsion Type

from sklearn.preprocessing import OneHotEncoder

# Initialize OneHotEncoder with sparse_output=False to return dense array
encoder = OneHotEncoder(sparse_output=False)

# Apply One-Hot Encoding to the 'Obsession Type' column
one_hot_encoded = encoder.fit_transform(filtered_df[["Obsession Type"]])

# Get the feature names for the encoded columns
encoded_feature_names = encoder.get_feature_names_out(["Obsession Type"])

# Convert the encoded data into a DataFrame
df_one_hot_O = pd.DataFrame(one_hot_encoded, columns=encoded_feature_names)

# Reset the index to avoid row misalignment during concatenation
df_one_hot_O.reset_index(drop=True, inplace=True)

# Concatenate the original dataframe with the new one-hot encoded columns
filtered_df.reset_index(drop=True, inplace=True)  # Reset index in original dataframe too
filtered_df = pd.concat([filtered_df, df_one_hot_O], axis=1)

# Drop the original 'Obsession Type' column if you no longer need it
filtered_df.drop(columns=['Obsession Type'], inplace=True)

# Apply One-Hot Encoding to the 'Compulsion Type' column
one_hot_encoded = encoder.fit_transform(filtered_df[["Compulsion Type"]])

# Get the feature names for the encoded columns
encoded_feature_names = encoder.get_feature_names_out(["Compulsion Type"])

# Convert the encoded data into a DataFrame
df_one_hot_C = pd.DataFrame(one_hot_encoded, columns=encoded_feature_names)

# Reset the index to avoid row misalignment during concatenation
df_one_hot_C.reset_index(drop=True, inplace=True)

# Concatenate the original dataframe with the new one-hot encoded columns
filtered_df.reset_index(drop=True, inplace=True)  # Reset index in original dataframe too
dataset_final = pd.concat([filtered_df, df_one_hot_C], axis=1)

# Drop the original 'Compulsion Type' column if you no longer need it
dataset_final.drop(columns=['Compulsion Type'], inplace=True)

# Display the updated dataframe
filtered_df.head()


Unnamed: 0,Age,Gender,Family History of OCD,Duration of Symptoms (months),Compulsion Type,Total_Score,Depression Diagnosis,Anxiety Diagnosis,Score_Category,Age Group,Condition,Gender_Boolean,Obsession Type_Contamination,Obsession Type_Harm-related,Obsession Type_Hoarding,Obsession Type_Religious,Obsession Type_Symmetry
0,0.245614,Female,0,0.84188,Checking,27,1,1,High,31-50,"Anxiety T, Depression T",0,0.0,1.0,0.0,0.0,0.0
1,0.684211,Male,0,0.713675,Checking,7,0,0,Low,51-70,"Anxiety F, Depression F",1,1.0,0.0,0.0,0.0,0.0
2,0.350877,Female,0,0.444444,Praying,28,1,0,High,31-50,"Anxiety F, Depression T",0,1.0,0.0,0.0,0.0,0.0
3,0.964912,Female,0,0.970085,Counting,20,1,1,Low,71+,"Anxiety T, Depression T",0,0.0,0.0,0.0,1.0,0.0
4,0.842105,Female,0,0.286325,Washing,12,0,1,Low,51-70,"Anxiety T, Depression F",0,0.0,1.0,0.0,0.0,0.0


In [56]:
# View a concise dataframe for Regression
filtered_df = [["Age", "Gender_Boolean", "Family History of OCD", "Duration of Symptoms (months)", "Obsession Type", "Compulsion Type", "Total_Score", "Depression Diagnosis", "Anxiety Diagnosis"]]
# Convert boolean values to integers (0 for False, 1 for True)
dataset_final_R['Depression Diagnosis'] = dataset_final_R['Depression Diagnosis'].astype(int)
dataset_final_R['Anxiety Diagnosis'] = dataset_final_R['Anxiety Diagnosis'].astype(int)
dataset_final_R

NameError: name 'dataset_new' is not defined

In [None]:
from sklearn.model_selection import train_test_split

# Separate features (x) and target variable (y)
target_column = 'Total_Score'

# Features
X = dataset_final_R.drop(columns=[target_column])

# Target
y = dataset_final_R[target_column]

# Split the data into training and testing sets
# test_size=0.2 means 20% of the data will be used for testing
# random_state ensures reproducibility
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Print the shapes of the resulting datasets
print(f"Training features shape: {X_train.shape}")
print(f"Testing features shape: {X_test.shape}")
print(f"Training target shape: {y_train.shape}")
print(f"Testing target shape: {y_test.shape}")

In [None]:
# Histogram for y
plt.subplot(1, 2, 1)
sns.histplot(y, bins=10, kde=True, color='skyblue')
plt.title('y Distribution')
plt.xlabel('y values')
plt.ylabel('Frequency')

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

rf_regressor = RandomForestRegressor(n_estimators=100, random_state=42)

rf_regressor.fit(X_train, y_train)

# Make predictions on the test data
y_pred = rf_regressor.predict(X_test)

#Evaluate the model
# Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")

# R-squared (R2) Score
r2 = r2_score(y_test, y_pred)
print(f"R2 Score: {r2}")

# Optional: Display predictions and actual values for comparison
comparison_df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
print(comparison_df.head())

In [None]:
#Test on training data

y_pred = rf_regressor.predict(X_train)

#Evaluate the model
# Mean Squared Error (MSE)
mse = mean_squared_error(y_train, y_pred)
print(f"Mean Squared Error: {mse}")

# R-squared (R2) Score
r2 = r2_score(y_train, y_pred)
print(f"R2 Score: {r2}")

# Optional: Display predictions and actual values for comparison
comparison_df = pd.DataFrame({'Actual': y_train, 'Predicted': y_pred})
print(comparison_df.head())

In [None]:
rf_regressor = RandomForestRegressor(n_estimators=25, random_state=42)

rf_regressor.fit(X_train, y_train)

# Make predictions on the test data
y_pred = rf_regressor.predict(X_test)

#Evaluate the model
# Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")

# R-squared (R2) Score
r2 = r2_score(y_test, y_pred)
print(f"R2 Score: {r2}")

# Optional: Display predictions and actual values for comparison
comparison_df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
print(comparison_df.head())

In [None]:
rf_regressor = RandomForestRegressor(n_estimators=50, random_state=42)

rf_regressor.fit(X_train, y_train)

# Make predictions on the test data
y_pred = rf_regressor.predict(X_test)

#Evaluate the model
# Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")

# R-squared (R2) Score
r2 = r2_score(y_test, y_pred)
print(f"R2 Score: {r2}")

# Optional: Display predictions and actual values for comparison
comparison_df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
print(comparison_df.head())

In [None]:
plt.figure(figsize=(10, 5))

# Histogram for y_train
plt.subplot(1, 2, 1)
sns.histplot(y_train, bins=10, kde=True, color='skyblue')
plt.title('y_train Distribution')
plt.xlabel('y_train values')
plt.ylabel('Frequency')

# Histogram for y_test
plt.subplot(1, 2, 2)
sns.histplot(y_test, bins=10, kde=True, color='lightcoral')
plt.title('y_test Distribution')
plt.xlabel('y_test values')
plt.ylabel('Frequency')

# Display the plot
plt.tight_layout()  # Ensures proper spacing between plots
plt.show()

from sklearn.svm import LinearSVC
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectFromModel
X, y = load_iris(return_X_y=True)
X.shape
lsvc = LinearSVC(C=0.01, penalty="l1", dual=False).fit(X, y)
model = SelectFromModel(lsvc, prefit=True)
X_new = model.transform(X)
X_new.shape

In [None]:
# Add it back
dataset_final_R['Score_Category'] = dropped_column
dataset_final_R


In [782]:
dropped_column = dataset_final_R['Total_Score']  # Save the column before dropping
dataset_final_R = dataset_final_R.drop(columns=['Total_Score'])

In [783]:
from sklearn.utils import resample
from sklearn.preprocessing import OrdinalEncoder

In [None]:
# Encode the 'Score_Category' using OrdinalEncoder to maintain order
encoder = OrdinalEncoder(categories=[['low', 'medium', 'high']])  # Define the order of categories
dataset_final_R['Score_Category_Encoded'] = encoder.fit_transform(dataset_final_R[['Score_Category']])

# Check class imbalance
print("Class distribution before resampling:")
print(dataset_final_R['Score_Category_Encoded'].value_counts())

# Separate each class into different DataFrames
df_low = dataset_final_R[dataset_final_R['Score_Category_Encoded'] == 0]
df_medium = dataset_final_R[dataset_final_R['Score_Category_Encoded'] == 1]
df_high = dataset_final_R[dataset_final_R['Score_Category_Encoded'] == 2]

In [None]:
# Upsample the minority classes to balance the dataset
df_low_upsampled = resample(df_low, replace=True, n_samples=len(df_high), random_state=42)
df_medium_upsampled = resample(df_medium, replace=True, n_samples=len(df_high), random_state=42)

# Combine the upsampled data
df_balanced = pd.concat([df_low_upsampled, df_medium_upsampled, df_high])

# Check class distribution after resampling
print("\nClass distribution after resampling:")
print(df_balanced['Score_Category_Encoded'].value_counts())

In [None]:
dropped_column = df_balanced['Score_Category']  # Save the column before dropping
df_balanced = df_balanced.drop(columns=['Score_Category'])
df_balanced

In [None]:

# Separate features (x) and target variable (y)
target_column = 'Score_Category_Encoded'

# Features
X = df_balanced.drop(columns=[target_column])

# Target
y = df_balanced[target_column]

# Split the data into training and testing sets
# test_size=0.2 means 20% of the data will be used for testing
# random_state ensures reproducibility
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Print the shapes of the resulting datasets
print(f"Training features shape: {X_train.shape}")
print(f"Testing features shape: {X_test.shape}")
print(f"Training target shape: {y_train.shape}")
print(f"Testing target shape: {y_test.shape}")

In [None]:
# Create and train the RandomForestRegressor model
rf_regressor = RandomForestRegressor(n_estimators=100, random_state=42)
rf_regressor.fit(X_train, y_train)

# Make predictions on the test data
y_pred = rf_regressor.predict(X_test)

# Calculate Mean Squared Error and R2 Score
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse}")
print(f"R2 Score: {r2}")

# Display actual vs predicted results
df_results = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
print(df_results)

In [None]:
from sklearn.svm import SVR
from sklearn.preprocessing import StandardScaler


# Standardize the features for SVM (important for SVR performance)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Define the SVR model with an RBF (Radial Basis Function) kernel
svr_model = SVR(kernel='rbf')

# Train the SVR model on the training data
svr_model.fit(X_train_scaled, y_train)

# Make predictions on the test data
y_pred = svr_model.predict(X_test_scaled)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse}")
print(f"R2 Score: {r2}")
