In [12]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import LocalOutlierFactor

In [14]:
data_df = pd.read_csv('../data/cleaned/South_East_Asia_Social_Media_MentalHealth_cleaned.csv')

In [36]:
columns = data_df[['Daily SM Usage (hrs)', 'Peer Comparison Frequency (1-10)', 'Social Anxiety Level (1-10)']]

scaler = StandardScaler()
scaled_columns = scaler.fit_transform(columns)

lof = LocalOutlierFactor(n_neighbors=10, contamination=0.001)

predictions = lof.fit_predict(scaled_columns)
data_df['anomaly'] = predictions

anomalies = data_df[data_df['anomaly'] == -1]

print("Total number of anomalies detected:", len(anomalies))
print("Anomalies detected:\n", anomalies[['Daily SM Usage (hrs)', 'Peer Comparison Frequency (1-10)', 'Social Anxiety Level (1-10)']].head(10))

Total number of anomalies detected: 341
Anomalies detected:
        Daily SM Usage (hrs)  Peer Comparison Frequency (1-10)  \
1554                   8.96                                 8   
1942                   3.34                                 6   
4267                   2.60                                 7   
4316                  11.65                                 5   
6225                   5.06                                 5   
6920                  10.10                                 5   
8772                   1.90                                 6   
9100                  12.00                                 7   
10275                  2.63                                 7   
10627                 11.06                                 5   

       Social Anxiety Level (1-10)  
1554                             2  
1942                             3  
4267                             9  
4316                             7  
6225                             7  
6

The steps that local outlier factor algorithm follow to identify potential anomalies are listed below.
1) Select the columns from the dataset to investigate.
2) If the data contains categorical values then convert to numeric values by encoding.
3) Scale the columns selected.
4) Create the local outlier factor.
5) Predict the anomalies in the model.
6) Check the rows with anomalies detected.
7) Print the total number of anomalies detected.
8) Print the rows with the detected anomalies.

The total of 341 anomalies were detected using local outlier factor algorithm. The attributes that the algorithm was implemented on were daily social media usage, peer comparison frequency, and social anxiety level. The potential anomalies detected were a user with high daily social media usage and high peer comparison frequency has low social anxiety level and a user with low daily social media usage has high peer comparison frequency and high social anxiety level. This could be due to the user might be able to manage anxiety well and the other anomaly could be the user avoids using social media a lot but still has a hard time to manage social anxiety.