# <span style="color:Blue">Assignment-5 of COSC5806: Data Analysis with Python</span>

## <span style="color:Purple">You are allowed to use core Python built-in modules/packages/libraries, NumPy, Pandas, scikit-learn, matplotlib, Seaborn, NLTK, Gensim, etc. Please read the instructions carefully and do not hesitate to contact me if you have any questions.</span>

### <span style="color:Red">Examples and Resources for this assignment:</span>
<ul>
    <li><span style="color:Red">Chapter 7 from <a href="https://github.com/amueller/introduction_to_ml_with_python/tree/main">Working with Text Data</a></span></li>
</ul>

### <span style="color:Green">Context</span>
CSSRS-Suicide: This Reddit dataset is comprised of 2181 user data from the timeframe between 2005 and 2016 and from 15 mental health-related subreddits. Four practicing psychiatrists followed the guidelines outlined in the Columbia Suicide Severity Rating Scale (C-SSRS) and annotated 500 users' data on suicide risks in five levels: Supportive, Indicator, Ideation, Behavior, and Attempt.

The following <a href="https://scholarcommons.sc.edu/cgi/viewcontent.cgi?params=/context/aii_fac_pub/article/1002/&path_info=knowledge_aware_assessment_of_severity_of_suicide_risk_for_early_intervention.pdf">link</a> might be useful to know more about the dataset.

# <span style="color:Green">P1: Load the dataset.</span>

In [44]:
#Codes of P1 here
#Load the libraries
import pandas as pd

#read the data from csv file
reddit = pd.read_csv(r'D:\Algoma\COSC5806001_DataAnalysis\500_Reddit_users_posts_labels.csv')
#get dataframe
df_reddit = pd.DataFrame(reddit)
print(df_reddit.info)


<bound method DataFrame.info of          User                                               Post       Label
0      user-0  ['Its not a viable option, and youll be leavin...  Supportive
1      user-1  ['It can be hard to appreciate the notion that...    Ideation
2      user-2  ['Hi, so last night i was sitting on the ledge...    Behavior
3      user-3  ['I tried to kill my self once and failed badl...     Attempt
4      user-4  ['Hi NEM3030. What sorts of things do you enjo...    Ideation
..        ...                                                ...         ...
495  user-495  ['Its not the end, it just feels that way. Or ...  Supportive
496  user-496  ['It was a skype call, but she ended it and Ve...   Indicator
497  user-497  ['That sounds really weird.Maybe you were Dist...  Supportive
498  user-498  ['Dont know there as dumb as it sounds I feel ...     Attempt
499  user-499  ['&gt;It gets better, trust me.Ive spent long ...    Behavior

[500 rows x 3 columns]>


# <span style="color:Green">P2: Print the number of posts and percentages for each label.</span>

In [40]:
#Codes of P2 here
#the number of posts for each label
posts_by_label = df_reddit['Label'].value_counts()

#percentage of each label
percent_of_label = (posts_by_label/ len(df_reddit))*100

# Print number of posts and percentages by each label
print("The number of posts and percentages for each label are: ")
for label, count in posts_by_label.items():
    print(f"{label}: {count} _ {percent_of_label[label]}%")

The number of posts and percentages for each label are: 
Ideation: 171 _ 34.2%
Supportive: 108 _ 21.6%
Indicator: 99 _ 19.8%
Behavior: 77 _ 15.4%
Attempt: 45 _ 9.0%


# <span style="color:Green">P3: Convert to lowercase. Remove contractions and punctuations. Remove leading and trailing whitespaces. </span>

In [11]:
#Codes of P3 here
import re
import contractions
def clean_post_data(text):  
    text = text.lower()  # lower text and assign back to text
    text = contractions.fix(text)  # Remove contractions
    text = re.sub(r"[^\w\s]", "", text)  # Remove punctuation
    text = text.strip()  # Strip whitespace
    return text
    
# post column
df_reddit['cleaned_post'] = df_reddit['Post'].apply(clean_post_data)
print(df_reddit['cleaned_post'])


0      its not a viable option and you will be leavin...
1      it can be hard to appreciate the notion that y...
2      hi so last night i was sitting on the ledge of...
3      i tried to kill my self once and failed badly ...
4      hi nem3030 what sorts of things do you enjoy d...
                             ...                        
495    its not the end it just feels that way or at l...
496    it was a skype call but she ended it and ventr...
497    that sounds really weirdmaybe you were distrac...
498    do not know there as dumb as it sounds i feel ...
499    gtit gets better trust mei have spent long eno...
Name: cleaned_post, Length: 500, dtype: object


# <span style="color:Green">P4: Tokenize the text based on whitespaces and remove single-character tokens like 'b', 'c', 'd', etc.</span>

In [13]:
#Codes of P4 here
def tokenize_remove(text):
    tokens = text.split() #tokenize based on whitespaces
    filtered_text = [token for token in tokens if len(token)>1] #remove single-character
    return filtered_text
    
# tokenize and remove single character on cleaned post column
df_reddit['cleaned_post'] = df_reddit['cleaned_post'].apply(tokenize_remove)# use cleaned_post column with cleaned data from previous step
print(df_reddit['cleaned_post'])

0      [its, not, viable, option, and, you, will, be,...
1      [it, can, be, hard, to, appreciate, the, notio...
2      [hi, so, last, night, was, sitting, on, the, l...
3      [tried, to, kill, my, self, once, and, failed,...
4      [hi, nem3030, what, sorts, of, things, do, you...
                             ...                        
495    [its, not, the, end, it, just, feels, that, wa...
496    [it, was, skype, call, but, she, ended, it, an...
497    [that, sounds, really, weirdmaybe, you, were, ...
498    [do, not, know, there, as, dumb, as, it, sound...
499    [gtit, gets, better, trust, mei, have, spent, ...
Name: cleaned_post, Length: 500, dtype: object


# <span style="color:Green">P5: Remove stopwords.</span>

In [15]:
#Codes of P5 here
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS

#remove stopwords using english_stop_words
def remove_stopwords(tokens):
    filtered_tokens = [token for token in tokens if token not in ENGLISH_STOP_WORDS]
    return filtered_tokens

df_reddit['cleaned_post'] = df_reddit['cleaned_post'].apply(remove_stopwords) # use cleaned_post column with cleaned data from previous step

print(df_reddit['cleaned_post'])


0      [viable, option, leaving, wife, pain, comprehe...
1      [hard, appreciate, notion, meet, make, happy, ...
2      [hi, night, sitting, ledge, window, contemplat...
3      [tried, kill, self, failed, badly, moment, wan...
4      [hi, nem3030, sorts, things, enjoy, doing, per...
                             ...                        
495    [end, just, feels, way, does, entire, lifetime...
496    [skype, ended, ventricular, dysfunction, left,...
497    [sounds, really, weirdmaybe, distractibility, ...
498    [know, dumb, sounds, feel, hyperactive, behavi...
499    [gtit, gets, better, trust, mei, spent, long, ...
Name: cleaned_post, Length: 500, dtype: object


# <span style="color:Green">P6: Use POS tagging to lemmatize each token. Use <a href="https://www.nltk.org/api/nltk.stem.WordNetLemmatizer.html?highlight=wordnet">WordNetLemmatizer</a>.</span>

In [17]:
#Codes of P6 here
import nltk
from nltk import pos_tag
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

lemmatizer = WordNetLemmatizer() # get wordnetlemmatizer 

def lemmatize_tokens(tokens):
    tagged = pos_tag(tokens)  # pos tag to all tokens
    lemmatized = []
    for word, tag in tagged: # modify tagging with single word
        if tag.startswith('J'):
            pos= wordnet.ADJ
        elif tag.startswith('V'):
            pos= wordnet.VERB
        elif tag.startswith('N'):
            pos= wordnet.NOUN
        elif tag.startswith('R'):
            pos= wordnet.ADV
        else:
            pos= wordnet.NOUN
        lemmatized.append(lemmatizer.lemmatize(word, pos=pos)) #lemmatize using WordNetLemmatizer
    # print(lemmatized)
    return lemmatized

df_reddit['lemmatized_post'] = df_reddit['cleaned_post'].apply(lemmatize_tokens) #add new col with lemmatized values
print(df_reddit['lemmatized_post'])


0      [viable, option, leave, wife, pain, comprehens...
1      [hard, appreciate, notion, meet, make, happy, ...
2      [hi, night, sit, ledge, window, contemplate, j...
3      [try, kill, self, fail, badly, moment, want, r...
4      [hi, nem3030, sort, thing, enjoy, do, personal...
                             ...                        
495    [end, just, feels, way, do, entire, lifetime, ...
496    [skype, end, ventricular, dysfunction, leave, ...
497    [sound, really, weirdmaybe, distractibility, s...
498    [know, dumb, sound, feel, hyperactive, behavio...
499    [gtit, get, good, trust, mei, spend, long, bli...
Name: lemmatized_post, Length: 500, dtype: object


# <span style="color:Green">P7: Perform label-wise topic modelling using Latent Dirichlet Allocation (LDA). Find an optimal number of topics using a coherence score. </span>

In [21]:
#Codes of P7 here
import numpy as np
import mglearn
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from gensim.models.coherencemodel import CoherenceModel
from gensim import corpora

# Group by label
grouped = {label: group['lemmatized_post'].tolist() for label, group in df_reddit.groupby('Label')}
#print(grouped)

# Function to calculate coherence score
def compute_coherence(lda_model, texts, vectorizer):
    # Create topics_get features
    feature_names = np.array(vectorizer.get_feature_names_out())
    topics = []
    for topic in lda_model.components_:
        top_word_indices = topic.argsort()[:-11:-1]
        top_words = [feature_names[i] for i in top_word_indices]
        topics.append(top_words)

    # Create dictionary and corpus for Gensim
    dictionary = corpora.Dictionary(texts)
    corpus = [dictionary.doc2bow(text) for text in texts]

    # Compute coherence
    cm = CoherenceModel(topics=topics, texts=texts, dictionary=dictionary, coherence='c_v')
    coherence = cm.get_coherence()
    return coherence

# LDA for each label
for label, docs in grouped.items():
    print(f"""{label}
    ==================""")
    # Join tokens back into text for CountVectorizer
    text_data = [" ".join(post) for post in docs]
    best_score = -1
    best_n = 0
    # Convert to matrix
    vectorizer = CountVectorizer(max_features=10000, max_df=.15)
    X = vectorizer.fit_transform(text_data)

    for n_topics in range(2, 11):      
        # Fit LDA model
        lda = LatentDirichletAllocation(n_components= n_topics, learning_method="batch",
                                    max_iter=25, random_state=0)
        lda.fit_transform(X)
    
        #compute coherence score
        coherence_score = compute_coherence(lda, docs, vectorizer)
        print(f"Coherence Score: {coherence_score:.4f}")

        if coherence_score > best_score:
            best_score = coherence_score
            best_n = n_topics
    print(f"Best number of topics for '{label}': {best_n} (Coherence = {best_score:.4f})")
    
    sorting = np.argsort(lda.components_, axis=1)[:, ::-1]
    # get the feature names from the vectorizer:
    feature_names = np.array(vectorizer.get_feature_names_out())
        
    # Print out the 10 topics:
    mglearn.tools.print_topics(topics=range(best_n), feature_names=feature_names,
                           sorting=sorting, topics_per_chunk=5, n_words=10)


Attempt
Coherence Score: 0.4281
Coherence Score: 0.4353
Coherence Score: 0.4184
Coherence Score: 0.5095
Coherence Score: 0.4432
Coherence Score: 0.5300
Coherence Score: 0.4829
Coherence Score: 0.4889
Coherence Score: 0.4924
Best number of topics for 'Attempt': 7 (Coherence = 0.5300)
topic 0       topic 1       topic 2       topic 3       topic 4       
--------      --------      --------      --------      --------      
cry           antony        daily         gt            gt            
dark          vorenus       church        adderall      hug           
abnormal      socially      faith         attack        recommend     
reflex        asshole       anytime       test          qualify       
okay          unhappy       mention       homeless      accept        
psych         earth         sub           boat          unhappiness   
nausea        student       prove         waste         rsuicidewatch 
dysfunction   everyday      failure       as            alcoholic     
ventri

# <span style="color:Green">P8: Analyze the topics extracted in the previous steps. See the example from pages 349-350 of this <a href="https://www.nrigroupindia.com/e-book/Introduction%20to%20Machine%20Learning%20with%20Python%20(%20PDFDrive.com%20)-min.pdf">link</a>.</span>

In [42]:
#Codes of P8 here
#select some topic
selected_topics = {
    "Attempt": ['cry', 'dark', 'abnormal', 'reflex','okay','psych','nausea', 'dysfunction', 'ventricular','join'],
    "Behavior": ['severe', 'memory', 'exercise', 'treatment', 'internet', 'wife', 'brother','everyday','hug','blah'],
    "Ideation":['recommend','ache','car','hotline','lie','fault','uni','treatment','bully','suppose'],
    "Indicator":['counselor','harm','fault','brain','illness','solution','hotline','amp','tonight','achieve'],
    "Supportive":['counselor','police','authority','med','hotlines','upset','local','depend','specific','caller']
}

for label, keywords in selected_topics.items():
    min_matches = 5 # at least 5 keyword matches
    df_label = df_reddit[df_reddit['Label'] == label]  #filter with label to get relevant post
    for _, row in df_label.iterrows():
        match_count = sum(1 for word in keywords if word in row['lemmatized_post'])
        if match_count >= min_matches:
            post= row['Post']
            user = row['User']
    print(f"""Label "{label}" : {user}
            {post}""")



Label "Attempt" : user-490
            ['She has Parkinsons. Im so sorry to hear that about your mom I cannot imagine. We always think we have endless time and we never do. I need to appreciate the time I have. What did your mom have?', 'Thank you for taking the time to reply I really appreciate you advice and tips. Thank you! An youre right Im sure it isnt the worst they could imagine for me- good point! Thank you again.', 'My mom knows and is Anxiety. I made an extremely halfhearted attempt when I was 20....looking back now I know it was just to get the attention of my parents. I was in college, and my university threatened to expel me if I did not sign a contract stating to see a psych, counselor weekly, group therepy weekly and I was not allowed to talk to friends about my bad feelings because I was distracting them from their studies... Everything but the last part was a blessing in disguise- so I have these coping skills-I n\xc3\xa9e to get in the habit of using them now that I a