# **Final Project**
## *DATA 5420/6420*
## Name: Stephanie Liechty

The purpose of the final project is to produce an MVP that is a culmination of the skills you have learned in each of the previous units. This MVP should be a cohesive product in that it combines methods in some logical pipeline, it should NOT simply be a collection of methods implemented independently/separately with no clear end goal/state. You will be tasked with applying at least four methods from across the four units, which I've outlined below:

### Unit 1

* Chatbots
* Basic Text Statistics
* NLP Pipeline (Preprocessing & Normalization)
* Compiling Corpora via APIs

### Unit 2

* Bag of Words Models (TF-IDF and Count Vectorization)
* Document Classification
* Sentiment Analysis

### Unit 3

* Document Summarization
* Topic Modeling
* Text Similarity
> * Information Retrieval (Search)
> * Recommendation Systems
* Document Clustering
> * KMeans
> * Affinity Prop
> * Wards Agglomerative Hierarchical

### Unit 4

* Word Embeddings
* Pretrained Transformers
* Question-Answering Systems
* Speech-to-Text (hopefully)

You will of course need to perform some form of cleaning/text normalization and feature engineering (bag of words and/or word embeddings), but the way you go about that will be problem dependent -- on top of those two steps, you will need to incorporate at least two other model types as well that form some coherent end-stage MVP.

For example:

1) corpus of a news articles pulled from the Bing News API that is cleaned/normalized

2) uses word embeddings to feature engineer the text

3) performs sentiment analysis to score sentiment of all articles

4) articles are sortable by sentiment, and ranked based on their relevance to keywords/search queries (information retrieval)

The MVP is a NewsFeed showing a table of articles displayed in an interactive dashboard

As you are performing your analyses consider:

* What cleaning and normalization steps are necessary for my text, and which are not?
* What sort of feature engineering do I need to utilize, both in terms of using BoW or word embeddings, and in terms of document or word vectorization? Do I need to use different methods for different analysis types?
* What is the purpose of performing your selected methods and how do they meaningful build on one another?
* What are the practical applications of the models you developed?

### **What four (+) methods have you chosen and how do they fit together?**



1.   NLP Pipeline
2.   Sentiment Analysis
3.   Text Similarity
4.   Question-Answering System

**Description of how these methods will be meaningfully combined**:



First, the NLP Pipeline cleans and gets the text ready, making sure everything's uniform for deeper digging. This tidy text is then the base for Sentiment Analysis, which figures out if the advice is positive or negative, and Text Similarity, which finds posts that are alike. This helps you easily find and compare different advice topics and feelings. The Question-Answering System takes this neat text and lets you ask specific questions, giving you personalized answers. All these steps come together in a dashboard that's easy to use. You can sift through, understand, and get involved with the world of parenting advice, showcasing how mixing different text mining methods can unlock useful insights from loads of messy data.

In [None]:
# General imports for data manipulation and numerical operations
import pandas as pd
import numpy as np

# NLP Pipeline imports
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import spacy
import re

# Sentiment Analysis imports
from transformers import pipeline
from collections import Counter

# Text Similarity imports
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import gensim

# Question-Answering System imports (using OpenAI)
!pip install openai==0.28
import openai
import ipywidgets as widgets
from IPython.display import display, clear_output

# Data Fetching for Reddit
!pip install praw
import praw
from datetime import datetime
from datetime import date

# Initialize NLTK (if you haven't already)
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('vader_lexicon')

# Setting up Secret Key usage
from google.colab import userdata

# Setting up OpenAI
openai.api_key = userdata.get('API')

# Setting up PRAW
reddit = praw.Reddit(client_id='cvFqWtnTSzLAF_sqtXLa_g',
                     client_secret= userdata.get('Reddit'),
                     user_agent='reddit_app/v1')




[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


# **Method 1: NLP Pipeline**

Implementing a Natural Language Processing pipeline to preprocess and normalize text data will involve steps such as tokenization, removing stopwords, lemmatization, and removing special characters.

In [None]:
# Specify the subreddit names for parenting advice
subreddit_names = ['beyondthebump', 'Mom', 'NewParents',
                   'Parenting', 'AskParents']

In [None]:
# Create empty lists to store post attributes and comments
post_attributes = []
comment_attributes = []

# Iterate through each subreddit
for subreddit_name in subreddit_names:
    subreddit = reddit.subreddit(subreddit_name) # set subreddit
    posts = subreddit.top(time_filter='month', limit=20)  # set post parameters from top posts from the past month, up to 20 posts

    # Iterate through the top posts in the subreddit
    for post in posts:
        # Append post attributes to the list
        post_attributes.append({
            'Post_ID': post.id,
            'Post_Title': post.title,
            'Post_Content': post.selftext or 'No Content',  # Ensure no null values in Post_Content
            'Post_URL': post.url,
            'Post_Date': datetime.utcfromtimestamp(post.created_utc).strftime('%Y-%m-%d'),
            'Provider': subreddit_name
        })

        # Fetch the top comments for the current post
        post.comments.replace_more(limit=0)
        top_comments = post.comments[:1]

        # Iterate through the top comments and append attributes to the list
        for comment in top_comments:
            comment_attributes.append({
                'Post_ID': post.id,  # Add Post_ID to link comments back to the posts
                'Post_Content': post.selftext or 'No Content',
                'Comment_Content': comment.body if comment.body else 'No Content',  # Ensure no null values in Comment_Content
                'Comment_Score': comment.score,
                'Comment_Date': datetime.utcfromtimestamp(comment.created_utc).strftime('%Y-%m-%d'),
                'Provider': subreddit_name
            })

# Create DataFrames for comments
df_comments = pd.DataFrame(comment_attributes)

# Adjust 'Provider' column for DataFrame
df_comments['Provider'] = 'r/' + df_comments['Provider']

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/l

In [None]:
df_comments.head()  # display dataframe

Unnamed: 0,Post_ID,Post_Content,Comment_Content,Comment_Score,Comment_Date,Provider
0,1c2aazi,Just a scenario i thought was sad but also a l...,Sorry I laughed out loud at the title alone. B...,721,2024-04-12,r/beyondthebump
1,1bm36ak,Went to bed around 11 after bub had his bottle...,"I mean, you did it. You hit the partner teamwo...",1623,2024-03-24,r/beyondthebump
2,1brvfep,I was angry at you today. You asked me to go p...,This is.. rough. In your writing I can feel th...,796,2024-03-30,r/beyondthebump
3,1bnfst8,Mine is that part of the reason newborns cry i...,That colic is a lazy diagnosis and synonymous ...,1249,2024-03-25,r/beyondthebump
4,1bqtnfg,It's a frequent topic in this sub that healthc...,Yup. Postpartum is the worst time because we g...,703,2024-03-29,r/beyondthebump


## **Part 1 (Lowercase, Special Character & Whitespace Removal) - NLP Pipeline**

In the first part, we are converting the text to lowercase, removing special characters, and removing whitespaces. Doing this will make the text data more amenable for analysis.

In [None]:
def clean_text(text):
    # Convert text to lowercase
    text = text.lower()

    # Remove special characters and digits
    text = re.sub(r'[^a-zA-Z\s]', ' ', text)

    # Remove extra spaces, tabs, and new lines
    text = re.sub(r'\s+', ' ', text).strip()

    return text

# Apply the cleaning function to both the Post_Content and Comment_Content columns
df_comments['Cleaned_Post_Content'] = df_comments['Post_Content'].apply(clean_text)
df_comments['Cleaned_Comment_Content'] = df_comments['Comment_Content'].apply(clean_text)

# Display the first few rows of the DataFrame to verify the changes
df_comments.head()

Unnamed: 0,Post_ID,Post_Content,Comment_Content,Comment_Score,Comment_Date,Provider,Cleaned_Post_Content,Cleaned_Comment_Content
0,1c2aazi,Just a scenario i thought was sad but also a l...,Sorry I laughed out loud at the title alone. B...,721,2024-04-12,r/beyondthebump,just a scenario i thought was sad but also a l...,sorry i laughed out loud at the title alone bless
1,1bm36ak,Went to bed around 11 after bub had his bottle...,"I mean, you did it. You hit the partner teamwo...",1623,2024-03-24,r/beyondthebump,went to bed around after bub had his bottle di...,i mean you did it you hit the partner teamwork...
2,1brvfep,I was angry at you today. You asked me to go p...,This is.. rough. In your writing I can feel th...,796,2024-03-30,r/beyondthebump,i was angry at you today you asked me to go pi...,this is rough in your writing i can feel the l...
3,1bnfst8,Mine is that part of the reason newborns cry i...,That colic is a lazy diagnosis and synonymous ...,1249,2024-03-25,r/beyondthebump,mine is that part of the reason newborns cry i...,that colic is a lazy diagnosis and synonymous ...
4,1bqtnfg,It's a frequent topic in this sub that healthc...,Yup. Postpartum is the worst time because we g...,703,2024-03-29,r/beyondthebump,it s a frequent topic in this sub that healthc...,yup postpartum is the worst time because we go...


## **Part 2 (Tokenization & Stopword Removal) - NLP Pipeline**

In the second part, stopwords are going to be removed. Tokenization will split the string text into individual words, which helps with stopword removal. This is a necessary task because it will reduce dataset size, bring more focus on meaningful words, and improve the performance of our NLP models.

In [None]:
def remove_stopwords(text):
    # Tokenize the text into words
    tokens = word_tokenize(text)

    # Get the list of English stopwords
    stop_words = set(stopwords.words('english'))

    # Remove stopwords from the tokens
    filtered_tokens = [word for word in tokens if word not in stop_words]

    # Rejoin the filtered tokens back into a string
    filtered_text = ' '.join(filtered_tokens)

    return filtered_text

df_comments['Cleaned_Post_Content'] = df_comments['Cleaned_Post_Content'].apply(remove_stopwords)
df_comments['Cleaned_Comment_Content'] = df_comments['Cleaned_Comment_Content'].apply(remove_stopwords)

df_comments.head()

Unnamed: 0,Post_ID,Post_Content,Comment_Content,Comment_Score,Comment_Date,Provider,Cleaned_Post_Content,Cleaned_Comment_Content
0,1c2aazi,Just a scenario i thought was sad but also a l...,Sorry I laughed out loud at the title alone. B...,721,2024-04-12,r/beyondthebump,scenario thought sad also little funny embarra...,sorry laughed loud title alone bless
1,1bm36ak,Went to bed around 11 after bub had his bottle...,"I mean, you did it. You hit the partner teamwo...",1623,2024-03-24,r/beyondthebump,went bed around bub bottle diaper change swadd...,mean hit partner teamwork jackpot sleep night ...
2,1brvfep,I was angry at you today. You asked me to go p...,This is.. rough. In your writing I can feel th...,796,2024-03-30,r/beyondthebump,angry today asked go pick oldest grandparents ...,rough writing feel love support husband damn f...
3,1bnfst8,Mine is that part of the reason newborns cry i...,That colic is a lazy diagnosis and synonymous ...,1249,2024-03-25,r/beyondthebump,mine part reason newborns cry hormonal one tal...,colic lazy diagnosis synonymous know probably ...
4,1bqtnfg,It's a frequent topic in this sub that healthc...,Yup. Postpartum is the worst time because we g...,703,2024-03-29,r/beyondthebump,frequent topic sub healthcare women kinda suck...,yup postpartum worst time go seen every week w...


## **Part 3 (Lemmatization) - NLP Pipeline**

In the third part, we are reducing words to their root form (lemmatization). This will provide more precise model performance because their are not different versions of the same word.

In [None]:
# Load the English language model in spaCy
nlp = spacy.load('en_core_web_sm')

def lemmatize_text(text):
    # Process the text using spaCy
    doc = nlp(text)

    # Extract the lemma for each token and join back into a string
    lemmatized_text = ' '.join([token.lemma_ for token in doc])

    return lemmatized_text

# Assuming 'df' is your DataFrame and 'Cleaned_Text' is the column you wish to lemmatize
# Apply the lemmatization function to the Cleaned_Text column
df_comments['Cleaned_Post_Content'] = df_comments['Cleaned_Post_Content'].apply(lemmatize_text)
df_comments['Cleaned_Comment_Content'] = df_comments['Cleaned_Comment_Content'].apply(lemmatize_text)

# Display the first few rows to verify the changes
df_comments.head()

Unnamed: 0,Post_ID,Post_Content,Comment_Content,Comment_Score,Comment_Date,Provider,Cleaned_Post_Content,Cleaned_Comment_Content
0,1c2aazi,Just a scenario i thought was sad but also a l...,Sorry I laughed out loud at the title alone. B...,721,2024-04-12,r/beyondthebump,scenario think sad also little funny embarrass...,sorry laugh loud title alone bless
1,1bm36ak,Went to bed around 11 after bub had his bottle...,"I mean, you did it. You hit the partner teamwo...",1623,2024-03-24,r/beyondthebump,go bed around bub bottle diaper change swaddle...,mean hit partner teamwork jackpot sleep night ...
2,1brvfep,I was angry at you today. You asked me to go p...,This is.. rough. In your writing I can feel th...,796,2024-03-30,r/beyondthebump,angry today ask go pick old grandparent meet f...,rough writing feel love support husband damn f...
3,1bnfst8,Mine is that part of the reason newborns cry i...,That colic is a lazy diagnosis and synonymous ...,1249,2024-03-25,r/beyondthebump,mine part reason newborn cry hormonal one talk...,colic lazy diagnosis synonymous know probably ...
4,1bqtnfg,It's a frequent topic in this sub that healthc...,Yup. Postpartum is the worst time because we g...,703,2024-03-29,r/beyondthebump,frequent topic sub healthcare woman kinda suck...,yup postpartum bad time go see every week week...


## **Part 4 (Vectorization) - NLP Pipeline**

Vectorization is a crucial step that transforms text data into a numerical format, making it understandable and processable my machine learning algorithms.

In [None]:
# Initialize the TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(
    max_df=0.85,  # discard words appearing in more than 85% of the documents
    min_df=2,     # discard words appearing in less than 2 documents
    ngram_range=(1, 2),  # consider unigrams and bigrams
)

In [None]:
# Concatenate post and comment content into a new column
df_comments['Combined_Text'] = df_comments['Cleaned_Post_Content'].str.cat(df_comments['Cleaned_Comment_Content'], sep=' ')

# Proceed with vectorization on the 'Combined_Text' column
tfidf_matrix = tfidf_vectorizer.fit_transform(df_comments['Combined_Text'])

# The resulting 'tfidf_matrix' is a sparse matrix representation of the TF-IDF values.
# You can convert it to a DataFrame for better readability (optional):
feature_names = tfidf_vectorizer.get_feature_names_out()
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=feature_names)

tfidf_df.head()

Unnamed: 0,abandon,abandon user,ability,able,absolute,absolutely,accept,acceptable,accident,accidentally,...,year old,year tell,yell,yes,yesterday,yet,yet spend,yo,young,zero
0,0.0,0.0,0.0,0.0,0.0,0.08483,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.080879,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.059844,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.068389,0.0,0.0,0.0,0.0


# **Method 2: Sentiment Analysis**

Assessing the emotional tone of the parenting advice comments will help identify whether posts express positive, negative, or neutral sentiments towrds parenting issues and advice.

In [None]:
# Using BERT for Sentiment Analysis
sentiment_model = pipeline("sentiment-analysis", model='cardiffnlp/twitter-roberta-base-sentiment-latest')

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
def truncate_sequence(sequence, max_length):
    if len(sequence) > max_length:
        sequence = sequence[:max_length]
    return sequence # Anything that is after the max length position is cut off

# Apply function to Sentence column & apply sentiment model
df_comments['Comment_Sentiment'] = df_comments['Cleaned_Comment_Content'].apply(lambda x: sentiment_model(truncate_sequence(x, 512))[0]['label'])


# Print the sentiment for each article
df_comments.head()

Unnamed: 0,Post_ID,Post_Content,Comment_Content,Comment_Score,Comment_Date,Provider,Cleaned_Post_Content,Cleaned_Comment_Content,Combined_Text,Comment_Sentiment
0,1c2aazi,Just a scenario i thought was sad but also a l...,Sorry I laughed out loud at the title alone. B...,721,2024-04-12,r/beyondthebump,scenario think sad also little funny embarrass...,sorry laugh loud title alone bless,scenario think sad also little funny embarrass...,neutral
1,1bm36ak,Went to bed around 11 after bub had his bottle...,"I mean, you did it. You hit the partner teamwo...",1623,2024-03-24,r/beyondthebump,go bed around bub bottle diaper change swaddle...,mean hit partner teamwork jackpot sleep night ...,go bed around bub bottle diaper change swaddle...,positive
2,1brvfep,I was angry at you today. You asked me to go p...,This is.. rough. In your writing I can feel th...,796,2024-03-30,r/beyondthebump,angry today ask go pick old grandparent meet f...,rough writing feel love support husband damn f...,angry today ask go pick old grandparent meet f...,negative
3,1bnfst8,Mine is that part of the reason newborns cry i...,That colic is a lazy diagnosis and synonymous ...,1249,2024-03-25,r/beyondthebump,mine part reason newborn cry hormonal one talk...,colic lazy diagnosis synonymous know probably ...,mine part reason newborn cry hormonal one talk...,neutral
4,1bqtnfg,It's a frequent topic in this sub that healthc...,Yup. Postpartum is the worst time because we g...,703,2024-03-29,r/beyondthebump,frequent topic sub healthcare woman kinda suck...,yup postpartum bad time go see every week week...,frequent topic sub healthcare woman kinda suck...,negative


# **Method 3: Text Similarity**

Text Similarity measures the degree of similarity between pairs of text documents. This method will help identify parenting advice posts that are closely related or similar in content, allowing users to explore similar discussions or solutions to common parenting challenges.

In [None]:
doc_sim = cosine_similarity(tfidf_matrix)                                                     # compute document similarity by examining the cosine similairty b/w documents in matrix
doc_sim_df = pd.DataFrame(doc_sim)                                                  # take doc_sim, convert to dataframe
# pull up a heading of the dataframe
doc_sim_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,1.0,0.100629,0.079585,0.060444,0.080766,0.008945,0.132982,0.076545,0.075344,0.20121,...,0.147932,0.133601,0.059208,0.049732,0.054449,0.05851,0.011585,0.066652,0.050993,0.052673
1,0.100629,1.0,0.063205,0.112776,0.055801,0.0,0.071298,0.060771,0.074459,0.073426,...,0.08196,0.058028,0.049632,0.029674,0.031672,0.057339,0.046221,0.052938,0.048166,0.137769
2,0.079585,0.063205,1.0,0.055986,0.10506,0.0,0.035473,0.105133,0.161393,0.15054,...,0.179365,0.160478,0.067809,0.042154,0.074197,0.09874,0.063085,0.09394,0.102509,0.116079
3,0.060444,0.112776,0.055986,1.0,0.041319,0.0,0.054639,0.089344,0.06893,0.040689,...,0.101695,0.071701,0.082598,0.0157,0.031751,0.055013,0.072061,0.050135,0.03146,0.155869
4,0.080766,0.055801,0.10506,0.041319,1.0,0.014587,0.068611,0.298174,0.138665,0.125213,...,0.136404,0.058954,0.069356,0.06477,0.059499,0.044228,0.044574,0.086681,0.099333,0.0578


In [None]:
# saving all the unique movie titles to a list
advice_list = df_comments['Comment_Content'].values
advice_list

array(['Sorry I laughed out loud at the title alone. Bless. ❤️😂',
       'I mean, you did it. You hit the partner teamwork jackpot. Any sleep through the night is a good one. Even if baby isn’t there yet.',
       "This is.. rough. In your writing I can feel the love and support you have for your husband, but damn doesn't it just feel unfair sometimes? \n\nI don't resent that my husband goes to the gym 3-5 times a week, I resent that I have to ask for any time I get to myself. \n\nI resent that it's assumed I've got the kids when he's getting ready for work, but when I'm getting ready for work, he will walk in the door 2 minutes after I need to leave which leaves me no time to communicate any of their needs, nevermind mine. \n\nI resent that every decision I make factors in the needs of the other 3 people in my house,  when it feels like most of his world is still his own.",
       'That colic is a lazy diagnosis and synonymous with “I don’t know it’ll probably be fine eventually good 

In [None]:
def advice_recommender(advice_title, advice=advice_list, doc_sims=doc_sim_df):

    advice_idx = np.where(advice == advice_title)[0][0] # find advice indx of the advice you want

    advice_similarities = doc_sims.iloc[advice_idx].values # produce cosine similarity values

    similar_advice_idxs = np.argsort(advice_similarities)[1:6] # output top 5 advice in descending order that are similar to chosen advice

    similar_advice = advice[similar_advice_idxs] # produce advice titles

    print("Based on your interest in \"{}\", I'd recommend checking out:".format(advice_title))
    for advice in similar_advice:
        print("- {}".format(advice))

In [None]:
advice_recommender('If you really really feel something is wrong, call someone.') # input advice

Based on your interest in "If you really really feel something is wrong, call someone.", I'd recommend checking out:
- So hard to loose it! No advice but I fell your pain. I never lost my baby weight 2nd time around
- Unrelated but holy cow! That baby has so much hair! Mine came out bald but with sideburns 😂
- I’d say that’s a strong positive.
- Sounds like your mom has an awesome sense of humour.
- I would be blasting them so hard on their FB right now.


How rude!


In [None]:
def get_top_recommended_advice(similarity_matrix, df):
    # Initialize an empty list to store the top recommended advice
    top_recommendations = []

    # Iterate over each row in the similarity matrix
    for idx, similarities in enumerate(similarity_matrix):
        # We skip the first index since it's the similarity with itself, which should be the highest
        sorted_indices = np.argsort(-similarities)[1:]
        # Find the index of the top recommended advice that is not the same as the advice itself
        top_recommendation_idx = sorted_indices[0]
        # Retrieve the top recommended advice using the index
        top_recommendations.append(df.iloc[top_recommendation_idx]['Comment_Content'])

    return top_recommendations

# Assuming doc_sim_df.values gives us the numpy array of the similarity matrix
# We now call the function and assign the result to a new column in df_comments
df_comments['Top_Recommended_Advice'] = get_top_recommended_advice(doc_sim_df.values, df_comments)

In [None]:
# Remove rows where any of the columns have a null value
df_comments = df_comments.dropna()

# If you want to remove rows where specific columns have null values, you can specify the subset
# For example, if you only want to remove rows where 'Comment_Content' is null:
df_comments = df_comments.dropna(subset=['Comment_Content'])


In [None]:
# Save the DataFrame to a CSV file
df_comments.to_csv('tm_dataset.csv', index=False)

df_comments.head()

Unnamed: 0,Post_ID,Post_Content,Comment_Content,Comment_Score,Comment_Date,Provider,Cleaned_Post_Content,Cleaned_Comment_Content,Combined_Text,Comment_Sentiment,Top_Recommended_Advice
0,1c2aazi,Just a scenario i thought was sad but also a l...,Sorry I laughed out loud at the title alone. B...,721,2024-04-12,r/beyondthebump,scenario think sad also little funny embarrass...,sorry laugh loud title alone bless,scenario think sad also little funny embarrass...,neutral,"I felt the same way, and felt so guilty. For t..."
1,1bm36ak,Went to bed around 11 after bub had his bottle...,"I mean, you did it. You hit the partner teamwo...",1623,2024-03-24,r/beyondthebump,go bed around bub bottle diaper change swaddle...,mean hit partner teamwork jackpot sleep night ...,go bed around bub bottle diaper change swaddle...,positive,Didn't wake them up: Me falling down the hardw...
2,1brvfep,I was angry at you today. You asked me to go p...,This is.. rough. In your writing I can feel th...,796,2024-03-30,r/beyondthebump,angry today ask go pick old grandparent meet f...,rough writing feel love support husband damn f...,angry today ask go pick old grandparent meet f...,negative,Some husbands manage to get through the whole ...
3,1bnfst8,Mine is that part of the reason newborns cry i...,That colic is a lazy diagnosis and synonymous ...,1249,2024-03-25,r/beyondthebump,mine part reason newborn cry hormonal one talk...,colic lazy diagnosis synonymous know probably ...,mine part reason newborn cry hormonal one talk...,neutral,"I'm so, so sorry you're going through this. My..."
4,1bqtnfg,It's a frequent topic in this sub that healthc...,Yup. Postpartum is the worst time because we g...,703,2024-03-29,r/beyondthebump,frequent topic sub healthcare woman kinda suck...,yup postpartum bad time go see every week week...,frequent topic sub healthcare woman kinda suck...,negative,Sending solidarity from another mum with plant...


# **Method 4: Question-Answering System**

A question-answering system enables users to pose questions in natural language and recieve relevant answers extracted from parenting advice posts.

In [None]:
# Function to handle question asking and log management
def ask_question(question, chat_log):
    if len(chat_log) == 0:
        chat_log.append({"role": "system", "content": "Welcome! You can ask any question you like about parenting."})

    # Trim the chat log if it exceeds 12 entries to manage size and relevance
    if len(chat_log) >= 12:
        chat_log = chat_log[-12:]

    # Append the user's question to the chat log
    chat_log.append({"role": "user", "content": question})

    # Assuming you've set up your API key correctly
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=chat_log,
        max_tokens=150
    )

    # Extracting the generated response
    answer = response.choices[0].message['content']

    # Append the assistant's response to the chat log
    chat_log.append({"role": "assistant", "content": answer})

    return answer, chat_log


In [None]:
# Widgets for input and output
text_input = widgets.Text(
    placeholder='Type your question here',
    description='Question:',
    disabled=False
)

ask_button = widgets.Button(description='Ask')

output = widgets.Output()

chat_log = []

# Button click handler
def on_ask_button_click(b):
    global chat_log
    with output:
        clear_output(wait=True)  # Clear the output to maintain a clean state
        question = text_input.value
        if question:
            answer, chat_log = ask_question(question, chat_log)
            # Print only the last user question and bot answer
            for message in chat_log[-2:]:  # Adjust to display more history if needed
                print(f"{message['role'].title()}: {message['content']}")
        text_input.value = ''

ask_button.on_click(on_ask_button_click)

# Display the widgets
display(text_input, ask_button, output)

Text(value='', description='Question:', placeholder='Type your question here')

Button(description='Ask', style=ButtonStyle())

Output()