### This purpose of this project is to build a chatbot that can uncover insights about Reddit comment about "Data Science"

#### Section 1: Get Reditt posts related to data science

In [3]:
# import reddit api and other libaries
import praw
import pandas as pd
import os

In [19]:
def get_top_posts(subreddit_list, limit=500, time_filter='all'):

    # call reddit api using my api client_id, client_secret, and user_agent, which have been set as environment variables
    reddit = praw.Reddit(client_id = os.environ['reddit_client'],
                     client_secret = os.environ['reddit_secret'],
                     user_agent = os.environ['reddit_user'],
                     redirect_url = 'http://localhost:8080')
    
    # extract posts that are in given subreddit from api
    posts = reddit.subreddit(subreddit_list).top(time_filter=time_filter, limit=limit)

    # initialize post dataframe
    posts_df = []

    # add post attributes to dataframe for retrieving relevant information from other api call later
    for post in posts:
        posts_df.append({'post_id': post.id,
                        'subreddit': post.subreddit,
                        'created_utc': post.created_utc,
                        'selftext': post.selftext,
                        'post_url': post.url,
                        'post_title': post.title,
                        'link_flair_text': post.link_flair_text,
                        'score': post.score,
                        'num_comments': post.num_comments,
                        'upvote_ratio': post.upvote_ratio
                        })
        
    return pd.DataFrame(posts_df)


# retrieve Top 100 posts from subreddit "MachineLearning", "artificial", "data"
posts_df = get_top_posts(subreddit_list='MachineLearning+artificial+datascience', limit=100, time_filter='all')
posts_df.to_csv('DS_ML_AI_posts.csv', header=True, index=False)
posts_df

Unnamed: 0,post_id,subreddit,created_utc,selftext,post_url,post_title,link_flair_text,score,num_comments,upvote_ratio
0,gh1dj9,MachineLearning,1.589117e+09,,https://v.redd.it/v492uoheuxx41,[Project] From books to presentations in 10s w...,Project,7847,186,0.99
1,kuc6tz,MachineLearning,1.610275e+09,,https://v.redd.it/25nxi9ojfha61,[D] A Demo from 1993 of 32-year-old Yann LeCun...,Discussion,5878,133,0.98
2,g7nfvb,MachineLearning,1.587789e+09,,https://v.redd.it/rlmmjm1q5wu41,[R] First Order Motion Model applied to animat...,Research,4760,111,0.97
3,lui92h,MachineLearning,1.614525e+09,,https://v.redd.it/ikd5gjlbi8k61,[N] AI can turn old photos into moving Images ...,News,4702,230,0.97
4,ohxnts,MachineLearning,1.625977e+09,,https://i.redd.it/34sgziebfia71.jpg,[D] This AI reveals how much time politicians ...,Discussion,4602,228,0.96
...,...,...,...,...,...,...,...,...,...,...
95,cqffii,datascience,1.565815e+09,,https://i.redd.it/4f71u8ti5hg31.jpg,Expectation vs reality,Fun/Trivia,1780,94,0.96
96,g6og9l,MachineLearning,1.587655e+09,# DICK-RNN\n\nA recurrent neural network train...,https://www.reddit.com/r/MachineLearning/comme...,[P] I trained a recurrent neural network train...,Project,1775,123,0.96
97,ijkkbb,MachineLearning,1.598822e+09,,https://v.redd.it/47g1f9cuf7k51,[P] Cross-Model Interpolations between 5 Style...,Project,1773,104,0.97
98,orybjg,datascience,1.627305e+09,,https://i.redd.it/u3ngf9tw2kd71.png,Me showing off a suspiciously well-performing ...,Fun/Trivia,1760,27,0.98


#### Section 2: Get comments of these posts

In [20]:
# create a list to store comments of all posts
comments_list = []
for post_id in posts_df['post_id']:
    
    # call comment api
    submission = reddit.submission(post_id)
    
    # in case there are "more comments" section, retrieve 10 more
    submission.comments.replace_more(limit = 10)
    
    # map comment with post
    for comment in submission.comments.list():
        comments_list.append({'post_id':post_id, 'comment':comment.body})

comments_df = pd.DataFrame(comments_list)
comments_df

Unnamed: 0,post_id,comment
0,gh1dj9,Twitter thread: [https://twitter.com/cyrildiag...
1,gh1dj9,The future 🤯
2,gh1dj9,Simple yet very useful. Thank you for sharing ...
3,gh1dj9,"Almost guaranteed, Apple will copy your idea i..."
4,gh1dj9,Ohh the nightmare of making this into a stable...
...,...,...
13612,xtd8kc,I don't think we are compatible with the webui...
13613,xtd8kc,Cool. I’ve already accepted it but I can doubl...
13614,xtd8kc,You need the token even for the local one. (It...
13615,xtd8kc,"No worries, I’ve got a token. 😂\nJust installe..."


#### Section 3: Build chatbot to answer questions based on the retrieved comment using OpenAI API

In [22]:
# import llama_index libaries to convert comment text into index that can be used by AI
from llama_index import SimpleDirectoryReader, GPTSimpleVectorIndex, LLMPredictor,PromptHelper

# import OpenAI to call AI engine
from langchain import OpenAI

In [27]:
# combine all comments and save to a local file
comment_text = ' '.join(comments_df['comment'])
f = open("textdata/all_text_reddit.txt", "w") 
f.write(comment_text)
f.close()

In [28]:
# convert comment text into index to be used by AI
def construct_index(directory_path):
    
    # set token limits for input and output to avoid long queries and long answers (it costs money!)
    max_input_size = 4096
    num_outputs = 256

    # call GPT 3.5 turbo model to answer questions
    llm_predictor = LLMPredictor(llm = OpenAI(temperature = 0, 
                                              model_name = 'gpt-3.5-turbo',
                                             max_token = num_outputs))
    
    # restrict prompts delievered to the model under the hood to save money and increae speed
    prompt_helper = PromptHelper(max_input_size, num_outputs)
    
    # store argument defiuned above
    service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, prompt_helper=prompt_helper)
    
    # turn local file into document objects for creating index
    documents = SimpleDirectoryReader(directory_path).load_data()
    
    # build index and save to local file
    index = GPTSimpleVectorIndex.from_documents(documents, service_context=service_context)
    index.save_to_disk('index.json')
    
    return index

In [29]:
# create "ask_me_anything" function as user interface
def ask_me_anything(question):
    index = GPTSimpleVectorIndex.load_from_disk('index.json')
    response = index.query(question, response_mode = 'compact')
    print(response.response)

In [104]:
# load comment data to build index
construct_index('textdata')

INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total LLM token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total embedding token usage: 664312 tokens


<llama_index.indices.vector_store.vector_indices.GPTSimpleVectorIndex at 0x7fa9d1199bb0>

In [32]:
ask_me_anything("What's the trend of machine leanring")

INFO:llama_index.token_counter.token_counter:> [query] Total LLM token usage: 3945 tokens
INFO:llama_index.token_counter.token_counter:> [query] Total embedding token usage: 8 tokens




The trend of machine learning is that it is steadily improving over time as long as the input data set is consistent and the evaluation of the performance is correct. This is achieved by using programs that guess and check at scale, with the results being graded and aggregated into better guesses. This process is repeated until the desired outcome is achieved. Machine learning is also being used to identify and mitigate cognitive biases such as regression towards the mean and the gambler's fallacy, which can help improve the accuracy of predictions and decisions.
