## Research Question:
How does post engagement, measured by upvotes and comments compare across three subreddits:
1. technology,
2. Futurology
3. science?


### Prompts for Contextualising AI Use
To begin this project, I engaged with two AI tools (Claude & ChatGPT), to help clarify task requirements, define a plan, and suggest appropriate code snippets. Below is a detailed record of the prompts I used, the rationale behind them, and my reflections on the outputs.


#### Prompt 1: Understanding the Assignment


**Prompt:**

> I have to do this Data Science project. Can you explain it to me in simple terms and tell me what is required?  
> Here is the project link: https://lse-dsi.github.io/DS105/2024/winter-term/summative/mini-project-2.html



**AI-Response:**

> While I couldn't directly access the specific URL provided due to access restrictions, I've gathered extensive information about DS105 web scraping summative assessments that align with your requirements........


**Comment:**

> Claude's response was unclear as it was not able to go to specific webpage. Since, it lacked essential detail and misrepresented key parts of the task, so it produced hallucinated output. I realised I needed to provide more context myself to guide the AI effectively.


#### Prompt 2: Clarifying Deliverables Using Contextual Prompting

**Prompt:**  

> I am a Data Science student working on a project titled *Reddit Engagement Analysis*. Based on the provided task description [], could you explain what needs to be done and outline the key deliverables in order?  
> You are an expert Data Scientist.

**AI Output:**  
> The assistant listed all necessary deliverables, GitHub setup, Reddit API configuration, data collection, storage, EDA, and report writing along with brief explanations.


**Comment:**  
> I used two contextual prompting frameworks like **STAR (Situation-Task-Action-Result)** and **CAR (Context-Action-Result)** to enhance clarity in my prompt.

> I also personified the AI as an “expert” to get more structured and useful responses.


**Reflection:**  
This was a useful interaction. The output helped me understand the project, before giving it a thorough read and draft a project outline


#### Prompt 3: Validating Project Plan

**Prompt:**  

> Here I have written all the key steps. On the basis of requirements of the task, Lemme know if my approach is right.

> a. Set up GitHub repo as per instructions  
> b. Create Reddit developer account  
> c. Configure Reddit app for API access  
> d. Define research question  
> e. Collect data  
> f. Store in SQLite database  
> g. Conduct EDA  
> h. Visualise data  
> i. Write final report  

> Does this cover all necessary steps or am i missing out on something?

**Reflection:**  
> The tool (ChatGPT), confirmed my structure and pointed out smaller elements I was overlooking, like hiding API keys securely and validating the schema of my database before loading data.



#### Prompt 4: Choosing a Research Question

**Prompt 1:**  

> Suggest some research questions for this project. I want to explore how different subreddits relate to each other.



**Reflection:**  
> The AI initially gave me generic suggestions that weren’t relevant to my interests. I realised I needed to specify my focus.

**Prompt 2:**  

> I'm interested in topics related to the future and technology. Can you suggest a better phrasing for my research question involving subreddit engagement differences?


**AI Output (Final Research Question):**  
> **How does post engagement (score, comments) vary across r/technology, r/Futurology, and r/science?**



**Reflection:**  
> This quesion closely aligned with my interests and met the project’s requirement of being measurable via engagement metrics like score and comments. I decided to adopt this research question moving forward.



#### Prompt 5: Guidance on Coding Structure

**Prompt:**

> Now that we have our research question, please guide me step-by-step through the code I need for the data gathering notebook.  
> Act as a professional Machine Learning engineer. Include comments and markdowns to document each step clearly.


**Reflection:**  
> This was one of the most helpful interactions. The AI provided modular code with proper explanations for API authentication, data parsing, dataframe construction, and database writing.


# Coding

**Prompt:**

> Lets start now. First give me all the requried libraries/Dependencies i need for the project and also code to handle secrte keys in colab.

In [None]:
from google.colab import userdata
import os
!pip install praw --quiet
import praw
import sqlite3

os.environ["REDDIT_CLIENT_ID"] = userdata.get("REDDIT_CLIENT_ID")
os.environ["REDDIT_CLIENT_SECRET"] = userdata.get("REDDIT_CLIENT_SECRET")
os.environ["REDDIT_USERNAME"] = userdata.get("REDDIT_USERNAME")
os.environ["REDDIT_PASSWORD"] = userdata.get("REDDIT_PASSWORD")

# Now use os.getenv in your PRAW setup


**Prompt** :
> Provide a Python snippet using PRAW in Colab to authenticate with Reddit, sourcing credentials from Colab Secrets (REDDIT_CLIENT_ID, REDDIT_CLIENT_SECRET, REDDIT_USERNAME, REDDIT_PASSWORD).



In [None]:
reddit = praw.Reddit(
    client_id=os.getenv("REDDIT_CLIENT_ID"),
    client_secret=os.getenv("REDDIT_CLIENT_SECRET"),
    user_agent="DS105 Project by /u/" + os.getenv("REDDIT_USERNAME"),
    username=os.getenv("REDDIT_USERNAME"),
    password=os.getenv("REDDIT_PASSWORD")
)

print("Read-only?", reddit.read_only)
# Retrieve 5 hot posts from r/python
for post in reddit.subreddit("python").hot(limit=5):
    print(post.title, post.score)

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/189.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m189.3/189.3 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[?25h

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



Read-only? False
Sunday Daily Thread: What's everyone working on this week? 9
Saturday Daily Thread: Resource Request and Sharing! Daily Thread 3
Premier: Instantly Turn Your ASGI App into an API Gateway 8
Building an ERP: ready-made platforms vs custom development 2
New in coding world. Need recommendations of tutorials for python in finance. 2


**Prompt**

> Give me code to Authenticate it using PRAW

In [None]:
reddit = praw.Reddit(
    client_id=os.getenv("REDDIT_CLIENT_ID"),
    client_secret=os.getenv("REDDIT_CLIENT_SECRET"),
    user_agent="ds105:reddit.engagement.analysis:v1 (by /u/" + os.getenv("REDDIT_USERNAME") + ")",
    username=os.getenv("REDDIT_USERNAME"),
    password=os.getenv("REDDIT_PASSWORD"),
)

print("Authenticated:", not reddit.read_only)


Authenticated: True


**Prompt**:

> Write code to fetch 100 hot posts plus subreddit metadata: id, display_name, subscribers, created_utc, public_description


In [None]:
subreddits = ["technology", "Futurology", "science"]
posts = []

for sr in subreddits:
    for post in reddit.subreddit(sr).hot(limit=100):
        posts.append({
            "subreddit": sr,
            "post_id": post.id,
            "title": post.title,
            "author": str(post.author),
            "created_utc": post.created_utc,
            "score": post.score,
            "upvote_ratio": post.upvote_ratio,
            "num_comments": post.num_comments
        })

df_posts = pd.DataFrame(posts)
df_posts.head()
print("Total posts:", len(df_posts))


It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



Total posts: 300


**Prompt**:

> Write code to fetch up to 20 top-level comments per post and normalize into a list.

In [None]:
comments = []
for _, row in df_posts.iterrows():
    submission = reddit.submission(id=row.post_id)
    submission.comment_sort = 'top'
    submission.comments.replace_more(limit=0)
    for c in submission.comments[:20]:
        comments.append({
            "comment_id": c.id,
            "post_id": row.post_id,
            "author": str(c.author),
            "created_utc": c.created_utc,
            "score": c.score,
            "body": c.body
        })
df_comments = pd.DataFrame(comments)
print("Total comments:", len(df_comments))
df_comments.head()


It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/l

Total comments: 3279


Unnamed: 0,comment_id,post_id,author,created_utc,score,body
0,mxocdxt,1laxbzg,Affectionate_Lie5601,1749868000.0,1789,bro \n\n3.409M ads blocked for me since instal...
1,mxoa9b7,1laxbzg,the_cat_did_it,1749867000.0,1385,At some point won't it just be faster to downl...
2,mxoeuxq,1laxbzg,VeryGayLopunny,1749869000.0,567,It's abhorrent on Roku. 60-75-second ad breaks...
3,mxoc4dt,1laxbzg,romjpn,1749868000.0,494,The era of free access to many websites is end...
4,mxo8nzu,1laxbzg,EzeakioDarmey,1749867000.0,445,Meanwhile; Brave users are just watching videos.


In [None]:
sub_data = []
for sr in subreddits:
    info = reddit.subreddit(sr)
    sub_data.append({
        "subreddit_id": info.id,
        "name": info.display_name,
        "subscribers": info.subscribers,
        "created_utc": info.created_utc,
        "description": info.public_description
    })

df_subreddits = pd.DataFrame(sub_data)
df_subreddits.head()


It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.



Unnamed: 0,subreddit_id,name,subscribers,created_utc,description
0,2qh16,technology,19468200,1201232000.0,Subreddit dedicated to the news and discussion...
1,2t7no,Futurology,21574310,1323681000.0,A subreddit devoted to the field of Future(s) ...
2,mouw,science,34144890,1161180000.0,This community is a place to share and discuss...


Prompt:

> This is how my df () look like. Now i want to save it to sqlite. As an expert data scienstist gimme well commented code

In [None]:
# Define the directory path
database_dir = "data"

# Create the directory if it doesn't exist
if not os.path.exists(database_dir):
    os.makedirs(database_dir)

# Now connect to the database file within the directory
conn = sqlite3.connect(os.path.join(database_dir, "database.db"))

df_subreddits.to_sql("SUBREDDITS", conn, if_exists="replace", index=False)
df_posts.to_sql("POSTS", conn, if_exists="replace", index=False)
df_comments.to_sql("COMMENTS", conn, if_exists="replace", index=False)
conn.close()

**Prompt:**
> I want to confirm if the data has actually been stored in the db. Gimme code to verify that

In [None]:
# Connect to the database to verify the data
conn = sqlite3.connect(os.path.join(database_dir, "database.db"))

# Verify the tables exist and count rows
try:
    cursor = conn.cursor()

    # Check SUBREDDITS table
    cursor.execute("SELECT COUNT(*) FROM SUBREDDITS;")
    count_subreddits = cursor.fetchone()[0]
    print(f"Number of rows in SUBREDDITS table: {count_subreddits}")

    # Check POSTS table
    cursor.execute("SELECT COUNT(*) FROM POSTS;")
    count_posts = cursor.fetchone()[0]
    print(f"Number of rows in POSTS table: {count_posts}")

    # Check COMMENTS table
    cursor.execute("SELECT COUNT(*) FROM COMMENTS;")
    count_comments = cursor.fetchone()[0]
    print(f"Number of rows in COMMENTS table: {count_comments}")

except sqlite3.Error as e:
    print(f"Database error: {e}")
finally:
    conn.close()


Number of rows in SUBREDDITS table: 3
Number of rows in POSTS table: 300
Number of rows in COMMENTS table: 3279


**Prompt**:

> Give me a brief summary of what has been done in the project so far , in this data gathering file.

Summary


*   Implemented API calls to fetch posts + comments.

*   Converted JSON → Pandas DataFrames.

*   Populated SQLite database with 3 tables.

*   Fully documented AI involvement & commit rationale.

**Final Note on AI Use**

Throughout this notebook, I documented my prompts with reflections/comments
to demonstrate the iterative process.

However, I also want to acknowledge that some early context-setting helped me get a better response from the tool.

Thus, it is important to know what you want to achieve through the tool and to be able to critically analyze the output. Mindless usage of AI not only produces hallucinated outputs but also does no good to an individual's learning process.