# Reddit Data Collection

The scripts collects data of reddit posts from various AI-related subreddits using the Reddit API.

The data is gathered from multiple subreddits that focus on artificial intelligence, deep learning, machine learning, and related technological advancements.

The collection process retrieves posts sorted by 'new' to ensure the dataset contains the latest discussions.

**Data Collection Process:**
The data is extracted from the following subreddits:

* MachineLearning
* ArtificialIntelligence
* artificial
* deeplearning
* DeepLearningPapers
* datascience
* AIethics
* AGI
* compling
* neuralnetworks
* learnmachinelearning
* CharacterAI
* singularity
* AI_Agents
* technology
* ainews
* AItoolsCatalog
* AI_Tools_Land
* tech
* Futurology
* robotics
* computerscience
* programming
* technews
* Automate
* Innovation
* techsupport
* AskTechnology

The script fetches posts from these subreddits using the Reddit API.

Metadata Collected (CSV Columns & Their Meaning):
* title -  The title of the Reddit post.
* url – The URL of the post on Reddit.
* score – The net score (upvotes minus downvotes) received by the post.
* num_awards – The total number of awards given to the post.
* created_utc – The timestamp of when the post was created, formatted as YYYY-MM-DD HH:MM:SS.
* num_comments – The total number of comments on the post.
* subreddit – The subreddit from which the post was collected.
* text – The body of the post, if available (otherwise, it remains empty for link-based posts).

This dataset is designed to capture trends in AI discussions, monitor engagement, and analyze public perception over time.

In [None]:
!pip install asyncpraw
!pip install nest_asyncio


Collecting asyncpraw
  Downloading asyncpraw-7.8.1-py3-none-any.whl.metadata (9.0 kB)
Collecting aiofiles (from asyncpraw)
  Downloading aiofiles-24.1.0-py3-none-any.whl.metadata (10 kB)
Collecting aiosqlite<=0.17.0 (from asyncpraw)
  Downloading aiosqlite-0.17.0-py3-none-any.whl.metadata (4.1 kB)
Collecting asyncprawcore<3,>=2.4 (from asyncpraw)
  Downloading asyncprawcore-2.4.0-py3-none-any.whl.metadata (5.5 kB)
Collecting update_checker>=0.18 (from asyncpraw)
  Downloading update_checker-0.18.0-py3-none-any.whl.metadata (2.3 kB)
Downloading asyncpraw-7.8.1-py3-none-any.whl (196 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m196.4/196.4 kB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading aiosqlite-0.17.0-py3-none-any.whl (15 kB)
Downloading asyncprawcore-2.4.0-py3-none-any.whl (19 kB)
Downloading update_checker-0.18.0-py3-none-any.whl (7.0 kB)
Downloading aiofiles-24.1.0-py3-none-any.whl (15 kB)
Installing collected packages: aiosqlite, aiofiles, update

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import asyncpraw
import pandas as pd
import asyncio
import nest_asyncio
from datetime import datetime, timedelta
import os

nest_asyncio.apply()

# Reddit API credentials
CLIENT_ID = "dlAFdiC0gIzdwTHsG4y0AQ"
CLIENT_SECRET = "97ZIaokro6aH-RIDt7BrNGiSyiL8Tw"
USER_AGENT = 'python:collect-ai-data:v1.0 (by u/MissShik)'

reddit = asyncpraw.Reddit(
    client_id=CLIENT_ID,
    client_secret=CLIENT_SECRET,
    user_agent=USER_AGENT
)

# subreddits list - extract posts from these subreddits only
SUBREDDITS = [
    "MachineLearning", "ArtificialInteligence",
    "artificial", "deeplearning",
    "DeepLearningPapers", "datascience",
    "AIethics", "AGI",
    "compling", "neuralnetworks",
    "learnmachinelearning", "CharacterAI",
    "singularity", "AI_Agents",
    "technology", "ainews",
    "AItoolsCatalog", "AI_Tools_Land",
    "tech","Futurology",
    "robotics","computerscience",
    "programming","technews",
    "Automate","Innovation",
    "techsupport","AskTechnology"
]

# collect data retrieval stats
class CollectionStats:
    def __init__(self):
        self.total_queries = 0
        self.total_posts = 0
        self.start_time = datetime.now()
        self.oldest_post_date = None

    def update(self, new_posts):
        self.total_queries += 1
        self.total_posts += len(new_posts)

        if new_posts:
            oldest_post = min(new_posts, key=lambda x: x['created_utc'])
            self.oldest_post_date = oldest_post['created_utc']

    def print_stats(self):
        duration = (datetime.now() - self.start_time).total_seconds() / 60
        print("\n=== Collection Statistics ===")
        print(f"Duration: {duration:.2f} minutes")
        print(f"Total queries made: {self.total_queries}")
        print(f"Total posts collected: {self.total_posts}")
        print(f"Oldest post date: {self.oldest_post_date}")
        print(f"Average collection rate: {self.total_queries/duration:.2f} queries per minute")
        print("===========================\n")

unique_post_urls = set()

async def fetch_posts(subreddit, stats, days_back=None):
    post_data = []
    cutoff_date = datetime.now() - timedelta(days=days_back) if days_back else None

    try:
        post_count = 0
        async for post in subreddit.new(limit=None):  # Fetch posts sorted by 'new'
            post_date = datetime.fromtimestamp(post.created_utc)

            if cutoff_date and post_date < cutoff_date:
                print(f"Reached posts older than {days_back} days")
                break

            if post.url in unique_post_urls:
                continue

            unique_post_urls.add(post.url)

            num_awards = post.total_awards_received

            post_data.append({
                'title': post.title,
                'url': post.url,
                'score': post.score,  # Net score (upvotes - downvotes) - there is no way to retrieve the actucal upvotes and downvotes
                'num_awards': num_awards,
                'created_utc': post_date,
                'num_comments': post.num_comments,
                'subreddit': post.subreddit.display_name,
                'text': post.selftext if hasattr(post, 'selftext') else ''
            })
            post_count += 1

        stats.update(post_data[-post_count:] if post_count > 0 else [])
        print(f"Found {post_count} posts in r/{subreddit.display_name}")

    except Exception as e:
        print(f"Error fetching posts: {e}")
        await asyncio.sleep(2)

    return post_data

async def save_to_csv(data,filename):
    if data:
        df = pd.DataFrame(data)
        df = df.sort_values('created_utc', ascending=False)
        df['created_utc'] = df['created_utc'].dt.strftime('%Y-%m-%d %H:%M:%S')

        save_path = "/content/drive/My Drive/AITrendAnalysis-project"
        os.makedirs(save_path, exist_ok=True)
        full_path = os.path.join(save_path, filename)

        df.to_csv(full_path, index=False, mode='w', header=True)
        print(f"Saved {len(df)} rows to {full_path}")

async def collect_data(duration_hours=1, save_threshold=50, days_back=None):
    end_time = datetime.now() + timedelta(hours=duration_hours)
    all_posts = []
    requests_in_window = 0
    window_start = datetime.now()
    stats = CollectionStats()

    print(f"\nStarting collection at: {datetime.now()}")
    print(f"Collection period: {duration_hours} hours")
    print(f"Collecting posts from the past {days_back if days_back else 'all'} days")

    for subreddit_name in SUBREDDITS:
        print(f"\nFetching posts from subreddit: r/{subreddit_name}")
        subreddit = await reddit.subreddit(subreddit_name)

        while datetime.now() < end_time:
          # handle requests limits (due to Reddit API restrictions)
            try:
                if (datetime.now() - window_start).total_seconds() >= 600:
                    requests_in_window = 0
                    window_start = datetime.now()

                if requests_in_window >= 95:
                    wait_time = 600 - (datetime.now() - window_start).total_seconds()
                    if wait_time > 0:
                        print(f"Approaching rate limit. Waiting {wait_time:.2f} seconds...")
                        await asyncio.sleep(wait_time)
                        requests_in_window = 0
                        window_start = datetime.now()

                new_posts = await fetch_posts(subreddit, stats, days_back)
                if not new_posts:
                    print(f"No new posts found in r/{subreddit_name}. Moving to the next subreddit.")
                    break

                all_posts.extend(new_posts)
                requests_in_window += 1

                if len(all_posts) >= save_threshold:
                    await save_to_csv(all_posts, filename="reddit-data-02-25_3650_days.csv")

                if (datetime.now() - stats.start_time).total_seconds() % 300 < 1:
                    stats.print_stats()

                await asyncio.sleep(1)

            except Exception as e:
                print(f"Error during collection: {e}")
                await asyncio.sleep(5)
                continue

    await save_to_csv(all_posts, filename="reddit-data-02-25_3650_days.csv")
    stats.print_stats()
    print(f"\nCollection completed at: {datetime.now()}")
    print(f"Total posts collected: {len(all_posts)}")

async def main():
    collection_hours = 12
    save_threshold = 500
    days_back = 3650 # Set to None for all posts - 10 years (past)

    try:
        await collect_data(
            duration_hours=collection_hours,
            save_threshold=save_threshold,
            days_back=days_back
        )
    except KeyboardInterrupt:
        print("\nCollection interrupted by user. Saving collected data...")
    except Exception as e:
        print(f"\nUnexpected error: {e}")
    finally:
        print("Script finished.")

loop = asyncio.get_event_loop()
loop.run_until_complete(main())


Starting collection at: 2025-02-22 16:10:21.912900
Collection period: 12 hours
Collecting posts from the past 3650 days

Fetching posts from subreddit: r/MachineLearning
Found 881 posts in r/MachineLearning
Saved 881 rows to /content/drive/My Drive/AITrendAnalysis-project/reddit-data-02-25_3650_days.csv
Found 0 posts in r/MachineLearning
No new posts found in r/MachineLearning. Moving to the next subreddit.

Fetching posts from subreddit: r/ArtificialInteligence
Found 870 posts in r/ArtificialInteligence
Saved 1751 rows to /content/drive/My Drive/AITrendAnalysis-project/reddit-data-02-25_3650_days.csv
Found 0 posts in r/ArtificialInteligence
No new posts found in r/ArtificialInteligence. Moving to the next subreddit.

Fetching posts from subreddit: r/artificial
Found 922 posts in r/artificial
Saved 2673 rows to /content/drive/My Drive/AITrendAnalysis-project/reddit-data-02-25_3650_days.csv
Found 0 posts in r/artificial
No new posts found in r/artificial. Moving to the next subreddit.
