# Step 1: Data Crawling
In this step, we will scrape data from Reddit using the [Pushshift API](https://github.com/pushshift/api), which allows us to access Reddit posts programmatically. We will extract at least 150 posts from a chosen subreddit (e.g. r/technology), capturing fields such as `id`, `title`, `author`, `score`, `num_comments`, `created_utc`, and `selftext`.


In [1]:
# Install dependencies (run only if not installed)
#!pip install requests pandas

import requests
import pandas as pd

# Parameters
subreddit = "technology"
size = 200  # Number of posts to fetch

# Fetch data from Pushshift API (Reddit)
url = f"https://api.pushshift.io/reddit/search/submission/?subreddit={subreddit}&size={size}"
response = requests.get(url)
data = response.json()["data"]

# Extract relevant columns
df = pd.DataFrame(data)
columns_of_interest = ['id', 'title', 'author', 'score', 'num_comments', 'created_utc', 'selftext']
df = df[columns_of_interest]

# Save to CSV
df.to_csv("reddit_technology_posts.csv", index=False)
print(f"Data scraped and saved to 'reddit_technology_posts.csv'. Number of rows: {df.shape[0]}")
df.head(3)


ModuleNotFoundError: No module named 'pandas'

## Data Source and Collection
- **Source**: [Pushshift Reddit API](https://github.com/pushshift/api), which provides free programmatic access to Reddit posts.
- **Variables extracted**:
    - `id`: Unique post identifier.
    - `title`: The title of the Reddit post.
    - `author`: The user who posted.
    - `score`: Upvotes minus downvotes.
    - `num_comments`: Number of comments.
    - `created_utc`: Timestamp of post creation.
    - `selftext`: The main text of the post.
- **Collection Method**: Data was collected via web API call, returned as JSON and converted to a pandas DataFrame.


# Step 2: Data Preparation and Cleaning
We will now load our CSV, check for missing data, handle outliers or inconsistencies, and enrich our dataset by adding readable date columns and a flag for text length.


In [None]:
# Load the dataset
df = pd.read_csv("reddit_technology_posts.csv")

# Check for missing values
print("Missing values per column:")
print(df.isnull().sum())

# Fill missing selftext with empty string, drop rows with missing critical fields
df['selftext'] = df['selftext'].fillna('')
df = df.dropna(subset=['id', 'title', 'author', 'score', 'num_comments', 'created_utc'])

# Convert created_utc to datetime
df['created_utc'] = pd.to_datetime(df['created_utc'], unit='s')

# Add a new column: text_length
df['text_length'] = df['selftext'].apply(lambda x: len(str(x)))

# Remove obvious outliers (e.g., negative scores)
df = df[df['score'] >= 0]
df = df[df['num_comments'] >= 0]

# Show a summary
df.describe()


## Data Preparation Steps
- **Missing Data**: Filled missing `selftext` with an empty string. Dropped rows with missing critical fields (`id`, `title`, etc.).
- **Date Parsing**: Converted the Unix timestamp to a readable datetime.
- **Feature Creation**: Added a `text_length` column representing the length of the post's text.
- **Outliers**: Removed any rows with negative scores or comment counts, which are not plausible in Reddit data.


# Step 3: Exploratory Data Analysis (EDA)
We will examine statistics, distributions, and relationships in our data, visualising where appropriate.


In [None]:
import matplotlib.pyplot as plt

# 1. Basic statistics
print("Mean score:", df['score'].mean())
print("Mean comments:", df['num_comments'].mean())
print("Text length range:", df['text_length'].min(), "-", df['text_length'].max())

# 2. Histograms
plt.figure(figsize=(10,4))
plt.hist(df['score'], bins=30)
plt.title('Distribution of Post Scores')
plt.xlabel('Score')
plt.ylabel('Frequency')
plt.show()

plt.figure(figsize=(10,4))
plt.hist(df['num_comments'], bins=30)
plt.title('Distribution of Number of Comments')
plt.xlabel('Number of Comments')
plt.ylabel('Frequency')
plt.show()

plt.figure(figsize=(10,4))
plt.hist(df['text_length'], bins=30)
plt.title('Distribution of Post Text Length')
plt.xlabel('Text Length (characters)')
plt.ylabel('Frequency')
plt.show()

# 3. Relationship between score and comments
plt.figure(figsize=(8,6))
plt.scatter(df['score'], df['num_comments'], alpha=0.5)
plt.title('Score vs. Number of Comments')
plt.xlabel('Score')
plt.ylabel('Number of Comments')
plt.show()
