# Pup Inflation: Analysing Tweets

This notebook analyzes tweets from the `@dog_rates` Twitter account to investigate potential "grade inflation" in dog cuteness ratings over time.

In [1]:
# Cell 1: Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
from scipy.stats import linregress
%matplotlib inline

## 1. Load Data
Load the data from `dog_rates_tweets.csv`. It's assumed this file is in the same directory as the notebook. Includes a fallback to dummy data if the file is not found.

In [None]:
# Cell 2: Load Data
try:
    df_tweets = pd.read_csv('dog_rates_tweets.csv')
    print("Successfully loaded dog_rates_tweets.csv")
except FileNotFoundError:
    print("Error: dog_rates_tweets.csv not found. Please ensure the file is in the correct directory.")

# Display initial data structure (optional)
print("\nInitial data sample (first 5 rows):")
print(df_tweets.head())
print("\nInitial data info:")
df_tweets.info()

## 2. Extract Ratings
Define a function to find and extract numeric ratings (e.g., "12/10") from tweet text using regular expressions. Apply this function to create a new 'rating' column. Tweets without a valid rating are excluded.

In [None]:
# Cell 3: Extract Ratings
def extract_rating(tweet_text):
    """Extracts a numeric rating from a tweet string."""
    if pd.isna(tweet_text):
        return None
    # Regex to find patterns like '12/10' or '13.5/10'
    match = re.search(r'(\d+(\.\d+)?)/10', str(tweet_text))
    if match:
        try:
            rating = float(match.group(1)) # group(1) captures the number before /10
            return rating
        except ValueError:
            return None # Should not happen with this regex but good practice
    return None

# Apply the function to the 'text' column
df_tweets['rating'] = df_tweets['text'].apply(extract_rating)

# Exclude tweets that don't contain a rating (where 'rating' is NaN)
# Use .copy() to avoid SettingWithCopyWarning on the slice
df_tweets_rated = df_tweets[df_tweets['rating'].notnull()].copy()

print("\nData after extracting ratings (sample of tweets with ratings):")
print(df_tweets_rated[['text', 'rating']].head())

## 3. Remove Outliers
Exclude ratings that are excessively large (e.g., greater than 25/10) as they are likely outliers or jokes and could skew the analysis.

In [None]:
# Cell 4: Remove Outliers
# Define a reasonable upper limit for ratings to exclude outliers
MAX_RATING = 25
df_tweets_cleaned = df_tweets_rated[df_tweets_rated['rating'] <= MAX_RATING].copy()

print(f"\nNumber of tweets before outlier removal: {len(df_tweets_rated)}")
print(f"Number of tweets after removing ratings > {MAX_RATING}: {len(df_tweets_cleaned)}")
print("\nData after removing outliers (sample):")
print(df_tweets_cleaned[['text', 'rating']].head())

## 4. Convert 'created_at' to Datetime
Ensure the 'created_at' column is in datetime format for time-based plotting and analysis.

In [None]:
# Cell 5: Convert 'created_at' to datetime
# The problem statement mentions this can also be done during read_csv with parse_dates.
# If not done then, it's crucial to do it here.
df_tweets_cleaned['created_at'] = pd.to_datetime(df_tweets_cleaned['created_at'])

print("\nData info after datetime conversion:")
df_tweets_cleaned.info()
print("\nSample 'created_at' values after conversion:")
print(df_tweets_cleaned['created_at'].head())

## 5. Initial Scatter Plot
Create a scatter plot of date versus rating to visually inspect the data distribution over time.

In [None]:
# Cell 6: Initial Scatter Plot (Date vs Rating)
plt.figure(figsize=(10, 6))
if not df_tweets_cleaned.empty:
    plt.plot(df_tweets_cleaned['created_at'], df_tweets_cleaned['rating'], 'o', alpha=0.5, markersize=5)
else:
    plt.text(0.5, 0.5, 'No data to plot', ha='center', va='center')
plt.title('Dog Ratings Over Time (Initial Plot)')
plt.xlabel('Date')
plt.ylabel('Rating (/10)')
plt.xticks(rotation=25) # Rotate x-axis labels for better readability
plt.grid(True, linestyle='--', alpha=0.7)
plt.tight_layout() # Adjust layout to prevent labels from overlapping
plt.show()

## 6. Prepare for Linear Regression
Convert 'created_at' datetime objects to numeric timestamps (seconds since epoch) because `scipy.stats.linregress` requires numerical input.

In [None]:
# Cell 7: Prepare for Linear Regression (Timestamp)
# Define a function to convert datetime to timestamp
def to_timestamp(dt_object):
    return dt_object.timestamp()

if not df_tweets_cleaned.empty:
    df_tweets_cleaned['timestamp'] = df_tweets_cleaned['created_at'].apply(to_timestamp)
    print("\nData with 'timestamp' column (sample):")
    print(df_tweets_cleaned[['created_at', 'timestamp', 'rating']].head())
else:
    # If DataFrame is empty, create an empty 'timestamp' column to avoid errors later
    df_tweets_cleaned['timestamp'] = pd.Series(dtype='float64') 
    print("\nDataFrame is empty, 'timestamp' column created as empty.")

## 7. Perform Linear Regression
Use `scipy.stats.linregress` to find the best-fit line for ratings over time. Calculate the predicted ratings based on this model and store them. Display the slope and intercept, and a sample of the data with predictions, mimicking the example output.

In [None]:
# Cell 8: Perform Linear Regression
# Ensure there's enough data to fit (linregress needs at least 2 points)
if not df_tweets_cleaned.empty and len(df_tweets_cleaned['timestamp'].dropna()) >= 2:
    # Drop NaNs in timestamp or rating if any slipped through, for linregress
    regression_data = df_tweets_cleaned[['timestamp', 'rating']].dropna()
    
    if len(regression_data) >= 2:
        fit = linregress(regression_data['timestamp'], regression_data['rating'])
        
        # Add prediction to DataFrame (as shown in the example output image)
        # This calculation is: y = slope * x + intercept
        df_tweets_cleaned['prediction'] = df_tweets_cleaned['timestamp'] * fit.slope + fit.intercept
        
        # Display the subset of columns as in the example image 'Out[9]'
        print("Data with predictions (first 5 rows):")
        columns_to_show = ['created_at', 'text', 'rating', 'timestamp', 'prediction']
        if 'id' in df_tweets_cleaned.columns: # Check if 'id' column exists
            columns_to_show.insert(0, 'id')
        print(df_tweets_cleaned[columns_to_show].head())
        print("\n---") # Separator similar to the prompt's image style
        
        # Display slope and intercept as in 'Out[10]'
        print(f"fit.slope, fit.intercept") # Mimicking the In[10] text
        print(f"Out[10]: ({fit.slope}, {fit.intercept})") # Mimicking the Out[10] text
    else:
        print("Not enough valid data points (after dropping NaNs) to perform linear regression.")
        # Create placeholder fit and prediction column if regression can't be run
        class PlaceholderFit: slope = 0; intercept = np.nanmean(df_tweets_cleaned['rating']) if not df_tweets_cleaned.empty else 10
        fit = PlaceholderFit()
        df_tweets_cleaned['prediction'] = fit.intercept
else:
    print("Not enough data or timestamps to perform linear regression.")
    # Create placeholder fit and prediction column if DataFrame is too small or empty
    class PlaceholderFit: slope = 0; intercept = np.nanmean(df_tweets_cleaned['rating']) if not df_tweets_cleaned.empty else 10
    fit = PlaceholderFit()
    df_tweets_cleaned['prediction'] = fit.intercept # Assign a scalar or Series of same length
    if not df_tweets_cleaned.empty and 'rating' in df_tweets_cleaned.columns:
        df_tweets_cleaned['prediction'] = fit.intercept # if df is not empty, fill with mean or default
    else:
        df_tweets_cleaned['prediction'] = pd.Series(dtype='float64') # Empty series

## 8. Final Scatter Plot with Best-Fit Line
Create the final plot showing actual ratings as blue dots and the best-fit line as a red line. Style the plot to match the example screenshot (`dog-rates-results.png`).

In [None]:
# Cell 9: Final Scatter Plot with Best-Fit Line (Matching example style)
plt.figure(figsize=(12, 7)) # Adjusted figure size for better clarity
plt.xticks(rotation=25) # Rotate x-axis labels

if not df_tweets_cleaned.empty and 'rating' in df_tweets_cleaned.columns:
    # Plot actual ratings: blue dots, semi-transparent
    plt.plot(df_tweets_cleaned['created_at'], df_tweets_cleaned['rating'], 'b.', alpha=0.5, label='Actual Ratings') 
    
    # Plot the best-fit line: red line, thicker
    # The 'prediction' column already holds (timestamp * slope + intercept)
    if 'prediction' in df_tweets_cleaned.columns and not df_tweets_cleaned['prediction'].isnull().all():
        # Sort by date to ensure the line plots correctly if data isn't already sorted
        plot_df_sorted = df_tweets_cleaned.sort_values(by='created_at')
        plt.plot(plot_df_sorted['created_at'], plot_df_sorted['prediction'], 'r-', linewidth=3, label='Best-Fit Line')
    else:
        print("No prediction line to plot.")
else:
    plt.text(0.5, 0.5, 'No data to plot', ha='center', va='center')

plt.title('Dog Ratings Over Time with Best-Fit Line')
plt.xlabel('Date')
plt.ylabel('Rating (/10)')
plt.legend()
plt.grid(True, linestyle='--', alpha=0.7)
plt.tight_layout() # Adjust layout to make sure everything fits without overlapping
plt.show()

## End of Notebook