# Wrangle Report

This report summarizes the data wrangling process for the WeRateDogs Twitter account. The process involved three main steps:

## 1. Gathering Data
The data was collected from three sources:
- **Twitter Archive**: A CSV file containing tweets and metadata.
- **Image Predictions**: A TSV file with predictions of dog breeds in tweet images.
- **Twitter API Data**: Additional tweet data (e.g., retweet and favorite counts) extracted from a JSON file due to API limitations.

## 2. Assessing Data
The datasets were assessed for quality and tidiness issues:
- **Quality Issues**:
  - Missing values in columns like `name`.
  - Invalid dog names (e.g., "a", "the").
  - Out-of-range values in `rating_numerator`.
  - Non-standard data types for `IDs` and `timestamps`.
  - Rows in the image predictions dataset not classified as dogs.
  - A row with an invalid prediction confidence ratio (`1`).
- **Tidiness Issues**:
  - Dog stages distributed across multiple columns (`doggo`, `floofer`, etc.).
  - The `rating_denominator` column does not measure a variable.
  - The three datasets needed merging for analysis.

## 3. Cleaning Data
The cleaning process addressed the identified issues using the **Define-Code-Test** framework:
- **Quality Fixes**:
  - Converted `IDs` and `timestamps` to appropriate data types.
  - Removed retweets and replies to focus on original tweets.
  - Replaced missing or invalid dog names with "Unknown."
  - Removed out-of-range `rating_numerator` values (set a max of 17).
  - Filtered out rows not classified as dogs and the invalid prediction confidence ratio.
- **Tidiness Fixes**:
  - Merged dog stages into a single column.
  - Dropped the `rating_denominator` column as it was redundant.
  - Combined the three datasets into one for easier analysis.

## Dimensionality Reduction
Unnecessary columns (e.g., `source`, `expanded_urls`, and unused prediction columns) were dropped to streamline the dataset.

## Storing Data
The cleaned dataset was saved as `twitter_archive_master.csv` for future analysis and visualization. It contains all necessary information about the WeRateDogs Twitter account and its tweets.