## Reporting: wragle_report


In this report, we embark on a captivating journey into the world of dog ratings and breeds on Twitter. Our objective is to analyze a rich dataset of tweets by WeRateDogs and uncover valuable insights. However, before we can plunge into the realm of analysis, it is essential to ensure the dataset is thoroughly cleaned and prepared. In this data cleaning report, we detail the meticulous steps taken to address various issues and refine the dataset for robust analysis.

Data Gathering and Assessment
During the data gathering phase, we obtained data from diverse sources, including Twitter's API and external neural network data. Upon initial assessment, we identified several quality and tidiness issues that required attention:

<center>Quality issues</center>

| check | Dataset      | Issue                                                                            |
|:-----:|:-------------|:---------------------------------------------------------------------------------|
| <span style="border: 2px solid green; border-radius: 50%; display: inline-block; width: 20px; height: 20px;"></span>   | t_arch_orig  | source string contains HTML elements                                            |
| <span style="border: 2px solid green; border-radius: 50%; display: inline-block; width: 20px; height: 20px;"></span>   | t_arch_orig  | Incorrect dog's names                                                           |
| <span style="border: 2px solid green; border-radius: 50%; display: inline-block; width: 20px; height: 20px;"></span>   | t_arch_orig  | No dog_stage column                                                           |
| <span style="border: 2px solid green; border-radius: 50%; display: inline-block; width: 20px; height: 20px;"></span>   | t_arch_orig  | If the text contains "We only rate dogs" that tweet should be excluded           |
| <span style="border: 2px solid green; border-radius: 50%; display: inline-block; width: 20px; height: 20px;"></span>   | t_arch_orig  | Incorrect data type in timestamp column                                         |
| <span style="border: 2px solid green; border-radius: 50%; display: inline-block; width: 20px; height: 20px;"></span>  | t_json_orig  | Incorrect data type in timestamp column                                         |
| <span style="border: 2px solid green; border-radius: 50%; display: inline-block; width: 20px; height: 20px;"></span>   | t_arch_orig  | Missing values in expanded_urls, quantity of missing values differ from t_json   |
| <span style="border: 2px solid green; border-radius: 50%; display: inline-block; width: 20px; height: 20px;"></span>   | t_arch_orig  | Incorrect data type in timestamp column, in_reply_to_status_id, ...              |
| <span style="border: 2px solid green; border-radius: 50%; display: inline-block; width: 20px; height: 20px;"></span>   | t_arch_orig  | Non-null values in retweeted* columns indicate non-original tweets              |
| <span style="border: 2px solid green; border-radius: 50%; display: inline-block; width: 20px; height: 20px;"></span>  | img_pred_orig| Dataset doesn't have information for ~300 tweet ids                              |
| <span style="border: 2px solid green; border-radius: 50%; display: inline-block; width: 20px; height: 20px;"></span>   | img_pred_orig| Underscore is used instead of whitespace p1, p2, p3                             |
| <span style="border: 2px solid green; border-radius: 50%; display: inline-block; width: 20px; height: 20px;"></span>   | t_arch_orig  | Incorrect data types in in_reply_to_status_id_json, in_reply_to_user_id_json... |
| <span style="border: 2px solid green; border-radius: 50%; display: inline-block; width: 20px; height: 20px;"></span>   | t_arch_orig  | Inaccurate data in the retweeted_status_id column                               |
| <span style="border: 2px solid green; border-radius: 50%; display: inline-block; width: 20px; height: 20px;"></span>   | t_arch_orig  | Inaccurate data in the rating_denominator column                                |
| <span style="border: 2px solid green; border-radius: 50%; display: inline-block; width: 20px; height: 20px;"></span>   | t_arch_orig  | Inaccurate data in the rating_denominator column |

--------------------------------------------------------------------------
<center>Tidiness issues</center>


| check | Dataset           | Issue                                                             |
|:-----:|:------------------|:------------------------------------------------------------------|
| <span style="border: 2px solid green; border-radius: 50%; display: inline-block; width: 20px; height: 20px;"></span> | t_arch_orig       | unnecessary columns: doggo, floofer, pupper, puppo                 |
| <span style="border: 2px solid green; border-radius: 50%; display: inline-block; width: 20px; height: 20px;"></span> | t_json_orig, t_arch_orig, img_pred_orig | unify all 3 datasets into one, eliminate repeated columns         |

------------------------------------------
1. HTML Elements in the 'source' Column: The 'source' column contained HTML elements, which needed to be extracted to obtain clean source information.

2. Incorrect Dog Names: The 'text' column revealed instances of incorrect dog names. To rectify this, we employed a regular expression pattern (r"(?:This is|name is)\s+([A-Z][a-z]+)(?!\s+[A-Z][a-z]+)") to extract accurate dog names from the text.

3. Absence of 'dog_stage' Column with Correct Values: We observed that the dataset lacked a 'dog_stage' column that could categorize dogs into different stages (e.g., doggo, puppo, floof, pupper, snoot, and blep). To address this, we extracted relevant data from the 'text' column and assigned appropriate values to the 'dog_stage' column.

4. Removing Data Without Dog Ratings: Our analysis primarily focused on dog ratings. Hence, we needed to remove rows that contained irrelevant data or lacked proper ratings. For this purpose, we formulated specific criteria to identify and exclude such rows.

5. Inaccurate Data in 'retweeted_status_id' and 'rating_denominator' Columns: The 'retweeted_status_id' and 'rating_denominator' columns exhibited inaccurate data. To rectify this, we extracted ratings from the 'text' column using a regex pattern (r'\d+(.\d+)?/\d+'). We then addressed questionable outputs and resolved inconsistencies. Each questionable rating was manually evaluated and underlying reasons were understood, e.g. cumulative rating for multiple dogs, dates, etc:
    - '11/15' replace with nan
        
    - 960/00 replace with 13/10
        
    - 9/11 replace with 14/10
        
    - 24/7 replace with nan
        
    - 84/70 replace with 12/10
        
    - 165/150 replace with 11/10
        
    - 204/170 replace with 12/10
        
    - 4/20 replace with 13/10
        
    - 50/50 replace with 11/10
        
    - 80/80 replace with 10/10
        
    - 99/90 replace with 11/10
        
    - 45/50 replace with 9/10
        
    - 60/50 replace with 12/10
        
    - 44/40 replace with 11/10
        
    - 4/20 replace with 2/10
        
    - 143 replace with 11/10
        
    - 121/110 replace with 11/10
        
    - 7/11 replace with 10/10
        
    - 20/16 replace with 12.5/10
        
    - 144/120 replace with 12/10
        
    - 88/80 replace with 11/10
        
    - 1/2 replace with 9/10
        
    - 007/10 replace with 0.07/10
        
    - 0/10 replace with nan

6. Incorrect Data Types: Several columns in the dataset had incorrect data types. To ensure consistency and accuracy, we performed appropriate data type conversions. Notably, we converted 'in_reply_to_status_id' and 'in_reply_to_user_id' to integers, and 'timestamp', 'retweeted_status_timestamp' to timestamps.

7. String Formatting Issues in 'p1', 'p2', and 'p3' Columns: The 'p1', 'p2', and 'p3' columns contained dog breed predictions but with underscores instead of whitespaces. We replaced these underscores with whitespaces to enhance readability and consistency.

8. Separate Datasets, Unnecessary Columns, Retweets, Replys, and Missing IDs: We observed that the dataset consisted of separate datasets and included unnecessary columns, retweets, and rows with missing IDs. To resolve these issues, we performed a triple inner merge to consolidate the datasets, dropped irrelevant columns, retweets, and rows with missing IDs, ensuring a cohesive and comprehensive dataset.

---------
Through a rigorous and meticulous data cleaning process, we successfully addressed various quality and tidiness issues present in the dataset. The resulting "twitter_archive_master.csv" file serves as a solid foundation for further analysis and exploration of dog ratings and breeds on Twitter.

By rectifying data quality concerns, we have laid the groundwork for uncovering meaningful insights and conducting reliable analyses. The cleaned dataset provides a robust platform for investigating the average retweet count for different source platforms, identifying the most popular and unpopular dog breeds, and determining the highest-rated dog breeds according to WeRateDogs.

This data cleaning endeavor has provided valuable experience in handling diverse data challenges and reinforced the importance of thorough data validation and transformation. With a pristine dataset at our disposal, we eagerly anticipate delving into the realm of data analysis in our upcoming report.

Stay tuned for the forthcoming data analysis report, where we will unveil captivating insights, enthralling visualizations, and intriguing findings derived from the cleaned dataset. The captivating world of dog ratings and breeds on Twitter awaits!












