# Data Wrangling Report: WeRateDogs Twitter Archive

## Introduction
This report documents the comprehensive data wrangling process undertaken for the WeRateDogs Twitter archive analysis. The dataset, consisting of tweets from the popular WeRateDogs account, required significant cleaning and restructuring to ensure data quality and usability for analysis.

## Data Gathering Process
The data was collected from three distinct sources, each requiring different approaches:
1. The Twitter archive CSV file, manually downloaded from Udacity
2. The image predictions TSV file, programmatically downloaded using Python's requests library
3. Additional tweet data obtained through the Twitter API using tweepy

## Data Assessment Findings
### Quality Issues
The initial assessment revealed several quality issues that needed addressing:
1. Missing values in critical columns such as ratings and dog names
2. Incorrect data types, particularly timestamps stored as strings
3. Outliers in numerical columns, especially in rating values
4. Duplicated entries across the dataset
5. Inconsistent capitalization in dog names
6. Invalid or incorrect data in rating denominators
7. Incorrect formatting in textual fields
8. Inconsistent date entries

### Tidiness Issues
Two main tidiness issues were identified:
1. Multiple variables stored in one column (dog stages)
2. One variable stored across multiple datasets

## Cleaning Process
The cleaning process followed a systematic approach:
1. First, I addressed the quality issues one by one, ensuring each change was properly tested
2. Then, I tackled the tidiness issues by restructuring the data
3. Finally, I merged the datasets into a single, clean master dataset

## Challenges Faced
Several challenges were encountered during the wrangling process:
1. Handling missing values while preserving data integrity
2. Standardizing inconsistent data formats
3. Ensuring proper merging of datasets without data loss
4. Maintaining data quality throughout the cleaning process

## Conclusion
The wrangling process resulted in a clean, well-structured dataset ready for analysis. The systematic approach to cleaning ensured that all data quality and tidiness issues were properly addressed while maintaining the integrity of the original data.

In [1]:
!jupyter nbconvert --to html wrangle_report.ipynb --output wrangle_report.html

[NbConvertApp] Converting notebook wrangle_report.ipynb to html
[NbConvertApp] Writing 273926 bytes to wrangle_report.html
