# Tales of Tails: Analyzing Dog Ratings on Twitter
#### Prepared by: Jose Carlos Moreno Ramirez
#### Institution: Western Govornors University
#### Course: Data Wrangling - D309 
#### Project: Wrangle and Analyze Data
---

### Introduction
This report's goal is to go into detail about the data wrangling I did for the "We Rate Dogs" project. Data from numerous sources were gathered, evaluated, and cleaned as part of this extensive process to guarantee that it was ready for further analysis and visualization.

---
### Data Gathering
Working with and extracting data from three separate sources was required to compile all the information required for this project:

1. Twitter Archive: `twitter_archive_enhanced.csv`, a CSV file, used as the primary dataset. This dataset includes text, dog ratings, timestamps, tweet IDs, and tweet content.

2. Image Predictions: A programmatic URL was used to download a file with the name `image_predictions.tsv`. The dog breed seen in each tweet's image is predicted in this dataset.

3. Twitter API: More information, such as favorite and retweet counts, was gathered via the Twitter API and saved in the text file `tweet_json.txt`.
>*Note: Unfortunately, the Twitter API is locked behind a paywall after Elon Musk took over as Twitter CEO. Therefor, Western Govornors University provided me [the student]  with a text file containing all the tweet data that would have otherwise been retrieved from the API.* 

---
### Data Assessment
Before cleaning the data, I analyzed the datasets to identify quality and tidiness issues. The assessment revealed the following key points:

1. **Archive Table**:
    - Adjust the datatype for certain columns to ensure consistency.
    - Address missing data in the DataFrame.
    - Replace missing values, currently represented as "None," with appropriate null values in the name column.
    - Review and correct expanded URLs that contain multiple URLs.
    - Improve the formatting of the `text` column.
    - Remove retweets from the data.
    - Address HTML tags in the `source` column.

2. **Image Table**:
    - Standardize the capitalization for `P1`, `P2`, and `P3` labels where necessary.
    - Ensure proper formatting for `P1`, `P2`, and `P3` columns within the image table.
    
3. **Tweet Table**:
    - Extract the date component from the `date_created` column.
    - Consider renaming the `date_created` column to `timestamp` for consistency across datasets.
    
4. **Tidiness issues**:
    - Combine multiple DataFrames to ensure that all tweet-related information is in one place for analysis.
    - Combine the dog stage information spread across four columns (`doggo`, `floofer`, `pupper`, and `puppo`) into a single stage column.

---

### Data Cleaning
The data cleaning procedure took care of the found quality and structure concerns. The actions taken were as follows:

1. Changed the data types for some columns.
2. Handled missing data within their applicable DataFrames.
3. Replaced missing data with acceptable null values.
4. Efficiently extracted data values separated by the pipe delimiter.
5. Standardized the capitilization for columns in the Image table where necessary.
6. Improved the formatting of the `text` column in the Archive table.
7. Extracted month, day, year components from the `date_created` column.
8. Renamed the `created_at` column to `timestamp` for consistency across datasets.
9. Ensured proper formatting for `p1`, `p2`, and `p3` columns within the Image table.
10. Eliminated HTML tags from the `source` column in the Archive table.

---
### Data Storage
The method I applied to storing all the collected data is as follows:
1. I made copies of each DataFrame to manipulate freely.
2. Then, I combined all the DataFrames into a single DataFrame named `master_df`
3. Finally, I saved the dataset in a CSV file named `twitter_master_archive.csv`.

---
### Conclusion
For the "We Rate Dogs" project, data had to be wrangled from a variety of sources, evaluated for quality and organization, and then cleaned up in order to be ready for analysis. I was able to create interesting graphics using the data thanks to the clean datasets that were produced and the cleansed Twitter API data.

---
### References
Any references I utilized are included right after the cellblock output.

---

END OF NOTEBOOK