## Wrangle Report

## 1. Introduction

This data wrangling project is the second project of the ALX-T Data Analyst Nanodegree programme on Udacity. The project focused on wrangling and analysing data from @WeRateDogs Twitter account. 

This involved:
- gathering data from multiple sources (downloaded @WeRateDogs' Twitter archive dataset and image predictions data from URLs provided to Udacity by @WeRateDogs, and queried Twitter API for additional data that wasn't in the Twitter archive), 
- assessing (visually and programmatically), 
- cleaning and merging the datasets, and then 
- performing analysis on the tweets to extract insights.

## 2. Data Gathering

This step required me to gather three datasets:
- @WeRateDogs' Twitter archive dataset (twitter-archive-enhanced.csv),
- Tweet image predictions (image_predictions.tsv), and
- additional data from Twitter API

The first dataset, `twitter-archive-enhanced.csv`, was provided by Udacity. All I had to do was manually download it via a URL that was provided, upload it to the Jupyter Notebook folder, and read it into the pandas DataFrame using pd.read_csv()

The second dataset, `image_predictions.tsv`, was programmatically downloaded using the Requests library and a URL that was provided by Udacity. I subsequently read it into the pandas DataFrame, employing io.IOString to read the contents of the requests.Response object. 

Finally, I queried Twitter API using tweepy to gather additional data (retweet_count and favorite_count) about the tweets in the @WeRateDogs' Twitter archive dataset.

## 3. Data Assessing

To check for quality and tidiness issues, I assessed each of three datasets:
- visually (printed the entire dataset -- as much of it as Jupyter notebook would allow -- and scrolled through it), and
- programmatically (using .head(), .sample(), .info(), .describe(), etc.) 

I found 13 issues in the data assessing process:

| Issue ID | Table | Issue Type | Column | Description |
| --- | --- | --- | --- | --- |
| 1 | `archive_df` | Quality | `name` | Invalid dog names e.g. "a", "then", etc. |
| 2 | `archive_df` | Quality | `rating_denominator` | There were invalid denominators (had values other than 10) |
| 3 | `archive_df` | Quality | `source` | Contains href. Extract href string |
| 4 | `archive_df` | Quality | `text` | Irrelevant column |
| 5 | `archive_df` | Quality | `retweeted_status_id` | Same dog recorded twice or more |
| 6 | `archive_df` | Quality | `in_reply_to_status_id` | Same dog recorded twice or more |
| 7 | `archive_df` | Quality | `timestamp` | Convert to `datetime` |
| 8 | `archive_df` | Quality | `doggo`, `floofer`, `pupper`, `puppo` | Convert to `category` datatype |
| 9 | `archive_df` | Quality | `tweet_id` | Convert to `string` datatype |
| 10 | `img_prediction_df` | Quality | `img_num` | Irrelevant column |
| 11 | `img_prediction_df` | Quality | `jpg_url` | Duplicate values |
| 12 | `archive_df`, `img_prediction_df`, `tweets_df` | Tidiness| `tweet_id` | Merge tables into one |
| 13 | `archive_df` | Tidiness | `doggo`, `floofer`, `pupper`, `puppo` | Merge columns into one, `dog_stage` |

## 4. Data Cleaning

Before cleaning, I first made copies of the original datasets. 

During cleaning, I ensured to drop all rows that did not contain original tweets (e.g. retweets). This would ensure that the same dog wouldn't be recorded more than once. I fixed the other quality and tidiness issues listed in the table above using the define-code-test methodology. 

Finally, I merged the datasets into one dataset, `df_merged_clean`, which was ready for analysis and visualisation.

## 5. Saving work

The cleaned dataset, `df_merged_clean`, was saved to a csv file, `twitter_archive_master.csv`.

## 6. Conclusions

I assessed, documented, and addressed 13 issues (11 quality issues and 2 tidiness issues). Then, I saved the final dataset (which contains 1851 observations and 19 variables) as `twitter_archive_master.csv`.

While the wrangling done was satisfactory for my project, the final dataset isn't devoid of issues. For example, there were null values (written as None) in `dog_stage` and `name` columns.