<h2>Reporting: wragle_report</h2>


<h2>This is a report on the Wrangling and Analysis of the WeRateDogs Twitter dataset.</h2>

<h2>Gathering Data</h2>

For this analysis, three data sets were produced. The following technique was used to collect these datasets:

1. `twitter_archive_enhanced.csv` was manually downloaded and loaded using `pd.read_csv` with the extension _".csv,"_ and was assigned to a variable `df_1`.

2. The file hosted on Udacity's servers `image_prediction.tsv` was downloaded programmatically using the Requests library, then loaded using `pd.read_csv` with the extension _".tsv"_ and assigned to a variable `df_2`

3. Python's __Tweepy__ module (Twittwe API) was used to query the `tweet_json.txt` file for each tweet's JSON data, which was then loaded as `df_3` by using `pd.read_json` with a _".text"_ extension.

<h2>Assessing Data</h2>

Both visual assessment and programmatic assessment were used to assess the data. The programmatic assessment was carried out using the following __Pandas__ and __Numpy__ methods:

- `.shape`


- `.info()`


- `.describe()`


- `.value_counts()`


- `.loc`, `.iloc`


- `.isna()`, `.isnull()`


- `.sum()`


- `.duplicated()`


- `.sample()`


- `.nunique()`


- `.head()`


- `.tail()`


Some __quality issues__ in the datasets were found during the data review. Here are several problems with the quality.

`df_1` revealed the following:

1. Timestamp is an object not a datetime.

2. Outliers in rating denominator.

3. Names sometimes begins with a lowercase.

4. Missing values (name, doggo, floofer, pupper, and puppo).

5. Remove (in_reply_to_status_id, in_reply_to_user_id, retweeted_status_user_id, retweeted_status_id, and retweeted_status_timestamp) and other unwanted columns.

6. Extract missing names from text and proper naming of incomplete names.


The following was noticed in `df_2`:

7. Underscore and inconsistent in letter casing of _(p1, p2,and p3)_ columns.

8. Extract dog confident level and dog breed.


Additionally, it was found in `df_3` that:

9. Missing values (contributors, coordinates, in_reply_to_screen_name, in_reply_to_status_id, in_reply_to_status_id_str, in_reply_to_user_id, in_reply_to_user_id_str, geo, place, quoted_status, quoted_status_id,quoted_status_id_str, quoted_status_permalink, and retweeted_status)

10. Convert display_text_range column to display_text_length

11. Drop unwanted columns.


However, some tidiness issues were discovered, including:

1. Merge doggo, floofer, pupper, and puppo columns and name it dog_stage in `df_1`

2. Merge `df_1`, `df_2`, and `df_3`

<h2>Data Cleaning</h2>

The data cleaning efforts involved several complex processes and involved a thorough analysis of the quality and tidiness issues identified during the assessment phase. The report highlights the limitations and potential improvements in the cleaning efforts.

<h3>Tidiness Issues</h3>

The columns doggo, floofer, pupper, and puppo were combined into one column and given the name dog stage. Then, unneeded columns were dropped, leaving 9 columns rather than 17.

Additionally, the three data frames were merged into a single data frame called `master_tweet` after these  tidiness issues were resolved.

<h3>Quality Issues</h3>

Upon closer inspection of `df_1`, some of the dog names were omitted, while others were incorrectly named as a result of the wrong tweet and omission of the dog's name in their tweet. The additional investigation looked at 19 names that were incorrectly recorded as "a" and were later retrieved from their different tweets.

Images that were properly identified as dogs using `df_2` were extracted and put in a single column column named dog breed along with their corresponding confidence level in a column named dog_conf.

The timestamp datatype was changed from object to pandas datatime to aid with time-series analysis. 

Finally, the outliers in the rating denominator were converted to the standardized rating since the denominator had a constant value of 10 in while the display_text_length was extracted from display_text_range in `df_3`
