## Reporting: wrangle_report

### Table of Content
<ul>
<li> <a href='#data-gathering'> Data Gathering </a> </li>
<li> <a href='#data-cleaning'> Data Cleaning</li>
<ul>
<li> <a href='#assessing'> Assessing </li>
<li> <a href='#cleaning'> Cleaning </li>
<ul>
<li> <a href='#quality'> Quality </li>
<li> <a href='#tidiness'> Tidiness </li>
</ul>
</ul>

</ul>


The steps I took to wrangle the data archived tweets of `@dogrates` can be summarised as follows:

<a id='data-gathering'> </a>


1.	### Data Gathering
- downloaded the `twitter_archive_enhanced` dataset
- used Requests library to retrieve `image_predictions`
- retrieved ids, retweet and favorite counts from instructor-provided `.json` file

<a id='data-cleaning'> </a>


2.	### Data Cleaning

<a id='assessing'> </a>

##### Assessing
 I used visual and programmatic assessments to identify quality and tidiness issues within the three datasets. 


<a id='cleaning'> </a>

##### Cleaning
I used the **Define**/**Code**/**Test** Framework to solve the issues I identified in the Assessing Stage. First, I made copies of all the original datasets before commencing cleaning operations. 


<a id='quality'> </a>

##### Quality
1. Dissimilar ordering of `tweet_id's` in the datasets: I resolved the disparity by:
    a. creating a list of `tweet_ids`
    b. setting this list to be the index of the `image_predictions` dataset
    c. resetting the index of the `image_predictions` 

2. Unnecessary columns eg. `source` in `twitter_archive`: This column was redundant to my analysis because the device of the account user has no bearing on the rating score or tweet popularity. I used Pandas' `drop` method to remove the column from the dataframe. 

3. Retweets included in `twitter_archive` dataset: I found that three columns - `retweeted_status_id`, `retweeted_status_user_id` and, `retweeted_status_timestamp` columns had 181 non-null values. I inferred that the rows with non-null values were retweets. To address this issue, I:
    a. matched `twitter_archive` to those rows where `retweeted_status_id` was null. 
    b. dropped the `retweeted_status_user_id` and `retweeted_status_timestamp` columns.

4. Tweets with no images: I wanted to collect only tweets with images so that all the tweets in my master dataframe would have image predictions. Since the `twitter_archive` dataframe contained no images column, I inferred that tweets without id's in the `image_predictions` column contained images. 
To resolve this issue, I:
a.	collected:
 - all ids in `twitter_archive` into a list
    	- collected all ids in `image predictions` into a list
- collected all ids present in the `twitter_archive` list but absent in the `image_predictions` list into a third list
    b. excluded the ids in the third list from `twitter_archive`.

5. Missing values: `no_reply_to_user_id` and `in_reply_to_status_id` columns contained null values. I filled these with the word 'Empty' instead of dropping the null rows because there were too many of them. 

6. Non-dog predictions: To identify the non-dog predictions, the most reliable method was to find those predictions in the datasets where none of the three predictions were True for dog. So, I:
    a. created a queried dataframe which contained rows from the `predictions` dataframe where at least one of the dog predictions was True for dog
    b. matched the predictions to the queried dataframe 

7 & 8. `Date` and `time` in `twitter archive` in string datatype and conjoined into one column: One rule of tidy data that is that every variable must form a column. Also, the values were in string rather than datetime format. Therefore, I:
    a. converted the `timestamp` column to a datetime object
    b. used the `datetime` function to extract `date` and `time` into two new columns
    

9 . Unequal lengths of dataframes: To resolve this, I merged the three dataframes using inner join which selected from each dataframe only the rows where all the values in the three dataframes are present.


<a id='tidiness'> </a>

##### Tidiness
1. dog breed predictions in `image_predictions` dataframe do not have a uniform lettercase, so I applied the `.str.lower()` thereto. 

2. `Dog_stage` variable spread across four columns: To resolve this issue, I:
    a. wrote a function to search through all the values in the `dog_stage` rows and return non-null values
    b. created a new column and assigned the values of the function to the new row.
    c. dropped unnecessary columns
