## Reporting: wragle_report


This report is split up into 3 Sections
- Data Retrieval
- Data Assessment
- Data Cleaning

### Data Retrieval

In order to analyse data, first of all data must be made available. Data was gathered using 3 methods.
Firstly, data was made available by directly downloading a CSV file "twitter-archive-enhanced.csv"
The file was made available via the nana degree resources folder on Udacity. This first csv file was used to create the first dataframe "df_archive"


Following this, efforts progressed onto retrieving data by programattically downloading the data from the internet using the requests module. The data was read from the url "[https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv](https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv)" and the contents were stored locally in a file "image-predictions.tsv". This second tsv file was used to create the second dataframe "df_images"

Lastly, additional data for analysis was gotten via the Twitter API. Using the tweepy module, an access library for the Twitter REST API, requests for tweet info were made directly to twitter and the output was stored locally in a flat file "tweet_json,txt". 
In order to access the Twitter API, it was necessary to acquire secret keys, consumer keys, bearer tokens and the likes. These keys were gotten from the Twitter Developer Portal after twitter approved my request for elevated access to their API. 
Also note that, the tweet_json.txt file had to be built iteratively as the twitter API only allows a certain maximum of 900 requests within a 15 minute interval.

With these 3 files handy, 3 dataframes were created ("df_archive","df_images", "df_all_json")

[Click for Data Gathering code](wrangle_act.ipynb#Data-Gathering)

---

### Data Assessment

Data assessment was done using a plethora of techniques from visual/manual assessment to programmatic assessment.

The aim of the data assessment stage was to certify that the acquired data was of high quality, devoid of **quality** and **tidiness** issues
Such quality and tidiness issues could pose challenges to accurate data analysis, and therefore could also lead to wrong inferences and conclusions. Therefore, in the data assessment stage, a list of quality and tidiness issues were identified and documented. The identified isses were then addressed in the data cleaning stage.

Below are a list of issues identified during data assessment

#### Quality issues

1. tweet_archive_enhanced.csv includes replies and retweets which are not meant to feature in analysis (df_archive)

2. timestamp column should be represented as a datetime instead of as a string (df_archive)

3. rating_numerator and rating denominator column should be represended as a float as these columns could contain floating numbers (df_archive)

4. 23 records in the the tweet_archive_enchanced.csv have inaccurate data as the denominators are less than 10  (df_archive)
    re-compute inaccurate ratings using tweet text

5. 543 records found in image_prediction.tsv which were not recognized as dogs using first prediction method  (df_images)

6. tweet_id column in the tweet_archive_enchanced.csv should be represented as an object instead of as an integer as it is not treated as a numerical field. The same applies to tweet_id column in the image_prediction.tsv and  id in the tweet_json.txt (df_archive, df_images, df_all_json)

7. 55 dogs are recorded as having the name "**a**" in the tweet_archive_enchanced.csv this is an inaccurate name and needs to be looked into. Some other inaccurate names were spotted also  (df_archive)

8. 31 tweet records could not be retrieved using tweepy into tweet_json.txt, therefore the data on favorite count cannot be derived.  (df_all_json)

#### Tidiness issues
1. The dog classification of doggo, pupper, puppo, floffer should be collapsed into 1 column

2. Since a dog has only one breed (1 to 1 relationship) the dog breed name should be extracted from the image_prediction.tsv and appended to the tweet_archive.csv. This also applies to the jpg_url field. Also the retweet_count and the favorites_count field should be extracted from the tweet_json.txt dataset and appended onto the tweet_archive_enhanced dataset as this is also a 1-to-1 relationship. 


---
---

### Data Cleaning

In the data cleaning stage, the quality / tidiness issues earlier identified were eliminated or better understood.
Issues such as wrong ratings where addressed by extracting the correct ratings from the tweet text and replacing the rating value with the extracted.
Some other issues such as the presence of null values as documented in issue 1, 2 and 3 were addressed by simply dropping those colums.
As the analysis was focued on dog ratings, through data cleaning it became imperative to drop all records that had images which were not predicted as dog images. 

Methods ranging from substitution of values, dropping unused columns, dropping null values were used in the cleaning process

[Click for Consolidated Cleaning Code](wrangle_act.ipynb#Consolidated-quality-cleaning)

---
---



Following the data retrieval, data assessment and data cleaning, more accurate insights and visualizations were able to be derived from the data sets