## Wrangle Report

This document briefly describes the wrangling efforts for the WeRateDogs Twitter dataset in the wrangle_act.ipynb notebook.

### About the Dataset(s)

> The dataset I'll be wrangling is the tweet archive of Twitter user @dog_rates
(https://twitter.com/dog_rates), also known as WeRateDogs. This archive/dataset consists of 2356 basic
tweet data from November, 2015 to August, 2017. WeRateDogs is a Twitter account that rates people's
dogs with a humorous comment about the dog.
Based on the images in the above dataset (i.e. WeRateDogs Twitter archive), another dataset is
created which consists of image predictions (the top three only) alongside each tweet ID, image URL,
and the image number that corresponded to the most confident prediction (numbered 1 to 4 since
tweets can have up to four images). Though no wrangling will be done directly on this image predictions
dataset, it will definitely provide some additional data for our main tweet archive dataset

## Gathering Data

The dataset was gather through the following methods:

- File on hand - twitter_archive_enhanced.csv
- File hosted on Udacity's servers - image_predictions.tsv
- Query of Twitter API - tweet_json.txt

- ***File on hand - twitter_archive_enhanced.csv***

Using the link provided by Udacity, I downloaded the WeRateDogs Twitter archive manually as
twitter_archive_enhanced.csv
(https://d17h27t6h515a5.cloudfront.net/topher/2017/August/59a4e958_twitter-archiveenhanced/twitter-archive-enhanced.csv) file and imported this file into a dataframe (df).


- ***Gather tweet image predictions***

I downloaded the tweet image predictions file hosted on Udacity's servers programmatically using
Python's Requests library and saved it locally to image_predictions.tsv file. Then, I imported this file
into a Python Pandas dataframe (df_image).


- ***Query of Twitter API - tweet_json.txt***

I couldn't get entrance into the twitter API so I downloaded a provided data from udacity server using python request

## Assessing Data

First of all, I was able to identify 2 quality issues just by going through the Key Points in the Project
Motivation page.

- ***Visual Assessment***

> I opened the twitter_archive_enhanced.csv and image_predictions.tsv using pandas and scrolled
through them, looking for quality and tidiness issues. I was able to observe this 

### Quality

- the columns in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id	retweeted_status_user_id	retweeted_status_timestamp are mostly null values in `twitter achive` table
- Remove columns having retweet_id as we need only only original rating not retweets in `twitter achive` table
- inconsitency in the tweet_id column name in the the three tavles
- Some of the prediction are not dogs in the `image prediction` table

### Tidyness

- The doggo, pupper, poppo, and floofer columns should form a single colums called dog stage in `twitter achive` table
- The name and dog stage can be extracted from the text column in `twitter achive` table
- tweet_ID should be adjusted to tweet_id to conform with the two tables in the `additional file` table

- ***Programmatic Assessment***

I then used pandas and a few of the methods to
- .head()
- .describe()
- .info()
- .duplicated()
- .value_counts() .query()
- .sum()

### Quality

##### twitter achive table

- tweet_id should be in string
- the columns retweeted_status_id, retweeted_status_user_id, retweeted_status_timestamp are all null values without duplicates, remove columns having retweet_id as we need only only original rating not retweets
- data issue rating denominator having 00 change to 0
- rating_denominator should be in int format as it is a rating number rather than strings
- rating_numerator should be in int format as it is a rating number rather than strings
- unneccesery html tags in the source column
- rating_numerator column has values less than 10 as well as some very large numbers
- rating_denominator column has values other than 10
- remove index 8 row as it has so many missing values
- Remove rows of tweets who tweet beyond august 1st, 2017
- remove the rows of tweet_ids that retweeted since retweets are essentially duplicates of the actual tweets and so they may skew the result of your analysis 
- observed error values in the rating column

#### Tidyness

##### twitter achive table
- The doggo, pupper, poppo, and floofer columns should form a single colums called dog stage
- The name and dog stage can be extracted from the text column
- convert timestamp to datetime column
- Remove tweets beyond august 1st, 2017 in order to be able to merge successful with image_id as there are no augorithm result for date beyond august 1st, 2017. 

##### images


##### twtcount
- tweet_ID should be renamed to tweet_id to conform with the two tables
- twtcount and df should be joined together and then joined with image data

## Cleaning Data

As all the quality and tidiness issues were related to `df_tweet_clean table`, I created a copy of only this table
and named it archive_clean. For each quality/tidiness issue, I performed the programmatic data
cleaning process in 3 stages - Define, Code & Test. During the cleaning process, I converted the
datatypes of source and newly created stage columns of archive_clean to category datatype.

## Storing Data

After the completion of the cleaning process, I stored the `df_tweet_clean table`  in
twitter_archive_master.csv file.