# Reporting: wrangle_report
* Create a **300-600 word written report** called "wrangle_report.pdf" or "wrangle_report.html" that briefly describes your wrangling efforts. This is to be framed as an internal document.

This document briefly describes the wrangling efforts for the WeRateDogs Twitter dataset in the wrangle_act.ipynb notebook.

# Gathering Data

The dataset was gather through:
1. File on hand 
````twitter_archive_enhanced.csv````
2. File hosted on Udacity server
````image_predictions.tsv````
3. Query of Twitter API 
````tweet_json.txt````

The twitter_archive_enhanced.csv was read directly into a DataFrame using the pandas library, .csv files.
image_predictions.tsv
Files hosted on the internet were programmatically downloaded using the requests library. After the download, the pandas library was used to read in the .tsv into a DataFrame.
tweet_json.txt
I used the tweet_json.txt file provided by udacity, because I could not get the necessary secret Keys from Twitter developers dashboard, as my request to sign up is yet to be approved. The text file is read line by line to append the tweet_id, favorite_count, and retweet_count into a DataFrame.

# Assessing Data


The data were assessed both visually and programmatically to look for quality and tidiness issues.
Programmatic Methods:
data.head()
data.describe()
data.info()
data.duplicated()
data.value_counts()

## Tidiness issues

1. The tweet_archive and image_prediction can be joined into one dataset

2. Redundant columns of the same category, which is now divided into many columns, but only one stage column is needed

## Quality issues

1. The dataset contains some retweets, these would be removed.

2. The source column will have to be changed from ulr type to text.

3. Non-dog names should be converted to 'None'.

4. Tweets without images will be removed.

5. Unnecessary columns will be removed.

6. The timestamp will be converted to datetime

7. There are empty values in several columns

8. Some of the rating_numerator and rating_denominator have offbeat values.

# Cleaning

This section contains the cleaning process performed on the datasets.

#### Tidiness issue
1. Simplify 3 tables to 2 by joining (inner) archived_tweet_copy with img_dataframe_copy to create one tweet observation table.
2. Convert doggo, flooter, pupper, puppo columns into one stage column in the tweet_combined, then drop the four columns

#### Quality issue
1. Retweets rows and  'retweeted_status_id', 'retweeted_status_user_id', 'retweeted_status_timestamp' columns were removed.
2. The source was Stripped to remove the HTML link.
3. Dog names were replaced with 'None', and lowercase dog names were also replaced. Names that were not found were replaced with not a number(NaN).
4. Rows where all predictions of dog breed is not a dog were dropped.
5. I removed the jpg_url, in_reply_to_status_id, in_reply_to_user_id and expanded_urls  column from img_dataframe and archived_tweet dataframe. 
6. Timestamp were converted to datetime.
7. Rows with missing values were removed. 
8. Variable types were change to appropriate.


# Storing data


The gathered, assessed, and cleaned master dataset were saved to a CSV file named "twitter_archive_master.csv".