## WeRateDogs - Twitter Data

### I. Gather Data

I followed the instructions given by Udacity on how to gather the data for this data wrangling excercise.

- I downloaded the WeRateDogs™ Twitter Archive as a .csv (twitter_archive_enhanced.csv) from the Udacity **3. Project Details**
- Afterwards I programmatically downloaded the Tweet image predictions (image_predictions.tsv), as advised.
- Because I didn't create a twitter developer account I just downloaded the additional twitter data from Udacity(tweet-json.txt).

Once I gathered the different data sources I created a DataFrame for each of them.

- *df_twitter* - This is the dataset "twitter-archive-enhanced.csv" and gives information on basic tweet data.  

- *df_predict* - This is the dataset "image_predictions.tsv" and contains image predictions per tweet.

- *df_add* - This is the dataset "tweet-json.txt" and contains additional info about the tweets. Here I preprocessed the id-column, to match with the other DataFrames (renamed "id" to "tweet_id").

### II. Assessing the data

Below I created an overview of the various columns and their descriptions in this dataset. For every DataFrame I looked at the info(), got a glimpse of the DataFrame by looking at the .head(), .tail() or .sample() and tried to understand it that way.

#### 1. `Enhanced Twitter Archive`

The WeRateDogs™ Twitter archive contains basic tweet data for 5000+ of their tweets, but it does not include every tweet. 
There is a column though which contains the text for each tweet, which we used to extract the rating, dog name, and dog "type" (doggo, floofer, pupper, and puppo).
During the assessment we found out that:
- not all tweets could be classified correctly
- HTML code in the source column
- records/tweets where the name is "None" or the names are "not real" (like a, by etc.)
- incorrect datatypes for tinmetamp columns -> should be datetime instead of objects


`df_twitter` columns and their description:
    
- **tweet_id**: the unique identifier for each of tweet
- **in_reply_to_status_id**: the status id for the reply given to the tweet id
- **in_reply_to_user_id**: the status id for the reply given to the tweet id
- **timestamp**: timestamp of the tweet
- **source**: source of the tweet
- **text**: the corresponding tweets text
- **retweeted_status_id**: the status id for the reply given to the tweet id
- **retweeted_status_user_id**: the status id for the reply given to the tweet id
- **retweeted_status_timestamp**: timestamp of the retweet
- **expanded_urls**: Full URL for the tweet.
- **rating_numerator**: the numerator of the given rank (5 if the rating is 5/10)
- **rating_denominator**: the denominator of the given rank (10 if the rating is 5/10)
- **name**: name of the dog
- **doggo**, **floofer**,  **pupper**, **puppo** -- The stage of the dog type

#### `Quality - df_twitter`

1. Missing values in columns from in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id , retweeted_status_user_id,
  retweeted_status_timestamp, expanded_urls

2. Incorrect data types for timestamp columns

3. Some names are not real names, like "None", "a", "the" etc. - there are more! Basically those are normal words and not names. 

4. Dogs could only be correctly classified 16.13% of the time (380/(1976+380))

5. There are a wide range of rating_numerators in this dataset, going as high as 1776! The scale only goes from one to ten though...

6. Checking some of the entries we can see that those are joke tweets (a chicken for example)

7. Inconsistent missing values in the doggo, floofer etc. columns (sometimes "None" sometimes real NULL values)

#### 3. `Quality - df_predict :`

WeRateDogs Twitter archive was ran through a neural network that can classify breeds of dogs. The result is a table full of image predictions (the top three only) alongside each tweet_ID, image URL, and the image number that corresponded to the most confident prediction (numbered 1 to 4 since tweets can have up to four images).

`df_predict` columns and their description:
    
- **tweet_id**: tweet_id is the last part of the tweet URL after "status/"
- **jpg_url**: image link or URL
- **img_num**: image number
- **p1**: p1 is the algorithm's #1 prediction for the image in the tweet 
- **p1_conf**: p1_conf is how confident the algorithm is in its #1 prediction
- **p1_dog**: p1_dog is whether or not the #1 prediction is a breed of dog
- **p2**: is the algorithm's second most likely prediction
- **p2_conf**: is how confident the algorithm is in its #2 prediction
- **p2_dog**:  is whether or not the #2 prediction is a breed of dog 
- **p3**: p3 is the algorithm's #3 prediction for the image in the tweet
- **p3_conf**: p3_conf is how confident the algorithm is in its #3 prediction
- **p3_dog**: p3_dog is whether or not the #3 prediction is a breed of dog


`Quality - df_predict` table:

- No NULL values

- But we can clearly see that there is no standardized format for the dog breed/race prediction as they sometimes start with capital letters and sometimes they do not (German_shepherd but miniature_pinscher)

- Additionally they are separated by and underscore instead of a whitespace

- Lastly there are pictures, which have no prediction at all (p1_dog, p2_dog and p3_dog are all False)


#### 3. `df_add`

`df_add` columns and their description:

- **tweet_id**: the unique identifier for each of tweet
- **retweets**: the count of retweets
- **favorites**: The count of favourites
- **followers**: The count of followers
- **friends**: The count of friends

#### `Quality - df_add` table

- No NULL values again!

### III. Cleaning

Cleaning steps I did everything is more detailed in the wrangle_act.ipynb:


**1. Merge tables - TIDINESS ISSUE!** 

**2. Drop everything we found before. Retweets, replies, tweets without an image or tweets without dogs - QUALITY ISSUE!**

**3. Fixing datatypes  - QUALITY ISSUE!**

**4. Clean the source column - QUALITY ISSUE!**

**5. Split the text range into two separate columns - TIDINESS ISSUE!**

**6. Merging the different classification columns (doggo, floofer etc.) into one column and removing "None" entries - TIDINESS ISSUE!**

**7. Remove incorrect names - QUALITY ISSUE!**

**8. Just use the column with the highest % and the respective prediction - TIDINESS ISSUE!**

**9. Standardize the breeds, because some are uppercase - QUALITY ISSUE!**


### IV. Store

I stored the final dataframe into a .csv file with name **twitter_archive_master.csv** (as instructed by Udacity).