In [None]:
import pandas as pd
pd.set_option('display.max_columns', None)
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

## Gather Data:

### Import WeRateDogs Twitter archive

In [None]:
df_WRD_twitter = pd.read_csv('twitter-archive-enhanced.csv')

### Import image prediction file from url

In [None]:
# Reference: https://www.codementor.io/aviaryan/downloading-files-from-urls-in-python-77q3bs0un

import requests
import os
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
r = requests.get(url)
open('image-predictions.tsv', 'wb').write(r.content)

In [None]:
df_img_pred = pd.read_csv('image-predictions.tsv', sep='\t')

### Import Tweet JSON data

In [None]:
# Reference: Dhaval P's answer on his question: https://knowledge.udacity.com/questions/47704
import json
data = []
with open('tweet_json.txt') as f:    
        for line in f:         
            data.append(json.loads(line))
df_twit_JSON = pd.DataFrame(data)


## Assess Data:

### Assessing WeRateDogs Twitter Archive Data:

#### Quality Issues:
- Of the 2356 entries, there are only approximately 400 which have a declared dog type (i.e. doggo, puppo, etc.). This is either because there are not enough established variables for the wide variety of dog categories, the majority of tweets do not implement use of dog categories, or the dataset did not extract all the category mentions from the tweets.
- Dog names ('name') has 745 extracted as a non-null 'None', and several dog names extracted as 'a', 'the', and 'an'. Most of the Nones are appropriate, and most of the 'a', 'the', and 'an' entries should also be changed to 'None'.
- Entry at index 2204 has to be renamed to 'Berta'
- There are 181 retweet entries, and the project dictates only having original tweets. Should be removed.
- There are 78 reply tweet entries, and I'm not sure if that fits into the definition of 'originial tweet' even if it includes new photo, name and rating. Better to err on the side of caution and remove them.
- Entry at index 313 extracted a rating of '960/0', and needs to be changed to the revised rating of '13/10'
- Entries at index 340 and 695 extracted a rating of '75/10', and needs to be changed to the actual rating of '9.75/10'
- Entry at index 342 actually doesn't have a rating ('11/15' was extracted, while it was simply a description of time). Row needs to be removed.
- Entry at index 516 actually doesn't have a rating ('24/7' was extracted, while it was simply a description of time). Row needs to be removed.
- Entry at index 763 extracted a rating of '27/10', and needs to be changed to the actual rating of '11.27/10'
- Entry at index 1068 extracted a rating of '9/11', and needs to be changed to the actual rating of '14/10'
- Entry at index 1165 extracted a rating of '4/20', and needs to be changed to the actual rating of '13/10'
- Entry at index 1202 extracted a rating of '50/50', and needs to be changed to the actual rating of '11/10'
- Entries at indices 1598 and 1663 were technically not officially given ratings by WeRateDogs, and should be removed.
- Entry at index 1662 extracted a rating of '7/11', and needs to be changed to the actual rating of '10/10'
- Entry at index 1712 extracted a rating of '26/10', and needs to be changed to the actual rating of '11.26/10'
- Entry at index 2335 extracted a rating of '1/2', and needs to be changed to the actual rating of '9/10'
- Since some correct ratings contain decimal values, 'rating_numerator' and 'rating_denominator' need to be changed from int to float

#### Tidiness Issues:
- Dog types (i.e. doggo, puppo, etc.) are in separate variable columns, where if a dog is described as such, the value is the dogtype, whereas if it isn't, the value is a non-null 'None'. Instead the columns could either be framed as Boolean 1's and 0's, or all placed into one 'dog_type' variable column.

In [None]:
df_WRD_twitter.head()

In [None]:
df_WRD_twitter.info()

In [None]:
#Reference: https://stackoverflow.com/questions/33042633/selecting-last-n-columns-and-excluding-last-n-columns-in-dataframe
dog_type_cols = df_WRD_twitter.columns[-5:].values

for i in dog_type_cols:
    print(df_WRD_twitter[i].value_counts())

In [None]:
# Reference: https://stackoverflow.com/questions/25351968/how-to-display-full-non-truncated-dataframe-information-in-html-when-convertin/25352191
pd.set_option('display.max_colwidth', -1)

df_WRD_twitter[df_WRD_twitter['name']=='None']['text'].head(10)

In [None]:
df_WRD_twitter[df_WRD_twitter['name']=='a']['text'].head(10)

In [None]:
df_WRD_twitter[df_WRD_twitter['name']=='an']['text']

In [None]:
df_WRD_twitter[df_WRD_twitter['name']=='the']['text']

In [None]:
df_WRD_twitter.rating_numerator.value_counts()

In [None]:
df_WRD_twitter[df_WRD_twitter['rating_numerator'] >= 20][['rating_numerator','rating_denominator','text']]

In [None]:
df_WRD_twitter.rating_denominator.value_counts()

In [None]:
df_WRD_twitter[df_WRD_twitter['rating_denominator'] != 10][['rating_numerator','rating_denominator','text']]

### Assessing Image Prediction Data:

#### Quality Issues:
- There are 324 images that returned predictions that were not dogs ('p1_dog', 'p2_dog', 'p3_dog' all False). These rows are either evidence of the neural network discovering pictures that indeed don't contain dogs, or of the neural network doing a poor job of finding the dog in the image.

In [None]:
df_img_pred.head()

In [None]:
df_img_pred.info()

In [None]:
df_img_pred.img_num.value_counts()

In [None]:
df_img_pred.tweet_id.duplicated().value_counts()

In [None]:
all_pict_false = df_img_pred[(df_img_pred.p1_dog == False) & (df_img_pred.p2_dog == False) & (df_img_pred.p3_dog == False)]
all_pict_false.shape[0]

### Assessing Twitter JSON Data:

#### Quality Issues:
- There are 179 tweets that are retweets. These should be removed, as they are not originals.
- There are 28-29 tweets that are original responses to other tweets. As they are not necessarily stand-alone originals, so may be up for removal, unless the image dataset has extracted photos associated with the tweet.
- Essentially empty columns that should be dropped or ignored from merging: 'contributors', 'coordinates', 'geo', 'place'.

#### Tidiness Issues:
- Columns in which entries contain multiple pieces of information: 'entities', 'extended_entities', 'quoted_status', 'retweeted_status', 'user'. These columns could be made into their own datasets, or their contents could be sorted into unique variables that would be attached onto the end of the main JSON dataset entries to which they belong.

In [None]:
df_twit_JSON.head()

In [None]:
df_twit_JSON.info()

In [None]:
pd.set_option('display.max_colwidth', -1)
df_twit_JSON[df_twit_JSON.retweeted_status.isna() != True]['retweeted_status'].iloc[0]

In [None]:
df_twit_JSON[df_twit_JSON.retweeted_status.isna() != True].iloc[0]

In [None]:
df_twit_JSON[df_twit_JSON.quoted_status.isna() != True].iloc[0]

In [None]:
df_twit_JSON.favorited.value_counts()

### Summary of Quality Issues:

#### WeRateDogs Twitter Archive Data:
- Of the 2356 entries, there are only approximately 400 which have a declared dog type (i.e. doggo, puppo, etc.). This is either because there are not enough established variables for the wide variety of dog categories, the majority of tweets do not implement use of dog categories, or the dataset did not extract all the category mentions from the tweets.
- Dog names ('name') has 745 extracted as a non-null 'None', and several dog names extracted as 'a', 'the', and 'an'. Most of the Nones are appropriate, and most of the 'a', 'the', and 'an' entries should also be changed to 'None'.
- Entry at index 2204 has to be renamed to 'Berta'
- There are 181 retweet entries, and the project dictates only having original tweets. Should be removed.
- There are 78 reply tweet entries, and I'm not sure if that fits into the definition of 'originial tweet' even if it includes new photo, name and rating. Better to err on the side of caution and remove them.
- Entry at index 313 extracted a rating of '960/0', and needs to be changed to the revised rating of '13/10'
- Entries at index 340 and 695 extracted a rating of '75/10', and needs to be changed to the actual rating of '9.75/10'
- Entry at index 342 actually doesn't have a rating ('11/15' was extracted, while it was simply a description of time). Row needs to be removed.
- Entry at index 516 actually doesn't have a rating ('24/7' was extracted, while it was simply a description of time). Row needs to be removed.
- Entry at index 763 extracted a rating of '27/10', and needs to be changed to the actual rating of '11.27/10'
- Entry at index 1068 extracted a rating of '9/11', and needs to be changed to the actual rating of '14/10'
- Entry at index 1165 extracted a rating of '4/20', and needs to be changed to the actual rating of '13/10'
- Entry at index 1202 extracted a rating of '50/50', and needs to be changed to the actual rating of '11/10'
- Entries at indices 1598 and 1663 were technically not officially given ratings by WeRateDogs, and should be removed.
- Entry at index 1662 extracted a rating of '7/11', and needs to be changed to the actual rating of '10/10'
- Entry at index 1712 extracted a rating of '26/10', and needs to be changed to the actual rating of '11.26/10'
- Entry at index 2335 extracted a rating of '1/2', and needs to be changed to the actual rating of '9/10'
- Since some correct ratings contain decimal values, 'rating_numerator' and 'rating_denominator' need to be changed from int to float

#### Image Prediction Data:
- There are 324 images that returned predictions that were not dogs ('p1_dog', 'p2_dog', 'p3_dog' all False). These rows are either evidence of the neural network discovering pictures that indeed don't contain dogs, or of the neural network doing a poor job of finding the dog in the image.

#### Twitter JSON Data:
- There are 179 tweets that are retweets. These should be removed, as they are not originals.
- There are 28-29 tweets that are original responses to other tweets. As they are not necessarily stand-alone originals, so may be up for removal, unless the image dataset has extracted photos associated with the tweet.
- Essentially empty columns that should be dropped or ignored from merging: 'contributors', 'coordinates', 'geo', 'place'.

### Summary of Tidiness Issues:

#### WeRateDogs Twitter Archive Data:
- Dog types (i.e. doggo, puppo, etc.) are in separate variable columns, where if a dog is described as such, the value is the dogtype, whereas if it isn't, the value is a non-null 'None'. Instead the columns could either be framed as Boolean 1's and 0's, or all placed into one 'dog_type' variable column.

#### Twitter JSON Data:
- Columns in which entries contain multiple pieces of information: 'entities', 'extended_entities', 'quoted_status', 'retweeted_status', 'user'. These columns could be made into their own datasets, or their contents could be sorted into unique variables that would be attached onto the end of the main JSON dataset entries to which they belong.

## Clean:

    Example         Dataset      Var Name   Var Type
666020888022790149 Image data: 'tweet_id' type int64

892420643555336193 WRD data:   'tweet_id' type int64

892420643555336193 JSON data:  'id'       type int64

#### Define

MK

#### Code

In [None]:
df_twit_JSON.drop(['contributors', 
                   'coordinates', 
                   'geo',
                   'id',
                   'in_reply_to_screen_name', 
                   'in_reply_to_status_id', 
                   'in_reply_to_status_id_str',
                   'in_reply_to_user_id',
                   'in_reply_to_user_id_str',
                   'place',
                   'quoted_status',
                   'quoted_status_id',
                   'quoted_status_id_str',
                   
                  ])

#### Test