### Introduction

This project examines the tweet archive for WeRateDogs to draw insights from the dataset regarding the ratings of dogs. The Gather, Assess and Clean process is followed with the aim of answering the following questions:

- What type of dogs have the highest ratings?
- What dog stage has the highest ratings?
- What type of dogs are the most popular in terms of retweet count and favorite count?

## Gather

In [1]:
import json
import numpy as np
import os
import pandas as pd
import requests
import tweepy
from timeit import default_timer as timer
from tweepy import OAuthHandler

In [2]:
# Load twitter archive file
archive_enhanced = pd.read_csv('twitter-archive-enhanced.csv')

In [3]:
# Download image-predictions file
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = requests.get(url)
with open('image_predictions.tsv', mode='wb') as file:
    file.write(response.content)

In [4]:
# Import image predictions file
image_predictions = pd.read_csv('image_predictions.tsv', sep='\t')

In [5]:
# Query Twitter API for each tweet in the Twitter archive and save JSON in a text file
consumer_key = 'HIDDEN'
consumer_secret = 'HIDDEN'
access_token = 'HIDDEN'
access_secret = 'HIDDEN'

auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

In [6]:
# Tweet IDs for which to gather additional data via Twitter's API
tweet_ids = archive_enhanced.tweet_id.values
len(tweet_ids)

2356

In [7]:
# Query Twitter's API for JSON data for each tweet ID in the Twitter archive
if not os.path.exists('tweet_json.txt'):
    count = 0
    fails_dict = {}
    start = timer()
    # Save each tweet's returned JSON as a new line in a .txt file
    with open('tweet_json.txt', 'w') as outfile:
        # This loop will likely take 20-30 minutes to run because of Twitter's rate limit
        for tweet_id in tweet_ids:
            count += 1
            print(str(count) + ": " + str(tweet_id))
            try:
                tweet = api.get_status(tweet_id, tweet_mode='extended')
                print("Success")
                json.dump(tweet._json, outfile)
                outfile.write('\n')
            except tweepy.TweepError as e:
                print("Fail")
                fails_dict[tweet_id] = e
                pass
    end = timer()
    print(end - start)
    print(fails_dict)


In [8]:
# DELETE
# Temp - save fails_dict to file for working with
'''
import csv
fails_list = []
for key in fails_dict.keys():
    fails_list.append(key)
fails_list
with open('fails_dict.csv', 'w') as outfile:
    wr = csv.writer(outfile, dialect='excel')
    wr.writerow(fails_list)
'''

"\nimport csv\nfails_list = []\nfor key in fails_dict.keys():\n    fails_list.append(key)\nfails_list\nwith open('fails_dict.csv', 'w') as outfile:\n    wr = csv.writer(outfile, dialect='excel')\n    wr.writerow(fails_list)\n"

In [9]:
# DELETE
# Working - uncomment out if necessary
'''
failures = []
with open('fails_dict.csv', 'r') as f:
    reader = csv.reader(f)
    for line in reader:
        failures.append(line)
'''

"\nfailures = []\nwith open('fails_dict.csv', 'r') as f:\n    reader = csv.reader(f)\n    for line in reader:\n        failures.append(line)\n"

In [10]:
# Load extended tweet data
tweets = []
with open('tweet_json.txt') as json_file:
    for line in json_file:
        tweets.append(json.loads(line))

# Place extended tweet data into a dataframe
extended_data = pd.DataFrame(tweets)

## Assess

In [11]:
# View column types and missing data in twitter_archive_enhanced
archive_enhanced.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

In [12]:
# DETERMINE IF TO KEEP
# Visually assess data in archive
archive_enhanced.sample(5)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
691,787322443945877504,,,2016-10-15 16:01:13 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Lincoln. He forgot to use his blinker ...,,,,https://twitter.com/dog_rates/status/787322443...,10,10,Lincoln,,,,
1806,676936541936185344,,,2015-12-16 01:27:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we see a rare pouched pupper. Ample stora...,,,,https://twitter.com/dog_rates/status/676936541...,8,10,,,,pupper,
1745,679148763231985668,,,2015-12-22 03:57:37 +0000,"<a href=""http://twitter.com/download/iphone"" r...",I know everyone's excited for Christmas but th...,,,,https://twitter.com/dog_rates/status/679148763...,8,10,,,,,
291,838085839343206401,8.380855e+17,2894131000.0,2017-03-04 17:56:49 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@bragg6of8 @Andy_Pace_ we are still looking fo...,,,,,15,10,,,,,
2085,670804601705242624,,,2015-11-29 03:20:54 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Mason. He's a total frat boy. Pretends to...,,,,https://twitter.com/dog_rates/status/670804601...,10,10,Mason,,,,


In [13]:
# Check for duplicated tweets
len(archive_enhanced[archive_enhanced.tweet_id.duplicated()])

0

In [14]:
# Check for inclusion of retweets
sum(~archive_enhanced.retweeted_status_id.isnull())

181

There are 181 tweets that are retweets and need to be removed.

In [15]:
# Check values in rating_denominator - should be 10
archive_enhanced.rating_denominator.value_counts().sort_index()

0         1
2         1
7         1
10     2333
11        3
15        1
16        1
20        2
40        1
50        3
70        1
80        2
90        1
110       1
120       1
130       1
150       1
170       1
Name: rating_denominator, dtype: int64

In [16]:
# Check for incorrect rating_denominator values
incorrect_denominators = archive_enhanced[archive_enhanced['rating_denominator'] != 10]
list(incorrect_denominators['tweet_id'])

[835246439529840640,
 832088576586297345,
 820690176645140481,
 810984652412424192,
 775096608509886464,
 758467244762497024,
 740373189193256964,
 731156023742988288,
 722974582966214656,
 716439118184652801,
 713900603437621249,
 710658690886586372,
 709198395643068416,
 704054845121142784,
 697463031882764288,
 686035780142297088,
 684225744407494656,
 684222868335505415,
 682962037429899265,
 682808988178739200,
 677716515794329600,
 675853064436391936,
 666287406224695296]

Some incorrect denominators were found:

- 835246439529840640 should be 13 numerator and 10 denominator
- 832088576586297345 does not contain a rating and should be dropped
- 810984652412424192 does not contain a rating and should be dropped
- 775096608509886464 is a retweet and will be deleted
- 740373189193256964 should be 14 numerator 10 denominator
- 722974582966214656 should be 13 numerator and 10 denominator
- 716439118184652801 should be 11 numerator and 10 denominator
- 686035780142297088 does not contain a rating and should be dropped
- 682962037429899265 should be 10 numerator and 10 denominator
- 682808988178739200 does not contain a rating and should be dropped
- 666287406224695296 should be 9 numerator and 10 denominator

Others from the list are accurate although they do not have 10 as a denominator.

In [17]:
# Examine data types and mising data in image_predictions
image_predictions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [18]:
# DETERMINE IF TO KEEP
image_predictions.sample(10)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
523,676588346097852417,https://pbs.twimg.com/media/CWO5gmCUYAAX4WA.jpg,1,Boston_bull,0.976577,True,French_bulldog,0.014324,True,Chihuahua,0.002302,True
1670,813081950185472002,https://pbs.twimg.com/media/C0ilsa1XUAEHK_k.jpg,2,Doberman,0.909951,True,kelpie,0.042649,True,miniature_pinscher,0.023004,True
1711,818536468981415936,https://pbs.twimg.com/media/C1wGkYoVQAAuC_O.jpg,1,swing,0.999403,False,Welsh_springer_spaniel,6.2e-05,True,bow,3e-05,False
1940,860924035999428608,https://pbs.twimg.com/media/C_KVJjDXsAEUCWn.jpg,2,envelope,0.933016,False,oscilloscope,0.012591,False,paper_towel,0.011178,False
896,699691744225525762,https://pbs.twimg.com/media/CbXN7aPWIAE0Xt1.jpg,1,hippopotamus,0.982269,False,sea_lion,0.006295,False,dugong,0.005768,False
1419,771500966810099713,https://pbs.twimg.com/media/CrTsCPHWYAANdzC.jpg,1,Labrador_retriever,0.833952,True,golden_retriever,0.103223,True,soccer_ball,0.012094,False
1059,714982300363173890,https://pbs.twimg.com/media/CewgnHAXEAAdbld.jpg,1,Brittany_spaniel,0.944376,True,beagle,0.025435,True,Ibizan_hound,0.009962,True
1978,870656317836468226,https://pbs.twimg.com/media/DBUxSSTXsAA-Jn1.jpg,4,Pembroke,0.945495,True,Cardigan,0.045875,True,beagle,0.004329,True
359,672622327801233409,https://pbs.twimg.com/media/CVWicBbUYAIomjC.jpg,1,golden_retriever,0.952773,True,Labrador_retriever,0.010835,True,clumber,0.008786,True
842,695051054296211456,https://pbs.twimg.com/media/CaVRP4GWwAERC0v.jpg,1,Boston_bull,0.761454,True,pug,0.075395,True,Chihuahua,0.041598,True


In [19]:
# Check column types and missing values in extended_data
extended_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2339 entries, 0 to 2338
Data columns (total 32 columns):
contributors                     0 non-null object
coordinates                      0 non-null object
created_at                       2339 non-null object
display_text_range               2339 non-null object
entities                         2339 non-null object
extended_entities                2065 non-null object
favorite_count                   2339 non-null int64
favorited                        2339 non-null bool
full_text                        2339 non-null object
geo                              0 non-null object
id                               2339 non-null int64
id_str                           2339 non-null object
in_reply_to_screen_name          77 non-null object
in_reply_to_status_id            77 non-null float64
in_reply_to_status_id_str        77 non-null object
in_reply_to_user_id              77 non-null float64
in_reply_to_user_id_str          77 non-null obj

In [20]:
# DETERMINE IF TO KEEP
extended_data.head(5)

Unnamed: 0,contributors,coordinates,created_at,display_text_range,entities,extended_entities,favorite_count,favorited,full_text,geo,...,quoted_status,quoted_status_id,quoted_status_id_str,quoted_status_permalink,retweet_count,retweeted,retweeted_status,source,truncated,user
0,,,Tue Aug 01 16:23:56 +0000 2017,"[0, 85]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 892420639486877696, 'id_str'...",37783,False,This is Phineas. He's a mystical boy. Only eve...,,...,,,,,8233,False,,"<a href=""http://twitter.com/download/iphone"" r...",False,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
1,,,Tue Aug 01 00:17:27 +0000 2017,"[0, 138]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 892177413194625024, 'id_str'...",32453,False,This is Tilly. She's just checking pup on you....,,...,,,,,6084,False,,"<a href=""http://twitter.com/download/iphone"" r...",False,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
2,,,Mon Jul 31 00:18:03 +0000 2017,"[0, 121]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 891815175371796480, 'id_str'...",24437,False,This is Archie. He is a rare Norwegian Pouncin...,,...,,,,,4027,False,,"<a href=""http://twitter.com/download/iphone"" r...",False,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
3,,,Sun Jul 30 15:58:51 +0000 2017,"[0, 79]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 891689552724799489, 'id_str'...",41113,False,This is Darla. She commenced a snooze mid meal...,,...,,,,,8384,False,,"<a href=""http://twitter.com/download/iphone"" r...",False,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
4,,,Sat Jul 29 16:00:24 +0000 2017,"[0, 138]","{'hashtags': [{'text': 'BarkWeek', 'indices': ...","{'media': [{'id': 891327551943041024, 'id_str'...",39326,False,This is Franklin. He would like you to stop ca...,,...,,,,,9089,False,,"<a href=""http://twitter.com/download/iphone"" r...",False,"{'id': 4196983835, 'id_str': '4196983835', 'na..."


In [21]:
# Find number of retweets in extended_data
len(extended_data[~extended_data['retweeted_status'].isnull()])

167

There are 167 tweets that are retweets and will need to be removed from the data.

In [22]:
# Check for duplication between archive_enhanced and image_predictions
all_columns = pd.Series(list(archive_enhanced) + list(image_predictions))
all_columns[all_columns.duplicated()]

17    tweet_id
dtype: object

Only the tweet_id column was duplicated, which we can use as the link between to two tables.

#### Quality
##### `archive_enhanced` table
- data missing such as retweet_count and favorite_count
- tweet_id is an int (should be str)
- presence of retweets (retweeted_status_id is non-null)
- rating_numerator and rating_denominator columns are ints (should be floats)
- several incorrect ratings to be corrected

##### `image_predictions` table
- tweet_id is an int (should be str)
- dog names in p1, p2, p3 columns inconsistent - upper and lowercase names, inclusion of underscores etc.
- predictions include non-dog objects which should be removed prior to analysis

##### `extended_data` table
- presence of retweets (retweeted_status is non-null) (Note: will be cleaned at the same time that the archive_enhanced file is cleaned, as the two dataframes will have been merged at that point).

#### Tidiness
- `archive_enhanced` contains multiple columns for the dog stage which should be one column with the dog stage as the variable
- `image_predictions` has multiple predictions per row

## Clean

In [23]:
# Make copies of each dataframe for cleaning
archive_enhanced_clean = archive_enhanced.copy()
image_predictions_clean = image_predictions.copy()
extended_data_clean = extended_data.copy()

### Missing Data

#### `extended_info`: favorite_count and retweet_count need to be joined to `archive_enhanced`.

##### Define
- Drop the unwanted columns from `extended_data_clean` so just id, favorite_count and retweet_count remain
- Rename id to tweet_id in `extended_data_clean`
- Merge `extended_data_clean` with `archive_enhanced_clean` on the tweet_id column, keeping only the rows (tweet_id) that are present in both dataframes

##### Code

In [24]:
# Remove unwanted columns
columns = ['id', 'favorite_count', 'retweet_count']
extended_data_clean = extended_data_clean[columns]

# Rename id column
extended_data_clean = extended_data_clean.rename(columns={'id':'tweet_id'})
extended_data_clean.head()

Unnamed: 0,tweet_id,favorite_count,retweet_count
0,892420643555336193,37783,8233
1,892177421306343426,32453,6084
2,891815181378084864,24437,4027
3,891689557279858688,41113,8384
4,891327558926688256,39326,9089


In [25]:
# Merge archive_enhanced_clean and extended_data_clean
archive_enhanced_clean = pd.merge(archive_enhanced_clean, \
                                  extended_data_clean, on='tweet_id', how='inner')

##### Test

In [26]:
# Check that retweet_count and favorite_count columns have been added and there are 2339 rows
archive_enhanced_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2339 entries, 0 to 2338
Data columns (total 19 columns):
tweet_id                      2339 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2339 non-null object
source                        2339 non-null object
text                          2339 non-null object
retweeted_status_id           167 non-null float64
retweeted_status_user_id      167 non-null float64
retweeted_status_timestamp    167 non-null object
expanded_urls                 2280 non-null object
rating_numerator              2339 non-null int64
rating_denominator            2339 non-null int64
name                          2339 non-null object
doggo                         2339 non-null object
floofer                       2339 non-null object
pupper                        2339 non-null object
puppo                         2339 non-null object
favorite_count                23

### Tidiness

#### `archive_enhanced` contains multiple columns for the dog stage.

##### Define
- Create a column in `archive_enhanced` called dog_stage that holds the concatenation of each value in the doggo, floofer, pupper and puppo columns.
- Extract the correct dog stage from the dog_stage column and update
- For entries with more than one correct dog, record these as 'mixed'

##### Code

In [27]:
# Add dog_stage column and populate with concatenated values for dog stages
archive_enhanced_clean['dog_stage'] = archive_enhanced_clean['doggo'] \
    + archive_enhanced_clean['floofer'] + archive_enhanced_clean['pupper'] \
    + archive_enhanced_clean['puppo']

In [28]:
# Check values present in dog_stage column
archive_enhanced_clean['dog_stage'].value_counts()

NoneNoneNoneNone        1961
NoneNonepupperNone       244
doggoNoneNoneNone         82
NoneNoneNonepuppo         29
doggoNonepupperNone       12
NoneflooferNoneNone        9
doggoNoneNonepuppo         1
doggoflooferNoneNone       1
Name: dog_stage, dtype: int64

In [29]:
# Function to extract dog_stage
def get_dog_stage(current):
    '''Extract dog_stage from passed string'''
    if current == 'NoneNoneNoneNone':
        return 'None'
    elif current == 'doggoNoneNoneNone':
        return 'doggo'
    elif current == 'NoneflooferNoneNone':
        return 'floofer'
    elif current == 'NoneNonepupperNone':
        return 'pupper'
    elif current == 'NoneNoneNonepuppo':
        return 'puppo'
    else:
        return 'mixed'

In [30]:
# Update dog_stage column with call to get_dog_stage()
archive_enhanced_clean['dog_stage'] = archive_enhanced_clean['dog_stage'].apply(get_dog_stage)
# Drop individual dog style columns
archive_enhanced_clean = archive_enhanced_clean.drop(['doggo', 'puppo', 'pupper', \
                                                      'floofer'], axis=1)

##### Test

In [31]:
# Confirm that column has been updated correctly
archive_enhanced_clean['dog_stage'].value_counts()

None       1961
pupper      244
doggo        82
puppo        29
mixed        14
floofer       9
Name: dog_stage, dtype: int64

In [32]:
# Confirm that old dog stage columns have been dropped
archive_enhanced_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2339 entries, 0 to 2338
Data columns (total 16 columns):
tweet_id                      2339 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2339 non-null object
source                        2339 non-null object
text                          2339 non-null object
retweeted_status_id           167 non-null float64
retweeted_status_user_id      167 non-null float64
retweeted_status_timestamp    167 non-null object
expanded_urls                 2280 non-null object
rating_numerator              2339 non-null int64
rating_denominator            2339 non-null int64
name                          2339 non-null object
favorite_count                2339 non-null int64
retweet_count                 2339 non-null int64
dog_stage                     2339 non-null object
dtypes: float64(4), int64(5), object(7)
memory usage: 310.6+ KB


#### `image_predictions` has multiple predictions per row.

##### Define
- Create separate dataframes from `image_predictions_clean` for p1, p2 and p3 data and keep only tweet_id, img_num, and the columns for that probability (`p1_df`, `p2_df`, `p3_df`)
- For each new dataframe, create a column that holds the p_order value (e.g. 1 for p1)
- Rename p1 column to 'p_type'
- Rename p1-conf to 'p_confidence'
- Rename p1_dog to 'p_dog'
- Repeat for p2 and p3
- Append `p2` to `p1`
- Append `p3` to `p1`
- Store `p1` as `image_predictions_clean`

##### Code

In [33]:
# Create new dataframes
p1_df = image_predictions_clean[['tweet_id', 'img_num', 'p1', 'p1_conf', 'p1_dog']]
p2_df = image_predictions_clean[['tweet_id', 'img_num', 'p2', 'p2_conf', 'p2_dog']]
p3_df = image_predictions_clean[['tweet_id', 'img_num', 'p3', 'p3_conf', 'p3_dog']]

# Rename p columns so consistent
p1_df = p1_df.rename(columns={'p1':'p_type', 'p1_conf':'p_conf', 'p1_dog':'p_dog'})
p2_df = p2_df.rename(columns={'p2':'p_type', 'p2_conf':'p_conf', 'p2_dog':'p_dog'})
p3_df = p3_df.rename(columns={'p3':'p_type', 'p3_conf':'p_conf', 'p3_dog':'p_dog'})

# Append tables and then assign to image_predictions_clean
p1_df = p1_df.append(p2_df)
p1_df = p1_df.append(p3_df)
image_predictions_clean = p1_df

##### Test

In [34]:
# Test dataframe contains correct columns (tweet_id, img_num, p_type, p_conf, p_dog)
image_predictions_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6225 entries, 0 to 2074
Data columns (total 5 columns):
tweet_id    6225 non-null int64
img_num     6225 non-null int64
p_type      6225 non-null object
p_conf      6225 non-null float64
p_dog       6225 non-null bool
dtypes: bool(1), float64(1), int64(2), object(1)
memory usage: 249.2+ KB


In [35]:
# Test data correct
image_predictions_clean.sample(10)

Unnamed: 0,tweet_id,img_num,p_type,p_conf,p_dog
1524,788178268662984705,2,Arctic_fox,0.036072,False
639,681297372102656000,1,Shih-Tzu,0.113672,True
314,671729906628341761,1,Samoyed,0.117122,True
1917,854732716440526848,1,chow,0.028411,True
536,676946864479084545,1,golden_retriever,0.055655,True
651,682003177596559360,1,triceratops,0.249872,False
1456,777641927919427584,1,golden_retriever,0.964929,True
1346,759197388317847553,1,kuvasz,0.511341,True
110,667886921285246976,1,Pekinese,0.168445,True
1397,768193404517830656,1,ram,0.300851,False


### Quality

#### `archive_enhanced`: tweet_id is an int

##### Define
- Convert tweet_id column to str using .astype()

##### Code

In [36]:
archive_enhanced_clean.tweet_id = archive_enhanced_clean.tweet_id.astype(str)

##### Test

In [37]:
# Check that tweet_id is an object ('O')
archive_enhanced_clean.tweet_id.dtype

dtype('O')

#### `image_predictions`: tweet_id is an int

##### Define
- Convert tweet_id column to str using .astype()

##### Code

In [38]:
image_predictions_clean.tweet_id = image_predictions_clean.tweet_id.astype(str)

##### Test

In [39]:
# Check that tweet_id is an object ('O')
image_predictions_clean.tweet_id.dtype

dtype('O')

#### `archive_enhanced`: presence of retweets

##### Define
- Remove rows where retweeted_status_id != null (presence of a value means that the row is a retweet)

##### Code

In [40]:
# Check the number of rows in archive_enhanced_clean
archive_enhanced_clean.shape

(2339, 16)

In [41]:
# Check the number of null rows in retweeted_status_id (i.e. non-retweets) 
sum(archive_enhanced_clean['retweeted_status_id'].isnull())

2172

In [42]:
# Remove non-null rows (retweeted_status_id) from archive_enhanced
archive_enhanced_clean = archive_enhanced_clean[archive_enhanced_clean[
    'retweeted_status_id'].isnull()]

##### Test

In [43]:
# Check that number of rows = 2172
archive_enhanced_clean.shape

(2172, 16)

In [44]:
# Check that there are no rows with a non-null value in the retweeted_satus_id column
sum(~archive_enhanced_clean['retweeted_status_id'].isnull())

0

#### rating_numerator and rating_denominator columns are ints

##### Define
- Convert rating_numerator and rating_denominator columns to float using .astype()

##### Code

##### Test

#### Incorrect ratings (numerators and denominators)

##### Define
- Write a regex to find the numerator and denominator from the text column
- Save as floats to numerator and denominator columns
- Test against some of the known incorrect values to see if the algorithm improves on original algorithm

##### Code

##### Test

#### Inconsistent format for dog names in p columns of `image_predictions`.

##### Define
- Replace '_' with ' ' in p_type column of `image_predictions_clean`
- Capitalise each word in p_type column of `image_predictions_clean`

##### Code

##### Test

#### Predictions include non-dog objects in `image_predictions`.

##### Define
- Drop rows where p_dog == False in `image_predictions_clean`

##### Code

##### Test

## Storage

(store each table as a csv file with main file called tiwtter_archive_master.csv)

## Analysis

(make at least 3 insights)