### Introduction

This project examines the tweet archive for WeRateDogs to draw insights from the dataset regarding the ratings of dogs. The Gather, Assess and Clean process is followed with the aim of answering the following questions:

- What type of dogs have the highest ratings?
- What dog stage has the highest ratings?
- What type of dogs are the most popular in terms of retweet count and favorite count?

## Gather

In [1]:
import json
import numpy as np
import os
import pandas as pd
import re
import requests
import tweepy
from timeit import default_timer as timer
from tweepy import OAuthHandler

In [2]:
# Load twitter archive file
archive_enhanced = pd.read_csv('twitter-archive-enhanced.csv')

In [3]:
# Download image-predictions file
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = requests.get(url)
with open('image_predictions.tsv', mode='wb') as file:
    file.write(response.content)

In [4]:
# Import image predictions file
image_predictions = pd.read_csv('image_predictions.tsv', sep='\t')

In [5]:
# Query Twitter API for each tweet in the Twitter archive and save JSON in a text file
consumer_key = 'HIDDEN'
consumer_secret = 'HIDDEN'
access_token = 'HIDDEN'
access_secret = 'HIDDEN'

auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

In [6]:
# Tweet IDs for which to gather additional data via Twitter's API
tweet_ids = archive_enhanced.tweet_id.values
len(tweet_ids)

2356

In [7]:
# Query Twitter's API for JSON data for each tweet ID in the Twitter archive
if not os.path.exists('tweet_json.txt'):
    count = 0
    fails_dict = {}
    start = timer()
    # Save each tweet's returned JSON as a new line in a .txt file
    with open('tweet_json.txt', 'w') as outfile:
        # This loop will likely take 20-30 minutes to run because of Twitter's rate limit
        for tweet_id in tweet_ids:
            count += 1
            print(str(count) + ": " + str(tweet_id))
            try:
                tweet = api.get_status(tweet_id, tweet_mode='extended')
                print("Success")
                json.dump(tweet._json, outfile)
                outfile.write('\n')
            except tweepy.TweepError as e:
                print("Fail")
                fails_dict[tweet_id] = e
                pass
    end = timer()
    print(end - start)
    print(fails_dict)


In [8]:
# DELETE
# Temp - save fails_dict to file for working with
'''
import csv
fails_list = []
for key in fails_dict.keys():
    fails_list.append(key)
fails_list
with open('fails_dict.csv', 'w') as outfile:
    wr = csv.writer(outfile, dialect='excel')
    wr.writerow(fails_list)
'''

"\nimport csv\nfails_list = []\nfor key in fails_dict.keys():\n    fails_list.append(key)\nfails_list\nwith open('fails_dict.csv', 'w') as outfile:\n    wr = csv.writer(outfile, dialect='excel')\n    wr.writerow(fails_list)\n"

In [9]:
# DELETE
# Working - uncomment out if necessary
'''
failures = []
with open('fails_dict.csv', 'r') as f:
    reader = csv.reader(f)
    for line in reader:
        failures.append(line)
'''

"\nfailures = []\nwith open('fails_dict.csv', 'r') as f:\n    reader = csv.reader(f)\n    for line in reader:\n        failures.append(line)\n"

In [10]:
# Load extended tweet data
tweets = []
with open('tweet_json.txt') as json_file:
    for line in json_file:
        tweets.append(json.loads(line))

# Place extended tweet data into a dataframe
extended_data = pd.DataFrame(tweets)

## Assess

In [11]:
# View column types and missing data in twitter_archive_enhanced
archive_enhanced.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

In [12]:
# Visually assess data in archive
archive_enhanced.sample(5)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1197,717009362452090881,,,2016-04-04 15:22:08 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Smokey. He's having some sort of exist...,,,,https://twitter.com/dog_rates/status/717009362...,10,10,Smokey,,,pupper,
953,751830394383790080,,,2016-07-09 17:28:29 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tucker. He's very camera shy. 12/10 wo...,,,,https://twitter.com/dog_rates/status/751830394...,12,10,Tucker,,,,
946,752568224206688256,,,2016-07-11 18:20:21 +0000,"<a href=""http://vine.co"" rel=""nofollow"">Vine -...",Here are three doggos completely misjudging an...,,,,https://vine.co/v/5W0bdhEUUVT,9,10,,,,,
391,826204788643753985,,,2017-01-30 23:05:46 +0000,"<a href=""http://twitter.com/download/iphone"" r...","Here's a little more info on Dew, your favorit...",,,,http://us.blastingnews.com/news/2017/01/kentuc...,13,10,,doggo,,,
532,808001312164028416,,,2016-12-11 17:31:39 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Cooper. He likes to stick his tongue o...,,,,https://twitter.com/dog_rates/status/808001312...,12,10,Cooper,,,,


In [13]:
# Check for duplicated tweets
len(archive_enhanced[archive_enhanced.tweet_id.duplicated()])

0

In [14]:
# Check for inclusion of retweets
sum(~archive_enhanced.retweeted_status_id.isnull())

181

There are 181 tweets that are retweets and need to be removed.

In [15]:
# Check values for ratings_numerator
archive_enhanced.rating_numerator.value_counts().sort_index()

0         2
1         9
2         9
3        19
4        17
5        37
6        32
7        55
8       102
9       158
10      461
11      464
12      558
13      351
14       54
15        2
17        1
20        1
24        1
26        1
27        1
44        1
45        1
50        1
60        1
75        2
80        1
84        1
88        1
99        1
121       1
143       1
144       1
165       1
182       1
204       1
420       2
666       1
960       1
1776      1
Name: rating_numerator, dtype: int64

All of the numerators are whole numbers, however, a visual insepction of the data shows that there are some fractional numerators in the dataset.

In [16]:
# Check values in rating_denominator - should be 10
archive_enhanced.rating_denominator.value_counts().sort_index()

0         1
2         1
7         1
10     2333
11        3
15        1
16        1
20        2
40        1
50        3
70        1
80        2
90        1
110       1
120       1
130       1
150       1
170       1
Name: rating_denominator, dtype: int64

Not all of the denominators are 10, however, the requirement to have 10 as a denominator is not hard so these can be allowed.

In [17]:
# Examine data types and missing data in image_predictions
image_predictions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


No data is missing as each column has the full number of entries (2075).

In [18]:
# Visually examine image_prdictions data
image_predictions.sample(10)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
1295,751937170840121344,https://pbs.twimg.com/media/Cm9q2d3XEAAqO2m.jpg,1,Lakeland_terrier,0.424168,True,teddy,0.260562,False,golden_retriever,0.127432,True
1429,772877495989305348,https://pbs.twimg.com/ext_tw_video_thumb/77287...,1,tabby,0.218303,False,Norwegian_elkhound,0.138523,True,wombat,0.074217,False
1965,867421006826221569,https://pbs.twimg.com/media/DAmyy8FXYAIH8Ty.jpg,1,Eskimo_dog,0.616457,True,Siberian_husky,0.38133,True,malamute,0.00167,True
1683,813944609378369540,https://pbs.twimg.com/media/Cveg1-NXgAASaaT.jpg,1,Labrador_retriever,0.427742,True,Great_Dane,0.190503,True,curly-coated_retriever,0.146427,True
1803,832040443403784192,https://pbs.twimg.com/media/Cq9guJ5WgAADfpF.jpg,1,miniature_pinscher,0.796313,True,Chihuahua,0.155413,True,Staffordshire_bullterrier,0.030943,True
167,668986018524233728,https://pbs.twimg.com/media/CUi3PIrWoAAPvPT.jpg,1,doormat,0.976103,False,Chihuahua,0.00564,True,Norfolk_terrier,0.003913,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
24,666353288456101888,https://pbs.twimg.com/media/CT9cx0tUEAAhNN_.jpg,1,malamute,0.336874,True,Siberian_husky,0.147655,True,Eskimo_dog,0.093412,True
739,687127927494963200,https://pbs.twimg.com/media/CYkrNIVWcAMswmP.jpg,1,pug,0.178205,True,Chihuahua,0.149164,True,Shih-Tzu,0.120505,True
942,704113298707505153,https://pbs.twimg.com/media/CcWDTerUAAALORn.jpg,2,otter,0.945537,False,mink,0.018231,False,sea_lion,0.015861,False


The px columns are inconsistent in their string formatting and the table is untidy as there are multiple predictions in each row.

In [19]:
# Check column types and missing values in extended_data
extended_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2339 entries, 0 to 2338
Data columns (total 32 columns):
contributors                     0 non-null object
coordinates                      0 non-null object
created_at                       2339 non-null object
display_text_range               2339 non-null object
entities                         2339 non-null object
extended_entities                2065 non-null object
favorite_count                   2339 non-null int64
favorited                        2339 non-null bool
full_text                        2339 non-null object
geo                              0 non-null object
id                               2339 non-null int64
id_str                           2339 non-null object
in_reply_to_screen_name          77 non-null object
in_reply_to_status_id            77 non-null float64
in_reply_to_status_id_str        77 non-null object
in_reply_to_user_id              77 non-null float64
in_reply_to_user_id_str          77 non-null obj

In [20]:
# Visually examine extended_data
extended_data.head(5)

Unnamed: 0,contributors,coordinates,created_at,display_text_range,entities,extended_entities,favorite_count,favorited,full_text,geo,...,quoted_status,quoted_status_id,quoted_status_id_str,quoted_status_permalink,retweet_count,retweeted,retweeted_status,source,truncated,user
0,,,Tue Aug 01 16:23:56 +0000 2017,"[0, 85]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 892420639486877696, 'id_str'...",37783,False,This is Phineas. He's a mystical boy. Only eve...,,...,,,,,8233,False,,"<a href=""http://twitter.com/download/iphone"" r...",False,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
1,,,Tue Aug 01 00:17:27 +0000 2017,"[0, 138]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 892177413194625024, 'id_str'...",32453,False,This is Tilly. She's just checking pup on you....,,...,,,,,6084,False,,"<a href=""http://twitter.com/download/iphone"" r...",False,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
2,,,Mon Jul 31 00:18:03 +0000 2017,"[0, 121]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 891815175371796480, 'id_str'...",24437,False,This is Archie. He is a rare Norwegian Pouncin...,,...,,,,,4027,False,,"<a href=""http://twitter.com/download/iphone"" r...",False,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
3,,,Sun Jul 30 15:58:51 +0000 2017,"[0, 79]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 891689552724799489, 'id_str'...",41113,False,This is Darla. She commenced a snooze mid meal...,,...,,,,,8384,False,,"<a href=""http://twitter.com/download/iphone"" r...",False,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
4,,,Sat Jul 29 16:00:24 +0000 2017,"[0, 138]","{'hashtags': [{'text': 'BarkWeek', 'indices': ...","{'media': [{'id': 891327551943041024, 'id_str'...",39326,False,This is Franklin. He would like you to stop ca...,,...,,,,,9089,False,,"<a href=""http://twitter.com/download/iphone"" r...",False,"{'id': 4196983835, 'id_str': '4196983835', 'na..."


Most of the columns will not be required so there isn't much cleaning required here other than the id column which will need to be converted to a string. The id_str column won't be used as it won't match the tweet_id column in the archive_enhanced data, which is currently an int.

In [21]:
# Find number of retweets in extended_data
len(extended_data[~extended_data['retweeted_status'].isnull()])

167

There are 167 tweets that are retweets and will need to be removed from the data. These will be cleaned in the merged archive_enhanced data.

In [22]:
# Check for duplication between archive_enhanced and image_predictions
all_columns = pd.Series(list(archive_enhanced) + list(image_predictions))
all_columns[all_columns.duplicated()]

17    tweet_id
dtype: object

Only the tweet_id column was duplicated, which we can use as the link between to two tables.

#### Quality
##### `archive_enhanced` table
- data missing such as retweet_count and favorite_count
- tweet_id is an int (should be str)
- presence of retweets (retweeted_status_id is non-null)
- rating_numerator and rating_denominator columns are ints (should be floats)
- several incorrect ratings to be corrected

##### `image_predictions` table
- tweet_id is an int (should be str)
- dog names in p1, p2, p3 columns inconsistent - upper and lowercase names, inclusion of underscores etc.
- predictions include non-dog objects which should be removed prior to analysis

##### `extended_data` table
- presence of retweets (retweeted_status is non-null) (Note: will be cleaned at the same time that the archive_enhanced file is cleaned, as the two dataframes will have been merged at that point).

#### Tidiness
- `archive_enhanced` contains multiple columns for the dog stage which should be one column with the dog stage as the variable
- `image_predictions` has multiple predictions per row

## Clean

In [23]:
# Make copies of each dataframe for cleaning
archive_enhanced_clean = archive_enhanced.copy()
image_predictions_clean = image_predictions.copy()
extended_data_clean = extended_data.copy()

### Missing Data

#### `extended_info`: favorite_count and retweet_count need to be joined to `archive_enhanced`.

##### Define
- Drop the unwanted columns from `extended_data_clean` so just id, favorite_count and retweet_count remain
- Rename id to tweet_id in `extended_data_clean`
- Merge `extended_data_clean` with `archive_enhanced_clean` on the tweet_id column, keeping only the rows (tweet_id) that are present in both dataframes

##### Code

In [24]:
# Remove unwanted columns
columns = ['id', 'favorite_count', 'retweet_count']
extended_data_clean = extended_data_clean[columns]

# Rename id column
extended_data_clean = extended_data_clean.rename(columns={'id':'tweet_id'})
extended_data_clean.head()

Unnamed: 0,tweet_id,favorite_count,retweet_count
0,892420643555336193,37783,8233
1,892177421306343426,32453,6084
2,891815181378084864,24437,4027
3,891689557279858688,41113,8384
4,891327558926688256,39326,9089


In [25]:
# Merge archive_enhanced_clean and extended_data_clean
archive_enhanced_clean = pd.merge(archive_enhanced_clean, \
                                  extended_data_clean, on='tweet_id', how='inner')

##### Test

In [26]:
# Check that retweet_count and favorite_count columns have been added and there are 2339 rows
archive_enhanced_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2339 entries, 0 to 2338
Data columns (total 19 columns):
tweet_id                      2339 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2339 non-null object
source                        2339 non-null object
text                          2339 non-null object
retweeted_status_id           167 non-null float64
retweeted_status_user_id      167 non-null float64
retweeted_status_timestamp    167 non-null object
expanded_urls                 2280 non-null object
rating_numerator              2339 non-null int64
rating_denominator            2339 non-null int64
name                          2339 non-null object
doggo                         2339 non-null object
floofer                       2339 non-null object
pupper                        2339 non-null object
puppo                         2339 non-null object
favorite_count                23

### Tidiness

#### `archive_enhanced` contains multiple columns for the dog stage.

##### Define
- Create a column in `archive_enhanced` called dog_stage that holds the concatenation of each value in the doggo, floofer, pupper and puppo columns.
- Extract the correct dog stage from the dog_stage column and update
- For entries with more than one correct dog, record these as 'mixed'

##### Code

In [27]:
# Add dog_stage column and populate with concatenated values for dog stages
archive_enhanced_clean['dog_stage'] = archive_enhanced_clean['doggo'] \
    + archive_enhanced_clean['floofer'] + archive_enhanced_clean['pupper'] \
    + archive_enhanced_clean['puppo']

In [28]:
# Check values present in dog_stage column
archive_enhanced_clean['dog_stage'].value_counts()

NoneNoneNoneNone        1961
NoneNonepupperNone       244
doggoNoneNoneNone         82
NoneNoneNonepuppo         29
doggoNonepupperNone       12
NoneflooferNoneNone        9
doggoNoneNonepuppo         1
doggoflooferNoneNone       1
Name: dog_stage, dtype: int64

In [29]:
# Function to extract dog_stage
def get_dog_stage(current):
    '''Extract dog_stage from passed string'''
    if current == 'NoneNoneNoneNone':
        return 'None'
    elif current == 'doggoNoneNoneNone':
        return 'doggo'
    elif current == 'NoneflooferNoneNone':
        return 'floofer'
    elif current == 'NoneNonepupperNone':
        return 'pupper'
    elif current == 'NoneNoneNonepuppo':
        return 'puppo'
    else:
        return 'mixed'

In [30]:
# Update dog_stage column with call to get_dog_stage()
archive_enhanced_clean['dog_stage'] = archive_enhanced_clean['dog_stage'].apply(get_dog_stage)
# Drop individual dog style columns
archive_enhanced_clean = archive_enhanced_clean.drop(['doggo', 'puppo', 'pupper', \
                                                      'floofer'], axis=1)

##### Test

In [31]:
# Confirm that column has been updated correctly
archive_enhanced_clean['dog_stage'].value_counts()

None       1961
pupper      244
doggo        82
puppo        29
mixed        14
floofer       9
Name: dog_stage, dtype: int64

In [32]:
# Confirm that old dog stage columns have been dropped
archive_enhanced_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2339 entries, 0 to 2338
Data columns (total 16 columns):
tweet_id                      2339 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2339 non-null object
source                        2339 non-null object
text                          2339 non-null object
retweeted_status_id           167 non-null float64
retweeted_status_user_id      167 non-null float64
retweeted_status_timestamp    167 non-null object
expanded_urls                 2280 non-null object
rating_numerator              2339 non-null int64
rating_denominator            2339 non-null int64
name                          2339 non-null object
favorite_count                2339 non-null int64
retweet_count                 2339 non-null int64
dog_stage                     2339 non-null object
dtypes: float64(4), int64(5), object(7)
memory usage: 310.6+ KB


#### `image_predictions` has multiple predictions per row.

##### Define
- Create separate dataframes from `image_predictions_clean` for p1, p2 and p3 data and keep only tweet_id, img_num, and the columns for that probability (`p1_df`, `p2_df`, `p3_df`)
- For each new dataframe, create a column that holds the p_order value (e.g. 1 for p1)
- Rename p1 column to 'p_type'
- Rename p1-conf to 'p_confidence'
- Rename p1_dog to 'p_dog'
- Repeat for p2 and p3
- Append `p2` to `p1`
- Append `p3` to `p1`
- Store `p1` as `image_predictions_clean`

##### Code

In [33]:
# Create new dataframes
p1_df = image_predictions_clean[['tweet_id', 'img_num', 'p1', 'p1_conf', 'p1_dog']]
p2_df = image_predictions_clean[['tweet_id', 'img_num', 'p2', 'p2_conf', 'p2_dog']]
p3_df = image_predictions_clean[['tweet_id', 'img_num', 'p3', 'p3_conf', 'p3_dog']]

# Rename p columns so consistent
p1_df = p1_df.rename(columns={'p1':'p_type', 'p1_conf':'p_conf', 'p1_dog':'p_dog'})
p2_df = p2_df.rename(columns={'p2':'p_type', 'p2_conf':'p_conf', 'p2_dog':'p_dog'})
p3_df = p3_df.rename(columns={'p3':'p_type', 'p3_conf':'p_conf', 'p3_dog':'p_dog'})

# Append tables and then assign to image_predictions_clean
p1_df = p1_df.append(p2_df)
p1_df = p1_df.append(p3_df)
image_predictions_clean = p1_df

##### Test

In [34]:
# Test dataframe contains correct columns (tweet_id, img_num, p_type, p_conf, p_dog)
image_predictions_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6225 entries, 0 to 2074
Data columns (total 5 columns):
tweet_id    6225 non-null int64
img_num     6225 non-null int64
p_type      6225 non-null object
p_conf      6225 non-null float64
p_dog       6225 non-null bool
dtypes: bool(1), float64(1), int64(2), object(1)
memory usage: 249.2+ KB


In [35]:
# Test data correct
image_predictions_clean.sample(10)

Unnamed: 0,tweet_id,img_num,p_type,p_conf,p_dog
1424,772152991789019136,2,golden_retriever,0.275318,True
1972,869596645499047938,1,muzzle,0.006296,False
1870,844704788403113984,1,beagle,0.003147,True
1803,832040443403784192,1,Staffordshire_bullterrier,0.030943,True
129,668286279830867968,1,Cardigan,0.11301,True
736,687102708889812993,1,rock_crab,0.001513,False
1710,818259473185828864,1,miniature_schnauzer,0.367368,True
431,674271431610523648,1,bloodhound,0.003019,True
1819,834209720923721728,1,Labrador_retriever,0.008654,True
1337,758405701903519748,4,notebook,0.032727,False


### Quality

#### `archive_enhanced`: tweet_id is an int

##### Define
- Convert tweet_id column to str using .astype()

##### Code

In [36]:
archive_enhanced_clean.tweet_id = archive_enhanced_clean.tweet_id.astype(str)

##### Test

In [37]:
# Check that tweet_id is an object ('O')
archive_enhanced_clean.tweet_id.dtype

dtype('O')

#### `image_predictions`: tweet_id is an int

##### Define
- Convert tweet_id column to str using .astype()

##### Code

In [38]:
image_predictions_clean.tweet_id = image_predictions_clean.tweet_id.astype(str)

##### Test

In [39]:
# Check that tweet_id is an object ('O')
image_predictions_clean.tweet_id.dtype

dtype('O')

#### `archive_enhanced`: presence of retweets

##### Define
- Remove rows where retweeted_status_id != null (presence of a value means that the row is a retweet)

##### Code

In [40]:
# Check the number of rows in archive_enhanced_clean
archive_enhanced_clean.shape

(2339, 16)

In [41]:
# Check the number of null rows in retweeted_status_id (i.e. non-retweets) 
sum(archive_enhanced_clean['retweeted_status_id'].isnull())

2172

In [42]:
# Remove non-null rows (retweeted_status_id) from archive_enhanced
archive_enhanced_clean = archive_enhanced_clean[archive_enhanced_clean[
    'retweeted_status_id'].isnull()]

##### Test

In [43]:
# Check that number of rows = 2172
archive_enhanced_clean.shape

(2172, 16)

In [44]:
# Check that there are no rows with a non-null value in the retweeted_satus_id column
sum(~archive_enhanced_clean['retweeted_status_id'].isnull())

0

#### rating_numerator and rating_denominator columns are ints

##### Define
- Convert rating_denominator column to float using .astype()

##### Code

In [45]:
# Convert each column to an int
archive_enhanced_clean.rating_denominator = archive_enhanced_clean.rating_denominator.astype(float)

Note that the rating_numerator column will be addressed when the regex for extracting the numerator from the text column is updated.

##### Test

In [46]:
# Test that rating_denomiator column type has changed to float
archive_enhanced_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2172 entries, 0 to 2338
Data columns (total 16 columns):
tweet_id                      2172 non-null object
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2172 non-null object
source                        2172 non-null object
text                          2172 non-null object
retweeted_status_id           0 non-null float64
retweeted_status_user_id      0 non-null float64
retweeted_status_timestamp    0 non-null object
expanded_urls                 2114 non-null object
rating_numerator              2172 non-null int64
rating_denominator            2172 non-null float64
name                          2172 non-null object
favorite_count                2172 non-null int64
retweet_count                 2172 non-null int64
dog_stage                     2172 non-null object
dtypes: float64(5), int64(3), object(8)
memory usage: 288.5+ KB


#### Incorrect rating_numerator values

##### Define
- Write a regex to find the numerator from the text column
- Save numerator to rating_numerator column
- Convert rating_numerator clumn to float using .astype()

##### Code

In [47]:
# Get value counts for ratings_numerator
archive_enhanced_clean.rating_numerator.value_counts().sort_index()

0         2
1         8
2         9
3        19
4        17
5        36
6        32
7        54
8        98
9       156
10      441
11      426
12      499
13      306
14       43
15        1
17        1
20        1
24        1
26        1
27        1
44        1
45        1
50        1
60        1
75        1
80        1
84        1
88        1
99        1
121       1
143       1
144       1
165       1
182       1
204       1
420       2
666       1
960       1
1776      1
Name: rating_numerator, dtype: int64

In [48]:
# Use a regex to include the fraction for the numerator
correct_numerator = archive_enhanced_clean.text.str.extract('((?:\d+\.)?\d+)\/(\d+)', expand=True)
archive_enhanced_clean['rating_numerator'] = correct_numerator

# Convert the rating_numerator column back to a float
archive_enhanced_clean['rating_numerator'] = archive_enhanced_clean['rating_numerator'].astype(float)

##### Test

In [49]:
# Check values in rating_numerator column now include fractional numerators
archive_enhanced_clean.rating_numerator.value_counts().sort_index()

0.00         2
1.00         8
2.00         9
3.00        19
4.00        17
5.00        34
6.00        32
7.00        54
8.00        98
9.00       156
9.50         1
9.75         1
10.00      441
11.00      426
11.26        1
11.27        1
12.00      499
13.00      306
13.50        1
14.00       43
15.00        1
17.00        1
20.00        1
24.00        1
44.00        1
45.00        1
50.00        1
60.00        1
80.00        1
84.00        1
88.00        1
99.00        1
121.00       1
143.00       1
144.00       1
165.00       1
182.00       1
204.00       1
420.00       2
666.00       1
960.00       1
1776.00      1
Name: rating_numerator, dtype: int64

We can see that there are now some fractional numerators included such as 9.5, 9.75 and 11.26

In [50]:
# Check data type for rating_numerator column
archive_enhanced_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2172 entries, 0 to 2338
Data columns (total 16 columns):
tweet_id                      2172 non-null object
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2172 non-null object
source                        2172 non-null object
text                          2172 non-null object
retweeted_status_id           0 non-null float64
retweeted_status_user_id      0 non-null float64
retweeted_status_timestamp    0 non-null object
expanded_urls                 2114 non-null object
rating_numerator              2172 non-null float64
rating_denominator            2172 non-null float64
name                          2172 non-null object
favorite_count                2172 non-null int64
retweet_count                 2172 non-null int64
dog_stage                     2172 non-null object
dtypes: float64(6), int64(2), object(8)
memory usage: 288.5+ KB


The rating_numerator column is now a float.

#### Inconsistent format for dog names in p columns of `image_predictions`.

##### Define
- Replace _ with ' ' in p_type column of `image_predictions_clean`
- Capitalise each word in p_type column of `image_predictions_clean`

##### Code

In [51]:
# Replace '_'
image_predictions_clean.p_type = image_predictions_clean.p_type.str.replace("_", " ")

# Capitalize first letters of each word
image_predictions_clean.p_type = image_predictions_clean.p_type.str.title()

##### Test

In [52]:
image_predictions_clean.sample(15)

Unnamed: 0,tweet_id,img_num,p_type,p_conf,p_dog
74,667393430834667520,1,Papillon,0.557009,True
118,668154635664932864,1,Wallaby,0.261411,False
1021,710269109699739648,1,German Shepherd,0.178157,True
69,667188689915760640,1,Vacuum,0.33583,False
540,676975532580409345,1,Eskimo Dog,0.125547,True
18,666268910803644416,1,Desktop Computer,0.086502,False
1317,755206590534418437,1,Web Site,0.906673,False
591,679158373988876288,1,Pug,0.272205,True
92,667546741521195010,1,Toy Poodle,0.787424,True
93,667549055577362432,1,Spotlight,0.007737,False


#### Predictions include non-dog objects in `image_predictions`.

##### Define
- Drop rows where p_dog == False in `image_predictions_clean`

##### Code

In [53]:
# Get initial number of rows in image_predictions_clean
image_predictions_clean.shape

(6225, 5)

In [54]:
# Keep only rows where 'p_dog' == True
image_predictions_clean = image_predictions_clean[image_predictions_clean['p_dog'] == True]

##### Test

In [55]:
# Test that rows have been dropped
image_predictions_clean.shape

(4584, 5)

In [56]:
# Test that there are no False in 'p_dog' column
image_predictions_clean.p_dog.value_counts()

True    4584
Name: p_dog, dtype: int64

## Storage

In [57]:
archive_enhanced_clean.to_csv('twitter_archive_master.csv', index=False)
image_predictions_clean.to_csv('image_predictions_master.csv', index=False)

## Analysis

(make at least 3 insights)