# Udacity WeRateDogs Data Wrangling Project

## Introduction
This is just some text to remind me to write a introduction for the project

In [1]:
import requests
import os
import json
import time
import tweepy
import pandas as pd

## Data Wrangling

In this section of the report I will gather the necessary data, understand its general properties, identify and clean possible quality and tidiness errors such as missing or incorrect values.

### Gather

Here I will be gathering each of the three pieces of data for this project.  
  
1. **WeRateDogs Twitter archive:** this file is provided by Udacity in a `.csv` file called `twitter_archive_enhanced.csv`.
2. **Image predictions:** has information about the breed of dog or object shown in the tweet photo. It is stored on Udacity's servers in the file `image_predictions.tsv`.
3. **Additional data about each tweet**: information such as the number of likes or retweets of each tweet. It can be accessed using the Twitter API with tweepy.

#### WeRateDogs Twitter Archive

In [2]:
# Converting csv file to Pandas DataFrame
twitter_archive_df = pd.read_csv('data/twitter-archive-enhanced.csv')
twitter_archive_df.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


#### Image Predictions

In [3]:
# Getting the file from the Udacity server and saving it in the data folder
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = requests.get(url)
with open(os.path.join('data', url.split('/')[-1]), mode='wb') as file:
    file.write(response.content)

In [4]:
# Converting csv file to Pandas DataFrame
image_predictions_df = pd.read_csv('data/image-predictions.tsv', sep='\t')
image_predictions_df.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


#### Additional data about each tweet

In [5]:
# Keys and tokens provided by Twitter
consumer_key = 'YOUR CONSUMER KEY HERE'
consumer_secret = 'YOUR CONSUMER SECRET HERE'
access_token = 'YOUR ACCESS TOKEN HERE'
access_secret = 'YOUR ACCESS SECRET HERE'

# Creating Twitter API object with rate limits parameters
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

In [6]:
# Creating empty list to store tweet data (json format)
tweet_json_list = []
# Creating empty dict to store the tweets that can't be accessed
errors_dict = {}
# Defining the start time to check how long it took to access the data
start_time = time.time()

for tweet_id in twitter_archive_df.tweet_id.values:
    try:
        # Adding tweet info to tweet_json_list
        tweet = api.get_status(tweet_id, tweet_mode = 'extended')
        tweet_json_list.append(tweet._json)
    except tweepy.TweepError as error:
        # Adding tweets that could't be accessed in the errors_dict
        errors_dict[tweet_id] = error
            
# Checking how much time the was spent
elapsed_time = time.time() - start_time

# Printing elapsed time in HH:MM:SS format
hms_elapsed_time = time.strftime("%H:%M:%S", time.gmtime(elapsed_time))
print('\n Time elapsed (HH:MM:SS): ' + hms_elapsed_time)
print('-'*55)

# Printing each tweet_id and error in the errors_dict
for tweet_id in errors_dict:
    print('ID:', tweet_id, 'ERROR:', errors_dict[tweet_id])
print('-'*55)

# Checking number of errors
print('\n Number of errors: ' + str(len(errors_dict)))


 Time elapsed (HH:MM:SS): 00:40:27
-------------------------------------------------------
ID: 888202515573088257 ERROR: [{'code': 144, 'message': 'No status found with that ID.'}]
ID: 873697596434513921 ERROR: [{'code': 144, 'message': 'No status found with that ID.'}]
ID: 872668790621863937 ERROR: [{'code': 144, 'message': 'No status found with that ID.'}]
ID: 872261713294495745 ERROR: [{'code': 144, 'message': 'No status found with that ID.'}]
ID: 869988702071779329 ERROR: [{'code': 144, 'message': 'No status found with that ID.'}]
ID: 866816280283807744 ERROR: [{'code': 144, 'message': 'No status found with that ID.'}]
ID: 861769973181624320 ERROR: [{'code': 144, 'message': 'No status found with that ID.'}]
ID: 856602993587888130 ERROR: [{'code': 144, 'message': 'No status found with that ID.'}]
ID: 851953902622658560 ERROR: [{'code': 144, 'message': 'No status found with that ID.'}]
ID: 845459076796616705 ERROR: [{'code': 144, 'message': 'No status found with that ID.'}]
ID: 8447

In [7]:
# Writting tweet data to a .txt file
with open('data/tweet-json.txt', 'w') as outfile:
    for tweet_json in tweet_json_list:
        json.dump(tweet_json, outfile)
        outfile.write('\n')

In [9]:
# Creating empty list that will be used to hold fav and rt count for each tweet
tweet_extra_data_list = []

# Reading .txt file 
with open('data/tweet-json.txt', 'r') as json_file:
    # Reading first line
    line = json_file.readline()
    
    # While there's a next line execute following code
    while line:
        # Select tweet id, fav and rt count
        tweet = json.loads(line)
        tweet_id = tweet['id']
        tweet_retweet_count = tweet['retweet_count']
        tweet_favorite_count = tweet['favorite_count']
        
        # Save selected data to a dict
        tweet_data = {'tweet_id': tweet_id, 
                      'retweet_count': tweet_retweet_count, 
                      'favorite_count': tweet_favorite_count,
                     }
        
        # Add tweet_data to tweet_json_data
        tweet_extra_data_list.append(tweet_data)

        # Read next line
        line = json_file.readline()

tweet_extra_data_list[0]

{'tweet_id': 892420643555336193,
 'retweet_count': 7725,
 'favorite_count': 36276}

In [13]:
# Creating Pandas DataFrame with tweet_json_data
tweet_extra_data_df = pd.DataFrame(tweet_extra_data_list, 
                                   columns = ['tweet_id',
                                              'retweet_count',
                                              'favorite_count'])

# Creating a csv file with the extra data so I don't have to access the Twitter API every time
tweet_extra_data_df.to_csv('data/tweet_extra_data.csv', index=False)


tweet_extra_data_df.head()

Unnamed: 0,tweet_id,retweet_count,favorite_count
0,892420643555336193,7725,36276
1,892177421306343426,5712,31281
2,891815181378084864,3786,23550
3,891689557279858688,7878,39567
4,891327558926688256,8496,37787


### Assess
The main objective of the Access part is to better understand each piece of data and identify possible issues that must be cleaned.  
All identified issues will be listed after accessing the data and divided between quality and tidiness problems for each piece of data, in order to facilitate the reading of the report.

#### WeRateDogs Twitter Archive

In [23]:
# Visualizing 5 random rows
twitter_archive_df.sample(5)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1215,715009755312439296,,,2016-03-30 02:56:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Siba. She's remarkably mobile. Very sl...,,,,https://twitter.com/dog_rates/status/715009755...,12,10,Siba,,,,
274,840698636975636481,8.406983e+17,8.405479e+17,2017-03-11 22:59:09 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@0_kelvin_0 &gt;10/10 is reserved for puppos s...,,,,,10,10,,,,,
498,813130366689148928,8.131273e+17,4196984000.0,2016-12-25 21:12:41 +0000,"<a href=""http://twitter.com/download/iphone"" r...",I've been informed by multiple sources that th...,,,,,12,10,,,,,
584,800141422401830912,,,2016-11-20 00:59:15 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Peaches. She's the ultimate selfie sid...,,,,https://twitter.com/dog_rates/status/800141422...,13,10,Peaches,,,,
2172,669327207240699904,,,2015-11-25 01:30:16 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Just got home from college. Dis my dog. She do...,,,,https://twitter.com/dog_rates/status/669327207...,13,10,,,,,


In [15]:
# Checking df number of rows and columns
twitter_archive_df.shape

(2356, 17)

In [21]:
# Checking df datatypes
twitter_archive_df.dtypes

tweet_id                        int64
in_reply_to_status_id         float64
in_reply_to_user_id           float64
timestamp                      object
source                         object
text                           object
retweeted_status_id           float64
retweeted_status_user_id      float64
retweeted_status_timestamp     object
expanded_urls                  object
rating_numerator                int64
rating_denominator              int64
name                           object
doggo                          object
floofer                        object
pupper                         object
puppo                          object
dtype: object

In [20]:
# Checking missing values
twitter_archive_df.isnull().sum()

tweet_id                         0
in_reply_to_status_id         2278
in_reply_to_user_id           2278
timestamp                        0
source                           0
text                             0
retweeted_status_id           2175
retweeted_status_user_id      2175
retweeted_status_timestamp    2175
expanded_urls                   59
rating_numerator                 0
rating_denominator               0
name                             0
doggo                            0
floofer                          0
pupper                           0
puppo                            0
dtype: int64

In [122]:
# I found it strange that there were None values in some columns and wanted to understand why
twitter_archive_df.query('name == "None"').head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
5,891087950875897856,,,2017-07-29 00:08:17 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a majestic great white breaching ...,,,,https://twitter.com/dog_rates/status/891087950...,13,10,,,,,
7,890729181411237888,,,2017-07-28 00:22:40 +0000,"<a href=""http://twitter.com/download/iphone"" r...",When you watch your owner call another dog a g...,,,,https://twitter.com/dog_rates/status/890729181...,13,10,,,,,
12,889665388333682689,,,2017-07-25 01:55:32 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here's a puppo that seems to be on the fence a...,,,,https://twitter.com/dog_rates/status/889665388...,13,10,,,,,puppo
24,887343217045368832,,,2017-07-18 16:08:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",You may not have known you needed to see this ...,,,,https://twitter.com/dog_rates/status/887343217...,13,10,,,,,
25,887101392804085760,,,2017-07-18 00:07:08 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This... is a Jubilant Antarctic House Bear. We...,,,,https://twitter.com/dog_rates/status/887101392...,12,10,,,,,


In [32]:
# Looking at the unique sources
twitter_archive_df.source.unique()

array(['<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
       '<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>',
       '<a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a>',
       '<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>'],
      dtype=object)

In [108]:
# Trying to understand what was this 'Vine - Make a Scene' value
twitter_archive_df[twitter_archive_df.source == 
                   '<a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a>'].head(1)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
529,808344865868283904,,,2016-12-12 16:16:49 +0000,"<a href=""http://vine.co"" rel=""nofollow"">Vine -...",This is Seamus. He's very bad at entering pool...,,,,https://vine.co/v/5QWd3LZqXxd,11,10,Seamus,,,,


In [68]:
# Selecting replies that have urls so I can look at the tweet
twitter_archive_df[twitter_archive_df.in_reply_to_status_id.notnull() &
                   twitter_archive_df.expanded_urls.notnull()].head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
149,863079547188785154,6.671522e+17,4196984000.0,2017-05-12 17:12:53 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Ladies and gentlemen... I found Pipsy. He may ...,,,,https://twitter.com/dog_rates/status/863079547...,14,10,,,,,
184,856526610513747968,8.558181e+17,4196984000.0,2017-04-24 15:13:52 +0000,"<a href=""http://twitter.com/download/iphone"" r...","THIS IS CHARLIE, MARK. HE DID JUST WANT TO SAY...",,,,https://twitter.com/dog_rates/status/856526610...,14,10,,,,,
251,844979544864018432,7.590995e+17,4196984000.0,2017-03-23 18:29:57 +0000,"<a href=""http://twitter.com/download/iphone"" r...",PUPDATE: I'm proud to announce that Toby is 23...,,,,https://twitter.com/dog_rates/status/844979544...,13,10,,,,,
565,802265048156610565,7.331095e+17,4196984000.0,2016-11-25 21:37:47 +0000,"<a href=""http://twitter.com/download/iphone"" r...","Like doggo, like pupper version 2. Both 11/10 ...",,,,https://twitter.com/dog_rates/status/802265048...,11,10,,doggo,,pupper,
1016,746906459439529985,7.468859e+17,4196984000.0,2016-06-26 03:22:31 +0000,"<a href=""http://twitter.com/download/iphone"" r...","PUPDATE: can't see any. Even if I could, I cou...",,,,https://twitter.com/dog_rates/status/746906459...,0,10,,,,,


In [118]:
# Selecting a tweet to look
twitter_archive_df.iloc[251].expanded_urls.split(',')

['https://twitter.com/dog_rates/status/844979544864018432/photo/1',
 'https://twitter.com/dog_rates/status/844979544864018432/photo/1',
 'https://twitter.com/dog_rates/status/844979544864018432/photo/1']

In [71]:
# Selecting retweets that have urls so I can look at the tweet
twitter_archive_df[twitter_archive_df.retweeted_status_id.notnull() &
                   twitter_archive_df.expanded_urls.notnull()].head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
19,888202515573088257,,,2017-07-21 01:02:36 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: This is Canela. She attempted s...,8.87474e+17,4196984000.0,2017-07-19 00:47:34 +0000,https://twitter.com/dog_rates/status/887473957...,13,10,Canela,,,,
32,886054160059072513,,,2017-07-15 02:45:48 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @Athletics: 12/10 #BATP https://t.co/WxwJmv...,8.860537e+17,19607400.0,2017-07-15 02:44:07 +0000,https://twitter.com/dog_rates/status/886053434...,12,10,,,,,
36,885311592912609280,,,2017-07-13 01:35:06 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: This is Lilly. She just paralle...,8.305833e+17,4196984000.0,2017-02-12 01:04:29 +0000,https://twitter.com/dog_rates/status/830583320...,13,10,Lilly,,,,
68,879130579576475649,,,2017-06-26 00:13:58 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: This is Emmy. She was adopted t...,8.780576e+17,4196984000.0,2017-06-23 01:10:23 +0000,https://twitter.com/dog_rates/status/878057613...,14,10,Emmy,,,,
73,878404777348136964,,,2017-06-24 00:09:53 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: Meet Shadow. In an attempt to r...,8.782815e+17,4196984000.0,2017-06-23 16:00:04 +0000,"https://www.gofundme.com/3yd6y1c,https://twitt...",13,10,Shadow,,,,


In [115]:
# Selecting a tweet to look
twitter_archive_df.iloc[32].expanded_urls.split(',')

['https://twitter.com/dog_rates/status/886053434075471873',
 'https://twitter.com/dog_rates/status/886053434075471873']

In [125]:
# Checking doggo values that aren't "None"
twitter_archive_df.query('doggo != "None"').head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
9,890240255349198849,,,2017-07-26 15:59:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Cassie. She is a college pup. Studying...,,,,https://twitter.com/dog_rates/status/890240255...,14,10,Cassie,doggo,,,
43,884162670584377345,,,2017-07-09 21:29:42 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Yogi. He doesn't have any important dog m...,,,,https://twitter.com/dog_rates/status/884162670...,12,10,Yogi,doggo,,,
99,872967104147763200,,,2017-06-09 00:02:31 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here's a very large dog. He has a date later. ...,,,,https://twitter.com/dog_rates/status/872967104...,12,10,,doggo,,,
108,871515927908634625,,,2017-06-04 23:56:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Napolean. He's a Raggedy East Nicaragu...,,,,https://twitter.com/dog_rates/status/871515927...,12,10,Napolean,doggo,,,
110,871102520638267392,,,2017-06-03 20:33:19 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Never doubt a doggo 14/10 https://t.co/AbBLh2FZCH,,,,https://twitter.com/animalcog/status/871075758...,14,10,,doggo,,,


#### Quality issues observed
`WeRateDogs Twitter Archive`:
1. IDs columns are floats not integers
2. Timestamp is a string not a datetime object
3. Source should be cleaned to make it more readable
4. Retweet timestamp is a string not a datetime object
5. Doggo, floofer, pupper and puppo should be ints not string.
6. There's a lot of NaN values.

#### Tidiness issues observed
Write the tidiness issues observed

### Clean
Explain purpose of Clean

#### Define
Define how you will address each problem identified

#### Code
Solve the problems

#### Test
Test if your code worked

## Exploratory Data Analysis
Explain EDA

## Conclusion

Write conclusions