# Project: Wrangling and Analyze Data

In [5]:
# Import necessary python libraries.
import pandas as pd
import requests 
import os
import matplotlib.pyplot as plt

## Data Gathering
In the cell below, gather **all** three pieces of data for this project and load them in the notebook. **Note:** the methods required to gather each data are different.
1. Directly download the WeRateDogs Twitter archive data (twitter_archive_enhanced.csv)

In [6]:
# Read the twitter archive file
twitter_archive = pd.read_csv('data/twitter-archive-enhanced.csv')

In [7]:
twitter_archive.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


2. Use the Requests library to download the tweet image prediction (image_predictions.tsv)

In [8]:
# Save file to download folder
folder_name = 'data'

url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = requests.get(url)
with open(os.path.join(folder_name, url.split('/')[-1]), 'wb') as file:
    file.write(response.content)

ConnectionError: ('Connection aborted.', ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None))

In [9]:
# Read the image prediction file
image_prediction = pd.read_csv('data/image-predictions.tsv', sep='\t')
image_prediction.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


3. Use the Tweepy library to query additional data via the Twitter API (tweet_json.txt)

In [10]:
import tweepy
from tweepy import OAuthHandler
import json
from timeit import default_timer as timer

# Query Twitter API for each tweet in the Twitter archive and save JSON in a text file
# These are hidden to comply with Twitter's API terms and conditions
consumer_key = 'HIDDEN'
consumer_secret = 'HIDDEN'
access_token = 'HIDDEN'
access_secret = 'HIDDEN'

auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth, wait_on_rate_limit=True)

# NOTE TO STUDENT WITH MOBILE VERIFICATION ISSUES:
# df_1 is a DataFrame with the twitter_archive_enhanced.csv file. You may have to
# change line 17 to match the name of your DataFrame with twitter_archive_enhanced.csv
# NOTE TO REVIEWER: this student had mobile verification issues so the following
# Twitter API code was sent to this student from a Udacity instructor
# Tweet IDs for which to gather additional data via Twitter's API
tweet_ids = twitter_archive.tweet_id.values
len(tweet_ids)

# Query Twitter's API for JSON data for each tweet ID in the Twitter archive
count = 0
fails_dict = {}
start = timer()
# Save each tweet's returned JSON as a new line in a .txt file
with open('tweet_json.txt', 'w') as outfile:
    # This loop will likely take 20-30 minutes to run because of Twitter's rate limit
    for tweet_id in tweet_ids:
        count += 1
        print(str(count) + ": " + str(tweet_id))
        try:
            tweet = api.get_status(tweet_id, tweet_mode='extended')
            print("Success")
            json.dump(tweet._json, outfile)
            outfile.write('\n')
        except tweepy.TweepError as e:
            print("Fail")
            fails_dict[tweet_id] = e
            pass
end = timer()
print(end - start)
print(fails_dict)

1: 892420643555336193


AttributeError: module 'tweepy' has no attribute 'TweepError'

In [12]:
# Create a dataframe from the tweet-json file
folder_name = 'data'
df_list = []

all_tweet = [json.loads(line) for line in open(os.path.join(folder_name, 'tweet-json.txt'))]
for tweet in all_tweet:
    tweet_id = tweet['id']
    text = tweet['full_text']
    only_url = text[text.find('https'):] 
    retweet_count = tweet['retweet_count']
    favorite_count = tweet['favorite_count']
    followers_count = tweet['user']['followers_count']
    friends_count = tweet['user']['friends_count']
    whole_source = tweet['source']
    source=whole_source[whole_source.find('rel="nofollow">') + 15:-4]
    retweeted = tweet.get('retweeted', 'This is a retweet')
    if retweeted == False:
        retweeted_status = 'Original tweet'
    else:
        retweeted_status = retweeted
    

    df_list.append({'tweet_id': tweet_id,
                    'url': only_url,
                    'retweet_count': retweet_count,
                    'favorite_count': favorite_count,
                    'followers_count': followers_count,
                    'friends_count': friends_count,
                    'source': source,
                    'retweeted_status': retweeted_status})
        
tweet_json = pd.DataFrame(df_list, columns = ['tweet_id', 'retweet_count', 'favorite_count', 'followers_count',
                                              'friends_count', 'source', 'retweeted_status', 'url'])

## Assessing Data

In [13]:
# increasing the column width so that the whole text in the 'text' column is visible
pd.set_option('display.max_colwidth', None)

* ##### `Visual assessment`: 
Each piece of gathered data is displayed for visual assessment purposes.

In [14]:
twitter_archive

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU,,,,https://twitter.com/dog_rates/status/892420643555336193/photo/1,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 https://t.co/0Xxu71qeIV",,,,https://twitter.com/dog_rates/status/892177421306343426/photo/1,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know when one may strike. 12/10 https://t.co/wUnZnhtVJB,,,,https://twitter.com/dog_rates/status/891815181378084864/photo/1,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us https://t.co/tD36da7qLQ,,,,https://twitter.com/dog_rates/status/891689557279858688/photo/1,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","This is Franklin. He would like you to stop calling him ""cute."" He is a very fierce shark and should be respected as such. 12/10 #BarkWeek https://t.co/AtUZn91f7f",,,,"https://twitter.com/dog_rates/status/891327558926688256/photo/1,https://twitter.com/dog_rates/status/891327558926688256/photo/1",12,10,Franklin,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2351,666049248165822465,,,2015-11-16 00:24:50 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Here we have a 1949 1st generation vulpix. Enjoys sweat tea and Fox News. Cannot be phased. 5/10 https://t.co/4B7cOc1EDq,,,,https://twitter.com/dog_rates/status/666049248165822465/photo/1,5,10,,,,,
2352,666044226329800704,,,2015-11-16 00:04:52 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is a purebred Piers Morgan. Loves to Netflix and chill. Always looks like he forgot to unplug the iron. 6/10 https://t.co/DWnyCjf2mx,,,,https://twitter.com/dog_rates/status/666044226329800704/photo/1,6,10,a,,,,
2353,666033412701032449,,,2015-11-15 23:21:54 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Here is a very happy pup. Big fan of well-maintained decks. Just look at that tongue. 9/10 would cuddle af https://t.co/y671yMhoiR,,,,https://twitter.com/dog_rates/status/666033412701032449/photo/1,9,10,a,,,,
2354,666029285002620928,,,2015-11-15 23:05:30 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is a western brown Mitsubishi terrier. Upset about leaf. Actually 2 dogs here. 7/10 would walk the shit out of https://t.co/r7mOb2m0UI,,,,https://twitter.com/dog_rates/status/666029285002620928/photo/1,7,10,a,,,,


In [15]:
image_prediction

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.072010,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True
...,...,...,...,...,...,...,...,...,...,...,...,...
2070,891327558926688256,https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg,2,basset,0.555712,True,English_springer,0.225770,True,German_short-haired_pointer,0.175219,True
2071,891689557279858688,https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg,1,paper_towel,0.170278,False,Labrador_retriever,0.168086,True,spatula,0.040836,False
2072,891815181378084864,https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg,1,Chihuahua,0.716012,True,malamute,0.078253,True,kelpie,0.031379,True
2073,892177421306343426,https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg,1,Chihuahua,0.323581,True,Pekinese,0.090647,True,papillon,0.068957,True


In [16]:
tweet_json

Unnamed: 0,tweet_id,retweet_count,favorite_count,followers_count,friends_count,source,retweeted_status,url
0,892420643555336193,8853,39467,3200889,104,Twitter for iPhone,Original tweet,https://t.co/MgUWQ76dJU
1,892177421306343426,6514,33819,3200889,104,Twitter for iPhone,Original tweet,https://t.co/0Xxu71qeIV
2,891815181378084864,4328,25461,3200889,104,Twitter for iPhone,Original tweet,https://t.co/wUnZnhtVJB
3,891689557279858688,8964,42908,3200889,104,Twitter for iPhone,Original tweet,https://t.co/tD36da7qLQ
4,891327558926688256,9774,41048,3200889,104,Twitter for iPhone,Original tweet,https://t.co/AtUZn91f7f
...,...,...,...,...,...,...,...,...
2349,666049248165822465,41,111,3201018,104,Twitter for iPhone,Original tweet,https://t.co/4B7cOc1EDq
2350,666044226329800704,147,311,3201018,104,Twitter for iPhone,Original tweet,https://t.co/DWnyCjf2mx
2351,666033412701032449,47,128,3201018,104,Twitter for iPhone,Original tweet,https://t.co/y671yMhoiR
2352,666029285002620928,48,132,3201018,104,Twitter for iPhone,Original tweet,https://t.co/r7mOb2m0UI


* #### `Programmatic assessment`: 
Pandas' functions and/or methods are used to assess the data.

In [17]:
twitter_archive.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 

In [18]:
image_prediction.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   tweet_id  2075 non-null   int64  
 1   jpg_url   2075 non-null   object 
 2   img_num   2075 non-null   int64  
 3   p1        2075 non-null   object 
 4   p1_conf   2075 non-null   float64
 5   p1_dog    2075 non-null   bool   
 6   p2        2075 non-null   object 
 7   p2_conf   2075 non-null   float64
 8   p2_dog    2075 non-null   bool   
 9   p3        2075 non-null   object 
 10  p3_conf   2075 non-null   float64
 11  p3_dog    2075 non-null   bool   
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [19]:
tweet_json.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2354 entries, 0 to 2353
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   tweet_id          2354 non-null   int64 
 1   retweet_count     2354 non-null   int64 
 2   favorite_count    2354 non-null   int64 
 3   followers_count   2354 non-null   int64 
 4   friends_count     2354 non-null   int64 
 5   source            2354 non-null   object
 6   retweeted_status  2354 non-null   object
 7   url               2354 non-null   object
dtypes: int64(5), object(3)
memory usage: 147.2+ KB


### Twitter Archive Assessment

In [20]:
# The value count of rating numerator
twitter_archive.rating_numerator.value_counts()

12      558
11      464
10      461
13      351
9       158
8       102
7        55
14       54
5        37
6        32
3        19
4        17
1         9
2         9
420       2
0         2
15        2
75        2
80        1
20        1
24        1
26        1
44        1
50        1
60        1
165       1
84        1
88        1
144       1
182       1
143       1
666       1
960       1
1776      1
17        1
27        1
45        1
99        1
121       1
204       1
Name: rating_numerator, dtype: int64

In [21]:
# Print the text of rating numerator above 100.
de
print(twitter_archive.loc[twitter_archive.rating_numerator == 420, 'text'])
print(twitter_archive.loc[twitter_archive.rating_numerator == 165, 'text'])
print(twitter_archive.loc[twitter_archive.rating_numerator == 144, 'text'])
print(twitter_archive.loc[twitter_archive.rating_numerator == 182, 'text'])
print(twitter_archive.loc[twitter_archive.rating_numerator == 143, 'text']) 
print(twitter_archive.loc[twitter_archive.rating_numerator == 666, 'text']) 
print(twitter_archive.loc[twitter_archive.rating_numerator == 960, 'text']) 
print(twitter_archive.loc[twitter_archive.rating_numerator == 1776, 'text'])
print(twitter_archive.loc[twitter_archive.rating_numerator == 121, 'text'])
print(twitter_archive.loc[twitter_archive.rating_numerator == 204, 'text'])
print(twitter_archive.loc[twitter_archive.rating_numerator == 0, 'text'])

188     @dhmontgomery We also gave snoop dogg a 420/10 but I think that predated your research
2074       After so many requests... here you go.\n\nGood dogg. 420/10 https://t.co/yfAAo1gdeY
Name: text, dtype: object
902    Why does this never happen at my front door... 165/150 https://t.co/HmwrdfEfUE
Name: text, dtype: object
1779    IT'S PUPPERGEDDON. Total of 144/120 ...I think https://t.co/ZanVtAtvIq
Name: text, dtype: object
290    @markhoppus 182/10
Name: text, dtype: object
1634    Two sneaky puppers were not initially seen, moving the rating to 143/130. Please forgive us. Thank you https://t.co/kRK51Y5ac3
Name: text, dtype: object
189    @s8n You tried very hard to portray this good boy as not so good, but you have ultimately failed. His goodness shines through. 666/10
Name: text, dtype: object
313    @jonnysun @Lin_Manuel ok jomny I know you're excited but 960/00 isn't a valid rating, 13/10 is tho
Name: text, dtype: object
979    This is Atticus. He's quite simply America af. 1

In [None]:
text_index_list = [188, 189, 290, 313, 902, 1779, 1634, 979, 1635, 1120, 315, 1016]

full_text = lambda i: twitter_archive['text'][i]

for i in text_index_list:
    print(full_text(i))

In [42]:
# Print the whole text to verify numerators
# no picture, this will be ignored when cleaning data
print(twitter_archive['text'][188])
print(twitter_archive['text'][189])
print(twitter_archive['text'][290])

# just a tweet to explain actual ratings, this will be ignored when cleaning
print(twitter_archive['text'][313])
print(twitter_archive['text'][902])
print(twitter_archive['text'][1779])
print(twitter_archive['text'][1634])
print(twitter_archive['text'][979])
print(twitter_archive['text'][1635])
print(twitter_archive['text'][1120])
print(twitter_archive['text'][315])
print(twitter_archive['text'][1016])

@dhmontgomery We also gave snoop dogg a 420/10 but I think that predated your research
@s8n You tried very hard to portray this good boy as not so good, but you have ultimately failed. His goodness shines through. 666/10
@markhoppus 182/10
@jonnysun @Lin_Manuel ok jomny I know you're excited but 960/00 isn't a valid rating, 13/10 is tho
Why does this never happen at my front door... 165/150 https://t.co/HmwrdfEfUE
IT'S PUPPERGEDDON. Total of 144/120 ...I think https://t.co/ZanVtAtvIq
Two sneaky puppers were not initially seen, moving the rating to 143/130. Please forgive us. Thank you https://t.co/kRK51Y5ac3
This is Atticus. He's quite simply America af. 1776/10 https://t.co/GRXwMxLBkh
Someone help the girl is being mugged. Several are distracting her while two steal her shoes. Clever puppers 121/110 https://t.co/1zfnTJLt55
Say hello to this unbelievably well behaved squad of doggos. 204/170 would try to pet all at once https://t.co/yGQI3He3xv
When you're so blinded by your systematic 

In [23]:
twitter_archive.rating_denominator.value_counts()

10     2333
11        3
50        3
80        2
20        2
2         1
16        1
40        1
70        1
15        1
90        1
110       1
120       1
130       1
150       1
170       1
7         1
0         1
Name: rating_denominator, dtype: int64

In [24]:
print(twitter_archive.loc[twitter_archive.rating_denominator == 110, 'text']) 
print(twitter_archive.loc[twitter_archive.rating_denominator == 120, 'text']) 
print(twitter_archive.loc[twitter_archive.rating_denominator == 130, 'text']) 
print(twitter_archive.loc[twitter_archive.rating_denominator == 150, 'text'])
print(twitter_archive.loc[twitter_archive.rating_denominator == 170, 'text'])
print(twitter_archive.loc[twitter_archive.rating_denominator == 0, 'text'])

1635    Someone help the girl is being mugged. Several are distracting her while two steal her shoes. Clever puppers 121/110 https://t.co/1zfnTJLt55
Name: text, dtype: object
1779    IT'S PUPPERGEDDON. Total of 144/120 ...I think https://t.co/ZanVtAtvIq
Name: text, dtype: object
1634    Two sneaky puppers were not initially seen, moving the rating to 143/130. Please forgive us. Thank you https://t.co/kRK51Y5ac3
Name: text, dtype: object
902    Why does this never happen at my front door... 165/150 https://t.co/HmwrdfEfUE
Name: text, dtype: object
1120    Say hello to this unbelievably well behaved squad of doggos. 204/170 would try to pet all at once https://t.co/yGQI3He3xv
Name: text, dtype: object
313    @jonnysun @Lin_Manuel ok jomny I know you're excited but 960/00 isn't a valid rating, 13/10 is tho
Name: text, dtype: object


In [25]:
# Print the whole text to verify denominators

#retweet - it will be deleted when delete all retweets
print(twitter_archive['text'][1635]) 
#actual rating 14/10 need to change manually
print(twitter_archive['text'][1779]) 
#actual rating 10/10 need to change manually
print(twitter_archive['text'][1634]) 
#actual rating 9/10 need to change manually
print(twitter_archive['text'][902]) 
#tweet to explain rating
print(twitter_archive['text'][1120]) 
# this tweet of 0 denominator will be neglected
print(twitter_archive['text'][313]) 

Someone help the girl is being mugged. Several are distracting her while two steal her shoes. Clever puppers 121/110 https://t.co/1zfnTJLt55
IT'S PUPPERGEDDON. Total of 144/120 ...I think https://t.co/ZanVtAtvIq
Two sneaky puppers were not initially seen, moving the rating to 143/130. Please forgive us. Thank you https://t.co/kRK51Y5ac3
Why does this never happen at my front door... 165/150 https://t.co/HmwrdfEfUE
Say hello to this unbelievably well behaved squad of doggos. 204/170 would try to pet all at once https://t.co/yGQI3He3xv
@jonnysun @Lin_Manuel ok jomny I know you're excited but 960/00 isn't a valid rating, 13/10 is tho


In [26]:
twitter_archive.name.value_counts()

None        745
a            55
Charlie      12
Lucy         11
Cooper       11
           ... 
Boots         1
Fletcher      1
Sky           1
Derby         1
Jordy         1
Name: name, Length: 957, dtype: int64

In [27]:
# Check for duplicate
twitter_archive[twitter_archive.tweet_id.duplicated()]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo


### Image prediction Assessment

In [28]:
image_prediction

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.072010,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True
...,...,...,...,...,...,...,...,...,...,...,...,...
2070,891327558926688256,https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg,2,basset,0.555712,True,English_springer,0.225770,True,German_short-haired_pointer,0.175219,True
2071,891689557279858688,https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg,1,paper_towel,0.170278,False,Labrador_retriever,0.168086,True,spatula,0.040836,False
2072,891815181378084864,https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg,1,Chihuahua,0.716012,True,malamute,0.078253,True,kelpie,0.031379,True
2073,892177421306343426,https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg,1,Chihuahua,0.323581,True,Pekinese,0.090647,True,papillon,0.068957,True


In [29]:
image_prediction.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   tweet_id  2075 non-null   int64  
 1   jpg_url   2075 non-null   object 
 2   img_num   2075 non-null   int64  
 3   p1        2075 non-null   object 
 4   p1_conf   2075 non-null   float64
 5   p1_dog    2075 non-null   bool   
 6   p2        2075 non-null   object 
 7   p2_conf   2075 non-null   float64
 8   p2_dog    2075 non-null   bool   
 9   p3        2075 non-null   object 
 10  p3_conf   2075 non-null   float64
 11  p3_dog    2075 non-null   bool   
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [30]:
# Check for duplicate
image_prediction[image_prediction.tweet_id.duplicated()]

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog


### Tweet json Assessment

In [31]:
tweet_json

Unnamed: 0,tweet_id,retweet_count,favorite_count,followers_count,friends_count,source,retweeted_status,url
0,892420643555336193,8853,39467,3200889,104,Twitter for iPhone,Original tweet,https://t.co/MgUWQ76dJU
1,892177421306343426,6514,33819,3200889,104,Twitter for iPhone,Original tweet,https://t.co/0Xxu71qeIV
2,891815181378084864,4328,25461,3200889,104,Twitter for iPhone,Original tweet,https://t.co/wUnZnhtVJB
3,891689557279858688,8964,42908,3200889,104,Twitter for iPhone,Original tweet,https://t.co/tD36da7qLQ
4,891327558926688256,9774,41048,3200889,104,Twitter for iPhone,Original tweet,https://t.co/AtUZn91f7f
...,...,...,...,...,...,...,...,...
2349,666049248165822465,41,111,3201018,104,Twitter for iPhone,Original tweet,https://t.co/4B7cOc1EDq
2350,666044226329800704,147,311,3201018,104,Twitter for iPhone,Original tweet,https://t.co/DWnyCjf2mx
2351,666033412701032449,47,128,3201018,104,Twitter for iPhone,Original tweet,https://t.co/y671yMhoiR
2352,666029285002620928,48,132,3201018,104,Twitter for iPhone,Original tweet,https://t.co/r7mOb2m0UI


In [32]:
tweet_json.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2354 entries, 0 to 2353
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   tweet_id          2354 non-null   int64 
 1   retweet_count     2354 non-null   int64 
 2   favorite_count    2354 non-null   int64 
 3   followers_count   2354 non-null   int64 
 4   friends_count     2354 non-null   int64 
 5   source            2354 non-null   object
 6   retweeted_status  2354 non-null   object
 7   url               2354 non-null   object
dtypes: int64(5), object(3)
memory usage: 147.2+ KB


In [33]:
# Check for duplicate
tweet_json[tweet_json.tweet_id.duplicated()]

Unnamed: 0,tweet_id,retweet_count,favorite_count,followers_count,friends_count,source,retweeted_status,url


### Quality issues

1. Timestamp is not in correct datetime format

2. Drop unnecessary columns (in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id, retweeted_status_user_id, retweeted_status_timestamp, expanded_urls)

3. Erroneous datatype tweet_id for tweet (combined twitter_archive, tweet_json)

4. Source column should be categorical datatype

5. Wrong name like `a` and `None` (naming issues)

6. Erroneous datatype tweet_id for image_prediction

7. p1_conf, p2_conf and p3_conf is decimal in image prediction table

8. Some name in the p1, p2, p3 columns are capitalize

### Tidiness issues
1. Merge the `twitter archive`, and `tweet json` dataframe

2. Two columns in `twitter archive` table (rating_numerator and rating_denominator) combine to one (ratings)

3. Two variable in the timestamp column (date and time).

4. Create new dog type column with doggo,floofer,pupper,puppo as its values

## Cleaning Data


In [38]:
# Make copies of original pieces of data
twitter_archive_clean = twitter_archive.copy()
image_prediction_clean = image_prediction.copy()
tweet_json_clean = tweet_json.copy()

### Tidiness

### Issue #1: 
Merge the `twitter archive`, and `tweet json` dataframe

#### Define:
Merge the twitter archive and tweet json to a single dataframe

#### Code

In [None]:
tweet_clean = pd.merge(twitter_archive_clean, tweet_json_clean, on='tweet_id', how='inner')

#### Test

In [None]:
tweet_clean.sample(3)

In [None]:
tweet_clean.columns

### Issue #2: 
Two columns in `twitter archive` table (rating_numerator and rating_denominator) combine to one (ratings)

#### Define:
create a new rating column from rating_numerator and rating_denominator, and drop the rating_numerator and rating_denominator

#### Code

In [None]:
tweet_clean['rating'] =  (tweet_clean.rating_numerator / tweet_clean.rating_denominator)

In [None]:
tweet_clean = tweet_clean.drop(['rating_numerator', 'rating_denominator'], axis=1)

#### Test

In [None]:
tweet_clean.sample(3)

### Issue #3: 
Two variable in the timestamp column (date and time).


#### Define:
Extract the date from the timestamp column, and drop the timestamp column.

#### Code

In [None]:
tweet_clean['date'] = pd.to_datetime(tweet_clean['timestamp']).dt.date

In [None]:
tweet_clean = tweet_clean.drop('timestamp', axis=1)

#### Test

In [None]:
tweet_clean.info()

### Issue #4:
Combine four columns (doggo,floofer,pupper,puppo) into one (dog type) `twitter archive`

#### Define:
Melt the doggo, floofer, pupper, and puppo columns to a dog type column and drop

#### Code

In [None]:
# Create dog type column.
tweet_clean['dog_type'] = tweet_clean.text.str.extract('(doggo|floofer|pupper|puppo)')

In [None]:
# Drop the doggo, floofer, pupper, and puppo columns.
tweet_clean = tweet_clean.drop(['doggo', 'floofer', 'pupper', 'puppo'], axis=1)

#### Test

In [None]:
tweet_clean.sample(9)

In [None]:
tweet_clean.dog_type.value_counts()

In [None]:
tweet_clean.info()

### Quality

### Issue #1:
Timestamp(date) is not in correct datetime format

#### Define:
Convert the extracted date columns to datetime format.

#### Code

In [None]:
tweet_clean.date = pd.to_datetime(tweet_clean.date)

#### Test

In [None]:
tweet_clean.info()

### Issue #2:
Drop unnecessary columns (in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id, retweeted_status_user_id, retweeted_status_timestamp, expanded_urls, source_x)

#### Define:
Drop the in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id, retweeted_status_user_id, retweeted_status_timestamp, expanded_urls, source_x columns and rename the other source_y to source.

#### Code

In [None]:
tweet_clean = tweet_clean.drop(['in_reply_to_status_id', 'in_reply_to_user_id', 'retweeted_status_id',
                                'retweeted_status_user_id', 'retweeted_status_timestamp', 'expanded_urls', 
                                'source_x'], axis=1)

In [None]:
# Rename the source_y to column.
tweet_clean = tweet_clean.rename(columns={'source_y': 'source'})

#### Test

In [None]:
tweet_clean.columns

### Issue #3:
Erroneous datatype tweet_id for tweet (combined twitter_archive, tweet_json)

#### Define:
Change tweet_id to object for the tweet table(combined twitter archive and tweet json)

#### Code

In [None]:
# Change tweet table tweet id to object
tweet_clean.tweet_id = tweet_clean.tweet_id.astype(object)

#### Test

In [None]:
tweet_clean.info()

### Issue #4:
Source column should be categorical datatype

#### Define:
Convert source column to categorical datatype.

#### Code

In [None]:
tweet_clean.source = tweet_clean.source.astype('category')

#### Test

In [None]:
tweet_clean.info()

### Issue #5:
wrong name like `a` and `None` (naming issues)

#### Define:
Drop names like `a` and `none`.

#### Code

In [None]:
# Drop rows that have `a` in the name column
tweet_clean.name = tweet_clean.name[tweet_clean.name != 'a']

In [None]:
# Drop rows that have `None` in the name column
tweet_clean.name = tweet_clean.name[tweet_clean.name != 'None']

#### Test

In [None]:
tweet_clean.name.value_counts()

### Issue #6:
Erroneous datatype tweet_id for image_prediction

#### Define:
Change tweet_id to object for image prediction

#### Code

In [None]:
# Change image prediction table tweet id to object
image_prediction_clean.tweet_id = image_prediction_clean.tweet_id.astype(object)

#### Test

In [None]:
image_prediction_clean.info()

### Issue #7:
p1_conf, p2_conf and p3_conf is decimal in image prediction table

#### Define:
Change the p1_conf, p2_conf and p3_conf columns to percentage 

#### Code

In [None]:
# Using applu, multiplying 100 to each rows in the columns.
image_prediction_clean.p1_conf = image_prediction_clean.p1_conf.apply(lambda x: round(x * 100, 2))
image_prediction_clean.p2_conf = image_prediction_clean.p2_conf.apply(lambda x: round(x * 100, 2))
image_prediction_clean.p3_conf = image_prediction_clean.p3_conf.apply(lambda x: round(x * 100, 2))

#### Test

In [None]:
image_prediction_clean.sample(3)

### Issue #8:
Some name in the p1, p2, p3 columns are capitalize

#### Define:
Change all name first letter in the p1, p2, p3 columns to capital

#### Code

In [None]:
# Capitalize using the str.title method.
image_prediction_clean.p1 = image_prediction_clean.p1.str.title()
image_prediction_clean.p2 = image_prediction_clean.p2.str.title()
image_prediction_clean.p3 = image_prediction_clean.p3.str.title()

#### Test

In [None]:
image_prediction.sample(3)

## Storing Data
Save gathered, assessed, and cleaned master dataset to a CSV file named "twitter_archive_master.csv".

In [None]:
tweet_clean.to_csv('data/twitter_archive_master.csv', index=False)
image_prediction_clean.to_csv('data/image_prediction_master.csv', index=False)

## Analyzing and Visualizing Data
In this section, analyze and visualize your wrangled data. You must produce at least **three (3) insights and one (1) visualization.**

In [None]:
# read the cleaned twitter archive files for analysis.
twitter = pd.read_csv('data/twitter_archive_master.csv')
image_prediction = pd.read_csv('data/image_prediction_master.csv')

In [None]:
twitter.head()

In [None]:
image_prediction.head()

### Insights:
1. Most popular dog name

In [None]:
twitter.name.value_counts()

The top five dog names are Charlie, Cooper, Oliver, Lucy, and Lola respectively

2. Descriptive statistic of dog type favourite count

In [None]:
twitter.groupby('dog_type').favorite_count.describe()

The table displays all the descriptive statistics for favorite count for the dog type category. From this, the puppo were the most popular dogs. The mean and max count values are the highest.

3. Top tweet source

In [None]:
twitter.source.value_counts()

From the table, majority of the tweet are from Iphone users.

### Visualization

In [None]:
# creating the bar plot
pd.value_counts(twitter.dog_type).plot.bar(rot=0, figsize=(6,8))
plt.xlabel("Dog types")
plt.ylabel("Counts of dog type")
plt.title("Most common Dog type")
plt.show();

Most popular dog type is Pupper