# Project : Data analysis on WeRateDogs twitter page
## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#gather">Gathering the data</a></li>
<li><a href="#assess">Assessing the Data</a></li>
<li><a href="#clean"> Cleaning the Date</a> </li> 
<li><a href="#storing"> Storing the data</a></li>     
</ul>

<a id='intro'> </a>
## Introduction
<p> The twitter account of WeRateDogs rates the dogs with a unique rating system with a humourous comment on the dog.
    Here data wragling will be performed for this twitter account which would help in bringing up interesting insights and visulaizations </p>

### Gathering
<a id='gather'> </a>
Here the data is gathered from three different sources:
1. Twitter archive-contains basic data on the twitter account
2. Through a webpage- contains image based predictions on the dogs 
3. Twitter API- Contains the retweet and favourite count information that would link upto twitter archive. 

In [1]:
import requests #download data
import numpy as np #array functions
import pandas as pd #data handling
import tweepy #twitter api
import json #handle json data
import matplotlib.pyplot as plt #data visualization
import seaborn as sns #data visualization
import re #text processing
%matplotlib inline

1a. <b> Downloaded the file from Udacity and loading the WeRateDogs data here using pandas<b>

In [2]:
#reading twitter csv file
df_archive=pd.read_csv('twitter-archive-enhanced.csv')
df_archive.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


<b> 1b. Getting the image prediction data from the url provided <b>

In [3]:
url= "https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv"
response=requests.get(url)

with open('reponse.tsv',mode='wb') as file:
    file.write(response.content)

In [4]:
df_pred=pd.read_csv('reponse.tsv',sep='\t')
df_pred.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


<b> 1c. Getting the twitter data from the API <b>

In [5]:
import tweepy

consumer_key="mykey"
consumer_secret="mykey"
access_token="mykey"
access_token_secret="mykey"

def connect_to_twitter_OAuth():
 auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
 auth.set_access_token(access_token, access_token_secret) 
 api = tweepy.API(auth)
 return api
 


api = connect_to_twitter_OAuth()


In [6]:
#in order to check the api
public_tweets = api.home_timeline()
for tweet in public_tweets:
    print(tweet.text)

RT @RainyTeaWrites: shelter dogs are the best dogs https://t.co/sUF6c7u8nT
I hope you're h*ckin ready... we're featuring adoptable dogs all day tomorrow https://t.co/YXI4v8M63w
RT @MicahR_: Ash’s moment as a calendar model for @dog_rates is now framed for posterity. 

He got an extra treat for posing for this photo…
RT @__betzaida: huskies in texas, it’s your time to shine let me see your puppies
This is Doc. He wants to be a beekeeper. Definitely didn’t get stuck going through the trash. 13/10 dream big buddy https://t.co/KHWsFpADVj
CALLING ALL NEW PET PARENTS!
@Trupanion is hosting an educational webinar Thursday afternoon to help you handle tha… https://t.co/fQaTidu1VI
This is Stan. He’s very serious about you having a good day. Any pawblems just send them his way. He’ll take care o… https://t.co/FV0VgAZkCw
This is Lola. She was recently diagnosed with bone cancer in her leg. Thankfully, it hadn’t spread far and an amput… https://t.co/3mQ0IFynEe
This is Shanel. She has lymphoma canc

In [27]:
#Extracting the data from the API from the df_archive tweet_ids using the api.get_status command and 
#appending it to retweet_count_and_favorite_count by 
#loading it as JSON and storing it in tweet_json.txt file
#columns are selected based on the info on 
#https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/object-model/tweet
'''page_no_exist = []
tweets = []

with open('tweet_json.txt', mode="w") as file:
    for i in list(df_archive['tweet_id']):
        try:
            tweet = api.get_status(str(i))
            file.write(json.dumps(tweet._json))
            tweets.append({
                "tweet_id" : str(i),
                "retweet_count" : tweet._json['retweet_count'],
                "favorite_count" : tweet._json['favorite_count']
            })
        except:
            page_no_exist.append(i)'''

In [28]:
#number of tweet ids that were found via API
#len(tweets), len(page_no_exist)

(877, 1479)

<b> As it can see from above cell, the tweets extracted were 877 whereas the other tweets were deleted/missing,hence using the 
tweet._json.txt provided by udacity, However I understood on how to collecy data using tweepy </b>

In [35]:
with open('tweet_json.txt')as f:
    for line in f: 
            print(line)
            status=json.loads(line)
            tweet_id=status['id_str']
            retweet_count=status['retweet_count']
            favorite_count=status['favorite_count']
            retweeted=status=['retweeted']
            df_api=df_api.append(pd.DataFrame([[tweet_id,retweet_count,favorite_count,retweeted]],columns=['tweet_id','favorite_count','retweet_count','retweeted']))
       

{"created_at": "Tue Aug 01 16:23:56 +0000 2017", "id": 892420643555336193, "id_str": "892420643555336193", "full_text": "This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU", "truncated": false, "display_text_range": [0, 85], "entities": {"hashtags": [], "symbols": [], "user_mentions": [], "urls": [], "media": [{"id": 892420639486877696, "id_str": "892420639486877696", "indices": [86, 109], "media_url": "http://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg", "media_url_https": "https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg", "url": "https://t.co/MgUWQ76dJU", "display_url": "pic.twitter.com/MgUWQ76dJU", "expanded_url": "https://twitter.com/dog_rates/status/892420643555336193/photo/1", "type": "photo", "sizes": {"large": {"w": 540, "h": 528, "resize": "fit"}, "thumb": {"w": 150, "h": 150, "resize": "crop"}, "small": {"w": 540, "h": 528, "resize": "fit"}, "medium": {"w": 540, "h": 528, "resize": "fit"}}}]}, "extended_entities": {"media

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  sort=sort)


{"created_at": "Sun Jun 25 18:56:45 +0000 2017", "id": 879050749262655488, "id_str": "879050749262655488", "full_text": "This is Steven. He has trouble relating to other dogs. Quite shy. Neck longer than average. Tropical probably. 11/10 would still pet https://t.co/2mJCDEJWdD", "truncated": false, "display_text_range": [0, 132], "entities": {"hashtags": [], "symbols": [], "user_mentions": [], "urls": [], "media": [{"id": 879050744279834628, "id_str": "879050744279834628", "indices": [133, 156], "media_url": "http://pbs.twimg.com/media/DDMD_phXoAQ1qf0.jpg", "media_url_https": "https://pbs.twimg.com/media/DDMD_phXoAQ1qf0.jpg", "url": "https://t.co/2mJCDEJWdD", "display_url": "pic.twitter.com/2mJCDEJWdD", "expanded_url": "https://twitter.com/dog_rates/status/879050749262655488/photo/1", "type": "photo", "sizes": {"medium": {"w": 674, "h": 1200, "resize": "fit"}, "thumb": {"w": 150, "h": 150, "resize": "crop"}, "small": {"w": 382, "h": 680, "resize": "fit"}, "large": {"w": 899, "h": 1600,

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



{"created_at": "Thu Jun 30 02:45:28 +0000 2016", "id": 748346686624440324, "id_str": "748346686624440324", "full_text": "\"So... we meat again\" (I'm so sorry for that pun I couldn't resist pls don't unfollow) 10/10 https://t.co/XFBrrqapZa", "truncated": false, "display_text_range": [0, 92], "entities": {"hashtags": [], "symbols": [], "user_mentions": [], "urls": [], "media": [{"id": 748346678575497217, "id_str": "748346678575497217", "indices": [93, 116], "media_url": "http://pbs.twimg.com/media/CmKpVtlWAAEnyHm.jpg", "media_url_https": "https://pbs.twimg.com/media/CmKpVtlWAAEnyHm.jpg", "url": "https://t.co/XFBrrqapZa", "display_url": "pic.twitter.com/XFBrrqapZa", "expanded_url": "https://twitter.com/dog_rates/status/748346686624440324/photo/1", "type": "photo", "sizes": {"large": {"w": 1024, "h": 768, "resize": "fit"}, "small": {"w": 680, "h": 510, "resize": "fit"}, "thumb": {"w": 150, "h": 150, "resize": "crop"}, "medium": {"w": 1024, "h": 768, "resize": "fit"}}}]}, "extended_entitie

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)




{"created_at": "Sun Nov 29 03:20:54 +0000 2015", "id": 670804601705242624, "id_str": "670804601705242624", "full_text": "Meet Mason. He's a total frat boy. Pretends to be Hawaiian. Head is unbelievably round. 10/10 would pet so damn well https://t.co/DM3ZP3AA7b", "truncated": false, "display_text_range": [0, 140], "entities": {"hashtags": [], "symbols": [], "user_mentions": [], "urls": [], "media": [{"id": 670804590565150720, "id_str": "670804590565150720", "indices": [117, 140], "media_url": "http://pbs.twimg.com/media/CU8tOJZWUAAlNoF.jpg", "media_url_https": "https://pbs.twimg.com/media/CU8tOJZWUAAlNoF.jpg", "url": "https://t.co/DM3ZP3AA7b", "display_url": "pic.twitter.com/DM3ZP3AA7b", "expanded_url": "https://twitter.com/dog_rates/status/670804601705242624/photo/1", "type": "photo", "sizes": {"medium": {"w": 600, "h": 800, "resize": "fit"}, "large": {"w": 768, "h": 1024, "resize": "fit"}, "thumb": {"w": 150, "h": 150, "resize": "crop"}, "small": {"w": 340, "h": 453, "resize": "fit"

In [38]:
#checking on the data received
df_api.head()

Unnamed: 0,favorite_count,retweet_count,retweeted,tweet_id
0,8853,39467,[retweeted],892420643555336193
0,6514,33819,[retweeted],892177421306343426
0,4328,25461,[retweeted],891815181378084864
0,8964,42908,[retweeted],891689557279858688
0,9774,41048,[retweeted],891327558926688256


In [39]:
##alternative approach,wherein the data is retrieved from api.get_status.converted into json and written into tweet_json.txt2 file
'''

#tweets that can be found
list_of_tweets = []
#Tweets that can't be found are saved in the list below:
cant_find_tweets_for_those_ids = []

#getting the details of all tweet ids
for tweet_id in df_archive['tweet_id']:   
    try:
        list_of_tweets.append(api.get_status(tweet_id))
    except Exception as e:
        cant_find_tweets_for_those_ids.append(tweet_id) 
        
#converting into json        
my_list_of_dicts = []
for each_json_tweet in list_of_tweets:
    my_list_of_dicts.append(each_json_tweet._json)
    
#We write this list into a txt file
with open('tweet_json.txt2', 'w') as file:
        file.write(json.dumps(my_list_of_dicts, indent=4)) '''

"\n\n#tweets that can be found\nlist_of_tweets = []\n#Tweets that can't be found are saved in the list below:\ncant_find_tweets_for_those_ids = []\n\n#getting the details of all tweet ids\nfor tweet_id in df_archive['tweet_id']:   \n    try:\n        list_of_tweets.append(api.get_status(tweet_id))\n    except Exception as e:\n        cant_find_tweets_for_those_ids.append(tweet_id) \n        \n#converting into json        \nmy_list_of_dicts = []\nfor each_json_tweet in list_of_tweets:\n    my_list_of_dicts.append(each_json_tweet._json)\n    \n#We write this list into a txt file\nwith open('tweet_json.txt2', 'w') as file:\n        file.write(json.dumps(my_list_of_dicts, indent=4)) "

In [40]:
#len(my_list_of_dicts),len(cant_find_tweets_for_those_ids)

In [41]:
#df_api2

<b> Final three datasets <b>

In [45]:
df_api.head()

Unnamed: 0,favorite_count,retweet_count,retweeted,tweet_id
0,8853,39467,[retweeted],892420643555336193
0,6514,33819,[retweeted],892177421306343426
0,4328,25461,[retweeted],891815181378084864
0,8964,42908,[retweeted],891689557279858688
0,9774,41048,[retweeted],891327558926688256


In [46]:
df_archive.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


In [47]:
df_pred.head(1)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True


### Assessing the data
<a id='assess'/>

<b> 1a Assessing the archive data </b>

In [48]:
#checking out on the Column  data types and Nans
df_archive.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

In [49]:
df_archive.sample(2)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
543,805958939288408065,,,2016-12-06 02:15:59 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: This is Penny. She fought a bee...,7.827226e+17,4196984000.0,2016-10-02 23:23:04 +0000,https://twitter.com/dog_rates/status/782722598...,10,10,Penny,,,,
1868,675166823650848770,,,2015-12-11 04:14:49 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Arnold. He broke his leg saving a hand...,,,,https://twitter.com/dog_rates/status/675166823...,10,10,Arnold,,,,


In [50]:
#checking on duplicate ids and rows
sum(df_archive['tweet_id'].duplicated()),sum(df_archive.duplicated())

(0, 0)

In [51]:
#checking on the number of  unique values on the rating_denominator 
df_archive['rating_denominator'].value_counts()

10     2333
11        3
50        3
80        2
20        2
2         1
16        1
40        1
70        1
15        1
90        1
110       1
120       1
130       1
150       1
170       1
7         1
0         1
Name: rating_denominator, dtype: int64

In [52]:
#checking on the number of  unique values on the rating_denominator 
df_archive['rating_numerator'].value_counts()

12      558
11      464
10      461
13      351
9       158
8       102
7        55
14       54
5        37
6        32
3        19
4        17
1         9
2         9
420       2
0         2
15        2
75        2
80        1
20        1
24        1
26        1
44        1
50        1
60        1
165       1
84        1
88        1
144       1
182       1
143       1
666       1
960       1
1776      1
17        1
27        1
45        1
99        1
121       1
204       1
Name: rating_numerator, dtype: int64

In [53]:
#checking the number of rows having the name of the dogs as none
len(df_archive[df_archive['name']=='None'])

745

In [54]:
#checking out on the dog names 
df_archive['name'].value_counts()

None          745
a              55
Charlie        12
Oliver         11
Cooper         11
Lucy           11
Tucker         10
Penny          10
Lola           10
Winston         9
Bo              9
the             8
Sadie           8
Toby            7
Daisy           7
Bailey          7
an              7
Buddy           7
Oscar           6
Koda            6
Milo            6
Stanley         6
Jax             6
Scout           6
Bella           6
Leo             6
Rusty           6
Dave            6
Jack            6
Bentley         5
             ... 
Cupid           1
Shiloh          1
Lillie          1
Mike            1
Ivar            1
Thor            1
Kody            1
Geoff           1
Ricky           1
Joshwa          1
Petrick         1
Hermione        1
Dixie           1
Franq           1
Combo           1
Darby           1
Noah            1
Maks            1
Grey            1
Rupert          1
Tycho           1
Lulu            1
Brandonald      1
Tug             1
Ebby      

In [55]:
#checking out the number of rows of dogs types columns having None values
len(df_archive[(df_archive['doggo']=='None')&(df_archive['floofer']=='None')&
                 (df_archive['pupper']=='None')&(df_archive['puppo']=='None')])

1976

In [56]:
# storing the re-tweets data for future analysis,if needed
df_retweet=df_archive[df_archive['retweeted_status_id'].isnull()==False]


In [57]:
#total duplicated rows in source columns
sum(df_archive['source'].duplicated())

2352

<b> 1b Assessing the API data </b>

In [114]:
#checking the data types and null values if any
df_api.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4708 entries, 0 to 0
Data columns (total 4 columns):
favorite_count    4708 non-null object
retweet_count     4708 non-null object
retweeted         4708 non-null object
tweet_id          4708 non-null object
dtypes: object(4)
memory usage: 183.9+ KB


<b> 1c Assessing the image prediction data </b>

In [121]:
#checking the data types and null values if any
df_pred.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [122]:
#Assessing the number of tweets that were predicted as dogs by algorithm 1,2 and 3  
len(df_pred[(df_pred['p1_dog']==False)|(df_pred['p2_dog']==False)|(df_pred['p3_dog']==False)])

832

In [123]:
#checking for duplicates apart from tweet_id which would be unique
sum(df_pred.iloc[:,1:].duplicated())

66

In [124]:
#checking on the word cases for dog breeds
df_pred['p1'].sample(5)

307     hen                        
1263    sulphur-crested_cockatoo   
220     German_short-haired_pointer
1879    golden_retriever           
223     Shetland_sheepdog          
Name: p1, dtype: object

## Inference
### <b> Quality issues </b> <br>
<i> <b> from Twitter archive data (df_archive)</i></b>
<ol>
    <li> The rating_denominator has values not equal to 10</li>
    <li> The timestamp column is in string format </li>
    <li> dog stage columns has none values</li>
    <li> Dog names has incomplete names(a,an)</li>
    <li> contains retweets </li>
    <li> Source column has duplicated values </li>
    </ol>
 <i><b> from API data (df_api)</i></b>
   <ol>
    <li>The tweet id column is in string, should be in Integer </li> </ol>
<b><i> from Image prediction data (df_pred)</i></b>
<ol>
<li> Have rows sets which are not predicted as dogs </li>
<li> Predicted dog breeds contains both lower case and upper case </li>
<li> Duplicate values in image predictions </li>   
</ol>

### <b> Tidyness issues </b> <br>
<ol>
    <li> The dog types columns in archive data can be merged into two columns (after assessing the data in the spreadsheet)</li>
    <li> The API data containing the favourites and retweet counts column should be merged with Archive data</li>

## <b> Cleaning data </b>
<a id='clean'/> 

In [184]:
#making a copy of the datasets
df_archive_clean=df_archive.copy()
df_api_clean=df_api.copy()
df_pred_clean=df_pred.copy()
df_archive_clean.columns

Index(['tweet_id', 'in_reply_to_status_id', 'in_reply_to_user_id', 'timestamp',
       'source', 'text', 'retweeted_status_id', 'retweeted_status_user_id',
       'retweeted_status_timestamp', 'expanded_urls', 'rating_numerator',
       'rating_denominator', 'name', 'doggo', 'floofer', 'pupper', 'puppo'],
      dtype='object')

<b> 1. The rating_denominator has values not equal to 10 </b>

<b> Define :</b>
filtering out the values which are greater than 10 in rating_denominator

<b> Code </b>

In [185]:

df_archive_clean=df_archive_clean[df_archive_clean['rating_denominator']==10]

<b> Test </b>

In [186]:
#should return 0
len(df_archive_clean[df_archive_clean['rating_denominator']!=10])

0

<b> 2. The timestamp column is in string format</b><br>
<b> Define :</b>
Using pandas to convert the timestamp column to datetime format

<b> Code </b>

In [187]:
df_archive_clean['timestamp']=pd.to_datetime(df_archive_clean['timestamp'])

<b> Test </b>

In [188]:
#should return as datetime type
df_archive_clean['timestamp'].dtype

dtype('<M8[ns]')

<b>3.Dog stage columns has none values </b><br>
<b> Define: The None values values would be replaced by empty space. So that the columns will remain in string and will be easier to merge these columns in Tidyness(1) </b> <br>


<b> Code </b>

In [189]:
df_archive_clean.replace("None","",inplace=True)

<b> Test </b>

In [190]:
#all should return zero
columns=['doggo','floofer','pupper','puppo']
for x in columns:
    non=len(df_archive_clean[df_archive_clean[x].str.contains("None")])
    print(non)   

0
0
0
0


<b> 4. Dog names has incomplete names(a,an)</b><br>
<b> Define: </b>certain names that are incomplete or with white spaces will be replaced by "None"

<b> Code</b>

In [191]:
n_index=df_archive_clean['name'][(df_archive_clean['name']=='a')|(df_archive_clean['name']=='an')|(df_archive_clean['name']=='')].index

In [192]:
df_archive_clean.loc[n_index,'name']='None'

<b> Test </b>

In [193]:
#should return an empty list
df_archive_clean['name'][(df_archive_clean['name']=='a')|(df_archive_clean['name']=='an')|(df_archive_clean['name']=='')].index


Int64Index([], dtype='int64')

<b> 5. contains retweets </b><br>
<b> Define: </b> The data should not contain retweets so rows which have values in 'retweeted_status_id' indicates those are 
retweets,hence removed

<b> Code </b>

In [194]:
re_index=df_archive_clean[df_archive_clean['retweeted_status_id'].isnull()==False].index
df_archive_clean.drop(re_index,axis=0,inplace=True)

<b> Test </b>

In [195]:
#should return 0
len(df_archive_clean[df_archive_clean['retweeted_status_id'].isnull()==False])


0

<b> 6. Source column has more than 2000 duplicated values <br>
Define:</b> This column has to be deleted 

<b> Code </b>

In [196]:
df_archive_clean.drop('source',axis=1,inplace=True)

<b> Test</b>

In [197]:
#sourve column should not be present
df_archive_clean.columns

Index(['tweet_id', 'in_reply_to_status_id', 'in_reply_to_user_id', 'timestamp',
       'text', 'retweeted_status_id', 'retweeted_status_user_id',
       'retweeted_status_timestamp', 'expanded_urls', 'rating_numerator',
       'rating_denominator', 'name', 'doggo', 'floofer', 'pupper', 'puppo'],
      dtype='object')

#### For df_api Quality issues


<b> 7.The tweet id column is in string, should be in Integer.<br>
    Define: </b> Tweet_id column will get converted into integer

<b> Code </b>

In [198]:
df_api_clean['tweet_id']=df_api_clean['tweet_id'].astype('int64')

<b> Test </b>

In [199]:
#should return int64
df_api_clean['tweet_id'].dtype

dtype('int64')

### For df_pred_clean quality issue
<b> 8 .Have rows sets which are not predicted as dogs</b><br>
<b> Define:</b> Delete rows which is not predicted as dogs,as the data should only contain dogs tweet<br>

<b> Code </b>

In [200]:
d_index=df_pred_clean[(df_pred_clean['p1_dog']==False)|(df_pred_clean['p2_dog']==False)|(df_pred_clean['p3_dog']==False)].index

In [201]:
df_pred_clean.drop(d_index,axis=0,inplace=True)

<b> Test </b>

In [202]:
#should return 0
len(df_pred_clean[(df_pred_clean['p1_dog']==False)|(df_pred_clean['p2_dog']==False)|(df_pred_clean['p3_dog']==False)])

0

<b> 9. Predicted dog breeds contains both lower case and upper case <br>
Define:</b> The column will have to be either in upper or lower case, here i will be converting the names into lower case
    for uniformity 

<b> Code </b>

In [203]:
cols=['p1','p2','p3']
for x in cols:
  df_pred_clean[x]=df_pred_clean[x].str.lower()

<b> Test </b>

In [204]:
#should have lowercases in p1,p2 and p3 columns
df_pred_clean[['p1','p2','p3']].sample(5)

Unnamed: 0,p1,p2,p3
1076,miniature_pinscher,italian_greyhound,beagle
41,labrador_retriever,chihuahua,french_bulldog
291,rottweiler,kelpie,appenzeller
176,pug,french_bulldog,chihuahua
1065,golden_retriever,labrador_retriever,chow


<b> 10. Duplicate values in image predictions </b><br>
<b> Define:</b> Dropping the duplicates predictions

<b> Code </b>

In [205]:
df_pred_clean['jpg_url'].drop_duplicates(inplace=True)

<b> Test </b>

In [206]:
#should return 0
sum(df_pred_clean['jpg_url'].duplicated())

0

<b> Checking on final table </b> 

In [207]:
df_api_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4708 entries, 0 to 0
Data columns (total 4 columns):
favorite_count    4708 non-null object
retweet_count     4708 non-null object
retweeted         4708 non-null object
tweet_id          4708 non-null int64
dtypes: int64(1), object(3)
memory usage: 183.9+ KB


In [208]:
df_archive_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2153 entries, 0 to 2355
Data columns (total 16 columns):
tweet_id                      2153 non-null int64
in_reply_to_status_id         73 non-null float64
in_reply_to_user_id           73 non-null float64
timestamp                     2153 non-null datetime64[ns]
text                          2153 non-null object
retweeted_status_id           0 non-null float64
retweeted_status_user_id      0 non-null float64
retweeted_status_timestamp    0 non-null object
expanded_urls                 2099 non-null object
rating_numerator              2153 non-null int64
rating_denominator            2153 non-null int64
name                          2153 non-null object
doggo                         2153 non-null object
floofer                       2153 non-null object
pupper                        2153 non-null object
puppo                         2153 non-null object
dtypes: datetime64[ns](1), float64(4), int64(3), object(8)
memory usage: 285.9+ K

In [209]:
df_pred_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1243 entries, 0 to 2073
Data columns (total 12 columns):
tweet_id    1243 non-null int64
jpg_url     1243 non-null object
img_num     1243 non-null int64
p1          1243 non-null object
p1_conf     1243 non-null float64
p1_dog      1243 non-null bool
p2          1243 non-null object
p2_conf     1243 non-null float64
p2_dog      1243 non-null bool
p3          1243 non-null object
p3_conf     1243 non-null float64
p3_dog      1243 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 100.5+ KB


### Tidyness

<b> 1. The dog types columns in archive data can be merged into two columns</b>
<b> Define: </b> <br>
1. First all the dog stages columns (doggo,floffer,pupper,puppo) has to appended into single column-DogStage
2. Extract the values of each classes defined and put it in a column-DogStage_test 
3. Now compare the column DogStage_test with DogStage and identify the double class detection (pupperdoggo,flofferdoggo etc) and None.
4. Correct the double classed manually for the respective index obtained from comparison above

<b> Code </b> <br>
Step 1

In [213]:
df_archive_clean['DogStage']=df_archive_clean['doggo'].map(str)+df_archive_clean['floofer'].map(str)+df_archive_clean['pupper'].map(str)+df_archive_clean['puppo'].map(str)

Step 2

In [214]:
df_archive_clean["DogStage_test"] = df_archive_clean.text.str.extract(
    r'(\bpuppo\b|\bdoggo\b|\bfloofer\b|\bpupper\b)', expand = True)

Checking on the value counts

In [212]:
df_archive_clean['DogStage'].value_counts()


                1809
pupper          224 
doggo           75  
puppo           24  
doggopupper     10  
floofer         9   
doggofloofer    1   
doggopuppo      1   
Name: DogStage, dtype: int64

Dropping the NaN in DogStage_test column

In [215]:
df_archive_clean.dropna(subset =["DogStage_test"],inplace=True)


Step 3

In [216]:
pd.set_option('display.max_colwidth', -1)
df_archive_clean[['text','DogStage','DogStage_test']][df_archive_clean['DogStage']!=df_archive_clean['DogStage_test']]

Unnamed: 0,text,DogStage,DogStage_test
191,Here's a puppo participating in the #ScienceMarch. Cleverly disguising her own doggo agenda. 13/10 would keep the planet habitable for https://t.co/cMhq16isel,doggopuppo,puppo
200,"At first I thought this was a shy doggo, but it's actually a Rare Canadian Floofer Owl. Amateurs would confuse the two. 11/10 only send dogs https://t.co/TXdT3tmuYk",doggofloofer,doggo
531,Here we have Burke (pupper) and Dexter (doggo). Pupper wants to be exactly like doggo. Both 12/10 would pet at same time https://t.co/ANBpEYHaho,doggopupper,pupper
565,"Like doggo, like pupper version 2. Both 11/10 https://t.co/9IxWAXFqze",doggopupper,doggo
575,This is Bones. He's being haunted by another doggo of roughly the same size. 12/10 deep breaths pupper everything's fine https://t.co/55Dqe0SJNj,doggopupper,doggo
705,This is Pinot. He's a sophisticated doggo. You can tell by the hat. Also pointier than your average pupper. Still 10/10 would pet cautiously https://t.co/f2wmLZTPHd,doggopupper,doggo
889,"Meet Maggie &amp; Lila. Maggie is the doggo, Lila is the pupper. They are sisters. Both 12/10 would pet at the same time https://t.co/MYwR4DQKll",doggopupper,doggo
956,Please stop sending it pictures that don't even have a doggo or pupper in them. Churlish af. 5/10 neat couch tho https://t.co/u2c9c7qSg8,doggopupper,doggo
1063,This is just downright precious af. 12/10 for both pupper and doggo https://t.co/o5J479bZUC,doggopupper,pupper
1113,"Like father (doggo), like son (pupper). Both 12/10 https://t.co/pG2inLaOda",doggopupper,doggo


Step-4

In [217]:
#sum of the tweets contains info on more than 1 type of dogs, hence i have retained the double classes in Dogstages in few rows 
df_archive_clean.loc[191, "DogStage"] = "puppo"
df_archive_clean.loc[200, "DogStage"] = "floofer"
df_archive_clean.loc[531, "DogStage"] = 'doggopupper'
df_archive_clean.loc[565, "DogStage"] = "doggopupper"
df_archive_clean.loc[575, "DogStage"] = 'doggopupper'
df_archive_clean.loc[705, "DogStage"] = 'doggo'
df_archive_clean.loc[778, "DogStage"] = 'doggopupper'
df_archive_clean.loc[822, "DogStage"] = 'doggopupper'
df_archive_clean.loc[889, "DogStage"] = 'doggopupper'
df_archive_clean.loc[956, "DogStage"] = np.nan
df_archive_clean.loc[1063, "DogStage"] = 'doggopupper'
df_archive_clean.loc[1113, "DogStage"] ='doggopupper'

In [218]:
#deleting the original columns
df_archive_clean.drop(columns=['doggo','floofer','pupper','puppo'],inplace=True,axis=1)

<b> Test</B>

In [219]:
#should return only the proper classes
df_archive_clean['DogStage'].value_counts()

pupper         218
doggo          70 
puppo          24 
doggopupper    8  
floofer        5  
Name: DogStage, dtype: int64

<b> 2. The API data containing the favourites and retweet counts column should be merged with Archive data </b> <br>
<b> Define: Left Merge the datasets with archive data being the primary dataset  </b> <br>

<b> Code </b>

In [220]:
len(df_archive_clean),len(df_api_clean)

(326, 4708)

In [221]:
df_archive_clean=df_archive_clean.merge(right=df_api_clean,how='left',on='tweet_id')

<b> Test </b>

In [222]:
#should display the retweet_count and favourite_count columns
df_archive_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 544 entries, 0 to 543
Data columns (total 17 columns):
tweet_id                      542 non-null float64
in_reply_to_status_id         14 non-null float64
in_reply_to_user_id           14 non-null float64
timestamp                     542 non-null datetime64[ns]
text                          542 non-null object
retweeted_status_id           0 non-null float64
retweeted_status_user_id      0 non-null float64
retweeted_status_timestamp    0 non-null object
expanded_urls                 531 non-null object
rating_numerator              542 non-null float64
rating_denominator            542 non-null float64
name                          542 non-null object
DogStage                      542 non-null object
DogStage_test                 542 non-null object
favorite_count                436 non-null object
retweet_count                 436 non-null object
retweeted                     436 non-null object
dtypes: datetime64[ns](1), float64(7),

In [223]:
#deleting the unwanted columns
df_archive_clean.drop(columns=['in_reply_to_status_id','in_reply_to_user_id','retweeted_status_id','retweeted_status_user_id',
                               'retweeted_status_timestamp','expanded_urls','retweeted','DogStage_test'],axis=1,inplace=True)

In [224]:
#checking on the retained columns
df_archive_clean.columns

Index(['tweet_id', 'timestamp', 'text', 'rating_numerator',
       'rating_denominator', 'name', 'DogStage', 'favorite_count',
       'retweet_count'],
      dtype='object')

In [225]:
#setting the index
df_archive_clean.set_index('tweet_id',inplace=True)

In [226]:
df_pred_clean.set_index('tweet_id',inplace=True)

### Storing Data

<a id="storing"/>


In [227]:
df_archive_clean.to_csv('Twitter_data.csv',header=True)

In [228]:
df_pred_clean.to_csv('prediction_data.csv',header=True)