# Project: Wrangle and analyze 'WeRateDogs' Twitter Data 

## Introduction:

>

**Step 1:** First, all the necessary packages for wrangling, analyzing and visualizing data must be imported.

In [1]:
#import necessary packages
import numpy as np
import pandas as pd
import requests
import os
import json
import tweepy
import matplotlib.pyplot as plt
import seaborn as sns

#display visualizations in this notebook
%matplotlib inline

#format all visualization backgrounds with seaborn
sns.set()


 ## Part I: Gather Data

**Step 1:** The `twitter_archive_enhanced.csv` file onhand is read into a pandas dataframe.

In [2]:
#read the .csv file as a pandas dataframe and assign it to the variable twdf
twdf = pd.read_csv('twitter-archive-enhanced.csv')
#preview first few lines
twdf.head()


Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


**Step 2:** Now, the Udacity hosted file: `image_predictions.tsv` is downloaded via the requests library and assigned to a pandas dataframe.

In [3]:
#open image_predictions.tsv and write the response to the `image_predictions.tsv` file
def download_preds():
    '''
    First this assigns file location to url variable.
    Then th requests library is used to download url and assign to response variable.
    Finally, with the file open, the response is written to the `image_predictions.tsv` file.
    '''
    url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
    response = requests.get(url)
    with open('image_predictions.tsv', 'wb') as file:
        file.write(response.content)
#download_preds()

In [4]:
#read the predictions file to a pandas dataframe and assign to pred_df variable
pred_df = pd.read_csv('image_predictions.tsv', sep='\t')
pred_df.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


**Step 3:** Use the Tweepy library to download each tweets JSON data into the single text file: `tweet_json.txt`. Then read the information on each line of the file into a pandas dataframe.

In [5]:
#assign twitter developer keys, authorze them, then assign twitter API to the api object variable
consumer_key = 'SP2BQjxXdQO2Z9b63y1pnbG56'
consumer_secret = '7NA3xssg2dpKOUhnCSXCvqJvVCJca5Mr9E6zcdwu1kxL86Ml8N'
access_token = '829356934478204928-UP3nvECGL7KwtPyvk6QsKZQrDPZct9A'
access_secret = 'PnFzZMNHunRW2oQItUkoTDmavcw2jES2ytvNXaunucR2p'

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

In [6]:
#save all tweets json data to a single text document
def save_tweet_jsons():
    '''
    -writes each tweets JSON data as a one_line string to 'tweet_json.txt' queried by each tweet_id in the dataframe: twdf
    -also has counter to completion
    -try block is used to continue in case there is no data to write
    '''
    count = twdf.tweet_id.count()
    with open('tweet_json.txt', 'w') as file:
        for tweet_id in twdf['tweet_id']:
            try:
                tweet = api.get_status(tweet_id, tweet_mode='extended')._json
                json.dump(tweet, file)
                file.write('\n')
                count -= 1 
                print(count)
            except tweepy.error.TweepError:
                count -= 1 
                print('TweepError', count)
                continue
#save_tweet_jsons()

In [7]:
#read each tweets JSON data as a line from `tweet_json.txt` and append their desired data to a list.
df_list =[]
index=0
with open('tweet_json.txt') as full_json_file:
    for line in full_json_file.readlines():
        data = json.loads(line)
        tweet_id = data['id']
        retweet_count = data['retweet_count']
        favorite_count = data['favorite_count']
        df_list.append({'tweet_id': tweet_id,
                        'retweet_count': retweet_count,
                        'favorite_count': favorite_count})
        

    

In [8]:
#converts the list with json data to a pandas dataframe assigned to the variable: json_df
json_df = pd.DataFrame(df_list)
json_df.head()

Unnamed: 0,favorite_count,retweet_count,tweet_id
0,37524,8188,892420643555336193
1,32256,6057,892177421306343426
2,24300,4005,891815181378084864
3,40872,8330,891689557279858688
4,39065,9033,891327558926688256


## Part II: Assess Data

**Step 1:** First, a random sample of five observations are displayed for each of the three dataframes in order to become familiar with the data.

In [9]:
#inspect five lines of the twdf dataframe
twdf.sample(5)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
181,857029823797047296,,,2017-04-26 00:33:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Zeke. He performs group cheeky wink tu...,,,,https://twitter.com/dog_rates/status/857029823...,12,10,Zeke,,,,
1722,680115823365742593,,,2015-12-24 20:00:22 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Ozzy. He woke up 2 minutes before he h...,,,,https://twitter.com/dog_rates/status/680115823...,9,10,Ozzy,,,,
405,823939628516474880,,,2017-01-24 17:04:50 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Cash. He's officially given pup on tod...,,,,https://twitter.com/dog_rates/status/823939628...,12,10,Cash,,,,
2168,669354382627049472,,,2015-11-25 03:18:15 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Dug. Dug fucken loves peaches. 8/10 https...,,,,https://twitter.com/dog_rates/status/669354382...,8,10,Dug,,,,
58,880935762899988482,,,2017-06-30 23:47:07 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Louis. He's crossing. It's a big deal....,,,,https://twitter.com/dog_rates/status/880935762...,13,10,Louis,,,,


In [10]:
#inspect five lines of the pred_df dataframe
pred_df.sample(5)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
1065,715680795826982913,https://pbs.twimg.com/media/Ce6b4MPWwAA22Xm.jpg,1,golden_retriever,0.990715,True,Labrador_retriever,0.002228,True,chow,0.001197,True
215,670055038660800512,https://pbs.twimg.com/media/CUyDgChWUAAmNSI.jpg,1,snail,0.563631,False,slug,0.296649,False,bolete,0.031839,False
442,674632714662858753,https://pbs.twimg.com/media/CVzG3yOVAAAqi9I.jpg,1,jellyfish,0.432748,False,goldfish,0.113111,False,coral_reef,0.087047,False
1705,817423860136083457,https://pbs.twimg.com/ext_tw_video_thumb/81742...,1,ice_bear,0.3362,False,Samoyed,0.201358,True,Eskimo_dog,0.186789,True
1425,772193107915964416,https://pbs.twimg.com/media/Crdhh_1XEAAHKHi.jpg,1,Pembroke,0.367945,True,Chihuahua,0.223522,True,Pekinese,0.164871,True


In [11]:
#inspect five lines of the json_df dataframe
json_df.sample(5)

Unnamed: 0,favorite_count,retweet_count,tweet_id
939,5246,1344,751538714308972544
1318,2657,837,705223444686888960
579,0,7138,798705661114773508
532,0,8686,805823200554876929
44,44561,9571,883482846933004288


**Step 2:** One thing that catches the eye is inconsistent capitalization of dognames in the collumn: `pred_df['p1']`, so the fifteen most common values are inspected below.  Interestingly, the fifteenth is not even a dog.

In [12]:
#inspect the fifteen most common dog prediction names in pred_df
pred_df.p1.value_counts().head(15)

golden_retriever            150
Labrador_retriever          100
Pembroke                     89
Chihuahua                    83
pug                          57
chow                         44
Samoyed                      43
toy_poodle                   39
Pomeranian                   38
cocker_spaniel               30
malamute                     30
French_bulldog               26
miniature_pinscher           23
Chesapeake_Bay_retriever     23
seat_belt                    22
Name: p1, dtype: int64

**Step 3:** To better understand the quality of data, the pandas `.info()` method is used on all three dataframes.  From this, various issues of wrong datatypes and missing values are revealed.

In [13]:
#view shape, collumn names and datatypes of twdf dataframe
twdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

In [14]:
#view shape, collumn names and datatypes of pred_df dataframe
pred_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [15]:
#view shape, collumn names and datatypes of json_df dataframe
json_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2337 entries, 0 to 2336
Data columns (total 3 columns):
favorite_count    2337 non-null int64
retweet_count     2337 non-null int64
tweet_id          2337 non-null int64
dtypes: int64(3)
memory usage: 54.9 KB


**Step 4:** since there are not many rows with data in the `twdf['retweeted_status_id']` collumn, a sample of five of these rows are queried. It appears that these are retweets. 

In [16]:
#sample retreat observations in the twdf dataframe for closer inspection
twdf[~twdf['retweeted_status_id'].isnull()].sample(5)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
171,858860390427611136,,,2017-05-01 01:47:28 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: Meet Winston. He knows he's a l...,8.395493e+17,4196984000.0,2017-03-08 18:52:12 +0000,https://twitter.com/dog_rates/status/839549326...,12,10,Winston,,,,
310,835309094223372289,,,2017-02-25 02:03:02 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: So this just changed my life. 1...,7.530398e+17,4196984000.0,2016-07-13 01:34:21 +0000,"https://vine.co/v/5W2Dg3XPX7a,https://vine.co/...",13,10,,,,,
165,860177593139703809,,,2017-05-04 17:01:34 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: Ohboyohboyohboyohboyohboyohboyo...,7.61673e+17,4196984000.0,2016-08-05 21:19:27 +0000,https://twitter.com/dog_rates/status/761672994...,10,10,,,,,
612,796904159865868288,,,2016-11-11 02:35:32 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: This is Tyrone. He's a leaf wiz...,6.873173e+17,4196984000.0,2016-01-13 16:56:30 +0000,https://twitter.com/dog_rates/status/687317306...,11,10,Tyrone,,,,
583,800188575492947969,,,2016-11-20 04:06:37 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: This is Bo. He's a Benedoop Cum...,6.816941e+17,4196984000.0,2015-12-29 04:31:49 +0000,https://twitter.com/dog_rates/status/681694085...,11,10,Bo,,,pupper,


**Step 5:** Since the number of rows in the three dataframes are not the same, it might help to check and see how many of the tweet_id's do not match accross dataframes for tidiness reasons.  This operation is performed below.

In [17]:
#create a list of tweet_id's in pred_df in order to count how many are not in the twdf dataframe

pred_ids = list(pred_df['tweet_id'])
count = 0
for tweet_id in twdf['tweet_id']:
    if tweet_id not in pred_ids:
        count += 1
count   

281

In [18]:
#then count how many json_df tweet_id's are not in the pred_df dataframe

count = 0
for tweet_id in json_df['tweet_id']:
    if tweet_id not in pred_ids:
        count += 1
count

272

**Step 6:**  Finally, the quality and tidiness issues are summarized as follows.

### Quality Issues:

- tweet_id is the wrong data type in all three tables
- `twdf` timestamps are wrong data type
- dog names are 'None' strings when they should be null in the `twdf` table
- some of the `twdf` observations are retweets
- missing `twdf['expanded_url']` data
- some of the `pred_df` observations are probably not dogs
- dog types in the p1, p2 and p3 collumns of the `pred_df` table are inconsistently capitalized
- since some tweet_ids in `twdf` and `json_df` are not in the pred_df, there is missing image data

### Tidiness Issues:

- the four 'doggo', 'floofer', 'pupper' and 'puppo' collumns represent one categorical variable
- some of the `json_df` tweet_id's have no prediction data
- there is no prediction info for some of the tweet_ids
- all data should be in a single table without the 'in_reply_to_status_id',	'in_reply_to_user_id', 'retweeted_status_id', and 'retweeted_status_user_id' collumns
- p2 and p3 info from pred_df does not need to be included since it is not relevent to the desired statistical analysis and visualization

## Part III: Clean Data

**Step 1:** Before cleaning, a copy of the `twdf` dataframe is created as `master_df` to initiate the process.

In [20]:
#create a copy of twdf for cleaning as the master_df
#then copy pred_df and json_df for cleaning
master_df = twdf.copy()
json_clean = json_df.copy()
pred_clean = pred_df.copy()
master_df.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


**Step 2:** Address the following datatype issues:

- tweet_id is the wrong data type in all three tables
- `twdf` timestamps are wrong data type

##### Definition:
- use the pandas `.astype()` method to convert each table's `tweet_id` collumn to string datatype
- use to pandas `to_datetime()` method to convert the timestamps in `master_df` to datetime

##### Code:

##### Test:

**Step 3:** Address the tidiness and cleanliness issue in the `doggo`, `floofer`, `pupper`, and `puppo` collumns.  For convenience, the remaining two quality issues in `master_df` are addressed as well.

- dog names are 'None' strings when they should be null in the `twdf` table
- the four 'doggo', 'floofer', 'pupper' and 'puppo' collumns represent one categorical variable
- some of the `twdf` observations are retweets
- missing `twdf['expanded_url']` data

##### Definition:

- replace 'None' with '' for all for collumns
- melt the four collunns into a single collumn

##### Code:

##### Test:

**Step 4:** Address the remaining two quality issues in the `pred_clean` table.

- some of the `pred_df` observations are probably not dogs
- dog types in the p1, p2 and p3 collumns of the `pred_df` table are inconsistently capitalized

##### Definition:

##### Code:

##### Test:

**Step 5:** The remaining quality issue and four tidiness issues are addressed by joining tables.

- since some tweet_ids in `twdf` and `json_df` are not in the pred_df, there is missing image data
- some of the `json_df` tweet_id's have no prediction data
- there is no prediction info for some of the tweet_ids
- all data should be in a single table without the 'in_reply_to_status_id',	'in_reply_to_user_id', 'retweeted_status_id', and 'retweeted_status_user_id' collumns
- p2 and p3 info from pred_df does not need to be included since it is not relevent to the desired statistical analysis and visualization

##### Define:

##### Code:

##### Test:

## Part IV: Analyze and Visualize Data