# Project: DATA WRANGLING - WeRateDogs

## Data Gathering

The first step of the wrangling process is data gathering.

In this step I will be looking to gather all three pieces of data that will be needed for this project
- The first one being to manually read in the `twitter-archive-enhanced.csv` data downloaded on my workstation
- Programmatically downloading the `image-predictions.tsv` data from the [link](https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv)
- And the third being to query Twitter API using the tweet ID in the `twitter-archive-enhanced.csv` data to gather each tweet's JSON data using Python's tweepy library and store each tweets entire set of JSON data in a file called `tweet_json.txt` file.

**First we will go ahead to import all the packages we will be needing for this project**

In [1]:
#import packages
import pandas as pd
import requests
import numpy as np
import tweepy
import os
import json
import tweepy

As pointed out earlier we manually read in the first dataset already downloaded into a dataframe

In [2]:
#read in twitter archive enhanced data our first dataset
df = pd.read_csv('twitter-archive-enhanced.csv')

Using the Requests library we will programmatically download the `image-predictions.tsv` data online, and save it's content into a tsv file, that will be read into a dataframe later on

In [3]:
#download image prediction data
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = requests.get(url)

In [4]:
#Save data into file using response.content
with open(os.path.join(url.split('/')[-1]), mode= 'wb') as file:
    file.write(response.content)

Finally using the Tweepy library, I would query additional data via the Twitter API to gather each tweets JSON data and store the contents in the file `tweet_json.txt`

In [5]:
#Set consumer key, secret, and access_token and secret
#They will be hidden to comply with Twitter API rules
consumer_key = 'XXXXXXXXXXXXXXXXXXXXX'
consumer_secret = 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
access_token = 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
access_secret = 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'

#Set Authentication
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth, wait_on_rate_limit= True)

In [6]:
#Query Twitter's API for JSON data for each tweet id in the dataframe
'''
id_of_tweet = df.tweet_id
count = 0
failed = {}
#Save output in a newline in a txt file
with open('tweet_json.txt', mode= 'w') as outputfile:
    for idx in id_of_tweet:
        count += 1
        try:
            tweet = api.get_status(idx, tweet_mode= 'extended')
            json.dump(tweet._json, outputfile)
            outputfile.write('\n')
        
        except tweepy.errors.TweepyException as e:
            print('No Data found')
            failed[idx] = e
            pass
print(failed)
'''

No Data found
No Data found
No Data found
No Data found
No Data found
No Data found
No Data found
No Data found
No Data found
No Data found
No Data found
No Data found
No Data found
No Data found
No Data found
No Data found
No Data found
No Data found
No Data found
No Data found
No Data found
No Data found
No Data found
No Data found
No Data found
No Data found
No Data found
No Data found


Rate limit reached. Sleeping for: 95


No Data found


Rate limit reached. Sleeping for: 176


{888202515573088257: NotFound('404 Not Found\n144 - No status found with that ID.',), 873697596434513921: NotFound('404 Not Found\n144 - No status found with that ID.',), 872668790621863937: NotFound('404 Not Found\n144 - No status found with that ID.',), 872261713294495745: NotFound('404 Not Found\n144 - No status found with that ID.',), 869988702071779329: NotFound('404 Not Found\n144 - No status found with that ID.',), 866816280283807744: NotFound('404 Not Found\n144 - No status found with that ID.',), 861769973181624320: NotFound('404 Not Found\n144 - No status found with that ID.',), 856602993587888130: NotFound('404 Not Found\n144 - No status found with that ID.',), 856330835276025856: NotFound('404 Not Found\n144 - No status found with that ID.',), 851953902622658560: NotFound('404 Not Found\n144 - No status found with that ID.',), 851861385021730816: NotFound('404 Not Found\n144 - No status found with that ID.',), 845459076796616705: NotFound('404 Not Found\n144 - No status fou

Now read the `tweet_json.txt` file by line into a pandas dataframe with variables of interest

In [11]:
json_list = []
with open('tweet_json.txt', mode= 'r') as json_file:
    for text in json_file:
        texts = json.loads(text)
        tweet_id = texts['id']
        retweet_count = texts['retweet_count']
        favorite_count = texts['favorite_count']
        tweet_date = texts['created_at']
        tweet_source = texts['source']
        json_list.append({'tweet_id' : tweet_id,
                       'retweet_count' : retweet_count,
                       'favorite_count' : favorite_count,
                       'tweet_date' : tweet_date,
                       'tweet_source' : tweet_source})

df_json = pd.DataFrame(json_list, columns = ['tweet_id', 'retweet_count','favorite_count','tweet_date','tweet_source'])

In [12]:
df_json

Unnamed: 0,tweet_id,retweet_count,favorite_count,tweet_date,tweet_source
0,892420643555336193,6981,33737,Tue Aug 01 16:23:56 +0000 2017,"<a href=""http://twitter.com/download/iphone"" r..."
1,892177421306343426,5284,29265,Tue Aug 01 00:17:27 +0000 2017,"<a href=""http://twitter.com/download/iphone"" r..."
2,891815181378084864,3468,22000,Mon Jul 31 00:18:03 +0000 2017,"<a href=""http://twitter.com/download/iphone"" r..."
3,891689557279858688,7203,36844,Sun Jul 30 15:58:51 +0000 2017,"<a href=""http://twitter.com/download/iphone"" r..."
4,891327558926688256,7727,35231,Sat Jul 29 16:00:24 +0000 2017,"<a href=""http://twitter.com/download/iphone"" r..."
...,...,...,...,...,...
2322,666049248165822465,36,88,Mon Nov 16 00:24:50 +0000 2015,"<a href=""http://twitter.com/download/iphone"" r..."
2323,666044226329800704,115,246,Mon Nov 16 00:04:52 +0000 2015,"<a href=""http://twitter.com/download/iphone"" r..."
2324,666033412701032449,36,100,Sun Nov 15 23:21:54 +0000 2015,"<a href=""http://twitter.com/download/iphone"" r..."
2325,666029285002620928,39,112,Sun Nov 15 23:05:30 +0000 2015,"<a href=""http://twitter.com/download/iphone"" r..."


In [14]:
df_json.tweet_source.unique()

array(['<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
       '<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>',
       '<a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a>',
       '<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>'],
      dtype=object)

## Assessing Data
Assessing our data is the second and penultimate phase of the data wrangling process.
In this section, detect and document quality and tidiness issues in our dataset. Detecting quality and tidiness issues can be done either **visually** or **programmatically.**

For the purpose of this project we will be documenting at least **eight (8) quality issues and two (2) tidiness issue.** 

Since we already have the `twitter-archive-enhanced` and the `twitter_json.txt` file read into this notebook in the gathering phase, we recall we only programmatically downloaded the `image-predictions.tsv` data, now we will read it in

In [15]:
#read in image predictions dataset
df_image = pd.read_csv('image-predictions.tsv', sep= '\t')

In [17]:
#Visually inspect all three datasets
df

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2351,666049248165822465,,,2015-11-16 00:24:50 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a 1949 1st generation vulpix. Enj...,,,,https://twitter.com/dog_rates/status/666049248...,5,10,,,,,
2352,666044226329800704,,,2015-11-16 00:04:52 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is a purebred Piers Morgan. Loves to Netf...,,,,https://twitter.com/dog_rates/status/666044226...,6,10,a,,,,
2353,666033412701032449,,,2015-11-15 23:21:54 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here is a very happy pup. Big fan of well-main...,,,,https://twitter.com/dog_rates/status/666033412...,9,10,a,,,,
2354,666029285002620928,,,2015-11-15 23:05:30 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is a western brown Mitsubishi terrier. Up...,,,,https://twitter.com/dog_rates/status/666029285...,7,10,a,,,,


In [18]:
df_json

Unnamed: 0,tweet_id,retweet_count,favorite_count,tweet_date,tweet_source
0,892420643555336193,6981,33737,Tue Aug 01 16:23:56 +0000 2017,"<a href=""http://twitter.com/download/iphone"" r..."
1,892177421306343426,5284,29265,Tue Aug 01 00:17:27 +0000 2017,"<a href=""http://twitter.com/download/iphone"" r..."
2,891815181378084864,3468,22000,Mon Jul 31 00:18:03 +0000 2017,"<a href=""http://twitter.com/download/iphone"" r..."
3,891689557279858688,7203,36844,Sun Jul 30 15:58:51 +0000 2017,"<a href=""http://twitter.com/download/iphone"" r..."
4,891327558926688256,7727,35231,Sat Jul 29 16:00:24 +0000 2017,"<a href=""http://twitter.com/download/iphone"" r..."
...,...,...,...,...,...
2322,666049248165822465,36,88,Mon Nov 16 00:24:50 +0000 2015,"<a href=""http://twitter.com/download/iphone"" r..."
2323,666044226329800704,115,246,Mon Nov 16 00:04:52 +0000 2015,"<a href=""http://twitter.com/download/iphone"" r..."
2324,666033412701032449,36,100,Sun Nov 15 23:21:54 +0000 2015,"<a href=""http://twitter.com/download/iphone"" r..."
2325,666029285002620928,39,112,Sun Nov 15 23:05:30 +0000 2015,"<a href=""http://twitter.com/download/iphone"" r..."


In [19]:
df_image

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.072010,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True
...,...,...,...,...,...,...,...,...,...,...,...,...
2070,891327558926688256,https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg,2,basset,0.555712,True,English_springer,0.225770,True,German_short-haired_pointer,0.175219,True
2071,891689557279858688,https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg,1,paper_towel,0.170278,False,Labrador_retriever,0.168086,True,spatula,0.040836,False
2072,891815181378084864,https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg,1,Chihuahua,0.716012,True,malamute,0.078253,True,kelpie,0.031379,True
2073,892177421306343426,https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg,1,Chihuahua,0.323581,True,Pekinese,0.090647,True,papillon,0.068957,True


In [21]:
df.sample(5)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1387,700505138482569216,,,2016-02-19 02:20:14 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Kaia. She's just cute as hell. 12/10 I...,,,,https://twitter.com/dog_rates/status/700505138...,12,10,Kaia,,,,
493,813202720496779264,,,2016-12-26 02:00:11 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here's a doggo who has concluded that Christma...,,,,https://twitter.com/dog_rates/status/813202720...,11,10,,doggo,,,
330,833124694597443584,,,2017-02-19 01:23:00 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Gidget. She's a spy pupper. Stealthy a...,,,,https://twitter.com/dog_rates/status/833124694...,12,10,Gidget,,,pupper,
1178,719551379208073216,,,2016-04-11 15:43:12 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Harnold. He accidentally opened the fr...,,,,https://twitter.com/dog_rates/status/719551379...,10,10,Harnold,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


In [22]:
df_image.sample(5)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
701,684880619965411328,https://pbs.twimg.com/media/CYEvSaRWwAAukZ_.jpg,1,clog,0.081101,False,spindle,0.066957,False,agama,0.060884,False
1685,814530161257443328,https://pbs.twimg.com/media/C03K2-VWIAAK1iV.jpg,1,miniature_poodle,0.626913,True,toy_poodle,0.265582,True,soft-coated_wheaten_terrier,0.041614,True
518,676470639084101634,https://pbs.twimg.com/media/CWNOdIpWoAAWid2.jpg,1,golden_retriever,0.790386,True,borzoi,0.022885,True,dingo,0.015343,False
950,704859558691414016,https://pbs.twimg.com/media/CcgqBNVW8AE76lv.jpg,1,pug,0.284428,True,teddy,0.156339,False,mitten,0.138915,False
1361,761227390836215808,https://pbs.twimg.com/media/CpBsRleW8AEfO8G.jpg,1,cougar,0.306512,False,French_bulldog,0.280802,True,boxer,0.054523,True


In [23]:
df_json.sample(5)

Unnamed: 0,tweet_id,retweet_count,favorite_count,tweet_date,tweet_source
1679,680801747103793152,740,2187,Sat Dec 26 17:25:59 +0000 2015,"<a href=""http://twitter.com/download/iphone"" r..."
453,816697700272001025,2049,9219,Wed Jan 04 17:27:59 +0000 2017,"<a href=""http://twitter.com/download/iphone"" r..."
2281,666786068205871104,409,658,Wed Nov 18 01:12:41 +0000 2015,"<a href=""http://twitter.com/download/iphone"" r..."
1812,675878199931371520,1238,3779,Sun Dec 13 03:21:34 +0000 2015,"<a href=""http://twitter.com/download/iphone"" r..."
1631,683098815881154561,588,1996,Sat Jan 02 01:33:43 +0000 2016,"<a href=""http://twitter.com/download/iphone"" r..."


In [24]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 

In [25]:
df_image.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   tweet_id  2075 non-null   int64  
 1   jpg_url   2075 non-null   object 
 2   img_num   2075 non-null   int64  
 3   p1        2075 non-null   object 
 4   p1_conf   2075 non-null   float64
 5   p1_dog    2075 non-null   bool   
 6   p2        2075 non-null   object 
 7   p2_conf   2075 non-null   float64
 8   p2_dog    2075 non-null   bool   
 9   p3        2075 non-null   object 
 10  p3_conf   2075 non-null   float64
 11  p3_dog    2075 non-null   bool   
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [26]:
df_json.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2327 entries, 0 to 2326
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   tweet_id        2327 non-null   int64 
 1   retweet_count   2327 non-null   int64 
 2   favorite_count  2327 non-null   int64 
 3   tweet_date      2327 non-null   object
 4   tweet_source    2327 non-null   object
dtypes: int64(3), object(2)
memory usage: 91.0+ KB


### Quality issues
1.

2.

3.

4.

5.

6.

7.

8.

### Tidiness issues
1.

2.

## Cleaning Data
In this section, clean **all** of the issues you documented while assessing. 

**Note:** Make a copy of the original data before cleaning. Cleaning includes merging individual pieces of data according to the rules of [tidy data](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html). The result should be a high-quality and tidy master pandas DataFrame (or DataFrames, if appropriate).

In [None]:
# Make copies of original pieces of data


### Issue #1:

#### Define:

#### Code

#### Test

### Issue #2:

#### Define

#### Code

#### Test

## Storing Data
Save gathered, assessed, and cleaned master dataset to a CSV file named "twitter_archive_master.csv".

## Analyzing and Visualizing Data
In this section, analyze and visualize your wrangled data. You must produce at least **three (3) insights and one (1) visualization.**

### Insights:
1.

2.

3.

### Visualization