# Project: Wrangling and Analyze Data

In [32]:
# importing packages for this project
import pandas as pd #for data wrangling
import numpy as np # for mathematical computing
import requests # for downloading files programmatically
import os # for accessing downloaded files
import tweepy # to query twitter API
import json # to write a json data off the querried data#
import time # time module allows to work with time#
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import sklearn
import datetime
import io

## Data Gathering
In the cell below, gather **all** three pieces of data for this project and load them in the notebook. **Note:** the methods required to gather each data are different.
1. Directly download the WeRateDogs Twitter archive data (twitter_archive_enhanced.csv)

In [33]:
# reading the downloaded file to pandas 
twitter_archive = pd.read_csv('twitter-archive-enhanced.csv')

In [34]:
# view read file
twitter_archive

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2351,666049248165822465,,,2015-11-16 00:24:50 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a 1949 1st generation vulpix. Enj...,,,,https://twitter.com/dog_rates/status/666049248...,5,10,,,,,
2352,666044226329800704,,,2015-11-16 00:04:52 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is a purebred Piers Morgan. Loves to Netf...,,,,https://twitter.com/dog_rates/status/666044226...,6,10,a,,,,
2353,666033412701032449,,,2015-11-15 23:21:54 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here is a very happy pup. Big fan of well-main...,,,,https://twitter.com/dog_rates/status/666033412...,9,10,a,,,,
2354,666029285002620928,,,2015-11-15 23:05:30 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is a western brown Mitsubishi terrier. Up...,,,,https://twitter.com/dog_rates/status/666029285...,7,10,a,,,,


2. Use the Requests library to download the tweet image prediction (image_predictions.tsv)

In [35]:
# creating a request for image_predictions file
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = requests.get(url)

# accessing the content of downloaded file and writing to a file
with open(os.path.join('image_predictions.tsv'), mode = 'wb') as file:
    file.write(response.content)

# reading image predictions file to pandas
image_predictions = pd.read_csv('image_predictions.tsv', sep='\t')

In [36]:
# view image_predictions file
image_predictions

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.072010,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True
...,...,...,...,...,...,...,...,...,...,...,...,...
2070,891327558926688256,https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg,2,basset,0.555712,True,English_springer,0.225770,True,German_short-haired_pointer,0.175219,True
2071,891689557279858688,https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg,1,paper_towel,0.170278,False,Labrador_retriever,0.168086,True,spatula,0.040836,False
2072,891815181378084864,https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg,1,Chihuahua,0.716012,True,malamute,0.078253,True,kelpie,0.031379,True
2073,892177421306343426,https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg,1,Chihuahua,0.323581,True,Pekinese,0.090647,True,papillon,0.068957,True


3. Use the Tweepy library to query additional data via the Twitter API (tweet_json.txt)

In [37]:
from tweepy import OAuthHandler
from timeit import default_timer as timer

In [38]:
# NOTE TO REVIEWER: this student had mobile verification issues so the following
# Twitter API code was sent to this student from a Udacity instructor
# Tweet IDs for which to gather additional data via Twitter's API
tweet_ids = twitter_archive.tweet_id.values
len(tweet_ids)

2356

In [43]:
# to extract the 'id', 'retweet_count', 'favorite_count', 'followers_count',
# 'friends_count', 'listed_count' from 'tweet_json'
# and later convert to a DataFrame

#create and empty list to house the extracted data
df_list =[]

# open .txt file for reading.
with open ('tweet_json.txt', 'r') as jsonfile:
    for line in jsonfile.readlines():
         # read json string into a dictionary
        tweet_line = json.loads(line)
        # getting the required parameters
        tweet_ID = tweet_line['id']
        retweet_count = tweet_line['retweet_count']
        friends_count = tweet_line['user']['friends_count']
        fav_count = tweet_line['favorite_count']
        followers_count = tweet_line['user']['followers_count']
        listed_count = tweet_line['user']['listed_count']
        
        
        # Append to list of dictionaries
        df_list.append({'id': tweet_ID,
                       'retweet_count': retweet_count,
                       'friends_count': friends_count,
                       'favorite_count': fav_count,
                       'followers_count': followers_count,
                       'listed_count': listed_count})
        
# creating a dataframe off the dictionaries
tweet_json = pd.DataFrame(df_list, columns=['id', 'retweet_count',
                                            'friends_count',
                                            'favorite_count', 'followers_count', 
                                            'listed_count'])    

In [44]:
df_list

[{'id': 892420643555336193,
  'retweet_count': 8853,
  'friends_count': 104,
  'favorite_count': 39467,
  'followers_count': 3200889,
  'listed_count': 2784},
 {'id': 892177421306343426,
  'retweet_count': 6514,
  'friends_count': 104,
  'favorite_count': 33819,
  'followers_count': 3200889,
  'listed_count': 2784},
 {'id': 891815181378084864,
  'retweet_count': 4328,
  'friends_count': 104,
  'favorite_count': 25461,
  'followers_count': 3200889,
  'listed_count': 2784},
 {'id': 891689557279858688,
  'retweet_count': 8964,
  'friends_count': 104,
  'favorite_count': 42908,
  'followers_count': 3200889,
  'listed_count': 2784},
 {'id': 891327558926688256,
  'retweet_count': 9774,
  'friends_count': 104,
  'favorite_count': 41048,
  'followers_count': 3200889,
  'listed_count': 2784},
 {'id': 891087950875897856,
  'retweet_count': 3261,
  'friends_count': 104,
  'favorite_count': 20562,
  'followers_count': 3200889,
  'listed_count': 2784},
 {'id': 890971913173991426,
  'retweet_count':

In [45]:
tweet_json

Unnamed: 0,id,retweet_count,friends_count,favorite_count,followers_count,listed_count
0,892420643555336193,8853,104,39467,3200889,2784
1,892177421306343426,6514,104,33819,3200889,2784
2,891815181378084864,4328,104,25461,3200889,2784
3,891689557279858688,8964,104,42908,3200889,2784
4,891327558926688256,9774,104,41048,3200889,2784
...,...,...,...,...,...,...
2349,666049248165822465,41,104,111,3201018,2812
2350,666044226329800704,147,104,311,3201018,2812
2351,666033412701032449,47,104,128,3201018,2812
2352,666029285002620928,48,104,132,3201018,2812


## Assessing Data
In this section, detect and document at least **eight (8) quality issues and two (2) tidiness issue**. You must use **both** visual assessment
programmatic assessement to assess the data.

**Note:** pay attention to the following key points when you access the data.

* You only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.
* Assessing and cleaning the entire dataset completely would require a lot of time, and is not necessary to practice and demonstrate your skills in data wrangling. Therefore, the requirements of this project are only to assess and clean at least 8 quality issues and at least 2 tidiness issues in this dataset.
* The fact that the rating numerators are greater than the denominators does not need to be cleaned. This [unique rating system](http://knowyourmeme.com/memes/theyre-good-dogs-brent) is a big part of the popularity of WeRateDogs.
* You do not need to gather the tweets beyond August 1st, 2017. You can, but note that you won't be able to gather the image predictions for these tweets since you don't have access to the algorithm used.



In [47]:
# for easy assess, rename the dataframes
df1 = twitter_archive
df2 = image_predictions
df3 = tweet_json

### Visual Assessment:
Here, a directed visual assessment of the dataframes will be carried out, aiming to explain the columns and check for anomalous data.

In [54]:
df1.head(3)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,


**`df1`** columns: 

1. `tweet_id`: this's the unique tweet identifier
2. `in_reply_to_status_id`: contains the integer representation of the original tweet's ID, if the `tweet_id` is a reply.
3. `in_reply_to_user_id`: if `tweet_id` is a reply, this contains the integer representation of the original Tweet's author ID.
4. `timestamp`: contains time when the tweet was created.
5. `source`: contains a display of the devices through which the tweet was created.
6. `text`: the text element of the tweet
7. `retweeted_status_id`: contains interger representation of the original `tweet_id` if `tweet_id` is a retweet.
8. `retweeted_status_user_id`: if `tweet_id` is a retweet, this displays the integer representaion of the original Tweet's author ID.
9. `retweeted_status_timestamp`: time of retweet.
10. `expanded_urls`: tweet's URL.
11. `rating_numerator`: conatains the numerator of the rating of a dog. ratings are almost always greater than 10.
12. `name`: name of the dog
13. `

In [55]:
df2.head(3)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True


**`df2`** columns:

1. **`tweet_id`**: the unique identifier for each tweet
2. **`jpg_url`**: dog's image URL
3. **`img_num`**: the image number that corresponded to the most confident prediction (numbered 1 to 4 since tweets can have up to four images).
4. **`p1`**: algorithm's #1 prediction for the image in the tweet
5. **`p1_conf`**: how confident the algorithm is in its #1 prediction.
6. **`p1_dog`**: whether or not the #1 prediction is a breed of dog
7. **`p2`**: algorithm's #2 prediction for the image in the tweet
8. **`p2_conf`**: how confident the algorithm is in its #2 prediction.
9. **`p2_dog`**: whether or not the #2 prediction is a breed of dog.
10. **`p3`**: algorithm's #3 prediction for the image in the tweet.
11. **`p3_conf`**: how confident the algorithm is in its #3 prediction.
12. **`p3_dog`**: whether or not the #3 prediction is a breed of dog

In [56]:
df3.head(4)

Unnamed: 0,id,retweet_count,friends_count,favorite_count,followers_count,listed_count
0,892420643555336193,8853,104,39467,3200889,2784
1,892177421306343426,6514,104,33819,3200889,2784
2,891815181378084864,4328,104,25461,3200889,2784
3,891689557279858688,8964,104,42908,3200889,2784


**`df3`** columns:

1. **`id`**: the unique identifier for each tweet.
2. **`retweet_count`**: the number of times the original tweet was retweeted.
3. **`favorite_count`**: the number of times the the original tweet was loved or liked.
4. **`followers_count`**: the number of followers of WeRataeDogs account as at the time of the each tweet.
5. **`friends_count`**: the number of profiles WeRateDogs account was following at the time of each tweet.
6. **`listed_count`**:|

### Programmatic Assessment
Here, directed assessment using different pandas function will be used to assess the three(3) dataframe,

In [52]:
# first, we use the `.info()` function to get a summary of the dataframe, starting with `*df1*`
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 

In [57]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   tweet_id  2075 non-null   int64  
 1   jpg_url   2075 non-null   object 
 2   img_num   2075 non-null   int64  
 3   p1        2075 non-null   object 
 4   p1_conf   2075 non-null   float64
 5   p1_dog    2075 non-null   bool   
 6   p2        2075 non-null   object 
 7   p2_conf   2075 non-null   float64
 8   p2_dog    2075 non-null   bool   
 9   p3        2075 non-null   object 
 10  p3_conf   2075 non-null   float64
 11  p3_dog    2075 non-null   bool   
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [58]:
df3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2354 entries, 0 to 2353
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype
---  ------           --------------  -----
 0   id               2354 non-null   int64
 1   retweet_count    2354 non-null   int64
 2   friends_count    2354 non-null   int64
 3   favorite_count   2354 non-null   int64
 4   followers_count  2354 non-null   int64
 5   listed_count     2354 non-null   int64
dtypes: int64(6)
memory usage: 110.5 KB


In [59]:
df1.describe()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,retweeted_status_id,retweeted_status_user_id,rating_numerator,rating_denominator
count,2356.0,78.0,78.0,181.0,181.0,2356.0,2356.0
mean,7.427716e+17,7.455079e+17,2.014171e+16,7.7204e+17,1.241698e+16,13.126486,10.455433
std,6.856705e+16,7.582492e+16,1.252797e+17,6.236928e+16,9.599254e+16,45.876648,6.745237
min,6.660209e+17,6.658147e+17,11856340.0,6.661041e+17,783214.0,0.0,0.0
25%,6.783989e+17,6.757419e+17,308637400.0,7.186315e+17,4196984000.0,10.0,10.0
50%,7.196279e+17,7.038708e+17,4196984000.0,7.804657e+17,4196984000.0,11.0,10.0
75%,7.993373e+17,8.257804e+17,4196984000.0,8.203146e+17,4196984000.0,12.0,10.0
max,8.924206e+17,8.862664e+17,8.405479e+17,8.87474e+17,7.874618e+17,1776.0,170.0


In [61]:
df2.describe()

Unnamed: 0,tweet_id,img_num,p1_conf,p2_conf,p3_conf
count,2075.0,2075.0,2075.0,2075.0,2075.0
mean,7.384514e+17,1.203855,0.594548,0.1345886,0.06032417
std,6.785203e+16,0.561875,0.271174,0.1006657,0.05090593
min,6.660209e+17,1.0,0.044333,1.0113e-08,1.74017e-10
25%,6.764835e+17,1.0,0.364412,0.05388625,0.0162224
50%,7.119988e+17,1.0,0.58823,0.118181,0.0494438
75%,7.932034e+17,1.0,0.843855,0.1955655,0.09180755
max,8.924206e+17,4.0,1.0,0.488014,0.273419


In [62]:
df3.describe()

Unnamed: 0,id,retweet_count,friends_count,favorite_count,followers_count,listed_count
count,2354.0,2354.0,2354.0,2354.0,2354.0,2354.0
mean,7.426978e+17,3164.797366,104.0,8080.968564,3200942.0,2799.480884
std,6.852812e+16,5284.770364,0.0,11814.771334,44.57302,11.178223
min,6.660209e+17,0.0,104.0,0.0,3200799.0,2724.0
25%,6.783975e+17,624.5,104.0,1415.0,3200898.0,2788.0
50%,7.194596e+17,1473.5,104.0,3603.5,3200945.0,2803.0
75%,7.993058e+17,3652.0,104.0,10122.25,3200953.0,2805.0
max,8.924206e+17,79515.0,104.0,132810.0,3201018.0,2846.0


In [73]:
df1.isna().sum()

tweet_id                         0
in_reply_to_status_id         2278
in_reply_to_user_id           2278
timestamp                        0
source                           0
text                             0
retweeted_status_id           2175
retweeted_status_user_id      2175
retweeted_status_timestamp    2175
expanded_urls                   59
rating_numerator                 0
rating_denominator               0
name                             0
doggo                            0
floofer                          0
pupper                           0
puppo                            0
dtype: int64

In [74]:
df2.isna().sum()

tweet_id    0
jpg_url     0
img_num     0
p1          0
p1_conf     0
p1_dog      0
p2          0
p2_conf     0
p2_dog      0
p3          0
p3_conf     0
p3_dog      0
dtype: int64

In [72]:
 df3.isna().sum()

id                 0
retweet_count      0
friends_count      0
favorite_count     0
followers_count    0
listed_count       0
dtype: int64

In [75]:
df1.duplicated().sum()

0

In [77]:
df2.duplicated().sum()

0

In [78]:
df3.duplicated().sum()

0

### Quality issues
1.

2.

3.

4.

5.

6.

7.

8.

### Tidiness issues
1.

2.

## Cleaning Data
In this section, clean **all** of the issues you documented while assessing. 

**Note:** Make a copy of the original data before cleaning. Cleaning includes merging individual pieces of data according to the rules of [tidy data](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html). The result should be a high-quality and tidy master pandas DataFrame (or DataFrames, if appropriate).

In [None]:
# Make copies of original pieces of data


### Issue #1:

#### Define:

#### Code

#### Test

### Issue #2:

#### Define

#### Code

#### Test

## Storing Data
Save gathered, assessed, and cleaned master dataset to a CSV file named "twitter_archive_master.csv".

## Analyzing and Visualizing Data
In this section, analyze and visualize your wrangled data. You must produce at least **three (3) insights and one (1) visualization.**

### Insights:
1.

2.

3.

### Visualization