# Project: Wrangling and Analyze Data

## Importing Libraries

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy
import tweepy
import json
import requests
import os
%matplotlib inline

# Data Gathering
In the cell below, gather **all** three pieces of data for this project and load them in the notebook. **Note:** the methods required to gather each data are different.
1. Directly download the WeRateDogs Twitter archive data (twitter_archive_enhanced.csv)

### Mounting Google Drive

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Defining Path of Working Directory

In [2]:
# Working Directory Path
working_dir = '/content/drive/My Drive/Colab_Notebooks/ALX_2/'

## 1. Reading downloaded dataset into Pandas Dataframe

#### Renaming file downloaded/provided: twitter-archive-enhanced.csv

In [None]:
# Defining File Path
old_name = working_dir + 'twitter-archive-enhanced.csv' #filename with hyphen (-)
new_name = working_dir + 'twitter_archive_enhanced.csv' #filename with undescore (_)

# Renaming the file
os.rename(old_name, new_name)

In [None]:
# Confirming file rename
os.listdir(working_dir)

['wrangle_act.ipynb',
 'project2.ipynb',
 'twitter_archive_enhanced.csv',
 'image_predictions_folder',
 'tweet_json.txt']

In [None]:
# Specifying path of twitter_archive dataset
twitter_archive_path = '/content/drive/My Drive/Colab_Notebooks/ALX_2/twitter_archive_enhanced.csv'

# Reading path
twitter_archive_df = pd.read_csv(twitter_archive_path)

In [None]:
# Checking columns in dataframe

twitter_archive_df.columns

Index(['tweet_id', 'in_reply_to_status_id', 'in_reply_to_user_id', 'timestamp',
       'source', 'text', 'retweeted_status_id', 'retweeted_status_user_id',
       'retweeted_status_timestamp', 'expanded_urls', 'rating_numerator',
       'rating_denominator', 'name', 'doggo', 'floofer', 'pupper', 'puppo'],
      dtype='object')

In [None]:
twitter_archive_df.shape

(2356, 17)

## 2. Use the Requests library to download the tweet image prediction (image_predictions.tsv)

In [None]:
# Defining path of folder
image_predictions_folder = '/content/drive/My Drive/Colab_Notebooks/ALX_2/image_predictions_folder'

# Creating directory if non-existent
if not os.path.exists(image_predictions_folder):
    os.makedirs(image_predictions_folder)

In [None]:
# Checking if directory was created
os.listdir(working_dir)

['wrangle_act.ipynb',
 'project2.ipynb',
 'twitter-archive-enhanced.csv',
 'image_predictions_folder',
 'tweet_json.txt']

In [None]:
# Checking contents of (created) directory
os.listdir(image_predictions_folder)

['image-predictions.tsv']

In [None]:
# Defining URL
url = "https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv"

# Using get method
image_predictions_resp = requests.get(url)

In [None]:
# Checking type of object returned
type(image_predictions_resp)

requests.models.Response

In [None]:
#print(image_predictions_resp.text)

#### Accessing Content and Writing to File

In [None]:
# Writing to a file
with open(os.path.join(image_predictions_folder, url.split('/')[-1]), mode = 'wb') as file:
    file.write(image_predictions_resp.content)

In [None]:
# Checking file was saved to disk
os.listdir(image_predictions_folder)

['image_predictions.tsv']

#### Renaming file downloaded programmatically: image-predictions.tsv

In [None]:
# Defining File Path
old_name = working_dir + 'image_predictions_folder/image-predictions.tsv' #filename with hyphen (-)
new_name = working_dir + 'image_predictions_folder/image_predictions.tsv' #filename with undescore (_)

# Renaming the file
os.rename(old_name, new_name)

In [None]:
# Confirming file rename
os.listdir(image_predictions_folder)

['image_predictions.tsv']

## 3. Use the Tweepy library to query additional data via the Twitter API (tweet_json.txt)

### Defining Tweepy Credentials

In [None]:
# Consumer (API) key authentication
consumer_key = 'I1p0KOzJQYYJBJY1ajZIz7Yl5'
consumer_secret = '2uxcBKVgFc4yYJg8hcZedC30kivBIU3Xoddgk0x2NYMWLsWFXe'

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)


# Access key authentication
access_token = '1193870818201264129-Q93Ow1UikBtBcFAkDW0pvEdnL3ylb9'
access_secret = 'XUtzOZkoMR396ZEJBK12CkY2w8eDMgwsSBTUVdwXtvUMz'

auth.set_access_token(access_token, access_secret)


# Set up the API with the authentication handler
api = tweepy.API(auth)

In [None]:
tweet_id_list = twitter_archive_df['tweet_id'].tolist()

In [None]:
len(tweet_id_list)

2356

In [None]:
# Creating list to hold Tweet JSON Objects
tweet_status_json_list = []

In [None]:
# Iterating to get tweet extended information from tweet_ids in tweet_id_list

for tweet_id in tweet_id_list:

  try:
    # Getting Status Object from API by tweet_id
    tweet_status = api.get_status(tweet_id, tweet_mode='extended') #Mode = Extended; for more information
  except tweepy.TweepError:
    continue

  #Converting Status Object to JSON Object
  json_str = json.dumps(tweet_status._json)

  #Appending JSON Object to list of JSON Objects
  tweet_status_json_list.append(json_str)


In [None]:
# Checking length of 
len(tweet_status_json_list)

3230

In [None]:
# Defining path of tweet_json.txt File

path = working_dir + 'tweet_json.txt'

In [None]:
# Saving List of JSON Objects to tweet_json.txt File

with open(path, "w") as fhandle:
  for line in tweet_status_json_list:
    fhandle.write(f'{line}\n')

## Assessing Data
In this section, detect and document at least **eight (8) quality issues and two (2) tidiness issue**. You must use **both** visual assessment
programmatic assessement to assess the data.

**Note:** pay attention to the following key points when you access the data.

* You only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.
* Assessing and cleaning the entire dataset completely would require a lot of time, and is not necessary to practice and demonstrate your skills in data wrangling. Therefore, the requirements of this project are only to assess and clean at least 8 quality issues and at least 2 tidiness issues in this dataset.
* The fact that the rating numerators are greater than the denominators does not need to be cleaned. This [unique rating system](http://knowyourmeme.com/memes/theyre-good-dogs-brent) is a big part of the popularity of WeRateDogs.
* You do not need to gather the tweets beyond August 1st, 2017. You can, but note that you won't be able to gather the image predictions for these tweets since you don't have access to the algorithm used.



### Reading Datasets Gathered

####1.   Reading twitter-archive-enhanced.csv Dataset (File provided)

In [5]:
# Reading dataset: twitter-archive-enhanced.csv
twitter_archive_df = pd.read_csv(working_dir + 'twitter_archive_enhanced.csv')

In [6]:
# Checking columns in dataframe
twitter_archive_df.columns

Index(['tweet_id', 'in_reply_to_status_id', 'in_reply_to_user_id', 'timestamp',
       'source', 'text', 'retweeted_status_id', 'retweeted_status_user_id',
       'retweeted_status_timestamp', 'expanded_urls', 'rating_numerator',
       'rating_denominator', 'name', 'doggo', 'floofer', 'pupper', 'puppo'],
      dtype='object')

In [7]:
# Shape of Dataframe
twitter_archive_df.shape

(2356, 17)

In [9]:
# Checking info
twitter_archive_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 

In [None]:
# Checking first 5 rows
twitter_archive_df.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


In [None]:
# Checking datatypes of columns
twitter_archive_df.dtypes

tweet_id                        int64
in_reply_to_status_id         float64
in_reply_to_user_id           float64
timestamp                      object
source                         object
text                           object
retweeted_status_id           float64
retweeted_status_user_id      float64
retweeted_status_timestamp     object
expanded_urls                  object
rating_numerator                int64
rating_denominator              int64
name                           object
doggo                          object
floofer                        object
pupper                         object
puppo                          object
dtype: object

In [53]:
# Summary statistics of rating_numerator column
twitter_archive_df['rating_numerator'].describe()

count    2356.000000
mean       13.126486
std        45.876648
min         0.000000
25%        10.000000
50%        11.000000
75%        12.000000
max      1776.000000
Name: rating_numerator, dtype: float64

* The rating_numerator is allowed to go past 10 (maximum is 1776), as the rating account permits that ("they're good dogs Brent")
* Hence, bigger values aren't inaccurate, therefore no cleaning required

In [54]:
# Summary statistics of rating_denominator column
twitter_archive_df['rating_denominator'].describe()

count    2356.000000
mean       10.455433
std         6.745237
min         0.000000
25%        10.000000
50%        10.000000
75%        10.000000
max       170.000000
Name: rating_denominator, dtype: float64

* The rating_denominator should always be 10 (maximum in the dataset is 170, minimum is 0)
* Hence, values greater/lesser than 10 are inaccurate, therefore cleaning is required

In [68]:
#check names in twitter archive dataset
twitter_archive_df['name'].value_counts()

None          745
a              55
Charlie        12
Cooper         11
Lucy           11
             ... 
Dex             1
Ace             1
Tayzie          1
Grizzie         1
Christoper      1
Name: name, Length: 957, dtype: int64

* Majority of the names are "None"
* Explore the names further below

In [69]:
none_names_df = twitter_archive_df[twitter_archive_df['name'] == 'None']
none_names_df.shape

(745, 17)

In [71]:
# Displaying some of the records with none names
none_names_df.sample(5)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1051,742534281772302336,,,2016-06-14 01:49:03 +0000,"<a href=""http://vine.co"" rel=""nofollow"">Vine -...","For anyone who's wondering, this is what happe...",,,,https://vine.co/v/iLTZmtE1FTB,11,10,,doggo,,,
2166,669363888236994561,,,2015-11-25 03:56:01 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a Gingivitis Pumpernickel named Z...,,,,https://twitter.com/dog_rates/status/669363888...,10,10,,,,,
1622,684914660081053696,,,2016-01-07 01:49:14 +0000,"<a href=""http://twitter.com/download/iphone"" r...","""Hello yes I'll just get one of each color tha...",,,,https://twitter.com/dog_rates/status/684914660...,12,10,,,,,
1834,676121918416756736,,,2015-12-13 19:30:01 +0000,"<a href=""http://vine.co"" rel=""nofollow"">Vine -...",Here we are witnessing a very excited dog. Cle...,,,,https://vine.co/v/iZXg7VpeDAv,8,10,,,,,
814,771014301343748096,,,2016-08-31 15:58:28 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Another pic without a dog in it? What am I sup...,,,,https://twitter.com/dog_rates/status/771014301...,7,10,,,,,


* The records with "None" names need to be cleaned, as there is no dog with a "None" name; by convention

In [55]:
# Checking if there are retweets
retweets_archive_df = twitter_archive_df[twitter_archive_df['text'].str.contains("RT @")]
retweets_archive_df.shape

(181, 17)

In [56]:
# Displaying retweets
retweets_archive_df.head(3)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
19,888202515573088257,,,2017-07-21 01:02:36 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: This is Canela. She attempted s...,8.87474e+17,4196984000.0,2017-07-19 00:47:34 +0000,https://twitter.com/dog_rates/status/887473957...,13,10,Canela,,,,
32,886054160059072513,,,2017-07-15 02:45:48 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @Athletics: 12/10 #BATP https://t.co/WxwJmv...,8.860537e+17,19607400.0,2017-07-15 02:44:07 +0000,https://twitter.com/dog_rates/status/886053434...,12,10,,,,,
36,885311592912609280,,,2017-07-13 01:35:06 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: This is Lilly. She just paralle...,8.305833e+17,4196984000.0,2017-02-12 01:04:29 +0000,https://twitter.com/dog_rates/status/830583320...,13,10,Lilly,,,,


* There are retweets in the twitter_archive dataset, which require cleaning

#### Checking Dog Stages Columns

In [45]:
twitter_archive_df.sample(10)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
420,822163064745328640,,,2017-01-19 19:25:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: This is Mattie. She's extremely...,7.86234e+17,4196984000.0,2016-10-12 15:55:59 +0000,https://twitter.com/dog_rates/status/786233965...,11,10,Mattie,,,,
32,886054160059072513,,,2017-07-15 02:45:48 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @Athletics: 12/10 #BATP https://t.co/WxwJmv...,8.860537e+17,19607400.0,2017-07-15 02:44:07 +0000,https://twitter.com/dog_rates/status/886053434...,12,10,,,,,
483,814986499976527872,,,2016-12-31 00:08:17 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Cooper. Someone attacked him with a sh...,,,,https://twitter.com/dog_rates/status/814986499...,11,10,Cooper,,,pupper,
2200,668655139528511488,,,2015-11-23 04:59:42 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Say hello to Winifred. He is a Papyrus Hydrang...,,,,https://twitter.com/dog_rates/status/668655139...,11,10,Winifred,,,,
1756,678767140346941444,,,2015-12-21 02:41:11 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Mia. She makes awful decisions. 8/10 h...,,,,https://twitter.com/dog_rates/status/678767140...,8,10,Mia,,,,
759,778396591732486144,,,2016-09-21 00:53:04 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: This is an East African Chalupa...,7.030419e+17,4196984000.0,2016-02-26 02:20:37 +0000,https://twitter.com/dog_rates/status/703041949...,10,10,an,,,,
830,768855141948723200,,,2016-08-25 16:58:45 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Jesse. He really wants a belly rub. Wi...,,,,https://twitter.com/dog_rates/status/768855141...,11,10,Jesse,,,,
1327,705975130514706432,,,2016-03-05 04:36:02 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Adele. Her tongue flies out of her mou...,,,,https://twitter.com/dog_rates/status/705975130...,10,10,Adele,,,pupper,
1254,710658690886586372,,,2016-03-18 02:46:49 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here's a brigade of puppers. All look very pre...,,,,https://twitter.com/dog_rates/status/710658690...,80,80,,,,,
1035,744995568523612160,,,2016-06-20 20:49:19 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Abby. She got her face stuck in a glas...,,,,https://twitter.com/dog_rates/status/744995568...,9,10,Abby,,,,puppo


In [48]:
# Pupper Stage
pupper_df = twitter_archive_df[twitter_archive_df['pupper'] == 'pupper']
pupper_df.shape

(257, 17)

In [49]:
# Puppo Stage
puppo_df = twitter_archive_df[twitter_archive_df['puppo'] == 'puppo']
puppo_df.shape

(30, 17)

In [50]:
# Doggo Stage
doggo_df = twitter_archive_df[twitter_archive_df['doggo'] == 'doggo']
doggo_df.shape

(97, 17)

In [51]:
# Pupper Stage
floofer_df = twitter_archive_df[twitter_archive_df['floofer'] == 'floofer']
floofer_df.shape

(10, 17)

* The dog stages are pivoted into different columns, which requires cleaning (unpivoting)

####2.  Reading image_predictions.tsv Dataset (File Downloaded Programmatically)

In [10]:
# Reading dataset: image_predictions.tsv
image_pred_df = pd.read_csv(working_dir + 'image_predictions_folder/image_predictions.tsv', sep='\t', header=0)

In [11]:
# Checking columns in dataframe
image_pred_df.columns

Index(['tweet_id', 'jpg_url', 'img_num', 'p1', 'p1_conf', 'p1_dog', 'p2',
       'p2_conf', 'p2_dog', 'p3', 'p3_conf', 'p3_dog'],
      dtype='object')

In [12]:
# Shape of Dataframe
image_pred_df.shape

(2075, 12)

In [13]:
# Checking info
image_pred_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   tweet_id  2075 non-null   int64  
 1   jpg_url   2075 non-null   object 
 2   img_num   2075 non-null   int64  
 3   p1        2075 non-null   object 
 4   p1_conf   2075 non-null   float64
 5   p1_dog    2075 non-null   bool   
 6   p2        2075 non-null   object 
 7   p2_conf   2075 non-null   float64
 8   p2_dog    2075 non-null   bool   
 9   p3        2075 non-null   object 
 10  p3_conf   2075 non-null   float64
 11  p3_dog    2075 non-null   bool   
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [14]:
# Checking datatypes of columns
image_pred_df.dtypes

tweet_id      int64
jpg_url      object
img_num       int64
p1           object
p1_conf     float64
p1_dog         bool
p2           object
p2_conf     float64
p2_dog         bool
p3           object
p3_conf     float64
p3_dog         bool
dtype: object

In [15]:
# Checking first 5 rows
image_pred_df.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


In [61]:
# Checking sampled rows
image_pred_df.sample(10)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
1556,793210959003287553,https://pbs.twimg.com/media/CwINKJeW8AYHVkn.jpg,1,doormat,0.874431,False,French_bulldog,0.018759,True,Boston_bull,0.015134,True
539,676957860086095872,https://pbs.twimg.com/ext_tw_video_thumb/67695...,1,Labrador_retriever,0.772423,True,beagle,0.055902,True,golden_retriever,0.031152,True
465,675006312288268288,https://pbs.twimg.com/media/CV4aqCwWsAIi3OP.jpg,1,boxer,0.654697,True,space_heater,0.043389,False,beagle,0.042848,True
429,674265582246694913,https://pbs.twimg.com/media/CVt49k_WsAAtNYC.jpg,1,slug,0.998075,False,ice_lolly,0.000984,False,leafhopper,9.7e-05,False
41,666701168228331520,https://pbs.twimg.com/media/CUCZLHlUAAAeAig.jpg,1,Labrador_retriever,0.887707,True,Chihuahua,0.029307,True,French_bulldog,0.020756,True
1425,772193107915964416,https://pbs.twimg.com/media/Crdhh_1XEAAHKHi.jpg,1,Pembroke,0.367945,True,Chihuahua,0.223522,True,Pekinese,0.164871,True
1569,794355576146903043,https://pbs.twimg.com/media/CvJCabcWgAIoUxW.jpg,1,cocker_spaniel,0.500509,True,golden_retriever,0.272734,True,jigsaw_puzzle,0.041476,False
1732,821149554670182400,https://pbs.twimg.com/ext_tw_video_thumb/82114...,1,German_shepherd,0.515933,True,malinois,0.203651,True,Irish_setter,0.091055,True
1272,750011400160841729,https://pbs.twimg.com/media/CmfmvGUWgAAuVKD.jpg,1,muzzle,0.23762,False,Boston_bull,0.08715,True,sombrero,0.06851,False
1258,748692773788876800,https://pbs.twimg.com/media/CmPkGhFXEAABO1n.jpg,1,ox,0.337871,False,plow,0.269287,False,oxcart,0.245653,False


* Some of the breed names in p1, p2, p3 columns are having the first letter as lowercase. This requires cleaning

####3.   Reading tweet_json.txt Dataset (Dataset of tweets queried using Twitter API)

In [16]:
# create pandas DataFrame: 
tweet_json_df = pd.read_json(working_dir + 'tweet_json.txt', lines=True)

In [17]:
# Checking columns in dataframe
tweet_json_df.columns

Index(['created_at', 'id', 'id_str', 'full_text', 'truncated',
       'display_text_range', 'entities', 'extended_entities', 'source',
       'in_reply_to_status_id', 'in_reply_to_status_id_str',
       'in_reply_to_user_id', 'in_reply_to_user_id_str',
       'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place',
       'contributors', 'is_quote_status', 'retweet_count', 'favorite_count',
       'favorited', 'retweeted', 'possibly_sensitive',
       'possibly_sensitive_appealable', 'lang', 'retweeted_status',
       'quoted_status_id', 'quoted_status_id_str', 'quoted_status_permalink',
       'quoted_status'],
      dtype='object')

In [18]:
# Shape of Dataframe
tweet_json_df.shape

(3230, 32)

In [19]:
# Checking info
tweet_json_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3230 entries, 0 to 3229
Data columns (total 32 columns):
 #   Column                         Non-Null Count  Dtype              
---  ------                         --------------  -----              
 0   created_at                     3230 non-null   datetime64[ns, UTC]
 1   id                             3230 non-null   int64              
 2   id_str                         3230 non-null   int64              
 3   full_text                      3230 non-null   object             
 4   truncated                      3230 non-null   bool               
 5   display_text_range             3230 non-null   object             
 6   entities                       3230 non-null   object             
 7   extended_entities              2917 non-null   object             
 8   source                         3230 non-null   object             
 9   in_reply_to_status_id          66 non-null     float64            
 10  in_reply_to_status_id_st

In [20]:
# Checking datatypes of columns
tweet_json_df.dtypes

created_at                       datetime64[ns, UTC]
id                                             int64
id_str                                         int64
full_text                                     object
truncated                                       bool
display_text_range                            object
entities                                      object
extended_entities                             object
source                                        object
in_reply_to_status_id                        float64
in_reply_to_status_id_str                    float64
in_reply_to_user_id                          float64
in_reply_to_user_id_str                      float64
in_reply_to_screen_name                       object
user                                          object
geo                                          float64
coordinates                                  float64
place                                         object
contributors                                 f

In [21]:
# Checking first 5 rows
tweet_json_df.head()

Unnamed: 0,created_at,id,id_str,full_text,truncated,display_text_range,entities,extended_entities,source,in_reply_to_status_id,...,favorited,retweeted,possibly_sensitive,possibly_sensitive_appealable,lang,retweeted_status,quoted_status_id,quoted_status_id_str,quoted_status_permalink,quoted_status
0,2017-08-01 16:23:56+00:00,892420643555336193,892420643555336192,This is Phineas. He's a mystical boy. Only eve...,False,"[0, 85]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 892420639486877696, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,False,False,0.0,0.0,en,,,,,
1,2017-08-01 00:17:27+00:00,892177421306343426,892177421306343424,This is Tilly. She's just checking pup on you....,False,"[0, 138]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 892177413194625024, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,False,False,0.0,0.0,en,,,,,
2,2017-07-31 00:18:03+00:00,891815181378084864,891815181378084864,This is Archie. He is a rare Norwegian Pouncin...,False,"[0, 121]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 891815175371796480, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,False,False,0.0,0.0,en,,,,,
3,2017-07-30 15:58:51+00:00,891689557279858688,891689557279858688,This is Darla. She commenced a snooze mid meal...,False,"[0, 79]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 891689552724799489, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,False,False,0.0,0.0,en,,,,,
4,2017-07-29 16:00:24+00:00,891327558926688256,891327558926688256,This is Franklin. He would like you to stop ca...,False,"[0, 138]","{'hashtags': [{'text': 'BarkWeek', 'indices': ...","{'media': [{'id': 891327551943041024, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,False,False,0.0,0.0,en,,,,,


#### Checking Tweets Gathered Beyond August 1st, 2017

In [22]:
max(tweet_json_df['created_at'])

Timestamp('2017-08-01 16:23:56+0000', tz='UTC')

In [23]:
max(twitter_archive_df['timestamp'])

'2017-08-01 16:23:56 +0000'

* There are no tweets gathered beyind August 1st, 2017
* Hence, the datasets conform to given specification of time window

#### Checking if there are duplicates by ID

In [24]:
# counting unique values by 'tweet_id': twitter_archive dataset
n = len(pd.unique(twitter_archive_df['tweet_id']))
  
print("Number of unique values :", n)

Number of unique values : 2356


* There are no duplicates by tweet_id in the dataset (number of unique values (2356) corresponds to number of records (2356) in the dataset)

In [25]:
# counting unique values by 'tweet_id': image_pred dataset
n = len(pd.unique(image_pred_df['tweet_id']))
  
print("Number of unique values :", n)

Number of unique values : 2075


* There are no duplicates by tweet_id in the dataset (number of unique values (2075) corresponds to number of records (2075) in the dataset)

In [26]:
# counting unique values by 'id': tweet_json dataset
n = len(pd.unique(tweet_json_df['id']))
  
print("Number of unique values :", n)

Number of unique values : 874


* There are many duplicates by id in the dataset (number of unique values (874) is less than number of records(3230) in the dataset)

In [27]:
# Chceking counts of unique values in tweet_json dataset
item_counts = tweet_json_df['id'].value_counts()
item_counts

758828659922702336    1458
852189679701164033       4
812781120811126785       3
872486979161796608       3
872820683541237760       3
                      ... 
834458053273591808       2
834209720923721728       2
834167344700198914       2
834089966724603904       2
716439118184652801       1
Name: id, Length: 874, dtype: int64

#### Checking rows of one duplicated ID: tweet_json dataset

In [28]:
df_dup_id = tweet_json_df.loc[tweet_json_df['id'] == 852189679701164033]

In [39]:
df_dup_id

Unnamed: 0,created_at,id,id_str,full_text,truncated,display_text_range,entities,extended_entities,source,in_reply_to_status_id,...,favorited,retweeted,possibly_sensitive,possibly_sensitive_appealable,lang,retweeted_status,quoted_status_id,quoted_status_id_str,quoted_status_permalink,quoted_status
210,2017-04-12 16:00:27+00:00,852189679701164033,852189679701164032,This is Sailor. He has collected the best dirt...,False,"[0, 135]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 852189646159327233, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,False,False,0.0,0.0,en,,,,,
211,2017-04-12 16:00:27+00:00,852189679701164033,852189679701164032,This is Sailor. He has collected the best dirt...,False,"[0, 135]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 852189646159327233, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,False,False,0.0,0.0,en,,,,,
212,2017-04-12 16:00:27+00:00,852189679701164033,852189679701164032,This is Sailor. He has collected the best dirt...,False,"[0, 135]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 852189646159327233, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,False,False,0.0,0.0,en,,,,,
2557,2017-04-12 16:00:27+00:00,852189679701164033,852189679701164032,This is Sailor. He has collected the best dirt...,False,"[0, 135]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 852189646159327233, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,False,False,0.0,0.0,en,,,,,


In [58]:
# Checking if there are retweets in tweet_json dataset
df_retweeted_json = tweet_json_df[tweet_json_df['full_text'].str.contains("RT @")]
df_retweeted_json.shape

(297, 32)

In [60]:
# Displaying retweets
df_retweeted_json.head(3)

Unnamed: 0,created_at,id,id_str,full_text,truncated,display_text_range,entities,extended_entities,source,in_reply_to_status_id,...,favorited,retweeted,possibly_sensitive,possibly_sensitive_appealable,lang,retweeted_status,quoted_status_id,quoted_status_id_str,quoted_status_permalink,quoted_status
32,2017-07-15 02:45:48+00:00,886054160059072513,886054160059072512,RT @Athletics: 12/10 #BATP https://t.co/WxwJmv...,False,"[0, 50]","{'hashtags': [{'text': 'BATP', 'indices': [21,...",,"<a href=""http://twitter.com/download/iphone"" r...",,...,False,False,0.0,0.0,und,{'created_at': 'Sat Jul 15 02:44:07 +0000 2017...,8.860534e+17,8.860534e+17,"{'url': 'https://t.co/WxwJmvjfxo', 'expanded':...",
36,2017-07-13 01:35:06+00:00,885311592912609280,885311592912609280,RT @dog_rates: This is Lilly. She just paralle...,False,"[0, 133]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 830583314243268608, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,False,False,0.0,0.0,en,{'created_at': 'Sun Feb 12 01:04:29 +0000 2017...,,,,
68,2017-06-26 00:13:58+00:00,879130579576475649,879130579576475648,RT @dog_rates: This is Emmy. She was adopted t...,False,"[0, 140]","{'hashtags': [], 'symbols': [], 'user_mentions...",,"<a href=""http://twitter.com/download/iphone"" r...",,...,False,False,,,en,{'created_at': 'Fri Jun 23 01:10:23 +0000 2017...,,,,


* There are retweets in the tweet_json dataset, which require cleaning

### Quality issues
1. The identifier (ID) column name in tweet_json dataset (id) is not corresponding to the ID name in both the twitter_archive and image_pred datasets (tweet_id) [Consistency Issue]

2. Time columns are denoted with different names in the twitter_archive dataset (timestamp) and the tweet_json dataset (created_at) [Consistency Issue]

3. The timestamp column in twitter_archive dataset is of datatype "object"; while the created_at column in the tweet_json dataset is of datatype "datetime64" [Validity Issue]

4. The "id" column in the tweet_json dataset has a lot of duplicated values [Validity Issue]

5. The image_pred dataset has missing records, as compared to the twitter_archive dataset (2075 instead of 2356) [Completeness Issue]

6. The dog stages (doggo, puppo, fluffy) in twitter_archive dataset are in separate columns instead of one column [Consistency Issue]

7. There are values in the rating_denominator column in twitter_archive dataset which are greater/lesser than 10. However, the rating_denominator should always be 10 (maximum in the dataset is 170, minimum is 0) [Accuracy Issue]

8. There are retweets in both the twitter_archive dataset and the tweet_json dataset, which require cleaning [Accuracy/Validity Issue]

9. Some of the breed names in p1, p2, p3 columns are having the first letter as lowercase [Consistency Issue]

### Tidiness issues
1.

2.

## Cleaning Data
In this section, clean **all** of the issues you documented while assessing. 

**Note:** Make a copy of the original data before cleaning. Cleaning includes merging individual pieces of data according to the rules of [tidy data](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html). The result should be a high-quality and tidy master pandas DataFrame (or DataFrames, if appropriate).

In [None]:
# Make copies of original pieces of data


### Issue #1:

#### Define:

#### Code

#### Test

### Issue #2:

#### Define

#### Code

#### Test

## Storing Data
Save gathered, assessed, and cleaned master dataset to a CSV file named "twitter_archive_master.csv".

## Analyzing and Visualizing Data
In this section, analyze and visualize your wrangled data. You must produce at least **three (3) insights and one (1) visualization.**

### Insights:
1.

2.

3.

### Visualization