# Project: Wrangling and Analyze Data

## Importing Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy
import tweepy
import json
import requests
import os
%matplotlib inline

# Data Gathering
In the cell below, gather **all** three pieces of data for this project and load them in the notebook. **Note:** the methods required to gather each data are different.
1. Directly download the WeRateDogs Twitter archive data (twitter_archive_enhanced.csv)

### Mounting Google Drive

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Defining Path of Working Directory

In [3]:
# Working Directory Path
working_dir = '/content/drive/My Drive/Colab_Notebooks/ALX_2/'

## 1. Reading downloaded dataset into Pandas Dataframe

#### Renaming file downloaded/provided: twitter-archive-enhanced.csv

In [None]:
# Defining File Path
old_name = working_dir + 'twitter-archive-enhanced.csv' #filename with hyphen (-)
new_name = working_dir + 'twitter_archive_enhanced.csv' #filename with undescore (_)

# Renaming the file
os.rename(old_name, new_name)

In [None]:
# Confirming file rename
os.listdir(working_dir)

['wrangle_act.ipynb',
 'project2.ipynb',
 'twitter_archive_enhanced.csv',
 'image_predictions_folder',
 'tweet_json.txt']

In [None]:
# Specifying path of twitter_archive dataset
twitter_archive_path = '/content/drive/My Drive/Colab_Notebooks/ALX_2/twitter_archive_enhanced.csv'

# Reading path
twitter_archive_df = pd.read_csv(twitter_archive_path)

In [None]:
# Checking columns in dataframe

twitter_archive_df.columns

Index(['tweet_id', 'in_reply_to_status_id', 'in_reply_to_user_id', 'timestamp',
       'source', 'text', 'retweeted_status_id', 'retweeted_status_user_id',
       'retweeted_status_timestamp', 'expanded_urls', 'rating_numerator',
       'rating_denominator', 'name', 'doggo', 'floofer', 'pupper', 'puppo'],
      dtype='object')

In [None]:
twitter_archive_df.shape

(2356, 17)

## 2. Use the Requests library to download the tweet image prediction (image_predictions.tsv)

In [None]:
# Defining path of folder
image_predictions_folder = '/content/drive/My Drive/Colab_Notebooks/ALX_2/image_predictions_folder'

# Creating directory if non-existent
if not os.path.exists(image_predictions_folder):
    os.makedirs(image_predictions_folder)

In [None]:
# Checking if directory was created
os.listdir(working_dir)

['wrangle_act.ipynb',
 'project2.ipynb',
 'twitter-archive-enhanced.csv',
 'image_predictions_folder',
 'tweet_json.txt']

In [None]:
# Checking contents of (created) directory
os.listdir(image_predictions_folder)

['image-predictions.tsv']

In [None]:
# Defining URL
url = "https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv"

# Using get method
image_predictions_resp = requests.get(url)

In [None]:
# Checking type of object returned
type(image_predictions_resp)

requests.models.Response

In [None]:
#print(image_predictions_resp.text)

#### Accessing Content and Writing to File

In [None]:
# Writing to a file
with open(os.path.join(image_predictions_folder, url.split('/')[-1]), mode = 'wb') as file:
    file.write(image_predictions_resp.content)

In [None]:
# Checking file was saved to disk
os.listdir(image_predictions_folder)

['image_predictions.tsv']

#### Renaming file downloaded programmatically: image-predictions.tsv

In [None]:
# Defining File Path
old_name = working_dir + 'image_predictions_folder/image-predictions.tsv' #filename with hyphen (-)
new_name = working_dir + 'image_predictions_folder/image_predictions.tsv' #filename with undescore (_)

# Renaming the file
os.rename(old_name, new_name)

In [None]:
# Confirming file rename
os.listdir(image_predictions_folder)

['image_predictions.tsv']

## 3. Use the Tweepy library to query additional data via the Twitter API (tweet_json.txt)

### Defining Tweepy Credentials

In [None]:
# Consumer (API) key authentication
consumer_key = 'I1p0KOzJQYYJBJY1ajZIz7Yl5'
consumer_secret = '2uxcBKVgFc4yYJg8hcZedC30kivBIU3Xoddgk0x2NYMWLsWFXe'

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)


# Access key authentication
access_token = '1193870818201264129-Q93Ow1UikBtBcFAkDW0pvEdnL3ylb9'
access_secret = 'XUtzOZkoMR396ZEJBK12CkY2w8eDMgwsSBTUVdwXtvUMz'

auth.set_access_token(access_token, access_secret)


# Set up the API with the authentication handler
api = tweepy.API(auth)

In [None]:
tweet_id_list = twitter_archive_df['tweet_id'].tolist()

In [None]:
len(tweet_id_list)

2356

In [None]:
# Creating list to hold Tweet JSON Objects
tweet_status_json_list = []

In [None]:
# Iterating to get tweet extended information from tweet_ids in tweet_id_list

for tweet_id in tweet_id_list:

  try:
    # Getting Status Object from API by tweet_id
    tweet_status = api.get_status(tweet_id, tweet_mode='extended') #Mode = Extended; for more information
  except tweepy.TweepError:
    continue

  #Converting Status Object to JSON Object
  json_str = json.dumps(tweet_status._json)

  #Appending JSON Object to list of JSON Objects
  tweet_status_json_list.append(json_str)


In [None]:
# Checking length of 
len(tweet_status_json_list)

3230

In [None]:
# Defining path of tweet_json.txt File

path = working_dir + 'tweet_json.txt'

In [None]:
# Saving List of JSON Objects to tweet_json.txt File

with open(path, "w") as fhandle:
  for line in tweet_status_json_list:
    fhandle.write(f'{line}\n')

## Assessing Data
In this section, detect and document at least **eight (8) quality issues and two (2) tidiness issue**. You must use **both** visual assessment
programmatic assessement to assess the data.

**Note:** pay attention to the following key points when you access the data.

* You only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.
* Assessing and cleaning the entire dataset completely would require a lot of time, and is not necessary to practice and demonstrate your skills in data wrangling. Therefore, the requirements of this project are only to assess and clean at least 8 quality issues and at least 2 tidiness issues in this dataset.
* The fact that the rating numerators are greater than the denominators does not need to be cleaned. This [unique rating system](http://knowyourmeme.com/memes/theyre-good-dogs-brent) is a big part of the popularity of WeRateDogs.
* You do not need to gather the tweets beyond August 1st, 2017. You can, but note that you won't be able to gather the image predictions for these tweets since you don't have access to the algorithm used.



### Reading Datasets Gathered

####1.   Reading twitter-archive-enhanced.csv Dataset (File provided)

In [4]:
# Reading dataset: twitter-archive-enhanced.csv
twitter_archive_df = pd.read_csv(working_dir + 'twitter_archive_enhanced.csv')

In [5]:
# Checking columns in dataframe
twitter_archive_df.columns

Index(['tweet_id', 'in_reply_to_status_id', 'in_reply_to_user_id', 'timestamp',
       'source', 'text', 'retweeted_status_id', 'retweeted_status_user_id',
       'retweeted_status_timestamp', 'expanded_urls', 'rating_numerator',
       'rating_denominator', 'name', 'doggo', 'floofer', 'pupper', 'puppo'],
      dtype='object')

In [6]:
# Shape of Dataframe
twitter_archive_df.shape

(2356, 17)

In [9]:
# Checking datatypes of columns
twitter_archive_df.dtypes

tweet_id                        int64
in_reply_to_status_id         float64
in_reply_to_user_id           float64
timestamp                      object
source                         object
text                           object
retweeted_status_id           float64
retweeted_status_user_id      float64
retweeted_status_timestamp     object
expanded_urls                  object
rating_numerator                int64
rating_denominator              int64
name                           object
doggo                          object
floofer                        object
pupper                         object
puppo                          object
dtype: object

In [7]:
# Checking first 5 rows
twitter_archive_df.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


####2.  Reading image_predictions.tsv Dataset (File Downloaded Programmatically)

In [10]:
# Reading dataset: image_predictions.tsv
image_pred_df = pd.read_csv(working_dir + 'image_predictions_folder/image_predictions.tsv', sep='\t', header=0)

In [11]:
# Checking columns in dataframe
image_pred_df.columns

Index(['tweet_id', 'jpg_url', 'img_num', 'p1', 'p1_conf', 'p1_dog', 'p2',
       'p2_conf', 'p2_dog', 'p3', 'p3_conf', 'p3_dog'],
      dtype='object')

In [12]:
# Shape of Dataframe
image_pred_df.shape

(2075, 12)

In [13]:
# Checking datatypes of columns
image_pred_df.dtypes

tweet_id      int64
jpg_url      object
img_num       int64
p1           object
p1_conf     float64
p1_dog         bool
p2           object
p2_conf     float64
p2_dog         bool
p3           object
p3_conf     float64
p3_dog         bool
dtype: object

In [None]:
# Checking first 5 rows
image_pred_df.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


####3.   Reading tweet_json.txt Dataset (Dataset of tweets queried using Twitter API)

In [14]:
# create pandas DataFrame: 
tweet_json_df = pd.read_json(working_dir + 'tweet_json.txt', lines=True)

In [15]:
# Checking columns in dataframe
tweet_json_df.columns

Index(['created_at', 'id', 'id_str', 'full_text', 'truncated',
       'display_text_range', 'entities', 'extended_entities', 'source',
       'in_reply_to_status_id', 'in_reply_to_status_id_str',
       'in_reply_to_user_id', 'in_reply_to_user_id_str',
       'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place',
       'contributors', 'is_quote_status', 'retweet_count', 'favorite_count',
       'favorited', 'retweeted', 'possibly_sensitive',
       'possibly_sensitive_appealable', 'lang', 'retweeted_status',
       'quoted_status_id', 'quoted_status_id_str', 'quoted_status_permalink',
       'quoted_status'],
      dtype='object')

In [16]:
# Shape of Dataframe
tweet_json_df.shape

(3230, 32)

In [17]:
# Checking datatypes of columns
tweet_json_df.dtypes

created_at                       datetime64[ns, UTC]
id                                             int64
id_str                                         int64
full_text                                     object
truncated                                       bool
display_text_range                            object
entities                                      object
extended_entities                             object
source                                        object
in_reply_to_status_id                        float64
in_reply_to_status_id_str                    float64
in_reply_to_user_id                          float64
in_reply_to_user_id_str                      float64
in_reply_to_screen_name                       object
user                                          object
geo                                          float64
coordinates                                  float64
place                                         object
contributors                                 f

In [None]:
# Checking first 5 rows
tweet_json_df.head()

Unnamed: 0,created_at,id,id_str,full_text,truncated,display_text_range,entities,extended_entities,source,in_reply_to_status_id,...,favorited,retweeted,possibly_sensitive,possibly_sensitive_appealable,lang,retweeted_status,quoted_status_id,quoted_status_id_str,quoted_status_permalink,quoted_status
0,2017-08-01 16:23:56+00:00,892420643555336193,892420643555336192,This is Phineas. He's a mystical boy. Only eve...,False,"[0, 85]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 892420639486877696, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,False,False,0.0,0.0,en,,,,,
1,2017-08-01 00:17:27+00:00,892177421306343426,892177421306343424,This is Tilly. She's just checking pup on you....,False,"[0, 138]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 892177413194625024, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,False,False,0.0,0.0,en,,,,,
2,2017-07-31 00:18:03+00:00,891815181378084864,891815181378084864,This is Archie. He is a rare Norwegian Pouncin...,False,"[0, 121]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 891815175371796480, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,False,False,0.0,0.0,en,,,,,
3,2017-07-30 15:58:51+00:00,891689557279858688,891689557279858688,This is Darla. She commenced a snooze mid meal...,False,"[0, 79]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 891689552724799489, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,False,False,0.0,0.0,en,,,,,
4,2017-07-29 16:00:24+00:00,891327558926688256,891327558926688256,This is Franklin. He would like you to stop ca...,False,"[0, 138]","{'hashtags': [{'text': 'BarkWeek', 'indices': ...","{'media': [{'id': 891327551943041024, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,False,False,0.0,0.0,en,,,,,


#### Checking Tweets Gathered Beyond August 1st, 2017

In [21]:
max(tweet_json_df['created_at'])

Timestamp('2017-08-01 16:23:56+0000', tz='UTC')

In [22]:
max(twitter_archive_df['timestamp'])

'2017-08-01 16:23:56 +0000'

#### Checking if there are duplicates by ID

In [41]:
# counting unique values by 'tweet_id': twitter_archive dataset
n = len(pd.unique(twitter_archive_df['tweet_id']))
  
print("Number of unique values :", n)

Number of unique values : 2356


* There are no duplicates by tweet_id in the dataset (number of unique values (2356) corresponds to number of records (2356) in the dataset)

In [42]:
# counting unique values by 'tweet_id': image_pred dataset
n = len(pd.unique(image_pred_df['tweet_id']))
  
print("Number of unique values :", n)

Number of unique values : 2075


* There are no duplicates by tweet_id in the dataset (number of unique values (2075) corresponds to number of records (2075) in the dataset)

In [43]:
# counting unique values by 'id': tweet_json dataset
n = len(pd.unique(tweet_json_df['id']))
  
print("Number of unique values :", n)

Number of unique values : 874


* There are many duplicates by id in the dataset (number of unique values (874) is less than number of records(3230) in the dataset)

In [44]:
# Chceking counts of unique values in tweet_json dataset
item_counts = tweet_json_df['id'].value_counts()
item_counts

758828659922702336    1458
852189679701164033       4
812781120811126785       3
872486979161796608       3
872820683541237760       3
                      ... 
834458053273591808       2
834209720923721728       2
834167344700198914       2
834089966724603904       2
716439118184652801       1
Name: id, Length: 874, dtype: int64

### Quality issues
1. The identifier (ID) column name in tweet_json dataset (id) is not corresponding to the ID name in both the twitter_archive and image_pred datasets (tweet_id)

2. Time columns are denoted with different names in the twitter_archive dataset (timestamp) and the tweet_json dataset (created_at)

3. The timestamp column in twitter_archive dataset is of datatype "object"; while the created_at column in the tweet_json dataset is of datatype "datetime64"

4. The "id" column in the tweet_json dataset has a lot of duplicated values

5. You only want original ratings (no retweets) that have images

6.

7.

8.

### Tidiness issues
1.

2.

## Cleaning Data
In this section, clean **all** of the issues you documented while assessing. 

**Note:** Make a copy of the original data before cleaning. Cleaning includes merging individual pieces of data according to the rules of [tidy data](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html). The result should be a high-quality and tidy master pandas DataFrame (or DataFrames, if appropriate).

In [None]:
# Make copies of original pieces of data


### Issue #1:

#### Define:

#### Code

#### Test

### Issue #2:

#### Define

#### Code

#### Test

## Storing Data
Save gathered, assessed, and cleaned master dataset to a CSV file named "twitter_archive_master.csv".

## Analyzing and Visualizing Data
In this section, analyze and visualize your wrangled data. You must produce at least **three (3) insights and one (1) visualization.**

### Insights:
1.

2.

3.

### Visualization