# Project: Wrangle and analyze WeRateDogs Twitter data

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#Gathering">Gathering Data</a></li>
<li><a href="#Assessing">Assessing Data</a></li>
<li><a href="#Cleaning">Cleaning Data</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

This Notebook focuses on the process of data wrangling which is composed of 3 steps:
- Gathering 
- Assessing
- Cleaning

Atfer wrangling the data in a quite thorough manner, a quick analysis will be performed. 

The data selected in this view is from Twitter. The data that we have comes from multiple sources including a manually downloaded file, a file from an online server and data from Twitter's API. A lot of processing must be done in order to wrangle this data. In the final steps, we will have a look at tweets from the famous profile WeRateDogs and derive trends from them. 

#### Set up the environment

In [1]:
# Import all the libraries used in this python notebook for the following analysis
import pandas as pd
import numpy as np
import requests
import tweepy
import os
import timeit
import json

<a id='Gathering'></a>
## Gathering Data

In this part, we will gather data from 3 different sources in 3 different formats: 
- *A downloadable CSV file source* : Twitter enhanced archive data
- *An online server source TSV file* : Image prediction data based on tweets from the archive
- *An API JSON source data to load into a txt file* : Using API to get more data based on tweets from the archive (our main focus will be to gather retweet count and favorite count)

These datasets will then be read in a dataframe format using Pandas Library

### Dataset number 1 : downloadable CSV file source

The first dataset to be used for the following analysis is a *Twitter enhanced archive data file from the WeRateDogs Twitter profile*. It is saved in a CSV format in our folder

#### Step 1 and only : load data into pandas dataframe

In [2]:
# Create a dataframe and View Twitter enhanced archive dataset using pandas

df_twit_archive = pd.read_csv('twitter-archive-enhanced.csv')
df_twit_archive.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


> *this file was manually downloaded from Udacity's platform as part of the project materials*

### Dataset number 2 : online server source 

The second dataset to be used for the following analysis is a *tweet image prediction TSV file* hosted on Udacity servers

#### Step 1 : Download file programmatically

In [3]:
# Download programmatically the tweet image prediction file using Requests Library

url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'

if os.path.exists('tweet_image_pred.tsv') == True:
    print('file exists') #To avoid retrieval & creation of file error if code is re-run multiple times
    
else:
    r = requests.get(url)
    with open('tweet_image_pred.tsv', mode= 'wb') as file:
        file.write(r.content)
        print('file created')


file exists


#### Step 2 : load data into pandas dataframe

In [4]:
# Create a dataframe and View Tweet image prediction dataset using pandas

df_image_pred = pd.read_csv('tweet_image_pred.tsv', sep="\t")
df_image_pred.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


### Dataset number 3 : API source

The third and last dataset to be used for the following analysis is a *tweet 'retweet count' and 'favorite count' (Likes) dataset* in JSON format retrieved by querying the twitter API

#### Step 1 : Authentication

In [None]:
# Authenticate to access API data

consumer_key = 'YOUR CONSUMER KEY'
consumer_secret = 'YOUR CONSUMER SECRET'
access_token = 'YOUR ACCESS TOKEN'
access_secret = 'YOUR ACCESS SECRET'

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True , parser=tweepy.parsers.JSONParser())

# the parser helps to set the response from the API status call later into a JSON object 
# which makes it easier to then get the values we are looking for in retweet_count and favourite_count using JSON object functionalities

> **Some useful links**
>
> - For the tweepy code documentation: [here](https://buildmedia.readthedocs.org/media/pdf/tweepy/latest/tweepy.pdf)
- My link to twitter developper page: [here](https://developer.twitter.com/en/apps/17388315)
- Twitter WeRateDogs page: [here](https://twitter.com/dog_rates)
- Documentation for Twitter API - get tweets with specific id: [here](https://stackoverflow.com/questions/28384588/twitter-api-get-tweets-with-specific-id)
- Convert tweepy status object into JSON: [here](https://stackoverflow.com/questions/27900451/convert-tweepy-status-object-into-json)

#### Step 2 : Load API JSON data into txt file

In [None]:
# Start timer for the following operation
start = timeit.timeit()
print("Start timer")

# List for tweet ids in the archive that were not found via the API
tweet_id_without_record = []
# Creates "tweet_json.txt" file or empties it before the loop starts if file exists already 
open("tweet_json.txt", 'w').close()

# Loop to build a txt file were each line represent a tweet from the twitter archive of WeRateDogs in JSON format
for tweet_id in twit_archive.tweet_id:
        try:
                tweet = api.get_status(tweet_id, tweet_mode='extended') # gets tweet status in JSON format
                tweet_str = json.dumps(tweet) # tweet JSON serialized into a str format

                # append each tweet status from the loop in the tweet_json.txt file
                with open("tweet_json.txt","a") as file: 
                    file.write(tweet_str + '\n') # '+/n' adds a new line after each str that has been added
                print(tweet['id']) # print tweet id to check loop advancement
       # keep record of the tweet id if the tweet id can't be found via the API
        except Exception as e:
                print(str(tweet_id) + " error tweet not found")
                tweet_id_without_record.append(tweet_id)

# End the timer after the loop is complete
end = timeit.timeit()
print("End timer and time to process in seconds:")
print(end - start)


> This operation above took more than one hour. Expect it if you re-run it.

#### Step 3 : Perform multipe checks on the txt file to verify it looks as expected

In [5]:
# Check how file looks and feel

with open("tweet_json.txt","r") as file:
     print(file.read(10000))

{"created_at": "Tue Aug 01 16:23:56 +0000 2017", "id": 892420643555336193, "id_str": "892420643555336193", "full_text": "This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU", "truncated": false, "display_text_range": [0, 85], "entities": {"hashtags": [], "symbols": [], "user_mentions": [], "urls": [], "media": [{"id": 892420639486877696, "id_str": "892420639486877696", "indices": [86, 109], "media_url": "http://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg", "media_url_https": "https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg", "url": "https://t.co/MgUWQ76dJU", "display_url": "pic.twitter.com/MgUWQ76dJU", "expanded_url": "https://twitter.com/dog_rates/status/892420643555336193/photo/1", "type": "photo", "sizes": {"thumb": {"w": 150, "h": 150, "resize": "crop"}, "medium": {"w": 540, "h": 528, "resize": "fit"}, "small": {"w": 540, "h": 528, "resize": "fit"}, "large": {"w": 540, "h": 528, "resize": "fit"}}, "features": {"orig": {"faces": 

In [6]:
# Count lines in JSON file - 1 line is one tweet
num_lines = sum(1 for line in open("tweet_json.txt"))
print(num_lines)

2331


In [9]:
# Count tweets in tweet ids archive
num_tweets = df_twit_archive.tweet_id.count()
num_tweets

2356

In [109]:
# Count tweets that were not found via the API
num_tweet_id_without_record = len(tweet_id_without_record)
num_tweet_id_without_record

25

In [110]:
# Check if the number of lines in txt JSON file is correct based on the archive of tweets we queried
num_lines == num_tweets - num_tweet_id_without_record

True

> OK the count of number of lines in the tweet_json.txt file corroborates with number of lines in twitter archive dataset minus the exceptions where the tweet id was not found via the API. It looks like we can go on and use this file to build the dataframe containing:
- tweet_id
- retweet_count 
- favourite_count
>
> Each of these values are found in each line of the txt file. Indeed 1 line represents 1 tweet.

#### Step 4 : Gather the selected data from the txt file in a list

In [10]:
tweet_API_list = [] # list for the loop to find lines in tweet_json.txt file
tweet_API_list_for_df = [] # list of dictionaries that will be used to build the final dataframe

# Open the txt file in read mode and put each line of the txt file in a list
with open("tweet_json.txt","r") as f:
    tweet_API_list = f.readlines() # readlines() returns a list of items, each item is a line in tweet_jon.txt file

# Loop to retrieve values for the dataframe for each tweet
i = 0
while i < len(tweet_API_list): 
        tweet_line_i = json.loads(tweet_API_list[i]) # JSON representing tweet i
        i += 1
        tweet_id = tweet_line_i['id'] # get the tweet id of tweet i
        retweet_count = tweet_line_i['retweet_count'] # get retweet count of tweet i
        favorite_count = tweet_line_i['favorite_count'] # get favorite count of tweet i

        # Append to list of dictionaries
        tweet_API_list_for_df.append({'tweet_id': tweet_id,
                            'retweet_count': retweet_count,
                            'favorite_count': favorite_count})

    

#### Step 5 : load data into pandas dataframe

In [11]:
# Create DataFrame from list of dictionaries
df_tweet_API = pd.DataFrame(tweet_API_list_for_df, columns = ['tweet_id', 'retweet_count', 'favorite_count'])

df_tweet_API

Unnamed: 0,tweet_id,retweet_count,favorite_count
0,892420643555336193,7802,36638
1,892177421306343426,5784,31572
2,891815181378084864,3826,23782
3,891689557279858688,7973,39967
4,891327558926688256,8609,38166
...,...,...,...
2326,666049248165822465,41,99
2327,666044226329800704,133,273
2328,666033412701032449,41,115
2329,666029285002620928,43,121


>It looks like our dataframe also has the correct number of rows. As expected it is the same number of rows as for the JSON txt file

<a id='Assessing'></a>
## Assessing Data
- Detect and document at least eight (8) quality issues and two (2) tidiness issues

Following these requirements: 

- You only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.
- Cleaning includes merging individual pieces of data according to the rules of tidy data.
- The fact that the rating numerators are greater than the denominators does not need to be cleaned. This unique rating system is a big part of the popularity of WeRateDogs.
- You do not need to gather the tweets beyond August 1st, 2017. You can, but note that you won't be able to gather the image predictions for these tweets since you don't have access to the algorithm used.

In [12]:
df_twit_archive.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

In [13]:
df_image_pred.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [14]:
df_tweet_API.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2331 entries, 0 to 2330
Data columns (total 3 columns):
tweet_id          2331 non-null int64
retweet_count     2331 non-null int64
favorite_count    2331 non-null int64
dtypes: int64(3)
memory usage: 54.8 KB


<a id='Cleaning'></a>
## Cleaning Data
- Clean each of the issues you documented while assessing.
- The result should be a high quality and tidy master pandas DataFrame
- Store the clean DataFrame(s) in a CSV file with the main one named twitter_archive_master.csv. If additional files exist because multiple tables are required for tidiness, name these files appropriately. Additionally, you may store the cleaned data in a SQLite database (which is to be submitted as well if you do).

<a id='eda'></a>
## Exploratory Data Analysis
- At least three (3) insights and one (1) visualization must be produced.

<a id='conclusions'></a>
## Conclusions


blablablab

## Reporting for this Project
- Create a 300-600 word written report called wrangle_report.pdf or wrangle_report.html that briefly describes your wrangling efforts. This is to be framed as an internal document.
- Create a 250-word-minimum written report called act_report.pdf or act_report.html that communicates the insights and displays the visualization(s) produced from your wrangled data. This is to be framed as an external document, like a blog post or magazine article, for example.