In [1]:
# PROJECT 4 - Wrangle Twitter data via API

## Table of Contents
* [Introduction](#intro)
* [Initial Brief](#1.1-initial-brief)
* [General Outline](#general_outline)
* [Import Libraries](#)
* [](#)
* [](#)

`Note: Fill at the end. Automate with python library/extension.`

# Introduction
Gather readily available data from an existing source on the web to allow first hand experience of wrangling data.<br>
It is a significant task as data will not always be provided and if it is: <br>
 - Best case: Spelling mistakes and/or equivalent,
 - Worst case: No schema/format, duplicates, incomplete and/or incorrect values recorded.

## Initial Brief
- User has provided archived twitter data for analysis
 - [ ] Twitter archive export in CSV
 - [ ] URL to Machine Learning image predictions
<br>
- Identify minimum:
 - [ ] 8 quality issues
 - [ ] 2 tidiness issues
<br>
- Out of scope:
 - [ ] Unique rating system
 - [ ] No gathering required past 01 Aug 2017

## General outline
- [ ] Read-in CSV data
- [ ] Access URL data (_over manually downloading file_)

In [2]:
## install modules via terminal
#pip install pandas # also downloads numpy
#pip install requests
#pip install tweepy

## Optional - provides TOC
#pip install jupyter_contrib_nbextensions

## Import Libraries

In [3]:
import pandas as pd
import numpy as np

import requests
import os

import sys

import tweepy

## Defined Functions

- addFiles(filename)    `Created for the ability to scale`
- go_assess(df)         `Created to reiterate through assessment steps`

In [4]:
filelist = [] # declare
print('{} Files in list'.format(len(filelist)) ) # initial print

# Adds and tracks files
def add_files(*filename): # PARAMETER: <string>
    for file in filename:
        filelist.append(file)
        print('{} added to file list.'.format(file) )

    if len(filelist) > 1:
        print('{} files now in list.'.format( len(filelist)) )
    else:
        print('{} file now in list.'.format( len(filelist)) )
    return file

0 Files in list


In [5]:
def get_values(df, col, name): # 
    export = []
    value_cnt = col.value_counts()
    value = value_cnt.values
# test for duplicates, no duplicates should be equal to .series size
    if value.sum() > value.shape[0]: # there are duplicates
        txt_result = ('Duplicates found in column \'{}\', the max duplicate item repeats {} times.'.format(name, value.max()) ) # print results, return indexes
    else: # no duplicates
        txt_result = ('No duplicates found in column \'{}\'.'.format(name) )
        #print('{}: No duplicates found.'.format(col) )
    # pack variables into list
    export.append(value_cnt)
    export.append(txt_result)
    
    return export

In [6]:
#assessment = [] # create global
def go_assess(df):
    # empty every function call, to prevent list from accumulating over time
    results = [] #
    summary = [] #
    val_sum = [] # 
    assessment = []
    print('Dataframe contains the following columns:')
    print('{}\n'.format(df.columns) )

    for i, col in enumerate(df.columns):
        # copy into message
        print('Column {} - \'{}\' has been assessed. Assessment saved in results[{}] and summary[{}]'.format(i, col, i, i))
        
        # call and get results
        val_sum = get_values(df, df[col], col)

        # append results
        summary.append(val_sum[1])
        results.append(val_sum[0])

    assessment.append(summary)
    assessment.append(results)
    print('NOTE: To access variables, set a series name e.g below:\nseries[0][x] to access summary details.\nseries[1][x] to access the value_counts results.\nx represents column number')
    return assessment #

In [7]:
## BLANK

## Data Wrangling

## Iteration 1
Import data from a twitter user archive provided by the end-user

`Note: Add edit# upon addition of new issue.`

### Gathering 1
#### Initialize
Enter Known Input Info
Format: file name inside ''

In [8]:
# FILE 1 - TWITTER ARCHIVE DATA
folder = 'Incoming Files/'
twitter_file = 'twitter-archive-enhanced-2.csv'
add_files(twitter_file)

twitter-archive-enhanced-2.csv added to file list.
1 file now in list.


'twitter-archive-enhanced-2.csv'

In [9]:
# FILE 2 - TWITTER ML IMAGE PREDICTIONS
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'

# assign to a response object
response = requests.get(url)

image_predictions = url.split('/')[-1] # extract file name

# with open, allows for the auto close file when complete
# split after last delimiter /, indicating file name
with open(os.path.join(folder, image_predictions), mode='wb') as file:
    # read file 
    file.write(response.content)
    print('{} has been saved in: "/{}"'.format(image_predictions, folder) )

# call function and add name to end of list
add_files(image_predictions)

image-predictions.tsv has been saved in: "/Incoming Files/"
image-predictions.tsv added to file list.
2 files now in list.


'image-predictions.tsv'

#### Import into dataframes

In [10]:
 # create empty list
df_raw = []
file_extensions = []

# dataframe to contain original imports
for num, file in enumerate(filelist):
    ext = file.split('.')[-1]
    file_extensions.append(ext)
    # read extension type
    ## catch CSV, TSV, JSON, no Switch/Case in Python
    if ext == 'csv':
        df_raw.append(pd.read_csv(folder + file) )
    elif ext == 'tsv':
        df_raw.append(pd.read_csv(folder + file, sep='\t') )
    else:
        print('filelist({}) - "{}", could not be read into a dataframe.'.format(num, filelist[num]) )

In [11]:
print(filelist)

['twitter-archive-enhanced-2.csv', 'image-predictions.tsv']


In [12]:
df_raw[0].sample(3)  # visually assess file was read in correctly

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
83,876537666061221889,,,2017-06-18 20:30:39 +0000,"<a href=""http://twitter.com/download/iphone"" r...",I can say with the pupmost confidence that the...,,,,https://twitter.com/mpstowerham/status/8761629...,14,10,,,,,
149,863079547188785154,6.671522e+17,4196984000.0,2017-05-12 17:12:53 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Ladies and gentlemen... I found Pipsy. He may ...,,,,https://twitter.com/dog_rates/status/863079547...,14,10,,,,,
2209,668623201287675904,,,2015-11-23 02:52:48 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Jomathan. He is not thrilled about the...,,,,https://twitter.com/dog_rates/status/668623201...,10,10,Jomathan,,,,


In [13]:
df_twitter = df_raw[0].copy()

In [14]:
df_raw[1].sample(3)  # visually assess file was read in correctly

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
1848,839549326359670784,https://pbs.twimg.com/media/C6atpTLWYAIL7bU.jpg,1,swing,0.393527,False,Norwich_terrier,0.05248,True,Pembroke,0.049901,True
232,670417414769758208,https://pbs.twimg.com/media/CU3NE8EWUAEVdPD.jpg,1,sea_urchin,0.493257,False,porcupine,0.460565,False,cardoon,0.008146,False
902,700029284593901568,https://pbs.twimg.com/media/CbcA673XIAAsytQ.jpg,1,West_Highland_white_terrier,0.726571,True,Maltese_dog,0.176828,True,Dandie_Dinmont,0.070134,True


In [15]:
df_image_predictor = df_raw[1].copy() # create copy

### Gathering 2

In [16]:
import tweepy
from tweepy import OAuthHandler
import json
from timeit import default_timer as timer

auth = OAuthHandler(consumer_key.strip(), consumer_secret.strip())
auth.set_access_token(access_token.strip(), access_secret.strip())

api = tweepy.API(auth, wait_on_rate_limit=True)

tweet_ids = df_twitter.tweet_id.values
len(tweet_ids)

# Query Twitter's API for JSON data for each tweet ID in the Twitter archive
count = 0
fails_dict = {}
start = timer()
# Save each tweet's returned JSON as a new line in a .txt file
with open('tweet_json.txt', 'w') as outfile:
    # This loop will likely take 20-30 minutes to run because of Twitter's rate limit
    for tweet_id in tweet_ids:
        count += 1
        print(str(count) + ": " + str(tweet_id))
        try:
            tweet = api.get_status(tweet_id, tweet_mode='extended') 
            print("Success")
            json.dump(tweet._json, outfile)
            outfile.write('\n')
        except tweepy.TweepError as e:
            print("Fail")
            fails_dict[tweet_id] = e
            pass
end = timer()
print(end - start)
print(fails_dict)

1: 892420643555336193
Success
2: 892177421306343426
Success
3: 891815181378084864
Success
4: 891689557279858688
Success
5: 891327558926688256
Success
6: 891087950875897856
Success
7: 890971913173991426
Success
8: 890729181411237888
Success
9: 890609185150312448


KeyboardInterrupt: 

In [None]:
# FILE 3 - TWITTER API JSON
#%run twitter-api.py df_twitter # run python script, pass dataframe

## Assessing data
### Assess 1 - Twitter Data Archive
#### Define:<br>


**Visual and programmatic summary**<br>
Exceptions:
1. ratings (numerator, denominator)

_Tidiness_<br>
1. Datatypes
1.1 Time stamp contains date and time, the timestamp can be split further
1.2 Columns 13-16 can be categorized into `Dog_Category`, values repeat the column name making it irrelevant

_Cleanliness_<br>
1 Missing information, Columns ordered by severity:<br>
1.1 Index 1-2 only has 78 non null values, a significant amount<br>
1.2 Index 6-8 contain 181 non null values<br>
1.3 Index 9 contains 2297 non null values<br>
2 Datatypes:<br>
2.1 float required for column 1-2 as the order is +17 providing no need for the precision of decimals


In [None]:
df_twitter.info()

In [None]:
df_twitter.describe()

In [None]:
# call go_assess
archive_assessed = go_assess(df_twitter)

### Column 0 - tweet_id

In [None]:
### Column 0 - 
archive_assessed[0][0], archive_assessed[1][0]

### Column 1 - in reply

In [None]:
### Column 1 - 
archive_assessed[0][1], archive_assessed[1][1]

In [None]:
df_twitter[df_twitter.in_reply_to_status_id.notna()]['in_reply_to_status_id'].sample(5)

In [None]:
### Column 2 - 
archive_assessed[0][2], archive_assessed[1][2]

In [None]:
### Column 3 - 

In [None]:
archive_assessed[0][3], archive_assessed[1][3]

In [None]:
### Column 4 - 
archive_assessed[0][4], archive_assessed[1][4]

In [None]:
### Column 5 - 
archive_assessed[0][5], archive_assessed[1][5]

In [None]:
### Column 6 - 
archive_assessed[0][6], archive_assessed[1][6]

In [None]:
### Column 7 - 
archive_assessed[0][7], archive_assessed[1][7]

In [None]:
### Column 12 - name

In [None]:
archive_assessed[0][12], archive_assessed[1][12]

## Assess 2 - Twitter Image Predictions
### Define:

In [None]:
df_image_predictor.sample(3)

In [None]:
df_image_predictor.info()

In [None]:
df_image_predictor.describe()

In [None]:
img_assessed = go_assess(df_image_predictor)

In [None]:
### Column 0
img_assessed[0][0], img_assessed[1][0]

In [None]:
### Column 1
# search for files other then .jpg, use .split and sift through values
not_jpg = df_image_predictor[~df_image_predictor.jpg_url.str.contains('.jpg',)]
not_jpg.jpg_url

In [None]:
img_assessed[0][1], img_assessed[1][1]

In [None]:
### Column 2
img_assessed[0][2], img_assessed[1][2]

In [None]:
### Column 3
img_assessed[0][3], img_assessed[1][3]

In [None]:
is_ws = df_image_predictor[df_image_predictor.p1.str.contains(' ',)]
is_ws

In [None]:
mask = img_assessed[1][3] == 1
img_assessed[1][3][mask]

In [None]:
### Column 4
img_assessed[0][4], img_assessed[1][4]

In [None]:
### Column 5
img_assessed[0][5], img_assessed[1][5]

In [None]:
df_image_predictor.query('p1_dog == False').iloc[:, [0,1,3,5]]

In [None]:
p1_df = df_image_predictor.query('p1_dog == False').iloc[:,:6]
p1_df

In [None]:
p1_df.groupby(['p1']).size()

In [None]:
### Column 6
img_assessed[0][6], img_assessed[1][6]

In [None]:
### Column 7
img_assessed[0][7], img_assessed[1][7]

In [None]:
### Column 8
img_assessed[0][8], img_assessed[1][8]

In [None]:
### Column 9
img_assessed[0][9], img_assessed[1][9]

In [None]:
### Column 10
img_assessed[0][10], img_assessed[1][10]

In [None]:
### Column 11
img_assessed[0][11], img_assessed[1][11]

## Cleaning data


In [None]:
# change
df_twitter.iloc[:,1:3]

In [None]:
# Misc: Workspace

## Save clean data