In [39]:
# PROJECT 4 - Wrangle Twitter data via API

## Table of Contents
* [Introduction](#intro)
* [Initial Brief](#1.1-initial-brief)
* [General Outline](#general_outline)
* [Import Libraries](#)
* [](#)
* [](#)

`Note: Fill at the end. Automate with python library/extension.`

# Introduction
Gather readily available data from an existing source on the web to allow first hand experience of wrangling data.<br>
It is a significant task as data will not always be provided and if it is: <br>
 - Best case: Spelling mistakes and/or equivalent,
 - Worst case: No schema/format, duplicates, incomplete and/or incorrect values recorded.

## Initial Brief
- User has provided archived twitter data for analysis
 - [ ] Twitter archive export in CSV
 - [ ] URL to Machine Learning image predictions
<br>
- Identify minimum:
 - [ ] 8 quality issues
 - [ ] 2 tidiness issues
<br>
- Out of scope:
 - [ ] Unique rating system
 - [ ] No gathering required past 01 Aug 2017

## General outline
- [ ] Read-in CSV data
- [ ] Access URL data (_over manually downloading file_)

In [40]:
## install modules via terminal
#pip install pandas # also downloads numpy
#pip install requests
#pip install tweepy

## Optional - provides TOC
#pip install jupyter_contrib_nbextensions

## Import Libraries

In [41]:
import pandas as pd
import numpy as np

import requests
import os

import json # json encoder and decoder

## Defined Functions

- addFiles(filename)    `Created for the ability to scale`
- go_assess(df)         `Created to reiterate through assessment steps`

In [42]:
filelist = [] # declare
print('{} Files in list'.format(len(filelist)) ) # initial print

# Adds and tracks files
def add_files(*filename): # PARAMETER: <string>
    for file in filename:
        filelist.append(file)
        print('{} added to file list.'.format(file) )

    if len(filelist) > 1:
        print('{} files now in list.'.format( len(filelist)) )
    else:
        print('{} file now in list.'.format( len(filelist)) )
    return file

0 Files in list


In [43]:
def get_values(df, col, name): # 
    export = []
    value_cnt = col.value_counts()
    value = value_cnt.values
# test for duplicates, no duplicates should be equal to .series size
    if value.sum() > value.shape[0]: # there are duplicates
        txt_result = ('Duplicates found in column \'{}\', the max duplicate item repeats {} times.'.format(name, value.max()) ) # print results, return indexes
    else: # no duplicates
        txt_result = ('No duplicates found in column \'{}\'.'.format(name) )
        #print('{}: No duplicates found.'.format(col) )
    # pack variables into list
    export.append(value_cnt)
    export.append(txt_result)
    
    return export

In [44]:
#assessment = [] # create global
def go_assess(df):
    # empty every function call, to prevent list from accumulating over time
    results = [] #
    summary = [] #
    val_sum = [] # 
    assessment = []
    print('Dataframe contains the following columns:')
    print('{}\n'.format(df.columns) )

    for i, col in enumerate(df.columns):
        # copy into message
        print('Column {} - \'{}\' has been assessed. Assessment saved in results[{}] and summary[{}]'.format(i, col, i, i))
        
        # call and get results
        val_sum = get_values(df, df[col], col)

        # append results
        summary.append(val_sum[1])
        results.append(val_sum[0])

    assessment.append(summary)
    assessment.append(results)
    print('NOTE: To access variables, set a series name e.g below:\nseries[0][x] to access summary details.\nseries[1][x] to access the value_counts results.\nx represents column number')
    return assessment #

In [45]:
## BLANK

## Data Wrangling

## Iteration 1
Import data from a twitter user archive provided by the end-user

`Note: Add edit# upon addition of new issue.`

### Gathering 1
#### Initialize
Enter Known Input Info
Format: file name inside ''

In [46]:
# FILE 1 - TWITTER ARCHIVE DATA
folder = 'Incoming Files/'
twitter_file = 'twitter-archive-enhanced-2.csv'
add_files(twitter_file)

twitter-archive-enhanced-2.csv added to file list.
1 file now in list.


'twitter-archive-enhanced-2.csv'

In [47]:
# FILE 2 - TWITTER ML IMAGE PREDICTIONS
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'

# assign to a response object
response = requests.get(url)

image_predictions = url.split('/')[-1] # extract file name

# with open, allows for the auto close file when complete
# split after last delimiter /, indicating file name
with open(os.path.join(folder, image_predictions), mode='wb') as file:
    # read file 
    file.write(response.content)
    print('{} has been saved in: "/{}"'.format(image_predictions, folder) )

# call function and add name to end of list
add_files(image_predictions)

image-predictions.tsv has been saved in: "/Incoming Files/"
image-predictions.tsv added to file list.
2 files now in list.


'image-predictions.tsv'

#### Import into dataframes

In [48]:
 # create empty list
df_raw = []
file_extensions = []

# dataframe to contain original imports
for num, file in enumerate(filelist):
    ext = file.split('.')[-1]
    file_extensions.append(ext)
    # read extension type
    ## catch CSV, TSV, JSON, no Switch/Case in Python
    if ext == 'csv':
        df_raw.append(pd.read_csv(folder + file) )
    elif ext == 'tsv':
        df_raw.append(pd.read_csv(folder + file, sep='\t') )
    else:
        print('filelist({}) - "{}", could not be read into a dataframe.'.format(num, filelist[num]) )

In [49]:
print(filelist)

['twitter-archive-enhanced-2.csv', 'image-predictions.tsv']


In [50]:
df_raw[0].sample(3)  # visually assess file was read in correctly

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
2186,668981893510119424,,,2015-11-24 02:38:07 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Unique dog here. Oddly shaped tail. Long pink ...,,,,https://twitter.com/dog_rates/status/668981893...,4,10,,,,,
1899,674670581682434048,,,2015-12-09 19:22:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Jeb &amp; Bush. Jeb is somehow stuck in t...,,,,https://twitter.com/dog_rates/status/674670581...,9,10,Jeb,,,,
461,817536400337801217,,,2017-01-07 01:00:41 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Say hello to Eugene &amp; Patti Melt. No matte...,,,,https://twitter.com/dog_rates/status/817536400...,12,10,Eugene,,,,


In [51]:
df_twitter = df_raw[0].copy()

In [52]:
df_raw[1].sample(3)  # visually assess file was read in correctly

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
783,690015576308211712,https://pbs.twimg.com/media/CZNtgWhWkAAbq3W.jpg,2,malamute,0.949609,True,Siberian_husky,0.033084,True,Eskimo_dog,0.016663,True
1728,820690176645140481,https://pbs.twimg.com/media/C2OtWr0VQAEnS9r.jpg,2,West_Highland_white_terrier,0.872064,True,kuvasz,0.059526,True,Samoyed,0.0374,True
328,672068090318987265,https://pbs.twimg.com/media/CVOqW8eUkAESTHj.jpg,1,pug,0.863385,True,shopping_cart,0.125746,False,Border_terrier,0.002972,True


In [53]:
df_image_predictor = df_raw[1].copy() # create copy

## Assessing data
### Assess 1 - Twitter Data Archive
#### Define:<br>


**Visual and programmatic summary**<br>
Exceptions:
1. ratings (numerator, denominator)

_Tidiness_<br>
1. Datatypes
1.1 Time stamp contains date and time, the timestamp can be split further
1.2 Columns 13-16 can be categorized into `Dog_Category`, values repeat the column name making it irrelevant

_Cleanliness_<br>
1 Missing information, Columns ordered by severity:<br>
1.1 Index 1-2 only has 78 non null values, a significant amount<br>
1.2 Index 6-8 contain 181 non null values<br>
1.3 Index 9 contains 2297 non null values<br>
2 Datatypes:<br>
2.1 float required for column 1-2 as the order is +17 providing no need for the precision of decimals


In [54]:
df_twitter.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 

In [55]:
df_twitter.describe()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,retweeted_status_id,retweeted_status_user_id,rating_numerator,rating_denominator
count,2356.0,78.0,78.0,181.0,181.0,2356.0,2356.0
mean,7.427716e+17,7.455079e+17,2.014171e+16,7.7204e+17,1.241698e+16,13.126486,10.455433
std,6.856705e+16,7.582492e+16,1.252797e+17,6.236928e+16,9.599254e+16,45.876648,6.745237
min,6.660209e+17,6.658147e+17,11856340.0,6.661041e+17,783214.0,0.0,0.0
25%,6.783989e+17,6.757419e+17,308637400.0,7.186315e+17,4196984000.0,10.0,10.0
50%,7.196279e+17,7.038708e+17,4196984000.0,7.804657e+17,4196984000.0,11.0,10.0
75%,7.993373e+17,8.257804e+17,4196984000.0,8.203146e+17,4196984000.0,12.0,10.0
max,8.924206e+17,8.862664e+17,8.405479e+17,8.87474e+17,7.874618e+17,1776.0,170.0


In [56]:
# call go_assess
archive_assessed = go_assess(df_twitter)

Dataframe contains the following columns:
Index(['tweet_id', 'in_reply_to_status_id', 'in_reply_to_user_id', 'timestamp',
       'source', 'text', 'retweeted_status_id', 'retweeted_status_user_id',
       'retweeted_status_timestamp', 'expanded_urls', 'rating_numerator',
       'rating_denominator', 'name', 'doggo', 'floofer', 'pupper', 'puppo'],
      dtype='object')

Column 0 - 'tweet_id' has been assessed. Assessment saved in results[0] and summary[0]
Column 1 - 'in_reply_to_status_id' has been assessed. Assessment saved in results[1] and summary[1]
Column 2 - 'in_reply_to_user_id' has been assessed. Assessment saved in results[2] and summary[2]
Column 3 - 'timestamp' has been assessed. Assessment saved in results[3] and summary[3]
Column 4 - 'source' has been assessed. Assessment saved in results[4] and summary[4]
Column 5 - 'text' has been assessed. Assessment saved in results[5] and summary[5]
Column 6 - 'retweeted_status_id' has been assessed. Assessment saved in results[6] and 

### Column 0 - tweet_id

In [57]:
### Column 0 - 
archive_assessed[0][0], archive_assessed[1][0]

("No duplicates found in column 'tweet_id'.",
 749075273010798592    1
 741099773336379392    1
 798644042770751489    1
 825120256414846976    1
 769212283578875904    1
                      ..
 715360349751484417    1
 666817836334096384    1
 794926597468000259    1
 673705679337693185    1
 700151421916807169    1
 Name: tweet_id, Length: 2356, dtype: int64)

### Column 1 - in reply

In [58]:
### Column 1 - 
archive_assessed[0][1], archive_assessed[1][1]

("Duplicates found in column 'in_reply_to_status_id', the max duplicate item repeats 2 times.",
 6.671522e+17    2
 8.562860e+17    1
 8.131273e+17    1
 6.754971e+17    1
 6.827884e+17    1
                ..
 8.482121e+17    1
 6.715449e+17    1
 6.936422e+17    1
 6.849598e+17    1
 7.331095e+17    1
 Name: in_reply_to_status_id, Length: 77, dtype: int64)

In [59]:
df_twitter[df_twitter.in_reply_to_status_id.notna()]['in_reply_to_status_id'].sample(5)

576     8.008580e+17
274     8.406983e+17
967     7.501805e+17
2298    6.670655e+17
1005    7.476487e+17
Name: in_reply_to_status_id, dtype: float64

In [60]:
### Column 2 - 
archive_assessed[0][2], archive_assessed[1][2]

("Duplicates found in column 'in_reply_to_user_id', the max duplicate item repeats 47 times.",
 4.196984e+09    47
 2.195506e+07     2
 7.305050e+17     1
 2.916630e+07     1
 3.105441e+09     1
 2.918590e+08     1
 2.792810e+08     1
 2.319108e+09     1
 1.806710e+08     1
 3.058208e+07     1
 2.625958e+07     1
 1.943518e+08     1
 3.589728e+08     1
 8.405479e+17     1
 2.894131e+09     1
 2.143566e+07     1
 2.281182e+09     1
 1.648776e+07     1
 4.717297e+09     1
 2.878549e+07     1
 1.582854e+09     1
 4.670367e+08     1
 4.738443e+07     1
 1.361572e+07     1
 1.584641e+07     1
 2.068372e+07     1
 1.637468e+07     1
 1.185634e+07     1
 1.198989e+09     1
 1.132119e+08     1
 7.759620e+07     1
 Name: in_reply_to_user_id, dtype: int64)

In [61]:
### Column 3 - 

In [62]:
archive_assessed[0][3], archive_assessed[1][3]

("No duplicates found in column 'timestamp'.",
 2015-11-23 02:19:29 +0000    1
 2015-11-16 01:52:02 +0000    1
 2016-01-30 03:52:58 +0000    1
 2016-07-25 23:54:05 +0000    1
 2015-11-16 20:32:58 +0000    1
                             ..
 2015-11-17 02:46:43 +0000    1
 2015-12-26 17:25:59 +0000    1
 2015-12-01 19:10:13 +0000    1
 2016-09-01 00:04:38 +0000    1
 2016-08-25 00:43:02 +0000    1
 Name: timestamp, Length: 2356, dtype: int64)

In [63]:
### Column 4 - 
archive_assessed[0][4], archive_assessed[1][4]

("Duplicates found in column 'source', the max duplicate item repeats 2221 times.",
 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>     2221
 <a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a>                          91
 <a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>                       33
 <a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>      11
 Name: source, dtype: int64)

In [64]:
### Column 5 - 
archive_assessed[0][5], archive_assessed[1][5]

("No duplicates found in column 'text'.",
 Was just informed about this hero pupper and others like her. Another 14/10, would be an absolute honor to pet https://t.co/hBTzPmj36Z                      1
 This is Huck. He's addicted to caffeine. Hope it's not too latte to seek help. 11/10 stay strong pupper https://t.co/iJE3F0VozW                             1
 This is Oliver. Bath time is upon him. His fear of the wetness postpones his ultimate pupper destiny. 11/10 https://t.co/AFzzKqR4tT                         1
 RT @dog_rates: Not familiar with this breed. No tail (weird). Only 2 legs. Doesn't bark. Surprisingly quick. Shits eggs. 1/10 https://t.co/…                1
 This is an Albanian 3 1/2 legged  Episcopalian. Loves well-polished hardwood flooring. Penis on the collar. 9/10 https://t.co/d9NcXFKwLv                    1
                                                                                                                                                            ..
 Unb

In [65]:
### Column 6 - 
archive_assessed[0][6], archive_assessed[1][6]

("No duplicates found in column 'retweeted_status_id'.",
 7.757333e+17    1
 7.507196e+17    1
 6.742918e+17    1
 6.833919e+17    1
 8.269587e+17    1
                ..
 7.848260e+17    1
 7.806013e+17    1
 8.305833e+17    1
 7.047611e+17    1
 7.331095e+17    1
 Name: retweeted_status_id, Length: 181, dtype: int64)

In [66]:
### Column 7 - 
archive_assessed[0][7], archive_assessed[1][7]

("Duplicates found in column 'retweeted_status_user_id', the max duplicate item repeats 156 times.",
 4.196984e+09    156
 4.296832e+09      2
 5.870972e+07      1
 6.669901e+07      1
 4.119842e+07      1
 7.475543e+17      1
 7.832140e+05      1
 7.266347e+08      1
 4.871977e+08      1
 5.970642e+08      1
 4.466750e+07      1
 1.228326e+09      1
 7.992370e+07      1
 2.488557e+07      1
 7.874618e+17      1
 3.638908e+08      1
 5.128045e+08      1
 8.117408e+08      1
 1.732729e+09      1
 1.960740e+07      1
 1.547674e+08      1
 3.410211e+08      1
 7.124572e+17      1
 2.804798e+08      1
 1.950368e+08      1
 Name: retweeted_status_user_id, dtype: int64)

In [67]:
### Column 12 - name

In [68]:
archive_assessed[0][12], archive_assessed[1][12]

("Duplicates found in column 'name', the max duplicate item repeats 745 times.",
 None        745
 a            55
 Charlie      12
 Cooper       11
 Lucy         11
            ... 
 Brutus        1
 Gilbert       1
 Amélie        1
 Winifred      1
 Andru         1
 Name: name, Length: 957, dtype: int64)

## Assess 2 - Twitter Image Predictions
### Define:

In [69]:
df_image_predictor.sample(3)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
47,666817836334096384,https://pbs.twimg.com/media/CUEDSMEWEAAuXVZ.jpg,1,miniature_schnauzer,0.496953,True,standard_schnauzer,0.285276,True,giant_schnauzer,0.073764,True
2002,876838120628539392,https://pbs.twimg.com/media/DCsnnZsVwAEfkyi.jpg,1,bloodhound,0.575751,True,redbone,0.24097,True,Tibetan_mastiff,0.088935,True
1352,759923798737051648,https://pbs.twimg.com/media/CovKqSYVIAAUbUW.jpg,1,Labrador_retriever,0.324579,True,seat_belt,0.109168,False,pug,0.102466,True


In [70]:
df_image_predictor.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   tweet_id  2075 non-null   int64  
 1   jpg_url   2075 non-null   object 
 2   img_num   2075 non-null   int64  
 3   p1        2075 non-null   object 
 4   p1_conf   2075 non-null   float64
 5   p1_dog    2075 non-null   bool   
 6   p2        2075 non-null   object 
 7   p2_conf   2075 non-null   float64
 8   p2_dog    2075 non-null   bool   
 9   p3        2075 non-null   object 
 10  p3_conf   2075 non-null   float64
 11  p3_dog    2075 non-null   bool   
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [71]:
df_image_predictor.describe()

Unnamed: 0,tweet_id,img_num,p1_conf,p2_conf,p3_conf
count,2075.0,2075.0,2075.0,2075.0,2075.0
mean,7.384514e+17,1.203855,0.594548,0.1345886,0.06032417
std,6.785203e+16,0.561875,0.271174,0.1006657,0.05090593
min,6.660209e+17,1.0,0.044333,1.0113e-08,1.74017e-10
25%,6.764835e+17,1.0,0.364412,0.05388625,0.0162224
50%,7.119988e+17,1.0,0.58823,0.118181,0.0494438
75%,7.932034e+17,1.0,0.843855,0.1955655,0.09180755
max,8.924206e+17,4.0,1.0,0.488014,0.273419


In [72]:
img_assessed = go_assess(df_image_predictor)

Dataframe contains the following columns:
Index(['tweet_id', 'jpg_url', 'img_num', 'p1', 'p1_conf', 'p1_dog', 'p2',
       'p2_conf', 'p2_dog', 'p3', 'p3_conf', 'p3_dog'],
      dtype='object')

Column 0 - 'tweet_id' has been assessed. Assessment saved in results[0] and summary[0]
Column 1 - 'jpg_url' has been assessed. Assessment saved in results[1] and summary[1]
Column 2 - 'img_num' has been assessed. Assessment saved in results[2] and summary[2]
Column 3 - 'p1' has been assessed. Assessment saved in results[3] and summary[3]
Column 4 - 'p1_conf' has been assessed. Assessment saved in results[4] and summary[4]
Column 5 - 'p1_dog' has been assessed. Assessment saved in results[5] and summary[5]
Column 6 - 'p2' has been assessed. Assessment saved in results[6] and summary[6]
Column 7 - 'p2_conf' has been assessed. Assessment saved in results[7] and summary[7]
Column 8 - 'p2_dog' has been assessed. Assessment saved in results[8] and summary[8]
Column 9 - 'p3' has been assessed. Assessm

In [77]:
### Column 0
img_assessed[0][0], img_assessed[1][0]

("No duplicates found in column 'tweet_id'.",
 685532292383666176    1
 826598365270007810    1
 692158366030913536    1
 714606013974974464    1
 715696743237730304    1
                      ..
 816829038950027264    1
 847971574464610304    1
 713175907180089344    1
 670338931251150849    1
 700151421916807169    1
 Name: tweet_id, Length: 2075, dtype: int64)

In [74]:
### Column 1
# search for files other then .jpg, use .split and sift through values
not_jpg = df_image_predictor[~df_image_predictor.jpg_url.str.contains('.jpg',)]
not_jpg.jpg_url

320    https://pbs.twimg.com/tweet_video_thumb/CVKtH-...
815    https://pbs.twimg.com/tweet_video_thumb/CZ0mhd...
Name: jpg_url, dtype: object

In [78]:
img_assessed[0][1], img_assessed[1][1]

("Duplicates found in column 'jpg_url', the max duplicate item repeats 2 times.",
 https://pbs.twimg.com/media/CkNjahBXAAQ2kWo.jpg    2
 https://pbs.twimg.com/media/CU1zsMSUAAAS0qW.jpg    2
 https://pbs.twimg.com/media/CdHwZd0VIAA4792.jpg    2
 https://pbs.twimg.com/media/CVgdFjNWEAAxmbq.jpg    2
 https://pbs.twimg.com/media/Cx5R8wPVEAALa9r.jpg    2
                                                   ..
 https://pbs.twimg.com/media/DBAePiVXcAAqHSR.jpg    1
 https://pbs.twimg.com/media/DFrEyVuW0AAO3t9.jpg    1
 https://pbs.twimg.com/media/ClCQzFUUYAA5vAu.jpg    1
 https://pbs.twimg.com/media/CZIr5gFUsAAvnif.jpg    1
 https://pbs.twimg.com/media/CfznaXuUsAAH-py.jpg    1
 Name: jpg_url, Length: 2009, dtype: int64)

In [79]:
### Column 2
img_assessed[0][2], img_assessed[1][2]

("Duplicates found in column 'img_num', the max duplicate item repeats 1780 times.",
 1    1780
 2     198
 3      66
 4      31
 Name: img_num, dtype: int64)

In [80]:
### Column 3
img_assessed[0][3], img_assessed[1][3]

("Duplicates found in column 'p1', the max duplicate item repeats 150 times.",
 golden_retriever      150
 Labrador_retriever    100
 Pembroke               89
 Chihuahua              83
 pug                    57
                      ... 
 candle                  1
 binoculars              1
 prayer_rug              1
 suit                    1
 canoe                   1
 Name: p1, Length: 378, dtype: int64)

In [82]:
is_ws = df_image_predictor[df_image_predictor.p1.str.contains(' ',)]
is_ws

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog


In [130]:
mask = img_assessed[1][3] == 1
img_assessed[1][3][mask]

espresso        1
book_jacket     1
ocarina         1
picket_fence    1
teapot          1
               ..
candle          1
binoculars      1
prayer_rug      1
suit            1
canoe           1
Name: p1, Length: 175, dtype: int64

In [132]:
### Column 4
img_assessed[0][4], img_assessed[1][4]

("Duplicates found in column 'p1_conf', the max duplicate item repeats 2 times.",
 0.366248    2
 0.713293    2
 0.375098    2
 0.636169    2
 0.611525    2
            ..
 0.713102    1
 0.765266    1
 0.491022    1
 0.905334    1
 1.000000    1
 Name: p1_conf, Length: 2006, dtype: int64)

In [133]:
### Column 5
img_assessed[0][5], img_assessed[1][5]

("Duplicates found in column 'p1_dog', the max duplicate item repeats 1532 times.",
 True     1532
 False     543
 Name: p1_dog, dtype: int64)

In [149]:
df_image_predictor.query('p1_dog == False').iloc[:, [0,1,3,5]]

Unnamed: 0,tweet_id,jpg_url,p1,p1_dog
6,666051853826850816,https://pbs.twimg.com/media/CT5KoJ1WoAAJash.jpg,box_turtle,False
8,666057090499244032,https://pbs.twimg.com/media/CT5PY90WoAAQGLo.jpg,shopping_cart,False
17,666104133288665088,https://pbs.twimg.com/media/CT56LSZWoAAlJj2.jpg,hen,False
18,666268910803644416,https://pbs.twimg.com/media/CT8QCd1WEAADXws.jpg,desktop_computer,False
21,666293911632134144,https://pbs.twimg.com/media/CT8mx7KW4AEQu8N.jpg,three-toed_sloth,False
...,...,...,...,...
2026,882045870035918850,https://pbs.twimg.com/media/DD2oCl2WAAEI_4a.jpg,web_site,False
2046,886680336477933568,https://pbs.twimg.com/media/DE4fEDzWAAAyHMM.jpg,convertible,False
2052,887517139158093824,https://pbs.twimg.com/ext_tw_video_thumb/88751...,limousine,False
2071,891689557279858688,https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg,paper_towel,False


In [161]:
p1_df = df_image_predictor.query('p1_dog == False').iloc[:,:6]
p1_df

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog
6,666051853826850816,https://pbs.twimg.com/media/CT5KoJ1WoAAJash.jpg,1,box_turtle,0.933012,False
8,666057090499244032,https://pbs.twimg.com/media/CT5PY90WoAAQGLo.jpg,1,shopping_cart,0.962465,False
17,666104133288665088,https://pbs.twimg.com/media/CT56LSZWoAAlJj2.jpg,1,hen,0.965932,False
18,666268910803644416,https://pbs.twimg.com/media/CT8QCd1WEAADXws.jpg,1,desktop_computer,0.086502,False
21,666293911632134144,https://pbs.twimg.com/media/CT8mx7KW4AEQu8N.jpg,1,three-toed_sloth,0.914671,False
...,...,...,...,...,...,...
2026,882045870035918850,https://pbs.twimg.com/media/DD2oCl2WAAEI_4a.jpg,1,web_site,0.949591,False
2046,886680336477933568,https://pbs.twimg.com/media/DE4fEDzWAAAyHMM.jpg,1,convertible,0.738995,False
2052,887517139158093824,https://pbs.twimg.com/ext_tw_video_thumb/88751...,1,limousine,0.130432,False
2071,891689557279858688,https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg,1,paper_towel,0.170278,False


In [178]:
p1_df.groupby(['p1']).size()

p1
African_crocodile      1
African_grey           1
African_hunting_dog    1
American_black_bear    1
Angora                 2
                      ..
wombat                 4
wood_rabbit            3
wooden_spoon           1
wool                   2
zebra                  1
Length: 267, dtype: int64

In [179]:
### Column 6
img_assessed[0][6], img_assessed[1][6]

("Duplicates found in column 'p2', the max duplicate item repeats 104 times.",
 Labrador_retriever    104
 golden_retriever       92
 Cardigan               73
 Chihuahua              44
 Pomeranian             42
                      ... 
 ice_lolly               1
 wombat                  1
 spotlight               1
 cardigan                1
 spotted_salamander      1
 Name: p2, Length: 405, dtype: int64)

In [180]:
### Column 7
img_assessed[0][7], img_assessed[1][7]

("Duplicates found in column 'p2_conf', the max duplicate item repeats 3 times.",
 0.069362    3
 0.027907    2
 0.193654    2
 0.271929    2
 0.003143    2
            ..
 0.138331    1
 0.254884    1
 0.090644    1
 0.219323    1
 0.016301    1
 Name: p2_conf, Length: 2004, dtype: int64)

In [183]:
### Column 8
img_assessed[0][8], img_assessed[1][8]

("Duplicates found in column 'p2_dog', the max duplicate item repeats 1553 times.",
 True     1553
 False     522
 Name: p2_dog, dtype: int64)

In [182]:
### Column 9
img_assessed[0][9], img_assessed[1][9]

("Duplicates found in column 'p3', the max duplicate item repeats 79 times.",
 Labrador_retriever    79
 Chihuahua             58
 golden_retriever      48
 Eskimo_dog            38
 kelpie                35
                       ..
 rhinoceros_beetle      1
 notebook               1
 eel                    1
 park_bench             1
 toyshop                1
 Name: p3, Length: 408, dtype: int64)

In [184]:
### Column 10
img_assessed[0][10], img_assessed[1][10]

("Duplicates found in column 'p3_conf', the max duplicate item repeats 2 times.",
 0.094759    2
 0.035711    2
 0.000428    2
 0.044660    2
 0.162084    2
            ..
 0.024007    1
 0.132820    1
 0.002099    1
 0.083643    1
 0.033835    1
 Name: p3_conf, Length: 2006, dtype: int64)

In [185]:
### Column 11
img_assessed[0][11], img_assessed[1][11]

("Duplicates found in column 'p3_dog', the max duplicate item repeats 1499 times.",
 True     1499
 False     576
 Name: p3_dog, dtype: int64)

## Cleaning data


In [190]:
# change
df_twitter.iloc[:,1:3]

Unnamed: 0,in_reply_to_status_id,in_reply_to_user_id
0,,
1,,
2,,
3,,
4,,
...,...,...
2351,,
2352,,
2353,,
2354,,


In [75]:
# Misc: Workspace

## Save clean data