In [None]:
# PROJECT 4 - Wrangle Twitter data via API (made in Vscode)

## Table of Contents
* [Introduction](#intro)
* [Initial Brief](#1.1-initial-brief)
* [General Outline](#general_outline)
* [Import Libraries](#)
* [](#)

`Note: Fill at the end. Automate with python library/extension.`

# Introduction
Gather readily available data from an existing source on the web to allow first hand experience of wrangling data.<br>
It is a significant task as data will not always be provided and if it is: <br>
 - Best case: Spelling mistakes and/or equivalent,
 - Worst case: No schema/format, duplicates, incomplete and/or incorrect values recorded.

## Initial Brief
- User has provided archived twitter data for analysis
 - [ ] Twitter archive export in CSV
 - [ ] URL to Machine Learning image predictions
<br>
- Identify minimum:
 - [ ] 8 quality issues
 - [ ] 2 tidiness issues
<br>
- Out of scope:
 - [ ] Unique rating system
 - [ ] No gathering required past 01 Aug 2017

## General outline
- [ ] Read-in CSV data
- [ ] Access URL data (_over manually downloading file_)

In [2]:
## install modules via terminal
#pip install pandas # also downloads numpy
#pip install requests
#pip install tweepy
#pip install pandasgui
#pip install autoviz
#pip install pandas-profiling
#pip install sweetviz
#pip install bs4

## Optional - provides Table of Contents and minimizes lines (only on Jupyter)
#pip install jupyter_contrib_nbextensions && jupyter contrib nbextension install

## Import Libraries

In [3]:
import pandas as pd
import numpy as np

import requests
import os

import msvcrt
import sys

from pandasgui import show

from bs4 import BeautifulSoup

import tweepy

import json

from pathlib import Path

## Defined Functions

- addFiles(filename)    `Created for the ability to scale`
- go_assess(df)         `Created to reiterate through assessment steps`

In [4]:
filelist = [] # declare
print('{} Files in list'.format(len(filelist)) ) # initial print

# Adds and tracks files
def add_files(*filename): # PARAMETER: <string>
    for file in filename:
        filelist.append(file)
        print('{} added to file list.'.format(file) )

    if len(filelist) > 1:
        print('{} files now in list.'.format( len(filelist)) )
    else:
        print('{} file now in list.'.format( len(filelist)) )
    return file

0 Files in list


In [5]:
def get_values(df, col, name): # 
    export = []
    value_cnt = col.value_counts()
    value = value_cnt.values
# test for duplicates, no duplicates should be equal to .series size
    if value.sum() > value.shape[0]: # there are duplicates
        txt_result = ('Duplicates found in column \'{}\', the max duplicate item repeats {} times.'.format(name, value.max()) ) # print results, return indexes
    else: # no duplicates
        txt_result = ('No duplicates found in column \'{}\'.'.format(name) )
        #print('{}: No duplicates found.'.format(col) )
    # pack variables into list
    export.append(value_cnt)
    export.append(txt_result)
    
    return export

In [6]:
def go_assess(df):
    # empty every function call, to prevent list from accumulating over time
    results = [] #
    summary = [] #
    val_sum = [] # 
    assessment = []
    print('Dataframe contains the following columns:')
    print('{}\n'.format(df.columns) )

    for i, col in enumerate(df.columns):
        # copy into message
        print('Column {} - \'{}\' has been assessed. Assessment saved in results[{}] and summary[{}]'.format(i, col, i, i))
        
        # call and get results
        val_sum = get_values(df, df[col], col)

        # append results
        summary.append(val_sum[1])
        results.append(val_sum[0])

    assessment.append(summary)
    assessment.append(results)
    print('NOTE: To access variables, set a series name e.g below:\nseries[0][x] to access summary details.\nseries[1][x] to access the value_counts results.\nx represents column number')
    return assessment #

In [7]:
def trim_strings(df):
    for col in df:
        if df[col].dtype == 'object':
            startcount = df[col].str.len()
            df[col].str.strip()
            endcount = df[col].str.len()
            print

            if (startcount.sum() - endcount.sum()) > 0:
                print('Whitespaces were present in {}.'.format(col) )
            else:
                print('No whitespaces in {}.'.format(col) )

## Data Wrangling

## Iteration 1
Import data from a twitter user archive provided by the end-user

`Note: Add edit# upon addition of new issue.`

### Gathering 1
#### Initialize
Enter Known Input Info
Format: file name inside ''

In [8]:
# FILE 1 - TWITTER ARCHIVE DATA
folder = 'Incoming_Files/'
twitter_file = 'twitter-archive-enhanced-2.csv'
add_files(twitter_file)

twitter-archive-enhanced-2.csv added to file list.
1 file now in list.


'twitter-archive-enhanced-2.csv'

In [9]:
# FILE 2 - TWITTER ML IMAGE PREDICTIONS
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'

# assign to a response object
response = requests.get(url)

image_predictions = url.split('/')[-1] # extract file name

# with open, allows for the auto close file when complete
# split after last delimiter /, indicating file name
with open(os.path.join(folder, image_predictions), mode='wb') as file:
    # read file 
    file.write(response.content)
    print('{} has been saved in: "/{}"'.format(image_predictions, folder) )

# call function and add name to end of list
add_files(image_predictions)

image-predictions.tsv has been saved in: "/Incoming_Files/"
image-predictions.tsv added to file list.
2 files now in list.


'image-predictions.tsv'

In [10]:
run_script = str(input('Run script to Access Twitter API (Y/N)?'))
valid_input = ['n', 'N', 'y', 'Y']
yes_list = ['y','Y']
no_list = ['n','N']

In [11]:
# FILE 3 - TWITTER API JSON
# run python script, pass dataframe name (dataframe could not be passed)
while run_script not in valid_input:
    run_script = input('Wrong input. Run script to Access Twitter API (Y/N)?')

if run_script in yes_list:
    folderarg = folder.replace(' ', '_')
    print('Running. Will indicate when complete.\n')
    %run twitter-api.py $folderarg $filelist[0]
elif run_script in no_list:
    print('Script not running.')

Script not running.


In [12]:
## Twitter API data
API_cols = ['tweet_id', 'retweet_count', 'fav_count']

# read in txt and convert to json
API_export = 'tweet_json.txt'
# call function and add name to end of list
add_files(API_export)

json_keys, json_id, json_fav_count, json_retweet_cnt = [], [],[],[]

with open(API_export) as txt_file:
    for line in txt_file:
        #print(line)
        json_obj = json.loads(line)
        #append to list then combine lists 
        json_keys.append(json_obj)
        json_id.append(json_obj['id_str'])
        json_fav_count.append(json_obj['favorite_count'])
        json_retweet_cnt.append(json_obj['retweet_count'])
        

tweet_json.txt added to file list.
3 files now in list.


In [13]:
json_obj

{'created_at': 'Sun Nov 15 22:32:08 +0000 2015',
 'id': 666020888022790149,
 'id_str': '666020888022790149',
 'full_text': 'Here we have a Japanese Irish Setter. Lost eye in Vietnam (?). Big fan of relaxing on stair. 8/10 would pet https://t.co/BLDqew2Ijj',
 'truncated': False,
 'display_text_range': [0, 131],
 'entities': {'hashtags': [],
  'symbols': [],
  'user_mentions': [],
  'urls': [],
  'media': [{'id': 666020881337073664,
    'id_str': '666020881337073664',
    'indices': [108, 131],
    'media_url': 'http://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg',
    'media_url_https': 'https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg',
    'url': 'https://t.co/BLDqew2Ijj',
    'display_url': 'pic.twitter.com/BLDqew2Ijj',
    'expanded_url': 'https://twitter.com/dog_rates/status/666020888022790149/photo/1',
    'type': 'photo',
    'sizes': {'medium': {'w': 960, 'h': 720, 'resize': 'fit'},
     'thumb': {'w': 150, 'h': 150, 'resize': 'crop'},
     'large': {'w': 960, 'h': 720, 'resize': 'fit

#### Import into dataframes

In [14]:
 # IMPROVE
 # create empty list
df_raw = []
file_extensions = []

# dataframe to contain original imports
for num, file in enumerate(filelist):
    ext = file.split('.')[-1]
    file_extensions.append(ext)
    # read extension type
    ## catch CSV, TSV, JSON, no Switch/Case in Python
    if ext == 'csv':
        df_raw.append(pd.read_csv(folder + file) )
    elif ext == 'tsv':
        df_raw.append(pd.read_csv(folder + file, sep='\t') )
    elif file == 'tweet_json.txt': # improve for general twitter api scrap import
        df_raw.append(pd.DataFrame(zip(json_id, json_fav_count, json_retweet_cnt), columns=API_cols))
    else:
        print('filelist({}) - "{}", could not be read into a dataframe.'.format(num, filelist[num]) )

In [15]:
print(filelist)

['twitter-archive-enhanced-2.csv', 'image-predictions.tsv', 'tweet_json.txt']


In [16]:
df_raw[0].sample(3)  # visually assess file was read in correctly

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
755,778748913645780993,,,2016-09-22 00:13:04 +0000,"<a href=""http://twitter.com/download/iphone"" r...","This is Mya (pronounced ""mmmyah?""). Her head i...",,,,https://twitter.com/dog_rates/status/778748913...,11,10,Mya,,,,
2060,671182547775299584,,,2015-11-30 04:22:44 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This pup holds the secrets of the universe in ...,,,,https://twitter.com/dog_rates/status/671182547...,12,10,,,,,
408,823581115634085888,,,2017-01-23 17:20:14 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Crawford. He's quite h*ckin good at th...,,,,https://twitter.com/dog_rates/status/823581115...,11,10,Crawford,,,,


In [17]:
df_twitter = df_raw[0].copy()

In [18]:
df_raw[1].sample(3)  # visually assess file was read in correctly

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
1153,732726085725589504,https://pbs.twimg.com/media/CisqdVcXEAE3iW7.jpg,1,Pomeranian,0.961902,True,Samoyed,0.024289,True,chow,0.005772,True
1160,734776360183431168,https://pbs.twimg.com/media/CjJzMlBUoAADMLx.jpg,1,Siberian_husky,0.304902,True,Eskimo_dog,0.155147,True,malamute,0.050942,True
2035,884162670584377345,https://pbs.twimg.com/media/DEUtQbzW0AUTv_o.jpg,1,German_shepherd,0.707046,True,malinois,0.199396,True,Norwegian_elkhound,0.049148,True


In [19]:
df_image_predictor = df_raw[1].copy() # create copy

In [20]:
df_raw[2].sample(3)  # visually assess file was read in correctly

Unnamed: 0,tweet_id,retweet_count,fav_count
659,789280767834746880,0,4890
1975,672488522314567680,1047,404
2302,666407126856765440,98,31


In [21]:
df_twitter_api = df_raw[2].copy()

In [22]:
df_twitter_api.sample(3)

Unnamed: 0,tweet_id,retweet_count,fav_count
334,831670449226514432,10261,1771
688,785533386513321988,9050,1975
1350,701952816642965504,3738,993


In [23]:
# DO - search Incoming Files directory, files not in list to be added.

## Assessing data
### Assess 1 - Twitter Data Archive
#### Define:<br>

In [24]:
# external windows open
twitter_gui = show(df_raw[0])

In [25]:
df_twitter.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 

In [26]:
df_twitter.describe()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,retweeted_status_id,retweeted_status_user_id,rating_numerator,rating_denominator
count,2356.0,78.0,78.0,181.0,181.0,2356.0,2356.0
mean,7.427716e+17,7.455079e+17,2.014171e+16,7.7204e+17,1.241698e+16,13.126486,10.455433
std,6.856705e+16,7.582492e+16,1.252797e+17,6.236928e+16,9.599254e+16,45.876648,6.745237
min,6.660209e+17,6.658147e+17,11856340.0,6.661041e+17,783214.0,0.0,0.0
25%,6.783989e+17,6.757419e+17,308637400.0,7.186315e+17,4196984000.0,10.0,10.0
50%,7.196279e+17,7.038708e+17,4196984000.0,7.804657e+17,4196984000.0,11.0,10.0
75%,7.993373e+17,8.257804e+17,4196984000.0,8.203146e+17,4196984000.0,12.0,10.0
max,8.924206e+17,8.862664e+17,8.405479e+17,8.87474e+17,7.874618e+17,1776.0,170.0


In [27]:
# call go_assess function
archive_assessed = go_assess(df_twitter)

Dataframe contains the following columns:
Index(['tweet_id', 'in_reply_to_status_id', 'in_reply_to_user_id', 'timestamp',
       'source', 'text', 'retweeted_status_id', 'retweeted_status_user_id',
       'retweeted_status_timestamp', 'expanded_urls', 'rating_numerator',
       'rating_denominator', 'name', 'doggo', 'floofer', 'pupper', 'puppo'],
      dtype='object')

Column 0 - 'tweet_id' has been assessed. Assessment saved in results[0] and summary[0]
Column 1 - 'in_reply_to_status_id' has been assessed. Assessment saved in results[1] and summary[1]
Column 2 - 'in_reply_to_user_id' has been assessed. Assessment saved in results[2] and summary[2]
Column 3 - 'timestamp' has been assessed. Assessment saved in results[3] and summary[3]
Column 4 - 'source' has been assessed. Assessment saved in results[4] and summary[4]
Column 5 - 'text' has been assessed. Assessment saved in results[5] and summary[5]
Column 6 - 'retweeted_status_id' has been assessed. Assessment saved in results[6] and 

### Column 0 - tweet_id

In [28]:
### Column 0 - 
archive_assessed[0][0], archive_assessed[1][0]

("No duplicates found in column 'tweet_id'.",
 749075273010798592    1
 741099773336379392    1
 798644042770751489    1
 825120256414846976    1
 769212283578875904    1
                      ..
 715360349751484417    1
 666817836334096384    1
 794926597468000259    1
 673705679337693185    1
 700151421916807169    1
 Name: tweet_id, Length: 2356, dtype: int64)

### Column 1 - in reply

In [29]:
### Column 1 - 
archive_assessed[0][1], archive_assessed[1][1]

("Duplicates found in column 'in_reply_to_status_id', the max duplicate item repeats 2 times.",
 6.671522e+17    2
 8.562860e+17    1
 8.131273e+17    1
 6.754971e+17    1
 6.827884e+17    1
                ..
 8.482121e+17    1
 6.715449e+17    1
 6.936422e+17    1
 6.849598e+17    1
 7.331095e+17    1
 Name: in_reply_to_status_id, Length: 77, dtype: int64)

In [30]:
df_twitter[df_twitter.in_reply_to_status_id.notna()]['in_reply_to_status_id'].sample(5)

189     8.558585e+17
234     8.476062e+17
1339    6.671522e+17
611     7.971238e+17
30      8.862664e+17
Name: in_reply_to_status_id, dtype: float64

In [31]:
### Column 2 - 
archive_assessed[0][2], archive_assessed[1][2]

("Duplicates found in column 'in_reply_to_user_id', the max duplicate item repeats 47 times.",
 4.196984e+09    47
 2.195506e+07     2
 7.305050e+17     1
 2.916630e+07     1
 3.105441e+09     1
 2.918590e+08     1
 2.792810e+08     1
 2.319108e+09     1
 1.806710e+08     1
 3.058208e+07     1
 2.625958e+07     1
 1.943518e+08     1
 3.589728e+08     1
 8.405479e+17     1
 2.894131e+09     1
 2.143566e+07     1
 2.281182e+09     1
 1.648776e+07     1
 4.717297e+09     1
 2.878549e+07     1
 1.582854e+09     1
 4.670367e+08     1
 4.738443e+07     1
 1.361572e+07     1
 1.584641e+07     1
 2.068372e+07     1
 1.637468e+07     1
 1.185634e+07     1
 1.198989e+09     1
 1.132119e+08     1
 7.759620e+07     1
 Name: in_reply_to_user_id, dtype: int64)

In [32]:
### Column 3 - 

In [33]:
archive_assessed[0][3], archive_assessed[1][3]

("No duplicates found in column 'timestamp'.",
 2015-12-21 00:53:29 +0000    1
 2017-03-07 01:17:48 +0000    1
 2016-11-20 04:06:37 +0000    1
 2016-11-08 22:25:27 +0000    1
 2015-12-31 22:57:47 +0000    1
                             ..
 2016-03-02 16:23:36 +0000    1
 2016-07-28 19:06:01 +0000    1
 2016-08-26 16:37:54 +0000    1
 2015-12-23 03:26:43 +0000    1
 2015-12-09 02:56:22 +0000    1
 Name: timestamp, Length: 2356, dtype: int64)

In [34]:
### Column 4 - 
archive_assessed[0][4], archive_assessed[1][4]

("Duplicates found in column 'source', the max duplicate item repeats 2221 times.",
 <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>     2221
 <a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a>                          91
 <a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>                       33
 <a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>      11
 Name: source, dtype: int64)

In [35]:
### Column 5 - 
archive_assessed[0][5], archive_assessed[1][5]

("No duplicates found in column 'text'.",
 Please don't send in photos without dogs in them. We're not @porch_rates. Insubordinate and churlish. Pretty good porch tho 11/10 https://t.co/HauE8M3Bu4    1
 This is Duchess. She uses dark doggo forces to levitate her toys. 13/10 magical af https://t.co/maDNMETA52                                                   1
 Meet Saydee. She's a Rochester  Ecclesiastical. Jumped off cliff and caught stick on way down. 11/10 1st round pick https://t.co/Eh2v0AyJbi                  1
 Meet Lucky. He was showing his friends an extreme pogo stick trick when he completely lost control. 10/10 still rad https://t.co/K55XrIoePl                  1
 This is all I want in my life. 12/10 for super sleepy pupper https://t.co/4RlLA5ObMh                                                                         1
                                                                                                                                                             .

In [36]:
### Column 6 - 
archive_assessed[0][6], archive_assessed[1][6]

("No duplicates found in column 'retweeted_status_id'.",
 7.757333e+17    1
 7.507196e+17    1
 6.742918e+17    1
 6.833919e+17    1
 8.269587e+17    1
                ..
 7.848260e+17    1
 7.806013e+17    1
 8.305833e+17    1
 7.047611e+17    1
 7.331095e+17    1
 Name: retweeted_status_id, Length: 181, dtype: int64)

In [37]:
### Column 7 - 
archive_assessed[0][7], archive_assessed[1][7]

("Duplicates found in column 'retweeted_status_user_id', the max duplicate item repeats 156 times.",
 4.196984e+09    156
 4.296832e+09      2
 5.870972e+07      1
 6.669901e+07      1
 4.119842e+07      1
 7.475543e+17      1
 7.832140e+05      1
 7.266347e+08      1
 4.871977e+08      1
 5.970642e+08      1
 4.466750e+07      1
 1.228326e+09      1
 7.992370e+07      1
 2.488557e+07      1
 7.874618e+17      1
 3.638908e+08      1
 5.128045e+08      1
 8.117408e+08      1
 1.732729e+09      1
 1.960740e+07      1
 1.547674e+08      1
 3.410211e+08      1
 7.124572e+17      1
 2.804798e+08      1
 1.950368e+08      1
 Name: retweeted_status_user_id, dtype: int64)

In [38]:
### Column 8 - 
archive_assessed[0][8], archive_assessed[1][8]

("No duplicates found in column 'retweeted_status_timestamp'.",
 2016-08-08 17:19:51 +0000    1
 2016-01-25 00:26:41 +0000    1
 2017-01-11 02:15:36 +0000    1
 2017-02-24 23:04:14 +0000    1
 2017-02-16 23:23:38 +0000    1
                             ..
 2016-03-01 20:11:59 +0000    1
 2017-01-06 17:33:29 +0000    1
 2016-08-28 16:51:16 +0000    1
 2016-10-13 23:23:56 +0000    1
 2015-12-28 17:12:42 +0000    1
 Name: retweeted_status_timestamp, Length: 181, dtype: int64)

In [39]:
### Column 9 - 
archive_assessed[0][9], archive_assessed[1][9]

("Duplicates found in column 'expanded_urls', the max duplicate item repeats 2 times.",
 https://twitter.com/dog_rates/status/786233965241827333/photo/1                                                                                                                                    2
 https://twitter.com/dog_rates/status/816450570814898180/photo/1,https://twitter.com/dog_rates/status/816450570814898180/photo/1                                                                    2
 https://twitter.com/dog_rates/status/667138269671505920/photo/1                                                                                                                                    2
 https://twitter.com/dog_rates/status/679462823135686656/photo/1                                                                                                                                    2
 https://twitter.com/dog_rates/status/866334964761202691/photo/1,https://twitter.com/dog_rates/status/866334964761202691

In [40]:
### Column 10 - 
archive_assessed[0][10], archive_assessed[1][10]

("Duplicates found in column 'rating_numerator', the max duplicate item repeats 558 times.",
 12      558
 11      464
 10      461
 13      351
 9       158
 8       102
 7        55
 14       54
 5        37
 6        32
 3        19
 4        17
 1         9
 2         9
 420       2
 0         2
 15        2
 75        2
 80        1
 20        1
 24        1
 26        1
 44        1
 50        1
 60        1
 165       1
 84        1
 88        1
 144       1
 182       1
 143       1
 666       1
 960       1
 1776      1
 17        1
 27        1
 45        1
 99        1
 121       1
 204       1
 Name: rating_numerator, dtype: int64)

In [41]:
### Column 11 - 
archive_assessed[0][11], archive_assessed[1][11]

("Duplicates found in column 'rating_denominator', the max duplicate item repeats 2333 times.",
 10     2333
 11        3
 50        3
 80        2
 20        2
 2         1
 16        1
 40        1
 70        1
 15        1
 90        1
 110       1
 120       1
 130       1
 150       1
 170       1
 7         1
 0         1
 Name: rating_denominator, dtype: int64)

In [42]:
### Column 12 - name

In [43]:
archive_assessed[0][12], archive_assessed[1][12]

("Duplicates found in column 'name', the max duplicate item repeats 745 times.",
 None        745
 a            55
 Charlie      12
 Oliver       11
 Cooper       11
            ... 
 Ralf          1
 Pumpkin       1
 Carll         1
 Ralphson      1
 Lilah         1
 Name: name, Length: 957, dtype: int64)

## Assess 2 - Twitter Image Predictions

In [44]:
# external windows open
predictions_gui = show(df_raw[1])

In [45]:
df_image_predictor.sample(3)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
36,666447344410484738,https://pbs.twimg.com/media/CT-yU5QWwAEjLX5.jpg,1,curly-coated_retriever,0.322084,True,giant_schnauzer,0.287955,True,Labrador_retriever,0.166331,True
1842,838476387338051585,https://pbs.twimg.com/media/C6Ld0wYWgAQQqMC.jpg,3,Great_Pyrenees,0.997692,True,kuvasz,0.001001,True,Newfoundland,0.000405,True
1542,791312159183634433,https://pbs.twimg.com/media/CvtONV4WAAAQ3Rn.jpg,4,miniature_pinscher,0.892925,True,toy_terrier,0.095524,True,Doberman,0.003544,True


In [46]:
df_image_predictor.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   tweet_id  2075 non-null   int64  
 1   jpg_url   2075 non-null   object 
 2   img_num   2075 non-null   int64  
 3   p1        2075 non-null   object 
 4   p1_conf   2075 non-null   float64
 5   p1_dog    2075 non-null   bool   
 6   p2        2075 non-null   object 
 7   p2_conf   2075 non-null   float64
 8   p2_dog    2075 non-null   bool   
 9   p3        2075 non-null   object 
 10  p3_conf   2075 non-null   float64
 11  p3_dog    2075 non-null   bool   
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [47]:
df_image_predictor.describe()

Unnamed: 0,tweet_id,img_num,p1_conf,p2_conf,p3_conf
count,2075.0,2075.0,2075.0,2075.0,2075.0
mean,7.384514e+17,1.203855,0.594548,0.1345886,0.06032417
std,6.785203e+16,0.561875,0.271174,0.1006657,0.05090593
min,6.660209e+17,1.0,0.044333,1.0113e-08,1.74017e-10
25%,6.764835e+17,1.0,0.364412,0.05388625,0.0162224
50%,7.119988e+17,1.0,0.58823,0.118181,0.0494438
75%,7.932034e+17,1.0,0.843855,0.1955655,0.09180755
max,8.924206e+17,4.0,1.0,0.488014,0.273419


In [48]:
img_assessed = go_assess(df_image_predictor)

Dataframe contains the following columns:
Index(['tweet_id', 'jpg_url', 'img_num', 'p1', 'p1_conf', 'p1_dog', 'p2',
       'p2_conf', 'p2_dog', 'p3', 'p3_conf', 'p3_dog'],
      dtype='object')

Column 0 - 'tweet_id' has been assessed. Assessment saved in results[0] and summary[0]
Column 1 - 'jpg_url' has been assessed. Assessment saved in results[1] and summary[1]
Column 2 - 'img_num' has been assessed. Assessment saved in results[2] and summary[2]
Column 3 - 'p1' has been assessed. Assessment saved in results[3] and summary[3]
Column 4 - 'p1_conf' has been assessed. Assessment saved in results[4] and summary[4]
Column 5 - 'p1_dog' has been assessed. Assessment saved in results[5] and summary[5]
Column 6 - 'p2' has been assessed. Assessment saved in results[6] and summary[6]
Column 7 - 'p2_conf' has been assessed. Assessment saved in results[7] and summary[7]
Column 8 - 'p2_dog' has been assessed. Assessment saved in results[8] and summary[8]
Column 9 - 'p3' has been assessed. Assessm

In [50]:
### Column 0
img_assessed[0][0], img_assessed[1][0]

("No duplicates found in column 'tweet_id'.",
 685532292383666176    1
 826598365270007810    1
 692158366030913536    1
 714606013974974464    1
 715696743237730304    1
                      ..
 816829038950027264    1
 847971574464610304    1
 713175907180089344    1
 670338931251150849    1
 700151421916807169    1
 Name: tweet_id, Length: 2075, dtype: int64)

In [51]:
### Column 1
# search for files other then .jpg, use .split and sift through values
not_jpg = df_image_predictor[~df_image_predictor.jpg_url.str.contains('.jpg',)]
not_jpg.jpg_url

320    https://pbs.twimg.com/tweet_video_thumb/CVKtH-...
815    https://pbs.twimg.com/tweet_video_thumb/CZ0mhd...
Name: jpg_url, dtype: object

In [52]:
img_assessed[0][1], img_assessed[1][1]

("Duplicates found in column 'jpg_url', the max duplicate item repeats 2 times.",
 https://pbs.twimg.com/ext_tw_video_thumb/815965888126062592/pu/img/JleSw4wRhgKDWQj5.jpg    2
 https://pbs.twimg.com/media/CrXhIqBW8AA6Bse.jpg                                            2
 https://pbs.twimg.com/media/Cp6db4-XYAAMmqL.jpg                                            2
 https://pbs.twimg.com/media/CV_cnjHWUAADc-c.jpg                                            2
 https://pbs.twimg.com/media/CvaYgDOWgAEfjls.jpg                                            2
                                                                                           ..
 https://pbs.twimg.com/media/CXAiiHUWkAIN_28.jpg                                            1
 https://pbs.twimg.com/media/C8m3-iQVoAAETnF.jpg                                            1
 https://pbs.twimg.com/ext_tw_video_thumb/744234667679821824/pu/img/1GaWmtJtdqzZV7jy.jpg    1
 https://pbs.twimg.com/media/CrHqwjWXgAAgJSe.jpg                        

In [53]:
### Column 2
img_assessed[0][2], img_assessed[1][2]

("Duplicates found in column 'img_num', the max duplicate item repeats 1780 times.",
 1    1780
 2     198
 3      66
 4      31
 Name: img_num, dtype: int64)

In [54]:
### Column 3
img_assessed[0][3], img_assessed[1][3]

("Duplicates found in column 'p1', the max duplicate item repeats 150 times.",
 golden_retriever      150
 Labrador_retriever    100
 Pembroke               89
 Chihuahua              83
 pug                    57
                      ... 
 canoe                   1
 lawn_mower              1
 fountain                1
 tricycle                1
 shopping_basket         1
 Name: p1, Length: 378, dtype: int64)

In [55]:
is_ws = df_image_predictor[df_image_predictor.p1.str.contains(' ',)]
is_ws

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog


In [56]:
mask = img_assessed[1][3] == 1
img_assessed[1][3][mask]

PandasGUI INFO — numexpr.utils — Note: NumExpr detected 12 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
PandasGUI INFO — numexpr.utils — NumExpr defaulting to 8 threads.


cup                1
cuirass            1
clumber            1
soccer_ball        1
loupe              1
                  ..
canoe              1
lawn_mower         1
fountain           1
tricycle           1
shopping_basket    1
Name: p1, Length: 175, dtype: int64

In [57]:
### Column 4
img_assessed[0][4], img_assessed[1][4]

("Duplicates found in column 'p1_conf', the max duplicate item repeats 2 times.",
 0.366248    2
 0.713293    2
 0.375098    2
 0.636169    2
 0.611525    2
            ..
 0.713102    1
 0.765266    1
 0.491022    1
 0.905334    1
 1.000000    1
 Name: p1_conf, Length: 2006, dtype: int64)

In [58]:
### Column 5
img_assessed[0][5], img_assessed[1][5]

("Duplicates found in column 'p1_dog', the max duplicate item repeats 1532 times.",
 True     1532
 False     543
 Name: p1_dog, dtype: int64)

In [59]:
df_image_predictor.query('p1_dog == False').iloc[:, [0,1,3,5]]

Unnamed: 0,tweet_id,jpg_url,p1,p1_dog
6,666051853826850816,https://pbs.twimg.com/media/CT5KoJ1WoAAJash.jpg,box_turtle,False
8,666057090499244032,https://pbs.twimg.com/media/CT5PY90WoAAQGLo.jpg,shopping_cart,False
17,666104133288665088,https://pbs.twimg.com/media/CT56LSZWoAAlJj2.jpg,hen,False
18,666268910803644416,https://pbs.twimg.com/media/CT8QCd1WEAADXws.jpg,desktop_computer,False
21,666293911632134144,https://pbs.twimg.com/media/CT8mx7KW4AEQu8N.jpg,three-toed_sloth,False
...,...,...,...,...
2026,882045870035918850,https://pbs.twimg.com/media/DD2oCl2WAAEI_4a.jpg,web_site,False
2046,886680336477933568,https://pbs.twimg.com/media/DE4fEDzWAAAyHMM.jpg,convertible,False
2052,887517139158093824,https://pbs.twimg.com/ext_tw_video_thumb/88751...,limousine,False
2071,891689557279858688,https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg,paper_towel,False


In [60]:
p1_false_results = df_image_predictor.query('p1_dog == False').iloc[:,:6]
p1_false_results

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog
6,666051853826850816,https://pbs.twimg.com/media/CT5KoJ1WoAAJash.jpg,1,box_turtle,0.933012,False
8,666057090499244032,https://pbs.twimg.com/media/CT5PY90WoAAQGLo.jpg,1,shopping_cart,0.962465,False
17,666104133288665088,https://pbs.twimg.com/media/CT56LSZWoAAlJj2.jpg,1,hen,0.965932,False
18,666268910803644416,https://pbs.twimg.com/media/CT8QCd1WEAADXws.jpg,1,desktop_computer,0.086502,False
21,666293911632134144,https://pbs.twimg.com/media/CT8mx7KW4AEQu8N.jpg,1,three-toed_sloth,0.914671,False
...,...,...,...,...,...,...
2026,882045870035918850,https://pbs.twimg.com/media/DD2oCl2WAAEI_4a.jpg,1,web_site,0.949591,False
2046,886680336477933568,https://pbs.twimg.com/media/DE4fEDzWAAAyHMM.jpg,1,convertible,0.738995,False
2052,887517139158093824,https://pbs.twimg.com/ext_tw_video_thumb/88751...,1,limousine,0.130432,False
2071,891689557279858688,https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg,1,paper_towel,0.170278,False


In [61]:
p1_false_results.groupby(['p1']).size()

p1
African_crocodile      1
African_grey           1
African_hunting_dog    1
American_black_bear    1
Angora                 2
                      ..
wombat                 4
wood_rabbit            3
wooden_spoon           1
wool                   2
zebra                  1
Length: 267, dtype: int64

In [62]:
### Column 6
img_assessed[0][6], img_assessed[1][6]

("Duplicates found in column 'p2', the max duplicate item repeats 104 times.",
 Labrador_retriever    104
 golden_retriever       92
 Cardigan               73
 Chihuahua              44
 Pomeranian             42
                      ... 
 sulphur_butterfly       1
 affenpinscher           1
 basketball              1
 television              1
 hair_slide              1
 Name: p2, Length: 405, dtype: int64)

In [63]:
### Column 7
img_assessed[0][7], img_assessed[1][7]

("Duplicates found in column 'p2_conf', the max duplicate item repeats 3 times.",
 0.069362    3
 0.027907    2
 0.193654    2
 0.271929    2
 0.003143    2
            ..
 0.138331    1
 0.254884    1
 0.090644    1
 0.219323    1
 0.016301    1
 Name: p2_conf, Length: 2004, dtype: int64)

In [64]:
### Column 8
img_assessed[0][8], img_assessed[1][8]

("Duplicates found in column 'p2_dog', the max duplicate item repeats 1553 times.",
 True     1553
 False     522
 Name: p2_dog, dtype: int64)

In [65]:
### Column 9
img_assessed[0][9], img_assessed[1][9]

("Duplicates found in column 'p3', the max duplicate item repeats 79 times.",
 Labrador_retriever    79
 Chihuahua             58
 golden_retriever      48
 Eskimo_dog            38
 kelpie                35
                       ..
 shoji                  1
 can_opener             1
 guillotine             1
 cowboy_boot            1
 plastic_bag            1
 Name: p3, Length: 408, dtype: int64)

In [66]:
### Column 10
img_assessed[0][10], img_assessed[1][10]

("Duplicates found in column 'p3_conf', the max duplicate item repeats 2 times.",
 0.094759    2
 0.035711    2
 0.000428    2
 0.044660    2
 0.162084    2
            ..
 0.024007    1
 0.132820    1
 0.002099    1
 0.083643    1
 0.033835    1
 Name: p3_conf, Length: 2006, dtype: int64)

In [67]:
### Column 11
img_assessed[0][11], img_assessed[1][11]

("Duplicates found in column 'p3_dog', the max duplicate item repeats 1499 times.",
 True     1499
 False     576
 Name: p3_dog, dtype: int64)

## Assess 3 - Twitter API Raw Data

In [68]:
# external windows open
api_gui = show(df_raw[2])

In [69]:
df_twitter_api.sample(3)

Unnamed: 0,tweet_id,retweet_count,fav_count
2204,668248472370458624,894,437
1306,705786532653883392,1877,499
300,835172783151792128,25684,5611


In [70]:
df_twitter_api.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2331 entries, 0 to 2330
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   tweet_id       2331 non-null   object
 1   retweet_count  2331 non-null   int64 
 2   fav_count      2331 non-null   int64 
dtypes: int64(2), object(1)
memory usage: 54.8+ KB


In [71]:
df_twitter_api.describe()

Unnamed: 0,retweet_count,fav_count
count,2331.0,2331.0
mean,7386.187473,2624.406692
std,11471.42746,4439.587366
min,0.0,1.0
25%,1283.0,532.0
50%,3207.0,1225.0
75%,9036.5,3044.5
max,152335.0,75413.0


In [72]:
#assess raw json api data
api_assessed = go_assess(df_twitter_api)

Dataframe contains the following columns:
Index(['tweet_id', 'retweet_count', 'fav_count'], dtype='object')

Column 0 - 'tweet_id' has been assessed. Assessment saved in results[0] and summary[0]
Column 1 - 'retweet_count' has been assessed. Assessment saved in results[1] and summary[1]
Column 2 - 'fav_count' has been assessed. Assessment saved in results[2] and summary[2]
NOTE: To access variables, set a series name e.g below:
series[0][x] to access summary details.
series[1][x] to access the value_counts results.
x represents column number


In [73]:
api_assessed = go_assess(df_twitter_api)

Dataframe contains the following columns:
Index(['tweet_id', 'retweet_count', 'fav_count'], dtype='object')

Column 0 - 'tweet_id' has been assessed. Assessment saved in results[0] and summary[0]
Column 1 - 'retweet_count' has been assessed. Assessment saved in results[1] and summary[1]
Column 2 - 'fav_count' has been assessed. Assessment saved in results[2] and summary[2]
NOTE: To access variables, set a series name e.g below:
series[0][x] to access summary details.
series[1][x] to access the value_counts results.
x represents column number


In [74]:
api_assessed[0][0], api_assessed[1][0]

("No duplicates found in column 'tweet_id'.",
 771171053431250945    1
 674737130913071104    1
 669603084620980224    1
 708853462201716736    1
 798697898615730177    1
                      ..
 669375718304980992    1
 704364645503647744    1
 671729906628341761    1
 767191397493538821    1
 701981390485725185    1
 Name: tweet_id, Length: 2331, dtype: int64)

In [75]:
api_assessed[0][1], api_assessed[1][1]

("Duplicates found in column 'retweet_count', the max duplicate item repeats 163 times.",
 0        163
 355        4
 1974       4
 2562       3
 1253       3
         ... 
 6544       1
 401        1
 403        1
 405        1
 22527      1
 Name: retweet_count, Length: 1954, dtype: int64)

In [76]:
api_assessed[0][2], api_assessed[1][2]

("Duplicates found in column 'fav_count', the max duplicate item repeats 5 times.",
 51       5
 445      5
 589      5
 41       5
 1161     4
         ..
 16705    1
 329      1
 4427     1
 339      1
 4096     1
 Name: fav_count, Length: 1676, dtype: int64)

### Assess Iteration 2

In [77]:
df_clean = []
df_clean.append(df_twitter)
df_clean.append(df_image_predictor)
df_clean.append(df_twitter_api)

In [78]:
for df in df_clean:
    print(df.shape)

(2356, 17)
(2075, 12)
(2331, 3)


## Cleaning data
### Quality Issue 1:
#### Define:
col0: tweet_id data type change to string, all dataframes

#### Code:

In [79]:
q1 = 'tweet_id'

In [80]:
# Print previous data types 
df_image_predictor[q1].head(1)
for df in df_clean:
    print(df[q1].head(1))

0    892420643555336193
Name: tweet_id, dtype: int64
0    666020888022790149
Name: tweet_id, dtype: int64
0    892420643555336193
Name: tweet_id, dtype: object


In [81]:
# Convert to string
for df in df_clean:
    df[q1] = df[q1].astype(str)

#### Test

In [82]:
df_image_predictor[q1].head(1)
for i, df in enumerate(df_clean):
    print(df[q1].head(1))

0    892420643555336193
Name: tweet_id, dtype: object
0    666020888022790149
Name: tweet_id, dtype: object
0    892420643555336193
Name: tweet_id, dtype: object


### Quality issue 2:
#### Define:
col3: change timestamp datatype to datetime

#### Code:

In [83]:
df_clean[0].timestamp = pd.to_datetime(df_clean[0].timestamp)

#### Test:

In [84]:
df_clean[0].timestamp

0      2017-08-01 16:23:56+00:00
1      2017-08-01 00:17:27+00:00
2      2017-07-31 00:18:03+00:00
3      2017-07-30 15:58:51+00:00
4      2017-07-29 16:00:24+00:00
                  ...           
2351   2015-11-16 00:24:50+00:00
2352   2015-11-16 00:04:52+00:00
2353   2015-11-15 23:21:54+00:00
2354   2015-11-15 23:05:30+00:00
2355   2015-11-15 22:32:08+00:00
Name: timestamp, Length: 2356, dtype: datetime64[ns, UTC]

### Quality Issue 3:
#### Define:
col4: split string to remove html tag and extract content within

In [85]:
archive_assessed[1][4]

<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>     2221
<a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a>                          91
<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>                       33
<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>      11
Name: source, dtype: int64

#### Code:
strip string prior to splitting

In [86]:
df_clean[0].iloc[:,4] = df_clean[0].iloc[:,4].str.strip()

In [87]:
# RESET COLUMN if coded incorrectly
df_clean[0].iloc[:,4] = df_raw[0].iloc[:,4]

In [88]:
df_clean[0].iloc[:,4] = df_clean[0].iloc[:,4].apply(lambda text: BeautifulSoup(text, 'html.parser').get_text())

In [89]:
df_clean[0].rename(columns={'source':'source_app'}, inplace=True)

#### Test:

In [90]:
df_clean[0].iloc[:,4].value_counts()

Twitter for iPhone     2221
Vine - Make a Scene      91
Twitter Web Client       33
TweetDeck                11
Name: source_app, dtype: int64

### Quality Issue 4:
#### Define:
col1,2,6,7: change datatype from float to int
#### Code:

In [91]:
q4 = list(df_twitter.iloc[:0, [1,2,6,7]])
# Print previous data types 
for column in q4:
    print(df_twitter[column].head(0))

Series([], Name: in_reply_to_status_id, dtype: float64)
Series([], Name: in_reply_to_user_id, dtype: float64)
Series([], Name: retweeted_status_id, dtype: float64)
Series([], Name: retweeted_status_user_id, dtype: float64)


In [92]:
# Convert to string
for column in q4:
    df_twitter[column] = df_twitter[column].astype(str)

#### Test:

In [93]:
for column in q4:
    print(df_twitter[column].head(0))

Series([], Name: in_reply_to_status_id, dtype: object)
Series([], Name: in_reply_to_user_id, dtype: object)
Series([], Name: retweeted_status_id, dtype: object)
Series([], Name: retweeted_status_user_id, dtype: object)


### Quality Issue 5:
#### Define:
remove potential whitespaces across all string/objects, trim front and end as visual inspection appeared to show start of strings not inline when scrolling down.

In [94]:
df_clean[0].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype              
---  ------                      --------------  -----              
 0   tweet_id                    2356 non-null   object             
 1   in_reply_to_status_id       2356 non-null   object             
 2   in_reply_to_user_id         2356 non-null   object             
 3   timestamp                   2356 non-null   datetime64[ns, UTC]
 4   source_app                  2356 non-null   object             
 5   text                        2356 non-null   object             
 6   retweeted_status_id         2356 non-null   object             
 7   retweeted_status_user_id    2356 non-null   object             
 8   retweeted_status_timestamp  181 non-null    object             
 9   expanded_urls               2297 non-null   object             
 10  rating_numerator            2356 non-null   int64           

#### Code & Test:
`Improvement opportunity. Scan for object/string dtype and return if true to easily filter`

In [95]:
# call function
trim_strings(df_clean[0])

No whitespaces in tweet_id.
No whitespaces in in_reply_to_status_id.
No whitespaces in in_reply_to_user_id.
No whitespaces in source_app.
No whitespaces in text.
No whitespaces in retweeted_status_id.
No whitespaces in retweeted_status_user_id.
No whitespaces in retweeted_status_timestamp.
No whitespaces in expanded_urls.
No whitespaces in name.
No whitespaces in doggo.
No whitespaces in floofer.
No whitespaces in pupper.
No whitespaces in puppo.


In [96]:
trim_strings(df_clean[1])

No whitespaces in tweet_id.
No whitespaces in jpg_url.
No whitespaces in p1.
No whitespaces in p2.
No whitespaces in p3.


In [97]:
trim_strings(df_clean[2])

No whitespaces in tweet_id.


### Quality Issue 6:
#### Define:
df_image_predictor<br>
col3,6,9: change to lower case

In [98]:
q6 = list(df_clean[1].iloc[:0, [3,6,9]])

#### Code:

In [99]:
q6 = list(df_clean[1].iloc[:0, [3,6,9]])
# Print previous data types 
for column in q6:
    print(df_clean[1][column].head(5))

0    Welsh_springer_spaniel
1                   redbone
2           German_shepherd
3       Rhodesian_ridgeback
4        miniature_pinscher
Name: p1, dtype: object
0                collie
1    miniature_pinscher
2              malinois
3               redbone
4            Rottweiler
Name: p2, dtype: object
0      Shetland_sheepdog
1    Rhodesian_ridgeback
2             bloodhound
3     miniature_pinscher
4               Doberman
Name: p3, dtype: object


In [100]:
# RESET COLUMN if coded incorrectly
df_clean[1].iloc[:, [3,6,9]] = df_raw[1].iloc[:, [3,6,9]]

In [101]:
for column in q6:
    df_clean[1][column] = df_clean[1][column].str.lower()

#### Test:

In [102]:
for column in q6:
    print(df_clean[1][column].head(5))

0    welsh_springer_spaniel
1                   redbone
2           german_shepherd
3       rhodesian_ridgeback
4        miniature_pinscher
Name: p1, dtype: object
0                collie
1    miniature_pinscher
2              malinois
3               redbone
4            rottweiler
Name: p2, dtype: object
0      shetland_sheepdog
1    rhodesian_ridgeback
2             bloodhound
3     miniature_pinscher
4               doberman
Name: p3, dtype: object


### Quality Issue 7:
#### Define:
col1: rename from jpg_url to img_url

In [103]:
df_clean[1].iloc[:0, 1]

Series([], Name: jpg_url, dtype: object)

#### Code:

In [104]:
df_clean[1].rename(columns={'jpg_url':'img_url'}, inplace=True)

#### Test:

In [105]:
df_clean[1].iloc[:0, 1]

Series([], Name: img_url, dtype: object)

### Quality Issue 8:
#### Define:
col2: rename from img_num to conf_tweet_img

#### Code:

In [106]:
df_clean[1].iloc[:0, 2]

Series([], Name: img_num, dtype: int64)

In [107]:
### Quality Issue 8:
df_clean[1].rename(columns={'img_num':'conf_tweet_img'}, inplace=True)

#### Test:

In [108]:
df_clean[1].iloc[:0, 2]

Series([], Name: conf_tweet_img, dtype: int64)

### Quality Issue 9:
#### Define:
<br>check col12 to remove/replace incorrect names with None

#### Code:

In [109]:
names_list = df_clean[0].name.value_counts().index
names_list

Index(['None', 'a', 'Charlie', 'Oliver', 'Cooper', 'Lucy', 'Lola', 'Tucker',
       'Penny', 'Winston',
       ...
       'Stefan', 'Alfy', 'Crumpet', 'Callie', 'Todo', 'Ralf', 'Pumpkin',
       'Carll', 'Ralphson', 'Lilah'],
      dtype='object', length=957)

In [110]:
# extract names - regex test
name_mask = df_clean[0].name.str.match('[^A-Z]')
name_mask.value_counts()

False    2247
True      109
Name: name, dtype: int64

In [112]:
df_clean[0].name[name_mask].value_counts()

a               55
the              8
an               7
very             5
quite            4
just             4
one              4
actually         2
not              2
getting          2
mad              2
incredibly       1
all              1
light            1
such             1
officially       1
old              1
this             1
his              1
my               1
space            1
unacceptable     1
infuriating      1
life             1
by               1
Name: name, dtype: int64

In [113]:
df_clean[0].name.where(~name_mask)

0        Phineas
1          Tilly
2         Archie
3          Darla
4       Franklin
          ...   
2351        None
2352         NaN
2353         NaN
2354         NaN
2355        None
Name: name, Length: 2356, dtype: object

In [114]:
df_clean[0].name = df_clean[0].name.where(~name_mask,None)

#### Test:

In [115]:
df_clean[0].name

0        Phineas
1          Tilly
2         Archie
3          Darla
4       Franklin
          ...   
2351        None
2352        None
2353        None
2354        None
2355        None
Name: name, Length: 2356, dtype: object

### Quality Issue 10:
#### Define:
<br>check numerator rating value is correct.

In [116]:
# preview of strings in text column
list(df_clean[0].text.sample(5) )

['Please stop sending in non-canines like this Very Pettable Dozing Bath Tortoise. We only rate dogs. Only send dogs... 12/10 https://t.co/mcagPeENIh',
 'Say hello to Oliver. He thought what was inside the pillow should be outside the pillow. Blurry since birth. 8/10 https://t.co/lFU9W31Fg9',
 'This is Linda. She fucking hates trees. 7/10 https://t.co/blaY85FIxR',
 'Finally some constructive political change in this country. 11/10 https://t.co/mvQaETHVSb',
 'RT @dog_rates: Ohboyohboyohboyohboyohboyohboyohboyohboyohboyohboyohboyohboyohboyohboyohboy. 10/10 for all (by happytailsresort) https://t.c…']

#### Code:

In [117]:
# regex filter to extract 123.34/123 found visually and programmatically
df_clean[0]['rating'] = df_clean[0].text.str.extract(r'(\b\d{0,3}\.?\d{1,2}\/\d{2,3})', expand=True)

In [118]:
df_clean[0]['rating'].value_counts(dropna=False)

12/10       558
11/10       463
10/10       460
13/10       350
9/10        157
8/10        102
14/10        54
7/10         53
5/10         35
6/10         32
3/10         19
4/10         15
2/10          9
1/10          8
420/10        2
4/20          2
15/10         2
9.75/10       2
9/11          2
0/10          2
144/120       1
60/50         1
9.5/10        1
11.26/10      1
204/170       1
50/50         1
17/10         1
11.27/10      1
11/15         1
99/90         1
121/110       1
960/00        1
.13/10        1
.10/10        1
007/10        1
80/80         1
143/130       1
20/16         1
13.5/10       1
84/70         1
7/11          1
165/150       1
182/10        1
45/50         1
1776/10       1
44/40         1
88/80         1
666/10        1
NaN           1
Name: rating, dtype: int64

In [119]:
#remove .13 and .10 manually
# remove .13
x = df_clean[0].query('rating==".13/10"').rating.index[0]
df_clean[0]['rating'].iloc[x] = df_clean[0]['rating'].iloc[x].split('.')[1]
df_clean[0]['rating'].iloc[x]

'13/10'

In [120]:
# remove .10
x = df_clean[0].query('rating==".10/10"').rating.index[0]
df_clean[0]['rating'].iloc[x] = df_clean[0]['rating'].iloc[x].split('.')[1]
df_clean[0]['rating'].iloc[x]

'10/10'

In [121]:
df_clean[0].rating.value_counts(dropna=False), df_clean[0].rating.shape

(12/10       558
 11/10       463
 10/10       461
 13/10       351
 9/10        157
 8/10        102
 14/10        54
 7/10         53
 5/10         35
 6/10         32
 3/10         19
 4/10         15
 2/10          9
 1/10          8
 15/10         2
 9/11          2
 9.75/10       2
 0/10          2
 4/20          2
 420/10        2
 11.27/10      1
 99/90         1
 60/50         1
 144/120       1
 11/15         1
 17/10         1
 11.26/10      1
 121/110       1
 50/50         1
 204/170       1
 9.5/10        1
 960/00        1
 165/150       1
 666/10        1
 88/80         1
 44/40         1
 1776/10       1
 45/50         1
 182/10        1
 7/11          1
 84/70         1
 13.5/10       1
 20/16         1
 143/130       1
 80/80         1
 007/10        1
 NaN           1
 Name: rating, dtype: int64,
 (2356,))

In [122]:
# check for non regex matches
checknull = df_clean[0].rating.isnull()

In [123]:
list(df_clean[0][checknull].text), list(df_clean[0][checknull].rating)

(['Meet Sam. She smiles 24/7 &amp; secretly aspires to be a reindeer. \nKeep Sam smiling by clicking and sharing this link:\nhttps://t.co/98tB8y7y7t https://t.co/LouL5vdvxx'],
 [nan])

In [124]:
decimal_mask = df_clean[0].rating.str.contains('\.', na=False) # to remove error
decimal_mask.value_counts()

False    2350
True        6
Name: rating, dtype: int64

In [125]:
df_clean[0][decimal_mask].rating

45       13.5/10
340      9.75/10
695      9.75/10
763     11.27/10
1689      9.5/10
1712    11.26/10
Name: rating, dtype: object

In [126]:
decimal_index = df_clean[0][decimal_mask].index
decimal_index

Int64Index([45, 340, 695, 763, 1689, 1712], dtype='int64')

In [127]:
clean_ratings = df_clean[0].rating.str.split('/', n=2, expand=True).astype(float)
mid = clean_ratings[0].median()
clean_ratings[0].fillna(mid, inplace=True)
clean_ratings[1].fillna(10, inplace=True)
clean_ratings

Unnamed: 0,0,1
0,13.0,10.0
1,13.0,10.0
2,12.0,10.0
3,13.0,10.0
4,12.0,10.0
...,...,...
2351,5.0,10.0
2352,6.0,10.0
2353,9.0,10.0
2354,7.0,10.0


In [128]:
clean_ratings[0].dtype, clean_ratings[0].dtype

(dtype('float64'), dtype('float64'))

In [129]:
clean_ratings[0].value_counts()

12.00      558
11.00      465
10.00      461
13.00      351
9.00       159
8.00       102
7.00        55
14.00       54
5.00        35
6.00        32
3.00        19
4.00        17
2.00         9
1.00         8
0.00         2
420.00       2
9.75         2
15.00        2
960.00       1
84.00        1
17.00        1
13.50        1
143.00       1
50.00        1
121.00       1
182.00       1
165.00       1
45.00        1
204.00       1
1776.00      1
666.00       1
99.00        1
11.27        1
11.26        1
88.00        1
144.00       1
9.50         1
20.00        1
44.00        1
60.00        1
80.00        1
Name: 0, dtype: int64

In [130]:
clean_ratings[1].value_counts()

10.0     2335
50.0        3
11.0        3
80.0        2
20.0        2
150.0       1
110.0       1
90.0        1
130.0       1
70.0        1
170.0       1
120.0       1
16.0        1
40.0        1
15.0        1
0.0         1
Name: 1, dtype: int64

In [131]:
df_clean[0].rating_numerator = clean_ratings[0].astype(float)
df_clean[0].rating_denominator = clean_ratings[1].astype(int)

#### Test:

In [132]:
df_clean[0].rating_denominator.isnull().values.any(), df_clean[0].rating_numerator.isnull().values.any()

(False, False)

In [133]:
df_clean[0].drop('rating', axis=1, inplace=True)

In [134]:
df_clean[0].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype              
---  ------                      --------------  -----              
 0   tweet_id                    2356 non-null   object             
 1   in_reply_to_status_id       2356 non-null   object             
 2   in_reply_to_user_id         2356 non-null   object             
 3   timestamp                   2356 non-null   datetime64[ns, UTC]
 4   source_app                  2356 non-null   object             
 5   text                        2356 non-null   object             
 6   retweeted_status_id         2356 non-null   object             
 7   retweeted_status_user_id    2356 non-null   object             
 8   retweeted_status_timestamp  181 non-null    object             
 9   expanded_urls               2297 non-null   object             
 10  rating_numerator            2356 non-null   float64         

In [135]:
df_clean[0].rating_numerator.value_counts()

12.00      558
11.00      465
10.00      461
13.00      351
9.00       159
8.00       102
7.00        55
14.00       54
5.00        35
6.00        32
3.00        19
4.00        17
2.00         9
1.00         8
0.00         2
420.00       2
9.75         2
15.00        2
960.00       1
84.00        1
17.00        1
13.50        1
143.00       1
50.00        1
121.00       1
182.00       1
165.00       1
45.00        1
204.00       1
1776.00      1
666.00       1
99.00        1
11.27        1
11.26        1
88.00        1
144.00       1
9.50         1
20.00        1
44.00        1
60.00        1
80.00        1
Name: rating_numerator, dtype: int64

In [136]:
df_clean[0].rating_denominator.value_counts()

10     2335
11        3
50        3
80        2
20        2
15        1
170       1
150       1
130       1
120       1
110       1
90        1
70        1
40        1
16        1
0         1
Name: rating_denominator, dtype: int64

In [137]:
df_clean[0].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype              
---  ------                      --------------  -----              
 0   tweet_id                    2356 non-null   object             
 1   in_reply_to_status_id       2356 non-null   object             
 2   in_reply_to_user_id         2356 non-null   object             
 3   timestamp                   2356 non-null   datetime64[ns, UTC]
 4   source_app                  2356 non-null   object             
 5   text                        2356 non-null   object             
 6   retweeted_status_id         2356 non-null   object             
 7   retweeted_status_user_id    2356 non-null   object             
 8   retweeted_status_timestamp  181 non-null    object             
 9   expanded_urls               2297 non-null   object             
 10  rating_numerator            2356 non-null   float64         

### Quality Issue 11:
#### Define:
remove retweets, i.e. 'RT @' in text

In [138]:
df_clean[0].in_reply_to_user_id.value_counts()

nan                      2278
4196983835.0               47
21955058.0                  2
29166305.0                  1
2281181600.0                1
20683724.0                  1
77596200.0                  1
2894131180.0                1
2319108198.0                1
30582082.0                  1
180670967.0                 1
26259576.0                  1
13615722.0                  1
21435658.0                  1
291859009.0                 1
358972768.0                 1
4717297476.0                1
113211856.0                 1
15846407.0                  1
16374678.0                  1
28785486.0                  1
1198988510.0                1
16487760.0                  1
47384430.0                  1
11856342.0                  1
1582853809.0                1
279280991.0                 1
3105440746.0                1
467036706.0                 1
194351775.0                 1
7.305050141505823e+17       1
8.405478643549184e+17       1
Name: in_reply_to_user_id, dtype: int64

#### Code:

In [139]:
df_clean[0] = df_clean[0].query('in_reply_to_user_id=="nan" & retweeted_status_user_id=="nan"')

#### Test:

In [140]:
df_clean[0].in_reply_to_user_id.value_counts()

nan    2097
Name: in_reply_to_user_id, dtype: int64

In [141]:
df_clean[0].retweeted_status_id.value_counts()

nan    2097
Name: retweeted_status_id, dtype: int64

### Tidiness Issue 1:
#### Define:
timestamp split into three columns, date, time, timezone

In [142]:
df_clean[0].shape, df_clean[1].shape, df_clean[2].shape

((2097, 17), (2075, 12), (2331, 3))

#### Code:


In [143]:
df_twitter.timestamp.sample(5)

449    2017-01-11 02:15:36+00:00
584    2016-11-20 00:59:15+00:00
1116   2016-05-17 14:57:41+00:00
568    2016-11-25 16:22:55+00:00
873    2016-08-04 22:52:29+00:00
Name: timestamp, dtype: datetime64[ns, UTC]

In [144]:
df_clean[0]['date'] = df_clean[0]['timestamp'].dt.date 
df_clean[0]['time'] = df_clean[0]['timestamp'].dt.time 
df_clean[0]['timezone'] = df_clean[0]['timestamp'].astype(str).str[-6:]
df_clean[0].drop(labels='timestamp', axis=1, inplace = True)

#### Test:


In [145]:
df_clean[0].iloc[:,16:]

Unnamed: 0,date,time,timezone
0,2017-08-01,16:23:56,+00:00
1,2017-08-01,00:17:27,+00:00
2,2017-07-31,00:18:03,+00:00
3,2017-07-30,15:58:51,+00:00
4,2017-07-29,16:00:24,+00:00
...,...,...,...
2351,2015-11-16,00:24:50,+00:00
2352,2015-11-16,00:04:52,+00:00
2353,2015-11-15,23:21:54,+00:00
2354,2015-11-15,23:05:30,+00:00


### Tidiness Issue 2:
#### Define:
categorize dog type into one column, and drop redundant columns.
### Quality Issue #:
#### Define:
change datatype into categorical

In [146]:
df_clean[0].iloc[:,11:16].sample(10)

Unnamed: 0,name,doggo,floofer,pupper,puppo
86,Goose,,,,
1112,Hermione,,,,
40,Kevin,,,,
436,,,,,
853,Louie,,,pupper,
34,Maisey,,,,
1436,Charlie,,,,
410,Wyatt,,,,
208,Wiggles,,,,
329,Poppy,,,,


#### Code:

In [147]:
df_clean[0]['dog_type'] = df_clean[0].text.str.extract('(doggo|floofer|pupper|puppo)', expand=False)
df_clean[0]['dog_type'] = df_clean[0]['dog_type'].astype('category')

In [148]:
drop_cols = list(df_clean[0].iloc[:1,12:16])
drop_cols

['doggo', 'floofer', 'pupper', 'puppo']

In [149]:
df_clean[0]['dog_type'].value_counts(dropna=False)

NaN        1744
pupper      240
doggo        80
puppo        29
floofer       4
Name: dog_type, dtype: int64

In [150]:
df_clean[0]['dog_type'].fillna('doggo', inplace=True)

In [151]:
df_clean[0]['dog_type'].value_counts(dropna=False)

doggo      1824
pupper      240
puppo        29
floofer       4
Name: dog_type, dtype: int64

In [152]:
df_clean[0].drop(drop_cols, axis=1, inplace=True)

#### Test:

In [153]:
df_clean[0]['dog_type'].value_counts(dropna=False)

doggo      1824
pupper      240
puppo        29
floofer       4
Name: dog_type, dtype: int64

In [154]:
list(df_clean[0].iloc[:0,:])

['tweet_id',
 'in_reply_to_status_id',
 'in_reply_to_user_id',
 'source_app',
 'text',
 'retweeted_status_id',
 'retweeted_status_user_id',
 'retweeted_status_timestamp',
 'expanded_urls',
 'rating_numerator',
 'rating_denominator',
 'name',
 'date',
 'time',
 'timezone',
 'dog_type']

### Tidiness Issue 3:
#### Define:
merge dataframes to contain the relevant columns required for analysis ensuring each is relevant to the information it pertains. Two dataframes in total.
One observation consisting of Twitter Data, another consisting of image predictions.

#### Code:


In [155]:
df_clean[0].shape

(2097, 16)

In [156]:
# merge twitter archive and api data first, keep predictions at the end (width wise) of data frame
twitter_archive_master = pd.merge(df_clean[0], df_clean[2], on='tweet_id', how='inner')

In [157]:
twitter_archive_master = pd.merge(twitter_archive_master, df_clean[1], on='tweet_id', how='inner')

#### Test:

In [158]:
twitter_archive_master.shape

(1964, 29)

In [159]:
twitter_archive_master.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,source_app,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,...,conf_tweet_img,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,892420643555336193,,,Twitter for iPhone,This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13.0,...,1,orange,0.097049,False,bagel,0.085851,False,banana,0.07611,False
1,892177421306343426,,,Twitter for iPhone,This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13.0,...,1,chihuahua,0.323581,True,pekinese,0.090647,True,papillon,0.068957,True
2,891815181378084864,,,Twitter for iPhone,This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12.0,...,1,chihuahua,0.716012,True,malamute,0.078253,True,kelpie,0.031379,True
3,891689557279858688,,,Twitter for iPhone,This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13.0,...,1,paper_towel,0.170278,False,labrador_retriever,0.168086,True,spatula,0.040836,False
4,891327558926688256,,,Twitter for iPhone,This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12.0,...,2,basset,0.555712,True,english_springer,0.22577,True,german_short-haired_pointer,0.175219,True


In [160]:
twitter_archive_master.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1964 entries, 0 to 1963
Data columns (total 29 columns):
 #   Column                      Non-Null Count  Dtype   
---  ------                      --------------  -----   
 0   tweet_id                    1964 non-null   object  
 1   in_reply_to_status_id       1964 non-null   object  
 2   in_reply_to_user_id         1964 non-null   object  
 3   source_app                  1964 non-null   object  
 4   text                        1964 non-null   object  
 5   retweeted_status_id         1964 non-null   object  
 6   retweeted_status_user_id    1964 non-null   object  
 7   retweeted_status_timestamp  0 non-null      object  
 8   expanded_urls               1964 non-null   object  
 9   rating_numerator            1964 non-null   float64 
 10  rating_denominator          1964 non-null   int32   
 11  name                        1866 non-null   object  
 12  date                        1964 non-null   object  
 13  time              

### Tidiness Issue 4:
#### Define:
drop redundant columns, retweeted and in_reply columns, four (4) in total

#### Code:

In [161]:
drop_cols = list(twitter_archive_master.iloc[:0,[1,2,5,6,7,8]])
drop_cols

['in_reply_to_status_id',
 'in_reply_to_user_id',
 'retweeted_status_id',
 'retweeted_status_user_id',
 'retweeted_status_timestamp',
 'expanded_urls']

In [162]:
twitter_archive_master.drop(columns=drop_cols, inplace=True)

#### Test:

In [163]:
twitter_archive_master.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1964 entries, 0 to 1963
Data columns (total 23 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   tweet_id            1964 non-null   object  
 1   source_app          1964 non-null   object  
 2   text                1964 non-null   object  
 3   rating_numerator    1964 non-null   float64 
 4   rating_denominator  1964 non-null   int32   
 5   name                1866 non-null   object  
 6   date                1964 non-null   object  
 7   time                1964 non-null   object  
 8   timezone            1964 non-null   object  
 9   dog_type            1964 non-null   category
 10  retweet_count       1964 non-null   int64   
 11  fav_count           1964 non-null   int64   
 12  img_url             1964 non-null   object  
 13  conf_tweet_img      1964 non-null   int64   
 14  p1                  1964 non-null   object  
 15  p1_conf             1964 non-null   fl

In [164]:
df_clean[1].head()

Unnamed: 0,tweet_id,img_url,conf_tweet_img,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,welsh_springer_spaniel,0.465074,True,collie,0.156665,True,shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,german_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,rottweiler,0.243682,True,doberman,0.154629,True


## Save Clean data

In [165]:
directory = 'Working_Files'
#from pathlib import Path
Path(directory).mkdir(parents=True, exist_ok=True)

In [166]:
### Twitter Master file
filename_out1 = 'twitter_archive_master.csv'
twitter_archive_master.to_csv(directory+'/'+filename_out1, index=True)

In [167]:
import shutil
def move_file(folder, file):
    path = os.getcwd()
    dest = path+'/'+folder+'/'+file
    source = path+'/'+file

    shutil.move(source, dest)
    print('{} moved.'.format(file))

## Exploratory Data Analysis and Visualization

In [170]:
twitter_archive_master.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1964 entries, 0 to 1963
Data columns (total 23 columns):
 #   Column              Non-Null Count  Dtype   
---  ------              --------------  -----   
 0   tweet_id            1964 non-null   object  
 1   source_app          1964 non-null   object  
 2   text                1964 non-null   object  
 3   rating_numerator    1964 non-null   float64 
 4   rating_denominator  1964 non-null   int32   
 5   name                1866 non-null   object  
 6   date                1964 non-null   object  
 7   time                1964 non-null   object  
 8   timezone            1964 non-null   object  
 9   dog_type            1964 non-null   category
 10  retweet_count       1964 non-null   int64   
 11  fav_count           1964 non-null   int64   
 12  img_url             1964 non-null   object  
 13  conf_tweet_img      1964 non-null   int64   
 14  p1                  1964 non-null   object  
 15  p1_conf             1964 non-null   fl

### Automated EDA

#### Sweetviz

In [171]:
import sweetviz as sv

sweetviz_file1 = 'SweetViz-Twitter_Data_Report.html'
sweetviz_file2 = 'SweetViz-Img_Predictions_Report.html'

# Clean
clean_report1 = sv.analyze([twitter_archive_master,'Twitter_Data'])
clean_report1.show_html(filepath=sweetviz_file1, open_browser=False)
print('')
# Raw
raw_report1 = sv.analyze([df_raw[0],'Twitter_Data_Raw'])
raw_report1.show_html(filepath='Raw_'+sweetviz_file1, open_browser=False)
print('')
raw_report2 = sv.analyze([df_raw[0], 'Image_Predictions_Raw'])
raw_report2.show_html(filepath='Raw_'+sweetviz_file2, open_browser=False)

PandasGUI INFO — matplotlib.font_manager — Generating new fontManager, this may take some time...
Done! Use 'show' commands to display/save.   |██████████| [100%]   00:00 -> (00:00 left)
Feature: tweet_id                            |▌         | [  6%]   00:00 -> (00:00 left)Report SweetViz-Twitter_Data_Report.html was generated.

Done! Use 'show' commands to display/save.   |██████████| [100%]   00:00 -> (00:00 left)
Feature: tweet_id                            |▌         | [  6%]   00:00 -> (00:00 left)Report Raw_SweetViz-Twitter_Data_Report.html was generated.

Done! Use 'show' commands to display/save.   |██████████| [100%]   00:00 -> (00:00 left)Report Raw_SweetViz-Img_Predictions_Report.html was generated.



In [172]:
clean_report2 = sv.analyze([df_image_predictor, 'Image_Predictions'])
clean_report2.show_html(filepath=sweetviz_file2, open_browser=False)

Done! Use 'show' commands to display/save.   |██████████| [100%]   00:00 -> (00:00 left)Report SweetViz-Img_Predictions_Report.html was generated.



In [173]:
# move into Reports Folder
#dest = path+'/'+folder+'/'+file
#source = path+'/'+file
move_file('Reports', sweetviz_file1)
move_file('Reports', sweetviz_file2)
move_file('Reports', 'Raw_'+sweetviz_file1)
move_file('Reports', 'Raw_'+sweetviz_file2)

SweetViz-Twitter_Data_Report.html moved.
SweetViz-Img_Predictions_Report.html moved.
Raw_SweetViz-Twitter_Data_Report.html moved.
Raw_SweetViz-Img_Predictions_Report.html moved.


#### Pandas Profiling
Limitation of n=10000 data points to be analysed

from pandas_profiling import ProfileReport

##### Twitter data

In [174]:
from pandas_profiling import ProfileReport
twitter_profile = ProfileReport(twitter_archive_master, title="Pandas_Profiling-Twitter_Data_Report")

In [175]:
def save_profile(profile, rep_name):
    prof_directory = 'Reports'
    Path(prof_directory).mkdir(parents=True, exist_ok=True)
    profile.to_file(prof_directory + '/' + rep_name + '.html')

In [176]:
twitter_profile.to_widgets() # to display report above

Summarize dataset: 100%|██████████| 37/37 [00:17<00:00,  2.13it/s, Completed]
Generate report structure: 100%|██████████| 1/1 [00:08<00:00,  8.35s/it]


VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

In [191]:
# save report as html, provide profile and report name
pandasprof = 'Twitter_Report.html'
twitter_profile.to_file(pandasprof)
move_file('Reports', pandasprof)

Export report to file: 100%|██████████| 1/1 [00:00<00:00,  7.81it/s]Twitter_Report.html moved.



### Manual Visualization
#### Tableau Public

In [190]:
%%html
<div class='tableauPlaceholder' id='viz1607920798943' style='position: relative'><noscript><a href='#'><img alt=' ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;We&#47;WeLoveDogsTwitterData&#47;Dashboard1&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='WeLoveDogsTwitterData&#47;Dashboard1' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;We&#47;WeLoveDogsTwitterData&#47;Dashboard1&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /><param name='language' value='en' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1607920798943');                    var vizElement = divElement.getElementsByTagName('object')[0];                    if ( divElement.offsetWidth > 800 ) { vizElement.style.width='1000px';vizElement.style.height='827px';} else if ( divElement.offsetWidth > 500 ) { vizElement.style.width='1000px';vizElement.style.height='827px';} else { vizElement.style.width='100%';vizElement.style.height='2077px';}                     var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>