#### Key points to keep in mind when data wrangling for this project:

- You only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.

- You do not need to gather the tweets beyond August 1st, 2017. You can, but note that you won't be able to gather the image predictions for these tweets since you don't have access to the algorithm used.

#### Your tasks in this project are as follows:

- Data wrangling, which consists of:
    - Gathering data (downloadable file in the Resources tab in the left most panel of your classroom and linked in step 1 below).
    - Assessing data
    - Cleaning data
- Storing, analyzing, and visualizing your wrangled data
- Reporting on 1) your data wrangling efforts and 2) your data analyses and visualizations

##### Cleaning Data for this Project
Clean each of the issues you documented while assessing. Perform this cleaning in wrangle_act.ipynb as well. The result should be a high quality and tidy master pandas DataFrame (or DataFrames, if appropriate). Again, the issues that satisfy the Project Motivation must be cleaned.

##### Storing, Analyzing, and Visualizing Data for this Project
Store the clean DataFrame(s) in a CSV file with the main one named twitter_archive_master.csv. If additional files exist because multiple tables are required for tidiness, name these files appropriately. Additionally, you may store the cleaned data in a SQLite database (which is to be submitted as well if you do).

Analyze and visualize your wrangled data in your wrangle_act.ipynb Jupyter Notebook. At least three (3) insights and one (1) visualization must be produced.

##### Reporting for this Project
Create a 300-600 word written report called wrangle_report.pdf or wrangle_report.html that briefly describes your wrangling efforts. This is to be framed as an internal document.

Create a 250-word-minimum written report called act_report.pdf or act_report.html that communicates the insights and displays the visualization(s) produced from your wrangled data. This is to be framed as an external document, like a blog post or magazine article, for example.

Both of these documents can be created in separate Jupyter Notebooks using the Markdown functionality of Jupyter Notebooks, then downloading those notebooks as PDF files or HTML files (see image below). You might prefer to use a word processor like Google Docs or Microsoft Word, however.

# Wrangle and Analyze Data: ...

This project ....

## Table of Contents
- [0. Introduction](#intro)
- [1. Gather data](#gather)
- [2. Assess data](#assess)
- [3. Clean data](#clean)
- [4. Analysis & Visualization](#analysis)


<a id='gather'></a>
## 0. Introduction

bli bla bluo

In [67]:
# Import necessary libraries
import numpy as np
import pandas as pd
import requests # to download files programmatically 
import os # to work with local directory
import tweepy # for twitter-api
import time # for timer 
import json # to create json file from python dictionary

<a id='intro'></a>
## 1. Gather data

####  Data is gathered from three different sources of data as described in steps below:

1. The WeRateDogs Twitter archive. File `twitter_archive_enhanced.csv` as comma separated file in local directory.
2. The tweet image predictions, i.e., what breed of dog (or other object, animal, etc.) is present in each tweet according to a neural network. This file (image_predictions.tsv) is hosted on Udacity's servers at [URL](https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv).
3. Each tweet's retweet count and favorite ("like") count. Using the tweet IDs in the WeRateDogs Twitter archive,  the Twitter API is queried for each tweet's JSON data using Python's Tweepy library and each tweet's entire set of JSON data is stored in a file called tweet_json.txt file. Each tweet's JSON data is written to its own line. Then this .txt file is read line by line into a pandas DataFrame with tweet ID, retweet count, and favorite count.

### a. Read data from csv-file

In [2]:
# Read WeRateDogs Twitter archive from csv
df_twitter_archive_enhanced = pd.read_csv('twitter-archive-enhanced.csv')

### b. Programmatically download file from url

In [3]:
# Create request-object with url of file
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = requests.get(url)

In [4]:
# Create file with content of response-object
with open(url.split('/')[-1], mode='wb') as file: 
    file.write(response.content)

In [5]:
# Read tweet image predictions from tsv
df_image_predictions = pd.read_csv(url.split('/')[-1], sep='\t')

### c. Query additional information for tweets via Twitter API

In [6]:
# Define credentials for twitter api
try:
    with open("pw.json", "r") as read_file:
        pw = json.load(read_file)

# Catch exception, if file doesn't exist
except FileNotFoundError as fnf_error:
    print(fnf_error)
    print('File pw.json with twitter credentials missing')
    
consumer_key = pw['consumer_key']
consumer_secret = pw['consumer_secret']
access_token = pw['access_token']
access_secret = pw['access_secret']

In [7]:
# Initiate session with Twitter API
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
api = tweepy.API(auth)

In [8]:
# Query for every tweet id in enhanced twitter archive and save tweet-information in json-format to 'tweet_json.txt'
tweet_jsons = {}
tweet_id_errors = []
start = time.time()
count = 0


with open('tweet_json.txt', 'w') as outfile:
    
    for tweet_id in df_twitter_archive_enhanced['tweet_id']:
        count +=1
        try:
            # Query API for data of tweet id, setting tweet_mode parameter to 'extended'
            tweet = api.get_status(tweet_id, tweet_mode='extended', wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
            # Measure elapsed time
            mid_s = time.time()
            # Print id and time elapsed
            print(str(count) + ": " + str(tweet_id) + " was successfully queried at " + str(mid_s - start) )
            # Write json of tweet to 'tweet_json.txt'
            json.dump(tweet._json, outfile)
            # New line
            outfile.write("\n")

        # Not best practice to catch all exceptions but fine for this short script
        except Exception as error:
            mid_f = time.time()
            print(str(count) + ": " + str(tweet_id) + " could not be queried " + str(mid_f - start) + str(error))
            # Gather ids of id's without status
            tweet_id_errors.append([count, str(tweet_id)])
            
    end = time.time()
    print(end - start)
df_tweet_id_errors = pd.DataFrame(tweet_id_errors, columns=['tweet_index', 'tweet_id'])

1: 892420643555336193 was successfully queried at 0.21577000617980957
2: 892177421306343426 was successfully queried at 0.4254758358001709
3: 891815181378084864 was successfully queried at 0.6118710041046143
4: 891689557279858688 was successfully queried at 0.8170490264892578
5: 891327558926688256 was successfully queried at 1.0295610427856445
6: 891087950875897856 was successfully queried at 1.2252051830291748
7: 890971913173991426 was successfully queried at 1.4242031574249268
8: 890729181411237888 was successfully queried at 1.617460012435913
9: 890609185150312448 was successfully queried at 1.8179621696472168
10: 890240255349198849 was successfully queried at 2.020556926727295
11: 890006608113172480 was successfully queried at 2.2189810276031494
12: 889880896479866881 was successfully queried at 2.4155139923095703
13: 889665388333682689 was successfully queried at 2.622896909713745
14: 889638837579907072 was successfully queried at 2.826932191848755
15: 889531135344209921 was succe

116: 870374049280663552 was successfully queried at 23.579715967178345
117: 870308999962521604 was successfully queried at 23.782437086105347
118: 870063196459192321 was successfully queried at 23.970773220062256
119: 869988702071779329 could not be queried 24.1567120552063[{'code': 144, 'message': 'No status found with that ID.'}]
120: 869772420881756160 was successfully queried at 24.358447074890137
121: 869702957897576449 was successfully queried at 24.55802297592163
122: 869596645499047938 was successfully queried at 24.754987955093384
123: 869227993411051520 was successfully queried at 24.945891857147217
124: 868880397819494401 was successfully queried at 25.145478010177612
125: 868639477480148993 was successfully queried at 25.35995602607727
126: 868622495443632128 was successfully queried at 25.56156301498413
127: 868552278524837888 was successfully queried at 25.761414051055908
128: 867900495410671616 was successfully queried at 25.962762117385864
129: 867774946302451713 was su

230: 848212111729840128 was successfully queried at 46.893795013427734
231: 847978865427394560 was successfully queried at 47.09711194038391
232: 847971574464610304 was successfully queried at 47.30740308761597
233: 847962785489326080 was successfully queried at 47.5085871219635
234: 847842811428974592 was successfully queried at 47.706080198287964
235: 847617282490613760 was successfully queried at 47.91905403137207
236: 847606175596138505 was successfully queried at 48.12530493736267
237: 847251039262605312 was successfully queried at 48.321462869644165
238: 847157206088847362 was successfully queried at 48.52312612533569
239: 847116187444137987 was successfully queried at 48.727954149246216
240: 846874817362120707 was successfully queried at 48.933881998062134
241: 846514051647705089 was successfully queried at 49.155434131622314
242: 846505985330044928 was successfully queried at 49.379660844802856
243: 846153765933735936 was successfully queried at 49.582399129867554
244: 84613971

344: 832040443403784192 was successfully queried at 70.63008499145508
345: 832032802820481025 was successfully queried at 70.82989978790283
346: 831939777352105988 was successfully queried at 71.03329491615295
347: 831926988323639298 was successfully queried at 71.23503184318542
348: 831911600680497154 was successfully queried at 71.43138599395752
349: 831670449226514432 was successfully queried at 71.63415598869324
350: 831650051525054464 was successfully queried at 71.83104300498962
351: 831552930092285952 was successfully queried at 72.03388404846191
352: 831322785565769729 was successfully queried at 72.23496985435486
353: 831315979191906304 was successfully queried at 72.44022011756897
354: 831309418084069378 was successfully queried at 72.63799405097961
355: 831262627380748289 was successfully queried at 72.83961892127991
356: 830956169170665475 was successfully queried at 73.05034303665161
357: 830583320585068544 was successfully queried at 73.25653100013733
358: 830173239259324

460: 817827839487737858 was successfully queried at 94.86386394500732
461: 817777686764523521 was successfully queried at 95.06653094291687
462: 817536400337801217 was successfully queried at 95.27443218231201
463: 817502432452313088 was successfully queried at 95.48503112792969
464: 817423860136083457 was successfully queried at 95.69137692451477
465: 817415592588222464 was successfully queried at 95.90943598747253
466: 817181837579653120 was successfully queried at 96.1217839717865
467: 817171292965273600 was successfully queried at 96.35417890548706
468: 817120970343411712 was successfully queried at 96.55534291267395
469: 817056546584727552 was successfully queried at 96.8473789691925
470: 816829038950027264 was successfully queried at 97.0552978515625
471: 816816676327063552 was successfully queried at 97.26139402389526
472: 816697700272001025 was successfully queried at 97.46466183662415
473: 816450570814898180 was successfully queried at 97.67183089256287
474: 816336735214911488

575: 801127390143516673 was successfully queried at 118.90158987045288
576: 801115127852503040 was successfully queried at 119.1111249923706
577: 800859414831898624 was successfully queried at 119.31323409080505
578: 800855607700029440 was successfully queried at 119.52591300010681
579: 800751577355128832 was successfully queried at 119.73229098320007
580: 800513324630806528 was successfully queried at 119.93235278129578
581: 800459316964663297 was successfully queried at 120.14312887191772
582: 800443802682937345 was successfully queried at 120.35606479644775
583: 800388270626521089 was successfully queried at 120.56418704986572
584: 800188575492947969 was successfully queried at 120.77439403533936
585: 800141422401830912 was successfully queried at 120.98233795166016
586: 800018252395122689 was successfully queried at 121.1839849948883
587: 799774291445383169 was successfully queried at 121.38874506950378
588: 799757965289017345 was successfully queried at 121.59289622306824
589: 799

691: 787397959788929025 was successfully queried at 142.90555095672607
692: 787322443945877504 was successfully queried at 143.10715198516846
693: 787111942498508800 was successfully queried at 143.31190586090088
694: 786963064373534720 was successfully queried at 143.5086328983307
695: 786729988674449408 was successfully queried at 143.72936582565308
696: 786709082849828864 was successfully queried at 143.9288830757141
697: 786664955043049472 was successfully queried at 144.15094685554504
698: 786595970293370880 was successfully queried at 144.35995984077454
699: 786363235746385920 was successfully queried at 144.5602011680603
700: 786286427768250368 was successfully queried at 144.76409101486206
701: 786233965241827333 was successfully queried at 144.96937203407288
702: 786051337297522688 was successfully queried at 145.1658420562744
703: 786036967502913536 was successfully queried at 145.38082599639893
704: 785927819176054784 was successfully queried at 145.5803301334381
705: 785872

806: 772114945936949249 was successfully queried at 169.14868807792664
807: 772102971039580160 was successfully queried at 169.35522484779358
808: 771908950375665664 was successfully queried at 169.56849813461304
809: 771770456517009408 was successfully queried at 169.7668640613556
810: 771500966810099713 was successfully queried at 169.96728682518005
811: 771380798096281600 was successfully queried at 170.1743519306183
812: 771171053431250945 was successfully queried at 170.40803599357605
813: 771136648247640064 was successfully queried at 170.61159300804138
814: 771102124360998913 was successfully queried at 170.81341409683228
815: 771014301343748096 was successfully queried at 171.0265109539032
816: 771004394259247104 could not be queried 171.22540616989136[{'code': 179, 'message': 'Sorry, you are not authorized to see this status.'}]
817: 770787852854652928 was successfully queried at 171.42598700523376
818: 770772759874076672 was successfully queried at 171.63095998764038
819: 770

Rate limit reached. Sleeping for: 711


900: 758828659922702336 was successfully queried at 188.73040294647217
901: 758740312047005698 was successfully queried at 905.0845489501953
902: 758474966123810816 was successfully queried at 905.2930109500885
903: 758467244762497024 was successfully queried at 905.4997470378876
904: 758405701903519748 was successfully queried at 905.7061100006104
905: 758355060040593408 was successfully queried at 905.9147610664368
906: 758099635764359168 was successfully queried at 906.1248700618744
907: 758041019896193024 was successfully queried at 906.332661151886
908: 757741869644341248 was successfully queried at 906.5333950519562
909: 757729163776290825 was successfully queried at 906.7359449863434
910: 757725642876129280 was successfully queried at 906.9494199752808
911: 757611664640446465 was successfully queried at 907.1668040752411
912: 757597904299253760 was successfully queried at 907.379497051239
913: 757596066325864448 was successfully queried at 907.5879919528961
914: 7574001623775928

1017: 746906459439529985 was successfully queried at 931.1211109161377
1018: 746872823977771008 was successfully queried at 931.3158721923828
1019: 746818907684614144 was successfully queried at 931.5140240192413
1020: 746790600704425984 was successfully queried at 931.7137308120728
1021: 746757706116112384 was successfully queried at 931.9259841442108
1022: 746726898085036033 was successfully queried at 932.1522290706635
1023: 746542875601690625 was successfully queried at 932.3607649803162
1024: 746521445350707200 was successfully queried at 932.5775129795074
1025: 746507379341139972 was successfully queried at 932.7879540920258
1026: 746369468511756288 was successfully queried at 932.9947538375854
1027: 746131877086527488 was successfully queried at 933.1973049640656
1028: 746056683365994496 was successfully queried at 933.3997778892517
1029: 745789745784041472 was successfully queried at 933.6002111434937
1030: 745712589599014916 was successfully queried at 933.8041508197784
1031: 

1133: 728760639972315136 was successfully queried at 956.1762239933014
1134: 728751179681943552 was successfully queried at 956.3746678829193
1135: 728653952833728512 was successfully queried at 956.5803978443146
1136: 728409960103686147 was successfully queried at 956.8000028133392
1137: 728387165835677696 was successfully queried at 956.999712228775
1138: 728046963732717569 was successfully queried at 957.2041010856628
1139: 728035342121635841 was successfully queried at 957.4001581668854
1140: 728015554473250816 was successfully queried at 957.6063599586487
1141: 727685679342333952 was successfully queried at 957.8141269683838
1142: 727644517743104000 was successfully queried at 958.0191268920898
1143: 727524757080539137 was successfully queried at 958.2244918346405
1144: 727314416056803329 was successfully queried at 958.4156630039215
1145: 727286334147182592 was successfully queried at 958.6222460269928
1146: 727175381690781696 was successfully queried at 958.839879989624
1147: 72

1249: 711363825979756544 was successfully queried at 985.5558910369873
1250: 711306686208872448 was successfully queried at 985.75461602211
1251: 711008018775851008 was successfully queried at 985.9437878131866
1252: 710997087345876993 was successfully queried at 986.1553728580475
1253: 710844581445812225 was successfully queried at 986.3582348823547
1254: 710833117892898816 was successfully queried at 986.5671720504761
1255: 710658690886586372 was successfully queried at 986.7745509147644
1256: 710609963652087808 was successfully queried at 986.9948019981384
1257: 710588934686908417 was successfully queried at 987.2204520702362
1258: 710296729921429505 was successfully queried at 987.4216091632843
1259: 710283270106132480 was successfully queried at 987.6289420127869
1260: 710272297844797440 was successfully queried at 987.8263449668884
1261: 710269109699739648 was successfully queried at 988.0299670696259
1262: 710153181850935296 was successfully queried at 988.241012096405
1263: 710

1366: 702684942141153280 was successfully queried at 1009.6783430576324
1367: 702671118226825216 was successfully queried at 1009.8837749958038
1368: 702598099714314240 was successfully queried at 1010.0980648994446
1369: 702539513671897089 was successfully queried at 1010.2989721298218
1370: 702332542343577600 was successfully queried at 1010.5056610107422
1371: 702321140488925184 was successfully queried at 1010.7660050392151
1372: 702276748847800320 was successfully queried at 1010.9699108600616
1373: 702217446468493312 was successfully queried at 1011.175920009613
1374: 701981390485725185 was successfully queried at 1011.3750779628754
1375: 701952816642965504 was successfully queried at 1011.5758419036865
1376: 701889187134500865 was successfully queried at 1011.7710649967194
1377: 701805642395348998 was successfully queried at 1011.9951620101929
1378: 701601587219795968 was successfully queried at 1012.226958990097
1379: 701570477911896070 was successfully queried at 1012.43023705

1481: 693486665285931008 was successfully queried at 1033.6323218345642
1482: 693280720173801472 was successfully queried at 1033.8367059230804
1483: 693267061318012928 was successfully queried at 1034.036866903305
1484: 693262851218264065 was successfully queried at 1034.2409629821777
1485: 693231807727280129 was successfully queried at 1034.4473769664764
1486: 693155686491000832 was successfully queried at 1034.6468839645386
1487: 693109034023534592 was successfully queried at 1034.8509509563446
1488: 693095443459342336 was successfully queried at 1035.059639930725
1489: 692919143163629568 was successfully queried at 1035.263878107071
1490: 692905862751522816 was successfully queried at 1035.476938009262
1491: 692901601640583168 was successfully queried at 1035.7038280963898
1492: 692894228850999298 was successfully queried at 1035.9157540798187
1493: 692828166163931137 was successfully queried at 1036.1121361255646
1494: 692752401762250755 was successfully queried at 1036.3111250400

1596: 686358356425093120 was successfully queried at 1057.437684059143
1597: 686286779679375361 was successfully queried at 1057.6273591518402
1598: 686050296934563840 was successfully queried at 1057.833577156067
1599: 686035780142297088 was successfully queried at 1058.03884100914
1600: 686034024800862208 was successfully queried at 1058.2441120147705
1601: 686007916130873345 was successfully queried at 1058.447802066803
1602: 686003207160610816 was successfully queried at 1058.652930021286
1603: 685973236358713344 was successfully queried at 1058.8489849567413
1604: 685943807276412928 was successfully queried at 1059.1358230113983
1605: 685906723014619143 was successfully queried at 1059.342798948288
1606: 685681090388975616 was successfully queried at 1059.5597579479218
1607: 685667379192414208 was successfully queried at 1059.7645049095154
1608: 685663452032069632 was successfully queried at 1059.9702010154724
1609: 685641971164143616 was successfully queried at 1060.21932888031
1

1711: 680583894916304897 was successfully queried at 1082.6210660934448
1712: 680497766108381184 was successfully queried at 1082.8275380134583
1713: 680494726643068929 was successfully queried at 1083.0322229862213
1714: 680473011644985345 was successfully queried at 1083.2371218204498
1715: 680440374763077632 was successfully queried at 1083.4408509731293
1716: 680221482581123072 was successfully queried at 1083.6505501270294
1717: 680206703334408192 was successfully queried at 1083.8566517829895
1718: 680191257256136705 was successfully queried at 1084.0665078163147
1719: 680176173301628928 was successfully queried at 1084.275149822235
1720: 680161097740095489 was successfully queried at 1084.4710948467255
1721: 680145970311643136 was successfully queried at 1084.6729249954224
1722: 680130881361686529 was successfully queried at 1084.8770608901978
1723: 680115823365742593 was successfully queried at 1085.0840730667114
1724: 680100725817409536 was successfully queried at 1085.2876460

Rate limit reached. Sleeping for: 703


1800: 677187300187611136 was successfully queried at 1101.9412891864777
1801: 676975532580409345 was successfully queried at 1810.2902932167053
1802: 676957860086095872 was successfully queried at 1810.497533082962
1803: 676949632774234114 was successfully queried at 1810.699380159378
1804: 676948236477857792 was successfully queried at 1810.9067649841309
1805: 676946864479084545 was successfully queried at 1811.113217830658
1806: 676942428000112642 was successfully queried at 1811.3182151317596
1807: 676936541936185344 was successfully queried at 1811.5174098014832
1808: 676916996760600576 was successfully queried at 1811.7242140769958
1809: 676897532954456065 was successfully queried at 1811.914677143097
1810: 676864501615042560 was successfully queried at 1812.1192281246185
1811: 676821958043033607 was successfully queried at 1812.3267250061035
1812: 676819651066732545 was successfully queried at 1812.5515098571777
1813: 676811746707918848 was successfully queried at 1812.7579829692

1915: 674330906434379776 was successfully queried at 1833.6524748802185
1916: 674318007229923329 was successfully queried at 1833.8667919635773
1917: 674307341513269249 was successfully queried at 1834.0837769508362
1918: 674291837063053312 was successfully queried at 1834.285611152649
1919: 674271431610523648 was successfully queried at 1834.4757888317108
1920: 674269164442398721 was successfully queried at 1834.6966660022736
1921: 674265582246694913 was successfully queried at 1834.89950299263
1922: 674262580978937856 was successfully queried at 1835.1032269001007
1923: 674255168825880576 was successfully queried at 1835.2910869121552
1924: 674082852460433408 was successfully queried at 1835.4957129955292
1925: 674075285688614912 was successfully queried at 1835.6950180530548
1926: 674063288070742018 was successfully queried at 1835.8949210643768
1927: 674053186244734976 was successfully queried at 1836.1049790382385
1928: 674051556661161984 was successfully queried at 1836.311996936

2030: 671855973984772097 was successfully queried at 1857.1859800815582
2031: 671789708968640512 was successfully queried at 1857.39279794693
2032: 671768281401958400 was successfully queried at 1857.5914080142975
2033: 671763349865160704 was successfully queried at 1857.804949760437
2034: 671744970634719232 was successfully queried at 1858.004625082016
2035: 671743150407421952 was successfully queried at 1858.2142579555511
2036: 671735591348891648 was successfully queried at 1858.4094030857086
2037: 671729906628341761 was successfully queried at 1858.609092950821
2038: 671561002136281088 was successfully queried at 1858.8165969848633
2039: 671550332464455680 was successfully queried at 1859.0221500396729
2040: 671547767500775424 was successfully queried at 1859.241600036621
2041: 671544874165002241 was successfully queried at 1859.4457519054413
2042: 671542985629241344 was successfully queried at 1859.6428377628326
2043: 671538301157904385 was successfully queried at 1859.847010850906

2145: 669942763794931712 was successfully queried at 1880.863960981369
2146: 669926384437997569 was successfully queried at 1881.0678181648254
2147: 669923323644657664 was successfully queried at 1881.2726402282715
2148: 669753178989142016 was successfully queried at 1881.4679100513458
2149: 669749430875258880 was successfully queried at 1881.6658790111542
2150: 669684865554620416 was successfully queried at 1881.8636360168457
2151: 669683899023405056 was successfully queried at 1882.0547459125519
2152: 669682095984410625 was successfully queried at 1882.2592961788177
2153: 669680153564442624 was successfully queried at 1882.4725379943848
2154: 669661792646373376 was successfully queried at 1882.6815440654755
2155: 669625907762618368 was successfully queried at 1882.8736219406128
2156: 669603084620980224 was successfully queried at 1883.0817170143127
2157: 669597912108789760 was successfully queried at 1883.2914819717407
2158: 669583744538451968 was successfully queried at 1883.4970810

2260: 667550904950915073 was successfully queried at 1904.3377392292023
2261: 667550882905632768 was successfully queried at 1904.5453469753265
2262: 667549055577362432 was successfully queried at 1904.7740387916565
2263: 667546741521195010 was successfully queried at 1904.9788389205933
2264: 667544320556335104 was successfully queried at 1905.1842200756073
2265: 667538891197542400 was successfully queried at 1905.3819317817688
2266: 667534815156183040 was successfully queried at 1905.5889801979065
2267: 667530908589760512 was successfully queried at 1905.7877111434937
2268: 667524857454854144 was successfully queried at 1905.985810995102
2269: 667517642048163840 was successfully queried at 1906.1807341575623
2270: 667509364010450944 was successfully queried at 1906.3788530826569
2271: 667502640335572993 was successfully queried at 1906.5803639888763
2272: 667495797102141441 was successfully queried at 1906.7924029827118
2273: 667491009379606528 was successfully queried at 1906.9980759

In [68]:
# Create empty list
tweet_api_info = []

# Open file with handle as best practice and for every json-object append interesting info to tweet_api_info
try:
    with open('tweet_json.txt') as file:
        
        for jsonObj in file:
            tweet = json.loads(jsonObj)
            tweet_api_info.append([tweet['id'], tweet['retweet_count'], tweet['favorite_count']])

# Catch exception, if file doesn't exist
except FileNotFoundError as fnf_error:
    print(fnf_error)

# Convert list to pandas dataframe
df_tweet_api_info = pd.DataFrame(tweet_api_info, columns=['tweet_id', 'retweet_count', 'favorite_count'])

In [69]:
df_tweet_api_info.to_csv('df_tweet_api_info.csv', encoding='utf-8')

<a id='assess'></a>
## 2. Assess data

##### Assessing Data for this Project
After gathering each of the above pieces of data, they are assessed visually and programmatically for quality and tidiness issues. 

8 quality issues 
2 tidiness issues

### a. Visual assessment

In [39]:
# Assess df_twitter_archive_enhanced visually
df_twitter_archive_enhanced.sample(n=5)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1545,689289219123089408,,,2016-01-19 03:32:10 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Someone sent me this without any context and e...,,,,https://twitter.com/dog_rates/status/689289219...,13,10,,,,,
2216,668537837512433665,,,2015-11-22 21:13:35 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Spark. He's nervous. Other dog hasn't ...,,,,https://twitter.com/dog_rates/status/668537837...,8,10,Spark,,,,
485,814578408554463233,,,2016-12-29 21:06:41 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: Meet Beau &amp; Wilbur. Wilbur ...,6.981954e+17,4196984000.0,2016-02-12 17:22:12 +0000,https://twitter.com/dog_rates/status/698195409...,9,10,Beau,,,,
15,889278841981685760,,,2017-07-24 00:19:32 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Oliver. You're witnessing one of his m...,,,,https://twitter.com/dog_rates/status/889278841...,13,10,Oliver,,,,
1018,746818907684614144,6.914169e+17,4196984000.0,2016-06-25 21:34:37 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Guys... Dog Jesus 2.0\n13/10 buoyant af https:...,,,,https://twitter.com/dog_rates/status/746818907...,13,10,,,,,


In [29]:
# Assess df_image_predictions visually
df_image_predictions.sample(n=10)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
2029,882762694511734784,https://pbs.twimg.com/media/DEAz_HHXsAA-p_z.jpg,1,Labrador_retriever,0.85005,True,Chesapeake_Bay_retriever,0.074257,True,flat-coated_retriever,0.015579,True
1623,803638050916102144,https://pbs.twimg.com/ext_tw_video_thumb/80363...,1,Labrador_retriever,0.372776,True,golden_retriever,0.343666,True,Great_Pyrenees,0.067242,True
1045,712809025985978368,https://pbs.twimg.com/media/CeRoBaxWEAABi0X.jpg,1,Labrador_retriever,0.868671,True,carton,0.095095,False,pug,0.007651,True
654,682059653698686977,https://pbs.twimg.com/media/CXcpovWWMAAMcfv.jpg,2,jigsaw_puzzle,0.995873,False,Siamese_cat,0.000781,False,pizza,0.000432,False
1722,819711362133872643,https://pbs.twimg.com/media/C2AzHjQWQAApuhf.jpg,2,acorn_squash,0.848704,False,toilet_seat,0.044348,False,toy_poodle,0.022009,True
1265,749317047558017024,https://pbs.twimg.com/ext_tw_video_thumb/74931...,1,wire-haired_fox_terrier,0.155144,True,Lakeland_terrier,0.108382,True,buckeye,0.074617,False
664,682697186228989953,https://pbs.twimg.com/media/CXltdtaWYAIuX_V.jpg,1,bald_eagle,0.097232,False,torch,0.096621,False,cliff,0.090385,False
395,673636718965334016,https://pbs.twimg.com/media/CVk9ApFWUAA-S1s.jpg,1,wombat,0.880257,False,corn,0.019421,False,pug,0.019044,True
302,671504605491109889,https://pbs.twimg.com/media/CVGp4LKWoAAoD03.jpg,1,toy_poodle,0.259115,True,bath_towel,0.177669,False,Maltese_dog,0.071712,True
1805,832273440279240704,https://pbs.twimg.com/ext_tw_video_thumb/83227...,1,Pembroke,0.134081,True,ice_bear,0.051928,False,pug,0.044311,True


In [13]:
# Assess df_tweet_api_info visually
df_tweet_api_info.sample(n=10)

Unnamed: 0,tweet_id,retweet_count,favorite_count
1886,674410619106390016,448,1168
1436,695051054296211456,787,2671
367,827199976799354881,2258,10715
464,815736392542261248,2326,10062
1999,671891728106971137,559,1281
2116,669993076832759809,81,314
1216,712085617388212225,484,3265
2024,671504605491109889,3424,6774
2218,667902449697558528,362,827
2257,667211855547486208,229,464


### b. Programmatic assessment

In [14]:
# Check datatypes and non-null count of columns in df_twitter_archive_enhanced
df_twitter_archive_enhanced.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 

In [15]:
# Check if column rating_denominator of df_twitter_archive_enhanced only consists of value 10
df_twitter_archive_enhanced.rating_denominator.value_counts()

10     2333
11        3
50        3
80        2
20        2
2         1
16        1
40        1
70        1
15        1
90        1
110       1
120       1
130       1
150       1
170       1
7         1
0         1
Name: rating_denominator, dtype: int64

In [25]:
# Check column rating_numerator for realistic values
df_twitter_archive_enhanced.rating_numerator.value_counts() 

12      558
11      464
10      461
13      351
9       158
8       102
7        55
14       54
5        37
6        32
3        19
4        17
1         9
2         9
420       2
0         2
15        2
75        2
80        1
20        1
24        1
26        1
44        1
50        1
60        1
165       1
84        1
88        1
144       1
182       1
143       1
666       1
960       1
1776      1
17        1
27        1
45        1
99        1
121       1
204       1
Name: rating_numerator, dtype: int64

In [28]:
# Check for most common values in column name of df_twitter_archive_enhanced
df_twitter_archive_enhanced.name.value_counts() 

None         745
a             55
Charlie       12
Cooper        11
Oliver        11
            ... 
Alejandro      1
Tessa          1
Arlo           1
Ron            1
Cupid          1
Name: name, Length: 957, dtype: int64

In [16]:
# Check datatypes and non-null count of columns in df_image_predictions
df_image_predictions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   tweet_id  2075 non-null   int64  
 1   jpg_url   2075 non-null   object 
 2   img_num   2075 non-null   int64  
 3   p1        2075 non-null   object 
 4   p1_conf   2075 non-null   float64
 5   p1_dog    2075 non-null   bool   
 6   p2        2075 non-null   object 
 7   p2_conf   2075 non-null   float64
 8   p2_dog    2075 non-null   bool   
 9   p3        2075 non-null   object 
 10  p3_conf   2075 non-null   float64
 11  p3_dog    2075 non-null   bool   
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [17]:
# Check datatypes and non-null count of columns in df_tweet_api_info
df_tweet_api_info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2331 entries, 0 to 2330
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype
---  ------          --------------  -----
 0   tweet_id        2331 non-null   int64
 1   retweet_count   2331 non-null   int64
 2   favorite_count  2331 non-null   int64
dtypes: int64(3)
memory usage: 54.8 KB


In [None]:
df_tweet_id_errors

In [26]:
# Check for duplicate column names across three dataframes
all_columns = pd.Series(list(df_twitter_archive_enhanced) + list(df_image_predictions) + list(df_tweet_api_info))
all_columns[all_columns.duplicated()]

17    tweet_id
29    tweet_id
dtype: object

In [35]:
# Statistical description of dataframe, min-values, max-values, quartiles
df_tweet_api_info.describe()

Unnamed: 0,tweet_id,retweet_count,favorite_count
count,2331.0,2331.0,2331.0
mean,7.419079e+17,2718.11154,7600.672673
std,6.82317e+16,4597.897045,11794.699699
min,6.660209e+17,1.0,0.0
25%,6.78267e+17,549.0,1327.5
50%,7.182469e+17,1275.0,3309.0
75%,7.986692e+17,3156.5,9312.5
max,8.924206e+17,78265.0,157126.0


### Requirements for Clean data:

#### Quality requirements:

- Completeness: All necessary records in dataframes, no specific rows, columns or cells missing.
- Validity: No records available, that do not conform schema.
- Accuracy: No wrong data, that is valid.
- Consistency: No data, that is valid and accurate, but referred to in multiple correct ways. 

#### Tidiniss requirements (as defined by Hadley Wickham):
- each variable is a column
- each observation is a row
- each type of observational unit is a table.



### Findings, which contradict requirements:

##### `df_twitter_archive_enhanced` table
- Completeness: Not all rows have a value in one of the following columns: `doggo`, `floofer`, `pupper`, `puppo`.
- Completeness: There are 745 entries 'None' and 55 entries 'a' in the column `name`. 
- Validity: Not all values in column `rating_denominator`are equal to 10.
- Validity: Not all tweets are dog ratings, some are retweets.Retweets: 78 non-null values in columns `in_reply_to_status_id` and `in_reply_to_user_id`.
- Validity: Ratings higher than 20 in `rating_numerator` do not seem to fit the rating-system.
- Validity: In column `expanded_urls` the urls match mainly `https://twitter.com/dog_rates/status/`, yet there is one example where the url starts with `https://gofundme.com/ydvmve-surgery-for-`.
- Accuracy: Column `timestamp` is not of type datetime.
- Consistency: Duplicate info in column `text` to columns `rating_numerator`, `rating_denominator`, `name`, `doggo`, `floofer`, `pupper`, `puppo`
- New variable `dog stage`should be of datatype categorical.

##### `df_image_predictions` table
- ...

##### `df_tweet_api_info` table
- ...

#### Tidiness Observations:
- Variable `text`contains multiple observational units `rating_numerator`, `rating_denominator`, `name`, `doggo`, `floofer`, `pupper`, `puppo`.
- One variable (dog stage) in four columns in `treatments` table (doggo, floofer, pupper and puppo) of type categorical. 
- One observational unit in in three tables, `df_twitter_archive_enhanced`, `df_image_predictions`and `df_tweet_api_info`, different amount of rows.

<a id='clean'></a>
## 3. Clean data

In [299]:
# Create copies for cleaning process to preserve original dataframes
df_twitter_archive_enhanced_clean = df_twitter_archive_enhanced.copy()
df_image_predictions_clean = df_image_predictions.copy()
df_tweet_api_info_clean = df_tweet_api_info.copy()

1. Clean tidiness issues.
2. Clean completeness issues.
3. Clean remaining quality issues.

### Issue 1:
#### Observe:
Validity: Not all tweets are dog ratings, some are retweets. 181 Retweets, identied by values in columns `retweeted_status_id`, `retweeted_status_user_id` and `retweeted_status_timestamp`.

#### Define:
Drop rows, which can be identified as retweets by non-null values in columns `retweeted_status_id`, `retweeted_status_user_id` and `retweeted_status_timestamp`. Furthermore drop columns `retweeted_status_id`, `retweeted_status_user_id` and `retweeted_status_timestamp`, as not necessary anymore.

#### Code:

In [300]:
# Keep only rows without values in 'retweeted_status_id'
df_twitter_archive_enhanced_clean = df_twitter_archive_enhanced_clean[
    df_twitter_archive_enhanced_clean['retweeted_status_id'].isnull()]

# Drop variables which are only necessary for retweets
df_twitter_archive_enhanced_clean.drop(['retweeted_status_id', 'retweeted_status_user_id', 
                                        'retweeted_status_timestamp'], axis=1, inplace=True)

#### Test:

In [301]:
# Check if amount twitter ids reduced as expected
assert len(df_twitter_archive_enhanced_clean) == len(df_twitter_archive_enhanced)-181

In [302]:
# Check if columnns `in_reply_to_status_id`, `in_reply_to_user_id` and `retweeted_status_timestamp` dropped
list(df_twitter_archive_enhanced)

['tweet_id',
 'in_reply_to_status_id',
 'in_reply_to_user_id',
 'timestamp',
 'source',
 'text',
 'retweeted_status_id',
 'retweeted_status_user_id',
 'retweeted_status_timestamp',
 'expanded_urls',
 'rating_numerator',
 'rating_denominator',
 'name',
 'doggo',
 'floofer',
 'pupper',
 'puppo']

### Issue 2:
#### Observe:
Validity: Not all tweets are dog ratings, some are replies. 78 replies, identied by values in columns `in_reply_to_status_id` and `in_reply_to_user_id`.

#### Define:
Drop rows, which can be identified as replies by non-null values in columns `in_reply_to_status_id` and `in_reply_to_user_id`. Furthermore drop columns `in_reply_to_status_id` and `in_reply_to_user_id`, as not necessary anymore.

#### Code:

In [303]:
# Keep only rows without values in 'in_reply_to_status_id'
df_twitter_archive_enhanced_clean = df_twitter_archive_enhanced_clean[df_twitter_archive_enhanced_clean[
    'in_reply_to_status_id'].isnull()]

# Drop variables which are only necessary for replies
df_twitter_archive_enhanced_clean.drop(['in_reply_to_status_id', 'in_reply_to_user_id'], 
                                       axis=1, inplace=True)

#### Test:

In [304]:
# Check if amount twitter ids reduced as expected
assert len(df_twitter_archive_enhanced_clean) == len(df_twitter_archive_enhanced)-181-78

In [305]:
# Check if columnns `in_reply_to_status_id`, `in_reply_to_user_id` and `retweeted_status_timestamp` dropped
list(df_twitter_archive_enhanced_clean)

['tweet_id',
 'timestamp',
 'source',
 'text',
 'expanded_urls',
 'rating_numerator',
 'rating_denominator',
 'name',
 'doggo',
 'floofer',
 'pupper',
 'puppo']

In [306]:
# Check if any row left without value in rating_numerator column
len(df_twitter_archive_enhanced_clean[df_twitter_archive_enhanced_clean['rating_numerator'].isnull()])

0

### Issue 3:
#### Observe:

Tidiness: Variable `text`contains multiple observational units `rating_numerator`, `rating_denominator`, `name`, `doggo`, `floofer`, `pupper`, `puppo`.


#### Define
- Extract `rating_numerator`, `rating_denominator`, `name`, `doggo`, `floofer`, `pupper`, `puppo` variables from the text column using regular expressions and pandas' str.extract method. Drop the contact column when done.
- [`str.extract` method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.extract.html).
- [regex tutorial](https://regexone.com/)


#### Code

In [307]:
# extract rating_numerator/rating_denominator via regex
df_twitter_archive_enhanced_clean['rating'] = df_twitter_archive_enhanced_clean.text.str.extract(
    '(\d{1,2}\.?\d{1,2}?\/10)', expand=True)

# Split the values in 'rating' column to columns 'rating_numerator' and 'rating_denominator'
df_twitter_archive_enhanced_clean['rating_numerator'], df_twitter_archive_enhanced_clean[
    'rating_denominator'] = df_twitter_archive_enhanced_clean['rating'].str.split('/', 1).str

# Drop rows which do not conform the (\d{1,2}\/10) format for ratings
df_twitter_archive_enhanced_clean = df_twitter_archive_enhanced_clean[df_twitter_archive_enhanced_clean.rating_numerator.notnull()]
df_twitter_archive_enhanced_clean = df_twitter_archive_enhanced_clean[df_twitter_archive_enhanced_clean.rating_denominator.notnull()]

# Convert columns to int
df_twitter_archive_enhanced_clean.rating_numerator = pd.to_numeric(df_twitter_archive_enhanced_clean.rating_numerator)
df_twitter_archive_enhanced_clean.rating_denominator = pd.to_numeric(df_twitter_archive_enhanced_clean.rating_denominator)

# Note: axis=1 denotes that referring to a column, not a row
df_twitter_archive_enhanced_clean.drop('rating', axis=1, inplace=True)

  import sys


#### Test:

In [308]:
# Check if columns `rating_numerator` and `rating_denominator` have int64 as datatype
df_twitter_archive_enhanced_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1673 entries, 0 to 2350
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   tweet_id            1673 non-null   int64  
 1   timestamp           1673 non-null   object 
 2   source              1673 non-null   object 
 3   text                1673 non-null   object 
 4   expanded_urls       1670 non-null   object 
 5   rating_numerator    1673 non-null   float64
 6   rating_denominator  1673 non-null   int64  
 7   name                1673 non-null   object 
 8   doggo               1673 non-null   object 
 9   floofer             1673 non-null   object 
 10  pupper              1673 non-null   object 
 11  puppo               1673 non-null   object 
dtypes: float64(1), int64(2), object(9)
memory usage: 169.9+ KB


In [312]:
# Check for values in rating_numerator
df_twitter_archive_enhanced_clean.rating_numerator.value_counts()

12.00      486
10.00      438
11.00      415
13.00      289
14.00       39
420.00       1
1776.00      1
11.27        1
11.26        1
9.75         1
13.50        1
Name: rating_numerator, dtype: int64

In [313]:
# Check for values in rating_numerator
df_twitter_archive_enhanced_clean.rating_denominator.value_counts()

10    1673
Name: rating_denominator, dtype: int64

### Issue 4:
#### Observe:

Tidiness: Variable `text`contains multiple observational units `rating_numerator`, `rating_denominator`, `name`, `doggo`, `floofer`, `pupper`, `puppo`.

In [98]:
# name

In [92]:
# doggo, floofer, pupper, puppo

#### Test

### Issue 2: 
One variable (stage) in four columns in treatments table (`doggo`, `floofer`, `pupper` and `puppo`).

#### Define
- Melt the variables *doggo*, *floofer*, *pupper* and *puppo* columns to a *stage* wit the [melt function](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.melt.html).

#### Code

In [81]:
list(df_twitter_archive_enhanced_clean)

['tweet_id',
 'in_reply_to_status_id',
 'in_reply_to_user_id',
 'timestamp',
 'source',
 'text',
 'retweeted_status_id',
 'retweeted_status_user_id',
 'retweeted_status_timestamp',
 'expanded_urls',
 'rating_numerator',
 'rating_denominator',
 'name',
 'doggo',
 'floofer',
 'pupper',
 'puppo']

In [87]:
stages = ['doggo', 'floofer', 'pupper', 'puppo']

for stage in stages: 
    df_twitter_archive_enhanced_clean[stage] = df_twitter_archive_enhanced_clean[stage].replace('None', np.nan)

# Use melt function to melt auralin and novodra to new column 'treatment' and put values of them into 'dose'
#df_twitter_archive_enhanced_clean = pd.melt(df_twitter_archive_enhanced_clean, 
#                                            id_vars=['tweet_id', 'in_reply_to_status_id', 'in_reply_to_user_id', 
#                                                     'timestamp', 'source', 'text','retweeted_status_id', 
#                                                     'retweeted_status_user_id', 'retweeted_status_timestamp', 
#                                                     'expanded_urls', 'rating_numerator', 'rating_denominator', 
#                                                     'name'], var_name='stage', value_name='type')


#### Test

In [83]:
list(df_twitter_archive_enhanced_clean)

['tweet_id',
 'in_reply_to_status_id',
 'in_reply_to_user_id',
 'timestamp',
 'source',
 'text',
 'retweeted_status_id',
 'retweeted_status_user_id',
 'retweeted_status_timestamp',
 'expanded_urls',
 'rating_numerator',
 'rating_denominator',
 'name',
 'doggo',
 'floofer',
 'pupper',
 'puppo']

In [100]:
df_twitter_archive_enhanced_clean.head(n=15)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo,rating
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,,13/10
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,,13/10
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,,12/10
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,,13/10
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,,12/10
5,891087950875897856,,,2017-07-29 00:08:17 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a majestic great white breaching ...,,,,https://twitter.com/dog_rates/status/891087950...,13,10,,,,,,13/10
6,890971913173991426,,,2017-07-28 16:27:12 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Jax. He enjoys ice cream so much he gets ...,,,,"https://gofundme.com/ydvmve-surgery-for-jax,ht...",13,10,Jax,,,,,13/10
7,890729181411237888,,,2017-07-28 00:22:40 +0000,"<a href=""http://twitter.com/download/iphone"" r...",When you watch your owner call another dog a g...,,,,https://twitter.com/dog_rates/status/890729181...,13,10,,,,,,13/10
8,890609185150312448,,,2017-07-27 16:25:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Zoey. She doesn't want to be one of th...,,,,https://twitter.com/dog_rates/status/890609185...,13,10,Zoey,,,,,13/10
9,890240255349198849,,,2017-07-26 15:59:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Cassie. She is a college pup. Studying...,,,,https://twitter.com/dog_rates/status/890240255...,14,10,Cassie,doggo,,,,14/10


<a id='analysis'></a>
## 4. Analysis & Visualization

In [19]:
from subprocess import call
call(['python', '-m', 'nbconvert', 'wrangle_act.ipynb'])

1

In [20]:
data

NameError: name 'data' is not defined