#### Key points to keep in mind when data wrangling for this project:

- You only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.

- You do not need to gather the tweets beyond August 1st, 2017. You can, but note that you won't be able to gather the image predictions for these tweets since you don't have access to the algorithm used.

#### Your tasks in this project are as follows:

- Data wrangling, which consists of:
    - Gathering data (downloadable file in the Resources tab in the left most panel of your classroom and linked in step 1 below).
    - Assessing data
    - Cleaning data
- Storing, analyzing, and visualizing your wrangled data
- Reporting on 1) your data wrangling efforts and 2) your data analyses and visualizations

##### Cleaning Data for this Project
Clean each of the issues you documented while assessing. Perform this cleaning in wrangle_act.ipynb as well. The result should be a high quality and tidy master pandas DataFrame (or DataFrames, if appropriate). Again, the issues that satisfy the Project Motivation must be cleaned.

##### Storing, Analyzing, and Visualizing Data for this Project
Store the clean DataFrame(s) in a CSV file with the main one named twitter_archive_master.csv. If additional files exist because multiple tables are required for tidiness, name these files appropriately. Additionally, you may store the cleaned data in a SQLite database (which is to be submitted as well if you do).

Analyze and visualize your wrangled data in your wrangle_act.ipynb Jupyter Notebook. At least three (3) insights and one (1) visualization must be produced.

##### Reporting for this Project
Create a 300-600 word written report called wrangle_report.pdf or wrangle_report.html that briefly describes your wrangling efforts. This is to be framed as an internal document.

Create a 250-word-minimum written report called act_report.pdf or act_report.html that communicates the insights and displays the visualization(s) produced from your wrangled data. This is to be framed as an external document, like a blog post or magazine article, for example.

Both of these documents can be created in separate Jupyter Notebooks using the Markdown functionality of Jupyter Notebooks, then downloading those notebooks as PDF files or HTML files (see image below). You might prefer to use a word processor like Google Docs or Microsoft Word, however.

# Wrangle and Analyze Data: ...

This project ....

## Table of Contents
- [0. Introduction](#intro)
- [1. Gather data](#gather)
- [2. Assess data](#assess)
- [3. Clean data](#clean)
- [4. Analysis & Visualization](#analysis)


<a id='gather'></a>
## 0. Introduction

bli bla bluo

In [280]:
# Import necessary libraries
import pandas as pd
import requests # to download files programmatically 
import os # to work with local directory
import tweepy # for twitter-api
import time # for timer 
import json # to create json file from python dictionary

<a id='intro'></a>
## 1. Gather data

####  Data is gathered from three different sources of data as described in steps below:

1. The WeRateDogs Twitter archive. File `twitter_archive_enhanced.csv` as comma separated file in local directory.
2. The tweet image predictions, i.e., what breed of dog (or other object, animal, etc.) is present in each tweet according to a neural network. This file (image_predictions.tsv) is hosted on Udacity's servers at [URL](https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv).
3. Each tweet's retweet count and favorite ("like") count. Using the tweet IDs in the WeRateDogs Twitter archive,  the Twitter API is queried for each tweet's JSON data using Python's Tweepy library and each tweet's entire set of JSON data is stored in a file called tweet_json.txt file. Each tweet's JSON data is written to its own line. Then this .txt file is read line by line into a pandas DataFrame with tweet ID, retweet count, and favorite count.

### a. Read data from csv-file

In [281]:
# Read WeRateDogs Twitter archive from csv
df_twitter_archive_enhanced = pd.read_csv('twitter-archive-enhanced.csv')

### b. Programmatically download file from url

In [282]:
# Create request-object with url of file
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = requests.get(url)

In [283]:
# Create file with content of response-object
with open(url.split('/')[-1], mode='wb') as file: 
    file.write(response.content)

In [284]:
# Read tweet image predictions from tsv
df_image_predictions = pd.read_csv(url.split('/')[-1], sep='\t')

### c. Query additional information for tweets via Twitter API

In [285]:
# Define credentials for twitter api
try:
    with open("pw.json", "r") as read_file:
        pw = json.load(read_file)

# Catch exception, if file doesn't exist
except FileNotFoundError as fnf_error:
    print(fnf_error)
    print('File pw.json with twitter credentials missing')
    
consumer_key = pw['consumer_key']
consumer_secret = pw['consumer_secret']
access_token = pw['access_token']
access_secret = pw['access_secret']

In [286]:
# Initiate session with Twitter API
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
api = tweepy.API(auth)

In [287]:
# Query for every tweet id in enhanced twitter archive and save tweet-information in json-format to 'tweet_json.txt'
tweet_jsons = {}
tweet_id_errors = {}
start = time.time()
count = 0


with open('tweet_json.txt', 'w') as outfile:
    
    for tweet_id in df_twitter_archive_enhanced['tweet_id']:
        count +=1
        try:
            # Query API for data of tweet id, setting tweet_mode parameter to 'extended'
            tweet = api.get_status(tweet_id, tweet_mode='extended', wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
            # Measure elapsed time
            mid_s = time.time()
            # Print id and time elapsed
            print(str(count) + ": " + str(tweet_id) + " was successfully queried at " + str(mid_s - start) )
            # Write json of tweet to 'tweet_json.txt'
            json.dump(tweet._json, outfile)
            # New line
            outfile.write("\n")

        # Not best practice to catch all exceptions but fine for this short script
        except Exception as error:
            mid_f = time.time()
            print(str(count) + ": " + str(tweet_id) + " could not be queried " + str(mid_f - start) + str(error))
            # Gather ids of id's without status
            tweet_id_errors[str(id) + ": no status for id"] = tweet
            
    end = time.time()
    print(end - start)


1: 892420643555336193 was successfully queried at 0.21903204917907715
2: 892177421306343426 was successfully queried at 0.473801851272583
3: 891815181378084864 was successfully queried at 0.6876981258392334
4: 891689557279858688 was successfully queried at 0.8983809947967529
5: 891327558926688256 was successfully queried at 1.0933208465576172
6: 891087950875897856 was successfully queried at 1.3059029579162598
7: 890971913173991426 was successfully queried at 1.507861852645874
8: 890729181411237888 was successfully queried at 1.713150978088379
9: 890609185150312448 was successfully queried at 1.9139928817749023
10: 890240255349198849 was successfully queried at 2.1272170543670654
11: 890006608113172480 was successfully queried at 2.326862096786499
12: 889880896479866881 was successfully queried at 2.5303380489349365
13: 889665388333682689 was successfully queried at 2.7406487464904785
14: 889638837579907072 was successfully queried at 2.9520370960235596
15: 889531135344209921 was succe

116: 870374049280663552 was successfully queried at 25.065710067749023
117: 870308999962521604 was successfully queried at 25.260261058807373
118: 870063196459192321 was successfully queried at 25.51508593559265
119: 869988702071779329 could not be queried 25.704071760177612[{'code': 144, 'message': 'No status found with that ID.'}]
120: 869772420881756160 was successfully queried at 25.91018009185791
121: 869702957897576449 was successfully queried at 26.118863105773926
122: 869596645499047938 was successfully queried at 26.35872793197632
123: 869227993411051520 was successfully queried at 26.554130792617798
124: 868880397819494401 was successfully queried at 26.747134923934937
125: 868639477480148993 was successfully queried at 26.950483798980713
126: 868622495443632128 was successfully queried at 27.14408588409424
127: 868552278524837888 was successfully queried at 27.354614973068237
128: 867900495410671616 was successfully queried at 27.548017978668213
129: 867774946302451713 was s

229: 848213670039564288 was successfully queried at 48.80704307556152
230: 848212111729840128 was successfully queried at 49.03656005859375
231: 847978865427394560 was successfully queried at 49.241455078125
232: 847971574464610304 was successfully queried at 49.46181893348694
233: 847962785489326080 was successfully queried at 49.70422601699829
234: 847842811428974592 was successfully queried at 49.91499304771423
235: 847617282490613760 was successfully queried at 50.120930910110474
236: 847606175596138505 was successfully queried at 50.31524705886841
237: 847251039262605312 was successfully queried at 50.51384496688843
238: 847157206088847362 was successfully queried at 50.71117305755615
239: 847116187444137987 was successfully queried at 50.91333079338074
240: 846874817362120707 was successfully queried at 51.11932587623596
241: 846514051647705089 was successfully queried at 51.316319942474365
242: 846505985330044928 was successfully queried at 51.62020301818848
243: 846153765933735

343: 832088576586297345 was successfully queried at 73.0606849193573
344: 832040443403784192 was successfully queried at 73.35031008720398
345: 832032802820481025 was successfully queried at 73.55324101448059
346: 831939777352105988 was successfully queried at 73.74737882614136
347: 831926988323639298 was successfully queried at 73.9437620639801
348: 831911600680497154 was successfully queried at 74.1493809223175
349: 831670449226514432 was successfully queried at 74.36606192588806
350: 831650051525054464 was successfully queried at 74.5748929977417
351: 831552930092285952 was successfully queried at 74.78969097137451
352: 831322785565769729 was successfully queried at 75.00144410133362
353: 831315979191906304 was successfully queried at 75.20005011558533
354: 831309418084069378 was successfully queried at 75.40923404693604
355: 831262627380748289 was successfully queried at 75.60727214813232
356: 830956169170665475 was successfully queried at 75.81693291664124
357: 830583320585068544 

459: 817908911860748288 was successfully queried at 97.69460797309875
460: 817827839487737858 was successfully queried at 97.89440298080444
461: 817777686764523521 was successfully queried at 98.10849189758301
462: 817536400337801217 was successfully queried at 98.31349277496338
463: 817502432452313088 was successfully queried at 98.51350903511047
464: 817423860136083457 was successfully queried at 98.7150628566742
465: 817415592588222464 was successfully queried at 98.92531085014343
466: 817181837579653120 was successfully queried at 99.1259708404541
467: 817171292965273600 was successfully queried at 99.33074975013733
468: 817120970343411712 was successfully queried at 99.53388786315918
469: 817056546584727552 was successfully queried at 99.75214695930481
470: 816829038950027264 was successfully queried at 99.9589159488678
471: 816816676327063552 was successfully queried at 100.18788290023804
472: 816697700272001025 was successfully queried at 100.39225888252258
473: 8164505708148981

574: 801167903437357056 was successfully queried at 123.50768804550171
575: 801127390143516673 was successfully queried at 123.7162458896637
576: 801115127852503040 was successfully queried at 123.91100096702576
577: 800859414831898624 was successfully queried at 124.12000894546509
578: 800855607700029440 was successfully queried at 124.33539605140686
579: 800751577355128832 was successfully queried at 124.55225110054016
580: 800513324630806528 was successfully queried at 124.75460910797119
581: 800459316964663297 was successfully queried at 124.95923089981079
582: 800443802682937345 was successfully queried at 125.15989184379578
583: 800388270626521089 was successfully queried at 125.36802506446838
584: 800188575492947969 was successfully queried at 125.57378506660461
585: 800141422401830912 was successfully queried at 125.79076075553894
586: 800018252395122689 was successfully queried at 125.9968569278717
587: 799774291445383169 was successfully queried at 126.20426297187805
588: 799

690: 787717603741622272 was successfully queried at 148.1703667640686
691: 787397959788929025 was successfully queried at 148.36923098564148
692: 787322443945877504 was successfully queried at 148.56415510177612
693: 787111942498508800 was successfully queried at 148.8100459575653
694: 786963064373534720 was successfully queried at 149.00235199928284
695: 786729988674449408 was successfully queried at 149.21513199806213
696: 786709082849828864 was successfully queried at 149.42834901809692
697: 786664955043049472 was successfully queried at 149.65033197402954
698: 786595970293370880 was successfully queried at 149.85383296012878
699: 786363235746385920 was successfully queried at 150.0665009021759
700: 786286427768250368 was successfully queried at 150.26492404937744
701: 786233965241827333 was successfully queried at 150.46475887298584
702: 786051337297522688 was successfully queried at 150.67137813568115
703: 786036967502913536 was successfully queried at 150.8792748451233
704: 78592

806: 772114945936949249 was successfully queried at 172.5752739906311
807: 772102971039580160 was successfully queried at 172.77678894996643
808: 771908950375665664 was successfully queried at 173.11008286476135
809: 771770456517009408 was successfully queried at 173.3058259487152
810: 771500966810099713 was successfully queried at 173.50547194480896
811: 771380798096281600 was successfully queried at 173.70561814308167
812: 771171053431250945 was successfully queried at 173.92749094963074
813: 771136648247640064 was successfully queried at 174.13271594047546
814: 771102124360998913 was successfully queried at 174.33157587051392
815: 771014301343748096 was successfully queried at 174.53155708312988
816: 771004394259247104 could not be queried 174.7197139263153[{'code': 179, 'message': 'Sorry, you are not authorized to see this status.'}]
817: 770787852854652928 was successfully queried at 174.9318928718567
818: 770772759874076672 was successfully queried at 175.1357500553131
819: 77074

Rate limit reached. Sleeping for: 708


901: 758740312047005698 was successfully queried at 905.7743539810181
902: 758474966123810816 was successfully queried at 905.9932088851929
903: 758467244762497024 was successfully queried at 906.2090420722961
904: 758405701903519748 was successfully queried at 906.4567511081696
905: 758355060040593408 was successfully queried at 906.6655669212341
906: 758099635764359168 was successfully queried at 906.8648099899292
907: 758041019896193024 was successfully queried at 907.0620210170746
908: 757741869644341248 was successfully queried at 907.3070209026337
909: 757729163776290825 was successfully queried at 907.5497899055481
910: 757725642876129280 was successfully queried at 907.7569019794464
911: 757611664640446465 was successfully queried at 907.9704859256744
912: 757597904299253760 was successfully queried at 908.1761720180511
913: 757596066325864448 was successfully queried at 908.3757169246674
914: 757400162377592832 was successfully queried at 908.5691649913788
915: 757393109802180

1018: 746872823977771008 was successfully queried at 930.8043780326843
1019: 746818907684614144 was successfully queried at 930.9950969219208
1020: 746790600704425984 was successfully queried at 931.2097430229187
1021: 746757706116112384 was successfully queried at 931.4040281772614
1022: 746726898085036033 was successfully queried at 931.6075608730316
1023: 746542875601690625 was successfully queried at 931.8201100826263
1024: 746521445350707200 was successfully queried at 932.047632932663
1025: 746507379341139972 was successfully queried at 932.2750058174133
1026: 746369468511756288 was successfully queried at 932.4749510288239
1027: 746131877086527488 was successfully queried at 932.6685128211975
1028: 746056683365994496 was successfully queried at 932.8615520000458
1029: 745789745784041472 was successfully queried at 933.0600759983063
1030: 745712589599014916 was successfully queried at 933.4225199222565
1031: 745433870967832576 was successfully queried at 933.6309239864349
1032: 7

1134: 728751179681943552 was successfully queried at 955.6706478595734
1135: 728653952833728512 was successfully queried at 955.8716850280762
1136: 728409960103686147 was successfully queried at 956.0650520324707
1137: 728387165835677696 was successfully queried at 956.3633399009705
1138: 728046963732717569 was successfully queried at 956.5568008422852
1139: 728035342121635841 was successfully queried at 956.7500550746918
1140: 728015554473250816 was successfully queried at 956.9634819030762
1141: 727685679342333952 was successfully queried at 957.1794610023499
1142: 727644517743104000 was successfully queried at 957.378494977951
1143: 727524757080539137 was successfully queried at 957.5882940292358
1144: 727314416056803329 was successfully queried at 957.8023037910461
1145: 727286334147182592 was successfully queried at 958.0127909183502
1146: 727175381690781696 was successfully queried at 958.3333628177643
1147: 727155742655025152 was successfully queried at 958.5458290576935
1148: 7

1250: 711306686208872448 was successfully queried at 980.7466869354248
1251: 711008018775851008 was successfully queried at 980.9627597332001
1252: 710997087345876993 was successfully queried at 981.2277369499207
1253: 710844581445812225 was successfully queried at 981.4350891113281
1254: 710833117892898816 was successfully queried at 981.6423399448395
1255: 710658690886586372 was successfully queried at 981.8367331027985
1256: 710609963652087808 was successfully queried at 982.0537369251251
1257: 710588934686908417 was successfully queried at 982.2525780200958
1258: 710296729921429505 was successfully queried at 982.4951078891754
1259: 710283270106132480 was successfully queried at 982.714418888092
1260: 710272297844797440 was successfully queried at 982.9153389930725
1261: 710269109699739648 was successfully queried at 983.1136910915375
1262: 710153181850935296 was successfully queried at 983.3354959487915
1263: 710140971284037632 was successfully queried at 983.5395979881287
1264: 7

1367: 702671118226825216 was successfully queried at 1005.4456119537354
1368: 702598099714314240 was successfully queried at 1005.6475901603699
1369: 702539513671897089 was successfully queried at 1005.8576939105988
1370: 702332542343577600 was successfully queried at 1006.0606818199158
1371: 702321140488925184 was successfully queried at 1006.2784340381622
1372: 702276748847800320 was successfully queried at 1006.4748649597168
1373: 702217446468493312 was successfully queried at 1006.6781370639801
1374: 701981390485725185 was successfully queried at 1006.8897249698639
1375: 701952816642965504 was successfully queried at 1007.0923638343811
1376: 701889187134500865 was successfully queried at 1007.3502168655396
1377: 701805642395348998 was successfully queried at 1007.6566917896271
1378: 701601587219795968 was successfully queried at 1007.9166691303253
1379: 701570477911896070 was successfully queried at 1008.11927485466
1380: 701545186879471618 was successfully queried at 1008.31891489

1482: 693280720173801472 was successfully queried at 1031.050077199936
1483: 693267061318012928 was successfully queried at 1031.2722599506378
1484: 693262851218264065 was successfully queried at 1031.4802060127258
1485: 693231807727280129 was successfully queried at 1031.6922159194946
1486: 693155686491000832 was successfully queried at 1031.9124529361725
1487: 693109034023534592 was successfully queried at 1032.1227350234985
1488: 693095443459342336 was successfully queried at 1032.335149049759
1489: 692919143163629568 was successfully queried at 1032.569128036499
1490: 692905862751522816 was successfully queried at 1032.779305934906
1491: 692901601640583168 was successfully queried at 1032.9980959892273
1492: 692894228850999298 was successfully queried at 1033.5098941326141
1493: 692828166163931137 was successfully queried at 1033.717437028885
1494: 692752401762250755 was successfully queried at 1033.9207320213318
1495: 692568918515392513 was successfully queried at 1034.12642979621

1597: 686286779679375361 was successfully queried at 1056.0495100021362
1598: 686050296934563840 was successfully queried at 1056.2968990802765
1599: 686035780142297088 was successfully queried at 1056.557857990265
1600: 686034024800862208 was successfully queried at 1056.820340871811
1601: 686007916130873345 was successfully queried at 1057.0807349681854
1602: 686003207160610816 was successfully queried at 1057.3521649837494
1603: 685973236358713344 was successfully queried at 1057.6257708072662
1604: 685943807276412928 was successfully queried at 1057.8903138637543
1605: 685906723014619143 was successfully queried at 1058.168268918991
1606: 685681090388975616 was successfully queried at 1058.4741048812866
1607: 685667379192414208 was successfully queried at 1058.7792768478394
1608: 685663452032069632 was successfully queried at 1059.0600171089172
1609: 685641971164143616 was successfully queried at 1059.3557980060577
1610: 685547936038666240 was successfully queried at 1059.644299983

1712: 680497766108381184 was successfully queried at 1082.4436841011047
1713: 680494726643068929 was successfully queried at 1082.659728050232
1714: 680473011644985345 was successfully queried at 1082.9249439239502
1715: 680440374763077632 was successfully queried at 1083.1307790279388
1716: 680221482581123072 was successfully queried at 1083.3273229599
1717: 680206703334408192 was successfully queried at 1083.519625902176
1718: 680191257256136705 was successfully queried at 1083.715152978897
1719: 680176173301628928 was successfully queried at 1083.938395023346
1720: 680161097740095489 was successfully queried at 1084.1759867668152
1721: 680145970311643136 was successfully queried at 1084.3953671455383
1722: 680130881361686529 was successfully queried at 1084.6008207798004
1723: 680115823365742593 was successfully queried at 1084.8042647838593
1724: 680100725817409536 was successfully queried at 1084.998775959015
1725: 680085611152338944 was successfully queried at 1085.2432429790497


Rate limit reached. Sleeping for: 705


1800: 677187300187611136 was successfully queried at 1100.9592480659485
1801: 676975532580409345 was successfully queried at 1811.2901821136475
1802: 676957860086095872 was successfully queried at 1811.522080898285
1803: 676949632774234114 was successfully queried at 1811.7775049209595
1804: 676948236477857792 was successfully queried at 1811.9936077594757
1805: 676946864479084545 was successfully queried at 1812.213047027588
1806: 676942428000112642 was successfully queried at 1812.421737909317
1807: 676936541936185344 was successfully queried at 1812.624752998352
1808: 676916996760600576 was successfully queried at 1812.836364030838
1809: 676897532954456065 was successfully queried at 1813.0308258533478
1810: 676864501615042560 was successfully queried at 1813.2276940345764
1811: 676821958043033607 was successfully queried at 1813.4342439174652
1812: 676819651066732545 was successfully queried at 1813.6337049007416
1813: 676811746707918848 was successfully queried at 1813.84152603149

1915: 674330906434379776 was successfully queried at 1835.0434091091156
1916: 674318007229923329 was successfully queried at 1835.2422499656677
1917: 674307341513269249 was successfully queried at 1835.4393949508667
1918: 674291837063053312 was successfully queried at 1835.6441168785095
1919: 674271431610523648 was successfully queried at 1835.849436044693
1920: 674269164442398721 was successfully queried at 1836.0374760627747
1921: 674265582246694913 was successfully queried at 1836.2515699863434
1922: 674262580978937856 was successfully queried at 1836.4627330303192
1923: 674255168825880576 was successfully queried at 1836.657607793808
1924: 674082852460433408 was successfully queried at 1836.8662948608398
1925: 674075285688614912 was successfully queried at 1837.0739550590515
1926: 674063288070742018 was successfully queried at 1837.269250869751
1927: 674053186244734976 was successfully queried at 1837.471680879593
1928: 674051556661161984 was successfully queried at 1837.6842360496

2030: 671855973984772097 was successfully queried at 1859.422516822815
2031: 671789708968640512 was successfully queried at 1859.6574928760529
2032: 671768281401958400 was successfully queried at 1859.8663868904114
2033: 671763349865160704 was successfully queried at 1860.0774829387665
2034: 671744970634719232 was successfully queried at 1860.2815201282501
2035: 671743150407421952 was successfully queried at 1860.4854278564453
2036: 671735591348891648 was successfully queried at 1860.683485031128
2037: 671729906628341761 was successfully queried at 1860.8791689872742
2038: 671561002136281088 was successfully queried at 1861.0932729244232
2039: 671550332464455680 was successfully queried at 1861.302412033081
2040: 671547767500775424 was successfully queried at 1861.5654990673065
2041: 671544874165002241 was successfully queried at 1861.8504550457
2042: 671542985629241344 was successfully queried at 1862.0639729499817
2043: 671538301157904385 was successfully queried at 1862.275441169738

2145: 669942763794931712 was successfully queried at 1884.1423029899597
2146: 669926384437997569 was successfully queried at 1884.362095117569
2147: 669923323644657664 was successfully queried at 1884.5573890209198
2148: 669753178989142016 was successfully queried at 1884.776338815689
2149: 669749430875258880 was successfully queried at 1884.9759180545807
2150: 669684865554620416 was successfully queried at 1885.186583995819
2151: 669683899023405056 was successfully queried at 1885.3945677280426
2152: 669682095984410625 was successfully queried at 1885.610650062561
2153: 669680153564442624 was successfully queried at 1885.8132808208466
2154: 669661792646373376 was successfully queried at 1886.0092189311981
2155: 669625907762618368 was successfully queried at 1886.2189548015594
2156: 669603084620980224 was successfully queried at 1886.413097858429
2157: 669597912108789760 was successfully queried at 1886.6112711429596
2158: 669583744538451968 was successfully queried at 1886.81240487098

2260: 667550904950915073 was successfully queried at 1908.114418745041
2261: 667550882905632768 was successfully queried at 1908.324893951416
2262: 667549055577362432 was successfully queried at 1908.5688259601593
2263: 667546741521195010 was successfully queried at 1908.8154590129852
2264: 667544320556335104 was successfully queried at 1909.0313329696655
2265: 667538891197542400 was successfully queried at 1909.240313053131
2266: 667534815156183040 was successfully queried at 1909.4320380687714
2267: 667530908589760512 was successfully queried at 1909.7289950847626
2268: 667524857454854144 was successfully queried at 1909.933294057846
2269: 667517642048163840 was successfully queried at 1910.152011871338
2270: 667509364010450944 was successfully queried at 1910.3610091209412
2271: 667502640335572993 was successfully queried at 1910.5685999393463
2272: 667495797102141441 was successfully queried at 1910.7688219547272
2273: 667491009379606528 was successfully queried at 1910.99812912940

In [288]:
# Create empty list
tweet_api_info = []

# Open file with handle as best practice and for every json-object append interesting info to tweet_api_info
try:
    with open('tweet_json.txt') as file:
        
        for jsonObj in file:
            tweet = json.loads(jsonObj)
            tweet_api_info.append([tweet['id'], tweet['retweet_count'], tweet['favorite_count']])

# Catch exception, if file doesn't exist
except FileNotFoundError as fnf_error:
    print(fnf_error)

# Convert list to pandas dataframe
df_tweet_api_info = pd.DataFrame(tweet_api_info, columns=['tweet_id', 'retweet_count', 'favorite_count'])

In [289]:
df_tweet_api_info.to_csv('df_tweet_api_info.csv', encoding='utf-8')

<a id='assess'></a>
## 2. Assess data

##### Assessing Data for this Project
After gathering each of the above pieces of data, they are assessed visually and programmatically for quality and tidiness issues. 

8 quality issues 
2 tidiness issues

### a. Visual assessment

In [290]:
# Assess df_twitter_archive_enhanced visually
df_twitter_archive_enhanced.sample(n=10)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
845,766423258543644672,,,2016-08-18 23:55:18 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Shadoe. Her tongue flies out of her mo...,,,,https://twitter.com/dog_rates/status/766423258...,9,10,Shadoe,,,,
592,798933969379225600,,,2016-11-16 17:01:16 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Iroh. He's in a predicament. 12/10 som...,,,,https://twitter.com/dog_rates/status/798933969...,12,10,Iroh,,,,
2291,667165590075940865,,,2015-11-19 02:20:46 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Churlie. AKA Fetty Woof. Lost eye savi...,,,,https://twitter.com/dog_rates/status/667165590...,10,10,Churlie,,,,
1084,738402415918125056,,,2016-06-02 16:10:29 +0000,"<a href=""http://twitter.com/download/iphone"" r...","""Don't talk to me or my son ever again"" ...10/...",,,,https://twitter.com/dog_rates/status/738402415...,10,10,,,,,
671,789960241177853952,,,2016-10-22 22:42:52 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: This is Buddy. His father was a...,7.624645e+17,4196984000.0,2016-08-08 01:44:46 +0000,https://twitter.com/dog_rates/status/762464539...,12,10,Buddy,,,,
1023,746521445350707200,,,2016-06-25 01:52:36 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: This is Shaggy. He knows exactl...,6.678667e+17,4196984000.0,2015-11-21 00:46:50 +0000,https://twitter.com/dog_rates/status/667866724...,10,10,Shaggy,,,,
297,837110210464448512,,,2017-03-02 01:20:01 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Clark. He passed pupper training today...,,,,https://twitter.com/dog_rates/status/837110210...,13,10,Clark,,,pupper,
1337,705102439679201280,,,2016-03-02 18:48:16 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Terrenth. He just stubbed his toe. 10/...,,,,https://twitter.com/dog_rates/status/705102439...,10,10,Terrenth,,,,
132,866816280283807744,,,2017-05-23 00:41:20 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: This is Jamesy. He gives a kiss...,8.664507e+17,4196984000.0,2017-05-22 00:28:40 +0000,https://twitter.com/dog_rates/status/866450705...,13,10,Jamesy,,,pupper,
533,807621403335917568,,,2016-12-10 16:22:02 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Ollie Vue. He was a 3 legged pupper on...,,,,https://twitter.com/dog_rates/status/807621403...,14,10,Ollie,,,pupper,


In [291]:
# Assess df_image_predictions visually
df_image_predictions.sample(n=10)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
806,691820333922455552,https://pbs.twimg.com/media/CZnW7JGW0AA83mn.jpg,1,minivan,0.332756,False,sports_car,0.129452,False,limousine,0.073936,False
135,668496999348633600,https://pbs.twimg.com/media/CUb6ebKWcAAJkd0.jpg,1,Staffordshire_bullterrier,0.412879,True,miniature_pinscher,0.161488,True,American_Staffordshire_terrier,0.112495,True
921,701981390485725185,https://pbs.twimg.com/media/Cb3wWWbWEAAy06k.jpg,1,Pomeranian,0.491022,True,weasel,0.130879,False,Yorkshire_terrier,0.099241,True
875,698195409219559425,https://pbs.twimg.com/media/CbB9BTqW8AEVc2A.jpg,1,Labrador_retriever,0.64369,True,American_Staffordshire_terrier,0.102684,True,dalmatian,0.050008,True
1953,863907417377173506,https://pbs.twimg.com/media/C_03NPeUQAAgrMl.jpg,1,marmot,0.358828,False,meerkat,0.174703,False,weasel,0.123485,False
1577,796116448414461957,https://pbs.twimg.com/media/CwxfrguUUAA1cbl.jpg,1,Cardigan,0.700182,True,Pembroke,0.260738,True,papillon,0.01711,True
900,699801817392291840,https://pbs.twimg.com/media/CbYyCMcWIAAHHjF.jpg,2,golden_retriever,0.808978,True,Irish_setter,0.042428,True,Labrador_retriever,0.023536,True
1070,716285507865542656,https://pbs.twimg.com/media/CfDB3aJXEAAEZNv.jpg,1,Yorkshire_terrier,0.43042,True,silky_terrier,0.196769,True,cairn,0.072676,True
58,667090893657276420,https://pbs.twimg.com/media/CUH7oLuUsAELWib.jpg,1,Chihuahua,0.959514,True,Italian_greyhound,0.00537,True,Pomeranian,0.002641,True
776,689661964914655233,https://pbs.twimg.com/media/CZIr5gFUsAAvnif.jpg,1,Italian_greyhound,0.322818,True,whippet,0.246966,True,Chihuahua,0.122541,True


In [299]:
# Assess df_tweet_api_info visually
df_tweet_api_info.sample(n=10)

Unnamed: 0,tweet_id,retweet_count,favorite_count
108,870804317367881728,5754,31984
2080,670474236058800128,712,1462
314,833722901757046785,3254,20980
2212,668142349051129856,270,557
1117,727644517743104000,1745,5879
34,885518971528720385,3422,19301
1460,693231807727280129,735,2876
644,790987426131050500,2196,10093
605,796031486298386433,3827,11087
204,851591660324737024,3417,16048


### b. Programmatic assessment

In [293]:
df_twitter_archive_enhanced.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 

In [300]:
df_twitter_archive_enhanced.rating_denominator.value_counts()

10     2333
11        3
50        3
80        2
20        2
2         1
16        1
40        1
70        1
15        1
90        1
110       1
120       1
130       1
150       1
170       1
7         1
0         1
Name: rating_denominator, dtype: int64

In [294]:
df_image_predictions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   tweet_id  2075 non-null   int64  
 1   jpg_url   2075 non-null   object 
 2   img_num   2075 non-null   int64  
 3   p1        2075 non-null   object 
 4   p1_conf   2075 non-null   float64
 5   p1_dog    2075 non-null   bool   
 6   p2        2075 non-null   object 
 7   p2_conf   2075 non-null   float64
 8   p2_dog    2075 non-null   bool   
 9   p3        2075 non-null   object 
 10  p3_conf   2075 non-null   float64
 11  p3_dog    2075 non-null   bool   
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [295]:
df_tweet_api_info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2331 entries, 0 to 2330
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype
---  ------          --------------  -----
 0   tweet_id        2331 non-null   int64
 1   retweet_count   2331 non-null   int64
 2   favorite_count  2331 non-null   int64
dtypes: int64(3)
memory usage: 54.8 KB


### Requirements for Clean data:

#### Quality requirements:

- Completeness: All necessary records in dataframes, no specific rows, columns or cells missing.
- Validity: No records available, taht do not conform schema.
- Accuracy: No wrong data, that is valid.
- Consistency: No data, that is valid and accurate, but referred to in multiple correct ways. 

#### Tidiniss requirements (as defined by Hadley Wickham):
- each variable is a column
- each observation is a row
- each type of observational unit is a table.



### Findings, which contradict requirements:

##### `df_twitter_archive_enhanced` table
- Not all values in column `rating_denominator`are equal to 10.
- Only 78 non-null values in columns `in_reply_to_status_id` and `in_reply_to_user_id`.


##### `df_image_predictions` table
- ...

##### `df_tweet_api_info` table
- ...

#### Tidiness Observations:
- One variable (dog stage) in four columns in `treatments` table (doggo, floofer, pupper and puppo)
- One observational unit in in three tables, `df_twitter_archive_enhanced`, `df_image_predictions`and `df_tweet_api_info`, different amount of rows.

<a id='clean'></a>
## 3. Clean data

In [296]:
# Create copies for cleaning process to preserve original dataframes
df_twitter_archive_enhanced_clean = df_twitter_archive_enhanced.copy()
df_image_predictions_clean = df_image_predictions.copy()
df_tweet_api_info_clean = df_tweet_api_info.copy()


### Issue 1

#### Define

- 

#### Code

#### Test

<a id='analysis'></a>
## 4. Analysis & Visualization

In [297]:
from subprocess import call
call(['python', '-m', 'nbconvert', 'wrangle_act.ipynb'])

1

In [298]:
data

{'created_at': 'Tue Aug 01 16:23:56 +0000 2017',
 'id': 892420643555336193,
 'id_str': '892420643555336193',
 'full_text': "This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU",
 'truncated': False,
 'display_text_range': [0, 85],
 'entities': {'hashtags': [],
  'symbols': [],
  'user_mentions': [],
  'urls': [],
  'media': [{'id': 892420639486877696,
    'id_str': '892420639486877696',
    'indices': [86, 109],
    'media_url': 'http://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg',
    'media_url_https': 'https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg',
    'url': 'https://t.co/MgUWQ76dJU',
    'display_url': 'pic.twitter.com/MgUWQ76dJU',
    'expanded_url': 'https://twitter.com/dog_rates/status/892420643555336193/photo/1',
    'type': 'photo',
    'sizes': {'thumb': {'w': 150, 'h': 150, 'resize': 'crop'},
     'medium': {'w': 540, 'h': 528, 'resize': 'fit'},
     'small': {'w': 540, 'h': 528, 'resize': 'fit'},
     'large': {'w': 