# Project: Wrangle and analyze WeRateDogs Twitter data

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#Gathering">Gathering Data</a></li>
<li><a href="#Assessing">Assessing Data</a></li>
<li><a href="#Cleaning">Cleaning Data</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

This Notebook focuses on the process of data wrangling which is composed of 3 steps:
- Gathering 
- Assessing
- Cleaning

Atfer wrangling the data in a quite thorough manner, a quick analysis will be performed. 

The data selected in this view is from Twitter. The data that we have comes from multiple sources including a manually downloaded file, a file from an online server and data from Twitter's API. A lot of processing must be done in order to wrangle this data. In the final steps, we will have a look at tweets from the famous profile WeRateDogs and derive trends from them. 

#### Set up the environment

In [1]:
# Import all the libraries used in this python notebook for the following analysis
import pandas as pd
import numpy as np
import requests
import tweepy
import os
import timeit
import json

<a id='Gathering'></a>
## Gathering Data

In this part, we will gather data from 3 different sources in 3 different formats: 
- *A downloadable CSV file source* : Twitter enhanced archive data
- *An online server source TSV file* : Image prediction data based on tweets from the archive
- *An API JSON source data to load into a txt file* : Using API to get more data based on tweets from the archive (our main focus will be to gather retweet count and favorite count)

These datasets will then be read in a dataframe format using Pandas Library

### Dataset number 1 : downloadable CSV file source

The first dataset to be used for the following analysis is a *Twitter enhanced archive data file from the WeRateDogs Twitter profile*. It is saved in a CSV format in our folder

#### Step 1 and only : load data into pandas dataframe

In [2]:
# Create a dataframe and View Twitter enhanced archive dataset using pandas

df_twit_archive = pd.read_csv('twitter-archive-enhanced.csv')
df_twit_archive.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


> *this file was manually downloaded from Udacity's platform as part of the project materials*

### Dataset number 2 : online server source 

The second dataset to be used for the following analysis is a *tweet image prediction TSV file* hosted on Udacity servers

#### Step 1 : Download file programmatically

In [3]:
# Download programmatically the tweet image prediction file using Requests Library

url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'

if os.path.exists('tweet_image_pred.tsv') == True:
    print('file exists') #To avoid retrieval & creation of file error if code is re-run multiple times
    
else:
    r = requests.get(url)
    with open('tweet_image_pred.tsv', mode= 'wb') as file:
        file.write(r.content)
        print('file created')


file exists


#### Step 2 : load data into pandas dataframe

In [4]:
# Create a dataframe and View Tweet image prediction dataset using pandas

df_image_pred = pd.read_csv('tweet_image_pred.tsv', sep="\t")
df_image_pred.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


### Dataset number 3 : API source

The third and last dataset to be used for the following analysis is a *tweet 'retweet count' and 'favorite count' (Likes) dataset* in JSON format retrieved by querying the twitter API

#### Step 1 : Authentication

In [None]:
# Authenticate to access API data

consumer_key = 'YOUR CONSUMER KEY'
consumer_secret = 'YOUR CONSUMER SECRET'
access_token = 'YOUR ACCESS TOKEN'
access_secret = 'YOUR ACCESS SECRET'

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True , parser=tweepy.parsers.JSONParser())

# the parser helps to set the response from the API status call later into a JSON object 
# which makes it easier to then get the values we are looking for in retweet_count and favourite_count using JSON object functionalities

> **Some useful links**
>
> - For the tweepy code documentation: [here](https://buildmedia.readthedocs.org/media/pdf/tweepy/latest/tweepy.pdf)
- My link to twitter developper page: [here](https://developer.twitter.com/en/apps/17388315)
- Twitter WeRateDogs page: [here](https://twitter.com/dog_rates)
- Documentation for Twitter API - get tweets with specific id: [here](https://stackoverflow.com/questions/28384588/twitter-api-get-tweets-with-specific-id)
- Convert tweepy status object into JSON: [here](https://stackoverflow.com/questions/27900451/convert-tweepy-status-object-into-json)

#### Step 2 : Load API JSON data into txt file

In [None]:
# Start timer for the following operation
start = timeit.timeit()
print("Start timer")

# List for tweet ids in the archive that were not found via the API
tweet_id_without_record = []
# Creates "tweet_json.txt" file or empties it before the loop starts if file exists already 
open("tweet_json.txt", 'w').close()

# Loop to build a txt file were each line represent a tweet from the twitter archive of WeRateDogs in JSON format
for tweet_id in twit_archive.tweet_id:
        try:
                tweet = api.get_status(tweet_id, tweet_mode='extended') # gets tweet status in JSON format
                tweet_str = json.dumps(tweet) # tweet JSON serialized into a str format

                # append each tweet status from the loop in the tweet_json.txt file
                with open("tweet_json.txt","a") as file: 
                    file.write(tweet_str + '\n') # '+/n' adds a new line after each str that has been added
                print(tweet['id']) # print tweet id to check loop advancement
       # keep record of the tweet id if the tweet id can't be found via the API
        except Exception as e:
                print(str(tweet_id) + " error tweet not found")
                tweet_id_without_record.append(tweet_id)

# End the timer after the loop is complete
end = timeit.timeit()
print("End timer and time to process in seconds:")
print(end - start)


> This operation above took more than one hour. Expect it if you re-run it.

#### Step 3 : Perform multipe checks on the txt file to verify it looks as expected

In [5]:
# Check how file looks and feel

with open("tweet_json.txt","r") as file:
     print(file.read(10000))

{"created_at": "Tue Aug 01 16:23:56 +0000 2017", "id": 892420643555336193, "id_str": "892420643555336193", "full_text": "This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU", "truncated": false, "display_text_range": [0, 85], "entities": {"hashtags": [], "symbols": [], "user_mentions": [], "urls": [], "media": [{"id": 892420639486877696, "id_str": "892420639486877696", "indices": [86, 109], "media_url": "http://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg", "media_url_https": "https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg", "url": "https://t.co/MgUWQ76dJU", "display_url": "pic.twitter.com/MgUWQ76dJU", "expanded_url": "https://twitter.com/dog_rates/status/892420643555336193/photo/1", "type": "photo", "sizes": {"thumb": {"w": 150, "h": 150, "resize": "crop"}, "medium": {"w": 540, "h": 528, "resize": "fit"}, "small": {"w": 540, "h": 528, "resize": "fit"}, "large": {"w": 540, "h": 528, "resize": "fit"}}, "features": {"orig": {"faces": 

In [6]:
# Count lines in JSON file - 1 line is one tweet
num_lines = sum(1 for line in open("tweet_json.txt"))
print(num_lines)

2331


In [9]:
# Count tweets in tweet ids archive
num_tweets = df_twit_archive.tweet_id.count()
num_tweets

2356

In [109]:
# Count tweets that were not found via the API
num_tweet_id_without_record = len(tweet_id_without_record)
num_tweet_id_without_record

25

In [110]:
# Check if the number of lines in txt JSON file is correct based on the archive of tweets we queried
num_lines == num_tweets - num_tweet_id_without_record

True

> OK the count of number of lines in the tweet_json.txt file corroborates with number of lines in twitter archive dataset minus the exceptions where the tweet id was not found via the API. It looks like we can go on and use this file to build the dataframe containing:
- tweet_id
- retweet_count 
- favourite_count
>
> Each of these values are found in each line of the txt file. Indeed 1 line represents 1 tweet.

#### Step 4 : Gather the selected data from the txt file in a list

In [10]:
tweet_API_list = [] # list for the loop to find lines in tweet_json.txt file
tweet_API_list_for_df = [] # list of dictionaries that will be used to build the final dataframe

# Open the txt file in read mode and put each line of the txt file in a list
with open("tweet_json.txt","r") as f:
    tweet_API_list = f.readlines() # readlines() returns a list of items, each item is a line in tweet_jon.txt file

# Loop to retrieve values for the dataframe for each tweet
i = 0
while i < len(tweet_API_list): 
        tweet_line_i = json.loads(tweet_API_list[i]) # JSON representing tweet i
        i += 1
        tweet_id = tweet_line_i['id'] # get the tweet id of tweet i
        retweet_count = tweet_line_i['retweet_count'] # get retweet count of tweet i
        favorite_count = tweet_line_i['favorite_count'] # get favorite count of tweet i

        # Append to list of dictionaries
        tweet_API_list_for_df.append({'tweet_id': tweet_id,
                            'retweet_count': retweet_count,
                            'favorite_count': favorite_count})

    

#### Step 5 : load data into pandas dataframe

In [11]:
# Create DataFrame from list of dictionaries
df_tweet_API = pd.DataFrame(tweet_API_list_for_df, columns = ['tweet_id', 'retweet_count', 'favorite_count'])

df_tweet_API

Unnamed: 0,tweet_id,retweet_count,favorite_count
0,892420643555336193,7802,36638
1,892177421306343426,5784,31572
2,891815181378084864,3826,23782
3,891689557279858688,7973,39967
4,891327558926688256,8609,38166
...,...,...,...
2326,666049248165822465,41,99
2327,666044226329800704,133,273
2328,666033412701032449,41,115
2329,666029285002620928,43,121


>It looks like our dataframe also has the correct number of rows. As expected it is the same number of rows as for the JSON txt file

<a id='Assessing'></a>
## Assessing Data
- Detect and document at least eight (8) quality issues and two (2) tidiness issues

Following these requirements: 

- You only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.
- Cleaning includes merging individual pieces of data according to the rules of tidy data.
- The fact that the rating numerators are greater than the denominators does not need to be cleaned. This unique rating system is a big part of the popularity of WeRateDogs.
- You do not need to gather the tweets beyond August 1st, 2017. You can, but note that you won't be able to gather the image predictions for these tweets since you don't have access to the algorithm used.

#### Dataset 1: Twitter archive data

##### Info summary of the table

In [12]:
df_twit_archive.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

> At first glance it looks like there is some missing data in at least 6 columns:
- in_reply_to_status_id
- in_reply_to_user_id
- retweeted_status_id
- retweeted_status_user_id
- retweeted_status_timestamp
- expanded_urls

> Secondly, datatypes look non-optimal for (*and should be as type* **in bold**): 
- in_reply_to_status_id **as int**
- in_reply_to_user_id **as int**
- timestamp **as datetime**
- retweeted_status_id **as int**
- retweeted_status_user_id **as int**
- retweeted_status_timestamp **as datetime** 

##### Let's have a look at the dataframe view

In [17]:
df_twit_archive

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2351,666049248165822465,,,2015-11-16 00:24:50 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a 1949 1st generation vulpix. Enj...,,,,https://twitter.com/dog_rates/status/666049248...,5,10,,,,,
2352,666044226329800704,,,2015-11-16 00:04:52 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is a purebred Piers Morgan. Loves to Netf...,,,,https://twitter.com/dog_rates/status/666044226...,6,10,a,,,,
2353,666033412701032449,,,2015-11-15 23:21:54 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here is a very happy pup. Big fan of well-main...,,,,https://twitter.com/dog_rates/status/666033412...,9,10,a,,,,
2354,666029285002620928,,,2015-11-15 23:05:30 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is a western brown Mitsubishi terrier. Up...,,,,https://twitter.com/dog_rates/status/666029285...,7,10,a,,,,


##### Explore rows with missing data

In [21]:
# check values in in_reply_to_status_... columns
df_twit_archive[df_twit_archive.in_reply_to_status_id.notna()]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
30,886267009285017600,8.862664e+17,2.281182e+09,2017-07-15 16:51:35 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@NonWhiteHat @MayhewMayhem omg hello tanner yo...,,,,,12,10,,,,,
55,881633300179243008,8.816070e+17,4.738443e+07,2017-07-02 21:58:53 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@roushfenway These are good dogs but 17/10 is ...,,,,,17,10,,,,,
64,879674319642796034,8.795538e+17,3.105441e+09,2017-06-27 12:14:36 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@RealKentMurphy 14/10 confirmed,,,,,14,10,,,,,
113,870726314365509632,8.707262e+17,1.648776e+07,2017-06-02 19:38:25 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@ComplicitOwl @ShopWeRateDogs &gt;10/10 is res...,,,,,10,10,,,,,
148,863427515083354112,8.634256e+17,7.759620e+07,2017-05-13 16:15:35 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@Jack_Septic_Eye I'd need a few more pics to p...,,,,,12,10,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2038,671550332464455680,6.715449e+17,4.196984e+09,2015-12-01 04:44:10 +0000,"<a href=""http://twitter.com/download/iphone"" r...",After 22 minutes of careful deliberation this ...,,,,,1,10,,,,,
2149,669684865554620416,6.693544e+17,4.196984e+09,2015-11-26 01:11:28 +0000,"<a href=""http://twitter.com/download/iphone"" r...",After countless hours of research and hundreds...,,,,,11,10,,,,,
2169,669353438988365824,6.678065e+17,4.196984e+09,2015-11-25 03:14:30 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tessa. She is also very pleased after ...,,,,https://twitter.com/dog_rates/status/669353438...,10,10,Tessa,,,,
2189,668967877119254528,6.689207e+17,2.143566e+07,2015-11-24 01:42:25 +0000,"<a href=""http://twitter.com/download/iphone"" r...",12/10 good shit Bubka\n@wane15,,,,,12,10,,,,,


> A priori, it looks normal that not all columns have information. The title of the columns suggests that there should be an id value only if the tweet has got a reply to a status or something like that. For the later analysis, this does not feel like valuable information for the analysis also seeing how little of these rows have a value compared to the dataset. Let's keep it as such for now.

In [20]:
# check values in retweet _status_... columns
df_twit_archive[df_twit_archive.retweeted_status_id.notna()]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
19,888202515573088257,,,2017-07-21 01:02:36 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: This is Canela. She attempted s...,8.874740e+17,4.196984e+09,2017-07-19 00:47:34 +0000,https://twitter.com/dog_rates/status/887473957...,13,10,Canela,,,,
32,886054160059072513,,,2017-07-15 02:45:48 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @Athletics: 12/10 #BATP https://t.co/WxwJmv...,8.860537e+17,1.960740e+07,2017-07-15 02:44:07 +0000,https://twitter.com/dog_rates/status/886053434...,12,10,,,,,
36,885311592912609280,,,2017-07-13 01:35:06 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: This is Lilly. She just paralle...,8.305833e+17,4.196984e+09,2017-02-12 01:04:29 +0000,https://twitter.com/dog_rates/status/830583320...,13,10,Lilly,,,,
68,879130579576475649,,,2017-06-26 00:13:58 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: This is Emmy. She was adopted t...,8.780576e+17,4.196984e+09,2017-06-23 01:10:23 +0000,https://twitter.com/dog_rates/status/878057613...,14,10,Emmy,,,,
73,878404777348136964,,,2017-06-24 00:09:53 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: Meet Shadow. In an attempt to r...,8.782815e+17,4.196984e+09,2017-06-23 16:00:04 +0000,"https://www.gofundme.com/3yd6y1c,https://twitt...",13,10,Shadow,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1023,746521445350707200,,,2016-06-25 01:52:36 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: This is Shaggy. He knows exactl...,6.678667e+17,4.196984e+09,2015-11-21 00:46:50 +0000,https://twitter.com/dog_rates/status/667866724...,10,10,Shaggy,,,,
1043,743835915802583040,,,2016-06-17 16:01:16 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: Extremely intelligent dog here....,6.671383e+17,4.196984e+09,2015-11-19 00:32:12 +0000,https://twitter.com/dog_rates/status/667138269...,10,10,,,,,
1242,711998809858043904,,,2016-03-21 19:31:59 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @twitter: @dog_rates Awesome Tweet! 12/10. ...,7.119983e+17,7.832140e+05,2016-03-21 19:29:52 +0000,https://twitter.com/twitter/status/71199827977...,12,10,,,,,
2259,667550904950915073,,,2015-11-20 03:51:52 +0000,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",RT @dogratingrating: Exceptional talent. Origi...,6.675487e+17,4.296832e+09,2015-11-20 03:43:06 +0000,https://twitter.com/dogratingrating/status/667...,12,10,,,,,


> It may be that these columns serve us to identify retweets, so it looks normal that not all columns have information. Retweets will make the analysis for dog ratings flawed.

##### Check for other issues

> Some dogs seem to have no name and the columns doggo, floofer, pupper and puppo seem to rarely have a qualifier. Let's check it.

In [62]:
df_twit_archive.name.value_counts()

None        745
a            55
Charlie      12
Cooper       11
Oliver       11
           ... 
Sparky        1
Humphrey      1
Ridley        1
JD            1
Ziva          1
Name: name, Length: 957, dtype: int64

In [36]:
df_twit_archive.query('name == "a"')['text']

56      Here is a pupper approaching maximum borkdrive...
649     Here is a perfect example of someone who has t...
801     Guys this is getting so out of hand. We only r...
1002    This is a mighty rare blue-tailed hammer sherk...
1004    Viewer discretion is advised. This is a terrib...
1017    This is a carrot. We only rate dogs. Please on...
1049    This is a very rare Great Alaskan Bush Pupper....
1193    People please. This is a Deadly Mediterranean ...
1207    This is a taco. We only rate dogs. Please only...
1340    Here is a heartbreaking scene of an incredible...
1351    Here is a whole flock of puppers.  60/50 I'll ...
1361    This is a Butternut Cumberfloof. It's not wind...
1368    This is a Wild Tuscan Poofwiggle. Careful not ...
1382    "Pupper is a present to world. Here is a bow f...
1499    This is a rare Arctic Wubberfloof. Unamused by...
1737    Guys this really needs to stop. We've been ove...
1785    This is a dog swinging. I really enjoyed it so...
1853    This i

> The name given "a" is unvalid. Actually it looks like all the text related contains a sentence saying: "This is **a**..."

> Let's check if lower cases names are wrong

In [66]:
df_twit_archive[df_twit_archive.name.str.islower()]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
22,887517139158093824,,,2017-07-19 03:39:09 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",I've yet to rate a Venezuelan Hover Wiener. This is such an honor. 14/10 paw-inspiring af (IG: roxy.thedoxy) https://t.co/20VrLAA8ba,,,,https://twitter.com/dog_rates/status/887517139158093824/video/1,14,10,such,,,,
56,881536004380872706,,,2017-07-02 15:32:16 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Here is a pupper approaching maximum borkdrive. Zooming at never before seen speeds. 14/10 paw-inspiring af \n(IG: puffie_the_chow) https://t.co/ghXBIIeQZF,,,,https://twitter.com/dog_rates/status/881536004380872706/video/1,14,10,a,,,pupper,
118,869988702071779329,,,2017-05-31 18:47:24 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",RT @dog_rates: We only rate dogs. This is quite clearly a smol broken polar bear. We'd appreciate if you only send dogs. Thank you... 12/10…,8.591970e+17,4.196984e+09,2017-05-02 00:04:57 +0000,https://twitter.com/dog_rates/status/859196978902773760/video/1,12,10,quite,,,,
169,859196978902773760,,,2017-05-02 00:04:57 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",We only rate dogs. This is quite clearly a smol broken polar bear. We'd appreciate if you only send dogs. Thank you... 12/10 https://t.co/g2nSyGenG9,,,,https://twitter.com/dog_rates/status/859196978902773760/video/1,12,10,quite,,,,
193,855459453768019968,,,2017-04-21 16:33:22 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","Guys, we only rate dogs. This is quite clearly a bulbasaur. Please only send dogs. Thank you... 12/10 human used pet, it's super effective https://t.co/Xc7uj1C64x",,,,"https://twitter.com/dog_rates/status/855459453768019968/photo/1,https://twitter.com/dog_rates/status/855459453768019968/photo/1",12,10,quite,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2349,666051853826850816,,,2015-11-16 00:35:11 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is an odd dog. Hard on the outside but loving on the inside. Petting still fun. Doesn't play catch well. 2/10 https://t.co/v5A4vzSDdc,,,,https://twitter.com/dog_rates/status/666051853826850816/photo/1,2,10,an,,,,
2350,666050758794694657,,,2015-11-16 00:30:50 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is a truly beautiful English Wilson Staff retriever. Has a nice phone. Privileged. 10/10 would trade lives with https://t.co/fvIbQfHjIe,,,,https://twitter.com/dog_rates/status/666050758794694657/photo/1,10,10,a,,,,
2352,666044226329800704,,,2015-11-16 00:04:52 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is a purebred Piers Morgan. Loves to Netflix and chill. Always looks like he forgot to unplug the iron. 6/10 https://t.co/DWnyCjf2mx,,,,https://twitter.com/dog_rates/status/666044226329800704/photo/1,6,10,a,,,,
2353,666033412701032449,,,2015-11-15 23:21:54 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Here is a very happy pup. Big fan of well-maintained decks. Just look at that tongue. 9/10 would cuddle af https://t.co/y671yMhoiR,,,,https://twitter.com/dog_rates/status/666033412701032449/photo/1,9,10,a,,,,


In [29]:
df_twit_archive.doggo.value_counts()

None     2259
doggo      97
Name: doggo, dtype: int64

In [30]:
df_twit_archive.floofer.value_counts()

None       2346
floofer      10
Name: floofer, dtype: int64

> It looks like the 4 variables doggo floofer, pupper and puppo values would fit better all in one category column

In [37]:
df_twit_archive.rating_numerator.value_counts()

12      558
11      464
10      461
13      351
9       158
8       102
7        55
14       54
5        37
6        32
3        19
4        17
1         9
2         9
420       2
0         2
15        2
75        2
80        1
20        1
24        1
26        1
44        1
50        1
60        1
165       1
84        1
88        1
144       1
182       1
143       1
666       1
960       1
1776      1
17        1
27        1
45        1
99        1
121       1
204       1
Name: rating_numerator, dtype: int64

> Ratings above 20 seem too suspicious to be accurate. Let's check

In [49]:
# Enlarge view of the text column to verify ratings associated
pd.options.display.max_colwidth = 200

In [50]:
df_twit_archive.query('rating_numerator > 0')[['text', 'rating_numerator', 'rating_denominator']]

Unnamed: 0,text,rating_numerator,rating_denominator
188,@dhmontgomery We also gave snoop dogg a 420/10 but I think that predated your research,420,10
189,"@s8n You tried very hard to portray this good boy as not so good, but you have ultimately failed. His goodness shines through. 666/10",666,10
290,@markhoppus 182/10,182,10
313,"@jonnysun @Lin_Manuel ok jomny I know you're excited but 960/00 isn't a valid rating, 13/10 is tho",960,0
340,"RT @dog_rates: This is Logan, the Chow who lived. He solemnly swears he's up to lots of good. H*ckin magical af 9.75/10 https://t.co/yBO5wu…",75,10
433,The floofs have been released I repeat the floofs have been released. 84/70 https://t.co/NIYC820tmd,84,70
516,Meet Sam. She smiles 24/7 &amp; secretly aspires to be a reindeer. \nKeep Sam smiling by clicking and sharing this link:\nhttps://t.co/98tB8y7y7t https://t.co/LouL5vdvxx,24,7
695,"This is Logan, the Chow who lived. He solemnly swears he's up to lots of good. H*ckin magical af 9.75/10 https://t.co/yBO5wuqaPS",75,10
763,This is Sophie. She's a Jubilant Bush Pupper. Super h*ckin rare. Appears at random just to smile at the locals. 11.27/10 would smile back https://t.co/QFaUiIHxHq,27,10
902,Why does this never happen at my front door... 165/150 https://t.co/HmwrdfEfUE,165,150


> All tweets showing a denominator not equal to 10 look like their ratings are wrong. There are float ratings on index 340/ 695/ 763/ 1712 which can be corrected in the dataset.

In [52]:
df_twit_archive[df_twit_archive.text.str.contains("@dog_rates")]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
19,888202515573088257,,,2017-07-21 01:02:36 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",RT @dog_rates: This is Canela. She attempted some fancy porch pics. They were unsuccessful. 13/10 someone help her https://t.co/cLyzpcUcMX,8.874740e+17,4.196984e+09,2017-07-19 00:47:34 +0000,"https://twitter.com/dog_rates/status/887473957103951883/photo/1,https://twitter.com/dog_rates/status/887473957103951883/photo/1,https://twitter.com/dog_rates/status/887473957103951883/photo/1,http...",13,10,Canela,,,,
36,885311592912609280,,,2017-07-13 01:35:06 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",RT @dog_rates: This is Lilly. She just parallel barked. Kindly requests a reward now. 13/10 would pet so well https://t.co/SATN4If5H5,8.305833e+17,4.196984e+09,2017-02-12 01:04:29 +0000,"https://twitter.com/dog_rates/status/830583320585068544/photo/1,https://twitter.com/dog_rates/status/830583320585068544/photo/1,https://twitter.com/dog_rates/status/830583320585068544/photo/1,http...",13,10,Lilly,,,,
68,879130579576475649,,,2017-06-26 00:13:58 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",RT @dog_rates: This is Emmy. She was adopted today. Massive round of pupplause for Emmy and her new family. 14/10 for all involved https://…,8.780576e+17,4.196984e+09,2017-06-23 01:10:23 +0000,"https://twitter.com/dog_rates/status/878057613040115712/photo/1,https://twitter.com/dog_rates/status/878057613040115712/photo/1",14,10,Emmy,,,,
73,878404777348136964,,,2017-06-24 00:09:53 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","RT @dog_rates: Meet Shadow. In an attempt to reach maximum zooming borkdrive, he tore his ACL. Still 13/10 tho. Help him out below\n\nhttps:/…",8.782815e+17,4.196984e+09,2017-06-23 16:00:04 +0000,"https://www.gofundme.com/3yd6y1c,https://twitter.com/dog_rates/status/878281511006478336/photo/1",13,10,Shadow,,,,
74,878316110768087041,,,2017-06-23 18:17:33 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",RT @dog_rates: Meet Terrance. He's being yelled at because he stapled the wrong stuff together. 11/10 hang in there Terrance https://t.co/i…,6.690004e+17,4.196984e+09,2015-11-24 03:51:38 +0000,https://twitter.com/dog_rates/status/669000397445533696/photo/1,11,10,Terrance,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
988,748977405889503236,,,2016-07-01 20:31:43 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",What jokester sent in a pic without a dog in it? This is not @rock_rates. This is @dog_rates. Thank you ...10/10 https://t.co/nDPaYHrtNX,,,,https://twitter.com/dog_rates/status/748977405889503236/photo/1,10,10,not,,,,
1012,747242308580548608,,,2016-06-27 01:37:04 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",RT @dog_rates: This pupper killed this great white in an epic sea battle. Now wears it as a trophy. Such brave. Much fierce. 13/10 https://…,7.047611e+17,4.196984e+09,2016-03-01 20:11:59 +0000,"https://twitter.com/dog_rates/status/704761120771465216/photo/1,https://twitter.com/dog_rates/status/704761120771465216/photo/1",13,10,,,,pupper,
1023,746521445350707200,,,2016-06-25 01:52:36 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",RT @dog_rates: This is Shaggy. He knows exactly how to solve the puzzle but can't talk. All he wants to do is help. 10/10 great guy https:/…,6.678667e+17,4.196984e+09,2015-11-21 00:46:50 +0000,https://twitter.com/dog_rates/status/667866724293877760/photo/1,10,10,Shaggy,,,,
1043,743835915802583040,,,2016-06-17 16:01:16 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",RT @dog_rates: Extremely intelligent dog here. Has learned to walk like human. Even has his own dog. Very impressive 10/10 https://t.co/0Dv…,6.671383e+17,4.196984e+09,2015-11-19 00:32:12 +0000,https://twitter.com/dog_rates/status/667138269671505920/photo/1,10,10,,,,,


#### Assess duplicates

In [26]:
df_twit_archive[df_twit_archive.tweet_id.duplicated()]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo


> No tweet id duplicated

#### Dataset 2 : Image prediction data

##### Info summary of the table

In [13]:
df_image_pred.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


##### Let's have a look at the dataframe view

In [53]:
df_image_pred

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.072010,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True
...,...,...,...,...,...,...,...,...,...,...,...,...
2070,891327558926688256,https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg,2,basset,0.555712,True,English_springer,0.225770,True,German_short-haired_pointer,0.175219,True
2071,891689557279858688,https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg,1,paper_towel,0.170278,False,Labrador_retriever,0.168086,True,spatula,0.040836,False
2072,891815181378084864,https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg,1,Chihuahua,0.716012,True,malamute,0.078253,True,kelpie,0.031379,True
2073,892177421306343426,https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg,1,Chihuahua,0.323581,True,Pekinese,0.090647,True,papillon,0.068957,True


##### Explore rows with missing data

> No apparent missing data

##### Check for other issues

In [58]:
df_image_pred.p1.value_counts()

golden_retriever            150
Labrador_retriever          100
Pembroke                     89
Chihuahua                    83
pug                          57
                           ... 
electric_fan                  1
cuirass                       1
American_black_bear           1
lynx                          1
sulphur-crested_cockatoo      1
Name: p1, Length: 378, dtype: int64

> No striking data quality issue appears.

> What can be annoying though for the rest of the analysis and that can be assessed as a quality issue is the prediction. It is confusing to face 3 different predictions and understand which is the prediction that should be used in the analysis. It would be more consistent to have one prediction for each tweet in one column.

#### Assess duplicates

In [54]:
df_image_pred[df_image_pred.tweet_id.duplicated()]

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog


#### Dataset 3 : Tweet API data

##### Info summary of the table

In [14]:
df_tweet_API.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2331 entries, 0 to 2330
Data columns (total 3 columns):
tweet_id          2331 non-null int64
retweet_count     2331 non-null int64
favorite_count    2331 non-null int64
dtypes: int64(3)
memory usage: 54.8 KB


##### Let's have a look at the dataframe view

In [59]:
df_tweet_API

Unnamed: 0,tweet_id,retweet_count,favorite_count
0,892420643555336193,7802,36638
1,892177421306343426,5784,31572
2,891815181378084864,3826,23782
3,891689557279858688,7973,39967
4,891327558926688256,8609,38166
...,...,...,...
2326,666049248165822465,41,99
2327,666044226329800704,133,273
2328,666033412701032449,41,115
2329,666029285002620928,43,121


##### Explore rows with missing data

> No apparent missing data

##### Check for other issues

#### Assess duplicates

In [60]:
df_tweet_API[df_tweet_API.tweet_id.duplicated()]

Unnamed: 0,tweet_id,retweet_count,favorite_count


In [16]:
# check duplicated columns across tables to analyse
all_columns = pd.Series(list(df_twit_archive)+list(df_image_pred)+list(df_tweet_API))
all_columns[all_columns.duplicated()]

17    tweet_id
29    tweet_id
dtype: object

#### Assessement Conclusions

##### Quality


**1. `twitter archive` dataframe**
- Datatypes look non-optimal for: in_reply_to_status_id, in_reply_to_user_id , timestamp, retweeted_status_id, retweeted_status_user_id, retweeted_status_timestamp
- The columns in_reply_to_status_id, in_reply_to_user_id look like they have no use in this analysis
- There are 181 rows corresponding to retweets which represents a sort of a duplicate of our dog ratings
- Names of dogs called "a"/ or any name with only lower case are wrong values because they do not correspond to the name of the dog
- A rating denominator different than 10 is wrong
- Row index # 340 is 75 and it should be 9.75 or rounded at 10. Rows at index 340/ 695/ 763/ 1712 all have wrong numbers because their ratings were float.

**2. `image prediction` dataframe**
- The predictions can be confusing as there are several, and for the analysis only one column should be used for better consistency. It would be better to group the best of the 3 predictions in one column
- Some dog races are invalid as they are not a dog breed but something totally unrelated

**3. `tweet API` dataframe**

###### Tidiness

- In `twitter archive` dataframe, the columns doggo floofer, pupper and puppo values would fit better all in one category column
- In the `image prediction`dataframe, the 3 predictions should be grouped into one prediction, however it will be taken care of in the quality issues because there is a need to sort which is the best of the 3 predicitions in terms of quality and consistency
- All the data should be in one table because it is all related to the same object, *the tweet id*

<a id='Cleaning'></a>
## Cleaning Data
- Clean each of the issues you documented while assessing.
- The result should be a high quality and tidy master pandas DataFrame
- Store the clean DataFrame(s) in a CSV file with the main one named twitter_archive_master.csv. If additional files exist because multiple tables are required for tidiness, name these files appropriately. Additionally, you may store the cleaned data in a SQLite database (which is to be submitted as well if you do).

In [83]:
# Create a copy of the datasets to be cleaned
df_twit_archive_clean = df_twit_archive.copy()
df_image_pred_clean = df_image_pred.copy()
df_tweet_API_clean = df_tweet_API.copy()

##### Start with the data to remove

> **Quality issue 1**: *There are 181 rows corresponding to retweets which represents a sort of a duplicate of our dog ratings*

**Define**

Remove the 181 rows which are retweets as we do not want to use it in the analysis because they are duplicated tweets

**Code**

In [84]:
# First get rid of the duplicated rows
df_twit_archive_clean.drop(df_twit_archive_clean[df_twit_archive_clean.retweeted_status_id.notna()].index, axis = 0, inplace= True)

In [85]:
# Then get rid of these columns as they will not be useful
df_twit_archive_clean.drop(labels= ['retweeted_status_id', 'retweeted_status_user_id', 'retweeted_status_timestamp'], axis= 1, inplace=True)

**Test**

In [86]:
df_twit_archive_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2175 entries, 0 to 2355
Data columns (total 14 columns):
tweet_id                 2175 non-null int64
in_reply_to_status_id    78 non-null float64
in_reply_to_user_id      78 non-null float64
timestamp                2175 non-null object
source                   2175 non-null object
text                     2175 non-null object
expanded_urls            2117 non-null object
rating_numerator         2175 non-null int64
rating_denominator       2175 non-null int64
name                     2175 non-null object
doggo                    2175 non-null object
floofer                  2175 non-null object
pupper                   2175 non-null object
puppo                    2175 non-null object
dtypes: float64(2), int64(3), object(9)
memory usage: 254.9+ KB


> **Quality issue 2**: *The columns in_reply_to_status_id, in_reply_to_user_id look like they have no use in this analysis*

**Define**

Remove the columns *in_reply_to_status_id, in_reply_to_user_id* because they look like noise in the dataset

**Code**

In [87]:
# Get rid of these columns as they will not be useful
df_twit_archive_clean.drop(labels= ['in_reply_to_status_id', 'in_reply_to_user_id'], axis= 1, inplace=True)

**Test**

In [88]:
df_twit_archive_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2175 entries, 0 to 2355
Data columns (total 12 columns):
tweet_id              2175 non-null int64
timestamp             2175 non-null object
source                2175 non-null object
text                  2175 non-null object
expanded_urls         2117 non-null object
rating_numerator      2175 non-null int64
rating_denominator    2175 non-null int64
name                  2175 non-null object
doggo                 2175 non-null object
floofer               2175 non-null object
pupper                2175 non-null object
puppo                 2175 non-null object
dtypes: int64(3), object(9)
memory usage: 220.9+ KB


##### Correction of data types

> **Quality issue 3** : *Datatypes look non-optimal for: in_reply_to_status_id, in_reply_to_user_id , timestamp, retweeted_status_id, retweeted_status_user_id, retweeted_status_timestamp*

Note: Only timestamp column is left because other columns were deleted from the dataset in the cleaning step just above

**Define**

Correct the `timestamp` column datatype from object to datetime

**Code**

**Test**

**Define**

**Code**

**Test**

**Define**

**Code**

**Test**

**Define**

**Code**

**Test**

**Define**

**Code**

**Test**

**Define**

**Code**

**Test**

**Define**

**Code**

**Test**

**Define**

**Code**

**Test**

**Define**

<a id='eda'></a>
## Exploratory Data Analysis
- At least three (3) insights and one (1) visualization must be produced.

<a id='conclusions'></a>
## Conclusions


blablablab

## Reporting for this Project
- Create a 300-600 word written report called wrangle_report.pdf or wrangle_report.html that briefly describes your wrangling efforts. This is to be framed as an internal document.
- Create a 250-word-minimum written report called act_report.pdf or act_report.html that communicates the insights and displays the visualization(s) produced from your wrangled data. This is to be framed as an external document, like a blog post or magazine article, for example.