# Project: Wrangling and Analyze Data

## Data Gathering
In the cell below, gather **all** three pieces of data for this project and load them in the notebook. **Note:** the methods required to gather each data are different.
1. Directly download the WeRateDogs Twitter archive data (twitter_archive_enhanced.csv)

In [2]:
# Importing needed libraries
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns

# import tweepy
import requests as r

In [3]:
# already uploaded the Twitter_archive_enhanced.csv

tweet_data = pd.read_csv('twitter-archive-enhanced.csv')

2. Use the Requests library to download the tweet image prediction (image_predictions.tsv)

In [161]:
# Using the requests library to download the tweet image prediction programmatically

url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = r.get(url)

with open('image_predictions.tsv', mode='wb') as file:
    file.write(response.content)

In [4]:
image_prediction = pd.read_csv('image_predictions.tsv', sep='\t')

3. Use the Tweepy library to query additional data via the Twitter API (tweet_json.txt)

In [164]:
from tweepy import OAuthHandler
import json
from timeit import default_timer as timer

# Query Twitter API for each tweet in the Twitter archive and save JSON in a text file
# These are hidden to comply with Twitter's API terms and conditions
consumer_key = 'HIDDEN'
consumer_secret = 'HIDDEN'
access_token = 'HIDDEN'
access_secret = 'HIDDEN'

auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth, wait_on_rate_limit=True)

# NOTE TO STUDENT WITH MOBILE VERIFICATION ISSUES:
# df_1 is a DataFrame with the twitter_archive_enhanced.csv file. You may have to
# change line 17 to match the name of your DataFrame with twitter_archive_enhanced.csv
# NOTE TO REVIEWER: this student had mobile verification issues so the following
# Twitter API code was sent to this student from a Udacity instructor
# Tweet IDs for which to gather additional data via Twitter's API
tweet_ids = df_1.tweet_id.values
len(tweet_ids)

# Query Twitter's API for JSON data for each tweet ID in the Twitter archive
count = 0
fails_dict = {}
start = timer()
# Save each tweet's returned JSON as a new line in a .txt file
with open('tweet_json.txt', 'w') as outfile:
    # This loop will likely take 20-30 minutes to run because of Twitter's rate limit
    for tweet_id in tweet_ids:
        count += 1
        print(str(count) + ": " + str(tweet_id))
        try:
            tweet = api.get_status(tweet_id, tweet_mode='extended')
            print("Success")
            json.dump(tweet._json, outfile)
            outfile.write('\n')
        except tweepy.TweepError as e:
            print("Fail")
            fails_dict[tweet_id] = e
            pass
end = timer()
print(end - start)
print(fails_dict)


In [5]:
import json

tweet_list = []
with open('tweet-json.txt','r') as file:
    for i in file:
        tweets = json.loads(i)
        tweet_list.append(tweets)

In [6]:
tweet_list;

In [7]:
additional_tweet_data = pd.DataFrame(tweet_list)

In [8]:
additional_tweet_data = additional_tweet_data[['id','geo','retweet_count','favorite_count']]

## Assessing Data
In this section, detect and document at least **eight (8) quality issues and two (2) tidiness issue**. You must use **both** visual assessment
programmatic assessement to assess the data.

**Note:** pay attention to the following key points when you access the data.

* You only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.
* Assessing and cleaning the entire dataset completely would require a lot of time, and is not necessary to practice and demonstrate your skills in data wrangling. Therefore, the requirements of this project are only to assess and clean at least 8 quality issues and at least 2 tidiness issues in this dataset.
* The fact that the rating numerators are greater than the denominators does not need to be cleaned. This [unique rating system](http://knowyourmeme.com/memes/theyre-good-dogs-brent) is a big part of the popularity of WeRateDogs.
* You do not need to gather the tweets beyond August 1st, 2017. You can, but note that you won't be able to gather the image predictions for these tweets since you don't have access to the algorithm used.



> #### Visual Assessment

In [9]:
tweet_data.sample(10)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
637,793286476301799424,,,2016-11-01 03:00:09 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Moreton. He's the Good Boy Who Lived. ...,,,,https://twitter.com/dog_rates/status/793286476...,13,10,Moreton,,,,
1786,677573743309385728,,,2015-12-17 19:39:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Sandy. He's sexually confused. Thinks ...,,,,https://twitter.com/dog_rates/status/677573743...,10,10,Sandy,,,,
420,822163064745328640,,,2017-01-19 19:25:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: This is Mattie. She's extremely...,7.86234e+17,4196984000.0,2016-10-12 15:55:59 +0000,https://twitter.com/dog_rates/status/786233965...,11,10,Mattie,,,,
292,838083903487373313,,,2017-03-04 17:49:08 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Daisy. She's puppears to be rare as al...,,,,https://twitter.com/dog_rates/status/838083903...,13,10,Daisy,,,,
1338,705066031337840642,,,2016-03-02 16:23:36 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Reese. He's a Chilean Sohcahtoa. Loves...,,,,https://twitter.com/dog_rates/status/705066031...,12,10,Reese,,,,
1403,699413908797464576,,,2016-02-16 02:04:04 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Miley. She's a Scandinavian Hollabackgirl...,,,,https://twitter.com/dog_rates/status/699413908...,11,10,Miley,,,,
1265,709901256215666688,,,2016-03-16 00:37:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",WeRateDogs stickers are here and they're 12/10...,,,,"http://goo.gl/ArWZfi,https://twitter.com/dog_r...",12,10,,,,,
563,802572683846291456,,,2016-11-26 18:00:13 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Winnie. She's h*ckin ferocious. Dandel...,,,,https://twitter.com/dog_rates/status/802572683...,12,10,Winnie,,,,
1355,703611486317502464,,,2016-02-27 16:03:45 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Scooter. He's experiencing the pupper equ...,,,,https://twitter.com/dog_rates/status/703611486...,10,10,Scooter,,,pupper,
644,793180763617361921,,,2016-10-31 20:00:05 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Newt. He's a strawberry. 11/10 https:/...,,,,https://twitter.com/dog_rates/status/793180763...,11,10,Newt,,,,


In [10]:
additional_tweet_data.sample(10)

Unnamed: 0,id,geo,retweet_count,favorite_count
1442,696713835009417216,,757,2613
555,803638050916102144,,4828,12270
2285,667177989038297088,,58,200
1842,675849018447167488,,172,1027
697,786363235746385920,,4072,12189
2336,666104133288665088,,6871,14765
2142,669942763794931712,,183,536
1658,683078886620553216,,634,2176
1069,740214038584557568,,2220,7335
53,881666595344535552,,11099,51522


In [11]:
image_prediction.sample(10)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
1567,794205286408003585,https://pbs.twimg.com/media/CwWVe_3WEAAHAvx.jpg,3,pedestal,0.66266,False,fountain,0.294827,False,brass,0.020371,False
1621,803276597545603072,https://pbs.twimg.com/media/CyXPzXRWgAAvd1j.jpg,1,Pembroke,0.457086,True,chow,0.307801,True,golden_retriever,0.049988,True
284,671141549288370177,https://pbs.twimg.com/media/CVBfrU9WUAApDeV.jpg,1,guinea_pig,0.387728,False,wood_rabbit,0.171681,False,borzoi,0.075358,True
598,679722016581222400,https://pbs.twimg.com/media/CW7bkW6WQAAksgB.jpg,1,boxer,0.459604,True,Boston_bull,0.197913,True,French_bulldog,0.087023,True
1696,816450570814898180,https://pbs.twimg.com/media/C1SddosXUAQcVR1.jpg,1,web_site,0.352857,False,envelope,0.060107,False,nail,0.031291,False
1845,838921590096166913,https://pbs.twimg.com/media/C6Ryuf7UoAAFX4a.jpg,1,Border_terrier,0.664538,True,Brabancon_griffon,0.170451,True,Yorkshire_terrier,0.087824,True
327,671896809300709376,https://pbs.twimg.com/media/CVMOlMiWwAA4Yxl.jpg,1,chow,0.243529,True,hamster,0.22715,False,Pomeranian,0.056057,True
1098,720389942216527872,https://pbs.twimg.com/media/Cf9W1J-UMAErahM.jpg,1,Pembroke,0.873977,True,Cardigan,0.043339,True,Eskimo_dog,0.019197,True
1225,744995568523612160,https://pbs.twimg.com/media/ClbBg4WWEAMjwJu.jpg,1,Old_English_sheepdog,0.427481,True,Shih-Tzu,0.146336,True,Tibetan_terrier,0.134269,True
1015,709852847387627521,https://pbs.twimg.com/media/CdnnZhhWAAEAoUc.jpg,2,Chihuahua,0.945629,True,Pomeranian,0.019204,True,West_Highland_white_terrier,0.010134,True


> #### Programmatic Assessment

> Checking for missing data

In [12]:
image_prediction.isna().all()

tweet_id    False
jpg_url     False
img_num     False
p1          False
p1_conf     False
p1_dog      False
p2          False
p2_conf     False
p2_dog      False
p3          False
p3_conf     False
p3_dog      False
dtype: bool

In [13]:
tweet_data.isna().all()

tweet_id                      False
in_reply_to_status_id         False
in_reply_to_user_id           False
timestamp                     False
source                        False
text                          False
retweeted_status_id           False
retweeted_status_user_id      False
retweeted_status_timestamp    False
expanded_urls                 False
rating_numerator              False
rating_denominator            False
name                          False
doggo                         False
floofer                       False
pupper                        False
puppo                         False
dtype: bool

In [14]:
additional_tweet_data.isna().all()

id                False
geo                True
retweet_count     False
favorite_count    False
dtype: bool

In [15]:
image_prediction.info();

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   tweet_id  2075 non-null   int64  
 1   jpg_url   2075 non-null   object 
 2   img_num   2075 non-null   int64  
 3   p1        2075 non-null   object 
 4   p1_conf   2075 non-null   float64
 5   p1_dog    2075 non-null   bool   
 6   p2        2075 non-null   object 
 7   p2_conf   2075 non-null   float64
 8   p2_dog    2075 non-null   bool   
 9   p3        2075 non-null   object 
 10  p3_conf   2075 non-null   float64
 11  p3_dog    2075 non-null   bool   
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [16]:
tweet_data.info();

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 

In [17]:
additional_tweet_data.info();

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2354 entries, 0 to 2353
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   id              2354 non-null   int64 
 1   geo             0 non-null      object
 2   retweet_count   2354 non-null   int64 
 3   favorite_count  2354 non-null   int64 
dtypes: int64(3), object(1)
memory usage: 73.7+ KB


In [18]:
tweet_data.describe()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,retweeted_status_id,retweeted_status_user_id,rating_numerator,rating_denominator
count,2356.0,78.0,78.0,181.0,181.0,2356.0,2356.0
mean,7.427716e+17,7.455079e+17,2.014171e+16,7.7204e+17,1.241698e+16,13.126486,10.455433
std,6.856705e+16,7.582492e+16,1.252797e+17,6.236928e+16,9.599254e+16,45.876648,6.745237
min,6.660209e+17,6.658147e+17,11856340.0,6.661041e+17,783214.0,0.0,0.0
25%,6.783989e+17,6.757419e+17,308637400.0,7.186315e+17,4196984000.0,10.0,10.0
50%,7.196279e+17,7.038708e+17,4196984000.0,7.804657e+17,4196984000.0,11.0,10.0
75%,7.993373e+17,8.257804e+17,4196984000.0,8.203146e+17,4196984000.0,12.0,10.0
max,8.924206e+17,8.862664e+17,8.405479e+17,8.87474e+17,7.874618e+17,1776.0,170.0


> Checking number of dogs with zero rating

In [19]:
tweet_data[(tweet_data.rating_numerator<1) & (tweet_data.rating_denominator>1)].all().sum()

16

>Ratings with denominator __greater than 10 values__

In [20]:
tweet_data[tweet_data.rating_denominator>10].count()

tweet_id                      20
in_reply_to_status_id          4
in_reply_to_user_id            4
timestamp                     20
source                        20
text                          20
retweeted_status_id            1
retweeted_status_user_id       1
retweeted_status_timestamp     1
expanded_urls                 17
rating_numerator              20
rating_denominator            20
name                          20
doggo                         20
floofer                       20
pupper                        20
puppo                         20
dtype: int64

> Checking for missing data in the 3 tables

In [21]:
image_prediction.shape

(2075, 12)

In [22]:
tweet_data.shape

(2356, 17)

In [23]:
additional_tweet_data.shape

(2354, 4)

In [24]:
image_prediction.describe()

Unnamed: 0,tweet_id,img_num,p1_conf,p2_conf,p3_conf
count,2075.0,2075.0,2075.0,2075.0,2075.0
mean,7.384514e+17,1.203855,0.594548,0.1345886,0.06032417
std,6.785203e+16,0.561875,0.271174,0.1006657,0.05090593
min,6.660209e+17,1.0,0.044333,1.0113e-08,1.74017e-10
25%,6.764835e+17,1.0,0.364412,0.05388625,0.0162224
50%,7.119988e+17,1.0,0.58823,0.118181,0.0494438
75%,7.932034e+17,1.0,0.843855,0.1955655,0.09180755
max,8.924206e+17,4.0,1.0,0.488014,0.273419


In [25]:
additional_tweet_data.describe()

Unnamed: 0,id,retweet_count,favorite_count
count,2354.0,2354.0,2354.0
mean,7.426978e+17,3164.797366,8080.968564
std,6.852812e+16,5284.770364,11814.771334
min,6.660209e+17,0.0,0.0
25%,6.783975e+17,624.5,1415.0
50%,7.194596e+17,1473.5,3603.5
75%,7.993058e+17,3652.0,10122.25
max,8.924206e+17,79515.0,132810.0


>Checking the correlation of favorite and retweet count in the __additional_tweet_data__ dataset

In [26]:
additional_tweet_data[(additional_tweet_data.favorite_count<1) & (additional_tweet_data.retweet_count>1000)].sum()

id               -3672632073111277423
geo                                 0
retweet_count                 1445718
favorite_count                      0
dtype: int64

> Checking for columns where __name__ of dogs in the __tweet_data__ is equal to None

In [27]:
len(tweet_data[tweet_data.name =='None'].index)

745

> Checking for duplicate data

In [28]:
print(f'Duplicated data for tweet_data: {tweet_data.tweet_id.duplicated().all()}\n Duplicated data for image_predictions: {image_prediction.tweet_id.duplicated().all()} \n Duplicated data for Additional Tweet Data: {additional_tweet_data.id.duplicated().all()}')

Duplicated data for tweet_data: False
 Duplicated data for image_predictions: False 
 Duplicated data for Additional Tweet Data: False


### Quality issues
1. Tweet id are integers instead of strings.


2. set id as the index in the datasets.


3. Replies to tweets should be dropped


4. Retweets should be droppped



5. Extraneous columns in the dataset.


4. Zero favorite count with huge retweet count in the __additional tweet data__ sample


5. Ratings of dogs have values of zero.


6. Ratings of dogs with denominator greater than 10


8. Configuration accuracy in the __Image Predictions__ dataset should be in percentage






### Tidiness issues
1. Rating denominator and numerator should be in one column


2. Dog levels are in different columns


3. All the datasets should be merged into one dataset

## Cleaning Data
In this section, clean **all** of the issues you documented while assessing. 

**Note:** Make a copy of the original data before cleaning. Cleaning includes merging individual pieces of data according to the rules of [tidy data](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html). The result should be a high-quality and tidy master pandas DataFrame (or DataFrames, if appropriate).

In [493]:
# Make copies of original pieces of data
tweet_data_copy = tweet_data.copy()
additional_tweet_data_copy = additional_tweet_data.copy()
image_prediction_copy = image_prediction.copy()

### Issue #1:   
Tweet id are integers instead of strings.


#### Define:  
Convert data type of the __"tweet_id"__, __"id"__ and __"tweet_id"__ columns in the __tweet_data_copy__, __additional_tweet_data_copy__ and __image_prediction_copy__ dataset respectively from *__integer__* to *__strings__*

#### Code

In [494]:
tweet_data_copy['tweet_id'] = tweet_data_copy['tweet_id'].astype(str);

In [495]:
additional_tweet_data_copy['id'] = additional_tweet_data_copy['id'].astype(str);

In [496]:
image_prediction_copy['tweet_id'] = image_prediction_copy['tweet_id'].astype(str);

#### Test

In [497]:
tweet_data_copy['tweet_id'].dtype

dtype('O')

In [498]:
additional_tweet_data_copy['id'].dtype

dtype('O')

In [499]:
image_prediction_copy['tweet_id'].dtype

dtype('O')

### Issue #2:

#### Define:
Set the id's in each dataset as the index

In [500]:
tweet_data_copy = tweet_data_copy.set_index('tweet_id');

In [501]:
additional_tweet_data_copy = additional_tweet_data_copy.set_index('id');

In [502]:
image_prediction_copy = image_prediction_copy.set_index('tweet_id');

#### Test:

In [503]:
tweet_data_copy.index

Index(['892420643555336193', '892177421306343426', '891815181378084864',
       '891689557279858688', '891327558926688256', '891087950875897856',
       '890971913173991426', '890729181411237888', '890609185150312448',
       '890240255349198849',
       ...
       '666058600524156928', '666057090499244032', '666055525042405380',
       '666051853826850816', '666050758794694657', '666049248165822465',
       '666044226329800704', '666033412701032449', '666029285002620928',
       '666020888022790149'],
      dtype='object', name='tweet_id', length=2356)

In [504]:
additional_tweet_data_copy.index

Index(['892420643555336193', '892177421306343426', '891815181378084864',
       '891689557279858688', '891327558926688256', '891087950875897856',
       '890971913173991426', '890729181411237888', '890609185150312448',
       '890240255349198849',
       ...
       '666058600524156928', '666057090499244032', '666055525042405380',
       '666051853826850816', '666050758794694657', '666049248165822465',
       '666044226329800704', '666033412701032449', '666029285002620928',
       '666020888022790149'],
      dtype='object', name='id', length=2354)

In [505]:
image_prediction_copy.index

Index(['666020888022790149', '666029285002620928', '666033412701032449',
       '666044226329800704', '666049248165822465', '666050758794694657',
       '666051853826850816', '666055525042405380', '666057090499244032',
       '666058600524156928',
       ...
       '890240255349198849', '890609185150312448', '890729181411237888',
       '890971913173991426', '891087950875897856', '891327558926688256',
       '891689557279858688', '891815181378084864', '892177421306343426',
       '892420643555336193'],
      dtype='object', name='tweet_id', length=2075)

### Issue #3:

#### Define:
Replies to tweets should be dropped

#### Code

In [506]:
tweet_data_copy.shape

(2356, 16)

In [507]:
tweet_data_copy = tweet_data_copy[tweet_data_copy.in_reply_to_status_id.isna()]

#### Test:

In [508]:
tweet_data_copy.in_reply_to_status_id.notna().sum()

0

In [509]:
tweet_data_copy.shape

(2278, 16)

### Issue #4: 
 Retweets should be droppped

#### Define:

Tweets that are retweets of original tweets are not supposed to be part of our dataset

#### Code

In [510]:
tweet_data_copy.shape

(2278, 16)

In [511]:
tweet_data_copy = tweet_data_copy[tweet_data_copy.retweeted_status_id.isna()]

#### Test:

In [512]:
tweet_data_copy.retweeted_status_id.notna().sum()

0

In [513]:
tweet_data_copy.shape

(2097, 16)

### Issue #5:   
Ratings of dogs have values of zero.

#### Define:

Dogs are mostly rated more than 10, dogs rated in the rating_numerator and rating_denominator should be dropped.

#### Code

In [514]:
tweet_data_copy = tweet_data_copy[(tweet_data_copy.rating_numerator!=0)]

#### Test:

In [515]:
(tweet_data_copy.rating_numerator==0).value_counts()

False    2096
Name: rating_numerator, dtype: int64

### Issue #6:   
Ratings of dogs with denominator greater than 10

#### Define:

Dogs are rated on a scale of 1-10, most are rated more than 10 in the numerator but the denominator should be 10.

#### Code

In [516]:
tweet_data_copy = tweet_data_copy[tweet_data_copy.rating_denominator==10] 

#### Test:

In [517]:
tweet_data_copy.rating_denominator.value_counts()

10    2079
Name: rating_denominator, dtype: int64

### Issue #7:   
Configuration accuracy in the __Image Predictions__ dataset should be in percentage

#### Define:
The image predictions would be better understood and easier to read when expressed in percentage.

#### Code

In [518]:
image_prediction_copy.head(2)

Unnamed: 0_level_0,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True


In [519]:
image_prediction_copy.p1_conf = (image_prediction_copy.p1_conf*100).round(2)

In [520]:
image_prediction_copy.p2_conf = (image_prediction_copy.p2_conf*100).round(2)

In [521]:
image_prediction_copy.p3_conf = (image_prediction_copy.p3_conf*100).round(2)

#### Test:

In [522]:
image_prediction_copy.head(4)

Unnamed: 0_level_0,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,46.51,True,collie,15.67,True,Shetland_sheepdog,6.14,True
666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,50.68,True,miniature_pinscher,7.42,True,Rhodesian_ridgeback,7.2,True
666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,59.65,True,malinois,13.86,True,bloodhound,11.62,True
666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,40.81,True,redbone,36.07,True,miniature_pinscher,22.28,True


### Issue #8:   
Extraneous columns in the datasets.


#### Define:

Not all columns in the datasets are needed for analysis is needed for analysis

#### Code

> for the __tweet_data_copy__ dataset

In [523]:
tweet_data_copy = tweet_data_copy[['rating_numerator','rating_denominator','name','doggo','floofer','pupper','puppo']]

> for the __additional_tweet_data_copy__ dataset

In [524]:
additional_tweet_data_copy = additional_tweet_data_copy[['retweet_count','favorite_count']]

> for the __image_prediction_copy__ dataset

In [525]:
image_prediction_copy = image_prediction_copy.drop(columns = ['jpg_url','img_num'], axis=1) 

#### Test:

In [526]:
tweet_data_copy.head(2)

Unnamed: 0_level_0,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
892420643555336193,13,10,Phineas,,,,
892177421306343426,13,10,Tilly,,,,


In [527]:
additional_tweet_data_copy.head(2)

Unnamed: 0_level_0,retweet_count,favorite_count
id,Unnamed: 1_level_1,Unnamed: 2_level_1
892420643555336193,8853,39467
892177421306343426,6514,33819


In [528]:
image_prediction_copy.head(2)

Unnamed: 0_level_0,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
666020888022790149,Welsh_springer_spaniel,46.51,True,collie,15.67,True,Shetland_sheepdog,6.14,True
666029285002620928,redbone,50.68,True,miniature_pinscher,7.42,True,Rhodesian_ridgeback,7.2,True


### Issue #9:   
Rating denominator and numerator should be in one column

#### Define:

One of the rules of tidiness is that one column represents one variable, here, a single variable is split in 2 columns.

Join the 2 columns into one "ratings" column

#### Code

In [529]:
tweet_data_copy.head(2)

Unnamed: 0_level_0,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
892420643555336193,13,10,Phineas,,,,
892177421306343426,13,10,Tilly,,,,


> First convert the series to string so as to be able to combine them

In [530]:
tweet_data_copy['rating_numerator'] = tweet_data_copy['rating_numerator'].astype(str)

In [531]:
tweet_data_copy['rating_denominator'] = tweet_data_copy['rating_denominator'].astype(str)

> Merging the two series

In [532]:
tweet_data_copy['ratings'] = tweet_data_copy.rating_numerator + '/' + tweet_data_copy.rating_denominator 

> Dropping the original rating seris 

In [533]:
tweet_data_copy = tweet_data_copy.drop(columns=['rating_numerator','rating_denominator'], axis=1)

#### Test:

In [534]:
tweet_data_copy.head()

Unnamed: 0_level_0,name,doggo,floofer,pupper,puppo,ratings
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
892420643555336193,Phineas,,,,,13/10
892177421306343426,Tilly,,,,,13/10
891815181378084864,Archie,,,,,12/10
891689557279858688,Darla,,,,,13/10
891327558926688256,Franklin,,,,,12/10


### Issue #10:   
Dog levels are in different columns

#### Define:

One of the rules of tidiness is that one column represents one variable, here, a single variable is split in 2 columns.

Join the 4 columns into one "dog_stage" column.

> Some of the dogs have unknown dog types, and are filled with 'None', 

> * Create  a new column that classifies a dog as having a known or unknown dog type
> * Melt the dataset to merge the now 5 dog stage classifiers and then remove duplicates

#### Code:

> Defining a function to create the new column

In [535]:
def classify_dog_stage(index):
        if (index['doggo'] == "None") and (index['floofer'] == "None") and (index['pupper'] == "None") and (index['puppo'] == "None"):
            dog_stage = 'Unknown dog stage'
        else:
            dog_stage = 'None'
        return dog_stage
            
#         if index['doggo'] != 'None':
#             dog_stage = 'doggo'
            
#         if index['floofer'] != 'None':
#             dog_stage = 'floofer'
            
#         if index['pupper'] != 'None':
#             dog_stage = 'pupper'

#         if index['puppo'] != 'None':
#             dog_stage = 'puppo'

#             return dog_stage
            
            

In [536]:

tweet_data_copy['class'] = tweet_data_copy.apply(classify_dog_stage, axis=1)



In [537]:
tweet_data_copy.head(0)

Unnamed: 0_level_0,name,doggo,floofer,pupper,puppo,ratings,class
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1


In [538]:
tweet_data_copy = tweet_data_copy.melt(id_vars=['name','ratings'], value_vars=['doggo','floofer','pupper','puppo','class'], value_name='dog_stage', ignore_index=False)


#

In [542]:
# Dropping the unwanted column
tweet_data_copy.drop(columns='variable', axis=1)

Unnamed: 0_level_0,name,ratings,dog_stage
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
892420643555336193,Phineas,13/10,
892177421306343426,Tilly,13/10,
891815181378084864,Archie,12/10,
891689557279858688,Darla,13/10,
891327558926688256,Franklin,12/10,
...,...,...,...
666049248165822465,,5/10,Unknown dog stage
666044226329800704,a,6/10,Unknown dog stage
666033412701032449,a,9/10,Unknown dog stage
666029285002620928,a,7/10,Unknown dog stage


In [539]:
tweet_data_copy.head(2)

Unnamed: 0_level_0,name,ratings,variable,dog_stage
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
892420643555336193,Phineas,13/10,doggo,
892177421306343426,Tilly,13/10,doggo,


> Removing duplicate columns

In [543]:
tweet_data_copy= tweet_data_copy[tweet_data_copy.dog_stage!='None'] 

#### Test:

In [544]:
tweet_data_copy.dog_stage.value_counts()

Unknown dog stage    1743
pupper                230
doggo                  83
puppo                  24
floofer                10
Name: dog_stage, dtype: int64

### Issue #11:   
The three datasets should be merged into one dataset

#### Define:

One of the rules of tidiness is that Each type of observational unit forms a table.

Join the 3 datasets into one dataset

#### Code:

In [546]:
additional_tweet_data_copy.head(0)

Unnamed: 0_level_0,retweet_count,favorite_count
id,Unnamed: 1_level_1,Unnamed: 2_level_1


In [548]:
image_prediction_copy.head(0)

Unnamed: 0_level_0,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1


In [552]:
twitter_combined = tweet_data_copy.merge(additional_tweet_data_copy, left_index=True, right_index=True).merge(image_prediction_copy, left_index=True, right_index=True)

#### Test:

In [554]:
twitter_combined.head()

Unnamed: 0,name,ratings,variable,dog_stage,retweet_count,favorite_count,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
666020888022790149,,8/10,class,Unknown dog stage,532,2535,Welsh_springer_spaniel,46.51,True,collie,15.67,True,Shetland_sheepdog,6.14,True
666029285002620928,a,7/10,class,Unknown dog stage,48,132,redbone,50.68,True,miniature_pinscher,7.42,True,Rhodesian_ridgeback,7.2,True
666033412701032449,a,9/10,class,Unknown dog stage,47,128,German_shepherd,59.65,True,malinois,13.86,True,bloodhound,11.62,True
666044226329800704,a,6/10,class,Unknown dog stage,147,311,Rhodesian_ridgeback,40.81,True,redbone,36.07,True,miniature_pinscher,22.28,True
666049248165822465,,5/10,class,Unknown dog stage,41,111,miniature_pinscher,56.03,True,Rottweiler,24.37,True,Doberman,15.46,True


## Storing Data
Save gathered, assessed, and cleaned master dataset to a CSV file named "twitter_archive_master.csv".

In [541]:
twitter_combined

(0.5803921568627451, 0.403921568627451, 0.7411764705882353)


## Analyzing and Visualizing Data
In this section, analyze and visualize your wrangled data. You must produce at least **three (3) insights and one (1) visualization.**

### Insights:
1.

2.

3.

### Visualization