# Project: Wrangling and Analyze Data

## Data Gathering
In the cell below, gather **all** three pieces of data for this project and load them in the notebook. **Note:** the methods required to gather each data are different.
1. Directly download the WeRateDogs Twitter archive data (twitter_archive_enhanced.csv)

In [159]:
# Importing needed libraries
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns

# import tweepy
import requests as r

In [160]:
# already uploaded the Twitter_archive_enhanced.csv

tweet_data = pd.read_csv('twitter-archive-enhanced.csv')

2. Use the Requests library to download the tweet image prediction (image_predictions.tsv)

In [161]:
# Using the requests library to download the tweet image prediction programmatically

url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = r.get(url)

with open('image_predictions.tsv', mode='wb') as file:
    file.write(response.content)

In [162]:
image_prediction = pd.read_csv('image_predictions.tsv', sep='\t')

3. Use the Tweepy library to query additional data via the Twitter API (tweet_json.txt)

In [164]:
from tweepy import OAuthHandler
import json
from timeit import default_timer as timer

# Query Twitter API for each tweet in the Twitter archive and save JSON in a text file
# These are hidden to comply with Twitter's API terms and conditions
consumer_key = 'HIDDEN'
consumer_secret = 'HIDDEN'
access_token = 'HIDDEN'
access_secret = 'HIDDEN'

auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth, wait_on_rate_limit=True)

# NOTE TO STUDENT WITH MOBILE VERIFICATION ISSUES:
# df_1 is a DataFrame with the twitter_archive_enhanced.csv file. You may have to
# change line 17 to match the name of your DataFrame with twitter_archive_enhanced.csv
# NOTE TO REVIEWER: this student had mobile verification issues so the following
# Twitter API code was sent to this student from a Udacity instructor
# Tweet IDs for which to gather additional data via Twitter's API
tweet_ids = df_1.tweet_id.values
len(tweet_ids)

# Query Twitter's API for JSON data for each tweet ID in the Twitter archive
count = 0
fails_dict = {}
start = timer()
# Save each tweet's returned JSON as a new line in a .txt file
with open('tweet_json.txt', 'w') as outfile:
    # This loop will likely take 20-30 minutes to run because of Twitter's rate limit
    for tweet_id in tweet_ids:
        count += 1
        print(str(count) + ": " + str(tweet_id))
        try:
            tweet = api.get_status(tweet_id, tweet_mode='extended')
            print("Success")
            json.dump(tweet._json, outfile)
            outfile.write('\n')
        except tweepy.TweepError as e:
            print("Fail")
            fails_dict[tweet_id] = e
            pass
end = timer()
print(end - start)
print(fails_dict)


In [166]:
import json

tweet_list = []
with open('tweet-json.txt','r') as file:
    for i in file:
        tweets = json.loads(i)
        tweet_list.append(tweets)

In [167]:
tweet_list;

In [168]:
additional_tweet_data = pd.DataFrame(tweet_list)

In [169]:
additional_tweet_data = additional_tweet_data[['id','geo','retweet_count','favorite_count']]

## Assessing Data
In this section, detect and document at least **eight (8) quality issues and two (2) tidiness issue**. You must use **both** visual assessment
programmatic assessement to assess the data.

**Note:** pay attention to the following key points when you access the data.

* You only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.
* Assessing and cleaning the entire dataset completely would require a lot of time, and is not necessary to practice and demonstrate your skills in data wrangling. Therefore, the requirements of this project are only to assess and clean at least 8 quality issues and at least 2 tidiness issues in this dataset.
* The fact that the rating numerators are greater than the denominators does not need to be cleaned. This [unique rating system](http://knowyourmeme.com/memes/theyre-good-dogs-brent) is a big part of the popularity of WeRateDogs.
* You do not need to gather the tweets beyond August 1st, 2017. You can, but note that you won't be able to gather the image predictions for these tweets since you don't have access to the algorithm used.



> #### Visual Assessment

In [170]:
tweet_data.sample(10)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1322,706291001778950144,,,2016-03-06 01:31:11 +0000,"<a href=""http://twitter.com/download/iphone"" r...",When you're just relaxin and having a swell ti...,,,,https://twitter.com/dog_rates/status/706291001...,11,10,,,,,
1866,675349384339542016,6.749998e+17,4196984000.0,2015-12-11 16:20:15 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Yea I lied. Here's more. All 13/10 https://t.c...,,,,https://twitter.com/dog_rates/status/675349384...,13,10,,,,,
797,773191612633579521,,,2016-09-06 16:10:20 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Grey. He's the dogtor in charge of you...,,,,https://twitter.com/dog_rates/status/773191612...,12,10,Grey,,,,
321,834209720923721728,,,2017-02-22 01:14:30 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Wilson. He's aware that he has somethi...,,,,https://twitter.com/dog_rates/status/834209720...,12,10,Wilson,,,,
1393,700029284593901568,,,2016-02-17 18:49:22 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Coops. His ship is taking on water. So...,,,,https://twitter.com/dog_rates/status/700029284...,10,10,Coops,,,,
610,797236660651966464,,,2016-11-12 00:36:46 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Pancake. She loves Batman and winks li...,,,,https://twitter.com/dog_rates/status/797236660...,12,10,Pancake,,,,
497,813142292504645637,,,2016-12-25 22:00:04 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Everybody stop what you're doing and look at t...,,,,https://twitter.com/dog_rates/status/813142292...,13,10,,,,,
1808,676897532954456065,,,2015-12-15 22:52:02 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Exotic handheld dog here. Appears unathletic. ...,,,,https://twitter.com/dog_rates/status/676897532...,5,10,,,,,
1023,746521445350707200,,,2016-06-25 01:52:36 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: This is Shaggy. He knows exactl...,6.678667e+17,4196984000.0,2015-11-21 00:46:50 +0000,https://twitter.com/dog_rates/status/667866724...,10,10,Shaggy,,,,
2219,668496999348633600,,,2015-11-22 18:31:19 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Jo. Jo is a Swedish Queso. Tongue bigg...,,,,https://twitter.com/dog_rates/status/668496999...,8,10,Jo,,,,


In [171]:
additional_tweet_data.sample(10)

Unnamed: 0,id,geo,retweet_count,favorite_count
2123,670361874861563904,,71,344
834,767884188863397888,,1634,5309
580,800443802682937345,,5068,0
124,868622495443632128,,6275,28295
302,836397794269200385,,31314,0
1535,689877686181715968,,1344,3323
633,793614319594401792,,3661,0
2338,666099513787052032,,73,164
57,880935762899988482,,2886,17346
51,882045870035918850,,5203,29900


In [172]:
image_prediction.sample(10)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
1473,780192070812196864,https://pbs.twimg.com/media/CtPMhwvXYAIt6NG.jpg,1,vizsla,0.144012,True,mongoose,0.091474,False,hatchet,0.073545,False
336,672222792075620352,https://pbs.twimg.com/media/CVQ3EDdWIAINyhM.jpg,1,beagle,0.958178,True,basset,0.009117,True,Italian_greyhound,0.007731,True
890,699323444782047232,https://pbs.twimg.com/media/CbR-9edXIAEHJKi.jpg,1,Labrador_retriever,0.309696,True,doormat,0.3037,False,sliding_door,0.077266,False
981,707377100785885184,https://pbs.twimg.com/media/CdEbt0NXIAQH3Aa.jpg,1,golden_retriever,0.637225,True,bloodhound,0.094542,True,cocker_spaniel,0.069797,True
1390,766693177336135680,https://pbs.twimg.com/media/CqPXYLLXEAAU2HC.jpg,1,Doberman,0.948355,True,vizsla,0.015032,True,Rhodesian_ridgeback,0.009631,True
1359,760893934457552897,https://pbs.twimg.com/media/Co88_ujWEAErCg7.jpg,1,Blenheim_spaniel,0.113992,True,cocker_spaniel,0.10578,True,borzoi,0.073935,True
473,675146535592706048,https://pbs.twimg.com/media/CV6aMToXIAA7kH4.jpg,1,dingo,0.288447,False,Cardigan,0.229944,True,Pembroke,0.190407,True
28,666407126856765440,https://pbs.twimg.com/media/CT-NvwmW4AAugGZ.jpg,1,black-and-tan_coonhound,0.529139,True,bloodhound,0.24422,True,flat-coated_retriever,0.17381,True
1092,719551379208073216,https://pbs.twimg.com/media/CfxcKU6W8AE-wEx.jpg,1,malamute,0.873233,True,Siberian_husky,0.076435,True,Eskimo_dog,0.035745,True
1469,779056095788752897,https://pbs.twimg.com/media/Cs_DYr1XEAA54Pu.jpg,1,Chihuahua,0.721188,True,toy_terrier,0.112943,True,kelpie,0.053364,True


> #### Programmatic Assessment

> Checking for missing data

In [173]:
image_prediction.isna().all()

tweet_id    False
jpg_url     False
img_num     False
p1          False
p1_conf     False
p1_dog      False
p2          False
p2_conf     False
p2_dog      False
p3          False
p3_conf     False
p3_dog      False
dtype: bool

In [174]:
tweet_data.isna().all()

tweet_id                      False
in_reply_to_status_id         False
in_reply_to_user_id           False
timestamp                     False
source                        False
text                          False
retweeted_status_id           False
retweeted_status_user_id      False
retweeted_status_timestamp    False
expanded_urls                 False
rating_numerator              False
rating_denominator            False
name                          False
doggo                         False
floofer                       False
pupper                        False
puppo                         False
dtype: bool

In [175]:
additional_tweet_data.isna().all()

id                False
geo                True
retweet_count     False
favorite_count    False
dtype: bool

In [176]:
image_prediction.info();

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   tweet_id  2075 non-null   int64  
 1   jpg_url   2075 non-null   object 
 2   img_num   2075 non-null   int64  
 3   p1        2075 non-null   object 
 4   p1_conf   2075 non-null   float64
 5   p1_dog    2075 non-null   bool   
 6   p2        2075 non-null   object 
 7   p2_conf   2075 non-null   float64
 8   p2_dog    2075 non-null   bool   
 9   p3        2075 non-null   object 
 10  p3_conf   2075 non-null   float64
 11  p3_dog    2075 non-null   bool   
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 119.6+ KB


In [177]:
tweet_data.info();

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 

In [178]:
additional_tweet_data.info();

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2354 entries, 0 to 2353
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   id              2354 non-null   int64 
 1   geo             0 non-null      object
 2   retweet_count   2354 non-null   int64 
 3   favorite_count  2354 non-null   int64 
dtypes: int64(3), object(1)
memory usage: 64.4+ KB


In [179]:
tweet_data.describe()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,retweeted_status_id,retweeted_status_user_id,rating_numerator,rating_denominator
count,2356.0,78.0,78.0,181.0,181.0,2356.0,2356.0
mean,7.427716e+17,7.455079e+17,2.014171e+16,7.7204e+17,1.241698e+16,13.126486,10.455433
std,6.856705e+16,7.582492e+16,1.252797e+17,6.236928e+16,9.599254e+16,45.876648,6.745237
min,6.660209e+17,6.658147e+17,11856340.0,6.661041e+17,783214.0,0.0,0.0
25%,6.783989e+17,6.757419e+17,308637400.0,7.186315e+17,4196984000.0,10.0,10.0
50%,7.196279e+17,7.038708e+17,4196984000.0,7.804657e+17,4196984000.0,11.0,10.0
75%,7.993373e+17,8.257804e+17,4196984000.0,8.203146e+17,4196984000.0,12.0,10.0
max,8.924206e+17,8.862664e+17,8.405479e+17,8.87474e+17,7.874618e+17,1776.0,170.0


> Checking number of dogs with zero rating

In [180]:
tweet_data[(tweet_data.rating_numerator<1) & (tweet_data.rating_denominator>1)].all().sum()

16

>Ratings with denominator __greater than 10 values__

In [181]:
tweet_data[tweet_data.rating_denominator>10].count()

tweet_id                      20
in_reply_to_status_id          4
in_reply_to_user_id            4
timestamp                     20
source                        20
text                          20
retweeted_status_id            1
retweeted_status_user_id       1
retweeted_status_timestamp     1
expanded_urls                 17
rating_numerator              20
rating_denominator            20
name                          20
doggo                         20
floofer                       20
pupper                        20
puppo                         20
dtype: int64

> Checking for missing data in the 3 tables

In [182]:
image_prediction.shape

(2075, 12)

In [183]:
tweet_data.shape

(2356, 17)

In [184]:
additional_tweet_data.shape

(2354, 4)

In [185]:
image_prediction.describe()

Unnamed: 0,tweet_id,img_num,p1_conf,p2_conf,p3_conf
count,2075.0,2075.0,2075.0,2075.0,2075.0
mean,7.384514e+17,1.203855,0.594548,0.1345886,0.06032417
std,6.785203e+16,0.561875,0.271174,0.1006657,0.05090593
min,6.660209e+17,1.0,0.044333,1.0113e-08,1.74017e-10
25%,6.764835e+17,1.0,0.364412,0.05388625,0.0162224
50%,7.119988e+17,1.0,0.58823,0.118181,0.0494438
75%,7.932034e+17,1.0,0.843855,0.1955655,0.09180755
max,8.924206e+17,4.0,1.0,0.488014,0.273419


In [186]:
additional_tweet_data.describe()

Unnamed: 0,id,retweet_count,favorite_count
count,2354.0,2354.0,2354.0
mean,7.426978e+17,3164.797366,8080.968564
std,6.852812e+16,5284.770364,11814.771334
min,6.660209e+17,0.0,0.0
25%,6.783975e+17,624.5,1415.0
50%,7.194596e+17,1473.5,3603.5
75%,7.993058e+17,3652.0,10122.25
max,8.924206e+17,79515.0,132810.0


>Checking the correlation of favorite and retweet count in the __additional_tweet_data__ dataset

In [187]:
additional_tweet_data[(additional_tweet_data.favorite_count<1) & (additional_tweet_data.retweet_count>1000)].sum()

id                -3672632073111277423
geo                                  0
retweet_count                  1445718
favorite_count                       0
dtype: object

> Checking for columns where __name__ of dogs in the __tweet_data__ is equal to None

In [188]:
len(tweet_data[tweet_data.name =='None'].index)

745

> Checking for duplicate data

In [189]:
print(f'Duplicated data for tweet_data: {tweet_data.tweet_id.duplicated().all()}\n Duplicated data for image_predictions: {image_prediction.tweet_id.duplicated().all()} \n Duplicated data for Additional Tweet Data: {additional_tweet_data.id.duplicated().all()}')

Duplicated data for tweet_data: False
 Duplicated data for image_predictions: False 
 Duplicated data for Additional Tweet Data: False


### Quality issues
1. Tweet id are integers instead of strings.


2. set id as the index in the datasets.


3. Replies to tweets should be dropped


4. Retweets should be droppped



5. Extraneous columns in the dataset.


4. Zero favorite count with huge retweet count in the __additional tweet data__ sample


5. Ratings of dogs have values of zero.


6. Ratings of dogs with denominator greater than 10


8. Configuration accuracy in the __Image Predictions__ dataset should be in percentage






### Tidiness issues
1. Rating denominator and numerator should be in one column

2. Merging the dog rating names

3. __Additional tweet data__ should be joined to the main __tweet data__ dataset 

## Cleaning Data
In this section, clean **all** of the issues you documented while assessing. 

**Note:** Make a copy of the original data before cleaning. Cleaning includes merging individual pieces of data according to the rules of [tidy data](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html). The result should be a high-quality and tidy master pandas DataFrame (or DataFrames, if appropriate).

In [302]:
# Make copies of original pieces of data
tweet_data_copy = tweet_data.copy()
additional_tweet_data_copy = additional_tweet_data.copy()
image_prediction_copy = image_prediction.copy()

### Issue #1:   
Tweet id are integers instead of strings.


#### Define:  
Convert data type of the __"tweet_id"__, __"id"__ and __"tweet_id"__ columns in the __tweet_data_copy__, __additional_tweet_data_copy__ and __image_prediction_copy__ dataset respectively from *__integer__* to *__strings__*

#### Code

In [303]:
tweet_data_copy['tweet_id'] = tweet_data_copy['tweet_id'].astype(str);

In [304]:
additional_tweet_data_copy['id'] = additional_tweet_data_copy['id'].astype(str);

In [305]:
image_prediction_copy['tweet_id'] = image_prediction_copy['tweet_id'].astype(str);

#### Test

In [306]:
tweet_data_copy['tweet_id'].dtype

dtype('O')

In [307]:
additional_tweet_data_copy['id'].dtype

dtype('O')

In [308]:
image_prediction_copy['tweet_id'].dtype

dtype('O')

### Issue #2:

#### Define:
Set the id's in each dataset as the index

In [309]:
tweet_data_copy = tweet_data_copy.set_index('tweet_id');

In [310]:
additional_tweet_data_copy = additional_tweet_data_copy.set_index('id');

In [311]:
image_prediction_copy = image_prediction_copy.set_index('tweet_id');

#### Test:

In [312]:
tweet_data_copy.index

Index(['892420643555336193', '892177421306343426', '891815181378084864',
       '891689557279858688', '891327558926688256', '891087950875897856',
       '890971913173991426', '890729181411237888', '890609185150312448',
       '890240255349198849',
       ...
       '666058600524156928', '666057090499244032', '666055525042405380',
       '666051853826850816', '666050758794694657', '666049248165822465',
       '666044226329800704', '666033412701032449', '666029285002620928',
       '666020888022790149'],
      dtype='object', name='tweet_id', length=2356)

In [313]:
additional_tweet_data_copy.index

Index(['892420643555336193', '892177421306343426', '891815181378084864',
       '891689557279858688', '891327558926688256', '891087950875897856',
       '890971913173991426', '890729181411237888', '890609185150312448',
       '890240255349198849',
       ...
       '666058600524156928', '666057090499244032', '666055525042405380',
       '666051853826850816', '666050758794694657', '666049248165822465',
       '666044226329800704', '666033412701032449', '666029285002620928',
       '666020888022790149'],
      dtype='object', name='id', length=2354)

In [314]:
image_prediction_copy.index

Index(['666020888022790149', '666029285002620928', '666033412701032449',
       '666044226329800704', '666049248165822465', '666050758794694657',
       '666051853826850816', '666055525042405380', '666057090499244032',
       '666058600524156928',
       ...
       '890240255349198849', '890609185150312448', '890729181411237888',
       '890971913173991426', '891087950875897856', '891327558926688256',
       '891689557279858688', '891815181378084864', '892177421306343426',
       '892420643555336193'],
      dtype='object', name='tweet_id', length=2075)

### Issue #3:

#### Define:
Replies to tweets should be dropped

#### Code

In [315]:
tweet_data_copy.shape

(2356, 16)

In [316]:
tweet_data_copy = tweet_data_copy[tweet_data_copy.in_reply_to_status_id.isna()]

#### Test:

In [317]:
tweet_data_copy.in_reply_to_status_id.notna().sum()

0

In [318]:
tweet_data_copy.shape

(2278, 16)

### Issue #4: 
 Retweets should be droppped

#### Define:

Tweets that are retweets of original tweets are not supposed to be part of our dataset

#### Code

In [319]:
tweet_data_copy.shape

(2278, 16)

In [320]:
tweet_data_copy = tweet_data_copy[tweet_data_copy.retweeted_status_id.isna()]

#### Test:

In [321]:
tweet_data_copy.retweeted_status_id.notna().sum()

0

In [322]:
tweet_data_copy.shape

(2097, 16)

### Issue #5:   
Ratings of dogs have values of zero.

#### Define:

Dogs are mostly rated more than 10, dogs rated in the rating_numerator and rating_denominator should be dropped.

#### Code

In [323]:
tweet_data_copy = tweet_data_copy[(tweet_data_copy.rating_numerator!=0)]

#### Test:

In [324]:
(tweet_data_copy.rating_numerator==0).value_counts()

False    2096
Name: rating_numerator, dtype: int64

### Issue #6:   
Ratings of dogs with denominator greater than 10

#### Define:

Dogs are rated on a scale of 1-10, most are rated more than 10 in the numerator but the denominator should be 10.

#### Code

In [325]:
tweet_data_copy = tweet_data_copy[tweet_data_copy.rating_denominator==10] 

#### Test:

In [326]:
tweet_data_copy.rating_denominator.value_counts()

10    2079
Name: rating_denominator, dtype: int64

### Issue #7:   
Configuration accuracy in the __Image Predictions__ dataset should be in percentage

#### Define:
The image predictions would be better understood and easier to read when expressed in percentage.

#### Code

In [327]:
image_prediction.head(2)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,4651.0,True,collie,1567.0,True,Shetland_sheepdog,614.0,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,5068.0,True,miniature_pinscher,742.0,True,Rhodesian_ridgeback,720.0,True


In [328]:
image_prediction.p1_conf = (image_prediction.p1_conf*100).round(2)

In [329]:
image_prediction.p2_conf = (image_prediction.p2_conf*100).round(2)

In [330]:
image_prediction.p3_conf = (image_prediction.p3_conf*100).round(2)

#### Test:

In [331]:
image_prediction.head(4)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,465100.0,True,collie,156700.0,True,Shetland_sheepdog,61400.0,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,506800.0,True,miniature_pinscher,74200.0,True,Rhodesian_ridgeback,72000.0,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,596500.0,True,malinois,138600.0,True,bloodhound,116200.0,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,408100.0,True,redbone,360700.0,True,miniature_pinscher,222800.0,True


### Issue #8:   
Extraneous columns in the datasets.


#### Define:

Not all columns in the datasets are needed for analysis is needed for analysis

#### Code

> for the __tweet_data_copy__ dataset

In [332]:
tweet_data_copy = tweet_data_copy[['rating_numerator','rating_denominator','name','doggo','floofer','pupper','puppo']]

> for the __additional_tweet_data_copy__ dataset

In [333]:
additional_tweet_data_copy = additional_tweet_data_copy[['retweet_count','favorite_count']]

> for the __image_prediction_copy__ dataset

In [334]:
image_prediction_copy = image_prediction_copy.drop(columns = ['jpg_url','img_num'], axis=1) 

#### Test:

In [335]:
tweet_data_copy.head(2)

Unnamed: 0_level_0,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
892420643555336193,13,10,Phineas,,,,
892177421306343426,13,10,Tilly,,,,


In [336]:
additional_tweet_data_copy.head(2)

Unnamed: 0_level_0,retweet_count,favorite_count
id,Unnamed: 1_level_1,Unnamed: 2_level_1
892420643555336193,8853,39467
892177421306343426,6514,33819


In [337]:
image_prediction_copy.head(2)

Unnamed: 0_level_0,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
666020888022790149,Welsh_springer_spaniel,4651.0,True,collie,1567.0,True,Shetland_sheepdog,614.0,True
666029285002620928,redbone,5068.0,True,miniature_pinscher,742.0,True,Rhodesian_ridgeback,720.0,True


### Issue #5:   
Rating denominator and numerator should be in one column

#### Define:

One of the rules of tidiness is that one column represents one variable, here, a single variable is split in 2 columns.

Join the 2 columns into one "ratings" column

#### Code

In [343]:
tweet_data_copy.head(2)

Unnamed: 0_level_0,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo,ratings
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
892420643555336193,13,10,Phineas,,,,,tweet_id\n892420643555336193 13\n8921774213...
892177421306343426,13,10,Tilly,,,,,tweet_id\n892420643555336193 13\n8921774213...


In [341]:
# tweet_data_copy['ratings'] = f'{tweet_data_copy["rating_numerator"]}/{tweet_data_copy["rating_denominator"]}'

In [None]:
# tweet_data_copy = tweet_data_copy.drop(columns=['rating_numerator','rating_denominator'], axis=1)

#### Test:

In [342]:
tweet_data_copy.ratings.value_counts()

tweet_id\n892420643555336193    13\n892177421306343426    13\n891815181378084864    12\n891689557279858688    13\n891327558926688256    12\n                      ..\n666049248165822465     5\n666044226329800704     6\n666033412701032449     9\n666029285002620928     7\n666020888022790149     8\nName: rating_numerator, Length: 2079, dtype: int64/tweet_id\n892420643555336193    10\n892177421306343426    10\n891815181378084864    10\n891689557279858688    10\n891327558926688256    10\n                      ..\n666049248165822465    10\n666044226329800704    10\n666033412701032449    10\n666029285002620928    10\n666020888022790149    10\nName: rating_denominator, Length: 2079, dtype: int64    2079
Name: ratings, dtype: int64

## Storing Data
Save gathered, assessed, and cleaned master dataset to a CSV file named "twitter_archive_master.csv".

## Analyzing and Visualizing Data
In this section, analyze and visualize your wrangled data. You must produce at least **three (3) insights and one (1) visualization.**

### Insights:
1.

2.

3.

### Visualization