# Project: Wrangling and Analyze Data

##### Import required libraries

In [554]:
import pandas as pd
import numpy as np
import requests
import tweepy
from tweepy import OAuthHandler
import json 
from timeit import default_timer as timer

# Set some viewing options
pd.set_option('max_rows', None)
pd.set_option('max_columns', None)
pd.set_option('max_colwidth', None)

## Data Gathering
In the cell below, gather **all** three pieces of data for this project and load them in the notebook. **Note:** the methods required to gather each data are different.
1. Directly download the WeRateDogs Twitter archive data (twitter_archive_enhanced.csv)

In [555]:
# Download twitter-archive-enhanced file directly from web and add to project.
# Import file contents into pandas dataframe called twitter_archive

twitter_archive = pd.read_csv('twitter-archive-enhanced.csv')

2. Use the Requests library to download the tweet image prediction (image_predictions.tsv)

In [556]:
# Import image_predictions.tsv using requests library

url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
r = requests.get(url, allow_redirects=True)

# Save to local tsv file
open('image-predictions.tsv', 'wb').write(r.content)

# Load image-predictions into a dataframe
image_predtictions = pd.read_csv('image-predictions.tsv', sep='\t')

# https://www.tutorialspoint.com/downloading-files-from-web-using-python

3. Use the Tweepy library to query additional data via the Twitter API (tweet_json.txt)

In [557]:
# Note to evaluator - Not using the Twitter API since it does not work without having a paid developer account.
# Using the backup option of importing tweet_json.txt file directly from Udacity

# DO NOT RUN THIS CODE - adding failsafe so that this block will not run unless exec_api_code is set to True
exec_api_code = False

if exec_api_code:
    # Query Twitter API for each tweet in the Twitter archive and save JSON in a text file
    # These are hidden to comply with Twitter's API terms and conditions
    consumer_key = 'HIDDEN'
    consumer_secret = 'HIDDEN'
    access_token = 'HIDDEN'
    access_secret = 'HIDDEN'

    auth = OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_secret)

    api = tweepy.API(auth, wait_on_rate_limit=True)

    # NOTE TO STUDENT WITH MOBILE VERIFICATION ISSUES:
    # df_1 is a DataFrame with the twitter_archive_enhanced.csv file. You may have to
    # change line 17 to match the name of your DataFrame with twitter_archive_enhanced.csv
    # NOTE TO REVIEWER: this student had mobile verification issues so the following
    # Twitter API code was sent to this student from a Udacity instructor
    # Tweet IDs for which to gather additional data via Twitter's API
    tweet_ids = twitter_archive.tweet_id.values
    len(tweet_ids)

    # Query Twitter's API for JSON data for each tweet ID in the Twitter archive
    count = 0
    fails_dict = {}
    start = timer()
    # Save each tweet's returned JSON as a new line in a .txt file
    with open('tweet_json.txt', 'w') as outfile:
        # This loop will likely take 20-30 minutes to run because of Twitter's rate limit
        for tweet_id in tweet_ids:
            count += 1
            print(str(count) + ": " + str(tweet_id))
            try:
                tweet = api.get_status(tweet_id, tweet_mode='extended')
                print("Success")
                json.dump(tweet._json, outfile)
                outfile.write('\n')
            except tweepy.TweepError as e:
                print("Fail")
                fails_dict[tweet_id] = e
                pass
    end = timer()
    print(end - start)
    print(fails_dict)

In [558]:
# Read the data from the tweet_json.txt file, line by line, into Pandas dataframe
with open('tweet_json.txt') as f:
    tweet_api = pd.DataFrame(json.loads(line) for line in f)

# Keep only the columns we need
tweet_details = tweet_api[['id', 'retweet_count', 'favorite_count']]

# https://stackoverflow.com/questions/20037430/reading-multiple-json-records-into-a-pandas-dataframe

## Assessing Data
In this section, detect and document at least **eight (8) quality issues and two (2) tidiness issue**. You must use **both** visual assessment
programmatic assessement to assess the data.

**Note:** pay attention to the following key points when you access the data.

* You only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.
* Assessing and cleaning the entire dataset completely would require a lot of time, and is not necessary to practice and demonstrate your skills in data wrangling. Therefore, the requirements of this project are only to assess and clean at least 8 quality issues and at least 2 tidiness issues in this dataset.
* The fact that the rating numerators are greater than the denominators does not need to be cleaned. This [unique rating system](http://knowyourmeme.com/memes/theyre-good-dogs-brent) is a big part of the popularity of WeRateDogs.
* You do not need to gather the tweets beyond August 1st, 2017. You can, but note that you won't be able to gather the image predictions for these tweets since you don't have access to the algorithm used.



#### Twitter Archive

In [559]:
twitter_archive.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU,,,,https://twitter.com/dog_rates/status/892420643555336193/photo/1,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 https://t.co/0Xxu71qeIV",,,,https://twitter.com/dog_rates/status/892177421306343426/photo/1,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know when one may strike. 12/10 https://t.co/wUnZnhtVJB,,,,https://twitter.com/dog_rates/status/891815181378084864/photo/1,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us https://t.co/tD36da7qLQ,,,,https://twitter.com/dog_rates/status/891689557279858688/photo/1,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","This is Franklin. He would like you to stop calling him ""cute."" He is a very fierce shark and should be respected as such. 12/10 #BarkWeek https://t.co/AtUZn91f7f",,,,"https://twitter.com/dog_rates/status/891327558926688256/photo/1,https://twitter.com/dog_rates/status/891327558926688256/photo/1",12,10,Franklin,,,,


In [560]:
twitter_archive.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 

In [561]:
twitter_archive.name.value_counts()

None              745
a                  55
Charlie            12
Lucy               11
Oliver             11
Cooper             11
Lola               10
Penny              10
Tucker             10
Winston             9
Bo                  9
the                 8
Sadie               8
an                  7
Buddy               7
Bailey              7
Daisy               7
Toby                7
Scout               6
Jax                 6
Koda                6
Bella               6
Stanley             6
Milo                6
Dave                6
Jack                6
Rusty               6
Leo                 6
Oscar               6
Alfie               5
very                5
George              5
Bentley             5
Louis               5
Phil                5
Sunny               5
Oakley              5
Sammy               5
Gus                 5
Chester             5
Larry               5
Finn                5
Hank                4
Sampson             4
Chip                4
Jerry     

In [562]:
# There are Denominator values that are not equal to 10. Might be extracted incorrectly.
twitter_archive.query('rating_denominator != 10')[['text', 'rating_numerator', 'rating_denominator']]

Unnamed: 0,text,rating_numerator,rating_denominator
313,"@jonnysun @Lin_Manuel ok jomny I know you're excited but 960/00 isn't a valid rating, 13/10 is tho",960,0
342,@docmisterio account started on 11/15/15,11,15
433,The floofs have been released I repeat the floofs have been released. 84/70 https://t.co/NIYC820tmd,84,70
516,Meet Sam. She smiles 24/7 &amp; secretly aspires to be a reindeer. \nKeep Sam smiling by clicking and sharing this link:\nhttps://t.co/98tB8y7y7t https://t.co/LouL5vdvxx,24,7
784,"RT @dog_rates: After so many requests, this is Bretagne. She was the last surviving 9/11 search dog, and our second ever 14/10. RIP https:/…",9,11
902,Why does this never happen at my front door... 165/150 https://t.co/HmwrdfEfUE,165,150
1068,"After so many requests, this is Bretagne. She was the last surviving 9/11 search dog, and our second ever 14/10. RIP https://t.co/XAVDNDaVgQ",9,11
1120,Say hello to this unbelievably well behaved squad of doggos. 204/170 would try to pet all at once https://t.co/yGQI3He3xv,204,170
1165,Happy 4/20 from the squad! 13/10 for all https://t.co/eV1diwds8a,4,20
1202,This is Bluebert. He just saw that both #FinalFur match ups are split 50/50. Amazed af. 11/10 https://t.co/Kky1DPG4iq,50,50


In [563]:
twitter_archive.text.sample(5)

1209     Meet Toby. He's a Lithuanian High-Steppin Stickeroo. One of the more accomplished Stickeroos around. 10/10 so nifty https://t.co/cYPHuJYTjC
1854                                Seriously guys?! Only send in dogs. I only rate dogs. This is a baby black bear... 11/10 https://t.co/H7kpabTfLj
1441               This is Misty. She's in a predicament. Not sure what next move should be. 9/10 stay calm pupper I'm comin https://t.co/XhR7PAgcwF
2245                           Meet Stu. Stu has stacks on stacks and an eye made of pure gold. 10/10 pay for my tuition pls https://t.co/7rkYZQdKEd
1176    This doggo was initially thrilled when she saw the happy cartoon pup but quickly realized she'd been deceived. 10/10 https://t.co/mvnBGaWULV
Name: text, dtype: object

#### Twitter Details

In [564]:
tweet_details.head()

Unnamed: 0,id,retweet_count,favorite_count
0,892420643555336193,8853,39467
1,892177421306343426,6514,33819
2,891815181378084864,4328,25461
3,891689557279858688,8964,42908
4,891327558926688256,9774,41048


In [565]:
tweet_details.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2354 entries, 0 to 2353
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype
---  ------          --------------  -----
 0   id              2354 non-null   int64
 1   retweet_count   2354 non-null   int64
 2   favorite_count  2354 non-null   int64
dtypes: int64(3)
memory usage: 55.3 KB


In [566]:
tweet_details.describe()

Unnamed: 0,id,retweet_count,favorite_count
count,2354.0,2354.0,2354.0
mean,7.426978e+17,3164.797366,8080.968564
std,6.852812e+16,5284.770364,11814.771334
min,6.660209e+17,0.0,0.0
25%,6.783975e+17,624.5,1415.0
50%,7.194596e+17,1473.5,3603.5
75%,7.993058e+17,3652.0,10122.25
max,8.924206e+17,79515.0,132810.0


#### Image Predictions

In [567]:
image_predtictions.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


In [568]:
image_predtictions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   tweet_id  2075 non-null   int64  
 1   jpg_url   2075 non-null   object 
 2   img_num   2075 non-null   int64  
 3   p1        2075 non-null   object 
 4   p1_conf   2075 non-null   float64
 5   p1_dog    2075 non-null   bool   
 6   p2        2075 non-null   object 
 7   p2_conf   2075 non-null   float64
 8   p2_dog    2075 non-null   bool   
 9   p3        2075 non-null   object 
 10  p3_conf   2075 non-null   float64
 11  p3_dog    2075 non-null   bool   
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [569]:
image_predtictions.p1.sample(25)


1596                    Siberian_husky
1907                        bloodhound
725                         Pomeranian
1423                Labrador_retriever
1527                    remote_control
1798                        Eskimo_dog
1749                             cairn
484                         box_turtle
452                             vizsla
97                              kuvasz
1294                              chow
1695                Labrador_retriever
870                          washbasin
501                 Labrador_retriever
1670                          Doberman
934                   golden_retriever
320                          dalmatian
2050                  Mexican_hairless
1017                          Pembroke
92                          toy_poodle
139     American_Staffordshire_terrier
905                        tennis_ball
588                             kelpie
321                         chimpanzee
523                        Boston_bull
Name: p1, dtype: object

In [570]:
image_predtictions.p2.sample(25)

872                 Cardigan
1991         Tibetan_mastiff
596                   cannon
1974                Cardigan
1115          Great_Pyrenees
2035                malinois
1859            Newfoundland
34                 chain_saw
6                 mud_turtle
1762        golden_retriever
1441        golden_retriever
1216         standard_poodle
1069                    chow
1509              toy_poodle
403                   ashcan
161     Old_English_sheepdog
2010                 whippet
1526        golden_retriever
1480          Sussex_spaniel
1079        English_springer
1259                Cardigan
1445              bath_towel
808                    teddy
181               Eskimo_dog
1957                Cardigan
Name: p2, dtype: object

In [571]:
image_predtictions.p3.sample(25)

238           Labrador_retriever
1622                        file
945                   Arctic_fox
1304              Siberian_husky
537              squirrel_monkey
1175                 Maltese_dog
829                   Eskimo_dog
1034                  Eskimo_dog
87      Chesapeake_Bay_retriever
186               cocker_spaniel
1583                    Pembroke
1305                      tripod
1658          Labrador_retriever
1442                    Pembroke
926                  Siamese_cat
392                 Walker_hound
614               English_setter
1673              Siberian_husky
1934                      kuvasz
679               cocker_spaniel
92                         teddy
1530          Norwegian_elkhound
1983                  Pomeranian
582             Lakeland_terrier
780                     sombrero
Name: p3, dtype: object

### Quality issues
#### twitter_archive
1. Timestamp column is not in timestamp format
2. Column floofer name should be floof not floofer
3. Retweet messages should not be included, only original tweets. Columns observed: retweeted_status_id and in_reply_to_status_id
4. Columns contain missing values and are not needed for analysis. expanded_urls, in_reply_to_status_id, in_reply_to_user_id, retweeted_status_user_id, retweeted_status_id ,retweeted_status_timestamp 
5. Columns have inconsistent case, invalid names. Columns observed: name, doggo, floof, pupper, puppo. Also fix case p1, p2, p3 in image_predictions.
6. Values above and below 10 observed in rating_denominator column. Possible outliers also in rating_numerator column.
<br>
### tweet_details
7. Rename 'id' column to tweet_id
<br>
### image_predtictions
8. Non-Dog names in columns p1, p2, p3

### Tidiness issues
1. Dog varibles doggo, floofer (floof), pupper, and puppo are in individual columns and should be categorical
2. Columns from image_predictions and tweet_details should merge with twitter_archive

## Cleaning Data
In this section, clean **all** of the issues you documented while assessing. 

**Note:** Make a copy of the original data before cleaning. Cleaning includes merging individual pieces of data according to the rules of [tidy data](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html). The result should be a high-quality and tidy master pandas DataFrame (or DataFrames, if appropriate).

In [572]:
# Make copies of original pieces of data
twitter_archive_clean = twitter_archive.copy()
tweet_details_clean = tweet_details.copy()
image_pred_clean = image_predtictions.copy()

## Quality Issues
### Issue #1: Timestamp column is not in timestamp format

#### Define:
- In the twitter_archive_clean dataframe, change the dtype of the timestamp column to a pandas datetime. 

#### Code

In [573]:
twitter_archive_clean['timestamp'] = pd.to_datetime(twitter_archive_clean['timestamp'])

#### Test

In [574]:
twitter_archive_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype              
---  ------                      --------------  -----              
 0   tweet_id                    2356 non-null   int64              
 1   in_reply_to_status_id       78 non-null     float64            
 2   in_reply_to_user_id         78 non-null     float64            
 3   timestamp                   2356 non-null   datetime64[ns, UTC]
 4   source                      2356 non-null   object             
 5   text                        2356 non-null   object             
 6   retweeted_status_id         181 non-null    float64            
 7   retweeted_status_user_id    181 non-null    float64            
 8   retweeted_status_timestamp  181 non-null    object             
 9   expanded_urls               2297 non-null   object             
 10  rating_numerator            2356 non-null   int64           

### Issue #2: Column floofer name should be floof not floofer

#### Define
- In twitter_archive_clean rename column floofer to floof

#### Code

In [575]:
twitter_archive_clean.rename(columns={'floofer' : 'floof'}, inplace=True)

#### Test

In [576]:
list(twitter_archive_clean)

['tweet_id',
 'in_reply_to_status_id',
 'in_reply_to_user_id',
 'timestamp',
 'source',
 'text',
 'retweeted_status_id',
 'retweeted_status_user_id',
 'retweeted_status_timestamp',
 'expanded_urls',
 'rating_numerator',
 'rating_denominator',
 'name',
 'doggo',
 'floof',
 'pupper',
 'puppo']

### Issue #3: Retweet messages should not be included, only original tweets. Columns observed: retweeted_status_id and in_reply_to_status_id: 

#### Define
- Remove all rows that are retweets or replies using retweeted_status_id and in_reply_to_status_id columns.

#### Code

In [577]:
# Find the index of the retweets
retweets = twitter_archive_clean[pd.notnull(twitter_archive_clean['retweeted_status_id'])].index

# Find the index of the replys
replys = twitter_archive_clean[pd.notnull(twitter_archive_clean['in_reply_to_status_id'])].index

# Drop the rows
twitter_archive_clean.drop(index=retweets, inplace=True)
twitter_archive_clean.drop(index=replys, inplace=True)

#### Test

In [578]:
# Confirm rows were deleted. Started with 2356 rows.
twitter_archive_clean.shape

(2097, 17)

### Issue #4: Columns contain missing values and are not needed for analysis. expanded_urls, in_reply_to_status_id, in_reply_to_user_id, retweeted_status_user_id, retweeted_status_id ,retweeted_status_timestamp

#### Define
- Drop expanded_urls, in_reply_to_status_id, in_reply_to_user_id, retweeted_status_user_id, retweeted_status_id ,retweeted_status_timestamp columns from twitter_archive_clean

#### Code

In [579]:
# Drop the columns
to_drop = ['expanded_urls', 'in_reply_to_status_id', 'in_reply_to_user_id', 'retweeted_status_user_id', 'retweeted_status_id' ,'retweeted_status_timestamp']
twitter_archive_clean.drop(to_drop, axis=1, inplace=True)

#### Test

In [580]:
list(twitter_archive_clean)

['tweet_id',
 'timestamp',
 'source',
 'text',
 'rating_numerator',
 'rating_denominator',
 'name',
 'doggo',
 'floof',
 'pupper',
 'puppo']

### Issue #5: Columns have inconsistent case, invalid names. Columns observed: name, doggo, floof, pupper, puppo. Also fix case p1, p2, p3 in image_predictions.

#### Define
- Keep the 'None' names, and make the 'a' names 'None' as well.
- Convert all 'None" to Nan
- Change case to capitalize

#### Code

In [581]:
# Change the dog names that are 'a' to 'None'
twitter_archive_clean['name'] = twitter_archive_clean.name.replace('a', 'None')

# Change all columns that contain 'None' to NaN
cols = ['name','doggo','floof','pupper','puppo']
twitter_archive_clean[cols] = twitter_archive_clean[cols].replace('None', np.nan)

# Fix the case issues on cols by using str.capitalize()
twitter_archive_clean[cols]= twitter_archive_clean[cols].apply(lambda x: x.astype(str).str.capitalize())

cols2 = ['p1', 'p2', 'p3']
image_pred_clean[cols2] = image_pred_clean[cols2].apply(lambda x: x.astype(str).str.capitalize())


# https://www.datasciencelearner.com/convert-entire-dataframe-columns-lower-case-upper-case/

#### Test

In [582]:
twitter_archive_clean[cols].sample(15), image_pred_clean[cols2].sample(15)

(           name  doggo floof  pupper puppo
 145     Neptune    Nan   Nan     Nan   Nan
 317        Tobi    Nan   Nan     Nan   Nan
 1327      Adele    Nan   Nan  Pupper   Nan
 1177      Clyde    Nan   Nan  Pupper   Nan
 874   Bonaparte    Nan   Nan     Nan   Nan
 18      Ralphus    Nan   Nan     Nan   Nan
 2133    Winston    Nan   Nan     Nan   Nan
 962        Milo    Nan   Nan  Pupper   Nan
 2249        Nan    Nan   Nan     Nan   Nan
 1146        Nan    Nan   Nan     Nan   Nan
 1570      Ember    Nan   Nan     Nan   Nan
 1871        Nan    Nan   Nan     Nan   Nan
 919         Nan  Doggo   Nan     Nan   Nan
 449          Bo  Doggo   Nan     Nan   Nan
 106      Lassie    Nan   Nan     Nan   Nan,
                                p1                    p2  \
 1733                 Irish_setter                Vizsla   
 448                          Tick                  Nail   
 20                    Maltese_dog            Toy_poodle   
 513                        Bubble            Leafhoppe

### Issue #6: Values above and below 10 observed in rating_denominator column. Possible outliers also in rating_numerator column.

#### Define
- Identified multiple rows where the denominator had a value that was not 10. Adjust denominators where appropriate, and drop rows where no rating is possible.

In [583]:
# Query the clean archive to find remaining rows to be fixed
twitter_archive_clean.query('rating_denominator != 10')[['text', 'rating_numerator', 'rating_denominator']]

Unnamed: 0,text,rating_numerator,rating_denominator
433,The floofs have been released I repeat the floofs have been released. 84/70 https://t.co/NIYC820tmd,84,70
516,Meet Sam. She smiles 24/7 &amp; secretly aspires to be a reindeer. \nKeep Sam smiling by clicking and sharing this link:\nhttps://t.co/98tB8y7y7t https://t.co/LouL5vdvxx,24,7
902,Why does this never happen at my front door... 165/150 https://t.co/HmwrdfEfUE,165,150
1068,"After so many requests, this is Bretagne. She was the last surviving 9/11 search dog, and our second ever 14/10. RIP https://t.co/XAVDNDaVgQ",9,11
1120,Say hello to this unbelievably well behaved squad of doggos. 204/170 would try to pet all at once https://t.co/yGQI3He3xv,204,170
1165,Happy 4/20 from the squad! 13/10 for all https://t.co/eV1diwds8a,4,20
1202,This is Bluebert. He just saw that both #FinalFur match ups are split 50/50. Amazed af. 11/10 https://t.co/Kky1DPG4iq,50,50
1228,Happy Saturday here's 9 puppers on a bench. 99/90 good work everybody https://t.co/mpvaVxKmc1,99,90
1254,Here's a brigade of puppers. All look very prepared for whatever happens next. 80/80 https://t.co/0eb7R1Om12,80,80
1274,"From left to right:\nCletus, Jerome, Alejandro, Burp, &amp; Titson\nNone know where camera is. 45/50 would hug all at once https://t.co/sedre1ivTK",45,50


In [584]:
# Rows to Fix
idx_fix_den = [1068, 1165, 1662, 2335]
#idx_fix_num = {313 : 13, 784 : 14, 1068 : 14, 1165 : 13, 1202 : 11, 1662 : 10, 2335 : 9}
idx_fix_num = [[1068, 14], [1165, 13], [1662, 10], [2335, 9]]
#
# Rows to Delete
idx_del = [433, 516, 902, 1120, 1202, 1228, 1254, 1274, 1351, 1433, 1635, 1779, 1843]

# Set the denominators = 10
for i in idx_fix_den:
    twitter_archive_clean.at[i, 'rating_denominator'] = 10

# Set the numerators = new value
for i, v in idx_fix_num:
    twitter_archive_clean.at[i, 'rating_numerator'] = v

# Drop rows
twitter_archive_clean.drop(index=idx_del, inplace=True)

#### Test

In [585]:
# Expect to have zero rows
twitter_archive_clean.query('rating_denominator != 10')[['text', 'rating_numerator', 'rating_denominator']]

Unnamed: 0,text,rating_numerator,rating_denominator


### Issue #7: Rename column to tweet_id in twitter_details table

#### Define
- In the tweet_details_clean table, change the column name from id to tweet_id

In [588]:
# Rename the column
tweet_details_clean.rename(columns={'id' : 'tweet_id'}, inplace=True)

#### Test

In [590]:
# Display first 5 rows and validate column name has changed
tweet_details_clean.head()

Unnamed: 0,tweet_id,retweet_count,favorite_count
0,892420643555336193,8853,39467
1,892177421306343426,6514,33819
2,891815181378084864,4328,25461
3,891689557279858688,8964,42908
4,891327558926688256,9774,41048


### Issue #8: Non-Dog names in columns p1, p2, p3

#### Define
- 

#### Test

## Tidy Issues
### Issue #1: Dog varibles doggo, floofer (floof), pupper, and puppo are in individual columns and should be categorical

#### Define
- 

#### Test

### Issue #2: Columns from image_predictions and tweet_details should merge with twitter_archive

#### Define
- 

#### Test

## Storing Data
Save gathered, assessed, and cleaned master dataset to a CSV file named "twitter_archive_master.csv".

## Analyzing and Visualizing Data
In this section, analyze and visualize your wrangled data. You must produce at least **three (3) insights and one (1) visualization.**

### Insights:
1.

2.

3.

### Visualization

### References
Downloading files using requests library: https://www.tutorialspoint.com/downloading-files-from-web-using-python<br>
Read Json file line by line into dataframe: https://stackoverflow.com/questions/20037430/reading-multiple-json-records-into-a-pandas-dataframe<br>
Strings that do not contain a value: https://stackoverflow.com/questions/17097643/search-for-does-not-contain-on-a-dataframe-in-pandas<br>
Convert case on multiple columns at same time: https://www.datasciencelearner.com/convert-entire-dataframe-columns-lower-case-upper-case/<br>

