# Project: Wrangling and Analyze Data

## Data Gathering
In the cell below, gather **all** three pieces of data for this project and load them in the notebook. **Note:** the methods required to gather each data are different.
1. Directly download the WeRateDogs Twitter archive data (twitter_archive_enhanced.csv)

In [1]:
import pandas as pd
import requests

def download_file(url,filename):
    r = requests.get(url)
    with open(filename,'wb') as file:
        file.write(r.content)

twitter_archive = pd.read_csv('twitter-archive-enhanced.csv')

2. Use the Requests library to download the tweet image prediction (image_predictions.tsv)

In [2]:
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
filename = url.split('/')[-1]
download_file(url,filename)
image_pred = pd.read_csv(filename, sep='\t')

3. Use the Tweepy library to query additional data via the Twitter API (tweet_json.txt)

In [3]:
tweet_info = pd.read_json('tweet-json.txt',lines=True)

## Assessing Data
In this section, detect and document at least **eight (8) quality issues and two (2) tidiness issue**. You must use **both** visual assessment
programmatic assessement to assess the data.

**Note:** pay attention to the following key points when you access the data.

* You only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.
* Assessing and cleaning the entire dataset completely would require a lot of time, and is not necessary to practice and demonstrate your skills in data wrangling. Therefore, the requirements of this project are only to assess and clean at least 8 quality issues and at least 2 tidiness issues in this dataset.
* The fact that the rating numerators are greater than the denominators does not need to be cleaned. This [unique rating system](http://knowyourmeme.com/memes/theyre-good-dogs-brent) is a big part of the popularity of WeRateDogs.
* You do not need to gather the tweets beyond August 1st, 2017. You can, but note that you won't be able to gather the image predictions for these tweets since you don't have access to the algorithm used.



In [4]:
twitter_archive.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 

In [5]:
twitter_archive.sample(10)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
705,785639753186217984,,,2016-10-11 00:34:48 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Pinot. He's a sophisticated doggo. You...,,,,https://twitter.com/dog_rates/status/785639753...,10,10,Pinot,doggo,,pupper,
1805,676942428000112642,,,2015-12-16 01:50:26 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Who leaves the last cupcake just sitting there...,,,,https://twitter.com/dog_rates/status/676942428...,9,10,,,,,
1118,732005617171337216,,,2016-05-16 00:31:53 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Larry. He has no self control. Tongue ...,,,,https://twitter.com/dog_rates/status/732005617...,11,10,Larry,,,,
143,864197398364647424,,,2017-05-15 19:14:50 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Paisley. She ate a flower just to prov...,,,,https://twitter.com/dog_rates/status/864197398...,13,10,Paisley,,,,
38,884925521741709313,,,2017-07-12 00:01:00 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Earl. He found a hat. Nervous about wh...,,,,https://twitter.com/dog_rates/status/884925521...,12,10,Earl,,,,
1103,735256018284875776,,,2016-05-24 23:47:49 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Kellogg. He accidentally opened the fr...,,,,https://twitter.com/dog_rates/status/735256018...,8,10,Kellogg,doggo,,,
2033,671744970634719232,,,2015-12-01 17:37:36 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Very fit horned dog here. Looks powerful. Not ...,,,,https://twitter.com/dog_rates/status/671744970...,6,10,,,,,
1214,715200624753819648,,,2016-03-30 15:34:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Michelangelope. He's half coffee cup. ...,,,,https://twitter.com/dog_rates/status/715200624...,12,10,Michelangelope,,,,
2113,670434127938719744,,,2015-11-28 02:48:46 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Hank and Sully. Hank is very proud of the...,,,,https://twitter.com/dog_rates/status/670434127...,11,10,Hank,,,,
336,832636094638288896,,,2017-02-17 17:01:29 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Orion. He just got back from the denti...,,,,https://twitter.com/dog_rates/status/832636094...,12,10,Orion,,,,


In [6]:
image_pred.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   tweet_id  2075 non-null   int64  
 1   jpg_url   2075 non-null   object 
 2   img_num   2075 non-null   int64  
 3   p1        2075 non-null   object 
 4   p1_conf   2075 non-null   float64
 5   p1_dog    2075 non-null   bool   
 6   p2        2075 non-null   object 
 7   p2_conf   2075 non-null   float64
 8   p2_dog    2075 non-null   bool   
 9   p3        2075 non-null   object 
 10  p3_conf   2075 non-null   float64
 11  p3_dog    2075 non-null   bool   
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [7]:
image_pred.sample(10)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
1276,750071704093859840,https://pbs.twimg.com/media/CmjKOzVWcAAQN6w.jpg,2,redbone,0.382113,True,malinois,0.249943,True,miniature_pinscher,0.070926,True
818,692901601640583168,https://pbs.twimg.com/media/CZ2uU37UcAANzmK.jpg,1,soft-coated_wheaten_terrier,0.403496,True,cocker_spaniel,0.135164,True,golden_retriever,0.088719,True
1541,791026214425268224,https://pbs.twimg.com/media/CpmyNumW8AAAJGj.jpg,1,malamute,0.375098,True,jean,0.069362,False,keeshond,0.050528,True
708,685198997565345792,https://pbs.twimg.com/media/CYJQxvJW8AAkkws.jpg,1,dishwasher,0.888829,False,stove,0.013411,False,Old_English_sheepdog,0.009671,True
1502,784517518371221505,https://pbs.twimg.com/media/CuMqhGrXYAQwRqU.jpg,2,malamute,0.757764,True,Eskimo_dog,0.151248,True,Siberian_husky,0.08484,True
755,688385280030670848,https://pbs.twimg.com/media/CY2iwGNWUAI5zWi.jpg,2,golden_retriever,0.900437,True,cocker_spaniel,0.022292,True,sombrero,0.014997,False
927,702671118226825216,https://pbs.twimg.com/media/CcBjp2nWoAA8w-2.jpg,1,bloodhound,0.381227,True,Sussex_spaniel,0.212017,True,clumber,0.128622,True
985,707610948723478529,https://pbs.twimg.com/media/CdHwZd0VIAA4792.jpg,1,golden_retriever,0.383223,True,cocker_spaniel,0.16593,True,Chesapeake_Bay_retriever,0.118199,True
1598,799297110730567681,https://pbs.twimg.com/media/CxeseRgUoAM_SQK.jpg,1,malamute,0.985028,True,Siberian_husky,0.005834,True,Eskimo_dog,0.005443,True
1373,762471784394268675,https://pbs.twimg.com/ext_tw_video_thumb/76247...,1,Samoyed,0.540276,True,standard_poodle,0.279802,True,toy_poodle,0.102058,True


In [8]:
tweet_info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2354 entries, 0 to 2353
Data columns (total 31 columns):
 #   Column                         Non-Null Count  Dtype              
---  ------                         --------------  -----              
 0   created_at                     2354 non-null   datetime64[ns, UTC]
 1   id                             2354 non-null   int64              
 2   id_str                         2354 non-null   int64              
 3   full_text                      2354 non-null   object             
 4   truncated                      2354 non-null   bool               
 5   display_text_range             2354 non-null   object             
 6   entities                       2354 non-null   object             
 7   extended_entities              2073 non-null   object             
 8   source                         2354 non-null   object             
 9   in_reply_to_status_id          78 non-null     float64            
 10  in_reply_to_status_id_st

In [12]:
tweet_info.sample(10)

Unnamed: 0,created_at,id,id_str,full_text,truncated,display_text_range,entities,extended_entities,source,in_reply_to_status_id,...,favorite_count,favorited,retweeted,possibly_sensitive,possibly_sensitive_appealable,lang,retweeted_status,quoted_status_id,quoted_status_id_str,quoted_status
867,2016-08-06 02:06:59+00:00,761745352076779520,761745352076779520,Guys.. we only rate dogs. Pls don't send any m...,False,"[0, 116]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 761745343629422592, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,4707,False,False,0.0,0.0,en,,,,
1335,2016-03-02 18:48:16+00:00,705102439679201280,705102439679201280,This is Terrenth. He just stubbed his toe. 10/...,False,"[0, 94]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 705102425103839232, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,2342,False,False,0.0,0.0,en,,,,
1262,2016-03-16 01:46:45+00:00,709918798883774466,709918798883774464,Meet Watson. He's a Suzuki Tickleboop. Leader ...,False,"[0, 140]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 709918790847492096, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,3250,False,False,0.0,0.0,en,,,,
2198,2015-11-23 04:59:42+00:00,668655139528511488,668655139528511488,Say hello to Winifred. He is a Papyrus Hydrang...,False,"[0, 112]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 668655136865181697, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,562,False,False,0.0,0.0,en,,,,
164,2017-05-04 17:01:34+00:00,860177593139703809,860177593139703808,RT @dog_rates: Ohboyohboyohboyohboyohboyohboyo...,False,"[0, 140]","{'hashtags': [], 'symbols': [], 'user_mentions...",,"<a href=""http://twitter.com/download/iphone"" r...",,...,0,False,False,,,in,{'created_at': 'Fri Aug 05 21:19:27 +0000 2016...,,,
1900,2015-12-09 17:38:19+00:00,674644256330530816,674644256330530816,When you see sophomores in high school driving...,False,"[0, 77]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 674644247564394496, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,1111,False,False,0.0,0.0,en,,,,
246,2017-03-25 02:15:26+00:00,845459076796616705,845459076796616704,RT @dog_rates: Here's a heartwarming scene of ...,False,"[0, 140]","{'hashtags': [], 'symbols': [], 'user_mentions...",,"<a href=""http://twitter.com/download/iphone"" r...",,...,0,False,False,,,en,{'created_at': 'Fri Jul 22 00:43:32 +0000 2016...,,,
530,2016-12-12 00:29:28+00:00,808106460588765185,808106460588765184,Here we have Burke (pupper) and Dexter (doggo)...,False,"[0, 120]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 808106447573843970, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,9701,False,False,0.0,0.0,en,,,,
1416,2016-02-13 03:59:01+00:00,698355670425473025,698355670425473024,This is Jessiga. She's a Tasmanian McCringlebe...,False,"[0, 138]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 698355656588447745, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,2046,False,False,0.0,0.0,en,,,,
91,2017-06-12 16:06:11+00:00,874296783580663808,874296783580663808,This is Jed. He may be the fanciest pupper in ...,False,"[0, 114]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 874296776056078336, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,26651,False,False,0.0,0.0,en,,,,


### Quality issues
1.Incorrect data types: in_reply_to_status_id and in_reply_to_user_id is float, Timestamp is string

2.Expanded_url in the twitter archive has more than one url

3.Twitter_archive dog_names are incorrect, dog_names with None should be Null

4.Some text and names in tweet are encoded wrongly

5.Some ratings are not for dogs but other things

6.Ratings with decimals did not get extracted properly

7.tweet_info id_str,in_reply_to_status_id_str, in_reply_to_user_id_str is int

8.tweet_info language should be categorical

9.`tweet_info`; possibly_sensitive and possibly_sensitive_appealable should be boolean

10.Retweeted tweets are in `twitter_archive`, remove columns refering to retweets in `twitter_archive` and `tweet_info`

### Tidiness issues

1.Url in twitter_archive should be split to text and shortened_url

2.Four dog stages should be a melted into one

## Cleaning Data
In this section, clean **all** of the issues you documented while assessing. 

**Note:** Make a copy of the original data before cleaning. Cleaning includes merging individual pieces of data according to the rules of [tidy data](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html). The result should be a high-quality and tidy master pandas DataFrame (or DataFrames, if appropriate).

In [14]:
# Make copies of original pieces of data
tweet_info_clean = tweet_info.copy()
image_pred_clean = image_pred.copy()
twitter_archive_clean = twitter_archive.copy()

### Quality Issue #1: 
Twitter_archive name are incorrect, name with None should be Null

#### Define: 
Remove the multiple urls by splitting them by comma and taking the first url

#### Code

In [31]:
'''
0649 Forrest
2287 Daryl

'''
twitter_archive_clean.query("name == 'a'").text

56      Here is a pupper approaching maximum borkdrive...
649     Here is a perfect example of someone who has t...
801     Guys this is getting so out of hand. We only r...
1002    This is a mighty rare blue-tailed hammer sherk...
1004    Viewer discretion is advised. This is a terrib...
1017    This is a carrot. We only rate dogs. Please on...
1049    This is a very rare Great Alaskan Bush Pupper....
1193    People please. This is a Deadly Mediterranean ...
1207    This is a taco. We only rate dogs. Please only...
1340    Here is a heartbreaking scene of an incredible...
1351    Here is a whole flock of puppers.  60/50 I'll ...
1361    This is a Butternut Cumberfloof. It's not wind...
1368    This is a Wild Tuscan Poofwiggle. Careful not ...
1382    "Pupper is a present to world. Here is a bow f...
1499    This is a rare Arctic Wubberfloof. Unamused by...
1737    Guys this really needs to stop. We've been ove...
1785    This is a dog swinging. I really enjoyed it so...
1853    This i

#### Test

### Issue #2:

#### Define

#### Code

#### Test

## Storing Data
Save gathered, assessed, and cleaned master dataset to a CSV file named "twitter_archive_master.csv".

## Analyzing and Visualizing Data
In this section, analyze and visualize your wrangled data. You must produce at least **three (3) insights and one (1) visualization.**

### Insights:
1.

2.

3.

### Visualization