# Project: Wrangling and Analyze Data

## Data Gathering
In the cell below, gather **all** three pieces of data for this project and load them in the notebook. **Note:** the methods required to gather each data are different.
1. Directly download the WeRateDogs Twitter archive data (twitter_archive_enhanced.csv)

In [133]:
#import required packages
import pandas as pd
import numpy as np
import requests
import tweepy
import json

#Read the manually downloaded twitter enhanced archive
twitter_enhanced = pd.read_csv('twitter-archive-enhanced.csv')

2. Use the Requests library to download the tweet image prediction (image_predictions.tsv)

In [8]:
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'

response = requests.get(url)
with open('image_predictions.tsv', mode='wb') as file:
    file.write(response.content)

image_predictions = pd.read_csv('image_predictions.tsv', sep='\t')

image_predictions.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


3. Use the Tweepy library to query additional data via the Twitter API (tweet_json.txt)

In [9]:
from distutils.log import error

#create an instance of the tweepy API client
auth = tweepy.OAuth2BearerHandler('AAAAAAAAAAAAAAAAAAAAAN8AgwEAAAAA3fLvHL%2FtANmE1ujG5F7zAUoj9tM%3Dkwp87yqxk3VCTu13lVg3Lhd9bE156Ic4lEI3StzeJ5KnR3rAVP') #Authentication required to run this!!!
api = tweepy.API(auth)

#Determine how many sets containing ~100 ids you can get from the tweet_id column of enhanced archive
pieces = np.round(len(twitter_enhanced.tweet_id)/100)
print(pieces) # to check number of sets

#split the tweet_ids into the above chunks to use with the lookup_statuses api.
id_chunks = np.array_split(twitter_enhanced.tweet_id, pieces)
chunked_list = [list(array) for array in id_chunks]

#extract tweets for each of the chunks and write them line by line to the tweets_json.txt
for chunk in chunked_list:
    try:
       tweets = api.lookup_statuses(chunk,trim_user = True)
       tweets_data = [json.dumps(tweet._json) for tweet in tweets]
    except:
        print(error)
    with open('tweet_json.txt', 'a', encoding='utf-8') as f:
        for tweet in tweets_data:
            f.write(tweet)
            f.write('\n')




24.0


In [10]:
# Read the tweet_json.txt file in line by line into a data variable we will use to create the dataframe,
with open('tweet_json.txt', 'r', encoding='utf-8') as f:
    data = [json.loads(line) for line in f]

#create dataframe containing only columns i need for this project
    tweets_df = pd.DataFrame(data, columns=['id', 'favorite_count', 'retweet_count'])

tweets_df.sample(5)

Unnamed: 0,id,favorite_count,retweet_count
1091,730573383004487680,4453,1906
843,770293558247038976,5834,1351
384,819015331746349057,0,17723
1057,735648611367784448,3720,1005
533,808134635716833280,0,5490


## Assessing Data
In this section, detect and document at least **eight (8) quality issues and two (2) tidiness issue**. You must use **both** visual assessment
programmatic assessement to assess the data.

**Note:** pay attention to the following key points when you access the data.

* You only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.
* Assessing and cleaning the entire dataset completely would require a lot of time, and is not necessary to practice and demonstrate your skills in data wrangling. Therefore, the requirements of this project are only to assess and clean at least 8 quality issues and at least 2 tidiness issues in this dataset.
* The fact that the rating numerators are greater than the denominators does not need to be cleaned. This [unique rating system](http://knowyourmeme.com/memes/theyre-good-dogs-brent) is a big part of the popularity of WeRateDogs.
* You do not need to gather the tweets beyond August 1st, 2017. You can, but note that you won't be able to gather the image predictions for these tweets since you don't have access to the algorithm used.



In [4]:
#Visually assess the three data sets
twitter_enhanced.head(3) #visually check the data

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,


In [5]:
twitter_enhanced.info() # check for missing values
print('duplicates', twitter_enhanced.duplicated().sum()) #check for duplicates

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 

In [6]:
twitter_enhanced.describe() #check summary for numerical values

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,retweeted_status_id,retweeted_status_user_id,rating_numerator,rating_denominator
count,2356.0,78.0,78.0,181.0,181.0,2356.0,2356.0
mean,7.427716e+17,7.455079e+17,2.014171e+16,7.7204e+17,1.241698e+16,13.126486,10.455433
std,6.856705e+16,7.582492e+16,1.252797e+17,6.236928e+16,9.599254e+16,45.876648,6.745237
min,6.660209e+17,6.658147e+17,11856340.0,6.661041e+17,783214.0,0.0,0.0
25%,6.783989e+17,6.757419e+17,308637400.0,7.186315e+17,4196984000.0,10.0,10.0
50%,7.196279e+17,7.038708e+17,4196984000.0,7.804657e+17,4196984000.0,11.0,10.0
75%,7.993373e+17,8.257804e+17,4196984000.0,8.203146e+17,4196984000.0,12.0,10.0
max,8.924206e+17,8.862664e+17,8.405479e+17,8.87474e+17,7.874618e+17,1776.0,170.0


In [None]:
print(image_predictions.info())
print(image_predictions.describe())
print(image_predictions.p1_dog.value_counts())# look for false values are per the #1 predictor
print('duplicates', image_predictions.duplicated().sum())
image_predictions.head()



In [None]:
print(tweets_df.describe())
print(tweets_df.info())
print('duplicates', tweets_df.duplicated().sum())
tweets_df.head()

### Quality issues
1. The `enhanced_twitter_archive` contains some tweets that are not dog ratings but rather replies or quoted tweets.

2. There are some retweets in the `enhanced_twitter_archive` while we only require original ratings.

3. There are some missing values for the **dog stages** in the `enhanced_twitter_archive` data.

4. The `enhanced_twitter_archive` is missing the *favorite_count* and *retweet_count* columns.

5. Some of the *numerator* scores seem either too high or too low in the `enhanced_twitter_archive`.

6. Some of the *denominator* scores seem either too high or too low in the `enhanced_twitter_archive`.

7. Based on it's most confident prediction, some of the images in the `image_predictions` data are not for dogs.

8. `Enhanced_twitter_archive` has no image data.

9. Clean up the `enhanced_twitter_archive` to remain with only relevant rows and columns

### Tidiness issues
1. Dog stage is a variable but is spread over four columns not one in the `enhanced_twitter_archive`.

2. `Image_predictions` prediction values are spread over many columns and need to be reshaped for easier analysis.

## Cleaning Data
In this section, clean **all** of the issues you documented while assessing. 

**Note:** Make a copy of the original data before cleaning. Cleaning includes merging individual pieces of data according to the rules of [tidy data](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html). The result should be a high-quality and tidy master pandas DataFrame (or DataFrames, if appropriate).

In [134]:
# Make copies of original pieces of data
twitter_enhanced = twitter_enhanced.copy()
image_predictions = image_predictions.copy()
tweets_df = tweets_df.copy()


### Issue #1:
The `enhanced_twitter_archive` contains some tweets that are not dog ratings but rather replies or quoted tweets.

#### Define:
- Remove all tweets that are replies from the `enhanced_twitter_archive` by subsetting the data.

#### Code

In [135]:
twitter_enhanced = twitter_enhanced[twitter_enhanced.in_reply_to_status_id.isna()]

#### Test

In [136]:
twitter_enhanced.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2278 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2278 non-null   int64  
 1   in_reply_to_status_id       0 non-null      float64
 2   in_reply_to_user_id         0 non-null      float64
 3   timestamp                   2278 non-null   object 
 4   source                      2278 non-null   object 
 5   text                        2278 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2274 non-null   object 
 10  rating_numerator            2278 non-null   int64  
 11  rating_denominator          2278 non-null   int64  
 12  name                        2278 non-null   object 
 13  doggo                       2278 

### Issue #2:
There are some retweets in the `enhanced_twitter_archive` while we only require original ratings.

#### Define
- Remove all retweets from the `enhanced_twitter_archive` data.

#### Code

In [137]:
twitter_enhanced = twitter_enhanced[twitter_enhanced.retweeted_status_id.isna()] # remove retweets

#remove unneeded columns
twitter_enhanced.drop(['in_reply_to_status_id', 'in_reply_to_user_id','retweeted_status_id', 'retweeted_status_user_id','retweeted_status_timestamp'],axis = 1,inplace=True)


#### Test

In [138]:
twitter_enhanced.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2097 entries, 0 to 2355
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   tweet_id            2097 non-null   int64 
 1   timestamp           2097 non-null   object
 2   source              2097 non-null   object
 3   text                2097 non-null   object
 4   expanded_urls       2094 non-null   object
 5   rating_numerator    2097 non-null   int64 
 6   rating_denominator  2097 non-null   int64 
 7   name                2097 non-null   object
 8   doggo               2097 non-null   object
 9   floofer             2097 non-null   object
 10  pupper              2097 non-null   object
 11  puppo               2097 non-null   object
dtypes: int64(3), object(9)
memory usage: 213.0+ KB


### Issue #3
There are some missing values for the **dog stages** in the `enhanced_twitter_archive` data.


#### Define
###### way one
- - Extract missing dog stage values from the tweet text and put the values in a column called dog_stage.
  - drop the dog stages columns *doggo, floofer, pupper, puppo* from the `enhanced_twitter_archive`
  <br>
###### way two
- - Melt the exixsting *dog stages* into a singe column
  

#### Code

In [139]:
''' import re
pattern = r'(doggo|puppo|pupper|floofer)'

#twitter_enhanced.doggo.value_counts() #do-83 , po-24, pr-230, fr-10 - total 347 - to see if there is any difference

# Extract the stages available
twitter_enhanced['dog_stage'] = twitter_enhanced['text'].str.extract(pattern)

#twitter_enhanced.drop(['doggo','puppo','pupper','floofer'], axis = 1, inplace = True) ''';

stages = ['doggo','puppo','pupper','floofer']
id_vars = [x for x in twitter_enhanced.columns if x not in stages ]
melted_archive = pd.melt(twitter_enhanced,id_vars = id_vars, value_vars=stages,value_name = 'dog_stage')
melted_archive = melted_archive.sort_values('dog_stage', ascending=False, ignore_index=True)
melted_archive.drop_duplicates('tweet_id', inplace = True, ignore_index=True)
melted_archive.drop('variable', axis=1, inplace = True)

twitter_enhanced = melted_archive

#### Test

In [140]:

twitter_enhanced.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2097 entries, 0 to 2096
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   tweet_id            2097 non-null   int64 
 1   timestamp           2097 non-null   object
 2   source              2097 non-null   object
 3   text                2097 non-null   object
 4   expanded_urls       2094 non-null   object
 5   rating_numerator    2097 non-null   int64 
 6   rating_denominator  2097 non-null   int64 
 7   name                2097 non-null   object
 8   dog_stage           2097 non-null   object
dtypes: int64(3), object(6)
memory usage: 147.6+ KB


### Issue #4
The `enhanced_twitter_archive` is missing the *favorite_count* and *retweet_count columns*.

#### Define
- Add the *favorite_count* and *retweet_count* from the `tweets_df` by merging the two data frames


#### Code

In [143]:
#merge the two data frames to ad the favourite count and retweet count to the enhanced archive
twitter_en = twitter_enhanced.merge(tweets_df, how='left', left_on='tweet_id', right_on='id')
#twitter_enhanced.drop('id', axis=1)



In [144]:
#Remove the id axis which was pulled after the merge - I should have cleaned up this structural issue earlier to avoid this step.
twitter_en = twitter_en.drop('id', axis=1)

In [147]:
#Remove rows without favorite and retweet counts
twitter_en = twitter_en.dropna(how='any')

In [149]:
twitter_enhanced = twitter_en

#### Test

In [150]:
twitter_enhanced.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2086 entries, 0 to 2096
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   tweet_id            2086 non-null   int64  
 1   timestamp           2086 non-null   object 
 2   source              2086 non-null   object 
 3   text                2086 non-null   object 
 4   expanded_urls       2086 non-null   object 
 5   rating_numerator    2086 non-null   int64  
 6   rating_denominator  2086 non-null   int64  
 7   name                2086 non-null   object 
 8   dog_stage           2086 non-null   object 
 9   favorite_count      2086 non-null   float64
 10  retweet_count       2086 non-null   float64
dtypes: float64(2), int64(3), object(6)
memory usage: 195.6+ KB


### Issue #5
Some of the *numerator scores* seem either too high or too low in the `enhanced_twitter_archive`.


#### Define
- Replace any numerator above 30 with the median numerator value to make the data more consistent and remove probable inaccurate figures.

#### Code

In [151]:
median_numerator = twitter_enhanced.rating_numerator.median()
median_numerator

twitter_enhanced.loc[twitter_enhanced['rating_numerator'] > 30, 'rating_numerator'] = median_numerator

#### Test

In [152]:
twitter_enhanced.describe()

Unnamed: 0,tweet_id,rating_numerator,rating_denominator,favorite_count,retweet_count
count,2086.0,2086.0,2086.0,2086.0,2086.0
mean,7.362885e+17,10.615053,10.451103,7754.196548,2295.035954
std,6.703816e+16,2.236592,6.662487,11292.625791,4021.465309
min,6.660209e+17,0.0,2.0,66.0,11.0
25%,6.766572e+17,10.0,10.0,1708.75,509.0
50%,7.094844e+17,11.0,10.0,3519.0,1109.0
75%,7.872326e+17,12.0,10.0,9669.0,2612.0
max,8.924206e+17,27.0,170.0,144242.0,70330.0


### Issue #6
Some of the *denominator scores* seem either too high or too low in the `enhanced_twitter_archive`

#### Define
- From the weRateDogs twitter account and based on the distribution of denominator values, which ideally should be on one scale, most values are actually a ten and so values that are not a 10 would definitely not assist the analysis. We convert all denominator values to 10 for consistency and validity.

#### Code

In [153]:
twitter_enhanced['rating_denominator'] = 10

#### Test

In [154]:
twitter_enhanced.describe()

Unnamed: 0,tweet_id,rating_numerator,rating_denominator,favorite_count,retweet_count
count,2086.0,2086.0,2086.0,2086.0,2086.0
mean,7.362885e+17,10.615053,10.0,7754.196548,2295.035954
std,6.703816e+16,2.236592,0.0,11292.625791,4021.465309
min,6.660209e+17,0.0,10.0,66.0,11.0
25%,6.766572e+17,10.0,10.0,1708.75,509.0
50%,7.094844e+17,11.0,10.0,3519.0,1109.0
75%,7.872326e+17,12.0,10.0,9669.0,2612.0
max,8.924206e+17,27.0,10.0,144242.0,70330.0


### Issue #7
Based on it's most confident prediction, some of the images in the `image_predictions` data are not for dogs

#### Define
- We will investigate quickly if the images marked false are truely not for dogs before we remove them, by visually inspecting photos via the links in the jpg_url column.


#### Code

In [155]:
image_predictions.query('p1_dog == False').sample(5)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
571,678399652199309312,https://pbs.twimg.com/ext_tw_video_thumb/67839...,1,swing,0.929196,False,Bedlington_terrier,0.015047,True,Great_Pyrenees,0.014039,True
1227,745314880350101504,https://pbs.twimg.com/media/Clfj6RYWMAAFAOW.jpg,2,ice_bear,0.807762,False,great_white_shark,0.02704,False,fountain,0.022052,False
660,682406705142087680,https://pbs.twimg.com/media/CXhlRmRUMAIYoFO.jpg,1,wombat,0.709344,False,koala,0.169758,False,beaver,0.079433,False
914,700890391244103680,https://pbs.twimg.com/media/CboQFolWIAE04qE.jpg,1,white_wolf,0.166563,False,schipperke,0.122356,True,West_Highland_white_terrier,0.119247,True
1284,750506206503038976,https://pbs.twimg.com/media/CmpVaOZWIAAp3z6.jpg,1,American_black_bear,0.219166,False,lesser_panda,0.214715,False,titi,0.091685,False


- Our quick check has revealed that sometimes the #1 prediction is not always the most accurate and in sampling the predictions with false values we see that most of the time the other two predictions do not only differ, but also are more accurate in checking for dogs. We will go with the assumption that is a dog image for any true prediction, not necessarily the strongest one.

In [156]:
# reshape the table so the predictions are in long format by unpoviting the dog predictions.
melted_predictions = pd.melt(image_predictions, id_vars=['tweet_id', 'jpg_url', 'img_num','p1','p2','p3'],value_vars=['p1_dog','p2_dog','p3_dog'],var_name='predictor', value_name='is_dog')

#remove "False" predictions using query and sort by newly created predictor columns
melted_predictions = melted_predictions.query('is_dog == True').sort_values(['predictor'], ignore_index = True)

#drop duplicates to remain with only unique "True" predictions sorted from p1_dog(strongest prediction) to p3_dog(3rd strongest prediction)
melted_predictions.drop_duplicates(['tweet_id','jpg_url','img_num','p1','p2','p3'], inplace=True, ignore_index=True)

#unpivot the breed predictions
melted_predictions = pd.melt(melted_predictions, id_vars=['tweet_id','jpg_url','img_num'], value_vars=['p1','p2','p3'], var_name='p', value_name='breed')

melted_predictions.sort_values(['tweet_id'], inplace=True, ignore_index = True)
melted_predictions.drop_duplicates(['tweet_id','jpg_url','img_num'],inplace=True, ignore_index=True)

image_predictions_cleaned = melted_predictions # tip - use a variable to do the transormations so you dont have to keep going back to make a copy.


#### Test

In [160]:
#test with one id where the number 1 prediction is incorrect. Not 100 percent guarantee but better than using only p1_dog.
test = image_predictions_cleaned[image_predictions_cleaned['tweet_id'] == 700890391244103680]
image_predictions = image_predictions_cleaned
test


Unnamed: 0,tweet_id,jpg_url,img_num,p,breed
727,700890391244103680,https://pbs.twimg.com/media/CboQFolWIAE04qE.jpg,1,p2,schipperke


### Issue #8
`Enhanced_twitter_archive` has no image data.

#### Define
- Merge the cleaned `image_predictions` data to the `enhanced twitter archive`.

#### Code

In [163]:
enhanced_twitter_archive = twitter_enhanced.merge(image_predictions, how='left', on='tweet_id')

enhanced_twitter_archive.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2086 entries, 0 to 2085
Data columns (total 15 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   tweet_id            2086 non-null   int64  
 1   timestamp           2086 non-null   object 
 2   source              2086 non-null   object 
 3   text                2086 non-null   object 
 4   expanded_urls       2086 non-null   object 
 5   rating_numerator    2086 non-null   int64  
 6   rating_denominator  2086 non-null   int64  
 7   name                2086 non-null   object 
 8   dog_stage           2086 non-null   object 
 9   favorite_count      2086 non-null   float64
 10  retweet_count       2086 non-null   float64
 11  jpg_url             1658 non-null   object 
 12  img_num             1658 non-null   float64
 13  p                   1658 non-null   object 
 14  breed               1658 non-null   object 
dtypes: float64(3), int64(3), object(9)
memory usage: 260.8+

#### Test

## Storing Data
Save gathered, assessed, and cleaned master dataset to a CSV file named "twitter_archive_master.csv".

## Analyzing and Visualizing Data
In this section, analyze and visualize your wrangled data. You must produce at least **three (3) insights and one (1) visualization.**

### Insights:
1.

2.

3.

### Visualization