# Project: Data Wrangling WeRateDogs

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#gathering">Gathering The Data</a></li>
<li><a href="#access">Accessing The Data</a></li>
<li><a href="#cleaning">Cleaning The Data</a>
<li><a href="#analizing">Analizing and Visualizing</a>
</li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id="gathering"></a>
# Gathering The Data

I parse my library needed in each task so it will easier to know what kind library needed from that task.
<ul>
<li><a href="#first">First Data: Get Data Twitter Archive</a></li>
<li><a href="#second">Second Data: Get Data Tweet Image Prediction</a></li>
<li>Third Data: Accessing The Data
    <ul>
        <li><a href="#configure">Configure Twitter Account</a></li>
        <li><a href="#getdatathird">Get Data Twitter with API & JSON</a></li>
    </ul>
</li>
<li><a href="#conclusiongathering">Conclusion</a></li>
<ul>

<a id="first"></a>
#### 1. Get Data Twitter Archive

Todo:
1. Import library needed
2. Read <b>twitter_archive_enhanced.csv</b> from the same folder
3. Make sure that data has been read correctly
    - print head

In [1]:
import pandas as pd

In [2]:
twitter_archive_df = pd.read_csv('data_udacity/twitter-archive-enhanced.csv')
twitter_archive_df = twitter_archive_df.sort_values('timestamp')
twitter_archive_df.head(2)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
2355,666020888022790149,,,2015-11-15 22:32:08 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a Japanese Irish Setter. Lost eye...,,,,https://twitter.com/dog_rates/status/666020888...,8,10,,,,,
2354,666029285002620928,,,2015-11-15 23:05:30 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is a western brown Mitsubishi terrier. Up...,,,,https://twitter.com/dog_rates/status/666029285...,7,10,a,,,,


<a id="second"></a>
#### 2. Get Data Tweet Image Prediction

Todo:
1. Import library needed
2. Read <b>image-predictions.tsv</b> from Udacity's server that can be access from <i> https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv </i>
3. Make sure that data has been read correctly
    - print head
    - describe domain knowledge about the data

In [3]:
import requests

In [4]:
url = "https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv"
response = requests.get(url)

with open('data_udacity/image-predictions.tsv', mode ='wb') as file:
    file.write(response.content)

In [5]:
#Read TSV file
image_prediction_df = pd.read_csv('data_udacity/image-predictions.tsv', sep='\t' )
image_prediction_df.head(2)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True


#### The description:
- tweet_id is the last part of the tweet URL after "status/" → https://twitter.com/dog_rates/status/889531135344209921
- p1 is the algorithm's #1 prediction for the image in the tweet → golden retriever
- p1_conf is how confident the algorithm is in its #1 prediction → 95%
- p1_dog is whether or not the #1 prediction is a breed of dog → TRUE
- p2 is the algorithm's second most likely prediction → Labrador retriever
- p2_conf is how confident the algorithm is in its #2 prediction → 1%
- p2_dog is whether or not the #2 prediction is a breed of dog → TRUE
- etc.

<a id="configure"></a>
#### 3. Configure Twitter Account

Todo:
1. Import library needed
2. Declare twitter configuration with consumer_key, consumer_secret, access_token, and access_secret 
3. Make configuration

In [6]:
import tweepy

In [7]:
# for security reasons, I save my configuration in csv
twitter_configuration = pd.read_csv("twitter_configuration.csv")

In [8]:
try:
    auth = tweepy.OAuthHandler(twitter_configuration.consumer_key[0], twitter_configuration.consumer_secret[0])
    auth.set_access_token(twitter_configuration.access_token[0], twitter_configuration.access_secret[0])
except tweepy.TweepError as t:
    print(t.message)
    
api = tweepy.API(auth, wait_on_rate_limit= True, wait_on_rate_limit_notify= True)

<a id="getdatathird"></a>
#### 4. Get Data Twitter with API & JSON

Todo:
1. Import library needed (if not exist before)
2. Get twitter data in JSON by id from file point 1
    - add data JSON from a list
    - add ids data that we can't find that with API
    - calculate the number id we wan to looking for
    - calculate number succes and fail data we looking for
    - save data tweets in txt file so we can accsess that many time
3. Read and save tweets data in dataframe so we can access in our notebook
4. Make sure that data has been read correctly
    - print head

In [None]:
import json
from timeit import default_timer as timer

In [None]:
tweets = []
ids_not_found_tweet = []
ids_fail_get_tweet = []
num_tweet_id = len(twitter_archive_df.tweet_id)
num_succes_get_data = 0
num_fail_get_data = 0

start = timer()
for tweet_id in twitter_archive_df.tweet_id:
    try:
        temp = api.get_status(tweet_id)._json
        tweets.append({'tweet_id':temp['id'],
                       'favorite_count':temp['favorite_count'],
                       'favorited':temp['favorited'],
                       'retweet_count':temp['retweet_count'],
                       'retweeted':temp['retweeted']})
        num_succes_get_data += 1
        print('{} : done, {}/{}'.format(tweet_id, num_succes_get_data, num_tweet_id))
    except tweepy.TweepError as t:
        num_fail_get_data += 1
        if (t.args[0][0]['message'] == 'No status found with that ID.'):
            ids_not_found_tweet.append(tweet_id)
        else:
            ids_fail_get_tweet.append(tweet_id)
        print('{} : {}, total fail= {}'.format(tweet_id, t, num_fail_get_data))

end = timer()
print("The time we need to get JSON file: {} second".format(end - start))

666020888022790149 : done, 1/2356
666029285002620928 : done, 2/2356
666033412701032449 : done, 3/2356
666044226329800704 : done, 4/2356
666049248165822465 : done, 5/2356
666050758794694657 : done, 6/2356
666051853826850816 : done, 7/2356
666055525042405380 : done, 8/2356
666057090499244032 : done, 9/2356
666058600524156928 : done, 10/2356
666063827256086533 : done, 11/2356
666071193221509120 : done, 12/2356
666073100786774016 : done, 13/2356
666082916733198337 : done, 14/2356
666094000022159362 : done, 15/2356
666099513787052032 : done, 16/2356
666102155909144576 : done, 17/2356
666104133288665088 : done, 18/2356
666268910803644416 : done, 19/2356
666273097616637952 : done, 20/2356
666287406224695296 : done, 21/2356
666293911632134144 : done, 22/2356
666337882303524864 : done, 23/2356
666345417576210432 : done, 24/2356
666353288456101888 : done, 25/2356
666362758909284353 : done, 26/2356
666373753744588802 : done, 27/2356
666396247373291520 : done, 28/2356
666407126856765440 : done, 29

670374371102445568 : done, 232/2356
670385711116361728 : done, 233/2356
670403879788544000 : done, 234/2356
670408998013820928 : done, 235/2356
670411370698022913 : done, 236/2356
670417414769758208 : done, 237/2356
670420569653809152 : done, 238/2356
670421925039075328 : done, 239/2356
670427002554466305 : done, 240/2356
670428280563085312 : done, 241/2356
670433248821026816 : done, 242/2356
670434127938719744 : done, 243/2356
670435821946826752 : done, 244/2356
670442337873600512 : done, 245/2356
670444955656130560 : done, 246/2356
670449342516494336 : done, 247/2356
670452855871037440 : done, 248/2356
670465786746662913 : done, 249/2356
670468609693655041 : done, 250/2356
670474236058800128 : done, 251/2356
670668383499735048 : done, 252/2356
670676092097810432 : done, 253/2356
670679630144274432 : done, 254/2356
670691627984359425 : done, 255/2356
670704688707301377 : done, 256/2356
670717338665226240 : done, 257/2356
670727704916926465 : done, 258/2356
670733412878163972 : done, 2

674739953134403584 : done, 460/2356
674742531037511680 : done, 461/2356
674743008475090944 : done, 462/2356
674752233200820224 : done, 463/2356
674754018082705410 : done, 464/2356
674764817387900928 : done, 465/2356
674767892831932416 : done, 466/2356
674774481756377088 : done, 467/2356
674781762103414784 : done, 468/2356
674788554665512960 : done, 469/2356
674790488185167872 : done, 470/2356
674793399141146624 : done, 471/2356
674800520222154752 : done, 472/2356
674805413498527744 : done, 473/2356
674999807681908736 : done, 474/2356
675003128568291329 : done, 475/2356
675006312288268288 : done, 476/2356
675015141583413248 : done, 477/2356
675047298674663426 : done, 478/2356
675109292475830276 : done, 479/2356
675111688094527488 : done, 480/2356
675113801096802304 : done, 481/2356
675135153782571009 : done, 482/2356
675145476954566656 : done, 483/2356
675146535592706048 : done, 484/2356
675147105808306176 : done, 485/2356
675149409102012420 : done, 486/2356
675153376133427200 : done, 4

682406705142087680 : done, 685/2356
682429480204398592 : done, 686/2356
682638830361513985 : done, 687/2356
682662431982772225 : done, 688/2356
682697186228989953 : done, 689/2356
682750546109968385 : done, 690/2356
682788441537560576 : done, 691/2356
682808988178739200 : done, 692/2356
682962037429899265 : done, 693/2356
683030066213818368 : done, 694/2356
683078886620553216 : done, 695/2356
683098815881154561 : done, 696/2356
683111407806746624 : done, 697/2356
683142553609318400 : done, 698/2356
683357973142474752 : done, 699/2356
683391852557561860 : done, 700/2356
683449695444799489 : done, 701/2356
683462770029932544 : done, 702/2356
683481228088049664 : done, 703/2356
683498322573824003 : done, 704/2356
683515932363329536 : done, 705/2356
683742671509258241 : done, 706/2356
683773439333797890 : done, 707/2356
683828599284170753 : done, 708/2356
683834909291606017 : done, 709/2356
683849932751646720 : done, 710/2356
683852578183077888 : done, 711/2356
683857920510050305 : done, 7

Rate limit reached. Sleeping for: 602


In [None]:
print("Success to get {} data, and fail to get {} data (no_tweet: {}, just fail: {}), from total {} data."\
      .format(num_succes_get_data, num_fail_get_data,\
             len(ids_not_found_tweet), len(ids_fail_get_tweet),\
             num_tweet_id))

In [None]:
json.dump(tweets,open('data_generated/tweets.txt', 'w', encoding="utf8"), ensure_ascii=False, indent=4)
print('Success save the json file')

In [None]:
# read json file into dataframe
with open('data_generated/tweets.txt','r') as f:
    data = json.load(f)

scrapped_tweet_df = pd.DataFrame(data)
scrapped_tweet_df.head(2)

<a id="conclusiongathering"></a>
###### Conclusion:
- We get the third data
- 21 data from tweet_id are failed to get from tweet API because the id is not found, the twitter must be deleted
- We get first data from file that we save in same folder, second data from Udacity's server, and third data from Twitter API
    - After see tweet_json.txt from Udacity, I decide to get some column (not all column) because another column has been save in first data, and some cols not need yet (like column user)
- Because twitter have range limit time, so we need extra time (because of sleep) to get all data. In this project we need 2010 second

<a id="access"></a>
# Accessing The Data 

For now, we have 3 data: <b> twitter_archive_df, image_prediction_df, and scrapped_tweet_df </b>.
<br>
Todo in accessing data:
<ol>
<li><a href="#length">Check length of data</a></li>
<li><a href="#type">Check the type of data</a></li>
<li><a href="#value">Check the value of data</a></li>
<li><a href="#missing">Check missing value of data</a></li>
<li><a href="#describe">Check stat describe data</a></li>
<li><a href="#issue">Founded Issues</a></li>
<ol>

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

%matplotlib inline

<a id="length"></a>
##### 1. Check length of data

In [None]:
def print_length(name, data_frame):
    print("The length of {} is {}".format(name, len(data_frame)))

In [None]:
print_length('twitter_archive_df', twitter_archive_df)
print_length('image_prediction_df', image_prediction_df)
print_length('scrapped_tweet_df', scrapped_tweet_df)

From that data we get info that twitter_archive_df has different length with scrapped_tweet_df because we failed to get 22 data from twitter. We can delete some row in data so we will have the same length in each table.

<a id="type"></a>
##### 2. Check the type of data

In [None]:
twitter_archive_df.dtypes

In [None]:
image_prediction_df.dtypes

In [None]:
scrapped_tweet_df.dtypes

Object in the data type mean string, we not have some problem in there except timestamp. It must be date

<a id="value"></a>
#### 3. Check the value of data

In [None]:
twitter_archive_df.name.value_counts().head()

There is 5 sorted dog name with the biggest total value. We find that "None" is typically missing data, and I assumed that "a" also a missing data, so we must find and uniformly all missing data value in each label.

In [None]:
twitter_archive_df.retweeted_status_id.value_counts().head()

We only need original tweet (not retweeted by another tweet), so we must drop row that retweeted_status_id doesn't missing

In [None]:
scrapped_tweet_df.retweeted.value_counts()

Retweeted indicates whether this Tweet has been Retweeted by the authenticating user, because all value are false so this column be not informatif anymore.

In [None]:
twitter_archive_df.duplicated(['tweet_id']).sum()

In [None]:
twitter_archive_df.duplicated(['expanded_urls']).sum()

In [None]:
twitter_archive_df[twitter_archive_df.duplicated(['expanded_urls'])]

In [None]:
twitter_archive_df[twitter_archive_df.duplicated(['expanded_urls'])].expanded_urls.value_counts()

In [None]:
twitter_archive_df.query("expanded_urls == 'https://twitter.com/dog_rates/status/767754930266464257/photo/1'")

There are some images that duplicated, we must re-check are they are have same value in each cols (except the id, because we don't have any duplicate tweet id)

In [None]:
scrapped_tweet_df.favorited.value_counts()

retweeted and favorited data only have 1 value, so it is not important anymore, we must to drop it.

In [None]:
twitter_archive_df.source.value_counts()

To make the data more clear, we need to change source cols value

<a id="missing"></a>
##### 4. Check missing value of data

list function name: <br>
<a id="get_missing_value_percentage">get_missing_value_percentage</a> <br>

In [None]:
def get_missing_value_percentage(data_frame):
    data_missing = data_frame.isna()
    num_data_missing = data_missing.sum()
    num_data = len(data_frame)
    return (num_data_missing * 100)/num_data

In [None]:
get_missing_value_percentage(twitter_archive_df)

In [None]:
get_missing_value_percentage(image_prediction_df)

In [None]:
get_missing_value_percentage(scrapped_tweet_df)

Data twitter_archive_df have some missing value in variable in_reply_to_status_id (96.69%), in_reply_to_user_id (96.69%), retweeted_status_id (92.32%), retweeted_status_user_id (92.32%), retweeted_status_timestamp (92.32%), and expanded_urls (2.50%). Because of the large missing value (>90%), 5 cols in twitter_archive_df must be deleted. For expanded_urls, must be check after join with other table. Data image_prediction_df didn't have any missing value, the scrapped_tweet_df also didn't have missing value.

<a id="describe"></a>
##### 5. Check stat describe data

In [None]:
twitter_archive_df.describe()

In [None]:
image_prediction_df.describe()

In [None]:
scrapped_tweet_df.describe()

tweed_id musn't describe as numeric variable that we can conclude the statistic description, it is more suitable as a string

<a id="issue"></a>
### Founded Issues:

quality issues:
1. Axist not original tweet
2. tweet_id format in third data doesn't like first data so maybe it can make some problem if we join the two table
3. tweet_id position in third table not same like the other table, so we can't easily see the id
4. timestamp in first table not in datetime format
5. Missing value was not uniformly, sometime NaN but some other None
6. There are exist columns that have >90% missing value
7. Cols retwitted and favorited have same value in all row
8. Cols source have html format
9. Cols expanded_urls and jpg_urls have duplicated value

tidiness issues:
1. Stage of dog must be 1 cols instead of 4 cols
2. Join all data is needed to make easier for analysis


<a id="cleaning"></a>
# Cleaning and Tidying The Data

In cleaning and tydinf data, we want to make sure that issues we founded before will not exist.
<br>
Todo in cleaning data:
<ol>
<li><a href="#c0">Cleaning: Delete not original tweet</a></li>
<li><a href="#c1">Cleaning: Change Tweet id format in each table</a></li>
<li><a href="#c2">Cleaning: Change tweet_id position into first col</a></li>
<li><a href="#c3">Cleaning: Change timestamp format</a></li>
<li><a href="#c4">Cleaning: Uniformly missing value</a></li>
<li><a href="#t1">Tidying: Make dog stages into 1 column</a></li>
<li><a href="#c8">Cleaning: Delete duplicated row from expanded and jpg urls</a></li>
<li><a href="#c5">Cleaning: Delete col with missing value >90% from total rows</a></li>
<li><a href="#c6">Cleaning: Delete cols with same value</a></li>
<li><a href="#c7">Cleaning: Get source col without HTML format</a></li>
<li><a href="#t2">Tidying: Join all table</a></li>
<ol>

<a id="c0"></a>
##### 1. Delete not original tweet

In [None]:
twitter_archive_df = twitter_archive_df[pd.isna(twitter_archive_df.retweeted_status_id)]
pd.notna(twitter_archive_df['retweeted_status_id']).sum()

In [None]:
pd.notna(twitter_archive_df.retweeted_status_timestamp).sum()

In [None]:
pd.notna(twitter_archive_df.retweeted_status_user_id).sum()

Now we only have the original tweet

<a id="c1"></a>
##### 2. Change Tweet id format in each table

list function name: <br>
<a id="convert_to_str">convert_to_str</a> <br>

In [None]:
def convert_to_str(cols):
    return cols.astype(str).infer_objects()

In [None]:
twitter_archive_df.tweet_id = convert_to_str(twitter_archive_df.tweet_id)
twitter_archive_df.head(1)

In [None]:
image_prediction_df.tweet_id = convert_to_str(image_prediction_df.tweet_id)
image_prediction_df.head(1)

In [None]:
scrapped_tweet_df.tweet_id = convert_to_str(scrapped_tweet_df.tweet_id)
scrapped_tweet_df.head(1)

<a id="c2"></a>
##### 3. Change tweet_id position into first col

In [None]:
scrapped_tweet_df = scrapped_tweet_df.reindex(\
                        ['tweet_id','favorite_count','favorited','retweet_count','retweeted'], \
                        axis=1)
scrapped_tweet_df.head(1)

<a id="c3"></a>
##### 4. Change timestamp format

In [None]:
twitter_archive_df.timestamp = pd.to_datetime(twitter_archive_df.timestamp)
twitter_archive_df.timestamp.head(1)

In [None]:
twitter_archive_df.info()

<a id="c4"></a>
##### 5. Uniformly missing value

list function name: <br>
<a id="uniformly_missing_value">uniformly_missing_value</a> <br>

In [None]:
def uniformly_missing_value(data_frame):
    missing_value_names = ['NaN','None', 'N/A', 'NA', 'Unknown']
    for column in data_frame.columns:
        for phrase in missing_value_names:
            data_frame[column].replace(to_replace=missing_value_names, value=np.nan, inplace=True)
    return data_frame

In [None]:
twitter_archive_df = uniformly_missing_value(twitter_archive_df)
twitter_archive_df.info()

<b> from information above, we find that cols doggo, flooger, ... , puppo have a lot of missing value, but that data is untidy, it must be 1 column. </b>

In [None]:
image_prediction_df = uniformly_missing_value(image_prediction_df)
image_prediction_df.info()

In [None]:
scrapped_tweet_df = uniformly_missing_value(scrapped_tweet_df)
scrapped_tweet_df.info()

<a id="t1"></a>
#### 6. Make dog stages into 1 column

Todo:
<ol>
    <li><a href="#a1">Validation check there is 1 single value for 1 row</a></li>
    <li><a href="#a2">Add new cols to save dog stages</a></li>
    <li><a href="#a3">Change value into dog stages</a></li>
    <li><a href="#a4">Remove cols not needed</a></li>
</ol>

<a id="a1"></a>
1. Validation there is 1 single value for 1 row

list function name: <br>
<a id="is_not_nan">is_not_nan</a> <br>

In [None]:
def is_not_nan(data_frame, index:int, col:str):
    cell = data_frame.iloc[index,data_frame.columns.get_loc(col)]
    return pd.notna(cell)

In [None]:
twitter_archive_df['validation'] = 0
num_rows = len(twitter_archive_df)

for i in range(num_rows):
    validation_value = twitter_archive_df.iloc[i,twitter_archive_df.columns.get_loc('validation')]
    
    twitter_archive_df.iloc[i,twitter_archive_df.columns.get_loc('validation')] = \
        validation_value + \
        is_not_nan(twitter_archive_df, i, 'doggo') + \
        is_not_nan(twitter_archive_df, i, 'floofer') + \
        is_not_nan(twitter_archive_df, i, 'pupper') + \
        is_not_nan(twitter_archive_df, i, 'puppo')
    
twitter_archive_df['validation'].value_counts()

from the value_counts above we find they are 12 row not vallid because they have 2 type of dog. Let's see the data

In [None]:
twitter_archive_df.query("validation > 1 ").head(2)

I don't know what the right stage,and the duplicate count just 14 row (0.5% from total row) so I decide to delete unvalid stage

In [None]:
twitter_archive_df = twitter_archive_df[twitter_archive_df.validation <= 1]
twitter_archive_df['validation'].value_counts()

<a id="a2"></a>
2. Add new colomn to save dog stage

In [None]:
twitter_archive_df['dog_stage'] = np.nan
twitter_archive_df.head(2)

<a id="a3"></a>
3. Change value column dog_stage

list used function: <br>
<a href='#is_not_nan'>is_not_nan</a> <br>

In [None]:
num_rows = len(twitter_archive_df)

for i in range(num_rows):
    result = twitter_archive_df.iloc[i,twitter_archive_df.columns.get_loc('dog_stage')]
    
    if(is_not_nan(twitter_archive_df, i, 'doggo')):
        result = twitter_archive_df.iloc[i,twitter_archive_df.columns.get_loc('doggo')]
    elif(is_not_nan(twitter_archive_df, i, 'floofer')):
        result = twitter_archive_df.iloc[i,twitter_archive_df.columns.get_loc('floofer')]
    elif(is_not_nan(twitter_archive_df, i, 'pupper')):
        result = twitter_archive_df.iloc[i,twitter_archive_df.columns.get_loc('pupper')]
    elif(is_not_nan(twitter_archive_df, i, 'puppo')):
        result = twitter_archive_df.iloc[i,twitter_archive_df.columns.get_loc('puppo')]
        
    twitter_archive_df.iloc[i,twitter_archive_df.columns.get_loc('dog_stage')] = result

twitter_archive_df.dog_stage.value_counts()

<a id="a4"></a>
4. Remove cols not needed

In [None]:
twitter_archive_df.columns

Because we add column dog_stage so we don't need columns: 'doggo', 'floofer', 'pupper', 'puppo', and we also don't need column validation. 

In [None]:
twitter_archive_df.drop(['doggo', 'floofer', 'pupper', 'puppo','validation'], axis=1, inplace=True)
twitter_archive_df.columns

In [None]:
get_missing_value_percentage(twitter_archive_df)

The missing value from dog_stage quietly high, but I think this variable save such as good information. So I decide to not delete it.

<a id="c8"></a>
##### 7. Ensure unique twitter by expanded_urls and jpg_urls

1. Expanded URLS

In [None]:
twitter_archive_df.duplicated(['expanded_urls']).value_counts()

In [None]:
twitter_archive_df[twitter_archive_df.duplicated(['expanded_urls'])].expanded_urls.value_counts()

In [None]:
twitter_archive_df.query('expanded_urls == "https://vine.co/v/ea0OwvPTx9l"')

I don't know what the real value from that photo, and because of that value just appear in small row (2) so I decide to delete it. And for other duplicate row I also delete it because the expanded_urls value is missing value.

In [None]:
twitter_archive_df.dropna(subset=['expanded_urls'], how='all', inplace = True)
twitter_archive_df.duplicated(['expanded_urls']).value_counts()

In [None]:
twitter_archive_df.drop_duplicates(subset=['expanded_urls'], keep=False, inplace = True)
twitter_archive_df.duplicated(['expanded_urls']).value_counts()

2. JPG URLS

In [None]:
image_prediction_df.duplicated(['jpg_url']).value_counts()

In [None]:
image_prediction_df[image_prediction_df.duplicated(['jpg_url'])].jpg_url.head(5)

In [None]:
image_prediction_df.query("jpg_url == 'https://pbs.twimg.com/media/CU1zsMSUAAAS0qW.jpg'")

In [None]:
twitter_archive_df.query("tweet_id == '670319130621435904'").expanded_urls

In [None]:
twitter_archive_df.query("tweet_id == '759159934323924993'").expanded_urls

From observasion above, I find that one of twitter id from duplicated jpg url, doesn't axist in first table. So I will elimited twitter_id that doesn't exist in first table.

In [None]:
jpg_url_duplicated = image_prediction_df[image_prediction_df.duplicated(['jpg_url'])]['jpg_url']

In [None]:
need_to_drop = image_prediction_df[image_prediction_df.jpg_url.isin(jpg_url_duplicated)]
need_to_drop.head(2)

In [None]:
# count row we must to keep
need_to_drop['tweet_id'].isin(twitter_archive_df.tweet_id).sum()

In [None]:
# drop row need_to_drop if the id exist in first table
need_to_drop = need_to_drop[~need_to_drop['tweet_id'].isin(twitter_archive_df.tweet_id)]
need_to_drop['tweet_id'].isin(twitter_archive_df.tweet_id).sum()

In [None]:
# drop row not exist in first table
image_prediction_df = image_prediction_df[~image_prediction_df.tweet_id.isin(need_to_drop.tweet_id)]
image_prediction_df.duplicated(['jpg_url']).value_counts()

<a id="c5"></a>
##### 8. Delete col with missing value >90% from total rows

list function name: <br>
<a id="drop_missing_value">drop_missing_value</a> <br>

list used function: <br>
<a href='#drop_missing_value'>drop_missing_value</a> <br>

In [None]:
def drop_missing_value(data_frame, treshold:int = 0.9):
    data = get_missing_value_percentage(data_frame)
    cols_will_drop = []
    
    for col,percentage_missing_value in data.items():
        if percentage_missing_value > 90:
            cols_will_drop.append(col)
    
    return data_frame.drop(cols_will_drop, axis = 1)

In [None]:
twitter_archive_df = drop_missing_value(twitter_archive_df)
get_missing_value_percentage(twitter_archive_df)

In [None]:
drop_missing_value(image_prediction_df)
get_missing_value_percentage(image_prediction_df)

In [None]:
drop_missing_value(scrapped_tweet_df)
get_missing_value_percentage(scrapped_tweet_df)

<a id="c6"></a>
##### 9. Delete cols with same value

list function name: <br>
<a id="drop_uniform_value">drop_uniform_value</a> <br>

In [None]:
def drop_uniform_value(data_frame):
    cols = data_frame.columns
    cols_will_drop = []
    
    for col in cols:
        num_value = len(data_frame[col].unique().tolist())
        if(num_value <= 1):
            cols_will_drop.append(col)
    return data_frame.drop(cols_will_drop, axis = 1)

In [None]:
# first data before
twitter_archive_df.nunique()

In [None]:
# second data before
image_prediction_df.nunique()

In [None]:
# third data before
scrapped_tweet_df.nunique()

Because the the only table exist 1 unique value is third data so we change only third data.

In [None]:
# third data after
scrapped_tweet_df = drop_uniform_value(scrapped_tweet_df)
scrapped_tweet_df.nunique()

<a id="c7"></a>
##### 10. Get source col without HTML format

In [None]:
import re

list function name: <br>
<a id="get_name_in_source">get_name_in_source</a> <br>

In [None]:
def get_name_in_source(col_source):
    return str(re.findall("<a.*?>(.+?)</a>", col_source)[0])

In [None]:
for index in range(len(twitter_archive_df)): 
    value = twitter_archive_df.iloc[index,twitter_archive_df.columns.get_loc('source')]
    twitter_archive_df.iloc[index,twitter_archive_df.columns.get_loc('source')] = get_name_in_source(value)
                            
twitter_archive_df.source.value_counts()

<a id="t2"></a>
##### 11. Join all table

In [None]:
# join first and second table
twitter_df = pd.merge(twitter_archive_df, scrapped_tweet_df, how = 'left', on = ['tweet_id'])

# join second and third table
twitter_df = pd.merge(twitter_df, image_prediction_df, how = 'left', on = ['tweet_id'])

# check the result
twitter_df.info()

list used function: <br>
<a href='#drop_missing_value'>drop_missing_value</a> <br>

In [None]:
# cek the percentage of missing value
get_missing_value_percentage(twitter_df)

Because the missing value is so small, so I decide to delete row with missing value

In [None]:
twitter_df.dropna(axis=0, how='any', inplace=True)
get_missing_value_percentage(twitter_df)

In [None]:
twitter_df.head(2)

In [None]:
# save csv
twitter_df.to_csv("data_generated/twitter_archive_master.csv", index=False)

In [None]:
# save each of data
twitter_archive_df.to_csv("data_generated/first_data_twitter_archive.csv", index=False)
image_prediction_df.to_csv("data_generated/second_data_image_prediction.csv", index=False)
scrapped_tweet_df.to_csv("data_generated/third_data_scrapped_tweet.csv", index=False)

<a href="#analizing"></a>
# Analyzing and Visualizing Data 

Question:
1. Are there any outlier in the data?
2. How about correlation between variables?
3. Does the retweet count and favorite count increase with time?
4. Does the rating increase with time?
5. Are the rating affect with the number of favorite and retweet count?
6. How much each algorithm predict the picture is dog?
7. What are the most popular dog names?

Summary The Data
Correlation
Are certain stages of dogs posted more often than others?
Retweets and favorites by stages
How does the rating affect the number of retweet counts?
How did the retweet count and favourite count improve over time?
How well does the model perform?
What are the most popular dog names?
What types of dogs are there?

In [None]:
# to analyzing, I add new column "rating" that can be calculate by numerator/denominator
twitter_df['rating'] = pd.to_numeric((twitter_df.rating_numerator*1.0)/(twitter_df.rating_denominator*1.0))
twitter_df.info()

In [None]:
# statistic description
twitter_df.describe()

In [None]:
# visualitation
fig, ax = plt.subplots()
fig.set_size_inches(14, 6)
sns.boxplot(data=twitter_df[['rating_numerator','rating_denominator','rating','favorite_count',\
                                  'retweet_count','img_num','p1_conf','p2_conf','p3_conf']],\
                 orient="h", palette="Set2", ax=ax);

In [None]:
fig, ax = plt.subplots()
fig.set_size_inches(14, 6)
sns.boxplot(data=twitter_df[['rating_numerator','rating_denominator','rating',\
                                  'img_num','p1_conf','p2_conf','p3_conf']],\
                 orient="h", palette="Set2", ax=ax);

In [None]:
fig, ax = plt.subplots()
fig.set_size_inches(14, 6)
sns.boxplot(data=twitter_df[['img_num','p1_conf','p2_conf','p3_conf']],\
                 orient="h", palette="Set2", ax=ax);

In [None]:
fig, ax = plt.subplots()
fig.set_size_inches(14, 6)
sns.boxplot(data=twitter_df['img_num'],\
                 orient="h", palette="Set2", ax=ax);

<a id="q1"></a>
##### 1. Are there any outlier in the data?
<br>
<b> Answer:</b> 
<br>
- In numeric data, all cols have outlier except p1_conf. Just like the information from udacity, some nominator have bigger value then their denominator so the rating can be more than 1 (because rating = numerator/denominator so the value must be 0 until 1).
<br>
- From statistic description, we found that distance min max from variables rating_numerator, rating_denominator, favorite_count, retweet_count, and ratings are high. But for all rating variable, we can find that Q3 is not too far from another Q, so the max value from that variables definitely outlier.

In [None]:
# correlation
fig, ax = plt.subplots()
fig.set_size_inches(11.7, 8.27)
sns.heatmap(twitter_df.corr(), annot=True, ax=ax)

<a id="q2"></a>
##### 2. How about correlation between variables?
<br>
<b> Answer:</b>
<br><i>Note: The correlation value between -1 until 1, negative just to make we know the correlation direction, the closer to the value 0, the smaller the correlation. It use pearson correlation so they just see the linear relationship between each variables.
<br></i>
<br>
- To this plot please ignore correlation between rating and rating_numerator or rating_denominator because the result  should be strong because rating is a calculation from both of them. But surprisingly the correlation between rating and rating_denominator is small. The answer can be found from stat desc that show if value rating_numerator is more varied than rating_denominator (std rating_denominator more hight than rating_numerator but their quantiles just similar each other)
<br>
- We can see hight positive correlation between favorite_count and retweet_count. Its mean the more favorited the more retweeted
<br>
- The correlation between all confidence variables also quite high. Somehow when p1_conf hight the confidence in p2 and p3 will decreese, but when confidence p3 increase the confidence in p2 will lightly increese.