# Data Wrangling Project

## Table of Contents

* <a href="#introduction">1. Introduction</a>
* <a href="#data_wrangling">2. Data Wrangling</a>
    - <a href="#data_wrangling_gather">2.1 Gathering</a>
    - <a href="#data_wrangling_assess">2.2 Assessment</a>
    - <a href="#data_wrangling_clean">2.3 Cleaning</a>
* <a href="#store_clean_data">3. Storing Cleaned Data</a>
* <a href="#visualizing_data">4. Visualizing Data</a>

<a id="introduction"></a>
###  1.0 Introduction

The goal of this project is to wrangle **@WeRateDogs** Twitter Data to aid trustworthy analysis on the twitter data. 

In this project, all the steps *(Gather,Assess,Clean)*  in the Data Wrangling process are handled. Initially, We have been provided with twitter archive data. This data needs to be asssessed further and need to gather aditional data if needed. All these data need to be cleaned, so that meaningful insights can be derived from the cleaned data.

<a id="data_wrangling"></a>
### 2.0 Data Wrangling

Data Wrangling is one of the key steps in Data Analysis, as it takes 80% or more part of Data Analyst. Real world data is often dirty and unstructured which make data analysis harder. Fortunately, latest software advancements like Python, and libraries  like Pandas, Numpy, etc., makes data analyst's life easier for making the data wrangling process faster, smoother. 

At a high level, the data wrangling comes in 3 different steps, as mentioned below:

* Gather
* Assess
* Clean

Lets dive deeper into each of the steps for the **@WeRateDogs** data to get meaningful insights.

<a id='data_wrangling_gather'></a>
### 2.1 Gathering

In this project, initially we have provided with Twitter Archived Data (*twitter-archive-enhanced.csv*). This archived data contains only tweets which has ratings.

In [2]:
''' Initial all libraries  '''

import numpy as np
import pandas as pd
import matplotlib.pyplot as pyplt
import json
import tweepy
import os as os
import requests
from pandas.io.json import json_normalize

%matplotlib inline

#### Parse the given twitter archive enhanced data

In [8]:
''' Read the Twitter Archived Data '''
df_tweet_archive = pd.read_csv('./data/provided_data/twitter-archive-enhanced.csv')
df_tweet_archive.head(5)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


Twitter Archive Enhanced Field Details:

* `tweet_id`: ID of each tweet
* `in_reply_to_status_id`: If the represented Tweet is a reply, this field will contain the integer representation of the original Tweet’s ID
* `in_reply_to_user_id`:If the represented Tweet is a reply, this field will contain the integer representation of the original Tweet’s author ID. This will not necessarily always be the user directly mentioned in the Tweet.
* `timestamp`: Tweet Created Time
* `source`: Source of the Tweet. i.e IPhone, Vine, etc.,
* `text`: Original Text of the Tweet
* `retweeted_status_id`: Retweet Status ID
* `retweeted_status_user_id` : Retweet User ID
* `retweeted_status_timestamp`: Retweet Timestamp
* `expanded_urls`: Tweet URL
* `rating_numerator`: Dog Rating Numerator.
* `rating_denominator`: Dog Rating Denominator. Its always 10.
* `name`: Name of the dog
* `doggo`: Stage of the dog
* `floofer`: Stage of the dog
* `pupper`: Stage of the dog
* `puppo`: Stage of the dog

In [10]:
df_tweet_archive.shape

(2356, 17)

#### Get additional details on the tweets via Twitter API Call

The above data missing some key information like retweet count, favorite count for each of the tweet. These additional data can be gathered using Twitter API. In this project, 'tweepy' library is used to get the tweet details.

Even though, we can get each individual tweet status by using `get_status` API call, it requires 2356 API calls. It seems, we can get tweets in bulk using `statuses_lookup` API call. `statuses_lookup` API call can upto 100 tweets. Also, we have to make sure the tweet_mode is set to  `extended`, so that tweets are not truncated.

In [23]:
''' This function initialize Twitter API Secret needed for further API calls'''
def initialize_twitter_secrets():
    with open('twitter_secrets.txt', 'r') as content_file:
        twitter_secrets = json.loads( content_file.read())
        return twitter_secrets

In [26]:
''' This function authenticates the application with Twitter and returns API object which can be used for further API Calls'''
def get_twitter_api_handler(twitter_secrets={}):
    auth = tweepy.OAuthHandler(twitter_secrets['consumer_api_key'], twitter_secrets['consumer_api_secret'])
    auth.set_access_token(twitter_secrets['access_token'],twitter_secrets['access_token_secret'])
    api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
    return api

In [29]:
''' The below function get all tweet details for given list and store them in a file in JSON format '''
def get_tweet_details_for_given_list(tweet_list=None, tweet_api = None, split_size=100, file_name=None):
    try:
        # Check for Max Split Size
        if split_size > 100:
            print('Twitter API can handle only 100 tweets per API at the Max. So switching split size to 100')
            split_size = 100

        # Check for Incorrect Split Size
        if split_size <= 0:
            print('Incorrect split size')
            return -1

        #Check if tweet list is empty
        if tweet_list is None or len(tweet_list) <= 0:
            print('tweet list is empty')
            return -1
        else:
            ''' Below Code splits the whole tweet list supplied into smaller chunks and get their details'''
            max_loop_index = (len(tweet_list)/split_size) + 1
            tweets_json_list = []
            for i in np.arange(max_loop_index):
                start_index = (int) (i * split_size)
                end_index = min( (int) ((i+1) * split_size), len(tweet_list))

                '''Get the small chunk tweet id list '''
                sub_array =  tweet_list[start_index:end_index]
                
                '''Check if the small chunk has tweet ids'''
                if len(sub_array) > 0:
                    
                    ''' API Call made to get the data and tweet_mode is set to Extended mode for getting the full tweet '''
                    tweets = tweet_api.statuses_lookup(id_=sub_array, tweet_mode='extended')
                    
                    '''Store all tweets in the list'''
                    for tweet in tweets:                    
                        tweets_json_list.append(tweet._json)

            file_name = os.path.join("./data", "collected_data", '{0}.txt'.format(file_name))
            with open(file_name,'w+b') as tf:
                for tweet in tweets_json_list:
                    '''Add EOL(\n) for every json stored'''
                    jsonstr =  (json.dumps(tweet, separators=(',', ': ')) + '\n').encode('UTF-8')
                    tf.write(jsonstr)
        return 0
    except:
        print('Error in getting tweets via API')
        return -1

In [30]:
#Initialize Secrets
twitter_secrets = initialize_twitter_secrets()
#Get Twitter API Handler
twitter_api = get_twitter_api_handler(twitter_secrets)
#Get all tweets and store them in a file
get_tweet_details_for_given_list(tweet_list=df_tweet_archive.tweet_id.values.tolist(), \
                                 file_name='all_tweet_details.txt', split_size=80, tweet_api=twitter_api)

In [34]:
''' Check if the file has all data in JSON - one tweet per line'''
open('./data/collected_data/all_tweet_details.txt.txt', 'r').readline().encode('UTF-8')

b'{"created_at": "Sat Jul 15 23:25:31 +0000 2017","id": 886366144734445568,"id_str": "886366144734445568","full_text": "This is Roscoe. Another pupper fallen victim to spontaneous tongue ejections. Get the BlepiPen immediate. 12/10 deep breaths Roscoe https://t.co/RGE08MIJox","truncated": false,"display_text_range": [0,131],"entities": {"hashtags": [],"symbols": [],"user_mentions": [],"urls": [],"media": [{"id": 886366138128449536,"id_str": "886366138128449536","indices": [132,155],"media_url": "http://pbs.twimg.com/media/DE0BTnQUwAApKEH.jpg","media_url_https": "https://pbs.twimg.com/media/DE0BTnQUwAApKEH.jpg","url": "https://t.co/RGE08MIJox","display_url": "pic.twitter.com/RGE08MIJox","expanded_url": "https://twitter.com/dog_rates/status/886366144734445568/photo/1","type": "photo","sizes": {"thumb": {"w": 150,"h": 150,"resize": "crop"},"small": {"w": 510,"h": 680,"resize": "fit"},"medium": {"w": 901,"h": 1200,"resize": "fit"},"large": {"w": 1201,"h": 1600,"resize": "fit"}}}]},"extende

Twitter Object Field Details.

* Details of the full Tweet Object can be found [here](https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object)


In [39]:
df_tweet_details =  pd.read_json(path_or_buf='./data/collected_data/all_tweet_details.txt.txt', \
                                 encoding='utf-8', orient='records', lines=True)
df_tweet_details.head(5)

Unnamed: 0,contributors,coordinates,created_at,display_text_range,entities,extended_entities,favorite_count,favorited,full_text,geo,...,quoted_status,quoted_status_id,quoted_status_id_str,quoted_status_permalink,retweet_count,retweeted,retweeted_status,source,truncated,user
0,,,2017-07-15 23:25:31,"[0, 131]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 886366138128449536, 'id_str'...",20898,False,This is Roscoe. Another pupper fallen victim t...,,...,,,,,3152,False,,"<a href=""http://twitter.com/download/iphone"" r...",False,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
1,,,2017-06-21 19:36:23,"[0, 122]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 876850756556607488, 'id_str'...",0,False,RT @rachel2195: @dog_rates the boyfriend and h...,,...,,,,,80,False,{'created_at': 'Mon Jun 19 17:14:49 +0000 2017...,"<a href=""http://twitter.com/download/iphone"" r...",False,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
2,,,2017-07-20 16:49:33,"[0, 127]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 888078426338406400, 'id_str'...",21467,False,This is Gerald. He was just told he didn't get...,,...,,,,,3447,False,,"<a href=""http://twitter.com/download/iphone"" r...",False,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
3,,,2017-07-30 15:58:51,"[0, 79]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 891689552724799489, 'id_str'...",41575,False,This is Darla. She commenced a snooze mid meal...,,...,,,,,8511,False,,"<a href=""http://twitter.com/download/iphone"" r...",False,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
4,,,2017-06-27 00:10:17,"[0, 90]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 879492035853660161, 'id_str'...",23125,False,This is Bailey. He thinks you should measure e...,,...,,,,,3145,False,,"<a href=""http://twitter.com/download/iphone"" r...",False,"{'id': 4196983835, 'id_str': '4196983835', 'na..."


In [40]:
df_tweet_details.columns

Index(['contributors', 'coordinates', 'created_at', 'display_text_range',
       'entities', 'extended_entities', 'favorite_count', 'favorited',
       'full_text', 'geo', 'id', 'id_str', 'in_reply_to_screen_name',
       'in_reply_to_status_id', 'in_reply_to_status_id_str',
       'in_reply_to_user_id', 'in_reply_to_user_id_str', 'is_quote_status',
       'lang', 'place', 'possibly_sensitive', 'quoted_status',
       'quoted_status_id', 'quoted_status_id_str', 'quoted_status_permalink',
       'retweet_count', 'retweeted', 'retweeted_status', 'source', 'truncated',
       'user'],
      dtype='object')

#### Get Image Predictions file via Requests Library

Additional Data for Dog Breed Prediction is provided and is available from here: https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv

We can use `requests` library to get this Tab Separated File, as shown below:

In [35]:
''' The below function download file from the web server for a given URL and File Name'''
def download_file_from_url(file_url=None, file_name=None):
    try:
        req = requests.get(file_url)        
        with open(file_name, 'wb') as fs:
            fs.write(req.content)
        return 0
    except:
        print('Error downloading file. Error Message: {0}'.format(sys.exc_info()[0]))
        return -1

In [36]:
#initial variables.
df_image_prediction =  None
file_url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
image_prediction_file_name = './data/collected_data/image-predictions.tsv'

#Download the image prediction file.
download_file_from_url(file_url=file_url, file_name=image_prediction_file_name)

#Check if file Exists.
if os.path.isfile(image_prediction_file_name):
    df_image_prediction = pd.read_csv(image_prediction_file_name, sep='\t')
else:
    raise Exception('No file Exists')

In [37]:
df_image_prediction.head(5)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


In [38]:
df_image_prediction.columns

Index(['tweet_id', 'jpg_url', 'img_num', 'p1', 'p1_conf', 'p1_dog', 'p2',
       'p2_conf', 'p2_dog', 'p3', 'p3_conf', 'p3_dog'],
      dtype='object')

Image Prediction Data Fields

* `tweet_id` : Tweet ID
* `jpg_url`: Image URL
* `img_num`: Image Number. Since Twitter supports upto 4 images per tweet. This column contains the index of the image being predicted
* `p1`: Dog Breed - Prediction 1
* `p1_conf`: Prediction 1 -  Confidence Score
* `p1_dog`: Is Prediction Dog or some other animal/object - Prediction 1
* `p2`: Dog Breed - Prediction 2
* `p2_conf`: Prediction  2 -  Confidence Score
* `p2_dog`: Is Prediction Dog or some other animal/object - Prediction 2
* `p3`: Dog Breed - Prediction 3
* `p3_conf`: Prediction  3 -  Confidence Score
* `p3_dog`: Is Prediction Dog or some other animal/object - Prediction 3

<a id='data_wrangling_assess'></a>
### 2.2 Assess

Since we have gathered all the data for our data analysis, Lets focus on the major step **Assess**. Here we are looking for two things: 

1. Quality Issues
2. Structural Issues

These issues can be detected either using Visual Assessment or Programmatic Assessment. Let's identify the data issues for all the data we have collected so far.

Visual Assessments for all the data have been done by opening the file in Visual Code Editor/Excel

#### Analysis Twitter archive enhanced data frame

In [41]:
df_tweet_archive.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

In [42]:
df_tweet_archive.describe()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,retweeted_status_id,retweeted_status_user_id,rating_numerator,rating_denominator
count,2356.0,78.0,78.0,181.0,181.0,2356.0,2356.0
mean,7.427716e+17,7.455079e+17,2.014171e+16,7.7204e+17,1.241698e+16,13.126486,10.455433
std,6.856705e+16,7.582492e+16,1.252797e+17,6.236928e+16,9.599254e+16,45.876648,6.745237
min,6.660209e+17,6.658147e+17,11856340.0,6.661041e+17,783214.0,0.0,0.0
25%,6.783989e+17,6.757419e+17,308637400.0,7.186315e+17,4196984000.0,10.0,10.0
50%,7.196279e+17,7.038708e+17,4196984000.0,7.804657e+17,4196984000.0,11.0,10.0
75%,7.993373e+17,8.257804e+17,4196984000.0,8.203146e+17,4196984000.0,12.0,10.0
max,8.924206e+17,8.862664e+17,8.405479e+17,8.87474e+17,7.874618e+17,1776.0,170.0


In [44]:
df_tweet_archive.name.str.len().value_counts()

4     1070
5      437
6      336
7      222
3      100
8       75
1       56
9       25
2       23
10       6
12       3
11       2
14       1
Name: name, dtype: int64

In [49]:
df_tweet_archive[df_tweet_archive.name.str.len() < 3].name.value_counts()

a     55
Bo     9
an     7
Mo     1
JD     1
Ed     1
Jo     1
Al     1
O      1
my     1
by     1
Name: name, dtype: int64

In [46]:
df_tweet_archive[df_tweet_archive.rating_denominator > 10].shape

(20, 17)

In [47]:
df_tweet_archive[df_tweet_archive.rating_numerator > 20].shape

(24, 17)

Twitter Archive DataFrame Issues:

Dirty Data Issues:

* `rating_denominator` - About 20 records have Rating Denominator greater than 10. As per [Wiki](https://en.wikipedia.org/wiki/WeRateDogs), the rating scale is one to ten.
* `rating_numerator` - About 24 records have Rating Numerator greater than 20. This is unusual. We need to check why this is happening
* `name` - Some dog name's length is less than 3. Some dog names have come up as 'a', 'O', 'my', etc.,
* `timestamp` - Tweet Created Time is not in datetime type

Messy Data Issues:

* `dog stage` - 'puppo', 'doggo', 'floffer', 'pupper' - these are different dog stages. In other words, these are values. These needs to be tracked under one variable 'dog_stage'


#### Analysing Twitter Data collected via API

In [51]:
df_tweet_details.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2342 entries, 0 to 2341
Data columns (total 31 columns):
contributors                 0 non-null float64
coordinates                  0 non-null float64
created_at                   2342 non-null datetime64[ns]
display_text_range           2342 non-null object
entities                     2342 non-null object
extended_entities            2068 non-null object
favorite_count               2342 non-null int64
favorited                    2342 non-null bool
full_text                    2342 non-null object
geo                          0 non-null float64
id                           2342 non-null int64
id_str                       2342 non-null int64
in_reply_to_screen_name      78 non-null object
in_reply_to_status_id        78 non-null float64
in_reply_to_status_id_str    78 non-null float64
in_reply_to_user_id          78 non-null float64
in_reply_to_user_id_str      78 non-null float64
is_quote_status              2342 non-null bool
lang

In [53]:
df_tweet_details.describe()

Unnamed: 0,contributors,coordinates,favorite_count,geo,id,id_str,in_reply_to_status_id,in_reply_to_status_id_str,in_reply_to_user_id,in_reply_to_user_id_str,possibly_sensitive,quoted_status_id,quoted_status_id_str,retweet_count
count,0.0,0.0,2342.0,0.0,2342.0,2342.0,78.0,78.0,78.0,78.0,2206.0,26.0,26.0,2342.0
mean,,,8006.33433,,7.422212e+17,7.422212e+17,7.455079e+17,7.455079e+17,2.014171e+16,2.014171e+16,0.0,8.113972e+17,8.113972e+17,2954.177199
std,,,12391.490229,,6.832408e+16,6.832408e+16,7.582492e+16,7.582492e+16,1.252797e+17,1.252797e+17,0.0,6.295843e+16,6.295843e+16,4968.040524
min,,,0.0,,6.660209e+17,6.660209e+17,6.658147e+17,6.658147e+17,11856340.0,11856340.0,0.0,6.721083e+17,6.721083e+17,0.0
25%,,,1384.0,,6.783509e+17,6.783509e+17,6.757419e+17,6.757419e+17,308637400.0,308637400.0,0.0,7.761338e+17,7.761338e+17,592.5
50%,,,3485.5,,7.186224e+17,7.186224e+17,7.038708e+17,7.038708e+17,4196984000.0,4196984000.0,0.0,8.281173e+17,8.281173e+17,1379.5
75%,,,9814.5,,7.986971e+17,7.986971e+17,8.257804e+17,8.257804e+17,4196984000.0,4196984000.0,0.0,8.637581e+17,8.637581e+17,3447.0
max,,,165028.0,,8.924206e+17,8.924206e+17,8.862664e+17,8.862664e+17,8.405479e+17,8.405479e+17,0.0,8.860534e+17,8.860534e+17,84207.0


In [52]:
df_tweet_details[df_tweet_details.retweeted]

Unnamed: 0,contributors,coordinates,created_at,display_text_range,entities,extended_entities,favorite_count,favorited,full_text,geo,...,quoted_status,quoted_status_id,quoted_status_id_str,quoted_status_permalink,retweet_count,retweeted,retweeted_status,source,truncated,user


In [54]:
df_tweet_details.columns

Index(['contributors', 'coordinates', 'created_at', 'display_text_range',
       'entities', 'extended_entities', 'favorite_count', 'favorited',
       'full_text', 'geo', 'id', 'id_str', 'in_reply_to_screen_name',
       'in_reply_to_status_id', 'in_reply_to_status_id_str',
       'in_reply_to_user_id', 'in_reply_to_user_id_str', 'is_quote_status',
       'lang', 'place', 'possibly_sensitive', 'quoted_status',
       'quoted_status_id', 'quoted_status_id_str', 'quoted_status_permalink',
       'retweet_count', 'retweeted', 'retweeted_status', 'source', 'truncated',
       'user'],
      dtype='object')

Twitter API Details DataFrame Issues:

Dirty Data Issues:

* `Missing Values` - Original we queried twitter for 2356 Tweets, but we have only 2342 Tweet Details. We are missing about 14 Tweet Details.
* `contributors`, `coordinates`,`entities` , `geo`,  `in_reply_to_screen_name`,`in_reply_to_status_id`, `in_reply_to_status_id_str`,`in_reply_to_user_id`, `in_reply_to_user_id_str`, `is_quote_status`, `possibly_sensitive`, `quoted_status`, `quoted_status_id`, `quoted_status_id_str`, `quoted_status_permalink`, `truncated`,`user`, `retweeted_status` - Remove these columns as we are not planning to use these columns.
* `id`, `id_str` - These are duplicate columns. One of them can be removed. Since 'id' is unreliable column (as some system can't handle large integers, we can `id_str` column. 

Tidy Data Issues:

* `extended_entities` - This column contains the data in JSON format. This need to be parsed. Also, a tweet may contain upto 4 images. Each Image is an observation and need to be a row in the dataset.


#### Analyzing Image Prediction Data Frame

In [59]:
df_image_prediction.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [60]:
df_image_prediction.describe()

Unnamed: 0,tweet_id,img_num,p1_conf,p2_conf,p3_conf
count,2075.0,2075.0,2075.0,2075.0,2075.0
mean,7.384514e+17,1.203855,0.594548,0.1345886,0.06032417
std,6.785203e+16,0.561875,0.271174,0.1006657,0.05090593
min,6.660209e+17,1.0,0.044333,1.0113e-08,1.74017e-10
25%,6.764835e+17,1.0,0.364412,0.05388625,0.0162224
50%,7.119988e+17,1.0,0.58823,0.118181,0.0494438
75%,7.932034e+17,1.0,0.843855,0.1955655,0.09180755
max,8.924206e+17,4.0,1.0,0.488014,0.273419


In [62]:
len(df_image_prediction.tweet_id.unique())

2075

In [63]:
len(df_image_prediction.jpg_url.unique())

2009

In [64]:
df_image_prediction.jpg_url.value_counts()

https://pbs.twimg.com/media/C2kzTGxWEAEOpPL.jpg    2
https://pbs.twimg.com/media/C2oRbOuWEAAbVSl.jpg    2
https://pbs.twimg.com/media/CdHwZd0VIAA4792.jpg    2
https://pbs.twimg.com/media/Ck2d7tJWUAEPTL3.jpg    2
https://pbs.twimg.com/media/CcG07BYW0AErrC9.jpg    2
https://pbs.twimg.com/media/CtVAvX-WIAAcGTf.jpg    2
https://pbs.twimg.com/media/CsGnz64WYAEIDHJ.jpg    2
https://pbs.twimg.com/media/CkNjahBXAAQ2kWo.jpg    2
https://pbs.twimg.com/media/CeRoBaxWEAABi0X.jpg    2
https://pbs.twimg.com/media/Cp6db4-XYAAMmqL.jpg    2
https://pbs.twimg.com/media/Cbs3DOAXIAAp3Bd.jpg    2
https://pbs.twimg.com/media/CiibOMzUYAA9Mxz.jpg    2
https://pbs.twimg.com/media/Ct72q9jWcAAhlnw.jpg    2
https://pbs.twimg.com/media/CU3mITUWIAAfyQS.jpg    2
https://pbs.twimg.com/media/CUN4Or5UAAAa5K4.jpg    2
https://pbs.twimg.com/media/DA7iHL5U0AA1OQo.jpg    2
https://pbs.twimg.com/media/CpmyNumW8AAAJGj.jpg    2
https://pbs.twimg.com/media/CiyHLocU4AI2pJu.jpg    2
https://pbs.twimg.com/media/CvT6IV6WEAQhhV5.jp

In [65]:
df_image_prediction[df_image_prediction.jpg_url == 'https://pbs.twimg.com/media/C2kzTGxWEAEOpPL.jpg']

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
1738,822244816520155136,https://pbs.twimg.com/media/C2kzTGxWEAEOpPL.jpg,1,Samoyed,0.585441,True,Pomeranian,0.193654,True,Arctic_fox,0.071648,False
1746,823269594223824897,https://pbs.twimg.com/media/C2kzTGxWEAEOpPL.jpg,1,Samoyed,0.585441,True,Pomeranian,0.193654,True,Arctic_fox,0.071648,False


Twitter Image Prediction Data Issues.

Dirty Data Issues:

* `Missing Data` - We have 2356 tweets in twitter archive data, but image prediction is available only for 2075 tweets.
* `Duplicate Data` - Some tweets have duplicate Image URLs.

Tidy Data Issues

* `p1`, `p2`, `p3`,`p1_conf`, `p2_conf`, `p3_conf`, `p1_dog`, `p2_dog`, `p3_dog` - These are just column names. Ideally, they should have been tracked in 4 variables (Prediction Number, Breed Prediction, Prediction Confidence Score, Is Prediction a dog?) 