#  WeRateDogs Project

# Introduction 

In this project we are going to do data wrangling process. Data Wrangling is one of the most important process in data analysis. Data Warngling flow works like *Gather,Access* and *Clean*,  once it is done we get down to solving problem by analysis, moreover data wrangling is an iterative process we can get back the wrangling process even after we start analysing our data's through plots and graphs.

Through out this project we will go in a step by step process of **Data Wrangling**,
- Gather
- Access 
- Clean 


# Gather 

In this Project we are going to use [WeRateDogs](https://en.wikipedia.org/wiki/WeRateDogs) for the data Wrangling Process.

- The WeRateDogs Twitter archive. We will download this file manually by clicking the following link: twitter_archive_enhanced.csv
- The tweet image predictions, i.e., what breed of dog (or other object, animal, etc.) is present in each tweet according to a neural network. This file (image_predictions.tsv) hosted on Udacity's servers and should be downloaded programmatically using the Requests library and the following URL: https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv
- Each tweet's retweet count and favorite (i.e. "like") count at minimum, and any additional data we will find interesting. Using the tweet IDs in the WeRateDogs Twitter archive, we will query the Twitter API for each tweet's JSON data using Python's Tweepy library and store each tweet's entire set of JSON data in a file called tweet_json.txt file. Each tweet's JSON data should be written to its own line. Then we will read this .txt file line by line into a pandas DataFrame with (at minimum) tweet ID, retweet count, and favorite count.


In [1]:
# Importing the required Libraries 
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import tweepy
import json
import time
import requests
from nltk import pos_tag
import re

In [2]:
# Reading the twitter-archive csv file and importing it into a dataframe format
df = pd.read_csv("twitter-archive-enhanced.csv", encoding = 'utf-8')
# Checking the header for testing
df.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


In [3]:
# Checking for the total number of rows and column using the info function 
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

It has **2356 recordes** and **17 columns**

In [4]:
# Importing the image-prediction dataset in to a dataframe 
images= pd.read_csv("image-predictions.tsv", delimiter= '\t', encoding = 'utf-8')
# Testing the dataframe using the header
images.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


In [5]:
# Checking for the missing rows of each column using the info function 
images.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


Image dataframe has **2075 rows and 12 columns**

In [6]:
# Autontification to twetter API

# Generate your own at https://apps.twitter.com/app
# CONSUMER_KEY = 'Consumer Key (API key)'
# CONSUMER_SECRET = 'Consumer Secret (API Secret)'
# OAUTH_TOKEN = 'Access Token'
# OAUTH_TOKEN_SECRET = 'Access Token Secret'

consumer_key = 'Hd91N2OI32xliHjqxecVHKZou'
consumer_secret = 'x1W4DvSQ6lF8xrj2VYSKteMcMSBwa3SSbPNscJfj3tvj9bRxkx'
access_token = '1048995689617936385-eJczPjQh9OrjbUMD3JDVnimSQ3q6YE'
access_token_secret = 'kBWfb9YmbBtMlTjuMiyWDk9UtfTNSIu8xWEIeNaohb0qc'

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

# Construct the API instance
api = tweepy.API(auth, 
                 parser = tweepy.parsers.JSONParser(), # Parse the result to Json Object
                 wait_on_rate_limit = True, # Automatically wait for rate limits to replenish
                 wait_on_rate_limit_notify = True) # Print a notification when Tweepy is waiting for rate limits to replenish


In [7]:
# Liste where we will store the dictionaries of our result
df_list = []
# Liste frame where we will store the tweet_id of the errors
error_list = []

# Calculate the time of excution
start = time.time()

# Get the tweet object for all the teweets in archive dataframe 
for tweet_id in df['tweet_id']:
    try:
        page = api.get_status(tweet_id, tweet_mode = 'extended')
        # Print one page to look at the structure of the returned file
        # and the names of attributes
        # print(json.dumps(page, indent = 4))
        #break
        
        favorites = page['favorite_count'] # How many favorites the tweet had
        retweets = page['retweet_count'] # Count of the retweet
        user_followers = page['user']['followers_count'] # How many followers the user had
        user_favourites = page['user']['favourites_count'] # How many favorites the user had
        date_time = page['created_at'] # The date and time of the creation
        
        df_list.append({'tweet_id': int(tweet_id),
                        'favorites': int(favorites),
                        'retweets': int(retweets),
                        'user_followers': int(user_followers),
                        'user_favourites': int(user_favourites),
                        'date_time': pd.to_datetime(date_time)})
    
    # Catch the exceptions of the TweepError
    except Exception as e:
        print(str(tweet_id)+ " _ " + str(e))
        error_list.append(tweet_id)

# Calculate the time of excution
end = time.time()
print(end - start)

888202515573088257 _ [{'code': 144, 'message': 'No status found with that ID.'}]
873697596434513921 _ [{'code': 144, 'message': 'No status found with that ID.'}]
872668790621863937 _ [{'code': 144, 'message': 'No status found with that ID.'}]
869988702071779329 _ [{'code': 144, 'message': 'No status found with that ID.'}]
866816280283807744 _ [{'code': 144, 'message': 'No status found with that ID.'}]
861769973181624320 _ [{'code': 144, 'message': 'No status found with that ID.'}]
845459076796616705 _ [{'code': 144, 'message': 'No status found with that ID.'}]
842892208864923648 _ [{'code': 144, 'message': 'No status found with that ID.'}]
837012587749474308 _ [{'code': 144, 'message': 'No status found with that ID.'}]
827228250799742977 _ [{'code': 144, 'message': 'No status found with that ID.'}]
802247111496568832 _ [{'code': 144, 'message': 'No status found with that ID.'}]
775096608509886464 _ [{'code': 144, 'message': 'No status found with that ID.'}]
771004394259247104 _ [{'code

In [8]:
# lengh of the result
print("The lengh of the result", len(df_list))
# The tweet_id of the errors
print("The lengh of the errors", len(error_list))


The lengh of the result 2341
The lengh of the errors 15


From the above results:

The total time was about 1980 seconds (~ 31 min)
We could get 2349 tweet_id correctly with 15 errors 

In [9]:
# now we are going to do the same operation for the tweet_ids that we coudln't get and append the result to df_list
ee_list = []
for e in error_list:
    try:
        favorites = page['favorite_count']
        retweets = page['retweet_count']
        user_followers = page['user']['followers_count']
        user_favourites = page['user']['favourites_count']
        date_time = page['created_at']
        
        df_list.append({'tweet_id': int(tweet_id),
                        'favorites': int(favorites),
                        'retweets': int(retweets),
                        'user_followers': int(user_followers),
                        'user_favourites': int(user_favourites),
                        'date_time': pd.to_datetime(date_time)})
        
    except Exception:
        print(str(tweet_id)+ " _ " + str(e))
        ee_list.append(e)

In [10]:
# lengh of the final result with all the error result in to the final result
print("The lengh of the result after Querying the errors separately", len(df_list))

The lengh of the result after Querying the errors separately 2356


In [11]:
#Creatinga  dataframe from a dictionaries 
final_tweets = pd.DataFrame(df_list, columns = ['tweet_id', 'favorites', 'retweets',
                                               'user_followers', 'user_favourites', 'date_time'])
# Read the header of the file 
final_tweets.head()

Unnamed: 0,tweet_id,favorites,retweets,user_followers,user_favourites,date_time
0,892420643555336193,38298,8405,7377937,138949,2017-08-01 16:23:56
1,892177421306343426,32826,6196,7377937,138949,2017-08-01 00:17:27
2,891815181378084864,24720,4100,7377937,138949,2017-07-31 00:18:03
3,891689557279858688,41631,8532,7377937,138949,2017-07-30 15:58:51
4,891327558926688256,39800,9251,7377937,138949,2017-07-29 16:00:24


In [12]:
# Export the dataframe to a file 
final_tweets.to_csv('fianl_tweets.txt', encoding = 'utf-8', index=False)

In [13]:
final_tweets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 6 columns):
tweet_id           2356 non-null int64
favorites          2356 non-null int64
retweets           2356 non-null int64
user_followers     2356 non-null int64
user_favourites    2356 non-null int64
date_time          2356 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(5)
memory usage: 110.5 KB


There is no variable with missing columns

# Gather: Summary

Gathering is the first step in the data wrangling process. We could finish the high-level gathering process:

- Obtaining data

- Getting data from an existing file (twitter-archive-enhanced.csv) Reading from csv file using pandas
- Downloading a file from the internet (image-predictions.tsv) Downloading file using requests
- Querying an API (tweet_json.txt) Get JSON object of all the tweet_ids using Tweepy
- Importing that data into our programming environment (Jupyter Notebook)

# Assess

After gathering each of the above pieces of data, assess them visually and programmatically for quality and tidiness issues will be our newt step. We will detect and document at quality issues and tidiness issues.

In [14]:
# Print all archive dataset to assess it visually
df

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,
5,891087950875897856,,,2017-07-29 00:08:17 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a majestic great white breaching ...,,,,https://twitter.com/dog_rates/status/891087950...,13,10,,,,,
6,890971913173991426,,,2017-07-28 16:27:12 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Jax. He enjoys ice cream so much he gets ...,,,,"https://gofundme.com/ydvmve-surgery-for-jax,ht...",13,10,Jax,,,,
7,890729181411237888,,,2017-07-28 00:22:40 +0000,"<a href=""http://twitter.com/download/iphone"" r...",When you watch your owner call another dog a g...,,,,https://twitter.com/dog_rates/status/890729181...,13,10,,,,,
8,890609185150312448,,,2017-07-27 16:25:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Zoey. She doesn't want to be one of th...,,,,https://twitter.com/dog_rates/status/890609185...,13,10,Zoey,,,,
9,890240255349198849,,,2017-07-26 15:59:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Cassie. She is a college pup. Studying...,,,,https://twitter.com/dog_rates/status/890240255...,14,10,Cassie,doggo,,,


In [15]:
# Assessing the data programmaticaly
df.info()
df.describe()
df['rating_numerator'].value_counts()
df['rating_denominator'].value_counts()
df['name'].value_counts()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

None              745
a                  55
Charlie            12
Lucy               11
Oliver             11
Cooper             11
Penny              10
Lola               10
Tucker             10
Winston             9
Bo                  9
the                 8
Sadie               8
an                  7
Buddy               7
Bailey              7
Toby                7
Daisy               7
Koda                6
Leo                 6
Rusty               6
Jax                 6
Dave                6
Jack                6
Scout               6
Bella               6
Stanley             6
Oscar               6
Milo                6
Sunny               5
                 ... 
my                  1
Tilly               1
Diogi               1
Pavlov              1
Jeffrie             1
Spark               1
Wafer               1
Michelangelope      1
Aja                 1
Sage                1
Barry               1
Binky               1
Lillie              1
by                  1
Georgie   

In [16]:
# Print all image dataset to assess it visually
images

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.072010,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True
5,666050758794694657,https://pbs.twimg.com/media/CT5Jof1WUAEuVxN.jpg,1,Bernese_mountain_dog,0.651137,True,English_springer,0.263788,True,Greater_Swiss_Mountain_dog,0.016199,True
6,666051853826850816,https://pbs.twimg.com/media/CT5KoJ1WoAAJash.jpg,1,box_turtle,0.933012,False,mud_turtle,0.045885,False,terrapin,0.017885,False
7,666055525042405380,https://pbs.twimg.com/media/CT5N9tpXIAAifs1.jpg,1,chow,0.692517,True,Tibetan_mastiff,0.058279,True,fur_coat,0.054449,False
8,666057090499244032,https://pbs.twimg.com/media/CT5PY90WoAAQGLo.jpg,1,shopping_cart,0.962465,False,shopping_basket,0.014594,False,golden_retriever,0.007959,True
9,666058600524156928,https://pbs.twimg.com/media/CT5Qw94XAAA_2dP.jpg,1,miniature_poodle,0.201493,True,komondor,0.192305,True,soft-coated_wheaten_terrier,0.082086,True


In [17]:
images.info()
images['jpg_url'].value_counts()
images[images['jpg_url'] == 'https://pbs.twimg.com/media/CZhn-QAWwAASQan.jpg']

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
800,691416866452082688,https://pbs.twimg.com/media/CZhn-QAWwAASQan.jpg,1,Lakeland_terrier,0.530104,True,Irish_terrier,0.197314,True,Airedale,0.082515,True
1624,803692223237865472,https://pbs.twimg.com/media/CZhn-QAWwAASQan.jpg,1,Lakeland_terrier,0.530104,True,Irish_terrier,0.197314,True,Airedale,0.082515,True


In [18]:
final_tweets.info()
final_tweets['tweet_id'].value_counts() #count tweet_ids
final_tweets['user_followers'].value_counts() #check if querying the use_followers had a meaning

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 6 columns):
tweet_id           2356 non-null int64
favorites          2356 non-null int64
retweets           2356 non-null int64
user_followers     2356 non-null int64
user_favourites    2356 non-null int64
date_time          2356 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(5)
memory usage: 110.5 KB


7377989    343
7377936    289
7378043    261
7378042    224
7377940    155
7377939    144
7377935    128
7377986    128
7377990     79
7377991     75
7377987     75
7377941     67
7377988     61
7378045     54
7377992     51
7377993     48
7377937     41
7377985     35
7377943     30
7378044     24
7377938     20
7378714      5
7378607      4
7378608      3
7377942      3
7378715      3
7378611      2
7378661      2
7378658      1
7378663      1
Name: user_followers, dtype: int64

# Quality
Completeness, Validity, Accuracy, Consistency => a.k.a content issues

### df dataset

- in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id, retweeted_status_user_id should be intergs instead of float
- retweeted_status_timestamp, timestamp should be datetime instead of object (string)
- The numerator and denominator columns have invalid values
- In several columns null objects are non-null (None to NaN)
- Name column have invalid names i.e 'None', 'a', 'an'
- We only want original ratings (no retweets) that have images
- We may want to change this columns type (in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id, retweeted_status_user_id and tweet_id) to string because We don't want any operations on them.

### images dataset

- Missing values from images dataset (2075 rows instead of 2356)
- Some tweet_ids have the same jpg_url
- Some tweets are have 2 different tweet_id one redirect to the other

### json_tweeets dataset

- This tweet_id (666020888022790149) duplicated 8 times

# Tidiness

Untidy data => a.k.a structural issues

- No need to all the informations in images dataset, (tweet_id and jpg_url what matters)
- Various stages of dogs in columns instead of rows archives dataset
- We may want to add a gender column from the text columns in archives dataset.
- All tables should be part of one dataset


# Clean

Cleaning our data is the third step in data wrangling. It is where we will fix the quality and tidiness issues that we identified in the assess step


In [72]:
# As we want a highly tidy and clean dataset , we are going to merge the archive, images and final_tweet dataset 
# in to a master dataset by using the merge function.
df_master = pd.merge(df, images, how = 'left', on = ['tweet_id'] )
df_master = pd.merge(df_master, final_tweets, how = 'left', on = ['tweet_id'])
df_master.to_csv('df_master.csv', encoding = 'utf-8')
df_master.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2371 entries, 0 to 2370
Data columns (total 33 columns):
tweet_id                      2371 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2371 non-null object
source                        2371 non-null object
text                          2371 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2312 non-null object
rating_numerator              2371 non-null int64
rating_denominator            2371 non-null int64
name                          2371 non-null object
doggo                         2371 non-null object
floofer                       2371 non-null object
pupper                        2371 non-null object
puppo                         2371 non-null object
jpg_url                       20

In [73]:
# Delete the retweets
df_master = df_master[pd.isnull(df_master.retweeted_status_id)]
# Delete duplicated tweet_id
df_master = df_master.drop_duplicates()
# Delete tweets with no pictures
df_master = df_master.dropna(subset = ['jpg_url'])

# test
len(df_master)

1994

In [74]:
# We will Delete columns related to retweet we don't need anymore
df_master = df_master.drop('retweeted_status_id', 1)
df_master = df_master.drop('retweeted_status_user_id', 1)
df_master = df_master.drop('retweeted_status_timestamp', 1)

# We will Delete column date_time we imported from the API, it has the same values as timestamp column
df_master = df_master.drop('date_time', 1)

# test
list(df_master)

['tweet_id',
 'in_reply_to_status_id',
 'in_reply_to_user_id',
 'timestamp',
 'source',
 'text',
 'expanded_urls',
 'rating_numerator',
 'rating_denominator',
 'name',
 'doggo',
 'floofer',
 'pupper',
 'puppo',
 'jpg_url',
 'img_num',
 'p1',
 'p1_conf',
 'p1_dog',
 'p2',
 'p2_conf',
 'p2_dog',
 'p3',
 'p3_conf',
 'p3_dog',
 'favorites',
 'retweets',
 'user_followers',
 'user_favourites']

In [75]:
# -----Melt the 'doggo', 'floofer', 'pupper' and 'puppo' columns into one column 'dog_stage'---------

# Select the columns to melt and to remain
columns_to_melt = ['doggo', 'floofer', 'pupper', 'puppo']
columns_to_stay = [x for x in df_master.columns.tolist() if x not in columns_to_melt]

# Mlet the the columns into values
df_master = pd.melt(df_master, id_vars = columns_to_stay, value_vars = columns_to_melt, 
                         var_name = 'stages', value_name = 'dog_stage')

# Delete column 'stages'
df_master = df_master.drop('stages', 1)

In [76]:
# Getting rid of the prediction of the image column 

# We will store the fisrt true algorithm with it's level of confidence
prediction_algorithm = []
confidence_level = []

# Get_prediction_confidence function:
# search the first true algorithm and append it to a list with it's level of confidence
# if flase prediction_algorthm will have a value of NaN
def get_prediction_confidence(dataframe):
    if dataframe['p1_dog'] == True:
        prediction_algorithm.append(dataframe['p1'])
        confidence_level.append(dataframe['p1_conf'])
    elif dataframe['p2_dog'] == True:
        prediction_algorithm.append(dataframe['p2'])
        confidence_level.append(dataframe['p2_conf'])
    elif dataframe['p3_dog'] == True:
        prediction_algorithm.append(dataframe['p3'])
        confidence_level.append(dataframe['p3_conf'])
    else:
        prediction_algorithm.append('NaN')
        confidence_level.append(0)

df_master.apply(get_prediction_confidence, axis=1)
df_master['prediction_algorithm'] = prediction_algorithm
df_master['confidence_level'] = confidence_level


# Delete the columns of image prediction information
df_master = df_master.drop(['img_num', 'p1', 'p1_conf', 'p1_dog', 'p2', 'p2_conf', 'p2_dog', 'p3', 'p3_conf', 'p3_dog'], 1)


list(df_master)

['tweet_id',
 'in_reply_to_status_id',
 'in_reply_to_user_id',
 'timestamp',
 'source',
 'text',
 'expanded_urls',
 'rating_numerator',
 'rating_denominator',
 'name',
 'jpg_url',
 'favorites',
 'retweets',
 'user_followers',
 'user_favourites',
 'dog_stage',
 'prediction_algorithm',
 'confidence_level']

In [77]:
# Checking for duplicate values in all the columns

# Print the count of the unique elements in all columns
df_master.apply(lambda x: len(x.unique()))

tweet_id                 1994
in_reply_to_status_id      23
in_reply_to_user_id         2
timestamp                1994
source                      3
text                     1994
expanded_urls            1994
rating_numerator           34
rating_denominator         15
name                      936
jpg_url                  1994
favorites                1858
retweets                 1585
user_followers             30
user_favourites             5
dog_stage                   5
prediction_algorithm      114
confidence_level         1684
dtype: int64

In [78]:
df_master.info()
df_master['in_reply_to_user_id'].value_counts()
df_master['source'].value_counts()
df_master['user_favourites'].value_counts()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7976 entries, 0 to 7975
Data columns (total 18 columns):
tweet_id                 7976 non-null int64
in_reply_to_status_id    92 non-null float64
in_reply_to_user_id      92 non-null float64
timestamp                7976 non-null object
source                   7976 non-null object
text                     7976 non-null object
expanded_urls            7976 non-null object
rating_numerator         7976 non-null int64
rating_denominator       7976 non-null int64
name                     7976 non-null object
jpg_url                  7976 non-null object
favorites                7972 non-null float64
retweets                 7972 non-null float64
user_followers           7972 non-null float64
user_favourites          7972 non-null float64
dog_stage                7976 non-null object
prediction_algorithm     7976 non-null object
confidence_level         7976 non-null float64
dtypes: float64(7), int64(3), object(8)
memory usage: 1.1+ MB


138950.0    3184
138949.0    2624
138951.0    2132
138952.0      32
Name: user_favourites, dtype: int64

In [79]:
# drop the following columns 'in_reply_to_status_id', 'in_reply_to_user_id', 'user_favourites'
df_master = df_master.drop(['in_reply_to_status_id', 'in_reply_to_user_id', 'user_favourites'], 1)

In [80]:
# Clean the content of source column
df_master['source'] = df_master['source'].apply(lambda x: re.findall(r'>(.*)<', x)[0])

# Test
df_master

Unnamed: 0,tweet_id,timestamp,source,text,expanded_urls,rating_numerator,rating_denominator,name,jpg_url,favorites,retweets,user_followers,dog_stage,prediction_algorithm,confidence_level
0,892420643555336193,2017-08-01 16:23:56 +0000,Twitter for iPhone,This is Phineas. He's a mystical boy. Only eve...,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg,38298.0,8405.0,7377937.0,,,0.000000
1,892177421306343426,2017-08-01 00:17:27 +0000,Twitter for iPhone,This is Tilly. She's just checking pup on you....,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg,32826.0,6196.0,7377937.0,,Chihuahua,0.323581
2,891815181378084864,2017-07-31 00:18:03 +0000,Twitter for iPhone,This is Archie. He is a rare Norwegian Pouncin...,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg,24720.0,4100.0,7377937.0,,Chihuahua,0.716012
3,891689557279858688,2017-07-30 15:58:51 +0000,Twitter for iPhone,This is Darla. She commenced a snooze mid meal...,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg,41631.0,8532.0,7377937.0,,Labrador_retriever,0.168086
4,891327558926688256,2017-07-29 16:00:24 +0000,Twitter for iPhone,This is Franklin. He would like you to stop ca...,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg,39800.0,9251.0,7377937.0,,basset,0.555712
5,891087950875897856,2017-07-29 00:08:17 +0000,Twitter for iPhone,Here we have a majestic great white breaching ...,https://twitter.com/dog_rates/status/891087950...,13,10,,https://pbs.twimg.com/media/DF3HwyEWsAABqE6.jpg,19983.0,3077.0,7377937.0,,Chesapeake_Bay_retriever,0.425595
6,890971913173991426,2017-07-28 16:27:12 +0000,Twitter for iPhone,Meet Jax. He enjoys ice cream so much he gets ...,"https://gofundme.com/ydvmve-surgery-for-jax,ht...",13,10,Jax,https://pbs.twimg.com/media/DF1eOmZXUAALUcq.jpg,11700.0,2041.0,7377937.0,,Appenzeller,0.341703
7,890729181411237888,2017-07-28 00:22:40 +0000,Twitter for iPhone,When you watch your owner call another dog a g...,https://twitter.com/dog_rates/status/890729181...,13,10,,https://pbs.twimg.com/media/DFyBahAVwAAhUTd.jpg,64647.0,18637.0,7377937.0,,Pomeranian,0.566142
8,890609185150312448,2017-07-27 16:25:51 +0000,Twitter for iPhone,This is Zoey. She doesn't want to be one of th...,https://twitter.com/dog_rates/status/890609185...,13,10,Zoey,https://pbs.twimg.com/media/DFwUU__XcAEpyXI.jpg,27475.0,4215.0,7377936.0,,Irish_terrier,0.487574
9,890240255349198849,2017-07-26 15:59:51 +0000,Twitter for iPhone,This is Cassie. She is a college pup. Studying...,https://twitter.com/dog_rates/status/890240255...,14,10,Cassie,https://pbs.twimg.com/media/DFrEyVuW0AAO3t9.jpg,31521.0,7295.0,7377936.0,doggo,Pembroke,0.511319


In [81]:
# Print the values and check if there exist in the text
df_master.rating_numerator.value_counts()
df_master.rating_denominator.value_counts()
print(df_master[df_master.rating_denominator == 170]['text'][2842])
print(df_master[df_master.rating_numerator == 1776]['text'][2720])
print(df_master[df_master.tweet_id == 786709082849828864]['text'][2497])
print(df_master['text'][1918])
print(df_master['text'][1917])
print(df_master['text'][1916])
print(df_master['text'][1911])

Say hello to this unbelievably well behaved squad of doggos. 204/170 would try to pet all at once https://t.co/yGQI3He3xv
This is Atticus. He's quite simply America af. 1776/10 https://t.co/GRXwMxLBkh
This is Logan, the Chow who lived. He solemnly swears he's up to lots of good. H*ckin magical af 9.75/10 https://t.co/yBO5wuqaPS
This is Biden. Biden just tripped... 7/10 https://t.co/3Fm9PwLju1
Ermergerd 12/10 https://t.co/PQni2sjPsm
Never seen this breed before. Very pointy pup. Hurts when you cuddle. Still cute tho. 10/10 https://t.co/97HuBrVuOx
Two dogs in this one. Both are rare Jujitsu Pythagoreans. One slightly whiter than other. Long legs. 7/10 and 8/10 https://t.co/ITxxcc4v9y


In [82]:
# Get ratings and treat them depending to their situation
ratings = df_master['text'].apply(lambda x: re.findall(r'(\d+(\.\d+)|(\d+))\/(\d+0)', x))
len(ratings)

7976

In [83]:
# Add new columns to store the new ratings and the count of dogs in each tweet
rating_numerator = []
rating_denominator = []
dogs_count = []

for rate in ratings:
    # Tweets with no rating
    if len(rate) == 0:
        rating_numerator.append('NaN')
        rating_denominator.append('NaN')
        dogs_count.append(1) # It has a picture so it is a dog
    
    # Tweets with one rate
    elif len(rate) == 1:
        rating_numerator.append((float(rate[0][0]) / (float(rate[0][-1])/10)))
        rating_denominator.append(float(rate[0][-1]))
        dogs_count.append(float(rate[0][-1]) / 10) 
                                  
    elif len(rate) > 1 and rate[0][-1] == '10':
        rating_plus = 0
        rating_avg = 0
        for i in range(len(rate)):
            rating_plus = rating_plus + float(rate[i][0])
        result_avg = (rating_plus / len(rate))
        rating_numerator.append(result_avg)
        rating_denominator.append(10)
        dogs_count.append(len(rate))
    else:
        rating_numerator.append('Error')
        rating_denominator.append('Error')
        dogs_count.append('Error')

df_master['new_rating_numerator'] = rating_numerator
df_master['new_rating_denominator'] = rating_denominator
df_master['dogs_count'] = dogs_count
df_master['new_rating_numerator'].value_counts()

12.0                 1812
10.0                 1644
11.0                 1596
13.0                 1044
9.0                   608
8.0                   376
7.0                   208
14.0                  144
6.0                   128
5.0                   120
3.0                    76
4.0                    56
2.0                    36
8.5                    16
1.0                    16
9.5                    12
7.5                    12
0.0                     8
Error                   8
10.5                    8
9.75                    4
1776.0                  4
5.5                     4
6.5                     4
11.27                   4
13.5                    4
11.5                    4
420.0                   4
NaN                     4
9.666666666666666       4
11.26                   4
4.5                     4
Name: new_rating_numerator, dtype: int64

In [84]:
print(df_master[df_master.new_rating_numerator == 'Error']['text'][2919])
print(df_master[df_master.new_rating_numerator == 'Error']['text'][2885])
print(df_master[df_master.new_rating_numerator == 1776.0]['text'][2720])
print(df_master[df_master.new_rating_numerator == 420.0]['text'][3712])

This is Bluebert. He just saw that both #FinalFur match ups are split 50/50. Amazed af. 11/10 https://t.co/Kky1DPG4iq
Happy 4/20 from the squad! 13/10 for all https://t.co/eV1diwds8a
This is Atticus. He's quite simply America af. 1776/10 https://t.co/GRXwMxLBkh
After so many requests... here you go.

Good dogg. 420/10 https://t.co/yfAAo1gdeY


In [85]:
#We notice that the two errors where for a special situation that has two fractions
#but only one of them is the valid one, 
#we will could assess this manually and we will clean it manually as well.

#We still see two big numbers one of them for Atticus and the other for snop Dog 
#we will exclude them in future visualisations


# Correct the errors
tweet_id_11 = df_master[df_master.new_rating_numerator == 'Error']['tweet_id'][2919]
tweet_id_13 = df_master[df_master.new_rating_numerator == 'Error']['tweet_id'][2885]

df_master.loc[df_master['tweet_id'] == tweet_id_11, 'new_rating_numerator'] = 11
df_master.loc[df_master['tweet_id'] == tweet_id_13, 'new_rating_numerator'] = 13

df_master.loc[df_master['dogs_count'] == 'Error', 'dogs_count'] = 1
df_master.loc[df_master['new_rating_denominator'] == 'Error', 'new_rating_denominator'] = 10

# Test
print(df_master.new_rating_numerator[df_master.tweet_id == tweet_id_11])
print(df_master.new_rating_numerator[df_master.tweet_id == tweet_id_13])

925     11
2919    11
4913    11
6907    11
Name: new_rating_numerator, dtype: object
891     13
2885    13
4879    13
6873    13
Name: new_rating_numerator, dtype: object


In [86]:
# Delete the old columns and update the names of the new ones
df_master = df_master.drop(['rating_numerator', 'rating_denominator'], 1)

# Rename columns
df_master.rename(columns = {'new_rating_numerator': 'rating_numerator', 
                            'new_rating_denominator': 'rating_denominator'}, inplace = True)

# Test
list(df_master)

['tweet_id',
 'timestamp',
 'source',
 'text',
 'expanded_urls',
 'name',
 'jpg_url',
 'favorites',
 'retweets',
 'user_followers',
 'dog_stage',
 'prediction_algorithm',
 'confidence_level',
 'rating_numerator',
 'rating_denominator',
 'dogs_count']

In [87]:
# Loop on all the texts and check if the comment has one of the above conditions
# and append the result in a list
dog_names = []

for text in df_master['text']:
    # Start with 'This is ' and the fisrt letter of the name is uppercase
    if text.startswith('This is ') and re.match(r'[A-Z].*', text.split()[2]):
        dog_names.append(text.split()[2].strip(',').strip('.'))
    # Start with 'Meet ' and the fisrt letter of the name is uppercase
    elif text.startswith('Meet ') and re.match(r'[A-Z].*', text.split()[1]):
        dog_names.append(text.split()[1].strip(',').strip('.'))
    # Start with 'Say hello to ' and the fisrt letter of the name is uppercase
    elif text.startswith('Say hello to ') and re.match(r'[A-Z].*', text.split()[3]):
        dog_names.append(text.split()[3].strip(',').strip('.'))
    # Start with 'Here we have ' and the fisrt letter of the name is uppercase
    elif text.startswith('Here we have ') and re.match(r'[A-Z].*', text.split()[3]):
        dog_names.append(text.split()[3].strip(',').strip('.'))
    # Contain 'named' and the fisrt letter of the name is uppercase
    elif 'named' in text and re.match(r'[A-Z].*', text.split()[text.split().index('named') + 1]):
        dog_names.append(text.split()[text.split().index('named') + 1].strip(',').strip('.'))
    # No name specified or other style
    else:
        dog_names.append('NaN')

# Test
len(dog_names)

# Save the result in a new column 'dog_name'
df_master['dog_name'] = dog_names

# Test
print("New column dog_name count \n", df_master.dog_name.value_counts())
print("Old column name count \n", df_master.name.value_counts())

New column dog_name count 
 NaN          2500
Charlie        44
Cooper         40
Oliver         40
Lucy           40
Penny          36
Tucker         36
Sadie          32
Winston        32
Daisy          28
Lola           28
Jax            24
Bo             24
Bella          24
Toby           24
Stanley        24
Koda           24
Buddy          20
Oscar          20
Leo            20
Louis          20
Chester        20
Scout          20
Rusty          20
Bailey         20
Milo           20
Jack           16
Cassie         16
Winnie         16
Duke           16
             ... 
Tebow           4
Sora            4
Alfredo         4
Andy            4
Winifred        4
Kayla           4
Iroh            4
Sonny           4
Sprout          4
Tobi            4
Harnold         4
Odin            4
Grizzwald       4
Goose           4
Chadrick        4
Brady           4
Sage            4
Butter          4
Emmy            4
Binky           4
Lillie          4
Wishes          4
Georgie         4


In [88]:
# We acn see here that the 'NaN' result for the tweets with two names or no name
df_master[df_master.dog_name == 'NaN']

Unnamed: 0,tweet_id,timestamp,source,text,expanded_urls,name,jpg_url,favorites,retweets,user_followers,dog_stage,prediction_algorithm,confidence_level,rating_numerator,rating_denominator,dogs_count,dog_name
5,891087950875897856,2017-07-29 00:08:17 +0000,Twitter for iPhone,Here we have a majestic great white breaching ...,https://twitter.com/dog_rates/status/891087950...,,https://pbs.twimg.com/media/DF3HwyEWsAABqE6.jpg,19983.0,3077.0,7377937.0,,Chesapeake_Bay_retriever,0.425595,13,10,1,
7,890729181411237888,2017-07-28 00:22:40 +0000,Twitter for iPhone,When you watch your owner call another dog a g...,https://twitter.com/dog_rates/status/890729181...,,https://pbs.twimg.com/media/DFyBahAVwAAhUTd.jpg,64647.0,18637.0,7377937.0,,Pomeranian,0.566142,13,10,1,
12,889665388333682689,2017-07-25 01:55:32 +0000,Twitter for iPhone,Here's a puppo that seems to be on the fence a...,https://twitter.com/dog_rates/status/889665388...,,https://pbs.twimg.com/media/DFi579UWsAAatzw.jpg,47531.0,9936.0,7377936.0,,Pembroke,0.966327,13,10,1,
21,887517139158093824,2017-07-19 03:39:09 +0000,Twitter for iPhone,I've yet to rate a Venezuelan Hover Wiener. Th...,https://twitter.com/dog_rates/status/887517139...,such,https://pbs.twimg.com/ext_tw_video_thumb/88751...,45697.0,11521.0,7377936.0,,,0.000000,14,10,1,
23,887343217045368832,2017-07-18 16:08:03 +0000,Twitter for iPhone,You may not have known you needed to see this ...,https://twitter.com/dog_rates/status/887343217...,,https://pbs.twimg.com/ext_tw_video_thumb/88734...,33260.0,10279.0,7377936.0,,Mexican_hairless,0.330741,13,10,1,
24,887101392804085760,2017-07-18 00:07:08 +0000,Twitter for iPhone,This... is a Jubilant Antarctic House Bear. We...,https://twitter.com/dog_rates/status/887101392...,,https://pbs.twimg.com/media/DE-eAq6UwAA-jaE.jpg,30176.0,5889.0,7377936.0,,Samoyed,0.733942,12,10,1,
30,885984800019947520,2017-07-14 22:10:11 +0000,Twitter for iPhone,Viewer discretion advised. This is Jimbo. He w...,https://twitter.com/dog_rates/status/885984800...,Jimbo,https://pbs.twimg.com/media/DEumeWWV0AA-Z61.jpg,32291.0,6722.0,7377936.0,,Blenheim_spaniel,0.972494,12,10,1,
32,885167619883638784,2017-07-12 16:03:00 +0000,Twitter for iPhone,Here we have a corgi undercover as a malamute....,https://twitter.com/dog_rates/status/885167619...,,https://pbs.twimg.com/media/DEi_N9qXYAAgEEw.jpg,21671.0,4332.0,7377936.0,,malamute,0.812482,13,10,1,
36,884441805382717440,2017-07-10 15:58:53 +0000,Twitter for iPhone,"I present to you, Pup in Hat. Pup in Hat is gr...",https://twitter.com/dog_rates/status/884441805...,,https://pbs.twimg.com/media/DEYrIZwWsAA2Wo5.jpg,26675.0,5590.0,7377936.0,,Pembroke,0.993225,14,10,1,
41,883117836046086144,2017-07-07 00:17:54 +0000,Twitter for iPhone,Please only send dogs. We don't rate mechanics...,https://twitter.com/dog_rates/status/883117836...,,https://pbs.twimg.com/media/DEF2-_hXoAAs62q.jpg,36757.0,6602.0,7377936.0,,golden_retriever,0.949562,13,10,1,


In [89]:
# Let's delete the old name column now
df_master = df_master.drop(['name'], 1)

# Test
list(df_master)

['tweet_id',
 'timestamp',
 'source',
 'text',
 'expanded_urls',
 'jpg_url',
 'favorites',
 'retweets',
 'user_followers',
 'dog_stage',
 'prediction_algorithm',
 'confidence_level',
 'rating_numerator',
 'rating_denominator',
 'dogs_count',
 'dog_name']

### Get Dogs gender column from text column

In [90]:
# Loop on all the texts and check if it has one of pronouns of male or female
# and append the result in a list

male = ['He', 'he', 'him', 'his', "he's", 'himself']
female = ['She', 'she', 'her', 'hers', 'herself', "she's"]

dog_gender = []

for text in df_master['text']:
    # Male
    if any(map(lambda v:v in male, text.split())):
        dog_gender.append('male')
    # Female
    elif any(map(lambda v:v in female, text.split())):
        dog_gender.append('female')
    # If group or not specified
    else:
        dog_gender.append('NaN')

# Test
len(dog_gender)

# Save the result in a new column 'dog_name'
df_master['dog_gender'] = dog_gender

# Test
print("dog_gender count \n", df_master.dog_gender.value_counts())


dog_gender count 
 NaN       4528
male      2544
female     904
Name: dog_gender, dtype: int64


### Convert the null values to None type

In [91]:
df_master.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7976 entries, 0 to 7975
Data columns (total 17 columns):
tweet_id                7976 non-null int64
timestamp               7976 non-null object
source                  7976 non-null object
text                    7976 non-null object
expanded_urls           7976 non-null object
jpg_url                 7976 non-null object
favorites               7972 non-null float64
retweets                7972 non-null float64
user_followers          7972 non-null float64
dog_stage               7976 non-null object
prediction_algorithm    7976 non-null object
confidence_level        7976 non-null float64
rating_numerator        7976 non-null object
rating_denominator      7976 non-null object
dogs_count              7976 non-null object
dog_name                7976 non-null object
dog_gender              7976 non-null object
dtypes: float64(4), int64(1), object(12)
memory usage: 1.0+ MB


In [92]:
df_master.loc[df_master['prediction_algorithm'] == 'NaN', 'prediction_algorithm'] = None
df_master.loc[df_master['dog_name'] == 'NaN', 'dog_name'] = None
df_master.loc[df_master['dog_gender'] == 'NaN', 'dog_gender'] = None
df_master.loc[df_master['rating_numerator'] == 'NaN', 'rating_numerator'] = 0
df_master.loc[df_master['rating_denominator'] == 'NaN', 'rating_denominator'] = 0

In [93]:
# Test
df_master.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7976 entries, 0 to 7975
Data columns (total 17 columns):
tweet_id                7976 non-null int64
timestamp               7976 non-null object
source                  7976 non-null object
text                    7976 non-null object
expanded_urls           7976 non-null object
jpg_url                 7976 non-null object
favorites               7972 non-null float64
retweets                7972 non-null float64
user_followers          7972 non-null float64
dog_stage               7976 non-null object
prediction_algorithm    6744 non-null object
confidence_level        7976 non-null float64
rating_numerator        7976 non-null object
rating_denominator      7976 non-null object
dogs_count              7976 non-null object
dog_name                5476 non-null object
dog_gender              3448 non-null object
dtypes: float64(4), int64(1), object(12)
memory usage: 1.0+ MB


## Convert each column to its appropriate field

In [94]:
df_master.dtypes

tweet_id                  int64
timestamp                object
source                   object
text                     object
expanded_urls            object
jpg_url                  object
favorites               float64
retweets                float64
user_followers          float64
dog_stage                object
prediction_algorithm     object
confidence_level        float64
rating_numerator         object
rating_denominator       object
dogs_count               object
dog_name                 object
dog_gender               object
dtype: object

In [97]:
df_master['tweet_id'] = df_master['tweet_id'].astype(object)
df_master['timestamp'] = pd.to_datetime(df_master.timestamp)
df_master['source'] = df_master['source'].astype('category')
#df_master['favorites'] = df_master['favorites'].astype(int)
#df_master['retweets'] = df_master['retweets'].astype(int)
#df_master['user_followers'] = df_master['user_followers'].astype(int)
df_master['dog_stage'] = df_master['dog_stage'].astype('category')
df_master['rating_numerator'] = df_master['rating_numerator'].astype(float)
df_master['rating_denominator'] = df_master['rating_denominator'].astype(float)
df_master['dogs_count'] = df_master['dogs_count'].astype(int)
df_master['dog_gender'] = df_master['dog_gender'].astype('category')

# Test
df_master.dtypes

tweet_id                        object
timestamp               datetime64[ns]
source                        category
text                            object
expanded_urls                   object
jpg_url                         object
favorites                      float64
retweets                       float64
user_followers                 float64
dog_stage                     category
prediction_algorithm            object
confidence_level               float64
rating_numerator               float64
rating_denominator             float64
dogs_count                       int64
dog_name                        object
dog_gender                    category
dtype: object

## Rename columns to be more expressive and Clean if needed

In [98]:
df_master = df_master.rename(columns = {'timestamp': 'tweet_date', 'source': 'tweet_source', 'text': 'tweet_text', 
                                        'expanded_urls': 'tweet_url', 'jpg_url': 'tweet_picture_predicted', 
                                        'favorites': 'tweet_favorites', 'retweets': 'tweet_retweets',
                                        'prediction_algorithm' : 'dog_breed'})


In [99]:
# All rating_denominator has one value 10
# We will delete this column
print(df_master.rating_denominator.value_counts())
df_master.drop('rating_denominator', 1, inplace = True)

10.0     7924
50.0        8
80.0        8
70.0        4
170.0       4
110.0       4
150.0       4
90.0        4
130.0       4
120.0       4
40.0        4
0.0         4
Name: rating_denominator, dtype: int64


## Storing, Analyzing, and Visualizing Data

In [100]:
# Store the clean DataFrame in a CSV file
df_master.to_csv('twitter_archive_master.csv', index=False, encoding = 'utf-8')

In [101]:
df_master = pd.read_csv('twitter_archive_master.csv')
df_master.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7976 entries, 0 to 7975
Data columns (total 16 columns):
tweet_id                   7976 non-null int64
tweet_date                 7976 non-null object
tweet_source               7976 non-null object
tweet_text                 7976 non-null object
tweet_url                  7976 non-null object
tweet_picture_predicted    7976 non-null object
tweet_favorites            7972 non-null float64
tweet_retweets             7972 non-null float64
user_followers             7972 non-null float64
dog_stage                  7976 non-null object
dog_breed                  6744 non-null object
confidence_level           7976 non-null float64
rating_numerator           7976 non-null float64
dogs_count                 7976 non-null int64
dog_name                   5476 non-null object
dog_gender                 3448 non-null object
dtypes: float64(5), int64(2), object(9)
memory usage: 997.1+ KB
