# Wrangling and Analyze Data

## Table of Contents

<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#gathering">Data Gathering</a></li>
<li><a href="#assessing"> Data Assessing</a></li>
<li><a href="#cleaning">Data Cleaning</a></li>
<li><a href="#storing">Data Storing</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
    
</ul>

## Introduction

This essense of this project is to learn how to gather data from different sources, assess data visually and programatically, and to clean data. Also, learn how store data, analyze and visuslize data.
The dataset for this project is the tweet archive of Twitter user @dog_rates, also known as WeRateDogs, has over 5000 tweets which has been filtered to create the enhanced archive that forms the basis of this analysis. 

### Questions for Analysis
My analysis will be based on answering the below questions;

What is the relationship between Favorite count and retweet count?

What is the most common dog breed?

What is the most common dog stage?

What tweets have the highest average retweet_count?

What are the most common dog names?

Most common tweet source

Top Dog Breeds by Favorite count

Top Dog Breeds by Retweet count


#### Importing Python libraries

In [1]:
#import libraries
import pandas as pd
import numpy as np
import requests
import os
import tweepy
import re
import json
import time
import datetime
import random
import matplotlib.pyplot as plt
from matplotlib import cm
%matplotlib inline
import seaborn as sns
from tweepy import OAuthHandler
from timeit import default_timer as timer

## Data Gathering

### The WeRateDogs Twitter archive

#### Twitter archive enhanced.csv data was downloaded manually from the udacity server

In [2]:
#load data
df_twitter_archive = pd.read_csv('twitter-archive-enhanced.csv')

In [3]:
#Check top rows of the data
df_twitter_archive.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


### The tweet image predictions

#### This file (image_predictions.tsv) is present in each tweet according to a neural network. It is hosted on Udacity's servers and should be downloaded programmatically using the Requests library and the following URL: https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv


In [4]:
#Get data url
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'

In [5]:
#Create file for the data
file_name = 'image_predictions.tsv'
if not os.path.exists(file_name):
    os.makedirs(file_name)

In [6]:
#Using Requests library to download the data
response = requests.get(url)

In [7]:
#Store the downloaded data in the file created
with open(os.path.join(file_name, url.split('/')[-1]), mode = 'wb') as file:
    file.write(response.content)

In [8]:
#Load the data
df_image_predictions= pd.read_csv('image-predictions.tsv', sep= '\t')

In [9]:
#Check top rows of the data
df_image_predictions.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


### Additional data from the Twitter API

Gather each tweet's retweet count and favorite ("like") count at the minimum and any additional data you find interesting. Using the tweet IDs in the WeRateDogs Twitter archive, query the Twitter API for each tweet's JSON data using Python's Tweepy library and store each tweet's entire set of JSON data in a file called tweet_json.txt file.

In [10]:
#Twitter API keys, Secrets, and Tokens
consumer_key = 'hidded'
consumer_secret = 'hidden'
access_token = 'hidden'
access_secret = 'hidden'


In [11]:
#Create Twitter API object and set rate limit 
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth, 
                 wait_on_rate_limit = True)

In [12]:
#Check the number of tweet ids
tweet_ids = df_twitter_archive.tweet_id.values
print("Number of tweet_ids = " + str(len(tweet_ids)) + "\n")

Number of tweet_ids = 2356



In [None]:
#Save tweets from the twitter API using the following loop
tweet_data = []
#Tweets that found are saved in the list below:
tweet_id_found = []
#Tweets that can't be found are saved in the list below:
tweet_id_missing = []
for tweet_id in df_twitter_archive['tweet_id']:
    try:
        data = api.get_status(tweet_id, 
                              tweet_mode='extended',
                              wait_on_rate_limit = True,
                              wait_on_rate_limit_notify = True)
        tweet_data.append(data)
        tweet_id_success.append(tweet_id)
    except Exception as e:
        tweet_id_missing.append(tweet_id)

Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter:

Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter:

Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter:

Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter:

Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter:

Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter:

Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter:

Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter:

Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter:

Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter: wait_on_rate_limit_notify
Unexpected parameter: wait_on_rate_limit
Unexpected parameter:

In [None]:
#Use the tweet_ids from the WeRateDogs twitter archive and query the Twitter API for each tweet's JSON
count = 0
fails_dict = {}
start = timer()
# Save each tweet's returned JSON as a new line in a .txt file
with open('tweet_json.txt', 'w') as file:
    for tweet_id in tweet_ids:
        count += 1
        print(str(count) + ": " + str(tweet_id))
        try:
            tweet = api.get_status(tweet_id, tweet_mode='extended')
            print("Success")
            json.dump(tweet._json, file)
            file.write('\n')
        except Exception as e:
            print("Fail")
            fails_dict[tweet_id] = e
            pass
end = timer()
print(end - start)
print(fails_dict)

In [None]:
api_df = []
with open('tweet_json.txt', 'r') as Json_file:
    for line in Json_file:
        tweet = (json.loads(line))
        tweet_id = tweet['id']
        retweet_count = tweet['retweet_count']
        favorite_count = tweet['favorite_count']
        create_date = tweet['created_at']
        api_df.append({'retweet_count' : retweet_count,
                  'favorite_count' : favorite_count,
                  'create_date' : create_date,
                  'tweet_id' : tweet_id})

In [None]:
#load the data into a dataframe
df_twitter_extra = pd.DataFrame(api_df, columns = ['tweet_id', 'retweet_count', 'favorite_count', 'create_date'])

In [None]:
#Check top rows of the data
df_twitter_extra.head()

# Data Assessing

### I will assess the data  visually and programmatically for quality and tidiness issue.

### Visual Assessment

In [None]:
#Load twitter archive data to visually assess it
df_twitter_archive

In [None]:
#Load image predictions data to visually assess it
df_image_predictions

In [None]:
#Load Twitter extra data to visually assess it
df_twitter_extra

### Programatical Assessment

##### Programmatically assess the twitter_archive data

In [None]:
#Check top rows of the data
df_twitter_archive.head()

In [None]:
#Check number of rows and columns()
df_twitter_archive.shape

In [None]:
#Check data characteristics
df_twitter_archive.info()

In [None]:
#Check data description7789
df_twitter_archive.describe()

In [None]:
#Check datatypes
df_twitter_archive.dtypes

In [None]:
#Check for null values
df_twitter_archive.isnull().sum()

In [None]:
#Check for duplicates
df_twitter_archive.duplicated().sum()

In [None]:
#Programmatically assess the image predictions data
df_image_predictions.head()

In [None]:
#Check number of rows and columns()
df_image_predictions.shape

In [None]:
#Check data characteristics
df_image_predictions.info()

In [None]:
#Check data description
df_image_predictions.describe()

In [None]:
#Check datatypes
df_image_predictions.dtypes

In [None]:
#Check for null values
df_image_predictions.isnull().sum()

In [None]:
#Check for duplicates
df_image_predictions.duplicated().sum()

In [None]:
#Check top rows of the data
df_image_predictions.head()

In [None]:
#Programmatically assess the twitter extra data
df_twitter_extra.head()

In [None]:
#Check number of rows and columns()
df_twitter_extra.shape

In [None]:
#Check data characteristics
df_twitter_extra.info()

In [None]:
#Check data description
df_twitter_extra.describe()

In [None]:
#Check datatypes
df_twitter_extra.dtypes

In [None]:
#Check for null values
df_twitter_extra.isnull().sum()

In [None]:
#Check for duplicates
df_twitter_extra.duplicated().sum()

In [None]:
#check top rows of the data
df_twitter_extra.head()

## Quality Issues

### Twitter archive
-	Retweets and replies should be removed
-	Drop columns not needed.
-	Change Timestamp column to datetime format and extract year, month and day.
-	Null objects are represented as 'None' instead of NaN.
-	Incorrect names in the name column, names weren't successfully extracted from the text. i.e. (a, an, the, very).
-	Invalid rating data is both rating_numerator and rating_denominator.

### Image predictions
-	Datatype for Tweet_id fields should be strings.

### Twitter extra
-	the datatype for Create_date and tweet_id are wrong.



## Tidiness Issues

### Twitter archive
-	Dog Name column have invalid names

### Image predictions 
-	Columns names are not informative. Names should be changed to be more informative.

### Twitter extra 
-	Merge twitter archive, image predictions and twitter extra as twitter master on tweet_id


# Data Cleaning

In [None]:
#Make copies of the dataframes
df_twitter_archive_clean = df_twitter_archive.copy() 
df_image_predictions_clean = df_image_predictions.copy()
df_twitter_extra_clean = df_twitter_extra.copy()

## Twitter archive


### Quality issues


#### Define
- remove all retweets and replies

#### Code

In [None]:
#Retweets and replies should be removed
df_twitter_archive_clean = df_twitter_archive_clean[pd.isnull(df_twitter_archive_clean['retweeted_status_user_id'])]
df_twitter_archive_clean = df_twitter_archive_clean[pd.isnull(df_twitter_archive_clean['in_reply_to_user_id'])]

#### Test

In [None]:
#Check that retweets and replies are removed
print(sum(df_twitter_archive_clean.retweeted_status_user_id.value_counts()))
print(sum(df_twitter_archive_clean.in_reply_to_user_id.value_counts()))

#### Define
- drop all columns not needed

#### Code

In [None]:
#Drop columns not needed
df_twitter_archive_clean = df_twitter_archive_clean.drop(['source',
                                                    'in_reply_to_status_id',
                                                    'in_reply_to_user_id',
                                                    'retweeted_status_id',
                                                    'retweeted_status_user_id', 
                                                    'retweeted_status_timestamp', 
                                                    'expanded_urls'], 1)

#### Test

In [None]:
#Confirm columns are dropped
df_twitter_archive_clean.info()

#### Define
- chnage wrong datatypes on 'tweet_id' and 'timestamp' columns

#### Code

In [None]:
#Correct wrong datatypes.
df_twitter_archive_clean.timestamp = pd.to_datetime(df_twitter_archive_clean.timestamp)
df_twitter_archive_clean.tweet_id = df_twitter_archive_clean.tweet_id.astype(object)

#### Test

In [None]:
#Confirm datatype change 
df_twitter_archive_clean.dtypes

#### Define
- change 'none' to NaN in columns

#### Code

In [None]:
#Solve Null objects are represented as 'None' instead of NaN.
df_twitter_archive_clean['name'].replace('None', np.NAN, inplace =True)
df_twitter_archive_clean['doggo'].replace('None', np.NAN, inplace =True)
df_twitter_archive_clean['floofer'].replace('None', np.NAN, inplace =True)
df_twitter_archive_clean['pupper'].replace('None', np.NAN, inplace =True)
df_twitter_archive_clean['puppo'].replace('None', np.NAN, inplace =True)

#### Test

In [None]:
# confirm changes
print(df_twitter_archive_clean['name'].value_counts())
print(df_twitter_archive_clean['doggo'].value_counts())
print(df_twitter_archive_clean['floofer'].value_counts())
print(df_twitter_archive_clean['pupper'].value_counts())
print(df_twitter_archive_clean['puppo'].value_counts())

#### Define
- Names not successfully extracted should be change to null values

#### Code

In [None]:
#Incorrect names in the name column, names weren't successfully extracted from the text. i.e. (a, an, the, very).
df_twitter_archive_clean['name'].replace('such', np.NAN, inplace =True)
df_twitter_archive_clean['name'].replace('a', np.NAN, inplace =True)
df_twitter_archive_clean['name'].replace('an', np.NAN, inplace =True)
df_twitter_archive_clean['name'].replace('the', np.NAN, inplace =True)
df_twitter_archive_clean['name'].replace('None', np.NAN, inplace =True)

#### Test

In [None]:
#Test
df_twitter_archive_clean.sample(5)

#### Define
- drop all numerators grater than 15

#### Code

In [None]:
#Invalid rating data is both rating_numerator and rating_denominator
df_twitter_archive_clean.loc[:,['rating_numerator', 'rating_denominator']].describe()

In [None]:
#Check unique values in numerator
df_twitter_archive_clean.rating_numerator.unique()

In [None]:
df_twitter_archive_clean.rating_numerator.value_counts()

In [None]:
odd_numerator = df_twitter_archive_clean.rating_numerator >= 15
odd_numerator.sum()

In [None]:
#Drop odd numerators
df_twitter_archive_clean = df_twitter_archive_clean[df_twitter_archive_clean.rating_numerator < 15]

#### Test

In [None]:
#Test
df_twitter_archive_clean.rating_numerator.value_counts()

#### Define
- change all denominators to 10

#### Code

In [None]:
#Check unique values in denominator
df_twitter_archive_clean.rating_denominator.unique()

In [None]:
df_twitter_archive_clean.rating_denominator.value_counts()

In [None]:
odd_denominator = np.logical_or(df_twitter_archive_clean.rating_denominator > 10, df_twitter_archive_clean.rating_denominator < 10)
odd_denominator.sum()

In [None]:
#Change all denominator to 10
df_twitter_archive_clean = df_twitter_archive_clean[df_twitter_archive_clean.rating_denominator == 10]

#### Test

In [None]:
df_twitter_archive_clean.rating_denominator.unique()

### Tidiness issues

#### Define
- create column 'dog_stage'
- drop unneccessary columns

In [None]:
#Dog Name column have invalid names
df_twitter_archive_clean[['doggo', 'floofer', 'pupper', 'puppo']].describe()

In [None]:
df_twitter_archive_clean.pupper.fillna("",inplace=True)
df_twitter_archive_clean.puppo.fillna("",inplace=True)
df_twitter_archive_clean.floofer.fillna("",inplace=True)
df_twitter_archive_clean.doggo.fillna("",inplace=True)

In [None]:
df_twitter_archive_clean['dog_stage'] = df_twitter_archive_clean.pupper+df_twitter_archive_clean.puppo+df_twitter_archive_clean.floofer+df_twitter_archive_clean.doggo

In [None]:
#Drop unneccessary columns
df_twitter_archive_clean=df_twitter_archive_clean.drop(columns=['doggo', 'puppo', 'pupper', 'floofer'],axis=1)

#### Test

In [None]:
df_twitter_archive_clean['dog_stage'].value_counts()

In [None]:
df_twitter_archive_clean.dog_stage.unique()

## Image Prediction

### Quality issues

#### Defiine
- Change datatype of 'tweet_id' to strings

#### Code

In [None]:
#Tweet_id fields should be strings.
df_image_predictions_clean.tweet_id = df_image_predictions_clean.tweet_id.astype(object)

#### Test

In [None]:
#Check changes
df_image_predictions_clean.dtypes

### Tidiness issues

#### Define
- create 'is_a_dog' column for predicted as a dog 
- create column 'breeds' for predicted breed
- create column for confidence level as confidence_level
- create 'dog_breed' column
- create 'confidence' column
- Replace none as nan then drop them since they are not dogs


#### Code

In [None]:
#create 'is_a_dog' column for predicted as a dog 
is_a_dog = [(df_image_predictions_clean['p1_dog'] == True),
              (df_image_predictions_clean['p2_dog'] == True),
              (df_image_predictions_clean['p3_dog'] == True)]

#create column for predicted breed as breeds
breeds = [df_image_predictions_clean['p1'], 
          df_image_predictions_clean['p2'],
          df_image_predictions_clean['p3']]

#create column for confidence level as confidence_level
confidence_level = [df_image_predictions_clean['p1_conf'], 
                    df_image_predictions_clean['p2_conf'], 
                    df_image_predictions_clean['p3_conf']]

#create 'dog_breed' column
df_image_predictions_clean['dog_breed'] = np.select(is_a_dog, breeds, 
                                       default = 'none')

#create 'confidence' column
df_image_predictions_clean['confidence'] = np.select(is_a_dog, confidence_level, 
                                            default = 0)

#Drop is_a_dog, breeds and confidence level columns


df_image_predictions_clean.drop(df_image_predictions_clean.iloc[:,3:12], inplace= True, axis=1)

In [None]:
#Replace none as nan then drop them since they are not dogs
df_image_predictions_clean['dog_breed'].replace('none', np.NAN, inplace =True)


In [None]:
df_image_predictions_clean.dog_breed.dropna(axis= 0,inplace= True)

#### Test

In [None]:
#Confirm changes
df_image_predictions_clean.info()

## Twitter extra

### Quality issues


In [None]:
df_twitter_extra_clean.info()

#### Define
- Change datatype of 'create_date' to datetime and' tweet_id' to strings.

#### Code

In [None]:
#Create_date is object instead of datetime and tweetid should be string.
df_twitter_extra_clean.tweet_id=df_twitter_extra_clean.tweet_id.astype('object')

In [None]:
df_twitter_extra_clean.create_date =pd.to_datetime(df_twitter_extra_clean.create_date)

#### Test

In [None]:
#Check data type change
df_twitter_extra_clean.dtypes

### Tidiness issues

#### Define
- Merged the 3 cleaned data frames on tweet_id into one dataframe called ‘twitter_archive_master’
- Create a rating column
- drop 'create_date' ,'rating_denominator' and 'rating_numerator'


#### Code

In [None]:
#Merge twitter archive, image predictions and twitter extra as twitter master on tweet_id
twitter_archive_master =pd.merge(pd.merge(df_twitter_archive_clean, 
                                         df_image_predictions_clean, how= 'inner', on ='tweet_id'), 
                                df_twitter_extra_clean, how='inner', on= 'tweet_id')

In [None]:
#Create rating column
twitter_archive_master['rating'] = twitter_archive_master['rating_numerator'].astype(str) + '/'+ twitter_archive_master['rating_denominator'].astype(str)

In [None]:
#Drop columns
twitter_archive_master = twitter_archive_master.drop(['create_date','rating_numerator','rating_denominator'],1)

#### Test

In [None]:
twitter_archive_master.info()

## Data Storing

Store the cleaned Twitter master as a csv file

In [None]:
twitter_archive_master.to_csv("twitter_archive_master.csv", index= False)

## Exploratory Data Analyis

In [None]:
#Explore the dataset
twitter_archive_master.hist(figsize=(20,16));

### Correlation analysis

In [None]:
#Get overview of dataset
f,ax = plt.subplots(figsize=(18, 18))
sns.heatmap(twitter_archive_master.corr(), cmap = 'bone', annot=True, linewidths=.5, fmt= '.2f',ax=ax);
plt.title('Dataset Overview');
plt.show()
twitter_archive_master.corr()

#### Create a function to invoke whenever I want to plot that a scatter plot

In [None]:
def scatterplot(x_data, y_data, x_label="", y_label="", title="", color = "r"):
    _, ax = plt.subplots()
    ax.scatter(x_data, y_data, s = 20, color = color)
    ax.set_title(title)
    ax.set_xlabel(x_label)
    ax.set_ylabel(y_label)

#### Relationship between Favourite count and Retweet count

In [None]:
scatterplot(twitter_archive_master['favorite_count'],twitter_archive_master['retweet_count'],'Favorite count','Retweet count','Favorite count vs Retweet count', 'green')
twitter_archive_master[['favorite_count', 'retweet_count']].corr()

In [None]:
plt.figure(figsize=(14, 8))
plt.style.use('fivethirtyeight')

plt.hist(twitter_archive_master.favorite_count, alpha=.4, label='Favorites')
plt.hist(twitter_archive_master.retweet_count, alpha=.4, label='Retweets')

plt.title('Distribution of Favorites and Retweets Counts', color='darkblue', fontsize=15)
plt.xlabel('Number of Favorites - Retweets', fontsize=12)
plt.ylabel('Counts', fontsize=12)

plt.xlim(-1, 80000)

plt.legend(prop={'size': 15})



plt.show()

In [None]:
data_corr = twitter_archive_master.corr()

print("The Correlation Between favorite count And retweet count is ",data_corr.loc['favorite_count','retweet_count']);

#### What is the most common dog breed?

In [None]:
twitter_archive_master.dog_breed.value_counts().head(5)

Golden_retriever is the most common dog breed

#### What is the most common dog stage?

In [None]:
twitter_archive_master.dog_stage.value_counts()

#### The most common dog stage is Pupper

#### What tweets have the highest average retweet_count?

In [None]:
twitter_archive_master.groupby('tweet_id')['retweet_count'].nlargest(5).astype(int)

In [None]:
twitter_archive_master[twitter_archive_master['tweet_id'] == 744234799360020481]

#### What are the most common dog names?

In [None]:
twitter_archive_master.name.value_counts().nlargest(5)

#### The most common dog name is Charlie

#### Top Dog Breeds by Favorite count

In [None]:
#plotting an horizontal bar chart to show top dog breed by favorite counts
top_breed = twitter_archive_master.groupby('dog_breed')['favorite_count'].sum().sort_values(ascending=True).nlargest(5)
plt.figure(figsize=(12,8))
plt.title("Top dog breeds by favorite counts", size=20)
top_breed.plot(kind='barh',fontsize=12,color='b')
plt.xlabel('favorite counts', fontsize=12)
plt.ylabel('Dog Breed', fontsize=12);
sns.set_style("whitegrid");

#### Top Dog Breeds by Retweet count

In [None]:
#plotting an horizontal bar chart to show top dog breed by favorite counts
top_breed = twitter_archive_master.groupby('dog_breed')['retweet_count'].sum().sort_values(ascending=True).nlargest(5)
plt.figure(figsize=(12,8))
plt.title("Top dog breeds by retweet counts", size=20)
top_breed.plot(kind='barh',fontsize=12,color='pink')
plt.xlabel('retwwet counts', fontsize=12)
plt.ylabel('Dog Breed', fontsize=12);
sns.set_style("whitegrid");

## Conclusions

1. Favorite count and retweet count are positively correlated

2. The most common dog breed is Golden Retriever
m
3. The most common dog stage is Pupper

4. The tweets have the highest average retweet_count is from tweet id; 744234799360020481 with over 70000 retweet count.

5. The most common dog name is Charlie

6. Top Dog Breeds by Favorite count are golden_retriever, Labrador_retriever, Pembroke, Chihuahua and French bulldog

7. Top Dog Breeds by Retweet count are golden_retriever, Labrador_retriever, Pembroke, Chihuahua and Samoyed

## Limitations

* There are so many none values in the dog_stage column and this mimited the kind of analysis that can be done with it