## PROJECT: WRANGLE AND ANALYZE DATA

#### INTRODUCTION



This project is carried out in fulfillment of one of the requirements of the Udacity Data Analysis Nano-Degree. Data is gathered from different sources related to @dogrates, a twitter account that post pictures of dogs and rates them in a humorous manner. Data wrangling techniques learnt during the course of the nano-degree will be put in practice to wrangle @dogrates twitter actiivity, analyze, and visualize.

In [None]:
import pandas as pd
import numpy as np
import tweepy as tw
import requests
import json
import os
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

DATA GATHERING
There are three data sources for this project. I will read the twitter archives csv file into a dataframe. The image prediction file will be programmatically downloaded from the provided url. The third data source is a json file provided by Udacity containing additional info about WeRateDogs' tweets not available in the twitter archives file.

In [None]:
#reading in the twitter archive file as a pandas dataframe

archive = pd.read_csv('twitter-archive-enhanced.csv', sep = ',')

In [None]:
# downloading image prediction file programmatically

url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = requests.get(url)

In [None]:
# creating a dataframe for the image predictions file
with open(os.path.join(url.split('/')[-1]), mode='wb') as file:
        file.write(response.content)

        
img_predictions = pd.read_csv('image-predictions.tsv', sep='\t')

In [None]:
# creating a dataframe from the tweet_json.txt file

df = []
with open('tweet_json.txt') as f:
    for line in f:
        tweet = (json.loads(line))
        tweet_id = tweet['id']
        retweet_count = tweet['retweet_count']
        favorite_count = tweet['favorite_count']
        create_date = tweet['created_at']
        df.append({'retweet_count' : retweet_count,
                  'favorite_count' : favorite_count,
                  'create_date' : create_date,
                  'tweet_id' : tweet_id})
        
tweets = pd.DataFrame(df, columns = ['tweet_id', 'retweet_count', 'favorite_count', 'create_date'])

##### ASSESSING DATA

In this segment, I am going to visually and programmatically assess the data for it's quality and tidiness. I will attempt to spot quality issues and inconsistencies in the datasets that will be cleaned in the next part of this project.

#### ASSESSMENT

In [None]:
#displaying the first rows of the archives dataframe
archive.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


In [None]:
# dusplaying the first 5 rows of the image predictions dataframe
img_predictions.head()


Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


In [None]:
# displaying the first 5 rows of the tweets dataframe
tweets.head()

Unnamed: 0,tweet_id,retweet_count,favorite_count,create_date
0,892420643555336193,8853,39467,Tue Aug 01 16:23:56 +0000 2017
1,892177421306343426,6514,33819,Tue Aug 01 00:17:27 +0000 2017
2,891815181378084864,4328,25461,Mon Jul 31 00:18:03 +0000 2017
3,891689557279858688,8964,42908,Sun Jul 30 15:58:51 +0000 2017
4,891327558926688256,9774,41048,Sat Jul 29 16:00:24 +0000 2017


In [None]:
# displaying info about the archives dataframe

archive.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 

In [None]:
# Showing info about the img_predictions dataframe

img_predictions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   tweet_id  2075 non-null   int64  
 1   jpg_url   2075 non-null   object 
 2   img_num   2075 non-null   int64  
 3   p1        2075 non-null   object 
 4   p1_conf   2075 non-null   float64
 5   p1_dog    2075 non-null   bool   
 6   p2        2075 non-null   object 
 7   p2_conf   2075 non-null   float64
 8   p2_dog    2075 non-null   bool   
 9   p3        2075 non-null   object 
 10  p3_conf   2075 non-null   float64
 11  p3_dog    2075 non-null   bool   
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [None]:
# displaying info about the tweets dataframe
tweets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2354 entries, 0 to 2353
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   tweet_id        2354 non-null   int64 
 1   retweet_count   2354 non-null   int64 
 2   favorite_count  2354 non-null   int64 
 3   create_date     2354 non-null   object
dtypes: int64(3), object(1)
memory usage: 73.7+ KB


In [None]:
archive.shape

(2356, 17)

In [None]:
tweets.shape

(2354, 4)

In [None]:
img_predictions.shape

(2075, 12)

In [None]:
# checking for duplicates in the archive dataframe

archive[archive.tweet_id.duplicated()]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo


No duplicates in the archives dataset

In [None]:
# checking for duplicates in the img_predictions dataframe

img_predictions[img_predictions.tweet_id.duplicated()]

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog


No duplicates in the img_predictions dataset also 

In [None]:
# checking for duplicates in the tweets dataset

tweets[tweets.tweet_id.duplicated()]

Unnamed: 0,tweet_id,retweet_count,favorite_count,create_date


There are no duplicated tweets in all 3 dataframes as all tweet_ids are unique

In [None]:
# checking the value counts for the different tweet sources
archive.source.value_counts()

<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>     2221
<a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a>                          91
<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>                       33
<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>      11
Name: source, dtype: int64

In [None]:
name = 4

In [None]:
type(name)

str

In [None]:
dir(df)

In [None]:
# checking the ratings numerator for each of the dogs

archive.rating_numerator.value_counts().sort_index()

0         2
1         9
2         9
3        19
4        17
5        37
6        32
7        55
8       102
9       158
10      461
11      464
12      558
13      351
14       54
15        2
17        1
20        1
24        1
26        1
27        1
44        1
45        1
50        1
60        1
75        2
80        1
84        1
88        1
99        1
121       1
143       1
144       1
165       1
182       1
204       1
420       2
666       1
960       1
1776      1
Name: rating_numerator, dtype: int64

rating numerators below 10 and above 15 are a possible quality issue

In [None]:
archive.rating_denominator.value_counts().sort_index()

0         1
2         1
7         1
10     2333
11        3
15        1
16        1
20        2
40        1
50        3
70        1
80        2
90        1
110       1
120       1
130       1
150       1
170       1
Name: rating_denominator, dtype: int64

Denominators other than 10 are a possible quality issue

In [None]:
# checking if there are any retweets or which tweets are in reply to another tweet

archive[archive['in_reply_to_status_id'].notnull()].head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
30,886267009285017600,8.862664e+17,2281182000.0,2017-07-15 16:51:35 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@NonWhiteHat @MayhewMayhem omg hello tanner yo...,,,,,12,10,,,,,
55,881633300179243008,8.81607e+17,47384430.0,2017-07-02 21:58:53 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@roushfenway These are good dogs but 17/10 is ...,,,,,17,10,,,,,
64,879674319642796034,8.795538e+17,3105441000.0,2017-06-27 12:14:36 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@RealKentMurphy 14/10 confirmed,,,,,14,10,,,,,
113,870726314365509632,8.707262e+17,16487760.0,2017-06-02 19:38:25 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@ComplicitOwl @ShopWeRateDogs &gt;10/10 is res...,,,,,10,10,,,,,
148,863427515083354112,8.634256e+17,77596200.0,2017-05-13 16:15:35 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@Jack_Septic_Eye I'd need a few more pics to p...,,,,,12,10,,,,,


In [None]:
## checking for tweets that are retweets

archive[archive['retweeted_status_id'].notnull()].head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
19,888202515573088257,,,2017-07-21 01:02:36 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: This is Canela. She attempted s...,8.87474e+17,4196984000.0,2017-07-19 00:47:34 +0000,https://twitter.com/dog_rates/status/887473957...,13,10,Canela,,,,
32,886054160059072513,,,2017-07-15 02:45:48 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @Athletics: 12/10 #BATP https://t.co/WxwJmv...,8.860537e+17,19607400.0,2017-07-15 02:44:07 +0000,https://twitter.com/dog_rates/status/886053434...,12,10,,,,,
36,885311592912609280,,,2017-07-13 01:35:06 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: This is Lilly. She just paralle...,8.305833e+17,4196984000.0,2017-02-12 01:04:29 +0000,https://twitter.com/dog_rates/status/830583320...,13,10,Lilly,,,,
68,879130579576475649,,,2017-06-26 00:13:58 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: This is Emmy. She was adopted t...,8.780576e+17,4196984000.0,2017-06-23 01:10:23 +0000,https://twitter.com/dog_rates/status/878057613...,14,10,Emmy,,,,
73,878404777348136964,,,2017-06-24 00:09:53 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: Meet Shadow. In an attempt to r...,8.782815e+17,4196984000.0,2017-06-23 16:00:04 +0000,"https://www.gofundme.com/3yd6y1c,https://twitt...",13,10,Shadow,,,,


Some tweets in the archive dataframe are retweets and replies to other tweets.

In [None]:
# Checking number of different dog stages that appeared in tweets

doggo = archive.doggo.value_counts()
floofer = archive.floofer.value_counts()
pupper = archive.pupper.value_counts()
puppo = archive.puppo.value_counts()
print(doggo)
print(floofer)
print(pupper)
print(puppo)

None     2259
doggo      97
Name: doggo, dtype: int64
None       2346
floofer      10
Name: floofer, dtype: int64
None      2099
pupper     257
Name: pupper, dtype: int64
None     2326
puppo      30
Name: puppo, dtype: int64


Issues

Quality
1. There are some html texts in the source column that need to be removed --- fixed
2. Wrong datatypes in some columns (create_date and time_stamp) --- fixed
3. Some tweets in the archive are replies and retweets not actual tweets of WeRateDogs --- fixed
4. The tweet_id column should be dtype object instead of int64 --- fixed
5. create_date column in tweets dataframe should be datetime format --- fixed
6. Null values for dog stages --- fixed
7. Dog stages seperated into different columns --- fixed
8. Change column name floofer to floof --- fixed
9. Tweet link at the end of each tweet to be removed --- fixed

Tidiness
1. The three dataframes are better as a merged datafram
1. Create date existing in two different dataframes
3. The dog stages are spread across multiple columns making the dataframe bulk

#### CLEANING

In [1]:
# Creating copies of the dataframes

archive_clean = archive.copy()
img_predictions_clean = img_predictions.copy()
tweets_clean = tweets.copy()


NameError: ignored

Issue 1: Retweets and replies included in tweet archives
    
Define:
Some tweets in the tweet archives are retweets and replies to other tweets. These are not needed for the purpose of this analysis and should be dropped.

Code:

In [None]:
# Removing tweets that are replies to other tweets

archive_clean = archive_clean[archive_clean['in_reply_to_status_id'].isnull()]

In [None]:
 # Excluding retweets from the dataframe
archive_clean = archive_clean[archive_clean['retweeted_status_id'].isnull()]

Test:

In [None]:
 archive_clean.info()

In [None]:
# removing redundant columns


archive_clean = archive_clean.drop(['in_reply_to_status_id',
                              'in_reply_to_user_id',
                              'retweeted_status_id',
                              'retweeted_status_user_id',
                              'retweeted_status_timestamp'],
                            axis = 1)

Retweets and replies have been removed from the dataframe

Issue: Nulls
    
Define: There are null values in the dataset that are to be dropped.
    
Code:

In [None]:
# Trimming the dataset by dropping null values

archive_clean.dropna(axis='columns',how='any', inplace=True)

Test:

In [None]:
archive_clean.info()

Nulls have been dropped

Issue: Incorrect datatype
    
Define: The time_stamp column carries a datatype other than the appropriate datetime datatype and this is to be corrected.
    
Code:

In [None]:
# converting time_stamp to datetime datatype 

archive_clean.timestamp = pd.to_datetime(archive_clean.timestamp)

Test:

In [None]:
archive_clean.info()

time_stamp datatype has been changed to datetime datatype

Issue: HTML strings in source columm.
    
Define: There are some html strings in the source column that clog the column containing channels through which tweets are posted.

Code:

In [None]:
# Removing html strings in source column

archive_clean.source = archive_clean.source.str.extract('>([\w\W\s]*)<', expand=True)

Test:

In [None]:
archive_clean.source.value_counts()

Html strings have been cleaned from the source column

Issue: Rating denominators not equal to 10
    
Define: There are rows with rating denominators not equal to 10. This is a quality issue and such rows are to be dropped.

Code:

In [None]:
# dropping rating denominators not equal to 10

archive_clean = archive_clean[archive_clean['rating_denominator'] == 10]

Test:

In [None]:
#confirming unwanted rating denominators have been dropped

archive_clean[archive_clean['rating_denominator'] != 10]

Rating denominators not equal to 10 have been dropped

Issue: Rating numerators outside the range of 10-15
    
Define: Rating numerators not between 10 and 15 are a possible quality issue going by WeRateDogs rating patterns. These are to be dropped


Code:

In [None]:
# keeping rating numerators between 10 and 15

archive_clean = archive_clean[(archive_clean['rating_numerator'] >= 10) & (archive_clean['rating_numerator']<= 15)]

Test:

In [None]:
# confirming rating numerators outside the range of 10-15 have been dropped

archive_clean.rating_numerator.value_counts()

Rating denominators outside range 10-15 have been dropped

Issue: Tweetids are in int64 datatype

Define: Tweetids should be in str datatype not int64
    
Code:

In [None]:
# Converting tweet_id from int to str

img_predictions_clean['tweet_id'] = img_predictions_clean['tweet_id'].astype('str')

archive_clean['tweet_id'] = archive_clean['tweet_id'].astype('str')
tweets_clean['tweet_id'] = tweets_clean['tweet_id'].astype('str')

Test:

In [None]:
tweets_clean.info()

Datatype change has been effected on tweet_id

Issue: Dog stage floof represent as floofer
    
Define: I will attempt to change the dog stage 'floofer' to the more prepared 'floof'

Code:

In [None]:
# renaming dog_stage floofer to floof

archive_clean.rename(columns={"floofer": "floof"}, inplace = True)

Test:

In [None]:
archive_clean.head(1)

Dog stage rename effected

Issue: Links at the end of tweets

Define: There are links are the end of tweet texts that need to be removed.

Code:

In [None]:
# removing links at the end of tweets

archive_clean['text'] = archive_clean.text.str.replace(r"http\S+", "")

Test:

In [None]:
archive_clean['text'].head().tolist()

Links at the end of tweets have been removed.

Issue: Dog stages spread across multiple columns
    
Define: The dog stages are spread across multiple columns. These columns need to be melted as one for ease of analysis. To merge all dog stages in one column, I will first create two dataframes for tweets and dogs from the archive dataframe.

Code:

In [None]:
# seperating dog and tweets info in archive

archive_dogs = archive_clean[['tweet_id', 'name', 'doggo', 'floof', 'pupper', 'puppo', 'rating_numerator', 'rating_denominator']].copy()
archive_tweets = archive_clean.drop(['name', 'doggo', 'floof', 'pupper', 'puppo', 'rating_numerator', 'rating_denominator'], axis=1)

In [None]:
archive_dogs.head()

In [None]:
archive_tweets.head()

In [None]:
# I will create a new column called unknown_dog_stage to indentify dogs without a dog stage with 'Yes'
# and dogs with a dog stage with 'No'
def u(row):
    if row ['doggo'] == 'None' and row ['floof'] == 'None' and row ['pupper'] == 'None' and row ['puppo'] == 'None':
        val = 'yes'
    else:
        val ='No'
    return val

archive_dogs['unknown_dog_stage'] = archive_dogs.apply(u, axis=1)

In [None]:
archive_dogs.sample()

The 'unknown_dog_stage' column has been created

In [None]:
# we melt the different dog stages into a new column calle 'dog_stage'
archive_dogs = pd.melt(archive_dogs, id_vars =['tweet_id', 'name', 'rating_numerator','rating_denominator'],
                     value_vars = ['doggo', 'floof', 'pupper', 'puppo', 'unknown_dog_stage'],
                     var_name = 'dog_stage', 
                    value_name = 'value')

In [None]:
archive_dogs.head(10)

In [None]:
# Removing the duplicated columns as a result of the melt and rows where unknown dog stage is 'No'
# as it has already be accounted for in the known dog stages

archive_dogs = archive_dogs[archive_dogs['value']!= 'None']
archive_dogs = archive_dogs[archive_dogs['value']!= 'No']

In [None]:
archive_dogs.info()


In [None]:
# Dropping the value column

archive_dogs = archive_dogs.drop('value', axis=1)

Test:

In [None]:
archive_dogs.sample()

Dog stages have been melted into one column

Merging Dataframes

In [None]:
# merging the tweets and dog datasets

archive_clean = archive_tweets.merge(archive_dogs, how = 'left', on = 'tweet_id')

In [None]:
archive_clean.head(2)

In [None]:
# Merging the tweets and archive dataframe

archive_clean = archive_clean.merge(tweets_clean, how = 'left', on = 'tweet_id')


In [None]:
# Merging the archives and img_predictions

twitter_archive_master = archive_clean.merge(img_predictions_clean, how = 'left', on = 'tweet_id')

In [None]:
twitter_archive_master.info()

In [None]:
# having create date and time stamp is redundant

twitter_archive_master = twitter_archive_master.drop(['create_date'], axis = 1)

In [None]:
twitter_archive_master.info()

In [None]:
# saving file to directory

twitter_archive_master.to_csv(r'C:\Users\IdrisBakare\twitter_archive_master.csv', encoding='utf-8')

ANALYSIS

In [None]:
twitter_archive_master.info()

Generating a WorldCloud for texts in the tweets

In [None]:
tweet_words = np.array(twitter_archive_master.text)
my_list = []
for tw in tweet_words:
    my_list.append(tw.replace("\n",""))

In [None]:
def gen_wc(my_list):
    word_cloud = WordCloud(width = 500, height = 300, background_color='black').generate(str(my_list))
    plt.figure(figsize=(10,8),facecolor = 'white', edgecolor='red')
    plt.imshow(word_cloud)
    plt.axis('off')
    plt.tight_layout(pad=0)
    plt.show()
    

In [None]:
gen_wc(my_list)

Checking the channel through which tweets are made by WeRateDogs

In [None]:
twitter_archive_master['source'].value_counts().plot(kind = 'bar', figsize = (8,4), title ='Tweet Sources')
plt.xlabel('Source')
plt.ylabel('Count');

Most tweets were made via Twitter for iPhone

Checking the most frequently used rating numerator

In [None]:
twitter_archive_master['rating_numerator'].value_counts().plot(kind ='bar')
plt.xlabel('Rating Numerator')
plt.ylabel('Frequency')
plt.title('Rating Numerator Count');


12 is the most frequently used rating numerator while 14 is the least frequently used rating numerator

Next, I am going to analyze the relationship between different dog stages and retweets/favorite count

In [None]:
dog_stages = twitter_archive_master.loc[twitter_archive_master['dog_stage'] != 'unknown_dog_stage']
dog_stages.plot.scatter(x = 'dog_stage', y = 'retweet_count').set(title = 'Dog Stage v Retweet Count');

Retweets per tweet for different dog stages are between 0 to 10,000
A doggo had the highest ever retweets

In [None]:
dog_stages.plot.scatter(x = 'dog_stage', y = 'favorite_count').set(title = 'Dog Stage v Favorite Count');

For favorite count, majority of the dog stages had favorite count per tweet above 10,000
A doggo and a puppo received the highest ever likes on a single tweet


Checking Distribution of dogs with known dog stages

In [None]:

dog_stages.dog_stage.hist()
plt.title('Dog Stages')
plt.xlabel('Dog Stage')
plt.ylabel('Number');

Most dogs are in the pupper dog stage

CONCLUSION

Tweets from the Twitter account WeRateDogs were analyzed for this project to generate some insights. After gathering, accessing and cleaning the different datasets used in this project, we merged them as one dataframe and analyzed.

In the course of our analysis, we discovered some facts, which were displayed in visuals using the python libraries matplotlib and seaborn. A word cloud was also used to display the most frequently occuring words in tweets text in larger fonts while the less frequently used words are displayed in smaller fonts. 

I also obtained the most frequently used tweet channel as well as the most frequently used rating numerator. The most frequently occuring dog stage as well as the retweet and favorite counts of different dog stages were uncovered.
