# Project: Wrangling and Analyze Data

## Data Gathering

There are three different types of data for this project, each with different data format. 

    1) WeRateDogs twitter archive data in a csv format
    2) Tweet image prediction in tsv format
    3) WeRateDogs twitter account additional datatweet to be stored as tweet_json.txt file

In [280]:
# load imports for data gathering
import tweepy
from tweepy import OAuthHandler
import json
from timeit import default_timer as timer
import pandas as pd # loads pandas library
import requests # loads requests library
import json # loads json library
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb

%matplotlib inline

**1.** Directly download the WeRateDogs Twitter archive data (**twitter_archive_enhanced.csv**)

According to Udacity, WeRateDogs downloaded their Twitter archive and sent it to Udacity via email exclusively for you to use in this project. The archive contains basic tweet data for over 5000 of their tweets as they stood on August 1, 2017.

I downloaded csv file provided via the link, and then uploaded into my working directory

In [281]:
# twitter_archive_enhanced.csv
path="C:/Users/Davie/Documents/GitHub/data_wrangling/data/"
twitter_archive=pd.read_csv(path + 'twitter_archive_enhanced.csv')

**'twitter-archive-enhanced.csv'**

**2.** Use the Requests library to download the tweet image prediction (**image_predictions.tsv**)

The WeRateDogs tweet image predictions is hosted on Udacity’s servers and is to be downloaded programmatically using requests library via url

In [282]:
# url for image_predictions.tsv
file_url='https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
# store resquest response in tsv_response
tsv_response=requests.get(file_url)
# write the response to 'image_predictions.tsv'
with open('image_predictions.tsv', 'w') as f:
    f.write(tsv_response.text)

**'image_predictions.tsv'**

**3.** Use the Tweepy library to query additional data via the Twitter API (**tweet_json.txt**)

There are two methods of getting this additional data. Either, through Twitter API and the python tweepy library or direct download of txt file provided by the udacity in the classroom

Twitter API

In [283]:
# Authentication process to use Tweepy API
consumer_key = 'HIDDEN'
consumer_secret = 'HIDDEN'
access_token = 'HIDDEN'
access_secret = 'HIDDEN'

auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth, wait_on_rate_limit=True)

In [None]:
# Creating list of tweet ids
tweet_id = twitter_archive['tweet_id']
list(tweet_id)

I successfuly applied for Twitter API v2 Essential, but it has limitated usage. I am unable to use it to acquire the data. Therefore, i have requested for Elevation but not yet approved.

- I downloaded the tweet_json.txt provided in the Udacity classroom 

In [285]:
# file url
file_url='https://video.udacity-data.com/topher/2018/November/5be5fb7d_tweet-json/tweet-json.txt'
# store resquest response in txt_response
txt_response=requests.get(file_url)
# write the response to 'tweet_json.txt'
with open('tweet_json.txt', 'w') as f:
    f.write(txt_response.text)

**'tweet_json.txt'**

I read the tweet_json.txt file by converting each json string into python dictionary and appending them to a twitter_list. Finally, I convert this list of dictionaries to a python pandas DataFrame, which is then stored as tweet_json.csv.

In [286]:
twitter_list = [] # empty list

with open('tweet_json.txt', 'r') as file: # create tweet_json.txt
# converts every line/json string into dictionary
    for line in file:
        tweet = json.loads(line)  
        tweet_id = tweet['id']
        retweet_count = tweet['retweet_count']
        fav_count = tweet['favorite_count']
# append dictionaries into the empty list
        twitter_list.append({'tweet_id':tweet_id, 'retweet_count': retweet_count, 'favorite_count': fav_count})
# convert list of dictionaries into panda data frame with atleast 
# (tweet_id, retweet_count, and favorite_count as per the instruction)        
twitter_df = pd.DataFrame(twitter_list, columns = ['tweet_id', 'retweet_count', 'favorite_count'])
twitter_df.to_csv('tweet_json.csv', index=False)

**'tweet_json.csv'**

## Assessing Data
In this section, detect and document at least **eight (8) quality issues and two (2) tidiness issue**. 

**Quality issues**

- Issues related to the data content (dirty data). We check for four quality diemnsions, completeness, validity, accuracy and consistency.

**Tidiness issues**

- Issues related to the data structure (messy data). We check whether or not each variable forms a column, each observation forms a row or each type of observational unit forms a table

**1. visual assessment**
    - viewing the data without code

In [None]:
# twitter_archive_enhanced.csv
twitter_archive.head()

- Missing data (NaN, none)
- Non-descriptive columns (source, name, text)
- inconsistent rating denominator
- Extremely low and high rating numerator
- invalid names under name column (a, an, none)

In [None]:
#'image_predictions.tsv'
path="C:/Users/Davie/Documents/GitHub/data_wrangling/data/"
image_pred=pd.read_csv(path + 'image_predictions.tsv', sep='\t')
image_pred

- None-descriptive column names for the rating algorithms (p's	p's_conf	p's_dog)
- Tidyness issues, p1, p2 and p3 columns 

In [None]:
#tweet_json.csv
path="C:/Users/Davie/Documents/GitHub/data_wrangling/data/"
tweet_json=pd.read_csv(path + 'tweet_json.csv')
tweet_json

- low retweet_count for tweet_id 886267009285017600

**2. programmatic assessement**

    -checking data issues with (code) python methods.We use .sample() .shape, .describe(), .info(), .dtypes, .nunique()

In [None]:
twitter_archive.columns # list all columns 

In [None]:
twitter_archive.shape # assess the dimensions of the data

In [None]:
twitter_archive.nunique() # assess the number of unique values of each column

In [None]:
twitter_archive.info() # assesscheck missing and data types of each column

In [None]:
twitter_archive[twitter_archive.duplicated()] # assess duplicate rows

No duplicates, timestamp column data type should be datetime

In [None]:
# assess the number of uniques for 'retweeted_status_id', 'retweeted_status_user_id', 'retweeted_status_timestamp'
twitter_archive[['retweeted_status_id', 'retweeted_status_user_id', 'retweeted_status_timestamp']].nunique()

Many retweets that nedd to be removed according to the project motivation

In [None]:
list_names=[] # empty list
for names in twitter_archive['name']:
    if len(names)<=3: # check if the name has less than 3 characters
        list_names.append(names)
funny_names=pd.Series(list_names).value_counts() # convert to pandas series and check counts per name
funny_names

Some of these names ('a','by','not', 'his', 'an', 'all', 'life', 'the')do not appear as valid names. The following cells contain codes used to examine a few of the them

In [None]:
twitter_archive_none=twitter_archive[twitter_archive['name']=='None'] # load texts for dogs having 'None' as the names
twitter_archive_none[['text', 'name', 'doggo', 'floofer', 'pupper', 'puppo']]

Some of the texts may contain dog names and stages, while others contain None

In [None]:
twitter_archive_a=twitter_archive[twitter_archive['name']=='a'] # load texts for dogs having 'a' as the names
twitter_archive_a[['text', 'name', 'doggo', 'floofer', 'pupper', 'puppo']]

'a' is not the correct dog name. Some of the texts contain correct dog names and stages but wrongly extracted, while others contain None

In [None]:
twitter_archive_an=twitter_archive[twitter_archive['name']=='an'] # load texts for dogs having 'an' as the names
twitter_archive_an[['text', 'name', 'doggo', 'floofer', 'pupper', 'puppo']]

'an' is not the correct dog name. Though, some of the texts contain correct dog names but wrongly extracted, while others contain None or correct dog stages

In [None]:
twitter_archive_the=twitter_archive[twitter_archive['name']=='the'] # load texts for dogs having 'the' as the names
twitter_archive_the[['tweet_id','text', 'name', 'doggo', 'floofer', 'pupper', 'puppo']]

'the' is not the correct dog name. However, the text may or may not contain dog name

In [None]:
twitter_archive_one=twitter_archive[twitter_archive['name']=='one'] # load texts for dogs having 'one' as the names
twitter_archive_one[['tweet_id','text', 'name', 'doggo', 'floofer', 'pupper', 'puppo']]

'one' is not the correct dog name. Texts may not contain dog names. puppers wrongly extracted

In [None]:
twitter_archive_all=twitter_archive[twitter_archive['name']=='all'] # load texts for dogs having 'all' as the names
twitter_archive_all[['text', 'name', 'doggo', 'floofer', 'pupper', 'puppo']]

'all' is not the correct dog name

In [None]:
twitter_archive_not=twitter_archive[twitter_archive['name']=='not'] # load texts for dogs having 'not' as the names
twitter_archive_not[['text', 'name', 'doggo', 'floofer', 'pupper', 'puppo']]

'not' isn not the correct dog name

In [None]:
twitter_archive_by=twitter_archive[twitter_archive['name']=='by'] # load texts for dogs having 'by' as the names
twitter_archive_by[['text', 'name', 'doggo', 'floofer', 'pupper', 'puppo']]

by is not the correct dog name

In [None]:
twitter_archive_my=twitter_archive[twitter_archive['name']=='my'] # load texts for dogs having 'my' as the names
twitter_archive_my[['text', 'name', 'doggo', 'floofer', 'pupper', 'puppo']]

name is Zoey, wrongly extracted as 'my'

In [None]:
twitter_archive_old=twitter_archive[twitter_archive['name']=='old'] # load texts for dogs having 'old' as the names
twitter_archive_old[['text', 'name', 'doggo', 'floofer', 'pupper', 'puppo']]

'old' is not the correct dog name

In [None]:
twitter_archive_his=twitter_archive[twitter_archive['name']=='his'] # load texts for dogs having 'his' as the names
twitter_archive_his[['text', 'name', 'doggo', 'floofer', 'pupper', 'puppo']]

The correct dog name is 'Quizno' and not 'his'

In [None]:
twitter_archive_just=twitter_archive[twitter_archive['name']=='just'] # load texts for dogs having 'just' as the names
twitter_archive_just[['text', 'name', 'doggo', 'floofer', 'pupper', 'puppo']]

'just' is not the correct dog name for this group

In [None]:
twitter_archive_life=twitter_archive[twitter_archive['name']=='life'] # load texts for dogs having 'life' as the names
twitter_archive_life[['text', 'name', 'doggo', 'floofer', 'pupper', 'puppo']]

'life' is not a dog name

**These texts do appear to contain the dog's name**

In [None]:
image_pred.columns # list all columns

Non-descriptive columns

In [None]:
image_pred.info() # assess the missing and data types for each column

tweet_id data type is a string

In [None]:
image_pred.nunique() # assess the unique values in each column

Out of 2075 tweet only 2009 had unique image url

In [None]:
image_pred[image_pred['jpg_url'].duplicated()].jpg_url.head(10) # assess duplicate image url

66 urls are repeated, pointing to the same image. I want to assess url with id 1315

In [None]:
image_pred[image_pred['jpg_url']=='https://pbs.twimg.com/media/CWza7kpWcAAdYLc.jpg'] # assess duplicate rows

The two image url's are the same and show the same image when clicked. The information about the rating algorithm is also the same, except the tweet_id. 

We need a unique tweet-id with a unique image url

In [None]:
image_pred[image_pred.duplicated()] # assess duplicates

No duplicate tweet_ids

In [None]:
p1_sorted=image_pred.p1.sort_values() # assess the values under p1-golden retriever
p1_sorted.head(10)

In [None]:
p2_sorted=image_pred.p2.sort_values() # assess the values under p2-labrador retriever
p2_sorted.head(10)

The two columns contains similar items, they are related

In [None]:
tweet_json.shape # assess the dimensions of the data

In [None]:
tweet_json.columns # list the columns in the data

In [None]:
tweet_json.info() # assess the missing values and data types

In [None]:
tweet_json.describe() # assess the descriptive statistics for the data

Minimum value is zero in both retweet_count and favorite _count

In [None]:
tweet_json[tweet_json.duplicated()] # assess duplicate rows

No duplicate observation

In [None]:
tweet_json.retweet_count.value_counts().tail(10) # assess the counts for each value in column retweet_count

One observation with zero value, did not get a retweet

In [None]:
tweet_json[tweet_json['retweet_count']==0] # find the observation with value zero in the retweet_count

In [None]:
tweet_json.favorite_count.value_counts().head(10) # assess the counts for each value in column favorite_count

179 observations have zero values, the tweets had no favourite

In [None]:
tweet_json[tweet_json['favorite_count']==0].head() # find the observation with value zero in the favorite_count

### Quality issues

1.Missing values in twitter_archive data for name column 'None'

2.Inaccurate values (dog names) in twitter_archive data for name column ('a', 'an', 'all', 'my', 'not', 'the', 'by', 'such', 'his', 'life', 'one', 'old', 'just')

3.181 rows with retweets in twitter archive data, and extreaneous columns need to be removed as per the project motivation

4.Missing values in twitter_archive data for the dog stages column 'None' 

5.The Tweet image prediction, has unique tweet_id but not image url, which is duplicated
 
6.Non-descriptive columns in twitter archive data ('name', 'text')

7.The values in image predictions under columns p1, p2, p3 are uppercase

8.Incorrect data types:'timestamp', 'retweeted_status_timestamp' datatype is of string

9.Incorrect data types for 'tweet_id, retweeted_status_id', retweeted_status_user_id, in_reply_to_status_id, in_reply_to_user_id

10.Inconsistent rating denominator

### Tidiness issues

1.The four columns for doggo, floofer, pupper, and puppo are dog stages, one variable

2.The tweet_id information in all data sets tweet_json.csv, witter archive data and tweet image prediction are related, hence same observational unit.

## Cleaning Data

Thuis aims to improve the quality and tidiness by correcting the inaccuracies, removing the irrelevant columns, renaming columns and replacing missing values, or droping rows with the missing values based on the assessment already done

Cleaning data uses programmatic data cleaning process, in which every issue identified in the assessment section is first defined followed by codng and testing.

In this section, clean **all** of the issues you documented while assessing. 

**Note:** Make a copy of the original data before cleaning. Cleaning includes merging individual pieces of data according to the rules of [tidy data](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html). The result should be a high-quality and tidy master pandas DataFrame (or DataFrames, if appropriate).

In [327]:
# Make copies of original pieces of data
twitter_archive_clean_a=twitter_archive.copy()
image_predictions_clean=image_pred.copy()
tweet_json_clean=tweet_json.copy()

### Issue #1: Missing values in twitter_archive data for name column 'None'

#### Define: Replace the "None" with corectly extracted names or with NaN using .str.extract() method

#### Code

In [328]:
# select observations where name is not "None"
twitter_archive_clean_b=twitter_archive_clean_a[twitter_archive_clean_a.name !='None']

In [329]:
# patterns to extract correct names and dog stages
name_pattern='(?:name(?:d)?)\s{1}(?:is\s)?([A-Za-z]+)|(?:This(?:d)?)\s{1}(?:is\s)([A-Za-z]+)|(?:Meet(?:d)?)\s{1}(?:\s)?([A-Za-z]+)|(?:hello(?:d)?)\s{1}(?:to\s)([A-Za-z]+)|(?:call(?:d)?)\s{1}(?:him\s)([A-Za-z]+)'
stage_pattern='(?i)(pupper|doggo|puppo|floofer)'

In [330]:
twitter_archive_clean_c=twitter_archive_clean_a[twitter_archive_clean_a.name =='None'].copy() # filter observations whose name is None and make a copy
twitter_archive_clean_c['name']=twitter_archive_clean_c['text'].str.extract(name_pattern, expand=True) # extract correct name from text

In [None]:
twitter_archive_clean_d=twitter_archive_clean_b.append(twitter_archive_clean_c, ignore_index=True) # join the data tables 
twitter_archive_clean_d.head()

#### Test

In [None]:
twitter_archive_clean_d.name.value_counts()

### Issue #2: Inaccurate values (dog names) in twitter_archive data for name column ('a', 'an', 'all', 'my', 'not', 'the', 'by', 'such', 'his', 'life', 'one', 'old', 'just')

#### Define: Replace ('a', 'an', 'all',  'not', 'the', 'by', 'such', 'life', 'one', 'old', 'just') with NaN using np.NaN, 'his' with 'Quizno' and 'my' with 'Zoey' using replace() method

#### Code

In [None]:
to_be_replaced=['a', 'an', 'all', 'not', 'the', 'by', 'such', 'life', 'one', 'old', 'just'] # group the values to be replaced
twitter_archive_clean_d['name']=twitter_archive_clean_d['name'].replace(to_be_replaced,np.NaN) # replace the values with NaN
twitter_archive_clean_d.head()

In [None]:
twitter_archive_clean_d['name']=twitter_archive_clean_d['name'].replace(['my','his'], ['Zoey', 'Quizno']) # correct my and his names with Zoey and Quizno respectively
twitter_archive_clean_d.head()

#### Test

In [None]:
twitter_archive_clean_d.name.value_counts() # check whether the names replaced still exist in the data

### Issue #3: The four columns for doggo, floofer, pupper, and puppo for twitter_archive data are dog stages, one variable

#### Define: In twitter_arvive data, melt doggo, floofer, pupper, and puppo columns into one column called dog_stage using .melt() method

#### Code

In [336]:
unmelted_col=['tweet_id', 'timestamp', 'text', 'retweeted_status_id', 'retweeted_status_user_id', 
              'in_reply_to_status_id', 'in_reply_to_user_id','expanded_urls', 'rating_numerator', 
              'rating_denominator', 'name'] # create columns not to be melted
twitter_archive_clean_e=twitter_archive_clean_d.melt(id_vars=unmelted_col, value_vars=['doggo', 'floofer', 'pupper',
       'puppo'], var_name='to_be_removed', value_name='dog_stage') # melt 'doggo', 'floofer', 'pupper','puppo' into dog_stage

In [337]:
twitter_archive_clean_e.drop('to_be_removed', axis=1, inplace=True) # drop column to be removed

In [338]:
twitter_archive_clean_e.drop_duplicates(inplace=True) # drops duplicates from the data

#### Test

In [None]:
len(twitter_archive_clean_e.columns)==len(twitter_archive_clean_e.columns) # should return false

In [None]:
twitter_archive_clean_e.dog_stage.value_counts() # assess the observations under dog stage column

### Issue #4: Missing values in twitter_archive data for the dog stages column 'None' 

#### Define: Replace None with the correct extracted dog_stage from the text or with NaN

#### Code

In [341]:
twitter_archive_clean=twitter_archive_clean_e.copy() # make copy
twitter_archive_clean['dog_stage']=twitter_archive_clean['text'].str.extract(stage_pattern, expand=True) # use regex pattern to extract the correct dog stage names

In [None]:
twitter_archive_clean.dog_stage.value_counts() # assess the new observations under dog stage column

In [None]:
stage_caps=['Doggo','Floofer', 'PUPPER', 'Puppo', 'DOGGO', 'Pupper'] # create names in caps to be corrected
cor_stage=['doggo', 'floofer', 'pupper', 'puppo', 'doggo', 'pupper'] # correct names
twitter_archive_clean['dog_stage']=twitter_archive_clean['dog_stage'].replace(stage_caps,cor_stage) # replace the names in caps with the correct ones
twitter_archive_clean.head()

#### Test

In [None]:
twitter_archive_clean.dog_stage.value_counts() # assess the observations under new dog stage column

In [None]:
twitter_archive_clean.head() # load new twitter_archive_clean data

### Issue #5: Some values in image predictions data under p1, p2 , and p2 columns are in uppercase

#### Define:Make the values of columns p1, p2, p3 in predictions all lowercase.

#### Code

In [346]:
image_predictions_clean['p1'] = image_predictions_clean['p1'].str.lower() # converts values in p1 to lowercase
image_predictions_clean['p2'] = image_predictions_clean['p2'].str.lower() # converts values in p2 to lowercase
image_predictions_clean['p3'] = image_predictions_clean['p3'].str.lower() # converts values in p3 to lowercase

#### Test

In [None]:
image_predictions_clean.p1.unique()
image_predictions_clean.p2.unique()
image_predictions_clean.p3.unique()

### Issue #6: The tweet_id information in all data sets tweet_json.csv, witter archive data and tweet image prediction are relerated, hence same observational unit.

#### Define: Merge the three datasets, weet_json.csv, witter archive data and tweet image prediction into one table called twitter_archive_master using merge() method on tweet_id

#### Code

In [None]:
# first merge twitter_archive_clean to image_predictions_clean
twitter_archive_master_a=image_predictions_clean.merge(twitter_archive_clean, on='tweet_id', how='inner')
twitter_archive_master_a.head()

In [349]:
# second merge twitter_json_clean to twitter_archive_master_a
twitter_archive_master_b=twitter_archive_master_a.merge(tweet_json_clean, on='tweet_id', how='inner')

In [350]:
# Increase number of visible columns in a pandas DataFrame to see all the columns in the newly created twitter_archive_master_b
pd.set_option("display.max_columns",25)

In [None]:
twitter_archive_master_b.head()

#### Test

In [None]:
twitter_archive_master_b.columns # should return more number of columns

### Issue #7: 181 rows with retweets in twitter archive data, need to be removed as per the project motivation

#### Define: Remove 181 rows with retweets as well as extraneous columns 'in_reply_to_status_id', 'in_reply_to_user_id','source', 'retweeted_status_id', 'retweeted_status_user_id','retweeted_status_timestamp' by filtering them out

#### Code

In [353]:
# there are 181 retweets found in "retweeted_status_id", "retweeted_status_user_id" and "retweeted_status_timestamp". We keep the rows that are null and remove the retweets.
twitter_archive_master_c = twitter_archive_master_b[twitter_archive_master_b.retweeted_status_id.isnull()]

In [None]:
# Remove extraneous columns
cols=['tweet_id', 'jpg_url', 'img_num', 'p1', 'p1_conf', 'p1_dog', 'p2','p2_conf', 'p2_dog', 'p3', 
      'p3_conf', 'p3_dog', 'timestamp', 'text', 'expanded_urls','rating_numerator', 'rating_denominator', 
      'name', 'dog_stage','retweet_count', 'favorite_count'] # group the columns to be filtered
twitter_archive_master_d=twitter_archive_master_c.filter(cols, axis=1) # filter the required columns
twitter_archive_master_d.head()

#### Test

In [None]:
twitter_archive_master_d.columns # assess the filtered columns

### Issue #8: The Tweet image prediction, has unique tweet_id but not image url, which is duplicated

#### Define: Remove duplicate jpg_url in tweet image prediction data, to get a unique tweet-id with a unique jpg_url

#### Code

In [None]:
twitter_archive_master_e=twitter_archive_master_d.drop_duplicates(subset=['jpg_url']) # remove duplicates based on jpg_url column
twitter_archive_master_e.head()

#### Test

In [None]:
twitter_archive_master_e.shape==twitter_archive_master_d.shape # should return false

### Issue #9: Non-descriptive columns names  ('name', 'text')

#### Define: Rename the column 'name' as 'dog_name'  and 'text' as 'tweets_text' in the twitter_archive_master using .rename() method

#### Code

In [358]:
# change name to dog name and text to tweets_text
twitter_archive_master=twitter_archive_master_e.copy()
twitter_archive_master.rename(columns={'name': 'dog_name', 'text': 'tweet_text'}, inplace=True)

#### Test

In [None]:
if twitter_archive_master.columns.any()=='dog_name' or 'tweet_text': # check if any column contains dog_name or tweet_text
    print('Yes') # should return yes

### Issue #10: None-descriptive column names in tweet image prediction data for the rating algorithms ('jpg_url', 'img_num')

#### Define: Rename the column 'jpg_url', and 'img_num', in the twitter_archive_master using .rename() method

#### Code

In [360]:
rename_dict={'jpg_url':'image_link', 'img_num':'number_of_images'} # create dictionary for old and new names
twitter_archive_master.rename(columns=rename_dict, inplace=True) # rename the coumns

#### Test

In [None]:
twitter_archive_master.columns # check the columns in the new data table

In [None]:
twitter_archive_master.head() # load the twitter_archive_master data table

### Issue #11: Incorrect data types for  'timestamp', datatype is of string

#### Define: Convert 'timestamp' data type to datetime

#### Code

In [363]:
twitter_archive_master['timestamp']=pd.to_datetime(twitter_archive_master['timestamp']) # convert timestamp into datetime

#### Test

In [None]:
twitter_archive_master['timestamp'].dtypes # assess the data type for timestamp, should return '<M8[ns]'

### Issue #12: Incorrect data types for 'tweet_id, retweeted_status_id', retweeted_status_user_id, in_reply_to_status_id, in_reply_to_user_id

#### Define: Convert 'tweet_id' to string type

#### Code

In [365]:
twitter_archive_master['tweet_id']=twitter_archive_master['tweet_id'].astype(str) # convert tweet_id to string

In [366]:
# create order of indexing of the columns
column_names = ['tweet_id', 'timestamp', 'dog_name', 'dog_stage','retweet_count', 'favorite_count', 
                'rating_numerator', 'rating_denominator', 'p1', 'p1_conf','p1_dog', 'p2','p2_conf', 
                'p2_dog', 'p3', 'p3_conf', 'p3_dog','image_link', 'number_of_images','tweet_text', 'expanded_urls']

twitter_archive_master = twitter_archive_master.reindex(columns=column_names) # reorder the columns

In [None]:
twitter_archive_master.head()

In [368]:
twitter_archive_master.drop_duplicates(inplace=True) # remove duplicates

#### Test

In [None]:
twitter_archive_master['tweet_id'].dtypes # should return O, object

In [None]:
twitter_archive_master.shape # assess the shape of twitter_archive_master

## Storing Data

Save gathered, assessed, and cleaned master dataset to a CSV file named "twitter_archive_master.csv".

In [371]:
twitter_archive_master.to_csv('twitter_archive_master.csv', index=False) # store twitter_archive_master dataframe to a csv file named twitter_archive_master.csv

## Analyzing and Visualizing Data
In this section, analyze and visualize your wrangled data. You must produce at least **three (3) insights and one (1) visualization.**

In [None]:
twitter_archive_master=pd.read_csv('twitter_archive_master.csv') # load twitter_archive_master.csv into pandas dataframe
twitter_archive_master.head()

### Insights:
1.Pupper dog stage is the most popular dog stage amongst WeRateDogs’s tweets, favorite and retweets counts. The second, most popular dog stage based on the retweets and favorite counts is doggo. 

2.There is strong linear relationship between the Favourites count and the Retweet, though most of the data is accumulated at the start. This relationship is the same in every dog stage. Also, the distribution for p1, p2 and p3 is really skewed

3.Golden Retriever is the most popular dog breed amongst WeRateDogs’s tweets in terms of the number of image predictions having 139 dogs. The second most popular dog breed is Labrador Retriever also having 95 dogs. Therefore, golden retriever, labrador retriever, pembroke, Chihuahua and pug make top 5 most popular dog breeds

### Visualization

In [None]:
values=twitter_archive_master.groupby(['dog_stage']).retweet_count.sum().sort_values()
#twitter_archive_master.dog_stage.value_counts()
labels=['floofer', 'puppo', 'doggo', 'pupper']

plt.rcParams['font.size'] = '16'
fig, ax=plt.subplots(figsize=[10,14])
fig.patch.set_facecolor('white')  # Set figure background to white

#explode = (0, 0, 0.2, 0.1)
plt.pie(values, labels=labels, counterclock=False, autopct='%1.1f%%') #explode=explode,  shadow=True, 
plt.title('Dog Stage Popularity Chart-Retweet Counts', fontsize=20)
plt.legend(labels, loc=2)

# Save the plot as a JPG file
plt.savefig('dog_stage_popularity_chart.jpg', format='jpg', dpi=300, bbox_inches='tight')

plt.show()

The retweet counts pie chart shows that the most popular dog stage is pupper, which is has 42.5 percent populaity. The second is doggo, followed by puppo and lastly floofer, whose popularity is 2.8 percent

In [None]:
values=twitter_archive_master.groupby(['dog_stage']).favorite_count.sum().sort_values()
#twitter_archive_master.dog_stage.value_counts()
labels=['floofer', 'puppo', 'doggo', 'pupper']

plt.rcParams['font.size'] = '12'
fig, ax=plt.subplots(figsize=[10,14])
fig.patch.set_facecolor('white')  # Set figure background to white

#explode = (0, 0.1, 0.2, 0.1)
plt.pie(values, labels=labels, counterclock=False, autopct='%1.1f%%') #explode=explode,
plt.title('Dog Stage Popularity Chart based on Favorite Counts', fontsize=20)
plt.legend(labels, loc=4)

# Save the plot as a JPG file
plt.savefig('dog_stage_popularity_chartFa.jpg', format='jpg', dpi=300, bbox_inches='tight')

plt.show()

In this case, favorite counts pie chart still shows that the most popular dog stage is pupper with popularity at 43.8 percent, which 1.3 percent increase from the previous pie chart. Doggo is still second most popular dog stage, the percentage has reduced from 39.3 percent for retweet counts to 36.1 percent for favourite counts.

In [None]:
twitter_archive_master.groupby(['dog_stage']).favorite_count.sum().sort_values()

From the above output, pupper dog stage has the highest sum of of favorite counts

In [None]:
# plot scatter plot for Retweet Counnts vs favorite Counnts
x=twitter_archive_master.retweet_count
y=twitter_archive_master.favorite_count

fig, ax=plt.subplots(figsize=[10, 6])
plt.rcParams['font.size'] = '14' # Set general font size
fig.patch.set_facecolor('white')  # Set figure background to white

plt.scatter(x,y,color='blue')
plt.title('Scatter plot for Retweet Counnts vs Favorite Counnts')
plt.xlabel('Retweet Counnts')
plt.ylabel('Favorite Counnts')

# Save the plot as a JPG file
plt.savefig('scatter_R_and_F.jpg', format='jpg', dpi=300, bbox_inches='tight')

plt.show()

There is strong linear correlation between retweet counnts and favorite counnts

In [None]:
# plot scatter plots for retweet counnts and favorite counnts for each dog stage
x1=twitter_archive_master[twitter_archive_master.dog_stage=='pupper'].retweet_count
y1=twitter_archive_master[twitter_archive_master.dog_stage=='pupper'].favorite_count

x2=twitter_archive_master[twitter_archive_master.dog_stage=='doggo'].retweet_count
y2=twitter_archive_master[twitter_archive_master.dog_stage=='doggo'].favorite_count

x3=twitter_archive_master[twitter_archive_master.dog_stage=='puppo'].retweet_count
y3=twitter_archive_master[twitter_archive_master.dog_stage=='puppo'].favorite_count

x4=twitter_archive_master[twitter_archive_master.dog_stage=='floofer'].retweet_count
y4=twitter_archive_master[twitter_archive_master.dog_stage=='floofer'].favorite_count

plt.rcParams["figure.figsize"] = [10, 6]
plt.rcParams["figure.autolayout"] = False
plt.rcParams['font.size'] = '14' # Set general font size
fig.patch.set_facecolor('white')  # Set figure background to white

labels='pupper'
plt.scatter(x1,y1, color='green')
plt.title('Scatter plot for Pupper')
plt.xlabel('Retweet Counnts')
plt.ylabel('Favorite Counnts')
plt.legend([labels], loc=0)

plt.savefig('scatter_for_pupper.jpg', format='jpg', dpi=300, bbox_inches='tight')

plt.show()

In [None]:
plt.rcParams["figure.figsize"] = [10, 6]
plt.rcParams["figure.autolayout"] = False
plt.rcParams['font.size'] = '14' # Set general font size
fig.patch.set_facecolor('white')  # Set figure background to white

labels='doggo'
plt.scatter(x2,y2,color='blue')
plt.title('Scatter plot for Doggo')
plt.xlabel('Retweet Counnts')
plt.ylabel('Favorite Counnts')
plt.legend([labels], loc=0)

plt.savefig('scatter_for_doggo.jpg', format='jpg', dpi=300, bbox_inches='tight')

plt.show()

In [None]:
plt.rcParams["figure.figsize"] = [10, 6]
plt.rcParams["figure.autolayout"] = False
plt.rcParams['font.size'] = '14' # Set general font size
fig.patch.set_facecolor('white')  # Set figure background to white

labels='puppo'
plt.scatter(x3,y3,color='orange')
plt.title('Scatter plot for Puppo')
plt.xlabel('Retweet Counnts')
plt.ylabel('Favorite Counnts')
plt.legend([labels], loc=0)

plt.savefig('scatter_for_puppo.jpg', format='jpg', dpi=300, bbox_inches='tight')

plt.show()

In [None]:
plt.rcParams["figure.figsize"] = [10, 6]
plt.rcParams["figure.autolayout"] = False
plt.rcParams['font.size'] = '14' # Set general font size
fig.patch.set_facecolor('white')  # Set figure background to white

labels='floofer'
plt.scatter(x4,y4,color='black')
plt.title('Scatter plot for Floofer')
plt.xlabel('Retweet Counnts')
plt.ylabel('Favorite Counnts')
plt.legend([labels], loc=0)

plt.savefig('scatter_for_floofer.jpg', format='jpg', dpi=300, bbox_inches='tight')

plt.show()

From the output, there is strong linear correlation between retweet counnts and favorite counnts for each dog stage

In [None]:
# plot scatter matrix
histogram=twitter_archive_master[['p1_conf','p2_conf','p3_conf']]

plt.rcParams["figure.figsize"] = [10, 6]
plt.rcParams['font.size'] = '14' # Set general font size
fig.patch.set_facecolor('white')  # Set figure background to white

pd.plotting.scatter_matrix(histogram, color='black', alpha=0.7)
plt.xticks(rotation=0)

plt.savefig('scatter_matrix.jpg', format='jpg', dpi=300, bbox_inches='tight')
plt.show()

The distributions are all skewed. p1 is left skewed, while p2 and p3 are right skewed

Next, I want to get the most common dog breed

In [382]:
twitter_archive_breed_type=twitter_archive_master.copy() # make copy of twitter_archive_master

In [383]:
# put p1, p2 and p3 into one group known as breed type using melt method
twitter_archive_breed_type=twitter_archive_breed_type.melt(id_vars=['tweet_id', 'timestamp', 'dog_name', 'dog_stage','retweet_count', 'favorite_count', 
                 'rating_numerator', 'rating_denominator','p1_conf','p1_dog','p2_conf', 
                 'p2_dog', 'p3_conf', 'p3_dog','image_link', 'number_of_images',
                     'tweet_text', 'expanded_urls'], 
            value_vars=['p1', 'p2', 'p3'], 
            var_name='to_be_removed1', value_name='breed_type')
twitter_archive_breed_type.drop('to_be_removed1', axis=1, inplace=True)

In [384]:
twitter_archive_breed_type.drop_duplicates(inplace=True) # remove duplicates

In [385]:
twitter_archive_breed=twitter_archive_breed_type.copy() # make copy

In [386]:
# put 'p1_dog', 'p2_dog', 'p3_dog' into one column called breed using melt method
twitter_archive_breed=twitter_archive_breed_type.melt(id_vars=['tweet_id', 'timestamp', 'dog_name', 'dog_stage','retweet_count', 'favorite_count', 
                 'rating_numerator', 'rating_denominator','p1_conf','p2_conf', 'p3_conf', 
                                                          'image_link', 'number_of_images',
                     'tweet_text', 'expanded_urls', 'breed_type'], 
            value_vars=['p1_dog', 'p2_dog', 'p3_dog'], 
            var_name='to_be_removed2', value_name='breed')
twitter_archive_breed.drop('to_be_removed2', axis=1, inplace=True)

In [387]:
twitter_archive_breed.drop_duplicates(subset='tweet_id',inplace=True) # remove duplicates if any

In [388]:
# select only those which are dog breed
breed=twitter_archive_breed[twitter_archive_breed.breed==True].breed_type.value_counts()

In [None]:
breed

In [None]:
color=['green', 'red', 'yellow', 'blue', 'black', 'orange', 'violet', 'brown', 'indigo', 'pink']
breed[breed>27].reset_index()
#reindex(['breed type', 'count'])


In [None]:
plt.rcParams["figure.figsize"] = [10, 6]
fig.patch.set_facecolor('white')  # Set figure background to white

# Filter and reset index
filtered_breed = breed[breed > 27].reset_index()
filtered_breed.columns = ['index', 'breed_type']  # Ensure proper column names

# Create the barplot
catplot = sb.catplot(
    data=filtered_breed, 
    kind='bar', 
    y='index', 
    x='breed_type', 
    palette='viridis', 
    height=6,  # Adjust height of the plot
    aspect=1.5  # Adjust aspect ratio
)

# Add title and labels
catplot.ax.set_title('Top 10 Dog Breeds', fontsize=16, fontweight='bold')
catplot.ax.set_xlabel('Frequency', fontsize=12)
catplot.ax.set_ylabel('Dog Breed', fontsize=12)

plt.savefig('top10dogs.jpg', format='jpg', dpi=300, bbox_inches='tight')

# Display the plot
plt.show()


The most common dog breed is golden retriever, interms of the number of image predictions. The second most popular dog breed is Labrador Retriever, followed by Pembroke and finaly by Chihuahua