# WeRateDogs Twitter 

---

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#question">Question</a></li>    
<li><a href="#gathering">Data Gathering</a></li>
<li><a href="#assessing">Data Assessment</a></li>
<li><a href="#cleaning">Data Cleaning</a></li>
<li><a href="#analyzevisualize">Data Analysis and Visualisations</a></li>
<li><a href="#report">Final Report</a></li>
</ul>

---

<a id='intro'></a>
## Introduction

The goal is to wrangle WeRateDogs Twitter data to create interesting and trustworthy analysis and visualizations. 

Wrangling activities include:
* Gathering data from 3 different sources (abbreviation used DF1, DF2, DF3).
* Assessing data to identify quality and tidiness issues. 
* Cleaning data, that includes activities: definition, coding, testing. 
* Storing, analysing, and visualising interesting insights.
* Reporting on wrangling and data analysis efforts. 

---

<a id='question'></a>
## Question

I'm not a Twitter user nor a fan, therefore I do not have a "strong" relationship with this dataset, however I came up with the following questions that I'd like to investigate:
* **What is the success rate for image prediction?**
* **What is the growth rate for WeRateDogs user ( by counting the followers)?**
* **Is there a correlation between twiting and month (i.e. is twitting affected by the season)?**
* **Which tweet characteristics predict high retweeting?**

---

<a id='gathering'></a>
## Data Gathering

In [None]:
# import all packages and set plots to be embedded inline
import pandas as pd
from pandas import DataFrame as df
import numpy as np
import os
import requests
import json
import tweepy
from tweepy import OAuthHandler
from timeit import default_timer as timer
import seaborn as sb
import matplotlib.pyplot as plt

### DF1 - Load Image prediction data (tsv format)

The tweet image predictions, i.e., what breed of dog (or other object, animal, etc.) is present in each tweet according to a neural network. This file is hosted on Udacity's servers and is downloaded programmatically using the Requests library and the following URL: https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv

Dictionary of the attributes:
* p1 is the algorithm's #1 prediction for the image in the tweet → golden retriever
* p1_conf is how confident the algorithm is in its #1 prediction → 95%
* p1_dog is whether or not the #1 prediction is a breed of dog → TRUE
* p2 is the algorithm's second most likely prediction → Labrador retriever
* p2_conf is how confident the algorithm is in its #2 prediction → 1%
* p2_dog is whether or not the #2 prediction is a breed of dog → TRUE
etc.

In [None]:
# Create a folder if it's not craeted
folder_name = 'Dataset'
if not os.path.exists(folder_name):
    os.makedirs('Dataset')

In [None]:
# Save link
url='https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
r = requests.get(url)

In [None]:
# Save the content from the saved link
with open(os.path.join(folder_name, url.split("/")[-1]), mode = 'wb') as file:
    file.write(r.content)

In [None]:
# Check Dataset repository
os.listdir(folder_name)

In [None]:
# Load Image Prediction into Pandas DataFrame
df1=pd.read_csv('Dataset/image-predictions.tsv', sep='\t')
df1.sample(5)

### DF2 - Load Twitter Archive data (csv format)

The WeRateDogs Twitter archive data file is given, therefore I'm downloading file manually and uploading into the pandas DataFrame.

In [None]:
df2=pd.read_csv('Dataset/twitter-archive-enhanced.csv')
df2.sample()

### DF3 - Load Twitter API data (json format)

Using the tweet IDs in the WeRateDogs Twitter archive, I'm query the Twitter API for each tweet's JSON data using Python's Tweepy library and store each tweet's entire set of JSON data in a file called tweet_json.txt file. Each tweet's JSON data should be written to its own line. Then read this .txt file line by line into a pandas DataFrame with (at minimum) tweet ID, retweet count, and favorite count. 

Additional information: [pandas API reference](https://pandas.pydata.org/pandas-docs/stable/api.html) for detailed usage information.

#### Query via API and save respond to txt

#### Read API data line by line

Here we read the txt file line by line and store values into a list object. 

In [None]:
# Read line by line and save into a list
count=0
tweets_data = []
inputfile=open('Dataset/tweet_json.txt', "r")
for line in inputfile:
    try:
        tweet = json.loads(line)
        tweets_data.append(tweet)
    except: 
        count+=1
        continue
print('Failed:{}'.format(count))

#### Check json structure
Pandas DataFrame function allows me to load the list object and to have a quick overview of the data structure and data types. 
However this data load approach does not comply with the data tidiness requirements, that are :
* Each variable forms a column
* Each observation forms a row
* Each observational unit forms a table

In [None]:
review=pd.DataFrame(tweets_data)

In [None]:
# Have an overview of data structure and data types
review.head(2)

In [None]:
# An overview of a specific variable
review.entities[0]

#### Select the attributes and save into DF3

Here I'm specifying the attributes of my interest and loading into pandas DataFrame.

In [None]:
# Had to add a dummy dictionary for the empty list, so that I could grab Hashtags when it has a value.
for tweet in tweets_data:
    if tweet['entities']['hashtags'] == []:
        tweet['entities']['hashtags'] = [{'text': '', 'indices':[]}]

In [None]:
# Grab the specified attributes and load into pandas DataFrame
d = []
for tweet in tweets_data:
    d.append({
        'tweet_id': tweet['id'],
        'tweet_favourite_count' : tweet['favorite_count'],
        'tweet_hashtags' : tweet['entities']['hashtags'][0]['text'],
        'retweet_count' : tweet['retweet_count'],
        'retweeted' : tweet['retweeted'],
        'user_id': tweet['user']['id'],
        'user_followers_count': tweet['user']['followers_count'],
        'user_friends_count': tweet['user']['friends_count'],
        'user_listed_count': tweet['user']['listed_count'],
        'user_favourites_count': tweet['user']['favourites_count'],
        'user_statuses_count': tweet['user']['statuses_count'],
    })

df3=pd.DataFrame(d)
df3.head(2)

### Data Gathering Summary 
> * All 3 data sources are loaded
> * DataFrames are named accorfingly DF1, DF2, DF3. 

---

<a id='assessing'></a>
## Data Assessment

After gathering each of the above pieces of data, requires assessment that is done visually and programmatically. 
There are two types of unclean data concepts:
* **Dirty data**, also known as **low quality data**. Low quality data has **content issues**.
* **Messy data**, also known as **untidy data**. Untidy data has **structural issues**.

Identified quality and tidiness issues are documented in the summary. 

### DF1 - Image prediction DataFrame

In [None]:
# Assessing data visually
df1.sample(3)

In [None]:
# Assessing data programatically
df1.info()

In [None]:
# Counting unique values
df1.nunique()

In [None]:
# Checking quantitative information
df1.describe()

#### Summary for DF1 Assessing :
> **Overview:**
 * At least 75% of the records have only 1 image. 
 * P1 confidence level is very high. 

> **Quality issues:** 
* It could be interested to analyse against the Image format data which is currently imbedded in the image link.

>**Tidiness issues:** 
* Prediction columns repeating per each prediction. 

### DF2 - Twitter Archive DataFrame

In [None]:
# Assessing data visually
df2.sample(3)

In [None]:
# Assessing data programatically
df2.info()

In [None]:
# Counting unique values
df2.nunique()

In [None]:
# Listing source categories
df2.source.unique()

In [None]:
# Checking quantitative information
df2.describe()

#### Summary for DF2 Assessing : 
> **Overview:**
 * There are 2356 tweets and tweet_id is unique identificator.
 * There are 4 source categories. 
 * rating_numerator and rating_denominator values have outliers.

> **Quality issues:** 
* Source information is presented by a link. 
* It was mentioned that rating_numerator and rating_denominator values might have arrors.
* It was mentioned that Dogtionary attributes might have have errors. 
* Since I'm interested in have at leats one plot with the timeline I need to have a user friendly date format.

>**Tidiness issues:** 
* Dogtionary types are listed in the columns.

### DF3 - Twitter API DataFrame

In [None]:
# Assessing data visually
df3.sample(3)

In [None]:
# Assessing data programatically
df3.info()

In [None]:
# Counting unique values
df3.nunique()

In [None]:
# Checking quantitative information
df3.describe()

#### Summary for DF3 Assessing :
> **Quality issues:** 
* Retweet value is always False, which is wrong.  
* user_friends_count and user_statuses_count have always the same value, therefore I'm not interested in this columns.

>**Tidiness issues:** 
* User information should be separated from the Twitter data, since the user is always the same. 

### Check duplicated columns

In [None]:
# Check the column duplicates within all 3 DataFrames
all_columns = pd.Series(list(df1) + list(df2) + list(df3))
all_columns[all_columns.duplicated()]

---

<a id='cleaning'></a>
## Data Cleaning

The data cleaning activities are performed that includes: 
* Define
* Code
* Test

### Create a working copy

In [None]:
df1_c = df1.copy() # For the Image prediction
df2_c = df2.copy() # For the Archived Twitter
df3_c = df3.copy() # For the Twitter API data

### Join Twitter Archive and Twitter API Data 

#### 1. Join Twitter DataFrames ( Tidiness )

Join both DataFrames on the attribute twitter_id. Twitter id is a unique record identifaer. I'm using inner join type because I'm interested only in the records that are in both DataFrames. 

In [None]:
# Code, i.e. join the DataFrames
t_merge = df2_c.merge(df3_c, left_on='tweet_id', right_on='tweet_id')
t_merge = pd.DataFrame(t_merge)

In [None]:
# Test visually
t_merge.head(3)

In [None]:
# Test by counting the records
print('Count of records in Twitter Archive is : {}'.format(df2_c['tweet_id'].count()))
print('Count of records in Twitter API is : {}'.format(df3_c['tweet_id'].count()))
print('Count of records in the joined file is : {}'.format(t_merge['tweet_id'].count()))

#### 2. Create a new attribute to indicate twitter Year and Month ( Quality )
I have intention to use timeline information in the visualisation, therefore I need to create a date format that is more user friendly. The lowest granularity I'm interested is month. 

By using a regular expression, I'm going to extract year and month information with the extract function. 

In [None]:
# Code. Extract year and month information
Timestamp = t_merge['timestamp'].str.extract('(\d+)-(\d+)').astype(str)

In [None]:
# Code. Concatinate year and month values with separator -
t_merge['year_month'] = Timestamp[0] + "-" + Timestamp[1]

In [None]:
# Test
t_merge['year_month'].unique()

### Clean Image Prediction File

#### 1. Add image format ( Quality )

From the jpg_url I want to extract the image format and save it into a separate column. By using a regular expression, I'm going to split the string and grab the last value. 

In [None]:
# A code to extract image format
df1_c['im_type'] = df1_c['jpg_url'].str.split(".").str[-1]

In [None]:
# Check unique image formats
df1_c['im_type'].unique()

#### 2. Transform the columns ( Tidiness )
Prediction columns as: p(x), p(x)_conf, p(x)_dog; should be merged and a new column with prediction algorithm No. should be added. 


First, I am creating 3 separate dataframes with new column prd and defaulted values p1, p2, or p3. After, it I'm concatinating into a single DataFrame. 

In [None]:
df1_c.head(3)

In [None]:
# Create a dataframe for prediction 1
prd1 = pd.DataFrame(df1_c, columns=['tweet_id','img_num', 'im_type', 'p1', 'p1_conf', 'p1_dog'])
prd1['prd']='p1'
prd1 = prd1.rename(columns={'tweet_id':'tweet_id','img_num':'img_num', 'im_type':'im_type', 
                            'prd':'prd', 'p1':'guess', 'p1_conf':'conf', 'p1_dog':'dog'})

In [None]:
# Test visually
prd1.sample(3)

In [None]:
# Create a dataframe for prediction 2
prd2 = pd.DataFrame(df1_c, columns=['tweet_id','img_num', 'im_type', 'p2', 'p2_conf', 'p2_dog'])
prd2['prd']='p2'
prd2 = prd2.rename(columns={'tweet_id':'tweet_id','img_num':'img_num', 'im_type':'im_type', 
                            'prd':'prd', 'p2':'guess', 'p2_conf':'conf', 'p2_dog':'dog'})

In [None]:
# Test visually
prd2.sample(3)

In [None]:
# Create a dataframe for prediction 3
prd3 = pd.DataFrame(df1_c, columns=['tweet_id','img_num', 'im_type', 'p3', 'p3_conf', 'p3_dog'])
prd3['prd']='p3'
prd3 = prd3.rename(columns={'tweet_id':'tweet_id','img_num':'img_num', 'im_type':'im_type', 
                            'prd':'prd', 'p3':'guess', 'p3_conf':'conf', 'p3_dog':'dog'})

In [None]:
# Test visually
prd3.sample(3)

In [None]:
# Concatinate all 3 DataFrames with a new attribute "prd" indicating whether it's the prediction 1, 2, or 3
frames = [prd1, prd2, prd3]
prediction_clean=pd.concat(frames)

In [None]:
# Check the results
prediction_clean.sample(5)

#### 3. Save the clean file

In [None]:
# Create a folder if it's not craeted
folder_name = 'Clean_dataset'
if not os.path.exists(folder_name):
    os.makedirs('Clean_dataset')

In [None]:
# Save into a csv format
df.to_csv(prediction_clean, 'Clean_dataset/prediction_clean.csv', sep=',')

### Create Twitter User DataFrame

#### 1. Create a new DataFrame ( Tidiness & Quality )
I'll grab interesting user attributes and save it into a separate dataframe. User is the same for all the twitters, but I'm more interested in the growth of followers. 

In [None]:
# Check the attributes
t_merge.head(3)

In [None]:
# Create a new User Dataframe with user information and twitter dates
user=pd.DataFrame(t_merge, columns=['timestamp', 'year_month', 'user_favourites_count', 'user_followers_count', 
                                    'user_id', 'user_listed_count'])

In [None]:
# Test visually
user.head(2)

In [None]:
# Count duplicated records
user.duplicated(['user_favourites_count', 'user_followers_count', 
                 'user_id', 'user_listed_count']).count()

In [None]:
# Remove duplicates and leave only unique values
user.drop_duplicates(['user_favourites_count', 'user_followers_count', 
                      'user_id', 'user_listed_count', 'year_month'], inplace=True)

In [None]:
# Count unique user records
user['user_id'].count()

In [None]:
# Test visually
user.head(3)

In [None]:
# Check the insights
print ('The followers count started with {}'.format(user['user_followers_count'].min()))
print ('The followers count ended with {}'.format(user['user_followers_count'].max()))
print ('The follower count within {} and {} time period increase by {}%'.format 
       ((user['year_month'].min()), (user['year_month'].max()), round((user['user_followers_count'].max())/(user['user_followers_count'].min())),2))

#### 2. Save cleaned User.csv file

In [None]:
# Save into a csv format
df.to_csv(user, 'Clean_dataset/user_clean.csv', sep=',')

### Clean Twitter information

#### 1. Clean dataframe by removing unecessary attributes ( Quality )
I want to reduce the list of attributes and keep only the ones I'm going to clean or use for visualisations. All the other attributes is droped from this analysis. 

In [None]:
t_merge.shape

In [None]:
# Check the attributes and identify the ones to be dropped
t_merge.info()

In [None]:
# Drop attribute, that I'm not going to use for the analysis
t_merge.drop(columns=['user_favourites_count', 'user_followers_count', 'user_friends_count', 'user_id', 
                      'user_listed_count', 'user_statuses_count', 'in_reply_to_status_id',
                      'in_reply_to_user_id', 'expanded_urls', 'retweeted_status_id', 
                      'retweeted_status_user_id'], inplace = True)

In [None]:
# Test programatically 
t_merge.shape

In [None]:
# Test visually
t_merge.head(3)

#### 2. Extract source name from the link ( Quality )
There are 4 types of sources that are presented by a link. I want to use regular expressions to extract the titles from a link. 

In [None]:
# Check the source values
t_merge.source.unique()

In [None]:
# Code, i.e. split the spring and grab the source name
t_merge['source'] = t_merge['source'].str.split(">").str[1].str.split("<").str[0]

In [None]:
# Test
t_merge['source'].unique()

#### 3. Clean rating information ( Quality )

As it was described in the description "The **ratings** probably aren't all correct." Rating information is part of the text with the format "digit/digit". here are some examples of the text with the rating in the end, however as you see the same pattern might appear twice in the text. 

In [None]:
print('Text examples: \n * {} or \n * {} or \n * {}'.format(t_merge['text'][2332], t_merge['text'][2333], t_merge['text'][2334]))

Every record has a text and I assume that every text, has rating information inside.  

In [None]:
# Check if all the tweets have a text
print('Count of records with empty text is {}'.format (t_merge['text'].isnull().sum()))
# Check if there are empty ratings
print('Count of records with empty rating is {}/{}'.format ((t_merge['rating_numerator'].isnull().sum()), (t_merge['rating_denominator'].isnull().sum())))

In order to grab the last "digit/digit" pattern in the string, I need to reverse the string, but before this I'll get rid of the link information that is in the end of string.

In [None]:
# Code. Remove the link and reverse the string. 
rating = t_merge['text'].str.split('http').str[0].str[::-1]
# Grap the first pattern 
rating = rating.str.extract('(\d+)/(\d+)')
# Store nto the rating format and do a back reverse
full_rating = (rating[0]+'/'+rating[1]).str[::-1]

In [None]:
# Test by comparing the ratings
print('Discrepencies with numerator values : {}'.format((t_merge[t_merge['rating_numerator'] != full_rating.str.split('/').str[0].astype(int)]['tweet_id'].count())))
print('Discrepencies with denominator values : {}'.format( (t_merge[t_merge['rating_denominator'] != full_rating.str.split('/').str[1].astype(int)]['tweet_id'].count())))

In [None]:
# Overwrite with the new values
t_merge['rating_numerator']=full_rating.str.split('/').str[0].astype(int)
t_merge['rating_denominator']=full_rating.str.split('/').str[1].astype(int)
t_merge.head(3)

#### 4. Clean Dogtionary information ( Quality )

As it was described in the description "The ratings probably aren't all correct. Same goes for the dog names and **probably dog stages**".

I'm going to scan the text and search if the categories were mentioned in the text, then I store the result in the new columns and do a comparison with the old values. Eventually old values is replace by the new ones. 

In [None]:
# Code. Scan the text and fill in a boolean if it was mentioned in the text
t_merge['doggo1'] = t_merge['text'].str.contains('doggo')
t_merge['floofer1'] = t_merge['text'].str.contains('floofer')
t_merge['pupper1'] = t_merge['text'].str.contains('pupper')
t_merge['puppo1'] = t_merge['text'].str.contains('puppo')

In [None]:
# Code. Correct value from bollean to the standard ones so that I could do a comparison
t_merge['doggo1'] = t_merge['doggo1'].map({True:'doggo',False:'None'})
t_merge['floofer1'] = t_merge['floofer1'].map({True:'floofer',False:'None'})
t_merge['pupper1'] = t_merge['pupper1'].map({True:'pupper',False:'None'})
t_merge['puppo1'] = t_merge['puppo1'].map({True:'puppo',False:'None'})

In [None]:
# Test by comparing old and new values
print ('Mismatch for doggo : {}'.format(t_merge[t_merge['doggo1'] != t_merge['doggo']]['tweet_id'].count()))
print ('Mismatch for floofer : {}'.format(t_merge[t_merge['floofer1'] != t_merge['floofer']]['tweet_id'].count()))
print ('Mismatch for pupper : {}'.format(t_merge[t_merge['pupper1'] != t_merge['pupper']]['tweet_id'].count()))
print ('Mismatch for puppo : {}'.format(t_merge[t_merge['puppo1'] != t_merge['puppo']]['tweet_id'].count()))

In [None]:
# Old values are replaced by new ones
t_merge.drop(columns=['doggo','floofer','pupper','puppo'], inplace=True)
t_merge = t_merge.rename(columns={'doggo1':'doggo','floofer1':'floofer', 'pupper1':'pupper', 'puppo1':'puppo'})
t_merge.head(3)

#### 5. Concatinate Dogtionary information into one column ( Tidiness )

I'm going to merge Dogtionary category columns into one column by concatinating the strings. 

In [None]:
# Code. Concatinate strings
t_merge.replace({'doggo':'None','floofer':'None','pupper':'None','puppo':'None'}, "", inplace=True)
t_merge['dogtionary'] = t_merge['doggo'] + " " + t_merge['floofer'] + " " + t_merge['pupper'] + " " + t_merge['puppo']

In [None]:
# check the result
t_merge['dogtionary'].unique()

In [None]:
# Remove additional spaces and assign np.nan in case of null
t_merge['dogtionary'] = t_merge['dogtionary'].astype(str)
t_merge['dogtionary'] = t_merge['dogtionary'].str.rstrip(' ')
t_merge['dogtionary'] = t_merge['dogtionary'].str.lstrip(' ')

t_merge['dogtionary'] = t_merge['dogtionary'].str.replace (' ',' ')
t_merge['dogtionary'] = t_merge['dogtionary'].str.replace ('  ',' ')
t_merge['dogtionary'] = t_merge['dogtionary'].str.replace ('   ',' ')

t_merge.loc[t_merge['dogtionary'] == '', 'dogtionary'] = np.nan

In [None]:
# Check the results
t_merge['dogtionary'].unique()

In [None]:
# Drop the columns and test visually
t_merge.drop(columns = ['doggo', 'floofer', 'pupper', 'puppo'], inplace=True)
t_merge.head(3)

#### 6. Correct Retweeted value ( Quality )
Retweet information is always False which is incorrect. I'm going to add a logic, if retweet timestamp is not null, then rating should be True. 

In [None]:
# Check the unique values
t_merge['retweeted'].unique()

In [None]:
# Code to replace by True when the timestamp is not na
t_merge['retweeted'] = pd.notna(t_merge['retweeted_status_timestamp'])

In [None]:
# Test programatically
print('Count of retweeted_status_timestamp : {}'.format(t_merge['retweeted_status_timestamp'].count()))
print('Count of retweeted status : {}'.format(t_merge['retweeted'].sum()))

#### 7. Save into a csv format

In [None]:
df.to_csv(t_merge, 'Clean_dataset/twitter_archive_master.csv', sep=',')

### Summary of Data Cliening
> All cleaned files can be found in 'Clean_dataset' repository

In [None]:
os.listdir('Clean_dataset')

---

<a id='analyzevisualize'></a>
## Data Analysis, and Visualisations

Here I come back to the questions that were raised in the beggining of the work and I'm going to visualise the answers for those questions:
* What is the success rate for image prediction?
* What is the growth rate for WeRateDogs user ( by counting the followers)?
* Is there a correlation between twiting and month (i.e. is twitting affected by the season)?
* Which tweet characteristics predict high retweeting?


### What is the success rate for image prediction?




In [None]:
bin_edges = np.arange(0, prediction_clean['conf'].max()+0.1, 0.1)
prediction_clean.query('prd=="p1"')['conf'].hist(label='Prediction 1', color='g',alpha=0.5, bins=30);
prediction_clean.query('prd=="p2"')['conf'].hist(label="Prediction 2", color='r',alpha=0.5, bins=30);
prediction_clean.query('prd=="p3"')['conf'].hist(label="Prediction 3", color='b',alpha=0.5, bins=30);
plt.legend()
plt.title('StatedMonthlyIncome')
plt.xlabel('StatedMonthlyIncome');

In [None]:
prediction_clean.head()

### What is the growth rate for WeRateDogs user ( by counting the followers)?

### Is there a correlation between twiting and month (i.e. is twitting affected by the season)?

### Which tweet characteristics predict high retweeting?

---

<a id='report'></a>
## Final Report
Create a 300-600 word written report called wrangle_report.pdf or wrangle_report.html that briefly describes your wrangling efforts. This is to be framed as an internal document.

Create a 250-word-minimum written report called act_report.pdf or act_report.html that communicates the insights and displays the visualization(s) produced from your wrangled data. This is to be framed as an external document, like a blog post or magazine article, for example.

Both of these documents can be created in separate Jupyter Notebooks using the Markdown functionality of Jupyter Notebooks, then downloading those notebooks as PDF files or HTML files (see image below). You might prefer to use a word processor like Google Docs or Microsoft Word, however.