## Introduction

<p>This project aims to wrangle (gather, assess and clean) real world data from a range of sources and in a variety of formats, through analyses and visualizations using Python and its libraries and/or SQL.</p> 

<p>The dataset to be wrangled (and analyzed and visualized) "is the tweet archive of Twitter user @dog_rates, also known as WeRateDogs. WeRateDogs is a Twitter account that rates people's dogs with a humorous comment about the dog. These ratings almost always have a denominator of 10. The numerators, though? Almost always greater than 10. 11/10, 12/10, 13/10, etc. Why? Because "they're good dogs Brent."" - Udacity Project Overview.</p>

## Table of Contents
<ul>
<li><a href="#gathering">Data Gathering</a></li>
<li><a href="#assessment">Data Assessment</a></li>
<li><a href="#cleaning">Data Cleaning</a></li>
<li><a href="#storage">Data Storage</a></li>
<li><a href="#analysis">Analyses and Vitualization</a></li>
</ul>

In [1]:
#importing all necessary libraries to complete this project
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import json
import seaborn as sns
import os
import requests
import re
from functools import reduce
import tweepy
from tweepy import OAuthHandler
import json
from timeit import default_timer as timer
%matplotlib inline

<a id = 'gathering'></a>
## Data Gathering

The first table (twitter-archive-enhanced.csv) is manually obtained from the internet and opened into a pandas data drame programmatically.

In [2]:
#load the 'twitter-archive-enhanced.csv' table into a pandas data frame
twitter_archive = pd.read_csv('twitter-archive-enhanced.csv')

The second table is downloaded programmatically from Udacity's server into a folder (image-predictions) using the requests library and its URL, written locally, and then loaded into a pandas Data Frame.

In [6]:
#create a folder called 'image-predictions' if the folder does not exist already
folder_name = 'image-predictions'
if not os.path.exists(folder_name):
    os.mkdir(folder_name)

In [9]:
#get the image-predictions data through its url and using the python requests library
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = requests.get(url)
#write the response of the above request into image-predictions.tsv
with open(os.path.join(folder_name, url.split('/')[-1]), mode = 'wb') as file:
    file.write(response.content)

In [10]:
#load the image-predictions.tsv file into a pandas data frame
image_predictions = pd.read_csv('image-predictions/image-predictions.tsv', sep='\t')

The third table is downloaded locally from the internet as 'tweet-json.txt', read line by line into a python list, and then loaded into a pandas Data Frame.

In [None]:
# read the tweet-json.txt file line by line and get the 'id_str', 'retweet_count', and 'favorite_count', then store in a python list called df_list
df_list = []
with open ('tweet-json.txt') as file:
    for line in file:
        data = json.loads(line)
        id_str = data.get('id_str')
        retweet_count = data.get('retweet_count')
        favorite_count = data.get('favorite_count')
        df_list.append({
            'id_str': id_str, 
            'retweet_count': retweet_count, 
            'favorite_count': favorite_count 
        })


In [None]:
#load df_list into a pandas data frame
tweet_data = pd.DataFrame(df_list, columns=['id_str', 'retweet_count', 'favorite_count'])

<a id = 'assessment'></a>
## Data Assessment

Visual Assessment

In [None]:
#displays first 25 observations
twitter_archive.head(25)

In [None]:
#displays 25 random observations from the table
twitter_archive.sample(25)

In [None]:
#displays last 25 observations on the table
twitter_archive.tail(25)

In [None]:
#displays first 25 observations on the table
image_predictions.head(25)

In [None]:
#displays 25 random observations from the table
image_predictions.sample(25)

In [None]:
#displays the last 25 observations on the table
image_predictions.tail(25)

In [None]:
#displays first 25 observations on the table.
tweet_data.head(25)

In [None]:
#displays 25 random observations from the table.
tweet_data.sample(25)

In [None]:
#displays the last 25 observations on the table.
tweet_data.tail(25)

#### Programmatic Assessment

In [None]:
#displays a summary information about the table, including numbers of columns, rows, and non-empty values, and the data type of each variable
twitter_archive.info()

In [None]:
#displays all duplicated observations
twitter_archive[twitter_archive.duplicated()]

In [None]:
#returns the number of occurences of each value in the `source` column
twitter_archive['source'].value_counts()

In [None]:
#returns the num of occurences of each value in the `name` column
twitter_archive['name'].value_counts()

In [None]:
#returns 25 random values from the `name` column
twitter_archive['name'].sample(25)

In [None]:
#returns the number of occurences for each value in the `rating_denumerator ` column
twitter_archive['rating_numerator'].value_counts()

In [None]:
#returns the number of occurences for each value in the `rating_denumerator ` column
twitter_archive['rating_denominator'].value_counts()

In [None]:
#displays a summary information about the table, including numbers of columns, rows, and non-empty values, and the data type of each variable
image_predictions.info()

In [None]:
#return the values of the `img_num` column sorted in an ascending order
image_predictions['img_num'].sort_values()

In [None]:
#returns 10 random samples of values from the `jpg_url` column
image_predictions['jpg_url'].sample(10)

In [None]:
#displays a summary information about the table, including numbers of columns, rows, and non-empty values, and the data type of each variable
tweet_data.info()

In [None]:
#returns 10 random observations from the table.
tweet_data.sample(10)

In [None]:
#returns the values of the `retweet_count` column sorted in an ascending order
tweet_data['retweet_count'].sort_values()

In [None]:
#returns the values of the `favorite_count` column sorted in an ascending order
tweet_data['favorite_count'].sort_values()

### Summary of Assesment
#### Quality
##### `twitter_archive` table
* Some entries are retweets and replies.
* `in_reply_to_status_id`, `in_reply_to_user_id`, `retweeted_status_id`, `retweeted_status_user_id`, and `retweeted_statustimestamp` columns are unnecessary for the analysis of `original tweets`
* The `source` variable contains html formating
* +0000 is redundant information in in `timestamp`
* Erroneous data types in `tweet_id` and `timestamp` columns
* Variable `floofer` should be `floof`, and likewise it values.




##### `image_predictions` table
* Comlumn labels are unclear
* Text in `p1`, `p2`, and `p3` sometimes start with an uppercase letter, lowercase other times, and underscores are use in place of space, and otherwise.
* `tweet_id` is a string not intiger

##### `tweet_data` table
* `id_str` variable should be named `tweet_id` instead, to be consistent with the other tables.

#### Tidiness
* One variable 'dog stage' in four columns (doggo, floofer, pupper, puppo) in `twitter_archive` table.
* All three tables should be merged into one table.