## Introduction

<p>This project aims to wrangle (gather, assess and clean) real world data from a range of sources and in a variety of formats, through analyses and visualizations using Python and its libraries and/or SQL.</p> 

<p>The dataset to be wrangled (and analyzed and visualized) "is the tweet archive of Twitter user @dog_rates, also known as WeRateDogs. WeRateDogs is a Twitter account that rates people's dogs with a humorous comment about the dog. These ratings almost always have a denominator of 10. The numerators, though? Almost always greater than 10. 11/10, 12/10, 13/10, etc. Why? Because "they're good dogs Brent."" - Udacity Project Overview.</p>

## Table of Contents
<ul>
<li><a href="#gathering">Data Gathering</a></li>
<li><a href="#assessment">Data Assessment</a></li>
<li><a href="#cleaning">Data Cleaning</a></li>
<li><a href="#storage">Data Storage</a></li>
<li><a href="#analysis">Analyses and Vitualization</a></li>
</ul>

In [1]:
#importing all necessary libraries to complete this project
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import json
import seaborn as sns
import os
import requests
import re
from functools import reduce
import tweepy
from tweepy import OAuthHandler
import json
from timeit import default_timer as timer
%matplotlib inline

<a id = 'gathering'></a>
## Data Gathering

The first table (twitter-archive-enhanced.csv) is manually obtained from the internet and opened into a pandas data drame programmatically.

In [3]:
#load the 'twitter-archive-enhanced.csv' table into a pandas data frame
twitter_archive = pd.read_csv('twitter-archive-enhanced.csv')

The second table is downloaded programmatically from Udacity's server into a folder (image-predictions) using the requests library and its URL, written locally, and then loaded into a pandas Data Frame.

In [4]:
#create a folder called 'image-predictions' if the folder does not exist already
folder_name = 'image-predictions'
if not os.path.exists(folder_name):
    os.mkdir(folder_name)

In [5]:
#get the image-predictions data through its url and using the python requests library
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = requests.get(url)
#write the response of the above request into image-predictions.tsv
with open(os.path.join(folder_name, url.split('/')[-1]), mode = 'wb') as file:
    file.write(response.content)

In [6]:
#load the image-predictions.tsv file into a pandas data frame
image_predictions = pd.read_csv('image-predictions/image-predictions.tsv', sep='\t')

The third table is downloaded locally from the internet as 'tweet-json.txt', read line by line into a python list, and then loaded into a pandas Data Frame.

In [7]:
# read the tweet-json.txt file line by line and get the 'id_str', 'retweet_count', and 'favorite_count', then store in a python list called df_list
df_list = []
with open ('tweet-json.txt') as file:
    for line in file:
        data = json.loads(line)
        id_str = data.get('id_str')
        retweet_count = data.get('retweet_count')
        favorite_count = data.get('favorite_count')
        df_list.append({
            'id_str': id_str, 
            'retweet_count': retweet_count, 
            'favorite_count': favorite_count 
        })


In [None]:
#load df_list into a pandas data frame
tweet_data = pd.DataFrame(df_list, columns=['id_str', 'retweet_count', 'favorite_count'])