# Project: Wrangling and Analyze Data

In [11]:
#import modules 
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt 
import requests 
import matplotlib
import seaborn as sns
plt.style.use("ggplot")
from matplotlib.pyplot import figure
matplotlib.rcParams["figure.figsize"] = (12, 8)
%matplotlib inline

## Data Gathering
In the cell below, gather **all** three pieces of data for this project and load them in the notebook. **Note:** the methods required to gather each data are different.
1. Directly download the WeRateDogs Twitter archive data (twitter_archive_enhanced.csv)

In [13]:
twitter_enhanced = pd.read_csv('twitter-archive-enhanced.csv', sep=',')

2. Use the Requests library to download the tweet image prediction (image_predictions.tsv)

In [14]:
#import the image_predictions.tsv file programmatically
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = requests.get(url)

with open('image-predictions.tsv', mode='wb') as file:
    file.write(response.content)
    
# load the image-predictions data as csv file 
image_prediction = pd.read_csv('image-predictions.tsv', '\t')

3. Use the Tweepy library to query additional data via the Twitter API (tweet_json.txt)

In [15]:
tweets = pd.read_json('tweet-json.txt', lines=True)

## Assessing Data
In this section, detect and document at least **eight (8) quality issues and two (2) tidiness issue**. You must use **both** visual assessment
programmatic assessement to assess the data.

**Note:** pay attention to the following key points when you access the data.

* You only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.
* Assessing and cleaning the entire dataset completely would require a lot of time, and is not necessary to practice and demonstrate your skills in data wrangling. Therefore, the requirements of this project are only to assess and clean at least 8 quality issues and at least 2 tidiness issues in this dataset.
* The fact that the rating numerators are greater than the denominators does not need to be cleaned. This [unique rating system](http://knowyourmeme.com/memes/theyre-good-dogs-brent) is a big part of the popularity of WeRateDogs.
* You do not need to gather the tweets beyond August 1st, 2017. You can, but note that you won't be able to gather the image predictions for these tweets since you don't have access to the algorithm used.



### Quality issues
1. Some rows are actually Retweets and not original tweets(which is needed for the analysis)

2. This confirms that the rating_denominator and rating_numerator have inconsistent data maybe due to recording data not related, not properly recording the values

3. The p1_dog column has false values that indicate that some rows are not a breed of dog

4. Timestamp column has extra characters +0000

5. Text column has links attached to it

6. Dog stages are not in the same column

7. We only need these columns for the analysis = tweet_id, img_num, breed, confidence_level, p1_dog 

8. The tweet_id column name appears to be id in this table



### Tidiness issues
1. The doggo, puppo, pupper and floofer columns should be in a single column

2. The three tables should be in a single table

3. The Timestamp column is a object data type instead of datetime 


## Cleaning Data
In this section, clean **all** of the issues you documented while assessing. 

**Note:** Make a copy of the original data before cleaning. Cleaning includes merging individual pieces of data according to the rules of [tidy data](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html). The result should be a high-quality and tidy master pandas DataFrame (or DataFrames, if appropriate).

In [16]:
# Make copies of original pieces of data
twitter_enhanced_clean = twitter_enhanced.copy()
image_prediction_clean = image_prediction.copy()
tweets_clean = tweets.copy()


### Issue #1:

#### Define:

#### Code

#### Test

### Issue #2:

#### Define

#### Code

#### Test

## Storing Data
Save gathered, assessed, and cleaned master dataset to a CSV file named "twitter_archive_master.csv".

## Analyzing and Visualizing Data
In this section, analyze and visualize your wrangled data. You must produce at least **three (3) insights and one (1) visualization.**

### Insights:
1.

2.

3.

### Visualization