# Project: Wrangling and Analyze Data

This project aims at ..

## Data Gathering
In this step, we'll gather the following data and load them in the notebook:
- WeRateDogs tweets
- Tweet Image predictions
- Counts of retweets and likes of the tweets


In [4]:
# Libraries required
import pandas as pd
import numpy as np
import requests
import tweepy
import csv

**1. Loading the WeRateDogs Twitter archive data (twitter_archive_enhanced.csv)**

In [3]:
weratedogs_df = pd.read_csv('twitter-archive-enhanced.csv')
weratedogs_df.head(2)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,


**2. Tweet image predictions**

Using the requests library, we'll load this data from the url provided

In [None]:
import csv
with open('people.csv', 'r',) as file:
    reader = csv.reader(file, delimiter = '\t')
    for row in reader:
        print(row)

In [22]:
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
pred_response = requests.get(url, stream = 1)

with open('image_predictions.tsv', mode ='wb') as file:
    file.write(pred_response.content)


image_predictions = pd.read_csv('image_predictions.tsv', sep='\t')


In [23]:
image_predictions.head(2)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True


**3. Additional Data about the tweets**

We'll use tweepy library to scrape count of likes and retweets of the tweet ids in the archive data into a txt file

In [25]:
# Extracting tweet ids from weratedogs archive data
tweet_ids = weratedogs_df['tweet_id']
type(tweet_ids)

pandas.core.series.Series

In [27]:
# Authenticating twitter API

# keys
consumer_key = "O3gbKK68SNLG6ExuQNQtSOEUz"
consumer_secret = "tneC1biiCyjcA1P5iGMQOxOjyaw8FeteHhtP8WleyfiMMjXzIM"
access_token = "1536361928943099904-AXqZYuh5CoITYM2ieCQv8agBg4fuYU"
access_token_secret = "4O7kHDADgtQP8PkmccQN2Zl3gBNlJ5ZOIydr4GkIeuR9w"

#authentication function
def auth():
    try:
        auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
        auth.set_access_token(access_token, access_token_secret)
        api = tweepy.API(auth, wait_on_rate_limit=True)
    except:
        print("An error occurred during the authentication, please retry")
    
    return api


In [28]:
# scraping the data
test_id = '666020888022790149'
tweet = auth().get_status(test_id)
print(tweet.text)

Forbidden: 403 Forbidden
453 - You currently have Essential access which includes access to Twitter API v2 endpoints only. If you need access to this endpoint, you’ll need to apply for Elevated access via the Developer Portal. You can learn more here: https://developer.twitter.com/en/docs/twitter-api/getting-started/about-twitter-api#v2-access-leve

## Assessing Data
In this section, detect and document at least **eight (8) quality issues and two (2) tidiness issue**. You must use **both** visual assessment
programmatic assessement to assess the data.

**Note:** pay attention to the following key points when you access the data.

* You only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.
* Assessing and cleaning the entire dataset completely would require a lot of time, and is not necessary to practice and demonstrate your skills in data wrangling. Therefore, the requirements of this project are only to assess and clean at least 8 quality issues and at least 2 tidiness issues in this dataset.
* The fact that the rating numerators are greater than the denominators does not need to be cleaned. This [unique rating system](http://knowyourmeme.com/memes/theyre-good-dogs-brent) is a big part of the popularity of WeRateDogs.
* You do not need to gather the tweets beyond August 1st, 2017. You can, but note that you won't be able to gather the image predictions for these tweets since you don't have access to the algorithm used.



### Quality issues
1.

2.

3.

4.

5.

6.

7.

8.

### Tidiness issues
1.

2.

## Cleaning Data
In this section, clean **all** of the issues you documented while assessing. 

**Note:** Make a copy of the original data before cleaning. Cleaning includes merging individual pieces of data according to the rules of [tidy data](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html). The result should be a high-quality and tidy master pandas DataFrame (or DataFrames, if appropriate).

In [None]:
# Make copies of original pieces of data


### Issue #1:

#### Define:

#### Code

#### Test

### Issue #2:

#### Define

#### Code

#### Test

## Storing Data
Save gathered, assessed, and cleaned master dataset to a CSV file named "twitter_archive_master.csv".

## Analyzing and Visualizing Data
In this section, analyze and visualize your wrangled data. You must produce at least **three (3) insights and one (1) visualization.**

### Insights:
1.

2.

3.

### Visualization