# Project: Data Wrangling with Twitter data

## Table of Contents
<ul>    
<li><a href="#intro">Introduction</a></li>
<li><a href="#gather">Gather</a></li>
<li><a href="#assess">Assess</a></li>
<li><a href="#clean">Clean</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
<li><a href="#ref">References</a></li>
</ul>

<a id='intro'></a>
## Introduction

In this project I'm going to analyze the dataset from twitter account WeRateDogs®
<br>
using Tweepy to query Twitter's API for additional data: retweet count and favorite count
<br>
Assessing data
Cleaning data
Storing, analyzing, and visualizing your wrangled data
Reporting on 1) your data wrangling efforts and 2) your data analyses and visualizations

<a id='gather'></a>
## Gather

In [None]:
#Import libraries
import pandas as pd
import requests 
import os
import tweepy
import json

#### Archive table

In [None]:
df_archive = pd.read_csv("twitter-archive-enhanced.csv")
df_archive.head()

In [None]:
df_archive.shape

#### Image predictions table

In [None]:
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = requests.get(url)

In [None]:
with open(os.path.join(os.getcwd(), url.split('/')[-1]), mode='wb') as file:
          file.write(response.content)

In [None]:
df_predictions = pd.read_csv('image-predictions.tsv', sep='\t')
df_predictions.head()

#### Tweepy
create an API object to gather Twitter data.

In [None]:
consumer_key = '7GCntbM7icOGMHkkXjcQXfTkL'
consumer_secret = 'gZP0QgAihs5EoDZFi6PdfwkDfill046cWS1fRZajz84mgVgpxB'
access_token = '960852542-Q9H69Zz43N7xvQEAEY25il9Xl5P3ZAjVnfzc2HEe'
access_secret = 'xM4iTrao32Su1Ww2ygacFoZtfTBGpzGz0u5uEZLmqsMcl'

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

In [None]:
#get data from Twitter
id_list = df.tweet_id.astype(str)#[0:10]
tweets = []
error_count = 0
error_ids = []
for tweet_id in id_list:
    try:
        tweet = api.get_status(tweet_id, tweet_mode='extended')
        tweets.append(tweet._json)
    except tweepy.TweepError as e:
        print(e)
        error_ids.append(tweet_id)
error_ids

In [None]:
error_ids = ['888202515573088257','873697596434513921','872668790621863937','872261713294495745', '869988702071779329','866816280283807744','861769973181624320','856602993587888130','851953902622658560','845459076796616705','844704788403113984','842892208864923648','837366284874571778',
 '837012587749474308','829374341691346946','827228250799742977','812747805718642688','802247111496568832','779123168116150273','775096608509886464','771004394259247104', '770743923962707968','759566828574212096','754011816964026368','680055455951884288']

In [None]:
#Write json data to file
with open('tweet_json.txt', 'w') as file:
    json.dump(tweets, file)

In [None]:
#Read json data from file
ls_tweets = []
with open('tweet_json.txt') as file:
    data = json.load(file)
    for p in data:
        ls_tweets.append({'tweet_id': p['id'],
                        'retweet_count': p['retweet_count'],
                        'favorite_count': p['favorite_count']})
        

    

In [None]:
len(ls_tweets)

In [None]:
# create dataFrame from list 
df_tweets = pd.DataFrame(ls_tweets, columns = ['tweet_id', 'retweet_count', 'favorite_count'])
df_tweets.head()

In [None]:
df_tweets.tweet_id.count()

In [None]:
    #full_tweets = []
   # tweet_count = df.tweet_id.count()
  #  id_list = df.tweet_id.astype(str)
  #  try:
   #     for i in range(int(tweet_count / 100) + 1):
   #         end_loc = min((i + 1) * 100, tweet_count)
            #print(id_list[i * 100:end_loc])
  #          list100 = id_list.iloc[(i * 100):end_loc]
   #         full_tweets.extend(api.statuses_lookup(list100))
  #          print(str(i))
  #          if i>5: break
  #  except tweepy.TweepError as e:
  #      print('Error:', e.text())

## Assess Data

#### Archive table

In [None]:
Detect and document at least eight (8) quality issues and two (2) tidiness issues

In [None]:
df_archive.info()

In [None]:
df_archive.query('in_reply_to_status_id !="NaN" and in_reply_to_user_id !="NaN"').shape

In [None]:
df_archive.query('retweeted_status_id !="NaN" and retweeted_status_user_id !="NaN"').shape

#### Issues
##### df_ archive table
Original records have these columns equal NaN<br>
- in_reply_to_status_id<br>
- in_reply_to_user_id<br>
- retweeted_status_id<br>
- retweeted_status_user_id<br>
- retweeted_status_timestamp<br>
<br>

Columns to delete: *timestamp, source, expanded_urls* <br>

*rating_denominator* has some incorrect data, zeros, big numbers (decimal?)<br>
*rating_numerator* can be decimal like 13.5/10 tweet_id:883482846933004288<br>
Some records have *rating_numerator* = 0 or >20<br>
*name* columns has some errors like name 'None' or 'a'. I'm not sure it will be used for analysis<br>

*doggo, floofer, pupper, puppo* columns have values only in 380 records vs 430 in *text* column
*doggo, floofer, pupper, puppo* can be combined in one column<br>
*tweet_id* as object type

In [None]:
#Checking for duplicated data
df_archive[df_archive.duplicated()].shape

In [None]:
df_archive['rating_denominator'].describe()
# rating_numerator rating_denominator

In [None]:
df_archive[df_archive['rating_denominator'] !=10][['tweet_id', 'rating_denominator', 'rating_denominator', 'text']]

In [None]:
df_archive['rating_numerator'].describe()

In [None]:
df_archive.query('rating_numerator < 1 or rating_numerator > 20')[['tweet_id', 'rating_numerator', 'text']]

In [None]:
df_archive['name'].describe()

In [None]:
df_archive['name'].value_counts()

In [None]:
df_archive.query("doggo != 'None' or floofer != 'None' or pupper != 'None' or puppo != 'None'").shape


In [None]:
df_archive[df_archive['text'].str.contains("puppo")].shape

In [None]:
df_archive[df_archive['text'].str.contains("floof")].shape

In [None]:
df_archive[df_archive['text'].str.contains("pupper")].shape

#### Image prediction table

In [None]:
df_predictions.head()

In [None]:
df_predictions.info()

#### Issues
##### df_predictions table

*jpg_url* column do not needed <br>
*p1, p2, p3* some breeds start with capital letter, some not<br>
Missing data: there are no predictions for 281 records from archive table (replies and retweets???)
*tweet_id* as object type not int


In [None]:
df_predictions.img_num.describe()

In [None]:
#df_predictions.p1_conf.describe()
#df_predictions.p2_conf.describe()
df_predictions.p3_conf.describe()

In [None]:
#df_predictions.p1.value_counts()
#df_predictions.p2.value_counts()
df_predictions.p3.value_counts()

In [None]:
#Tweets in archive table and not in prediction table
len(list(set(df.tweet_id) - set(df_predictions.tweet_id)))

In [None]:
len(list(set(df_predictions.tweet_id) - set(df.tweet_id)))

#### Tweepy table

In [None]:
df_tweets.info()

In [None]:
df_tweets.retweet_count.describe()

In [None]:
#df_tweets.query('retweet_count < 5 or retweet_count > 70000')

In [None]:
df_tweets.favorite_count.describe()

In [None]:
#df_tweets[df_tweets.favorite_count == 0] df_archive

#### Issues
##### df_tweets table
- Merge df_tweets and df_archive table. df_tweets is just additional info about the same tweets <br>
- Some tweets were deleted, df_tweets has no info about them, ids in error_ids list

## Clean

In [None]:
Cleaning includes merging individual pieces of data according to the rules of tidy data.

In [None]:
Define Code Test

In [None]:
Storing, Analyzing, and Visualizing Data for this Project
Store the clean DataFrame(s) in a CSV file with the main one named twitter_archive_master.csv. If additional files exist because multiple tables are required for tidiness, name these files appropriately.

In [None]:
At least three (3) insights and one (1) visualization must be produced.

<a id='ref'></a>
## References

https://stackoverflow.com/questions/28384588/twitter-api-get-tweets-with-specific-id
<br>
https://stackabuse.com/reading-and-writing-json-to-a-file-in-python/
<br>
https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object
