# Project: Data Wrangling with Twitter data

## Table of Contents
<ul>    
<li><a href="#intro">Introduction</a></li>
<li><a href="#gather">Gather</a></li>
<li><a href="#assess">Assess</a></li>
<li><a href="#clean">Clean</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
<li><a href="#ref">References</a></li>
</ul>

<a id='intro'></a>
## Introduction

In this project I'm going to analyze the dataset from twitter account WeRateDogs®
<br>
using Tweepy to query Twitter's API for additional data: retweet count and favorite count
<br>
Assessing data
Cleaning data
Storing, analyzing, and visualizing your wrangled data
Reporting on 1) your data wrangling efforts and 2) your data analyses and visualizations

<a id='gather'></a>
## Gather

In [1]:
#Import libraries
import pandas as pd
import requests 
import os
import tweepy
import json

#### Archive table

In [2]:
df_archive = pd.read_csv("twitter-archive-enhanced.csv")
df_archive.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


In [3]:
df_archive.shape

(2356, 17)

#### Image predictions table

In [4]:
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = requests.get(url)

In [5]:
with open(os.path.join(os.getcwd(), url.split('/')[-1]), mode='wb') as file:
          file.write(response.content)

In [6]:
df_predictions = pd.read_csv('image-predictions.tsv', sep='\t')
df_predictions.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


#### Tweepy
create an API object to gather Twitter data.

In [7]:
consumer_key = '7GCntbM7icOGMHkkXjcQXfTkL'
consumer_secret = 'gZP0QgAihs5EoDZFi6PdfwkDfill046cWS1fRZajz84mgVgpxB'
access_token = '960852542-Q9H69Zz43N7xvQEAEY25il9Xl5P3ZAjVnfzc2HEe'
access_secret = 'xM4iTrao32Su1Ww2ygacFoZtfTBGpzGz0u5uEZLmqsMcl'

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

In [None]:
#get data from Twitter
id_list = df.tweet_id.astype(str)#[0:10]
tweets = []
error_count = 0
error_ids = []
for tweet_id in id_list:
    try:
        tweet = api.get_status(tweet_id, tweet_mode='extended')
        tweets.append(tweet._json)
    except tweepy.TweepError as e:
        print(e)
        error_ids.append(tweet_id)
error_ids

In [8]:
error_ids = ['888202515573088257','873697596434513921','872668790621863937','872261713294495745', '869988702071779329','866816280283807744','861769973181624320','856602993587888130','851953902622658560','845459076796616705','844704788403113984','842892208864923648','837366284874571778',
 '837012587749474308','829374341691346946','827228250799742977','812747805718642688','802247111496568832','779123168116150273','775096608509886464','771004394259247104', '770743923962707968','759566828574212096','754011816964026368','680055455951884288']

In [None]:
#Write json data to file
with open('tweet_json.txt', 'w') as file:
    json.dump(tweets, file)

In [9]:
#Read json data from file
ls_tweets = []
with open('tweet_json.txt') as file:
    data = json.load(file)
    for p in data:
        ls_tweets.append({'tweet_id': p['id'],
                        'retweet_count': p['retweet_count'],
                        'favorite_count': p['favorite_count']})
        

    

In [10]:
len(ls_tweets)

2332

In [11]:
# create dataFrame from list 
df_tweets = pd.DataFrame(ls_tweets, columns = ['tweet_id', 'retweet_count', 'favorite_count'])
df_tweets.head()

Unnamed: 0,tweet_id,retweet_count,favorite_count
0,892420643555336193,7675,36055
1,892177421306343426,5674,31097
2,891815181378084864,3763,23407
3,891689557279858688,7850,39325
4,891327558926688256,8445,37558


In [12]:
df_tweets.tweet_id.count()

2332

In [13]:
    #full_tweets = []
   # tweet_count = df.tweet_id.count()
  #  id_list = df.tweet_id.astype(str)
  #  try:
   #     for i in range(int(tweet_count / 100) + 1):
   #         end_loc = min((i + 1) * 100, tweet_count)
            #print(id_list[i * 100:end_loc])
  #          list100 = id_list.iloc[(i * 100):end_loc]
   #         full_tweets.extend(api.statuses_lookup(list100))
  #          print(str(i))
  #          if i>5: break
  #  except tweepy.TweepError as e:
  #      print('Error:', e.text())

## Assess Data

#### Archive table

In [None]:
Detect and document at least eight (8) quality issues and two (2) tidiness issues

In [14]:
df_archive.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

In [25]:
df_archive.query('in_reply_to_status_id == in_reply_to_status_id and in_reply_to_user_id ==in_reply_to_user_id', engine='python').shape

(78, 17)

In [27]:
df_archive.query('retweeted_status_id ==retweeted_status_id and retweeted_status_user_id ==retweeted_status_user_id').shape

(181, 17)

#### Issues
##### df_ archive table
Original records have these columns equal NaN<br>
- in_reply_to_status_id<br>
- in_reply_to_user_id<br>
- retweeted_status_id<br>
- retweeted_status_user_id<br>
- retweeted_status_timestamp<br>
<br>

Columns to delete: *timestamp, source, expanded_urls* <br>

*rating_denominator* has some incorrect data, zeros, big numbers (decimal?)<br>
*rating_numerator* can be decimal like 13.5/10 tweet_id:883482846933004288<br>
Some records have *rating_numerator* = 0 or >20<br>
*name* columns has some errors like name 'None' or 'a'. I'm not sure it will be used for analysis<br>

*doggo, floofer, pupper, puppo* columns have values only in 380 records vs 430 in *text* column
*doggo, floofer, pupper, puppo* can be combined in one column<br>
*tweet_id* as object type

In [None]:
#Checking for duplicated data
df_archive[df_archive.duplicated()].shape

In [None]:
df_archive['rating_denominator'].describe()
# rating_numerator rating_denominator

In [None]:
df_archive[df_archive['rating_denominator'] !=10][['tweet_id', 'rating_denominator', 'rating_denominator', 'text']]

In [None]:
df_archive['rating_numerator'].describe()

In [None]:
df_archive.query('rating_numerator < 1 or rating_numerator > 20')[['tweet_id', 'rating_numerator', 'text']]

In [None]:
df_archive['name'].describe()

In [None]:
df_archive['name'].value_counts()

In [None]:
df_archive.query("doggo != 'None' or floofer != 'None' or pupper != 'None' or puppo != 'None'").shape


In [None]:
df_archive[df_archive['text'].str.contains("puppo")].shape

In [None]:
df_archive[df_archive['text'].str.contains("floof")].shape

In [None]:
df_archive[df_archive['text'].str.contains("pupper")].shape

#### Image prediction table

In [None]:
df_predictions.head()

In [None]:
df_predictions.info()

#### Issues
##### df_predictions table

*jpg_url* column do not needed <br>
*p1, p2, p3* some breeds start with capital letter, some not<br>
Missing data: there are no predictions for 281 records from archive table (replies and retweets???)
*tweet_id* as object type not int


In [28]:
df_predictions.describe()

Unnamed: 0,tweet_id,img_num,p1_conf,p2_conf,p3_conf
count,2075.0,2075.0,2075.0,2075.0,2075.0
mean,7.384514e+17,1.203855,0.594548,0.1345886,0.06032417
std,6.785203e+16,0.561875,0.271174,0.1006657,0.05090593
min,6.660209e+17,1.0,0.044333,1.0113e-08,1.74017e-10
25%,6.764835e+17,1.0,0.364412,0.05388625,0.0162224
50%,7.119988e+17,1.0,0.58823,0.118181,0.0494438
75%,7.932034e+17,1.0,0.843855,0.1955655,0.09180755
max,8.924206e+17,4.0,1.0,0.488014,0.273419


In [29]:
#df_predictions.p1.value_counts()
#df_predictions.p2.value_counts()
df_predictions.p3.value_counts()

Labrador_retriever                79
Chihuahua                         58
golden_retriever                  48
Eskimo_dog                        38
kelpie                            35
kuvasz                            34
chow                              32
Staffordshire_bullterrier         32
beagle                            31
cocker_spaniel                    31
toy_poodle                        29
Pomeranian                        29
Pekinese                          29
Chesapeake_Bay_retriever          27
Great_Pyrenees                    27
Pembroke                          27
malamute                          26
French_bulldog                    26
American_Staffordshire_terrier    24
pug                               23
Cardigan                          23
basenji                           21
toy_terrier                       20
bull_mastiff                      20
Siberian_husky                    19
Shetland_sheepdog                 17
Boston_bull                       17
b

In [None]:
#Tweets in archive table and not in prediction table
len(list(set(df.tweet_id) - set(df_predictions.tweet_id)))

In [None]:
len(list(set(df_predictions.tweet_id) - set(df.tweet_id)))

#### Tweepy table

In [None]:
df_tweets.info()

In [None]:
df_tweets.retweet_count.describe()

In [None]:
#df_tweets.query('retweet_count < 5 or retweet_count > 70000')

In [None]:
df_tweets.favorite_count.describe()

In [None]:
#df_tweets[df_tweets.favorite_count == 0] df_archive

#### Issues
##### df_tweets table
- Merge df_tweets and df_archive table. df_tweets is just additional info about the same tweets <br>
- Some tweets were deleted, df_tweets has no info about them, ids in error_ids list

## Clean

Original records have these columns equal NaN<br>
- in_reply_to_status_id<br>
- in_reply_to_user_id<br>
- retweeted_status_id<br>
- retweeted_status_user_id<br>
- retweeted_status_timestamp<br>
<br>


*rating_denominator* has some incorrect data, zeros, big numbers (decimal?)<br>
*rating_numerator* can be decimal like 13.5/10 tweet_id:883482846933004288<br>
Some records have *rating_numerator* = 0 or >20<br>
*name* columns has some errors like name 'None' or 'a'. I'm not sure it will be used for analysis<br>

*doggo, floofer, pupper, puppo* columns have values only in 380 records vs 430 in *text* column
*doggo, floofer, pupper, puppo* can be combined in one column<br>
*tweet_id* as object type

##### df_predictions table
p1, p2, p3 some breeds start with capital letter, some not
Missing data: there are no predictions for 281 records from archive table (replies and retweets???) tweet_id as object type not int

df_tweets table
Merge df_tweets and df_archive table. df_tweets is just additional info about the same tweets
Some tweets were deleted, df_tweets has no info about them, ids in error_ids list

In [30]:
#Create copies of the tables
df_archive_copy = df_archive.copy()
df_predictions_copy = df_predictions.copy()
df_tweets_copy = df_tweets.copy()

#### Drop all not needed columns
##### df_archive_copy table
Columns to delete: *timestamp, source, expanded_urls* <br>
##### df_predictions_copy table
Columns to delete: *jpg_url*

In [31]:
df_archive_copy.drop(['timestamp', 'source', 'expanded_urls'], axis=1, inplace=True)
df_predictions_copy.drop(['jpg_url'], axis=1, inplace=True)

In [32]:
df_archive_copy.head(1)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,This is Phineas. He's a mystical boy. Only eve...,,,,13,10,Phineas,,,,


In [34]:
df_predictions_copy.head(1)

Unnamed: 0,tweet_id,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True


In [None]:
Storing, Analyzing, and Visualizing Data for this Project
Store the clean DataFrame(s) in a CSV file with the main one named twitter_archive_master.csv. If additional files exist because multiple tables are required for tidiness, name these files appropriately.

In [None]:
At least three (3) insights and one (1) visualization must be produced.

<a id='ref'></a>
## References

https://stackoverflow.com/questions/28384588/twitter-api-get-tweets-with-specific-id
<br>
https://stackabuse.com/reading-and-writing-json-to-a-file-in-python/
<br>
https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/tweet-object<br>
https://stackoverflow.com/questions/37863660/pandas-dataframe-query-fetch-not-null-rows-pandas-equivalent-to-sql-is-no<br>
