# Project 4: Wrangle, assess, clean and analyse twitter data - WeRateDogs


                            Christine Shuttleworth, 1st of October 2020



### Table of Contents
- [Introduction](#intro)
- [Part I - Data wrangling](#wrangling)
    - [Twitter Archive - load csv file](#load_csv)
    - [Twitter API - access and load data via Twitter API access](#twitter_api)
    - [Download and ingest neural network predictor data using requests](#requests)
- [Part II - Assess data](#assess)
    - [Visual assessment: data overview](#visual)
    - [Programmatic assessment:](#programmatic)
        - [Data structure:](#structure)
        - [Data quality:](#structure)
    - [Summary list of data issues:](#summary_issues)
        - [Tidyness issues:](#tidyness)
        - [Cleanliness issues:](#cleanliness)
- [Part III - Clean data and create twitter_archive_master.csv file](#clean)
    - [Define issue: x](#def1)
    - [Code issue: x](#code1)
    - [Test issue: x](#test1)
    - [Define issue: x](#def2)
    - [Code issue: x](#code2)
    - [Test issue: x](#test2)
    - [Define issue: x](#def3)
    - [Code issue: x](#code3)
    - [Test issue: x](#test3)
    - [Define issue: x](#def4)
    - [Code issue: x](#code4)
    - [Test issue: x](#test4)
- [Part IV - Analyse data](#clean)
    - [Insight 1: x](#insight1) Which type of dog is rated the most often and the highest?
    - [Insight 2: x](#insight2)
    - [Insight 3: x](#insight3)





<a id='intro'></a>
### Introduction 

For this report, I wrangled WeRateDogs Twitter data to create interesting and trustworthy data insights and visualizations of the dog rating twitter feed. 

The twitter data will be enhanced with information of likely breed of the dog being rated, based on images available in the tweets. This data originates from a neural network image prediction data set of types of dogs.

To achieve this, I createe a solid and clean master dataset. Possible questions to ask:
- Which dog type is being rated the most often and the hightest?

Based on the analysis I created two reports:

    wrangle_report.pdf - summary of my wrangling effort
    act_report.pdf - insights and visualisation of the findings as a magazine article or blog post

<a id='wrangling'></a>
### Part 1 - Data wrangling

Set up python environment

In [1]:
import pandas as pd
import numpy as np
import tweepy as tw
import requests
import config as cfg
import os 
from dotenv import load_dotenv

%matplotlib inline
%load_ext dotenv
%dotenv

pd.options.display.max_rows = 999

<a id='load_csv'></a>
#### Load twitter_archive_enhanced.csv

In [2]:
df_ta = pd.read_csv('twitter-archive-enhanced.csv')

In [3]:
df_ta.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 

Columns:
1. tweet_id: twitter reference for this particular tweet
2. in_reply_to_status_id: twitter_id of tweet that was replied to. Tweets with NaN in this column are original tweets.
3. in_reply_to_user_id: user_id who wrote the reply 
4. timestamp: timestamp of the tweet
5. source: source of tweet - Twitter for iPhone, Vine - Make a Scene, Twitter Web Client, TweetDeck   
6. text: text of tweet: with hashtags and URL link to tweet.
7. retweeted_status_id: twitter_id of tweet that retweeted original tweet to. Tweets with NaN in this column were not retweeted.
8. retweeted_status_user_id: user_id who retweeted
9. retweeted_status_timestamp: timestamp of the retweet 
10. expanded_urls: full URL of the original tweet                
11. rating_numerator: rating of dog ...
12. rating_denominator: ... out of this number   
13. name: dog name   
14. doggo: flag if this dog falls into the doggo category
15. floofer: flag if this dog falls into the doggo category
16. pupper: flag if this dog falls into the doggo category
17. puppo: flag if this dog falls into the doggo category 

In [4]:
df_ta

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2351,666049248165822465,,,2015-11-16 00:24:50 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a 1949 1st generation vulpix. Enj...,,,,https://twitter.com/dog_rates/status/666049248...,5,10,,,,,
2352,666044226329800704,,,2015-11-16 00:04:52 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is a purebred Piers Morgan. Loves to Netf...,,,,https://twitter.com/dog_rates/status/666044226...,6,10,a,,,,
2353,666033412701032449,,,2015-11-15 23:21:54 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here is a very happy pup. Big fan of well-main...,,,,https://twitter.com/dog_rates/status/666033412...,9,10,a,,,,
2354,666029285002620928,,,2015-11-15 23:05:30 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is a western brown Mitsubishi terrier. Up...,,,,https://twitter.com/dog_rates/status/666029285...,7,10,a,,,,


In [5]:
df_ta.source.value_counts()

<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>     2221
<a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a>                          91
<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>                       33
<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>      11
Name: source, dtype: int64

In [6]:
#df_ta.query('in_reply_to_status_id != "NaN"')
#df_ta.query('retweeted_status_id != "NaN"')
df_ta.query('doggo != "None"')

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
9,890240255349198849,,,2017-07-26 15:59:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Cassie. She is a college pup. Studying...,,,,https://twitter.com/dog_rates/status/890240255...,14,10,Cassie,doggo,,,
43,884162670584377345,,,2017-07-09 21:29:42 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Yogi. He doesn't have any important dog m...,,,,https://twitter.com/dog_rates/status/884162670...,12,10,Yogi,doggo,,,
99,872967104147763200,,,2017-06-09 00:02:31 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here's a very large dog. He has a date later. ...,,,,https://twitter.com/dog_rates/status/872967104...,12,10,,doggo,,,
108,871515927908634625,,,2017-06-04 23:56:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Napolean. He's a Raggedy East Nicaragu...,,,,https://twitter.com/dog_rates/status/871515927...,12,10,Napolean,doggo,,,
110,871102520638267392,,,2017-06-03 20:33:19 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Never doubt a doggo 14/10 https://t.co/AbBLh2FZCH,,,,https://twitter.com/animalcog/status/871075758...,14,10,,doggo,,,
121,869596645499047938,,,2017-05-30 16:49:31 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Scout. He just graduated. Officially a...,,,,https://twitter.com/dog_rates/status/869596645...,12,10,Scout,doggo,,,
172,858843525470990336,,,2017-05-01 00:40:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",I have stumbled puppon a doggo painting party....,,,,https://twitter.com/dog_rates/status/858843525...,13,10,,doggo,,,
191,855851453814013952,,,2017-04-22 18:31:02 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here's a puppo participating in the #ScienceMa...,,,,https://twitter.com/dog_rates/status/855851453...,13,10,,doggo,,,puppo
200,854010172552949760,,,2017-04-17 16:34:26 +0000,"<a href=""http://twitter.com/download/iphone"" r...","At first I thought this was a shy doggo, but i...",,,,https://twitter.com/dog_rates/status/854010172...,11,10,,doggo,floofer,,
211,851953902622658560,,,2017-04-12 00:23:33 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: This is Astrid. She's a guide d...,8.293743e+17,4196984000.0,2017-02-08 17:00:26 +0000,https://twitter.com/dog_rates/status/829374341...,13,10,Astrid,doggo,,,


In [7]:
df_ta.text[10]

'This is Koda. He is a South Australian deckshark. Deceptively deadly. Frighteningly majestic. 13/10 would risk a petting #BarkWeek https://t.co/dVPW0B0Mme'

<a id='twitter_api'></a>
#### Request data from the twitter API

Use twitter ID to request retweet count and favourite count.

https://developer.twitter.com/en/docs/labs/tweets-and-users/quick-start/get-tweets

In [8]:
##using magic command to access variables in .env
#%env
##Get, set, or list environment variables.

##Usage:

#%env: lists all environment variables/values 
#%env var: get value for var 
#%env var val: set value for var 
#%env var=val: set value for var 
#%env var=$val: set value for var, 
    
##using python expansion if possible
                    
                    

In [15]:
#using .env file to keep access token safe and python-dotenv or better in .bash_profile file?
#pip install -U python-dotenv

import os
from pathlib import Path  # Python 3.6+ only
env_path = Path('.') / '.env'
load_dotenv(dotenv_path=env_path)

consumer_key = os.getenv("TWAPIKEY")
consumer_secret = os.getenv("TWAPISECRETKEY")

#use tweepy to access twitter API with OAuth2

auth = tw.AppAuthHandler(consumer_key, consumer_secret)

api = tw.API(auth)
for tweet in tw.Cursor(api.search, q='tweepy').items(10):
    print(tweet.text)

RT @StackCodeReview: Can you answer this? Twitter contact scraper with Tweepy &amp; Django https://t.co/wAhnrRvghr #python
Can you answer this? Twitter contact scraper with Tweepy &amp; Django https://t.co/wAhnrRvghr #python
El último salto
#NationalGeographic #Fotodeldia #Python #Tweepy https://t.co/IR6KjPWzKg
https://t.co/ByoPQ7tzAT #Python #Tweepy #Unsplash #Photo https://t.co/U9aqjwPSSg
Sending my first tweet via Tweepy!
Hello Tweepy
RT @steely_dan_bot: tweepy==3.8.0
urllib3==1.25.7
tweepy==3.8.0
urllib3==1.25.7
RT @KatiMichel: For #100DaysofCode, I'm completing some Real Python projects. A few of my faves involve Requests, Tweepy, InstaPy, GeoDjang…
For #100DaysofCode, I'm completing some Real Python projects. A few of my faves involve Requests, Tweepy, InstaPy,… https://t.co/p4PW6JWyJh


In [None]:
#Tweepy doc: http://docs.tweepy.org/en/latest/
#time limit on handler
def limit_handled(cursor):
    while True:
        try:
            yield cursor.next()
        except tweepy.RateLimitError:
            time.sleep(15 * 60)

for tweets in limit_handled(tweepy.Cursor(api.xxx).items()):
   df_

In [10]:
#using a python config file to store access token e.g. with wikiart API
#response = requests.get(f'https://www.wikiart.org/en/Api/2/login?accessCode={cfg.twitter['api_key']}&secretCode={cfg.twitter['api_secret_key']')


In [11]:
#secure storage of access details with yaml
#import yaml

#with open("config.yml", 'r') as ymlfile:
#    cfg = yaml.safe_load(ymlfile)

#print(cfg[api_creds'access_code'])
#print(cfg[api_creds'secret_code'])

### Appendix:

Secure authorisation key outside of notebook:

http://veekaybee.github.io/2020/02/25/secrets/

https://pypi.org/project/python-dotenv/

http://docs.tweepy.org/en/latest/getting_started.html#api
