# Project 4: Wrangle, assess, clean and analyse twitter data - WeRateDogs


                            Christine Shuttleworth, 1st of October 2020



### Table of Contents
- [Introduction](#intro)
- [Part I - Data wrangling](#wrangling)
    - [Twitter Archive - load csv file](#load_csv)
    - [Twitter API - access and load data via Twitter API access](#twitter_api)
    - [Download and ingest neural network predictor data using requests](#requests)
- [Part II - Assess data](#assess)
    - [Visual assessment: data overview](#visual)
    - [Programmatic assessment:](#programmatic)
        - [Data structure:](#structure)
        - [Data quality:](#quality)
    - [Summary list of data issues:](#summary_issues)
        - [Tidyness issues:](#tidyness)
        - [Cleanliness issues:](#cleanliness)
- [Part III - Clean data and create twitter_archive_master.csv file](#clean)
    - [Define issue: x](#def1)
    - [Code issue: x](#code1)
    - [Test issue: x](#test1)
    - [Define issue: x](#def2)
    - [Code issue: x](#code2)
    - [Test issue: x](#test2)
    - [Define issue: x](#def3)
    - [Code issue: x](#code3)
    - [Test issue: x](#test3)
    - [Define issue: x](#def4)
    - [Code issue: x](#code4)
    - [Test issue: x](#test4)
- [Part IV - Analyse data](#clean)
    - [Insight 1: x](#insight1) Which type of dog is rated the most often and the highest?
    - [Insight 2: x](#insight2)
    - [Insight 3: x](#insight3)





<a id='intro'></a>
### Introduction 

For this report, I wrangled WeRateDogs Twitter data to create interesting and trustworthy data insights and visualizations of the dog rating twitter feed. 

The twitter data will be enhanced with information of likely breed of the dog being rated, based on images available in the tweets. This data originates from a neural network image prediction data set of types of dogs.

To achieve this, I createe a solid and clean master dataset. Possible questions to ask:
- Which dog type is being rated the most often and the hightest?

Based on the analysis I created two reports:

    wrangle_report.pdf - summary of my wrangling effort
    act_report.pdf - insights and visualisation of the findings as a magazine article or blog post

<a id='wrangling'></a>
### Part 1 - Data wrangling

Set up python environment

In [190]:
import pandas as pd
import numpy as np
import tweepy as tw
import requests
import config as cfg
import os
from pathlib import Path 
import json
from dotenv import load_dotenv

%matplotlib inline
#%load_ext dotenv
#%dotenv

pd.options.display.max_rows = 999

<a id='load_csv'></a>
#### Load twitter_archive_enhanced.csv and learn about the data

In [191]:
df_ta = pd.read_csv('twitter-archive-enhanced.csv')

In [192]:
df_ta.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 

Columns:
1. tweet_id: twitter reference for this particular tweet
2. in_reply_to_status_id: twitter_id of tweet that was replied to. Tweets with NaN in this column are original tweets.
3. in_reply_to_user_id: user_id who wrote the reply 
4. timestamp: timestamp of the tweet
5. source: source of tweet - Twitter for iPhone, Vine - Make a Scene, Twitter Web Client, TweetDeck   
6. text: text of tweet: with hashtags and URL link to tweet.
7. retweeted_status_id: twitter_id of tweet that retweeted original tweet to. Tweets with NaN in this column were not retweeted.
8. retweeted_status_user_id: user_id who retweeted
9. retweeted_status_timestamp: timestamp of the retweet 
10. expanded_urls: full URL of the original tweet                
11. rating_numerator: rating of dog ...
12. rating_denominator: ... out of this number   
13. name: dog name   
14. doggo: flag if this dog falls into the doggo category
15. floofer: flag if this dog falls into the doggo category
16. pupper: flag if this dog falls into the doggo category
17. puppo: flag if this dog falls into the doggo category 

In [193]:
df_ta.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU,,,,https://twitter.com/dog_rates/status/892420643555336193/photo/1,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 https://t.co/0Xxu71qeIV",,,,https://twitter.com/dog_rates/status/892177421306343426/photo/1,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know when one may strike. 12/10 https://t.co/wUnZnhtVJB,,,,https://twitter.com/dog_rates/status/891815181378084864/photo/1,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us https://t.co/tD36da7qLQ,,,,https://twitter.com/dog_rates/status/891689557279858688/photo/1,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","This is Franklin. He would like you to stop calling him ""cute."" He is a very fierce shark and should be respected as such. 12/10 #BarkWeek https://t.co/AtUZn91f7f",,,,"https://twitter.com/dog_rates/status/891327558926688256/photo/1,https://twitter.com/dog_rates/status/891327558926688256/photo/1",12,10,Franklin,,,,


In [194]:
df_ta.source.value_counts()

<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>     2221
<a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a>                          91
<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>                       33
<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>      11
Name: source, dtype: int64

In [195]:
#df_ta.query('in_reply_to_status_id != "NaN"')
#df_ta.query('retweeted_status_id != "NaN"')
df_ta.query('doggo != "None"').head(100)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
9,890240255349198849,,,2017-07-26 15:59:51 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Cassie. She is a college pup. Studying international doggo communication and stick theory. 14/10 so elegant much sophisticate https://t.co/t1bfwz5S2A,,,,https://twitter.com/dog_rates/status/890240255349198849/photo/1,14,10,Cassie,doggo,,,
43,884162670584377345,,,2017-07-09 21:29:42 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Meet Yogi. He doesn't have any important dog meetings today he just enjoys looking his best at all times. 12/10 for dangerously dapper doggo https://t.co/YSI00BzTBZ,,,,https://twitter.com/dog_rates/status/884162670584377345/photo/1,12,10,Yogi,doggo,,,
99,872967104147763200,,,2017-06-09 00:02:31 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Here's a very large dog. He has a date later. Politely asked this water person to check if his breath is bad. 12/10 good to go doggo https://t.co/EMYIdoblMR,,,,"https://twitter.com/dog_rates/status/872967104147763200/photo/1,https://twitter.com/dog_rates/status/872967104147763200/photo/1",12,10,,doggo,,,
108,871515927908634625,,,2017-06-04 23:56:03 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Napolean. He's a Raggedy East Nicaraguan Zoom Zoom. Runs on one leg. Built for deception. No eyes. Good with kids. 12/10 great doggo https://t.co/PR7B7w1rUw,,,,"https://twitter.com/dog_rates/status/871515927908634625/photo/1,https://twitter.com/dog_rates/status/871515927908634625/photo/1",12,10,Napolean,doggo,,,
110,871102520638267392,,,2017-06-03 20:33:19 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Never doubt a doggo 14/10 https://t.co/AbBLh2FZCH,,,,https://twitter.com/animalcog/status/871075758080503809,14,10,,doggo,,,
121,869596645499047938,,,2017-05-30 16:49:31 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Scout. He just graduated. Officially a doggo now. Have fun with taxes and losing sight of your ambitions. 12/10 would throw cap for https://t.co/DsA2hwXAJo,,,,"https://twitter.com/dog_rates/status/869596645499047938/photo/1,https://twitter.com/dog_rates/status/869596645499047938/photo/1",12,10,Scout,doggo,,,
172,858843525470990336,,,2017-05-01 00:40:27 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",I have stumbled puppon a doggo painting party. They're looking to be the next Pupcasso or Puppollock. All 13/10 would put it on the fridge https://t.co/cUeDMlHJbq,,,,https://twitter.com/dog_rates/status/858843525470990336/photo/1,13,10,,doggo,,,
191,855851453814013952,,,2017-04-22 18:31:02 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Here's a puppo participating in the #ScienceMarch. Cleverly disguising her own doggo agenda. 13/10 would keep the planet habitable for https://t.co/cMhq16isel,,,,https://twitter.com/dog_rates/status/855851453814013952/photo/1,13,10,,doggo,,,puppo
200,854010172552949760,,,2017-04-17 16:34:26 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","At first I thought this was a shy doggo, but it's actually a Rare Canadian Floofer Owl. Amateurs would confuse the two. 11/10 only send dogs https://t.co/TXdT3tmuYk",,,,"https://twitter.com/dog_rates/status/854010172552949760/photo/1,https://twitter.com/dog_rates/status/854010172552949760/photo/1",11,10,,doggo,floofer,,
211,851953902622658560,,,2017-04-12 00:23:33 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",RT @dog_rates: This is Astrid. She's a guide doggo in training. 13/10 would follow anywhere https://t.co/xo7FZFIAao,8.293743e+17,4196984000.0,2017-02-08 17:00:26 +0000,"https://twitter.com/dog_rates/status/829374341691346946/photo/1,https://twitter.com/dog_rates/status/829374341691346946/photo/1,https://twitter.com/dog_rates/status/829374341691346946/photo/1,https://twitter.com/dog_rates/status/829374341691346946/photo/1",13,10,Astrid,doggo,,,


In [196]:
df_ta.text[9]

'This is Cassie. She is a college pup. Studying international doggo communication and stick theory. 14/10 so elegant much sophisticate https://t.co/t1bfwz5S2A'

<a id='twitter_api'></a>
#### Request data from the twitter API and load it into a dataframe

Use twitter ID to request retweet count and favourite count.

https://developer.twitter.com/en/docs/labs/tweets-and-users/quick-start/get-tweets

In [115]:
#using .env file and python-dotenv to keep access token safe
#pip install -U python-dotenv

#import os
#from pathlib import Path  # Python 3.6+ only
env_path = Path('.') / '.env'
load_dotenv(dotenv_path=env_path)

consumer_key = os.getenv("TWAPIKEY")
consumer_secret = os.getenv("TWAPISECRETKEY")

#use tweepy to access twitter API with OAuth2

auth = tw.AppAuthHandler(consumer_key, consumer_secret)

#Other option to store passkey safely:
#1. could use a python .config file and the config library to store access token e.g. with wikiart API
#response = requests.get(f'https://www.wikiart.org/en/Api/2/login?accessCode={cfg.twitter['api_key']}&secretCode={cfg.twitter['api_secret_key']')

#2. secure storage of access details with yaml
#import yaml

#with open("config.yml", 'r') as ymlfile:
#    cfg = yaml.safe_load(ymlfile)

#print(cfg[api_creds'access_code'])
#print(cfg[api_creds'secret_code'])

#3.using magic command to access variables in .env
#%env
##Get, set, or list environment variables.

##Usage:

#%env: lists all environment variables/values 
#%env var: get value for var 
#%env var val: set value for var 
#%env var=val: set value for var 
#%env var=$val: set value for var, 
    
##using python expansion if possible



In [57]:
#Access tweets by tweet_id using .get_status() to extract favourites_count, retweet_count and write to csv file
#api.get_status('749075273010798592')._json['retweet_count']

api = tw.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

with open('twitter_retweet_favorite_count.csv', 'a') as file:
    file.write('tweet_id, retweet_count, favorite_count \n')
        
    for i in df_ta['tweet_id']:
        try:
            json_resp = api.get_status(i)
            rt_count=json_resp._json['retweet_count']
            f_count=json_resp._json['favorite_count']
            file.write(f'{i}'+','+f'{rt_count}'+','+f'{f_count}'+'\n')
        except tweepy.TweepError:
            file.write(f'{i}'+',,\n')


Rate limit reached. Sleeping for: 316
Rate limit reached. Sleeping for: 534


In [197]:
df_tapi = pd.read_csv('twitter_retweet_favorite_count.csv')

In [198]:
df_tapi.head()
df_tapi.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   tweet_id          2356 non-null   int64  
 1    retweet_count    2331 non-null   float64
 2    favorite_count   2331 non-null   float64
dtypes: float64(2), int64(1)
memory usage: 55.3 KB


In [199]:
## I could also write the json as text one line per tweet into a text file and extract the data later. Will do this do extract more information than we already have
## and possible missing information such as missing extended urls. 

api = tw.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

with open('tweet_json.txt', 'a') as file:
        
    for i in df_ta['tweet_id']:
        try:
            json_resp = api.get_status(i)
            json.dump(json_resp._json, file)
            file.write('\n')
           
        except tw.TweepError:
            pass


KeyboardInterrupt: 

In [200]:
with open ('tweet_json.txt', 'r') as file:
    for line in file:
        json_line = file.readline()
        tw_json = json.load(json_line)
        df_json['twitter_id'] = tw_json._json.id
        break
        #df_json['text'] = tw_json._json.text.str.strip('') #everything from rating.
        #df_json['hashtags'] = tw_json._json.text.extract() #everthing from # until rating or https:
        #df_json['rating'] = tw_json._json.text.extract(r'\d?\d\/10)
        #df_json['jpg_url'] = tw_json._json.media.url
        #df_json['expanded_url'] = tw_json._json.expanded_url                                           
                                                        

AttributeError: 'str' object has no attribute 'read'

In [None]:
df_json.head()

<a id='requests'></a>
#### Request data from URL and load .tsv file into dataframe 

In [86]:
url='https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = requests.get(url)
#response.content

with open('image-predictions.tsv', 'wb') as file:
    file.write(response.content)

df_pre = pd.read_csv('image-predictions.tsv', delimiter='\t')

In [98]:
df_pre.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


<a id='assess'></a>
### Part 2 - Assess Data

<a id='visual'></a>
#### Visual assessment: Data overview 

I have three dataframes that all link together with the tweed_id. 

Some of the columns look redundant, such as expanded twitter url. Need to check if the expansion is always the same. If yes the expanced url can be arrived at using the tweed_id. This information does not have to be stored in a column as this takes up unnecessary space.

Other columns that are not necessary are the probability columns for the second and third best predicitons. I am only interested in the best prediction of the dog type that is actually a dog.

The twitter archive dataframe - df_ta - includes columns that indicate if the tweet was a reply or not (in_reply_to_status_id, in_reply_to_user_id). These columns can be used to filter out any tweets that are not original posts and then the columns can be deleted.

<a id='programmatic'></a>
#### Programmatic assessment 

In [91]:
df_ta.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 

In [99]:
df_tapi.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 3 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   tweet_id          2356 non-null   int64  
 1    retweet_count    2331 non-null   float64
 2    favorite_count   2331 non-null   float64
dtypes: float64(2), int64(1)
memory usage: 55.3 KB


In [100]:
df_pre.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   tweet_id  2075 non-null   int64  
 1   jpg_url   2075 non-null   object 
 2   img_num   2075 non-null   int64  
 3   p1        2075 non-null   object 
 4   p1_conf   2075 non-null   float64
 5   p1_dog    2075 non-null   bool   
 6   p2        2075 non-null   object 
 7   p2_conf   2075 non-null   float64
 8   p2_dog    2075 non-null   bool   
 9   p3        2075 non-null   object 
 10  p3_conf   2075 non-null   float64
 11  p3_dog    2075 non-null   bool   
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


<a id='structure'></a>
#### Data structure: Tidyness issues

**Issue 1:** in_reply_to_status, in_reply_to_user_id. Use columns to drop rows that are replies and not original ids and delete both columns.

**Issue 2:** Find best prediction for each dog that is actually a dog and store it in a column. Merge this and the image_url column with df_ta dataframe. All other columns are redundant.

**Issue 3:** Merge df_tapi columns with df_ta dataframe on the tweed_id column. Delete df_tapi dataframe

**Issue 4:** Doggo, Floofer, Pupper, Puppo columns need to be melted into one column - dog_age_category: which contains the category which is correct for the dog.

<a id="quality"></a>
#### Data quality: 

**Issue 1:** suspicious rating_numerators (< 8 and > 15)

In [107]:
df_ta.rating_numerator.value_counts()

12      558
11      464
10      461
13      351
9       158
8       102
7        55
14       54
5        37
6        32
3        19
4        17
1         9
2         9
420       2
0         2
15        2
75        2
80        1
20        1
24        1
26        1
44        1
50        1
60        1
165       1
84        1
88        1
144       1
182       1
143       1
666       1
960       1
1776      1
17        1
27        1
45        1
99        1
121       1
204       1
Name: rating_numerator, dtype: int64

**Issue 2:** suspicious denominator values. Especially 110, 120, etc. which most likely have been read in with a zero too much.

In [108]:
df_ta.rating_denominator.value_counts()

10     2333
11        3
50        3
80        2
20        2
2         1
16        1
40        1
70        1
15        1
90        1
110       1
120       1
130       1
150       1
170       1
7         1
0         1
Name: rating_denominator, dtype: int64

**Issue 3:** df_ta: wrong datatypes (retweeted_status_id, retweeted_status_user_id, retweeted_status_timestamp
         df_tapi: retweet_count, favorite_count

**Issue 4:** Source should be a categorical column as there are only 4 options: 
<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>     2221
<a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a>                          91
<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>                       33
<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>      11

Should be stored as: 'Twitter for iPhone', 'Vine - Make a Scene', 'Twitter Web Client', 'TweetDeck'. The href for these categories can be stored elsewhere.

**Issue 5**: Image url is stored twice? End of text and in df_pre?

**Issue 6:** Missing value in expanded_url column. This data is stored in the json file and can be extracted to be added. Maybe add more interesting data that is missing.

**Issue 7:** More then one variable stored in the df_ta.text column. Removes #tags, urls and ratings from text as these can be store or are stored in other columns. Two different urls are saved for the tweet. Once in the df_ta.text column and once in df_pr.jpg_id. The rating are stored in two separate columns: df_ta.denominator and df_ta.numerator. Endresult: the text col only contains the text.

In [150]:
df_ta[df_ta.text.str.contains('#')].text.count()

27

In [163]:
df_ta[df_ta.text.str.contains('https:')].text.count()

2284

In [162]:
#df_ta.text[12]
df_ta.query('tweet_id == 892420643555336193').text[0] 

"This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU"

In [179]:
pd.options.display.max_colwidth=500
df_ta.query('tweet_id == 871102520638267392').expanded_urls ##the /photo/1 does not work. This url returns the tweet itself.

110    https://twitter.com/animalcog/status/871075758080503809
Name: expanded_urls, dtype: object

**Issue 8:** Check dog names count_values. Wrong dog names (a, the, one, quite, mad, not, 0, life, space, this, by, officially, old, his, such, inacceptable, my, all, incredibly. See if I can restore these names from the text or read them in again with json.

In [182]:
df_ta.name.value_counts()

None              745
a                  55
Charlie            12
Cooper             11
Lucy               11
Oliver             11
Penny              10
Tucker             10
Lola               10
Winston             9
Bo                  9
the                 8
Sadie               8
Bailey              7
Daisy               7
Toby                7
Buddy               7
an                  7
Koda                6
Leo                 6
Rusty               6
Bella               6
Jack                6
Scout               6
Stanley             6
Milo                6
Dave                6
Oscar               6
Jax                 6
Alfie               5
very                5
Sunny               5
Phil                5
Oakley              5
Sammy               5
Louis               5
Larry               5
Finn                5
Bentley             5
George              5
Gus                 5
Chester             5
Gerald              4
Clark               4
Shadow              4
Brody     

**Issue 9:** The df_ta.text contains sometimes two dogs and only one name is stored but both dog categories. Or it refers to two categories of dogs, e.g. doggo 1, pupper 0. Where both categories are set to true. 

In [184]:
df_ta.query('doggo == "doggo" and pupper =="pupper"')

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
460,817777686764523521,,,2017-01-07 16:59:28 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","This is Dido. She's playing the lead role in ""Pupper Stops to Catch Snow Before Resuming Shadow Box with Dried Apple."" 13/10 (IG: didodoggo) https://t.co/m7isZrOBX7",,,,https://twitter.com/dog_rates/status/817777686764523521/video/1,13,10,Dido,doggo,,pupper,
531,808106460588765185,,,2016-12-12 00:29:28 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Here we have Burke (pupper) and Dexter (doggo). Pupper wants to be exactly like doggo. Both 12/10 would pet at same time https://t.co/ANBpEYHaho,,,,https://twitter.com/dog_rates/status/808106460588765185/photo/1,12,10,,doggo,,pupper,
565,802265048156610565,7.331095e+17,4196984000.0,2016-11-25 21:37:47 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","Like doggo, like pupper version 2. Both 11/10 https://t.co/9IxWAXFqze",,,,https://twitter.com/dog_rates/status/802265048156610565/photo/1,11,10,,doggo,,pupper,
575,801115127852503040,,,2016-11-22 17:28:25 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Bones. He's being haunted by another doggo of roughly the same size. 12/10 deep breaths pupper everything's fine https://t.co/55Dqe0SJNj,,,,"https://twitter.com/dog_rates/status/801115127852503040/photo/1,https://twitter.com/dog_rates/status/801115127852503040/photo/1",12,10,Bones,doggo,,pupper,
705,785639753186217984,,,2016-10-11 00:34:48 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",This is Pinot. He's a sophisticated doggo. You can tell by the hat. Also pointier than your average pupper. Still 10/10 would pet cautiously https://t.co/f2wmLZTPHd,,,,"https://twitter.com/dog_rates/status/785639753186217984/photo/1,https://twitter.com/dog_rates/status/785639753186217984/photo/1",10,10,Pinot,doggo,,pupper,
733,781308096455073793,,,2016-09-29 01:42:20 +0000,"<a href=""http://vine.co"" rel=""nofollow"">Vine - Make a Scene</a>","Pupper butt 1, Doggo 0. Both 12/10 https://t.co/WQvcPEpH2u",,,,https://vine.co/v/5rgu2Law2ut,12,10,,doggo,,pupper,
778,775898661951791106,,,2016-09-14 03:27:11 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","RT @dog_rates: Like father (doggo), like son (pupper). Both 12/10 https://t.co/pG2inLaOda",7.331095e+17,4196984000.0,2016-05-19 01:38:16 +0000,"https://twitter.com/dog_rates/status/733109485275860992/photo/1,https://twitter.com/dog_rates/status/733109485275860992/photo/1",12,10,,doggo,,pupper,
822,770093767776997377,,,2016-08-29 03:00:36 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",RT @dog_rates: This is just downright precious af. 12/10 for both pupper and doggo https://t.co/o5J479bZUC,7.410673e+17,4196984000.0,2016-06-10 00:39:48 +0000,"https://twitter.com/dog_rates/status/741067306818797568/photo/1,https://twitter.com/dog_rates/status/741067306818797568/photo/1",12,10,just,doggo,,pupper,
889,759793422261743616,,,2016-07-31 16:50:42 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>","Meet Maggie &amp; Lila. Maggie is the doggo, Lila is the pupper. They are sisters. Both 12/10 would pet at the same time https://t.co/MYwR4DQKll",,,,"https://twitter.com/dog_rates/status/759793422261743616/photo/1,https://twitter.com/dog_rates/status/759793422261743616/photo/1",12,10,Maggie,doggo,,pupper,
956,751583847268179968,,,2016-07-09 01:08:47 +0000,"<a href=""http://twitter.com/download/iphone"" rel=""nofollow"">Twitter for iPhone</a>",Please stop sending it pictures that don't even have a doggo or pupper in them. Churlish af. 5/10 neat couch tho https://t.co/u2c9c7qSg8,,,,https://twitter.com/dog_rates/status/751583847268179968/photo/1,5,10,,doggo,,pupper,


### Appendix:

Secure authorisation key outside of notebook:

http://veekaybee.github.io/2020/02/25/secrets/

https://pypi.org/project/python-dotenv/

http://docs.tweepy.org/en/latest/getting_started.html#api
