## 评估项目数据
收集上述数据的每个内容后，从视觉上和程序上，对质量和清洁度进行数据评估。在你的 wrangle_act.ipynb中查找和记录至少 8 个质量问题 和 2 个清洁度问题。关键要点：

- 我们只需要含有图片的原始评级 (不包括转发)。
- 充分评估和清洗整个数据集需要巨大努力，所以只有一些问题 (至少 8 个质量问题和 2 个清洁度问题) 的子集需要进行评估和清洗。
- 根据清洗数据的规则，清洗包括合并数据的独立内容。
- 如果分子评级超过分母评级，不需要进行清洗。这个 特殊评级系统 是 WeRateDogs 人气度较高的主要原因。


## 清洗项目数据
评估时清洗你记录的每个问题。在 wrangle_act.ipynb 完成清洗。结果应该为优质干净的主要 pandas DataFrame (如有，或为多个 DataFrame)。必须评估符合项目动机的问题。


## 存储、分析和可视化项目数据
在 CSV 文件中存储洁净的数据，命名为 twitter_archive_master.csv。如果因为清洁需要多个表格，存在附加文件，要给这些文件合理命名。另外，你可以把清洗后的数据存储在 SQLite 数据库中 (如有需要也可以提交)。

## 在 wrangle_act.ipynb 中对清洗后的数据进行分析和可视化。必须生成至少 3 个见解和 1 个可视化。

## 项目汇报
1. 创建一个 300-600 字书面报告 命名为 wrangle_report.pdf，可以简要描述你的清洗过程。这可以作为内部文档。

2. 创建一个 250 字以上的书面报告 命名为 act_report.pdf，可以沟通观点，展示你清洗过数据后生成的可视化内容。这可作为外部文档，如博客帖子或杂志文章。

In [1]:
import numpy as np
import pandas as pd

In [3]:
## download the image-predictions.tsv
url = 'https://raw.githubusercontent.com/udacity/new-dand-advanced-china/master/%E6%95%B0%E6%8D%AE%E6%B8%85%E6%B4%97/WeRateDogs%E9%A1%B9%E7%9B%AE/image-predictions.tsv'

In [4]:
import requests
import os

file = url.split('/')[-1]
path = './WeRateDogs_data/'
r = requests.get(url, stream=True)

if os.path.isfile(path + file):
    print("File {} exist!".format(file))
    #http://docs.python-requests.org/en/master/user/quickstart/#make-a-request
else:
    with(open(path + file, 'ab')) as fd:
        for chunk in r.iter_content(chunk_size=1024):
            if chunk:
                fd.write(chunk)

File image-predictions.tsv exist!


In [5]:
!ls ./WeRateDogs_data/

image-predictions.tsv        twitter-archive-enhanced.csv
tweet_json.txt


## file 1

In [6]:
df_tweet = pd.read_json('./WeRateDogs_data/tweet_json.txt', lines=True)
df_tweet.head()

Unnamed: 0,contributors,coordinates,created_at,display_text_range,entities,extended_entities,favorite_count,favorited,full_text,geo,...,possibly_sensitive_appealable,quoted_status,quoted_status_id,quoted_status_id_str,retweet_count,retweeted,retweeted_status,source,truncated,user
0,,,2017-08-01 16:23:56,"[0, 85]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 892420639486877696, 'id_str'...",39492,False,This is Phineas. He's a mystical boy. Only eve...,,...,0.0,,,,8842,False,,"<a href=""http://twitter.com/download/iphone"" r...",False,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
1,,,2017-08-01 00:17:27,"[0, 138]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 892177413194625024, 'id_str'...",33786,False,This is Tilly. She's just checking pup on you....,,...,0.0,,,,6480,False,,"<a href=""http://twitter.com/download/iphone"" r...",False,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
2,,,2017-07-31 00:18:03,"[0, 121]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 891815175371796480, 'id_str'...",25445,False,This is Archie. He is a rare Norwegian Pouncin...,,...,0.0,,,,4301,False,,"<a href=""http://twitter.com/download/iphone"" r...",False,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
3,,,2017-07-30 15:58:51,"[0, 79]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 891689552724799489, 'id_str'...",42863,False,This is Darla. She commenced a snooze mid meal...,,...,0.0,,,,8925,False,,"<a href=""http://twitter.com/download/iphone"" r...",False,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
4,,,2017-07-29 16:00:24,"[0, 138]","{'hashtags': [{'text': 'BarkWeek', 'indices': ...","{'media': [{'id': 891327551943041024, 'id_str'...",41016,False,This is Franklin. He would like you to stop ca...,,...,0.0,,,,9721,False,,"<a href=""http://twitter.com/download/iphone"" r...",False,"{'id': 4196983835, 'id_str': '4196983835', 'na..."


In [7]:
df_tweet.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2352 entries, 0 to 2351
Data columns (total 31 columns):
contributors                     0 non-null float64
coordinates                      0 non-null float64
created_at                       2352 non-null datetime64[ns]
display_text_range               2352 non-null object
entities                         2352 non-null object
extended_entities                2073 non-null object
favorite_count                   2352 non-null int64
favorited                        2352 non-null bool
full_text                        2352 non-null object
geo                              0 non-null float64
id                               2352 non-null int64
id_str                           2352 non-null int64
in_reply_to_screen_name          78 non-null object
in_reply_to_status_id            78 non-null float64
in_reply_to_status_id_str        78 non-null float64
in_reply_to_user_id              78 non-null float64
in_reply_to_user_id_str          78 n

In [32]:
df_tweet['geo'].sample(3) # No use for this time

1448   NaN
485    NaN
1498   NaN
Name: geo, dtype: float64

In [14]:
df_tweet['user'].sample(3)

1125    {'id': 4196983835, 'id_str': '4196983835', 'na...
1297    {'id': 4196983835, 'id_str': '4196983835', 'na...
119     {'id': 4196983835, 'id_str': '4196983835', 'na...
Name: user, dtype: object

In [23]:
df_tweet['extended_entities'].head()

0    {'media': [{'id': 892420639486877696, 'id_str'...
1    {'media': [{'id': 892177413194625024, 'id_str'...
2    {'media': [{'id': 891815175371796480, 'id_str'...
3    {'media': [{'id': 891689552724799489, 'id_str'...
4    {'media': [{'id': 891327551943041024, 'id_str'...
Name: extended_entities, dtype: object

In [44]:
l = df_tweet['extended_entities'].sample()

l.values.tolist()

[{'media': [{'display_url': 'pic.twitter.com/r6LZN1o1Gx',
    'expanded_url': 'https://twitter.com/dog_rates/status/674255168825880576/photo/1',
    'id': 674255163272642560,
    'id_str': '674255163272642560',
    'indices': [54, 77],
    'media_url': 'http://pbs.twimg.com/media/CVtvf6bWwAAd1rT.jpg',
    'media_url_https': 'https://pbs.twimg.com/media/CVtvf6bWwAAd1rT.jpg',
    'sizes': {'large': {'h': 1024, 'resize': 'fit', 'w': 576},
     'medium': {'h': 1024, 'resize': 'fit', 'w': 576},
     'small': {'h': 680, 'resize': 'fit', 'w': 383},
     'thumb': {'h': 150, 'resize': 'crop', 'w': 150}},
    'type': 'photo',
    'url': 'https://t.co/r6LZN1o1Gx'}]}]

In [45]:
l2 = df_tweet['full_text'].sample()

l2.values.tolist()

["This is Brudge. He's a Doberdog. Going to be h*ckin massive one day. 11/10 would pat on head approvingly https://t.co/cTlHjEUNK8"]

In [22]:
df_tweet['id'].head()

0    892420643555336193
1    892177421306343426
2    891815181378084864
3    891689557279858688
4    891327558926688256
Name: id, dtype: int64

In [10]:
df_tweet['retweet_count'].sample(3)

2206      45
2208    1170
2285      87
Name: retweet_count, dtype: int64

In [15]:
df_tweet['favorite_count'].sample(3)

322     14594
1786     1156
1755     4086
Name: favorite_count, dtype: int64

In [16]:
df_tweet['retweet_count'].sample(3)

523    3516
310      82
542    2126
Name: retweet_count, dtype: int64

In [18]:
df_tweet['full_text'].sample()

112    @ComplicitOwl @ShopWeRateDogs &gt;10/10 is res...
Name: full_text, dtype: object

In [47]:
var_col = ['id', 'retweet_count', 'favorite_count', 'extended_entities', 'full_text']
df_tweet_clean = df_tweet[var_col]  # do i need a copy??
df_tweet_clean.head()

Unnamed: 0,id,retweet_count,favorite_count,extended_entities,full_text
0,892420643555336193,8842,39492,"{'media': [{'id': 892420639486877696, 'id_str'...",This is Phineas. He's a mystical boy. Only eve...
1,892177421306343426,6480,33786,"{'media': [{'id': 892177413194625024, 'id_str'...",This is Tilly. She's just checking pup on you....
2,891815181378084864,4301,25445,"{'media': [{'id': 891815175371796480, 'id_str'...",This is Archie. He is a rare Norwegian Pouncin...
3,891689557279858688,8925,42863,"{'media': [{'id': 891689552724799489, 'id_str'...",This is Darla. She commenced a snooze mid meal...
4,891327558926688256,9721,41016,"{'media': [{'id': 891327551943041024, 'id_str'...",This is Franklin. He would like you to stop ca...


## file 2

In [52]:
df_image = pd.read_csv('./WeRateDogs_data/image-predictions.tsv', delimiter='\t')
df_image.sample(5)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
511,676191832485810177,https://pbs.twimg.com/media/CWJQ4UmWoAIJ29t.jpg,2,Chihuahua,0.376741,True,Italian_greyhound,0.173114,True,muzzle,0.071485,False
1385,766008592277377025,https://pbs.twimg.com/media/CqFouXOXYAAYpzG.jpg,1,Welsh_springer_spaniel,0.728153,True,basset,0.103842,True,Brittany_spaniel,0.062414,True
168,668988183816871936,https://pbs.twimg.com/media/CUi5M7TXIAAY0gj.jpg,1,Arabian_camel,0.999614,False,bison,0.000228,False,llama,6.7e-05,False
1302,752917284578922496,https://pbs.twimg.com/media/CnLmRiYXEAAO_8f.jpg,1,German_shepherd,0.609283,True,malinois,0.35246,True,kelpie,0.016105,True
155,668815180734689280,https://pbs.twimg.com/media/CUgb21RXIAAlff7.jpg,1,redbone,0.461172,True,Italian_greyhound,0.270733,True,miniature_pinscher,0.109752,True


In [None]:
# p1 是对推特中图片算法 #1 的预测 → basset
# p1_conf 是 #1 预测中算法的可信度 → 0.555712
# p1_dog 是 #1 预测是否是狗的品种 → True

In [53]:
df_image['img_num'].value_counts()

1    1780
2     198
3      66
4      31
Name: img_num, dtype: int64

## file 3

In [54]:
df = pd.read_csv('./WeRateDogs_data/twitter-archive-enhanced.csv')
df.sample(5)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1382,700864154249383937,,,2016-02-20 02:06:50 +0000,"<a href=""http://twitter.com/download/iphone"" r...","""Pupper is a present to world. Here is a bow f...",,,,https://twitter.com/dog_rates/status/700864154...,12,10,a,,,pupper,
715,783839966405230592,,,2016-10-06 01:23:05 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Riley. His owner put a donut pillow ar...,,,,https://twitter.com/dog_rates/status/783839966...,13,10,Riley,,,,
1043,743835915802583040,,,2016-06-17 16:01:16 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: Extremely intelligent dog here....,6.671383e+17,4196984000.0,2015-11-19 00:32:12 +0000,https://twitter.com/dog_rates/status/667138269...,10,10,,,,,
1211,715360349751484417,,,2016-03-31 02:09:32 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Bertson. He just wants to say hi. 11/1...,,,,https://twitter.com/dog_rates/status/715360349...,11,10,Bertson,,,,
2247,667873844930215936,,,2015-11-21 01:15:07 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Neat dog. Lots of spikes. Always in push-up po...,,,,https://twitter.com/dog_rates/status/667873844...,10,10,,,,,


In [55]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

In [56]:
df.columns

Index(['tweet_id', 'in_reply_to_status_id', 'in_reply_to_user_id', 'timestamp',
       'source', 'text', 'retweeted_status_id', 'retweeted_status_user_id',
       'retweeted_status_timestamp', 'expanded_urls', 'rating_numerator',
       'rating_denominator', 'name', 'doggo', 'floofer', 'pupper', 'puppo'],
      dtype='object')

In [57]:
l3 = df['text'].sample()

l3.values.tolist()

['This is Lilli Bee &amp; Honey Bear. Unfortunately, they were both born with no eyes. So heckin sad. Both 11/10 https://t.co/4UrfOZhztW']

In [None]:
df['tweet_id'].

In [11]:
# remove rating_numerator < 10
df2 = df[df.rating_numerator < 10]
df2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 440 entries, 45 to 2355
Data columns (total 17 columns):
tweet_id                      440 non-null int64
in_reply_to_status_id         12 non-null float64
in_reply_to_user_id           12 non-null float64
timestamp                     440 non-null object
source                        440 non-null object
text                          440 non-null object
retweeted_status_id           9 non-null float64
retweeted_status_user_id      9 non-null float64
retweeted_status_timestamp    9 non-null object
expanded_urls                 430 non-null object
rating_numerator              440 non-null int64
rating_denominator            440 non-null int64
name                          440 non-null object
doggo                         440 non-null object
floofer                       440 non-null object
pupper                        440 non-null object
puppo                         440 non-null object
dtypes: float64(4), int64(3), object(10)
memory us

In [22]:
df3 = df[df.rating_numerator < df.rating_denominator] 
df3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 442 entries, 45 to 2355
Data columns (total 17 columns):
tweet_id                      442 non-null int64
in_reply_to_status_id         13 non-null float64
in_reply_to_user_id           13 non-null float64
timestamp                     442 non-null object
source                        442 non-null object
text                          442 non-null object
retweeted_status_id           9 non-null float64
retweeted_status_user_id      9 non-null float64
retweeted_status_timestamp    9 non-null object
expanded_urls                 431 non-null object
rating_numerator              442 non-null int64
rating_denominator            442 non-null int64
name                          442 non-null object
doggo                         442 non-null object
floofer                       442 non-null object
pupper                        442 non-null object
puppo                         442 non-null object
dtypes: float64(4), int64(3), object(10)
memory us

In [16]:
df4 = df3[df3.rating_denominator != 10] 
df4

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
342,832088576586297345,8.320875e+17,30582080.0,2017-02-16 04:45:50 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@docmisterio account started on 11/15/15,,,,,11,15,,,,,
784,775096608509886464,,,2016-09-11 22:20:06 +0000,"<a href=""http://twitter.com/download/iphone"" r...","RT @dog_rates: After so many requests, this is...",7.403732e+17,4196984000.0,2016-06-08 02:41:38 +0000,https://twitter.com/dog_rates/status/740373189...,9,11,,,,,
1068,740373189193256964,,,2016-06-08 02:41:38 +0000,"<a href=""http://twitter.com/download/iphone"" r...","After so many requests, this is Bretagne. She ...",,,,https://twitter.com/dog_rates/status/740373189...,9,11,,,,,
1165,722974582966214656,,,2016-04-21 02:25:47 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Happy 4/20 from the squad! 13/10 for all https...,,,,https://twitter.com/dog_rates/status/722974582...,4,20,,,,,
1274,709198395643068416,,,2016-03-14 02:04:08 +0000,"<a href=""http://twitter.com/download/iphone"" r...","From left to right:\r\nCletus, Jerome, Alejand...",,,,https://twitter.com/dog_rates/status/709198395...,45,50,,,,,
1598,686035780142297088,6.86034e+17,4196984000.0,2016-01-10 04:04:10 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Yes I do realize a rating of 4/20 would've bee...,,,,,4,20,,,,,
1662,682962037429899265,,,2016-01-01 16:30:13 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darrel. He just robbed a 7/11 and is i...,,,,https://twitter.com/dog_rates/status/682962037...,7,11,Darrel,,,,
2335,666287406224695296,,,2015-11-16 16:11:11 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is an Albanian 3 1/2 legged Episcopalian...,,,,https://twitter.com/dog_rates/status/666287406...,1,2,an,,,,


In [18]:
df4.rating_denominator.value_counts()

11    3
20    2
15    1
2     1
50    1
Name: rating_denominator, dtype: int64

In [19]:
df3.rating_numerator.value_counts()

9     158
8     102
7      55
5      37
6      32
3      19
4      17
2       9
1       9
0       2
45      1
11      1
Name: rating_numerator, dtype: int64

In [26]:
df3[df3['rating_numerator'] == 11].text

342    @docmisterio account started on 11/15/15
Name: text, dtype: object

In [64]:
s = df3[df3['rating_numerator'] == 45]['text'].to_string()
s

'1274    From left to right:\\r\\nCletus, Jerome, Alejand...'

In [63]:
len(s.split())

7

In [66]:
df3

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
45,883482846933004288,,,2017-07-08 00:28:19 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Bella. She hopes her smile made you sm...,,,,https://twitter.com/dog_rates/status/883482846...,5,10,Bella,,,,
229,848212111729840128,,,2017-04-01 16:35:01 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Jerry. He's doing a distinguished tong...,,,,https://twitter.com/dog_rates/status/848212111...,6,10,Jerry,,,,
315,835152434251116546,,,2017-02-24 15:40:31 +0000,"<a href=""http://twitter.com/download/iphone"" r...",When you're so blinded by your systematic plag...,,,,https://twitter.com/dog_rates/status/835152434...,0,10,,,,,
342,832088576586297345,8.320875e+17,3.058208e+07,2017-02-16 04:45:50 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@docmisterio account started on 11/15/15,,,,,11,15,,,,,
387,826598799820865537,8.265984e+17,4.196984e+09,2017-02-01 01:11:25 +0000,"<a href=""http://twitter.com/download/iphone"" r...","I was going to do 007/10, but the joke wasn't ...",,,,,7,10,,,,,
462,817502432452313088,,,2017-01-06 22:45:43 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: Meet Herschel. He's slightly bi...,6.924173e+17,4.196984e+09,2016-01-27 18:42:06 +0000,https://twitter.com/dog_rates/status/692417313...,7,10,Herschel,,,pupper,
485,814578408554463233,,,2016-12-29 21:06:41 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: Meet Beau &amp; Wilbur. Wilbur ...,6.981954e+17,4.196984e+09,2016-02-12 17:22:12 +0000,https://twitter.com/dog_rates/status/698195409...,9,10,Beau,,,,
599,798682547630837760,,,2016-11-16 00:22:12 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: Here we see a rare pouched pupp...,6.769365e+17,4.196984e+09,2015-12-16 01:27:03 +0000,https://twitter.com/dog_rates/status/676936541...,8,10,,,,pupper,
605,798576900688019456,,,2016-11-15 17:22:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: Not familiar with this breed. N...,6.661041e+17,4.196984e+09,2015-11-16 04:02:55 +0000,https://twitter.com/dog_rates/status/666104133...,1,10,,,,,
730,781661882474196992,,,2016-09-30 01:08:10 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Who keeps sending in pictures without dogs in ...,,,,https://twitter.com/dog_rates/status/781661882...,5,10,,,,,


In [76]:
df3.text[]

KeyError: 1

In [74]:
df3['type'] = None

dog_lists = ['pupper', 'puppo', 'doggo', 'floofer'] #['blep', 'snoot']

for i in range(len(df3)):
    text = df3.text[i]
    for dog_status in dog_lists:
        if dog_status in text:
            df3.type[i] = dog_status

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


KeyError: 0

## 评估数据是数据整理的第二步, 
- 数据质量问题（即内容问题）
- 缺乏整洁度（即结构性问题）

这个项目就是从csv提取文本，再从文本提取评级，狗名及狗地位，
再从txt_json里至少提取转发和喜欢，
最后把提取的和image合并（inner）后评估清洗，当然先清洗后合并

In [8]:
df_clean = df.copy()

In [19]:
df_clean['type'] = None

#dog_lists = ['pupper', 'puppo', 'doggo', 'floofer'] # typer_count =  399
dog_lists = ['pupper', 'puppo', 'doggo', 'floofer', 'blep', 'snoot'] # typer_count =  399

for i in range(0, len(df_clean)):
    text = df_clean.text[i]
    for dog_status in dog_lists:
        if dog_status in text:
            df_clean.type[i] = dog_status

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [12]:
df_clean

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo,type
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,,
5,891087950875897856,,,2017-07-29 00:08:17 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a majestic great white breaching ...,,,,https://twitter.com/dog_rates/status/891087950...,13,10,,,,,,
6,890971913173991426,,,2017-07-28 16:27:12 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Jax. He enjoys ice cream so much he gets ...,,,,"https://gofundme.com/ydvmve-surgery-for-jax,ht...",13,10,Jax,,,,,
7,890729181411237888,,,2017-07-28 00:22:40 +0000,"<a href=""http://twitter.com/download/iphone"" r...",When you watch your owner call another dog a g...,,,,https://twitter.com/dog_rates/status/890729181...,13,10,,,,,,
8,890609185150312448,,,2017-07-27 16:25:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Zoey. She doesn't want to be one of th...,,,,https://twitter.com/dog_rates/status/890609185...,13,10,Zoey,,,,,
9,890240255349198849,,,2017-07-26 15:59:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Cassie. She is a college pup. Studying...,,,,https://twitter.com/dog_rates/status/890240255...,14,10,Cassie,doggo,,,,doggo


In [20]:
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 18 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
type                          40

In [26]:
df_clean.columns

Index(['tweet_id', 'in_reply_to_status_id', 'in_reply_to_user_id', 'timestamp',
       'source', 'text', 'retweeted_status_id', 'retweeted_status_user_id',
       'retweeted_status_timestamp', 'expanded_urls', 'rating_numerator',
       'rating_denominator', 'name', 'doggo', 'floofer', 'pupper', 'puppo',
       'type'],
      dtype='object')