## 评估项目数据
收集上述数据的每个内容后，从视觉上和程序上，对质量和清洁度进行数据评估。在你的 wrangle_act.ipynb中查找和记录至少 8 个质量问题 和 2 个清洁度问题。关键要点：

- 我们只需要含有图片的原始评级 (不包括转发)。
- 充分评估和清洗整个数据集需要巨大努力，所以只有一些问题 (至少 8 个质量问题和 2 个清洁度问题) 的子集需要进行评估和清洗。
- 根据清洗数据的规则，清洗包括合并数据的独立内容。
- 如果分子评级超过分母评级，不需要进行清洗。这个 特殊评级系统 是 WeRateDogs 人气度较高的主要原因。


## 清洗项目数据
评估时清洗你记录的每个问题。在 wrangle_act.ipynb 完成清洗。结果应该为优质干净的主要 pandas DataFrame (如有，或为多个 DataFrame)。必须评估符合项目动机的问题。


## 存储、分析和可视化项目数据
在 CSV 文件中存储洁净的数据，命名为 twitter_archive_master.csv。如果因为清洁需要多个表格，存在附加文件，要给这些文件合理命名。另外，你可以把清洗后的数据存储在 SQLite 数据库中 (如有需要也可以提交)。

## 在 wrangle_act.ipynb 中对清洗后的数据进行分析和可视化。必须生成至少 3 个见解和 1 个可视化。

## 项目汇报
1. 创建一个 300-600 字书面报告 命名为 wrangle_report.pdf，可以简要描述你的清洗过程。这可以作为内部文档。

2. 创建一个 250 字以上的书面报告 命名为 act_report.pdf，可以沟通观点，展示你清洗过数据后生成的可视化内容。这可作为外部文档，如博客帖子或杂志文章。

In [1]:
import numpy as np
import pandas as pd

In [3]:
## download the image-predictions.tsv
url = 'https://raw.githubusercontent.com/udacity/new-dand-advanced-china/master/%E6%95%B0%E6%8D%AE%E6%B8%85%E6%B4%97/WeRateDogs%E9%A1%B9%E7%9B%AE/image-predictions.tsv'

In [4]:
import requests
import os

file = url.split('/')[-1]
path = './WeRateDogs_data/'
r = requests.get(url, stream=True)

if os.path.isfile(path + file):
    print("File {} exist!".format(file))
    #http://docs.python-requests.org/en/master/user/quickstart/#make-a-request
else:
    with(open(path + file, 'ab')) as fd:
        for chunk in r.iter_content(chunk_size=1024):
            if chunk:
                fd.write(chunk)

File image-predictions.tsv exist!


In [2]:
!ls ./WeRateDogs_data/

image-predictions.tsv
tweet_json.txt
twitter-archive-enhanced.csv


## file 1

In [3]:
df_tweet = pd.read_json('./WeRateDogs_data/tweet_json.txt', lines=True)
df_tweet.head()

Unnamed: 0,contributors,coordinates,created_at,display_text_range,entities,extended_entities,favorite_count,favorited,full_text,geo,...,possibly_sensitive_appealable,quoted_status,quoted_status_id,quoted_status_id_str,retweet_count,retweeted,retweeted_status,source,truncated,user
0,,,2017-08-01 16:23:56,"[0, 85]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 892420639486877696, 'id_str'...",39492,False,This is Phineas. He's a mystical boy. Only eve...,,...,0.0,,,,8842,False,,"<a href=""http://twitter.com/download/iphone"" r...",False,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
1,,,2017-08-01 00:17:27,"[0, 138]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 892177413194625024, 'id_str'...",33786,False,This is Tilly. She's just checking pup on you....,,...,0.0,,,,6480,False,,"<a href=""http://twitter.com/download/iphone"" r...",False,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
2,,,2017-07-31 00:18:03,"[0, 121]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 891815175371796480, 'id_str'...",25445,False,This is Archie. He is a rare Norwegian Pouncin...,,...,0.0,,,,4301,False,,"<a href=""http://twitter.com/download/iphone"" r...",False,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
3,,,2017-07-30 15:58:51,"[0, 79]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 891689552724799489, 'id_str'...",42863,False,This is Darla. She commenced a snooze mid meal...,,...,0.0,,,,8925,False,,"<a href=""http://twitter.com/download/iphone"" r...",False,"{'id': 4196983835, 'id_str': '4196983835', 'na..."
4,,,2017-07-29 16:00:24,"[0, 138]","{'hashtags': [{'text': 'BarkWeek', 'indices': ...","{'media': [{'id': 891327551943041024, 'id_str'...",41016,False,This is Franklin. He would like you to stop ca...,,...,0.0,,,,9721,False,,"<a href=""http://twitter.com/download/iphone"" r...",False,"{'id': 4196983835, 'id_str': '4196983835', 'na..."


In [4]:
df_tweet.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2352 entries, 0 to 2351
Data columns (total 31 columns):
contributors                     0 non-null float64
coordinates                      0 non-null float64
created_at                       2352 non-null datetime64[ns]
display_text_range               2352 non-null object
entities                         2352 non-null object
extended_entities                2073 non-null object
favorite_count                   2352 non-null int64
favorited                        2352 non-null bool
full_text                        2352 non-null object
geo                              0 non-null float64
id                               2352 non-null int64
id_str                           2352 non-null int64
in_reply_to_screen_name          78 non-null object
in_reply_to_status_id            78 non-null float64
in_reply_to_status_id_str        78 non-null float64
in_reply_to_user_id              78 non-null float64
in_reply_to_user_id_str          78 n

In [32]:
df_tweet['geo'].sample(3) # No use for this time

1448   NaN
485    NaN
1498   NaN
Name: geo, dtype: float64

In [5]:
df_tweet['user'].sample(3)

2094    {'id': 4196983835, 'id_str': '4196983835', 'na...
1567    {'id': 4196983835, 'id_str': '4196983835', 'na...
2296    {'id': 4196983835, 'id_str': '4196983835', 'na...
Name: user, dtype: object

In [6]:
df_tweet['extended_entities'].head()

0    {'media': [{'id': 892420639486877696, 'id_str'...
1    {'media': [{'id': 892177413194625024, 'id_str'...
2    {'media': [{'id': 891815175371796480, 'id_str'...
3    {'media': [{'id': 891689552724799489, 'id_str'...
4    {'media': [{'id': 891327551943041024, 'id_str'...
Name: extended_entities, dtype: object

In [7]:
l = df_tweet['extended_entities'].sample()

l.values.tolist()

[{'media': [{'display_url': 'pic.twitter.com/EfejX3iRGr',
    'expanded_url': 'https://twitter.com/dog_rates/status/738885046782832640/photo/1',
    'id': 738885039132401664,
    'id_str': '738885039132401664',
    'indices': [111, 134],
    'media_url': 'http://pbs.twimg.com/media/CkEMBz9WYAAGLaa.jpg',
    'media_url_https': 'https://pbs.twimg.com/media/CkEMBz9WYAAGLaa.jpg',
    'sizes': {'large': {'h': 338, 'resize': 'fit', 'w': 450},
     'medium': {'h': 338, 'resize': 'fit', 'w': 450},
     'small': {'h': 338, 'resize': 'fit', 'w': 450},
     'thumb': {'h': 150, 'resize': 'crop', 'w': 150}},
    'type': 'photo',
    'url': 'https://t.co/EfejX3iRGr'}]}]

In [8]:
l2 = df_tweet['full_text'].sample()

l2.values.tolist()

["This is Gus. He didn't win the Powerball. Quite perturbed about it. Still 10/10 would comfort in time of need https://t.co/3wc246LOtu"]

In [9]:
df_tweet['id'].head()

0    892420643555336193
1    892177421306343426
2    891815181378084864
3    891689557279858688
4    891327558926688256
Name: id, dtype: int64

In [10]:
df_tweet['retweet_count'].sample(3)

1848     767
407     4615
1810    1103
Name: retweet_count, dtype: int64

In [11]:
df_tweet['favorite_count'].sample(3)

1276     1848
687     12064
2220      262
Name: favorite_count, dtype: int64

In [12]:
df_tweet['retweet_count'].sample(3)

628      3874
1093     3099
550     18791
Name: retweet_count, dtype: int64

In [15]:
df_tweet['full_text'].sample().values.tolist()

['Meet Maggie. She can hear your cells divide. 12/10 can also probably fly https://t.co/ovE2hqXryV']

In [16]:
df_tweet['full_text'].sample().values.tolist()

["This is Kramer. He's a Picasso Tortellini. Tie couldn't be more accurate. Confident af. Runs his own business. 10/10 https://t.co/jIcVW0xxmH"]

In [17]:
var_col = ['id', 'retweet_count', 'favorite_count', 'extended_entities', 'full_text']
df_tweet_clean = df_tweet[var_col]  # do i need a copy??
df_tweet_clean.head()

Unnamed: 0,id,retweet_count,favorite_count,extended_entities,full_text
0,892420643555336193,8842,39492,"{'media': [{'id': 892420639486877696, 'id_str'...",This is Phineas. He's a mystical boy. Only eve...
1,892177421306343426,6480,33786,"{'media': [{'id': 892177413194625024, 'id_str'...",This is Tilly. She's just checking pup on you....
2,891815181378084864,4301,25445,"{'media': [{'id': 891815175371796480, 'id_str'...",This is Archie. He is a rare Norwegian Pouncin...
3,891689557279858688,8925,42863,"{'media': [{'id': 891689552724799489, 'id_str'...",This is Darla. She commenced a snooze mid meal...
4,891327558926688256,9721,41016,"{'media': [{'id': 891327551943041024, 'id_str'...",This is Franklin. He would like you to stop ca...


## file 2

In [18]:
df_image = pd.read_csv('./WeRateDogs_data/image-predictions.tsv', delimiter='\t')
df_image.sample(5)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
61,667152164079423490,https://pbs.twimg.com/media/CUIzWk_UwAAfUNq.jpg,1,toy_poodle,0.535411,True,Pomeranian,0.087544,True,miniature_poodle,0.06205,True
1330,757597904299253760,https://pbs.twimg.com/media/CoOGZjiWAAEMKGx.jpg,1,doormat,0.836106,False,wallet,0.056627,False,purse,0.051333,False
1232,746056683365994496,https://pbs.twimg.com/media/ClqGl7fXIAA8nDe.jpg,1,Shetland_sheepdog,0.43332,True,collie,0.335997,True,borzoi,0.177179,True
741,687317306314240000,https://pbs.twimg.com/media/CYnXcLEUkAAIQOM.jpg,1,Shih-Tzu,0.747208,True,Maltese_dog,0.091025,True,Lhasa,0.035788,True
684,683852578183077888,https://pbs.twimg.com/media/CX2ISqSWYAAEtCF.jpg,1,toy_poodle,0.551352,True,teddy,0.180678,False,miniature_poodle,0.164095,True


In [None]:
# p1 是对推特中图片算法 #1 的预测 → basset
# p1_conf 是 #1 预测中算法的可信度 → 0.555712
# p1_dog 是 #1 预测是否是狗的品种 → True

In [19]:
df_image['img_num'].value_counts()

1    1780
2     198
3      66
4      31
Name: img_num, dtype: int64

In [52]:
df_image.describe()

Unnamed: 0,tweet_id,img_num,p1_conf,p2_conf,p3_conf
count,2075.0,2075.0,2075.0,2075.0,2075.0
mean,7.384514e+17,1.203855,0.594548,0.1345886,0.06032417
std,6.785203e+16,0.561875,0.271174,0.1006657,0.05090593
min,6.660209e+17,1.0,0.044333,1.0113e-08,1.74017e-10
25%,6.764835e+17,1.0,0.364412,0.05388625,0.0162224
50%,7.119988e+17,1.0,0.58823,0.118181,0.0494438
75%,7.932034e+17,1.0,0.843855,0.1955655,0.09180755
max,8.924206e+17,4.0,1.0,0.488014,0.273419


In [53]:
df_image.columns

Index(['tweet_id', 'jpg_url', 'img_num', 'p1', 'p1_conf', 'p1_dog', 'p2',
       'p2_conf', 'p2_dog', 'p3', 'p3_conf', 'p3_dog'],
      dtype='object')

## file 3

In [21]:
df = pd.read_csv('./WeRateDogs_data/twitter-archive-enhanced.csv')
df.sample(5)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
473,816336735214911488,,,2017-01-03 17:33:39 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Dudley. He found a flower and now he's...,,,,https://twitter.com/dog_rates/status/816336735...,11,10,Dudley,,,,
1457,695095422348574720,,,2016-02-04 04:03:57 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is just a beautiful pupper good shit evol...,,,,https://twitter.com/dog_rates/status/695095422...,12,10,just,,,pupper,
892,759447681597108224,,,2016-07-30 17:56:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Oakley. He has no idea what happened h...,,,,https://twitter.com/dog_rates/status/759447681...,11,10,Oakley,,,,
1987,672877615439593473,,,2015-12-04 20:38:19 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Oscar. He's getting bombarded with the...,,,,https://twitter.com/dog_rates/status/672877615...,8,10,Oscar,,,,
1259,710272297844797440,,,2016-03-17 01:11:26 +0000,"<a href=""http://twitter.com/download/iphone"" r...",We 👏🏻 only 👏🏻 rate 👏🏻 dogs. Pls stop sending i...,,,,https://twitter.com/dog_rates/status/710272297...,11,10,infuriating,,,,


In [22]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

In [28]:
df.duplicated('tweet_id').value_counts() # ID no duplicated issue.

False    2356
dtype: int64

In [51]:
df_image[df_image['tweet_id'] == s1['tweet_id'].values.tolist()] ['p1'] == s1['name'].values.tolist()

1751    False
Name: p1, dtype: bool

## Get dog rating, name and type from text

In [24]:
df['text'].sample().values.tolist()

['This is Rocky. He sleeps like a psychopath. 10/10 quality tongue slip https://t.co/MbgG95mUdu']

In [84]:
df_clean[df_clean['puppo'] == None] # all type enter is None, drop out

Unnamed: 0,tweet_id,text,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo


In [109]:
df_clean = df[['tweet_id', 'text',  'rating_numerator', 'rating_denominator', 'name', ]]

In [94]:
df_clean.head()

Unnamed: 0,tweet_id,text,rating_numerator,rating_denominator,name
0,892420643555336193,This is Phineas. He's a mystical boy. Only eve...,13,10,Phineas
1,892177421306343426,This is Tilly. She's just checking pup on you....,13,10,Tilly
2,891815181378084864,This is Archie. He is a rare Norwegian Pouncin...,12,10,Archie
3,891689557279858688,This is Darla. She commenced a snooze mid meal...,13,10,Darla
4,891327558926688256,This is Franklin. He would like you to stop ca...,12,10,Franklin


In [95]:
df_clean.text[100]

'Here are my favorite #dogsatpollingstations \r\nMost voted for a more consistent walking schedule and to increase daily pats tenfold. All 13/10 https://t.co/17FVMl4VZ5'

In [110]:
df_clean.loc[:, 'type'] = None

#dog_lists = ['pupper', 'puppo', 'doggo', 'floofer'] # typer_count =  399
dog_lists = ['pupper', 'puppo', 'doggo', 'floofer', 'blep', 'snoot'] # ref doc provide two more types

for i in range(len(df_clean)):
    text = df_clean.loc[i, 'text']
    for dog_status in dog_lists:
        if dog_status in text:
            #df_clean.type[i] = dog_status
            df_clean.loc[i, 'type'] = dog_status

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s


In [115]:
df_clean.sample(5)

Unnamed: 0,tweet_id,text,rating_numerator,rating_denominator,name,type
2321,666435652385423360,"""Can you behave? You're ruining my wedding day...",10,10,,
1783,677673981332312066,Endangered triangular pup here. Could be a wiz...,9,10,,
599,798682547630837760,RT @dog_rates: Here we see a rare pouched pupp...,8,10,,pupper
1968,673320132811366400,This is Frankie. He's wearing blush. 11/10 rea...,11,10,Frankie,
273,840728873075638272,RT @dog_rates: This is Pipsy. He is a fluffbal...,12,10,Pipsy,


In [114]:
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 6 columns):
tweet_id              2356 non-null int64
text                  2356 non-null object
rating_numerator      2356 non-null int64
rating_denominator    2356 non-null int64
name                  2356 non-null object
type                  400 non-null object
dtypes: int64(3), object(3)
memory usage: 110.5+ KB


In [132]:
df_clean['name'].value_counts()

None              745
a                  55
Charlie            12
Oliver             11
Cooper             11
Lucy               11
Tucker             10
Penny              10
Lola               10
Bo                  9
Winston             9
Sadie               8
the                 8
Toby                7
Bailey              7
Daisy               7
Buddy               7
an                  7
Rusty               6
Oscar               6
Milo                6
Jack                6
Koda                6
Scout               6
Stanley             6
Jax                 6
Dave                6
Leo                 6
Bella               6
very                5
                 ... 
Blipson             1
Murphy              1
Major               1
Michelangelope      1
Erik                1
Vinscent            1
Sailor              1
Tessa               1
Naphaniel           1
Farfle              1
Holly               1
Stark               1
Nugget              1
Mary                1
Acro      

In [122]:
df_clean.text.sample().values # name: Louis

array([ "This is Louis. He's crossing. It's a big deal. 13/10 h*ckin breathtaking https://t.co/D0wb1GlKAt"], dtype=object)

In [124]:
df_clean.text.sample().values # name: Shikha

array([ 'This is Shikha. She just watched you drop a skittle on the ground and still eat it. Could not be less impressed. 12/10 superior puppo https://t.co/XZlZKd73go'], dtype=object)

In [133]:
df_clean.text.sample().values # name: Brandi and Harley ?? two name

array([ 'This is Brandi and Harley. They are practicing their caroling for later. Both 12/10 festive af https://t.co/AbBDuGZUpp'], dtype=object)

In [134]:
df_clean.text.sample().values # name: swell petting?? not true name

array([ '"Thank you friend that was a swell petting" 11/10 (vid by @MatthewjamesMac) https://t.co/NY3cPAZAIM'], dtype=object)

###   Assess df

In [64]:
# remove rating_numerator < 10
df2 = df[df.rating_numerator < 10]
df2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 440 entries, 45 to 2355
Data columns (total 17 columns):
tweet_id                      440 non-null int64
in_reply_to_status_id         12 non-null float64
in_reply_to_user_id           12 non-null float64
timestamp                     440 non-null object
source                        440 non-null object
text                          440 non-null object
retweeted_status_id           9 non-null float64
retweeted_status_user_id      9 non-null float64
retweeted_status_timestamp    9 non-null object
expanded_urls                 430 non-null object
rating_numerator              440 non-null int64
rating_denominator            440 non-null int64
name                          440 non-null object
doggo                         440 non-null object
floofer                       440 non-null object
pupper                        440 non-null object
puppo                         440 non-null object
dtypes: float64(4), int64(3), object(10)
memory us

In [65]:
df3 = df[df.rating_numerator < df.rating_denominator] 
df3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 442 entries, 45 to 2355
Data columns (total 17 columns):
tweet_id                      442 non-null int64
in_reply_to_status_id         13 non-null float64
in_reply_to_user_id           13 non-null float64
timestamp                     442 non-null object
source                        442 non-null object
text                          442 non-null object
retweeted_status_id           9 non-null float64
retweeted_status_user_id      9 non-null float64
retweeted_status_timestamp    9 non-null object
expanded_urls                 431 non-null object
rating_numerator              442 non-null int64
rating_denominator            442 non-null int64
name                          442 non-null object
doggo                         442 non-null object
floofer                       442 non-null object
pupper                        442 non-null object
puppo                         442 non-null object
dtypes: float64(4), int64(3), object(10)
memory us

In [66]:
df4 = df3[df3.rating_denominator != 10] 
df4

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
342,832088576586297345,8.320875e+17,30582080.0,2017-02-16 04:45:50 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@docmisterio account started on 11/15/15,,,,,11,15,,,,,
784,775096608509886464,,,2016-09-11 22:20:06 +0000,"<a href=""http://twitter.com/download/iphone"" r...","RT @dog_rates: After so many requests, this is...",7.403732e+17,4196984000.0,2016-06-08 02:41:38 +0000,https://twitter.com/dog_rates/status/740373189...,9,11,,,,,
1068,740373189193256964,,,2016-06-08 02:41:38 +0000,"<a href=""http://twitter.com/download/iphone"" r...","After so many requests, this is Bretagne. She ...",,,,https://twitter.com/dog_rates/status/740373189...,9,11,,,,,
1165,722974582966214656,,,2016-04-21 02:25:47 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Happy 4/20 from the squad! 13/10 for all https...,,,,https://twitter.com/dog_rates/status/722974582...,4,20,,,,,
1274,709198395643068416,,,2016-03-14 02:04:08 +0000,"<a href=""http://twitter.com/download/iphone"" r...","From left to right:\r\nCletus, Jerome, Alejand...",,,,https://twitter.com/dog_rates/status/709198395...,45,50,,,,,
1598,686035780142297088,6.86034e+17,4196984000.0,2016-01-10 04:04:10 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Yes I do realize a rating of 4/20 would've bee...,,,,,4,20,,,,,
1662,682962037429899265,,,2016-01-01 16:30:13 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darrel. He just robbed a 7/11 and is i...,,,,https://twitter.com/dog_rates/status/682962037...,7,11,Darrel,,,,
2335,666287406224695296,,,2015-11-16 16:11:11 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is an Albanian 3 1/2 legged Episcopalian...,,,,https://twitter.com/dog_rates/status/666287406...,1,2,an,,,,


In [18]:
df4.rating_denominator.value_counts()

11    3
20    2
15    1
2     1
50    1
Name: rating_denominator, dtype: int64

In [19]:
df3.rating_numerator.value_counts()

9     158
8     102
7      55
5      37
6      32
3      19
4      17
2       9
1       9
0       2
45      1
11      1
Name: rating_numerator, dtype: int64

In [26]:
df3[df3['rating_numerator'] == 11].text

342    @docmisterio account started on 11/15/15
Name: text, dtype: object

In [64]:
s = df3[df3['rating_numerator'] == 45]['text'].to_string()
s

'1274    From left to right:\\r\\nCletus, Jerome, Alejand...'

In [63]:
len(s.split())

7

In [66]:
df3

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
45,883482846933004288,,,2017-07-08 00:28:19 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Bella. She hopes her smile made you sm...,,,,https://twitter.com/dog_rates/status/883482846...,5,10,Bella,,,,
229,848212111729840128,,,2017-04-01 16:35:01 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Jerry. He's doing a distinguished tong...,,,,https://twitter.com/dog_rates/status/848212111...,6,10,Jerry,,,,
315,835152434251116546,,,2017-02-24 15:40:31 +0000,"<a href=""http://twitter.com/download/iphone"" r...",When you're so blinded by your systematic plag...,,,,https://twitter.com/dog_rates/status/835152434...,0,10,,,,,
342,832088576586297345,8.320875e+17,3.058208e+07,2017-02-16 04:45:50 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@docmisterio account started on 11/15/15,,,,,11,15,,,,,
387,826598799820865537,8.265984e+17,4.196984e+09,2017-02-01 01:11:25 +0000,"<a href=""http://twitter.com/download/iphone"" r...","I was going to do 007/10, but the joke wasn't ...",,,,,7,10,,,,,
462,817502432452313088,,,2017-01-06 22:45:43 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: Meet Herschel. He's slightly bi...,6.924173e+17,4.196984e+09,2016-01-27 18:42:06 +0000,https://twitter.com/dog_rates/status/692417313...,7,10,Herschel,,,pupper,
485,814578408554463233,,,2016-12-29 21:06:41 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: Meet Beau &amp; Wilbur. Wilbur ...,6.981954e+17,4.196984e+09,2016-02-12 17:22:12 +0000,https://twitter.com/dog_rates/status/698195409...,9,10,Beau,,,,
599,798682547630837760,,,2016-11-16 00:22:12 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: Here we see a rare pouched pupp...,6.769365e+17,4.196984e+09,2015-12-16 01:27:03 +0000,https://twitter.com/dog_rates/status/676936541...,8,10,,,,pupper,
605,798576900688019456,,,2016-11-15 17:22:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",RT @dog_rates: Not familiar with this breed. N...,6.661041e+17,4.196984e+09,2015-11-16 04:02:55 +0000,https://twitter.com/dog_rates/status/666104133...,1,10,,,,,
730,781661882474196992,,,2016-09-30 01:08:10 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Who keeps sending in pictures without dogs in ...,,,,https://twitter.com/dog_rates/status/781661882...,5,10,,,,,


In [74]:
df3['type'] = None

dog_lists = ['pupper', 'puppo', 'doggo', 'floofer'] #['blep', 'snoot']

for i in range(len(df3)):
    text = df3.text[i]
    for dog_status in dog_lists:
        if dog_status in text:
            df3.type[i] = dog_status

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


KeyError: 0

## 评估数据是数据整理的第二步, 
- 数据质量问题（即内容问题）
- 缺乏整洁度（即结构性问题）

这个项目就是从csv提取文本，再从文本提取评级，狗名及狗地位，
再从txt_json里至少提取转发和喜欢，
最后把提取的和image合并（inner）后评估清洗，当然先清洗后合并