### Wrangle WeRateDogs Data
This document contains the code that was used to wrangle the data for the Wrangle and Analyze Data project. I've divided the document into three sections corresponding to a step in the data wrangling process as outlined in the lessons:

* Gather
* Assess
* Clean

#### Gather

Here I begin the process of wrangling by gathering the required data. The only data that I didn't have to obtain elsewhere are the data contained in the file "twitter-archive-enhanced.csv" which was provided to me.

In [1]:
import pandas as pd
import numpy as np
import os
import tweepy
import json
import requests

#load twitter-archive-enhanced.csv into twitter_archive_df
twitter_archive_df = pd.read_csv('twitter-archive-enhanced.csv')

In [2]:
consumer_key = 'HIDDEN'
consumer_secret = 'HIDDEN'
access_token = 'HIDDEN'
access_secret = 'HIDDEN'

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

In [3]:
page_no_exist = []
retweet_count_and_favorite_count = []

with open('tweet_json.txt', mode="w") as file:
    for i in list(twitter_archive_df.tweet_id):
        try:
            tweet = api.get_status(str(i))
            file.write(json.dumps(tweet._json))
            retweet_count_and_favorite_count.append({
                "tweet_id" : str(i),
                "retweet_count" : tweet._json['retweet_count'],
                "favorite_count" : tweet._json['favorite_count']
            })
        except:
            page_no_exist.append(i)

Rate limit reached. Sleeping for: 469
Rate limit reached. Sleeping for: 475


In [4]:
len(retweet_count_and_favorite_count), len(page_no_exist)

(2331, 25)

In [5]:
tweet_data_df = pd.DataFrame(retweet_count_and_favorite_count, columns=["tweet_id",'retweet_count', 'favorite_count'])

In [6]:
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = requests.get(url)

with open ('image-predictions.tsv', mode='wb') as file:
    file.write(response.content)
    
image_pred_df = pd.read_csv('image-predictions.tsv',sep="\t")

The data has now been loaded into the following dataframes:

twitter_archive_df: contains data about the archived WeRateDogs tweets.\
tweet_data_df: contains additional data about the WeRateDogs tweets gathered from Twitter.\
image_pred_df: contains the prediction results of a machine learning algorithm trained on a sample of the images from the tweets in the WeRateDogs archive.

#### Assess
With the data in hand I can now assess the data for potential quality and structural issues starting with visual assessment.

i will start with assecing the data one by one and clean it.

#### Dataframe contents for visual assessment

In [7]:
image_pred_df.sample(15)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
971,706593038911545345,https://pbs.twimg.com/media/Cc5Snc7XIAAMidF.jpg,1,four-poster,0.696423,False,quilt,0.189312,False,pillow,0.029409,False
719,685906723014619143,https://pbs.twimg.com/media/CYTUhn7WkAEXocW.jpg,1,Yorkshire_terrier,0.414963,True,briard,0.063505,True,Pekinese,0.053682,True
808,692142790915014657,https://pbs.twimg.com/media/CZr8LvyXEAABJ9k.jpg,3,toy_poodle,0.670068,True,teddy,0.190898,False,miniature_poodle,0.032178,True
1672,813112105746448384,https://pbs.twimg.com/media/C0jBJZVWQAA2_-X.jpg,1,dingo,0.287369,False,Pembroke,0.140682,True,basenji,0.090819,True
676,683462770029932544,https://pbs.twimg.com/media/CXwlw9MWsAAc-JB.jpg,1,Italian_greyhound,0.39956,True,whippet,0.267153,True,German_short-haired_pointer,0.081319,True
1782,828770345708580865,https://pbs.twimg.com/media/C4BiOXOXAAAf6IS.jpg,1,seat_belt,0.765979,False,Chesapeake_Bay_retriever,0.033899,True,polecat,0.027252,False
1582,796759840936919040,https://pbs.twimg.com/media/Cw6o1JQXcAAtP78.jpg,1,American_Staffordshire_terrier,0.463996,True,Staffordshire_bullterrier,0.155566,True,Weimaraner,0.137587,True
1944,861769973181624320,https://pbs.twimg.com/media/CzG425nWgAAnP7P.jpg,2,Arabian_camel,0.366248,False,house_finch,0.209852,False,cocker_spaniel,0.046403,True
290,671166507850801152,https://pbs.twimg.com/media/CVB2TnWUYAA2pAU.jpg,1,refrigerator,0.829772,False,toilet_seat,0.030083,False,shower_curtain,0.015461,False
868,697596423848730625,https://pbs.twimg.com/media/Ca5cPrJXIAImHtD.jpg,1,Shetland_sheepdog,0.621668,True,collie,0.366578,True,Pembroke,0.007698,True


In [8]:
twitter_archive_df.sample(15)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1643,683857920510050305,,,2016-01-04 03:50:08 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Sadie. She fell asleep on the beach and h...,,,,https://twitter.com/dog_rates/status/683857920...,10,10,Sadie,,,,
395,825535076884762624,,,2017-01-29 02:44:34 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here's a very loving and accepting puppo. Appe...,,,,https://twitter.com/dog_rates/status/825535076...,14,10,,,,,puppo
2165,669367896104181761,,,2015-11-25 04:11:57 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Chip. Chip's pretending to be choked. ...,,,,https://twitter.com/dog_rates/status/669367896...,10,10,Chip,,,,
2053,671485057807351808,,,2015-12-01 00:24:48 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Penelope. She is a white Macadamias Duode...,,,,https://twitter.com/dog_rates/status/671485057...,11,10,Penelope,,,,
259,843235543001513987,,,2017-03-18 22:59:54 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tycho. She just had new wheels install...,,,,https://twitter.com/dog_rates/status/843235543...,13,10,Tycho,,,,
1836,676098748976615425,,,2015-12-13 17:57:57 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Extremely rare pup here. Very religious. Alway...,,,,https://twitter.com/dog_rates/status/676098748...,3,10,,,,,
2282,667211855547486208,,,2015-11-19 05:24:37 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Genevieve. She is a golden retriever c...,,,,https://twitter.com/dog_rates/status/667211855...,9,10,Genevieve,,,,
1587,686749460672679938,,,2016-01-12 03:20:05 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Strange pup here. Easily manipulated. Rather i...,,,,https://twitter.com/dog_rates/status/686749460...,8,10,,,,,
2217,668528771708952576,,,2015-11-22 20:37:34 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Gòrdón. He enjoys his razberrita by po...,,,,https://twitter.com/dog_rates/status/668528771...,12,10,Gòrdón,,,,
1075,739623569819336705,,,2016-06-06 01:02:55 +0000,"<a href=""http://vine.co"" rel=""nofollow"">Vine -...",Here's a doggo that don't need no human. 12/10...,,,,https://vine.co/v/iY9Fr1I31U6,12,10,,doggo,,,


In [9]:
tweet_data_df.sample(15)

Unnamed: 0,tweet_id,retweet_count,favorite_count
502,810284430598270976,11122,35002
732,779056095788752897,4452,14593
784,772117678702071809,723,3722
478,813187593374461952,4308,19714
1088,733460102733135873,1230,4047
1702,679877062409191424,625,1922
544,802624713319034886,2890,0
388,824025158776213504,588,4765
1562,686760001961103360,1337,3418
1339,702932127499816960,695,2513


### Programmatic assessment

#### Dataframe structure

In [10]:
image_pred_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   tweet_id  2075 non-null   int64  
 1   jpg_url   2075 non-null   object 
 2   img_num   2075 non-null   int64  
 3   p1        2075 non-null   object 
 4   p1_conf   2075 non-null   float64
 5   p1_dog    2075 non-null   bool   
 6   p2        2075 non-null   object 
 7   p2_conf   2075 non-null   float64
 8   p2_dog    2075 non-null   bool   
 9   p3        2075 non-null   object 
 10  p3_conf   2075 non-null   float64
 11  p3_dog    2075 non-null   bool   
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [11]:
image_pred_df.duplicated().sum()

0

In [12]:
twitter_archive_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 

In [13]:
twitter_archive_df.duplicated().sum()

0

In [14]:
twitter_archive_df.describe()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,retweeted_status_id,retweeted_status_user_id,rating_numerator,rating_denominator
count,2356.0,78.0,78.0,181.0,181.0,2356.0,2356.0
mean,7.427716e+17,7.455079e+17,2.014171e+16,7.7204e+17,1.241698e+16,13.126486,10.455433
std,6.856705e+16,7.582492e+16,1.252797e+17,6.236928e+16,9.599254e+16,45.876648,6.745237
min,6.660209e+17,6.658147e+17,11856340.0,6.661041e+17,783214.0,0.0,0.0
25%,6.783989e+17,6.757419e+17,308637400.0,7.186315e+17,4196984000.0,10.0,10.0
50%,7.196279e+17,7.038708e+17,4196984000.0,7.804657e+17,4196984000.0,11.0,10.0
75%,7.993373e+17,8.257804e+17,4196984000.0,8.203146e+17,4196984000.0,12.0,10.0
max,8.924206e+17,8.862664e+17,8.405479e+17,8.87474e+17,7.874618e+17,1776.0,170.0


In [15]:
twitter_archive_df.rating_numerator.value_counts()

12      558
11      464
10      461
13      351
9       158
8       102
7        55
14       54
5        37
6        32
3        19
4        17
1         9
2         9
420       2
0         2
15        2
75        2
80        1
20        1
24        1
26        1
44        1
50        1
60        1
165       1
84        1
88        1
144       1
182       1
143       1
666       1
960       1
1776      1
17        1
27        1
45        1
99        1
121       1
204       1
Name: rating_numerator, dtype: int64

In [16]:
twitter_archive_df.rating_denominator.value_counts()

10     2333
11        3
50        3
80        2
20        2
2         1
16        1
40        1
70        1
15        1
90        1
110       1
120       1
130       1
150       1
170       1
7         1
0         1
Name: rating_denominator, dtype: int64

## Quality Issues:
### twitter_archive_df
- name columns has String 'None'
- wrong data types ['tweet_id', 'timestamp']
- redundant retweets rows
- redundant "in reply to users' tweet" rows
- some rating_denominator do not equal 10
- the rows in sources columns are over 95% duplicated
- missing data in name encoded as the string 'None'

### image_pred_df
- false predictions.

### Tidiness Issues:
twitter_archive_df
- doggo bread seperated to 4 columns
- "retweet count" and "favorite count" columns are not in the twitter_archive_df

### Clean

In [17]:
# make copies
twitter_archive_df_clean = twitter_archive_df.copy()
image_pred_df_clean = image_pred_df.copy()
tweet_data_df_clean = tweet_data_df.copy()

#### Tackle the Tidiness Issuies

1.0 redundent columns of same category, which is now divided into ["doggo", "flooter", "pupper", "puppo"] columns, but we only need one "stage" column
##### Define: convert ["doggo", "flooter", "pupper", "puppo"] columns into one "stage" column, then drop the four columns.
#### Code

In [18]:
twitter_archive_df_clean.columns

Index(['tweet_id', 'in_reply_to_status_id', 'in_reply_to_user_id', 'timestamp',
       'source', 'text', 'retweeted_status_id', 'retweeted_status_user_id',
       'retweeted_status_timestamp', 'expanded_urls', 'rating_numerator',
       'rating_denominator', 'name', 'doggo', 'floofer', 'pupper', 'puppo'],
      dtype='object')

In [19]:
# dog_breed = ['doggo', 'floofer', 'pupper', 'puppo']
twitter_archive_df_clean['dog breed'] = twitter_archive_df_clean.doggo + twitter_archive_df_clean.floofer + twitter_archive_df_clean.pupper + twitter_archive_df_clean.puppo
twitter_archive_df_clean['dog breed'].value_counts()

NoneNoneNoneNone        1976
NoneNonepupperNone       245
doggoNoneNoneNone         83
NoneNoneNonepuppo         29
doggoNonepupperNone       12
NoneflooferNoneNone        9
doggoNoneNonepuppo         1
doggoflooferNoneNone       1
Name: dog breed, dtype: int64

In [20]:
twitter_archive_df_clean['dog breed'] = twitter_archive_df_clean['dog breed'].map(lambda x: x.replace("None",""))
twitter_archive_df_clean['dog breed'].value_counts()

                1976
pupper           245
doggo             83
puppo             29
doggopupper       12
floofer            9
doggopuppo         1
doggofloofer       1
Name: dog breed, dtype: int64

In [21]:
twitter_archive_df_clean.loc[twitter_archive_df_clean['dog breed'] == 'doggopuppo', 'dog breed'] = 'doggo, puppo'
twitter_archive_df_clean.loc[twitter_archive_df_clean['dog breed'] == 'doggofloofer', 'dog breed'] = 'doggo, floofer'

twitter_archive_df_clean.drop(['doggo', 'floofer', 'pupper', 'puppo'], axis =1, inplace=True)

#### Test

In [22]:
twitter_archive_df_clean.columns

Index(['tweet_id', 'in_reply_to_status_id', 'in_reply_to_user_id', 'timestamp',
       'source', 'text', 'retweeted_status_id', 'retweeted_status_user_id',
       'retweeted_status_timestamp', 'expanded_urls', 'rating_numerator',
       'rating_denominator', 'name', 'dog breed'],
      dtype='object')

2. "retweet count" and "favorite count" columns are not in the twitter_archive_df
##### Define: merge "retweet_count_and_favorite_count" with twitter_clean_df on tweet_id
#### Code

In [23]:
twitter_archive_df_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 14 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  dog breed                   2356 

In [24]:
tweet_data_df_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2331 entries, 0 to 2330
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   tweet_id        2331 non-null   object
 1   retweet_count   2331 non-null   int64 
 2   favorite_count  2331 non-null   int64 
dtypes: int64(2), object(1)
memory usage: 54.8+ KB


In [25]:
# convert tweet_id in tweet_data_df_clean to int which i couldn`t so (error OverflowError: Python int too large to convert to C long)
# tweet_data_df_clean.tweet_id.astype('int')
# tweet_data_df_clean

# so i will change the twitter_archive_df_clean.tweet_id to srt instead
twitter_archive_df_clean.tweet_id = twitter_archive_df_clean.tweet_id.astype('str')

# merge tweet_data_df_clean with twitter_archived_df_clean
twitter_archive_df_clean = pd.merge(twitter_archive_df_clean, tweet_data_df_clean, on=['tweet_id'], how='left')


#### Test

In [26]:
twitter_archive_df_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2356 entries, 0 to 2355
Data columns (total 16 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   object 
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  dog breed                   2356 

### Tackle the quality issues
1. redundant retweets rows
#### Define: Find the index of retweets, then remove retweets rows and [ 'retweeted_status_id', 'retweeted_status_user_id', 'retweeted_status_timestamp'] columns

### Code

In [27]:
twitter_archive_df_clean[['retweeted_status_id', 'retweeted_status_user_id', 'retweeted_status_timestamp']].value_counts()

retweeted_status_id  retweeted_status_user_id  retweeted_status_timestamp
8.874740e+17         4.196984e+09              2017-07-19 00:47:34 +0000     1
7.594477e+17         4.196984e+09              2016-07-30 17:56:51 +0000     1
7.575971e+17         2.804798e+08              2016-07-25 15:23:28 +0000     1
7.562885e+17         4.196984e+09              2016-07-22 00:43:32 +0000     1
7.533757e+17         4.196984e+09              2016-07-13 23:48:51 +0000     1
                                                                            ..
8.008540e+17         7.992370e+07              2016-11-22 00:10:52 +0000     1
8.001414e+17         4.196984e+09              2016-11-20 00:59:15 +0000     1
8.000650e+17         2.488557e+07              2016-11-19 19:55:41 +0000     1
7.961497e+17         4.196984e+09              2016-11-09 00:37:46 +0000     1
6.661041e+17         4.196984e+09              2015-11-16 04:02:55 +0000     1
Length: 181, dtype: int64

In [28]:
twitter_archive_df_clean['retweeted_status_id'].isnull().value_counts()

True     2175
False     181
Name: retweeted_status_id, dtype: int64

In [29]:
retweet_index = twitter_archive_df_clean[twitter_archive_df_clean.retweeted_status_id.isnull()==False].index
twitter_archive_df_clean.drop(axis=0, index=retweet_index, inplace=True)
twitter_archive_df_clean[['retweeted_status_id', 'retweeted_status_user_id', 'retweeted_status_timestamp']].value_counts()

twitter_archive_df_clean.drop(['retweeted_status_id', 
                       'retweeted_status_user_id', 
                       'retweeted_status_timestamp'],
                      axis=1,
                      inplace=True)

#### Test

In [30]:
for retweet in retweet_index:
    if retweet in list(twitter_archive_df_clean.index):
        print('Found a retweet')

2. redundant "in reply to users' tweet" rows
#### Define: find the index of "reply" kind of tweets, then remove them and the ['in_reply_to_user_id' , 'in_reply_to_status_id' ] columns

#### Code

In [31]:
twitter_archive_df_clean['in_reply_to_status_id'].isnull().value_counts()

True     2097
False      78
Name: in_reply_to_status_id, dtype: int64

In [32]:
replies_index = twitter_archive_df_clean[twitter_archive_df_clean['in_reply_to_status_id'].isnull()==False].index
twitter_archive_df_clean.drop(axis=0, index=replies_index, inplace=True)

twitter_archive_df_clean.drop( ['in_reply_to_user_id' , 'in_reply_to_status_id' ],
                      axis=1,
                      inplace=True)

In [33]:
for reply in replies_index:
    if reply in list(twitter_archive_df_clean.index):
        print('Found a reply')

3. wrong rating_numerators in tweet_id 883482846933004288, 778027034220126208, given the number after the decimal point instead of whole numbers

#### Define: find the numbers in articles that are float, and save them into 'temp' column temporarily , then convert the rating_ numerator column to float data type and add the right float number into it.

#### Code

In [34]:
twitter_archive_df_clean.rating_numerator[twitter_archive_df_clean.tweet_id == '883482846933004288']

45    5
Name: rating_numerator, dtype: int64

In [35]:
twitter_archive_df_clean.text[twitter_archive_df_clean.tweet_id == '883482846933004288']

45    This is Bella. She hopes her smile made you sm...
Name: text, dtype: object

In [36]:
twitter_archive_df_clean['true_rate'] = twitter_archive_df_clean.text.str.extract(r"([0-9]+[.][0-9]+/[0-9]+)")

In [37]:
twitter_archive_df_clean.rating_numerator = twitter_archive_df_clean.rating_numerator.astype(float)

In [38]:
wrong_rates = twitter_archive_df_clean[twitter_archive_df_clean.true_rate.isnull() == False].index

for i in wrong_rates:
    twitter_archive_df_clean.rating_numerator[i] = float(twitter_archive_df_clean.true_rate[i][:-3])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  twitter_archive_df_clean.rating_numerator[i] = float(twitter_archive_df_clean.true_rate[i][:-3])


In [39]:
twitter_archive_df_clean.drop("true_rate", axis=1, inplace= True)

#### Test

In [40]:
for i in wrong_rates:
    print(twitter_archive_df_clean.rating_numerator[i])

13.5
9.75
11.27
11.26


In [41]:
twitter_archive_df_clean.rating_numerator[twitter_archive_df_clean.tweet_id == '883482846933004288']

45    13.5
Name: rating_numerator, dtype: float64

4. name columns has String 'None'

#### Define: find the list of wrong names and use for loop to give indvidual name "None".

#### Code

In [42]:
twitter_archive_df_clean.name.value_counts()

None       603
a           55
Charlie     11
Lucy        11
Cooper      10
          ... 
Ralph        1
Kawhi        1
Gerbald      1
Monty        1
Leonard      1
Name: name, Length: 955, dtype: int64

from my beginning analysis i found only tha names has None values but now i found it has also wrong names as 'a' and 'an'
which i thought i could extract the name from the text but no names also in text.

In [43]:
wrong_name = twitter_archive_df_clean.query('name == "a" or name == "an"').index

for i in wrong_name:
    twitter_archive_df_clean.name[i] = 'None'
                                      

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  twitter_archive_df_clean.name[i] = 'None'


#### Test

In [44]:
for i in wrong_name:
    if twitter_archive_df_clean.name[i] != "None":
        print("wrong name")

5. some rating_denominator do not equal 10

#### Define: find the list of rating_denominator != 10, then drop those rows

#### Code

In [45]:
wrong_denom_rate = twitter_archive_df_clean.query("rating_denominator != 10").index
twitter_archive_df_clean.drop(index=wrong_denom_rate, inplace=True)

#### Test

In [46]:
twitter_archive_df_clean.query("rating_denominator != 10")

Unnamed: 0,tweet_id,timestamp,source,text,expanded_urls,rating_numerator,rating_denominator,name,dog breed,retweet_count,favorite_count


6. wrong Data types of "timestamp", dog breed
#### Define: change data type

#### Code

In [47]:
twitter_archive_df_clean.timestamp = twitter_archive_df_clean.timestamp.astype('datetime64')
twitter_archive_df_clean['dog breed'] = twitter_archive_df_clean['dog breed'].astype('category')

#### Test

In [48]:
twitter_archive_df_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2080 entries, 0 to 2355
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   tweet_id            2080 non-null   object        
 1   timestamp           2080 non-null   datetime64[ns]
 2   source              2080 non-null   object        
 3   text                2080 non-null   object        
 4   expanded_urls       2077 non-null   object        
 5   rating_numerator    2080 non-null   float64       
 6   rating_denominator  2080 non-null   int64         
 7   name                2080 non-null   object        
 8   dog breed           2080 non-null   category      
 9   retweet_count       2073 non-null   float64       
 10  favorite_count      2073 non-null   float64       
dtypes: category(1), datetime64[ns](1), float64(3), int64(1), object(5)
memory usage: 181.2+ KB


6. the rows in sources columns are 95% duplicated

#### Define: delete this column

#### Code 

In [49]:
twitter_archive_df_clean.drop("source",axis=1,inplace=True)

#### Test

In [50]:
twitter_archive_df_clean.columns

Index(['tweet_id', 'timestamp', 'text', 'expanded_urls', 'rating_numerator',
       'rating_denominator', 'name', 'dog breed', 'retweet_count',
       'favorite_count'],
      dtype='object')

7. many predictions contains 3 false in image_pred_df, which means those are useless prediction data.

#### Define: find rows that have three false and drop them.

#### Code

In [51]:
false_predc = list(image_pred_df.query("p1_dog==False and p2_dog==False and p3_dog == False").index)
image_pred_df.drop(index=false_predc,inplace=True)


#### Test

In [52]:
image_pred_df.query("p1_dog==False and p2_dog==False and p3_dog == False")

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog


8. missing data in name encoded as the string 'None'

#### Define Replace all instances of the string 'None' with NaN.

#### Code

In [53]:
twitter_archive_df_clean.name = twitter_archive_df_clean.name.replace('None', np.nan)

#### Test


In [54]:
twitter_archive_df_clean.name.loc[twitter_archive_df_clean.name == 'None'].sum()

0

### Final cleaned dataset

After the cleaning step I'm left with a single master dataframe: tweet_data_archived_clean. The last thing that I'll do is store it in an appropriately named dataframe and export it to the current working directory.



In [55]:
twitter_archive_master = twitter_archive_df_clean.copy()

out_file = 'twitter-archive-master.csv'
twitter_archive_master.to_csv(out_file, index=False)