# Project: Wrangling and Analyze Data

## Data Gathering
In the cell below, gather **all** three pieces of data for this project and load them in the notebook. **Note:** the methods required to gather each data are different.
1. Directly download the WeRateDogs Twitter archive data (twitter_archive_enhanced.csv)

In [1]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import json
import requests

%matplotlib inline

In [2]:
# load the twitter-archive-enhanced.csv into a dataframe
twit_arch = pd.read_csv('twitter-archive-enhanced.csv')

2. Use the Requests library to download the tweet image prediction (image_predictions.tsv)

In [3]:
# downlod the image prediction file programmatically
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = requests.get(url)

with open(url.split('/')[-1], mode = 'wb') as file:
    file.write(response.content)

In [4]:
# load the image-prediction.tsv into a dataframe
image_pred = pd.read_table('image-predictions.tsv')

3. Use the Tweepy library to query additional data via the Twitter API (tweet_json.txt)

In [5]:
# I have no access to twitter API, so download the tweet-json provided by Udacity
data = []
with open('tweet-json.txt', 'r')  as file:
    for line in file:
        data.append(json.loads(line))

In [6]:
# extract the contents needed(tweet_id, retweet_count, favorite_count)
deets = []
for item in data:
    tweet_id = item['id']
    retweet_count = item['retweet_count']
    favorite_count = item['favorite_count']
    
    deets.append({'tweet_id': tweet_id,
                 'retweet_count': retweet_count,
                 'favorite_count': favorite_count})

In [7]:
# read the extracted content into a dataframe
twit_like = pd.DataFrame(deets, columns = ['tweet_id', 'retweet_count', 'favorite_count'])

## Assessing Data
In this section, detect and document at least **eight (8) quality issues and two (2) tidiness issue**. You must use **both** visual assessment
programmatic assessement to assess the data.

**Note:** pay attention to the following key points when you access the data.

* You only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.
* Assessing and cleaning the entire dataset completely would require a lot of time, and is not necessary to practice and demonstrate your skills in data wrangling. Therefore, the requirements of this project are only to assess and clean at least 8 quality issues and at least 2 tidiness issues in this dataset.
* The fact that the rating numerators are greater than the denominators does not need to be cleaned. This [unique rating system](http://knowyourmeme.com/memes/theyre-good-dogs-brent) is a big part of the popularity of WeRateDogs.
* You do not need to gather the tweets beyond August 1st, 2017. You can, but note that you won't be able to gather the image predictions for these tweets since you don't have access to the algorithm used.



In [8]:
# First assess the tables (twitter-archive-enhanced) visually with google sheets, or any other program, for review, I'll use pandas .sample() function
twit_arch.sample(5)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
2332,666345417576210432,,,2015-11-16 20:01:42 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Look at this jokester thinking seat belt laws ...,,,,https://twitter.com/dog_rates/status/666345417...,10,10,,,,,
1077,739544079319588864,,,2016-06-05 19:47:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This... is a Tyrannosaurus rex. We only rate d...,,,,https://twitter.com/dog_rates/status/739544079...,10,10,,,,,
1960,673363615379013632,,,2015-12-06 04:49:31 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This little pupper can't wait for Christmas. H...,,,,https://twitter.com/dog_rates/status/673363615...,11,10,,,,pupper,
1696,681242418453299201,,,2015-12-27 22:37:04 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Champ. He's being sacrificed to the Az...,,,,https://twitter.com/dog_rates/status/681242418...,10,10,Champ,,,,
1635,684222868335505415,,,2016-01-05 04:00:18 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Someone help the girl is being mugged. Several...,,,,https://twitter.com/dog_rates/status/684222868...,121,110,,,,,


In [9]:
# Assess the tables(image-prediction) visually
image_pred.sample(5)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
1501,784431430411685888,https://pbs.twimg.com/media/CuLcNkCXgAEIwK2.jpg,1,miniature_poodle,0.744819,True,toy_poodle,0.243192,True,standard_poodle,0.01092,True
1483,781251288990355457,https://pbs.twimg.com/media/CteP5H5WcAEhdLO.jpg,2,Mexican_hairless,0.887771,True,Italian_greyhound,0.030666,True,seat_belt,0.02673,False
272,670826280409919488,https://pbs.twimg.com/media/CU9A8ZuWsAAt_S1.jpg,1,scorpion,0.927956,False,tarantula,0.021631,False,wolf_spider,0.014837,False
1416,771136648247640064,https://pbs.twimg.com/media/CrOgsIBWYAA8Dtb.jpg,1,bathtub,0.36866,False,golden_retriever,0.297402,True,tub,0.201711,False
687,684122891630342144,https://pbs.twimg.com/media/CX5-HslWQAIiXKB.jpg,1,cheetah,0.822193,False,Arabian_camel,0.046976,False,jaguar,0.025785,False


In [10]:
# Assess the tables(retweet and tweet likes) visually
twit_like.sample(5)

Unnamed: 0,tweet_id,retweet_count,favorite_count
936,753294487569522689,1191,3758
2233,668171859951755264,208,525
2095,670733412878163972,558,1030
1759,678675843183484930,1680,3155
1538,689659372465688576,4412,11394


Assess each table programatically with pandas methods or functions(.head(), .describe(), .duplicated(), .info(), e.t.c)

In [11]:
# Assess each table programatically (twitter-archive-enhanced)
twit_arch.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 

In [12]:
# show statistical summary of table
twit_arch.describe()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,retweeted_status_id,retweeted_status_user_id,rating_numerator,rating_denominator
count,2356.0,78.0,78.0,181.0,181.0,2356.0,2356.0
mean,7.427716e+17,7.455079e+17,2.014171e+16,7.7204e+17,1.241698e+16,13.126486,10.455433
std,6.856705e+16,7.582492e+16,1.252797e+17,6.236928e+16,9.599254e+16,45.876648,6.745237
min,6.660209e+17,6.658147e+17,11856340.0,6.661041e+17,783214.0,0.0,0.0
25%,6.783989e+17,6.757419e+17,308637400.0,7.186315e+17,4196984000.0,10.0,10.0
50%,7.196279e+17,7.038708e+17,4196984000.0,7.804657e+17,4196984000.0,11.0,10.0
75%,7.993373e+17,8.257804e+17,4196984000.0,8.203146e+17,4196984000.0,12.0,10.0
max,8.924206e+17,8.862664e+17,8.405479e+17,8.87474e+17,7.874618e+17,1776.0,170.0


In [13]:
# check number of duplicated rows
sum(twit_arch.duplicated())

0

In [14]:
# show number of missing values in each column
twit_arch.isna().sum()

tweet_id                         0
in_reply_to_status_id         2278
in_reply_to_user_id           2278
timestamp                        0
source                           0
text                             0
retweeted_status_id           2175
retweeted_status_user_id      2175
retweeted_status_timestamp    2175
expanded_urls                   59
rating_numerator                 0
rating_denominator               0
name                             0
doggo                            0
floofer                          0
pupper                           0
puppo                            0
dtype: int64

In [15]:
# Assess each table programatically (image-prediction)
image_pred.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   tweet_id  2075 non-null   int64  
 1   jpg_url   2075 non-null   object 
 2   img_num   2075 non-null   int64  
 3   p1        2075 non-null   object 
 4   p1_conf   2075 non-null   float64
 5   p1_dog    2075 non-null   bool   
 6   p2        2075 non-null   object 
 7   p2_conf   2075 non-null   float64
 8   p2_dog    2075 non-null   bool   
 9   p3        2075 non-null   object 
 10  p3_conf   2075 non-null   float64
 11  p3_dog    2075 non-null   bool   
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [16]:
# show statistical summary of table
image_pred.describe()

Unnamed: 0,tweet_id,img_num,p1_conf,p2_conf,p3_conf
count,2075.0,2075.0,2075.0,2075.0,2075.0
mean,7.384514e+17,1.203855,0.594548,0.1345886,0.06032417
std,6.785203e+16,0.561875,0.271174,0.1006657,0.05090593
min,6.660209e+17,1.0,0.044333,1.0113e-08,1.74017e-10
25%,6.764835e+17,1.0,0.364412,0.05388625,0.0162224
50%,7.119988e+17,1.0,0.58823,0.118181,0.0494438
75%,7.932034e+17,1.0,0.843855,0.1955655,0.09180755
max,8.924206e+17,4.0,1.0,0.488014,0.273419


In [17]:
# check number of duplicated rows
sum(image_pred.duplicated())

0

In [18]:
# Assess each table programatically (tweet-json)
twit_like.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2354 entries, 0 to 2353
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype
---  ------          --------------  -----
 0   tweet_id        2354 non-null   int64
 1   retweet_count   2354 non-null   int64
 2   favorite_count  2354 non-null   int64
dtypes: int64(3)
memory usage: 55.3 KB


In [19]:
# show statistical summary of table
twit_like.describe()

Unnamed: 0,tweet_id,retweet_count,favorite_count
count,2354.0,2354.0,2354.0
mean,7.426978e+17,3164.797366,8080.968564
std,6.852812e+16,5284.770364,11814.771334
min,6.660209e+17,0.0,0.0
25%,6.783975e+17,624.5,1415.0
50%,7.194596e+17,1473.5,3603.5
75%,7.993058e+17,3652.0,10122.25
max,8.924206e+17,79515.0,132810.0


In [20]:
# check number of duplicated rows
sum(twit_like.duplicated())

0

### Quality issues
1. twitter-archive-enhanced table: Source column is not in a proper format

2. twitter-archive-enhanced table: Columns that are not relevant to analysis are present

3. twitter-archive-enhanced table: incorrect datatype for timestamp column

4. image-prediction table: missing records leading to incomplete dataset (2075 out of 2356)

5. twitter-archive-enhanced table: Null values represented as None in name column

6. image-prediction table: some values in P1, P2 ans p3 columns begin with lowercase

7. twitter-archive-enhanced table: maximum and minimum value for rating_denominator column are 170 and 0 instead of 10

8. image-preiction table: p1, p2 and p3 column values containing underscore(_)

### Tidiness issues
1. twitter-archive-enhanced table: Dog stages are divided into columns 

2. image-preiction table: names of columns are not explanatory

## Cleaning Data
In this section, clean **all** of the issues you documented while assessing. 

**Note:** Make a copy of the original data before cleaning. Cleaning includes merging individual pieces of data according to the rules of [tidy data](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html). The result should be a high-quality and tidy master pandas DataFrame (or DataFrames, if appropriate).

In [21]:
# Make copies of original pieces of data
twit_arch_cp = twit_arch.copy()
image_pred_cp = image_pred.copy()
twit_like_cp = twit_like.copy()

In [22]:
twit_arch_cp.sample()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
2032,671763349865160704,,,2015-12-01 18:50:38 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Say hello to Mark. He's a good dog. Always rea...,,,,https://twitter.com/dog_rates/status/671763349...,9,10,Mark,,,,


### Issue #1: twitter-archive-enhanced table: Columns that are not relevant to analysis are present

#### Define: Drop columns that are not needed for analysis

#### Code

In [23]:
# define the columns to drop
cols = ['in_reply_to_status_id', 'in_reply_to_user_id', 'retweeted_status_id', 'retweeted_status_user_id', 'retweeted_status_timestamp']
# drop the columns
twit_arch_cp.drop(cols, axis = 1, inplace = True)

#### Test

In [24]:
# check if the columns are dropped
twit_arch_cp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   tweet_id            2356 non-null   int64 
 1   timestamp           2356 non-null   object
 2   source              2356 non-null   object
 3   text                2356 non-null   object
 4   expanded_urls       2297 non-null   object
 5   rating_numerator    2356 non-null   int64 
 6   rating_denominator  2356 non-null   int64 
 7   name                2356 non-null   object
 8   doggo               2356 non-null   object
 9   floofer             2356 non-null   object
 10  pupper              2356 non-null   object
 11  puppo               2356 non-null   object
dtypes: int64(3), object(9)
memory usage: 221.0+ KB


### Issue #2: twitter-archive-enhanced table: incorrect datatype for timestamp column

#### Define: Convert the timestamp coumn from object data type to datetime data type

#### Code

In [25]:
# use pd.to_datetime function
twit_arch_cp['timestamp'] = pd.to_datetime(twit_arch_cp['timestamp'])

#### Test

In [26]:
# confirm if the change has been made
twit_arch_cp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype              
---  ------              --------------  -----              
 0   tweet_id            2356 non-null   int64              
 1   timestamp           2356 non-null   datetime64[ns, UTC]
 2   source              2356 non-null   object             
 3   text                2356 non-null   object             
 4   expanded_urls       2297 non-null   object             
 5   rating_numerator    2356 non-null   int64              
 6   rating_denominator  2356 non-null   int64              
 7   name                2356 non-null   object             
 8   doggo               2356 non-null   object             
 9   floofer             2356 non-null   object             
 10  pupper              2356 non-null   object             
 11  puppo               2356 non-null   object             
dtypes: datetime64[ns, UTC](1), int64(3

### Issue #3: twitter-archive-enhanced table: Null values represented as None and 'a' in name column

#### Define: convert None and 'a' in name column to NaN values with np.nan

#### Code

In [27]:
# check the values in name column
twit_arch_cp.name.value_counts()

None          745
a              55
Charlie        12
Cooper         11
Lucy           11
             ... 
Dex             1
Ace             1
Tayzie          1
Grizzie         1
Christoper      1
Name: name, Length: 957, dtype: int64

In [28]:
# replace the 'none' and 'a' values with NaN value
(twit_arch_cp['name']).replace(['None', 'a'], np.nan, inplace = True)



#### Test

In [29]:
# confirm if the change has been made
twit_arch_cp.name.value_counts()

Charlie       12
Cooper        11
Lucy          11
Oliver        11
Tucker        10
              ..
Aqua           1
Chase          1
Meatball       1
Rorie          1
Christoper     1
Name: name, Length: 955, dtype: int64

### Issue #4: **image-prediction table**: some fields in P1, P2 ans p3 columns begin with lowercase

#### Define: Change the starting characters in the columns to uppercase

#### Code

In [30]:
# capitalize the first characters and characters after a delimeter
image_pred_cp['p1'] = image_pred_cp['p1'].str.title()
image_pred_cp['p2'] = image_pred_cp['p2'].str.title()
image_pred_cp['p3'] = image_pred_cp['p3'].str.title()

#### Test

In [31]:
# check if the change has been made
image_pred_cp.tail()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
2070,891327558926688256,https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg,2,Basset,0.555712,True,English_Springer,0.22577,True,German_Short-Haired_Pointer,0.175219,True
2071,891689557279858688,https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg,1,Paper_Towel,0.170278,False,Labrador_Retriever,0.168086,True,Spatula,0.040836,False
2072,891815181378084864,https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg,1,Chihuahua,0.716012,True,Malamute,0.078253,True,Kelpie,0.031379,True
2073,892177421306343426,https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg,1,Chihuahua,0.323581,True,Pekinese,0.090647,True,Papillon,0.068957,True
2074,892420643555336193,https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg,1,Orange,0.097049,False,Bagel,0.085851,False,Banana,0.07611,False


### Issue #5: **image-preiction table**: p1, p2 and p3 column values containing underscore(_)

#### Define: Replace underscore(_) with white space(' ') in the elemets of the columns

#### Code

In [32]:
# replace underscore with white space
image_pred_cp['p1'] = image_pred_cp['p1'].str.replace('_', ' ')
image_pred_cp['p2'] = image_pred_cp['p2'].str.replace('_', ' ')
image_pred_cp['p3'] = image_pred_cp['p3'].str.replace('_', ' ')

#### Test

In [33]:
# confirm if the change has been made
image_pred_cp.sample(5)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
466,675015141583413248,https://pbs.twimg.com/media/CV4iqh5WcAEV1E6.jpg,1,Street Sign,0.290091,False,Golden Retriever,0.258372,True,Sandbar,0.132173,False
449,674737130913071104,https://pbs.twimg.com/media/CV0l10AU8AAfg-a.jpg,1,Pomeranian,0.948537,True,Schipperke,0.01431,True,Chihuahua,0.00812,True
262,670789397210615808,https://pbs.twimg.com/media/CU8fZSQWoAEVp6O.jpg,1,Beagle,0.295966,True,Basset,0.143527,True,Bluetick,0.138992,True
878,698355670425473025,https://pbs.twimg.com/media/CbEOxQXW0AEIYBu.jpg,1,Pug,0.990191,True,Pekinese,0.002799,True,Sunglasses,0.00131,False
545,677314812125323265,https://pbs.twimg.com/media/CWZOOIUW4AAQrX_.jpg,2,Blenheim Spaniel,0.924127,True,Japanese Spaniel,0.05479,True,Chihuahua,0.008204,True


### Issue #6: **twitter-archive-enhanced table**: maximum and minimum value for rating_denominator column are 170 and 0 instead of 10

#### Define: Replace all denominator that has the value of 170 and 0 to 10

#### Code

In [34]:
# replace any denominator value less than 10 woth 10
twit_arch_cp[twit_arch_cp['rating_denominator'] < 10] = 10
# replace any denominator value greater than 10 woth 10
twit_arch_cp[twit_arch_cp['rating_denominator'] > 10] = 10

#### Test

In [35]:
# check for correction
twit_arch_cp['rating_denominator'].describe()

count    2356.0
mean       10.0
std         0.0
min        10.0
25%        10.0
50%        10.0
75%        10.0
max        10.0
Name: rating_denominator, dtype: float64

### Issue #7: twitter-archive-enhanced table: Source column is not in a proper format

#### Define: Extract tweet source from long complicated string

#### Code

In [36]:
#extract the tweet source
twit_arch_cp['source'] = twit_arch_cp['source'].str.split("/").str[4]
twit_arch_cp['source'] = twit_arch_cp['source'].str.split('>').str[1]
twit_arch_cp['source'] = twit_arch_cp['source'].str.replace('<', " ")

#### Test

In [37]:
# confirm if the change has been made
twit_arch_cp['source'].value_counts()

Twitter for iPhone     2198
TweetDeck                11
Name: source, dtype: int64

### Issue #8: **twitter-archive-enhanced table**: Dog stages are divided into columns

#### Define: Convert the different dog stages columns into a single column containing the value of dog stage

#### Code

In [38]:
# replace none values with NaN
dog = ["doggo","floofer","pupper","puppo"]
twit_arch_cp[dog] = twit_arch_cp[dog].replace({'None':np.nan})

In [39]:
# create a new column to represent dog stages by merging all relevant columns
twit_arch_cp['dog_stage'] = twit_arch_cp[twit_arch_cp.columns[8:]].apply(lambda x: ','.join(x.dropna().astype(str)), axis=1)

In [40]:
# replace whitespace with NaN in the newly created column
twit_arch_cp['dog_stage'] = twit_arch_cp['dog_stage'].replace('',np.nan)
#drop columns after merging
twit_arch_cp.drop(["doggo","floofer","pupper","puppo"], axis=1, inplace=True)

In [41]:
# check the values of the new column
twit_arch_cp.dog_stage.value_counts()

pupper           245
doggo             83
puppo             29
10,10,10,10       23
doggo,pupper      12
floofer            9
doggo,puppo        1
doggo,floofer      1
Name: dog_stage, dtype: int64

A value 10,10,10,10 is present, this is not a valid value of the dog_stage column, lets check for duplicates in the dataset

In [42]:
sum(twit_arch_cp.duplicated())

22

In [43]:
# drop duplicated rows
twit_arch_cp.drop_duplicates(inplace = True)
# check the values of the new column if there's still 10,10,10,10 present
twit_arch_cp.dog_stage.value_counts()

pupper           245
doggo             83
puppo             29
doggo,pupper      12
floofer            9
doggo,puppo        1
doggo,floofer      1
10,10,10,10        1
Name: dog_stage, dtype: int64

In [44]:
# replace the value with NaN
twit_arch_cp['dog_stage'] = twit_arch_cp['dog_stage'].replace('10,10,10,10', np.nan)

In [45]:
twit_arch_cp.dog_stage.value_counts()

pupper           245
doggo             83
puppo             29
doggo,pupper      12
floofer            9
doggo,puppo        1
doggo,floofer      1
Name: dog_stage, dtype: int64

#### Test

In [46]:
twit_arch_cp.sample(7)

Unnamed: 0,tweet_id,timestamp,source,text,expanded_urls,rating_numerator,rating_denominator,name,dog_stage
1084,738402415918125056,2016-06-02 16:10:29+00:00,Twitter for iPhone,"""Don't talk to me or my son ever again"" ...10/...",https://twitter.com/dog_rates/status/738402415...,10,10,,
1408,699072405256409088,2016-02-15 03:27:04+00:00,Twitter for iPhone,ERMAHGERD 12/10 please enjoy https://t.co/7WrA...,https://twitter.com/dog_rates/status/699072405...,12,10,,
1510,691444869282295808,2016-01-25 02:17:57+00:00,Twitter for iPhone,This is Bailey. She likes flowers. 12/10 https...,https://twitter.com/dog_rates/status/691444869...,12,10,Bailey,
1458,695074328191332352,2016-02-04 02:40:08+00:00,Twitter for iPhone,This is Lorenzo. He's educated af. Just gradua...,https://twitter.com/dog_rates/status/695074328...,11,10,Lorenzo,pupper
646,793150605191548928,2016-10-31 18:00:14+00:00,Twitter for iPhone,This is Nida. She's a free elf. Waited so long...,https://twitter.com/dog_rates/status/793150605...,11,10,Nida,
2089,670789397210615808,2015-11-29 02:20:29+00:00,Twitter for iPhone,Two obedient dogs here. Left one has extra leg...,https://twitter.com/dog_rates/status/670789397...,9,10,,
2189,668967877119254528,2015-11-24 01:42:25+00:00,Twitter for iPhone,12/10 good shit Bubka\n@wane15,,12,10,,


### Issue #9: image-preiction table: names of columns are not explanatory

#### Define: Rename the columns in image-prediction dataset 

#### Code

In [47]:
image_pred_cp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   tweet_id  2075 non-null   int64  
 1   jpg_url   2075 non-null   object 
 2   img_num   2075 non-null   int64  
 3   p1        2075 non-null   object 
 4   p1_conf   2075 non-null   float64
 5   p1_dog    2075 non-null   bool   
 6   p2        2075 non-null   object 
 7   p2_conf   2075 non-null   float64
 8   p2_dog    2075 non-null   bool   
 9   p3        2075 non-null   object 
 10  p3_conf   2075 non-null   float64
 11  p3_dog    2075 non-null   bool   
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [48]:
# rename the vague column names
image_pred_cp.rename(columns = {'jpg_url':'image_url', 'p1':'first_prediction',
                                'p2':'second_prediction', 'p3':'third_prediction',
                               'p1_dog':'first_is_dog', 'p2_dog':'second_is_dog',
                               'p3_dog':'third_is_dog', 'p1_conf':'first_pred_score',
                               'p2_conf':'second_pred_score', 'p3_conf':'third_pred_score'}, inplace = True)

#### Test

In [49]:
image_pred_cp.sample()

Unnamed: 0,tweet_id,image_url,img_num,first_prediction,first_pred_score,first_is_dog,second_prediction,second_pred_score,second_is_dog,third_prediction,third_pred_score,third_is_dog
310,671542985629241344,https://pbs.twimg.com/media/CVHMyHMWwAALYXs.jpg,1,Shetland Sheepdog,0.980339,True,Collie,0.006693,True,Papillon,0.006157,True


### Issue #10: **image-prediction table**: missing records leading to incomplete dataset (2075 out of 2356)

#### Define: Merge all the tables(image-prediction, tweet-json, twitter archive) table to drop rows with no records

#### Code

In [50]:
# merge twit_arch_cp with image_pred_cp
twit_image_merged = pd.merge(twit_arch_cp, image_pred_cp, on='tweet_id', how='inner')

In [52]:
# merge twit_image_merged with twit_like_cp
all_df_merged = pd.merge(twit_image_merged, twit_like_cp, on = 'tweet_id', how = 'inner')

#### Test

In [None]:
twit_image_merged.sample(5)

In [None]:
all_df_merged.sample(5)

In [55]:
all_df_merged.shape

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2055 entries, 0 to 2054
Data columns (total 22 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   tweet_id            2055 non-null   int64  
 1   timestamp           2055 non-null   object 
 2   source              2025 non-null   object 
 3   text                2055 non-null   object 
 4   expanded_urls       2055 non-null   object 
 5   rating_numerator    2055 non-null   int64  
 6   rating_denominator  2055 non-null   int64  
 7   name                1436 non-null   object 
 8   dog_stage           320 non-null    object 
 9   image_url           2055 non-null   object 
 10  img_num             2055 non-null   int64  
 11  first_prediction    2055 non-null   object 
 12  first_pred_score    2055 non-null   float64
 13  first_is_dog        2055 non-null   bool   
 14  second_prediction   2055 non-null   object 
 15  second_pred_score   2055 non-null   float64
 16  second

## Storing Data
Save gathered, assessed, and cleaned master dataset to a CSV file named "twitter_archive_master.csv".

## Analyzing and Visualizing Data
In this section, analyze and visualize your wrangled data. You must produce at least **three (3) insights and one (1) visualization.**

### Insights:
1.

2.

3.

### Visualization