## Introduction

<p>This project aims to wrangle (gather, assess and clean) real world data from a range of sources and in a variety of formats, through analyses and visualizations using Python and its libraries and/or SQL.</p> 

<p>The dataset to be wrangled (and analyzed and visualized) "is the tweet archive of Twitter user @dog_rates, also known as WeRateDogs. WeRateDogs is a Twitter account that rates people's dogs with a humorous comment about the dog. These ratings almost always have a denominator of 10. The numerators, though? Almost always greater than 10. 11/10, 12/10, 13/10, etc. Why? Because "they're good dogs Brent."" - Udacity Project Overview.</p>

## Table of Contents
<ul>
<li><a href="#gathering">Data Gathering</a></li>
<li><a href="#assessment">Data Assessment</a></li>
<li><a href="#cleaning">Data Cleaning</a></li>
<li><a href="#storage">Data Storage</a></li>
<li><a href="#analysis">Analyses and Vitualization</a></li>
</ul>

In [16]:
#importing all necessary libraries to complete this project
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import json
import seaborn as sns
import os
import requests
import re
from functools import reduce
import tweepy
from tweepy import OAuthHandler
import json
from timeit import default_timer as timer
%matplotlib inline

<a id = 'gathering'></a>
## Data Gathering

The first table (twitter-archive-enhanced.csv) is manually obtained from the internet and opened into a pandas data drame programmatically.

In [17]:
#load the 'twitter-archive-enhanced.csv' table into a pandas data frame
twitter_archive = pd.read_csv('twitter-archive-enhanced.csv')

The second table is downloaded programmatically from Udacity's server into a folder (image-predictions) using the requests library and its URL, written locally, and then loaded into a pandas Data Frame.

In [18]:
#create a folder called 'image-predictions' if the folder does not exist already
folder_name = 'image-predictions'
if not os.path.exists(folder_name):
    os.mkdir(folder_name)

In [19]:
#get the image-predictions data through its url and using the python requests library
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = requests.get(url)
#write the response of the above request into image-predictions.tsv
with open(os.path.join(folder_name, url.split('/')[-1]), mode = 'wb') as file:
    file.write(response.content)

In [20]:
#load the image-predictions.tsv file into a pandas data frame
image_predictions = pd.read_csv('image-predictions/image-predictions.tsv', sep='\t')

The third table is downloaded locally from the internet as 'tweet-json.txt', read line by line into a python list, and then loaded into a pandas Data Frame.

In [21]:
# read the tweet-json.txt file line by line and get the 'id_str', 'retweet_count', and 'favorite_count', then store in a python list called df_list
df_list = []
with open ('tweet-json.txt') as file:
    for line in file:
        data = json.loads(line)
        id_str = data.get('id_str')
        retweet_count = data.get('retweet_count')
        favorite_count = data.get('favorite_count')
        df_list.append({
            'id_str': id_str, 
            'retweet_count': retweet_count, 
            'favorite_count': favorite_count 
        })


In [22]:
#load df_list into a pandas data frame
tweet_data = pd.DataFrame(df_list, columns=['id_str', 'retweet_count', 'favorite_count'])

<a id = 'assessment'></a>
## Data Assessment

Visual Assessment

In [23]:
#displays first 25 observations
twitter_archive.head(25)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,
5,891087950875897856,,,2017-07-29 00:08:17 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a majestic great white breaching ...,,,,https://twitter.com/dog_rates/status/891087950...,13,10,,,,,
6,890971913173991426,,,2017-07-28 16:27:12 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Jax. He enjoys ice cream so much he gets ...,,,,"https://gofundme.com/ydvmve-surgery-for-jax,ht...",13,10,Jax,,,,
7,890729181411237888,,,2017-07-28 00:22:40 +0000,"<a href=""http://twitter.com/download/iphone"" r...",When you watch your owner call another dog a g...,,,,https://twitter.com/dog_rates/status/890729181...,13,10,,,,,
8,890609185150312448,,,2017-07-27 16:25:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Zoey. She doesn't want to be one of th...,,,,https://twitter.com/dog_rates/status/890609185...,13,10,Zoey,,,,
9,890240255349198849,,,2017-07-26 15:59:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Cassie. She is a college pup. Studying...,,,,https://twitter.com/dog_rates/status/890240255...,14,10,Cassie,doggo,,,


In [24]:
#displays 25 random observations from the table
twitter_archive.sample(25)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
2331,666353288456101888,,,2015-11-16 20:32:58 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a mixed Asiago from the Galápagos...,,,,https://twitter.com/dog_rates/status/666353288...,8,10,,,,,
1870,675149409102012420,,,2015-12-11 03:05:37 +0000,"<a href=""http://twitter.com/download/iphone"" r...",holy shit 12/10 https://t.co/p6O8X93bTQ,,,,https://twitter.com/dog_rates/status/675149409...,12,10,,,,,
862,762699858130116608,,,2016-08-08 17:19:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Leela. She's a Fetty Woof. Lost eye wh...,,,,https://twitter.com/dog_rates/status/762699858...,11,10,Leela,,,,
891,759557299618865152,,,2016-07-31 01:12:26 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Emma. She can't believe her last guess...,,,,https://twitter.com/dog_rates/status/759557299...,10,10,Emma,,,,
761,778286810187399168,,,2016-09-20 17:36:50 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Stanley. He has too much skin. Isn't h...,,,,https://twitter.com/dog_rates/status/778286810...,11,10,Stanley,,,,
1529,690248561355657216,,,2016-01-21 19:04:15 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Maxwell. That's his moped. He rents it...,,,,https://twitter.com/dog_rates/status/690248561...,11,10,Maxwell,,,,
221,849776966551130114,,,2017-04-06 00:13:11 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Seriously guys? Again? We only rate dogs. Plea...,,,,https://twitter.com/dog_rates/status/849776966...,12,10,,,,,
1917,674291837063053312,,,2015-12-08 18:17:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Kenny. He just wants to be included in...,,,,https://twitter.com/dog_rates/status/674291837...,11,10,Kenny,,,,
1189,718246886998687744,,,2016-04-08 01:19:36 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Alexanderson. He's got a weird ass bir...,,,,https://twitter.com/dog_rates/status/718246886...,3,10,Alexanderson,,,,
344,832032802820481025,,,2017-02-16 01:04:13 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Miguel. He was the only remaining dogg...,,,,"https://www.petfinder.com/petdetail/34918210,h...",12,10,Miguel,doggo,,,


In [25]:
#displays last 25 observations on the table
twitter_archive.tail(25)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
2331,666353288456101888,,,2015-11-16 20:32:58 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a mixed Asiago from the Galápagos...,,,,https://twitter.com/dog_rates/status/666353288...,8,10,,,,,
2332,666345417576210432,,,2015-11-16 20:01:42 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Look at this jokester thinking seat belt laws ...,,,,https://twitter.com/dog_rates/status/666345417...,10,10,,,,,
2333,666337882303524864,,,2015-11-16 19:31:45 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is an extremely rare horned Parthenon. No...,,,,https://twitter.com/dog_rates/status/666337882...,9,10,an,,,,
2334,666293911632134144,,,2015-11-16 16:37:02 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is a funny dog. Weird toes. Won't come do...,,,,https://twitter.com/dog_rates/status/666293911...,3,10,a,,,,
2335,666287406224695296,,,2015-11-16 16:11:11 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is an Albanian 3 1/2 legged Episcopalian...,,,,https://twitter.com/dog_rates/status/666287406...,1,2,an,,,,
2336,666273097616637952,,,2015-11-16 15:14:19 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Can take selfies 11/10 https://t.co/ws2AMaNwPW,,,,https://twitter.com/dog_rates/status/666273097...,11,10,,,,,
2337,666268910803644416,,,2015-11-16 14:57:41 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Very concerned about fellow dog trapped in com...,,,,https://twitter.com/dog_rates/status/666268910...,10,10,,,,,
2338,666104133288665088,,,2015-11-16 04:02:55 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Not familiar with this breed. No tail (weird)....,,,,https://twitter.com/dog_rates/status/666104133...,1,10,,,,,
2339,666102155909144576,,,2015-11-16 03:55:04 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Oh my. Here you are seeing an Adobe Setter giv...,,,,https://twitter.com/dog_rates/status/666102155...,11,10,,,,,
2340,666099513787052032,,,2015-11-16 03:44:34 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Can stand on stump for what seems like a while...,,,,https://twitter.com/dog_rates/status/666099513...,8,10,,,,,


In [26]:
#displays first 25 observations on the table
image_predictions.head(25)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True
5,666050758794694657,https://pbs.twimg.com/media/CT5Jof1WUAEuVxN.jpg,1,Bernese_mountain_dog,0.651137,True,English_springer,0.263788,True,Greater_Swiss_Mountain_dog,0.016199,True
6,666051853826850816,https://pbs.twimg.com/media/CT5KoJ1WoAAJash.jpg,1,box_turtle,0.933012,False,mud_turtle,0.045885,False,terrapin,0.017885,False
7,666055525042405380,https://pbs.twimg.com/media/CT5N9tpXIAAifs1.jpg,1,chow,0.692517,True,Tibetan_mastiff,0.058279,True,fur_coat,0.054449,False
8,666057090499244032,https://pbs.twimg.com/media/CT5PY90WoAAQGLo.jpg,1,shopping_cart,0.962465,False,shopping_basket,0.014594,False,golden_retriever,0.007959,True
9,666058600524156928,https://pbs.twimg.com/media/CT5Qw94XAAA_2dP.jpg,1,miniature_poodle,0.201493,True,komondor,0.192305,True,soft-coated_wheaten_terrier,0.082086,True


In [27]:
#displays 25 random observations from the table
image_predictions.sample(25)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
1308,753655901052166144,https://pbs.twimg.com/media/CnWGCpdWgAAWZTI.jpg,1,miniature_pinscher,0.456092,True,toy_terrier,0.153126,True,Italian_greyhound,0.144147,True
1765,826598365270007810,https://pbs.twimg.com/media/C3iq0EEXUAAdBYC.jpg,1,French_bulldog,0.628119,True,Siamese_cat,0.117397,False,cougar,0.082765,False
1689,815639385530101762,https://pbs.twimg.com/media/C1G7sXyWIAA10eH.jpg,1,German_shepherd,0.817953,True,Norwegian_elkhound,0.140007,True,malinois,0.024821,True
509,676101918813499392,https://pbs.twimg.com/media/CWH_FTgWIAAwOUy.jpg,1,Shih-Tzu,0.225848,True,Norfolk_terrier,0.186873,True,Irish_terrier,0.106987,True
1972,869596645499047938,https://pbs.twimg.com/media/DBFtiYqWAAAsjj1.jpg,1,Chihuahua,0.955156,True,toy_terrier,0.008054,True,muzzle,0.006296,False
1023,710283270106132480,https://pbs.twimg.com/media/Cdtu3WRUkAAsRVx.jpg,2,Shih-Tzu,0.932401,True,Lhasa,0.030806,True,Tibetan_terrier,0.008974,True
205,669923323644657664,https://pbs.twimg.com/media/CUwLtPeU8AAfAb2.jpg,1,car_mirror,0.343063,False,seat_belt,0.110289,False,wing,0.080148,False
931,703079050210877440,https://pbs.twimg.com/media/CcHWqQCW8AEb0ZH.jpg,2,Pembroke,0.778503,True,Shetland_sheepdog,0.093834,True,Cardigan,0.060296,True
139,668542336805281792,https://pbs.twimg.com/media/CUcjtL8WUAAAJoz.jpg,1,American_Staffordshire_terrier,0.267695,True,French_bulldog,0.25405,True,Staffordshire_bullterrier,0.212381,True
1659,811386762094317568,https://pbs.twimg.com/media/C0Kf9PtWQAEW4sE.jpg,1,Pembroke,0.804177,True,Cardigan,0.18989,True,beagle,0.001965,True


In [28]:
#displays the last 25 observations on the table
image_predictions.tail(25)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
2050,887343217045368832,https://pbs.twimg.com/ext_tw_video_thumb/88734...,1,Mexican_hairless,0.330741,True,sea_lion,0.275645,False,Weimaraner,0.134203,True
2051,887473957103951883,https://pbs.twimg.com/media/DFDw2tyUQAAAFke.jpg,2,Pembroke,0.809197,True,Rhodesian_ridgeback,0.05495,True,beagle,0.038915,True
2052,887517139158093824,https://pbs.twimg.com/ext_tw_video_thumb/88751...,1,limousine,0.130432,False,tow_truck,0.029175,False,shopping_cart,0.026321,False
2053,887705289381826560,https://pbs.twimg.com/media/DFHDQBbXgAEqY7t.jpg,1,basset,0.821664,True,redbone,0.087582,True,Weimaraner,0.026236,True
2054,888078434458587136,https://pbs.twimg.com/media/DFMWn56WsAAkA7B.jpg,1,French_bulldog,0.995026,True,pug,0.000932,True,bull_mastiff,0.000903,True
2055,888202515573088257,https://pbs.twimg.com/media/DFDw2tyUQAAAFke.jpg,2,Pembroke,0.809197,True,Rhodesian_ridgeback,0.05495,True,beagle,0.038915,True
2056,888554962724278272,https://pbs.twimg.com/media/DFTH_O-UQAACu20.jpg,3,Siberian_husky,0.700377,True,Eskimo_dog,0.166511,True,malamute,0.111411,True
2057,888804989199671297,https://pbs.twimg.com/media/DFWra-3VYAA2piG.jpg,1,golden_retriever,0.46976,True,Labrador_retriever,0.184172,True,English_setter,0.073482,True
2058,888917238123831296,https://pbs.twimg.com/media/DFYRgsOUQAARGhO.jpg,1,golden_retriever,0.714719,True,Tibetan_mastiff,0.120184,True,Labrador_retriever,0.105506,True
2059,889278841981685760,https://pbs.twimg.com/ext_tw_video_thumb/88927...,1,whippet,0.626152,True,borzoi,0.194742,True,Saluki,0.027351,True


In [29]:
#displays first 25 observations on the table.
tweet_data.head(25)

Unnamed: 0,id_str,retweet_count,favorite_count
0,892420643555336193,8853,39467
1,892177421306343426,6514,33819
2,891815181378084864,4328,25461
3,891689557279858688,8964,42908
4,891327558926688256,9774,41048
5,891087950875897856,3261,20562
6,890971913173991426,2158,12041
7,890729181411237888,16716,56848
8,890609185150312448,4429,28226
9,890240255349198849,7711,32467


In [30]:
#displays 25 random observations from the table.
tweet_data.sample(25)

Unnamed: 0,id_str,retweet_count,favorite_count
1458,695051054296211456,885,2918
1877,675015141583413248,1335,2918
412,822872901745569793,48265,132810
2063,671151324042559489,166,714
1296,707741517457260545,696,2718
2026,671866342182637568,548,1191
1023,746369468511756288,1854,6637
883,760153949710192640,38,0
1436,696900204696625153,1156,3492
2235,668142349051129856,306,592


In [31]:
#displays the last 25 observations on the table.
tweet_data.tail(25)

Unnamed: 0,id_str,retweet_count,favorite_count
2329,666353288456101888,77,229
2330,666345417576210432,146,307
2331,666337882303524864,96,204
2332,666293911632134144,368,522
2333,666287406224695296,71,152
2334,666273097616637952,82,184
2335,666268910803644416,37,108
2336,666104133288665088,6871,14765
2337,666102155909144576,16,81
2338,666099513787052032,73,164


#### Programmatic Assessment

In [32]:
#displays a summary information about the table, including numbers of columns, rows, and non-empty values, and the data type of each variable
twitter_archive.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 

In [33]:
#displays all duplicated observations
twitter_archive[twitter_archive.duplicated()]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo


In [34]:
#returns the number of occurences of each value in the `source` column
twitter_archive['source'].value_counts()

<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>     2221
<a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a>                          91
<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>                       33
<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>      11
Name: source, dtype: int64

In [35]:
#returns the num of occurences of each value in the `name` column
twitter_archive['name'].value_counts()

None          745
a              55
Charlie        12
Cooper         11
Lucy           11
             ... 
Dex             1
Ace             1
Tayzie          1
Grizzie         1
Christoper      1
Name: name, Length: 957, dtype: int64

In [36]:
#returns 25 random values from the `name` column
twitter_archive['name'].sample(25)

1458     Lorenzo
2256      Calvin
1990    Leonidas
1791        None
1271       Billy
2111        Koda
995         None
1574        None
1177       Clyde
1610        None
440         None
1077        None
247         None
599         None
1808        None
2268        Dook
821      Jackson
2178        None
1236        Kane
1228        None
781         None
738         Koda
2160      Kollin
2181        None
1795       Tassy
Name: name, dtype: object

In [37]:
#returns the number of occurences for each value in the `rating_denumerator ` column
twitter_archive['rating_numerator'].value_counts()

12      558
11      464
10      461
13      351
9       158
8       102
7        55
14       54
5        37
6        32
3        19
4        17
2         9
1         9
75        2
15        2
420       2
0         2
80        1
144       1
17        1
26        1
20        1
121       1
143       1
44        1
60        1
45        1
50        1
99        1
204       1
1776      1
165       1
666       1
27        1
182       1
24        1
960       1
84        1
88        1
Name: rating_numerator, dtype: int64

In [38]:
#returns the number of occurences for each value in the `rating_denumerator ` column
twitter_archive['rating_denominator'].value_counts()

10     2333
11        3
50        3
20        2
80        2
70        1
7         1
15        1
150       1
170       1
0         1
90        1
40        1
130       1
110       1
16        1
120       1
2         1
Name: rating_denominator, dtype: int64

In [39]:
#displays a summary information about the table, including numbers of columns, rows, and non-empty values, and the data type of each variable
image_predictions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   tweet_id  2075 non-null   int64  
 1   jpg_url   2075 non-null   object 
 2   img_num   2075 non-null   int64  
 3   p1        2075 non-null   object 
 4   p1_conf   2075 non-null   float64
 5   p1_dog    2075 non-null   bool   
 6   p2        2075 non-null   object 
 7   p2_conf   2075 non-null   float64
 8   p2_dog    2075 non-null   bool   
 9   p3        2075 non-null   object 
 10  p3_conf   2075 non-null   float64
 11  p3_dog    2075 non-null   bool   
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [40]:
#return the values of the `img_num` column sorted in an ascending order
image_predictions['img_num'].sort_values()

0       1
1295    1
1294    1
1293    1
1292    1
       ..
1978    4
1496    4
1768    4
1713    4
2040    4
Name: img_num, Length: 2075, dtype: int64

In [41]:
#returns 10 random samples of values from the `jpg_url` column
image_predictions['jpg_url'].sample(10)

637       https://pbs.twimg.com/media/CXRTw_5WMAAUDVp.jpg
588     https://pbs.twimg.com/ext_tw_video_thumb/67911...
1993      https://pbs.twimg.com/media/DCEeLxjXsAAvNSM.jpg
1391      https://pbs.twimg.com/media/CqQykxrWYAAlD8g.jpg
1283      https://pbs.twimg.com/media/CmoPdmHW8AAi8BI.jpg
1477      https://pbs.twimg.com/media/CtUMLzRXgAAbZK5.jpg
815     https://pbs.twimg.com/tweet_video_thumb/CZ0mhd...
1507      https://pbs.twimg.com/media/CucnLmeWAAALOSC.jpg
511       https://pbs.twimg.com/media/CWJQ4UmWoAIJ29t.jpg
1648      https://pbs.twimg.com/media/CzmSFlKUAAAQOjP.jpg
Name: jpg_url, dtype: object

In [42]:
#displays a summary information about the table, including numbers of columns, rows, and non-empty values, and the data type of each variable
tweet_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2354 entries, 0 to 2353
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   id_str          2354 non-null   object
 1   retweet_count   2354 non-null   int64 
 2   favorite_count  2354 non-null   int64 
dtypes: int64(2), object(1)
memory usage: 55.3+ KB


In [43]:
#returns 10 random observations from the table.
tweet_data.sample(10)

Unnamed: 0,id_str,retweet_count,favorite_count
121,869227993411051520,4023,21112
1616,684969860808454144,421,2374
1470,693942351086120961,413,1896
1333,705239209544720384,854,3290
2247,667861340749471744,86,253
520,809920764300447744,4521,17250
2277,667405339315146752,234,489
386,826598799820865537,292,5637
805,772102971039580160,1065,4448
2210,668587383441514497,1174,1760


In [44]:
#returns the values of the `retweet_count` column sorted in an ascending order
tweet_data['retweet_count'].sort_values()

290         0
1293        2
273         3
341         3
112         3
        ...  
816     52360
1077    52360
259     56625
533     56625
1037    79515
Name: retweet_count, Length: 2354, dtype: int64

In [45]:
#returns the values of the `favorite_count` column sorted in an ascending order
tweet_data['favorite_count'].sort_values()

484          0
585          0
164          0
588          0
909          0
         ...  
134     106827
533     107015
65      107956
1037    131075
412     132810
Name: favorite_count, Length: 2354, dtype: int64

### Summary of Assesment
#### Quality
##### `twitter_archive` table
* Some entries are retweets and replies.
* `in_reply_to_status_id`, `in_reply_to_user_id`, `retweeted_status_id`, `retweeted_status_user_id`, and `retweeted_statustimestamp` columns are unnecessary for the analysis of `original tweets`
* The `source` variable contains html formating
* +0000 is redundant information in in `timestamp`
* Erroneous data types in `tweet_id` and `timestamp` columns
* Variable `floofer` should be `floof`, and likewise it values.




##### `image_predictions` table
* Comlumn labels are unclear
* Text in `p1`, `p2`, and `p3` sometimes start with an uppercase letter, lowercase other times, and underscores are use in place of space, and otherwise.
* `tweet_id` is a string not intiger

##### `tweet_data` table
* `id_str` variable should be named `tweet_id` instead, to be consistent with the other tables.

#### Tidiness
* One variable 'dog stage' in four columns (doggo, floofer, pupper, puppo) in `twitter_archive` table.
* All three tables should be merged into one table.

<a id = 'cleaning'></a>
## Data Cleaning 
Next, I clean up the data for analysis by eliminating or modifying any data that is erroneous, incomplete, irrelevant, redundant, or improperly formatted. When it comes to data analysis, this data is usually not necessary or beneficial because it can slow down the process or produce inaccurate results.

#### Quality

In [46]:
#makes copies of the three dataframes
twit_archive_clean = twitter_archive.copy()
image_pred_clean = image_predictions.copy()
tweet_data_clean = tweet_data.copy()

##### 1. `twitter_archive` table: Some entries are retweets and replies.

__Define__

Create a list for each of retweests and replies using the indices of non-empty values of `retweet_status_id` and `in_reply_to_status_id` variables, and drop the rows using the `drop()` function.

__Code__

In [47]:
#get lists of the indices of rows with retweets and replies and store in `retweet_index` and `reply_index` variables respectively
retweet_index = twit_archive_clean[twit_archive_clean['retweeted_status_id'].notnull()].index
reply_index = twit_archive_clean[twit_archive_clean['in_reply_to_status_id'].notnull()].index

In [48]:
#remove rows with retweets and replies using their indices
twit_archive_clean.drop(index= retweet_index, axis=0, inplace = True)
twit_archive_clean.drop(index= reply_index, axis=0, inplace = True)

__Test__

In [49]:
#checks for any non empty value in retweet_status_id column
twit_archive_clean['retweeted_status_id'].notnull().sum()

0

In [50]:
#checks for any non-empty vlaue in in_reply_to_status_id column
twit_archive_clean['in_reply_to_status_id'].notnull().sum()

0

##### 2. twitter_archive: `in_reply_to_status_id`, `in_reply_to_user_id`, `retweeted_status_id`, `retweeted_status_user_id`, and `retweeted_statustimestamp` columns are unnecessary for the analysis of `original tweets`

__Define__

Store `in_reply_to_status_id`, `in_reply_to_user_id`, `retweeted_status_id`, `retweeted_status_user_id`, and `retweeted_statustimestamp` columns in a list and remove them from the table using the drop() fucntion.

__Code__

In [51]:
#store the unnecessary columns in a list
drop_list = ['in_reply_to_status_id', 'in_reply_to_user_id', 'retweeted_status_id', 'retweeted_status_user_id', 'retweeted_status_timestamp' ]

In [52]:
#removes the unnecessary columns
twit_archive_clean.drop(drop_list, axis = 1, inplace = True)

__Test__

In [53]:
#displays first 5 observations with headers to confirm the previous action.
twit_archive_clean.head()

Unnamed: 0,tweet_id,timestamp,source,text,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


##### 3. `twitter_archive` table: The `source` column contains html formating

__Define__

Use regular expressions to extract the `source` values from the html formating, using the string.extract() function.

__Code__

In [54]:
#extracts the values of source variable from the html formating
twit_archive_clean['source'] = twit_archive_clean['source'].str.extract('>([\w\W\s]*)<', expand  = True)

__Test__

In [55]:
#displays the first 10 samples of the twit_archive_clean table
twit_archive_clean['source'].head(10)

0    Twitter for iPhone
1    Twitter for iPhone
2    Twitter for iPhone
3    Twitter for iPhone
4    Twitter for iPhone
5    Twitter for iPhone
6    Twitter for iPhone
7    Twitter for iPhone
8    Twitter for iPhone
9    Twitter for iPhone
Name: source, dtype: object

In [56]:
#displays counts for each value in source column
twit_archive_clean['source'].value_counts()

Twitter for iPhone     1964
Vine - Make a Scene      91
Twitter Web Client       31
TweetDeck                11
Name: source, dtype: int64

##### 4. `twitter_archive` table: +0000 is redundant information in in `timestamp`

__Define__

Strip the last 6 characters of the values of `timestamp` column using the str.strip() fucntion

__Code__

In [57]:
#strips the last 6 charcters of the timestamp variable
twit_archive_clean['timestamp'] = twit_archive_clean['timestamp'].str[:-6].str.strip()

__Test__

In [58]:
#displays random samples of the timestamp column
twit_archive_clean['timestamp'].sample(10)

1832    2015-12-14 00:07:50
1778    2015-12-18 16:56:01
925     2016-07-18 18:43:07
1815    2015-12-15 04:05:01
1065    2016-06-09 01:07:06
1444    2016-02-08 15:14:57
850     2016-08-17 01:20:27
2027    2015-12-02 02:13:48
1768    2015-12-20 01:38:42
38      2017-07-12 00:01:00
Name: timestamp, dtype: object

##### 5. `twitter_archive` table: Erroneous data types in `tweet_id` and `timestamp` columns

__Define__

Change the data types of `tweet_id` and `timestamp` variables to `str` and `datetime` using the astype() and to_datetime() fuctions respective.

__Code__

In [59]:
#changes the data type of the tweet_id column to string
twit_archive_clean['tweet_id'] = twit_archive_clean['tweet_id'].astype(str)
#chnages the data tyoe of the timestanmp column to datetime
twit_archive_clean['timestamp'] = pd.to_datetime(twit_archive_clean['timestamp'])

__Test__

In [60]:
#displays a summarized information about the twit_archive_clean table
twit_archive_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2097 entries, 0 to 2355
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   tweet_id            2097 non-null   object        
 1   timestamp           2097 non-null   datetime64[ns]
 2   source              2097 non-null   object        
 3   text                2097 non-null   object        
 4   expanded_urls       2094 non-null   object        
 5   rating_numerator    2097 non-null   int64         
 6   rating_denominator  2097 non-null   int64         
 7   name                2097 non-null   object        
 8   doggo               2097 non-null   object        
 9   floofer             2097 non-null   object        
 10  pupper              2097 non-null   object        
 11  puppo               2097 non-null   object        
dtypes: datetime64[ns](1), int64(2), object(9)
memory usage: 213.0+ KB


##### 6. `twitter_archive` table: Variable `floofer` should be `floof`, and likewise its values.

__Define__

Rename the `floofer` column to `floof` using the rename() fucntion, and replace values of floofer with floof in the column.

__Code__

In [61]:
#renames the floofer column to floof
twit_archive_clean.rename(columns={'floofer': 'floof'}, inplace=True)

In [62]:
#renames floofer values to floof in the floofer column
twit_archive_clean['floof'] = twit_archive_clean['floof'].str.replace('floofer', 'floof')

__Test__

In [63]:
#displays random observations of the twit_archive_clean table
twit_archive_clean.sample(10)

Unnamed: 0,tweet_id,timestamp,source,text,expanded_urls,rating_numerator,rating_denominator,name,doggo,floof,pupper,puppo
1822,676575501977128964,2015-12-15 01:32:24,Twitter for iPhone,This pupper is very passionate about Christmas...,https://twitter.com/dog_rates/status/676575501...,8,10,,,,pupper,
775,776201521193218049,2016-09-14 23:30:38,Twitter for iPhone,This is O'Malley. That is how he sleeps. Doesn...,https://twitter.com/dog_rates/status/776201521...,10,10,O,,,,
1639,684177701129875456,2016-01-05 01:00:50,Twitter for iPhone,This is Kulet. She's very proud of the flower ...,https://twitter.com/dog_rates/status/684177701...,10,10,Kulet,,,,
969,750132105863102464,2016-07-05 01:00:05,Twitter for iPhone,This is Stewie. He will roundhouse kick anyone...,https://twitter.com/dog_rates/status/750132105...,11,10,Stewie,,,,
1936,673956914389192708,2015-12-07 20:07:04,Twitter for iPhone,This is one esteemed pupper. Just graduated co...,https://twitter.com/dog_rates/status/673956914...,10,10,one,,,pupper,
240,846514051647705089,2017-03-28 00:07:32,Twitter for iPhone,This is Barney. He's an elder doggo. Hitches a...,https://twitter.com/dog_rates/status/846514051...,13,10,Barney,doggo,,,
1397,699779630832685056,2016-02-17 02:17:19,Twitter for iPhone,Take all my money. 10/10 https://t.co/B28ebc5LzQ,https://twitter.com/dog_rates/status/699779630...,10,10,,,,,
328,833722901757046785,2017-02-20 17:00:04,Twitter for iPhone,This is Bronte. She's fairly h*ckin aerodynami...,https://twitter.com/dog_rates/status/833722901...,13,10,Bronte,,,,
840,767122157629476866,2016-08-20 22:12:29,Twitter for iPhone,This is Rupert. You betrayed him with bath tim...,https://twitter.com/dog_rates/status/767122157...,13,10,Rupert,,,,
1173,720340705894408192,2016-04-13 19:59:42,Twitter for iPhone,This is Derek. He just got balled on. Can't ev...,https://twitter.com/dog_rates/status/720340705...,10,10,Derek,,,pupper,


In [64]:
#displays the number of occurences of each value in the floof column
twit_archive_clean['floof'].value_counts()

None     2087
floof      10
Name: floof, dtype: int64

##### 7. `image_predictions` table: Some column labels are unclear

__Define__

Rename columns to be more precise by parsing a list into image_pred_clean.columns.

__Code__

In [65]:
#creates a list of the new column lables
image_pred_clean.columns = [
    'tweet_id',
    'image_url',
    'image_num',
    'prediction_1',
    'prediction_1_confidence',
    'prediction_1_isdog',
    'prediction_2',
    'prediction_2_confidence',
    'prediction_2_isdog',
    'prediction_3',
    'prediction_3_confidence',
    'prediction_3_isdog',
]

__Test__

In [66]:
#displays a list fo the image_pred_clean column lables.
image_pred_clean.columns

Index(['tweet_id', 'image_url', 'image_num', 'prediction_1',
       'prediction_1_confidence', 'prediction_1_isdog', 'prediction_2',
       'prediction_2_confidence', 'prediction_2_isdog', 'prediction_3',
       'prediction_3_confidence', 'prediction_3_isdog'],
      dtype='object')

##### 8. `image_predictions` table: Text in `p1`, `p2`, and `p3` sometimes start with an uppercase letter, lowercase other times, and underscores are use in place of space, and otherwise.

__Define__

Use str.replace() and str.title() functions to replace _ with " " and make the first letter of every word uppercase respectively.

__Code__

In [67]:
#replace underscore with space and make first letter of every word uppercase in each of prediction_1, prediction_2, predicction_3 columns
image_pred_clean['prediction_1'] =  image_pred_clean['prediction_1'].str.replace('_', ' ').str.title()
image_pred_clean['prediction_2'] =  image_pred_clean['prediction_2'].str.replace('_', ' ').str.title()
image_pred_clean['prediction_3'] =  image_pred_clean['prediction_3'].str.replace('_', ' ').str.title()

__Test__

In [68]:
#displays 20 random samples of image_pred_clean table's observations
image_pred_clean.sample(20)

Unnamed: 0,tweet_id,image_url,image_num,prediction_1,prediction_1_confidence,prediction_1_isdog,prediction_2,prediction_2_confidence,prediction_2_isdog,prediction_3,prediction_3_confidence,prediction_3_isdog
783,690015576308211712,https://pbs.twimg.com/media/CZNtgWhWkAAbq3W.jpg,2,Malamute,0.949609,True,Siberian Husky,0.033084,True,Eskimo Dog,0.016663,True
1486,781955203444699136,https://pbs.twimg.com/media/CtoQGu4XgAQgv5m.jpg,1,Pool Table,0.179568,False,Dining Table,0.154396,False,Microwave,0.03369,False
1640,807059379405148160,https://pbs.twimg.com/media/Ct2qO5PXEAE6eB0.jpg,1,Seat Belt,0.474292,False,Golden Retriever,0.171393,True,Labrador Retriever,0.110592,True
1702,817171292965273600,https://pbs.twimg.com/media/C1cs8uAWgAEwbXc.jpg,1,Golden Retriever,0.295483,True,Irish Setter,0.144431,True,Chesapeake Bay Retriever,0.077879,True
1591,798665375516884993,https://pbs.twimg.com/media/CVMOlMiWwAA4Yxl.jpg,1,Chow,0.243529,True,Hamster,0.22715,False,Pomeranian,0.056057,True
1825,835172783151792128,https://pbs.twimg.com/media/C5chM_jWAAQmov9.jpg,2,Border Collie,0.663138,True,Collie,0.152494,True,Cardigan,0.035471,True
1488,782305867769217024,https://pbs.twimg.com/media/CttPBt0WIAAcsDE.jpg,1,Briard,0.504427,True,Soft-Coated Wheaten Terrier,0.390678,True,Lhasa,0.034596,True
2004,877316821321428993,https://pbs.twimg.com/media/DCza_vtXkAQXGpC.jpg,1,Saluki,0.509967,True,Italian Greyhound,0.090497,True,Golden Retriever,0.079406,True
84,667502640335572993,https://pbs.twimg.com/media/CUNyHTMUYAAQVch.jpg,1,Labrador Retriever,0.996709,True,Golden Retriever,0.001688,True,Beagle,0.000712,True
1368,761750502866649088,https://pbs.twimg.com/media/CYLDikFWEAAIy1y.jpg,1,Golden Retriever,0.586937,True,Labrador Retriever,0.39826,True,Kuvasz,0.00541,True


##### 9. `image_pred_clean` table: `tweet_id` is a string not intiger

__Define__

Change the data type of `tweet_id` to string using the astype() function.

__Code__

In [69]:
#changes the data type of tweet_id to string
image_pred_clean['tweet_id'] = image_pred_clean['tweet_id'].astype(str)

__Test__

In [70]:
#displays summarized information about the image_pred_clean table.
image_pred_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   tweet_id                 2075 non-null   object 
 1   image_url                2075 non-null   object 
 2   image_num                2075 non-null   int64  
 3   prediction_1             2075 non-null   object 
 4   prediction_1_confidence  2075 non-null   float64
 5   prediction_1_isdog       2075 non-null   bool   
 6   prediction_2             2075 non-null   object 
 7   prediction_2_confidence  2075 non-null   float64
 8   prediction_2_isdog       2075 non-null   bool   
 9   prediction_3             2075 non-null   object 
 10  prediction_3_confidence  2075 non-null   float64
 11  prediction_3_isdog       2075 non-null   bool   
dtypes: bool(3), float64(3), int64(1), object(5)
memory usage: 152.1+ KB


##### 10. `tweet_data_clean` table: `id_str` variable should be named `tweet_id` instead, to be consistent with the other tables.

__Define__

Rename the `id_str` column to `tweet_id` using the replace() function.

__Code__

In [71]:
#rename the id_str column to tweet_id
tweet_data_clean.rename(columns={'id_str':'tweet_id'}, inplace = True)

__Test__

In [72]:
#displace first five observations of the tweet_data_clean table with headers
tweet_data_clean.head()

Unnamed: 0,tweet_id,retweet_count,favorite_count
0,892420643555336193,8853,39467
1,892177421306343426,6514,33819
2,891815181378084864,4328,25461
3,891689557279858688,8964,42908
4,891327558926688256,9774,41048


#### Tidiness

##### 1. One variable 'dog stage' in four columns (doggo, floofer, pupper, puppo) in `twitter_archive` table.

__Define__

Create a list of the dog stages. Iterate through the rows of table, and check if an element of the list is present in any of the columns. store the found dog stage in a column `dog_stage` in title form. Drop the `doggo`, `floof`, `pupper` and `puppo` columns.

__Code__

In [73]:
#list of dog stages
stages = ['doggo','floof', 'pupper','puppo']
#function that checks whether or not an item of the stages list is present in a row, and stores the present stage in a column; dog_stage
for index, column in twit_archive_clean.iterrows():
    for stage in stages:
        if stage.lower() in str(twit_archive_clean.loc[index, 'text']).lower():
            twit_archive_clean.loc[index, 'dog_stage'] = stage.title()


In [74]:
#changes the data type of the dog_stage column to category
twit_archive_clean['dog_stage'] = twit_archive_clean['dog_stage'].astype('category')

In [75]:
#remove columns doggo, floof, pupper, and puppo
twit_archive_clean.drop(['doggo', 'floof', 'pupper', 'puppo'], axis = 1, inplace = True)

__Test__

In [76]:
#displays 25 random samples of tweit_archive_clean table's observations
twit_archive_clean.sample(25)

Unnamed: 0,tweet_id,timestamp,source,text,expanded_urls,rating_numerator,rating_denominator,name,dog_stage
1880,675006312288268288,2015-12-10 17:37:00,Twitter for iPhone,Say hello to Mollie. This pic was taken after ...,https://twitter.com/dog_rates/status/675006312...,10,10,Mollie,Pupper
301,836677758902222849,2017-02-28 20:41:37,Twitter for iPhone,Say hello to Oliver. He's pretty exotic. Fairl...,https://twitter.com/dog_rates/status/836677758...,11,10,Oliver,
1884,674800520222154752,2015-12-10 03:59:15,Twitter for iPhone,This is Tedders. He broke his leg saving babie...,https://twitter.com/dog_rates/status/674800520...,11,10,Tedders,
473,816336735214911488,2017-01-03 17:33:39,Twitter for iPhone,This is Dudley. He found a flower and now he's...,https://twitter.com/dog_rates/status/816336735...,11,10,Dudley,
120,869702957897576449,2017-05-30 23:51:58,Twitter for iPhone,Meet Stanley. He likes road trips. Will shift ...,https://twitter.com/dog_rates/status/869702957...,13,10,Stanley,
2140,670003130994700288,2015-11-26 22:16:09,Twitter for iPhone,This is Raphael. He is a Baskerville Conquista...,https://twitter.com/dog_rates/status/670003130...,10,10,Raphael,
2032,671763349865160704,2015-12-01 18:50:38,Twitter for iPhone,Say hello to Mark. He's a good dog. Always rea...,https://twitter.com/dog_rates/status/671763349...,9,10,Mark,
1357,703407252292673536,2016-02-27 02:32:12,Twitter for iPhone,This pupper doesn't understand gates. 10/10 so...,https://twitter.com/dog_rates/status/703407252...,10,10,,Pupper
1801,676957860086095872,2015-12-16 02:51:45,Twitter for iPhone,10/10 I'd follow this dog into battle no quest...,https://twitter.com/dog_rates/status/676957860...,10,10,,
685,788150585577050112,2016-10-17 22:51:57,Twitter for iPhone,This is Leo. He's a golden chow. Rather h*ckin...,https://twitter.com/dog_rates/status/788150585...,13,10,Leo,


In [77]:
#displays the number of occurences of each vlaue in dog_stage column
twit_archive_clean['dog_stage'].value_counts()

Pupper    255
Doggo      78
Floof      38
Puppo      30
Name: dog_stage, dtype: int64

##### 2. All three tables should be merged into one table.

__Define__

Create a list containing the three tables and perform an outer merge on them on `tweet_id`. And assign the value of the merge to `tweet_archive_master`

__Code__

In [78]:
#creates a list of the three tables
data_frames = [twit_archive_clean, image_pred_clean, tweet_data_clean]

In [79]:
#merges the three tables into one table called tweet_archive_master

tweet_archive_master = reduce(lambda left,right: pd.merge(left,right, on = ['tweet_id'], how = 'outer'), data_frames)

__Test__

In [80]:
#displays summarized information about the tweet_archive_master table
tweet_archive_master.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2356 entries, 0 to 2355
Data columns (total 22 columns):
 #   Column                   Non-Null Count  Dtype         
---  ------                   --------------  -----         
 0   tweet_id                 2356 non-null   object        
 1   timestamp                2097 non-null   datetime64[ns]
 2   source                   2097 non-null   object        
 3   text                     2097 non-null   object        
 4   expanded_urls            2094 non-null   object        
 5   rating_numerator         2097 non-null   float64       
 6   rating_denominator       2097 non-null   float64       
 7   name                     2097 non-null   object        
 8   dog_stage                401 non-null    category      
 9   image_url                2075 non-null   object        
 10  image_num                2075 non-null   float64       
 11  prediction_1             2075 non-null   object        
 12  prediction_1_confidence  2075 non-

<a id = 'storage'></a>
## Data Storage

Store the `tweet_archive_master` table to CSV

In [81]:
#saves the table to csv
tweet_archive_master.to_csv('tweet_archive_master.csv', encoding='utf-8')