# Project: Wrangling and Analyze Data

## Data Gathering
In the cell below, gather **all** three pieces of data for this project and load them in the notebook. **Note:** the methods required to gather each data are different.
1. Directly download the WeRateDogs Twitter archive data (twitter_archive_enhanced.csv)

In [1]:
import pandas as pd 
import numpy as np
import requests
import tweepy
import json
import os 

from matplotlib import pyplot as plt
%matplotlib inline 

import seaborn as sns 

In [2]:
df_archive = pd.read_csv('twitter-archive-enhanced.csv')

df_archive

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,
5,891087950875897856,,,2017-07-29 00:08:17 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a majestic great white breaching ...,,,,https://twitter.com/dog_rates/status/891087950...,13,10,,,,,
6,890971913173991426,,,2017-07-28 16:27:12 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Jax. He enjoys ice cream so much he gets ...,,,,"https://gofundme.com/ydvmve-surgery-for-jax,ht...",13,10,Jax,,,,
7,890729181411237888,,,2017-07-28 00:22:40 +0000,"<a href=""http://twitter.com/download/iphone"" r...",When you watch your owner call another dog a g...,,,,https://twitter.com/dog_rates/status/890729181...,13,10,,,,,
8,890609185150312448,,,2017-07-27 16:25:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Zoey. She doesn't want to be one of th...,,,,https://twitter.com/dog_rates/status/890609185...,13,10,Zoey,,,,
9,890240255349198849,,,2017-07-26 15:59:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Cassie. She is a college pup. Studying...,,,,https://twitter.com/dog_rates/status/890240255...,14,10,Cassie,doggo,,,


In [3]:
df_archive.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

2. Use the Requests library to download the tweet image prediction (image_predictions.tsv)

In [4]:
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'

response = requests.get(url)

open('image-predictions.tsv', 'wb').write(response.content)

335079

In [5]:
df_images = pd.read_csv('image-predictions.tsv', sep='\t')

df_images.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


In [6]:
df_images.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


3. Use the Tweepy library to query additional data via the Twitter API (tweet_json.txt)

In [7]:
tweet = []

with open ('tweet_json.txt', 'r') as file:
    for line in file:
        data = json.loads(line)
        tweet.append(data)

In [8]:
print(len(tweet))

2340


In [9]:
df_tweet = pd.DataFrame(tweet, columns = ['id', 'retweet_count', 'favorite_count'])

df_tweet.head()

Unnamed: 0,id,retweet_count,favorite_count
0,892420643555336193,8323,38047
1,892177421306343426,6149,32665
2,891815181378084864,4069,24579
3,891689557279858688,8469,41402
4,891327558926688256,9164,39579


In [10]:
df_tweet.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2340 entries, 0 to 2339
Data columns (total 3 columns):
id                2340 non-null int64
retweet_count     2340 non-null int64
favorite_count    2340 non-null int64
dtypes: int64(3)
memory usage: 54.9 KB


## Assessing Data
In this section, detect and document at least **eight (8) quality issues and two (2) tidiness issue**. You must use **both** visual assessment
programmatic assessement to assess the data.

**Note:** pay attention to the following key points when you access the data.

* You only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.
* Assessing and cleaning the entire dataset completely would require a lot of time, and is not necessary to practice and demonstrate your skills in data wrangling. Therefore, the requirements of this project are only to assess and clean at least 8 quality issues and at least 2 tidiness issues in this dataset.
* The fact that the rating numerators are greater than the denominators does not need to be cleaned. This [unique rating system](http://knowyourmeme.com/memes/theyre-good-dogs-brent) is a big part of the popularity of WeRateDogs.
* You do not need to gather the tweets beyond August 1st, 2017. You can, but note that you won't be able to gather the image predictions for these tweets since you don't have access to the algorithm used.



In [11]:
df_archive

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,
5,891087950875897856,,,2017-07-29 00:08:17 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a majestic great white breaching ...,,,,https://twitter.com/dog_rates/status/891087950...,13,10,,,,,
6,890971913173991426,,,2017-07-28 16:27:12 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Jax. He enjoys ice cream so much he gets ...,,,,"https://gofundme.com/ydvmve-surgery-for-jax,ht...",13,10,Jax,,,,
7,890729181411237888,,,2017-07-28 00:22:40 +0000,"<a href=""http://twitter.com/download/iphone"" r...",When you watch your owner call another dog a g...,,,,https://twitter.com/dog_rates/status/890729181...,13,10,,,,,
8,890609185150312448,,,2017-07-27 16:25:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Zoey. She doesn't want to be one of th...,,,,https://twitter.com/dog_rates/status/890609185...,13,10,Zoey,,,,
9,890240255349198849,,,2017-07-26 15:59:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Cassie. She is a college pup. Studying...,,,,https://twitter.com/dog_rates/status/890240255...,14,10,Cassie,doggo,,,


In [12]:
df_archive.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

>- Some of the Assessing outputs where deleted because of the time spent on scrolling

>- While cleaning the data, you'll see some of the assessing issues


 ### Quality issues

#### For df_archive 

1. Some dog names are invalid and duplicated in the `df_archive` table



2. From the assesing data objectives it was stated that retweets are not needed, hence, we'll have to drop the retweeted_statuses also.


3. A lot of missing data in the reply columns in the `df_archive` table are also not needed so we have to drop them 



4. The datatype for `timestamp` in the df_archive data is object instead of datetime



5. Removing the anchor link and retaining only the text for source 


6. Invalid tweet_id data type (integer instead of string)





#### for df_images 


1. Invalid tweet_id data type (integer instead of string)

 
2. Upper case and lower case name


3. Breed and predictions to individual columns







### Tidiness issues

1. id instead of tweet_id


2.  URLs in some of the text in the text column


3.  Merging all the data to one


## Cleaning Data
In this section, clean **all** of the issues you documented while assessing. 

**Note:** Make a copy of the original data before cleaning. Cleaning includes merging individual pieces of data according to the rules of [tidy data](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html). The result should be a high-quality and tidy master pandas DataFrame (or DataFrames, if appropriate).

In [13]:
# Make copies of original pieces of data
df_archive_clean = df_archive.copy()
df_images_clean = df_images.copy()
df_tweet_clean = df_tweet.copy()

### Issue #1:

>- Some dog names are invalid and duplicated in the `df_archive` table


#### Define: 

>- Rename them to none

#### Code

In [14]:
df_archive_clean['name'].value_counts()

None          745
a              55
Charlie        12
Cooper         11
Lucy           11
Oliver         11
Lola           10
Penny          10
Tucker         10
Bo              9
Winston         9
Sadie           8
the             8
Toby            7
Bailey          7
Buddy           7
an              7
Daisy           7
Jack            6
Koda            6
Bella           6
Dave            6
Leo             6
Milo            6
Oscar           6
Jax             6
Scout           6
Rusty           6
Stanley         6
Phil            5
             ... 
Birf            1
Ester           1
Georgie         1
Sailer          1
Rascal          1
Lilah           1
Miley           1
Banjo           1
Tedders         1
Samsom          1
Fabio           1
Florence        1
Orion           1
Boots           1
Maya            1
Monkey          1
Pippin          1
Newt            1
Gilbert         1
Tanner          1
Walker          1
my              1
Jimbo           1
Jarod           1
Fynn      

In [15]:
mask = df_archive_clean.name.str.islower() 
column_name = 'name' 
df_archive_clean.loc[mask, column_name] = np.nan

#### Test

In [16]:
df_archive_clean['name'].value_counts()

None        745
Charlie      12
Lucy         11
Oliver       11
Cooper       11
Tucker       10
Lola         10
Penny        10
Bo            9
Winston       9
Sadie         8
Toby          7
Daisy         7
Bailey        7
Buddy         7
Bella         6
Jack          6
Stanley       6
Milo          6
Rusty         6
Jax           6
Koda          6
Dave          6
Oscar         6
Leo           6
Scout         6
Bentley       5
Chester       5
George        5
Alfie         5
           ... 
Lulu          1
Birf          1
Ester         1
Georgie       1
Sailer        1
Rascal        1
Miley         1
Tanner        1
Ember         1
Ralphus       1
Banjo         1
Samsom        1
Fabio         1
Sweet         1
Florence      1
Orion         1
Boots         1
Maya          1
Monkey        1
Pippin        1
Newt          1
Tedders       1
Gilbert       1
Walker        1
Jimbo         1
Jarod         1
Fynn          1
Kayla         1
Kenzie        1
Venti         1
Name: name, Length: 932,

### Issue #2: 

>- From the assesing data objectives it was stated that retweets are not needed

#### Define 
>- Hence, we'll have to drop the retweeted_statuses also.


#### Code

In [17]:
#create a dynamic function to drop rows and cloumns in the df_archive table

def drop_row_col (dframe, row_col, axis=1):
    dframe.drop(row_col, axis = axis, inplace = True)

In [18]:
retweeted_id = df_archive_clean[pd.notnull(df_archive_clean['retweeted_status_id'])].index

drop_row_col(df_archive_clean, retweeted_id, axis = 0)

In [19]:
df_archive_clean.drop(columns = ['retweeted_status_id', 'retweeted_status_user_id', 'retweeted_status_timestamp'], 
                      inplace = True)

#### Test

In [20]:
df_archive_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2175 entries, 0 to 2355
Data columns (total 14 columns):
tweet_id                 2175 non-null int64
in_reply_to_status_id    78 non-null float64
in_reply_to_user_id      78 non-null float64
timestamp                2175 non-null object
source                   2175 non-null object
text                     2175 non-null object
expanded_urls            2117 non-null object
rating_numerator         2175 non-null int64
rating_denominator       2175 non-null int64
name                     2071 non-null object
doggo                    2175 non-null object
floofer                  2175 non-null object
pupper                   2175 non-null object
puppo                    2175 non-null object
dtypes: float64(2), int64(3), object(9)
memory usage: 254.9+ KB


### Issue #3:

>- A lot of missing data in the reply columns in the `df_archive` table

#### Define 
>- They're not needed so we have to drop them, but first we drop the row which they reference before we drop the column


#### Code

In [21]:
#create a dynamic function to drop rows and cloumns in the df_archive table

def drop_row_col (dframe, row_col, axis=1):
    dframe.drop(row_col, axis = axis, inplace = True)

In [22]:
reply_id = df_archive_clean[pd.notnull(df_archive_clean['in_reply_to_status_id'])].index

In [23]:
drop_row_col (df_archive_clean, reply_id, axis = 0)

In [24]:
df_archive_clean.drop(columns = ['in_reply_to_status_id', 'in_reply_to_user_id'], inplace = True)

#### Test

In [25]:
df_archive_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2097 entries, 0 to 2355
Data columns (total 12 columns):
tweet_id              2097 non-null int64
timestamp             2097 non-null object
source                2097 non-null object
text                  2097 non-null object
expanded_urls         2094 non-null object
rating_numerator      2097 non-null int64
rating_denominator    2097 non-null int64
name                  1993 non-null object
doggo                 2097 non-null object
floofer               2097 non-null object
pupper                2097 non-null object
puppo                 2097 non-null object
dtypes: int64(3), object(9)
memory usage: 213.0+ KB


### Issue #4: 

>- The datatype for `timestamp` in the df_archive data is object instead of datetime


#### Define 
>- Convert to datetime

#### Code

In [26]:
df_archive_clean['timestamp'] = pd.to_datetime(df_archive_clean['timestamp'])

#### Test

In [27]:
type(df_archive_clean['timestamp'][0])

pandas._libs.tslibs.timestamps.Timestamp

In [28]:
df_archive_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2097 entries, 0 to 2355
Data columns (total 12 columns):
tweet_id              2097 non-null int64
timestamp             2097 non-null datetime64[ns]
source                2097 non-null object
text                  2097 non-null object
expanded_urls         2094 non-null object
rating_numerator      2097 non-null int64
rating_denominator    2097 non-null int64
name                  1993 non-null object
doggo                 2097 non-null object
floofer               2097 non-null object
pupper                2097 non-null object
puppo                 2097 non-null object
dtypes: datetime64[ns](1), int64(3), object(8)
memory usage: 293.0+ KB


### Issue #5:

>- Anchor link in the text

#### Define:

>- Removing the anchor link and retaining only the text for `source`

#### Code

In [29]:
df_archive_clean['source'].value_counts()

<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>     1964
<a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a>                          91
<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>                       31
<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>      11
Name: source, dtype: int64

In [30]:
df_archive_clean['source'] = df_archive_clean['source'].str.extract('^<a.+>(.+)</a>$')

#### Test

In [31]:
df_archive_clean['source'].value_counts()

Twitter for iPhone     1964
Vine - Make a Scene      91
Twitter Web Client       31
TweetDeck                11
Name: source, dtype: int64

### Issue #6:

>- Invalid tweet_id data type (integer instead of string)


#### Define: 

>- Change int to str

#### Code:

In [32]:
df_archive_clean['tweet_id'] =  df_archive_clean['tweet_id'].astype(str)

#### Test:

In [33]:
type(df_archive_clean['tweet_id'][0])

str

In [34]:
df_archive_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2097 entries, 0 to 2355
Data columns (total 12 columns):
tweet_id              2097 non-null object
timestamp             2097 non-null datetime64[ns]
source                2097 non-null object
text                  2097 non-null object
expanded_urls         2094 non-null object
rating_numerator      2097 non-null int64
rating_denominator    2097 non-null int64
name                  1993 non-null object
doggo                 2097 non-null object
floofer               2097 non-null object
pupper                2097 non-null object
puppo                 2097 non-null object
dtypes: datetime64[ns](1), int64(2), object(9)
memory usage: 293.0+ KB


### Issue #7:

>- Breed and Predictions to individual columns

#### Define: 
>- combine the Breed and Predictions to individual columns

#### Code:

In [35]:
condition = [(df_images_clean.p1_dog == True), (df_images_clean.p1_dog == True), (df_images_clean.p1_dog == True)]

breed = [df_images_clean.p1, df_images_clean.p2, df_images_clean.p3]

prediction = [df_images_clean.p1_conf, df_images_clean.p2_conf, df_images_clean.p3_conf]

df_images_clean['breed'] = np.select(condition, breed, default ='None')

df_images_clean['prediction'] = np.select(condition, prediction, default =0)



>- we'll have to drop the p columns since it has been merged to one column

In [36]:
df_images_clean.drop(columns=['p1','p1_conf','p1_dog','p2','p2_conf','p2_dog','p3','p3_conf','p3_dog'], inplace=True)

#### Test:

In [37]:
df_images_clean

Unnamed: 0,tweet_id,jpg_url,img_num,breed,prediction
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311
5,666050758794694657,https://pbs.twimg.com/media/CT5Jof1WUAEuVxN.jpg,1,Bernese_mountain_dog,0.651137
6,666051853826850816,https://pbs.twimg.com/media/CT5KoJ1WoAAJash.jpg,1,,0.000000
7,666055525042405380,https://pbs.twimg.com/media/CT5N9tpXIAAifs1.jpg,1,chow,0.692517
8,666057090499244032,https://pbs.twimg.com/media/CT5PY90WoAAQGLo.jpg,1,,0.000000
9,666058600524156928,https://pbs.twimg.com/media/CT5Qw94XAAA_2dP.jpg,1,miniature_poodle,0.201493


### Issue #1 (For df_images):

>-  Invalid tweet_id data type (integer instead of string)


#### Define: 

>- Change the datatype to string from int

#### Code:

In [38]:
df_images_clean['tweet_id'] = df_images_clean['tweet_id'].astype(str)

#### Test:

In [39]:
type(df_images_clean['tweet_id'][0])

str

In [40]:
df_images_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 5 columns):
tweet_id      2075 non-null object
jpg_url       2075 non-null object
img_num       2075 non-null int64
breed         2075 non-null object
prediction    2075 non-null float64
dtypes: float64(1), int64(1), object(3)
memory usage: 81.1+ KB


In [None]:
df_images

### Issue #2: 

>- Upper case and lower case names in the p1, p2, p3, columns


#### Define: 

>- convert all  the names to Upper cases

#### Code:

In [41]:
df_images_clean.p1 = df_images_clean.p1.str.title()
df_images_clean.p2 = df_images_clean.p2.str.title()
df_images_clean.p3 = df_images_clean.p3.str.title()

AttributeError: 'DataFrame' object has no attribute 'p1'

#### Test:

In [None]:
df_images_clean.head()

### Issue #1(df_tweet):

>- id instead of tweet_id and datatype


#### Define:

>- Rename id to tweet_id for joining purpose and change datatype


#### Code:

In [None]:
df_tweet_clean.rename(columns = {'id':'tweet_id'}, inplace = True)

In [None]:
df_tweet_clean['tweet_id'] = df_tweet_clean['tweet_id'].astype(str)

#### Test:

In [None]:
df_tweet_clean.head()

In [None]:
type(df_tweet_clean['tweet_id'][0])

## Tidiness Issues:

### Issue #1: 

>- Dog stages are in multiple columns

#### Define:
>- Put dog stages into one column

#### Code:

In [None]:
df_archive_clean.doggo.replace('None', '', inplace=True) 
df_archive_clean.floofer.replace('None', '', inplace=True)
df_archive_clean.pupper.replace('None', '', inplace=True)
df_archive_clean.puppo.replace('None', '', inplace=True)


df_archive_clean['stage'] = df_archive_clean.doggo + df_archive_clean.floofer + df_archive_clean.pupper + df_archive_clean.puppo


df_archive_clean.loc[df_archive_clean.stage == 'doggopupper', 'stage'] = 'doggo,pupper' 
df_archive_clean.loc[df_archive_clean.stage == 'doggopuppo', 'stage'] = 'doggo,puppo' 
df_archive_clean.loc[df_archive_clean.stage == 'doggofloofer', 'stage'] = 'doggo,floofer'

In [None]:
# Here i'll go ahead to drop the different dog stage cloumns so we can have the stage column alone

df_archive_clean.drop(columns=['doggo', 'floofer', 'pupper', 'puppo'], inplace=True)

#### Test:

In [None]:
df_archive_clean

### Issue #2:

>- URLs in some of the text in the text column


#### Define:

>- Remove the URLs in the text


#### Code:

In [None]:
df_archive_clean['text'] = df_archive_clean.text.str.replace(r"http\S+", "")
df_archive_clean['text'] = df_archive_clean.text.str.strip()

#### Test:

In [None]:
df_archive_clean['text']

### Issue #3:

>- Scattered data

#### Data:

>- Merging all the data to one

#### Code:

In [None]:
#merging the df_archive_clean and df_tweet_clean

merged_data = df_archive_clean.merge(df_tweet_clean, how='inner', on = 'tweet_id')

merged_data.head()

In [None]:
#merging the df_images_clean data with the merged data

merged_data = merged_data.merge(df_images_clean, how = 'inner', on = 'tweet_id')

merged_data.info()

#### Here i'll be adding a new column TWEET DAY generated from Timestamp which i'll later use for insights

In [None]:
#new column for insights
merged_data["tweet_day"] = merged_data["timestamp"].dt.strftime("%a")

merged_data["tweet_day"].value_counts()

## Storing Data
Save gathered, assessed, and cleaned master dataset to a CSV file named "twitter_archive_master.csv".

In [None]:
merged_data.to_csv("twitter_archive_master.csv", index=False)

## Analyzing and Visualizing Data
In this section, analyze and visualize your wrangled data. You must produce at least **three (3) insights and one (1) visualization.**

In [None]:
#we'll first make a copy of the merged data

df_merged_data = merged_data.copy()

### Insights:
1. The most talked about dog (Dog with the most like and retweet)

2. Most Popular name given to a dog

3. Top source people tweet from 

4. Use a scatterplot to view correlation between some of the columns in our data


#### The most talked about dog (Dog with the most retweet & like(favorite) )

In [None]:
df_merged_data[df_merged_data.favorite_count == df_merged_data.favorite_count.max()]

`Hence`
>- The people's dog is the Labrador_Retriever which has `83712` retweet_count & `164364` favorite_count	

#### Most Popular name given to a dog

In [None]:
named_dogs = df_merged_data.query('name == name')
named_dogs_grouped = named_dogs.groupby('name').count()[['tweet_id']]
named_dogs_grouped.rename(columns={'tweet_id':'name_count'}, inplace=True)
named_dogs_grouped.query('name_count >= 8').sort_values(by=['name_count']).plot.bar()
plt.ylim(top=20)
plt.title("Most Popular Dog Names",{'fontsize': 20})
plt.xlabel("Dogs Names")
plt.legend(["Dogs Names Frequencey"]);

#### From the bar chart we can see, majority of the dogs don't have a name

>- But the frequency between `Charlie`, `Cooper`, `Lucy`, `Oliver` are on the same line 

#### Top source people tweet from

In [None]:
tweet_sources = df_merged_data.groupby('source').count()[['tweet_id']]
tweet_sources.rename(columns={'tweet_id': 'source_count'}, inplace=True)
tweet_sources['source_percentage'] = tweet_sources.source_count / tweet_sources.source_count.sum() * 100
tweet_sources['source_percentage'].plot.pie(figsize=(10,10), autopct='%1.2f%%', explode=[0.1,0.2,0.1])
plt.title("Top tweets source ", {'fontsize': 20})
plt.legend(["Tweetdeck", "Twitter", "iPhone", "Vine"])
plt.ylabel("");

#### From the pie chart above we can tell that the most tweets came from Iphone

### Correlation in Our data

In [None]:
df_merged_data.corr()

>- If we study this table, we'll be able to see the correlation between the `favorite_count` and the `retweet_count` which has a positive relationship with value `0.929662`. The table shall be visualized below to make it make more sense 

In [None]:
#we use the tweet_day to make the tables colorful

sns.pairplot(df_merged_data, vars=["rating_numerator", "retweet_count", "favorite_count","img_num", "prediction",
                        ], hue ="tweet_day");