# Project: Wrangling and Analyze Data

## Import Libraries 
In the cell below, we import **all** the libraries required for this project.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import requests
import seaborn as sns
import datetime

## Data Gathering
In the cell below, gather **all** three pieces of data for this project and load them in the notebook. **Note:** the methods required to gather each data are different.
1. Directly download the WeRateDogs Twitter archive data (twitter_archive_enhanced.csv)

In [2]:
twitter_archive_df = pd.read_csv('twitter-archive-enhanced.csv') # Direct download of the WeRateDogs Twitter archive data, and store inside a pandas dataframe

2. Use the Requests library to download the tweet image prediction (image_predictions.tsv)

In [3]:
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
image_predictions_response = requests.get(url, allow_redirects = True)

with open('image-predictions.tsv', mode='wb') as file:
  file.write(image_predictions_response.content)

image_predictions_df = pd.read_csv('image-predictions.tsv', sep='\t')

3. Use the Tweepy library to query additional data via the Twitter API (tweet_json.txt)

In [4]:
# Since my Twitter Developer account approval is pending, 
# I will be reading the data directly from the tweet-json.json file
tweet_data = pd.read_json('tweet-json.json', lines=True)
tweet_data = tweet_data.filter(['id', 'favorite_count', 'retweet_count'])
tweet_data = tweet_data.rename(columns={"id":"tweet_id"})
tweet_data.to_csv('tweet_data.csv')

## Assessing Data
In this section, detect and document at least **eight (8) quality issues and two (2) tidiness issue**. You must use **both** visual assessment
programmatic assessement to assess the data.

**Note:** pay attention to the following key points when you access the data.

* You only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog ratings and some are retweets.
* Assessing and cleaning the entire dataset completely would require a lot of time, and is not necessary to practice and demonstrate your skills in data wrangling. Therefore, the requirements of this project are only to assess and clean at least 8 quality issues and at least 2 tidiness issues in this dataset.
* The fact that the rating numerators are greater than the denominators does not need to be cleaned. This [unique rating system](http://knowyourmeme.com/memes/theyre-good-dogs-brent) is a big part of the popularity of WeRateDogs.
* You do not need to gather the tweets beyond August 1st, 2017. You can, but note that you won't be able to gather the image predictions for these tweets since you don't have access to the algorithm used.



### i. Accessing the twitter archive data

In [5]:
twitter_archive_df.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


In [6]:
# # Number of retweeted tweets
# len(twitter_archive_df[twitter_archive_df['retweeted_status_id'].notnull()])

In [7]:
twitter_archive_df.sample(10)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
2207,668627278264475648,,,2015-11-23 03:09:00 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Timofy. He's a pilot for Southwest. It...,,,,https://twitter.com/dog_rates/status/668627278...,9,10,Timofy,,,,
810,771380798096281600,,,2016-09-01 16:14:48 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Fizz. She thinks love is a social constru...,,,,https://twitter.com/dog_rates/status/771380798...,11,10,Fizz,,,,
2151,669682095984410625,,,2015-11-26 01:00:28 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Louie. He just pounded that bottle of win...,,,,https://twitter.com/dog_rates/status/669682095...,9,10,Louie,,,,
387,826598799820865537,8.265984e+17,4196984000.0,2017-02-01 01:11:25 +0000,"<a href=""http://twitter.com/download/iphone"" r...","I was going to do 007/10, but the joke wasn't ...",,,,,7,10,,,,,
2159,669571471778410496,,,2015-11-25 17:40:53 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Keith. He's had 13 DUIs. 7/10 that's t...,,,,https://twitter.com/dog_rates/status/669571471...,7,10,Keith,,,,
1994,672604026190569472,,,2015-12-04 02:31:10 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is a baby Rand Paul. Curls for days. 11/1...,,,,https://twitter.com/dog_rates/status/672604026...,11,10,a,,,,
117,870063196459192321,,,2017-05-31 23:43:25 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Clifford. He's quite large. Also red. Goo...,,,,https://twitter.com/dog_rates/status/870063196...,14,10,Clifford,,,,
2109,670449342516494336,,,2015-11-28 03:49:14 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Vibrant dog here. Fabulous tail. Only 2 legs t...,,,,https://twitter.com/dog_rates/status/670449342...,5,10,,,,,
1928,674045139690631169,,,2015-12-08 01:57:39 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Herd of wild dogs here. Not sure what they're ...,,,,https://twitter.com/dog_rates/status/674045139...,3,10,,,,,
1736,679722016581222400,,,2015-12-23 17:55:32 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Mike. He is a Jordanian Frito Pilates....,,,,https://twitter.com/dog_rates/status/679722016...,8,10,Mike,,,,


In [8]:
twitter_archive_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 

In [9]:
twitter_archive_df.describe()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,retweeted_status_id,retweeted_status_user_id,rating_numerator,rating_denominator
count,2356.0,78.0,78.0,181.0,181.0,2356.0,2356.0
mean,7.427716e+17,7.455079e+17,2.014171e+16,7.7204e+17,1.241698e+16,13.126486,10.455433
std,6.856705e+16,7.582492e+16,1.252797e+17,6.236928e+16,9.599254e+16,45.876648,6.745237
min,6.660209e+17,6.658147e+17,11856340.0,6.661041e+17,783214.0,0.0,0.0
25%,6.783989e+17,6.757419e+17,308637400.0,7.186315e+17,4196984000.0,10.0,10.0
50%,7.196279e+17,7.038708e+17,4196984000.0,7.804657e+17,4196984000.0,11.0,10.0
75%,7.993373e+17,8.257804e+17,4196984000.0,8.203146e+17,4196984000.0,12.0,10.0
max,8.924206e+17,8.862664e+17,8.405479e+17,8.87474e+17,7.874618e+17,1776.0,170.0


In [10]:
twitter_archive_df['name'].value_counts()

None          745
a              55
Charlie        12
Cooper         11
Lucy           11
             ... 
Dex             1
Ace             1
Tayzie          1
Grizzie         1
Christoper      1
Name: name, Length: 957, dtype: int64

In [11]:
# Showing rows with numerators less than 10
twitter_archive_df[twitter_archive_df.rating_numerator < 10].count()[0]

440

In [12]:
# Showing rows with denominators equals 0
twitter_archive_df[(twitter_archive_df.rating_denominator == 0) ]

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
313,835246439529840640,8.35246e+17,26259576.0,2017-02-24 21:54:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@jonnysun @Lin_Manuel ok jomny I know you're e...,,,,,960,0,,,,,


In [13]:
# Showing rows with denominators greater or less than 10
print(len(twitter_archive_df[(twitter_archive_df.rating_denominator < 10) | (twitter_archive_df.rating_denominator > 10)]))
twitter_archive_df[(twitter_archive_df.rating_denominator < 10) | (twitter_archive_df.rating_denominator > 10)]

23


Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
313,835246439529840640,8.35246e+17,26259580.0,2017-02-24 21:54:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@jonnysun @Lin_Manuel ok jomny I know you're e...,,,,,960,0,,,,,
342,832088576586297345,8.320875e+17,30582080.0,2017-02-16 04:45:50 +0000,"<a href=""http://twitter.com/download/iphone"" r...",@docmisterio account started on 11/15/15,,,,,11,15,,,,,
433,820690176645140481,,,2017-01-15 17:52:40 +0000,"<a href=""http://twitter.com/download/iphone"" r...",The floofs have been released I repeat the flo...,,,,https://twitter.com/dog_rates/status/820690176...,84,70,,,,,
516,810984652412424192,,,2016-12-19 23:06:23 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Sam. She smiles 24/7 &amp; secretly aspir...,,,,"https://www.gofundme.com/sams-smile,https://tw...",24,7,Sam,,,,
784,775096608509886464,,,2016-09-11 22:20:06 +0000,"<a href=""http://twitter.com/download/iphone"" r...","RT @dog_rates: After so many requests, this is...",7.403732e+17,4196984000.0,2016-06-08 02:41:38 +0000,https://twitter.com/dog_rates/status/740373189...,9,11,,,,,
902,758467244762497024,,,2016-07-28 01:00:57 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Why does this never happen at my front door......,,,,https://twitter.com/dog_rates/status/758467244...,165,150,,,,,
1068,740373189193256964,,,2016-06-08 02:41:38 +0000,"<a href=""http://twitter.com/download/iphone"" r...","After so many requests, this is Bretagne. She ...",,,,https://twitter.com/dog_rates/status/740373189...,9,11,,,,,
1120,731156023742988288,,,2016-05-13 16:15:54 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Say hello to this unbelievably well behaved sq...,,,,https://twitter.com/dog_rates/status/731156023...,204,170,this,,,,
1165,722974582966214656,,,2016-04-21 02:25:47 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Happy 4/20 from the squad! 13/10 for all https...,,,,https://twitter.com/dog_rates/status/722974582...,4,20,,,,,
1202,716439118184652801,,,2016-04-03 01:36:11 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Bluebert. He just saw that both #Final...,,,,https://twitter.com/dog_rates/status/716439118...,50,50,Bluebert,,,,


### ii. Accessing the tweet image prediction 

In [14]:
image_predictions_df.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


In [15]:
image_predictions_df.tail()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
2070,891327558926688256,https://pbs.twimg.com/media/DF6hr6BUMAAzZgT.jpg,2,basset,0.555712,True,English_springer,0.22577,True,German_short-haired_pointer,0.175219,True
2071,891689557279858688,https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg,1,paper_towel,0.170278,False,Labrador_retriever,0.168086,True,spatula,0.040836,False
2072,891815181378084864,https://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg,1,Chihuahua,0.716012,True,malamute,0.078253,True,kelpie,0.031379,True
2073,892177421306343426,https://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg,1,Chihuahua,0.323581,True,Pekinese,0.090647,True,papillon,0.068957,True
2074,892420643555336193,https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg,1,orange,0.097049,False,bagel,0.085851,False,banana,0.07611,False


In [16]:
image_predictions_df.sample(10)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
582,678798276842360832,https://pbs.twimg.com/media/CWuTbAKUsAAvZHh.jpg,1,Airedale,0.583122,True,silky_terrier,0.129567,True,Lakeland_terrier,0.094727,True
1685,814530161257443328,https://pbs.twimg.com/media/C03K2-VWIAAK1iV.jpg,1,miniature_poodle,0.626913,True,toy_poodle,0.265582,True,soft-coated_wheaten_terrier,0.041614,True
626,680801747103793152,https://pbs.twimg.com/media/CXKxkseW8AAjAMY.jpg,1,pug,0.99672,True,Labrador_retriever,0.001439,True,Staffordshire_bullterrier,0.000518,True
47,666817836334096384,https://pbs.twimg.com/media/CUEDSMEWEAAuXVZ.jpg,1,miniature_schnauzer,0.496953,True,standard_schnauzer,0.285276,True,giant_schnauzer,0.073764,True
971,706593038911545345,https://pbs.twimg.com/media/Cc5Snc7XIAAMidF.jpg,1,four-poster,0.696423,False,quilt,0.189312,False,pillow,0.029409,False
1753,824663926340194305,https://pbs.twimg.com/media/C3HLd0HXUAAUI2b.jpg,1,English_setter,0.526488,True,golden_retriever,0.402815,True,Irish_setter,0.034418,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
966,706291001778950144,https://pbs.twimg.com/media/Cc0_2tXXEAA2iTY.jpg,1,Border_terrier,0.587101,True,bull_mastiff,0.164087,True,Staffordshire_bullterrier,0.105011,True
629,680913438424612864,https://pbs.twimg.com/media/CXMXKKHUMAA1QN3.jpg,1,Pomeranian,0.615678,True,golden_retriever,0.126455,True,Chihuahua,0.087184,True
1402,768970937022709760,https://pbs.twimg.com/ext_tw_video_thumb/76896...,1,Pomeranian,0.182358,True,golden_retriever,0.110658,True,mousetrap,0.086399,False


In [17]:
image_predictions_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   tweet_id  2075 non-null   int64  
 1   jpg_url   2075 non-null   object 
 2   img_num   2075 non-null   int64  
 3   p1        2075 non-null   object 
 4   p1_conf   2075 non-null   float64
 5   p1_dog    2075 non-null   bool   
 6   p2        2075 non-null   object 
 7   p2_conf   2075 non-null   float64
 8   p2_dog    2075 non-null   bool   
 9   p3        2075 non-null   object 
 10  p3_conf   2075 non-null   float64
 11  p3_dog    2075 non-null   bool   
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [None]:
# I'm merely interested as to why this image was classified as a paper towel.
image_predictions_df.loc[2071, 'jpg_url']

In [None]:
from IPython.display import Image
Image(url="https://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg")

### iii. Getting the tweet data from twitter API

In [None]:
tweet_data.head()

In [None]:
tweet_data.sample(10)

In [None]:
tweet_data.info()

### Quality issues
1. Inaccurate tweet id (integer instead of string).

2. Invalid timestamp data type (String not DateTime)

3. The Twitter enhanced archive data contains 181 retweets.

4. There are 78 reply to tweet in the Twitter enhanced archive data

5. Incorrect dog names like None, a, an, & the.

6. Predicted photo data are not all complete (2075 instead of 2356)

7. Instead of using spaces, the picture names uses underscore in p1, p2 and p3 column.

8. Inconsistent title case for p name

### Tidiness issues
1. There are four columns of data for the dog stage

2. Despite being split into three different dataframes, all the data are related.

## Cleaning Data
In this section, clean **all** of the issues you documented while assessing. 

**Note:** Make a copy of the original data before cleaning. Cleaning includes merging individual pieces of data according to the rules of [tidy data](https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html). The result should be a high-quality and tidy master pandas DataFrame (or DataFrames, if appropriate).

In [None]:
# Make copies of original pieces of data
clean_twitter_archive = twitter_archive_df.copy()
clean_tweet_data = tweet_data.copy()
clean_image_predictions = image_predictions_df.copy()

In [None]:
clean_twitter_archive.head()

In [None]:
clean_tweet_data.head()

In [None]:
clean_image_predictions.head()

In [None]:
# Create a new dog_stage column by extracting dog stage from the text column.
clean_twitter_archive['dog_stage'] = clean_twitter_archive['text'].str.extract('(doggo|floofer|pupper|puppo)')

## Tidiness Issues

### **Issue #1:** There are four columns of data for the dog stage.

#### **Define:** The dog stage should be extracted from the text and merged into one colum (dog_stage), finally the four other colums should be dropped.

#### Code

In [None]:
# Create a new dog_stage column by extracting dog stage from the text column.
clean_twitter_archive['dog_stage'] = clean_twitter_archive['text'].str.extract('(doggo|floofer|pupper|puppo)')
clean_twitter_archive = clean_twitter_archive.drop(columns=['doggo', 'floofer', 'pupper', 'puppo'])

#### Test

In [None]:
clean_twitter_archive.sample(10)

### **Issue #2:** Despite being split into three different dataframes, all the data are related.

#### **Define:** All dataframes should be combined and joined together using the tweet id.

#### Code

In [None]:
# Combining the Enhanced Twitter Archive data that has been cleaned with the Tweet Data from the Twitter API
clean_twitter_archive = pd.merge(clean_twitter_archive, clean_tweet_data, on='tweet_id', how='left')

# Taking the resulting merged archive and merging it with the Tweet Image Predictions
clean_twitter_archive = pd.merge(clean_twitter_archive, clean_image_predictions, on='tweet_id', how='left')

#### Test

In [None]:
clean_twitter_archive.info()

## Quality Issues

### **Issue #1:** Inaccurate tweet id (integer instead of string).

#### **Define:** Convert the tweet id column's data type from an integer to a string using astype.

#### Code

In [None]:
# Converting the tweet_id from integer into string
clean_twitter_archive.tweet_id = clean_twitter_archive.tweet_id.astype('str')

#### Test

In [None]:
clean_twitter_archive.info()

### **Issue #2:** Invalid timestamp data type (String not DateTime).

#### **Define:** Convert the timestamp column's data type from an string to a DateTime.

#### Code

In [None]:
# Converting the tweet_id from integer into string
clean_twitter_archive.timestamp = pd.to_datetime(clean_twitter_archive.timestamp)

#### Test

In [None]:
clean_twitter_archive.info()

### **Issue #3:** The Twitter enhanced archive data contains 181 retweets.

#### **Define:** Remove all associated columns and the rows that indicate retweets.

#### Code

In [None]:
# Only keep original tweets that do not have a retweet statud id.
clean_twitter_archive = clean_twitter_archive[clean_twitter_archive.retweeted_status_id.isnull()]
clean_twitter_archive.info()

In [None]:
# Drop all retweet related columns
clean_twitter_archive = clean_twitter_archive.drop(columns=['retweeted_status_id', 'retweeted_status_user_id', 'retweeted_status_timestamp'])

#### Test

In [None]:
clean_twitter_archive.info()

### **Issue #4:** There are 78 reply to tweet in the Twitter enhanced archive data.

#### **Define:** Remove all associated columns and the rows that indicate tweet replies.

#### Code

In [None]:
# Only keep original tweets that do not have a retweet statud id.
clean_twitter_archive = clean_twitter_archive[clean_twitter_archive.in_reply_to_status_id.isnull()]
clean_twitter_archive.info()

In [None]:
# Drop all retweet related columns
clean_twitter_archive = clean_twitter_archive.drop(columns=['in_reply_to_status_id', 'in_reply_to_user_id'])

#### Test

In [None]:
clean_twitter_archive.info()

### **Issue #5:** Incorrect dog names like None, a, an, & the.

#### **Define:** Extract the correct names from the text column and change any invalid names (None or names beginning with lowercase letters) to NaN. (immediately following the word "named").

#### Code

In [None]:
clean_twitter_archive.name = clean_twitter_archive.name.replace(regex=['^[a-z]+', 'None'], value= np.nan)

In [None]:
# a function to extract names from text columns, returning NaN if there are no words that can be called names.
def name_extractor(text):
  list_of_text = text.split()
  for word in list_of_text:
    if word.lower() == 'named':
      name_index = list_of_text.index(word) + 1 # word after 'named
      return list_of_text[name_index]
    else:
      pass
  return np.nan

In [None]:
#np.where(condition, what to do if condition is true, what to do if condition is false)
clean_twitter_archive.name = np.where(clean_twitter_archive.name.isnull(), clean_twitter_archive.text.apply(name_extractor), clean_twitter_archive.name)

#### Test

In [None]:
clean_twitter_archive.info()

### **Issue #6:** Predicted photo data are not all complete (2075 instead of 2356).

#### **Define:** Remove the row with missing photo data.

#### Code

In [None]:
clean_twitter_archive = clean_twitter_archive[clean_twitter_archive.jpg_url.notnull()]

#### Test

In [None]:
clean_twitter_archive.info()

### **Issue #7:** Instead of using spaces, the picture names uses underscore in p1, p2 and p3 column.

#### **Define:** Replace the underscore with space.

#### Code

In [None]:
clean_twitter_archive.p1 = clean_twitter_archive.p1.str.replace('_', ' ')
clean_twitter_archive.p2 = clean_twitter_archive.p2.str.replace('_', ' ')
clean_twitter_archive.p3 = clean_twitter_archive.p3.str.replace('_', ' ')

#### Test

In [None]:
clean_twitter_archive.p1.sample(10)

In [None]:
clean_twitter_archive.p2.sample(10)

In [None]:
clean_twitter_archive.p3.sample(10)

### **Issue #8:** Inconsistent title case for p name.

#### **Define:** Replace the  names that starts with lowercase using an uppercase.

#### Code

In [None]:
clean_twitter_archive.p1 = clean_twitter_archive.p1.str.title()
clean_twitter_archive.p2 = clean_twitter_archive.p2.str.title()
clean_twitter_archive.p3 = clean_twitter_archive.p3.str.title()

#### Test

In [None]:
clean_twitter_archive.p1.sample(10)

In [None]:
clean_twitter_archive.p2.sample(10)

In [None]:
clean_twitter_archive.p3.sample(10)

In [None]:
(clean_twitter_archive.p1.str.istitle()).value_counts()

In [None]:
(clean_twitter_archive.p2.str.istitle()).value_counts()

In [None]:
(clean_twitter_archive.p3.str.istitle()).value_counts()

## Storing Data
Save gathered, assessed, and cleaned master dataset to a CSV file named "twitter_archive_master.csv".

In [None]:
clean_twitter_archive.to_csv('twitter_archive_master.csv')

## Analyzing and Visualizing Data
The data set is analysed in this section, along with the related visuals that help us reach insightful conclusions.

### Insights:
1. Displaying the general quantity of tweets over time allowing us to spot any noticeable changes in the volume of tweets.

2. The popularity of different dog stages in percentage.

3. Relationship between the number of retweets and the number of favorites.

### Visualization
1. Number of tweet as date changes

In [None]:
clean_twitter_archive.timestamp = pd.to_datetime(clean_twitter_archive['timestamp'], format='%Y-%m-%d %H:%M:%S.%f')

monthly_tweets = clean_twitter_archive.groupby(pd.Grouper(key = 'timestamp', freq = "M")).count().reset_index()
monthly_tweets = monthly_tweets[['timestamp', 'tweet_id']]
monthly_tweets.head()
monthly_tweets.sum()

In [None]:
plt.figure(figsize=(10, 8));
plt.xlim([datetime.date(2015, 11, 30), datetime.date(2017, 7, 30)]);

plt.xlabel('Year and Month')
plt.ylabel('Tweets Count')

plt.plot(monthly_tweets.timestamp, monthly_tweets.tweet_id);
plt.title('We Rate Dogs Tweets over Time');

Over time, the number of tweets drastically reduced, peaking in January and March of 2016 before steadily declining thereafter.

2. Percentage of dog stages

In [None]:
dog_stage_df = clean_twitter_archive.dog_stage.value_counts()
dog_stage_df

In [None]:
# Creating a pie chart
plt.pie(dog_stage_df,
        labels = ['Pupper', 'Doggo', 'Puppo', 'Floofer'], 
        autopct='%1.1f%%',
        shadow=True,
        explode=(0.1, 0.2, 0.2, 0.3)
        )

plt.title('Percentage of dog stages')
plt.axis('equal')

3. Relationship between the number of retweets and the number of favorites


In [None]:
plt.scatter(clean_twitter_archive.retweet_count, clean_twitter_archive.favorite_count)
plt.title('Relationship between the number of retweets and the number of favorites')
plt.xlabel('Number of Retweets')
plt.ylabel ('Number of Favorites')