## Introduction

Real-world data rarely comes clean. Using Python and its libraries, you will gather data from a variety of sources and in a variety of formats, assess its quality and tidiness, then clean it. This is called data wrangling. You will document your wrangling efforts in a Jupyter Notebook, plus showcase them through analyses and visualizations using Python.

The dataset that you will be wrangling (and analyzing and visualizing) is the tweet archive of Twitter user [@dog_rates](https://twitter.com/dog_rates), also known as [WeRateDogs](https://en.wikipedia.org/wiki/WeRateDogs). WeRateDogs is a Twitter account that rates people's dogs with a humorous comment about the dog


## Project Motivation

### Context

Your goal: wrangle WeRateDogs Twitter data to create interesting and trustworthy analyses and visualizations. The Twitter archive is great, but it only contains very basic tweet information. Additional gathering, then assessing and cleaning is required for "Wow!"-worthy analyses and visualizations.

## The Data

### Enhanced Twitter Archive

The WeRateDogs Twitter archive contains basic tweet data for all 5000+ of their tweets, but not everything. One column the archive does contain though: each tweet's text, which I used to extract rating, dog name, and dog "stage" (i.e. doggo, floofer, pupper, and puppo) to make this Twitter archive "enhanced." Of the 5000+ tweets, I have filtered for tweets with ratings only (there are 2356).


## Additional Data via the Twitter API

Back to the basic-ness of Twitter archives: retweet count and favorite count are two of the notable column omissions. Fortunately, this additional data can be gathered by anyone from Twitter's API. Well, "anyone" who has access to data for the 3000 most recent tweets, at least. But you, because you have the WeRateDogs Twitter archive and specifically the tweet IDs within it, can gather this data for all 5000+. And guess what? You're going to query Twitter's API to gather this valuable data.

## Image Predictions File

One more cool thing: I ran every image in the WeRateDogs Twitter archive through a neural network that can classify breeds of dogs*. The results: a table full of image predictions (the top three only) alongside each tweet ID, image URL, and the image number that corresponded to the most confident prediction (numbered 1 to 4 since tweets can have up to four images).

## Software and libraries
- You need to be able to work in a Jupyter Notebook.
- The following packages (libraries) need to be installed. You can install these packages via conda or pip.
  - [Numpy](https://numpy.org/)
  - [Pandas](http://pandas.pydata.org/)
  - [matplotlib](http://matplotlib.org/)
  - [Requests](https://requests.readthedocs.io/en/master/user/quickstart/)
  - [Tweepy](http://docs.tweepy.org/en/latest/)
  - [JSON](https://docs.python.org/2/library/json.html)

## IMPORTS 

In [7]:
import pandas as pd
import numpy as np
import requests
import tweepy
import json
import re
from tqdm import tqdm ,tqdm_notebook,tnrange
import matplotlib.pyplot as plt
%matplotlib inline

## 1. Gathering Data for this Project

#### Gather each of the three pieces of data as described in a Jupyter Notebook

- The WeRateDogs Twitter archive.this file can be downloaded manually from link in Udacity course page


- The tweet image predictions, i.e., what breed of dog (or other object, animal, etc.) is present in each tweet according to a neural network. This file (image_predictions.tsv) is hosted on Udacity's servers and should be downloaded programmatically using the Requests library and the following URL: https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv


- Each tweet's retweet count and favorite ("like") count at minimum, and any additional data you find interesting. Using the tweet IDs in the WeRateDogs Twitter archive, query the Twitter API for each tweet's JSON data using Python's Tweepy library and store each tweet's entire set of JSON data in a file called tweet_json.txt file. Each tweet's JSON data should be written to its own line. Then read this .txt file line by line into a pandas DataFrame with (at minimum) tweet ID, retweet count, and favorite count. 


### a) Twitter archive Data

In [2]:
archive = pd.read_csv(r'E:\werate_dogs\twitter-archive-enhanced.csv')

### b) Tweet image prediction

In [3]:
url = "https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv"
request = requests.get(url)
with open (r'E:\werate_dogs\image_predictions.tsv','wb') as file:
    file.write(request.content)
image_preds = pd.read_csv(r'E:\werate_dogs\image_predictions.tsv', delimiter= '\t')

### c) Twitter complementry Data using API

I didn't have twitter account so i use tweet_json.txt from udacity class which is provided as a result for running .py file so i work directly with twwet_json.txt

In [4]:
# Personal API keys, secrets, and tokens have been replaced with placeholders
consumer_key = '41nz7LT9uc8hFhsFmmBO5TNvh'
consumer_secret = 'jJdH7yAyuv0jmMxYtaHXhoMuunS8hvX3nKyrPp9NYsR2mNUuDH'
access_token = '1226617674979069952-a9u3mh4ZBg5q61quwoY0IZSMb2p6Eb'
access_secret = 'Lcr8D8t7zkYsHHZRqrAZQftCc50FkuSh2aeGpPRPky5R0'

In [5]:
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth)

In [9]:
# For loop which will add each available tweet to a new line of tweet_json.txt

with open(r'E:\werate_dogs\tweet_json1.txt', 'w', encoding='utf8') as file:
    for tweet_id in tqdm(archive['tweet_id']):
        try:
            tweet = api.get_status(tweet_id, tweet_mode='extended')
            json.dump(tweet._json, file)
            file.write('\n')
        except:
            continue

100%|██████████████████████████████████████| 2356/2356 [42:09<00:00,  1.03s/it]


In [10]:
tweets=[]

with open (r'E:\werate_dogs\tweet_json1.txt') as tweet_json:
    
    for line in tqdm(tweet_json):
        try:
            tweet = json.loads(line)
            tweets.append({'tweet_id': tweet['id'],
                        'retweet_count': tweet['retweet_count'], 
                        'favorite_count': tweet['favorite_count']})
        except:
            continue
            
tweets_data = pd.DataFrame(tweets)

2323it [00:00, 11008.85it/s]



## DATA ASSESSING


#### After gathering each of the above pieces of data,i will  assess them visually and programmatically for quality and tidiness issues.

### 1. Data assessing for archive dataframe

In [13]:
archive.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


In [14]:
archive.shape

(2356, 17)

In [15]:
archive.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

In [20]:
archive.describe()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,retweeted_status_id,retweeted_status_user_id,rating_numerator,rating_denominator
count,2356.0,78.0,78.0,181.0,181.0,2356.0,2356.0
mean,7.427716e+17,7.455079e+17,2.014171e+16,7.7204e+17,1.241698e+16,13.126486,10.455433
std,6.856705e+16,7.582492e+16,1.252797e+17,6.236928e+16,9.599254e+16,45.876648,6.745237
min,6.660209e+17,6.658147e+17,11856340.0,6.661041e+17,783214.0,0.0,0.0
25%,6.783989e+17,6.757419e+17,308637400.0,7.186315e+17,4196984000.0,10.0,10.0
50%,7.196279e+17,7.038708e+17,4196984000.0,7.804657e+17,4196984000.0,11.0,10.0
75%,7.993373e+17,8.257804e+17,4196984000.0,8.203146e+17,4196984000.0,12.0,10.0
max,8.924206e+17,8.862664e+17,8.405479e+17,8.87474e+17,7.874618e+17,1776.0,170.0


In [21]:
list(archive.columns)

['tweet_id',
 'in_reply_to_status_id',
 'in_reply_to_user_id',
 'timestamp',
 'source',
 'text',
 'retweeted_status_id',
 'retweeted_status_user_id',
 'retweeted_status_timestamp',
 'expanded_urls',
 'rating_numerator',
 'rating_denominator',
 'name',
 'doggo',
 'floofer',
 'pupper',
 'puppo']

### let's check if name column has a non-english name for dog

In [33]:
def isEnglish(s):
    return s.isascii()

archive['isEnglish'] = archive['name'].apply(lambda x: isEnglish(x))

In [36]:
archive['isEnglish'].value_counts()

True     2347
False       9
Name: isEnglish, dtype: int64

In [46]:
non_english = archive[['tweet_id','name']][archive['isEnglish'] ==False]


### 2. Data assessing for image_preds dataframe

In [22]:
image_preds.head()

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True
1,666029285002620928,https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg,1,redbone,0.506826,True,miniature_pinscher,0.074192,True,Rhodesian_ridgeback,0.07201,True
2,666033412701032449,https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg,1,German_shepherd,0.596461,True,malinois,0.138584,True,bloodhound,0.116197,True
3,666044226329800704,https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg,1,Rhodesian_ridgeback,0.408143,True,redbone,0.360687,True,miniature_pinscher,0.222752,True
4,666049248165822465,https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg,1,miniature_pinscher,0.560311,True,Rottweiler,0.243682,True,Doberman,0.154629,True


In [23]:
image_preds.shape

(2075, 12)

In [24]:
image_preds.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB


In [25]:
image_preds.describe()

Unnamed: 0,tweet_id,img_num,p1_conf,p2_conf,p3_conf
count,2075.0,2075.0,2075.0,2075.0,2075.0
mean,7.384514e+17,1.203855,0.594548,0.1345886,0.06032417
std,6.785203e+16,0.561875,0.271174,0.1006657,0.05090593
min,6.660209e+17,1.0,0.044333,1.0113e-08,1.74017e-10
25%,6.764835e+17,1.0,0.364412,0.05388625,0.0162224
50%,7.119988e+17,1.0,0.58823,0.118181,0.0494438
75%,7.932034e+17,1.0,0.843855,0.1955655,0.09180755
max,8.924206e+17,4.0,1.0,0.488014,0.273419


In [26]:
list(image_preds.columns)

['tweet_id',
 'jpg_url',
 'img_num',
 'p1',
 'p1_conf',
 'p1_dog',
 'p2',
 'p2_conf',
 'p2_dog',
 'p3',
 'p3_conf',
 'p3_dog']

### 13. Data assessing for tweets_data dataframe

In [27]:
tweets_data.head()

Unnamed: 0,favorite_count,retweet_count,tweet_id
0,36184,7703,892420643555336193
1,31214,5697,892177421306343426
2,23500,3777,891815181378084864
3,39485,7869,891689557279858688
4,37696,8479,891327558926688256


In [28]:
tweets_data.shape

(2323, 3)

In [29]:
tweets_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2323 entries, 0 to 2322
Data columns (total 3 columns):
favorite_count    2323 non-null int64
retweet_count     2323 non-null int64
tweet_id          2323 non-null int64
dtypes: int64(3)
memory usage: 54.5 KB


In [30]:
tweets_data.describe()

Unnamed: 0,favorite_count,retweet_count,tweet_id
count,2323.0,2323.0,2323.0
mean,7559.167886,2703.382695,7.418515e+17
std,11745.759617,4575.749894,6.834235e+16
min,0.0,1.0,6.660209e+17
25%,1310.5,546.0,6.780222e+17
50%,3279.0,1266.0,7.175377e+17
75%,9261.0,3142.0,7.986846e+17
max,156180.0,77773.0,8.924206e+17


In [48]:
tweets_data.isnull().sum()

favorite_count    0
retweet_count     0
tweet_id          0
dtype: int64

## After reviewing datasets and keypoints from course web page in addition to notes in project motivation i summraize quality and tidness issues as below;

# Quality issues


## archive Dataframe

1. Timestamp is object but it should be a datetime as it represent time on which tweet uploaded

2. retweeted_status_timestamp is object but it should be a datetime as it represent time on which tweet is retweeted

3. name column should be convert to dog_name as it may confuse reader as it is dog name or owner name 

4. Rating_denominator ranges from 0-10 but this cause issues when we calculate rating ratio as diving number by zero lead to    mathmatical issues this should be changed 

5. there is outliers in denominators as the maximum value is 177 

6. doggo, floofer, pupper, puppo columns contain 'None' value where NaN should be used.

7. there are 9 dogs with non-english names with the following tweet-id ,757354760399941633,720389942216527872,717047459982213120,694352839993344000,688547210804498433,686050296934563840,669371483794317312,668872652652679168,668528771708952576 so i will drop them as in this analysis i intersted in english dogs only 

8. Missing values in 'name'showing as 'None'

9. there are tweets without image url i will drop them

10. There are many columns with missing values namely - in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id, retweeted_status_user_id, retweeted_status_timestamp, expanded_urls

## image_preds Dataframe

1. change columns name to something informative for example p1 to predection 1 / p1_confd to prediction_1_confidence 


# Tideness 


1. img_num column in image_preds contain number for images in tweet i think but this is un-useful info so i will drop this column

2. doggo, floofer, pupper, puppo columns are redundant as they describe same thing 
3. we need to join dataframes to make clear and complete picture for the entire information needed for the next step

# Cleaning stage 


### archive Dataframe

**Timestamp is object but it should be a datetime as it represent time on which tweet uploaded**

**retweeted_status_timestamp is object but it should be a datetime as it represent time on which tweet is retweeted**

**'name' column should be convert to dog_name as it may confuse reader as it is dog name or owner name**

**denominator issues**

**doggo, floofer, pupper, puppo columns contain 'None' value where NaN should be used.**

**removing non-english dog names**

**Missing values in 'name'showing as 'None'**

**tweets without image url i will drop them**

**Drop unrelevant columns in archive**

**image_preds columns name changing**

In [None]:
# Combine the columns into one column
tweet_data_clean['dog_stage'] = tweet_data_clean['doggo'] + tweet_data_clean['floofer'] + tweet_data_clean['pupper'] + tweet_data_clean['puppo']


tweet_data_clean.dog_stage = tweet_data_clean.dog_stage.replace('doggopupper', 'multiple')
tweet_data_clean.dog_stage = tweet_data_clean.dog_stage.replace('doggopuppo', 'multiple')
tweet_data_clean.dog_stage = tweet_data_clean.dog_stage.replace('doggofloofer', 'multiple')