# Project: Wrangling and Analyze Data

## Data Gathering
In the cell below, gather **all** three pieces of data for this project and load them in the notebook. **Note:** the methods required to gather each data are different.
1. Directly download the WeRateDogs Twitter archive data (twitter_archive_enhanced.csv)

In [None]:
# importing the necesary modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import requests
from IPython.display import Image
import tweepy
from tweepy import OAuthHandler
import json
from timeit import default_timer as timer

In [None]:
# reading the twitter_archive_enhanced.csv data with pandas
twitter_archive = pd.read_csv('twitter-archive-enhanced.csv')

2. Use the Requests library to download the tweet image prediction (image_predictions.tsv)

In [None]:
# using requests module to download twitter-archive-enhanced.csv programatically
url = ' https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = requests.get(url)
with open('image-predictions.tsv', mode='wb') as file:
    file.write(response.content)

3. Use the Tweepy library to query additional data via the Twitter API (tweet_json.txt)

In [None]:
# Query Twitter API for each tweet in the Twitter archive and save JSON in a text file
# These are hidden to comply with Twitter's API terms and conditions
consumer_key = 'HIDDEN'
consumer_secret = 'HIDDEN'
access_token = 'HIDDEN'
access_secret = 'HIDDEN'

auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth, wait_on_rate_limit=True)

# NOTE TO STUDENT WITH MOBILE VERIFICATION ISSUES:
# df_1 is a DataFrame with the twitter_archive_enhanced.csv file. You may have to
# change line 17 to match the name of your DataFrame with twitter_archive_enhanced.csv
# NOTE TO REVIEWER: this student had mobile verification issues so the following
# Twitter API code was sent to this student from a Udacity instructor
# Tweet IDs for which to gather additional data via Twitter's API
tweet_ids = twitter_archive.tweet_id.values
len(tweet_ids)

# Query Twitter's API for JSON data for each tweet ID in the Twitter archive
count = 0
fails_dict = {}
start = timer()
# Save each tweet's returned JSON as a new line in a .txt file
with open('tweet_json.txt', 'w') as outfile:
    # This loop will likely take 20-30 minutes to run because of Twitter's rate limit
    for tweet_id in tweet_ids:
        count += 1
        print(str(count) + ": " + str(tweet_id))
        try:
            tweet = api.get_status(tweet_id, tweet_mode='extended')
            print("Success")
            json.dump(tweet._json, outfile)
            outfile.write('\n')
        except tweepy.TweepError as e:
            print("Fail")
            fails_dict[tweet_id] = e
            pass
end = timer()
print(end - start)
print(fails_dict)

In [None]:
# converting the text files in the tweet_json.txt into dataframe
df_2_list = []
with open('tweet_json.txt', encoding='utf-8') as file:
    for line in file: 
        new_data = json.loads(line)
        tweet_id = new_data['id']
        retweet_count = new_data['retweet_count']
        favorite_count = new_data['favorite_count']
        df_2_list.append({
            'tweet_id': tweet_id,
            'retweet_count': retweet_count,
            'favorite_count': favorite_count
        })

In [None]:
# creating the dataframe from the df_2_list text
tweet_data_df = pd.DataFrame(df_2_list, columns=['tweet_id', 'retweet_count', 'favorite_count'])

In [None]:
tweet_data_df

In [None]:
# saving data to csv
tweet_data_df.to_csv('tweet_data_df', index=False)

In [None]:
tweet_data_df.info()

In [None]:
tweet_data_df.describe()

## Assessing Data


In [None]:
twitter_archive.head(2)

In [None]:
# displaying the entries and data types of the fields in twitter_archive data
twitter_archive.info()

In [None]:
# displaying the general statistics of the rating_numerators column
twitter_archive.rating_numerator.describe()

In [None]:
# checking the rating_numerators which are less than 10
twitter_archive[twitter_archive.rating_numerator <10].count()[0]

In [None]:
# checking ids which have numerators less than 10
twitter_archive[twitter_archive.rating_numerator < 10].tweet_id

In [None]:
# checking rating numerator for tweet id 666020888022790149
twitter_archive.loc[(twitter_archive.tweet_id == 666020888022790149), 'rating_numerator']

In [None]:
# general statistics for the denominators
twitter_archive.rating_denominator.describe()

In [None]:
# tweet with the zero denominators
twitter_archive[twitter_archive.rating_denominator == 0]

In [None]:
twitter_archive.name.value_counts()

In [None]:
twitter_archive[twitter_archive.rating_denominator < 10].count()[0]

In [None]:
twitter_archive[twitter_archive.rating_denominator != 10].count()[0]

In [None]:
# reading the image-predictions.tsv data with pandas
df_image_preditions = pd.read_csv('image-predictions.tsv', sep='\t')
df_image_preditions.head()

In [None]:
df_image_preditions.sample(5)

In [None]:
# displaying some of these images to see if they are dog images
url = df_image_preditions.loc[1840, 'jpg_url']
Image(url=url)

In [None]:
df_image_preditions.info()

### Quality issues
1. Some dog names such as a is not valid and should be name

2. Tweet_id should be intergers instead of string

3. Time stamp should be date datatype instead of string

4. Retweeted_status_id indicates 181 retweet

5. There are 440 rating nmerator less than 10

6. tweet_id 835246439529840640 has 0 denominators

7. missing values for retweet_counts

8. Missing photos of some tweet images url

### Tidiness issues
1. The dog stage data is separated into four distinct columns

2. All the data is related but is devided into three separate dataframes

## Cleaning Data

In [None]:
# Make copies of original pieces of data
clean_tweet_data_df = tweet_data_df.copy()
clean_twitter_archive = twitter_archive.copy()
clean_df_image_preditions = df_image_preditions.copy()

### Issue #1: Some dog names such as a is not valid and should be name

#### Define:
- convert invalid dog name that is NaN or string with lower case laters and extract the correct name from the text columns

#### Code

In [None]:
# converting None values with appropriate names
# and checking if the values have been chnaged
clean_twitter_archive.name =clean_twitter_archive.name.replace(regex=['^[a-z]+', 'None'], value=np.nan)
sum(clean_twitter_archive.name.isnull())

In [None]:
# declare a function to return names from text column and return NaN if theirs no named word
def name(text):
    list_text = text.split()
    for word in list_text:
        if word.lower() == 'named':
            name_of_index = list_text.index(word) + 1
            return list_text[name_of_index]
        else:
            pass
    return np.nan

#### Test

In [None]:
sum(clean_twitter_archive.name.isnull())

### Issue #2: Tweet_id should be intergers instead of string

#### Define
- converting tweet_id tostring

#### Code

In [None]:
# converting tweet_id dtypes to string
clean_twitter_archive.tweet_id = clean_twitter_archive.tweet_id.astype(str)

#### Test

In [None]:
clean_twitter_archive.info()

### Issue #3: Retweeted_status_id indicates 181 retweet

#### Define
- delete retweet related rows and columns

#### Code

In [None]:
clean_twitter_archive = clean_twitter_archive[clean_twitter_archive.retweeted_status_id.isnull()]
clean_twitter_archive.info()

In [None]:
clean_twitter_archive = clean_twitter_archive.drop(columns=['retweeted_status_id','retweeted_status_user_id','retweeted_status_timestamp'])

#### Test

In [None]:
clean_twitter_archive.info()

### Issue #4: missing values for retweet_counts

#### Define
- already deleted while cleaning previous data issues

#### Test

# sum(clean_twitter_archive.retweeted_status_timestamp.isnull())

### Issue #5: Missing photos of some tweet images url

#### Define
- delete rows with the missig values

#### Code

In [None]:
clean_twitter_archive = clean_twitter_archive[clean_twitter_archive.expanded_urls.notnull()]

#### Test

In [None]:
clean_twitter_archive.info()

### Issue #6: The dog stage data is separated into four distinct columns

#### Define
- Marging the 4 column into one

#### Code

In [None]:
# extracting the stage of the dog from the text column into the new dog_stage column
clean_twitter_archive['dog_stage'] = clean_twitter_archive['text'].str.extract('(pupper|floofer|puppo|doggo)')
clean_twitter_archive.head()

In [None]:
# drop some columns pupper|floofer|puppo|doggo
clean_twitter_archive = clean_twitter_archive.drop(columns=['pupper', 'floofer', 'puppo', 'doggo'])

#### Test

In [None]:
clean_twitter_archive.dog_stage.value_counts()

### Issue #7: All the data is related but is devided into three separate dataframes

#### Define
- maging all the data frame into one

#### Code

In [None]:
# maging the tweeter achive data and the data from the tweepy api
# maging the result with the tweeter image prediction data
clean_twitter_archive = pd.merge(clean_twitter_archive, clean_tweet_data_df, on=tweet_id, how='left')
clean_twitter_archive = pd.merge(clean_twitter_archive, clean_df_image_preditions, on=tweet_id, how='left')

#### Test

In [None]:
clean_twitter_archive.info()

## Storing Data
Save gathered, assessed, and cleaned master dataset to a CSV file named "twitter_archive_master.csv".

In [None]:
clean_twitter_archive.to_csv('twitter_archive_master.csv')

## Analyzing and Visualizing Data
In this section, analyze and visualize your wrangled data. You must produce at least **three (3) insights and one (1) visualization.**

In [None]:
# persentage of dog stages visualization
dog_stages_df = clean_twitter_archive.dog_stage.value_counts()

In [None]:
dog_stages_df

In [None]:
plt.pie(dog_stages_df, labels=['pupper', 'doggo', 'puppo', 'floofer'])
plt.title('The percentage of dogs stages')
plt.axis('equal')

### Insights:
1. pupper has the highest percentage

2. floofer has the lowest percentage

3. doggo has the second largest percentage