# Project: Wrangling and Analyze Data
*This project aims to perform Data Wrangling and Exploratory Data Analysis on the archived data of WeRateDogs® Twitter account.*

## Data Gathering
In the cells below, **all** three pieces of data for this project are gathered and loaded in the notebook. **Note:** the methods required to gather each data are different.
1. Direct import of the WeRateDogs Twitter archive data (twitter_archive_enhanced.csv)

In [1]:
#Importing necessary libraries
import pandas as pd
import numpy as np
import requests
import json
from IPython.display import Image
import datetime
import re
import matplotlib.pyplot as plt
import seaborn as sns

% matplotlib inline

In [2]:
dt1_archive  = pd.read_csv('twitter-archive-enhanced-2.csv') #downloading from uploaded document
dt1_archive

FileNotFoundError: ignored

2. Used the Requests library to download the tweet image prediction (image_predictions.tsv)

In [None]:
#Using the request library to download the flat file image_predictions
url = "https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv"

r = requests.get(url)
with open('image-predictions.tsv', mode ='wb') as file:
    file.write(r.content)

In [None]:
dt2_image = pd.read_csv('image-predictions.tsv', sep = '\t')
dt2_image.head()

3. Used the provided (tweet_json.txt)

In [None]:
#Reading the tweet-Json file

dt = pd.read_csv('tweet-json.txt', delimiter = "\t")

In [None]:
#we write this list into a txt file:
dt_list = []
with open('tweet-json.txt') as file:
    for line in file:
        dt_list.append(json.loads(line))

In [None]:
dt_list

In [None]:

tweet_json = pd.DataFrame(dt_list, columns = ['id', 'favorite_count','retweet_count'
                                                           ])

In [None]:
tweet_json.head()

In [None]:
#changing column 'id' to tweet_id using pandas rename
tweet_json = tweet_json.rename(columns = {'id':'tweet_id'})
tweet_json.head()

In [None]:
tweet_json.to_csv('tweet_json.csv', index=False) #storing the file in csv

data_tweet = pd.read_csv('tweet_json.csv')
data_tweet.head()

## Assessing Data
In this section, I was able to detect and document at least **eight (8) quality issues and two (2) tidiness issue**. Displaying **both** visual assessment and programmatic assessement to assess the data.

**(Visual assessment) Each of the three data gathered is displayed for visual assessment purpose.**

In [None]:
dt1_archive #Visual assessment twitter archive

A quality issue identified visually in twitter-archive-enhanced-2.csv (dt1_archive): Invalid names or non-standard names in the name column, this will require further analysis using the .value_counts() method to examine the frequency of the error.

In [None]:
dt2_image #Visual assessment image predictions

In [None]:
# This is an image for tweet_id 666049248165822465 Visual assessment
Image(url = 'https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg')

The tweet_id is unique key to the tweet and not really to the rated dog. its better to filter for unique pictures of the dogs, which will also remove duplicates.

In [None]:
data_tweet #Visual assessment for tweet_json


**(Programmatic assessment) Using Pandas' functions to assess each gathered data.**

Assessing the twitter Archive Enhanced File (dt1_archive) to generate data quality and tidiness issues

In [None]:
dt1_archive.info()

**Some quality issues:**
The retweeted_status_id, retweeted_status_user_id, in_reply_to_status_id, and in_reply_to_user_id might give wrong results if not cleaned because it will duplicate. The same dog picture in each retweet or reply. It's good to remove the tweet_id's.

The dog "Sierra" appears twice due to the retweet.

Wron datatypes e.g object instead of date time for timestamp etc.

In [None]:
# Subsetting the dt1_archive to find a retweet.
dt1_archive[dt1_archive.name == 'Sierra']

*The Tweet_ids with duplicate retweet status will be dropped.*

In [None]:
dt1_archive.name.value_counts()

**Quality issue:** Some names are invalid and will be dropped (None,a, the, an)

In [None]:
dt1_archive.rating_numerator.describe() #checking the characteistics of the rating numerator

In [None]:
dt1_archive[dt1_archive.rating_numerator<10].count()[0]

In [None]:
dt1_archive[dt1_archive.rating_numerator<10].tweet_id

**Quality issue:** tweet_ids with numerators lower than 10 are incorrect due to the peculiarity of the WeRateDogs twitter page, they will be dropped.

In [None]:
dt1_archive.rating_denominator.describe() #checking the characteistics of the rating denominator

In [None]:
dt1_archive[dt1_archive.rating_denominator<10].count()[0]

In [None]:
dt1_archive[dt1_archive.rating_denominator<10].tweet_id

In [None]:
dt1_archive[dt1_archive.tweet_id == 810984652412424192]

**Quality issue:** tweet_ids with denominators lower than 10 are incorrect as the standard is 10 and the three tweet ids in this category must be input errors and will be dropped.
example:

In [None]:
dt1_archive[dt1_archive.tweet_id == 835246439529840640] #An example with zero denominator instead of ten

**Assessing the image-predictions(dt2_image) to generate data quality and tidiness issues**

In [None]:
# Overall of the dt2_image.
dt2_image.info()

In [None]:
# Checking for Duplicated imagesjpg_url.
sum(dt2_image.jpg_url.duplicated())

**Some quality issues:** The tweet_id to be	Converted to string.
The jpg_url	has duplicated images.

**Assessing the tweet_json file(data_tweet) to generate data quality and tidiness issues**

In [None]:
data_tweet.info()

No issue is noticed from Tweet_json file.

*However a general tidiness issue is that the files are related and yet broken into 3. The primary key which is the tweet_id will be used to join them during cleaning.*

### Quality issues
**Assessing the twitter Archive Enhanced File**
1. There are 181 retweets in the the file.

2. Presence of invalid dog names (None, a, The, an, etc.)

3. Numerators with ratings less than 10 about 440

4. Denominators with zero rating

5. Columns in wrong data type object to datetime

6. Tweet id data type is integer instead of spring

**Assessing the Image Prediction File**

7. The jpg_urls are duplicated

8. Some of the tweet_ids has no images total (2075 rows instead 2356)

9. Some of the dogs names 'p's start with small letters and others capital letters

**Assessing the Tweet_json File**

10. Missing entries (Only 2354 entries, instead of 2356)

### Tidiness issues
1. The dog data is separated into four different columns

2. The data files are related but are in different dataframes divide

## Cleaning Data
In this section, **All** of the issues documented while assessing was addressed. 

A copy of the original data was done

In [None]:
# Making copies of original pieces of data

# Copying the dt1_archive.
clean_archive = dt1_archive.copy()

# Copying the dt2_image.
clean_image = dt2_image.copy()

# Copying the data_tweet. 
clean_tweet = data_tweet.copy()

In [None]:
clean_archive.head(1)

In [None]:
clean_image.head(1)

In [None]:
clean_tweet.head(1)

### Issue #1: Cleaning Tidiness Issues 

#### Define:  The dog's data are in four separate columns 
**This will be merged into one 'dog_states'**

#### Code

In [None]:
# Extract the text from the columns into the new dog_states column
clean_archive['dog_states'] = clean_archive['text'].str.extract('(doggo|floofer|pupper|puppo)')
clean_archive.head()

In [None]:
#dropping unneccessary columns (doggo|floofer|pupper|puppo)
clean_archive = clean_archive.drop(columns = ['doggo', 'floofer', 'pupper', 'puppo'])

#### Test

In [None]:
clean_archive.dog_states.value_counts()

#### Define: The data files are related but are in different dataframes divide
**Merging all the files into one, based on tweet_id as primary key.**

#### Code

In [None]:
#Using the pandas merge function to join the files, into one dataframe.
df = pd.merge(clean_archive, clean_tweet, on='tweet_id', how='left') #df = clean_archive + clean_tweet
df = pd.merge(df, clean_image, on='tweet_id', how='left') #df = df + clean_image
df.head(2)

#### Test

In [None]:
df.info()

### Issue #2: Cleaning Some Quality Issues

#### Define: Q1 There are 181 retweets in the the file.

**Only rows where retweeted_status_id is null will be kept.**

#### Code

In [None]:
# Select rows with only where retweeted_status_id is Null.
df = df[df.retweeted_status_id.isnull()]
df.info()

In [None]:
#dropping the unneccessary related columns to retweet
df = df.drop(columns=['retweeted_status_id', 'retweeted_status_user_id', 'retweeted_status_timestamp',
                      'in_reply_to_status_id', 'in_reply_to_user_id' ])

#### Test: Expect all rows and columns deviod of retweet duplicates

In [None]:
df.info()

#### Define: Q2 Presence of invalid dog names (None, a, The, an, etc.) 
**Converting invalid names to None**

#### Code

In [None]:
# Initialization of variable.
invalid_names = []

# Looping to find ordinary words.
for index in df.name:
    # Checking every name starting with lowercase.
    if index.islower():
        # If yes will append to invalid_names.
        invalid_names.append(index)

# This list will filter only unique values
invalid_names = list(set(invalid_names))

# Printing non-standard/ non names.
invalid_names

In [None]:
# Loop to replace each non standard name (invalid_name).
for index in invalid_names:
    df.name.replace(index,"None",inplace = True)

#### Test

In [None]:
df.name.value_counts()

In [None]:
sum(df.name.isnull())

#### Define: Q3&4 Dealing with numerators with ratings less than 10 about 440 & denominators with zero rating

#### Code

**Standardize the dog ratings:
Converting to float, Regularize the the ratings**

In [None]:
df['rating_numerator'] = df['rating_numerator'].astype(float)
df['rating_denominator'] = df['rating_denominator'].astype(float)
df.info()

In [None]:
# For the loop to gather all the text,indices,and ratings for tweets having decimal numerator
decimal_rating_text = []
decimal_rating_index = []
ratings_in_decimals = []


for i, text in df['text'].iteritems():
    if bool(re.search('\d+\.\d+\/\d+', text)):
        decimal_rating_text.append(text)
        decimal_rating_index.append(i)
        ratings_in_decimals.append(re.search('\d+\.\d+', text).group())

# The ratings with decimals        
decimal_rating_text

In [None]:
#The indices of the ratings above (having decimal)
decimal_rating_index

In [None]:
#Correctly converting the above decimal ratings to float
df.loc[decimal_rating_index[0],'rating_numerator'] = float(ratings_in_decimals[0])
df.loc[decimal_rating_index[1],'rating_numerator'] = float(ratings_in_decimals[1])
df.loc[decimal_rating_index[2],'rating_numerator'] = float(ratings_in_decimals[2])
df.loc[decimal_rating_index[3],'rating_numerator'] = float(ratings_in_decimals[3])

#### Test

In [None]:
# Testing the indices 
df.loc[695]

In [None]:
Image(url = 'https://pbs.twimg.com/media/CurzvFTXgAA2_AP.jpg') #sample image of index

In [None]:
# A new column called rating is created, calculating the value with new and standardized ratings
df['rating'] = df['rating_numerator'] / df['rating_denominator']
df.head()

#### Define: Q5 & 6 Columns in wrong data type Timestamp is object instead of datetime, Tweet id data type is integer instead of string

**The timestamp variable is an object datatype, this will be coverted to date time format. 
Pandas datetime was useful in conversion.**

**Tweet id data type is in integer and will be converted to string.**

In [None]:
#Converting to datetime format
df.timestamp = pd.to_datetime(df['timestamp'], format='%Y-%m-%d %H:%M:%S.%f')
df.timestamp.head()

In [None]:
df.tweet_id = df.tweet_id.astype(str) #from integer to string

#### Test

In [None]:
df.info()

#### Define: Q8. Some of the tweet_ids has no images total (2075 rows instead 2356)

**Delete rows with missing images**

#### Code

In [None]:
df = df[df.jpg_url.notnull()]

#### Test

In [None]:
df.info()

#### Define: Q9. Some of the dogs names 'p's start with small letters and others capital letters.

**Replace in P, names with '_', from '_' to space**

In [None]:
df.p1 = df.p1.str.replace('_', ' ')
df.p2 = df.p2.str.replace('_', ' ')
df.p3 = df.p3.str.replace('_', ' ')

**Convert Lower case to Upper case**

In [None]:
df.p1 = df.p1.str.title()
df.p2 = df.p2.str.title()
df.p3 = df.p3.str.title()

#### Test

In [None]:
df.p1.head(10)

In [None]:
df.p2.head(10)

In [None]:
df.p3.head(10)

In [None]:
df.head()

**Quality Issues 7 & 10 wont be needed since we have been able to drop major rows and columns with duplicates**
*Overall the tidiness and quality of this data has been improved*

## Storing Data
Save gathered, assessed, and cleaned master dataset to a CSV file named "twitter_archive_master.csv".

In [None]:
df.to_csv('twitter_archive_master.csv')

## Analyzing and Visualizing Data
In this section, analyzed and visualized wrangled data. Provided at least **three (3) insights and one (1) visualization.**
This using the 'df' dataset.

In [None]:
monthly_tweets = df.groupby(pd.Grouper(key = 'timestamp', freq = "M")).count().reset_index()
monthly_tweets = monthly_tweets[['timestamp', 'tweet_id']]
monthly_tweets.head()
monthly_tweets.sum()

In [None]:
# Plotting time vs. tweets

plt.figure(figsize=(10, 10));
plt.xlim([datetime.date(2015, 11, 30), datetime.date(2017, 7, 30)]);

plt.xlabel('Year and Month')
plt.ylabel('Tweets Count')

plt.plot(monthly_tweets.timestamp, monthly_tweets.tweet_id);
plt.title('We Rate Dogs Tweets over Time');

WeRateDogs® Twitter account was at its highest tweet count by january 2016. The tweet counts have been maintaining a decline with varying spikes in the mid year of 2016.

In [None]:
# Scatterplot of retweets vs favorite count

sns.lmplot(x="retweet_count", 
           y="favorite_count", 
           data=df,
           size = 5,
           aspect=1.3,
           scatter_kws={'alpha':1/5});

plt.title('Favorite Count vs. Retweet Count');
plt.xlabel('WeRateDogs™ Twitter Retweet Count');
plt.ylabel('WeRateDogs™ Twitter Favorite Count');

There is a linear relationship between favorite count & retweet count of WeRateDogs™ Twitter

**Percentage of different dog stages**

In [None]:
stage_df = df.dog_states.value_counts()
stage_df

### Visualization

In [None]:
#Plotting a pie chart 
plt.pie(stage_df,
       labels = [ 'Pupper','Doggo', 'Puppo', 'Floofer'],
       autopct = '%1.1f%%',  #To show percent on plot 1.1 formats the percentage to the tenth place
       shadow = True,
       explode = (0.1, 0.2, 0.2, 0.3)
       )
plt.title('Percentage of Dog Stages')
plt.axis('equal') #Removing the default tilt from matplotlib pie

Pupper has the highest percentage
Floofer has the lowest percentage

### Insights:
1. There is a linear relationship between retweet count & favorite count.

2. Pupper has the highest percentage

3. Floofer has the lowest percentage

4. WeRateDogs® Twitter account highest tweet count was at january 2016.

**N.B: I wasn't able to get an elevated access to twitter developer's but i couldnt get the required api. I used the udacity provided twitter-json.txt.**

### References

https://github.com/AndersonUyekita/ND111_data_science_foundations_02/blob/master/03-Chapter03/00-Project_02/wrangle_act.ipynb
https://www.youtube.com/watch?v=0dkzcshJz0k
https://docs.python.org/3/library/re.html#:~:text=A%20regular%20expression%20(or%20RE,down%20to%20the%20same%20thing)
https://github.com/Abhishek20182/Wrangle-and-Analyze-Data/blob/master/wrangle_act.ipynb
https://github.com/zekuva/Udacity-datasets/blob/main/Week_6_project_2_cleaning.ipynb
https://ipython.readthedocs.io/en/stable/api/generated/IPython.display.html
https://stackabuse.com/reading-and-writing-json-to-a-file-in-python/