# Data Wrangling, Analysis and Report on @dog_rates twitter feeds

## Content

#### 1. Gather data
   >> `Twitter Archives`
    
   >> `Image Predictions`
    
   >> `Tweets json`

#### 2.  Assess data
>>`Summary of Assessments:`
>>>      Quality
>>>      Tidiness
    

#### 3. Clean data

>> `Define`

>> `Code`

>> `Test`

#### 4. Analyse data
     
   >> `Feature Enginnering`
     
  >>`Visualization`

#### 5.  Report
   >> `wrangle_report`
   
   >>`act_report`

> 

>

### Gathering Data

> Import Required Libraries and Modules

In [2]:
import pandas as pd
import requests
import json
import os

> Dataset 1:  Twitter Archives

In [9]:
tweet_archive_enhanced = './datasets/twitter-archive-enhanced.csv'

In [10]:
df_twitter_archive = pd.read_csv(tweet_archive_enhanced)
df_twitter_archive.head(1)

FileNotFoundError: [Errno 2] File ./datasets/twitter-archive-enhanced.csv does not exist: './datasets/twitter-archive-enhanced.csv'

In [None]:
#size of tweets where rating was included
df_twitter_archive.nunique()[0]

> Dataset 2: Image Predictions

In [None]:
img_predictions_link = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'

In [None]:
res = requests.get(img_predictions_link)

In [None]:
#walk through datasets directory
!cd datasets
!ls

In [None]:
#create a new file to save dowmloaded data
!touch img_pred.tsv
!mv img_pred.tsv ~ datasets

In [None]:
# write downloaded data into file
with open ('datasets/img_pred.tsv', mode='wb') as file:
    file.write(res.content)

In [None]:
df_img_pred = pd.read_csv('datasets/img_pred.tsv', sep='\t')
df_img_pred.head(2)

> Dataset 3: Json Tweets

In [None]:
tweet_json = 'datasets/tweet-json-2.txt'

In [None]:
tweets =[]
with open (tweet_json, mode='rb') as file:
        for line in file:
            tweet = json.loads(line) 
            tweets.append(tweet)

In [None]:
df_tweets = pd.DataFrame(tweets)

> read this .txt file line by line into a pandas DataFrame with (at minimum) tweet ID, retweet count, and favorite count.

In [None]:
df_tweets.head(1)

### 

### Assessing Data

##### After gathering each of the above pieces of data, assess them visually and programmatically for quality and tidiness issues. 
##### Detect and document at least eight `(8) quality issues and two (2) tidiness issues` in your wrangle_act.ipynb Jupyter Notebook. 
##### To meet specifications, the issues that satisfy the Project Motivation (see the Key Points header on the previous page) must be assessed.

> Aseesment 1:  Twitter Archive

In [None]:
df_twitter_archive

In [None]:
#check data size
df_twitter_archive.tweet_id.nunique()

In [None]:
#gather general info about data
df_twitter_archive.info()

In [None]:
#take a look at the list of unique names
df_twitter_archive.name.unique()

In [None]:
# check the counts of names
unique_names = (df_twitter_archive.name.value_counts())
unique_names[:20]

In [None]:
len(unique_names)

In [None]:
# THere are invalid names such as None and a

In [None]:
# inspect column names where values is None
df_twitter_archive.query('name=="None"')

> inspect text where name is None

In [None]:
df_twitter_archive.query('name=="None"')['text'].values.tolist()[:20]

> inspect text where name is not None

In [None]:
df_twitter_archive.query('name!="None"')['text'].values.tolist()[:20]

##### From the above queries, we can observe rows where names of dog are not available do not have text in them

#### Inspect ratings (numerator & denominator)

In [None]:
ratings = df_twitter_archive[['rating_numerator', 'rating_denominator']]

In [None]:
ratings.rating_numerator.value_counts()

In [None]:
ratings.rating_denominator.value_counts()

> Some rating values (numerator & denominator appears to be incorrect)

> Check for null values

In [None]:
df_twitter_archive.isnull().any()

>

##### most of the columns containing null values are not unnecessary - there are basically id columns and retweets. 
##### The project instructions places emphasis on tweets and not retweets

> examine values of columns of dog typpes

In [None]:
df_twitter_archive[['doggo','floofer' ,'pupper', 'puppo']].values.tolist()[:3]

>  examine values in twweet source

In [None]:
df_twitter_archive.source.values

In [None]:
df_twitter_archive.source.unique

In [None]:
# we only needed the client type (e.g iphone, androif, etc) of the tweet source not the link

In [None]:
# view exapnded urls column
df_twitter_archive.expanded_urls.values

.

> Assessment 2: Image Predictions

In [None]:
df_img_pred

> check unique size of dataset

In [None]:
df_img_pred.tweet_id.nunique()

In [None]:
df_img_pred.info()

> check for null values

In [None]:
df_img_pred.isnull().any().sum()

> describe values

In [None]:
df_img_pred.describe()

> examine the p1, p2 and p3 columns

In [None]:
df_img_pred[['p1', 'p2', 'p3']]

In [None]:
df_img_pred[['p1', 'p2', 'p3']].describe()

.

> Assessment 3: Json Tweets

In [None]:
df_tweets

In [None]:
# gather info about data
df_tweets.info()

In [None]:
# check data tyoes
df_tweets.dtypes

In [None]:
# find number of retweets
df_tweets.query('retweeted == False').shape[0]

###### _All Tweets here are not retweets_

In [None]:
# examin language columns
#### we might have to create a dictionary to map the full name of the language to thier abbreviations
df_tweets.lang.value_counts()

In [None]:
# examine qntities columns
df_tweets[['entities','extended_entities']]

> at minimum the tweets dataframe should contain the following columns,

>  tweet ID, retweet count, and favorite count
> hence we would examine them

In [None]:
main_tweet_cols = df_tweets[['id','retweet_count','favorite_count']]
main_tweet_cols.info()

> since these columns look preety good (do not have null values), 

> we might as well use only them and a few other columns that are also completed but before we make that decision, let's report general assessment of quality and tidiness issues that has been found in this dataset and others

#### Assessment Summary

#### Quality Issues

##### Twitter Archives
<ul>
    
<li>Some columns (_retweets and ids'columns_) contain null values and are not really needed</li>
    
<li>Columns on dog stages (_doggy, pupper etc_) contain mostly NUll values</li>
    
<li>source column values not in correct format</li>
    
<li>in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id, retweeted_status_user_id ouhght to be data type integers rather than float ( They have id's similar to tweet_id)</li>
    
<!-- <li></li>
     -->
<li>rating denominator has a value of 0</li>
<li>Incorrect denominator rating values, values greater than 10 and some quite outrageous</li>
    
<li>Incorrect numerator rating values; we have outrageous values such as 420, 1776</li>
</ul>

    
    
##### Image Predictions

<ul>   
<li> The string values in p1, p2, and p3 coluumns _breed predictions_ by the neural network are not in uniform format</li>
    
<li>Unique counts of ids is less than that in archives - missing data which indicates that certain tweet ids do not have images</li>
    
<li> columns such as tweet_id are not in thier corect datatype</li>
</ul>

##### Json Tweets
<ul>
<li>The column datatype in int instead of string</li>
<li>language column values are not meaningful</li>
<li>data type for tweet_id retweet count, favorite count are object instead of integers</li>
</ul>

#### Tidiness

##### Twitter Archives
<ul>
<li>dog stages in different columns in twitter_archve - they ought to be a single variable to reduce amount of nan values in them and they are of the same category</li>
</ul>

##### Image Predictions
<ul>
<li>image prediction datasets needs to be joined with twitter archive rather than in separate datasets</li>
<li>generally we have 3 separe column datasets rather than one master dataset
    we have to find a way to join useful features from the 3 separate datasets into one master dataset</li>
</ul>

##### Json Tweets
<ul>
<li>multiple id columns in json_tweets data</li>
<li>columns like (_truncated 	display_text_range 	entities 	extended_entities 	source 	in_reply_to_status_id_) not required in terms of analysis</li>
<li>datasets needs to be joined with twitter archive data</li>
</ul>

.

### Cleaning Data

##### Clean each of the issues you documented while assessing. Perform this cleaning in wrangle_act.ipynb as well. 
##### The result should be a high quality and tidy master pandas DataFrame (or DataFrames, if appropriate). 
##### Again, the issues that satisfy the Project Motivation must be cleaned.

> Create copies of Dataframe before cleaning

In [None]:
twitter_archive_copy = df_twitter_archive.copy()

In [None]:
img_pred_copy = df_img_pred.copy()

In [1]:
tweets_copy = df_tweets.copy()

NameError: name 'df_tweets' is not defined

> Load datasets copy

In [383]:
twitter_archive_copy.head(1)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,


.

In [384]:
img_pred_copy.head(1)

Unnamed: 0,tweet_id,jpg_url,img_num,p1,p1_conf,p1_dog,p2,p2_conf,p2_dog,p3,p3_conf,p3_dog
0,666020888022790149,https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg,1,Welsh_springer_spaniel,0.465074,True,collie,0.156665,True,Shetland_sheepdog,0.061428,True


.

In [385]:
tweets_copy.head(1)

Unnamed: 0,created_at,id,id_str,full_text,truncated,display_text_range,entities,extended_entities,source,in_reply_to_status_id,...,favorite_count,favorited,retweeted,possibly_sensitive,possibly_sensitive_appealable,lang,retweeted_status,quoted_status_id,quoted_status_id_str,quoted_status
0,Tue Aug 01 16:23:56 +0000 2017,892420643555336193,892420643555336193,This is Phineas. He's a mystical boy. Only eve...,False,"[0, 85]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 892420639486877696, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,39467,False,False,False,False,en,,,,


##### Clean Twitter Archives
         Quality issues

  > 1. Some columns (_retweets & ids_) contain null values and are not really needed
  > 2. Columns on dog stages (_doggy, pupper etc_) contain mostly NUll values
  > 3. source column values not in correct format
  > 4. expanded urls column not needed. The only useful information needed from it (tweet id) is already availabele
       in a separate column
  > 5. all retweets columns (_retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp_) not
      required
  > 6  Incorrect denominator rating values, values greater than 10 and some quite outrageous
  > 7  Incorrect numerator rating values; we have outrageous values such as 420, 1776

columns with null values 
> Define

>>Drop columns with null values which are also not useful for analysis



   

> Code

In [386]:
twitter_archive_copy.isnull().any()

tweet_id                      False
in_reply_to_status_id          True
in_reply_to_user_id            True
timestamp                     False
source                        False
text                          False
retweeted_status_id            True
retweeted_status_user_id       True
retweeted_status_timestamp     True
expanded_urls                  True
rating_numerator              False
rating_denominator            False
name                          False
doggo                         False
floofer                       False
pupper                        False
puppo                         False
dtype: bool

In [387]:
twitter_archive_copy.columns

Index(['tweet_id', 'in_reply_to_status_id', 'in_reply_to_user_id', 'timestamp',
       'source', 'text', 'retweeted_status_id', 'retweeted_status_user_id',
       'retweeted_status_timestamp', 'expanded_urls', 'rating_numerator',
       'rating_denominator', 'name', 'doggo', 'floofer', 'pupper', 'puppo'],
      dtype='object')

In [388]:
drop_cols = ['in_reply_to_status_id', 'in_reply_to_user_id','retweeted_status_id', 'retweeted_status_user_id',
       'retweeted_status_timestamp', 'expanded_urls']
twitter_archive_copy.drop(columns=drop_cols, inplace=True)

> Test 

In [389]:
twitter_archive_copy.isnull().any()

tweet_id              False
timestamp             False
source                False
text                  False
rating_numerator      False
rating_denominator    False
name                  False
doggo                 False
floofer               False
pupper                False
puppo                 False
dtype: bool

In [390]:
twitter_archive_copy

Unnamed: 0,tweet_id,timestamp,source,text,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,13,10,Phineas,,,,
1,892177421306343426,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,13,10,Tilly,,,,
2,891815181378084864,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,12,10,Archie,,,,
3,891689557279858688,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,13,10,Darla,,,,
4,891327558926688256,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,12,10,Franklin,,,,
...,...,...,...,...,...,...,...,...,...,...,...
2351,666049248165822465,2015-11-16 00:24:50 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here we have a 1949 1st generation vulpix. Enj...,5,10,,,,,
2352,666044226329800704,2015-11-16 00:04:52 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is a purebred Piers Morgan. Loves to Netf...,6,10,a,,,,
2353,666033412701032449,2015-11-15 23:21:54 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here is a very happy pup. Big fan of well-main...,9,10,a,,,,
2354,666029285002620928,2015-11-15 23:05:30 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is a western brown Mitsubishi terrier. Up...,7,10,a,,,,


Dog Stages Columns with None Values


> Define 

Inspect text column to check if dog stages name wwas or was not properly extracted

If extraction is feasible, use the extract function and regex to get the dog stage type out

In [391]:
twitter_archive_copy.doggo.apply(lambda x: "doggo " in twitter_archive_copy.text and  "doggo" or  "None").value_counts()

None    2356
Name: doggo, dtype: int64

Source column values not in correct format
> Define

extract tweet source (i-phone, etc) from url string using pandas extract function

> Code

In [500]:
twitter_archive_copy.source.str.split(pat="for ", expand=True)[1]\
.str.split(pat="<", expand=True)[0].unique()

array(['iPhone', None], dtype=object)

In [436]:
twitter_archive_copy['rsource'] = twitter_archive_copy.source.str.split('for ').tolist

In [502]:
twitter_archive_copy.source.unique()

array(['<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>',
       '<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>',
       '<a href="http://vine.co" rel="nofollow">Vine - Make a Scene</a>',
       '<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>'],
      dtype=object)

In [522]:
import re
sr = re.compile('>(iPhone|WebClient|TweetDeck)')

In [523]:
twitter_archive_copy.source.apply(lambda x: sr.match(x)).value_counts()

Series([], Name: source, dtype: int64)

In [438]:
twitter_archive_copy.rsource.values

array([list(['<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter ', 'iPhone</a>']),
       list(['<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter ', 'iPhone</a>']),
       list(['<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter ', 'iPhone</a>']),
       ...,
       list(['<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter ', 'iPhone</a>']),
       list(['<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter ', 'iPhone</a>']),
       list(['<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter ', 'iPhone</a>'])],
      dtype=object)

###  Store, Analyze & Visualize Data

#### Store the clean DataFrame(s) in a CSV file with the main one named `twitter_archive_master.csv`. 
#### If additional files exist because multiple tables are required for tidiness, name these files appropriately. 
#### Additionally, you may store the cleaned data in a SQLite database (which is to be submitted as well if you do).
#### Analyze and visualize your wrangled data in your wrangle_act.ipynb Jupyter Notebook. 
#### `At least three (3) insights and one (1) visualization must be produced`.