# WE RATE DOG DATA ANALYSIS

## TABLE OF CONTENT
1.[Introduction](https://viewf6b31853.udacity-student-workspaces.com/notebooks/wrangle_act.ipynb#Introduction)

2.[Gathering Data](https://viewf6b31853.udacity-student-workspaces.com/notebooks/wrangle_act.ipynb#Data-Gathering)

3.[Assessing Data](https://viewf6b31853.udacity-student-workspaces.com/notebooks/wrangle_act.ipynb#Assessing-Data)

4.[Cleaning Data](https://viewf6b31853.udacity-student-workspaces.com/notebooks/wrangle_act.ipynb#Cleaning-Data)

5.[Storing Data](https://viewf6b31853.udacity-student-workspaces.com/notebooks/wrangle_act.ipynb#Storing-Data)

6.[Analysing and Visualizing Data](https://viewf6b31853.udacity-student-workspaces.com/notebooks/wrangle_act.ipynb#Analyzing-and-Visualizing-Data)

### Introduction

#### Project Goal

My goal is to wrangle WeRateDogs Twitter data to create interesting and trustworthy analyses and visualizations. The Twitter archive is great, but it only contains very basic tweet information. Additional gathering, then assessing and cleaning is required for "Wow!"-worthy analyses and visualizations.

#### The Data
In this project, i will work on the following three datasets.

#### Enhanced Twitter Archive

The WeRateDogs Twitter archive contains basic tweet data for all 5000+ of their tweets, but not everything. One column the archive does contain though: each tweet's text, which was used to extract rating, dog name, and dog "stage" (i.e. doggo, floofer, pupper, and puppo) to make this Twitter archive "enhanced." Of the 5000+ tweets,tweets with ratings only  are been filtered (there are 2356).

The extracted data from each tweet's text was extrated  programmatically, but it was not extracted properly . The ratings probably aren't all correct. Same goes for the dog names and probably dog stages (see below for more information on these) too. I'll need to assess and clean these columns if you want to use them for analysis and visualization.

#### Additional Data via the Twitter API

Back to the basic-ness of Twitter archives: retweet count and favorite count are two of the notable column omissions. Fortunately, this additional data can be gathered by anyone from Twitter's API. Well, "anyone" who has access to data for the 3000 most recent tweets, at least. But you, because you have the WeRateDogs Twitter archive and specifically the tweet IDs within it, can gather this data for all 5000+. And guess what? You're going to query Twitter's API to gather this valuable data.

#### Image Predictions File

 Ran every image in the WeRateDogs Twitter archive through a neural network that can classify breeds of dogs*. The results: a table full of image predictions (the top three only) alongside each tweet ID, image URL, and the image number that corresponded to the most confident prediction (numbered 1 to 4 since tweets can have up to four images).

Image predictions

tweet_id is the last part of the tweet URL after "status/" → https://twitter.com/dog_rates/status/889531135344209921

p1 is the algorithm's #1 prediction for the image in the tweet → golden retriever

p1_conf is how confident the algorithm is in its #1 prediction → 95%

p1_dog is whether or not the #1 prediction is a breed of dog → TRUE

p2 is the algorithm's second most likely prediction → Labrador retriever

p2_conf is how confident the algorithm is in its #2 prediction → 1%

p2_dog is whether or not the #2 prediction is a breed of dog → TRUE
etc.


In [1]:
#importing all libraries needed for analysis
import pandas as pd 
import requests
import json
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
#upgrade seaborn and import seaborn libarary
!pip install seaborn==0.11.0
import seaborn as sns
sns.__version__



'0.11.0'

## Data Gathering

In [3]:
#programmatically importing the twitter_archive_enhanced csv file
dogs_rating = pd.read_csv('twitter-archive-enhanced.csv')

In [4]:
#using the requests library to download the tweet image prediction tsv file
url='https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response=requests.get(url)
with open ('image-predictions.tsv','wb')as file:
    file.write(response.content)

In [5]:
#response.content to a dataframe
image_prediction=pd.read_csv('image-predictions.tsv',sep='\t')

In [6]:
# query additional data via the Twitter API (tweet_json.txt)
tweet=pd.read_json('tweet-json.txt',lines=True)


## Assessing Data
I only want original ratings (no retweets) that have images. Though there are 5000+ tweets in the dataset, not all are dog 

## Acesssing  dogs_rating data 

In [7]:
#get the top five rows of the dataframe
dogs_rating.head(5)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


In [8]:
#information about all columnns in the dataframe
dogs_rating.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), ob

In [9]:
#total rows and columns 
dogs_rating.shape

(2356, 17)

In [10]:
#checking for duplicated rows
dogs_rating.duplicated().sum()

0

In [11]:
#each unique rating numerators and their value counts
dogs_rating['rating_numerator'].value_counts()

12      558
11      464
10      461
13      351
9       158
8       102
7        55
14       54
5        37
6        32
3        19
4        17
1         9
2         9
420       2
0         2
15        2
75        2
80        1
20        1
24        1
26        1
44        1
50        1
60        1
165       1
84        1
88        1
144       1
182       1
143       1
666       1
960       1
1776      1
17        1
27        1
45        1
99        1
121       1
204       1
Name: rating_numerator, dtype: int64

In [12]:
#remove the ellipses 
pd.set_option("display.max_colwidth", -1)
# checking if ellipses had been removed
dogs_rating[dogs_rating['tweet_id']==668625577880875008][['retweeted_status_user_id','in_reply_to_user_id','expanded_urls','text','rating_numerator','rating_denominator','name','doggo','floofer','pupper','puppo']]

Unnamed: 0,retweeted_status_user_id,in_reply_to_user_id,expanded_urls,text,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
2208,,,https://twitter.com/dog_rates/status/668625577880875008/photo/1,This is Maks. Maks just noticed something wasn't right. 10/10 https://t.co/0zBycaxyvs,10,10,Maks,,,,


In [13]:
#each unique rating denominators and their value counts
dogs_rating['rating_denominator'].value_counts()

10     2333
11     3   
50     3   
80     2   
20     2   
2      1   
16     1   
40     1   
70     1   
15     1   
90     1   
110    1   
120    1   
130    1   
150    1   
170    1   
7      1   
0      1   
Name: rating_denominator, dtype: int64

In [14]:
#checking the row with rating_denominator of '0'
dogs_rating[dogs_rating['rating_denominator']==0][['tweet_id','retweeted_status_user_id','in_reply_to_user_id','expanded_urls','text','rating_numerator','rating_denominator','name','doggo','floofer','pupper','puppo']]

Unnamed: 0,tweet_id,retweeted_status_user_id,in_reply_to_user_id,expanded_urls,text,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
313,835246439529840640,,26259576.0,,"@jonnysun @Lin_Manuel ok jomny I know you're excited but 960/00 isn't a valid rating, 13/10 is tho",960,0,,,,,


In [15]:
#checking if all column headers are in small letter and seperated by undescore
dogs_rating.columns

Index(['tweet_id', 'in_reply_to_status_id', 'in_reply_to_user_id', 'timestamp',
       'source', 'text', 'retweeted_status_id', 'retweeted_status_user_id',
       'retweeted_status_timestamp', 'expanded_urls', 'rating_numerator',
       'rating_denominator', 'name', 'doggo', 'floofer', 'pupper', 'puppo'],
      dtype='object')

In [16]:
#datatype for each column
dogs_rating.dtypes

tweet_id                      int64  
in_reply_to_status_id         float64
in_reply_to_user_id           float64
timestamp                     object 
source                        object 
text                          object 
retweeted_status_id           float64
retweeted_status_user_id      float64
retweeted_status_timestamp    object 
expanded_urls                 object 
rating_numerator              int64  
rating_denominator            int64  
name                          object 
doggo                         object 
floofer                       object 
pupper                        object 
puppo                         object 
dtype: object

In [17]:
#statistical summary of all quantitative column
dogs_rating.describe()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,retweeted_status_id,retweeted_status_user_id,rating_numerator,rating_denominator
count,2356.0,78.0,78.0,181.0,181.0,2356.0,2356.0
mean,7.427716e+17,7.455079e+17,2.014171e+16,7.7204e+17,1.241698e+16,13.126486,10.455433
std,6.856705e+16,7.582492e+16,1.252797e+17,6.236928e+16,9.599254e+16,45.876648,6.745237
min,6.660209e+17,6.658147e+17,11856340.0,6.661041e+17,783214.0,0.0,0.0
25%,6.783989e+17,6.757419e+17,308637400.0,7.186315e+17,4196984000.0,10.0,10.0
50%,7.196279e+17,7.038708e+17,4196984000.0,7.804657e+17,4196984000.0,11.0,10.0
75%,7.993373e+17,8.257804e+17,4196984000.0,8.203146e+17,4196984000.0,12.0,10.0
max,8.924206e+17,8.862664e+17,8.405479e+17,8.87474e+17,7.874618e+17,1776.0,170.0


In [None]:
#checking expanded_urls   for null values
dogs_rating[dogs_rating['expanded_urls'].isnull()]

In [None]:
#checking if all tweet id are unique
dogs_rating['tweet_id'].nunique()

## Acessing image_prediction data

In [None]:
#30 random rows of the dataframe
image_prediction.sample(30)

In [None]:
#unique values in img_num column
image_prediction['img_num'].unique()

In [None]:
#total rows and column 
image_prediction.shape

In [None]:
#datatype for all columns
image_prediction.dtypes

In [None]:
#checking for duplicated rows
image_prediction.duplicated().sum()

In [None]:
#checking if all values in tweet_id are unique,because there shouldn't be any duplicated tweet_id
image_prediction.tweet_id.nunique()

In [None]:
#information about all columns in the dataframe
image_prediction.info()

## Accessing tweet data

In [None]:
#5 random rows of the dataframe
tweet.sample(5)

In [None]:
# total rows and column
tweet.shape

In [None]:
#information about all columns in the dataframe
tweet.info()

In [None]:
#datatype for all columns
tweet.dtypes

In [None]:
tweet.favorited.value_counts()

In [None]:
#ensure no duplicated tweet_id
tweet.id.nunique()

checking whether it is tweet_id or tweet_id_str that matches other dataset tweet_id

In [None]:
#10 ramdom sample of tewwt_id column
tweet.id.sample(10)

In [None]:
#rows where tweet.id_str is not equal to tweet.id
tweet[tweet.id_str!=tweet.id][['id','id_str']].sample(10)

In [None]:
#searching for a particular tweet_id_str in dogs_rating which could not be found
dogs_rating.loc[dogs_rating['tweet_id']==693590843962331136]

In [None]:
#searching for a particular tweet_id in dogs_rating, which  could be found
dogs_rating.loc[dogs_rating['tweet_id']==693590843962331137]

checking the is_quote_status column

In [None]:
tweet.is_quote_status.value_counts()

In [None]:
tweet.loc[tweet.is_quote_status==True][['id']].sample(5)

In [None]:
dogs_rating.loc[dogs_rating['tweet_id']==804475857670639616]

In [None]:
image_prediction.loc[image_prediction['tweet_id']==804475857670639616]

checking lang column

In [None]:
tweet.lang.value_counts()

checking place column

In [None]:
tweet.place.unique

checking retweeted column

In [None]:
tweet['retweeted'].value_counts()

In [None]:
#all columns in the dataframe
tweet.columns

## Quality issues

1.tweet_id,rating_numerator,rating_denominator wrong datatype

2.timestamp wrong datatype

3.name column values to lowercase

4.Ambiguous column header

5.Erroneous column datatype

6.Some of the image predictions are not dog

7.columns in image_prediction dataset that will not be need for futher analysis

8.Inconsistent column header for all dataset

9.Some of the ratings are not dog_ratings

10.Some of the dog_ratings doesnt have likes _counts

11.Some of the dog_ratings are retweets

12.Incorrect names

13.Plagarism rating included in the dataset

14.Tweet_id '810984652412424192' is not dog rating 


### Tidiness issues

1.dog stage in  different column
2.All the three dataset should be merged together 

## Cleaning Data
In this section,  **all** issues documented while assessing are been cleaned  

In [None]:
# Make copies of the three original dataframe
dogs_rating_clean=dogs_rating.copy()
image_prediction_clean=image_prediction.copy()
tweet_clean=tweet.copy()


# dogs_rating cleaning

### Issue #1:  tweet_id,rating_numerator,rating_denominator wrong datatype 

#### Define:
Since no calcualation will be done with tweet_id,rating_numerator,rating_denominator use d astype funtion to convert to string 

#### Code

In [None]:
dogs_rating_clean[['rating_numerator','rating_denominator','tweet_id']] = dogs_rating_clean[['rating_numerator','rating_denominator','tweet_id']].astype(str)


#### Test

In [None]:
dogs_rating_clean[['rating_numerator','rating_denominator','tweet_id']].dtypes

### Issue #2: timestamp wrong datatype

#### Define 
Use pd to datetime to convert the timestamp column to datetype datatype 

#### Code

In [None]:
dogs_rating_clean['timestamp']= pd.to_datetime(dogs_rating_clean['timestamp'])

#### Test

In [None]:
dogs_rating_clean['timestamp'].sample(2)

### Issue #3:name column values  to lowercase

#### Define Change all names to small letter for easy slicing,merging and manupulation using the lower function 

#### Code

In [None]:
dogs_rating_clean['name']=dogs_rating_clean.name.str.lower()

#### Test

In [None]:
#shows that in all 2356 rows all values in name column are all  in lowercase 
dogs_rating_clean['name'].str.islower().sum()

### Issue #4: dogs types in  different column

#### Define 
use the melt function that changes the format structure of my dataframe from a wide format to a long format

#### Code

getting the unique value in each dog stage column

In [None]:
dogs_rating_clean.doggo.value_counts()

In [None]:
dogs_rating_clean.floofer.value_counts()

In [None]:
dogs_rating_clean.pupper.value_counts()

In [None]:
dogs_rating_clean.pupper.value_counts()

In [None]:
#getting all rows that has all it dog stage as 'none',the use the melt function to melt all dog_stage together ,then drop duplicated rows and the column not needed
no_dog_stage=dogs_rating_clean[(dogs_rating_clean.doggo=='None')
                               &(dogs_rating_clean.floofer=='None')
                               &(dogs_rating_clean.pupper=='None')
                               &(dogs_rating_clean.puppo=='None')]
no_dog_stage=pd.melt(no_dog_stage,id_vars=['tweet_id', 'in_reply_to_status_id', 'in_reply_to_user_id', 'timestamp','source', 'text', 'retweeted_status_id', 'retweeted_status_user_id','retweeted_status_timestamp', 'expanded_urls', 'rating_numerator','rating_denominator', 'name'],var_name='dog_stage_',value_name='dog_stage')
no_dog_stage=no_dog_stage.drop(['dog_stage_'],axis=1)
no_dog_stage=no_dog_stage.drop_duplicates()

In [None]:
#melt all dog stage of the original dataset and filter out rows that has 'None' dog stage and drop column not needed 
dogs_rating_clean=pd.melt(dogs_rating_clean,id_vars=['tweet_id', 'in_reply_to_status_id', 'in_reply_to_user_id', 'timestamp','source', 'text', 'retweeted_status_id', 'retweeted_status_user_id','retweeted_status_timestamp', 'expanded_urls', 'rating_numerator','rating_denominator', 'name'],var_name='dog_stage_',value_name='dog_stage')
dogs_rating_clean=dogs_rating_clean[dogs_rating_clean['dog_stage']!='None']
dogs_rating_clean=dogs_rating_clean.drop(['dog_stage_'],axis=1)
#add the no_dog_stage to the dataframe and drop rows with two different type of dog stage while keeping the first row detected  
dogs_rating_clean=dogs_rating_clean.append(no_dog_stage,ignore_index=True)
dogs_rating_clean=dogs_rating_clean.drop_duplicates(subset=['tweet_id'],keep='first')

#### Test

In [None]:
#unique values in dog stage column and their respective counts 
dogs_rating_clean.dog_stage.value_counts()

In [None]:
#duplicated tweet_id
dogs_rating_clean.tweet_id.duplicated().sum()

In [None]:
#checking if all columns not needed have been successfully dropped 
list(dogs_rating_clean)

In [None]:
#confirming total rows and columns
dogs_rating_clean.shape

# image_prediction cleaning 

### Issue #3:Ambiguous column header 

#### Define 
Change the column name to a more readable name using the rename function 

#### Code

In [None]:
image_prediction_clean.rename(columns={'p1':'predicion_1',
                                 'p1_conf':'prediction_1_conf',
                                 'p1_dog':'prediction_1_dog',
                                 'p2':'prediction_2',
                                 'p2_conf':'prediction_2_conf',
                                 'p2_dog':'prediction_2_dog',
                                 'p3':'prediction_3',
                                 'p3_conf':'prediction_3_conf',
                                 'p3_dog':'prediction_3_dog'},
                             inplace=True)

#### Test

In [None]:
list(image_prediction_clean)

### Issue #3:Erroneous column datatype 

#### Define 
Since no calcualation will be done with tweet_id use d astype funtion to convert to string

#### Code

In [None]:
image_prediction_clean['tweet_id']=image_prediction_clean['tweet_id'].astype(str)

#### Test

In [None]:
image_prediction_clean.tweet_id.dtype

### Issue #3:Some of the image predictions are not dog 

#### Define drop all image prediction that couldn't at least recognise one dog in its three predicction i.e at least 1 true recognition of dog in the dataset.

#### Code

In [None]:
#images that are recognised false in the three prediction 
Non_dogs=image_prediction_clean[(image_prediction_clean['prediction_1_dog']==False)
                &(image_prediction_clean['prediction_2_dog']==False)
                &(image_prediction_clean['prediction_3_dog']==False)]

In [None]:
#number of prediction with three  false prediction 
len(Non_dogs)

In [None]:
#checking the non_dog dataset to confirm thay are all false in the three prediction 
Non_dogs.sample(5)

In [None]:
# merging the non dog datset with the image prediction dataset 
image_prediction_clean=pd.merge(image_prediction_clean,Non_dogs,how='outer',indicator=True)

In [None]:
# checking the merge column to ensure they where properly merge
image_prediction_clean._merge.value_counts()

In [None]:
# dropping rows that has false value in the three prediction 
image_prediction_clean['_merge']=image_prediction_clean['_merge'].loc[image_prediction_clean['_merge']!='both']
image_prediction_clean=image_prediction_clean.dropna(subset=['_merge'],axis=0)

In [None]:
# dropping the _merge column 
image_prediction_clean.drop('_merge',axis=1,inplace=True)

#### Test

In [None]:
#checking if the _merge column still exist 
list(image_prediction_clean)=='_merge'

In [None]:
#checking if rows with its three prediction as false are no longer in the dataset
len(image_prediction_clean[(image_prediction_clean['prediction_1_dog']==False)
                &(image_prediction_clean['prediction_2_dog']==False)
                &(image_prediction_clean['prediction_3_dog']==False)])

### Issue #3: columns in image_prediction dataset that will not be need for futher analysis 


#### Define: Use the drop function to drop for following columns:img_num', 'predicion_1', 'prediction_1_conf','prediction_1_dog', 'prediction_2', 'prediction_2_conf','prediction_2_dog', 'prediction_3', 'prediction_3_conf','prediction_3_dog'

#### Code

In [None]:
image_prediction_clean=image_prediction_clean.drop(['img_num', 'predicion_1', 'prediction_1_conf','jpg_url','prediction_1_dog', 'prediction_2', 'prediction_2_conf','prediction_2_dog', 'prediction_3', 'prediction_3_conf','prediction_3_dog'],axis=1)

#### Test

In [None]:
image_prediction_clean.columns

In [None]:
image_prediction_clean.shape

# Tweet cleaning

### Issue #3:columns not needed for futher analysis 

#### Define:
Use the drop function to drop the folowing columns : contributor,cordinator and geo column as they are all null value
display_text_range,id_str as it doesnt correspond with other dataset
drop in_reply_to_status_id', 'in_reply_to_status_id_str',in_reply_to_user_id',  'in_reply_to_user_id_str',quoted_status','quoted_status_id_str','possibly_sensitive', possibly_sensitive_appealable.user.source,truncated,quoted_status_id_str,place
lang column,i dont undertand the concept 

#### Code

In [None]:
tweet_clean=tweet_clean[['created_at', 'favorite_count', 'id','retweet_count']]

#### Test

In [None]:
tweet_clean.columns

In [None]:
tweet_clean.shape

### Issue #3:Inconsistent column header for all dataset

#### Define 
use the rename function to change the column header 'created at' to the column name that matches other dataset

#### Code

In [None]:
tweet_clean=tweet_clean.rename(columns={'created_at':'timestamp',
                                       'id':'tweet_id',
                                        'favorite_count':'likes_count'})

#### Test

In [None]:
list(tweet_clean)

### Issue #3:Erroneous datatype 

#### Define 
Since no calcualation will be done with tweet_id use d astype funtion to convert to string

#### Code

In [None]:
tweet_clean=tweet_clean.rename(columns={'created_at':'timestamp',
                                       'id':'tweet_id',
                                        'favorite_count':'likes_count'})

In [None]:
tweet_clean['tweet_id']=tweet_clean['tweet_id'].astype(str)

#### Test

In [None]:
tweet_clean.tweet_id.dtype

In [None]:
tweet_clean.shape

# Combined dataset

### Issue #3: Some of the ratings are not dog_ratings

#### Define 
Some ratings in the dogs_rating dataset are not dog ratings ,use the image_prediction that helps detect if the rating is for dog to filter all ratings of dog with at least 1 true recognition of dog in the dataset.

#### Code

In [None]:
dogs_rating_clean=pd.merge(dogs_rating_clean,image_prediction_clean,on='tweet_id',how='inner')

#### Test

In [None]:
dogs_rating_clean.shape

### Issue #3: Some of the dog_ratings doesnt have likes _counts 

#### Define 
i only want original ratings (no retweets) that have images


#### Code

In [None]:
dogs_rating_clean=pd.merge(dogs_rating_clean,tweet_clean,on=['tweet_id','timestamp'],how='inner')

#### Test

In [None]:
dogs_rating_clean.shape

In [None]:
dogs_rating_clean.info()

In [None]:
#dogs_rating_clean.replace('None',np.nan, inplace=True)

### Issue #3: Some of the dog_ratings are retweets

#### Define 
i only want original ratings (no retweets) that have images


#### Code

In [None]:
dogs_rating_clean=dogs_rating_clean[dogs_rating_clean.retweeted_status_id.isnull()]

#### Test

In [None]:
dogs_rating_clean.retweeted_status_id.value_counts()

### Issue #3:columns not needed for futher analysis 

#### Define:
Use the drop function to drop the folowing columns:'in_reply_to_status_id','in_reply_to_user_id','source','retweeted_status_id','retweeted_status_user_id','retweeted_status_timestamp'

#### Code

In [None]:
dogs_rating_clean=dogs_rating_clean.drop(['in_reply_to_status_id','in_reply_to_user_id','source','retweeted_status_id','retweeted_status_user_id','retweeted_status_timestamp'],axis=1)

#### Test

In [None]:
list(dogs_rating_clean)

### Issue #3: Incorrect names

#### Define:
manully detect all incorrect nameand use the replace function to replace them

#### Code

In [None]:
dogs_rating_clean.name.replace('none',np.nan, inplace=True)

In [None]:
incorrect_name=['the','all','my','space','a','an','not','very','just']
dogs_rating_clean.name[dogs_rating_clean.name.isin(incorrect_name)].unique()

In [None]:
for name in incorrect_name:
    if name == 'my':
        dogs_rating_clean['name'].replace(name,'zoey',inplace=True)
    elif name== 'space':
         dogs_rating_clean['name'].replace(name,'space pup',inplace=True)
    else:
         dogs_rating_clean['name'].replace(name,np.nan,inplace=True)


#### Test

In [None]:
dogs_rating_clean.name[dogs_rating_clean.name.isin(incorrect_name)].unique()

### Issue #3: wrongly extracted decimal rating_numerator

#### Define
Extract all decimal rating from the text column using a regrex pattern

#### Code

In [None]:
#extract all decimal rating from the text column using a regrex pattern
dogs_rating_clean['extract']= dogs_rating_clean['text'].str.extract(pat='([0-9]+\.[0-9]+/[0-9]+)')

In [None]:
#splict the extracted rating into numerator and denominator ,then drop the extracted column and the denominator column 
dogs_rating_clean['numerator_'],dogs_rating_clean['denominator']=dogs_rating_clean['extract'].str.split('/',1).str
dogs_rating_clean=dogs_rating_clean.drop(['extract','denominator'],axis=1)

In [None]:
#getting all decimal numerator
dogs_rating_clean['numerator_'].unique()

In [None]:
#replace all wrong rating numerator 
dogs_rating_clean.loc[dogs_rating_clean['tweet_id']=='778027034220126208','rating_numerator']='11.27'
dogs_rating_clean.loc[dogs_rating_clean['tweet_id']=='883482846933004288','rating_numerator']='13.5'
dogs_rating_clean.loc[dogs_rating_clean['tweet_id']=='786709082849828864','rating_numerator']='9.75'
dogs_rating_clean.loc[dogs_rating_clean['tweet_id']=='680494726643068929','rating_numerator']='11.26'

In [None]:
#drop the numerator_ column created 
dogs_rating_clean.drop(['numerator_'],axis=1,inplace=True)

#### Test

In [None]:
decimal_numerator=['11.27', '13.5', '9.75', '11.26']
dogs_rating_clean[dogs_rating_clean['rating_numerator'].isin(decimal_numerator)]  


### Issue #3: Tweet_id '810984652412424192' is not dog rating 

#### Define 
Tweet_id '810984652412424192' uses 24/7 as rating ,this is not rating but telling us the dog smiles all the time ,so i will use the drop function to drop this row

#### Code

In [None]:
#droping 24/7 rating 
dogs_rating_clean=dogs_rating_clean.drop([567],axis=0)

#### Test

In [None]:
dogs_rating_clean.loc[dogs_rating_clean['tweet_id']=='810984652412424192']

### Issue #3 Plagarism rating included in the dataset

#### Define 
Tweet_id '835152434251116546' uses 0/10 as rating ,it isnt a dog rating just plagarism rating the real rating has been done in row with dis timestamp'2016-07-11 15:07:30 ',use the drop function to drop the row

#### Code

In [None]:

dogs_rating_clean=dogs_rating_clean.drop([453],axis=0)

#### Test

In [None]:
dogs_rating_clean.loc[dogs_rating_clean['tweet_id']=='835152434251116546']

## Storing Data
Save gathered, assessed, and cleaned master dataset to a CSV file named "twitter_archive_master.csv".

In [None]:
dogs_rating_clean.to_csv('twitter_archive_master.csv',index=False)

## Analyzing and Visualizing Data


#### Questions for analysis
1.The most popular dog names?
2.What year was the highest dog rating tweeted?
3.What tweet id had the highest likes and what is the rating for the  dog? 
4.Is there any relationship between likes and retweet?




#### Research Question 1:The most popular dog names?

In [None]:
#most popular dog name
popular_dog_names=dogs_rating_clean.name.value_counts().head(10)


In [None]:
sns.barplot(x=popular_dog_names.index,y=popular_dog_names.values,palette=['blue','blue','blue','grey','grey','grey','grey','grey','grey','grey'])
plt.title('most_popular_dog_name',fontsize=20)
plt.xlabel('dog_name',fontsize=18)
plt.ylabel('name_count',fontsize=18)

>>the most popular dog names are:Charlie,Cooper,Lucy

#### Research Question 2:What year was the highest dog rating tweeted?

In [None]:
#extract the years and month from the timestamp column 
dogs_rating_clean['year']=pd.DatetimeIndex(dogs_rating_clean.timestamp).year
dogs_rating_clean['month']=pd.DatetimeIndex(dogs_rating_clean.timestamp).month

In [None]:
sns.countplot(data=dogs_rating_clean,x='year')
plt.title('Number of yearly tweet',fontsize=25)

>>In year 2016 the highest number of original dog_ratings where tweeted 

#### Research Question 3:what tweet id had the highest likes and what is the rating for the  dog ?

In [None]:
filtered_data=dogs_rating_clean.sort_values(by=['likes_count','retweet_count'],ascending=False).head(10)
rating_numerator=filtered_data['rating_numerator'].sort_values()

In [None]:
sns.relplot(data=filtered_data,x='likes_count',y='retweet_count',size= rating_numerator,sizes=(25,300),style='rating_denominator',hue='tweet_id')

>>tweet id  '822872901745569793'  with the rating of  '13/10'  has the highest likes ,but it doesnt have the highest retweet.

#### Research Question 4: Is there any relationship between likes and retweet ?

In [None]:
sns.relplot(data=dogs_rating_clean,x='likes_count',y='retweet_count',col='year',row='month')
plt.title('reweets and likes for each month in a year')


>>likes and retweets are positvely related i.e the higher the likes, the higher the retweet & the lower the the likes,the lower the retweet


### Insights:
1.The most popular dog names are:Charlie,Cooper,Lucy

2.In year 2016 the highest number of original dog_ratings where tweeted

3.Tweet id '822872901745569793' with the rating of '13/10' has the highest likes ,but it doesnt have the highest retweet.

4.Likes and retweets are positvely related i.e the higher the likes, the higher the retweet & the lower the the likes,the lower the retweet



### Limitations:

1.Alot of dogs do not have dog name

2.Some images that are dog ,where not recognised as dog in the three predictions of the image prediction csv file 

3.Over 1000 tweets do not contain dog stage 

In [None]:
""" Source:
      - https://stackoverflow.com/questions/48122744/how-to-download-all-files-and-folder-hierarchy-from-jupyter-notebook
"""
import os
import tarfile

def recursive_files(dir_name='.', ignore=None):
    for dir_name,subdirs,files in os.walk(dir_name):
        if ignore and os.path.basename(dir_name) in ignore:
            continue

        for file_name in files:
            if ignore and file_name in ignore:
                continue

            yield os.path.join(dir_name, file_name)

def make_tar_file(dir_name='.', target_file_name='workspace_archive.tar', ignore=None):
    tar = tarfile.open(target_file_name, 'w')

    for file_name in recursive_files(dir_name, ignore):
        tar.add(file_name)

    tar.close()


dir_name = '.'
target_file_name = 'workspace_archive.tar'
# List of files/directories to ignore
ignore = {'.ipynb_checkpoints', '__pycache__', target_file_name}

make_tar_file(dir_name, target_file_name, ignore)