## Project  - WeRateDogs

# Wrangling, Cleaning and Analyzing of Data

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#gather">Gathering</a></li>
<li><a href="#assess">Assessing</a></li>
<li><a href="#obs">Observations</a></li>
<li><a href="#clean">Cleaning</a></li>
<li><a href="#store">Storing</a></li>
<li><a href="#viz">Visualization</a></li>   
<li><a href="#ref">References</a></li>
</ul>


<a id='intro'></a>

## INTRODUCTION

### Project Details

- Data wrangling is the process of cleaning and unifying messy and complex data sets for easy access and analysis, it consists of:
    - Gathering the data 
        1. From 'twitter-archive-enhanced.csv' file.
        2. From a [link]('https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv').
        3. From twitter API.
    - Assessing the data
    - Cleaning the data
- Storing, analyzing, and visualizing the wrangled data
- Reporting on: 
    1. Data wrangling efforts. 
    2. Data analyses and visualizations

<a id='gather'></a>
## GATHERING

In [31]:
# Importing all Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()
import json
import requests
import os
import tweepy
# from twitter_api import get_twitter_data
import json
from PIL import Image
from io import BytesIO
import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)

In [20]:
# Loading twitter archive file
twitter_arch = pd.read_csv("twitter-archive-enhanced.csv")

In [23]:
# Downloading image prediction file
folder_name = "data"
url = "https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv"
response = requests.get(url)
if not os.path.exists(folder_name):
    os.mkdir(folder_name)
with open(os.path.join('data/' + url.split('/')[-1]), 'wb') as file:
    file.write(response.content)

In [27]:
# Loading "image-prediction.csv" file
image_pred = pd.read_csv("data/image-predictions.tsv", sep="\t")

In [28]:
#Loading the tweets

In [29]:
with open("data/tweet-json.txt") as file:
    tweet_json = pd.read_json(file, lines=True, encoding="utf-8")

<a id='assess'></a>

## ASSESSING

### ASSESSMENT OF "twitter_archive.csv" 

#### VISUAL ASSESSMENT

In [36]:
# FIRST 5 ROWS
twitter_arch.head()

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
0,892420643555336193,,,2017-08-01 16:23:56 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Phineas. He's a mystical boy. Only eve...,,,,https://twitter.com/dog_rates/status/892420643...,13,10,Phineas,,,,
1,892177421306343426,,,2017-08-01 00:17:27 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Tilly. She's just checking pup on you....,,,,https://twitter.com/dog_rates/status/892177421...,13,10,Tilly,,,,
2,891815181378084864,,,2017-07-31 00:18:03 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Archie. He is a rare Norwegian Pouncin...,,,,https://twitter.com/dog_rates/status/891815181...,12,10,Archie,,,,
3,891689557279858688,,,2017-07-30 15:58:51 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Darla. She commenced a snooze mid meal...,,,,https://twitter.com/dog_rates/status/891689557...,13,10,Darla,,,,
4,891327558926688256,,,2017-07-29 16:00:24 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Franklin. He would like you to stop ca...,,,,https://twitter.com/dog_rates/status/891327558...,12,10,Franklin,,,,


In [38]:
# CHECKING RANDOM ROWS/OBSERVATIONS
twitter_arch.sample(10)

Unnamed: 0,tweet_id,in_reply_to_status_id,in_reply_to_user_id,timestamp,source,text,retweeted_status_id,retweeted_status_user_id,retweeted_status_timestamp,expanded_urls,rating_numerator,rating_denominator,name,doggo,floofer,pupper,puppo
1334,705428427625635840,,,2016-03-03 16:23:38 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Ambrose. He's an Alfalfa Ballyhoo. Dra...,,,,https://twitter.com/dog_rates/status/705428427...,11,10,Ambrose,,,pupper,
1079,739238157791694849,,,2016-06-04 23:31:25 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Here's a doggo blowing bubbles. It's downright...,,,,https://twitter.com/dog_rates/status/739238157...,13,10,,doggo,,,
352,831315979191906304,,,2017-02-14 01:35:49 +0000,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",I couldn't make it to the #WKCDogShow BUT I ha...,,,,https://twitter.com/dog_rates/status/831315979...,13,10,,,,pupper,
1818,676593408224403456,,,2015-12-15 02:43:33 +0000,"<a href=""http://vine.co"" rel=""nofollow"">Vine -...",This pupper loves leaves. 11/10 for committed ...,,,,https://vine.co/v/eEQQaPFbgOY,11,10,,,,pupper,
874,761292947749015552,,,2016-08-04 20:09:17 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Meet Bonaparte. He's pupset because it's cloud...,,,,https://twitter.com/dog_rates/status/761292947...,11,10,Bonaparte,,,,
1433,697463031882764288,,,2016-02-10 16:51:59 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Happy Wednesday here's a bucket of pups. 44/40...,,,,https://twitter.com/dog_rates/status/697463031...,44,40,,,,,
1422,698178924120031232,,,2016-02-12 16:16:41 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Lily. She accidentally dropped all her...,,,,https://twitter.com/dog_rates/status/698178924...,10,10,Lily,,,,
1028,745789745784041472,,,2016-06-23 01:25:06 +0000,"<a href=""http://twitter.com/download/iphone"" r...",This is Gus. He didn't win the Powerball. Quit...,,,,https://twitter.com/dog_rates/status/745789745...,10,10,Gus,,,,
269,841320156043304961,,,2017-03-13 16:08:50 +0000,"<a href=""http://twitter.com/download/iphone"" r...","We don't rate penguins, but if we did, this on...",,,,https://twitter.com/abc/status/841311395547250688,12,10,,,,,
947,752519690950500352,,,2016-07-11 15:07:30 +0000,"<a href=""http://twitter.com/download/iphone"" r...",Hopefully this puppo on a swing will help get ...,,,,https://twitter.com/dog_rates/status/752519690...,11,10,,,,,puppo


#### PROGRAMMATIC ASSESSMENT

In [39]:
# Details of all the columns
twitter_arch.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   tweet_id                    2356 non-null   int64  
 1   in_reply_to_status_id       78 non-null     float64
 2   in_reply_to_user_id         78 non-null     float64
 3   timestamp                   2356 non-null   object 
 4   source                      2356 non-null   object 
 5   text                        2356 non-null   object 
 6   retweeted_status_id         181 non-null    float64
 7   retweeted_status_user_id    181 non-null    float64
 8   retweeted_status_timestamp  181 non-null    object 
 9   expanded_urls               2297 non-null   object 
 10  rating_numerator            2356 non-null   int64  
 11  rating_denominator          2356 non-null   int64  
 12  name                        2356 non-null   object 
 13  doggo                       2356 

--- Some columns are having missing values and wrong data types

In [42]:
# Summary statistics of the numerator ratings and denominator ratings
twitter_arch[["rating_numerator", "rating_denominator"]].describe()

Unnamed: 0,rating_numerator,rating_denominator
count,2356.0,2356.0
mean,13.126486,10.455433
std,45.876648,6.745237
min,0.0,0.0
25%,10.0,10.0
50%,11.0,10.0
75%,12.0,10.0
max,1776.0,170.0


--- The rating numerator is greater than 10, and it should not be

In [47]:
# Checking for unique dog names
twitter_arch["name"].unique()

array(['Phineas', 'Tilly', 'Archie', 'Darla', 'Franklin', 'None', 'Jax',
       'Zoey', 'Cassie', 'Koda', 'Bruno', 'Ted', 'Stuart', 'Oliver',
       'Jim', 'Zeke', 'Ralphus', 'Canela', 'Gerald', 'Jeffrey', 'such',
       'Maya', 'Mingus', 'Derek', 'Roscoe', 'Waffles', 'Jimbo', 'Maisey',
       'Lilly', 'Earl', 'Lola', 'Kevin', 'Yogi', 'Noah', 'Bella',
       'Grizzwald', 'Rusty', 'Gus', 'Stanley', 'Alfy', 'Koko', 'Rey',
       'Gary', 'a', 'Elliot', 'Louis', 'Jesse', 'Romeo', 'Bailey',
       'Duddles', 'Jack', 'Emmy', 'Steven', 'Beau', 'Snoopy', 'Shadow',
       'Terrance', 'Aja', 'Penny', 'Dante', 'Nelly', 'Ginger', 'Benedict',
       'Venti', 'Goose', 'Nugget', 'Cash', 'Coco', 'Jed', 'Sebastian',
       'Walter', 'Sierra', 'Monkey', 'Harry', 'Kody', 'Lassie', 'Rover',
       'Napolean', 'Dawn', 'Boomer', 'Cody', 'Rumble', 'Clifford',
       'quite', 'Dewey', 'Scout', 'Gizmo', 'Cooper', 'Harold', 'Shikha',
       'Jamesy', 'Lili', 'Sammy', 'Meatball', 'Paisley', 'Albus',
       'Nept

"a", "all", "his" can not be dog names

In [49]:
# Counting each unique name
twitter_arch["name"].value_counts()

None          745
a              55
Charlie        12
Cooper         11
Lucy           11
             ... 
Dex             1
Ace             1
Tayzie          1
Grizzie         1
Christoper      1
Name: name, Length: 957, dtype: int64

### ASSESSMENT OF "image_pred" 

#### VISUAL ASSESSMENT

In [None]:
#

#### PROGRAMMATIC ASSESSMENT

### ASSESSMENT OF "twitter_archive.csv" 

#### VISUAL ASSESSMENT

#### PROGRAMMATIC ASSESSMENT