In [None]:
![udacity-banner](https://miro.medium.com/max/1400/1*XV1XQlk4lCYcm-ft5-gMtw.jpeg)
### Udacity Connect Session - Week 7 - Project 2 Walkthrough & Data Wrangling
## Agenda

* Project 2 Checkups
* Project 2 Walkthrough - Blockers
* Dataset Brainstorming
* Q & A/Breaktime
* Google BigQuery - Gather, Assess, Analyze, Store, Visualize

In [1]:
# import libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

https://medium.com/nerd-for-tech/data-cleaning-on-hotel-booking-demand-dataset-df19922adf0a

Usually, if more than 70% of values in a column are missing and there is no way to fill in the missing values, then the column can be dropped completely from the dataset.



## Project 2 Walkthrough - Gathering Data

**Context**

Your goal: wrangle WeRateDogs Twitter data to create interesting and trustworthy analyses and visualizations. The Twitter archive is great, but it only contains very basic tweet information. Additional gathering, then assessing and cleaning is required for "Wow!"-worthy analyses and visualizations.

I extracted this data programmatically, but I didn't do a very good job. The ratings probably aren't all correct. Same goes for the dog names and probably dog stages (see below for more information on these) too. You'll need to assess and clean these columns if you want to use them for analysis and visualization.
 
 
 Data wrangling efforts
  The ratings probably aren't all correct
   Same goes for the dog names and probably dog stages - You'll need to assess and clean these columns if you want to use them for analysis and visualization.

**Enhanced Tweet Archive**
![tweet archive](https://video.udacity-data.com/topher/2017/October/59dd4791_screenshot-2017-10-10-18.19.36/screenshot-2017-10-10-18.19.36.png)

1. twitter-archive-enhanced.csv
2. image_predictions.tsv - 
3. get from twitter api  - optinally download the tweet_json.txt from the classroom

**Additional Data via the Twitter API**

Back to the basic-ness of Twitter archives: retweet count and favorite count are two of the notable column omissions. Fortunately, this additional data can be gathered by anyone from Twitter's API. Well, "anyone" who has access to data for the 3000 most recent tweets, at least. But you, because you have the WeRateDogs Twitter archive and specifically the tweet IDs within it, can gather this data for all 5000+. And guess what? You're going to query Twitter's API to gather this valuable data.


**Tweet image predictions**

The results: a table full of image predictions (the top three only) alongside each tweet ID, image URL, and the image number that corresponded to the most confident prediction (numbered 1 to 4 since tweets can have up to four images).

**Image Predictions**
![image predictions](https://video.udacity-data.com/topher/2017/October/59dd4d2c_screenshot-2017-10-10-18.43.41/screenshot-2017-10-10-18.43.41.png)

tweet_id is the last part of the tweet URL after "status/" → https://twitter.com/dog_rates/status/889531135344209921
    
p1 is the algorithm's #1 prediction for the image in the tweet → golden retriever

p1_conf is how confident the algorithm is in its #1 prediction → 95%

p1_dog is whether or not the #1 prediction is a breed of dog → TRUE

p2 is the algorithm's second most likely prediction → Labrador retriever

p2_conf is how confident the algorithm is in its #2 prediction → 1%

p2_dog is whether or not the #2 prediction is a breed of dog → TRUE
etc.

**Additional Data from Tweeter**

Each tweet's JSON data should be written to its own line. Then read this .txt file line by line into a pandas DataFrame with (at minimum) tweet ID, retweet count, and favorite count. 


**[tweeter_api.py](https://video.udacity-data.com/topher/2018/November/5be5fb4c_twitter-api/twitter-api.py)**

This is the Twitter API code to gather some of the required data for the project. Read the code and comments, understand how the code works, then copy and paste it into your notebook.

In [10]:
# Install Tweepy
!pip3 install tweepy

Defaulting to user installation because normal site-packages is not writeable


In [11]:
# Import Tweety
import tweepy

At the bottom of **[the page](https://learn.udacity.com/nanodegrees/nd002-alg-t2/parts/cd0015/lessons/ls2232/concepts/e7279c35-6d40-42e5-998d-9062ceac6d4c)** in the classroom you can find two files you can download:- 

**[tweeter_api.py](https://video.udacity-data.com/topher/2018/November/5be5fb4c_twitter-api/twitter-api.py)*** - This is the Twitter API code to gather some of the required data for the project mentioned earlier.

```import tweepy```

```from tweepy import OAuthHandler```

```import json```

```from timeit import default_timer as timer```

**Elevated access may be required**


```consumer_key = 'YOUR CONSUMER KEY'
consumer_secret = 'YOUR CONSUMER SECRET'
access_token = 'YOUR ACCESS TOKEN'
access_secret = 'YOUR ACCESS SECRET'```

How do i get the access_secret?

Generate the access keys for the app after creating it. There is a button that you to  generate the key


### [import with requests from url](https://stackoverflow.com/questions/4981977/how-to-handle-response-encoding-from-urllib-request-urlopen-to-avoid-typeerr)

url = ''

```some_response = requests.get(url).content  //optional verify = False if SSL errors```

```some_df = pd.read_csv(io.StringIO(s.decode('utf-8')), sep='\t')  //file is csv file with utf-8 encoding```

```resource.headers['content-type']```


### **[If you get SSL error while importing stack overflow link](https://stackoverflow.com/questions/10667960/python-requests-throwing-sslerror)**

### **Accessing Project Data Without a Twitter Developer Account**

If not getting Twitter Developer Account, that should not be a problem:

If you can't set up a Twitter developer account using the steps above, or you prefer not to create a Twitter account for some reason, you may instead follow the directions below to access the data necessary for the project.

**Directions for accessing the Twitter data without actually creating a Twitter account:**

Download **[tweet_jason.txt](https://video.udacity-data.com/topher/2018/November/5be5fb7d_tweet-json/tweet-json.txt)** 


## How to read the tweet_json.txt from tweeter or from Udacity

### Read from a jason_file

**[tweet_jason.txt](https://video.udacity-data.com/topher/2018/November/5be5fb7d_tweet-json/tweet-json.txt)** - This is the resulting data from twitter_api.py. You can proceed with the following part of "Gathering Data for this Project" on the Project Details page: "Then read this tweet_json.txt file line by line into a pandas DataFrame with (at minimum) tweet ID, retweet count, and favorite count."

### [Slack response example resolution](https://alx-c2-en.slack.com/archives/C03N1EGHWES/p1661777873863019)


```with open('filename.txt')as json_file:```

```data = [json.loads(line) for line in open('tweet-json.txt','r')]```


### Create pandas df from the resulting data

```some_df = pd.DataFrame(data)```


4. **Wrangling/Cleaning**
* 8 quality issues
* 2 tidiness issues

* You must clearly document the piece of assessed and cleaned (if necessary) data used to make each analysis and visualization.

5. **Analyze and Visualize your data**
* You must produce at least three (3) insights and one (1) visualization.
* You must clearly document the piece of assessed and cleaned (if necessary) data used to make each analysis and visualization.

6. **Create your Reports**
* wrangle_report.pdf or html
* act_report.pdf or html

**Required files to be submitted**

1. wrangle_act.ipynb: code for gathering, assessing, cleaning, analyzing, and visualizing data
2. wrangle_report.pdf or wrangle_report.html: documentation for data wrangling steps: gather, assess, and clean
3. act_report.pdf or act_report.html: documentation of analysis and insights into final data
4. twitter_archive_enhanced.csv: file as given
5. image_predictions.tsv: file downloaded programmatically
6. tweet_json.txt: file constructed via API
7. twitter_archive_master.csv: combined and cleaned data
8. any additional files (e.g. files for additional pieces of gathered data or a database file for your stored clean data)

**[The Project workspace contains three starter files](https://learn.udacity.com/nanodegrees/nd002-alg-t2/parts/cd0015/lessons/ls2232/concepts/eee7bb49-fccb-4c6d-a30b-ace5f69f33db)**

# Dataset Brainstorming

### [Download the Travel Times CSV file](https://1drv.ms/u/s!Ag9pH02JWJmogZk3h3FVNUf1kLJVMA?e=VT5gJ5)

### Import the csv file

In [2]:
# import data to dataframe
df = pd.read_csv('travel-times.csv')

# display top 5 rows
df.head()

Unnamed: 0,Date,StartTime,DayOfWeek,GoingTo,Distance,MaxSpeed,AvgSpeed,AvgMovingSpeed,FuelEconomy,TotalTime,MovingTime,Take407All,Comments
0,1/6/2012,16:37,Friday,Home,51.29,127.4,78.3,84.8,,39.3,36.3,No,
1,1/6/2012,08:20,Friday,GSK,51.63,130.3,81.8,88.9,,37.9,34.9,No,
2,1/4/2012,16:17,Wednesday,Home,51.27,127.4,82.0,85.8,,37.5,35.9,No,
3,1/4/2012,07:53,Wednesday,GSK,49.17,132.3,74.2,82.9,,39.8,35.6,No,
4,1/3/2012,18:57,Tuesday,Home,51.15,136.2,83.4,88.1,,36.8,34.8,No,


### Visually inspect the data from in various aspects, this allows you to get familiar with the data

In [192]:
df.tail()

Unnamed: 0,Date,StartTime,DayOfWeek,GoingTo,Distance,MaxSpeed,AvgSpeed,AvgMovingSpeed,FuelEconomy,TotalTime,MovingTime,Take407All,Comments
200,7/18/2011,08:09,Monday,GSK,54.52,125.6,49.9,82.4,7.89,65.5,39.7,No,
201,7/14/2011,08:03,Thursday,GSK,50.9,123.7,76.2,95.1,7.89,40.1,32.1,Yes,
202,7/13/2011,17:08,Wednesday,Home,51.96,132.6,57.5,76.7,,54.2,40.6,Yes,
203,7/12/2011,17:51,Tuesday,Home,53.28,125.8,61.6,87.6,,51.9,36.5,Yes,
204,7/11/2011,16:56,Monday,Home,51.73,125.0,62.8,92.5,,49.5,33.6,Yes,


### I use sample to get a sizeable chunk of the data and run the sample command several times to view random samples to get an idea of what is going on for better visual inspection

In [22]:
df.sample(60)

Unnamed: 0,Date,StartTime,DayOfWeek,GoingTo,Distance,MaxSpeed,AvgSpeed,AvgMovingSpeed,FuelEconomy,TotalTime,MovingTime,Take407All,Comments
129,9/8/2011,16:37,Thursday,Home,50.92,137.0,66.9,77.7,8.5,45.6,39.3,No,
135,9/2/2011,17:07,Friday,Home,51.17,129.7,77.7,87.9,8.5,39.5,34.9,No,
104,10/3/2011,07:41,Monday,GSK,50.65,127.4,91.1,95.2,7.97,33.4,31.9,Yes,
156,8/19/2011,07:05,Friday,GSK,49.18,123.0,72.0,81.4,8.37,41.0,36.3,No,Start early to run a batch
70,11/2/2011,16:45,Wednesday,Home,51.27,121.5,75.1,79.1,8.32,41.0,38.9,No,
83,10/20/2011,08:22,Thursday,GSK,51.74,124.5,68.9,79.2,8.75,45.0,39.2,No,
198,7/19/2011,17:17,Tuesday,Home,51.16,126.7,92.2,102.6,7.89,33.3,29.9,Yes,
134,9/6/2011,07:50,Tuesday,GSK,54.36,132.5,95.1,98.0,8.5,34.3,33.3,Yes,
88,10/17/2011,16:58,Monday,Home,51.3,127.3,78.6,82.9,8.75,39.1,37.1,No,
79,10/25/2011,08:13,Tuesday,GSK,51.75,127.4,72.1,82.0,8.97,43.1,37.8,No,


## Note down your initial observations, Questions and thoughs

### Q1 - List out 10 Data Wranging - Assessing your Data Observations to look for or that you  observed
Tip: Vigorously Question the dataset, assess visually and programmatically and list out your thoughts and observations


### Q2 - List out 10 Exploration Ideas - Data Exploration
Tip: Assess the data from various angle and postulate possible questions and exploration ideas possible from the dataset

# Google BigQuery

* Gathering Data - upload CSV file to a dataset using create table

* Asessing Data - view data sample and perform visual inspection

* Analyzing Data - Analyze data with various queries

* Storing Data -  create a new table from and save to local

* [Visualizing & Exploring Data](https://datastudio.google.com/)