# [CPSC 325](https://github.com/GonzagaCPSC325) Data Science Project Lab
[Gonzaga University](https://www.gonzaga.edu/)

[Gina Sprint](http://cs.gonzaga.edu/faculty/sprint/)

# Tool Demos
What are our learning objectives for this lesson?
* Follow along with some cloud data science demos

Content used in this lesson is based upon information in the following sources:
* Google Cloud Platform training courses

## Warm up Task(s)
1. Go to https://www.allmysportsteamssuck.com/ncaa-division-i-football-and-basketball-twitter-hashtags-and-handles/ and inspect/view the page source
    * In Chrome, right click -> Inspect or View Page Source
1. Suppose we want to extract the Men's Basketball Team Twitter handles, what are the relevant tags do we need to find in the the HTML?

## Today
* Announcements
    * Project pitch: IoT water flow with microcontroller, microphone, and pub/sub
    * Research proposal is due Thursday night, let's go over it
* Go over lambda function solutions
* Project brainstorming lab speed dating!!
* Webscraping
* TODO before next class: please finish the last page of the project brainstorming lab

## Webscraping
We can scrape data we are interested in from web pages. While there are several libraries to help you do this, today we will use [Beautiful Soup](https://beautiful-soup-4.readthedocs.io/en/latest/):
> Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

## Tasks
1. Make a new project called WebScrapingFun
1. Use Beautiful Soup to scrape the school names and handles from https://www.allmysportsteamssuck.com/ncaa-division-i-football-and-basketball-twitter-hashtags-and-handles/
1. Store rankings in a Pandas DataFrame and write it to a CSV file

## Warm up Task(s)
1. Open your project from last class (I called mine WebScrapingFun, it is pushed to Github)

## Today
* Announcements
    * Research proposal is due tonight, questions?
    * Let's go over the project log
* Project brainstorming lab idea sharing!!
* Finish web scraping demo
* Start working with Twitter API

## Twitter API
We will use the [Twitter API](https://developer.twitter.com/en/docs/twitter-api) to get account information and tweets from basketball teams. For ease of use, we will take advantage of the [Tweepy](https://docs.tweepy.org/en/stable/) Python library to help us out. 

## Tasks
1. Setup
    1. Go to https://developer.twitter.com/ and sign up for the API
        1. Make a project, an app, and generate a "Bearer Token" for authenticating with the app
        1. Copy this token we will need it later
    1. Install Tweepy with `pip install tweepy`
    1. Add the Python .gitignore file to your project: https://github.com/github/gitignore
1. WebScrapingFun: pick a Twitter handle you want to work with, or just use @Zag_MBB
    1. We will use this Tweepy function to get information about this account: https://docs.tweepy.org/en/stable/client.html#tweepy.Client.get_user
        * We are mostly interested in the ID, since it is unique to the account and doesn't change like a handle (AKA username) can
        * But there is so much infr you can grab!

## Warm up Task(s)
1. Go to Canvas -> Announcements and follow the instructions to redeem a Google Cloud Platform (GCP) coupon
1. Run WebScrapingFun/main.py and copy the user_id for the account you were grabbing info from (@ZagMBB's is 602989093)
1. Create a new project called CloudFunctionFun
    * In a main.py, paste the user_id into a new variable
    * Read the docs for: https://docs.tweepy.org/en/stable/client.html#tweepy.Client.get_users_tweets
    * We will need the Twitter API bearer token again
        * Either copy twitter_keys.json and read in the variable OR
        * In the terminal you are going to run main.py, add an environment variable for your Twitter API bearer token
            * In the docker container: `export BEARER_TOKEN="your token here"`

## Today
* Announcements
    * Nice research proposals! More details would be helpful
    * Your homework for the next few weeks is to work on your research and keep track of your progress in your log :)
    * We are going to be working with could services these next few days. **Don't forget to shut services down!!**
* Requesting recent tweets from an account by ID
* Deploying this recent tweet code a GCP Cloud Function
* Scheduling the Cloud Function to run periodically
* (if time) Inserting the recent tweets into a big query table

## Google Cloud Functions
From https://cloud.google.com/functions:
>**Simplified developer experience and increased developer velocity**  
Cloud Functions has a simple and intuitive developer experience. Just write your code and let Google Cloud handle the operational infrastructure. Develop faster by writing and running small code snippets that respond to events. Streamline challenging orchestration problems by connecting Google Cloud products to one another or third party services using events.  
**Pay only for what you use**  
You are only billed for your function’s execution time, metered to the nearest 100 milliseconds. You pay nothing when your function is idle. Cloud Functions automatically spins up and backs down in response to events.  
**Avoid lock-in with open technology**  
Use open source FaaS (function as a service) framework to run functions across multiple environments and prevent lock-in. Supported environments include Cloud Functions, local development environment, on-premises, Cloud Run, and other Knative-based serverless environments.

## Tasks
1. Setup
    * Redeem cloud coupon and make a billing account with an associated project
    * Enable Cloud Functions
1. CloudFunctionFun: pick a Twitter user id to get recent tweets from
    * We will use this Tweepy function to get tweets using the user id: https://docs.tweepy.org/en/stable/client.html#tweepy.Client.get_users_tweets
    * We will deploy this tweet fetching code as a Cloud Function
    * Then trigger the Cloud Function with Cloud Scheduler
        * Note: this cron expression helper is helpful: https://crontab.cronhub.io/
    * We will insert these tweets as rows in a BigQuery table

## Warm up Task(s)
1. Open CloudFunctionFun
1. [`get_users_tweets`](https://docs.tweepy.org/en/stable/client.html#tweepy.Client.get_users_tweets) accepts a keyword arg called `start_time`: 
>start_time (datetime.datetime | str | None) –
YYYY-MM-DDTHH:mm:ssZ (ISO 8601/RFC 3339). The oldest or earliest UTC timestamp from which the Tweets will be provided. Only the 3200 most recent Tweets are available. Timestamp is in second granularity and is inclusive (for example, 12:00:01 includes the first second of the minute). Minimum allowable time is 2010-11-06T00:00:00Z
1. Provide a `start_time` that is 24 hours ago from "now"

## Today
* Announcements
    * Did your cloud function execute yesterday at 10:40am? :)
    * Work on your research, keep track of progress in your log
    * Note on citing AI generation tools: https://www.nature.com/articles/d41586-023-00191-1
    * Question: One more demo next week on Flask app w/ GCP Cloud Run? Or are you good?
* Finish CloudFunctionFun
    * Inserting 24 hours of recent tweets into a big query table
* CloudPubSubFun
    * Publish tweet "messages" to a pub/sub topic
    * Subscribe to topic "message" via a Cloud Function
        * Insert "messages" into BigQuery
    * (if time) Stream tweets via Compute Engine

## Pub/Sub
From https://cloud.google.com/pubsub:
>Ingest events for streaming into BigQuery, data lakes or operational databases.  
**Stream analytics and connectors**  
Native Dataflow integration enables reliable, expressive, exactly-once processing and integration of event streams in Java, Python, and SQL.  
**In-order delivery at scale**  
Optional per-key ordering simplifies stateful application logic without sacrificing horizontal scale—no partitions required.  
**Cost-optimized ingestion with Pub/Sub Lite**  
Complementing Pub/Sub, Pub/Sub Lite aims to be the lowest cost option for high-volume  event ingestion. Pub/Sub Lite offers regional or zonal storage, putting you in control of capacity management.

## Tasks
1. Setup
    * Create a new project folder called CloudPubSubFun
    * In a main.py, read in your bearer_token from twitter_keys.json
1. CloudPubSubFun: pick 1 to 5 Twitter accounts to stream tweets from
    * We will subclass this Tweepy class to stream tweets using a filtered rule: https://docs.tweepy.org/en/stable/streamingclient.html
    * We will create a pub/sub topic and publish our streaming tweets as messages to it
    * We will create a Cloud Function that subscribes to these messages
        * Optionally we can have this function insert the message into our tweets table in BigQuery
    * We will move our streaming code over to the cloud via a [free tier e2-micro Compute Engine VM](https://cloud.google.com/free/docs/free-cloud-features#free-tier-usage-limits)