![alt text](Piazza_logo.svg "The piazza logo")

# An Introduction to Data Science Through Piazza Forum Data
*A comprehensive guide to the data science pipeline by Elan Naideck and Erika Schlunk*

Welcome aspiring data scientists! Have you ever wanted to do a "machine learning" or a "big data?" If so, this tutorial will show you how to get started with the data science pipeline, and will introduce you to industry standard tools for data analytics and visualization. This tutorial assumes you already know a thing or two about python, so if at any point you feel like you don't understand the code it may be wise to brush up on your python knowledge.

Now for the actual project. In this tutorial we will be using data from our own data science class forums to analyze student and instructor participation. A professor might find like to know how well their teacher's assistants have been performing, or how much their students are participating in the discussions. On our journey of meta analysis we'll walk you through all 5 steps of the data lifecycle shown bellow.

![alt text](Data_lifecycle.png "The data lifecycle")

As you can see from the many interconnecting arrows this is not a strictly linear process. As you go through this process yourself you will find out that the universe is conspiring to make our lives miserable and no step in this process is straight forward. At any point you may find yourself having to backtrack and wrangle more data, or collect more data, or wrangle more data. Or wrangle your data some more, and then wrangle it again for good measure. This tutorial can't show you exactly how to do it because the process it totally unique for every dataset you may find yourself working with. This tutorial will give you the tools and outline the process, but you'll have to use your brain to figure out how best to prepare and process your own data. Copying this code won't help you much.

## Your Environment And You

Before we get going on the real data science we have to talk about your programming environment. If you aren't familiar with the world of jupyter notebooks you're about be. If you are already familiar with jupyter notebooks, and have jupyter installed on your machine you can skip this part and move on to the required packages. These "notebooks" are industry standard for data science with python due to their publishability, modular design, and inline data visualization support. You can get basic instructions on how to install jupyter [here](https://jupyter.readthedocs.io/en/latest/install.html), but if you like spoilers and would like to see a much more comprehnsive guide on how to use jupyter you can go [here](https://www.datacamp.com/community/tutorials/tutorial-jupyter-notebook). Don't worry about mastering notebooks. Just get it up and running so you can experiment with them and write your own code.

## The Packages

Python wasn't designed as a data science language, but through the magic of packages python can rival or even surpass other statistical languages like R (fight me stat majors). The beating heart of the python science ecosystem is the pandas package. Pandas is a data analysis library that has become ubiquitous in the data science world because of it's fast, well optimized data structures and intuitive design. It's the first thing you want to have installed when you begin any data science project, so if you don't already have it installed in your python envoronment do so now by using pip or conda, depending on your setup. More help for installing pandas can be found [here](https://pandas.pydata.org/pandas-docs/stable/install.html)

Once you have pandas up and running you can import it using the usual python import statements. You'll be using it a lot so I strongly recommend importing it with an alias. Industry standard is to import it as pd. The full list of standard packages you want to install when starting any project are [pandas](https://pandas.pydata.org/), [numpy](https://numpy.org/), [matplotlib](https://matplotlib.org/), [scikit-learn](https://scikit-learn.org/stable/), and [seaborn](https://seaborn.pydata.org/). If you want more information on how to use each package and what they do just follow the links to their websites. For now we'll just import them and explain how to use them as we go.

In [75]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

The primary data structure around which all of your work will around is the pandas dataframe. Dataframes store data like a giant excel spreadsheet, with named columns and indexed rows. If you already know a thing or two about pandas you can move on to data collection, otherwise [here](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html) is a handy guide to all of the basic dataframe opperations that we will need as we proceed. You might want to keep this handy when writing your own code.

## Data Collection

Pictured below is the average state of real world data. Reality is messy, and so is the data it produces. Data can be downloaded from government websites, retrieved by API calls, scrapped of of websites manuallty, computer generated, or beamed into your brain by hyperinteligent beings from another world.

![alt text](dumpster-fire.gif "The average state of real world data")

For our particular situation we're going to be pulling our data with an unnoficial API for the piazza forums that somebody was kind enough to post online. API stands for application programming interface, and it's one of the nicest ways you can get data. An API is a package or set of instructions released publicly but companies or other organizations that you can use to query data from their web servers. They send you the data, usually in some documented format that you can then parse into your pandas dataframes. Pandas also has functions for importing any csv files you may stumble across online if your data is in a file you can download. If collecting your data is more complicated than either of those methods may god, and [beautifulsoup4](https://pypi.org/project/beautifulsoup4/) help you.

Thankfully we have an API, so the first step is to install it. The Piazza API which can be found [here](https://pypi.org/project/piazza-api/), with full documentation [here](https://github.com/hfaran/piazza-api/) can be installed using the standard pip/conda methods, which saves us quite a bit of hastle. Once we have it installed in our python environment we can import it into our jupyter notebook.

In [None]:
from piazza_api.rpc import PiazzaRPC

As you can see I didn't just import the api. Before I even wrote my import statement I read through the API documentation and figured out which part of the package I need for my particular application. For this project I'm going to need to be able to download class and post data. The PiazzaRPC class in the documentation seems to have the functions I need so I imported just that. When it comes to using API's *DOCUMENTATION IS YOUR FRIEND*

Now that we've got our API and we've read its documentation we can start collecting our data. The first step is to login to the API. Many APIs have access tied to online accounts, so you need to login with your credentials. Obviously I removed my own credentials from the code bellow before posting it online, but if you have a Piazza account you can fill in your own credentials and class id and the code should work just fine. Once you've logged in you should print out your data so you can perform a visual inspection to see how it's stuctured and if there is anything wrong with it. From there it's time to play data detective.

In [58]:
# Below is the class id. Needed for querying any data for the target class. Don't ask how I got it
# class nid: jzlv5mrrqnn4tz

# Create a piazza API instance and login to it
# DON'T FORGET TO REMOVE THE PASSWORD BEFORE POSTING IT ONLINE
p = PiazzaRPC()
p.user_login(email='enaideck@umd.edu', password='chaseand1')

# The returned data structure is a dictionary containing tons of data about the class
# It has to be sifted through to get a post list
class_data = p.get_my_feed(limit=2000, offset=0, sort="recent", nid='jzlv5mrrqnn4tz')
posts_dict = class_data['feed']
posts_dict[5]

{'fol': 'project3|',
 'pin': 1,
 'm': 1574114977824,
 'rq': 0,
 'id': 'k34o5uf2wb63ee',
 'unique_views': 278,
 'score': 278.0,
 'is_new': False,
 'version': 8,
 'bucket_name': 'Pinned',
 'bucket_order': 0,
 'folders': ['project3'],
 'nr': 423,
 'main_version': 8,
 'request_instructor': 0,
 'log': [{'t': '2019-11-18T16:54:37Z', 'u': 'iv9hhjrk2iv2w7', 'n': 'create'},
  {'t': '2019-11-18T16:54:59Z', 'u': 'jl3b2jmi38z3wh', 'n': 'followup'},
  {'t': '2019-11-18T16:55:33Z', 'n': 'followup'},
  {'t': '2019-11-18T16:56:31Z', 'u': 'iv9hhjrk2iv2w7', 'n': 'update'},
  {'t': '2019-11-18T16:59:32Z', 'n': 'followup'},
  {'t': '2019-11-18T17:00:49Z', 'n': 'followup'},
  {'t': '2019-11-18T18:46:07Z', 'n': 'followup'},
  {'t': '2019-11-18T22:09:37Z', 'u': 'iv9hhjrk2iv2w7', 'n': 'feedback'}],
 'subject': 'Extension on Project 3',
 'no_answer_followup': 0,
 'num_favorites': 1,
 'type': 'note',
 'tags': ['instructor-note', 'pin', 'project3'],
 'content_snipet': 'I&#39;m extending P3&#39;s due date to next

In [59]:
print(class_data.keys())
p.get_users(user_ids=['iv9hhjrk2iv2w7'], nid='jzlv5mrrqnn4tz')
#class_data['users']

dict_keys(['more', 'last_networks', 'drafts', 'sort', 'avg_cnt', 'users', 'tags', 'feed', 'no_open_teammate_search', 'avg', 't', 'notification_subjects', 'token_data', 'draft', 'users_7', 'notifications', 'hof'])


[{'role': 'professor',
  'name': 'John Dickerson',
  'endorser': {},
  'admin': True,
  'photo': None,
  'id': 'iv9hhjrk2iv2w7',
  'photo_url': None,
  'us': False,
  'facebook_id': None}]

Let's look at what we have so far!

From the API, we requested the data using our login info the class ID for this class, CMSC 320. This gives us a pretty big dictionary. We then grabbed the "feed" part of that to get a list of posts stored as dictionaries. 

## Data Processing

Before we do anything else, let's import Pandas to help with storing and manipulating our data.

In [4]:
import pandas as pd

#### Breaking it down

Now that we have our data, need to decide what kind of data we want to save about each post. 


# ?????????????ADD MOTIVATION HERE?????????????

Let's start with the following:
- posted : date and time
- folders : list string
- type : string
- subject : string
- total_replies : int
- instructor_replies : int
- unique_views : int
- num_favorites : int

### Posted

We want get the date and time that the post was created.

### Folders

In [5]:
folders = class_data['tags']['instructor']
folders

['quiz1',
 'quiz2',
 'quiz3',
 'quiz4',
 'quiz5',
 'quiz6',
 'quiz7',
 'quiz8',
 'quiz9',
 'quiz10',
 'quiz11',
 'quiz12',
 'project1',
 'project2',
 'project3',
 'project4',
 'midterm_exam',
 'final_tutorial',
 'other',
 'project',
 'exam',
 'logistics']

### Type, Subject, Total Replies, Instructor Replies, Unique Views, Num Favorites

In [32]:
types = pd.Series()
subjects = pd.Series()
total_replies = pd.Series()
instructor_replies = pd.Series()
unique_views = pd.Series()
num_favorites = pd.Series()
for post in posts_dict:
    types = types.append(pd.Series(post["type"]))
    subjects = subjects.append(pd.Series(post["subject"]))
    total_replies = total_replies.append(pd.Series(post["type"]))
types

0        poll
0        note
0        note
0        note
0        note
       ...   
0    question
0        note
0    question
0    question
0        note
Length: 503, dtype: object

In [61]:
import pandas as pd
from datetime import datetime

posts = pd.DataFrame(columns = ['posted', 'total_replies', 'instructor_replies', 'unique_views', 'num_favs'] 
                     + folders + ['subject'])

instructors = {}

# THIS CELL NOT DONE YET. RANDOM DATA IS ADDED INTO THE FRAME AS A TEST
# Need to actually set each piece of data based on the dict
for i in range(len(posts_dict)):
    posts.loc[i, 'posted'] = datetime.strptime(posts_dict[i]['log'][0]['t'], '%Y-%m-%dT%H:%M:%SZ')
    posts.loc[i, 'total_replies'] = len(posts_dict[i]['log']) - 1
    posts.loc[i, 'instructor_replies'] = 0 # PLACEHOLDER CUZ I DON'T WANT TO COMPUTE THAT RN
    posts.loc[i, 'unique_views'] = posts_dict[i]['unique_views']
    posts.loc[i, 'num_favs'] = posts_dict[i]['num_favorites']
    for folder in folders:
        posts.loc[i, folder] = False
    for folder in posts_dict[i]['folders']:
        posts.loc[i, folder] = True
    posts.loc[i, 'subject'] = posts_dict[i]['subject']
    for j in range(len(posts_dict[i]['log'])):
        if 'u' in posts_dict[i]['log'][j] and not posts_dict[i]['log'][j]['u'] in instructors.keys():
            instructors[posts_dict[i]['log'][j]['u']] = p.get_users(user_ids=[posts_dict[i]['log'][j]['u']], 
                                                                    nid='jzlv5mrrqnn4tz')

posts.head()

Unnamed: 0,posted,total_replies,instructor_replies,unique_views,num_favs,quiz1,quiz2,quiz3,quiz4,quiz5,...,midterm_exam,final_tutorial,other,project,exam,logistics,subject,hw2,polls,hw1
0,2019-12-09 18:08:02,7,0,121,0,False,False,False,False,False,...,False,True,False,False,False,False,"OH: Wednesday, Dec. 11 1-3pm AVW 1120",,,
1,2019-12-05 02:59:01,0,0,224,0,False,False,False,False,False,...,False,True,False,False,False,False,Final Request for the Final Tutorial,,,
2,2019-12-04 13:08:09,0,0,157,0,False,False,False,False,False,...,False,False,False,False,False,True,"Office Hours Cancelled: Tomorrow, 8:30-10:30am",,,
3,2019-12-02 19:02:23,2,0,243,0,False,False,False,False,False,...,False,False,False,False,False,False,Project 4 Key,,,
4,2019-11-19 15:04:47,10,0,272,0,False,False,False,False,False,...,True,False,False,False,False,False,Midterm grades posted,,,


In [62]:
instructors

{'jzm53injueh49j': [{'role': 'ta',
   'name': 'Aviva Prins',
   'endorser': {},
   'admin': True,
   'photo': None,
   'id': 'jzm53injueh49j',
   'photo_url': None,
   'us': False,
   'facebook_id': None}],
 'iv9hhjrk2iv2w7': [{'role': 'professor',
   'name': 'John Dickerson',
   'endorser': {},
   'admin': True,
   'photo': None,
   'id': 'iv9hhjrk2iv2w7',
   'photo_url': None,
   'us': False,
   'facebook_id': None}],
 'ikq4ua6j9dp38z': [{'role': 'student',
   'profile': {'academics': {'major': 'Computer Science',
     'minor': '',
     'grad_month': '6',
     'program': 'Undergraduate',
     'grad_year': '2020'},
    'all_classes': {'ikgz27xjcp04i4': {'id': 'ikgz27xjcp04i4',
      'category': '',
      'num': 'MATH 115',
      'term': 'Spring 2016',
      'subject': '',
      'name': 'PreCalculus',
      'job': '',
      'difficulty': '',
      'type': 'open',
      'instructors': [{'id': 'ikh3s2ret4166u', 'name': 'Raluca Rosca'}],
      'school_id': 'ghzif8rpvM6',
      'is_top': T

## Exploratory Data Analysis and Visualization

## Analysis, Hypothesis Testing and ML

## Insight and Policy Decision

In [12]:
#Erika was here