# Digging Into Discussion Data

Discussion data in Canvas might be used by a teacher for a participation grade, or to identify students who are less engaged and target interventions to them.

## Getting Started

Initialize access to the Canvas API and setup helper function(s).

In [1]:
from canvasapi import Canvas, exceptions
import os
import pprint
from datetime import datetime
from dotenv import load_dotenv
load_dotenv()

API_URL = os.getenv("CANVAS_BASE_URL")
API_KEY = os.getenv("CANVAS_ACCESS_TOKEN")
OUTPUT_PATH = os.getenv("LOGIN_ACCESS_DATA_OUTPUT_PATH")

canvas = Canvas(API_URL, API_KEY)
accounts = canvas.get_accounts()

def flatten(pages):
    flat = []
    for p in pages:
        for p2 in p:
            flat.append(p2)
    return flat

Get all courses so that we can look through them for discussion topics.

In [2]:
courses = accounts[0].get_courses()
pprint.pprint([{u.name, u. id} for u in courses])

[{'English I', 14}, {16, 'Algebra I'}]


## Discussion Topics

The response is a nested set of pages; it is easier to work with them if they are "flattened" from a two dimensional array to a one dimensional array.

In [3]:
topics = [c.get_discussion_topics() for c in courses]

flat_topics = flatten(topics)
pprint.pprint(flat_topics[0])

DiscussionTopic(_requester=<canvasapi.requester.Requester object at 0x05482F10>, id=3, title=ಠ_ಠ, last_reply_at=2020-08-27T13:15:53Z, last_reply_at_date=2020-08-27 13:15:53+00:00, created_at=2020-08-27T13:15:53Z, created_at_date=2020-08-27 13:15:53+00:00, delayed_post_at=None, posted_at=2020-08-27T13:15:53Z, posted_at_date=2020-08-27 13:15:53+00:00, assignment_id=None, root_topic_id=None, position=None, podcast_has_student_posts=False, discussion_type=side_comment, lock_at=None, allow_rating=False, only_graders_can_rate=False, sort_by_rating=False, is_section_specific=False, user_name=Kyle Hughes, discussion_subentry_count=0, permissions={'attach': True, 'update': True, 'reply': True, 'delete': True}, require_initial_post=None, user_can_see_posts=True, podcast_url=None, read_state=unread, unread_count=0, subscribed=False, attachments=[], published=True, can_unpublish=True, locked=False, can_lock=True, comments_disabled=False, author={'id': 14, 'display_name': 'Kyle Hughes', 'avatar_ima

Restructure the above responses in a JSON object that contains only the properties we care about.

In [4]:
topic_data = [{'title': t.title, 'date': t.created_at_date, 'author_name': t.user_name, 'author_id': t.author['id'], 'message': t.message} for t in flat_topics]
[pprint.pprint(t.title) for t in flat_topics]

'ಠ_ಠ'
'My Second Discussion Topic'
'Hi everyone!'


[None, None, None]

## Pandas: Word Counts

Introducing use of Pandas. Use it to do a count of spaces in the message, as a rough estimate of the number of words. Then remove the message from the new Pandas data frame object so that the message is not including in further output.

In [5]:
import pandas as pd
df = pd.DataFrame(data=topic_data)

df['word_count'] = df['message'].str.count(' ') + 1

df = df.drop(['message'], axis=1)

pprint.pprint(df)

title                      date  author_name  \
0                         ಠ_ಠ 2020-08-27 13:15:53+00:00  Kyle Hughes   
1  My Second Discussion Topic 2020-08-27 13:12:28+00:00  Mary Archer   
2                Hi everyone! 2020-08-27 13:11:38+00:00  Mary Archer   

   author_id  word_count  
0         14           4  
1         13          99  
2         13         111  


## Pandas: Write CSV File

Write the data frame to a CSV file.

In [6]:
df.to_csv(f"{OUTPUT_PATH}/discussions.csv")

Open and read the file just to prove that it worked

In [7]:
print(open(f"{OUTPUT_PATH}/discussions.csv", 'r').read())

,title,date,author_name,author_id,word_count
0,à² _à² ,2020-08-27 13:15:53+00:00,Kyle Hughes,14,4
1,My Second Discussion Topic,2020-08-27 13:12:28+00:00,Mary Archer,13,99
2,Hi everyone!,2020-08-27 13:11:38+00:00,Mary Archer,13,111



## What's Next?

* The messages contain HTML, which artificially inflate word count. Look for a module, or just write a regex, to strip out HTML before doing the word count.
* Are there any other elements in the message that should be "normalized" away before counting?
* Did the discussion topic elicit replies?
* And were those replies from someone other than the original author?
* For each should we list out the replies and likewise the word counts?