# Courses Demo
This Jupyter notebook is for exploring the data set courses20-21.json
which consists of all Brandeis courses in the 20-21 academic year (Fall20, Spr21, Sum21) 
which had at least 1 student enrolled.

First we need to read the json file into a list of Python dictionaries

In [12]:
import json
import statistics

In [13]:
with open("courses20-21.json","r",encoding='utf-8') as jsonfile:
    courses = json.load(jsonfile)

## Structure of a course
Next we look at the fields of each course dictionary and their values

In [14]:
print('there are',len(courses),'courses in the dataset')
print('here is the data for course 1246')
courses[1246]

there are 7813 courses in the dataset
here is the data for course 1246


{'limit': 28,
 'times': [{'start': 1080, 'end': 1170, 'days': ['w', 'm']}],
 'enrolled': 4,
 'details': 'Instruction for this course will be offered remotely. Meeting times for this course are listed in the schedule of classes (in ET).',
 'type': 'section',
 'status_text': 'Open',
 'section': '1',
 'waiting': 0,
 'instructor': ['An', 'Huang', 'anhuang@brandeis.edu'],
 'coinstructors': [],
 'code': ['MATH', '223A'],
 'subject': 'MATH',
 'coursenum': '223A',
 'name': 'Lie Algebras: Representation Theory',
 'independent_study': False,
 'term': '1203',
 'description': "Theorems of Engel and Lie. Semisimple Lie algebras, Cartan's criterion. Universal enveloping algebras, PBW theorem, Serre's construction. Representation theory. Other topics as time permits. Usually offered every second year.\nAn Huang"}

## Cleaning the data
If we want to sort courses by instructor or by code, we need to replace the lists with tuples (which are immutable lists)

In [15]:
for course in courses:
        course['instructor'] = tuple(course['instructor'])
        course['coinstructors'] = tuple([tuple(f) for f in course['coinstructors']])
        course['code']= tuple(course['code'])

In [16]:
print('notice that the instructor and code are tuples now')
courses[1123]

notice that the instructor and code are tuples now


{'limit': None,
 'times': [],
 'enrolled': 0,
 'details': '',
 'type': 'section',
 'status_text': 'Open',
 'section': '7',
 'waiting': 0,
 'instructor': ('Marcelle', 'Soares-Santos', 'marcelle@brandeis.edu'),
 'coinstructors': (),
 'code': ('PHYS', '280A'),
 'subject': 'PHYS',
 'coursenum': '280A',
 'name': 'Advanced Readings and Research',
 'independent_study': True,
 'term': '1203',
 'description': 'Specific sections for individual faculty members as requested. Usually offered every year.\nStaff'}

# Exploring the data set
Now we will show how to use straight python to explore the data set and answer some interesting questions. Next week we will start learning Pandas/Numpy which are packages that make it easier to explore large dataset efficiently.


Here are some questions we can try to asnwer:
* what are all of the subjects of courses (e.g. COSI, MATH, JAPN, PHIL, ...)
* which terms are represented?
* how many instructors taught at Brandeis last year?
* what were the five largest course sections?
* what were the five largest courses (where we combine sections)?
* which are the five largest subjects measured by number of courses offered?
* which are the five largest courses measured by number of students taught?
* which course had the most sections taught in 20-21?
* who are the top five faculty in terms of number of students taught?
* etc.

In [17]:
#Karen
all_subjects = {c['subject'] for c in courses} 
all_terms = {c['term'] for c in courses} #1203 - fall2020, 1211 - spring2021, 1212 - summer2021
num_instructors = len({c['instructor'] for c in courses})
five_largest_sec = {c['name'] for c in sorted(courses,key= lambda course: -course['enrolled'])[1:6]}

## what were the five largest courses (where we combine sections)?
five_largest_sec = {c['name'] for c in sorted(courses,key= lambda course: -course['enrolled'])[1:6]}
    
# which are the five largest subjects measured by number of courses offered?
# creating a nested for loop to then create a new table that would keep the number of courses for each subject 

## which are the five largest courses measured by number of students taught?
# go through all the courses and then keep track of the total enrolled throughout all the years 
# sort it and then get the name for those courses 

## which course had the most sections taught in 20-21? 
# create a table that stores the len of the sections 
# sort it and get the top one 

## who are the top five faculty in terms of numbers of students taught?  
# create a new table which has total of students that the professor taught 
# sort that table 

# PA01 - Python Data Analysis I
Now we will show how to use straight python to explore the data set and answer some interesting questions. Next week we will start learning Pandas/Numpy which are packages that make it easier to explore large dataset efficiently.

Here are some questions we can try to asnwer:
* how many faculty taught COSI courses last year?
* what is the total number of students taking COSI courses last year?
* what was the median size of a COSI course last year (counting only those courses with at least 10 students)
* create a list of tuples (E,S) where S is a subject and E is the number of students enrolled in courses in that subject, sort it and print the top 10. This shows the top 10 subjects in terms of number of students taught.
* do the same as in (d) but print the top 10 subjects in terms of number of courses offered
* do the same as (d) but print the top 10 subjects in terms of number of faculty teaching courses in that subject
* list the top 20 faculty in terms of number of students they taught
* list the top 20 courses in terms of number of students taking that course (where you combine different sections and semesters, i.e. just use the subject and course number)
* Create your own interesting question (each team member creates their own) and use Python to answer that question.

In [None]:
#Gillian
num_cosi_instructors = len({c['instructor'] for c in courses if (c['subject'] == 'COSI')})
print(num_cosi_instructors)

27


In [19]:
#Gillian
num_cosi_students = sum([e['enrolled'] for e in courses if e['subject'] == 'COSI'])
print(num_cosi_students)

2223


In [20]:
# Gillian
median = statistics.median(e['enrolled'] for e in courses if (e['subject'] == 'COSI') and (e['enrolled'] > 10))
print(median)

38.0


In [24]:
# Gillian
subject_list = list(all_subjects)
enrolled_by_subject = []
es_list = []
for s in subject_list:
    enrolled_by_subject.append(sum([e['enrolled'] for e in courses if e['subject'] == s]))
for i in range(len(subject_list)):
    es_list.append((enrolled_by_subject[i], subject_list[i]))
es_list.sort(reverse = True)
print(es_list[:10])

[(5318, 'HS'), (3085, 'BIOL'), (2766, 'BUS'), (2734, 'HWL'), (2322, 'CHEM'), (2315, 'ECON'), (2223, 'COSI'), (1785, 'MATH'), (1704, 'PSYC'), (1144, 'ANTH')]


In [26]:
# Gillian
codes_by_subject = []
cs_list = []
for s in subject_list:
    codes_by_subject.append(len([c['code'] for c in courses if c['subject'] == s]))
for i in range(len(subject_list)):
    cs_list.append((codes_by_subject[i], subject_list[i]))
cs_list.sort(reverse = True)
print(cs_list[:10])

[(613, 'BIOL'), (498, 'HIST'), (417, 'PSYC'), (403, 'NEUR'), (296, 'BCHM'), (288, 'PHYS'), (274, 'HS'), (272, 'COSI'), (266, 'MUS'), (265, 'ENG')]


In [None]:
#Anjola