## Get Data from API

Create a text file for each semester of course data retrieved from the [API](https://github.com/ScottyLabs/course-api).
Structured in JSON.

Also create a text file with unique identifiers of each created file

In [None]:
import cmu_course_api, json

semesters = ["M1", "M2", "F", "S"]

try:
    files = open("fileList.txt", "r")
    fileList = files.read()
    files.close()
except:
    files = open("fileList.txt", "w")
    files.close()
    fileList = ''

for semester in semesters:
    data = cmu_course_api.get_course_data(semester)
    actualSemester = data["semester"].replace('/', " ")
    if fileList == "":
        fileList = actualSemester
    else:
        fileList = fileList + "," + actualSemester
    f = open(f"course_data_{actualSemester}.txt", "w")
    f.write(json.dumps(data))
    f.close()

f = open("fileList.txt", "w")
f.write(fileList)
f.close()

## Parse Data from API

Create a pandas dataframe that has the count of how many times a word has been seen in course descriptions for a department. Departments are a subheading of a college. Then stores the dataframe in a CSV file

In [1]:
import pandas as pd
import string, json

f = open(r"fileList.txt")
fileList = f.read().split(',')
f.close()

stripChar = string.punctuation + string.digits + string.whitespace
colleges = ['Dietrich', 'MCS', 'SCS', 'CFA', 'CIT', 'Tepper', 'Heinz', 'Other']
Dietrich = ['36', '65', '66', '67', '73', '76', '79', '80', '82', '84',
            '85', '88',]
MCS = ['03', '09', '21', '33', '38']
SCS = ['02', '04', '05', '07', '08', '10', '11', '14', '15', '16', '17']
CFA = ['48', '51', '54', '57', '60', '62']
CIT = ['06', '12', '18', '19', '24', '27', '39', '42']
Tepper = ['45', '70']
Heinz = ['90', '91', '92', '93', '94', '95']

wordCount = dict()
deptSeen = set()
classSeen = set()

for file in fileList:
    f = open(f"course_data_{file}.txt", "r")
    data = json.loads(f.read())
    f.close()
    classes = data["courses"]
    for aClass in classes:
        if aClass in classSeen:
            continue
        else:
            classSeen.add(aClass)
        
        desc = classes[aClass]["desc"]
        if isinstance(desc, str):
            dept = aClass[:2]
            if dept in Dietrich:
                colTuple = ("Dietrich", dept)
            elif dept in MCS:
                colTuple = ("MCS", dept)
            elif dept in CIT:
                colTuple = ("CIT", dept)
            elif dept in SCS:
                colTuple = ("SCS", dept)
            elif dept in CFA:
                colTuple = ("CFA", dept)
            elif dept in Tepper:
                colTuple = ("Tepper", dept)
            elif dept in Heinz:
                colTuple = ('Heinz', dept)
            else:
                colTuple = ('Other', dept)

            for charBunch in desc.split():
                charBunch = charBunch.strip(stripChar)
                if charBunch.isalpha():
                    charBunch = charBunch.lower()
                    if dept not in deptSeen:
                        wordCount[colTuple] = dict()
                        wordCount[colTuple][charBunch] = 1
                        deptSeen.add(dept)
                    elif charBunch in wordCount[colTuple]:
                        wordCount[colTuple][charBunch] += 1
                    else:
                        wordCount[colTuple][charBunch] = 1

df = pd.DataFrame.from_dict(wordCount, dtype=int)
df.columns = df.columns.rename(('College', 'Dept'))
df = df.reindex(sorted(df.columns), axis=1)

df.head()

College,CFA,CFA,CFA,CFA,CFA,CFA,CIT,CIT,CIT,CIT,...,SCS,SCS,SCS,SCS,SCS,SCS,SCS,SCS,Tepper,Tepper
Dept,48,51,54,57,60,62,06,12,18,19,...,07,08,10,11,14,15,16,17,45,70
no,7.0,1.0,9.0,21.0,7.0,6.0,2.0,4.0,10.0,12.0,...,,,1.0,9.0,5.0,10.0,3.0,9.0,7.0,4.0
course,197.0,172.0,350.0,272.0,97.0,168.0,42.0,81.0,322.0,157.0,...,8.0,1.0,75.0,133.0,81.0,232.0,126.0,261.0,208.0,143.0
description,9.0,,23.0,16.0,,1.0,5.0,,9.0,,...,,,,5.0,,10.0,4.0,6.0,14.0,4.0
provided,6.0,1.0,6.0,11.0,4.0,13.0,1.0,1.0,6.0,2.0,...,,,,6.0,2.0,5.0,3.0,6.0,6.0,
none,8.0,3.0,17.0,29.0,,5.0,4.0,3.0,4.0,2.0,...,,,,1.0,1.0,10.0,,4.0,3.0,


## Parse Data Further

For each college. Remove words that the college has none of in its descriptions. Then sum together all the departments word counts. Then sort in descending order.

Then remove all words that under four characters and store it in a dictionary according to college.
Sum up all the department word counts together. Then remove all words that have

In [2]:
colleges = df.columns.levels[0]
wordCount = dict()
for college in colleges:
    collegeWordCount = df[college].dropna(how="all").sum(axis=1).sort_values(ascending=False)
    wordList = []
    for word in collegeWordCount.keys():
        if len(word) > 3:
            wordList.append(word)
    wordCount[college] = collegeWordCount[wordList]

## Explore Data

All the college word counts are stored in the dictionary wordCount.
College names are: 'CIT', 'Heinz', 'MCS', 'Dietrich', 'Other', 'SCS', 'Tepper'

In [3]:
for college in wordCount:
    print(f"{college}: ", end='\n-----------------\n')
    print(wordCount[college].head(50), end='\n\n')

CFA: 
-----------------
will            1354.0
course          1256.0
students        1089.0
this            1043.0
design           795.0
with             788.0
that             477.0
music            453.0
class            428.0
work             364.0
their            349.0
from             343.0
through          334.0
techniques       241.0
project          239.0
majors           234.0
projects         229.0
studio           225.0
performance      214.0
skills           214.0
student          211.0
required         201.0
semester         194.0
process          191.0
learn            187.0
research         178.0
development      177.0
permission       176.0
also             173.0
which            171.0
first            170.0
instructor       169.0
explore          168.0
practice         166.0
develop          156.0
each             154.0
these            153.0
production       151.0
have             149.0
basic            149.0
include          148.0
building         147.0
registrati