## Get Data from API

Create a text file for each semester of course data retrieved from the [API](https://github.com/ScottyLabs/course-api).
Structured in JSON.

Also create a text file with unique identifiers of each created file

In [114]:
import cmu_course_api, json

semesters = ["M1", "M2", "F", "S"]
fileList = ""

for semester in semesters:
    data = cmu_course_api.get_course_data(semester)
    actualSemester = data["semester"].replace('/', " ")
    if fileList == "":
        fileList = actualSemester
    else:
        fileList = fileList + "," + actualSemester
    f = open(f"course_data_{actualSemester}.txt", "w")
    f.write(json.dumps(data))
    f.close()

f = open("fileList.txt", "w")
f.write(fileList)
f.close()


Requesting the HTML page from the network...
Done.
Fixing errors on page...
Done.
Finding table rows on page...
Done.
Parsing rows...
Done.
running on 4 threads
[377/377] Getting description for 47998...


FileNotFoundError: [Errno 2] No such file or directory: 'course_data/course_data_Summer One All 2019.txt'

## Parse Data from API

Create a pandas dataframe that has the count of how many times a word has been seen in course descriptions for a department. Departments are a subheading of a college. Then stores the dataframe in a CSV file

In [115]:
import pandas as pd
import string, json

f = open(r"fileList.txt")
fileList = f.read().split(',')
f.close()

stripChar = string.punctuation + string.digits + string.whitespace
colleges = ['Dietrich', 'MCS', 'SCS', 'CFA', 'CIT', 'Tepper', 'Heinz', 'Other']
Dietrich = ['36', '65', '66', '67', '73', '76', '79', '80', '82', '84',
            '85', '88',]
MCS = ['03', '09', '21', '33', '38']
SCS = ['02', '04', '05', '07', '08', '10', '11', '14', '15', '16', '17']
CFA = ['48', '51', '54', '57', '60', '62']
CIT = ['06', '12', '18', '19', '24', '27', '39', '42']
Tepper = ['45', '70']
Heinz = ['90', '91', '92', '93', '94', '95']

wordCount = dict()
wordSeenInDept = dict()
deptSeen = set()
for file in fileList:
    f = open(f"course_data_{file}.txt", "r")
    data = json.loads(f.read())
    f.close()
    classes = data["courses"]
    for aClass in classes:
        desc = classes[aClass]["desc"]
        if isinstance(desc, str):
            dept = aClass[:2]
            if dept not in deptSeen:
                deptSeen.add(dept)
            if dept in Dietrich:
                colTuple = ("Dietrich", dept)
            elif dept in MCS:
                colTuple = ("MCS", dept)
            elif dept in CIT:
                colTuple = ("CIT", dept)
            elif dept in SCS:
                colTuple = ("SCS", dept)
            elif dept in CFA:
                colTuple = ("CFA", dept)
            elif dept in Tepper:
                colTuple = ("Tepper", dept)
            elif dept in Heinz:
                colTuple = ('Heinz', dept)
            else:
                colTuple = ('Other', dept)

            for charBunch in desc.split():
                charBunch = charBunch.strip(stripChar)
                if charBunch.isalpha():
                    charBunch = charBunch.lower()
                    if dept not in wordSeenInDept:
                        wordSeenInDept[dept] = {charBunch}
                        wordCount[colTuple] = dict()
                        wordCount[colTuple][charBunch] = 1
                    elif charBunch in wordSeenInDept[dept]:
                        wordCount[colTuple][charBunch] += 1
                    else:
                        wordSeenInDept[dept].add(charBunch)
                        wordCount[colTuple][charBunch] = 1

df = pd.DataFrame.from_dict(wordCount, dtype=int)
df.columns = df.columns.rename(('College', 'Dept'))
df = df.reindex(sorted(df.columns), axis=1)

df.head()

College,CFA,CFA,CFA,CFA,CFA,CFA,CIT,CIT,CIT,CIT,...,SCS,SCS,SCS,SCS,SCS,SCS,SCS,SCS,Tepper,Tepper
Dept,48,51,54,57,60,62,06,12,18,19,...,07,08,10,11,14,15,16,17,45,70
none,12.0,4.0,15.0,43.0,,5.0,12.0,3.0,8.0,3.0,...,,,,6.0,1.0,16.0,,4.0,6.0,1.0
this,145.0,179.0,313.0,319.0,121.0,128.0,25.0,58.0,254.0,144.0,...,3.0,,68.0,110.0,77.0,225.0,80.0,265.0,152.0,137.0
course,216.0,195.0,394.0,327.0,119.0,199.0,40.0,95.0,392.0,197.0,...,8.0,1.0,105.0,191.0,102.0,323.0,128.0,339.0,272.0,232.0
number,3.0,,2.0,3.0,3.0,3.0,8.0,3.0,11.0,7.0,...,,,5.0,22.0,1.0,5.0,,2.0,1.0,7.0
is,151.0,141.0,314.0,395.0,98.0,217.0,30.0,57.0,264.0,125.0,...,10.0,1.0,74.0,128.0,97.0,224.0,81.0,183.0,204.0,156.0


## Parse Data Further

For each college. Remove words that the college has none of in its descriptions. Then sum together all the departments word counts. Then sort in descending order.

Then remove all words that under four characters and store it in a dictionary according to college.
Sum up all the department word counts together. Then remove all words that have

In [116]:
colleges = df.columns.levels[0]
wordCount = dict()
for college in colleges:
    collegeWordCount = df[college].dropna(how="all").sum(axis=1).sort_values(ascending=False)
    wordList = []
    for word in collegeWordCount.keys():
        if len(word) > 3:
            wordList.append(word)
    wordCount[college] = collegeWordCount[wordList]

## Explore Data

All the college word counts are stored in the dictionary wordCount.
College names are: 'CIT', 'Heinz', 'MCS', 'Dietrich', 'Other', 'SCS', 'Tepper'

In [117]:
for college in wordCount:
    print(f"{college}: ", end='\n-----------------\n')
    print(wordCount[college].head(50), end='\n\n')

CFA: 
-----------------
will            1531.0
course          1450.0
students        1332.0
this            1205.0
with             912.0
design           880.0
music            618.0
that             549.0
class            528.0
work             438.0
their            423.0
through          376.0
from             360.0
majors           343.0
project          326.0
required         291.0
student          287.0
performance      268.0
studio           266.0
techniques       265.0
projects         256.0
skills           252.0
semester         249.0
permission       229.0
instructor       224.0
learn            219.0
research         216.0
develop          211.0
process          210.0
development      204.0
faculty          204.0
which            199.0
first            195.0
practice         194.0
also             193.0
registration     190.0
have             188.0
topics           184.0
explore          182.0
production       182.0
basic            182.0
school           180.0
each      