## 020 Retrieving a Sample Learning Curriculum
This notebook was used to extract the learning objectives from an example of a learning program in Data Science. The module handbook of the study program of the author was used.

In [1]:
import pdfplumber
import pandas as pd
from sentence_transformers import SentenceTransformer, util
import pickle
import numpy as np
from scipy import spatial

In [2]:
pdf = pdfplumber.open("Modulhandbuch_FS-MADS-60_Data-Science_SGo_19.09.2019.pdf")
pages = pdf.pages[0:]

In [4]:
long_string = ""
for page in pages:
    long_string+=page.extract_text()

## Data Extraction from PDF
This method uses some simple text splitting techniques to extract a dataframe of learning objectives from the handbook PDF.

In [5]:
def extract_objective_df(pdf):
    df = pd.DataFrame(columns = ['objective','module','study_program'])
    pages = pdf.pages[0:]
    long_string = ""
    for page in pages:
        long_string+=page.extract_text()
    for module in long_string.split('Module Title:  '): #getting full text of each module
        course_objectives = module.split('On successful completion of this course, students will be able to: ')
        if len(course_objectives)>1: #avoid empty strings

            #Module title
            title = module.split('\n')[0]

            #Module objtectives
            objectives = course_objectives[1]
            objectives = objectives.split('Teaching Methods')[0]
            if (objectives[-4:]) =='/78 ':
                objectives = objectives[:-6]
            objectives = objectives.encode()
            objectives = objectives.decode()
            objectives = objectives.replace('\n',"")
            objectives = objectives.replace('. ',".")
            objectives = objectives.split('\uf0a7  ')[1:] #[0] is always empty because of the splitting
            for objective in objectives:
                row_list = [objective,title,"DS_60"]
                df.loc[len(df)] = row_list #because of 0 indexing we can write to new row at len(df)

    return(df)
df=extract_objective_df(pdf)

In [18]:
def preprocess_for_bert(string):
    string = string.strip("\n.’:")
    string = string.strip("’")
    string = string.strip("\\n")
    string = string.replace("/"," ")
    return(string)

## Inferring Clusters
The pipeline of embedding and clustering skills that was described in the notebook ../model_selection/model_selection_silhouette.ipynb is read into memory and used to infer clusters on the objectives in the learning curriculum.

In [19]:
k31 = pickle.load(open('../Model_Selection/k_31_full', 'rb'))
objectives = []
model = 'all-distilroberta-v1'
model = SentenceTransformer(model)
for o in df['objective'].to_list():
    objectives.append(preprocess_for_bert(o))
embeddings = model.encode(objectives)
clusters =  k31.predict(np.array(embeddings.tolist()))
df['cluster']=clusters



In [21]:
df.head()

Unnamed: 0,objective,module,study_program,cluster
0,understand the fundamental building blocks of ...,Advanced Statistics,DS_60,12
1,analyze stochastic data in terms of the underl...,Advanced Statistics,DS_60,11
2,utilize Bayesian statistics techniques.,Advanced Statistics,DS_60,12
3,summarize the properties of observed data usin...,Advanced Statistics,DS_60,11
4,apply data visualization techniques to design ...,Advanced Statistics,DS_60,14


In [20]:
df.to_csv('../datasets/df_handbook_DS_60.csv')