## Preprocessing data
This notebook does several preprocessing steps on the data.  
It converts it to a Pandas DataFrame and removes nested structures.  
Moreover, it now focusses on courses available in English, so it filters out those not available in English. 
At the end, probably all courses should be added but for testing this is easier.   
**Note**: actually combining English and Finnish in (content-based) models is gonna be a challenge!

In [1]:
#import the needed packages
import json
import pandas as pd
from copy import deepcopy
#for nicely displaying dataframe
from IPython.display import display
#to identify language
import langid

In [2]:
#open the course info json file
with open('../Data/courses_all.json') as json_data:
    course_info = json.load(json_data)

In [3]:
print("number of courses in course info:",len(course_info))

number of courses in course info: 2659


In [4]:
#display example of info gotten from API
course_info[2]

{'code': '20E99904',
 'courseUnitId': '1125574316',
 'credits': '6',
 'endDate': '2018-11-28',
 'id': '1133537977',
 'languageOfInstructionCodes': ['en'],
 'name': {'en': 'Capstone: Business Development Project',
  'fi': 'Capstone: Business Development Project',
  'sv': 'Capstone: Business Development Project'},
 'organizationId': 'E701',
 'startDate': '2018-09-19',
 'summary': {'additionalInformation': {'en': '',
   'fi': 'Compulsory attendance in all class sessions and meetings. Most Master¿s Programme studies have to be completed before you can enroll on the Capstone course. The maximum number of students is 50, but only eligible candidates will be admitted even if the maximum number is not reached. Credit transfer and capstone course Students can, on legitimate grounds (Eg. exchange studies abroad, serious illness; however, working life and its restraints are not considered legitimate reasons not to complete the capstone course), apply for a credit transfer for a capstone course. H

In [5]:
def sel_lang(entry):
    # some keys are a dictionary of the form {'en':'..','fi':'..','sv':'..'}
    #Here we choose one of them, since this makes working with the data way easier
    #Preferences is given to English, if this is empty, it is checked whether Finnish or Swedish has info
    if entry['en']!='':
        result=entry['en']
    elif entry['fi']!='':
        result=entry['fi']
    elif entry['sv']!='':
        result=entry['sv']
    else:
        result=''
    return result

def avail_english(entry):
    #check if course available in English
    if 'en' in entry:
        return True
    else:
        return False

def fix_struct(course_info):
    # fold out nested structure. Goal: more structured and be ready to convert to DataFrame
    #we need this copy cause the keys in course_info are changing
    d=deepcopy(course_info)
    for i in range(len(course_info)):
        for k in d[i].keys():
            if type(d[i][k])==type(dict()):
                #this means it is dict with just the 3 different languages (en,fi,sv)
                if len(d[i][k].keys())==3:
                    course_info[i][k]=sel_lang(d[i][k])
                #this means it is normally a dict within a dict. Mainly applies to the key additionalInformation
                else: 
                    for j in d[i][k].keys():
                        course_info[i][j]=sel_lang(d[i][k][j])
                    del course_info[i][k]
        #add boolean for course available in English          
        course_info[i]['availableEnglish']=avail_english(d[i]['languageOfInstructionCodes'])
    return course_info
course_info=fix_struct(course_info)

In [6]:
#convert to dataframe
df=pd.DataFrame(course_info)

In [7]:
#print first columns of dataframe
#note: colums are truncated. Can show all columns by pd.set_option('display.max_columns', None)  
#and all info per cell by pd.set_option('display.max_colwidth', -1)
df.head()

Unnamed: 0,additionalInformation,assesmentMethods,availableEnglish,cefrLevel,code,content,courseStatus,courseUnitId,credits,endDate,...,organizationId,prerequisities,registration,startDate,substitutes,teacherInCharge,teachers,teachingPeriod,type,workload
0,,1. Luennot ja ohjaustapaaminen2. Ajanhal...,False,,20A00511,"oppimis- ja opiskelutaidot, tiedonhakukoulutus...","KTK-tutkinto, Liiketoimintaosaamisen perusteet...",1113134889,1,2019-04-03,...,E700,,WebOodi-ilmoittautuminen. Katso ilmoittautumis...,2018-09-18,Kurssin tapaamiskerrasta voi saada vapautuksen...,Anni Rintala,"[Anni Rintala, Inka Pulkkinen]",I-V (2018-2020),course,Osallistuminen luennoille ja ohjaustapaamiseen...
1,Kurssin lähipäivillä on pakollinen läsnäolo. M...,Harjoitustyöt 100%.,False,,20C00201,Kurssin aikana opiskelija oppii käytännönlähei...,"KTK-tutkinto, vapaasti valittavat opinnot",1128567674,1,2018-12-18,...,E706,,WebOodi,2018-12-17,,KTT Christa Uusi-Rauva,[Christa Uusi-Rauva],Kurssi järjestetään 17.-18.12.2018,course,Lähiopetus: 15 tuntia; pakollinen läsnäolo Har...
2,Compulsory attendance in all class sessions an...,100 % assignments (group and individual),True,,20E99904,"The course consists of an applied, real-life p...",Mandatory course in the Master¿s programs of B...,1125574316,6,2018-11-28,...,E701,Most Master¿s Programme studies have to be com...,via WebOodi.,2018-09-19,Students can replace this capstone course by p...,Perttu KähäriNina GranqvistPaulina JunniGregor...,"[Perttu Kähäri, Laura Peni, Pekka Pälli, Iiris...","Periods I-II Töölö campus, periods IV-V Otanie...",course,Contact teaching :10-15 h (incl. closing semin...
3,Compulsory attendance in all class sessions an...,100 % assignments (group and individual),True,,20E99904,"The course consists of an applied, real-life p...",Mandatory course in the Master¿s programs of B...,1125574316,6,2019-05-15,...,E701,Most Master¿s Programme studies have to be com...,via WebOodi.,2019-02-27,Students can replace this capstone course by p...,Perttu KähäriNina GranqvistPaulina JunniGregor...,"[Paulina Junni, Pekka Pälli, Gregory O'Shea, I...","Periods I-II Töölö campus, periods IV-V Otanie...",course,Contact teaching :10-15 h (incl. closing semin...
4,,1. Luennot 36 h. Professorit Nina Granqvist ja...,False,,21A00110,Organisaatiot ovat toiminnan perusyksikkö ¿ sy...,"KTK-tutkinto, liiketoimintaosaamisen perusteet.",1013558549,6,2019-04-09,...,E706,Ei ennakkovaatimuksia,WebOodissa,2019-02-25,,Nina GranqvistOlli-Pekka Kauppila,[Nina Granqvist],IV periodi 2018-2019 Otaniemi kampusIV periodi...,course,Osallistuminen luennoille 36 hValmistautuminen...


Now that we have all the info in the dataframe, we wanna do some selection.  
We now focus on English courses only

In [8]:
#make copy of original dataframe
df_adj=df.copy()

In [9]:
#drop duplicate courses
#seems duplicate courses mainly due to having several starting times
df_adj=df_adj.drop_duplicates(['courseUnitId'])
print('number of unique courses:',len(df_adj))
print('number of duplicate courses:',len(df)-len(df_adj))

number of unique courses: 2183
number of duplicate courses: 476


In [10]:
#only select those courses that have English as (one of the) teaching languages
df_adj=df_adj[df_adj['availableEnglish']==True]
print('Number of courses available in English:',len(df_adj))

Number of courses available in English: 1331


In [11]:
# some courses have as (one of the) languages English, but most of the information is only available in Finnish. 
# Delete these for now, but check why this is the case (see Questions)! 

num_entries=len(df_adj)
#detect whether Finnish or English description and delete those with Finnish
langid.set_languages(['fi', 'en'])
for index, row in df_adj.iterrows():
    lang, score = langid.classify(df_adj['content'].loc[index])
    if lang=='fi':
        df_adj=df_adj.drop(index)
df_adj=df_adj.reset_index(drop=True)

print("Number of courses Finnish description:",num_entries-len(df_adj))
print("Number of courses left:",len(df_adj))

Number of courses Finnish description: 119
Number of courses left: 1212


In [12]:
df_adj.head()

Unnamed: 0,additionalInformation,assesmentMethods,availableEnglish,cefrLevel,code,content,courseStatus,courseUnitId,credits,endDate,...,organizationId,prerequisities,registration,startDate,substitutes,teacherInCharge,teachers,teachingPeriod,type,workload
0,Compulsory attendance in all class sessions an...,100 % assignments (group and individual),True,,20E99904,"The course consists of an applied, real-life p...",Mandatory course in the Master¿s programs of B...,1125574316,6,2018-11-28,...,E701,Most Master¿s Programme studies have to be com...,via WebOodi.,2018-09-19,Students can replace this capstone course by p...,Perttu KähäriNina GranqvistPaulina JunniGregor...,"[Perttu Kähäri, Laura Peni, Pekka Pälli, Iiris...","Periods I-II Töölö campus, periods IV-V Otanie...",course,Contact teaching :10-15 h (incl. closing semin...
1,The minimum number of participants is 20,Learning diaries 50%Take-home exam 50%,True,,21C00150,This introductory course gives a basic underst...,Degree Elective,1130843834,3,2019-03-29,...,E706,,Via WebOodi,2019-02-27,,"DSc Christa Uusi-Rauva, Professor Ingmar Björkman","[Alice Wickström, Ingmar Björkman]","2018-2019; IV, Otaniemi Campus 2019-2020: no t...",course,Lectures: 33 hoursLearning diaries: 24 hoursTa...
2,Max. 100 students. Priority for management stu...,Final exam: 40%Assignments: 30%Learning diary:...,True,,21C00350,"Throughout this course, we will be covering di...",Bachelor: Management HR specialization area Co...,1125857456,6,2018-12-13,...,E706,It is recommended that the students have basic...,WebOodi,2018-10-30,21C00300 Henkilöstöjohtaminen,Kathrin Sele,[Kathrin Sele],"Period II (2018-2019), Otaniemi campusPeriod I...",course,Lectures 30h presence (obligatory classroom pr...
3,,,True,,21C03000,The course is taught by a visiting lecturer an...,B.Sc. Management minor,1133021737,3-6,2019-02-15,...,E706,,via WebOodi,2019-01-09,,The course is taught by a visiting lecturer. 2...,[Mikko Martela],"2018-2019: III, Otaniemi campusNo teaching 201...",course,
4,,50% reflective learning diary50% final essay exam,True,,21C10000,"Must know: the concepts of ""concept and contex...",Aalto-course Management minor elective course,1121603277,6,2019-02-19,...,E706,No specific prerequisites for attending the co...,Via Weboodi,2019-01-08,,Esko Aho Kirsti Iivonen,"[Esko Aho, Kirsti Iivonen]",Period III (2018-2019)Period III (2019-2020),course,Attending lectures 24h (not compulsory but hig...


In [13]:
#reset the index
df_adj=df_adj.reset_index()
#write processed dataframe to csv
df_adj.to_csv('../Data/filtered_courses.csv',index=False)

### Questions
#### Data retrieval
- Some courses seem to have all the info in Finnish and English on Oodi, but only return the one with Finnish!! See e.g. Sustainable Built Environment (idx 2644)
- Some courses seem to return less info than there is on Oodi. See e.g. Magnificent life (courseUnitId 1113375184)
- The two problems above also occur on courses.aalto.fi (so not just me :p)

#### Data processing
- What causes duplicate causes and are there any differences in content between duplicates?
- How to handle the two languages
- Check out the courses that have english as language but only Finnish language in description
- What to do with the courses with no description? Do they have negative influence for some methods or doesn't really matter?

#### Some tests related to questions

In [14]:
#see content MagLife course
for i in range(len(course_info)):
    if course_info[i]['courseUnitId']=='1113375184':
        print(course_info[i])

{'id': '1134038359', 'startDate': '2018-11-05', 'languageOfInstructionCodes': ['en'], 'courseUnitId': '1113375184', 'name': 'Magnificent Life', 'credits': '0', 'endDate': '2018-12-02', 'organizationId': 'T307', 'code': 'TU-CV', 'type': 'course', 'teachers': ['Esa Saarinen'], 'workload': '', 'prerequisities': '', 'learningOutcomes': '', 'literature': '', 'languageOfInstruction': '', 'registration': '', 'homepage': 'https://mycourses.aalto.fi/course/search.php?search=TU-CV', 'content': '', 'cefrLevel': '', 'level': '', 'teacherInCharge': '', 'assesmentMethods': '', 'courseStatus': '', 'substitutes': '', 'additionalInformation': '', 'gradingScale': '', 'teachingPeriod': '', 'availableEnglish': True}


In [15]:
#test for sustainable built environment
df['name'][df['name'].str.contains('Sust')]

31                    CAPSTONE in Creative Sustainability
32                             Sustainability in Business
67                          Accounting for Sustainability
115                          Sustainable Entrepreneurship
128              Sustainability in International Business
238                             Sustainable Supply Chains
317                                Product Sustainability
373                    Introduction to Sustainable Design
422           Sustainability Tools for Building Designers
672                     Process Safety and Sustainability
991                   Sustainable Building Energy Systems
1177                            Sustainable Electronics P
2218               Sustainable Fashion and Textile Design
2243                  Knowledge-Making for Sustainability
2247               Sustainable Product and Service Design
2252      Intersections Between Sustainability and Design
2407    Systems Thinking for Sustainable Living Enviro...
2433          

In [16]:
k=0
for i in range(len(df_adj)):
    if df_adj['content'].loc[i]=='':
        k+=1
print("Number of courses with no description:",k)

Number of courses with no description: 36


I think some missing content is caused by this nested structure.. How to fix that?
![Mag Life example](../Images/magLife_example.png)