## Building a Bag of Words (BoW) model
The cleaned dataset can be used to train a text classification model. The model would try to predict the assigned tag from the raw survey response. It would then be easier to process future survey data.

### Importing modules

In [1]:
import pandas as pd
from joblib import dump
from nlpEngine import findKeywords 
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

### Reading required data

In [23]:
workbook = pd.read_excel('data.xlsx',sheet_name=None)
responses = []
for name, sheet in workbook.items():
    sheet.dropna(how='all',inplace=True)
    for i, student in sheet.iterrows():
        response = None
        try:
            pathway = student['Based on your career aspirations, what pathway and course do you want to work towards?']
        except KeyError:
            pathway = student['Based on your career aspirations, what pathway do you want to work towards?']
        if pathway == 'Junior College':
            response = student['What is your 1st choice JC?']
        elif pathway == 'International Baccalaureate':
            response = student['What is your 1st choice IB course?']
        elif pathway == 'A Levels/International Baccalaureate':
            response = student['What is your 1st choice JC/IB?']
        elif pathway == 'NAFA/LaSalle':
            response = student['What is your 1st choice NAFA/LaSalle course?']
        elif pathway == 'Polytechnic':
            response =  student['What is your 1st choice polytechnic and course?']
        elif pathway == 'Polytechnic Foundation Programme (PFP)':
            response =  student['What is your 1st choice PFP course and polytechnic?']
        elif pathway == 'Direct to Polytechnic Programme (DPP)':
            response = student['What is your 1st choice Higher NITEC DPP course?']
        industry = student['The industry I\'m most interested in is']
        responses.append([response,industry])

responses = pd.DataFrame(responses,columns=['School','Industry'])
responses['Industry'] = responses['Industry'].apply(findKeywords)
responses['Industry'] = responses['Industry'].apply(lambda x : ' '.join(x))
responses['School'] = responses['School'].apply(findKeywords)
responses['School'] = responses['School'].apply(lambda x : ' '.join(x))

In [24]:
schools = ['Anglo-Chinese JC','Anderson Serangonn JC','Catholic JC','Dunman High School','Eunoia JC','Hwa Chong Institution','Jurong Pioneer JC','Millenia Institude','Nanyang JC','National JC','Raffles Institution','River Valley High School','St Andrews JC','St Josephs Insitution','Tampines Meridian JC','Temasek JC','Victoria JC','Yishun Innova JC','Ngee Ann Polytechnic','Singapore Polytechnic','Temasek Polytechnic','Nanyang Polytechnic','Republic Polytechnic']
industries = ['Arts&Media','Business&Finance','Engineering','Technology','Sciences','Medical']
school_tags = []
industry_tags = []
workbook = pd.read_excel('processedData.xlsx',sheet_name=None)
for name, sheet in workbook.items():
    sheet.dropna(how='all',inplace=True)
    for i, student in sheet.iterrows():
        if student['Pathway'] != 'Progress to Sec 5':
            school = student['School']
            if school not in schools:
                school = 'Uncertain'
        else:
            school = 'Uncertain'
        school_tags.append(school)
        industry = student['Industry']
        if industry not in industries:
            industry = 'Uncertain'
        industry_tags.append(industry)

### Encoding text
Text cannot be processed by the model. It will be vectorised into an array of its TF-IDF.

In [25]:
school_matrix = TfidfVectorizer(max_features=180)
school_choice_responses = school_matrix.fit_transform(responses['School']).toarray()
school_response_train, school_response_test, school_tag_train, school_tag_test = train_test_split(school_choice_responses, school_tags)

### Naive Bayes Model
I will try different models to see which is the most accurate. Starting off with the naive bayes algorithm, which has a 0.77 accuracy score.

In [26]:
school_model = GaussianNB()
school_model.fit(school_response_train,school_tag_train)

prediction = school_model.predict(school_response_test)
print('Accuracy: ',accuracy_score(school_tag_test,prediction))

Accuracy:  0.7210144927536232


### Random Forest Classifier
The random forest classifier is another model that works well with text data. This model has a accuracy score of 0.97, which is much higher. This model will be used.

In [27]:
school_model = RandomForestClassifier()
school_model.fit(school_response_train,school_tag_train)

prediction = school_model.predict(school_response_test)
print('Accuracy: ',accuracy_score(school_tag_test,prediction))

Accuracy:  0.9855072463768116


### Another model for predicting Industry
Another model can be built to find out what industries student wish to work in. The same approach will be used.

In [109]:
industry_matrix = TfidfVectorizer(max_features=350)
industry_responses = industry_matrix.fit_transform(responses['Industry']).toarray()
industry_response_train,industry_response_test,industry_tag_train,industry_tag_test = train_test_split(industry_responses,industry_tags)
industry_model = RandomForestClassifier()
industry_model.fit(industry_response_train,industry_tag_train)

prediction = industry_model.predict(industry_response_test)
print('Accuracy: ',accuracy_score(industry_tag_test,prediction))

Accuracy:  0.9528985507246377


### Storing the models

In [115]:
dump(school_matrix,'school_vectorizer.sav')
dump(school_model,'school_classifier.sav')
dump(industry_matrix,'industry_vectorizer.sav')
dump(industry_model,'industry_classifier.sav')

['industry_classifier.sav']