# STAT 196K - Final Project - Clustering of CSUS Course Descriptions

## Project Overview and Goal

This project is aiming to cluster related CSUS Courses by their catalog descriptions. The project involves getting and parsing the html of the CSUS catalog website for course descriptions. Once the descriptions are obtained, they are passed into Bert for preprocessing and high overview of the descriptions (dimensionality reduction). After this, the 1028-dimeinosnal embedding that represents the pooled understanding of the descriptions is uploaded to TensorFlow's embedding projector to further reduce the dimensions with t-SNE and to view the projection on a 3D Plot. This project utilizes complex datasets, natural language processing, and dimension reduction.

## Get Course Information

### Install and Import

In [1]:
!pip install beautifulsoup4



In [2]:
import requests
import json
from bs4 import BeautifulSoup
from IPython.display import clear_output

### Get all subject pages

In [3]:
catalog_url = 'https://catalog.csus.edu'
subject_pages_url = f'{catalog_url}/courses-a-z/'
subject_pages_html = requests.get(subject_pages_url).text
subject_pages_bs = BeautifulSoup(subject_pages_html, 'html.parser')

subject_page_urls = {}
subject_div_bs = subject_pages_bs.find('div', {'id': 'atozindex'})
for subject_a in subject_div_bs.find_all('a'):
    subject_page_ref = subject_a.get('href')
    if subject_page_ref is not None:
        subject_page_urls[subject_a.get_text()] = catalog_url + subject_page_ref

subject_htmls = {}
for subject_title, subject_page_url in subject_page_urls.items():
    clear_output(wait=True)
    print(f'Get html page for {subject_title}')
    subject_htmls[subject_title] = requests.get(subject_page_url).text

Get html page for World Languages & Literatures (WLL)


### Get descriptions for all courses

In [4]:
course_infos = {}
for subject_title, subject_html in subject_htmls.items():
    subject_page_bs = BeautifulSoup(subject_html, 'html.parser')
    for course_div in subject_page_bs.find_all('div', {'class': 'courseblock'}):
        title = course_div.find('span', {'class': 'title'})
        # Skip courses with no title
        if title is not None:
            title = title.get_text()
            title = title.strip().replace('\u00a0\u00a0\u00a0\u00a0\u00a0', ' ')
            description = course_div.find('p', {'class': 'courseblockdesc'})
            # Skip courses with no description
            if description is not None:
                description = description.get_text()
                course_infos[title] = {'subject': subject_title, 'description': description}

#### Save course infos

In [5]:
with open('course_infos.json', 'w') as file:
    json.dump(course_infos, file, indent=2)

#### Load course infos

In [6]:
course_infos = json.loads(open('course_infos.json', 'r').read())

## Preprocess and Reduce Data Dimensions with Bert

### Import and Setup

In [7]:
# assuming tensorflow modules are already installed
import re
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text
from tensorflow import keras

# see if using GPU and if so enable memory growth
gpus = tf.config.list_physical_devices('GPU')
print(gpus)
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)

INFO:tensorflow:Enabling eager execution
INFO:tensorflow:Enabling v2 tensorshape
INFO:tensorflow:Enabling resource variables
INFO:tensorflow:Enabling tensor equality
INFO:tensorflow:Enabling control flow v2
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]


### Create Bert Model

In [13]:
def bert(trainable=False):
    preprocess = hub.KerasLayer('assets/bert_en_uncased_preprocess_3')
    bert_encoder = hub.KerasLayer('assets/bert_en_wwm_uncased_L-24_H-1024_A-16_3', trainable=trainable)
    def _bert(text_input):
        x = preprocess(text_input)
        x = bert_encoder(x)
        pooled_output = x['pooled_output']
        sequence_output = x['sequence_output']
        return pooled_output, sequence_output
    return _bert

text_input = keras.layers.Input(shape=(), dtype=tf.string)
embedding = bert(trainable=False)(text_input)[0]
bert_model = keras.Model(inputs=text_input, outputs=embedding)
bert_model.summary()

Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_4 (InputLayer)            [(None,)]            0                                            
__________________________________________________________________________________________________
keras_layer_5 (KerasLayer)      {'input_mask': (None 0           input_4[0][0]                    
__________________________________________________________________________________________________
keras_layer_6 (KerasLayer)      {'default': (None, 1 335141889   keras_layer_5[0][0]              
                                                                 keras_layer_5[0][1]              
                                                                 keras_layer_5[0][2]              
Total params: 335,141,889
Trainable params: 0
Non-trainable params: 335,141,889
____________

### Embed course descriptions

In [9]:
course_titles = []
course_descriptions = []
for course_title, info in course_infos.items():
    course_titles.append(course_title)
    course_descriptions.append(info['description'])

embedded_course_descriptions = bert_model.predict(course_descriptions, verbose=1)



#### Save Embeddings as TSV files

In [14]:
with open('embedded_course_descriptions.tsv', 'w') as tsvfile:
    for embedding in embedded_course_descriptions:
        tsvfile.write('\t'.join([str(x) for x in embedding]) + '\n')
        
with open('course_metadata.tsv', 'w') as tsvfile:
    tsvfile.write('Course Title\tCourse Subject\tCourse Description\n')
    for course_title in course_titles:
        course_subject = re.sub(r'[^\x00-\x7F]+|\s', ' ', course_infos[course_title]['subject'])
        course_description = re.sub(r'[^\x00-\x7F]+|\s', ' ', course_infos[course_title]['description'])
        course_title = re.sub(r'[^\x00-\x7F]+|\s', ' ', course_title)
        tsvfile.write(f'{course_title}\t{course_subject}\t{course_description}\n')

## Reduce Embedding Dimensions and Plot with TensorFlow Projector

To cluster the embeddings I am using [TensorFlow's embedding projector](https://projector.tensorflow.org/). This website requires the previosuly saved tsv files. Once these files are uploaded, I use the T-SNE to further reduce the dimensions to view the descrptions on a 3D plane. The data can be trained and viewed by the projector at this [link](https://projector.tensorflow.org/?config=https://raw.githubusercontent.com/Tiger767/STAT196K-Final-Project/main/projector_config.json).

After about 1000 iterations of training t-SNE with a perplexity of 5, a learning rate of 10, and supervise 0, the plot below was obtained. ![embedding plot](https://github.com/Tiger767/STAT196K-Final-Project/raw/main/embedding_plot.png)

## Project Insights
Before beginning this project, I suspected that classes from the same subject or similar subjects would cluster together. 
After looking at the results, a majority of classes from the same subject were not grouped together. This result is better because we already knew the subject for each class and description, so if it were the case, they clustered based on subject, no new insights would have been obtained. I had also hoped later on when working on this project that classes related to similar topics, such as classes dealing with machine learning, would be grouped together; however, this was not the case for the majority of clusters. In the plot above, there are about three main clusters. The first contains a majority of music classes that all have the same description: "Individual instruction. Music majors only. May be taken for credit four times..." The second cluster contains "Special Problems" courses with descriptions like: "Individual projects or directed reading..." The last main cluster contains everything else, though there are smaller clusters within this cluster. One such smaller cluster is a cluster of a couple of classes relating to master's theses. Another smaller cluster that did match the insight of clustering classes with similar topics was a group of physics, chemistry, and mechanical engineering classes that all dealt with physics in some way. One insight I gained from these clusters was that many descriptions are generic and very broad. This insight could have been observed by looking at the descriptions themselves, but the process followed here did highlight this for a couple of cases. Another insight that is a subset of the previous is that many of the generic course descriptions revealed that there are many courses, especially in the music subject, that allow students to instructed themselves. Overall, the plot of the embeddings did not reveal any uniquely significant insights, but it did reveal some insights.

## Project Conclusion

In conclusion, the project results were not as satisfactory as I would have hoped, but some insights were obtained. I believe if I would have fine-tuned Bert on course catalogs or used the sequence_output instead of the pooled_output, the embedding plot may have revealed more unique insights that other simplistic methods could not have revealed.