# Introduction
Hi! My name is Fei Liang. Thank you for being interested in my coding skills. In this coding sample, I present a project that solves a natural language processing problem using Python.

The project I have developed demonstrates my proficiency in Python and NLP techniques. The goal of the project is to analyze a collection of text data and extract meaningful information from it. Specifically, the project involves preprocessing the text data, performing feature extraction, and building models using machine learning algorithms.

I hope that this coding sample provides a glimpse into my skills and abilities as a data scientist.

Please note that some materials and code have been removed from this coding sample due to non-disclosure agreements (NDA) with previous clients. If you have any questions about the missing portions or would like more information about my experience, please don't hesitate to contact me. I would be happy to discuss my work in more detail with you.

# Data Preprocessing
In this part, we start by cleaning the text data by removing special characters, punctuation, and other non-alphanumeric characters. We also remove any stop words, which are commonly used words that don't provide much value in our analysis, such as "the", "and", and "a". After cleaning the text data, we perform lemmatization, which involves reducing words to their root form, to further reduce the number of unique words in the data.

In [12]:
# Import all the packages needed in the data preprocessing part of this project
import re
import nltk
import json
import pandas as pd
from nltk import word_tokenize
from nltk.corpus import stopwords

nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')
nltk.download('omw-1.4')

[nltk_data] Downloading package punkt to /Users/liangfei/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/liangfei/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/liangfei/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/liangfei/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/liangfei/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [2]:
# Read dataset
# This dataset contains the responses collected from the survey data. It has the following columns:
# 'TA', 'Brand', 'Response', 'Month', 'Year', 'Quarter'
# This project mainly focus on the 'Response' column which contains free text feedbacks
dataset = pd.read_csv('Dataset.csv')
dataset.columns = ['TA', 'Brand', 'Response', 'Month', 'Year', 'Quarter']
dataset = dataset.drop(columns=['Quarter'])
# dataset.head()

Replace non-alphanumeric characters, punctuation, technical terms, and abbreviations to the original form using a dictionary created before.

In [4]:
# Read the word dictionary from a json file
with open('word_dict.json', 'r') as file:
    word_dict = json.load(file)

def sub_word(text, dictionary):
    '''
    Substitute the words that match the keys of the dictionary
    :param text: The text string to be modified
    :param dictionary: The word dictionary
    :return: A modified text string
    '''
    # Convert everything to lower case before we perform the transformation
    text = str(text).lower()

    for key, value in dictionary.items():
        text = re.sub(key, value, text)

    # strip the leading and trailing spaces
    text = text.strip()
    return text

In [5]:
# Apply the sub_word function to the Response column of the dataset
dataset['Response_Basic'] = dataset['Response'].apply(sub_word, args=(word_dict,))

Next, remove all the stopwords from the `Response_Basic` column.

In [7]:
# get the stopwords from the `nltk` package
stop_words = stopwords.words('english')

def remove_stopwords(text):
    '''
    This function remove all the stopwords in a text string
    :param text: a text string to be modified
    :return: a cleaned text string
    '''
    # tokenize the text into a list of words
    word_list = word_tokenize(text)
    # filter out the words in the stop_words list
    cleaned_word_list = filter(lambda word: word not in stop_words, word_list)
    # join the remaining words back to a string
    return ' '.join(cleaned_word_list)

In [8]:
# Apply the remove_stopwords function to the Response_Basic column of the dataset
dataset['Response_Cleaned'] = dataset['Response_Basic'].apply(remove_stopwords)

Then, lemmatize the word to reduce the number of different words we have

In [10]:
lemmatizer = nltk.wordnet.WordNetLemmatizer()

def lemmatize(text):
    '''
    This function lemmatize all the words in a text string
    :param text: a text string to be modified
    :return: a text string with all words lemmatized
    '''
    # tokenize the text into a list of words
    word_list = word_tokenize(text)
    # filter out the words in the stop_words list
    lemmatized_word_list = list(map(lemmatizer.lemmatize, word_list))
    # join the remaining words back to a string
    return ' '.join(lemmatized_word_list)

In [13]:
# Apply the remove_stopwords function to the Response_Basic column of the dataset
dataset['Response_Processed'] = dataset['Response_Cleaned'].apply(lemmatize)

Finally, remove all the rows with missing values in the `Response_Processed` column. Output the dataset to a csv file.

In [16]:
dataset.dropna(inplace=True)
dataset.to_csv('Cleaned_Dataset.csv', index=False)

# Machine Learning Analysis (Topic Modeling)
After preprocessing the text data, we move on to the machine learning analysis part of our NLP project. The goal of this part is to cluster the text responses into different topics based on their content.

To accomplish this, we first use a Sentence-BERT model to vectorize the text. The vectorized text is then processed with Principal Component Analysis (PCA) to reduce the dimensionality of the data. Finally, we apply the HDBSCAN clustering algorithm to the reduced-dimensional vectorized text. HDBSCAN is a density-based clustering algorithm that is particularly effective at clustering high-dimensional data. It works by identifying regions of high density in the data and separating them into clusters.

In [37]:
# Import all the packages needed for topic modeling
import pandas as pd
import hdbscan
from sklearn.decomposition import PCA
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

In [20]:
# Read cleaned dataset, we will use the 'Response_Processed' column of the dataset as the input data
dataset = pd.read_csv('Cleaned_Dataset.csv')
# dataset.head()

We will use SentenceBert Encoder to vectorize the response text.

In [23]:
# Fit the data to the pre-trained model
bert_model = SentenceTransformer('bert-base-nli-mean-tokens')
text_responses =  dataset['Response_Processed']
sentence_embeddings = bert_model.encode(text_responses)

Because the dimension of the embedded vector is very large, we perform Principle Component Analysis on the vectors.

In [63]:
# Define a variable `dim` to store the desired dimension of the output vector
dim = 20
# Define the PCA model
model = PCA(n_components = dim)
reduced_embeddings = model.fit_transform(sentence_embeddings)
explained_variance = model.explained_variance_ratio_

In [64]:
# Calculate the percentage of the amount of information we retrained from the reduced_embedding vector
sum(explained_variance)

0.717490941286087

Finally, we applied HDBSCAN clustering algorithm on the vector space to get a cluster result

In [65]:
# Choose 'l1' norm to calculate the distance, more hyperparameters can be adjusted
cluster_model = hdbscan.HDBSCAN(metric='l1')
cluster_model.fit(reduced_embeddings)
cluster_model.labels_.max()

10

In [66]:
dataset['Cluster'] = cluster_model.labels_
dataset.to_csv('VS_Result.csv', index=False)

We classified the responses into 10 different clusters and store them back into the dataset. We are going to use the cluster result to perform some further analysis.

# Conclusion
This concludes my coding sample. Please note that this coding sample does not show the full research process. However, it does provide a glimpse into the technical aspects of the project and showcases my proficiency in Python and NLP.

I hope that this coding sample has been informative and enjoyable for you. If you have any follow-up questions or would like more information about my work, please don't hesitate to contact me. I would be happy to discuss my experience and skills in more detail.