Building a Machine Learning Model with Python

Overview

We're going to build a machine learning model that will predict the personality type of an individual using the 16 distinct personality types from Myers-Briggs Personality Type Indicator (MBPTI). Isabel Myers and her mother Katherine Briggs developed a questionnaire indicating differing psychological preferences in how people perceive the world and make decisions. The Myers-Briggs framework aims to help people identify and gain some understanding of how they take in information and make decisions (perception and judgment) in relation to their environment.

Chart by Jake Beech - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=30859659.

The framework consists of eight preferences organized into four pairs of opposites:

Extraversion (E) or Introversion (I)
Sensing (S) or Intuition (N)
Thinking (T) or Feeling (F)
Judging (J) or Perceiving (P)

Combining the preferences (E or I, S or N, T or F, J or P) makes up a four-letter code MBTI type as seen in the image below. These types represent a person's psychological or personality type. There are 16 possible combinations of letters leading to 16 distinct MBTI personality types.

~Image source

"Whatever the circumstances of your life, the understanding of type can make your perceptions clearer, your judgments sounder, and your life closer to your heart's desire." —Isabel Briggs Myers

Dataset

We'll use two data sets sourced from Kaggle, as described in the sections below.

One

The first dataset (26 MB) contains 8,675 rows of uncleaned data (individual posts collected from a forum and labelled with the personality type). This will allow us to learn the fundamentals of pre-processing data and data visualization. You can download the file (mbpt_dataset.csv) in this repository or directly from the linked Kaggle page above.

The data is in a CSV file with the following columns:

Column	Description
`type`	Four-letter MBTI code.
`posts`	The last 50 things an individual posted with each entry separated by three pipe characters (\| \| \|).

Two

The resulting model from the first data set will not be optimal due to an imbalance in the data; hence, we will use the second dataset (346 MB) which contains 106,067 rows of cleaned and more balanced data. This dataset combines the first dataset and another one collected from Reddit and will give us much better accuracy. You can download the file (mbpt_dataset_2.csv) or directly from the linked Kaggle page above. Since the file is large, I couldn't upload it to this repository (it exceeds GitHub's file size limit of 100.00 MB), so you can add it as a shortcut to your Google Drive. This way, you don't have to download the file and can easily mount it to your Colab session.

The data is in a CSV file with the following columns:

Column	Description
`type`	Four-letter MBTI code.
`posts`	Equal-sized words per sample of the last 50 things an individual posted.

Technologies

We'll build our model using the Scikit-Learn library. Scikit-Learn is a free software machine learning library for the Python programming language, including various classification, regression, and clustering algorithms.

Here's a list and summary of each Python package we will use:

Package	Description
`re`	Python built-in module for regular expression operations.
`tqdm`	A fast and extensible progress bar.
`pandas`	A fast, powerful, flexible and easy-to-use data analysis and manipulation tool.
`matplotlib.pyplot`	The state-based interface to matplotlib—the comprehensive library for creating static, animated, and interactive visualizations (more like an API for matplotlib making it a viable alternative to MATLAB).
`WordCloud`	A word cloud generator.
`train_test_split` from `sklearn.model_selection`	Scikit-Learn class for splitting arrays or matrices into random train and test subsets.
`LabelEncoder` from `sklearn.preprocessing`	Scikit-Learn class for encoding target labels with value between 0 and n_classes-1.
`TfidfVectorizer` from `sklearn.feature_extraction.text`	Scikit-Learn class for converting a collection of raw documents to a matrix of TF-IDF features.
`accuracy_score` from `sklearn.metrics`	Scikit-Learn class for calculating the accuracy score of classification algorithms.
`classification_report` from `sklearn.metrics`	Scikit-Learn class for building a text report showing the main classification metrics.
`confusion_matrix` from `sklearn.metrics`	Scikit-Learn class for building a text report showing the main classification metrics.
`LogisticRegression` from `sklearn.linear_model`	Scikit-Learn Logistic Regression classifier for classification models.
`SVC` from `sklearn.svm`	Scikit-Learn C-Support Vector classifier for classification models.
`LinearSVC` from `sklearn.svm`	Scikit-Learn Linear Support Vector classifier for classification models.
`MultinomialNB` from `sklearn.naive_bayes`	Scikit-Learn Naive Bayes classifier for multinomial classification models.
`RandomForestClassifier` from `sklearn.ensemble`	Scikit-Learn Random Forest classifier for classification models.
`pickle`	Python built-in module for object structure serialization and de-serialization.

Notebook (source code)

Here are the source code files for the Jupyter notebooks and Python scripts used in this lesson. We will discuss the sequence of steps in the Notebook further during the workshop. You can also find all the files in the ./model directory.

model.ipynb: Jupyter notebook for dataset one.
model.py: Python script for dataset one.

model.ipynb: Jupyter notebook for dataset two.
model.py: Python script for dataset two.

Special thanks to Nada Gamal's initial notebook on the first dataset.

Results Comparison

Model One	Model Two

Glossary

Here's a list of terms you'll come across in this lesson (we will discuss them further during the workshop).

Term	Description
MBTI	Myers-Briggs Personality Type Indicator.
Kaggle	This is an online community of data scientists and machine learning engineers.
Mounting Google Drive	If you upload files (CSV datasets, image files, or anything else) during a Colab session, you will lose the files (both input and output) after the runtime session. Mounting your Google Drive allows you to access files directly from your drive and use them in your Colab session.
Splitting Data into Train and Test	This is an essential process used to estimate the performance of an algorithm (checking how it performs when used on a fresh set of data) by dividing a large dataset into two subsets.
Pie Chart	This is a circular statistical graphic which is divided into slices to illustrate numerical proportions.
Regular Expression	This is a sequence of characters that define a search pattern (it is also shortened as regex or regexp).
Word Cloud	This is a data visualization technique used for representing text data in which the size of each word indicates its frequency or importance.
Corpus	This is a collection of written or spoken material in machine-readable form, assembled to study linguistic structures, frequencies, etc.
Tokenization	This is the process of breaking down a text into words, phrases, symbols, or other meaningful elements called tokens.
Stop Words	These are words (like "a", "an", "are, "the", "in", etc.) which are filtered out before processing natural language data (text). Because ML algorithms check the frequency of words in data, they can give higher weights to stop words if they occur frequently in the document, resulting in an inefficient model.
TF-IDF (Term Frequency-Inverse Document Frequency)	This is a numerical statistic that is intended to reflect how important a word is to a document in a corpus.
Stemming/Lemmatization	NLP text normalization techniques that reduce the inflected words (words derived from another: React → Reactive) to their root form (base word/meaningful word).
Label Encoding	This refers to converting the labels into numeric form (between 0 and n_classes-1) so as to convert them into machine-readable form.
Classification Report	This is a report of each class's main classification metrics (precision, recall, f1-score, and support).
Precision Score	This is the ratio of correctly predicted positive observations to the total predicted positive observations.
Recall Score	This is the ratio of correctly predicted positive observations to all observations in actual class.
F1 Score	This is the weighted average (harmonic mean) of the Precision and Recall scores.
Support Score	This is the number of actual class occurrences in the specified dataset.
Model Accuracy	This is the fraction of predictions our model got right.
Macro Average	This is the average performance of each class.
Weighted Average	This is the average performance of each class with respect to the number of samples available for each class.
Pickling	This is the process of converting a Python object into a byte stream to store it in a file/database, maintain program state across sessions, or transport data over the network.

Confuison Matrix

$x = Actual Condition$
$y = Predicted Condition$


	Positive	Negative
Positive	True Positive (TP)	False Positive (FP)
Negative	False Negative (FN)	True Negative (TN)

Precision = $TP / (TP + FP)$
Recall = $TP / (TP + FN)$
F1-Score = $2 *(Precision* Recall) / (Precision + Recall)$
Accuracy = $(TP + TN) / (TP + TN + FP + FN)$

For two (2) classes, you get a 2x2 matrix. But in the case of the model in this lesson (sixteen (16) classes), we get a 16x16 matrix.

Something like this:

 [  4   2   1   0   0   0   0   0  10  19   0   1   0   0   0   1]
 [  0  74   0   3   0   0   0   0  10  28  14   2   1   0   1   2]
 [  0   2   7   3   0   0   0   0   4  12  14   3   0   0   0   1]
 [  0   7   0  70   0   0   0   0  16  15  11  17   0   1   0   0]
 [  0   0   0   1   0   0   0   0   2   5   0   1   0   0   0   0]
 [  0   2   1   0   0   0   0   0   0   4   1   2   0   0   0   0]
 [  0   0   0   1   0   0   0   0   1   4   1   1   0   0   0   0]
 [  0   0   0   1   0   0   0   0   5   5   2   4   0   0   0   1]
 [  0   2   0   2   0   0   0   0 210  52   8  18   0   0   0   2]
 [  0   4   0   4   0   0   0   0  14 322   9  11   0   1   0   1]
 [  0   0   0   5   0   0   0   0  18  20 146  27   0   0   1   1]
 [  0   0   0   5   0   0   0   0  10  20  13 212   0   0   0   1]
 [  0   2   0   0   0   0   0   0   8  11   3   5   4   0   0   0]
 [  0   0   0   5   0   0   0   0   6  30   2   2   0   9   0   0]
 [  0   2   0   2   0   0   0   0   5  14  10   4   0   0   3   1]
 [  0   0   0   2   0   0   0   0   2  16   1  17   0   0   0  29]

Sample matrix for the sixteen (16) MBPTI classes

	INFP	INFJ	INTP	INTJ	ENTP	ENFP	ISTP	ISFP	ENTJ	ISTJ	ENFJ	ISFJ	ESTP	ESFP	ESFJ	ESTJ
INFP	True	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False
INFJ	False	True	False	False	False	False	False	False	False	False	False	False	False	False	False	False
INTP	False	False	True	False	False	False	False	False	False	False	False	False	False	False	False	False
INTJ	False	False	False	True	False	False	False	False	False	False	False	False	False	False	False	False
ENTP	False	False	False	False	True	False	False	False	False	False	False	False	False	False	False	False
ENFP	False	False	False	False	False	True	False	False	False	False	False	False	False	False	False	False
ISTP	False	False	False	False	False	False	True	False	False	False	False	False	False	False	False	False
ISFP	False	False	False	False	False	False	False	True	False	False	False	False	False	False	False	False
ENTJ	False	False	False	False	False	False	False	False	True	False	False	False	False	False	False	False
ISTJ	False	False	False	False	False	False	False	False	False	True	False	False	False	False	False	False
ENFJ	False	False	False	False	False	False	False	False	False	False	True	False	False	False	False	False
ISFJ	False	False	False	False	False	False	False	False	False	False	False	True	False	False	False	False
ESTP	False	False	False	False	False	False	False	False	False	False	False	False	True	False	False	False
ESFP	False	False	False	False	False	False	False	False	False	False	False	False	False	True	False	False
ESFJ	False	False	False	False	False	False	False	False	False	False	False	False	False	False	True	False
ESTJ	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	True

Thank you for coming this far; you've done well 👏🏾. Please open a new GitHub discussion using the links below and let me know your thoughts about this lesson or any issues you're experiencing.

Share Feedback | Ask Question

<< previous lesson | next lesson >>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

02.md

02.md

Building a Machine Learning Model with Python

Table of Contents

Overview

Dataset

One

Two

Technologies

Notebook (source code)

Results Comparison

Glossary

Confuison Matrix

Files

02.md

Latest commit

History

02.md

File metadata and controls

Building a Machine Learning Model with Python

Table of Contents

Overview

Dataset

One

Two

Technologies

Notebook (source code)

Results Comparison

Glossary

Confuison Matrix