In [1]:
# Basic libraries
import nltk
import pandas as pd

# NLTK utils
from nltk.corpus import wordnet
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Bag of words
from sklearn.feature_extraction.text import CountVectorizer

# Classification stuff
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import LabelEncoder

In [2]:
# Function originally from: https://www.programcreek.com/python/?CodeExample=get%20wordnet%20pos
def get_wordnet_pos(word):
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

Now lets load and look at our data:

In [3]:
df = pd.read_csv('./data/myers_briggs_comments.tsv', sep='\t')
df

Unnamed: 0,comment_id,personality type,source url,comment,is_reply,parent_comment_id
0,77016,INFP-A,https://www.16personalities.com/infp-strengths...,"Hello friends infp, I identify a lot with all ...",False,
1,77025,ISTP-T,https://www.16personalities.com/infp-strengths...,"I can't believe how accurate this was, it's so...",False,
2,77700,INFP-T,https://www.16personalities.com/infp-strengths...,We matter. I am Grace too. It is so refreshin...,True,77073.0
3,77073,INFP-T,https://www.16personalities.com/infp-strengths...,Finally I know for sure that I am not a weirdo...,False,
4,77151,INFP-T,https://www.16personalities.com/infp-strengths...,I finally feel understood. I always give and g...,False,
...,...,...,...,...,...,...
41695,119246,ENTP-T,https://www.16personalities.com/entp-personality,I'm such a debater I had to debate before deci...,False,
41696,119900,ENTP-A,https://www.16personalities.com/entp-personality,Accurate,False,
41697,121460,ENTP-A,https://www.16personalities.com/entp-personality,Debatable,True,119995.0
41698,120098,ENTP-T,https://www.16personalities.com/entp-personality,relatable,True,119995.0


We can delete the columns `comment_id` and `parent_comment_id` (we aren't going to use the columns `source url` and `is_reply`, but they may come in handy in the bonus tasks later):

In [4]:
df = df.drop('comment_id', axis=1)
df = df.drop('parent_comment_id', axis=1)
df

Unnamed: 0,personality type,source url,comment,is_reply
0,INFP-A,https://www.16personalities.com/infp-strengths...,"Hello friends infp, I identify a lot with all ...",False
1,ISTP-T,https://www.16personalities.com/infp-strengths...,"I can't believe how accurate this was, it's so...",False
2,INFP-T,https://www.16personalities.com/infp-strengths...,We matter. I am Grace too. It is so refreshin...,True
3,INFP-T,https://www.16personalities.com/infp-strengths...,Finally I know for sure that I am not a weirdo...,False
4,INFP-T,https://www.16personalities.com/infp-strengths...,I finally feel understood. I always give and g...,False
...,...,...,...,...
41695,ENTP-T,https://www.16personalities.com/entp-personality,I'm such a debater I had to debate before deci...,False
41696,ENTP-A,https://www.16personalities.com/entp-personality,Accurate,False
41697,ENTP-A,https://www.16personalities.com/entp-personality,Debatable,True
41698,ENTP-T,https://www.16personalities.com/entp-personality,relatable,True


#### Lemmatizer

Now lets run our lemmatizer on the comments:

In [5]:
lemmatizer = WordNetLemmatizer()
for index, row in df.iterrows():
    comment = str(row['comment'])
    lemmitized_comment = " ".join([lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in comment.split()])
    df.loc[index, 'comment'] = lemmitized_comment
df

Unnamed: 0,personality type,source url,comment,is_reply
0,INFP-A,https://www.16personalities.com/infp-strengths...,"Hello friend infp, I identify a lot with all t...",False
1,ISTP-T,https://www.16personalities.com/infp-strengths...,"I can't believe how accurate this was, it's so...",False
2,INFP-T,https://www.16personalities.com/infp-strengths...,We matter. I be Grace too. It be so refresh kn...,True
3,INFP-T,https://www.16personalities.com/infp-strengths...,Finally I know for sure that I be not a weirdo...,False
4,INFP-T,https://www.16personalities.com/infp-strengths...,I finally feel understood. I always give and g...,False
...,...,...,...,...
41695,ENTP-T,https://www.16personalities.com/entp-personality,I'm such a debater I have to debate before dec...,False
41696,ENTP-A,https://www.16personalities.com/entp-personality,Accurate,False
41697,ENTP-A,https://www.16personalities.com/entp-personality,Debatable,True
41698,ENTP-T,https://www.16personalities.com/entp-personality,relatable,True


### ** HERE I AM JUST Checking all 32 personality types

In [6]:
print(df["personality type"].value_counts())

personality type
INFP-T    7463
INFJ-T    4392
INTP-T    4367
INTJ-T    3129
INTJ-A    2675
ENFP-T    2436
INFJ-A    2113
INTP-A    1632
INFP-A    1132
ENFJ-A    1118
ENTP-T    1027
ENTP-A    1022
ISTP-T    1011
ENFP-A     946
ISFP-T     893
ENFJ-T     865
ENTJ-A     724
ISFJ-T     653
ISTP-A     610
ENTJ-T     421
ISTJ-T     360
ISTJ-A     341
ESTP-A     332
ISFJ-A     302
ESFP-T     289
ESFJ-A     258
ESFJ-T     252
ISFP-A     238
ESTP-T     236
ESFP-A     230
ESTJ-A     161
ESTJ-T      72
Name: count, dtype: int64


In [None]:
print(df["personality type"].unique())

['INFP-A' 'ISTP-T' 'INFP-T' 'INFJ-T' 'ENFP-T' 'ISFJ-T' 'ENTP-T' 'INTP-T'
 'INTJ-A' 'INTP-A' 'INFJ-A' 'ISFP-T' 'INTJ-T' 'ENFP-A' 'ISFP-A' 'ENFJ-T'
 'ENTP-A' 'ISTJ-A' 'ESFP-T' 'ISTP-A' 'ESFP-A' 'ESFJ-T' 'ESTP-A' 'ESTP-T'
 'ENTJ-A' 'ENTJ-T' 'ESFJ-A' 'ISTJ-T' 'ESTJ-A' 'ESTJ-T' 'ISFJ-A' 'ENFJ-A']


In [22]:
personality_types = df["personality type"].unique()

for i,type in enumerate(personality_types, start=1):
    print(f"{i}. {type}")

1. INFP-A
2. ISTP-T
3. INFP-T
4. INFJ-T
5. ENFP-T
6. ISFJ-T
7. ENTP-T
8. INTP-T
9. INTJ-A
10. INTP-A
11. INFJ-A
12. ISFP-T
13. INTJ-T
14. ENFP-A
15. ISFP-A
16. ENFJ-T
17. ENTP-A
18. ISTJ-A
19. ESFP-T
20. ISTP-A
21. ESFP-A
22. ESFJ-T
23. ESTP-A
24. ESTP-T
25. ENTJ-A
26. ENTJ-T
27. ESFJ-A
28. ISTJ-T
29. ESTJ-A
30. ESTJ-T
31. ISFJ-A
32. ENFJ-A


Just some notes here:

To get all personality types I will have to use their original name - so don't strip the first letter of the string

#### Extract classification categories


We aren't going to try and classify all 32 personality types (for now), we are just going to look at the first category (Extraversion vs Introversion). As our personality type data is structured using these handy codes, all we need to do is extract the first character from the string to do this (using `df['personality type'].str.strip().str[0]`). We will come back to this code later to try other ways of dividing our dataset.

The class `LabelEncoder` is a handy tool to then convert whatever classes we have into integer numbers starting from one (that we need to have for our classifier).

In [6]:
le = LabelEncoder()
df['class_label'] = le.fit_transform(df['personality type'])
df

Unnamed: 0,personality type,source url,comment,is_reply,class_label
0,INFP-A,https://www.16personalities.com/infp-strengths...,"Hello friend infp, I identify a lot with all t...",False,18
1,ISTP-T,https://www.16personalities.com/infp-strengths...,"I can't believe how accurate this was, it's so...",False,31
2,INFP-T,https://www.16personalities.com/infp-strengths...,We matter. I be Grace too. It be so refresh kn...,True,19
3,INFP-T,https://www.16personalities.com/infp-strengths...,Finally I know for sure that I be not a weirdo...,False,19
4,INFP-T,https://www.16personalities.com/infp-strengths...,I finally feel understood. I always give and g...,False,19
...,...,...,...,...,...
41695,ENTP-T,https://www.16personalities.com/entp-personality,I'm such a debater I have to debate before dec...,False,7
41696,ENTP-A,https://www.16personalities.com/entp-personality,Accurate,False,6
41697,ENTP-A,https://www.16personalities.com/entp-personality,Debatable,True,6
41698,ENTP-T,https://www.16personalities.com/entp-personality,relatable,True,7


Now lets extract our comments, class labels and the associated names of our classes into Python lists:

In [7]:
comments = df["comment"].values.tolist()
class_labels = df["class_label"].values.tolist()
class_names = list(le.classes_)

#### Bag of words features

Lets fit our bag of words to our entire dataset first so that the our bag of words feature vectors are the same length in both our test and train sets:

In [8]:
vectorizer = CountVectorizer(stop_words=stopwords.words('english'), ngram_range=(1,1))
bag_of_words = vectorizer.fit_transform(comments)
vocab = vectorizer.get_feature_names_out()
print(f'Our bag of words for the whole dataset is a matrix of the shape and size {bag_of_words.shape}')

Our bag of words for the whole dataset is a matrix of the shape and size (41700, 23173)


##### Split dataset

Now lets split our dataset into **train** and **test** sets. The training set will be used to optimise our classifier on the data. The test set is used to evaluate our classifier after training. 

Here `X_train` and `X_test` are our comments. `y_train` and `y_test` are our class labels corresponding to each comment. Our classify will take the bag of words representations of our comments data as input and try to give the most accurate predictions of classes. 

It is very important that we **never evaluate a classifer on our training data**, and that **we never train on our test data**. When we do training we repeatedly optimise on that data. Therefore the accuracy in training won't give us an accurate idea of how well our classifer is performing. We can only determine a realsitic idea of accuracy on **unseen data**.

In [9]:
X_train, X_test, y_train, y_test = train_test_split(comments, class_labels, test_size=0.3, random_state=42)

Lets redo the bag of words on the test and train with .transform() instead of .fit_transform() to ensure we use the complete vocabulary for both the test and train sets:

In [10]:
X_train_bow = vectorizer.transform(X_train)
X_train_bow

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 379981 stored elements and shape (29190, 23173)>

In [11]:
X_test_bow = vectorizer.transform(X_test)
X_test_bow

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 164211 stored elements and shape (12510, 23173)>

#### Train classifier

Now let's define what kind of classifer we are using and train it on our training data. We will give our Bag our words matrix for our entire training set, and out list of classes labels that corresponds to each row in the matrix. The classifier implementation from sci-kit learn will take care of the rest for us. 

In [13]:
classifier = MultinomialNB()
classifier.fit(X_train_bow.toarray(), y_train)

#### Test classifier

Now we have trained our classifer we can test it. We will get the classifer to make predictions on our test dataset. We will then calucate our accuracy scores by comparing our predictions `y_pred` to our true class labels `y_test`. 

Sci-kit learn gives us a nice classification report, breaking it down into three scores, **precision**, **recall** and **f1-score**. **Precision** tells us of **True Positives / True Positive + False Positives** (how many retrieved elements are relevant). **Recall** tells us of **True Positives / True Positive + False Negatives** (how many relevant items are retrieved). The **F1-Score** tells us an average (the harmonic mean) of these two scores.


There is not perfect way to measure accuracy. In some cases, you will be happy with a high recall and low precision if you want to find all possible results, and can use a human expert to check to result (i.e. if you were looking for possible cases of cancer). In other cases you may want high precision but are less bothered about having a high recall (i.e. if were deciding one of many possible stocks to buy that you want to make a profit from).

Another analogy would be if you were fishing, recall is **how big your net is** and precision is **how effective your net is at catching fish (and not other things in the sea)**.

In [14]:
y_pred = classifier.predict(X_test_bow.toarray())
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred, target_names=class_names)

print(f"Accuracy: {accuracy}")
print("Classification Report:")
print(report)

Accuracy: 0.2147082334132694
Classification Report:
              precision    recall  f1-score   support

      ENFJ-A       0.00      0.00      0.00       336
      ENFJ-T       0.50      0.00      0.01       290
      ENFP-A       0.00      0.00      0.00       292
      ENFP-T       0.32      0.06      0.10       730
      ENTJ-A       0.20      0.00      0.01       216
      ENTJ-T       0.00      0.00      0.00       130
      ENTP-A       0.00      0.00      0.00       313
      ENTP-T       0.50      0.02      0.05       293
      ESFJ-A       0.00      0.00      0.00        73
      ESFJ-T       0.00      0.00      0.00        78
      ESFP-A       0.00      0.00      0.00        68
      ESFP-T       0.00      0.00      0.00       101
      ESTJ-A       0.00      0.00      0.00        36
      ESTJ-T       0.00      0.00      0.00        20
      ESTP-A       0.00      0.00      0.00        98
      ESTP-T       0.00      0.00      0.00        76
      INFJ-A       0.30      

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


### Task 2:

In [12]:
from sklearn.linear_model import LogisticRegression

reg_classifier = LogisticRegression(max_iter=1000)
reg_classifier.fit(X_train_bow, y_train) 


In [13]:
y_pred_reg = reg_classifier.predict(X_test_bow.toarray())
reg_accuracy = accuracy_score(y_test, y_pred_reg)
reg_report = classification_report(y_test, y_pred_reg, target_names=class_names)

print(f"Accuracy: {reg_accuracy}")
print("Classification Report:")
print(reg_report)

Accuracy: 0.2081534772182254
Classification Report:
              precision    recall  f1-score   support

      ENFJ-A       0.14      0.08      0.10       336
      ENFJ-T       0.09      0.02      0.04       290
      ENFP-A       0.14      0.03      0.05       292
      ENFP-T       0.20      0.13      0.16       730
      ENTJ-A       0.16      0.06      0.08       216
      ENTJ-T       0.16      0.04      0.06       130
      ENTP-A       0.13      0.05      0.07       313
      ENTP-T       0.26      0.10      0.15       293
      ESFJ-A       0.00      0.00      0.00        73
      ESFJ-T       0.00      0.00      0.00        78
      ESFP-A       0.27      0.04      0.08        68
      ESFP-T       0.25      0.02      0.04       101
      ESTJ-A       0.33      0.06      0.10        36
      ESTJ-T       0.00      0.00      0.00        20
      ESTP-A       0.12      0.02      0.03        98
      ESTP-T       0.33      0.01      0.03        76
      INFJ-A       0.20      

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


## Tasks


**Task 1** Try splitting this dataset using some of the other personality distinctions, you can do this by modifying [the cell where we extract the categories we are using for classificaiton](#extract-classification-categories). Try some of the other individual binary distinctions, then see if you can train a classifer on all 16 of the original Myers-Briggs personality types. Can you then do all 32 different categories available to us in the dataset?

**Task 2** You may have noticed that we often get perfomance on one of more of the classes we have when we have a large imbalance between the numbers for each class (listed as `support` in our classification report). Try [changing the type of classifier](#train-classifier) used to another one of the [many available classifiers](https://scikit-learn.org/stable/supervised_learning.html) in sci-kit learn.

**Task 3** Can you change the code to use TF-IDF features instead of Bag of Words for classification?

**Task 4** Discuss with someone on your table:
- What are the potential uses of a text classifier trained on personality characteristics?
- What are the ethical concerns of using this dataset?
- What are the potential misuses of this dataset? 
- What are the biases present in this dataset?
- 
**Task 5 (optional) ** Run this notebook and the classification with LDA features. Which one works better? 


### Bonus tasks

**Task A** Can you filter the dataset in some way. For instance you could filter out comments that are replies (using `is_reply` in the dataset) or filter out comments that are below (or above) a certain length. The `source_url` may also be something that you use to filter out particular comments. 

**Task B** Does using a stemmer instead of a lemmatizer effect the classification scores? What happens if you don't do any pre-processing to the text?

**Task C** Can you add any stop words that are specific to this dataset? Does that improve classification results?

**Task D** Can you save the results from classification (and any other important meta-data) to a log file. This can just be an append only text file that you log the results of each experiment to, to make comparisons with later. 

**Task E** If you are doing lots of experiments using the same preprocessing to the text (stemming / lemmatisation), can you perform this and then save that dataset to a separate `.tsv` file. Which then only have to pre-process once, and then can then load directly into your code each time you runa  new experiment?

**Task F** Look for other classification datasets on [kaggle](). Can you adapt this notebook to work with a different classification dataset. You may want to make a copy of this notebook before making changes to a new dataset. 