I. How to install?
Using 'git clone' feature, clone the project to your IDE for coding (like VSCode or Google Colab). Make sure to either pip install or conda install libraries like scipy, wordcloud, nltk, seaborn and scikit-learn to run the code.
II. How to Run?
- Download the datasets from the folder linked here: https://lsu.box.com/s/n58ia30eouwszswydrxkn6zejdd7co6y
mbti_cleaned.csv (used for 3730TorchClassifierBinary.ipynb and smaller_ds_ml.ipynb): mbti_cleaned.csv
MBTI500.csv ( used in Data500_prediction.ipynb): MBTI500.csv
-
In the Jupyter Notebook, link the datasets from your local machine. Make sure to check whether the dataset that is not attached to the code folder is mentioned with the correct path from your local computer. The additional datasets, which are cleaned are given in the folder breakdowns part below.
-
The last to run is the prediction where you will run the function preprocessed_text, then put the sentence you want to run into the variable trial_sentence and run all the cells below which will give the letters that predict the output.
File Breakdown (images.docx):
wordcloud - creates a word cloud for all the words
wordcloud_removed - creates a word cloud with certain common words removed due to redundancy(removed words: think, like, one, people, know)
topTen_bar - top ten words in every MBTI type
Folder Breakdown:
datasets_types - cleaned mbti_clean.csv split into the corresponding MBTI types
datasets_letters - cleaned mbti_clean.csv split into the corresponding MBTI dimensions (E, I, N, S, F, T, P, J)
big_datasets_types: cleaned MBTI500.csv split into the corresponding MBTI types
big_datasets_letters: MBTI500.csv split into the corresponding MBTI dimensions (E, I, N, S, F, T, P, J)
Packages Used:
- For Data Cleaning and Analyzing
pandas as pd
re
nltk: to install stopwords and lemmatizing functionality
nltk.corpus import stopwords
nltk.stem import WordNetLemmatizer
numpy as np
- For feature extraction
sklearn.pipeline import Pipeline
sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
- For Machine Learning Algorithms
sklearn.model_selection import train_test_split, GridSearchCV
sklearn.naive_bayes import MultinomialNB
sklearn import metrics
sklearn.linear_model import LogisticRegression
sklearn.metrics import accuracy_score, classification_report, confusion_matrix, r2_score, mean_squared_error
torch
math
torchtext.data.utils import get_tokenizer
torchtext.vocab import build_vocab_from_iterator
torch.utils.data import DataLoader
torch import nn
time
from torch.utils.data.dataset import random_split
from torchtext.data.functional import to_map_style_dataset
- For Data Visualization
wordcloud
matplotlib.pyplot as plt
seaborn as sns