# Natural Language Processing (NLP) Analysis of Student Essays

## Introduction

This project focuses on analyzing a dataset of student essays using various Natural Language Processing (NLP) techniques. We'll explore different aspects of the essays, including word frequencies, personality types, and bi-gram analysis. The dataset contains essays written by students along with their Myers-Briggs Type Indicator (MBTI) personality types.

Our main objectives are to:
1. Preprocess the text data
2. Analyze word frequencies and unique words
3. Explore the relationship between personality types and language use
4. Create and analyze bi-grams

Let's begin by importing the necessary libraries and loading our data.

## 1. Setup and Data Loading

First, we'll import the required libraries and load our dataset.


In [5]:
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk import ngrams, bigrams
from collections import Counter
import string

# Download necessary NLTK data
nltk.download('punkt')
nltk.download('stopwords')

# Load the dataset
df = pd.read_csv('Essay_data.csv')

# Remove rows with missing values and reset index
df = df.dropna().reset_index(drop=True)

# Display basic information about the dataframe
print(df.info())
print("\nFirst few rows:")
print(df.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 93 entries, 0 to 92
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   I/E     93 non-null     object
 1   N/S     93 non-null     object
 2   T/F     93 non-null     object
 3   J/P     93 non-null     object
 4   Essay   93 non-null     object
dtypes: object(5)
memory usage: 3.8+ KB
None

First few rows:
  I/E N/S T/F J/P                                              Essay
0   I   S   T   J  My first 4 months at the EDSA have been filled...
1   I   N   F   J  I joined the academy being at a crossroads of ...
2   E   N   F   J  so far my experience has been positive and i c...
3   I   N   F   J  I have been very fortunate to have the opportu...
4   I   N   T   J  Looking back to when one got to the academy an...


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Amoh\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Amoh\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## 2. Text Preprocessing Functions
Next, we'll define some helper functions for text preprocessing.


In [6]:
def preprocess_text(text):
    # Remove punctuation and convert to lower case
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = text.lower()
    return text

def tokenize_and_remove_stopwords(text):
    # Tokenize the text
    tokens = word_tokenize(text)
    
    # Remove stop words
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    
    return tokens

## 3. Analyzing the First Essay
Let's start by analyzing the first essay in our dataset.

In [7]:
# Get the first essay
first_essay = df['Essay'][0]

# Preprocess and tokenize
preprocessed_essay = preprocess_text(first_essay)
tokens = tokenize_and_remove_stopwords(preprocessed_essay)

# Find the 10th character
tenth_character = preprocessed_essay[9]

print(f"The 10th character in the first essay is: {tenth_character}")
print(f"Number of tokens in the first essay: {len(tokens)}")

The 10th character in the first essay is: 4
Number of tokens in the first essay: 275


## 4. Analyzing Word Frequencies
Now, let's examine word frequencies across all essays.

In [8]:
# Process all essays
all_words = []

for essay in df['Essay']:
    preprocessed_essay = preprocess_text(essay)
    tokens = tokenize_and_remove_stopwords(preprocessed_essay)
    all_words.extend(tokens)

# Create a bag of words for all essays
bag_of_words = Counter(all_words)

# Count unique words
unique_word_count = len(set(all_words))

print(f"The number of unique words in all essays (after removing stopwords) is: {unique_word_count}")

# Calculate percentage of words appearing at least twice
words_at_least_twice = sum(count for word, count in bag_of_words.items() if count >= 2)
total_words = sum(bag_of_words.values())
percentage = (words_at_least_twice / total_words) * 100

print(f"Percentage of words that appear at least twice: {percentage:.2f}%")

The number of unique words in all essays (after removing stopwords) is: 3406
Percentage of words that appear at least twice: 89.88%


## 5. Personality Type Analysis
Let's analyze word usage by specific personality types, focusing on ENFJ.

In [9]:
# Filter for ENFJ personalities
enfj_df = df[(df['I/E'] == 'E') & (df['N/S'] == 'N') & (df['T/F'] == 'F') & (df['J/P'] == 'J')]

# Process ENFJ essays
enfj_words = []

for essay in enfj_df['Essay']:
    preprocessed_essay = preprocess_text(essay)
    tokens = tokenize_and_remove_stopwords(preprocessed_essay)
    enfj_words.extend(tokens)

# Create a bag of words for ENFJ essays
enfj_bag_of_words = Counter(enfj_words)

# Find the most common word
most_common_word = enfj_bag_of_words.most_common(1)[0]

print(f"Number of ENFJ essays: {len(enfj_df)}")
print(f"The most commonly mentioned word by ENFJ personalities is '{most_common_word[0]}' with {most_common_word[1]} occurrences.")

print("\nTop 10 most common words for ENFJ:")
for word, count in enfj_bag_of_words.most_common(10):
    print(f"{word}: {count}")

Number of ENFJ essays: 6
The most commonly mentioned word by ENFJ personalities is 'team' with 33 occurrences.

Top 10 most common words for ENFJ:
team: 33
work: 23
working: 14
different: 13
people: 12
like: 12
making: 12
academy: 12
new: 12
great: 11


## 6. Bi-gram Analysis
Finally, let's create bi-grams for each essay and analyze them.

In [10]:
def get_bigrams(tokens):
    return list(bigrams(tokens))

# Create a new column with bi-grams
df['bigrams'] = df['Essay'].apply(lambda x: get_bigrams(tokenize_and_remove_stopwords(preprocess_text(x))))

# Get the 70th essay (index 69 since Python uses 0-based indexing)
seventieth_essay_bigrams = df.loc[69, 'bigrams']

# Check if there are at least 109 bi-grams
if len(seventieth_essay_bigrams) >= 109:
    print(f"The 109th bi-gram in the 70th essay is: {seventieth_essay_bigrams[108]}")
else:
    print(f"The 70th essay doesn't have 109 bi-grams. It only has {len(seventieth_essay_bigrams)} bi-grams.")

print("\nFirst few bi-grams of the 70th essay:")
print(seventieth_essay_bigrams[:10])

The 109th bi-gram in the 70th essay is: ('work', 'quite')

First few bi-grams of the 70th essay:
[('academy', 'taught'), ('taught', 'lot'), ('lot', 'honestly'), ('honestly', 'starting'), ('starting', 'class'), ('class', 'hundred'), ('hundred', 'people'), ('people', 'experience'), ('experience', 'knowledge'), ('knowledge', 'wisdom')]


## Conclusion
In this project, we've conducted a comprehensive NLP analysis of student essays. We've explored various aspects of the text data, including:

Basic text preprocessing and tokenization
Word frequency analysis across all essays
Unique word count and repetition patterns
Personality type-specific word usage (focusing on ENFJ)
Bi-gram creation and analysis

These analyses provide insights into the language patterns used by students in their essays and how they might relate to personality types. Future work could involve more advanced NLP techniques such as sentiment analysis, topic modeling, or using machine learning to predict personality types based on essay content.
This project demonstrates the power of NLP techniques in extracting meaningful information from unstructured text data, opening up possibilities for further research in areas like education, psychology, and linguistics.