# Extrovert-Introvert Profiler
## Exploratory Data Analysis (EDA)

**Project Goal:** Find out which behaviors and social habits help us identify if someone is an Extrovert or Introvert.

---

## Part A: Study the Data
## Part B: Data Cleaning
## Part C: Data Visualization

---
# PART A: STUDY THE DATA
---

## Step 1: Import Libraries

In [None]:
# pandas: Used for data manipulation and analysis (DataFrames)
import pandas as pd

# numpy: Used for numerical operations on arrays
import numpy as np

# matplotlib: Used for creating static visualizations
import matplotlib.pyplot as plt

# seaborn: Built on matplotlib, provides beautiful statistical plots
import seaborn as sns

# Suppress warning messages for cleaner output
import warnings
warnings.filterwarnings('ignore')

# Set default figure size for all plots (width=10, height=5 inches)
plt.rcParams['figure.figsize'] = (10, 5)

# Set seaborn style to whitegrid for better readability
sns.set_style('whitegrid')

print('Libraries loaded!')

## Step 2: Load the Data

In [None]:
# pd.read_csv(): Reads a CSV file and loads it into a DataFrame
# DataFrame is like an Excel spreadsheet - rows and columns of data
data = pd.read_csv('train.csv')

# len(data): Returns the number of rows in the DataFrame
# len(data.columns): Returns the number of columns
print('Dataset has', len(data), 'rows and', len(data.columns), 'columns')

## Step 3: View the Data

In [None]:
# data.head(): Shows the first 5 rows of the DataFrame
# Useful to quickly see what the data looks like
print('First 5 rows:')
data.head()

In [None]:
# data.tail(): Shows the last 5 rows of the DataFrame
# Helps verify data loaded completely
print('Last 5 rows:')
data.tail()

In [None]:
# data.columns.tolist(): Returns all column names as a list
print('Column names:')
print(data.columns.tolist())

# data.shape: Returns (rows, columns) as a tuple
# shape[0] = number of rows, shape[1] = number of columns
print('\nRows:', data.shape[0])
print('Columns:', data.shape[1])

## Step 4: Data Types

In [None]:
# data.dtypes: Shows the data type of each column
# int64 = integer, float64 = decimal, object = text/string
print('Data types:')
print(data.dtypes)

# data.info(): Shows detailed info - column names, non-null counts, data types, memory usage
# Non-null count helps identify missing values
print('\nData Info:')
data.info()

## Step 5: Basic Statistics

In [None]:
# data.describe(): Shows statistical summary for numerical columns
# Includes: count, mean, std, min, 25%, 50% (median), 75%, max
# Helps understand the distribution and range of values
print('Statistics:')
data.describe()

In [None]:
# describe(include='all'): Shows statistics for ALL columns including categorical
# For categorical: shows count, unique values, top (most frequent), freq (frequency of top)
print('All columns statistics:')
data.describe(include='all')

## Step 6: Missing Values

In [None]:
# data.isnull(): Returns True/False for each cell (True if missing)
# .sum(): Counts True values (missing values) for each column
print('Missing values:')
print(data.isnull().sum())

# Calculate missing value percentage for each column
# Formula: (missing count / total rows) * 100
print('\nMissing values percentage:')
for col in data.columns:
    missing = data[col].isnull().sum()
    pct = (missing / len(data)) * 100
    print(f'{col}: {missing} ({round(pct, 2)}%)')

In [None]:
# Visualize missing values as a bar chart
# Filter to only show columns with missing values (> 0)
missing = data.isnull().sum()
missing = missing[missing > 0]

# plt.bar(): Creates a bar chart
# missing.index = column names, missing.values = counts
plt.bar(missing.index, missing.values, color='coral')
plt.title('Missing Values')
plt.xlabel('Column')
plt.ylabel('Count')
plt.xticks(rotation=45)  # Rotate x-axis labels for readability
plt.tight_layout()  # Adjust spacing to prevent label cutoff
plt.show()

## Step 7: Duplicate Rows

In [None]:
# data.duplicated(): Returns True for rows that are duplicates of earlier rows
# .sum(): Counts the number of duplicate rows
# Duplicates can skew analysis and should be removed
dup_count = data.duplicated().sum()
print('Duplicate rows:', dup_count)

## Step 8: Check Categorical Values (NEW)

In [None]:
# Check for invalid/inconsistent values in categorical columns
# value_counts(): Counts frequency of each unique value
# dropna=False: Also shows count of NaN (missing) values
# This helps detect typos like 'yes' vs 'Yes' or unexpected values
print('Checking categorical values for inconsistencies:')
print('\nStage_fear:')
print(data['Stage_fear'].value_counts(dropna=False))
print('\nDrained_after_socializing:')
print(data['Drained_after_socializing'].value_counts(dropna=False))
print('\nPersonality:')
print(data['Personality'].value_counts(dropna=False))

## Step 9: Class Imbalance Check (NEW)

In [None]:
# Check class distribution - important for classification problems
# Imbalanced data can cause model to be biased towards majority class
print('Class Distribution:')
print(data['Personality'].value_counts())

# normalize=True: Shows proportions instead of counts (multiply by 100 for %)
print('\nClass Distribution (%):')
print(data['Personality'].value_counts(normalize=True) * 100)

# Visualize class distribution with bar chart
plt.figure(figsize=(8, 5))
data['Personality'].value_counts().plot(kind='bar', color=['steelblue', 'coral'])
plt.title('Class Distribution: Extrovert vs Introvert')
plt.xlabel('Personality')
plt.ylabel('Count')
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()

# Calculate imbalance ratio: majority class / minority class
# Ratio > 1.5 indicates imbalance that may need handling
counts = data['Personality'].value_counts()
ratio = counts.max() / counts.min()
print(f'\nImbalance Ratio: {ratio:.2f}:1')
if ratio > 1.5:
    print('WARNING: Dataset is imbalanced!')

## Step 10: Data Study Summary

In [None]:
# Print a summary of all findings from Part A
# '=' * 40 creates a line of 40 equal signs for visual separation
print('=' * 40)
print('DATA STUDY SUMMARY')
print('=' * 40)
print('Rows:', len(data))
print('Columns:', len(data.columns))
# .sum().sum(): First sum counts per column, second sum totals all columns
print('Missing values:', data.isnull().sum().sum())
print('Duplicates:', data.duplicated().sum())
# Filter rows where Personality equals specific value and count them
print('Extroverts:', len(data[data['Personality'] == 'Extrovert']))
print('Introverts:', len(data[data['Personality'] == 'Introvert']))
print('=' * 40)

---
# PART B: DATA CLEANING
---

## Step 11: Handle Missing Values

In [None]:
# data.copy(): Creates a copy so original data remains unchanged
# Always work on a copy when cleaning to preserve raw data
clean_data = data.copy()
print('Created copy for cleaning!')

# List of numerical columns that need missing value handling
num_cols = ['Time_spent_Alone', 'Social_event_attendance', 
            'Going_outside', 'Friends_circle_size', 'Post_frequency']

# Fill missing numerical values with MEDIAN
# Median is preferred over mean because it's not affected by outliers
print('\nFilling numerical columns with MEDIAN:')
for col in num_cols:
    median = clean_data[col].median()  # Calculate median of non-null values
    before = clean_data[col].isnull().sum()  # Count missing before
    clean_data[col] = clean_data[col].fillna(median)  # Replace NaN with median
    print(f'{col}: {before} -> 0 (median={median})')

In [None]:
# List of categorical columns that need missing value handling
cat_cols = ['Stage_fear', 'Drained_after_socializing']

# Fill missing categorical values with MODE (most frequent value)
# Mode is the best choice for categorical data
print('Filling categorical columns with MODE:')
for col in cat_cols:
    mode = clean_data[col].mode()[0]  # mode() returns a Series, [0] gets first value
    before = clean_data[col].isnull().sum()
    clean_data[col] = clean_data[col].fillna(mode)  # Replace NaN with mode
    print(f'{col}: {before} -> 0 (mode={mode})')

# Verify all missing values are now filled
print('\nMissing values after cleaning:')
print(clean_data.isnull().sum())

## Step 12: Remove Duplicates

In [None]:
# Remove duplicate rows from the dataset
before = clean_data.duplicated().sum()

# drop_duplicates(): Removes rows that are exact copies of earlier rows
# reset_index(drop=True): Resets row numbers to 0,1,2,3... after removal
# drop=True prevents old index from being added as a new column
clean_data = clean_data.drop_duplicates().reset_index(drop=True)

after = clean_data.duplicated().sum()
print('Duplicates before:', before)
print('Duplicates after:', after)
print('Index reset: Yes')

## Step 13: Drop ID Column (NEW)

In [None]:
# Drop the 'id' column - it's just a row identifier, not useful for analysis
# Including id in modeling would cause data leakage or meaningless patterns
print('Dropping id column...')

# drop('id', axis=1): Removes the 'id' column
# axis=1 means column (axis=0 would mean row)
clean_data = clean_data.drop('id', axis=1)
print('Columns:', clean_data.columns.tolist())

## Step 14: Verify Data Ranges (NEW)

In [None]:
# Verify data ranges are within expected/logical bounds
# This helps detect data entry errors (e.g., negative hours, impossible values)
print('Verifying data ranges:')
for col in num_cols:
    # .min() and .max() return the smallest and largest values in the column
    print(f'{col}: {clean_data[col].min()} to {clean_data[col].max()}')

## Step 15: Check Outliers

In [None]:
# Create box plots to visualize outliers in numerical columns
# Box plots show: median (line), quartiles (box), and outliers (dots)

# plt.subplots(2, 3): Creates a 2x3 grid of subplots (6 total)
fig, axes = plt.subplots(2, 3, figsize=(12, 8))

# axes.flatten(): Converts 2D array of axes to 1D for easier looping
axes = axes.flatten()

# Create a boxplot for each numerical column
for i in range(len(num_cols)):
    axes[i].boxplot(clean_data[num_cols[i]])
    axes[i].set_title(num_cols[i])

# Turn off the 6th subplot (we only have 5 columns)
axes[5].axis('off')
plt.suptitle('Checking Outliers')  # Main title for all subplots
plt.tight_layout()
plt.show()

## Step 16: Final Cleaned Data

In [None]:
# Preview the cleaned dataset to verify changes
print('Cleaned data:')
clean_data.head()

In [None]:
# Show info of cleaned data to confirm:
# - No missing values (all columns show same non-null count)
# - id column is removed
# - Data types are correct
clean_data.info()

In [None]:
# Print a summary comparing original vs cleaned data
# This documents what changes were made during cleaning
print('=' * 40)
print('CLEANING SUMMARY')
print('=' * 40)
print('BEFORE:')
print('  Rows:', len(data))  # Original row count
print('  Columns:', len(data.columns))  # Original column count
print('  Missing:', data.isnull().sum().sum())  # Original missing count
print('AFTER:')
print('  Rows:', len(clean_data))  # May be less if duplicates removed
print('  Columns:', len(clean_data.columns), '(dropped id)')  # One less column
print('  Missing:', clean_data.isnull().sum().sum())  # Should be 0
print('=' * 40)
print('DATA IS CLEAN!')

## Step 17: Save Cleaned Data

In [None]:
# Save cleaned data to a new CSV file for use in modeling
# to_csv(): Exports DataFrame to a CSV file
# index=False: Don't include row numbers in the output file
# Uncomment the lines below to save:
# clean_data.to_csv('cleaned_data.csv', index=False)
# print('Saved!')

---
## Summary

### Part A: Study Data
- Loaded train.csv
- Viewed data, checked types
- Found missing values, checked duplicates
- **NEW: Checked categorical values for inconsistencies**
- **NEW: Analyzed class imbalance**

### Part B: Data Cleaning
- Filled numerical with MEDIAN, categorical with MODE
- Removed duplicates
- **NEW: Reset index after duplicate removal**
- **NEW: Dropped id column**
- **NEW: Verified data ranges**
- Checked outliers

### Part C: Data Visualization

---
# PART C: DATA VISUALIZATION
---

## Step 18: Time Spent Alone

In [None]:
# Box plot: Compares distribution of Time_spent_Alone between personality types
# x-axis: Categories (Extrovert/Introvert)
# y-axis: Numerical values (hours spent alone)
plt.figure(figsize=(8, 4))
sns.boxplot(x='Personality', y='Time_spent_Alone', data=clean_data)
plt.title('Time Spent Alone by Personality')
plt.show()

# INSIGHT: Introverts spend MORE time alone (median ~5-6 hours)
# Extroverts spend LESS time alone (median ~1-2 hours)
# This is a STRONG indicator of personality type

## Step 19: Social Event Attendance

In [None]:
# Box plot: Compares social event attendance between personality types
# The box shows 25th-75th percentile, line inside is median
plt.figure(figsize=(8, 4))
sns.boxplot(x='Personality', y='Social_event_attendance', data=clean_data)
plt.title('Social Event Attendance by Personality')
plt.show()

# INSIGHT: Extroverts attend MORE social events (median ~7)
# Introverts attend FEWER social events (median ~2)
# Clear separation - good predictor feature

## Step 20: Going Outside Frequency

In [None]:
# Box plot: Compares how often each personality type goes outside
# Whiskers extend to 1.5*IQR, points beyond are outliers
plt.figure(figsize=(8, 4))
sns.boxplot(x='Personality', y='Going_outside', data=clean_data)
plt.title('Going Outside Frequency by Personality')
plt.show()

# INSIGHT: Extroverts go outside MORE often (median ~5)
# Introverts go outside LESS often (median ~1-2)
# Another strong differentiator

## Step 21: Friends Circle Size

In [None]:
# Box plot: Compares friend circle size between personality types
# Larger box = more variation in the data
plt.figure(figsize=(8, 4))
sns.boxplot(x='Personality', y='Friends_circle_size', data=clean_data)
plt.title('Friends Circle Size by Personality')
plt.show()

# INSIGHT: Extroverts have MORE friends (median ~10-11)
# Introverts have FEWER friends (median ~3-4)
# Makes sense - extroverts are more social

## Step 22: Social Media Post Frequency

In [None]:
# Box plot: Compares social media posting frequency
# Shows how active each personality type is online
plt.figure(figsize=(8, 4))
sns.boxplot(x='Personality', y='Post_frequency', data=clean_data)
plt.title('Social Media Post Frequency by Personality')
plt.show()

# INSIGHT: Extroverts post MORE on social media (median ~6-7)
# Introverts post LESS (median ~2-3)
# Extroverts like to share and be visible online

## Step 23: Stage Fear

In [None]:
# Bar chart: Shows relationship between Stage_fear and Personality
# pd.crosstab(): Creates a frequency table of two categorical variables
# Each bar shows count of Extroverts/Introverts for Yes/No stage fear
plt.figure(figsize=(8, 4))
pd.crosstab(clean_data['Stage_fear'], clean_data['Personality']).plot(kind='bar')
plt.title('Stage Fear by Personality')
plt.xlabel('Has Stage Fear')
plt.ylabel('Count')
plt.xticks(rotation=0)
plt.legend(title='Personality')
plt.show()

# INSIGHT: Most Extroverts say NO to stage fear
# Most Introverts say YES to stage fear
# Stage fear is a key differentiator

## Step 24: Drained After Socializing

In [None]:
# Bar chart: Shows relationship between feeling drained and Personality
# This is a classic psychological trait that differentiates the two types
plt.figure(figsize=(8, 4))
pd.crosstab(clean_data['Drained_after_socializing'], clean_data['Personality']).plot(kind='bar')
plt.title('Drained After Socializing by Personality')
plt.xlabel('Feels Drained')
plt.ylabel('Count')
plt.xticks(rotation=0)
plt.legend(title='Personality')
plt.show()

# INSIGHT: Extroverts mostly say NO - they get energy from socializing
# Introverts mostly say YES - socializing drains their energy
# This is a classic introvert/extrovert trait

## Step 25: Correlation Heatmap

In [None]:
# Correlation requires numerical data, so encode categorical columns
# map(): Replaces values based on a dictionary {old: new}
data_encoded = clean_data.copy()
data_encoded['Stage_fear'] = data_encoded['Stage_fear'].map({'Yes': 1, 'No': 0})
data_encoded['Drained_after_socializing'] = data_encoded['Drained_after_socializing'].map({'Yes': 1, 'No': 0})
data_encoded['Personality'] = data_encoded['Personality'].map({'Extrovert': 1, 'Introvert': 0})

# Heatmap: Shows correlation between all pairs of variables
# corr(): Calculates Pearson correlation (-1 to +1)
# +1 = perfect positive correlation, -1 = perfect negative, 0 = no correlation
# annot=True: Shows correlation values in each cell
# cmap='coolwarm': Blue for negative, red for positive correlations
plt.figure(figsize=(10, 8))
sns.heatmap(data_encoded.corr(), annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Heatmap')
plt.tight_layout()
plt.show()

# INSIGHT: Features POSITIVELY correlated with Extrovert (value=1):
#   - Social_event_attendance, Going_outside, Friends_circle_size, Post_frequency
# Features NEGATIVELY correlated (means Introvert):
#   - Time_spent_Alone, Stage_fear, Drained_after_socializing

## Step 26: All Features Comparison

In [None]:
# Grouped bar chart: Compare average values of all features side by side
# groupby('Personality'): Groups data by Extrovert/Introvert
# .mean(): Calculates average for each group
means = clean_data.groupby('Personality')[num_cols].mean()

# .T: Transpose (swap rows and columns) so features are on x-axis
# This makes it easier to compare Extrovert vs Introvert for each feature
means.T.plot(kind='bar', figsize=(10, 5), color=['coral', 'steelblue'])
plt.title('Average Values: Extrovert vs Introvert')
plt.xlabel('Feature')
plt.ylabel('Average Value')
plt.xticks(rotation=45)
plt.legend(title='Personality')
plt.tight_layout()
plt.show()

# INSIGHT: Clear pattern visible
# Introverts: High time alone, low everything else
# Extroverts: Low time alone, high everything else

---
## Visualization Insights Summary

**Extroverts tend to:**
- Spend less time alone
- Attend more social events
- Go outside more often
- Have larger friend circles
- Post more on social media
- NOT have stage fear
- NOT feel drained after socializing

**Introverts tend to:**
- Spend more time alone
- Attend fewer social events
- Go outside less often
- Have smaller friend circles
- Post less on social media
- Have stage fear
- Feel drained after socializing

**Best Predictor Features:**
1. Time_spent_Alone (strongest)
2. Social_event_attendance
3. Stage_fear
4. Drained_after_socializing

### Next: Model Building