# Extrovert-Introvert Profiler
## Exploratory Data Analysis (EDA)

**Project Goal:** Find out which behaviors and social habits help us identify if someone is an Extrovert or Introvert.

---

## Part A: Study the Data
## Part B: Data Cleaning
## Part C: Data Visualization

---
# GOOGLE COLAB SETUP
---

In [None]:
# Mount Google Drive to access files
from google.colab import drive
drive.mount('/content/drive')

# UPDATE THIS PATH to your Google Drive folder
DATA_PATH = '/content/drive/MyDrive/PY/'
print('Drive mounted! Data path:', DATA_PATH)

---
# PART A: STUDY THE DATA
---

## Step 1: Import Libraries

In [None]:
# pandas: Used for data manipulation and analysis (DataFrames)
import pandas as pd

# numpy: Used for numerical operations on arrays
import numpy as np

# matplotlib: Used for creating static visualizations
import matplotlib.pyplot as plt

# seaborn: Built on matplotlib, provides beautiful statistical plots
import seaborn as sns

# Suppress warning messages for cleaner output
import warnings
warnings.filterwarnings('ignore')

# Set default figure size for all plots
plt.rcParams['figure.figsize'] = (10, 5)
sns.set_style('whitegrid')

print('Libraries loaded!')

## Step 2: Load the Data

In [None]:
# Load data from Google Drive
# DATA_PATH is set in the Colab Setup cell above
data = pd.read_csv(DATA_PATH + 'train.csv')

print('Dataset has', len(data), 'rows and', len(data.columns), 'columns')

## Step 3: View the Data

In [None]:
# data.head(): Shows the first 5 rows
print('First 5 rows:')
data.head()

In [None]:
# data.tail(): Shows the last 5 rows
print('Last 5 rows:')
data.tail()

In [None]:
# Column names and shape
print('Column names:')
print(data.columns.tolist())
print('\\nRows:', data.shape[0])
print('Columns:', data.shape[1])

## Step 4: Data Types

In [None]:
# data.dtypes: Shows data type of each column
print('Data types:')
print(data.dtypes)
print('\\nData Info:')
data.info()

## Step 5: Basic Statistics

In [None]:
# data.describe(): Statistical summary for numerical columns
print('Statistics:')
data.describe()

In [None]:
# describe(include='all'): Statistics for ALL columns
print('All columns statistics:')
data.describe(include='all')

## Step 6: Missing Values

In [None]:
# Count missing values per column
print('Missing values:')
print(data.isnull().sum())

print('\\nMissing values percentage:')
for col in data.columns:
    missing = data[col].isnull().sum()
    pct = (missing / len(data)) * 100
    print(f'{col}: {missing} ({round(pct, 2)}%)')

In [None]:
# Visualize missing values
missing = data.isnull().sum()
missing = missing[missing > 0]
plt.bar(missing.index, missing.values, color='coral')
plt.title('Missing Values')
plt.xlabel('Column')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

## Step 7: Duplicate Rows

In [None]:
# Count duplicate rows
dup_count = data.duplicated().sum()
print('Duplicate rows:', dup_count)

## Step 8: Check Categorical Values

In [None]:
# Check for invalid/inconsistent values
print('Checking categorical values:')
print('\\nStage_fear:')
print(data['Stage_fear'].value_counts(dropna=False))
print('\\nDrained_after_socializing:')
print(data['Drained_after_socializing'].value_counts(dropna=False))
print('\\nPersonality:')
print(data['Personality'].value_counts(dropna=False))

## Step 9: Class Imbalance Check

In [None]:
# Check class distribution
print('Class Distribution:')
print(data['Personality'].value_counts())
print('\\nClass Distribution (%):')
print(data['Personality'].value_counts(normalize=True) * 100)

# Visualize
plt.figure(figsize=(8, 5))
data['Personality'].value_counts().plot(kind='bar', color=['steelblue', 'coral'])
plt.title('Class Distribution: Extrovert vs Introvert')
plt.xlabel('Personality')
plt.ylabel('Count')
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()

# Check imbalance ratio
counts = data['Personality'].value_counts()
ratio = counts.max() / counts.min()
print(f'\\nImbalance Ratio: {ratio:.2f}:1')
if ratio > 1.5:
    print('WARNING: Dataset is imbalanced!')

## Step 10: Data Study Summary

In [None]:
print('=' * 40)
print('DATA STUDY SUMMARY')
print('=' * 40)
print('Rows:', len(data))
print('Columns:', len(data.columns))
print('Missing values:', data.isnull().sum().sum())
print('Duplicates:', data.duplicated().sum())
print('Extroverts:', len(data[data['Personality'] == 'Extrovert']))
print('Introverts:', len(data[data['Personality'] == 'Introvert']))
print('=' * 40)

---
# PART B: DATA CLEANING
---

## Step 11: Handle Missing Values

In [None]:
# Create copy for cleaning
clean_data = data.copy()
print('Created copy for cleaning!')

num_cols = ['Time_spent_Alone', 'Social_event_attendance', 
            'Going_outside', 'Friends_circle_size', 'Post_frequency']

# Fill numerical with MEDIAN
print('\\nFilling numerical columns with MEDIAN:')
for col in num_cols:
    median = clean_data[col].median()
    before = clean_data[col].isnull().sum()
    clean_data[col] = clean_data[col].fillna(median)
    print(f'{col}: {before} -> 0 (median={median})')

In [None]:
cat_cols = ['Stage_fear', 'Drained_after_socializing']

# Fill categorical with MODE
print('Filling categorical columns with MODE:')
for col in cat_cols:
    mode = clean_data[col].mode()[0]
    before = clean_data[col].isnull().sum()
    clean_data[col] = clean_data[col].fillna(mode)
    print(f'{col}: {before} -> 0 (mode={mode})')

print('\\nMissing values after cleaning:')
print(clean_data.isnull().sum())

## Step 12: Remove Duplicates

In [None]:
before = clean_data.duplicated().sum()
clean_data = clean_data.drop_duplicates().reset_index(drop=True)
after = clean_data.duplicated().sum()
print('Duplicates before:', before)
print('Duplicates after:', after)
print('Index reset: Yes')

## Step 13: Drop ID Column

In [None]:
print('Dropping id column...')
clean_data = clean_data.drop('id', axis=1)
print('Columns:', clean_data.columns.tolist())

## Step 14: Verify Data Ranges

In [None]:
print('Verifying data ranges:')
for col in num_cols:
    print(f'{col}: {clean_data[col].min()} to {clean_data[col].max()}')

## Step 15: Check Outliers

In [None]:
# Box plots for outliers
fig, axes = plt.subplots(2, 3, figsize=(12, 8))
axes = axes.flatten()
for i in range(len(num_cols)):
    axes[i].boxplot(clean_data[num_cols[i]])
    axes[i].set_title(num_cols[i])
axes[5].axis('off')
plt.suptitle('Checking Outliers')
plt.tight_layout()
plt.show()

## Step 16: Final Cleaned Data

In [None]:
print('Cleaned data:')
clean_data.head()

In [None]:
clean_data.info()

In [None]:
print('=' * 40)
print('CLEANING SUMMARY')
print('=' * 40)
print('BEFORE:')
print('  Rows:', len(data))
print('  Columns:', len(data.columns))
print('  Missing:', data.isnull().sum().sum())
print('AFTER:')
print('  Rows:', len(clean_data))
print('  Columns:', len(clean_data.columns), '(dropped id)')
print('  Missing:', clean_data.isnull().sum().sum())
print('=' * 40)
print('DATA IS CLEAN!')

## Step 17: Save Cleaned Data

In [None]:
# Uncomment to save to Google Drive
# clean_data.to_csv(DATA_PATH + 'cleaned_data.csv', index=False)
# print('Saved to Google Drive!')

---
# PART C: DATA VISUALIZATION
---

## Step 18: Time Spent Alone

In [None]:
plt.figure(figsize=(8, 4))
sns.boxplot(x='Personality', y='Time_spent_Alone', data=clean_data)
plt.title('Time Spent Alone by Personality')
plt.show()

# INSIGHT: Introverts spend MORE time alone (median ~5-6 hours)
# Extroverts spend LESS time alone (median ~1-2 hours)
# This is a STRONG indicator of personality type

## Step 19: Social Event Attendance

In [None]:
plt.figure(figsize=(8, 4))
sns.boxplot(x='Personality', y='Social_event_attendance', data=clean_data)
plt.title('Social Event Attendance by Personality')
plt.show()

# INSIGHT: Extroverts attend MORE social events (median ~7)
# Introverts attend FEWER social events (median ~2)

## Step 20: Going Outside Frequency

In [None]:
plt.figure(figsize=(8, 4))
sns.boxplot(x='Personality', y='Going_outside', data=clean_data)
plt.title('Going Outside Frequency by Personality')
plt.show()

# INSIGHT: Extroverts go outside MORE often (median ~5)
# Introverts go outside LESS often (median ~1-2)

## Step 21: Friends Circle Size

In [None]:
plt.figure(figsize=(8, 4))
sns.boxplot(x='Personality', y='Friends_circle_size', data=clean_data)
plt.title('Friends Circle Size by Personality')
plt.show()

# INSIGHT: Extroverts have MORE friends (median ~10-11)
# Introverts have FEWER friends (median ~3-4)

## Step 22: Social Media Post Frequency

In [None]:
plt.figure(figsize=(8, 4))
sns.boxplot(x='Personality', y='Post_frequency', data=clean_data)
plt.title('Social Media Post Frequency by Personality')
plt.show()

# INSIGHT: Extroverts post MORE on social media (median ~6-7)
# Introverts post LESS (median ~2-3)

## Step 23: Stage Fear

In [None]:
plt.figure(figsize=(8, 4))
pd.crosstab(clean_data['Stage_fear'], clean_data['Personality']).plot(kind='bar')
plt.title('Stage Fear by Personality')
plt.xlabel('Has Stage Fear')
plt.ylabel('Count')
plt.xticks(rotation=0)
plt.legend(title='Personality')
plt.show()

# INSIGHT: Most Extroverts say NO to stage fear
# Most Introverts say YES to stage fear

## Step 24: Drained After Socializing

In [None]:
plt.figure(figsize=(8, 4))
pd.crosstab(clean_data['Drained_after_socializing'], clean_data['Personality']).plot(kind='bar')
plt.title('Drained After Socializing by Personality')
plt.xlabel('Feels Drained')
plt.ylabel('Count')
plt.xticks(rotation=0)
plt.legend(title='Personality')
plt.show()

# INSIGHT: Extroverts mostly say NO - they get energy from socializing
# Introverts mostly say YES - socializing drains their energy

## Step 25: Correlation Heatmap

In [None]:
# Encode categorical for correlation
data_encoded = clean_data.copy()
data_encoded['Stage_fear'] = data_encoded['Stage_fear'].map({'Yes': 1, 'No': 0})
data_encoded['Drained_after_socializing'] = data_encoded['Drained_after_socializing'].map({'Yes': 1, 'No': 0})
data_encoded['Personality'] = data_encoded['Personality'].map({'Extrovert': 1, 'Introvert': 0})

plt.figure(figsize=(10, 8))
sns.heatmap(data_encoded.corr(), annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Heatmap')
plt.tight_layout()
plt.show()

# INSIGHT: Features POSITIVELY correlated with Extrovert:
#   - Social_event_attendance, Going_outside, Friends_circle_size, Post_frequency
# Features NEGATIVELY correlated (means Introvert):
#   - Time_spent_Alone, Stage_fear, Drained_after_socializing

## Step 26: All Features Comparison

In [None]:
means = clean_data.groupby('Personality')[num_cols].mean()
means.T.plot(kind='bar', figsize=(10, 5), color=['coral', 'steelblue'])
plt.title('Average Values: Extrovert vs Introvert')
plt.xlabel('Feature')
plt.ylabel('Average Value')
plt.xticks(rotation=45)
plt.legend(title='Personality')
plt.tight_layout()
plt.show()

# INSIGHT: Clear pattern visible
# Introverts: High time alone, low everything else
# Extroverts: Low time alone, high everything else

---
## Visualization Insights Summary

**Extroverts tend to:**
- Spend less time alone
- Attend more social events
- Go outside more often
- Have larger friend circles
- Post more on social media
- NOT have stage fear
- NOT feel drained after socializing

**Introverts tend to:**
- Spend more time alone
- Attend fewer social events
- Go outside less often
- Have smaller friend circles
- Post less on social media
- Have stage fear
- Feel drained after socializing

**Best Predictor Features:**
1. Time_spent_Alone (strongest)
2. Social_event_attendance
3. Stage_fear
4. Drained_after_socializing

### Next: Model Building