# 00 - Data Exploration and EDA

## Learning Objectives
- Understand the disaster tweets dataset structure
- Explore text characteristics and patterns
- Identify data quality issues and preprocessing needs
- Build intuition about the classification task

## Phase 1: PyTorch Fundamentals ðŸ§ 
*Build everything from scratch to understand the foundations*

## Phase 2: Transformers Enhancement ðŸš€
*Enhance with modern NLP tools after mastering fundamentals*

---

## Dataset Overview

**Competition**: [Natural Language Processing with Disaster Tweets](https://www.kaggle.com/competitions/nlp-getting-started)

**Task**: Binary classification - predict if a tweet is about a real disaster (1) or not (0)

**Files**:
- `train.csv` - Training data with labels
- `test.csv` - Test data for submission (no labels)

**Key Columns**:
- `id` - Unique identifier
- `text` - Tweet content
- `target` - Label (1=disaster, 0=not disaster)


## TODO 1: Load and Inspect Data

**Goal**: Load the dataset and perform basic inspection

**Steps**:
1. Import necessary libraries (pandas, numpy, matplotlib, seaborn)
2. Load `train.csv` and `test.csv` from `data/raw/`
3. Display basic dataset information:
   - Shape (number of rows and columns)
   - Column names and data types
   - First few rows
   - Memory usage

**Hint**: Use `pd.read_csv()`, `df.info()`, `df.head()`, and `df.memory_usage()`

**Expected Output**: Understanding of dataset size and structure


In [None]:
# TODO 1: Load and inspect data
# Your implementation here


## TODO 2: Target Distribution Analysis

**Goal**: Understand the class distribution and balance

**Steps**:
1. Analyze target distribution:
   - Count of each class (0 vs 1)
   - Percentage distribution
   - Visualize with bar chart
2. Check for any missing values in target column
3. Analyze class balance implications for modeling

**Hint**: Use `df['target'].value_counts()`, `df['target'].value_counts(normalize=True)`, and `sns.countplot()`

**Expected Output**: Understanding of whether the dataset is balanced and potential impact on model training


In [None]:
# TODO 2: Target distribution analysis
# Your implementation here


## TODO 3: Text Length Analysis

**Goal**: Understand text characteristics and length patterns

**Steps**:
1. Calculate text length statistics:
   - Character count per tweet
   - Word count per tweet
   - Sentence count per tweet
2. Visualize length distributions:
   - Histograms for character and word counts
   - Box plots by target class
3. Identify outliers and potential preprocessing needs

**Hint**: Use `df['text'].str.len()`, `df['text'].str.split().str.len()`, and matplotlib/seaborn for visualization

**Expected Output**: Understanding of text length patterns and preprocessing requirements


In [None]:
# TODO 3: Text length analysis
# Your implementation here


## TODO 4: Word Frequency Analysis

**Goal**: Identify common words and patterns in disaster vs non-disaster tweets

**Steps**:
1. Create word frequency analysis:
   - Most common words overall
   - Most common words by class (disaster vs non-disaster)
   - Word clouds for visual representation
2. Analyze differences between classes:
   - Words unique to disaster tweets
   - Words unique to non-disaster tweets
   - Statistical significance of word differences

**Hint**: Use `Counter`, `collections`, and `wordcloud` library for visualization

**Expected Output**: Understanding of vocabulary patterns that might help classification


In [None]:
# TODO 4: Word frequency analysis
# Your implementation here


## TODO 5: Data Quality Assessment

**Goal**: Identify data quality issues and preprocessing requirements

**Steps**:
1. Check for missing values in all columns
2. Identify duplicate tweets
3. Analyze special characters and URLs:
   - Count of URLs per tweet
   - Count of hashtags, mentions, emojis
   - Special character patterns
4. Sample and manually review examples from each class

**Hint**: Use `df.isnull().sum()`, `df.duplicated().sum()`, and regex patterns for special characters

**Expected Output**: List of data quality issues and preprocessing steps needed


In [None]:
# TODO 5: Data quality assessment
# Your implementation here


## TODO 6: Key Insights and Next Steps

**Goal**: Synthesize findings and plan preprocessing strategy

**Steps**:
1. Summarize key findings from exploration:
   - Dataset characteristics
   - Class distribution insights
   - Text length patterns
   - Vocabulary differences
   - Data quality issues
2. Document preprocessing requirements:
   - Text cleaning steps needed
   - Tokenization strategy
   - Vocabulary size considerations
   - Sequence length decisions
3. Plan train/validation/test split strategy

**Expected Output**: Clear roadmap for preprocessing phase

---

## Phase 2: Transformers Enhancement

*After completing Phase 1, consider these enhancements:*

- Use HuggingFace datasets for efficient data loading
- Leverage pre-trained tokenizers (BERT, RoBERTa)
- Compare custom preprocessing vs. transformer tokenization
- Analyze how transformer tokenization handles special characters differently
