# üî¨ Research Methodology: UNIT 3
## Data Collection, Analysis and Report Writing

**For BSc Zoology Students**

*From Raw Data to Published Research*

---

## üìö Unit 3 Contents

1. **Observation and Collection of Data**
   - Types of data
   - Observation methods
   - Data recording

2. **Methods of Data Collection**
   - Sampling methods
   - Primary vs Secondary data
   - Data collection instruments

3. **Data Processing and Analysis**
   - Data cleaning and organization
   - Statistical analysis strategies
   - Choosing appropriate tests
   - Interpreting results

4. **Technical Reports and Thesis Writing**
   - Structure of scientific reports
   - Writing each section (IMRAD)
   - Academic writing style

5. **Tables and Bibliography**
   - Creating effective tables
   - Citation styles
   - Reference management

6. **Data Presentation Using Digital Technology**
   - Visualization principles
   - Creating graphs and charts
   - Using Python for data presentation

---

### üéØ Learning Outcomes

By the end of this unit, you will:
- ‚úÖ Choose appropriate data collection methods
- ‚úÖ Design effective sampling strategies
- ‚úÖ Clean and organize raw data
- ‚úÖ Perform statistical analysis using Python
- ‚úÖ Write scientific reports and thesis
- ‚úÖ Create publication-quality tables and figures
- ‚úÖ Cite references correctly
- ‚úÖ Present data using digital tools

---

**Created by:** Dr. Alok Patel  
**Institution:** Department of Zoology, Kuchinda College  
**Affiliation:** Sambalpur University

---

## üìã How to Use This Notebook

1. **Run Setup Cell First** - Load libraries and sample data
2. **Work Through Examples** - Real biological data analysis
3. **Practice with Exercises** - Analyze provided datasets
4. **Apply to Your Data** - Use templates for your research
5. **Export Your Work** - Create publication-ready outputs

Let's master data handling and reporting! üìä

In [None]:
# SETUP: Run this cell first to load all required libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from scipy.stats import normaltest, shapiro, levene, mannwhitneyu, kruskal
from ipywidgets import interact, widgets, Layout, VBox, HBox
from IPython.display import display, HTML, Markdown
import warnings
warnings.filterwarnings('ignore')

# Set visualization style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette("Set2")
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 11

# Create sample biological datasets for practice
np.random.seed(42)

# Dataset 1: Fish morphometrics
fish_data = pd.DataFrame({
    'Species': ['Labeo rohita']*30 + ['Catla catla']*30,
    'Location': ['Pond A']*15 + ['Pond B']*15 + ['Pond A']*15 + ['Pond B']*15,
    'Length_cm': np.concatenate([
        np.random.normal(25, 3, 15),  # Rohu Pond A
        np.random.normal(28, 3, 15),  # Rohu Pond B
        np.random.normal(30, 4, 15),  # Catla Pond A
        np.random.normal(32, 4, 15)   # Catla Pond B
    ]),
    'Weight_g': np.concatenate([
        np.random.normal(450, 50, 15),
        np.random.normal(520, 50, 15),
        np.random.normal(600, 70, 15),
        np.random.normal(650, 70, 15)
    ])
})

# Dataset 2: Earthworm abundance
earthworm_data = pd.DataFrame({
    'Site': ['Agricultural']*20 + ['Mining']*20 + ['Forest']*20,
    'Season': ['Monsoon', 'Winter']*30,
    'Count': np.concatenate([
        np.random.poisson(15, 20),  # Agricultural
        np.random.poisson(25, 20),  # Mining (metal-tolerant species)
        np.random.poisson(30, 20)   # Forest
    ]),
    'pH': np.concatenate([
        np.random.normal(6.5, 0.5, 20),
        np.random.normal(5.0, 0.8, 20),
        np.random.normal(6.2, 0.4, 20)
    ])
})

# Dataset 3: Butterfly diversity
butterfly_data = pd.DataFrame({
    'Habitat': ['Forest']*25 + ['Agriculture']*25 + ['Urban']*25,
    'Species_Count': np.concatenate([
        np.random.poisson(35, 25),
        np.random.poisson(18, 25),
        np.random.poisson(8, 25)
    ]),
    'Temperature_C': np.random.uniform(25, 35, 75),
    'Canopy_Cover_%': np.concatenate([
        np.random.uniform(70, 90, 25),
        np.random.uniform(10, 30, 25),
        np.random.uniform(5, 15, 25)
    ])
})

print('‚úÖ All libraries loaded successfully!')
print('üìä Sample datasets created:')
print('   ‚Ä¢ Fish morphometrics (60 samples)')
print('   ‚Ä¢ Earthworm abundance (60 samples)')
print('   ‚Ä¢ Butterfly diversity (75 samples)')
print('üî¨ Ready for Unit 3: Data Collection, Analysis & Reporting!')
print('\n' + '='*70)
print('From Raw Data to Publication - The Complete Journey!')
print('='*70)

---

# üìù Section 1: Observation and Collection of Data

---

## 1.1 Types of Data in Biological Research

### üìä Classification by Nature

#### 1. **Quantitative Data** (Numbers)

Data that can be measured numerically.

**A. Discrete (Count) Data**
- Can only take specific values (usually whole numbers)
- Examples:
  - Number of fish in a pond: 150, 151, 152 (not 150.5)
  - Number of butterfly species: 5, 6, 7 (not 5.3)
  - Egg count: 10, 11, 12
  - Number of chromosomes: 46, 48

**B. Continuous Data**
- Can take any value within a range
- Examples:
  - Fish length: 25.3 cm, 25.35 cm, 25.352 cm
  - Water temperature: 28.5¬∞C, 28.56¬∞C
  - Body weight: 450.2 g
  - pH: 6.5, 6.53, 6.534

---

#### 2. **Qualitative Data** (Categories)

Data that describes qualities or characteristics.

**A. Nominal (Named) Data**
- Categories with no inherent order
- Examples:
  - Species: Labeo rohita, Catla catla, Cirrhinus mrigala
  - Color: Red, Blue, Green, Yellow
  - Sex: Male, Female
  - Habitat type: Forest, Grassland, Wetland

**B. Ordinal (Ordered) Data**
- Categories with meaningful order
- Examples:
  - Life stage: Egg < Larva < Pupa < Adult
  - Pollution level: Low < Medium < High
  - Abundance: Rare < Occasional < Common < Abundant
  - Health status: Poor < Fair < Good < Excellent

---

### üéØ Why Data Type Matters

**Different data types require different:**
- Statistical tests
- Visualization methods
- Summary statistics
- Analysis approaches

| Data Type | Summary Statistic | Visualization | Statistical Test |
|-----------|-------------------|---------------|------------------|
| **Continuous** | Mean, SD | Histogram, Box plot | t-test, ANOVA |
| **Discrete (Count)** | Median, Range | Bar chart | Chi-square, Poisson |
| **Nominal** | Mode, Frequency | Pie chart, Bar chart | Chi-square |
| **Ordinal** | Median, Mode | Ordered bar chart | Mann-Whitney U |

---

### üìö Classification by Source

#### 1. **Primary Data**

**Definition:** Data collected firsthand by the researcher for the specific study.

**Advantages:**
- ‚úÖ Specific to your research question
- ‚úÖ You control quality
- ‚úÖ You know how it was collected
- ‚úÖ Fresh, current data

**Disadvantages:**
- ‚ùå Time-consuming
- ‚ùå Expensive
- ‚ùå Requires expertise
- ‚ùå May need permissions/ethics approval

**Examples:**
- Field surveys you conduct
- Experiments you perform
- Measurements you take
- Observations you record

---

#### 2. **Secondary Data**

**Definition:** Data collected by someone else, used for a different purpose.

**Advantages:**
- ‚úÖ Saves time and money
- ‚úÖ Large datasets available
- ‚úÖ Historical data accessible
- ‚úÖ Can compare your data with existing

**Disadvantages:**
- ‚ùå May not fit your needs exactly
- ‚ùå Unknown quality/reliability
- ‚ùå May be outdated
- ‚ùå Limited control over variables

**Sources:**
- Published papers and journals
- Government databases (Forest Survey of India)
- Museum collections
- Online databases (GenBank, GBIF)
- Weather stations data
- Agricultural department records

**Best Practice:** Combine both!
- Use secondary data for context and comparison
- Collect primary data for specific questions

In [None]:
# Interactive Demo: Identifying and Handling Different Data Types

def demonstrate_data_types():
    """
    Show how different data types are handled in analysis
    """
    print("="*80)
    print("üîç EXAMINING OUR SAMPLE DATASETS")
    print("="*80)
    
    # Dataset 1: Fish Data
    print("\nüìä Dataset 1: Fish Morphometrics")
    print("‚îÄ" * 40)
    print(fish_data.head())
    print("\nData Types:")
    print(fish_data.dtypes)
    print("\nClassification:")
    print("  ‚Ä¢ Species: NOMINAL (categorical, no order)")
    print("  ‚Ä¢ Location: NOMINAL (categorical, no order)")
    print("  ‚Ä¢ Length_cm: CONTINUOUS (can be any value)")
    print("  ‚Ä¢ Weight_g: CONTINUOUS (can be any value)")
    
    # Summary statistics
    print("\nSummary Statistics (for continuous variables):")
    print(fish_data[['Length_cm', 'Weight_g']].describe())
    
    # Frequency counts for categorical
    print("\nFrequency Counts (for categorical variables):")
    print("Species:")
    print(fish_data['Species'].value_counts())
    
    # Create visualizations
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
    
    # 1. Continuous data: Histogram
    ax1 = axes[0, 0]
    ax1.hist(fish_data['Length_cm'], bins=15, edgecolor='black', alpha=0.7, color='steelblue')
    ax1.set_xlabel('Fish Length (cm)', fontweight='bold')
    ax1.set_ylabel('Frequency', fontweight='bold')
    ax1.set_title('CONTINUOUS DATA\nHistogram of Fish Length', fontweight='bold')
    ax1.grid(axis='y', alpha=0.3)
    
    # 2. Continuous data: Box plot by category
    ax2 = axes[0, 1]
    fish_data.boxplot(column='Weight_g', by='Species', ax=ax2, grid=False)
    ax2.set_xlabel('Species', fontweight='bold')
    ax2.set_ylabel('Weight (g)', fontweight='bold')
    ax2.set_title('CONTINUOUS by NOMINAL\nWeight by Species', fontweight='bold')
    plt.sca(ax2)
    plt.xticks(rotation=45, ha='right')
    
    # 3. Categorical data: Bar chart
    ax3 = axes[1, 0]
    species_counts = fish_data['Species'].value_counts()
    ax3.bar(range(len(species_counts)), species_counts.values, 
            color=['coral', 'lightgreen'], edgecolor='black')
    ax3.set_xticks(range(len(species_counts)))
    ax3.set_xticklabels(species_counts.index, rotation=45, ha='right')
    ax3.set_ylabel('Count', fontweight='bold')
    ax3.set_title('NOMINAL DATA\nSpecies Frequency', fontweight='bold')
    ax3.grid(axis='y', alpha=0.3)
    
    # 4. Earthworm ordinal data example
    ax4 = axes[1, 1]
    site_order = ['Agricultural', 'Mining', 'Forest']  # Ordered by disturbance
    site_counts = earthworm_data.groupby('Site')['Count'].mean().reindex(site_order)
    colors_gradient = plt.cm.YlOrRd(np.linspace(0.3, 0.9, len(site_order)))
    ax4.bar(range(len(site_counts)), site_counts.values, color=colors_gradient, edgecolor='black')
    ax4.set_xticks(range(len(site_counts)))
    ax4.set_xticklabels(site_order, rotation=45, ha='right')
    ax4.set_ylabel('Mean Earthworm Count', fontweight='bold')
    ax4.set_title('ORDINAL DATA\nSites Ordered by Disturbance', fontweight='bold')
    ax4.grid(axis='y', alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    # Statistical test recommendations
    print("\n" + "="*80)
    print("üìà RECOMMENDED STATISTICAL TESTS BY DATA TYPE")
    print("="*80)
    print("\n1. Comparing Fish Length between Two Species (Continuous):")
    print("   ‚Üí Independent t-test (if normally distributed)")
    print("   ‚Üí Mann-Whitney U test (if not normal)")
    
    print("\n2. Comparing Earthworm Counts across 3 Sites (Discrete/Count):")
    print("   ‚Üí Kruskal-Wallis test (non-parametric)")
    print("   ‚Üí ANOVA (if normally distributed)")
    
    print("\n3. Association between Species and Location (Both Nominal):")
    print("   ‚Üí Chi-square test of independence")
    
    print("\n4. Relationship between Length and Weight (Both Continuous):")
    print("   ‚Üí Pearson correlation (if linear relationship)")
    print("   ‚Üí Linear regression")
    print("="*80)

# Run the demonstration
demonstrate_data_types()

---

# üéØ Section 2: Methods of Data Collection

---

## 2.1 Sampling Methods

### üéØ What is Sampling?

**Sampling** = Selecting a subset (sample) from a larger group (population) to make inferences about the whole.

**Why Sample?** (Why not study the entire population?)

| Reason | Example |
|--------|----------|
| **Impractical** | Cannot count ALL fish in Mahanadi river |
| **Expensive** | Testing water quality at every point costs too much |
| **Time-consuming** | Surveying every forest patch takes years |
| **Destructive** | Can't sacrifice all animals for study |
| **Infinite population** | Can't measure ALL possible measurements |

---

### üìä Types of Sampling Methods

#### **A. PROBABILITY SAMPLING**

Every member of population has a known, non-zero chance of selection.

##### 1. **Simple Random Sampling**

**Method:** Every individual has equal chance of selection.

**How to do it:**
- Assign number to each individual
- Use random number generator
- Select those numbers

**Example:**
```
Population: 1000 fish in a pond
Sample needed: 50 fish
Method: Net fish, number them 1-1000, use random number
        generator to select 50 numbers
```

**Advantages:**
- ‚úÖ Unbiased
- ‚úÖ Simple to understand
- ‚úÖ Statistical theory works well

**Disadvantages:**
- ‚ùå Need complete list of population
- ‚ùå May miss rare groups
- ‚ùå Geographically scattered samples

---

##### 2. **Stratified Random Sampling**

**Method:** Divide population into groups (strata), then randomly sample from each.

**When to use:** Population has distinct subgroups that should all be represented.

**Example:**
```
Studying fish in a lake with 3 zones:
- Shallow zone (20% of lake)
- Medium depth (50% of lake)
- Deep zone (30% of lake)

Sample 100 fish total:
- 20 from shallow
- 50 from medium
- 30 from deep
(Proportional to zone size)
```

**Advantages:**
- ‚úÖ Ensures all subgroups represented
- ‚úÖ More precise than simple random
- ‚úÖ Can compare strata

**Disadvantages:**
- ‚ùå Need to know strata in advance
- ‚ùå More complex

---

##### 3. **Systematic Sampling**

**Method:** Select every kth individual after random start.

**How to do it:**
```
Population: N = 1000
Sample needed: n = 50
Sampling interval: k = N/n = 1000/50 = 20

Steps:
1. Randomly start between 1-20, say 7
2. Select: 7, 27, 47, 67, 87... (every 20th)
```

**Example:**
```
Sampling earthworms along a transect:
- Transect = 100 meters
- Need 10 samples
- Place quadrats every 10 meters
- Random start at 3m: sample at 3, 13, 23, 33... meters
```

**Advantages:**
- ‚úÖ Easy to implement
- ‚úÖ Good spatial coverage
- ‚úÖ Less bias than convenience sampling

**Disadvantages:**
- ‚ùå If pattern in population matches interval, bias occurs
- ‚ùå Less precise than stratified

---

##### 4. **Cluster Sampling**

**Method:** Divide area into clusters, randomly select clusters, sample all within selected clusters.

**Example:**
```
Studying butterflies in Western Odisha:
- Divide region into 100 grid squares (clusters)
- Randomly select 10 squares
- Survey ALL butterflies in those 10 squares
```

**Advantages:**
- ‚úÖ Cost-effective for large areas
- ‚úÖ Logistically easier
- ‚úÖ Don't need complete population list

**Disadvantages:**
- ‚ùå Higher sampling error
- ‚ùå Clusters may not represent population well

---

#### **B. NON-PROBABILITY SAMPLING**

Not all individuals have known chance of selection. **Cannot generalize to population!**

##### 1. **Convenience Sampling**

**Method:** Sample what's easily accessible.

**Example:** Collecting fish from fishermen's catch (only certain species, sizes caught).

**When acceptable:**
- Pilot studies
- Exploratory research
- When population difficult to access

**Warning:** ‚ö†Ô∏è **Highly biased!** Cannot make statistical inferences.

---

##### 2. **Purposive/Judgmental Sampling**

**Method:** Researcher deliberately selects specific individuals based on criteria.

**Example:** Selecting only large, healthy fish for breeding study.

**When acceptable:**
- Case studies
- Specific expertise needed
- Studying rare species

---

### üéØ How to Choose Sampling Method?

| If your situation is... | Use this method |
|-------------------------|------------------|
| Homogeneous population, simple study | Simple Random |
| Distinct subgroups to compare | Stratified |
| Linear/spatial arrangement | Systematic |
| Large area, budget limited | Cluster |
| Pilot study, exploratory | Convenience |
| Specific cases needed | Purposive |

---

### üìè Sample Size Determination

**How many samples do I need?**

Depends on:
1. **Variability** - More variable = larger sample
2. **Desired precision** - More precise = larger sample
3. **Confidence level** - Higher confidence = larger sample
4. **Effect size** - Smaller effect = larger sample

**General Formula for Mean Estimation:**

$$n = \frac{Z^2 \times \sigma^2}{E^2}$$

Where:
- n = sample size
- Z = Z-score (1.96 for 95% confidence)
- œÉ = population standard deviation (estimate from pilot)
- E = margin of error (acceptable error)

**Rule of Thumb for Biological Studies:**
- Minimum: n = 30 (for basic statistics)
- Good: n = 50-100 per group
- Better: Use power analysis software

In [None]:
# Interactive Demo: Different Sampling Methods

def demonstrate_sampling_methods():
    """
    Visualize different sampling methods and their effects
    """
    # Create a "population" of 1000 fish
    np.random.seed(42)
    population_size = 1000
    
    # Population has 3 zones with different mean lengths
    zone1_fish = np.random.normal(25, 3, 300)  # Shallow: smaller fish
    zone2_fish = np.random.normal(28, 3, 500)  # Medium: medium fish
    zone3_fish = np.random.normal(32, 4, 200)  # Deep: larger fish
    
    population = np.concatenate([zone1_fish, zone2_fish, zone3_fish])
    true_mean = population.mean()
    true_std = population.std()
    
    # Apply different sampling methods
    sample_size = 100
    
    # 1. Simple Random Sampling
    random_sample = np.random.choice(population, sample_size, replace=False)
    
    # 2. Stratified Sampling (proportional)
    n1 = int(sample_size * 0.3)  # 30% from zone 1
    n2 = int(sample_size * 0.5)  # 50% from zone 2
    n3 = sample_size - n1 - n2   # 20% from zone 3
    
    stratified_sample = np.concatenate([
        np.random.choice(zone1_fish, n1, replace=False),
        np.random.choice(zone2_fish, n2, replace=False),
        np.random.choice(zone3_fish, n3, replace=False)
    ])
    
    # 3. Systematic Sampling
    k = population_size // sample_size
    start = np.random.randint(0, k)
    systematic_sample = population[start::k][:sample_size]
    
    # 4. Convenience Sampling (biased - only from zone 1, easiest to access)
    convenience_sample = np.random.choice(zone1_fish, sample_size, replace=False)
    
    # Visualization
    fig, axes = plt.subplots(2, 3, figsize=(18, 10))
    
    # Population distribution
    ax0 = axes[0, 0]
    ax0.hist(population, bins=30, alpha=0.7, color='gray', edgecolor='black')
    ax0.axvline(true_mean, color='red', linewidth=3, linestyle='--', label=f'True Mean: {true_mean:.2f}')
    ax0.set_xlabel('Fish Length (cm)', fontweight='bold')
    ax0.set_ylabel('Frequency', fontweight='bold')
    ax0.set_title('POPULATION (N=1000)\nAll fish in the lake', fontweight='bold', fontsize=12)
    ax0.legend()
    ax0.grid(axis='y', alpha=0.3)
    
    # Simple Random
    ax1 = axes[0, 1]
    ax1.hist(random_sample, bins=15, alpha=0.7, color='steelblue', edgecolor='black')
    ax1.axvline(random_sample.mean(), color='blue', linewidth=3, linestyle='--', 
                label=f'Sample Mean: {random_sample.mean():.2f}')
    ax1.axvline(true_mean, color='red', linewidth=2, linestyle=':', label=f'True Mean: {true_mean:.2f}')
    ax1.set_xlabel('Fish Length (cm)', fontweight='bold')
    ax1.set_title('SIMPLE RANDOM\n(n=100)', fontweight='bold', fontsize=12)
    ax1.legend(fontsize=9)
    ax1.grid(axis='y', alpha=0.3)
    
    # Stratified
    ax2 = axes[0, 2]
    ax2.hist(stratified_sample, bins=15, alpha=0.7, color='green', edgecolor='black')
    ax2.axvline(stratified_sample.mean(), color='darkgreen', linewidth=3, linestyle='--',
                label=f'Sample Mean: {stratified_sample.mean():.2f}')
    ax2.axvline(true_mean, color='red', linewidth=2, linestyle=':', label=f'True Mean: {true_mean:.2f}')
    ax2.set_xlabel('Fish Length (cm)', fontweight='bold')
    ax2.set_title('STRATIFIED RANDOM\n(n=100, proportional)', fontweight='bold', fontsize=12)
    ax2.legend(fontsize=9)
    ax2.grid(axis='y', alpha=0.3)
    
    # Systematic
    ax3 = axes[1, 0]
    ax3.hist(systematic_sample, bins=15, alpha=0.7, color='orange', edgecolor='black')
    ax3.axvline(systematic_sample.mean(), color='darkorange', linewidth=3, linestyle='--',
                label=f'Sample Mean: {systematic_sample.mean():.2f}')
    ax3.axvline(true_mean, color='red', linewidth=2, linestyle=':', label=f'True Mean: {true_mean:.2f}')
    ax3.set_xlabel('Fish Length (cm)', fontweight='bold')
    ax3.set_ylabel('Frequency', fontweight='bold')
    ax3.set_title('SYSTEMATIC\n(every 10th fish)', fontweight='bold', fontsize=12)
    ax3.legend(fontsize=9)
    ax3.grid(axis='y', alpha=0.3)
    
    # Convenience (Biased!)
    ax4 = axes[1, 1]
    ax4.hist(convenience_sample, bins=15, alpha=0.7, color='red', edgecolor='black')
    ax4.axvline(convenience_sample.mean(), color='darkred', linewidth=3, linestyle='--',
                label=f'Sample Mean: {convenience_sample.mean():.2f}')
    ax4.axvline(true_mean, color='blue', linewidth=2, linestyle=':', label=f'True Mean: {true_mean:.2f}')
    ax4.set_xlabel('Fish Length (cm)', fontweight='bold')
    ax4.set_title('CONVENIENCE\n(only shallow zone - BIASED!)', 
                  fontweight='bold', fontsize=12, color='red')
    ax4.legend(fontsize=9)
    ax4.grid(axis='y', alpha=0.3)
    
    # Comparison of accuracies
    ax5 = axes[1, 2]
    methods = ['Random', 'Stratified', 'Systematic', 'Convenience']
    means = [random_sample.mean(), stratified_sample.mean(), 
             systematic_sample.mean(), convenience_sample.mean()]
    errors = [abs(m - true_mean) for m in means]
    
    colors_bar = ['steelblue', 'green', 'orange', 'red']
    bars = ax5.bar(methods, errors, color=colors_bar, edgecolor='black', linewidth=2)
    ax5.set_ylabel('Absolute Error from True Mean (cm)', fontweight='bold')
    ax5.set_title('ACCURACY COMPARISON\n(Lower is better)', fontweight='bold', fontsize=12)
    ax5.grid(axis='y', alpha=0.3)
    
    # Add value labels
    for bar, error in zip(bars, errors):
        height = bar.get_height()
        ax5.text(bar.get_x() + bar.get_width()/2., height,
                f'{error:.3f}',
                ha='center', va='bottom', fontweight='bold')
    
    plt.tight_layout()
    plt.show()
    
    # Print summary
    print("\n" + "="*80)
    print("üìä SAMPLING METHOD COMPARISON")
    print("="*80)
    print(f"\nüéØ TRUE POPULATION MEAN: {true_mean:.2f} cm")
    print(f"üìè POPULATION STD: {true_std:.2f} cm")
    print("\n" + "‚îÄ"*80)
    
    for method, mean, error in zip(methods, means, errors):
        accuracy = "‚úÖ Accurate" if error < 0.5 else "‚ö†Ô∏è Moderate" if error < 1.0 else "‚ùå Biased"
        print(f"\n{method:15} Mean: {mean:.2f} cm | Error: {error:.3f} cm | {accuracy}")
    
    print("\n" + "="*80)
    print("üí° KEY LESSONS")
    print("="*80)
    print("\n1. STRATIFIED sampling often most accurate (accounts for groups)")
    print("2. SIMPLE RANDOM and SYSTEMATIC both good, similar accuracy")
    print("3. CONVENIENCE sampling is BIASED - underestimates true mean!")
    print("   (Only sampled small fish from shallow zone)")
    print("4. Use PROBABILITY sampling for valid statistical inference!")
    print("="*80)

# Run demonstration
demonstrate_sampling_methods()

---

# üìà Section 3: Data Processing and Analysis

---

## 3.1 Data Processing Workflow

### üìã Step-by-Step Data Processing

#### **Step 1: Data Entry and Organization**

**Best Practices:**
- Use spreadsheets (Excel, Google Sheets) or databases
- One row per observation
- One column per variable
- Use clear, consistent variable names
- Include units in column names
- Date format: YYYY-MM-DD

**Example:**
```
Good:
Sample_ID | Date       | Species      | Length_cm | Weight_g
F001      | 2024-01-15 | Labeo rohita | 25.3      | 450.2

Bad:
1 | 15/1/24 | Rohu | 25.3cm | 450.2 grams
```

---

#### **Step 2: Data Cleaning**

**Check for:**

1. **Missing Values**
   - Mark as NA or blank (not 0!)
   - Decide: delete, impute, or analyze separately

2. **Outliers**
   - Extremely high/low values
   - Could be errors OR real biological variation
   - Verify before removing!

3. **Inconsistencies**
   - Spelling variations: "Labeo rohita" vs "L. rohita"
   - Unit mismatches: mixing cm and mm
   - Date format variations

4. **Impossible Values**
   - Negative lengths
   - Weights > 1000x normal
   - Dates in the future

---

#### **Step 3: Data Transformation**

Sometimes need to transform data:

| Transformation | When to Use | Example |
|----------------|-------------|----------|
| **Log** | Right-skewed data | Body size, abundance |
| **Square root** | Count data | Number of individuals |
| **Standardization** | Different units/scales | Combine length and weight |
| **Normalization** | 0-1 scale | Machine learning |

---

#### **Step 4: Exploratory Data Analysis (EDA)**

**Before statistical testing, EXPLORE your data!**

1. **Summary Statistics**
   - Mean, median, mode
   - Standard deviation, range
   - Quartiles

2. **Visualizations**
   - Histograms (distribution)
   - Box plots (compare groups)
   - Scatter plots (relationships)

3. **Check Assumptions**
   - Normality (Shapiro-Wilk test, Q-Q plots)
   - Homogeneity of variance (Levene's test)
   - Independence

---

### üéØ Choosing Statistical Tests

**Decision Tree:**

```
What is your research question?
‚îÇ
‚îú‚îÄ Compare means of 2 groups?
‚îÇ  ‚îÇ
‚îÇ  ‚îú‚îÄ Normal distribution? ‚Üí t-test
‚îÇ  ‚îî‚îÄ Not normal? ‚Üí Mann-Whitney U test
‚îÇ
‚îú‚îÄ Compare means of 3+ groups?
‚îÇ  ‚îÇ
‚îÇ  ‚îú‚îÄ Normal distribution? ‚Üí ANOVA
‚îÇ  ‚îî‚îÄ Not normal? ‚Üí Kruskal-Wallis test
‚îÇ
‚îú‚îÄ Test relationship between 2 variables?
‚îÇ  ‚îÇ
‚îÇ  ‚îú‚îÄ Both continuous? ‚Üí Correlation/Regression
‚îÇ  ‚îî‚îÄ Both categorical? ‚Üí Chi-square test
‚îÇ
‚îî‚îÄ Predict one variable from others? ‚Üí Regression
```

In [None]:
# Complete Data Analysis Workflow Example

def complete_analysis_workflow():
    """
    Demonstrate complete data analysis from raw data to results
    """
    print("="*80)
    print("üìä COMPLETE DATA ANALYSIS WORKFLOW")
    print("Research Question: Does fish weight differ between two ponds?")
    print("="*80)
    
    # STEP 1: Load and examine data
    print("\n" + "‚îÄ"*80)
    print("STEP 1: DATA EXAMINATION")
    print("‚îÄ"*80)
    
    # Filter for Labeo rohita only
    rohu_data = fish_data[fish_data['Species'] == 'Labeo rohita'].copy()
    
    print("\nFirst few rows:")
    print(rohu_data.head())
    
    print("\nData shape:", rohu_data.shape)
    print("Missing values:", rohu_data.isnull().sum().sum())
    
    # STEP 2: Summary statistics
    print("\n" + "‚îÄ"*80)
    print("STEP 2: SUMMARY STATISTICS")
    print("‚îÄ"*80)
    
    summary = rohu_data.groupby('Location')['Weight_g'].describe()
    print("\n", summary)
    
    # STEP 3: Visualize data
    print("\n" + "‚îÄ"*80)
    print("STEP 3: DATA VISUALIZATION")
    print("‚îÄ"*80)
    
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
    
    # 3a. Box plot
    ax1 = axes[0, 0]
    rohu_data.boxplot(column='Weight_g', by='Location', ax=ax1, grid=False)
    ax1.set_title('Weight Distribution by Pond', fontweight='bold')
    ax1.set_xlabel('Pond', fontweight='bold')
    ax1.set_ylabel('Weight (g)', fontweight='bold')
    plt.sca(ax1)
    plt.xticks([1, 2], ['Pond A', 'Pond B'])
    
    # 3b. Histogram
    ax2 = axes[0, 1]
    pond_a = rohu_data[rohu_data['Location'] == 'Pond A']['Weight_g']
    pond_b = rohu_data[rohu_data['Location'] == 'Pond B']['Weight_g']
    ax2.hist([pond_a, pond_b], bins=10, label=['Pond A', 'Pond B'], 
             alpha=0.7, edgecolor='black')
    ax2.set_xlabel('Weight (g)', fontweight='bold')
    ax2.set_ylabel('Frequency', fontweight='bold')
    ax2.set_title('Weight Distribution', fontweight='bold')
    ax2.legend()
    ax2.grid(axis='y', alpha=0.3)
    
    # 3c. Q-Q plot for normality (Pond A)
    ax3 = axes[1, 0]
    stats.probplot(pond_a, dist="norm", plot=ax3)
    ax3.set_title('Q-Q Plot: Pond A\n(Check Normality)', fontweight='bold')
    ax3.grid(alpha=0.3)
    
    # 3d. Q-Q plot for normality (Pond B)
    ax4 = axes[1, 1]
    stats.probplot(pond_b, dist="norm", plot=ax4)
    ax4.set_title('Q-Q Plot: Pond B\n(Check Normality)', fontweight='bold')
    ax4.grid(alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    # STEP 4: Test assumptions
    print("\n" + "‚îÄ"*80)
    print("STEP 4: TEST ASSUMPTIONS")
    print("‚îÄ"*80)
    
    # Normality tests
    print("\nNormality Tests (Shapiro-Wilk):")
    stat_a, p_a = shapiro(pond_a)
    stat_b, p_b = shapiro(pond_b)
    print(f"  Pond A: statistic={stat_a:.4f}, p-value={p_a:.4f}")
    print(f"  Pond B: statistic={stat_b:.4f}, p-value={p_b:.4f}")
    
    normal_a = p_a > 0.05
    normal_b = p_b > 0.05
    
    if normal_a and normal_b:
        print("  ‚úÖ Both groups normally distributed (p > 0.05)")
    else:
        print("  ‚ö†Ô∏è  At least one group not normally distributed")
    
    # Homogeneity of variance
    print("\nHomogeneity of Variance (Levene's Test):")
    stat_lev, p_lev = levene(pond_a, pond_b)
    print(f"  statistic={stat_lev:.4f}, p-value={p_lev:.4f}")
    
    equal_var = p_lev > 0.05
    if equal_var:
        print("  ‚úÖ Variances are equal (p > 0.05)")
    else:
        print("  ‚ö†Ô∏è  Variances are not equal")
    
    # STEP 5: Choose and perform statistical test
    print("\n" + "‚îÄ"*80)
    print("STEP 5: STATISTICAL TEST")
    print("‚îÄ"*80)
    
    if normal_a and normal_b:
        print("\nUsing: Independent t-test (data is normally distributed)")
        stat, p_value = stats.ttest_ind(pond_a, pond_b, equal_var=equal_var)
        test_name = "t-test"
    else:
        print("\nUsing: Mann-Whitney U test (data not normally distributed)")
        stat, p_value = mannwhitneyu(pond_a, pond_b)
        test_name = "Mann-Whitney U"
    
    print(f"\nTest statistic: {stat:.4f}")
    print(f"p-value: {p_value:.4f}")
    
    # STEP 6: Interpret results
    print("\n" + "‚îÄ"*80)
    print("STEP 6: INTERPRETATION")
    print("‚îÄ"*80)
    
    alpha = 0.05
    mean_a = pond_a.mean()
    mean_b = pond_b.mean()
    diff = mean_b - mean_a
    percent_diff = (diff / mean_a) * 100
    
    print(f"\nPond A mean weight: {mean_a:.1f} g")
    print(f"Pond B mean weight: {mean_b:.1f} g")
    print(f"Difference: {diff:.1f} g ({percent_diff:.1f}%)")
    
    if p_value < alpha:
        print(f"\n‚úÖ SIGNIFICANT DIFFERENCE (p = {p_value:.4f} < 0.05)")
        print(f"\nConclusion: Fish in Pond B are significantly heavier than")
        print(f"            those in Pond A (mean difference = {diff:.1f} g, {test_name},")
        print(f"            p = {p_value:.4f}).")
    else:
        print(f"\n‚ùå NO SIGNIFICANT DIFFERENCE (p = {p_value:.4f} ‚â• 0.05)")
        print(f"\nConclusion: No significant difference in fish weight between")
        print(f"            Pond A and Pond B ({test_name}, p = {p_value:.4f}).")
    
    # STEP 7: Effect size
    print("\n" + "‚îÄ"*80)
    print("STEP 7: EFFECT SIZE (Cohen's d)")
    print("‚îÄ"*80)
    
    # Calculate Cohen's d
    pooled_std = np.sqrt(((len(pond_a)-1)*pond_a.std()**2 + (len(pond_b)-1)*pond_b.std()**2) / (len(pond_a)+len(pond_b)-2))
    cohens_d = (mean_b - mean_a) / pooled_std
    
    print(f"\nCohen's d = {cohens_d:.3f}")
    
    if abs(cohens_d) < 0.2:
        effect = "negligible"
    elif abs(cohens_d) < 0.5:
        effect = "small"
    elif abs(cohens_d) < 0.8:
        effect = "medium"
    else:
        effect = "large"
    
    print(f"Effect size: {effect}")
    
    print("\n" + "="*80)
    print("üìù FINAL REPORT-READY STATEMENT")
    print("="*80)
    
    if p_value < alpha:
        report = f"""\nFish from Pond B (mean = {mean_b:.1f} ¬± {pond_b.std():.1f} g, n = {len(pond_b)})
were significantly heavier than fish from Pond A (mean = {mean_a:.1f} ¬± 
{pond_a.std():.1f} g, n = {len(pond_a)}; {test_name}, p = {p_value:.4f}, 
Cohen's d = {cohens_d:.2f}), representing a {effect} effect size."""
    else:
        report = f"""\nNo significant difference was found in fish weight between Pond B 
(mean = {mean_b:.1f} ¬± {pond_b.std():.1f} g, n = {len(pond_b)}) and Pond A 
(mean = {mean_a:.1f} ¬± {pond_a.std():.1f} g, n = {len(pond_a)}; {test_name}, 
p = {p_value:.4f})."""
    
    print(report)
    print("\n" + "="*80)

# Run complete analysis
complete_analysis_workflow()

---

*Unit 3 continues with sections on:*
- *Technical Report Writing (IMRAD structure)*
- *Creating Professional Tables*
- *Bibliography and Citation Management*
- *Advanced Data Visualization*

*This is Part 1 of Unit 3. The complete notebook includes all sections with report templates and citation examples.*

---