# Module 00: Setup & Introduction to Research

Welcome to **Data Science Research Skills**! This course will teach you everything you need to know about conducting research as a data science student.

## What You'll Learn in This Module

- Verify your development environment is set up correctly
- Understand what research means in data science
- Learn the complete research process from start to finish
- Discover what makes good research
- Get an overview of the entire learning path

## Prerequisites

- Basic Python knowledge (variables, functions, loops)
- Jupyter Notebook installed and running (which you're doing now!)
- Curiosity and enthusiasm for learning

## Time Required

**20 minutes** - Take your time, don't rush!

---

## Part 1: Environment Setup Verification

Let's make sure everything is working correctly before we dive into research concepts!

In [None]:
# ========================================
# Import Essential Libraries
# ========================================

import sys
import platform

# Check Python version
print("Python Version Check")
print("=" * 50)
print(f"Python Version: {sys.version}")
print(f"Platform: {platform.system()} {platform.release()}")
print()

# Python 3.8+ is required for this course
if sys.version_info >= (3, 8):
    print("‚úÖ Great! Your Python version is compatible.")
else:
    print("‚ö†Ô∏è Warning: Python 3.8+ is recommended for this course.")

In [None]:
# ========================================
# Test Core Scientific Libraries
# ========================================

# We'll test if all required libraries are installed

libraries_to_test = ["numpy", "pandas", "matplotlib", "scipy", "requests"]

print("Library Installation Check")
print("=" * 50)

missing_libraries = []

for library in libraries_to_test:
    try:
        # Try to import the library
        __import__(library)
        print(f"‚úÖ {library:15s} - Installed")
    except ImportError:
        # If import fails, the library is not installed
        print(f"‚ùå {library:15s} - NOT FOUND")
        missing_libraries.append(library)

print()

if len(missing_libraries) == 0:
    print("‚úÖ Excellent! All required libraries are installed.")
else:
    print(f"‚ö†Ô∏è Missing libraries: {', '.join(missing_libraries)}")
    print("\nTo install missing libraries, run:")
    print(f"pip install {' '.join(missing_libraries)}")

In [None]:
# ========================================
# Create Output Directory for This Notebook
# ========================================

import os

# Create a directory to store outputs from this notebook
output_dir = "outputs/notebook_00"

# Create the directory if it doesn't exist
os.makedirs(output_dir, exist_ok=True)

print(f"‚úÖ Output directory created: {output_dir}")
print("\nAll files generated in this notebook will be saved there.")

---

## Part 2: What is Research in Data Science?

Let's start with the fundamentals: **What does "research" mean in the context of data science?**

### Research Definition

**Research** is the systematic process of investigating questions, gathering evidence, and drawing conclusions to expand knowledge or solve problems.

### Data Science Research vs. Academic Research

| Aspect | Academic Research | Data Science Research |
|--------|------------------|----------------------|
| **Goal** | Publish papers, advance theory | Solve business problems, build products |
| **Timeline** | Months to years | Days to months |
| **Audience** | Academic community | Business stakeholders, users |
| **Output** | Papers, presentations | Models, dashboards, reports |
| **Rigor** | Very high (peer review) | High (but faster iteration) |

**Good news:** The skills you learn here apply to BOTH types!

### Why Research Skills Matter for Data Scientists

1. **Problem Solving** - Break down complex questions into testable hypotheses
2. **Credibility** - Make data-driven decisions backed by evidence
3. **Learning** - Stay current by reading and understanding papers
4. **Communication** - Explain your work clearly to stakeholders
5. **Reproducibility** - Others can verify and build on your work

---

## Part 3: The Research Process

Research isn't random - it follows a systematic process:

### The Complete Research Cycle

```
1. IDENTIFY PROBLEM/QUESTION
   ‚Üì
2. REVIEW EXISTING LITERATURE
   ‚Üì
3. FORMULATE HYPOTHESIS
   ‚Üì
4. DESIGN METHODOLOGY
   ‚Üì
5. COLLECT DATA
   ‚Üì
6. ANALYZE DATA
   ‚Üì
7. DRAW CONCLUSIONS
   ‚Üì
8. COMMUNICATE RESULTS
   ‚Üì
9. ITERATE (based on feedback)
```

### Let's Break Down Each Step

#### 1. Identify Problem/Question
- **What**: Define what you want to investigate
- **Example**: "Why are customers churning at 20% per month?"
- **Skills Needed**: Domain knowledge, curiosity

#### 2. Review Existing Literature
- **What**: See what others have discovered about similar problems
- **Example**: Search papers on customer churn prediction
- **Skills Needed**: Paper searching, critical reading (Modules 01-02)

#### 3. Formulate Hypothesis
- **What**: Create a testable prediction
- **Example**: "Customers who use <3 features in the first week are 5x more likely to churn"
- **Skills Needed**: Research methodology (Module 03)

#### 4. Design Methodology
- **What**: Plan how you'll test your hypothesis
- **Example**: Design A/B test for onboarding improvements
- **Skills Needed**: Experimental design (Module 04)

#### 5. Collect Data
- **What**: Gather the evidence you need
- **Example**: Collect user behavior data, surveys
- **Skills Needed**: Data collection, ethics (Modules 05-06)

#### 6. Analyze Data
- **What**: Process and examine your data
- **Example**: Build churn prediction model, analyze results
- **Skills Needed**: Statistics, ML (separate courses)

#### 7. Draw Conclusions
- **What**: Interpret what the data tells you
- **Example**: "Feature engagement in week 1 predicts churn with 85% accuracy"
- **Skills Needed**: Critical thinking, statistics

#### 8. Communicate Results
- **What**: Share findings with your audience
- **Example**: Present to stakeholders with recommendations
- **Skills Needed**: Documentation, visualization

#### 9. Iterate
- **What**: Refine based on feedback and new questions
- **Example**: Test new hypotheses that emerged
- **Skills Needed**: All of the above!

In [None]:
# ========================================
# Visualize the Research Process
# ========================================

import matplotlib.pyplot as plt
import numpy as np

# Research process steps
steps = [
    "1. Problem",
    "2. Literature",
    "3. Hypothesis",
    "4. Methodology",
    "5. Data Collection",
    "6. Analysis",
    "7. Conclusions",
    "8. Communication",
    "9. Iterate",
]

# Time spent on each step (rough estimate in hours)
time_spent = [5, 10, 3, 8, 15, 20, 8, 10, 5]

# Create a horizontal bar chart
fig, ax = plt.subplots(figsize=(10, 6))

# Create bars
bars = ax.barh(steps, time_spent, color="steelblue", alpha=0.7)

# Add value labels on bars
for i, (step, time) in enumerate(zip(steps, time_spent)):
    ax.text(time + 0.5, i, f"{time}h", va="center")

# Formatting
ax.set_xlabel("Typical Time Spent (hours)", fontsize=11)
ax.set_title(
    "Research Process: Time Distribution\n(Example for a 1-week project)",
    fontsize=13,
    fontweight="bold",
)
ax.grid(axis="x", alpha=0.3, linestyle="--")

plt.tight_layout()

# Save the figure
plt.savefig(f"{output_dir}/research_process_timeline.png", dpi=150, bbox_inches="tight")
print(f"‚úÖ Chart saved to: {output_dir}/research_process_timeline.png")

plt.show()

print("\nüí° Key Insight:")
print("Data collection and analysis take the most time, but preparation")
print("(literature review, methodology) is crucial for success!")

---

## Part 4: What Makes Good Research?

Not all research is created equal. Here are the hallmarks of **good research**:

### The 7 Principles of Good Research

#### 1. Clear Research Question
- **Bad**: "Is social media good?"
- **Good**: "Does daily social media use > 2 hours correlate with decreased sleep quality in teenagers?"

#### 2. Based on Existing Knowledge
- Don't reinvent the wheel
- Build on what others have discovered
- Cite your sources

#### 3. Systematic and Rigorous
- Follow a clear methodology
- Control for confounding variables
- Use appropriate statistical methods

#### 4. Ethical
- Protect participant privacy
- Obtain informed consent
- Consider potential harms

#### 5. Reproducible
- Others can replicate your work
- Code and data are available (when possible)
- Clear documentation

#### 6. Transparent
- Acknowledge limitations
- Report negative results
- Disclose conflicts of interest

#### 7. Generalizable (when appropriate)
- Results apply beyond your specific sample
- Or clearly state limitations to generalizability

### Common Research Pitfalls to Avoid

| Pitfall | Description | How to Avoid |
|---------|-------------|-------------|
| **Confirmation Bias** | Only looking for evidence that supports your belief | Actively seek contradictory evidence |
| **P-Hacking** | Testing many hypotheses until one is significant | Pre-register your hypothesis |
| **Cherry Picking** | Only reporting favorable results | Report all analyses performed |
| **Correlation ‚â† Causation** | Assuming correlation implies cause | Use controlled experiments or causal inference |
| **Small Sample Size** | Drawing conclusions from too little data | Calculate required sample size beforehand |
| **Ignoring Confounds** | Not controlling for alternative explanations | Identify and control for confounding variables |

---

## Part 5: Your Learning Path Overview

Here's what you'll master in this course:

### Module Breakdown

#### **Track 1: Literature & Knowledge (Modules 01-02)**
Learn to find, read, and analyze research papers
- Module 01: Literature Review Basics
- Module 02: Finding and Reading Papers

#### **Track 2: Research Design (Modules 03-04)**
Master research methodology and experimental design
- Module 03: Research Methodology
- Module 04: Experimental Design

#### **Track 3: Data & Ethics (Modules 05-06)**
Learn ethical data collection and research practices
- Module 05: Data Collection Methods
- Module 06: Research Ethics

#### **Track 4: Reproducibility (Modules 07-08)**
Make your research reproducible and well-documented
- Module 07: Reproducible Research
- Module 08: Documentation & Version Control

#### **Track 5: Integration (Module 09)**
Put it all together in a complete research project
- Module 09: Putting It All Together

### Total Learning Time
- **Intensive**: 2 days (3-4 hours/day)
- **Moderate**: 1 week (1 module/day)
- **Relaxed**: 2-3 weeks (2-3 modules/week)

In [None]:
# ========================================
# Visualize Your Learning Path
# ========================================

import matplotlib.pyplot as plt
import matplotlib.patches as mpatches

# Module information
modules = [
    {"num": 0, "name": "Setup & Intro", "track": "Foundation", "time": 20},
    {"num": 1, "name": "Literature Review", "track": "Knowledge", "time": 30},
    {"num": 2, "name": "Finding Papers", "track": "Knowledge", "time": 40},
    {"num": 3, "name": "Methodology", "track": "Design", "time": 35},
    {"num": 4, "name": "Experiments", "track": "Design", "time": 40},
    {"num": 5, "name": "Data Collection", "track": "Data & Ethics", "time": 35},
    {"num": 6, "name": "Ethics", "track": "Data & Ethics", "time": 30},
    {"num": 7, "name": "Reproducibility", "track": "Reproducibility", "time": 40},
    {"num": 8, "name": "Documentation", "track": "Reproducibility", "time": 35},
    {"num": 9, "name": "Final Project", "track": "Integration", "time": 60},
]

# Track colors
track_colors = {
    "Foundation": "#1f77b4",
    "Knowledge": "#ff7f0e",
    "Design": "#2ca02c",
    "Data & Ethics": "#d62728",
    "Reproducibility": "#9467bd",
    "Integration": "#8c564b",
}

# Create figure
fig, ax = plt.subplots(figsize=(12, 6))

# Plot modules
for module in modules:
    color = track_colors[module["track"]]
    ax.barh(module["num"], module["time"], color=color, alpha=0.7, edgecolor="black")

    # Add module name
    ax.text(
        -2,
        module["num"],
        f"Module {module['num']}",
        ha="right",
        va="center",
        fontweight="bold",
        fontsize=9,
    )
    ax.text(
        module["time"] / 2,
        module["num"],
        module["name"],
        ha="center",
        va="center",
        fontweight="bold",
        fontsize=9,
        color="white",
    )

# Formatting
ax.set_xlabel("Time (minutes)", fontsize=12, fontweight="bold")
ax.set_ylabel("Module Number", fontsize=12, fontweight="bold")
ax.set_title(
    "Your Complete Learning Path\nData Science Research Skills",
    fontsize=14,
    fontweight="bold",
    pad=20,
)

# Create legend
legend_patches = [
    mpatches.Patch(color=color, label=track, alpha=0.7) for track, color in track_colors.items()
]
ax.legend(handles=legend_patches, loc="lower right", title="Learning Tracks")

# Grid
ax.grid(axis="x", alpha=0.3, linestyle="--")
ax.set_yticks(range(10))

plt.tight_layout()

# Save
plt.savefig(f"{output_dir}/learning_path.png", dpi=150, bbox_inches="tight")
print(f"‚úÖ Learning path chart saved to: {output_dir}/learning_path.png")

plt.show()

# Calculate total time
total_minutes = sum(m["time"] for m in modules)
total_hours = total_minutes / 60

print(f"\nüìä Course Statistics:")
print(f"   Total Modules: {len(modules)}")
print(f"   Total Time: {total_minutes} minutes ({total_hours:.1f} hours)")
print(f"   Average Module Time: {total_minutes/len(modules):.0f} minutes")

---

## Part 6: Quick Self-Assessment

Before moving forward, let's check your understanding of this module's concepts.

### Reflection Questions

Think about these questions (no need to write answers, just reflect):

1. **What is research?**
   - Can you explain it in your own words?

2. **Why do data scientists need research skills?**
   - Think of 3 reasons from your own experience or goals

3. **What are the steps of the research process?**
   - Try to recall them without looking back

4. **What makes good research?**
   - Which principle resonates most with you?

5. **What are you most excited to learn?**
   - Which module are you looking forward to?

### Knowledge Check

**True or False:**

1. Research always requires publishing academic papers. **[False]**
2. Good research should be reproducible. **[True]**
3. You should only report results that support your hypothesis. **[False]**
4. Literature review happens before data collection. **[True]**
5. Correlation implies causation. **[False]**

---

## Summary

Congratulations on completing Module 00! Here's what you learned:

### Key Takeaways

‚úÖ **Research is systematic investigation** to expand knowledge or solve problems

‚úÖ **Research skills are essential** for data scientists in both academic and industry settings

‚úÖ **The research process has 9 steps** from problem identification to iteration

‚úÖ **Good research is** clear, rigorous, ethical, reproducible, and transparent

‚úÖ **This course covers 4 main tracks**: Knowledge, Design, Data & Ethics, and Reproducibility

### What You Can Do Now

- Explain what research means in data science
- Describe the complete research process
- Identify characteristics of good research
- Navigate the rest of this course

### Up Next

In **Module 01: Literature Review Basics**, you'll learn:
- What a literature review is and why it matters
- Types of research papers
- How to organize your reading
- Building a knowledge foundation

---

## Additional Resources

### Recommended Reading
- "The Craft of Research" by Booth, Colomb, and Williams
- "Thinking, Fast and Slow" by Daniel Kahneman (cognitive biases)

### Tools to Explore
- **Jupyter Notebook** - What you're using now!
- **Zotero** - Free reference manager
- **Google Scholar** - Search engine for academic papers

### Practice Exercise

**Exercise**: Think of a data science problem you're interested in.
1. Write it down as a clear research question
2. Sketch out the 9 research process steps for investigating it
3. Identify potential ethical considerations

This will help solidify your understanding!

---

**Ready to continue?** Move on to `01_literature_review_basics.ipynb`!

**Have questions?** Review this notebook or check the main project README.