# Module 10: Git Best Practices - Professional Workflows

**Difficulty**: ‚≠ê‚≠ê (Intermediate)

**Estimated Time**: 90-120 minutes

**Prerequisites**: 
- Module 01: Git Fundamentals
- Module 02: GitHub Essentials
- Module 03: Branching and Merging
- Module 04: Collaboration Workflows

---

## Learning Objectives

By the end of this notebook, you will be able to:

1. Write clear, professional commit messages following conventions
2. Create and maintain effective .gitignore files
3. Apply branch naming conventions and workflows
4. Organize repositories for team collaboration
5. Use Git hooks to automate quality checks
6. Follow code review best practices
7. Implement semantic versioning and tagging
8. Maintain clean Git history

---

## 1. The Art of Commit Messages

### Why Commit Messages Matter

Good commit messages:
- **Document history**: Future you (and teammates) will thank you
- **Enable automation**: Tools can generate changelogs
- **Facilitate reviews**: Reviewers understand intent
- **Aid debugging**: Find when bugs were introduced
- **Show professionalism**: Employers look at commit history

### Bad vs Good Commit Messages

```bash
# ‚ùå BAD - Too vague
"fixed bug"
"updates"
"changes"
"stuff"
"asdfasdf"

# ‚ùå BAD - Not descriptive enough
"updated file.py"
"made changes to model"

# ‚úÖ GOOD - Clear and specific
"Fix: Correct null pointer exception in data loader"
"Add: Feature extraction for time series data"
"Refactor: Simplify preprocessing pipeline"
"Docs: Update API documentation for v2.0"
```

---

## 2. Conventional Commits Standard

### Format Structure

```
<type>(<scope>): <subject>

<body>

<footer>
```

### Commit Types

| Type | Description | Example |
|------|-------------|----------|
| `feat` | New feature | `feat(api): Add user authentication endpoint` |
| `fix` | Bug fix | `fix(model): Correct bias in prediction algorithm` |
| `docs` | Documentation | `docs(readme): Add installation instructions` |
| `style` | Formatting, no code change | `style: Format code with Black` |
| `refactor` | Code restructuring | `refactor(data): Simplify ETL pipeline` |
| `test` | Adding tests | `test(utils): Add unit tests for helpers` |
| `chore` | Maintenance | `chore: Update dependencies` |
| `perf` | Performance improvement | `perf(query): Optimize database query` |

### Examples

```bash
# Simple commit
feat(data): Add CSV export functionality

# With body
fix(model): Correct data leakage in validation split

The validation set was using data that overlapped with training.
Changed to use time-based split instead of random split.

# With footer (breaking change)
feat(api): Change authentication to OAuth2

BREAKING CHANGE: API now requires OAuth2 tokens instead of API keys.
Users must update their authentication method.
```

---

## 3. Practical: Writing Good Commit Messages

In [None]:
import os
from pathlib import Path
import subprocess

# Create practice environment
practice_dir = Path("commit_practice")
practice_dir.mkdir(exist_ok=True)
os.chdir(practice_dir)

# Initialize repository
!git init
!git config user.name "Best Practices Learner"
!git config user.email "learner@example.com"

print("‚úì Practice environment ready")

In [None]:
# Create a data processing script
script_content = """import pandas as pd

def load_data(filepath):
    """Load data from CSV file."""
    return pd.read_csv(filepath)

def clean_data(df):
    """Remove missing values and duplicates."""
    df = df.dropna()
    df = df.drop_duplicates()
    return df
"""

with open("data_processor.py", "w") as f:
    f.write(script_content)

# Make a good commit
!git add data_processor.py
!git commit -m "feat(data): Add basic data loading and cleaning functions"

print("‚úì Made first commit with good message")

In [None]:
# View the commit
!git log --oneline -1

print("\nDetailed view:")
!git log -1

### The 7 Rules of Good Commit Messages

1. **Separate subject from body with a blank line**
2. **Limit subject line to 50 characters**
3. **Capitalize the subject line**
4. **Do not end the subject line with a period**
5. **Use imperative mood** ("Add feature" not "Added feature")
6. **Wrap body at 72 characters**
7. **Use body to explain what and why, not how**

---

## 4. Mastering .gitignore

### What is .gitignore?

A `.gitignore` file specifies files that Git should **not track**.

### Why Use .gitignore?

- **Keep repo clean**: No temporary or generated files
- **Reduce size**: Don't track large binary files
- **Protect secrets**: Never commit API keys or passwords
- **Avoid conflicts**: Don't version IDE settings
- **Focus on source**: Only track source code, not outputs

---

## 5. Creating a Data Science .gitignore

In [None]:
# Create a comprehensive .gitignore for data science
gitignore_content = """# Python
__pycache__/
*.py[cod]
*$py.class
*.so
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg

# Virtual Environments
venv/
env/
ENV/
.venv/

# Jupyter Notebook
.ipynb_checkpoints/
*-checkpoint.ipynb
*.ipynb_checkpoints

# Data files (keep small samples only)
*.csv
*.xlsx
*.xls
*.parquet
*.h5
*.hdf5
# Exception: keep small sample data
!data/sample/*.csv
!data/sample/*.xlsx

# Machine Learning Models
*.pkl
*.pickle
*.joblib
*.h5
*.pb
*.pt
*.pth
*.onnx
models/
checkpoints/

# Large files
*.zip
*.tar
*.tar.gz
*.rar

# Images and Media
*.jpg
*.jpeg
*.png
*.gif
*.mp4
*.avi
# Exception: keep documentation images
!docs/images/*
!README_images/*

# Database files
*.db
*.sqlite
*.sqlite3

# Environment variables and secrets
.env
.env.local
.env.*.local
secrets.yaml
credentials.json
*.pem
*.key

# IDE and Editor
.vscode/
.idea/
*.swp
*.swo
*~
.DS_Store
Thumbs.db

# Logs and temporary files
*.log
logs/
tmp/
temp/

# Testing
.pytest_cache/
.coverage
htmlcov/
.tox/

# Documentation builds
docs/_build/
site/

# OS files
.DS_Store
.DS_Store?
._*
.Spotlight-V100
.Trashes
ehthumbs.db
Thumbs.db
"""

with open(".gitignore", "w") as f:
    f.write(gitignore_content)

print("‚úì Created comprehensive .gitignore file")
print("\nFirst 20 lines:")
print(gitignore_content.split('\n')[:20])

### .gitignore Patterns

```bash
# Exact filename
secret.txt

# All files with extension
*.log

# Directory and all contents
temp/

# Files in specific directory
data/*.csv

# Files in all subdirectories
**/*.pkl

# Exception (don't ignore)
!important.log

# Ignore directory but keep structure
logs/*
!logs/.gitkeep
```

### Testing .gitignore

```bash
# Check if a file would be ignored
git check-ignore -v filename.txt

# Find all ignored files
git status --ignored
```

---

## 6. Branch Naming Conventions

### Standard Branch Name Format

```
<type>/<ticket-number>-<short-description>
```

### Common Branch Types

| Type | Purpose | Example |
|------|---------|----------|
| `feature/` | New features | `feature/123-user-auth` |
| `bugfix/` or `fix/` | Bug fixes | `bugfix/456-null-pointer` |
| `hotfix/` | Urgent production fix | `hotfix/789-security-patch` |
| `refactor/` | Code improvements | `refactor/cleanup-data-pipeline` |
| `docs/` | Documentation | `docs/update-api-guide` |
| `test/` | Testing | `test/add-unit-tests` |
| `experiment/` | Experimental work | `experiment/try-new-model` |

### Best Practices

```bash
# ‚úÖ GOOD
feature/user-authentication
bugfix/data-loading-error
refactor/simplify-preprocessing

# ‚ùå BAD
my-branch
test123
branch1
johns-work
```

### Creating Well-Named Branches

In [None]:
# Example: Create properly named branches
branches = [
    "feature/add-data-validation",
    "bugfix/fix-memory-leak",
    "docs/update-contributing-guide"
]

for branch in branches:
    !git branch {branch}
    print(f"‚úì Created branch: {branch}")

print("\nAll branches:")
!git branch

---

## 7. Git Workflow Strategies

### 1. GitHub Flow (Simple)

**Best for**: Small teams, continuous deployment

```
main (always deployable)
  |
  |-- feature/new-feature
  |     |
  |     |-- commit
  |     |-- commit
  |     |
  |<----+ (PR merged)
  |
```

**Process**:
1. Create branch from `main`
2. Make changes
3. Open pull request
4. Review and merge to `main`
5. Deploy `main`

---

### 2. Git Flow (Structured)

**Best for**: Scheduled releases, larger teams

```
main (production)
  |
develop (integration)
  |
  |-- feature/a
  |-- feature/b
  |
  |<-- (features merged to develop)
  |
release/v1.0
  |
  |-- bugfix (if needed)
  |
  |-> main (release merged)
```

**Branches**:
- `main`: Production-ready code
- `develop`: Integration branch
- `feature/*`: New features
- `release/*`: Release preparation
- `hotfix/*`: Emergency fixes

---

### 3. Trunk-Based Development

**Best for**: Experienced teams, CI/CD

```
main (trunk)
  |
  |-- short-lived branch (< 1 day)
  |     |
  |<----+ (merged quickly)
  |
  |-- another short branch
  |     |
  |<----+
```

**Key principles**:
- Everyone commits to main/trunk frequently
- Very short-lived feature branches
- Feature flags for incomplete features
- Strong automated testing

---

## 8. Repository Organization

### Standard Data Science Project Structure

In [None]:
# Create a well-organized repository structure
import os

directories = [
    "data/raw",
    "data/processed",
    "data/sample",
    "notebooks/exploratory",
    "notebooks/reports",
    "src/data",
    "src/models",
    "src/visualization",
    "tests",
    "docs",
    "models",
    "reports/figures",
    "configs",
]

for directory in directories:
    os.makedirs(directory, exist_ok=True)
    # Create .gitkeep to track empty directories
    gitkeep_path = os.path.join(directory, ".gitkeep")
    open(gitkeep_path, 'a').close()

print("‚úì Created standard project structure")
print("\nDirectory tree:")
for directory in sorted(directories):
    print(f"  {directory}/")

In [None]:
# Create essential files
readme_content = """# Project Name

Brief description of the project.

## Installation

```bash
pip install -r requirements.txt
```

## Usage

How to use this project.

## Project Structure

```
‚îú‚îÄ‚îÄ data/               # Data files
‚îú‚îÄ‚îÄ notebooks/          # Jupyter notebooks
‚îú‚îÄ‚îÄ src/                # Source code
‚îú‚îÄ‚îÄ tests/              # Test files
‚îú‚îÄ‚îÄ docs/               # Documentation
‚îî‚îÄ‚îÄ reports/            # Generated reports
```

## Contributing

See CONTRIBUTING.md

## License

MIT License
"""

with open("README.md", "w") as f:
    f.write(readme_content)

# Create requirements.txt
requirements_content = """pandas>=2.0.0
numpy>=1.24.0
scikit-learn>=1.3.0
matplotlib>=3.7.0
jupyter>=1.0.0
"""

with open("requirements.txt", "w") as f:
    f.write(requirements_content)

print("‚úì Created README.md and requirements.txt")

---

## 9. Git Hooks for Automation

### What are Git Hooks?

Git hooks are **scripts that run automatically** at specific points in the Git workflow.

### Common Hooks

| Hook | When It Runs | Use Case |
|------|--------------|----------|
| `pre-commit` | Before commit | Check code style, run tests |
| `commit-msg` | After commit message | Validate message format |
| `pre-push` | Before push | Run full test suite |
| `post-commit` | After commit | Send notifications |

### Using pre-commit Framework

In [None]:
# Create .pre-commit-config.yaml
precommit_config = """# Pre-commit hooks configuration
repos:
  # Code formatting
  - repo: https://github.com/psf/black
    rev: 23.12.0
    hooks:
      - id: black
        language_version: python3.10

  # Import sorting
  - repo: https://github.com/pycqa/isort
    rev: 5.13.0
    hooks:
      - id: isort

  # Linting
  - repo: https://github.com/pycqa/flake8
    rev: 6.1.0
    hooks:
      - id: flake8
        args: ['--max-line-length=100']

  # Remove trailing whitespace
  - repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v4.5.0
    hooks:
      - id: trailing-whitespace
      - id: end-of-file-fixer
      - id: check-yaml
      - id: check-added-large-files
        args: ['--maxkb=1000']

  # Jupyter notebook cleaning
  - repo: https://github.com/kynan/nbstripout
    rev: 0.6.1
    hooks:
      - id: nbstripout
"""

with open(".pre-commit-config.yaml", "w") as f:
    f.write(precommit_config)

print("‚úì Created pre-commit configuration")
print("\nTo install:")
print("  pip install pre-commit")
print("  pre-commit install")

---

## 10. Semantic Versioning and Tags

### Semantic Versioning (SemVer)

Format: `MAJOR.MINOR.PATCH` (e.g., `2.3.1`)

- **MAJOR**: Breaking changes (incompatible API changes)
- **MINOR**: New features (backward compatible)
- **PATCH**: Bug fixes (backward compatible)

### Examples

```
1.0.0  ‚Üí  Initial release
1.0.1  ‚Üí  Bug fix
1.1.0  ‚Üí  New feature added
1.1.1  ‚Üí  Another bug fix
2.0.0  ‚Üí  Breaking change
```

### Creating Tags

In [None]:
# Create an annotated tag (recommended)
!git tag -a v1.0.0 -m "Release version 1.0.0: Initial stable release"

print("‚úì Created version tag v1.0.0")
print("\nAll tags:")
!git tag -l

In [None]:
# View tag details
!git show v1.0.0

### Tag Best Practices

```bash
# Create annotated tag (preferred)
git tag -a v1.2.3 -m "Release 1.2.3: Add feature X"

# Push tags to remote
git push origin v1.2.3
# Or push all tags
git push --tags

# List tags
git tag -l

# Checkout specific version
git checkout v1.2.3

# Delete tag
git tag -d v1.2.3
git push origin --delete v1.2.3
```

---

## 11. Code Review Best Practices

### For Authors (Creating PRs)

**Before opening PR**:
- ‚úÖ Self-review your code
- ‚úÖ Write clear PR description
- ‚úÖ Include context and motivation
- ‚úÖ Reference related issues
- ‚úÖ Add screenshots for UI changes
- ‚úÖ Ensure tests pass
- ‚úÖ Update documentation

**PR Description Template**:

```markdown
## Description
Brief summary of changes

## Type of Change
- [ ] Bug fix
- [ ] New feature
- [ ] Breaking change
- [ ] Documentation update

## Related Issues
Fixes #123
Related to #456

## Testing
- [ ] Unit tests added/updated
- [ ] Integration tests pass
- [ ] Manual testing completed

## Checklist
- [ ] Code follows style guidelines
- [ ] Self-review completed
- [ ] Comments added for complex logic
- [ ] Documentation updated
- [ ] No new warnings generated
```

---

### For Reviewers

**What to Look For**:

1. **Correctness**: Does it work as intended?
2. **Design**: Is the approach sound?
3. **Readability**: Is the code clear?
4. **Testing**: Are there adequate tests?
5. **Documentation**: Are changes documented?
6. **Performance**: Any performance concerns?
7. **Security**: Any security vulnerabilities?

**Review Comments**:

```markdown
# ‚ùå BAD (vague, confrontational)
"This is wrong."
"Why did you do it this way?"
"This makes no sense."

# ‚úÖ GOOD (specific, constructive)
"Consider using a list comprehension here for better readability:
[x**2 for x in numbers]"

"This function might have performance issues with large datasets.
Have you considered using vectorized operations?"

"Nice solution! One suggestion: we could extract this logic into 
a separate function for reusability."
```

**Review Etiquette**:
- üëç Be kind and constructive
- üëç Ask questions, don't make demands
- üëç Praise good solutions
- üëç Explain the "why" behind suggestions
- üëé Avoid nitpicking style (use linters)
- üëé Don't review while angry or rushed

---

## 12. Keeping History Clean

### Interactive Rebase

Clean up commits before merging:

```bash
# Rebase last 3 commits interactively
git rebase -i HEAD~3

# Options:
# pick   = keep commit
# reword = change commit message
# squash = combine with previous commit
# fixup  = like squash but discard message
# drop   = remove commit
```

### Squashing Commits

Before:
```
feat: Add user authentication
fix: typo
fix: another typo
fix: formatting
```

After squashing:
```
feat: Add user authentication
```

### When to Rebase vs Merge

**Use Rebase**:
- Updating feature branch with main
- Cleaning up commits before PR
- Maintaining linear history

**Use Merge**:
- Integrating completed features
- Public/shared branches
- When preserving full history matters

**‚ö†Ô∏è Never rebase public/shared branches!**

---

## 13. Exercise 1: Write Good Commit Messages

**Task**: Practice writing conventional commit messages.

For each scenario, write an appropriate commit message:

1. You added a new function to calculate moving averages
2. You fixed a bug where null values caused crashes
3. You updated the README with installation instructions
4. You improved performance of data loading by 50%
5. You changed the API in a way that breaks backward compatibility

### Your Answers Here

1. `TODO: Write commit message`
2. `TODO: Write commit message`
3. `TODO: Write commit message`
4. `TODO: Write commit message`
5. `TODO: Write commit message (include body)`

---

## 14. Exercise 2: Create a .gitignore

**Task**: Create a .gitignore file for a machine learning project.

**Requirements**:
- Ignore all model files (`.pkl`, `.h5`, `.pt`)
- Ignore large datasets but keep sample data
- Ignore virtual environments
- Ignore Jupyter checkpoints
- Keep small documentation images
- Ignore environment variables

In [None]:
# Exercise 2: Your solution here

ml_gitignore = """
# TODO: Write your .gitignore content
"""

with open("ml_project.gitignore", "w") as f:
    f.write(ml_gitignore)

print("‚úì Created .gitignore for ML project")

---

## 15. Exercise 3: Design a Workflow

**Task**: Design a Git workflow for a data science team.

**Team Context**:
- 5 data scientists
- Working on various ML experiments
- Monthly production releases
- Need to track experiments
- Occasional hotfixes required

**Questions to Address**:
1. Which workflow strategy (GitHub Flow, Git Flow, Trunk-Based)?
2. What branch naming conventions?
3. How to handle experiments that might not make it to production?
4. What's your PR and merge strategy?
5. How do you handle versioning?

### Your Workflow Design Here

TODO: Describe your workflow strategy

Include:
- Branch structure
- Development process
- Release process
- Experiment tracking
- Diagram (optional)

---

## 16. Summary

### Key Concepts Learned

1. **Commit Messages**: Use conventional commits format
2. **.gitignore**: Protect secrets, reduce repo size, focus on source
3. **Branch Naming**: Follow conventions for clarity
4. **Workflows**: Choose appropriate strategy for team size and release cadence
5. **Repository Organization**: Standard structure improves collaboration
6. **Git Hooks**: Automate quality checks
7. **Versioning**: Use semantic versioning and tags
8. **Code Review**: Be constructive and thorough
9. **Clean History**: Use rebase for local cleanup

### Essential Practices Checklist

For every project:
- [ ] Write clear, conventional commit messages
- [ ] Maintain comprehensive .gitignore
- [ ] Use descriptive branch names
- [ ] Create detailed PR descriptions
- [ ] Review code constructively
- [ ] Tag releases with semantic versions
- [ ] Keep repository organized
- [ ] Document processes in README
- [ ] Set up pre-commit hooks
- [ ] Protect sensitive information

### Impact on Your Career

Following these best practices:
- üìà Makes you a better collaborator
- üìà Improves code quality
- üìà Reduces bugs and conflicts
- üìà Demonstrates professionalism
- üìà Makes repositories more maintainable
- üìà Employers notice good Git hygiene

---

## 17. What's Next?

You've mastered professional Git practices! Continue your journey with:

**Module 11: Git for Data Science**
- Handling large datasets
- Git LFS (Large File Storage)
- Versioning notebooks
- Data versioning with DVC
- MLflow integration

### Additional Resources

- [Conventional Commits](https://www.conventionalcommits.org/)
- [GitHub Flow Guide](https://guides.github.com/introduction/flow/)
- [Git Flow](https://nvie.com/posts/a-successful-git-branching-model/)
- [Semantic Versioning](https://semver.org/)
- [Pre-commit Framework](https://pre-commit.com/)
- [gitignore.io](https://www.toptal.com/developers/gitignore)

### Keep Practicing

Good practices become habits through repetition:
1. Apply these patterns to all your projects
2. Review others' code on GitHub
3. Contribute to open source
4. Teach these practices to others

---

**Excellent work!** You're now equipped with professional Git workflows that will serve you throughout your career. üöÄ