# Practical Version Control for Data Scientists  
## Detailed Notes and Code Examples  

## Module 1: Version Control Fundamentals  

### 1.1: Why Version Control Matters for Data Scientists  

#### Key Notes:  

- Version control is a system that records changes to files over time.  
- **Data science challenges without version control:**  
  - Notebook experiments become unmanageable (`final_model_v5_FINAL_REALLY_FINAL.ipynb`).  
  - Difficult to reproduce past results.  
  - Collaboration becomes error-prone.  
  - Lost work due to accidental overwrites.  

#### Real-life example:  
A data scientist spends weeks building a complex model. After making "improvements" that actually decrease performance, they can't go back to the previous version because they've overwritten their files. With Git, they could simply revert to a previous commit.  


### Module 1.2: Setting Up Your Environment  

####  Installation Commands  


#### For macOS:

#### For Ubuntu/Debian:

##### For Windows:

Download Git from https://git-scm.com/download/win
Run the installer with default options

### 1.3 Initial Configuration:

In [None]:
# Set your identity
git config --global user.name "Your Name"
git config --global user.email "your.email@example.com"

# Set default editor (optional)
git config --global core.editor "code --wait"  # For VS Code

# Check your settings
git config --list

In [None]:
Lesson 3: Core Git Concepts
Key Commands:

###  1.4: Core Git Concepts

####   Key Commands:

In [None]:
# Initialize a new repository
git init

# Check status of files
git status

# Add files to staging area
git add <filename>  # Specific file
git add .           # All changes

# Commit changes
git commit -m "Descriptive message about changes"

# View commit history
git log
git log --oneline  # Compact view

### 1.5 Illustrated Example:

In [None]:
# Create a project directory
mkdir ds_project
cd ds_project

# Initialize Git repository
git init

# Create a README file
echo "# My Data Science Project" > README.md

# Check status (should show README.md as untracked)
git status

# Add README to staging
git add README.md

# Check status again (should show README.md as staged)
git status

# Commit the file
git commit -m "Initial commit with README"

# View commit history
git log

## Module 2: Local Git Workflows  

### 2.1: Creating Your First Data Science Repository  

In [None]:
data-science-project/
│
├── data/              # Data files
│   ├── raw/           # Original, immutable data
│   ├── processed/     # Cleaned, transformed data
│   └── external/      # Data from third-party sources
│
├── notebooks/         # Jupyter notebooks
│   ├── exploratory/   # Initial exploration
│   └── final/         # Polished analysis
│
├── src/               # Source code (functions, classes)
│   ├── data/          # Data processing scripts
│   ├── features/      # Feature engineering code
│   ├── models/        # Model training and prediction
│   └── visualization/ # Visualization code
│
├── models/            # Saved model files
│
├── reports/           # Analysis reports, figures
│
├── requirements.txt   # Package dependencies
│
└── README.md          # Project documentation

### Initial Setup Commands:  use git bash 

#### 1 Create Project Directory Structure

The `mkdir -p` command creates directories recursively, including parent directories if they don’t exist. This command helps organize the project into structured folders:

- **data/**: Stores datasets, divided into:
  - **raw/**: Original unprocessed datasets.
  - **processed/**: Cleaned and transformed datasets.
  - **external/**: External datasets (e.g., third-party data sources).

- **notebooks/**: Contains Jupyter notebooks, divided into:
  - **exploratory/**: Notebooks for initial data exploration and analysis.
  - **final/**: Polished notebooks for reports or sharing.

- **src/**: Stores reusable Python scripts for different functionalities:
  - **data/**: Scripts for loading and processing data.
  - **features/**: Scripts for feature engineering.
  - **models/**: Machine learning model training and evaluation scripts.
  - **visualization/**: Scripts for data visualization.

- **models/**: Stores trained models (e.g., `.pkl`, `.h5` files).

- **reports/**: Holds reports and generated output visualizations.


#### 2 Create initial files



- **`touch requirements.txt`**: Creates a blank `requirements.txt` file for listing required Python libraries.

- **`echo "# Data Science Project Title\n\nDescription of your project." > README.md`**: Writes a basic project title and description into `README.md`.


#### 3. Create .gitignore File


In [None]:
cat > .gitignore << EOF
# Data files
data/raw/*
data/processed/*
data/external/*
!data/raw/.gitkeep
!data/processed/.gitkeep
!data/external/.gitkeep

*.csv
*.xlsx
*.json
*.sqlite
*.h5
*.parquet
*.feathe

# Jupyter notebook checkpoints
.ipynb_checkpoints
*/.ipynb_checkpoints/*

# Python cache files
__pycache__/
*.py[cod]
*$py.class

# Distribution / packaging
dist/
build/
*.egg-info/

# Virtual environments
venv/
env/
.env/

# Model files
*.pkl
*.h5
*.joblib
*.onnx
*.pt
*.bin
*.keras
models/*
!models/.gitkeep

# IDE specific files
.idea/
.vscode/
*.swp
EOF

# Environment
.env
.venv
env/
venv/
ENV/
.conda/


# Credentials and secrets
.env
*.pem
*_key.json
credentials/


# Figures and plots
*.png
*.jpg
*.pdf
# Except those in reports
!reports/*.png
!reports/*.pdf

# Logs
logs/
*.log
runs/



 .gitignore File Explanation

This creates a `.gitignore` file to prevent unnecessary files from being tracked by Git:

- **Data files**: Ignores all datasets except empty placeholder files (`.gitkeep`).

- **Jupyter Notebook checkpoints**: Excludes auto-generated backup files.

- **Python cache files**: Ignores compiled Python files (`__pycache__/`, `.pyc`, `.pyo`).

- **Distribution/packaging**: Ignores build-related files (e.g., `dist/`, `*.egg-info`).

- **Virtual environments**: Ignores virtual environment folders (`venv/`, `env/`).

- **Model files**: Prevents tracking of large ML models (`*.pkl`, `*.h5`, `*.joblib`).

- **IDE-specific files**: Ignores editor configuration files (`.idea/`, `.vscode/`).


#### 4. Add Placeholder Files to Keep Empty Directories

touch data/raw/.gitkeep data/processed/.gitkeep data/external/.gitkeep models/.gitkeep


#### 5 Initialize Git Repository

### 2.2 Git Commit Message Best Practices

**Bad:**

git commit -m "changes"

**Good**

git commit -m "Add random forest model for churn prediction"

**Better**

git commit -m "Add random forest model for churn prediction

Implemented random forest with 100 trees and max depth of 10.
Model achieves 87% accuracy on validation set, improving over
previous logistic regression (82% accuracy).
Hyperparameters chosen via 5-fold cross-validation."



### 2.3 Working with Notebooks Example:

In [None]:
# Create a new notebook
jupyter notebook notebooks/exploratory/data_exploration.ipynb

# After making changes and saving
git add notebooks/exploratory/data_exploration.ipynb
git commit -m "Explore customer demographics with visualization

- Added distribution plots for age and income
- Identified significant correlation between age and purchase frequency
- Added initial hypothesis about customer segments"

### 2.4  Handling Large Files:

#### Git LFS (Large File Storage) for Data Science Projects

**What is Git LFS?

Git Large File Storage (LFS) is an extension of Git that efficiently handles large files such as datasets, machine learning models, and binary files. Instead of storing large files directly in the Git repository, Git LFS stores pointers to these files, keeping the repository size manageable.



In [None]:
# Initialize Git LFS
#Installs Git LFS for the repository.
#Ensures Git LFS is enabled for handling large files.

git lfs install

# Track specific file types with Git LFS
# Git LFS Configuration

```python
# Tells Git LFS to track large files, specifically:
# *.csv → Large dataset files.
# *.h5 → Deep learning model files (e.g., TensorFlow, Keras).
# models/*.pkl → Pickled ML models.

# This creates a .gitattributes file, which maps these file types to Git LFS.

git lfs track "*.csv"
git lfs track "*.h5"
git lfs track "models/*.pkl"


# Add the .gitattributes file
# The .gitattributes file defines which files should be handled by Git LFS.
git add .gitattributes

# Commit the change
git commit -m "Setup Git LFS for large data and model files"

# Now you can add large files normally
#Git does not store the actual large file in the repository.
# Instead, it stores a pointer and uploads the file to a separate storage area handled by Git LFS.
git add data/external/large_reference_data.csv
git commit -m "Add reference dataset"

#### . DVC: Data Version Control
    
DVC (Data Version Control) is a tool designed specifically for data science projects.
Unlike Git LFS, DVC does not store large files in Git—it uses external storage (S3, Google Drive, etc.).


##### How DVC Works
-Tracks datasets separately from Git.
-Stores data in remote storage (S3, Google Drive, etc.).
-Version control for datasets and models.

In [None]:
# Install DVC
pip install dvc

# Initialize DVC
dvc init
git add .dvc
git commit -m "Initialize DVC"

# Track a data file or directory
dvc add data/raw/large_dataset.csv

# Commit the .dvc file
git add data/raw/large_dataset.csv.dvc
git commit -m "Add raw dataset"

# Set up remote storage
dvc remote add -d myremote s3://mybucket/dvcstore

# Push data to remote
dvc push

#### Git LFS and DVC Integration

Git LFS and DVC can work together, but they serve different purposes and should be used for different types of files in a data science project.

##### How They Complement Each Other

- **Git LFS** is best for binary files that need to be versioned inside Git (e.g., trained models).
- **DVC** is best for large datasets that don’t belong inside Git and should be stored externally.

💡 **Best Practice:**

- ✅ Use Git LFS for ML models (`*.pkl`, `*.h5`, `*.onnx`) since they are versioned alongside your code.
- ✅ Use DVC for datasets (`*.csv`, `*.parquet`) because datasets are typically large and should be managed separately from the codebase.


### 2.5 Working with Branches
Branch Management Commands:

In [None]:
# Create a new branch
git branch feature-new-visualization

# Switch to the branch
git checkout feature-new-visualization

# Create and switch in one command
git checkout -b feature-neural-network

# List all branches
git branch

# Merge a branch into current branch
git merge feature-new-visualization

# Delete a branch after merging
git branch -d feature-new-visualization

Real-world Scenario: Adding a New Model Feature

In [None]:
# Start from main branch
git checkout main

# Create a new branch for the feature
git checkout -b feature-gradient-boosting

# Make changes, run experiments, etc.
# ... (work on files)

# Add and commit changes
git add src/models/gradient_boosting.py
git add notebooks/exploratory/gradient_boosting_exploration.ipynb
git commit -m "Implement gradient boosting model

- Added GBM implementation with custom loss function
- Model achieves 89.2% accuracy, best so far
- Added hyperparameter tuning notebook"

# Switch back to main branch
git checkout main

# Merge your feature
git merge feature-gradient-boosting

# Delete the feature branch
git branch -d feature-gradient-boosting

### 2.6 Handling Merge Conflicts:

In [None]:
# When a merge conflict occurs, Git will tell you which files are conflicted
# Open the files with conflicts and look for markers like:
<<<<<<< HEAD
print("This is the current branch version")
=======
print("This is the feature branch version")
>>>>>>> feature-branch

# Edit the file to resolve the conflict by choosing one version or combining them
print("This is the resolved version")

# Mark the file as resolved
git add <filename>

# Complete the merge
git commit -m "Merge feature-branch, resolve conflicts in model parameters"

## Module 3: Remote Repositories with GitHub
### Lesson 3.1: GitHub Basics
Setting Up Remote Repository:

In [None]:
# After creating repo on GitHub, connect local to remote
git remote add origin git@github.com:username/repository-name.git

# Verify remote
git remote -v

# Push local repository to GitHub
git push -u origin main

# For subsequent pushes
git push

Example Workflow:

In [None]:
# Clone an existing repository
git clone git@github.com:username/data-science-project.git
cd data-science-project

# Make some changes
touch new_analysis.py

# Add and commit
git add new_analysis.py
git commit -m "Add script for time series analysis"

# Push to GitHub
git push

###  3.2: Collaborative Workflows
#### Fetching and Pulling:

In [None]:
# Fetch updates from remote without merging
git fetch origin

# Pull updates (fetch + merge)
git pull origin main

# Pull updates for a specific branch
git pull origin feature-branch

#### Creating a Pull Request (via command line and GitHub):
 1 Push your branch to GitHub:

In [None]:
git checkout -b feature-data-cleaning
# Make changes
git add .
git commit -m "Implement data cleaning pipeline"
git push -u origin feature-data-cleaning

In [None]:
4) # Make requested changes
git add .
git commit -m "Address review comments: fix normalization"
git push

5 When approved, merge via GitHub interface or:

In [None]:
git checkout main
git pull
git merge feature-data-cleaning
git push

### 3.3  Jupyter Notebook Version Control

nbstripout is a Git filter that removes execution outputs and metadata from notebooks before they are committed.

In [None]:
# Install nbstripout
#Installs the package, which allows notebooks to be cleaned automatically before commits.
!pip install nbstripout

# Set up for the repository
nbstripout --install

# This adds a filter to .git/config to clean notebooks on commit
git commit -m "Set up nbstripout for clean notebook diffs"

#### Why Use `nbstripout`?

- ✔ **Keeps version control clean**: Avoids large, unnecessary diffs due to output changes.
- ✔ **Improves collaboration**: Prevents conflicts from execution order differences.
- ✔ **Reduces repository size**: No need to track large output files.


### 3.4 Scenario - Managing a Data Science Portfolio
Portfolio README Template:

In [None]:
# Project Name

![Status](https://img.shields.io/badge/status-active-success.svg)
![Python](https://img.shields.io/badge/python-3.9-blue.svg)
![License](https://img.shields.io/badge/license-MIT-green.svg)

## Overview
Brief description of the project, its purpose, and the problem it solves.

## Key Findings
- Finding 1 with supporting data
- Finding 2 with supporting data
- Finding 3 with supporting data

## Data Sources
Description of data sources used, with links when possible.

## Methods
Overview of the approach, techniques, and algorithms used.

### Features
Description of key features and feature engineering approaches.

### Models
Summary of models built and their performance metrics.

## Results
Key results, visualizations, and insights gained.

## Installation and Usage
```bash
# Clone repository
git clone https://github.com/username/project-name.git
cd project-name

# Install dependencies
pip install -r requirements.txt

# Run example
python src/main.py