> **DSA 6101: PRINCIPLES OF DATA SCIENCE**

> **CAT 2**

> **Bwahyi Chris Muthee: 25/03553**

## Role of Programming Environments

Programming environments like **Jupyter Notebook** and **RStudio** are designed to support collaborative data science workflows and structured code organization.

### Jupyter Notebook:

 This is an Interactive coding environment that supports languages like Python, R and Julian and offers the outline benefits to a data scientist

> **Collaboration:**

> - Jupyter’s literate programming approach integrates code, markdown, and visualizations, enabling data scientists to create narrative-driven notebooks that communicate findings to team members and stakeholders
> - The ability to share *.ipynb files via platforms like GitHub facilitates team reviews and feedback

> **Code Organization:**

> - The notebook’s cell-based structure allows modular code development, where data cleaning, analysis, and visualization are segmented into distinct cells. Markdown cells provide documentation, ensuring clarity and reproducibility.

### RStudio:

This is IDE that you use when coding in R. can be better understood as a suite of tools that data scientists and analysts manage, visualize, model data and deploy machine learning models.

> **Collaboration:**
> - RStudio supports collaborative workflows through tools like RMarkdown for report generation and Shiny apps for interactive sharing of results, making it ideal for statistical teams. Integration with version control systems enhances team coordination.

> **Code Organization:**

> - RStudio’s IDE layout (Console, Script, Environment, Plots) consolidates code, data, and outputs in a unified interface, improving workflow efficiency

These environments through either their Interactive coding environment, Markdown (Ideal for story telling) bridge technical and non-technical collaboration i.e. the combining of executable code with human-readable documentation, ensuring team members can understand and replicate analyses.


# Comparing Anaconda and pip/virtualenv for Managing Python Environments and Packages

Anaconda is a Python distribution tailored for data science and machine learning. It includes the conda package manager. This Distro comes with a suite of pre-installed packages such as NumPy, pandas, and Jupyter.

pip is the default Python package manager, and virtualenv is used to create isolated environments.

While pip/virtualenv is more lightweight and flexible, Anaconda offers a more user-friendly setup for scientific computing.

**Environment Management**

In terms of environment management both conda and virtualenv enable the creation of an isolated environments to avoid dependency conflicts.

Conda environment also manages non-Python dependencies such as compiled libraries (e.g., BLAS (Basic Linear Algebra Subprograms) :- routines that provide standard building blocks for performing basic vector and matrix operations)

When creating a Conda (Anacondas Package Manager) it involves a single command ***conda create -n myenv python=3.10*** which can also specify the python and package version.

Virtualenv/venv provides basic environment isolation and integrates well with pip for managing packages from the Python Package Index (PyPI).

**Package Management**

Pip installs packages from the Python Package Index (PyPI), which hosts a vast array of Python libraries. Almost any Python library can be installed using pip.

On the other hand, conda installs packages from the Anaconda distribution and other channels. While the number of packages available through conda is smaller than pip, conda can install packages for multiple languages and not just Python.

Both pip and conda are powerful tools for managing Python packages. The choice between them depends on the user's specific needs. By understanding the strengths and weaknesses of each tool, one can make an informed decision and manage their Python projects more effectively







# Version Control in Collaborative Projects
In collaborative data science projects, version control is critical not only for managing code, but also for handling datasets, tracking experiments, and ensuring reproducibility across different stages of the workflow. Unlike traditional software engineering, data science workflows involve continuous experimentation, dynamic datasets, and multi-disciplinary teams working across evolving codebases and data versions.

To that effect the complexities that come with data science make robust version control indispensable for the reasons outline below

**Reproducibility:** this is crucial to ensure sharing of findings and validation of results and by keeping track of the precise code, parameters, datasets, environment used, version control lets you try out different versions of an analysis and still revert back to a previous one

**Parallel Development:** Teams collaborate on various project component at the same time, this includes on different teams handling different sections of the project e.g., feature engineering, modeling and visualization. With Version control systems each team can work independently without overwriting each other’s work using the branching techniques which is essential in a fast-paced or even an agile setting work environment.

**monitoring and assessing experiments:** Data science has a process of developing machine learning that entails a number of experiments with different setups and model changes. Version control helps team compare performance metrics, identify areas of improvement and also prevent effort duplication.

**Risks of Data Loss or Corruption:** with teams tracking version of code, data or model they could easily load up the last working dataset or model if changes introduced bugs. This ensures integrity of data.

# Differences between Jupyter Notebook and JupyterLab
Jupyter Notebook and JupyterLab are both open-source web applications that allow you to create and share documents containing live code, equations, visualizations, and narrative text. The primary purpose of these tools is to provide an interactive environment for data exploration, analysis, and visualization. Despite having a similar primary purpose these tools cater to different data science needs:

**Jupyter Notebook:** A single-document interface (SDI) optimized for linear storytelling, ideal for academic reports and teaching. It uses a Tornado web server and JSON-based *. ipynb files.

**JupyterLab:** A multi-document interface (MDI) with a flexible, IDE-like environment. It has a modular structure, where you can open several notebooks or files (e.g. HTML, Text, Markdowns etc.) as tabs in the same window offering more of an IDE-like experience.

It supports real-time collaboration, plugin extensibility, and complex project management.

Jupyter Notebook suits presentation-focused tasks, while JupyterLab excels in exploratory analysis and collaborative development.



# Git and GitHub Integration
Git is a version control system that intelligently tracks changes in files. Git is particularly useful when you and a group of people are all making changes to the same files at the same time.

GitHub on the other hand is a cloud-based platform where one can store, share, and work together with others to write files or codes. GitHub uses git underneath; this helps one manage their repositories or folders easily.

**How do Git and GitHub work together?**

Whenever one uploads a file to GitHub, these files are stored in a ***“Git Repository”***. When you use git commands such as git commit on GitHub, Git starts tracking and managing your changes. Hence, we could say GitHub is a cloud storage platform i.e. a hub whereas git is a system that is used to track and update changes on GitHub.

**a pull request and when is it used**

In git a pull requests is a combination of two other git commands
1. Get fetch – used to retrieve the latest changes from a remote repository
2. Git merge – used to merge these changes into the local branch

In essence git pull is a request used to synchronize your local repository with the remote repository by downloading and integrating the changes

**When to use git pull**
1. Starting work -this will ensure you have the latest version of the file before making change.
2. When collaborating with others - This helps one load the latest changes made by teammates
3. Resolving divergence – This is to resolve when the local branch has diverged from the remote
4. Updating Feature Branches - This is to enable incorporation of changes from the main branch


# Setting up virtual environment
For the activity I was using Pycharm hence the environment set is though using python

![Setting up a virtual environment, activating the environment and installing pandas and matplotlib](virtual_env_setup.png)

# Initializing a Git repository in a local project folder and push it to a GitHub repository
![ Initialize a Git repository in a local project folder and push it to a GitHub repository](git_setup_initial.png)
![ Initialize a Git repository in a local project folder and push it to a GitHub repository](git_setup_2.png)
![ Initialize a Git repository in a local project folder and push it to a GitHub repository](git_setup_3.png)


 ## 8. Simple Python script that imports pandas, reads a CSV file, and displays the first five rows

In [1]:
import pandas as pd

In [3]:
# Read data from a csv
sal_df = pd.read_csv("D:/SCHOOL_WORK/DSA-6101-Practicals_Projects/Salaries.csv")

In [4]:
# display the first five rows
sal_df.head()

Unnamed: 0,Id,EmployeeName,JobTitle,BasePay,OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits,Year,Notes,Agency,Status
0,1,NATHANIEL FORD,GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY,167411.18,0.0,400184.25,,567595.43,567595.43,2011,,San Francisco,
1,2,GARY JIMENEZ,CAPTAIN III (POLICE DEPARTMENT),155966.02,245131.88,137811.38,,538909.28,538909.28,2011,,San Francisco,
2,3,ALBERT PARDINI,CAPTAIN III (POLICE DEPARTMENT),212739.13,106088.18,16452.6,,335279.91,335279.91,2011,,San Francisco,
3,4,CHRISTOPHER CHONG,WIRE ROPE CABLE MAINTENANCE MECHANIC,77916.0,56120.71,198306.9,,332343.61,332343.61,2011,,San Francisco,
4,5,PATRICK GARDNER,"DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT)",134401.6,9737.0,182234.59,,326373.19,326373.19,2011,,San Francisco,


In [7]:
# check the Random sample of the  Data
sal_df.sample(10)

Unnamed: 0,Id,EmployeeName,JobTitle,BasePay,OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits,Year,Notes,Agency,Status
35836,35837,LESLIE BAILEY,SPECIAL NURSE,337.82,0.0,33.78,,371.6,371.6,2011,,San Francisco,
6728,6729,BRANDON MCKELLEY,POLICE OFFICER I,94535.7,7652.41,12464.45,,114652.56,114652.56,2011,,San Francisco,
76324,76325,Brian C Cotter,Police Officer 2,120819.7,24156.14,11640.45,36528.85,156616.29,193145.14,2013,,San Francisco,
37781,37782,Saed Toloui,Eng/Arch/Landscape Arch Sr,160476.84,0.0,259.78,52171.43,160736.62,212908.05,2012,,San Francisco,
66697,66698,Marciano Mora Jr,Custodian,13350.61,35.66,69.73,7803.56,13456.0,21259.56,2012,,San Francisco,
148073,148074,Oriana R Asoau,Junior Clerk,456.0,299.25,0.0,7.55,755.25,762.8,2014,,San Francisco,
117900,117901,Arnel F Bautista,Senior Stationary Engineer,90070.05,4927.83,27621.52,34717.15,122619.4,157336.55,2014,,San Francisco,
118427,118428,Helen M Hale,Mayoral Staff XV,106409.98,0.0,0.0,46929.72,106409.98,153339.7,2014,,San Francisco,
31840,31841,LAKESHIA CLARK,PUBLIC SERVICE AIDE-SPECIAL PROGRAMS,9391.62,0.0,0.0,,9391.62,9391.62,2011,,San Francisco,
11010,11011,EDDY WONG,DEPUTY SHERIFF,86506.08,250.5,5965.26,,92721.84,92721.84,2011,,San Francisco,


# Reflection
## 9. After using two different environments (e.g., local Jupyter and Google Colab), describe your experience. What are the pros and cons of each?

Having worked with Local Jupiter through Pycharm environment and Google Colab I came to realize a number of advantages and disadvantage of each as outline

**Pros of Google Colab**

- Being cloud-based, Google Colab eliminates the need for complex installations or local resources. You can start coding right away without worrying about hardware limitations, which is particularly convenient for those who want to focus on coding without the setup.

- Google Colab offers free access to GPUs and TPUs, which are essential for running computationally intensive tasks like training machine learning models. This makes it an excellent choice for AI projects that require heavy computational power.

- Google Colab is tightly integrated with Google Drive, making it incredibly easy to share notebooks and collaborate in real time. Whether you’re working on a team project or sharing your work with others, the process is seamless and hassle-free.

**Cons of Google Colab**
- While Google Colab is easy to use, it offers fewer customization options than Jupyter Notebook. If you need to adjust configurations, install specific software, or use custom setups, Google Colab may not be as flexible as Jupyter.

- Since Google Colab is cloud-based, it is dependent on an internet connection. Additionally, sessions can time out after a period of inactivity, meaning you may lose your progress if you’re not careful. This can be frustrating for long-running tasks or when working in areas with unreliable internet.

**Pros of Using Local Jupyter Notebook in PyCharm**
- Running Jupyter Notebook Locally in an IDE like Pycharm gives one full control over development Environment thus allowing complete customization of one’s setup. i.e. One can configure their python environment of choice (conda,vevn), install specific packages without restriction.
- Seamless integration with IDE features for instance running on Pycharm, Jupyter Notebook is enhanced with powerful IDE capabilities which include the likes of smart code completion, debugging support, version control integration.
- Offline accessibility and security due no internet dependency for development and sensitive data stays on the local machine
**Cons of Local Jupyter Notebook**
- unlike cloud-based solution solutions (Google Colab) the Local environment one must go through the initial setup and configuration which involves a lot of manual processes, one also needs to get up virtual environment to avoid dependency conflicts.
- Compared to Google Colab, Local Jupyter Notebook don’t provide free GPU acceleration and one might be forced to do a manual setup or get a paid cloud service for large project like when dealing with deep learning
- Difference projects may need separate python environment as a result of dependency and environment management and if not well managed this leads to Library Version conflicts.
