# 📜 IBM Data Science Professional Certificate  
*Curiosity to Capability — One Notebook at a Time*

---

**Compiled and Authored by:**  
**Partho Sarothi Das**  
Dhaka, Bangladesh  
🎓 Bachelor's & Master's in Statistics  
💼 Investment Banking Professional → Aspiring Data Scientist  

>**Disclaimer:** This notebook is based on content from the [IBM Data Science Professional Certificate](https://www.coursera.org/professional-certificates/ibm-data-science) offered on Coursera. It is intended for personal learning and review purposes.

---
---

# Introduction to R and RStudio

### What is R?

* **R** is a **statistical programming language** widely used for:

  * Data processing & manipulation
    
  * Statistical inference
 
  * Data analysis
 
  * Machine learning
 
* Popular among **academics**, **healthcare professionals**, and **governments** (based on 2017 analysis)


### R’s Capabilities

* Import data from:

  * Flat files (CSV, TXT)
  * Databases
  * Web APIs
  * Other software (e.g., SPSS, STATA)

* Easy-to-use functions and great for **visualizations**

* Many built-in tools—often no need to install extra packages for common tasks

### RStudio: The IDE for R

* **RStudio** is a popular **integrated development environment (IDE)** for writing and running R code

* Increases **productivity** and **ease of use**

#### RStudio Interface Tabs:

* **Editor**: Write and execute R code
* **Console**: Run R commands directly
* **Workspace & History**: Track created R objects and past commands
* **Files**: Access local file directory
* **Plots**: View and export plots as PDF/image
* **Packages**: Manage R packages
* **Help**: Access documentation and support


### Popular R Libraries for Data Science

* `dplyr` – Data manipulation
* `stringr` – String operations
* `ggplot2` – Data visualization
* `caret` – Machine learning

### Practice with Virtual Labs

* Access a hosted **RStudio environment** through **Skills Network Labs**
* No need to install or configure R/RStudio locally
* Hands-on practice directly from your browser

### Key Takeaways

* **R** is powerful for data analysis, visualization, and ML
* **RStudio** provides a clean, efficient interface for working with R
* Several well-established libraries enhance R’s capability in data science
* Cloud-based virtual environments make it easy to get started without setup hassles

---

# Plotting in RStudio

### Popular R Visualization Packages

You can install any package with: `install.packages("package_name")`

1. **`ggplot2`** – Layered grammar of graphics for plots like histograms, bar charts, scatterplots, etc.
2. **`plotly`** – Web-based, interactive charts (can be saved as HTML)
3. **`lattice`** – High-level visualization for multivariate data (less customization)
4. **`leaflet`** – Create interactive maps and plots


### Using R's Base Plot Function

* Syntax: `plot(data)`
* Returns a **scatter plot** of values vs. index
* You can add:

  * **Lines**: `type = "l"`
  * **Title**: `title("Your Title")`
  * Example:

    ```r
    plot(c(1,2,3,4,5), type = "l")
    title("Simple Line Plot")
    ```

### Using `ggplot2` for Advanced Plots

* First, load the library: `library(ggplot2)`
* Basic structure:

  ```r
  ggplot(data, aes(x = var1, y = var2)) + geom_point()
  ```
* Example with `mtcars`:

  ```r
  ggplot(mtcars, aes(x = mpg, y = wt)) + 
    geom_point() +
    ggtitle("Mileage vs Weight") +
    labs(x = "Miles per Gallon", y = "Weight")
  ```

### Enhancing ggplot with `GGally`

* `GGally` extends `ggplot2` with additional features for visualizing relationships in complex datasets
* Useful for:

  * Pair plots
  * Correlation matrix visualizations
  * Parallel coordinate plots

### Key Takeaways

* R offers several libraries for creating both static and interactive visualizations.
* The **base `plot()` function** is useful for simple visualizations.
* **`ggplot2`** provides a powerful and flexible grammar for layered visualizations.
* Use **`ggtitle()`** and **`labs()`** to add context with titles and axis labels.
* **`GGally`** enhances `ggplot2` with additional plot types for deeper analysis.

---

# Overview of Git and GitHub

### What is Version Control?

* Version control systems help **track changes** to documents or code.
  
* Useful for **recovering older versions**, and simplifies **collaboration**.

* Example: Sharing a shopping list with roommates—version control helps avoid confusion.

### What is Git?

* Git is **free, open-source**, and distributed under the **GNU General Public License**.
  
* It’s a **distributed version control system**: everyone has a copy of the project locally.

* Widely used in software development, data science, and project collaboration.


### What is GitHub?

* GitHub is a **web-based platform** for hosting Git repositories.

* It allows easier **sharing**, **collaboration**, and **project management**.

* Other alternatives: GitLab, BitBucket, Beanstalk.

### Key Terms

* **Repository (repo):** Project folder under version control.
  
* **Fork:** A personal copy of someone else's repository.

* **Pull Request:** Request to merge your changes into another branch or repo.

* **Working Directory:** Your local copy of the repo’s files.

* **SSH:** Secure way to connect to a remote Git server.


### Essential Git Commands

| Command        | Description                                  |
| -------------- | -------------------------------------------- |
| `git init`     | Initialize a new Git repository              |
| `git add`      | Stage changes for commit                     |
| `git status`   | Check current state of the working directory |
| `git commit`   | Save staged changes with a message           |
| `git reset`    | Undo changes in the working directory        |
| `git log`      | View history of commits                      |
| `git branch`   | Create/manage branches                       |
| `git checkout` | Switch between branches                      |
| `git merge`    | Merge branches together                      |


### Getting Started Resources

* Use GitHub tutorials: [try.github.io](https://try.github.io)
* Download Git cheat sheets
* Learn hands-on by practicing with projects and guided labs


### Key Takeaways

* Git helps manage code history and collaboration efficiently.

* GitHub provides an online platform for sharing and collaborating on Git repositories.

* Learning basic Git commands is essential for working with others in data science and software projects.

---

# Introduction to GitHub

### Background and History

* In the early 2000s, **Linux development** used BitKeeper, a free version control system.
* In 2005, BitKeeper became a paid service, leading **Linus Torvalds** to create a new system—**Git**.
* Git was designed to support:

  * 🔄 **Non-linear development** (handling rapid patch rates)
  * 🌐 **Distributed development**
  * 🔌 Compatibility with existing systems
  * ⚡ Efficient handling of large projects
  * 🔐 **Cryptographic authentication**
  * 🧩 **Customizable merge strategies**


### What Makes Git Special?

* **Distributed Version Control System (DVCS)**:

  * Every developer has a **full local copy** of the development history.
  * Changes can be shared between repositories without a central server.
  * Promotes **collaboration**, **flexibility**, and **parallel development**.

* **Main Branch Strategy**:

  * Teams work on separate branches and **merge** when features are ready.
  * Encourages **continuous integration** and **agile workflows**.

* **Centralized Management**:

  * Supports **access controls** and **role-based administration**.


### GitHub and Related Tools

* **GitHub**:

  * A **web-based hosting platform** for Git repositories.
  * Owned by Microsoft.
  * Offers **free**, **pro**, and **enterprise** accounts.
  * Hosts over **100 million** repositories (as of 2019).
  * Provides a **browser interface** and a **GitHub Desktop client**.

* **Repositories (Repos)**:

  * Data structures for storing code and documents.
  * Allow **version control**, **collaboration**, and **change tracking**.

* **IBM Cloud**:

  * Integrates Git repos and open-source tools for cloud development.

* **GitLab**:

  * A **DevOps platform** with integrated Git, CI/CD, code review, and collaboration features.
  * Allows developers to:

    * Collaborate on code
    * Branch and merge
    * Streamline testing and deployment


### Key Takeaways

* **GitHub** is a powerful platform for hosting and managing **Git repositories**.
* **Git** enables flexible, distributed, and efficient software development.
* Tools like GitHub and GitLab enhance collaboration and support the full development lifecycle.

---

# GitHub Repositories

### Creating a GitHub Account

* Visit **[https://github.com](https://github.com)**
* Provide:

  * A username
  * Your email address
  * A password
* Complete a simple CAPTCHA puzzle and click Verify
* Choose the free personal account option (default)
* Optionally answer some questions about your experience and interests
* Confirm your account by clicking a verification link sent to your email

### Creating a Repository

* After signing in, GitHub provides options:

  * Create a repository
  * Create an organization
  * Take the Intro to GitHub course

* A repository (repo) is a data structure that stores:

  * Source code
  * Project files like README, licenses, and documentation

* You can make your repo **public** or **private**


### Main Tabs in a Repository

* **Code** – Contains source code and any files (e.g. README, license)
* **Issues** – Track tasks, bugs, and project enhancements
* **Pull Requests** – Used for proposing, reviewing, and merging changes
* **Projects** – Manage and plan features using boards and task lists
* **Wiki** – Add documentation pages for users and contributors
* **Security** – Tools to manage vulnerability alerts and permissions
* **Insights** – Analytics like contributors, commit frequency, etc.
* **Settings** – Manage repo name, visibility, collaborators, webhooks, etc.


### Key Takeaways

* **GitHub** makes it easy to sign up and get started with source control.
* A **repository** is the heart of your project—it tracks versions and facilitates collaboration.
* GitHub repositories come with powerful built-in tools for project management and communication.

---

# Creating and Editing Files in GitHub

### Creating a New Repository

1. Sign in to your GitHub account
2. Click the **+** icon (top right) → New repository
3. Fill in:

   * Repository name
   * (Optional) Description
   * Choose visibility (public or private)
   * Check “Initialize this repository with a README”


4. Click Create repository


### Editing the README File

* On the repository page, click the **pencil icon** next to `README.md`
* Edit the content in the online editor
* Scroll down to the **“Commit changes”** section

  * Add a **commit message**
  * Optionally add a **description**
  * Click **Commit changes**
* Your changes will be saved and visible immediately


### Creating a New File

1. On the repository home screen, click **Add file** → **Create new file**
2. Enter a file name, e.g., `firstpython.py`
3. Add a **description comment** and your **code**
4. Scroll down and **commit the changes**
5. The file will now appear in your repository


### Uploading a File from Local System

1. Click Add file → Upload files
2. Click Choose your files and select from your computer
3. Wait for the upload to complete
4. Click Commit changes
5. Uploaded files are now part of the repository

### Key Takeaways

* GitHub’s web interface lets you:

  * Create repositories
  * Edit and commit files directly in the browser
  * Create and upload files easily
* Committing is essential to save and track your changes
* You can always return and edit files using the built-in editor

---
---

# GitHub: Working with Branches

### 🌿 **What is a Branch?**

* A **branch** is a snapshot of your repository where you can make changes without affecting the **master** (or main) branch.
* The **master branch** holds the stable, deployable codebase.
* A **child branch** is created from the master branch to:

  * Make edits
  * Test code
  * Experiment freely
* Once validated, changes from a child branch can be **merged back** into the master.

---

### 🧪 **Why Use Branches?**

* Prevent breaking the main workflow
* Allow multiple team members to work independently
* Encourage proper testing and approval before deployment

---

### 🛠️ **Steps to Work with Branches on GitHub**

1. **Create a Branch**

   * On the repository page, click the branch dropdown → type new branch name (e.g., `child-branch`) → press Enter
   * Now you have two branches: `master` and `child-branch`

2. **Make Changes in the Child Branch**

   * Switch to the child branch
   * Click **Add file** → **Create new file**
   * Name it (e.g., `test_child.py`), add code (e.g., `print("inside child branch")`)
   * Add a **commit message** like `Create test_child.py`, then click **Commit new file**

3. **Verify the Change**

   * Switch back to the `master` branch → confirm the new file is **not** present
   * This proves the changes are isolated to the child branch

---

### 🔀 **Merging with a Pull Request (PR)**

* Once your code is ready:

  * Click **Compare & pull request**
  * Review the differences between branches
  * Add a **title** and **comment**, then click **Create pull request**
  * Click **Merge pull request** → **Confirm merge**

---

### **After the Merge**

* The child branch changes are now in the `master`
* You can **delete the child branch** if it’s no longer needed

---

###  **Key Takeaways**

* Branching allows **safe, parallel development**
* Always test in branches before merging into `master`
* Pull Requests enable **collaboration and code review**
* GitHub provides an intuitive web interface for all these steps

---

# Module 5 Summary

Congratulations! You have completed this module. At this point in the course, you know:

- The capabilities of R and its uses in Data Science.

- The RStudio interface for running R codes. 

- Popular R packages for Data Science.

- Popular data visualization packages in R.

- Plotting with the inbuilt R plot function.

- Plotting with ggplot.

- Adding titles and changing the axis names using the ggtitle and lab’s function.

- A Distributed Version Control System (DVCS) keeps track of changes to code, regardless of where it is stored. 

- Version control allows multiple users to work on the same codebase or repository, mirroring the codebase on their own computers if needed, while the distributed version control software helps manage synchronization amongst the various codebase mirrors.

- Repositories are storage structures that:

    - Store the code

    - Track issues and changes

    - Enable you to collaborate with others

- Git is one of the most popular distributed version control systems. 

- GitHub, GitLab and Bitbucket are examples of hosted version control systems.

- Branches are used to isolate changes to code. When the changes are complete, they can be merged back into the main branch.

- Repositories can be cloned to make it possible to work locally, then sync changes back to the original.