# 📜 IBM Data Science Professional Certificate  
*Curiosity to Capability — One Notebook at a Time*

---

**Compiled and Authored by:**  
**Partho Sarothi Das**  
Dhaka, Bangladesh  
🎓 Bachelor's & Master's in Statistics  
💼 Investment Banking Professional → Aspiring Data Scientist  

>**Disclaimer:** This notebook is based on content from the [IBM Data Science Professional Certificate](https://www.coursera.org/professional-certificates/ibm-data-science) offered on Coursera. It is intended for personal learning and review purposes.

---
---

# Languages of Data Science

* **Choosing a Programming Language** depends on:

  * Your goals and problem types
  * The organization and role you work in
  * The age and nature of existing applications

* **Recommended Core Languages** for data science:

  * Python, R, SQL

* **Other Useful Languages** with specific use cases:

  * Scala, Java, C++, Julia
  * JavaScript, PHP, Go, Ruby, Visual Basic

* **Data Science-Related Roles** include:

  * Business Analyst
  * Data Analyst
  * Data Engineer
  * Data Scientist
  * Database Engineer
  * Research Scientist
  * Software Engineer
  * Statistician
  * Product & Project Managers

### Final Note

The best language to learn in data science depends on your **specific tasks, goals, and environment**. While Python, R, and SQL are foundational, many other languages can be valuable based on your role and use case.

# Introduction to Python

### Who Uses Python?

* **Widely used** in data science, AI, machine learning, web development, and IoT
* Popular among both **beginners** and **experienced developers**
* Used by major organizations like **IBM, NASA, Google, Facebook, Amazon, Spotify**, and more
* Over **75% of data science jobs** in 2019 required Python
* Over **80% of data professionals** use it (Kaggle 2019 survey)


### Benefits of Python

* Clear, readable syntax
* Less code needed than in many other languages
* Extensive standard library for databases, automation, web scraping, etc.
* Powerful data science and AI libraries:

  * **Pandas, NumPy, SciPy, Matplotlib** (data science)
  * **TensorFlow, PyTorch, Keras, Scikit-learn** (AI/ML)
  * **NLTK** (Natural Language Processing)
* Large **global community** and strong documentation


### Diversity & Inclusion in the Python Community

* Led by the **Python Software Foundation** with a strong **code of conduct**
* Inclusive initiatives like **PyLadies**, an international group supporting women in Python
* Focus on safe, welcoming environments—online and in person

### 📌 **Final Takeaway**

Python is the most popular language in data science thanks to its simplicity, versatility, strong community, and support for diverse use cases—from machine learning to web development. It also stands out for its **inclusive, community-driven culture**.

# Introduction to R Language

### Open Source vs Free Software

* **Similarities**:

  * Both are free to use and support collaboration
  * Often governed by licenses like **GNU General Public License (GPL)**

* **Differences**:

  * **Open Source**: Business-oriented, promoted by the **Open Source Initiative (OSI)**
  * **Free Software**: Value-driven, promoted by the **Free Software Foundation (FSF)**
  * **Python** is open source; **R** is free software


### Who Uses R?

* Popular among:

  * **Statisticians**, **mathematicians**, **data miners**
  * **Academia** and researchers
* Used by companies like:

  * **IBM, Google, Facebook, Microsoft, Bank of America, Ford, Uber, Trulia**


### Benefits of Using R

* **Free for private, commercial, and public use**
* **Array-oriented syntax**: Easy for those with minimal programming experience
* Over **15,000 packages** (as of 2018) for data analysis and visualization
* Known as the **world’s largest repository of statistical knowledge**
* Strong **object-oriented programming** and **math capabilities** (e.g., matrix operations)
* Integrates well with **C++, Java, Python, .Net**, and more


### Global R Communities

* **useR**, **WhyR**, **SatRdays**, **R-Ladies**
* Events and updates available on the **R Project website**


### Final Takeaway

R is a powerful free software language, especially suited for statistical analysis and academic use. Its rich ecosystem, ease of use for math-based programming, and strong global community make it a valuable tool in a data scientist’s toolkit.


# Introduction to SQL


### What is SQL?

* **SQL** stands for **Structured Query Language** (pronounced *S-Q-L* or *Sequel*)
* It is a **non-procedural language** designed specifically for:

  * **Querying** and **managing structured data**
* **Not a general-purpose language**, but essential for data-related tasks

### History & Use

* Developed in **1974 at IBM** (older than Python and R)
* Originally built for **relational databases**, now used in **NoSQL** and **big data** environments as well
* Relational databases consist of **tables** (like spreadsheets) with fixed columns and flexible rows

### SQL Language Elements

* Clauses
* Expressions
* Predicates
* Queries
* Statements

### Benefits of SQL

* Widely used in **data science, analytics, and engineering**
* Allows **direct access** to data — no need to copy it
* Improves **workflow speed and efficiency**
* Acts as an **interpreter** between you and the database
* **ANSI standard** — SQL skills are transferable across different databases

### Popular SQL Databases

* MySQL, PostgreSQL, SQLite, Oracle, MariaDB
* IBM DB2, Microsoft SQL Server, Apache OpenOffice Base

> Note: SQL syntax may vary slightly between systems — focus on one and join its community


### Final Takeaway

SQL is a powerful, non-procedural language designed for structured data and relational databases. It’s essential for data professionals and easily transferable across platforms, making it a must-learn tool in any data science career path.


# Other Languages for Data Science

### Key Languages & Their Use in Data Science

#### Java

* General-purpose, object-oriented, enterprise-grade language
* Compiled to bytecode and runs on the **JVM**
* Used for scalable, high-performance applications
* **Data Science Tools**:

  * **Weka** (data mining)
  * **Java-ML** (machine learning)
  * **Apache MLlib** (ML library for big data)
  * **Deeplearning4j** (deep learning)
  * **Hadoop** (big data processing & storage)

#### Scala

* General-purpose language supporting **functional programming**
* Built to improve Java and runs on the JVM
* Highly scalable; name stands for “**scalable language**”
* **Main Use**: **Apache Spark**, which supports:

  * **Shark** (query engine)
  * **MLlib** (machine learning)
  * **GraphX** (graph processing)
  * **Spark Streaming** (real-time data)

#### C++

* High-performance, low-level programming language
* Used for **system programming** and **real-time data applications**
* **Data Science Tools**:

  * **TensorFlow core** (deep learning)
  * **MongoDB** (NoSQL database)
  * **Caffe** (deep learning framework with Python/Matlab bindings)

#### JavaScript

* Originally browser-based; extended to server-side with **Node.js**
* Not related to Java
* **Data Science Tools**:

  * **TensorFlow\.js** (ML in browser and Node.js)
  * **Brain.js**, **machinelearn.js** (ML libraries)
  * **R-js** (R’s linear algebra rewritten in TypeScript for browser use)

#### Julia

* Designed at MIT (2012) for **high-performance numerical computing**
* Combines speed of C with ease of Python/R
* Compiled language with refined parallelism and interoperability
* **Key Tool**: **JuliaDB** (managing large datasets)

### Final Takeaway

Besides Python, R, and SQL, several other programming languages have specialized roles in data science:

* **Java, Scala, and C++** power large-scale, high-performance systems
* **JavaScript** enables ML in the browser
* **Julia** offers a promising future for fast, scientific computing

Each language brings unique strengths depending on the **task**, **platform**, and **performance needs** in the data science pipeline.

# Module 2 Summary

Congratulations! You have completed this module. At this point in the course, you know:

- You should select a language to learn depending on your needs, the problems you are trying to solve, and whom you are solving them for.

- The popular languages are Python, R, SQL, Scala, Java, C++, and Julia.

- For data science, you can use Python's scientific computing libraries like Pandas, NumPy, SciPy, and Matplotlib. 

- Python can also be used for Natural Language Processing (NLP) using the Natural Language Toolkit (NLTK). 

- Python is open source, and R is free software. 

- R language’s array-oriented syntax makes it easier to translate from math to code for learners with no or minimal programming background.

- SQL is different from other software development languages because it is a non-procedural language.

- SQL was designed for managing data in relational databases. 

- If you learn SQL and use it with one database, you can apply your SQL knowledge with many other databases easily.

- Data science tools built with Java include Weka, Java-ML, Apache MLlib, and Deeplearning4.

- For data science, popular program built with Scala is Apache Spark which includes Shark, MLlib, GraphX, and Spark Streaming.

- Programs built for Data Science with JavaScript include TensorFlow.js and R-js.

- One great application of Julia for Data Science is JuliaDB.