# What is Data Science?


Data science is a multidisciplinary field that combines expertise from various domains, including **statistics, mathematics, and computer science**, to analyze and interpret complex data. The goal of data science is to uncover **patterns and valuable information** that can help in decision-making and solve data driven problems.

# Key Steps of a Data Science Project

1. **Data Collection:** Gathering relevant data from various sources, which may include databases, files, APIs, sensors, and more.

2. **Data Cleaning and Preprocessing:** Ensuring data quality by handling missing values, outliers, and other inconsistencies.

3. **Exploratory Data Analysis (EDA):** Examining and visualizing the data to understand its characteristics, identify patterns, and generate hypotheses.

4. **Statistical Analysis:** Applying statistical methods to draw inferences from the data, test hypotheses, and quantify uncertainty.

5. **Machine Learning:** Building predictive models and making sense of complex relationships within the data. Machine learning algorithms are used for tasks like regression, classification, clustering, and recommendation.

6. **Feature Engineering:** Transforming or creating features (variables) to improve the performance of machine learning models.

7. **Model Evaluation and Validation:** Assessing the performance of models using appropriate metrics and ensuring that they generalize well to new, unseen data.

8. **Deployment:** Integrating models and insights into business processes, applications, or decision-making systems.

# Tools and Technologies Used in Data Science

Data science involves a variety of tools and technologies to collect, process, analyze, and visualize data. The choice of tools depends on the specific tasks, the nature of the data, and the preferences of the data scientists. Here are some commonly used tools in data science:

1. **Programming Languages:**
   - **Python:** Widely used for its versatility, extensive libraries (e.g., NumPy, pandas, scikit-learn), and strong support in the data science community.
   - **R:** Particularly popular for statistical analysis and data visualization.

2. **Integrated Development Environments (IDEs):**
   - **Jupyter Notebooks:** Interactive and widely used for data exploration, visualization, and analysis. Supports multiple programming languages, including Python and R.
   - **RStudio:** An IDE specifically designed for R, providing a user-friendly environment for data analysis and visualization.

3. **Big Data Tools:**
   - **Apache Spark:** A fast, in-memory data processing engine for big data processing and analysis.
   - **Hadoop:** A distributed storage and processing framework commonly used for big data analytics.

4. **Database Systems:**
   - **SQL:** For querying and managing relational databases.
   - **MongoDB:** A NoSQL database often used for handling unstructured data.


5. **Cloud Platforms:**
   - **AWS, Azure, Google Cloud:** Cloud platforms provide scalable infrastructure and services for data storage, processing, and analysis.

6. **Collaboration and Documentation:**
    - **GitHub, GitLab:** Platforms for hosting and collaborating on code repositories.
    - **Confluence, Jira:** Tools for documentation and project management.

# Introduction to Google Colab
Google Colab, short for Google Colaboratory, is a free, cloud-based platform provided by Google that allows users to write and execute Python code collaboratively. It provides access to Graphics Processing Units (GPUs) and Tensor Processing Units (TPUs) that helps for tasks that involve heavy computation, such as machine learning and deep learning.

Using Google Colab (Colaboratory) is quite straightforward. Here are the basic steps to use Google Colab:

1. **Access Google Colab:**
   Open your web browser and go to [Google Colab](https://colab.research.google.com/).

2. **Sign In with Google:**
   If you are not already signed in to your Google account, you will be prompted to sign in. If you don't have a Google account, you'll need to create one.

3. **Create a New Notebook:**
   Once you're signed in, you can create a new notebook by clicking on the "New notebook" button.

4. **Writing and Executing Code:**
   Google Colab notebooks work similarly to Jupyter notebooks. You can write and execute code in cells. To run a cell, either click the **"Play"** button next to the cell or press **Shift + Enter**. Colab supports both code and text cells.

5. **Saving and Sharing Notebooks:**
   You can save your Colab notebook to Google Drive by clicking on "File" -> "Save a copy in Drive." This allows you to keep your work and share it with others.

6. **Adding Code and Text Cells:**
   You can add new cells by clicking on the "+" button above the notebook. You can choose whether the cell should be a code cell or a text cell.


7. **GPU Support:**
   Colab provides free access to GPU resources. You can enable GPU support by clicking on "Runtime" -> "Change runtime type" and selecting "GPU" under the "Hardware accelerator" section.

# Introduction to Numpy

NumPy is a powerful library in Python that is widely used in the field of data science, machine learning, and scientific computing. It provides support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays.

Key features of NumPy include:

1. **Arrays:** At the core of the NumPy package, is the **ndarray** object that encapsulates n-dimensional arrays of homogeneous data types. These arrays are more **efficient** than Python lists for numerical operations.

2. **Indexing and Slicing:** NumPy provides powerful indexing and slicing capabilities for accessing and manipulating data within arrays. This makes it easy to extract subsets of data or modify specific elements.

3. **Broadcasting:** NumPy allows for operations between arrays of different shapes and sizes through a mechanism called broadcasting. This makes it easy to perform element-wise operations on arrays of different shapes without the need for explicit looping or reshaping.

4. **Parallelization:** NumPy operations can be parallelized, as they are often implemented using optimized low-level libraries that take advantage of parallel processing capabilities on modern hardware.

NumPy is a foundational library in the Python data science ecosystem and is often used in conjunction with other libraries like Pandas, Matplotlib, and scikit-learn for tasks such as data manipulation, analysis, and visualization.

### ndarrays

In [None]:
import numpy as np
arr_1d = np.array([1, 2, 3])
arr_2d = np.array([[1, 2, 3], [4, 5, 6]])
arr_3d = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])

In [17]:
import numpy as np
arr_2d = np.array([[1, 2, 3], [4, 5, 6]])
print(arr_2d)

[[1 2 3]
 [4 5 6]]


Some key points about ndarray and its attributes:

1. **Axis:**
  - In NumPy, arrays can have one or more dimensions, and each dimension is referred to as an "axis."

  - Many NumPy functions allow operations to be performed along a specified axis. Common operations include ```sum, mean, minimum, maximum,``` etc. The axis parameter in these operations specify the direction along which an operation is applied. For example, when summing a 3D array along axis 0, the operation is performed along columns; when summing along axis 1, the operation is performed along rows.

2. **Shape:**
   - The "shape" of an ndarray refers to a tuple representing the dimensions of the array. For example, a 1-dimensional array might have a shape like ```(5,)```, indicating it has 5 elements along a single axis. A 2-dimensional array might have a shape like ```(3, 4)```, indicating it has 3 rows and 4 columns.

In [22]:
import numpy as np

arr_1d = np.array([1, 2, 3])
arr_2d = np.array([[1, 2, 3], [4, 5, 6]])
arr_3d = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])

print("shape of arr_1d:",arr_1d.shape)
print("shape of arr_2d:",arr_2d.shape)
print("shape of arr_3d:",arr_3d.shape)
print(arr_3d.size)

shape of arr_1d: (3,)
shape of arr_2d: (2, 3)
shape of arr_3d: (2, 2, 3)
12


In [25]:
import numpy as np

arr_2d = np.array([[1, 2, 3], [4, 5, 6]])
print(arr_2d)
print("\n\n")
# Sum along axis 0 (columns) & 1 (rows)
column_sum = np.sum(arr_2d, axis=1)
print(column_sum)  # Output: [5 7 9]

[[1 2 3]
 [4 5 6]]



[ 6 15]


### Data Types of ndarrays

Understanding data types in NumPy is crucial as it allows us to control how data is stored in memory and how operations are performed on that data. NumPy provides a rich set of data types that are more efficient than the built-in Python types.

Here are some key data types in NumPy:

1. **int8, int16, int32, int64**: Signed integers with 8, 16, 32, or 64 bits of precision, respectively.

2. **uint8, uint16, uint32, uint64**: Unsigned integers with 8, 16, 32, or 64 bits of precision, respectively.

3. **float16, float32, float64**: Floating-point numbers with 16, 32, or 64 bits of precision, respectively.

4. **complex64, complex128**: Complex numbers with 64 or 128 bits of precision, where the real and imaginary parts are represented by 32 or 64-bit floating-point numbers.

5. **bool**: Boolean type storing True or False values.

6. **object**: A generic object data type. It is often used when dealing with heterogeneous data or when the elements of the array need to be arbitrary Python objects.

7. **string_**: String data type.

8. **unicode_**: Unicode data type.

**You can specify the data type when creating a NumPy array using the `dtype` parameter.** For example:

```python
import numpy as np
# define the data type during array creation
arr1 = np.array([1, 2, 3], dtype=np.int32)
arr2 = np.array([1, 2, 3], dtype='int32')
```
You can change the data type of an array after it has been created using the `astype` function.
```python
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
float_arr = arr.astype(np.float32)
print(arr.dtype,float_arr.dtype)
```

In [26]:
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
float_arr = arr.astype(np.float32)
print(arr.dtype,float_arr.dtype)

int64 float32
