# <font color='#3cb371'>**Numpy and Pandas**</font> 

To start with data preprocessing and exploration, we will first review some of the objects we have covered in the previous course:

## <font color='#6a5acd'>**Review**</font> 

### Tuples

Tuples are immutable sequences, typically used to store collections of heterogeneous data.

In [1]:
my_tuple = (1, 'hello', True, 3.14)
print(my_tuple)


(1, 'hello', True, 3.14)


### Lists

Lists are mutable sequences, used to store collections of items.

In [2]:
my_list = [1, 2, 3, 4, 5]
print(my_list)


[1, 2, 3, 4, 5]


### Sets

Sets are unordered collections of unique elements.

In [3]:
my_set = {1, 2, 3, 2, 1}
print(my_set)  


{1, 2, 3}


### Dictionaries

Dictionaries are collections of key-value pairs.

In [4]:
my_dict = {'name': 'Alice', 'age': 25, 'city': 'New York'}
print(my_dict)


{'name': 'Alice', 'age': 25, 'city': 'New York'}


### Strings

Strings are sequences of characters.

In [5]:
my_string = 'Hello, world!'
print(my_string)


Hello, world!


### Methods 

Methods in Python are functions that belong to objects and can operate on the object itself.  
Python has different methods for different data types, including lists, sets, dictionaries, and strings.

<div style='text-align: center'>
    <img src='https://i.pinimg.com/736x/1e/28/08/1e2808ef46528cd4f2a3164345989505.jpg' alt='ml' title='methods' width='800' height='1000'/>
</div>

## <font color='#6a5acd'>**Introduction to NumPy and Arrays**</font> 

> `NumPy` (Numerical Python) is an essential library in the Python ecosystem, widely used for scientific computing. 
> It provides support for large, multi-dimensional `arrays` and `matrices`, along with a collection of mathematical functions to operate on these data structures.

> <font color='#FF6347'>**Note:**</font> NumPy provides the essential structure and operations for working with `arrays` in Python, offering a comprehensive and efficient framework for numerical computations and data manipulation.

### Array

An array is a data structure that stores a collection of items. Typically, these items are of the same data type.  
Arrays are fundamental in programming, providing a means to store and manipulate collections of variables efficiently.

### Why NumPy?

- `Performance`: NumPy arrays are stored more efficiently than Python lists and provide faster access in reading and writing items.
- `Functionality`: NumPy provides a wide range of mathematical functions to perform operations on arrays efficiently and with less code.
- `Convenience`: With NumPy, you can perform complex mathematical operations on arrays without needing for loops, making the code cleaner and faster.

### Arrays vs. Matrices vs. Vectors

- **`Array`**: In the context of NumPy, an array can be one-dimensional (1D), two-dimensional (2D), or multi-dimensional. It's a grid of values, all of the same type.
- **`Vector`**: A vector is a one-dimensional array. It does not have the concept of rows and columns, just a single dimension of elements.
- **`Matrix`**: A matrix is a two-dimensional array. It has rows and columns. Matrices are particularly important in linear algebra, where they are used to represent systems of linear equations.

### How to install NumPy?

To install NumPy, you typically use the package manager for Python called `pip`:

```
pip install numpy
```

<font color='#FF6347'>**Note:**</font> In a Jupyter Notebook

**pip**: This would typically be used in a script or a terminal to install Python packages.  
**!pip**: This is used within the notebook cells to execute the pip command as if it's in the 

In [6]:
!pip install numpy


Note: you may need to restart the kernel to use updated packages.


In [2]:
# Importing NumPy library
import numpy as np


### Vectors, Matrices, 3D Arrays

Vectors, matrices, and 3D arrays are fundamental structures in numerical computing and can be used in various fields and applications:

<div style='text-align: center'>
    <img src='https://www.w3resource.com/w3r_images/numpy-1d2d3d-array.png' alt='ml' title='array' width='600' height='400'/>
</div>

In [8]:
# Creating a one-dimensional NumPy array (Vector)
vector = np.array([1, 2, 3, 4, 5])

print('Vector:\n', vector)



Vector:
 [1 2 3 4 5]


In [9]:
# Creating a two-dimensional NumPy array (Matrix)
matrix = np.array([[1, 2, 3], 
                   [4, 5, 6], 
                   [7, 8, 9]])

print('\nMatrix:\n', matrix)


Matrix:
 [[1 2 3]
 [4 5 6]
 [7 8 9]]


In [10]:
# Creating a three-dimensional NumPy array
three_d_array = np.array([[[1, 2], [3, 4]], 
                          [[5, 6], [7, 8]]])

print('\nThree-dimensional array:\n', three_d_array)


Three-dimensional array:
 [[[1 2]
  [3 4]]

 [[5 6]
  [7 8]]]


### Exercise:

You are given the heights and weights of a group of students. Your task is to calculate the Body Mass Index (BMI).

Heights of students (in meters): 1.73, 1.68, 1.71, 1.89, 1.79  
Weights of students (in kilograms): 65.4, 59.2, 63.6, 88.4, 68.7


$$
BMI = \frac{weight\ (kg)}{(height\ (m))^2}
$$



In [11]:
height = [1.73, 1.68, 1.71, 1.89, 1.79]

In [12]:
weight = [65.4, 59.2, 63.6, 88.4, 68.7]

In [23]:
# Solution:

bmi = weight / height**2

TypeError: unsupported operand type(s) for ** or pow(): 'list' and 'int'

In [13]:
height_1 = np.array(height)
height_1

array([1.73, 1.68, 1.71, 1.89, 1.79])

In [14]:
weight_1 = np.array(weight)
weight_1

array([65.4, 59.2, 63.6, 88.4, 68.7])

In [16]:
bmi = weight_1 / height_1**2
bmi

array([21.85171573, 20.97505669, 21.75028214, 24.7473475 , 21.44127836])

### Exercise 

Find the answer to the following: 

In [24]:
inventory_list = [20,30,40]

inventory_array = np.array([20,30,40])

In [None]:
inventory_list + inventory_list

In [None]:
inventory_array + inventory_array

### Array Slicing

Slicing in arrays allows you to select and extract specific sections of an array.

In [25]:
my_arr = np.array([10, 20, 30, 40, 50])

In [28]:
# 2nd Element
my_arr[2]

30

In [26]:
# Slicing from index 1 to 3
print(my_arr[1:4])  

[20 30 40]


In [29]:
# Slicing from the beginning to index 3
print(my_arr[:4])  

[10 20 30 40]


In [30]:
# Slicing from index 2 to the end
print(my_arr[2:])  

[30 40 50]


In [33]:
# Slicing with a step
my_arr[0:5:2]

array([10, 30, 50])

### 2D Array Slicing

For a 2D array (matrix), the slicing becomes a bit more involved as you need to consider rows and columns. 

The syntax is `[row_start:row_stop:row_step, col_start:col_stop:col_step]`.

In [34]:
# Creating a 2D array
my_arr_2d = np.array([[1, 2, 3], 
                      [4, 5, 6], 
                      [7, 8, 9]])

In [35]:
# Slicing to get the first two rows and columns
print(my_arr_2d[:2, :2])


[[1 2]
 [4 5]]


In [36]:
# Slicing to get the last row
print(my_arr_2d[-1, :])

[7 8 9]


In [40]:
# Slicing to get the first row
print(my_arr_2d[0, :])

[1 2 3]


In [37]:
# Slicing to get the middle column
print(my_arr_2d[:, 1])

[2 5 8]


In [42]:
# Boolean indexing
print(my_arr_2d[my_arr_2d > 5])

[6 7 8 9]


In [43]:
# Selects elements (0,1) and (2,2)
print(my_arr_2d[[0, 2], [1, 2]])

[2 9]


<div style='text-align: center'>
    <img src='https://scaler.com/topics/images/use-of-2d-array-in-python.webp' alt='ml' title='array' width='600' height='400'/>
</div>

<div style='text-align: center'>
    <img src='https://scaler.com/topics/images/a-2d-array.webp' alt='ml' title='array' width='600' height='400'/>
</div>

<div style='text-align: center'>
    <img src='https://numpy.org/devdocs/_images/np_matrix_indexing.png' alt='ml' title='array' width='700' height='300'/>
</div>

## <font color='#6a5acd'>**Introduction to Pandas**</font> 

> `Pandas` is a powerful Python library used for data manipulation and analysis, providing data structures and operations to manipulate numerical tables and time series.

> Pandas built on top of the NumPy library and offers a more intuitive interface for data manipulation compared to NumPy, especially for tabular data

### Key Features of Pandas:

- Fast and efficient DataFrame object for data manipulation with integrated indexing.
- Tools for reading and writing data between in-memory data structures and different file formats.
- Data alignment and integrated handling of missing data.
- Reshaping and pivoting of datasets.
- Label-based slicing, indexing, and subsetting of large datasets.
- Data structure column insertion and deletion.
- Group by functionality to perform split-apply-combine operations on datasets.
- Merging and joining of datasets.
- Time Series functionality.

<font color='#FF6347'>**Question:**</font> How to install Pandas? 

In [3]:
import pandas as pd


## Basic Usage of Pandas

### Pandas Series

> A Pandas Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.).  
> It has both a data column and a label column, which is called the index.

```
s = pd.Series(data, index=index)
```

Here, `data` can be many different things: a Python dict, an ndarray, a scalar value, etc. The passed `index` is a list of axis labels.

#### Key Features of Pandas Series:

- **`Homogeneous Data`**: All elements in a Pandas Series have the same data type, which makes the data manipulation efficient and robust.
- **`Size Immutable`**: While the size of a Series cannot be changed, its values are mutable.
- **`Data Alignment`**: An essential feature of Series is the automatic data alignment based on the index labels. It's especially useful in arithmetic operations and data alignment from different Series objects.

In [15]:
# Creating a Series from a list
series_from_list = pd.Series([1, 2, 3, 4, 5])
print(series_from_list)


0    1
1    2
2    3
3    4
4    5
dtype: int64


In [16]:
# Creating a Series from a numpy array
array = np.array([1, 2, 3, 4, 5])
series_from_array = pd.Series(array)
print(series_from_array)

0    1
1    2
2    3
3    4
4    5
dtype: int64


In [17]:
# Accessing elements using index
print(series_from_list[0])  # 

1


In [18]:
# Slicing
print(series_from_list[:3])

0    1
1    2
2    3
dtype: int64


In [19]:
series1 = pd.Series([1, 2, 3])
series2 = pd.Series([4, 5, 6])

# Arithmetic operations
print(series1 + series2)
print(series1 * 2)

0    5
1    7
2    9
dtype: int64
0    2
1    4
2    6
dtype: int64


In [20]:
data = [10, 20, 30, 40, 50]
index = ['a', 'b', 'c', 'd', 'e']

s = pd.Series(data, index=index)
print(s)


a    10
b    20
c    30
d    40
e    50
dtype: int64


### Creating DataFrames

DataFrames are `two-dimensional`, `size-mutable`, potentially `heterogeneous` `tabular data` structures with `labeled axes` (rows and columns).  
Here's how you can create a DataFrame:

In [45]:
# Creating a DataFrame from a dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [24, 27, 22],
    'City': ['New York', 'Los Angeles', 'Chicago']
}

df = pd.DataFrame(data)

In [46]:
df

Unnamed: 0,Name,Age,City
0,Alice,24,New York
1,Bob,27,Los Angeles
2,Charlie,22,Chicago


### Reading Data

Pandas supports various file formats like CSV, Excel, JSON, HTML, and more.

In [4]:
heart_df = pd.read_csv('heart_data.csv')
heart_df

Unnamed: 0,Age (age in year),sex,chest pain,blood pressure,cholestoral,blood sugar,electrocardiographic,heart rate,exercise induced,depression,slope,ca,thal,c
0,63,1,1,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0
1,37,1,3,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0
2,41,0,2,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0
3,56,1,2,120.0,236.0,0.0,0.0,178.0,0.0,0.8,1.0,0.0,3.0,0
4,57,0,4,120.0,354.0,0.0,0.0,163.0,1.0,0.6,1.0,0.0,3.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
592,52,1,4,140.0,266.0,0.0,0.0,134.0,1.0,2.0,2.0,,,1
593,43,1,4,140.0,288.0,0.0,0.0,135.0,1.0,2.0,2.0,,,1
594,41,1,4,120.0,336.0,0.0,0.0,118.0,1.0,3.0,2.0,,,1
595,44,1,4,135.0,491.0,0.0,0.0,135.0,0.0,0.0,,,,1


### Data Exploration

Pandas provides several functions to perform basic data exploration:

In [50]:
df.head()  # Displays the first 5 rows of the DataFrame


Unnamed: 0,Name,Age,City
0,Alice,24,New York
1,Bob,27,Los Angeles
2,Charlie,22,Chicago


In [51]:
df.tail()  # Displays the last 5 rows of the DataFrame


Unnamed: 0,Name,Age,City
0,Alice,24,New York
1,Bob,27,Los Angeles
2,Charlie,22,Chicago


In [52]:
df.describe()  # Shows a summary of statistics for numerical columns


Unnamed: 0,Age
count,3.0
mean,24.333333
std,2.516611
min,22.0
25%,23.0
50%,24.0
75%,25.5
max,27.0


In [53]:
df.info()  # Provides a concise summary of the DataFrame

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    3 non-null      object
 1   Age     3 non-null      int64 
 2   City    3 non-null      object
dtypes: int64(1), object(2)
memory usage: 200.0+ bytes


### Selecting a Column:

In [55]:
ages = df['Age']
ages


0    24
1    27
2    22
Name: Age, dtype: int64

In [56]:
type(ages)

pandas.core.series.Series

In [58]:
subset = df[['Name', 'Age']]
subset


Unnamed: 0,Name,Age
0,Alice,24
1,Bob,27
2,Charlie,22


In [59]:
type(subset)

pandas.core.frame.DataFrame

In [61]:
young_people = df[df['Age'] < 25]
young_people

Unnamed: 0,Name,Age,City
0,Alice,24,New York
2,Charlie,22,Chicago


In [62]:
df

Unnamed: 0,Name,Age,City
0,Alice,24,New York
1,Bob,27,Los Angeles
2,Charlie,22,Chicago


In [64]:
df['Age'] = df['Age'] * 1.1
df

Unnamed: 0,Name,Age,City
0,Alice,26.4,New York
1,Bob,29.7,Los Angeles
2,Charlie,24.2,Chicago


### Applications of Pandas: 

- **`Data Cleaning`**: Handling missing data, dropping or filling missing values, and cleaning messy data.
- **`Data Analysis`**: Analyzing large datasets, performing statistical analysis, and generating insights from data.
- **`Data Visualization`**: Although Pandas is not primarily built for visualization, it provides basic plotting capabilities and can be integrated with libraries like Matplotlib and Seaborn for more advanced visualizations.
- **`Machine Learning`**: Preparing and manipulating data for use in machine learning models.