## NumPy: 

### Setting up NumPy

`import numpy`: This tells Python to load the NumPy library, making its functions and classes available in the script.

`as np`: This renames the namespace to np. It's a mechanism that allows you to refer to numpy with the shorter np prefix, reducing the amount of typing required for namespace specification and improving code readability.

The sole reason that numpy is imported as np is convention. You are free to use another alias but it's not recommended as this is what you will find everywhere and it's better to stick to standards.

In [1]:
import numpy as np

In [2]:
np.__version__

'2.1.1'

### Moving from lists to array: 

* Think of an array as a row of mailboxes at an apartment building. Each mailbox is the same size and holds mail (data) for each apartment (element in the array). You can easily find who the mail belongs to because the mailboxes are numbered in order. But, every mailbox has to hold the same kind of thing, like only letters or only packages, not both.

* A list in Python is like a big bag where you can put anything you want – letters, packages, a basketball – and you can keep adding more things or take some out. It's super flexible, but if you're trying to find something specific, it might take longer to dig around because there's all sorts of stuff in there.

* So, while an array is like a neatly organized row of same-sized mailboxes, a list is more like a big, mixed bag of goodies. They both hold stuff, but an array is more about being tidy and quick to find things when they're all the same type, and a list is about being able to hold anything you want, anytime.

In [3]:
%timeit pythonList = [i for i in range(10000)]
%timeit npList = np.arange(10000)

149 μs ± 154 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
1.78 μs ± 108 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)


Python lists are versatile, allowing for a mix of different data types and can adjust in size. However, this flexibility can slow them down for number-heavy tasks. NumPy arrays, on the other hand, require all elements to be the same type, which makes them more efficient for storage and faster for calculations.

### nd-array


* The primary reason that numpy is fast is because of the nd-array type that it uses to store and manipulate data.

* An ndarray is a generic multidimensional container for homogenous data. It provides arithmetic operations and broadcasting capabilities. 

* Every ndarray has 2 properties: shape and dtype. shape is a tuple providing the dimension of the array and dtype provides you the datatype of the array.

* The dtype of the array can also be explicitly specified while defining the array giving you fine-tuned control over the array.

Let's create a numpy array from an array. This is possible by passing the array as input to the np.array function.

In [4]:
# Creating an integer array.
nparray = np.array([1, 2, 3])

In [5]:
nparray

array([1, 2, 3])

In [6]:
# Datatype of the array.
nparray.dtype

dtype('int64')

In [7]:
# Creating an integer array with explicit dtype, which is not necessary.
int_array = np.array([1, 2, 3], dtype=np.int64)

In [8]:
int_array

array([1, 2, 3])

In [9]:
# Create an 2D array
original_array = np.array([[1, 2, 3],
                           [4, 5, 6]])

In [10]:
original_array

array([[1, 2, 3],
       [4, 5, 6]])

### Array Dimension and Shapes

* A one-dimensional array is like a list/vector, a two-dimensional array is akin to a matrix, and so on. Get dimension using array_name.ndim

* Array shape specifies the number of elements along each dimension. It is represented as a tuple of integers. Array size is basically the product of a number of rows and columns. You can get them by using array_name.shape and array_name.size

In [11]:
# Creating a 1D array (Vector)
arr_1d = np.array([1, 2, 3])
# Dimesion: 1 , Shape: (3,), Size: 3
print(f"Dimension (1D): {arr_1d.ndim}, Shape: {arr_1d.shape}, size: {arr_1d.size}")

# Creating a 2D array (Matrix) # Dimension: 2 , Shape: (2, 3)
arr_2d = np.array([[1, 2, 3], [4, 5, 6]]) 
print(f"Dimension (2D): {arr_2d.ndim}, Shape: {arr_2d.shape}, size: {arr_2d.size}")

Dimension (1D): 1, Shape: (3,), size: 3
Dimension (2D): 2, Shape: (2, 3), size: 6


### Creating NumPy Arrays


#### arange

* arange generates an array of numbers within the range of the digit that's passed.

* It created an array with regularly spaced values between start and stop with a specified step size.
`np.arange(start, stop, step, dtype=None).`

In [12]:
np.arange(10)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [13]:
# Create an array of values from 0 to 9 with a step size of 2
np.arange(0, 10, 2)

array([0, 2, 4, 6, 8])

#### linspace

* linspace returns a set of linearly-spaced items within the range passed as input. 
* In linspace, the starting digit, ending digit along with the number of digits required as passed as input. 
* Basically, it returns an array with the required number of digits in a specified interval.

In [14]:
# Create an array of 10 equally spaced values from 0 to 1
np.linspace(0, 1, 10)

array([0.        , 0.11111111, 0.22222222, 0.33333333, 0.44444444,
       0.55555556, 0.66666667, 0.77777778, 0.88888889, 1.        ])

In [15]:
np.linspace(0, 10, 5)

array([ 0. ,  2.5,  5. ,  7.5, 10. ])

#### zeros and ones

* zeros creates an array filled with zeroes. The parameter passed as input is the size of the required array.
* ones creates an array filled with ones. The parameter passed as input is the size of the required array.

In [16]:
np.zeros(5)

array([0., 0., 0., 0., 0.])

In [17]:
# Create a 3x3 array filled with zeros
np.zeros((3, 3))

array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.]])

In [18]:
np.ones(5)

array([1., 1., 1., 1., 1.])

In [19]:
# Create a 2x4 array filled with ones
np.ones((2, 4))

array([[1., 1., 1., 1.],
       [1., 1., 1., 1.]])

#### np.zeros_like: creates a new array filled with zeros but with the same shape and data type as an existing array.

In [20]:
np.zeros_like(np.arange(5))

array([0, 0, 0, 0, 0])

In [21]:
# Create a new array filled with zeros, 
# matching the shape and data type of the original array
zeros_array = np.zeros_like(original_array)
zeros_array

array([[0, 0, 0],
       [0, 0, 0]])

#### np.ones_like: is similar to np.zeros_like, but it creates a new array filled with ones instead of zeros.

In [22]:
np.ones_like(np.arange(5))

array([1, 1, 1, 1, 1])

In [23]:
# Create a new array filled with ones, 
# matching the shape and data type of the original array
ones_array = np.ones_like(original_array)
ones_array

array([[1, 1, 1],
       [1, 1, 1]])

### Array Indexing and Slicing

* Array Indexing: Refers to accessing individual elements within a NumPy array.

* Array Slicing: This allows you to extract specific portions of an array, creating new arrays with the selected elements.

In [24]:
# Creating a NumPy array
arr = np.array([10, 20, 30, 40, 50])
print("Original array:", arr)

# Accessing individual elements
first_element = arr[0]  # Access the first element
print("First element:", first_element)

# Accessing elements using negative indices
last_element = arr[-1]  # Access the last element
print("Last element:", last_element)

# Creating a 2D NumPy array
arr_2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print("\nOriginal 2D array:\n", arr_2d)

# Accessing an individual element in 2D array
element_row_0_col_1 = arr_2d[0, 1]  # Element at row 0, column 1
print("Element at row 0, column 1:", element_row_0_col_1)


Original array: [10 20 30 40 50]
First element: 10
Last element: 50

Original 2D array:
 [[1 2 3]
 [4 5 6]
 [7 8 9]]
Element at row 0, column 1: 2


In [25]:
# Creating a 1D NumPy array
arr = np.array([10, 20, 30, 40, 50])
print("Original array:", arr)

# Slicing the array to create a new array
sliced_array = arr[1:4]  # Slice from index 1 to 3 (exclusive)
print("Sliced array from index 1 to 3:", sliced_array)

# Slicing with a step of 2
sliced_array = arr[0::2]  # Start at index 0, step by 2
print("Array with elements at every 2nd position:", sliced_array)

# Slicing with negative index
second_to_last = arr[-2::]  # Access the last two elements
print("Last two elements of the array:", second_to_last)

# Conditional slicing: Select elements greater than 30
sliced_array = arr[arr > 30]  # Result: [40, 50]
print("Elements greater than 30:", sliced_array)

# Creating a 2D NumPy array
arr_2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print("\nOriginal 2D array:\n", arr_2d)

# Slicing along rows and columns
sliced_array = arr_2d[1:3, 0:2] # Slice a 2x2 subarray
print("2x2 subarray from the 2D array:\n", sliced_array)

Original array: [10 20 30 40 50]
Sliced array from index 1 to 3: [20 30 40]
Array with elements at every 2nd position: [10 30 50]
Last two elements of the array: [40 50]
Elements greater than 30: [40 50]

Original 2D array:
 [[1 2 3]
 [4 5 6]
 [7 8 9]]
2x2 subarray from the 2D array:
 [[4 5]
 [7 8]]


### Array Operations

#### Element-wise Operations

Element-wise operations apply a given operation to each element in the array independently. You can perform addition, subtraction, even multiplication, and division as well on the arrays.

#### 1. Broadcasting: 
    
NumPy allows operations between arrays of different shapes and sizes which is called broadcasting. Broadcasting automatically adjusts the smaller array’s shape to match the larger array, making it compatible with element-wise operations.

In [26]:
# Creating 1D NumPy arrays
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
scalar = 2

# Addition
result_add = arr1 + arr2
print("Addition of arr1 and arr2:", result_add)  # Output: [5, 7, 9]

# Multiplication (element-wise)
result_mul = arr1 * arr2
print("Element-wise multiplication of arr1 and arr2:", result_mul)  # Output: [4, 10, 18]

# Subtraction
result_sub = arr1 - arr2
print("Subtraction of arr2 from arr1:", result_sub)  # Output: [-3, -3, -3]

# Division
result_div = arr1 / arr2
print("Division of arr1 by arr2:", result_div)  # Output: [0.25, 0.4, 0.5]

# Creating 2D NumPy arrays
matrix1 = np.array([[1, 2], [3, 4]])
matrix2 = np.array([[5, 6], [7, 8]])

# Element-wise multiplication (NOT matrix multiplication)
result_mul_2d = matrix1 * matrix2
print("Element-wise multiplication of matrix1 and matrix2:\n", result_mul_2d)

# Actual Matrix Multiplication using np.dot
matrix_multiplication = np.dot(matrix1, matrix2)
print("Matrix multiplication of matrix1 and matrix2 using np.dot:\n", matrix_multiplication)

# Broadcasting: Multiply array by a scalar
result_broadcast = arr1 * scalar
print("Multiplying arr1 by a scalar (2):", result_broadcast)  # Output: [2, 4, 6]

Addition of arr1 and arr2: [5 7 9]
Element-wise multiplication of arr1 and arr2: [ 4 10 18]
Subtraction of arr2 from arr1: [-3 -3 -3]
Division of arr1 by arr2: [0.25 0.4  0.5 ]
Element-wise multiplication of matrix1 and matrix2:
 [[ 5 12]
 [21 32]]
Matrix multiplication of matrix1 and matrix2 using np.dot:
 [[19 22]
 [43 50]]
Multiplying arr1 by a scalar (2): [2 4 6]


#### Append and Delete

* To append arrays in NumPy, you can use the numpy.append() function. This function allows you to add elements to the end of an existing array along a specified axis.

* Keep in mind that np.append() returns a new array with the appended elements; so it does not modify the original arrays. If you want to modify an existing array in-place, you can use methods like np.concatenate() or use assignment statements.

* We can use np.delete to remove the items from an array.

In [27]:
# Create an array
original_array = np.array([1, 2, 3])
print("Original array:", original_array)

# Append elements in-place
original_array = np.append(original_array, [4, 5, 6])
print("Array after appending [4, 5, 6]:", original_array)

# Create a NumPy array
arr = np.array([1, 2, 3, 4, 5])
print("\nNew array for deletion operations:", arr)

# Remove the item at index 2 (value 3)
new_arr = np.delete(arr, 2)
print("Array after removing element at index 2:", new_arr)

# Create a 2D NumPy array
arr_2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print("\nOriginal 2D array:\n", arr_2d)

# Remove the second row (index 1)
new_arr_2d = np.delete(arr_2d, 1, axis=0)
print("2D array after removing second row:\n", new_arr_2d)

Original array: [1 2 3]
Array after appending [4, 5, 6]: [1 2 3 4 5 6]

New array for deletion operations: [1 2 3 4 5]
Array after removing element at index 2: [1 2 4 5]

Original 2D array:
 [[1 2 3]
 [4 5 6]
 [7 8 9]]
2D array after removing second row:
 [[1 2 3]
 [7 8 9]]


#### Aggregation Functions and ufuncs

NumPy provides built-in functions for common aggregation operations on arrays, including mean, sum, minimum, maximum, etc. NumPy also provides universal functions (ufuncs) that operate element-wise on arrays, including mathematical, trigonometric, and exponential functions.

In [28]:
import numpy as np

# Creating a NumPy array
arr = np.array([1, 2, 3, 4, 5])
print("Original array:", arr)

# Aggregation functions
mean_value = np.mean(arr)
print("Mean value:", mean_value)

median_value = np.median(arr)
print("Median value:", median_value)

variance = np.var(arr)
print("Variance:", variance)

standard_deviation = np.std(arr)
print("Standard deviation:", standard_deviation)

sum_value = np.sum(arr)
print("Sum of all elements:", sum_value)

min_value = np.min(arr)
print("Minimum value:", min_value)

max_value = np.max(arr)
print("Maximum value:", max_value)

# Universal functions
sqrt_arr = np.sqrt(arr)
print("\nSquare root of each element:", sqrt_arr)

exp_arr = np.exp(arr)
print("Exponential of each element:", exp_arr)

Original array: [1 2 3 4 5]
Mean value: 3.0
Median value: 3.0
Variance: 2.0
Standard deviation: 1.4142135623730951
Sum of all elements: 15
Minimum value: 1
Maximum value: 5

Square root of each element: [1.         1.41421356 1.73205081 2.         2.23606798]
Exponential of each element: [  2.71828183   7.3890561   20.08553692  54.59815003 148.4131591 ]


#### Reshpaing Arrays

* You can change the shape of an array without changing its data using the reshape method, say to flatten an array (convert it to a 1D array) or to change it to a higher-dimensional array (e.g., from 1D to 2D or 2D to 3D).

* The reshape method can be particularly useful when you need to prepare data for various operations, such as matrix multiplication, convolution, or displaying images.

* To reshape method, you can pass the shape of the expected result you want it to be. The total number of elements in the original array must match the total number of elements in the new shape. In other words, the product of the dimensions in the new shape should be equal to the total number of elements in the original array. NumPy will raise an error if this condition is not met.

In [29]:
import numpy as np

# Creating a 2D array
arr_2d = np.array([[1, 2, 3], [4, 5, 6]])
print("Original 2D array:\n", arr_2d)

# Reshaping the 2D array to a 1D array
arr_1d = arr_2d.reshape(6)
print("\nConverted to 1D array:", arr_1d)

# Creating a 1D array
arr_1d = np.array([1, 2, 3, 4, 5, 6])
print("\nOriginal 1D array:", arr_1d)

# Reshaping the 1D array back to a 2D array
arr_2d = arr_1d.reshape(2, 3)
print("\nReshaped back to 2D array:\n", arr_2d)

Original 2D array:
 [[1 2 3]
 [4 5 6]]

Converted to 1D array: [1 2 3 4 5 6]

Original 1D array: [1 2 3 4 5 6]

Reshaped back to 2D array:
 [[1 2 3]
 [4 5 6]]


#### Combining Arrays

NumPy provides the np.concatenate() function to concatenate arrays along a specified axis. And Stacking arrays can be done using functions like np.vstack() (vertical stacking) and np.hstack() (horizontal stacking).

* Concatenation directly joins two arrays end-to-end.
* Vertical stacking places one array on top of the other, creating a 2D array if starting with 1D arrays.
* Horizontal stacking places arrays side by side, similar to concatenation for 1D arrays but is also applicable for 2D arrays to combine them column-wise.

In [30]:
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])

# Concatenate along the 0-axis (rows)
combined = np.concatenate((arr1, arr2))
print("Concatenated array:", combined)  # Result: [1, 2, 3, 4, 5, 6]

# Vertical stacking
vertical_stack = np.vstack((arr1, arr2))
print("Vertically stacked:\n", vertical_stack)  # Result: [[1, 2, 3], [4, 5, 6]]

# Horizontal stacking
horizontal_stack = np.hstack((arr1, arr2))
print("Horizontally stacked:", horizontal_stack)  # Result: [1, 2, 3, 4, 5, 6]


Concatenated array: [1 2 3 4 5 6]
Vertically stacked:
 [[1 2 3]
 [4 5 6]]
Horizontally stacked: [1 2 3 4 5 6]


#### Splitting Arrays

* Splitting arrays is the opposite of combining them. It’s the process of breaking a single array into multiple smaller arrays. NumPy provides np.split(), np.hsplit(), and np.vsplit() functions for this purpose.

* In np.split() function, the first argument is the array to be split, and the second argument is the number of equal parts or the specific indices at which to split the array. The result is a list of arrays, each representing a part of the original array.

* np.hsplit() is used for splitting arrays horizontally (column-wise) and works on arrays of at least two dimensions. Similarly, np.vsplit() splits arrays vertically (row-wise) and also requires the array to be at least two-dimensional. 

In [31]:
import numpy as np

arr = np.array([1, 2, 3, 4, 5, 6])

# Split into three equal parts
split_arr = np.split(arr, 3) 
print("Array split into three equal parts:", split_arr)

Array split into three equal parts: [array([1, 2]), array([3, 4]), array([5, 6])]


In [32]:
# Creating a 2D array for demonstration
arr_2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

# Horizontal split - splits into 3 arrays along columns
hsplit_arr = np.hsplit(arr_2d, 3)
print("Horizontally split array:")
for i, arr in enumerate(hsplit_arr):
    print(f"Part {i+1}:\n", arr)

# Creating another 2D array for vertical split
# Note: np.vsplit() requires the array to be at least two-dimensional

# Vertical split - splits into 3 arrays along rows
vsplit_arr = np.vsplit(arr_2d, 3)
print("\nVertically split array:")
for i, arr in enumerate(vsplit_arr):
    print(f"Part {i+1}:\n", arr)

Horizontally split array:
Part 1:
 [[1]
 [4]
 [7]]
Part 2:
 [[2]
 [5]
 [8]]
Part 3:
 [[3]
 [6]
 [9]]

Vertically split array:
Part 1:
 [[1 2 3]]
Part 2:
 [[4 5 6]]
Part 3:
 [[7 8 9]]


#### Alias vs. View vs. Copy of Arrays

* Alias: An alias refers to multiple variables that all point to the same underlying NumPy array object. They share the same data in memory. Changes in alias array will affect the original array.

* View: The .view() method creates a new array object that looks at the same data as the original array but does not share the same identity. It provides a way to view the data differently or with different data types, but it still operates on the same underlying data.

* Copy: A copy is a completely independent duplicate of a NumPy array. It has its own data in memory, and changes made to the copy will not affect the original array, and vice versa.

In [33]:
original_arr = np.array([1, 2, 3])
print("Original array:", original_arr)

# Alias of original array
alias_arr = original_arr
# No change yet, so no print statement needed here for aliasing demonstration

# Changes to view_arr will affect the original array
view_arr = original_arr.view()
# Making a change to view_arr to show its effect
view_arr[0] = 10
print("After modifying view_arr, original array changes to:", original_arr)

# Reset original array for clarity in demonstration
original_arr[0] = 1
print("Original array reset to:", original_arr)

# Changes to copy_arr won't affect the original array
copy_arr = original_arr.copy()
# Making a change to copy_arr to show it doesn't affect original_arr
copy_arr[0] = 10
print("After modifying copy_arr, original array remains unchanged:", original_arr)
print("Modified copy array:", copy_arr)

# Now, demonstrating aliasing by modifying alias_arr
alias_arr[1] = 20
print("After modifying alias_arr, original array reflects the change:", original_arr)

Original array: [1 2 3]
After modifying view_arr, original array changes to: [10  2  3]
Original array reset to: [1 2 3]
After modifying copy_arr, original array remains unchanged: [1 2 3]
Modified copy array: [10  2  3]
After modifying alias_arr, original array reflects the change: [ 1 20  3]


#### Sorting Numpy Arrays

You can use np.sort(array) to sort the array in ascending order, however for descending you have to use the trick of array slicing [::-1], which reverses the array elements.

In [34]:
data = np.array([3, 1, 5, 2, 4])
print("Original data:", data)

# Sorting the data in ascending order
sorted_data = np.sort(data)  # Ascending order
print("Data sorted in ascending order:", sorted_data)

# Sorting the data in descending order
reverse_sorted_data = np.sort(data)[::-1]  # Descending order
print("Data sorted in descending order:", reverse_sorted_data)

# Returning indices that would sort the array
sorted_indices = np.argsort(data)
print("Indices that would sort the array:", sorted_indices)

# Demonstrating how to use sorted_indices to sort the array
sorted_data_with_indices = data[sorted_indices]
print("Data sorted using indices:", sorted_data_with_indices)

Original data: [3 1 5 2 4]
Data sorted in ascending order: [1 2 3 4 5]
Data sorted in descending order: [5 4 3 2 1]
Indices that would sort the array: [1 3 0 4 2]
Data sorted using indices: [1 2 3 4 5]


### NumPy for Data Cleaning

#### 1. Identifying Missing Values

NumPy provides functions to check for missing values in a numeric array, represented as NaN (Not a Number).

In [35]:
# Create a NumPy array with missing values
data = np.array([1, 2, np.nan, 4, np.nan, 6])
print("Data with missing values:", data)

# Check for missing values
has_missing = np.isnan(data)
print("Missing values in the array:", has_missing)

Data with missing values: [ 1.  2. nan  4. nan  6.]
Missing values in the array: [False False  True False  True False]


#### 2. Removing Rows or Columns with Missing Values

We can use np.isnan to get a boolean matrix with True for the indices where there is a missing value. And when we pass it to np.any, it will return a 1D array with True for the index where any row item is True. And finally we ~ (not), and pass the boolean to the original Matrix, which will remove the rows with missing values.

In [36]:
# Create a 2D array with missing values
data = np.array([[1, 2, 3], [4, np.nan, 6], [7, 8, 9]])
print("Original 2D array with missing values:\n", data)

# Remove rows with any missing values
cleaned_data = data[~np.any(np.isnan(data), axis=1)]
print("Cleaned data after removing rows with missing values:\n", cleaned_data)

Original 2D array with missing values:
 [[ 1.  2.  3.]
 [ 4. nan  6.]
 [ 7.  8.  9.]]
Cleaned data after removing rows with missing values:
 [[1. 2. 3.]
 [7. 8. 9.]]


### NumPy for Statistical Analysis

### Data Transformation

Numpy doesn’t have the data transformation features directly, but we can utilize the existing features to perform these.

#### 1. Data Centering
Centering data involves subtracting the mean from each data point. This is often done to remove the effect of a constant term or to facilitate model convergence.

#### 2. Standardization
This to transform numerical data in such a way that it has a mean of 0 and a standard deviation of 1. This process makes it easier to compare and analyze data with different scales.

#### 3. Log Transformation
Logarithmic transformation is used to make data more symmetric or to stabilize variance in cases of exponential growth.

In [37]:
# Data Centering
data = np.array([10, 20, 30, 40, 50])
mean = np.mean(data)
centered_data = data - mean
print("Original data:", data)
print("Centered data (subtracting the mean):", centered_data)

# Standardization
std_dev = np.std(data)
standardized_data = (data - mean) / std_dev
print("\nStandard deviation of the data:", std_dev)
print("Standardized data (subtracting the mean and dividing by the standard deviation):", standardized_data)

# Log Transformation
log_transformed_data = np.log(data)
print("\nLog-transformed data:", log_transformed_data)


Original data: [10 20 30 40 50]
Centered data (subtracting the mean): [-20. -10.   0.  10.  20.]

Standard deviation of the data: 14.142135623730951
Standardized data (subtracting the mean and dividing by the standard deviation): [-1.41421356 -0.70710678  0.          0.70710678  1.41421356]

Log-transformed data: [2.30258509 2.99573227 3.40119738 3.68887945 3.91202301]


### Random Sampling

Random sampling involves selecting a subset of data points from a larger dataset. NumPy also provides tools for generating random numbers from various probability distributions.

#### 1. Simple Random Sampling: 
Select a random sample of a specified size from a dataset. When sampling without replacement, each item selected is not returned to the population.

#### 2. Bootstrap Sampling: 
Bootstrap sampling involves sampling with replacement to create multiple datasets. This is often used for estimating statistics’ variability.

In [38]:
# Simple Random Sampling Without replacement
data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
random_samples = np.random.choice(data, size=5, replace=False)
print("Original data:", data)
print("Random samples without replacement:", random_samples)

# Bootstrap Sampling
num_samples = 1000
bootstrap_samples = np.random.choice(data, size=(num_samples, len(data)), replace=True)
print("\nNumber of bootstrap samples:", num_samples)
print("Example of one bootstrap sample:", bootstrap_samples[0])
print("Each bootstrap sample contains the same number of elements as the original data but with replacement.")


Original data: [ 1  2  3  4  5  6  7  8  9 10]
Random samples without replacement: [9 2 5 4 3]

Number of bootstrap samples: 1000
Example of one bootstrap sample: [3 7 9 6 4 9 8 6 9 8]
Each bootstrap sample contains the same number of elements as the original data but with replacement.


### Structured Arrays

Structured arrays allow you to work with data similar to a table with named columns. Each element of a structured array can have different data types. Create your datatypes by using np.dtype and add the column name and datatype as a tuple. Then you can pass it to your array.


In [39]:
# Define data types for fields
dt = np.dtype([('name', 'S20'), ('age', int), ('salary', float)])

# Create a structured array
employees = np.array([('Alice', 30, 50000.0), ('Bob', 25, 60000.0)], dtype=dt)
print("Structured array of employees:", employees)

# Access the 'name' field of the first employee
print("\nName of the first employee:", employees['name'][0].decode('utf-8'))

# Access the 'age' field of all employees
print("Ages of all employees:", employees['age'])

Structured array of employees: [(b'Alice', 30, 50000.) (b'Bob', 25, 60000.)]

Name of the first employee: Alice
Ages of all employees: [30 25]


## Pandas: 

### Setting up Pandas

In [40]:
import pandas as pd

### Data Structures:

Pandas provides two fundamental data structures: Series and DataFrame, which are the building blocks of data manipulation and analysis in Python. Understanding these data structures is essential for effective data handling with Pandas.

#### Series

A Series is a one-dimensional labeled array that can hold various data types, such as integers, floats, strings, or even custom objects. It’s similar to a column in an Excel spreadsheet or a single column in a SQL table. Key features of the Series include:

* Labeling: Each element in a Series has a label or an index, which allows for easy access and manipulation of data.
* Homogeneous Data: Unlike lists in Python, Series typically stores data of the same data type, ensuring consistency.
* Vectorized Operations: You can perform vectorized operations on Series, making it efficient for element-wise calculations. This feature allows you to efficiently perform operations on entire columns or Series without the need for explicit loops. You can add, subtract, and multiply the series (columns of a dataframe) with a series or scalar.

#### Creating a Series

Let's explore different ways to create a Series in Pandas:

* From a List: The most straightforward method, akin to jotting down a list of items on a piece of paper.
* From a Dictionary: Where each key-value pair becomes an index-data pair in the Series, like mapping names to phone numbers in an address book.
* From a NumPy Array: It is like transforming a NumPy array into a Series with additional capabilities like having custom index labels.

In [41]:
# Creating a Series from a list
series_from_list = pd.Series([1, 2, 3, 4, 5])
print("Series from a list:\n", series_from_list)

# Creating a Series from a dictionary
series_from_dict = pd.Series({'a': 1, 'b': 2, 'c': 3})
print("\nSeries from a dictionary:\n", series_from_dict)

# Creating a Series from a NumPy array
series_from_array = pd.Series(np.array([10, 20, 30, 40, 50]))
print("\nSeries from a NumPy array:\n", series_from_array)

Series from a list:
 0    1
1    2
2    3
3    4
4    5
dtype: int64

Series from a dictionary:
 a    1
b    2
c    3
dtype: int64

Series from a NumPy array:
 0    10
1    20
2    30
3    40
4    50
dtype: int64


#### DataFrame
A DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns). Imagine a DataFrame as a whole spreadsheet or a SQL table. It's like a collection of Series objects that share the same index, perfect for storing real-world data like sales reports, sports statistics, etc.

Key features of DataFrames include:

* Columns: Each column in a DataFrame is a Series, which means it can hold different data types.
* Indexing: DataFrames have both row and column indexes, allowing for flexible data selection.
* Data Alignment: Like Series, DataFrames can align data based on labels, making operations easy and intuitive.
* Data Integration: You can merge, join, and concatenate DataFrames to combine and analyze data from various sources.

#### Creating a DataFrame

Creating a DataFrame can be done in several ways, reflecting the versatility of Pandas:

* From a Dictionary of Lists: Where keys become column names and lists become column data.
* From a List of Dictionaries: Each dictionary in the list becomes a row in the DataFrame.
* From a List of Lists: Combined with a separate list of column names to label the data.
* From a Series: You can build a DataFrame from one or more Series objects. If you use multiple Series, Pandas aligns them by their indexes.
* From a NumPy Array: Similar to Series, but now you can have a multi-dimensional array forming a table-like structure. You can specify column names separately.

In [42]:
# Creating a DataFrame from a dictionary of lists
df_from_dict = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
print("DataFrame from a dictionary of lists:\n", df_from_dict)

# Creating a DataFrame from a list of dictionaries
df_from_list_of_dicts = pd.DataFrame([{'A': 1, 'B': 2, 'C': 3}, {'A': 4, 'B': 5, 'C': 6}])
print("\nDataFrame from a list of dictionaries:\n", df_from_list_of_dicts)

# Creating a DataFrame from a list of lists
df_from_list_of_lists = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]], columns=['A', 'B', 'C'])
print("\nDataFrame from a list of lists:\n", df_from_list_of_lists)

# Creating a DataFrame from a Series
series1 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
series2 = pd.Series([4, 5, 6], index=['a', 'b', 'c'])
df_from_series = pd.DataFrame({'Column1': series1, 'Column2': series2})
print("DataFrame from Series:\n", df_from_series)

# Creating a DataFrame from a NumPy array
np_array = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
df_from_np_array = pd.DataFrame(np_array, columns=['ColumnA', 'ColumnB', 'ColumnC'])
print("\nDataFrame from a NumPy array:\n", df_from_np_array)

DataFrame from a dictionary of lists:
    A  B  C
0  1  4  7
1  2  5  8
2  3  6  9

DataFrame from a list of dictionaries:
    A  B  C
0  1  2  3
1  4  5  6

DataFrame from a list of lists:
    A  B  C
0  1  2  3
1  4  5  6
2  7  8  9
DataFrame from Series:
    Column1  Column2
a        1        4
b        2        5
c        3        6

DataFrame from a NumPy array:
    ColumnA  ColumnB  ColumnC
0        1        2        3
1        4        5        6
2        7        8        9


Understanding these two core data structures sets the foundation for efficient data manipulation and analysis using Pandas. They enable you to load, clean, explore, and transform data in various ways, making Pandas a powerful tool in the data scientist’s toolkit.

### Data Loading and Data Inspection

Pandas provides a wide range of functions and methods for efficiently loading data from various sources and formats into DataFrames.

#### Reading Data from Different Sources (CSV, Excel, Json)

In [44]:
# Load data from csv file with the name - data.csv
df_csv = pd.read_csv('data.csv')

# Load data from excel file with the name - data.xlsx
df_excel = pd.read_excel('data.xlsx')

# You can specify a specific sheet using the sheet_name parameter
df_sheet1 = pd.read_excel('data.xlsx', sheet_name='Sheet1')

# Load data from a JSON file
df_json = pd.read_json('data.json')

FileNotFoundError: [Errno 2] No such file or directory: 'data.csv'

### Case Study: Product Sales Analysis for TechGear

This case study will involve a fictional online retail company, "TechGear," which sells technology gadgets and accessories.

Objective: Analyze TechGear's sales data to understand sales trends, customer preferences, product performance, and inventory management for the fiscal year 2024.
  

In [45]:
# Sample data creation
data = {
    'TransactionID': range(1, 21),
    'Date': pd.date_range(start='2024-01-01', periods=20, freq='D'),
    'CustomerID': [101, 102, 103, 104, 101, 102, 103, 104, 101, 102, 103, 104, 101, 102, 103, 104, 101, 102, 103, 104],
    'ProductName': ['Laptop', 'Mouse', 'Keyboard', 'Headphones', 'Laptop', 'Mouse', 'Keyboard', 'Headphones', 'Laptop', 'Mouse', 'Keyboard', 'Headphones', 'Laptop', 'Mouse', 'Keyboard', 'Headphones', 'Laptop', 'Mouse', 'Keyboard', 'Headphones'],
    'Category': ['Electronics', 'Accessories', 'Accessories', 'Accessories', 'Electronics', 'Accessories', 'Accessories', 'Accessories', 'Electronics', 'Accessories', 'Accessories', 'Accessories', 'Electronics', 'Accessories', 'Accessories', 'Accessories', 'Electronics', 'Accessories', 'Accessories', 'Accessories'],
    'Quantity': [1, 2, 3, 1, 1, 2, 3, 1, 1, 2, 3, 1, 1, 2, 3, 1, 1, 2, 3, 1],
    'Price': [1000, 20, 30, 50, 1000, 20, 30, 50, 1000, 20, 30, 50, 1000, 20, 30, 50, 1000, 20, 30, 50]
}

# Creating DataFrame
sales_df = pd.DataFrame(data)

### Displaying DataFrames

It’s nice that we loaded the data, but how do we see it, right? Displaying a data frame is the first step in understanding its contents. you can just type the data frame name and execute the cell to see the top 5 and bottom 5 rows. And Pandas offers several othermethods to display different portions of your data frame:

* head(n): This method displays the first n rows of the data frame. It's useful for getting a quick overview of the data's structure without overwhelming yourself with too much information, or if you just want to see the column names, you can use .columns
* tail(n): Similar to .head(), this method shows the last n rows of the DataFrame. It's handy for checking the end of the dataset.
* sample(n): If you want to see random rows from the DataFrame, use this method. This is useful for exploring diverse parts of the dataset.

In [46]:
# Display the first 10 rows
print("\nFirst 10 rows:")
sales_df.head(10)


First 10 rows:


Unnamed: 0,TransactionID,Date,CustomerID,ProductName,Category,Quantity,Price
0,1,2024-01-01,101,Laptop,Electronics,1,1000
1,2,2024-01-02,102,Mouse,Accessories,2,20
2,3,2024-01-03,103,Keyboard,Accessories,3,30
3,4,2024-01-04,104,Headphones,Accessories,1,50
4,5,2024-01-05,101,Laptop,Electronics,1,1000
5,6,2024-01-06,102,Mouse,Accessories,2,20
6,7,2024-01-07,103,Keyboard,Accessories,3,30
7,8,2024-01-08,104,Headphones,Accessories,1,50
8,9,2024-01-09,101,Laptop,Electronics,1,1000
9,10,2024-01-10,102,Mouse,Accessories,2,20


In [47]:
# Display the last 10 rows
print("\nLast 10 rows:")
sales_df.tail(10)


Last 10 rows:


Unnamed: 0,TransactionID,Date,CustomerID,ProductName,Category,Quantity,Price
10,11,2024-01-11,103,Keyboard,Accessories,3,30
11,12,2024-01-12,104,Headphones,Accessories,1,50
12,13,2024-01-13,101,Laptop,Electronics,1,1000
13,14,2024-01-14,102,Mouse,Accessories,2,20
14,15,2024-01-15,103,Keyboard,Accessories,3,30
15,16,2024-01-16,104,Headphones,Accessories,1,50
16,17,2024-01-17,101,Laptop,Electronics,1,1000
17,18,2024-01-18,102,Mouse,Accessories,2,20
18,19,2024-01-19,103,Keyboard,Accessories,3,30
19,20,2024-01-20,104,Headphones,Accessories,1,50


In [48]:
# Display a random sample of 10 rows
print("\nRandom sample of 10 rows:")
sales_df.sample(10)


Random sample of 10 rows:


Unnamed: 0,TransactionID,Date,CustomerID,ProductName,Category,Quantity,Price
8,9,2024-01-09,101,Laptop,Electronics,1,1000
7,8,2024-01-08,104,Headphones,Accessories,1,50
18,19,2024-01-19,103,Keyboard,Accessories,3,30
5,6,2024-01-06,102,Mouse,Accessories,2,20
3,4,2024-01-04,104,Headphones,Accessories,1,50
1,2,2024-01-02,102,Mouse,Accessories,2,20
11,12,2024-01-12,104,Headphones,Accessories,1,50
17,18,2024-01-18,102,Mouse,Accessories,2,20
4,5,2024-01-05,101,Laptop,Electronics,1,1000
6,7,2024-01-07,103,Keyboard,Accessories,3,30


### Data Exploration

Pandas provides methods for obtaining fundamental insights into your data. These are the first things you need to check while exploring your data.

* .shape : This function gives a set where the first element specifies the no.of samples/rows in the data and the second element specifies the no.of columns.
* .info(): This method provides a concise summary of the DataFrame, including the data types, non-null counts, and memory usage. It's an excellent starting point for understanding the data's structure, or if you just want to see the data types, you can use .dtypes
* .describe(): The method generates basic statistics for each numeric column in the DataFrame, such as count, mean, standard deviation, minimum, and maximum values.

In [49]:
# Display the shape of the DataFrame
print("\nDataFrame shape:", sales_df.shape)

# Display concise summary
print("\nDataFrame info:")
sales_df.info()

# Display summary statistics
print("\nSummary statistics:")
sales_df.describe()


DataFrame shape: (20, 7)

DataFrame info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   TransactionID  20 non-null     int64         
 1   Date           20 non-null     datetime64[ns]
 2   CustomerID     20 non-null     int64         
 3   ProductName    20 non-null     object        
 4   Category       20 non-null     object        
 5   Quantity       20 non-null     int64         
 6   Price          20 non-null     int64         
dtypes: datetime64[ns](1), int64(4), object(2)
memory usage: 1.2+ KB

Summary statistics:


Unnamed: 0,TransactionID,Date,CustomerID,Quantity,Price
count,20.0,20,20.0,20.0,20.0
mean,10.5,2024-01-10 12:00:00,102.5,1.75,275.0
min,1.0,2024-01-01 00:00:00,101.0,1.0,20.0
25%,5.75,2024-01-05 18:00:00,101.75,1.0,27.5
50%,10.5,2024-01-10 12:00:00,102.5,1.5,40.0
75%,15.25,2024-01-15 06:00:00,103.25,2.25,287.5
max,20.0,2024-01-20 00:00:00,104.0,3.0,1000.0
std,5.91608,,1.147079,0.850696,429.595893


For categorical or discrete data, you can explore unique values and their frequencies:

* .nunique(): This method calculates the number of unique values in each column. It's handy for understanding the diversity of data in categorical columns.
* .column_name or ['column_name'] : To access a specific column in the DataFrame. You can only use the second approach when the column name has spaces.  
* .value_counts(): Use this method on a specific column to count the occurrences of each unique value. It's particularly useful for categorical columns.
* Basic Statistics: You can calculate additional statistics for specific columns, such as the sum, max, min, mean, median, or mode, using Pandas’ mathematical functions:

In [50]:
# Unique product names
print("\nUnique Product Names:")
print(sales_df['ProductName'].unique())

# Value counts for 'Category'
print("\nCategory Value Counts:")
print(sales_df['Category'].value_counts())

# Basic statistics for 'Price'
print("\nPrice Statistics:")
print(sales_df['Price'].describe())

# Mean quantity sold
print("\nAverage Quantity Sold:")
print(sales_df['Quantity'].mean())


Unique Product Names:
['Laptop' 'Mouse' 'Keyboard' 'Headphones']

Category Value Counts:
Category
Accessories    15
Electronics     5
Name: count, dtype: int64

Price Statistics:
count      20.000000
mean      275.000000
std       429.595893
min        20.000000
25%        27.500000
50%        40.000000
75%       287.500000
max      1000.000000
Name: Price, dtype: float64

Average Quantity Sold:
1.75


Based on the analysis, we can derive insights such as the most popular product categories, average spending per transaction, and customer buying patterns. These insights can help TechGear in decision-making processes related to inventory management, marketing strategies, and customer engagement initiatives.

### Data Selection and Indexing

Data selection and indexing are fundamental operations in Pandas, allowing you to extract specific subsets of data from a DataFrame.

#### Selecting Columns and Rows

You can select specific columns and rows from a DataFrame using square brackets [], .loc[], and .iloc[] indexing methods:

1. Using Square Brackets []: To select one or more columns by their names, you can use square brackets with the column names as a list.

In [51]:
# Sample data
data = {
    'CustomerID': [101, 102, 103, 104, 105, 106, 107, 108, 109, 110],
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva', 'Frank', 'Grace', 'Hannah', 'Ian', 'Jack'],
    'Age': [25, 30, 35, 40, 28, 22, 55, 65, 20, 45],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix', 'Philadelphia', 'San Antonio', 'San Diego', 'Dallas', 'San Jose'],
    'Salary': [50000, 60000, 55000, 65000, 70000, 52000, 58000, 62000, 60000, 64000]
}

# Creating DataFrame
customers_df = pd.DataFrame(data)

# Display the DataFrame
customers_df

Unnamed: 0,CustomerID,Name,Age,City,Salary
0,101,Alice,25,New York,50000
1,102,Bob,30,Los Angeles,60000
2,103,Charlie,35,Chicago,55000
3,104,David,40,Houston,65000
4,105,Eva,28,Phoenix,70000
5,106,Frank,22,Philadelphia,52000
6,107,Grace,55,San Antonio,58000
7,108,Hannah,65,San Diego,62000
8,109,Ian,20,Dallas,60000
9,110,Jack,45,San Jose,64000


In [52]:
selected_columns = customers_df[['Name', 'City']]
selected_columns

Unnamed: 0,Name,City
0,Alice,New York
1,Bob,Los Angeles
2,Charlie,Chicago
3,David,Houston
4,Eva,Phoenix
5,Frank,Philadelphia
6,Grace,San Antonio
7,Hannah,San Diego
8,Ian,Dallas
9,Jack,San Jose


2. Using .loc[] For Label-Based Selection: The .loc[] indexer allows you to select rows and columns by label. You can specify both row and column labels. If you specify multiple rows or columns using index slicing, the inner and outer indices both are inclusive. Hence, 3,4,5,6 all the rows are included.

In [53]:
selected_data = customers_df.loc[2:5, ['Name', 'Age']]
selected_data

Unnamed: 0,Name,Age
2,Charlie,35
3,David,40
4,Eva,28
5,Frank,22


3. Using .iloc[] For Integer-Based Selection: The .iloc[] indexer lets you select rows and columns by integer location, which is useful for numeric indexing.
If you specify multiple rows or columns using index slicing, only the inner is inclusive, and the outer is exclusive. Hence only 1,2,3 rows will be shown and 0,1 columns will be shown.

In [54]:
selected_data = customers_df.iloc[1:4, 0:2]
selected_data

Unnamed: 0,CustomerID,Name
1,102,Bob
2,103,Charlie
3,104,David


#### Filtering / Conditional Selection

Conditional selection enables you to filter rows based on specific criteria. You can use boolean indexing to achieve this. When you pass a list of booleans ( length = length of samples/rows ) to a data frame, the data frame selects the specific rows where the index of the boolean list is True.

1. Boolean Indexing: Create a boolean mask by applying a condition to a column, and then use this mask to filter rows for the True Condition.

In [55]:
boolean_mask = customers_df['Salary'] > 60000
filtered_data = customers_df[boolean_mask]
filtered_data

Unnamed: 0,CustomerID,Name,Age,City,Salary
3,104,David,40,Houston,65000
4,105,Eva,28,Phoenix,70000
7,108,Hannah,65,San Diego,62000
9,110,Jack,45,San Jose,64000


2. Multiple Conditions: Combine multiple conditions using logical operators (& for AND, | for OR) and use parentheses for clarity. or you can also use .isin() a method of pandas when you want to check if a value from a list of things

In [56]:
# To filter customers older than 25 and with a salary greater than 55,000:
boolean_mask = (customers_df['Age'] > 25) & (customers_df['Salary'] > 55000)
filtered_data = customers_df[boolean_mask]
filtered_data

Unnamed: 0,CustomerID,Name,Age,City,Salary
1,102,Bob,30,Los Angeles,60000
3,104,David,40,Houston,65000
4,105,Eva,28,Phoenix,70000
6,107,Grace,55,San Antonio,58000
7,108,Hannah,65,San Diego,62000
9,110,Jack,45,San Jose,64000


In [57]:
# To filter customers either older than 60 or younger than 22:
boolean_mask = (customers_df['Age'] > 60) | (customers_df['Age'] < 22)
filtered_data = customers_df[boolean_mask]
filtered_data


Unnamed: 0,CustomerID,Name,Age,City,Salary
7,108,Hannah,65,San Diego,62000
8,109,Ian,20,Dallas,60000


### Data Cleaning

Data cleaning is a critical step in the data preparation process. It involves identifying and addressing issues in your dataset to ensure its quality and reliability.

The dataset we will be using for this activity is from Kaggle. 

Kaggle is an online platform that hosts data science competitions, datasets, and other resources. It allows users to find and publish datasets, explore and build models in a web-based data-science environment, work with other data scientists and machine learning engineers, and enter competitions to solve data science challenges.

https://www.kaggle.com/datasets/tanishqdublish/grocery-dataset?resource=download
Scraped Grocery Data from Costco's online marketplace.

#### 1. Handling Missing Data

Missing data is a common issue in real-world datasets. Pandas offers methods to handle missing values effectively.

* .isna()and .notna(): These methods allow you to identify missing (NaN) and non-missing values, respectively, in your DataFrame. Applying this method for a column will return the boolean list with True for the indices where there is a missing value. And passing this list to a dataframe will return the rows where that column values are null.
* .fillna(): You can replace missing values with a specified value or a calculated value using the .fillna() method. The below example replaces all the null values with zeros and directly modifies the data as we used in place = True.
* .dropna(): Use this method to remove rows or columns containing missing values. By default, it will remove the rows ( axis = 0) where any of the column values is missing ( how = ‘any’ ).

Note:
- By Using how = ‘any’, it will drop the rows where any of the column values are missing.
- By using how = ‘all’, it will drop the rows where all of the specified column values are missing.

In [58]:
# Load the dataset
file_path = 'CostcoDataset.csv'
grocery_df = pd.read_csv(file_path)

# Display the first few rows of the DataFrame to understand its structure
grocery_df.head()

Unnamed: 0,Sub Category,Price,Discount,Rating,Title,Currency,Feature,Product Description
0,Bakery & Desserts,$56.99,No Discount,Rated 4.3 out of 5 stars based on 265 reviews.,"David’s Cookies Mile High Peanut Butter Cake, ...",$,"""10"""" Peanut Butter Cake\nCertified Kosher OU-...",A cake the dessert epicure will die for!Our To...
1,Bakery & Desserts,$159.99,No Discount,Rated 5 out of 5 stars based on 1 reviews.,"The Cake Bake Shop 8"" Round Carrot Cake (16-22...",$,Spiced Carrot Cake with Cream Cheese Frosting ...,"Due to the perishable nature of this item, ord..."
2,Bakery & Desserts,$44.99,No Discount,Rated 4.1 out of 5 stars based on 441 reviews.,"St Michel Madeleine, Classic French Sponge Cak...",$,100 count\nIndividually wrapped\nMade in and I...,Moist and buttery sponge cakes with the tradit...
3,Bakery & Desserts,$39.99,No Discount,Rated 4.7 out of 5 stars based on 9459 reviews.,"David's Cookies Butter Pecan Meltaways 32 oz, ...",$,Butter Pecan Meltaways\n32 oz 2-Pack\nNo Prese...,These delectable butter pecan meltaways are th...
4,Bakery & Desserts,$59.99,No Discount,Rated 4.5 out of 5 stars based on 758 reviews.,"David’s Cookies Premier Chocolate Cake, 7.2 lb...",$,"""10"" Four Layer Chocolate Cake\nCertified Kosh...",A cake the dessert epicure will die for!To the...


In [59]:
# Set option to display the entire text of a column
pd.set_option('display.max_colwidth', None)

In [60]:
grocery_df.head()

Unnamed: 0,Sub Category,Price,Discount,Rating,Title,Currency,Feature,Product Description
0,Bakery & Desserts,$56.99,No Discount,Rated 4.3 out of 5 stars based on 265 reviews.,"David’s Cookies Mile High Peanut Butter Cake, 6.8 lbs (14 Servings)",$,"""10"""" Peanut Butter Cake\nCertified Kosher OU-D\n14 Servings","A cake the dessert epicure will die for!Our Top Selling Cake! Fudge brownie base, layered in velvety smooth peanut butter mousse, rich chocolate cake, topped with brownie chunks, handful of peanut butter chips, drizzled in fudge. This cake is the thoughtful gift idea that’s perfect for family, friends, coworkers, or to anyone you care about in your life. -\tGenerously sized precut slices, a cake lover’s dreams come true! Includes:Measures 10” diameterWeighs in at 6.8 lbs.14 servings OU-D certified, the most trusted kosher certification in the U.S.All natural with no added preservativesSome of our products may contain nuts. Our facility is NOT a nut-free facility, and as a result it is possible that any product may have come in contact with nut or nut oils"
1,Bakery & Desserts,$159.99,No Discount,Rated 5 out of 5 stars based on 1 reviews.,"The Cake Bake Shop 8"" Round Carrot Cake (16-22 Servings)",$,"Spiced Carrot Cake with Cream Cheese Frosting Silk Cherry Blossom Flowers (Not Edible) No Nuts or Raisins Dimensions: 9” Diameter, 7” High 16-22 Servings","Due to the perishable nature of this item, orders do NOT ship over the weekend. Orders can only be delivered on Wednesday, Thursday and Friday. Minimum delivery time is 5 business days. Plate not included. Gwendolyn Rogers' The Cake Bake Shop is famous for handcrafting magnificent and delicious cakes and desserts for her award winning restaurants. Each cake arrives beautifully packaged in her bakery’s signature pink and gold cake box with a pink satin ribbon and is topped with the bakery’s dusting of edible Pixie Glitter®, adding a sparkling finish to every dessert. Gwendolyn’s moist and delicious carrot cake is made with hand peeled and freshly grated carrots. Perfectly spiced with just the right amount of cinnamon, this cake has no nuts and no raisins. The three layers of spiced carrot cake are then filled and frosted with Gwendolyn's signature homemade cream cheese frosting. Topped with decorative pink silk cherry blossom flowers. Features: Flavor: Spiced Carrot Cake\nCake Filling: Cream Cheese Frosting\nCake Frosting: Cream Cheese Frosting\nTopped with Pink Silk Cherry Blossom Flowers (cherry blossom flowers are not edible, please do not consume, remove before eating)\nDimensions: 9” diameter, 7” high\nServes 16-22\nEach Cake Arrives With It’s Own Cake Care Card\nAllergens: Contains Wheat, Milk, Soy, Egg\nShips Frozen"
2,Bakery & Desserts,$44.99,No Discount,Rated 4.1 out of 5 stars based on 441 reviews.,"St Michel Madeleine, Classic French Sponge Cake 100 - count",$,100 count\nIndividually wrapped\nMade in and Imported from France\nFree-range eggs\nNon-GMO ingredients,"Moist and buttery sponge cakes with the traditional European madeleine flavor of almond. The Classic Madeleine is baked in the shape of seashell with ridges on one side and a “belly” on the other. Each madeleine is individually-wrapped for portion control and convenience.The Origin of the Madeleine: 18th century King Stanislas 1st, Duke of Lorraine During a festive dinner party in Commercy, France, the king’s chef abruptly left the kitchen. Seeking a solution to feed his guests dessert, a servant girl in the kitchen offered to make her family’s traditional pastry. The king enjoyed the little cake so much that he named it after the servant: Madeleine. Baked with non-GMO ingredients and free-range eggs. No preservatives, palm oil, hydrogenated oil or colorings. Baked with love in France.We all have our Madeleine moment:Enjoy everyday for breakfast, snack or dessert (Just as the French do!)Pack in lunches or backpacks for schoolServe during business, book club or PTA meetingsCut in half and fill with jelly or chocolate hazelnut spreadDecorate cakes or cupcakes with classic seashell shapeIncludes:100 countIndividually wrappedFree-range eggsNon-GMO ingredients"
3,Bakery & Desserts,$39.99,No Discount,Rated 4.7 out of 5 stars based on 9459 reviews.,"David's Cookies Butter Pecan Meltaways 32 oz, 2-pack",$,Butter Pecan Meltaways\n32 oz 2-Pack\nNo Preservatives\nCertified Kosher OU-D\nContains Nuts,"These delectable butter pecan meltaways are the perfect snack or dessert for the whole family. The treats are made with pure creamy butter and large pecan chunks and have just the right amount of powdered sugar to satisfy your sweet tooth.Includes:Includes: 2 Tins (32 oz. each)Contains nutsNo preservativesEnjoy with your morning coffee or teaCookies can be stored at room temperature for up to 60 daysEach tin contains approximately 64 cookiesKosher OU-DSome of our products may contain nuts. Our facility is NOT a nut-free facility, and as a result it is possible that any product may have come in contact with nut or nut oils"
4,Bakery & Desserts,$59.99,No Discount,Rated 4.5 out of 5 stars based on 758 reviews.,"David’s Cookies Premier Chocolate Cake, 7.2 lbs (Serves 14)",$,"""10"" Four Layer Chocolate Cake\nCertified Kosher OU-D\nServes 14","A cake the dessert epicure will die for!To the ultimate chocolate lover - We've baked your dream cake! Four split layers of our rich chocolate cake, filled with a smooth milk chocolate mousse, finished in chocolate ganache & covered in dark chocolate bark pieces. This cake is the thoughtful gift idea that’s perfect for family, friends, coworkers, or to anyone you care about in your life. Generously sized precut slices, A cake lover’s dreams come true! Includes:1 - 10” Premier Chocolate Overload CakeWeighs in at 7.2 lbs.14 Servings OU-D certified, the most trusted kosher certification in the U.S.All natural with no added preservativesSome of our products may contain nuts. Our facility is NOT a nut-free facility, and as a result it is possible that any product may have come in contact with nut or nut oils"


In [61]:
# Identifying missing values in each column
missing_values_count = grocery_df.isna().sum()
print("Missing values in each column:\n", missing_values_count)

Missing values in each column:
 Sub Category              0
Price                     3
Discount                  0
Rating                 1075
Title                     0
Currency                  5
Feature                  18
Product Description      42
dtype: int64


#### In the below example we are replacing the missing value with zero.
df['Column'].fillna(value=0, inplace=True)

#### If you want to fill mising data with the mean of the numeric column
df['Column'].fillna(value=df['Column'].mean(), inplace=True)

#### If you want to fill mising data with the median of the numeric column
df['Column'].fillna(value=df['Column'].median(), inplace=True)

#### If you want to fill mising data with the mode of the categorical column
df['Column'].fillna(value=df['Column'].mode(), inplace=True)

In [62]:
# Example of filling missing data in the 'Rating' column with 'No Rating'
grocery_df['Rating'].fillna(value='No Rating', inplace=True)

# Example of filling missing data in the 'Feature' column with 'No Features listed'
grocery_df['Feature'].fillna(value='No Features listed', inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  grocery_df['Rating'].fillna(value='No Rating', inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  grocery_df['Feature'].fillna(value='No Features listed', inplace=True)


In [63]:
missing_values_count = grocery_df.isna().sum()
print("Missing values in each column:\n", missing_values_count)

Missing values in each column:
 Sub Category            0
Price                   3
Discount                0
Rating                  0
Title                   0
Currency                5
Feature                 0
Product Description    42
dtype: int64


In [64]:
# Default axis = 0, how= 'any'. Drops all rows where any columns is missing
grocery_df['Price'].dropna()

# If you want it to be checked only for certain columns, use subset.
# Drops the rows where any of column1 or column2 value is missing.
grocery_df.dropna(subset=['Currency', 'Product Description'], how = 'any', inplace=True)

# Drops column, if any one of the column value is missing, not recommended.
# grocery_df.dropna(axis=1)

In [65]:
missing_values_count = grocery_df.isna().sum()
print("Missing values in each column:\n", missing_values_count)

Missing values in each column:
 Sub Category           0
Price                  0
Discount               0
Rating                 0
Title                  0
Currency               0
Feature                0
Product Description    0
dtype: int64


#### 2. Removing Duplicates
Duplicate rows can skew your analysis results. Pandas offers a simple way to remove duplicates:

* .duplicated(): This method identifies duplicate rows in a DataFrame.

1. By Default duplicated, uses keep='first’ , which keeps the first observed row in the dataframe and marks the later observed ones as True, which specifies they are duplicated ones.
2. If you want to keep the last observed duplicated row in the dataframe then you can give keep='last' .
3. If you want to see all the duplicates, then you can give keep='False'
4. If you want to check duplicates based on specific columns, then you need to give

* .drop_duplicates(): Use this method to remove duplicate rows from the DataFrame.

In [66]:
# Results the duplicated columns 
# when there is an any other row with exact match of all the columns.
duplicates = grocery_df[grocery_df.duplicated()]
duplicates

Unnamed: 0,Sub Category,Price,Discount,Rating,Title,Currency,Feature,Product Description
1344,Paper & Plastic Products,$16.19,After $3.80 OFF,No Rating,"Ziploc Seal Top Freezer Bag, Gallon, 38-count, 4-pack",$,Seal Top Bags 1-Gallon Freezer Bags 38 Bags per Box 4 Boxes 152 Total Bags,"4 - 38 Count Boxes\n152 Gallon Freezer Bags Total\nFeaturing Easy Open Tabs\nDesigned to protect food against freezer burn\nSmart Zip Plus seal lets you feel, hear and see the bag close from edge-to-edge\nHelps to preserve original flavor\nMicrowave safe (use as directed). When defrosting and reheating, open zipper one inch to vent\nCaution: When using in microwave, place bag on a microwave-safe dish. Handle with care. Bag and contents may be hot. Do not overheat contents as bag may melt. Warning:\nTo avoid danger of suffocation, keep bags away from babies and young children."
1345,Paper & Plastic Products,$16.19,After $3.80 OFF,No Rating,"Ziploc Seal Top Freezer Bag, Gallon, 38-count, 4-pack",$,Seal Top Bags 1-Gallon Freezer Bags 38 Bags per Box 4 Boxes 152 Total Bags,"4 - 38 Count Boxes\n152 Gallon Freezer Bags Total\nFeaturing Easy Open Tabs\nDesigned to protect food against freezer burn\nSmart Zip Plus seal lets you feel, hear and see the bag close from edge-to-edge\nHelps to preserve original flavor\nMicrowave safe (use as directed). When defrosting and reheating, open zipper one inch to vent\nCaution: When using in microwave, place bag on a microwave-safe dish. Handle with care. Bag and contents may be hot. Do not overheat contents as bag may melt. Warning:\nTo avoid danger of suffocation, keep bags away from babies and young children."
1346,Paper & Plastic Products,$16.19,After $3.80 OFF,No Rating,"Ziploc Seal Top Freezer Bag, Gallon, 38-count, 4-pack",$,Seal Top Bags 1-Gallon Freezer Bags 38 Bags per Box 4 Boxes 152 Total Bags,"4 - 38 Count Boxes\n152 Gallon Freezer Bags Total\nFeaturing Easy Open Tabs\nDesigned to protect food against freezer burn\nSmart Zip Plus seal lets you feel, hear and see the bag close from edge-to-edge\nHelps to preserve original flavor\nMicrowave safe (use as directed). When defrosting and reheating, open zipper one inch to vent\nCaution: When using in microwave, place bag on a microwave-safe dish. Handle with care. Bag and contents may be hot. Do not overheat contents as bag may melt. Warning:\nTo avoid danger of suffocation, keep bags away from babies and young children."


In [67]:
# Identifying and removing duplicate rows based on all columns
print("\nBefore removing duplicates, shape:", grocery_df.shape)
grocery_df.drop_duplicates(inplace=True)
print("After removing duplicates, shape:", grocery_df.shape)


Before removing duplicates, shape: (1712, 8)
After removing duplicates, shape: (1709, 8)


#### 3. String Operations

When working with text data, Pandas offers string operations through .str an accessor to apply for the entire column which is of object data type.

* .str.lower() and .str.upper(): These methods convert strings to lowercase or uppercase for the entire column values.
* .str.replace(): Use this method to replace substrings within strings.
* .str.contains() : This method allows you to check if a specific substring or pattern exists within a string. It returns a boolean Series indicating whether each element contains the specified pattern.
* .str.slice(): You can extract a substring from each string in a Series using the .str.slice() method. Specify the start and end positions to define the slice.

In [68]:
# Converting 'Title' to lowercase for uniformity
grocery_df['Title'] = grocery_df['Title'].str.lower()

# Using .str.contains() to filter items with 'cake' in the title
cake_items = grocery_df[grocery_df['Title'].str.contains('cake')]

# Displaying a summary of 'cake' items
cake_items[['Title', 'Price']].head(10)

Unnamed: 0,Title,Price
0,"david’s cookies mile high peanut butter cake, 6.8 lbs (14 servings)",$56.99
1,"the cake bake shop 8"" round carrot cake (16-22 servings)",$159.99
2,"st michel madeleine, classic french sponge cake 100 - count",$44.99
4,"david’s cookies premier chocolate cake, 7.2 lbs (serves 14)",$59.99
5,david's cookies mango & strawberry cheesecake 2-count (28 slices total),$59.99
7,"david's cookies no sugar added cheesecake & marble truffle cake, 2-pack (28 slices total)",$59.99
9,"the cake bake shop 8"" round chocolate cake (16-22 servings)",$159.99
10,"david's cookies 10"" rainbow cake (12 servings)",$62.99
11,the cake bake shop 2 tier special occasion cake (16-22 servings),$299.99
13,"david's cookies chocolate fudge birthday cake, 3.75 lbs. includes party pack (16 servings)",$54.99


### Data Manipulation

Data manipulation is a core task in data analysis and involves transforming and modifying your data to derive insights or prepare it for further analysis. Pandas provides a rich set of methods for data manipulation that empower you to shape your data to meet your specific needs.

#### 1. Applying Functions to DataFrames
There is one hack to do element-wise operations for the dataframe using python .iterrows(). However, it’s important to note that Pandas is optimized for vectorized operations, and iterating through a DataFrame row by row is generally not the most efficient way to work with data in Pandas. It’s recommended to use vectorized operations whenever possible.

As an efficient way you can below functions to columns or rows of a DataFrame to perform element-wise operations:

* .apply(): Use this method to apply a custom function to a series or to the entire dataframe.
* -- when you use this on series, Each element of the original column will be passed to the function.
* -- when you use this for the entire dataframe, based on the axis (1 - row, 0-column), the entire row or the entire column will be passed to the function.
* .map(): This method applies a function to each element of a Series. It's particularly useful for transforming one column based on values from another.
* .applymap(): When you want to apply a function to each element in the entire DataFrame, you can use .applymap().

In [69]:
grocery_df['Discount'].value_counts()

Discount
No Discount                     1588
After $6 OFF                      13
After $5 OFF                      12
After $4 OFF                      12
After $3 OFF                       9
After $3.30 OFF                    7
After $3.60 OFF                    6
After $20 OFF                      4
After $50 OFF                      4
After $30 OFF                      4
After $3.80 OFF                    4
After $3.50 OFF                    4
After $3.10 OFF                    3
After $60 OFF                      3
After $10 OFF                      3
After $2.20 OFF                    3
After $12 OFF                      2
After $5.60 OFF                    2
After $2.50 OFF                    2
After $70 OFF                      2
After $2.30 OFF                    2
After $40 OFF                      2
After $2.60 OFF                    1
After $7 OFF                       1
After $4.10 OFF                    1
After $2 OFF                       1
After $6.50 OFF              

In [70]:
# Define a function to extract discount value
def extract_discount_value(discount):
    if 'OFF' in discount:
        # Assuming the discount format is 'After $X OFF'
        return float(discount.split('$')[1].split()[0])
    else:
        return 0.0

# Use .apply() to update the 'Discount' column
grocery_df['Discount Value'] = grocery_df['Discount'].apply(extract_discount_value)

In [71]:
grocery_df['Discount Value'].value_counts()

Discount Value
0.0     1591
6.0       13
5.0       12
4.0       12
3.0        9
3.3        7
3.6        6
3.8        4
40.0       4
50.0       4
30.0       4
3.5        4
20.0       4
3.1        3
10.0       3
60.0       3
2.2        3
12.0       2
5.6        2
2.5        2
70.0       2
2.3        2
7.0        1
6.5        1
4.1        1
2.6        1
2.0        1
2.7        1
80.0       1
9.3        1
1.5        1
8.0        1
2.8        1
4.5        1
2.4        1
Name: count, dtype: int64

In [72]:
# Define a mapping function for discount categories
def discount_category(discount):
    if discount >= 10.0:
        return 'High'
    else:
        return 'Low'

# Use .map() to apply the categorization
grocery_df['Discount Category'] = grocery_df['Discount Value'].map(discount_category)

In [73]:
grocery_df['Discount Category'].value_counts()

Discount Category
Low     1682
High      27
Name: count, dtype: int64

In Pandas, the lambda function is often used in conjunction with .apply(), .map(), .applymap(), and other similar methods for inline, anonymous function definitions. This approach is particularly useful for applying quick, one-off functions to DataFrame or Series objects without the need to formally define a separate function.

* lambda Function: A lambda function is a small anonymous function in Python. It can take any number of arguments, but can only have one expression. The syntax is lambda arguments: expression. The expression is executed and the result is returned.

* Usage with .apply(): When you use a lambda function with .apply(), it allows you to apply a simple function to each element (when used on a Series) or to each row/column (when used on a DataFrame) without defining a traditional function using def.



In [74]:
# Selecting only the columns we want to modify
text_columns = grocery_df[['Title', 'Feature']]

# Use .applymap() to convert all text to uppercase
grocery_df[['Title', 'Feature']] = text_columns.applymap(lambda x: x.upper())

  grocery_df[['Title', 'Feature']] = text_columns.applymap(lambda x: x.upper())


In [75]:
grocery_df

Unnamed: 0,Sub Category,Price,Discount,Rating,Title,Currency,Feature,Product Description,Discount Value,Discount Category
0,Bakery & Desserts,$56.99,No Discount,Rated 4.3 out of 5 stars based on 265 reviews.,"DAVID’S COOKIES MILE HIGH PEANUT BUTTER CAKE, 6.8 LBS (14 SERVINGS)",$,"""10"""" PEANUT BUTTER CAKE\nCERTIFIED KOSHER OU-D\n14 SERVINGS","A cake the dessert epicure will die for!Our Top Selling Cake! Fudge brownie base, layered in velvety smooth peanut butter mousse, rich chocolate cake, topped with brownie chunks, handful of peanut butter chips, drizzled in fudge. This cake is the thoughtful gift idea that’s perfect for family, friends, coworkers, or to anyone you care about in your life. -\tGenerously sized precut slices, a cake lover’s dreams come true! Includes:Measures 10” diameterWeighs in at 6.8 lbs.14 servings OU-D certified, the most trusted kosher certification in the U.S.All natural with no added preservativesSome of our products may contain nuts. Our facility is NOT a nut-free facility, and as a result it is possible that any product may have come in contact with nut or nut oils",0.0,Low
1,Bakery & Desserts,$159.99,No Discount,Rated 5 out of 5 stars based on 1 reviews.,"THE CAKE BAKE SHOP 8"" ROUND CARROT CAKE (16-22 SERVINGS)",$,"SPICED CARROT CAKE WITH CREAM CHEESE FROSTING SILK CHERRY BLOSSOM FLOWERS (NOT EDIBLE) NO NUTS OR RAISINS DIMENSIONS: 9” DIAMETER, 7” HIGH 16-22 SERVINGS","Due to the perishable nature of this item, orders do NOT ship over the weekend. Orders can only be delivered on Wednesday, Thursday and Friday. Minimum delivery time is 5 business days. Plate not included. Gwendolyn Rogers' The Cake Bake Shop is famous for handcrafting magnificent and delicious cakes and desserts for her award winning restaurants. Each cake arrives beautifully packaged in her bakery’s signature pink and gold cake box with a pink satin ribbon and is topped with the bakery’s dusting of edible Pixie Glitter®, adding a sparkling finish to every dessert. Gwendolyn’s moist and delicious carrot cake is made with hand peeled and freshly grated carrots. Perfectly spiced with just the right amount of cinnamon, this cake has no nuts and no raisins. The three layers of spiced carrot cake are then filled and frosted with Gwendolyn's signature homemade cream cheese frosting. Topped with decorative pink silk cherry blossom flowers. Features: Flavor: Spiced Carrot Cake\nCake Filling: Cream Cheese Frosting\nCake Frosting: Cream Cheese Frosting\nTopped with Pink Silk Cherry Blossom Flowers (cherry blossom flowers are not edible, please do not consume, remove before eating)\nDimensions: 9” diameter, 7” high\nServes 16-22\nEach Cake Arrives With It’s Own Cake Care Card\nAllergens: Contains Wheat, Milk, Soy, Egg\nShips Frozen",0.0,Low
2,Bakery & Desserts,$44.99,No Discount,Rated 4.1 out of 5 stars based on 441 reviews.,"ST MICHEL MADELEINE, CLASSIC FRENCH SPONGE CAKE 100 - COUNT",$,100 COUNT\nINDIVIDUALLY WRAPPED\nMADE IN AND IMPORTED FROM FRANCE\nFREE-RANGE EGGS\nNON-GMO INGREDIENTS,"Moist and buttery sponge cakes with the traditional European madeleine flavor of almond. The Classic Madeleine is baked in the shape of seashell with ridges on one side and a “belly” on the other. Each madeleine is individually-wrapped for portion control and convenience.The Origin of the Madeleine: 18th century King Stanislas 1st, Duke of Lorraine During a festive dinner party in Commercy, France, the king’s chef abruptly left the kitchen. Seeking a solution to feed his guests dessert, a servant girl in the kitchen offered to make her family’s traditional pastry. The king enjoyed the little cake so much that he named it after the servant: Madeleine. Baked with non-GMO ingredients and free-range eggs. No preservatives, palm oil, hydrogenated oil or colorings. Baked with love in France.We all have our Madeleine moment:Enjoy everyday for breakfast, snack or dessert (Just as the French do!)Pack in lunches or backpacks for schoolServe during business, book club or PTA meetingsCut in half and fill with jelly or chocolate hazelnut spreadDecorate cakes or cupcakes with classic seashell shapeIncludes:100 countIndividually wrappedFree-range eggsNon-GMO ingredients",0.0,Low
3,Bakery & Desserts,$39.99,No Discount,Rated 4.7 out of 5 stars based on 9459 reviews.,"DAVID'S COOKIES BUTTER PECAN MELTAWAYS 32 OZ, 2-PACK",$,BUTTER PECAN MELTAWAYS\n32 OZ 2-PACK\nNO PRESERVATIVES\nCERTIFIED KOSHER OU-D\nCONTAINS NUTS,"These delectable butter pecan meltaways are the perfect snack or dessert for the whole family. The treats are made with pure creamy butter and large pecan chunks and have just the right amount of powdered sugar to satisfy your sweet tooth.Includes:Includes: 2 Tins (32 oz. each)Contains nutsNo preservativesEnjoy with your morning coffee or teaCookies can be stored at room temperature for up to 60 daysEach tin contains approximately 64 cookiesKosher OU-DSome of our products may contain nuts. Our facility is NOT a nut-free facility, and as a result it is possible that any product may have come in contact with nut or nut oils",0.0,Low
4,Bakery & Desserts,$59.99,No Discount,Rated 4.5 out of 5 stars based on 758 reviews.,"DAVID’S COOKIES PREMIER CHOCOLATE CAKE, 7.2 LBS (SERVES 14)",$,"""10"" FOUR LAYER CHOCOLATE CAKE\nCERTIFIED KOSHER OU-D\nSERVES 14","A cake the dessert epicure will die for!To the ultimate chocolate lover - We've baked your dream cake! Four split layers of our rich chocolate cake, filled with a smooth milk chocolate mousse, finished in chocolate ganache & covered in dark chocolate bark pieces. This cake is the thoughtful gift idea that’s perfect for family, friends, coworkers, or to anyone you care about in your life. Generously sized precut slices, A cake lover’s dreams come true! Includes:1 - 10” Premier Chocolate Overload CakeWeighs in at 7.2 lbs.14 Servings OU-D certified, the most trusted kosher certification in the U.S.All natural with no added preservativesSome of our products may contain nuts. Our facility is NOT a nut-free facility, and as a result it is possible that any product may have come in contact with nut or nut oils",0.0,Low
...,...,...,...,...,...,...,...,...,...,...
1752,Snacks,$23.99,No Discount,No Rating,"OBERTO THIN STYLE SMOKED SAUSAGE STICK, COCKTAIL PEPPERONI, 3 OZ, 8-COUNT",$,COCKTAIL PEPPERONI SMOKED SAUSAGE STICKS 3 OZ BAG 8-COUNT NET WEIGHT: 24 OZ,Cocktail PepperoniSmoked Sausage Sticks3 oz bag8-count,0.0,Low
1753,Snacks,$49.99,No Discount,No Rating,"CHEETOS CRUNCHY, ORIGINAL, 2.1 OZ, 64-COUNT",$,MADE WITH REAL CHEESE,64-count2.1 oz Bags,0.0,Low
1754,Snacks,$22.99,No Discount,No Rating,"SABRITAS CHILE & LIMON MIX, VARIETY PACK, 30-COUNT",$,CHILE & LIMÓN MIX VARIETY PACK 30 CT NET WEIGHT 48 OZ,8-Doritos Dinamita Chile Limón Flavored Rolled Tortilla Chips (1.75 oz bag)\n8-Lay's Limón Flavored Potato Chips (1.5 oz bag)\n6-Sabritones Chile & Lime Flavored Puffed Wheat Snacks (1.0 oz bag)\n8-Sabritas Turbo Flamas Flavored Corn Snacks (2.0 oz bag),0.0,Low
1755,Snacks,$17.49,No Discount,No Rating,"FRUIT ROLL-UPS, VARIETY PACK, 72-COUNT",$,VARIETY PACK 1 BOX WITH 72 ROLLS FLAVORED WITH OTHER NATURAL FLAVORS GELATIN AND GLUTEN FREE,"Fruit Flavored Snacks\nVariety Includes: Strawberry Blast, Tropical Tie Dye\nIndividually Wrapped\nTotal Net Weight 2.3 lbs.",0.0,Low


In [76]:
total_discount = 0
for index, row in grocery_df.iterrows():
    if row['Discount Value'] > 0:
        total_discount += row['Discount Value']

print(f"Total Discount Given: ${total_discount}")

Total Discount Given: $1390.999999999999


#### 2. Adding and Removing Columns

You can add and remove columns to tailor your DataFrame for analysis:

* Adding Columns: To add a new column or replace an existing one, simply assign values to it.
* Removing Columns: Use the .drop() method to remove columns. Whether to drop labels from the index (0 or ‘index’) or columns (1 or ‘columns’). You can use axis=1/ ‘columns’ to drop column, or use axis =0/ ‘index’ to drop the row.

In [77]:
# Adding a new column 'Is Bakery Item'
grocery_df['Is Bakery Item'] = grocery_df['Sub Category'].apply(lambda x: x == 'Bakery & Desserts')

In [78]:
# Removing 'Feature' and 'Product Description' columns
grocery_df.drop(['Feature', 'Product Description'], axis=1, inplace=True)

In [79]:
grocery_df

Unnamed: 0,Sub Category,Price,Discount,Rating,Title,Currency,Discount Value,Discount Category,Is Bakery Item
0,Bakery & Desserts,$56.99,No Discount,Rated 4.3 out of 5 stars based on 265 reviews.,"DAVID’S COOKIES MILE HIGH PEANUT BUTTER CAKE, 6.8 LBS (14 SERVINGS)",$,0.0,Low,True
1,Bakery & Desserts,$159.99,No Discount,Rated 5 out of 5 stars based on 1 reviews.,"THE CAKE BAKE SHOP 8"" ROUND CARROT CAKE (16-22 SERVINGS)",$,0.0,Low,True
2,Bakery & Desserts,$44.99,No Discount,Rated 4.1 out of 5 stars based on 441 reviews.,"ST MICHEL MADELEINE, CLASSIC FRENCH SPONGE CAKE 100 - COUNT",$,0.0,Low,True
3,Bakery & Desserts,$39.99,No Discount,Rated 4.7 out of 5 stars based on 9459 reviews.,"DAVID'S COOKIES BUTTER PECAN MELTAWAYS 32 OZ, 2-PACK",$,0.0,Low,True
4,Bakery & Desserts,$59.99,No Discount,Rated 4.5 out of 5 stars based on 758 reviews.,"DAVID’S COOKIES PREMIER CHOCOLATE CAKE, 7.2 LBS (SERVES 14)",$,0.0,Low,True
...,...,...,...,...,...,...,...,...,...
1752,Snacks,$23.99,No Discount,No Rating,"OBERTO THIN STYLE SMOKED SAUSAGE STICK, COCKTAIL PEPPERONI, 3 OZ, 8-COUNT",$,0.0,Low,False
1753,Snacks,$49.99,No Discount,No Rating,"CHEETOS CRUNCHY, ORIGINAL, 2.1 OZ, 64-COUNT",$,0.0,Low,False
1754,Snacks,$22.99,No Discount,No Rating,"SABRITAS CHILE & LIMON MIX, VARIETY PACK, 30-COUNT",$,0.0,Low,False
1755,Snacks,$17.49,No Discount,No Rating,"FRUIT ROLL-UPS, VARIETY PACK, 72-COUNT",$,0.0,Low,False


#### 3. Combining DataFrames (Concatenation, Joining, Merging)

Pandas offers powerful methods to combine DataFrames:

* Concatenation: You can concatenate DataFrames vertically or horizontally using pd.concat(). axis=0 will concatenate them in the rows, axis=1 will concatenate them in the columns. It will check for the common columns between both the dataframes and for the matched columns, it will concatenate in the rows.

In [80]:
# Sample DataFrames with the same column names
data1 = {'A': [1, 2, 3], 'B': [4, 5, 6], 'C':[1,2,3]}
data2 = {'A': [7, 8, 9], 'B': [10, 11, 12], 'D':[1,2,3]}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

# Concatenate df1 and df2 horizontally (along columns) with same column names
result = pd.concat([df1, df2], axis=0)

# Display the concatenated DataFrame
result

# Tip: To make the index proper, you can use -> result.reset_index(drop=True)

Unnamed: 0,A,B,C,D
0,1,4,1.0,
1,2,5,2.0,
2,3,6,3.0,
0,7,10,,1.0
1,8,11,,2.0
2,9,12,,3.0


* Joining: You can perform SQL-like joins on DataFrames using the .merge() method.
* Merging: Pandas allows you to merge DataFrames based on common columns. Using merge you can perform various joins such as inner, outer, left, and right.

An inner join will only keep where the cells of common columns have matched.

An outer join will keep every row from both data frames. Left will keep all the rows from the left table, similarly right for all the rows from the right table.

If you want to match a table column with the index of another table, then for the table you want to match it with the index, specify left_index=True or right_index=True accordingly. And for the other table on which column you want to match it with, you have to specify it as left_on=column_name or right_on=column_name accordingly.

In [81]:
# Sample DataFrames
df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice', 'Bob', 'Charlie']})
df2 = pd.DataFrame({'ID': [2, 3, 4], 'Age': [25, 30, 22]})

# Inner join on 'ID'
result = pd.merge(df1, df2, on='ID', how='inner')

# Display the merged DataFrame
print("Inner Join")
print(result)

# Left join on 'ID'
result = pd.merge(df1, df2, on='ID', how='left')

# Display the merged DataFrame
print("Left Join")
print(result)

Inner Join
   ID     Name  Age
0   2      Bob   25
1   3  Charlie   30
Left Join
   ID     Name   Age
0   1    Alice   NaN
1   2      Bob  25.0
2   3  Charlie  30.0


In [82]:
# If matching column names are different from both data frames
# you can specify them manually
# Considering, left table has ID1 and right table has ID2
# pd.merge(df1,df2, left_on="ID1", right_on="ID2")

# Here it will try to match the left dataframe index with right ID column
# pd.merge(df1,df2, left_index=True,right_on="ID")

### Data Aggregations

Meaningful insights will never be observed without proper aggregation of data, right? In fact, that’s why we use pivot tables a lot in excel. So we can say that it’s a critical step in data analysis and often involves applying aggregation functions like sum, mean, count, etc., to groups of data

#### 1. Grouping Data

* groupby with a single column: This method allows you to group data based on one or more columns. You can think of it as a powerful version of the SQL GROUP BY statement.

First, you need to pass the column name on which you want to group the data, After that, you can use the grouped data and choose the column between which you want to compare this grouped data, and then select the aggregate function( mean, sum, max, min, etc.. ).

When you apply an aggregation function to grouped data without specifying a column, it will be applied to all the numeric columns in the DataFrame.
Eg: Let’s say you have sales data and you want to group it by the “Category” column and calculate the total sales within each category.

In [83]:
# Sample DataFrame
data = {'Category': ['Electronics', 'Clothing', 'Electronics', 'Clothing'],
        'Sales': [1000, 500, 800, 500]}
df = pd.DataFrame(data)

# Grouping by 'Category'
grouped_data = df.groupby('Category')

# Choosing sales column to compare with grouped data and using sum function
# This gives the total sales for each category.
total_sales = grouped_data['Sales'].sum()

total_sales

Category
Clothing       1000
Electronics    1800
Name: Sales, dtype: int64

* groupby with multiple columns: You can even group with multiple columns by passing a list of columns you want to group by.

Eg: Suppose you have a dataset with student information including their grades in different subjects, and you want to group the data by both “Class” and “Gender” columns, then calculate statistics such as the average, minimum, and maximum scores for Math subject score.

In [84]:
# Sample DataFrame
data = {'Class': ['A', 'B', 'A', 'B', 'A', 'B'],
        'Gender': ['Male', 'Male', 'Female', 'Female', 'Male', 'Female'],
        'Math_Score': [85, 92, 78, 89, 90, 86],
        'English_Score': [88, 94, 80, 92, 92, 88]}
df = pd.DataFrame(data)

# Grouping by 'Class' and 'Gender' and calculating statistics
grouped_data = df.groupby(['Class', 'Gender'])

# Calculate the mean for Math_score
agg_results = grouped_data['Math_Score'].mean()

agg_results

Class  Gender
A      Female    78.0
       Male      87.5
B      Female    87.5
       Male      92.0
Name: Math_Score, dtype: float64

* Apply aggregate function to grouped data without specifying a column:
In such cases, it will be applied to all the numeric columns in the grouped DataFrame.

In [85]:
# Sample DataFrame
data = {'Class': ['A', 'B', 'A', 'B', 'A', 'B'],
        'Gender': ['Male', 'Male', 'Female', 'Female', 'Male', 'Female'],
        'Math_Score': [85, 92, 78, 89, 90, 86],
        'English_Score': [88, 94, 80, 92, 92, 88]}
df = pd.DataFrame(data)

# Grouping by 'Class' and 'Gender'
grouped_data = df.groupby(['Class', 'Gender'])

# Applying the mean aggregation function to all numeric columns
aggregated_data = grouped_data.mean()

aggregated_data

Unnamed: 0_level_0,Unnamed: 1_level_0,Math_Score,English_Score
Class,Gender,Unnamed: 2_level_1,Unnamed: 3_level_1
A,Female,78.0,80.0
A,Male,87.5,90.0
B,Female,87.5,90.0
B,Male,92.0,94.0


#### 2. Aggregation Functions

Aggregation functions are essential for summarizing data within groups. And the Common Aggregation Functions are sum(), max(), min(), mean(), median(), count(), agg() — this allows you apply custom aggregation funcitons.

Eg: Say you want to apply multiple aggregate functions ( mean, min, and max) at once for the Math Score. You also want to check these multiple aggregate functions for two subjects ( particularly a few columns).

In [86]:
# Sample DataFrame
data = {'Class': ['A', 'B', 'A', 'B', 'A', 'B'],
        'Gender': ['Male', 'Male', 'Female', 'Female', 'Male', 'Female'],
        'Math_Score': [85, 92, 78, 89, 90, 86],
        'English_Score': [88, 94, 80, 92, 92, 88],
        'Physics_Score': [78, 90, 85, 92, 88, 84]}
df = pd.DataFrame(data)

# Grouping by 'Class' and 'Gender' and calculating statistics
grouped_data = df.groupby(['Class', 'Gender'])

# Calculate the mean, min, and max scores for Math_score
agg_results = grouped_data.Math_Score.agg(['mean', 'min', 'max'])

agg_results

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,min,max
Class,Gender,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
A,Female,78.0,78,78
A,Male,87.5,85,90
B,Female,87.5,86,89
B,Male,92.0,92,92


In [87]:
# Applying aggregation functions to 'Math_Score' and 'Physics_Score'
aggregated_data = grouped_data.agg({
    'Math_Score': ['mean', 'min', 'max'],
    'Physics_Score': ['mean', 'min', 'max']
})

aggregated_data

Unnamed: 0_level_0,Unnamed: 1_level_0,Math_Score,Math_Score,Math_Score,Physics_Score,Physics_Score,Physics_Score
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,min,max,mean,min,max
Class,Gender,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
A,Female,78.0,78,78,85.0,85,85
A,Male,87.5,85,90,83.0,78,88
B,Female,87.5,86,89,88.0,84,92
B,Male,92.0,92,92,90.0,90,90


### Case Study: Health Metrics Analysis

#### Dataset Overview:
* Patient ID: Unique identifier for each patient.
* Age: Age of the patient.
* Weight: Weight of the patient in kilograms.
* Cholesterol Level: Cholesterol level in mg/dL.
* Blood Pressure: Blood pressure reading (systolic/diastolic) in mmHg.
* Date: Date of the health metric recording.

In [90]:
# Sample data
np.random.seed(42) # For reproducibility
patient_ids = np.arange(1, 11)
ages = np.random.randint(20, 70, size=10)
weights = np.random.uniform(55, 100, size=10)
cholesterol_levels = np.random.randint(150, 250, size=10)
blood_pressures_systolic = np.random.randint(110, 140, size=10)
blood_pressures_diastolic = np.random.randint(70, 90, size=10)
dates = pd.date_range('2024-01-01', periods=10, freq='ME')

# Create a DataFrame
df = pd.DataFrame({
    'Patient ID': patient_ids,
    'Age': ages,
    'Weight': weights,
    'Cholesterol Level': cholesterol_levels,
    'Blood Pressure Systolic': blood_pressures_systolic,
    'Blood Pressure Diastolic': blood_pressures_diastolic,
    'Date': dates
})

df.head(5)

Unnamed: 0,Patient ID,Age,Weight,Cholesterol Level,Blood Pressure Systolic,Blood Pressure Diastolic,Date
0,1,58,75.6662,207,124,72,2024-01-31
1,2,48,70.016888,171,139,74,2024-02-29
2,3,34,61.429007,238,139,88,2024-03-31
3,4,62,84.289981,198,124,76,2024-04-30
4,5,27,57.538521,240,139,78,2024-05-31


### 💡📝 Exercise 1: Calculate the average weight of the patient population.
Use NumPy to compute the mean of the Weight column, providing insights into the general health status of the population.

In [91]:
# Calculating average weight
average_weight = np.mean(weights)
average_weight

np.float64(77.11366460070388)

### 💡📝 Exercise 2: Determine the median cholesterol level.
Calculate the median of the Cholesterol Level column to find the middle value of cholesterol levels, helping to understand the distribution of cholesterol levels among patients.


In [92]:
# Determining median cholesterol level
median_cholesterol = np.median(cholesterol_levels)
median_cholesterol

np.float64(208.5)

### 💡📝 Exercise 3: Identify the range of blood pressure readings in the dataset.
Utilize NumPy to find the maximum and minimum values in the Blood Pressure readings, giving an idea of the variation in blood pressure across the patient population.

In [93]:
# Extracting systolic and diastolic blood pressures for range calculation
systolic_range = blood_pressures_systolic.max() - blood_pressures_systolic.min()
diastolic_range = blood_pressures_diastolic.max() - blood_pressures_diastolic.min()
systolic_range, diastolic_range

(np.int64(18), np.int64(16))

### 💡📝 Exercise 4: Evaluate the standard deviation of ages.
Calculate the standard deviation of the Age column using NumPy to assess the age diversity of the patient population.

In [94]:
# Evaluating standard deviation of ages
std_dev_ages = np.std(ages)
std_dev_ages

np.float64(11.713667231059622)

### 💡📝 Exercise 5: How Many Patients Fall into Each Category of Blood Pressure Status?

Classify patients into categories based on their systolic and diastolic blood pressure readings to assess the prevalence of normal blood pressure, elevated blood pressure, and different stages of hypertension within the patient population.

In [95]:
def categorize_blood_pressure(row):
    if row['Blood Pressure Systolic'] < 120 and row['Blood Pressure Diastolic'] < 80:
        return 'Normal'
    elif 120 <= row['Blood Pressure Systolic'] <= 129 and row['Blood Pressure Diastolic'] < 80:
        return 'Elevated'
    elif 130 <= row['Blood Pressure Systolic'] <= 139 or 80 <= row['Blood Pressure Diastolic'] <= 89:
        return 'Hypertension Stage 1'
    elif row['Blood Pressure Systolic'] >= 140 or row['Blood Pressure Diastolic'] >= 90:
        return 'Hypertension Stage 2'
    else:
        return 'Hypertensive Crisis'

# Apply the function to categorize blood pressure for each patient
df['Blood Pressure Category'] = df.apply(categorize_blood_pressure, axis=1)

# Count the number of patients in each blood pressure category
blood_pressure_counts = df['Blood Pressure Category'].value_counts()
blood_pressure_counts

Blood Pressure Category
Hypertension Stage 1    7
Elevated                3
Name: count, dtype: int64

### 💡📝 Exercise 6: What is the Average Cholesterol Level for Each Decade of Age?

Calculate the average cholesterol level, grouping patients by their age in decades (20s, 30s, etc.).

In [96]:
# Create a new column for age decade
df['Age Decade'] = df['Age'] // 10 * 10

# Calculate the average cholesterol level for each age decade
average_cholesterol_by_decade = df.groupby('Age Decade')['Cholesterol Level'].mean()
average_cholesterol_by_decade

Age Decade
20    240.0
30    236.0
40    196.0
50    199.0
60    198.0
Name: Cholesterol Level, dtype: float64

### 💡📝 Exercise 7:  How Many Patients Have Cholesterol Levels Above the Median?

Determine the number of patients whose cholesterol levels are above the median value, providing a simple measure of how cholesterol levels are distributed across the patient population.

In [97]:
# Count how many patients have cholesterol levels above this median
patients_above_median_cholesterol = df[df['Cholesterol Level'] > median_cholesterol].shape[0]
patients_above_median_cholesterol

5

### 💡📝 TO-DO Exercise 8:  Update the Dataset to Include BMI (Body Mass Index)
Objective: Enhance the dataset with a new column for BMI, calculated from each patient's weight (in kg) and an assumed average height (since height is not provided in the dataset, you could assume a fixed value for simplicity, such as 1.75 meters for all patients).

Approach:

* Calculate BMI using the formula below.
* Add the BMI calculations as a new column in the DataFrame.
* Analyze the distribution of BMI across the dataset to identify how many patients might be classified as underweight, normal weight, overweight, or obese based on standard BMI categories.

In [98]:
height_m = 1.75  # Assuming an average height of 1.75 meters for all patients
weight = 60
bmi = weight / height_m**2
bmi

19.591836734693878

### 💡📝 TO-DO Exercise 9:  Standardize the Weight Column Using NumPy

Objective: Standardize the values in the Weight column to have a mean of 0 and a standard deviation of 1. This process will allow for the comparison of weights in a way that is independent of the original unit scale, useful in many statistical analyses and machine learning algorithms.

Approach:
* Extract the Weight column into a NumPy array.
* Subtract the mean of the array from each element and then divide by the standard deviation of the array.
* Optionally, update the DataFrame with the standardized weights.

### References

1. Python Data Science Handbook [https://jakevdp.github.io/PythonDataScienceHandbook/]
2. Python for Data Analysis [https://bedford-computing.co.uk/learning/wp-content/uploads/2015/10/Python-for-Data-Analysis.pdf] — Written by the creator of Pandas, you can go through this book for in-depth examples.