# 4t DigiNext Data & ML Bootcamp

## Week 1 - EDA

### Essential Tools for Data Exploration

This notebook serves as a practical guide to essential tools for data exploration.



---



## Jupyter Notebooks & Google Colab

### Task 1

This Notebook is read-only and you can't modify it!

Go to `File` tab and click on `Save a copy in Drive`.

We suggest that you create a folder in your Drive for your notebooks.

Then, proceed in your own notebook!

### Task 2

There is a code block below, fill the variable using your name and run the block.

In [None]:
## Task 2

student_name = "<ENTER YOUR NAME HERE>"
print(f"This notebook belongs to {student_name}")

### Task 3

Go to `File` tab and click on `Download` and then select `Download .ipynb`.

## NumPy

### Easier to work

#### Task 1

Run this code block.

In [None]:
# NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices,
# along with a large collection of high-level mathematical functions to operate on these arrays.

import numpy as np

# Create a NumPy array
numpy_array = np.array([1, 2, 3, 4, 5])
print("NumPy Array:", numpy_array)

# Create a Python list
python_list = [1, 2, 3, 4, 5]
print("Python List:", python_list)

# Perform an element-wise operation (e.g., add 10 to each element)
# Using NumPy (vectorized operation)
numpy_result = numpy_array + 10
print("NumPy Result (add 10):", numpy_result)

# Using Python list (requires a loop or list comprehension)
python_result = [x + 10 for x in python_list]
print("Python List Result (add 10):", python_result)


# Calculate the sum of all elements in a NumPy array vs Python list
sum_of_elements_numpy = np.sum(numpy_array)
print("Sum of elements in NumPy Array:", sum_of_elements_numpy)

sum_of_elements_python = sum(python_list)
print("Sum of elements in Python List:", sum_of_elements_python)


# Calculate the mean of elements in a NumPy array vs Python list
mean_of_elements_numpy = np.mean(numpy_array)
print("Mean of elements in NumPy Array:", mean_of_elements_numpy)

mean_of_elements_python = sum(python_list) / len(python_list)
print("Mean of elements in Python List:", mean_of_elements_python)

### No for loops!

#### Task 2
Run this code block.

In [None]:
# Matrix operations using NumPy
matrix_a_numpy = np.array([[1, 2], [3, 4]])
matrix_b_numpy = np.array([[5, 6], [7, 8]])

# Matrix multiplication using NumPy
matrix_c_numpy = np.dot(matrix_a_numpy, matrix_b_numpy)
print("NumPy Matrix Multiplication Result:\n", matrix_c_numpy)

# Matrix operations using Python lists (requires nested loops)
matrix_a_python = [[1, 2], [3, 4]]
matrix_b_python = [[5, 6], [7, 8]]

# Initialize result matrix with zeros
matrix_c_python = [[0, 0], [0, 0]]

# Matrix multiplication using Python lists
for i in range(len(matrix_a_python)):
   for j in range(len(matrix_b_python[0])):
       for k in range(len(matrix_b_python)):
           matrix_c_python[i][j] += matrix_a_python[i][k] * matrix_b_python[k][j]

print("Python List Matrix Multiplication Result:\n", matrix_c_python)

#### Task 3

What is matrix? What is 'np.dot' function doing?
Learn more and explain briefly.

YOUR RESPONSE:


### Faster!

#### Task 4

Run this code block and try different values for `NUM_TRIALS`

In [None]:
# Time comparison for simple operations
import numpy as np
import time
import random

# Create large NumPy array and Python list
size = 1_000_000
NUM_TRIALS = 100 # try 1, 10, 100, and 1000

numpy_array_large = np.arange(size)
python_list_large = list(range(size))

# Time element-wise addition
start_time = time.time()
for i in range(NUM_TRIALS):
  new_numpy_result_large = numpy_array_large + random.randint(1, 10)
end_time = time.time()
print(f"NumPy element-wise addition took: {end_time - start_time:.6f} seconds")

start_time = time.time()
for i in range(NUM_TRIALS):
  new_python_result_large = [x + random.randint(1, 10) for x in python_list_large]
end_time = time.time()
print(f"Python list element-wise addition took: {end_time - start_time:.6f} seconds")


#### Task 5

Learn more about how and why numpy is faster, then explain briefly.

## Pandas

#### Task 1

Run this code block.

### Dataframe and Series

In [None]:
import pandas as pd

# Create a simple DataFrame
data = {'col1': [1, 2, 3], 'col2': ['A', 'B', 'C']}
df = pd.DataFrame(data)
print("Type of df:", type(df))
print("Type of a column:", type(df["col1"]))
print("Simple DataFrame:")
df

### Loading a DataFrame

#### Task 2

Upload the `titanic.csv` file to your session and run this code block.

In [None]:
# Read the dataset from csv file

df = pd.read_csv("titanic.csv")
print("Shape of df:", df.shape)

In [None]:
# Let's take a look at the dataframe
df

### Exploring a DataFrame

#### Task 3

Run all the code blocks. Make sure you understand what each code does!

In [None]:
# Gather some info about each column
df.info()

In [None]:
# Describe numerical columns
df.describe()

In [None]:
# Count values in a column
df["Embarked"].value_counts()

In [None]:
# Calculate the mean value of a column
df["Survived"].mean()

In [None]:
# Group by a column and calculate the mean value of another column
df.groupby("Sex")["Survived"].mean()

In [None]:
# Access a row using index
df.loc[10]

### Transforming a DataFrame



In [None]:
df["is_male"] = df["Sex"] == "male"

In [None]:
# Filter based on column values
survived_females_df = df[(df["Survived"] == 1) & (df["Sex"] == "female")]
survived_females_df

In [None]:
def get_age_bucket(age):
  if age < 18:
    return "child"
  elif age < 35:
    return "young"
  else:
    return "old"

df["age_bucket"] = df["Age"].apply(get_age_bucket)

In [None]:
df