Important points covered in the note: 

**A. Machine Learning:** 
1) Using supervised learning techniques to build predictive models
2)  For both regression and classification problems
3)  Underfitting and overfitting
4)  How to split data
5)  Cross-validation
6)  Data preprocessing techniques
7)  Model selection
8)  Hyperparameter tuning
9)  Model performance evaluation
10)  Using pipelines

**B. Data Analytics**
1) 

**What is machine learning?**

+ Machine learning is the process whereby:
  * Computers are given the ability to learn to make decisions from data
  * without explicitly programmed
  * E.g:
    a) learning to predict whether an email is spam or not spam given its content and sender
    b) learning to cluster books into different categories based on the words they contain, then assigning any new book to one of the existing clusters
    
+ Unsupervised learning:
  * Uncovering hidden patterns from unlabeled data
  * E.g: Grouping customers into distinct categories (Clustering)based on their purchasing behavior without knowing in advance what these categories are.
  => Clustering, one branch of unsupervised learning. 
    ![image.png](attachment:d8e831a0-47e1-48fa-aa8d-2ecc9235841a.png)

+ Supervised learning:
  * The predicted values are known
  * Aim: Predict the target values of unseen data, given the features
  * uses features to predict the value of a target variable, such as predicting a basketball player's position based on their points per game

+ Types of supervised learning:
  * Classification: Target variable consists of categories (Predict whether a bank transaction is fraudulent or not - binary classification)
  * Regression: used to predict continuous values (use features such as number of bedrooms, and the size of a property, to predict the target variable, and price of the property)

+ Naming conventions:
  * Feature = predictor variable = independent variable
  * Target variable = dependent variable = response variable
![image.png](attachment:d91a2f0d-e00f-48d8-b148-5d53ce0f9f3a.png)

+ Before using supervised learning:
  * Requirements:
    a) No missing values
    b) Data in numeric format
    c) Data stored in pandas DataFrame or Numpy array
    d) Perform Exploratory Data Analysis (EDA) first
    => ensure data is in the correct format 
    => various pandas methods for descriptive statistics, along with appropriate data visualizations, are useful in this step

[**Exploratory Data Analysis (EDA)** là quá trình đầu tiên trong xử lý dữ liệu trước khi áp dụng các phương pháp học máy (machine learning). Mục tiêu chính của EDA là phân tích, tóm tắt và trực quan hóa dữ liệu để hiểu các đặc điểm, cấu trúc, phân phối, và mối quan hệ giữa các biến. Việc này giúp phát hiện dữ liệu bị thiếu, giá trị ngoại lai, lỗi nhập liệu, cũng như xác định loại dữ liệu và các đặc trưng quan trọng ảnh hưởng đến mô hình sau này.

Các bước thường gặp trong EDA:
a) Thống kê mô tả: Tính các tham số như mean, median, mode, min, max, std,...
b) Vẽ biểu đồ: Biểu đồ histogram, boxplot, scatterplot,... để nhìn tổng quan phân phối và mối quan hệ dữ liệu.
c) Xử lý dữ liệu thiếu, kiểm tra dữ liệu ngoại lai.
d) Kiểm tra định dạng và loại của dữ liệu (numerical, categorical,...).]

+ scikit-learn syntax
a) scikit-learn follows the same syntax for all supervised learning models, which makes the workflow repeatable
b) code:

In [None]:
from sklearn.module import Model 



#khởi tạo 1 mô hình học máy trong Python, thường sử dụng thư viện scikit-learn
model = Model()

#A model is fit to the data, where it learns patterns about the features and the target variable 
#We fit the model to X, an array of features, and y, an array of our target variable values
model.fit(X, y) 

#X_new là tập dữ liệu mới mà bạn muốn dự đoán, với cùng định dạng như dữ liệu huấn luyện (các features phải giống nhau)
predictions = model.predict(X_new) 
print(predictions)


**Classifying labels of unseen data**
1. Build a model
2. Model learns from the labeled data we pass to it
3. Pass unlabeled data to the model as input
4. Model predicts the labels of the unseen data

* Labeled data = training data

* k-Nearest Neighbors - popular for classification problems:
+ Predict the label of a data point by
  a) Looking at the **k** closest labeled data points
  b) Taking a majority vote 
![image.png](attachment:ed2bf8d4-5363-4c64-b2f1-976c41df7c81.png)

+ KNN Intuition:
* KNN Intuition (trực giác về KNN) là cách hiểu đơn giản về cách hoạt động của thuật toán K-Nearest Neighbors (KNN) trong machine learning:
* Khi bạn có một điểm dữ liệu mới và muốn dự đoán nhãn của nó, thuật toán KNN sẽ tìm K điểm dữ liệu trong tập huấn luyện gần nhất với điểm mới này (dựa trên khoảng cách, thường là khoảng cách Euclid).
* Với bài toán phân loại, KNN sẽ "bỏ phiếu" chọn nhãn phổ biến nhất trong K điểm lân cận đó và gán nhãn đó cho điểm mới.
* Với bài toán hồi quy, KNN lấy trung bình giá trị của K điểm gần nhất để dự đoán giá trị mới đó.
* Trực giác đơn giản nhất:
"Vật cùng loại thường ở gần nhau" — dữ liệu mới sẽ được phân lớp (hoặc dự đoán giá trị) dựa trên những điểm "láng giềng" gần nó nhất trong không gian đặc trưng.
* Ví dụ: Nếu bạn chuyển đến một khu phố mới và muốn đoán nghề nghiệp phổ biến nhất ở đó, bạn chỉ cần hỏi vài người hàng xóm xung quanh, nếu đa số là "giáo viên", bạn cũng đoán họ là giáo viên!
* KNN creates a decision boundary to predict if customers will churn 
![image.png](attachment:3aaa2eee-dbfb-40e7-bb80-c54a7950b0aa.png)

In [None]:
#Using scikit-learn to fit a classifier
from sklearn.neighbors import KNeighborsClassifier
#.values giúp chuyển dữ liệu từ pandas sang numpy để dùng cho các mô hình machine learning
#Nếu không dùng .values dữ liệu vẫn là kiểu pandas, có thể gây lỗi hoặc không tương thích với scikit-learn
X = churn_df[["total_day_charge","total_eve_charge"]].values
y = churn_df["churn"].values
print(X.shape, y.shape) #(3333, 2), (3333,)

#dòng lệnh này tạo 1 mô hình KNN phân loại với k = 15 
#Mô hình này sẽ "hỏi ý kiến" 15 hàng xóm gần nhất để quyết định nhãn cho mỗi điểm dữ liệu mới.
knn = KNeighborsClassifier(n_neighbors = 15) 
knn.fit(X, y)

#Predicting on unlabeled data 
X_new = np.array([[56.8, 17.5],
                  [24.4, 24.1],
                  [50.1, 10.9]])

print(X_new.shape) #(3,2)
predictions = knn.predict(X_new)
print('Predictions: {}'.format(predictions)) #Predictions: [1 0 0]

In [None]:
# Import KNeighborsClassifier
from sklearn.neighbors import KNeighborsClassifier 

y = churn_df["churn"].values
X = churn_df[["account_length", "customer_service_calls"]].values

# Create a KNN classifier with 6 neighbors
knn = KNeighborsClassifier(n_neighbors = 6)

# Fit the classifier to the data
knn.fit(X, y)

# Predict the labels for the X_new
y_pred = knn.predict(X_new)

# Print the predictions
print("Predictions: {}".format(y_pred)) 

**Measuring model performance** 
+ How do we measure accuracy? (accuracy = số dự đoán đúng / tổng số dự đoán)
+ Could compute accuracy on the data used to fit the classifier.
+ NOT indicative of ability to generalize
  → Tính accuracy trên dữ liệu huấn luyện không phản ánh đúng khả năng tổng quát hóa của mô hình lên dữ liệu mới/không nhìn thấy trước.
+ Indicative of how well it can generalize to unseen data, which is what we are interested in!
  → Quan trọng là đánh giá mô hình trên dữ liệu chưa từng thấy (unseen data), vì mục tiêu thực sự là mô hình phải áp dụng tốt cho dữ liệu mới thực tế (không chỉ nhớ máy móc dữ liệu cũ).

**Computing accuracy**
![image.png](attachment:21ea7c1b-68b7-4666-a3f3-b99e826b8274.png)

In [None]:
#Train/Test Split

from sklearn.model_selection import train_test_split
#passing our features and targets
#use 20 - 30% of our data as the test set

#random_state: giúp quá trình chia sẽ luôn giống nhau mỗi lần chạy lại, giúp bạn tái lập kết quả và so sánh dễ dàng
#số 21 trong random_state không có ý nghĩa gì đặc biệt, chỉ là 1 số giúp bạn khóa bộ sinh số ngẫu nhiên
#nhưng khi đổi số 21 thành 1 số khác, thì bộ sinh số ngẫu nhiên sẽ bắt đầu với 1 điểm xuất phát khác
#=> mỗi giá trị random_state khác nhau sẽ cho ra 1 kết quả ngẫu nhiên khác nhau

#đặt stratify = y giúp quá trình chia train/test đảm bảo số lượng mỗi label (churn = 0, churn = 1) chia đều và đúng tỷ lệ
#stratify giúp tránh dữ liệu bị lệch nhãn, nhất là khi dữ liệu gốc không cân bằng
#dùng trong phân loại (classification) để cải thiện đánh giá mô hình chính xác hơn
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 21, stratify = y)  

knn = KNeighborsClassifier(n_neighbors = 6)
knn.fit(X_train, y_train)
print(knn.score(X_test, y_test)) 
#0.8800599700149925 
#The accuracy of model is 88%, which is low given our labels have a 9 to 1 ratio. 

**Model complexity**
+ Larger k = less complex model = can cause underfitting
+ Smaller k = more complex model = can lead to overfitting

![image.png](attachment:34de9b69-d59d-403a-9c1d-7032286d0c30.png)


In [None]:
#Model complexity and over/underfitting 

train_accuracies = {}
test_accuracies = {}
neighbors = np.arange(1,26)
#lệnh này tạo ra 1 mảng các số nguyên từ 1 đến 25 bằng numpy
#Khi bạn muốn kiểm tra hiệu suất mô hình KNN với nhiều giá trị khác nhau, bạn cần 1 danh sách các giá trị k để lặp qua
#Vì thế, lệnh này giúp bạn đánh giá xem với k nào mô hình cho kết quả tốt nhất (tránh overfitting/underfitting) 

for neighbor in neighbors: 
    knn = KNeighborsClassfifier(n_neighbors = neighbor)
    knn.fit(X_train, y_train)
    train_accuracies[neighbor] = knn.score(X_train, y_train)
    test_accuracies[neighbor] = knn.score(X_test, y_test)

#Plotting our results: 
plt.figure(figsize=(8,6))
plt.title("KNN: Varying Number of Neighbors")
plt.plot(neighbors, train_accuracies.values(), label = "Training Accuracy")
plt.plot(neighbors, test_accuracies.values(), label = "Testing Accuracy")
plt.legend()
plt.xlabel("Number of Neighbors")
plt.ylabel("Accuracy")
plt.show()

**Model Complexity Curve**

+ Không đồng dạng giúp bạn nhận diện nguy cơ quá khớp hoặc dưới khớp, điều chỉnh tham số k phù hợp.
+ Khi hai đường gần nhau (và accuracy Test cao), đó là mô hình tốt nhất!
+ Nếu bạn thấy độ lệch quá lớn → có thể k đang chưa tối ưu, hoặc dữ liệu cần xử lý lại (ví dụ: cân đối nhãn, làm sạch,...).
+ Hai đường Train và Test không đồng dạng là hoàn toàn bình thường trong thực tế machine learning!
+ Đây là dấu hiệu cho thấy mô hình đang học trên hai tập dữ liệu khác nhau (một tập đã nhìn thấy, một tập chưa nhìn thấy).
+ Nếu hai đường giống hệt nhau: Có thể bạn gặp vấn đề như quá khớp, hoặc dữ liệu kiểm tra bị trùng lặp dữ liệu huấn luyện (không đúng quy tắc đánh giá mô hình).
+ Nếu hai đường khác nhau: Mô hình đang hoạt động thực tế, ranh giới giữa việc “học thuộc lòng” (overfitting) và “tổng quát hóa” (generalization) sẽ hiện ra rõ ràng.
![image.png](attachment:e2ace3f5-30e9-4628-9bc2-90b37eb62bb2.png)

In [None]:
# Import the module
from sklearn.model_selection import train_test_split

X = churn_df.drop("churn", axis=1).values
y = churn_df["churn"].values

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state= 42, stratify= y)
knn = KNeighborsClassifier(n_neighbors=5)

# Fit the classifier to the training data
knn.fit(X_train, y_train)

# Print the accuracy
print(knn.score(X_test, y_test))

In [None]:
# Create neighbors
neighbors = np.arange(1, 13)
train_accuracies = {}
test_accuracies = {}

for neighbor in neighbors:
  
	# Set up a KNN Classifier
	knn = KNeighborsClassifier(n_neighbors=neighbor)
  
	# Fit the model
	knn.fit(X_train, y_train)
  
	# Compute accuracy
	train_accuracies[neighbor] = knn.score(X_train, y_train)
	test_accuracies[neighbor] = knn.score(X_test, y_test)
print(neighbors, '\n', train_accuracies, '\n', test_accuracies)

In [None]:
# Add a title
plt.title("KNN: Varying Number of Neighbors")

# Plot training accuracies
plt.plot(neighbors, train_accuracies.values(), label="Training Accuracy")

# Plot test accuracies
plt.plot(neighbors, test_accuracies.values(), label="Testing Accuracy")

plt.legend()
plt.xlabel("Number of Neighbors")
plt.ylabel("Accuracy")

# Display the plot
plt.show()

![image.png](attachment:e51531a0-f929-495d-a434-2496b4e2af38.png)

In [None]:
#Introduction to regression 
#Predicting blood glucose levels

import pandas as pd 
diabetes_df = pd.read_csv("diabetes.csv")
print(diabetes_df.head())

![image.png](attachment:fad31771-7b54-4751-a0c9-63083f5247a6.png)

In [None]:
#Creating feature & target arrays
#X: các đặc trưng đầu vào, dùng để dự đoán giá trị.
#y: giá trị thực cần dự đoán (giá trị mục tiêu).
X = diabetes_df.drop("glucose",axis=1).values
y = diabetes_df["glucose"].values
print(type(X), type(y)) #<class 'numpy.ndarray'><class 'numpy.ndarray'>

#Making predictions from a single feature 
X_bmi = X[:,3]
print(y.shape, X_bmi.shape) 
#(752,) (752,) #they are both one-dimensional arrays
#This is fine for y
#But our features must be formatted as a 2-dimensional array to be accepted by scikit-learn

X_bmi = X_bmi.reshape(-1,1) 
#ệnh X_bmi.reshape(-1, 1) chuyển mảng 1 chiều này sang ma trận 2 chiều phù hợp với yêu cầu
#trong đó -1 để numpy tự tính số dòng (ở đây là 752), còn 1 là số cột.
print(X_bmi.shape) #(752,1)

#Plotting glucose vs. body mass index 
import matplotlib.pyplot as plt 
plt.scatter(X_bmi, y)
plt.ylabel("Blood Glucose (mg/dl)")
plt.xlabel("Body Mass Index")
plt.show()

![image.png](attachment:0bff945d-f418-442c-8146-d31ad065c334.png)

In [None]:
#Fitting a regression model 
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(X_bmi, y)
predictions = reg.predict(X_bmi)
plt.scatter(X_bmi, y)
plt.plot(X_bmi, predictions)
plt.ylabel("Blood Glucose (mg/dl)")
plt.xlabel("Body Mass Index")
plt.show()

![image.png](attachment:dafadc33-58f4-40dc-b9c5-8e7b588a33cf.png)

In [None]:
import numpy as np

# Create X from the radio column's values
X = sales_df["radio"].values

# Create y from the sales column's values
y = sales_df["sales"].values

# Reshape X
X = X.reshape(-1,1)

# Check the shape of the features and targets
print(X.shape, y.shape) #Output (4546,1) (4546,)

In [None]:
# Import LinearRegression
from sklearn.linear_model import LinearRegression

# Create the model
reg = LinearRegression()

# Fit the model to the data
reg.fit(X, y)

# Make predictions
predictions = reg.predict(X)

print(predictions[:5])

# Import matplotlib.pyplot
import matplotlib.pyplot as plt

# Create scatter plot
plt.scatter(X, y, color="blue")

# Create line plot
plt.plot(X, predictions, color="red")
plt.xlabel("Radio Expenditure ($)")
plt.ylabel("Sales ($)")

# Display the plot
plt.show()

![image.png](attachment:89dcb844-9ebc-4c9b-8d79-c06f41cdcb7a.png)

**Regression mechanics:**

y = ax + b 

* Simple linear regression uses one feature
* y = target
* x = simple feature
* a, b = parameter/coefficients of the model - slope, intercept

**How do we choose a and b?** 
* Define an error function of any given line
* Choose the line that minimizes the error function

Error function = loss function = cost function

**Linear regression in higher dimensions**
y = a1x1 + a2x2 + b
* To fit a linear regression model here:
  + Need to specify 3 variables: a1, a2, b
* In higher dimensions:
  + Known as multiple regression
  + Must specify coefficients for each feature and the variable b:
    y = a1x1 + a2x2 + a3x3 + ... + anxn + b
* scikit-learn works exactly the same way:
  + Pass 2 arrays: features and target

In [None]:
#Linear regression using all features

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)
reg_all = LinearRegression()
reg_all.fit(X_train, y_train)
y_pred = reg_all.predict(X_test)

R-squared 
* R^2: quantifies the variance in target values explained by the features
* Values range from 0 to 1
* High R^2:
  ![image.png](attachment:58ac8447-4d00-4251-86f7-d05954412063.png)
  
  ![image.png](attachment:5e95e4cd-fe0a-47c2-a440-68adfadf56c2.png)

In [None]:
#R-squared in scikit-learn
reg_all.score(X_test, y_test) #0.356302876407827

#RMSE in scikit-learn
from sklearn.metrics import root_mean_squared_error
root_mean_squared_error(y_test, y_pred) #24.0281094269

In [None]:
# Create X and y arrays
X = sales_df.drop("sales", axis=1).values
y = sales_df["sales"].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Instantiate the model
reg = LinearRegression()

# Fit the model to the data
reg.fit(X_train, y_train)

# Make predictions
y_pred = reg.predict(X_test)
print("Predictions: {}, Actual Values: {}".format(y_pred[:2], y_test[:2]))

# Import root_mean_squared_error
from sklearn.metrics import root_mean_squared_error

# Compute R-squared
r_squared = reg.score(X_test, y_test)

# Compute RMSE
rmse = root_mean_squared_error(y_test, y_pred)

# Print the metrics
print("R^2: {}".format(r_squared))
print("RMSE: {}".format(rmse))

**Cross-validation motivation**
+ Động lực chính của cross-validation (kiểm tra chéo) là để đánh giá hiệu suất của mô hình máy học một cách đáng tin cậy và ổn định, và quan trọng nhất là kiểm tra khả năng tổng quát hóa (generalization) của mô hình trên dữ liệu mới mà nó chưa từng thấy.
+ Vấn đề cốt lõi (Tại sao không chỉ chia Train/Test 1 lần?)
Hãy tưởng tượng bạn đang huấn luyện một mô hình.

Cách tệ nhất: Bạn huấn luyện (train) và kiểm tra (test) mô hình trên cùng một tập dữ liệu. Mô hình có thể chỉ đơn giản là "học vẹt" (overfitting) toàn bộ dữ liệu đó. Nó sẽ đạt điểm 100% khi kiểm tra, nhưng khi gặp dữ liệu thực tế (dữ liệu mới), nó sẽ thất bại thảm hại.

Cách tốt hơn: Bạn chia dữ liệu thành hai phần: một phần lớn để huấn luyện (ví dụ: 80% data) và một phần nhỏ để kiểm tra (ví dụ: 20% data). Đây gọi là train-test split.

Vấn đề của cách 2 là gì? Kết quả 80% hay 90% accuracy bạn nhận được có thể chỉ là do may mắn (hoặc xui xẻo).

Biết đâu 20% dữ liệu test bạn vô tình chọn trúng toàn các ca "dễ"? Mô hình đạt 95% accuracy, khiến bạn đánh giá quá cao mô hình (quá lạc quan).

Biết đâu 20% đó lại toàn ca "khó"? Mô hình chỉ đạt 70% accuracy, khiến bạn đánh giá thấp mô hình (quá bi quan).

Kết quả từ một lần chia train/test duy nhất không ổn định và phụ thuộc nhiều vào sự ngẫu nhiên của việc chia dữ liệu.

+ Model performance is dependent on the way we split up the data
+ Not representative of the model's ability to generalize to unseen data

![image.png](attachment:751d9b27-4acf-448a-8aad-57267c926c11.png)

**Cross-Validation (Kiểm tra chéo)**
Cross-validation giải quyết vấn đề "hên xui" này bằng cách thực hiện việc huấn luyện và kiểm tra nhiều lần một cách có hệ thống. Phương pháp phổ biến nhất là K-Fold Cross-Validation.

+ Cách hoạt động (ví dụ K=5):
Chia (Split): Chia toàn bộ tập dữ liệu huấn luyện của bạn thành 5 phần (gọi là "fold") bằng nhau.
Lặp (Iterate):
+ Lần 1: Dùng 4 phần (Fold 1-4) để huấn luyện, dùng phần còn lại (Fold 5) để kiểm tra. Ghi lại kết quả (ví dụ: 90% accuracy).
+ Lần 2: Dùng 4 phần (Fold 1, 2, 3, 5) để huấn luyện, dùng Fold 4 để kiểm tra. Ghi lại kết quả (ví dụ: 92% accuracy).
+ Lần 3: Dùng 4 phần (Fold 1, 2, 4, 5) để huấn luyện, dùng Fold 3 để kiểm tra. Ghi lại kết quả (ví dụ: 89% accuracy).
+ ... (Tiếp tục cho đến khi mọi phần đều đã được dùng làm tập kiểm tra 1 lần).
+ Tính trung bình (Average): Bạn sẽ có 5 kết quả performance. Lấy trung bình của 5 kết quả này (ví dụ: (90 + 92 + 89 + 91 + 90) / 5 = 90.4%).
=> Con số 90.4% này đáng tin cậy hơn nhiều so với kết quả của chỉ một lần chia train/test.

+ 5 folds = 5-fold CV
+ 10 folds = 10-fold CV
+ k folds = k-fold CV
+ More folds = More computationally expensive


In [None]:
#Cross-validation using scikit-learn

from sklearn.model_selection import cross_val_score, KFold
kf = KFold(n_splits = 6, shuffle = True, random_state = 42)
reg = LinearRegression()
cv_results = cross_val_score(reg, X, y, cv = kf)

#Evaluating cross-validation performance 
print(cv)


![image.png](attachment:2b9a0cd1-f695-4b5c-8ea5-f8ec173e102b.png)

In [None]:
print(np.mean(cv_results), np.std(cv_results))

![image.png](attachment:8464de28-df50-44a1-b5b0-c3ea685aae10.png)

In [None]:
print(np.quantile(cv_results, [0.025, 0.975]))

![image.png](attachment:9311b6ca-abbb-40a2-a369-5871086f3970.png)

In [None]:
# Import the necessary modules
from sklearn.model_selection import cross_val_score, KFold

# Create a KFold object
kf = KFold(n_splits=6, shuffle=True, random_state=5)

reg = LinearRegression()

# Compute 6-fold cross-validation scores
cv_scores = cross_val_score(reg, X, y, cv=kf)

# Print scores
print(cv_scores)

#[0.74451678 0.77241887 0.76842114 0.7410406  0.75170022 0.74406484]

# Print the mean
print(np.mean(cv_results))

# Print the standard deviation
print(np.std(cv_results))

# Print the 95% confidence interval
print(np.quantile(cv_results, [0.025, 0.975]))

**Regularized regression**

Why regularize? 
* Recall: Linear regression minimizes a loss function
* It chooses a coefficient, a, for each feature variable, plus b
* Large coefficients can lead to overfitting
* Regularization: Penalize large coefficients 
* Hồi quy chính quy hóa (Regularized Regression) là một nhóm các kỹ thuật trong machine learning được sử dụng để giải quyết vấn đề overfitting (học vẹt) trong các mô hình hồi quy (như hồi quy tuyến tính - Linear Regression).

Ý tưởng cốt lõi là thêm một "khoản phạt" (penalty) vào hàm mất mát (loss function) của mô hình. Khoản phạt này được tính dựa trên độ lớn của các hệ số (coefficients) của mô hình.

Mục tiêu là tạo ra một mô hình đơn giản hơn và ổn định hơn, giúp nó có khả năng tổng quát hóa (generalize) tốt hơn trên dữ liệu mới mà nó chưa từng thấy.

![image.png](attachment:ad169abf-afb6-4447-9022-ea44bb4a2502.png)

Ridge Regression (Hồi quy Ridge) là một phiên bản cải tiến của Hồi quy tuyến tính (Linear Regression) được thiết kế đặc biệt để giải quyết vấn đề overfitting (học vẹt) và xử lý hiệu quả khi các đặc trưng đầu vào bị đa cộng tuyến (multicollinearity).

Nó thuộc nhóm mô hình Hồi quy chính quy hóa (Regularized Regression).

**Tóm tắt:** 
* Ridge Regression là gì? Là Hồi quy tuyến tính + Khoản phạt L2.
* Tại sao dùng? Để chống overfitting và xử lý đa cộng tuyến.
* Nó làm gì? Nó "thu nhỏ" các hệ số của mô hình lại gần 0, làm cho mô hình bớt phức tạp và ổn định hơn.
* Khác gì Lasso? Ridge thu nhỏ tất cả hệ số, trong khi Lasso có thể loại bỏ (đặt bằng 0) một số hệ số.

In [None]:
#Ridge regression in scikit-learn: 
#import ridge từ thư viện sklearn
from sklearn.linear_model import Ridge 
scores = []
for alpha in [0.1, 1.0, 10.0, 100.0, 1000.0] 
#huấn luyện 5 mô hình Ridge khác nhau, mỗi mô hình với 1 giá trị alpha khác nhau
#alpha chính là siêu tham số(lambda) kiểm soát "độ mạnh" của khoản phạt L2.
    ridge = Ridge(alpha=alpha)
    #Huấn luyện (train) mô hình vừa tạo.
    ridge.fit(X_train, y_train)
    #Dùng mô hình đã huấn luyện để dự đoán kết quả trên tập dữ liệu kiểm tra (X_test). Các dự đoán này được lưu vào biến y_pred
    y_pred = ridge.predict(X_test)  
    scores.append(ridge.score(X_test, y_test))
print(scores)


![image.png](attachment:7f989453-aa81-4d53-97a2-fd8b50e75973.png)

**Bằng cách nhìn vào danh sách 5 điểm số, bạn có thể biết giá trị alpha nào mang lại hiệu suất cao nhất trên dữ liệu kiểm tra**
+ Giá trị **alpha quá nhỏ** (ví dụ: 0.1): Mô hình hoạt động gần giống Hồi quy tuyến tính thông thường, có thể vẫn bị **overfitting**.
+ Giá trị **alpha quá lớn** (ví dụ: 1000.0): Mô hình bị "phạt" quá nặng, các hệ số bị ép về 0 quá nhiều, dẫn đến bị **underfitting (mô hình quá đơn giản).**
+ Giá trị **alpha tối ưu** (ví dụ: 1.0 trong kết quả ví dụ): Là **giá trị cân bằng được giữa việc khớp dữ liệu (fit) và độ đơn giản (regularization)**, cho kết quả tổng quát hóa tốt nhất trên tập test.

In [None]:
#Lasso regression in scikit-learn
from sklearn.linear_model import Lasso 
scores = []
for alpha in [0.01, 1.0, 10.0, 20.0, 50.0]:
    lasso = Lasso(alpha = alpha)
    lasso.fit(X_train, y_train)
    lasso_pred = lasso.predict(X_test)
    scores.append(lasso.score(X_test, y_test))
print(scores)

![image.png](attachment:d698877a-7049-45ef-b0ef-d6b232b55ac5.png)

**Sự khác biệt cốt lõi giữa Lasso và Ridge regression nằm ở cách chúng xử lý các hệ số (coefficients) của mô hình để chống lại overfitting.**
+ **Lasso (L1):** Có khả năng loại bỏ đặc trưng (feature selection) bằng cách ép một số hệ số về chính xác bằng 0.
+ **Ridge (L2):** Chỉ thu nhỏ (shrink) các hệ số về gần 0 nhưng không bao giờ loại bỏ chúng hoàn toàn.
+ ![image.png](attachment:32c4ecca-4959-4757-a0ea-0c4a64b496b4.png)

In [None]:
#Lasso for feature selection in scikit-learn
#import Lasso from sklearn libraries
from sklearn.linear_model import Lasso
X = diabetes_df.drop("glucose", axis = 1).values 
y = diabetes_df["glucose"].values
names = diabetes_df.drop("glucose", axis=1).columns
#Khởi tạo 1 mô hình Lasso
lasso = Lasso(alpha=0.1) #tham số quan trọng nhất của Lasso, nó kiểm soát độ mạnh của khoản phạt L1
#alpha càng lớn, khoản phạt càng mạnh, và mô hình sẽ ép nhiều hệ số hơn về 0
#alpha = 0.1 là 1 giá trị tương đối nhỏ, khoản phạt ở mức độ nhẹ. 
lasso_coef = lasso.fit(X, y).coef_
plt.bar(names, lasso_coef)
plt.xticks(rotation = 45)
plt.show()

![image.png](attachment:195886ca-77b5-431e-898c-157be4343d68.png)

In [None]:
# Import Ridge
from sklearn.linear_model import Ridge
alphas = [0.1, 1.0, 10.0, 100.0, 1000.0, 10000.0]
ridge_scores = []
for alpha in alphas:
  
  # Create a Ridge regression model
  ridge = Ridge(alpha = alpha)
  
  # Fit the data
  ridge.fit(X_train, y_train)
  
  # Obtain R-squared
  score = ridge.score(X_test, y_test)
  ridge_scores.append(score)
print(ridge_scores)

In [None]:
# Import Lasso
from sklearn.linear_model import Lasso

# Instantiate a lasso regression model
lasso = Lasso(alpha=0.1)

# Fit the model to the data
lasso_coef = lasso.fit(X,y).coef_

# Compute and print the coefficients
lasso_coef = lasso.coef_
print(lasso_coef) # [ 3.56256962 -0.00397037  0.00496392]
plt.bar(sales_columns, lasso_coef)
plt.xticks(rotation=45)
plt.show()

#Lasso (L1 Regularization) có một cơ chế "trừng phạt" (penalty) các hệ số có giá trị lớn. 
#Mục đích là để làm cho mô hình đơn giản hơn và tránh overfitting.
#Kết quả quan trọng nhất của cơ chế này là Lasso có khả năng đẩy một số coefficients về chính xác bằng 0.

![image.png](attachment:6b0470f6-c61b-414b-8d31-44c4e06ff490.png)

In [None]:
# Import the function for generating classification datasets
from sklearn.datasets import make_classification

# Generate 5000 samples with 4 features, 1 cluster per class, 3 classes, and class separation of 2
x, y = make_classification(n_samples=5000, n_classes = 3, n_features = 4, n_clusters_per_class=1,class_sep=2)

# Inspect the generated data shape
print(x.shape)
print(y.shape)
print(x)

# Inspect the resulting data points in a 2 dimensional scatter plot
plot_data_points(x, y)

In [None]:
# Import the function from the datasets module for generating clustering datasets
from sklearn.datasets import make_blobs

# Generate a dataset with 15000 rows, 2 features, 2 centers, and a cluster std of 3
x, labels = make_blobs(n_samples=15000, n_features=2, centers=2, cluster_std=3)

# Print the shape of the resulting generated data
print(x.shape)

# See the resulting data points in a 2 dimensional scatter plot
plot_data_points(x, labels)

In [None]:
# Generate a name according to the gender, that will be unique in the dataset
ratings['name'] = [fake_data.unique.name_female() if x == "Female" 
                   else fake_data.unique.name_male() 
                   for x in ratings['gender']]   

# Generate random company domain emails with username as their name
ratings['email'] = [x.replace(" ", "") +"@"+ fake_data.domain_name() 
                    for x in ratings['name']]                                                                     

# Generate dates between current date and 2 years ago
ratings['date'] = [fake_data.date_between(start_date="-2y", end_date="now")
                    for x in range(len(ratings))]

# Inspect the DataFrame
print(ratings.head())

In [None]:
# Obtain or specify the probabilities
p = (0.46, 0.26, 0.16, 0.1, 0.02)

# Generate 5 random cities 
cities = [fake_data.city() for x in range(5)]

# Sample 300 rows from the generated cities following the distribution
df['City'] = np.random.choice(cities, size=300, p=p)

# See the resulting dataset
print(df.head())

**Why do we need to use stratify=y?**
If you split the data purely at random, there is a chance that the training set might contain all the data from one class, while the test set contains all the data from another. This is particularly dangerous when your dataset is imbalanced.

Real-world example: Suppose you have 100 emails, consisting of 10 Spam emails (10%) and 90 Ham emails (90%).

WITHOUT stratify: The train_test_split function might accidentally put all 10 Spam emails into the test set. As a result, your training set would have no Spam emails to learn from.

WITH stratify=y: The function ensures that the training set contains exactly 10% Spam and the test set also contains exactly 10% Spam.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=123)
print("Original proportion: ", np.sum(y)/len(y)) #0.37
print("Training proportion: ", np.sum(y_train)/len(y_train)) #0.375

Model shouldn't be fit on both the training and test data. 

This refers to the fundamental principle of preventing Data Leakage.

The Core Concept: Evaluation vs. Learning
+ The Training Set is for learning. The model "fits" (adjusts its parameters) to this data to find patterns.
+ The Test Set is for evaluation. It acts as a "final exam" to see how the model performs on new, unseen data.

If you fit the model on the test data, you are essentially giving the model the answers to the exam before it takes it.

![image.png](attachment:59ef6bc2-614c-4793-aabe-83fa9af653ae.png)

Increasing the value of k, increases the risk of underfitting. 

In [None]:
#Create and fit a k-Nearest Neighbors classification model with 3 neighbors to the array of 
#feature values, X, and array of target values, y

knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X,y)

![image.png](attachment:c3a04e27-d0ef-41b5-9159-91988236427d.png)

In [None]:
#Calculate the test set accuracy of the fitted knn model. Training and test features, X_train and X_test, 
#and training and test labels, y_train and y_test. are available

knn = KNeighborsClassifier(n_neighbors = 7)
knn.fit(X_train, y_train)
print(knn.score(X_test, y_test)) #0.7597402597402597

![image.png](attachment:b780f658-5db4-4a52-949d-f6e238a27e3a.png)

![image.png](attachment:e111eb7d-6c67-4938-87f4-b0367f18dfff.png)

![image.png](attachment:039799d7-0aae-4782-8864-938c51c8a222.png)

In [None]:
#Create and fit a linear regression model to the training data, X_train, and y_train, and predict on the test features, X_test

linreg = LinearRegression()
linreg.fit(X_train, y_train)
y_pred = lingred.predict(X_test)
print(y_pred[:3]) #[75.55656365    61.47270116    56.47836737]

In [None]:
reg = LinearRegression()
kf = KFold(n_splits=5, shuffle=True, random_state=123)

print(cross_val_score(reg, X, y, cv=kf)) #[0.77790279  0.81729963  0.82794155  0.8235076  0.83357379]

In [None]:
#Calculate the test set Root-Mean Squared Error (RMSE) between the model predictions, y_pred, and test labels, y_test 

print(mean_squared_error(y_test, y_pred, squared = False)) #4.211915229490384

#khi đặt squared = False, hàm sẽ lấy căn bậc 2 của MSE, kết quả trả về là RMSE

In [None]:
#The scores from 3-fold cross-validation are stored in the cv_results array. Calculate the mean, standard deviation, 
#and 95% confidence interval of cv_results

print("Mean: ", np.mean(cv_results)) #Mean: 0.81861304333334
print("Std. Dev.: ", np.std(cv_results)) #Std. Dev: 0.030340492006681246
print("95% CI: ", np.quantile(cv_results, [0.025, 0.975]))  #[0.78306584   0.85366204]

In [None]:
#Create and fit a lasso regression model to the training dat, X_train and y_train, and calculate the R-squared on 
#the test arrays, X_test and y_test 

reg = Lasso(alpha=0.1) #alpha càng lớn thì mô hình càng bị kiểm soát chặt chẽ
reg.fit(X_train, y_train) 
print(reg.score(X_test, y_test)) #0.7780935514511569


### What is Lasso Regression?

**Lasso Regression** (Least Absolute Shrinkage and Selection Operator) is a type of linear regression that uses **regularization** to improve model accuracy and interpretability. It is particularly powerful when dealing with datasets that have many features, some of which may be redundant or irrelevant.

---

### 1. How it works: The Penalty Term

In standard Linear Regression, the model minimizes the Mean Squared Error (MSE). Lasso adds a "penalty" equal to the **absolute value** of the magnitude of coefficients.

The Loss Function formula:

* ** (Alpha):** This is the tuning parameter that decides how much you want to penalize the model.
* If , it behaves like OLS (Ordinary Least Squares).
* As  increases, more coefficients are pushed toward zero.

---

### 2. Key Feature: Automatic Feature Selection

This is the most significant advantage of Lasso. Unlike Ridge Regression (which only shrinks coefficients near zero), Lasso can shrink coefficients **exactly to zero**.

* **Result:** The model effectively "ignores" unimportant features.
* **Benefit:** It creates a **sparse model**, making it much easier to explain which variables actually drive the predictions.

---

### 3. Why use Lasso?

1. **Prevents Overfitting:** By penalizing large coefficients, it reduces the model's complexity and its tendency to react to noise in the training data.
2. **High-Dimensional Data:** It is excellent when you have more features than observations, or when you suspect only a few features are truly important.
3. **Model Simplicity:** It identifies the most important predictors, helping you understand the underlying data better.

---

### 4. Code Implementation 

```python
from sklearn.linear_model import Lasso
from sklearn.preprocessing import StandardScaler

# Lasso is sensitive to the scale of features, 
# so always scale your data first.
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Initialize the Lasso model
# Higher alpha means more features will be set to zero
lasso = Lasso(alpha=0.1)

# Fit the model to the scaled training data
lasso.fit(X_train_scaled, y_train)

# Check the coefficients
# Features with 0.0 coefficients were 'selected out' by the model
print(lasso.coef_)

# Make predictions
predictions = lasso.predict(X_test_scaled)

```

---

### 5. Important Note: Feature Scaling

Because Lasso penalizes the absolute size of coefficients, features with larger scales (e.g., "Income" in thousands vs "Age" in years) will be penalized more heavily than others. **Always use `StandardScaler**` before fitting a Lasso model to ensure all features are treated fairly.

Would you like me to draft a comparison table between **Lasso** and **Ridge** for your README section 3.3?

![image.png](attachment:26bbbf5e-5050-460b-902f-6a434049f834.png)

In [None]:
X = gapminder.drop("life", axis = 1).values
y = gapminder["life"].values
print(X.shape, y.shape)

In [None]:
reg = Ridge(alpha=10.0)
reg.fit(X_train, y_train)
print(reg.score(X_test, y_test)) #0.7747947227678673

In [None]:
#Calculate the test set R-squared value of the fitted linear regression model, linreg

print(linreg.score(X_test, y_test))

The correct definition of the R-squared metric: The amount of variance in the target variable that is explained by the features. 

**Fine-Tuning Your Model**

**How good is your model?**

**Classfication metrics**
+ Measuring model performance with accuracy:
  * Fraction of correctly classified samples
  * Not always a useful metric

**Class imbalance**
+ Classification for predicting fraudulent bank transactions
  * 99% of transactions are legitimate, 1% are fraudulent
+ Could build a classifier that predicts NONE of the transactions are fraudulent
  * 99% accurate!
  * But terrible at actually predicting fraudulent transactions!
  * Fails at its original purpose.
+ Class imbalance: Uneven frequency of classes.
+ Need a different way to assess performance

![image.png](attachment:23143dc3-27e5-49c5-b712-a8f225f795cd.png)

![image.png](attachment:aa982fb6-f759-4ca1-9c36-2a156dd65fca.png)

**Khi nào Recall là quan trọng nhất?**
+ Recall cực kỳ quan trọng trong các tình huống mà việc bỏ sót (False Negative) để lại hậu quả nghiêm trọng hơn việc nhầm lẫn (False Positive).
+ Chẩn đoán bệnh (Ung thư): Bỏ sót một người mắc bệnh (FN) nguy hiểm hơn nhiều so với việc chẩn đoán nhầm một người khỏe mạnh là có bệnh (FP) để xét nghiệm thêm.
+ Phát hiện gian lận ngân hàng: Thà kiểm tra nhầm một giao dịch hợp pháp còn hơn là để lọt một giao dịch đánh cắp tiền.

**Mối quan hệ giữa Recall và Precision**
+ Thường có một sự đánh đổi (Trade-off) giữa Recall và Precision:
+ Nếu bạn muốn tăng Recall (không bỏ sót ai), bạn thường phải chấp nhận Precision giảm xuống (dễ bị nhầm nhiều hơn).

F1-score là một chỉ số đánh giá (metric) kết hợp giữa Precision (Độ chính xác) và Recall (Độ gợi nhớ) bằng cách tính trung bình điều hòa (harmonic mean) của hai chỉ số này.

Nó cung cấp một con số duy nhất đại diện cho sự cân bằng giữa việc "dự đoán đúng" và "không bỏ sót".

**Tại sao dùng F1-score thay vì Accuracy (Độ chính xác tổng quát)?**
Trong các bộ dữ liệu mất cân bằng (imbalanced data), chỉ số Accuracy thường gây đánh lừa.

Ví dụ: Bạn có 100 người, trong đó 99 người khỏe và 1 người bệnh. Nếu mô hình luôn dự đoán "Khỏe", Accuracy sẽ là 99%. Tuy nhiên, mô hình này hoàn toàn vô dụng vì không tìm được người bệnh nào.

F1-score trong trường hợp này sẽ cực thấp, giúp bạn nhận ra mô hình đang gặp vấn đề.

![image.png](attachment:c0a3de97-3654-4385-99be-ad756a2e0629.png)

![image.png](attachment:f9b5f8d7-27a9-4011-ad33-0f8ce31ddc6a.png)

Deciding on a Primary Metric: Precision vs. Recall
In machine learning, choosing the right metric depends on the cost of making a mistake.

1. Analysis of the Scenarios
**Scenario 1: Predicting Cancer (Positive Class)**
The Mistake: A False Negative (missing a patient who actually has cancer) is catastrophic because they won't receive life-saving treatment.

Primary Metric: Recall. We want to find as many positive cases as possible, even if it means some healthy people are wrongly flagged for further testing.

**Scenario 2: Predicting Malware in a Program**
The Mistake: A False Negative (missing a virus) leads to a compromised system.

Primary Metric: Recall. The priority is to catch every single threat to ensure security.

**Scenario 3: Predicting High-Value Sales Leads (Limited Capacity)**
The Context: The sales team has limited time and resources. They can only call a few people per day.

The Mistake: A False Positive (predicting someone is a high-value lead when they are not) is very costly because it wastes the sales team's limited time.

Primary Metric: Precision. When the model says "This is a high-value lead," we want to be absolutely sure so we don't waste resources.

**Logistic Regression for Binary Classification**

+ Logistic regression is used for classification problems
+ Logistic regression outputs probabilities:
  
+ If the probability, p > 0.5:
  * The data is labeled 1.
    
+ If the probability, p < 0.5:
  * the data is labeled 0.

![image.png](attachment:ece18f47-4d35-492a-972a-a4ecc52360a5.png)



In [None]:
#Logistic regression in scikit-learn

from sklearn.linear_model import LogisticRegression 
logreg = LogisticRegression()
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=42)
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)

#Predicting probabilities 
y_pred_probs= logreg.predict_proba(X_test)[:,1]
print(y_pred_probs[0]) #[0.08961376]



**Probability thresholds**
+ By default, logistic regression threshold = 0.5
+ Not specific to logistic regression:
  * KNN classifiers also have thresholds
+ What happens if we vary the threshold?
  * ![image.png](attachment:add5340f-0b72-4e08-a375-66a7bab17bff.png)

In [None]:
#Plotting the ROC curve 

from sklearn.metrics import roc_curve 
fpr, tpr, thresholds = roc_curve(y_test, y_pred_probs)
plt.plot([0,1], [0,1], 'k--')
plt.plot(fpr,tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Logitic Regression ROC Curve')
plt.show()

![image.png](attachment:4c269982-099e-4e4a-a940-100d488d9165.png)

The ROC curve (Receiver Operating Characteristic curve) là một biểu đồ dùng để đánh giá hiệu suất của mô hình phân loại (classification) tại các ngưỡng (threshold) khác nhau.

**1. ROC Curve đại diện cho điều gì?**
+ Đường cong ROC biểu diễn mối quan hệ giữa hai đại lượng quan trọng:
+ Trục Tung (Y-axis): True Positive Rate (TPR) – hay còn gọi là Recall hoặc Sensitivity. Nó cho biết mô hình tìm được bao nhiêu phần trăm trường hợp dương tính thực tế.
+ Trục Hoành (X-axis): False Positive Rate (FPR) – hay còn gọi là 1 - Specificity. Nó cho biết tỷ lệ mô hình dự đoán nhầm một trường hợp âm tính thành dương tính.

**2. Ý nghĩa của hình dáng đường cong**
+ Đường chéo (Diagonal line): Đại diện cho một mô hình dự đoán ngẫu nhiên (giống như tung đồng xu).
+ Đường cong càng lồi về phía góc trên bên trái: Mô hình càng tốt. Điều này có nghĩa là bạn có Recall rất cao (TPR) nhưng tỷ lệ báo động giả (FPR) lại rất thấp.

**3. AUC (Area Under the Curve)**
+ Thường đi đôi với ROC là chỉ số AUC (Diện tích dưới đường cong).
+ AUC = 1.0: Mô hình hoàn hảo.
+ AUC = 0.5: Mô hình dự đoán ngẫu nhiên.
+ AUC < 0.5: Mô hình dự đoán tệ hơn cả ngẫu nhiên (có thể nhãn đã bị đảo ngược).

In [None]:
#ROC AUC in scikit-learn 

from sklearn.metrics import roc_auc_score
print(roc_auc_score(y_test, y_pred_probs)) #0.6700964152663693

TPR (True Positive Rate) - Tỷ lệ dương tính thậtTPR còn được gọi là Recall hoặc Sensitivity (Độ nhạy).Ý nghĩa: Trả lời câu hỏi: "Trong số những người thực sự bị bệnh, mô hình tìm ra được bao nhiêu người?"Công thức:$$TPR = \frac{TP}{TP + FN}$$Mục tiêu: Càng cao (gần 1) càng tốt.2. FPR (False Positive Rate) - Tỷ lệ dương tính giảFPR còn được gọi là Fall-out.Ý nghĩa: Trả lời câu hỏi: "Trong số những người thực sự khỏe mạnh, mô hình đã dự đoán nhầm bao nhiêu người là bị bệnh?" (Báo động giả).Công thức:$$FPR = \frac{FP}{FP + TN}$$Mục tiêu: Càng thấp (gần 0) càng tốt.

In [None]:
#Confusion matrix in scikit-learn

from sklearn.metrics import classification_report, confusion_matrix
knn = KNeighborsClassifier(n_neighbors=7)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print(confusion_matrix(y_test, y_pred)) [[1106  11] 
                                         [183 34]]

#Classification report in scikit-learn 
print(classfication_report(y_test, y_pred))

In [None]:
# Import roc_auc_score
from sklearn.metrics import roc_auc_score

# Calculate roc_auc_score
print(roc_auc_score(y_test, y_pred_probs))

# Calculate the confusion matrix
print(confusion_matrix(y_test, y_pred))

# Calculate the classification report
print(classification_report(y_test, y_pred))

**Hyperparameter tuning**

+ Ridge/Lasso Regression: Choosing alpha
+ KNN: Choosing n_neighbors
+ Hyperparameters: Parameters we specify before fitting the model
  * Like alpha and n_neighbors
+ Choosing the correct hyperparameters:
1) Try lots of different hyperparamters values
2) Fit all of them separately
3) See how well they perform
4) Choose the best performing values

=> This is called hyperparameter tuning 

+ It is essential to use cross-validation to avoid overfitting to the test set.
+ We can still split the data and perform cross-validation on the training set.
+ We withhold the test set for final evaluation. 

**Grid search cross-validation**

+ We perform k-fold cross-validation for each combination of hyperparameters
+ The mean scores for each combination are shown here.
+ We then choose hyperparameters that performed best, as shown here. 

![image.png](attachment:404bb928-97b5-4f48-94fd-5ae511657a9c.png)

In [None]:
#GridSearchCV in scikit-learn 

from sklearn.model_selection import GridSearchCV

# Define a K-Fold cross-validation strategy with 5 splits and a fixed seed for reproducibility
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Define the grid of hyperparameters to test: alpha values and different solvers
param_grid = {
    "alpha": np.linspace(0.0001, 1, 10), # Range of regularization strengths
    "solver": ["sag", "lsqr"]           # Algorithms to use in the optimization
}

# Initialize the Ridge regression model
ridge = Ridge()

# Setup GridSearchCV to find the best parameters using the defined grid and cross-validation
ridge_cv = GridSearchCV(ridge, param_grid, cv=kf)

# Train the model by searching through all possible combinations in the grid
ridge_cv.fit(X_train, y_train)

# Output the best parameter combination and the highest average cross-validation score
print(ridge_cv.best_params_, ridge_cv.best_score_)

In [None]:
# Create X and y
X = music_dummies.drop("popularity", axis=1).values
y = music_dummies["popularity"].values

# Instantiate a ridge model
ridge = Ridge(alpha=0.2)

# Perform cross-validation
scores = cross_val_score(ridge, X, y, cv=kf, scoring="neg_mean_squared_error")

# Calculate RMSE
rmse = np.sqrt(-scores)
print("Average RMSE: {}".format(np.mean(rmse)))
print("Standard Deviation of the target array: {}".format(np.std(y)))

**Missing data**

+ No value for a feature in a particular row
+ This can occur because:
  * There may have been no observation
  * The data might be corrupt
+ We need to deal with missing data. 


In [None]:
#Music dataset

print(music_df.isna().sum().sort_values())

#Dropping missing data 
music_df = music_df.dropna(subset=["genre", "popularity", "loudness", "liveness", "tempo"])
print(music_df.isna().sum().sort_values())

**Imputing values** (Xử lý dữ liệu thiếu)

+ Imputation - use subject-matter expertise to replace missing data with educated guesses
+ Common to use the mean
+ Can also use the median, or another value
+ For categorical values, we typically use the most frequent value - the mode
+ Must split our data first, to avoid data leakage. 

In [None]:
#Imputation with scikit-learn 

from sklearn.impute import SimpleImputer 
X_cat = music_df["genre"].values.reshape(-1,1)
X_num = music_df.drop(["genre", "popularity"], axis=1).values
y = music_df["popularity"].values
X_train_cat, X_test_cat, y_train, y_test = train_test_split(X_cat, y, test_size=0.2, random_state=12)
X_train_num, X_test_num, y_train, y_test = train_test_split(X_num, y, test_size=0.2, random_state=12)
imp_cat = SimpleImputer(strategy="most_frequent")
X_train_cat = imp_cat.fit_transform(X_train_cat)
X_test_cat = imp_cat.transform(X_test_cat)

imp_num = SimpleImputer()
X_train_num = imp_num.fit_transform(X_train_num)
X_test_num = imp_num.transform(X_test_num)
X_train = np.append(X_train_num, X_train_cat, axis=1)
X_test = np.append(X_test_num, X_test_cat, axis=1)

#Imputers are known as tranformers 

In [None]:
#Imputing within a pipeline 

from sklearn.pipeline import Pipeline
music_df = music_df.dropna(subset=["genre", "popularity", "loudness", "liveness", "tempo"])
music_df["genre"] = np.where(music_df["genre"] == "Rock", 1, 0)
X = music_df.drop("genre", axis=1).values
y = music_df["genre"].values
steps = [("imputation", SimpleImputer()), "logistic_regression", LogisticRegression())]
pipeline = Pipeline(steps)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
pipeline.fit(X_train, y_train)
pipeline.score(X_test, y_test)

#0.7593582887700535

**Limitations and an alternative approach**

+ 3-fold cross-validation, 1 hyperparameter, 10 total values = 30 fits
+ 10-fold cross-validation, 3 hyperparameter, 30 total values = 900 fits

The Efficiency Problem (Grid Search)Grid Search uses an exhaustive search strategy. The total number of model "fits" is calculated as:$$\text{Total Fits} = (\text{Combinations}) \times (\text{Folds})
$$Why it is limited:Exponential Growth: Adding just one more hyperparameter or increasing the number of folds can cause the training time to explode.

Computationally Expensive: As seen in your example, moving from 30 fits to 900 fits happens very easily. This can crash your system or take hours to complete on large datasets.The 

Alternative: RandomizedSearchCVInstead of checking every point on the grid, Randomized Search picks a fixed number of random combinations.Faster: You control the budget (number of iterations).

Effective: It often finds an optimal (or near-optimal) solution in a fraction of the time because it doesn't waste resources on unimportant parameters.

Grid Search = Accurate but very slow (High cost). 

Random Search = Fast and usually "good enough" (High efficiency).

In [None]:
#RandomizedSearchCV

from sklearn.model_selection import RandomizedSearchCV 
kf = KFold(n_splits=5, shuffle=True, random_state=42)
param_grid = {'alpha': np.arange(0.0001, 1, 10),
               "solver":['sag', 'lsqr']}
ridge = Ridge()
ridge_cv = RandomizedSearchCV(ridge, param_grid, cv=kf, n_iter=2)
ridge_cv.fit(X_train, y_train)
print(ridge_cv.best_params_, ridge_cv.best_score_) #{'solver':'sag','alpha':0.0001} #0.7529912278705785

Ridge Regression (hay còn gọi là L2 Regularization) là một kỹ thuật hồi quy được thiết kế để giải quyết vấn đề Overfitting (quá khớp) và đa cộng tuyến (multicollinearity) bằng cách thêm một hình phạt (penalty) vào hàm mất mát.

In [None]:
#Evaluating on the test set 

test_score = ridge_cv.score(X_test, y_test)
print(test_score) #0.7564731534089224

In [None]:
# Import GridSearchCV
from sklearn.model_selection import GridSearchCV

# Set up the parameter grid
param_grid = {"alpha": np.linspace(0.00001, 1, 20)}

# Instantiate lasso_cv
lasso_cv = GridSearchCV(lasso, param_grid, cv=kf)

# Fit to the training data
lasso_cv.fit(X_train, y_train)
print("Tuned lasso paramaters: {}".format(lasso_cv.best_params_))
print("Tuned lasso score: {}".format(lasso_cv.best_score_))

In [None]:
# Create the parameter space
params = {"penalty": ["l1", "l2"],
         "tol": np.linspace(0.0001, 1.0, 50),
         "C": np.linspace(0.1, 1.0, 50),
         "class_weight": ["balanced", {0:0.8, 1:0.2}]}

# Instantiate the RandomizedSearchCV object
logreg_cv = RandomizedSearchCV(logreg, params, cv=kf)

# Fit the data to the model
logreg_cv.fit(X_train, y_train)

# Print the tuned parameters and score
print("Tuned Logistic Regression Parameters: {}".format(logreg_cv.best_params_))
print("Tuned Logistic Regression Best Accuracy Score: {}".format(logreg_cv.best_score_))

**Preprocessing data**

**scikit-learn requirements**
+ Numeric data
+ No missing values

**With real-world data**
+ This is rarely the case
+ We will often need to prepocess our data first

**Dealing with categorical features**
+ scikit-learn will not accept categorical features by default
+ Need to convert categorical features into numeric values
+ Convert to binary features called dummy variables
   * 0: Observation was NOT that category
   * 1: Observation was that category

![image.png](attachment:a639844a-badc-4678-b54e-54385e6a31d3.png)

If a song is not any of the first nine genres, then implicitly, it is a rock song. That means we only need 9 features, so we can delete the Rock column. If we do not do this, we are duplicating information, which might be an issue for some models. 

To create dummy variables, we can use scikit-learn's OneHotEncoder, or pandas' get_dummies

**Dealing with categorical features in Python**
+ scikit-learn: OneHotEncoder()
+ pandas: get_dummies()
  
**Music dataset**
+ popularity: Target variable
+ genre: Categorical feature

**Understanding Dummy Variables**
Dummy Variables are a way to convert categorical data (text-based labels like "Genre", "City", or "Color") into a numerical format that machine learning algorithms can process.

**Why are they necessary?**
Mathematical Requirement: Models like Linear Regression, Lasso, and Ridge perform mathematical operations (addition, multiplication). They cannot "calculate" using words like Anime or Rock.

**No Natural Order:** Unlike numbers, categories usually don't have a mathematical order. Dummy variables allow the model to treat each category independently without assuming that Rock is "greater than" Jazz.

In [None]:
#Encoding dummy variables 

import pandas as pd
music_df = pd.read_csv('music.csv')
music_dummies = pd.get_dummies(music_df["genre"], drop_first=True)
print(music_dummies.head())
music_dummies = pd.concat([music_df, music_dummies], axis=1)
music_dummies = music_dummies.drop("genre", axis=1)
music_dummies = pd.get_dummies(music_df, drop_first=True)
print(music_dummies.columns)

In [None]:
#Linear Regression with dummy variables

from sklearn.model_selection import cross_val_score, KFold
from sklearn.linear_model import LinearRegression
X = music_dummies.drop("popularity", axis=1).values 
y = music_dummies["popularity"].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
kf = KFold(n_splits=5, shuffle=True, random_state=42)
linreg = LinearRegression()
linreg_cv = cross_val_score(linreg, X_train, y_train, cv=kf, scoring="neg_mean_squared_error")
print(np.sqrt(-linreg_cv)) #[8.15792932, 8.63117538,  7.52275279,  8.6205778,  7.91329988]

In [None]:
# Print missing values for each column
print(music_df.isna().sum().sort_values())

# Remove values where less than 5% are missing
music_df = music_df.dropna(subset=["genre", "popularity", "loudness", "liveness", "tempo"])

# Convert genre to a binary feature
music_df["genre"] = np.where(music_df["genre"] == "Rock", 1, 0)

print(music_df.isna().sum().sort_values())
print("Shape of the `music_df`: {}".format(music_df.shape))

#The dataset has gone from 1000 observations down to 892, but it is now in the correct format for binary classification 
#and the remaining missing values can be imputed as part of a pipeline.

In [None]:
# Import modules
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

# Instantiate an imputer
imputer = SimpleImputer()

# Instantiate a knn model
knn = KNeighborsClassifier(n_neighbors = 3)

# Build steps for the pipeline
steps = [("imputer", imputer), 
         ("knn", knn)]

In [None]:
steps = [("imputer", imp_mean),
        ("knn", knn)]

# Create the pipeline
pipeline = Pipeline(steps)

# Fit the pipeline to the training data
pipeline.fit(X_train, y_train)

# Make predictions on the test set
y_pred = pipeline.predict(X_test)

# Print the confusion matrix
print(confusion_matrix(y_test, y_pred))

**Why scale our data?**

+ Many models use some form of distance to inform them
+ Features on larger scales can disproportionately influence the model
+ Example: KNN uses distance explicitly when making predictions
+ We want features to be ona similar scale
+ Normalizing or standardizing (scaling & centering)

**How to scale our data?**
+ Subtract the mean and divide by variance
  * All features are centered around zero and have a variance of one
  * This is called standardization
+ Can also subtract the minimum and divide by the range
  * Min zero and max one
+ Can also normalize so the data ranges from -1 to +1


In [None]:
#Scaling in scikit-learn 

from sklearn.preprocessing import StandardScaler 
X = music_df.drop("genre", axis=1).values
y = music_df["genre"].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
print(np.mean(), np.std())
print(np.mean(X_train_scaled), np.std(X_train_scaled))

In [None]:
#Scaling in a pipeline 

steps = [('scaler', StandardScaler()), 
         ('knn', KNeighborsClassifier(n_neighbors=6))]
pipeline = Pipeline(steps)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=21)
knn_scaled = pipeline.fit(X_train, y_train)
y_pred = knn_scaled.predict(X_test)
print(knn_scaled.score(X_test, y_test)) #0.81

#Comparing performance using unscaled data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=21)
knn_unscaled = KNeighborsClassifier(n_neighbors=6).fit(X_train, y_train)
print(knn_unscaled.score(X_test, y_test) #0.53

In [None]:
#CV and scaling in a pipeline 
from sklearn.model_selection import GridSearchCV
steps = [('scaler', StandardScaler()),
         ('knn', KNeighborsClassifier())]
pipeline = Pipeline(steps) 
parameters = {"knn__n_neighbors": np.arange(1, 50)}
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=21)
cv = GridSearchCV(pipeline, param_grid=parameters)
cv.fit(X_train, y_train)
y_pred = cv.predict(X_test)

#Checking model parameters 
print(cv.best_score_) #0.81999999

print(cv.best_params_) #{'knn__n_neighbors': 12}

In [None]:
# Import StandardScaler
from sklearn.preprocessing import StandardScaler

# Create pipeline steps
steps = [("scaler", StandardScaler()),
         ("lasso", Lasso(alpha=0.5))]

# Instantiate the pipeline
pipeline = Pipeline(steps)
pipeline.fit(X_train, y_train)

# Calculate and print R-squared
print(pipeline.score(X_test, y_test))

In [None]:
# Build the steps
steps = [("scaler", StandardScaler()),
         ("logreg", LogisticRegression())]
pipeline = Pipeline(steps)

# Create the parameter space
parameters = {"logreg__C": np.linspace(0.001, 1.0, 20)}
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                    random_state=21)

# Instantiate the grid search object
cv = GridSearchCV(pipeline, param_grid=parameters)

# Fit to the training data
cv.fit(X_train, y_train)
print(cv.best_score_, "\n", cv.best_params_)

**Evaluating multiple models**

**Different models for different problems**
**Some guiding principles**

+ **Size of the dataset:**
  * Fewer features = simpler model, faster training time
  * Some models require large amounts of data to perform well
    
+ **Interpretability:**
  * Some models are easier to explain, which can be important for stakeholders
  * Linear regression has high interpretability, as we can understand the coefficients.
    
+ **Flexibility:**
  * May improve accuracy, by making fewer assumptions about data
  * KNN is a more flexible model, doesn't assumeany linear relationships

**It's all in the metrics**
+ Regression model performance:
  * RMSE
  * R-squared
+ Classification model performance:
  * Accuracy
  * Confusion matrix
  * Precision, recall, F1-score
  * ROC AUC
+ Train several models and evaluate performance out of the box
+ Models affected by scaling:
  * KNN
  * Linear Regression(plus Ridge, Lasso)
  * Logistic Regression
  * Artificial Neural Network
+ Best to scale our data before evaluating models 

In [None]:
#evaluating classification models

import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score, KFold, train_test_split 
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier 
X = music.drop("genre", axis=1).values
y= music["genre"].values
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

#Evaluating classification models
models = {"Logistic Regression": LogisticRegression(), "KNN": KNeighborsClassifier(), 
          "Decision Tree": DecisionTreeClassifier()}
results = []
for model in models.values():
    kf = KFold(n_splits=6, random_state=42, shuffle=True)
    cv_results= cross_val_score(model, X_train_scaled, y_train, cv=kf)
    results.append(cv_results)
plt.boxplot(results, labels=models.keys())
plt.show()

![image.png](attachment:2f081fc2-6ab9-4c54-a11f-251b7adb14ac.png)

**Model Performance Comparison**
**Logistic Regression (Best Performer):** It has the highest median score (~0.84) and generally higher accuracy ranges, making it the top choice for this dataset.

**KNN (Most Consistent):** The box is very short, indicating low variance and stable performance, though its overall accuracy is lower than Logistic Regression.

**Decision Tree (High Variance):** It shows the widest range (from <0.72 to >0.85), meaning its performance is highly inconsistent and sensitive to specific data splits.

In [None]:
#Test set performance
for name, model in models.items():
    model.fit(X_train_scaled, y_train)
    test_score = model.score(X_test_scaled, y_test)
    print("{} Test Set Accuracy: {}".format(name, test_score))

#Logistic Regression Test Set Accuracy: 0.844
#KNN Test Set Accuracy: 0.82
#Decision Tree Test Set Accuracy: 0.832

In [None]:
models = {"Linear Regression": LinearRegression(), "Ridge": Ridge(alpha=0.1), "Lasso": Lasso(alpha=0.1)}
results = []

# Loop through the models' values
for model in models.values():
  kf = KFold(n_splits=6, random_state=42, shuffle=True)
  
  # Perform cross-validation
  cv_scores = cross_val_score(model, X_train, y_train, cv=kf)
  
  # Append the results
  results.append(cv_scores)

# Create a box plot of the results
plt.boxplot(results, labels=models.keys())
plt.show()

![image.png](attachment:99a8606a-3fa3-4afb-a669-1efda330826e.png)

In [None]:
# Create models dictionary
models = {"Logistic Regression": LogisticRegression(), "KNN": KNeighborsClassifier(), "Decision Tree Classifier": DecisionTreeClassifier()}
results = []

# Loop through the models' values
for model in models.values():
  
  # Instantiate a KFold object
  kf = KFold(n_splits=6, random_state=12, shuffle=True)
  
  # Perform cross-validation
  cv_results = cross_val_score(model, X_train_scaled, y_train, cv=kf)
  results.append(cv_results)
plt.boxplot(results, labels=models.keys())
plt.show()

![image.png](attachment:7411aae0-8e5c-45a3-a49b-9284ee8dc348.png)

In [None]:
# Create steps
steps = [("imp_mean", SimpleImputer()), 
         ("scaler", StandardScaler()), 
         ("logreg", LogisticRegression())]

# Set up pipeline
pipeline = Pipeline(steps)
params = {"logreg__solver": ["newton-cg", "saga", "lbfgs"],
         "logreg__C": np.linspace(0.001, 1.0, 10)}

# Create the GridSearchCV object
tuning = GridSearchCV(pipeline, param_grid=params)
tuning.fit(X_train, y_train)
y_pred = tuning.predict(X_test)

# Compute and print performance
print("Tuned Logistic Regression Parameters: {}, Accuracy: {}".format(tuning.best_params_, tuning.score(X_test, y_test)))

**Data Analytics**

**Discovering Data Analytics**
Data analytics is the process of examining raw data to find trends, patterns, and insights. It involves using various techniques and tools to extract meaningful information from data and use it to make informed decisions. 

**Why is Data Analytics so important?**
In today's world, we generate an enormous amount of data every single day. From social media posts and online purchases to sensor readings and financial transactions, data is everywhere. 

But raw data alone isn't very useful. It's like having a room full of ingredients but not recipe. Data analytics provides the "recipe" to turn that raw data into something valuable. 

**By using data analytics, businesses can:**
+ Identify customer preferences: Understand what their customers want and tailor their products and services accordingly.
+ Improve marketing campaigns: Target the right audience with the right message at the right time.
+ Optimize operations: Streamline processes, reduce costs, and increase efficiency.
+ Gain a competitive advantage: Make data-driven decisions that give them an edge in the market.

Data analytics is not just for businesses, though. It's used in various fields, including healthcare, education, government, and even sports. For example, data analytics can help doctors diagnose diseases earlier, teachers personalize learning for students, and city planners improve traffic flow. 

As the world becomes more data-driven, storytelling through data analysis is becoming a vital component and aspect of large and small businesses. 

Data-driven businesses make decisions based on the story that their data tells, and in today's data-driven world, data is not being used to its full potential, a challenge that most businesses face. 

Data analysis is, and should be, a critical aspect of all organizations to help determine the impact to their business, including evaluating customer sentiment, performing market and product research, and identifying trends or other data insights. 

While the process of data analysis focuses on the tasks of cleaning, modeling, and visualizing data, the concept of data analyis and its importance to business should not be understated. To analyze data, core components of analytics are divided into 5 categories: 

**1) Descriptive Analytics**
+ Descriptive analytics help answer questions about **what has happened** based on historical data. Descriptive analytics techniques **summarize large semantic models to describe outcomes to stakeholders.**
  
+ By developing key performance indicators (KPIs), these strategies, can help **track the sucess or failure of key objectives**. Metrics such as return on investment (ROI) are used in many industries, and **specialized metrics** are developed to **track performance in specific industries.**

+ An example of descriptive analytics is **generating reports to provide a view of an organization's sales and financial data.**  

**2) Diagnostic Analytics**
+ Diagnostic analytics help answer questions about **why events happened**. Diagnostic analytics techniques supplement basic descriptive analytics, and they **use the findings from descriptive analytics to discover the cause of these events.**
+ Then, performance indicators are further investigated to discover **why these events improved or became worse**. Generally, this process occurs in 3 steps:
  a) Identify anomalies in the data. These anomalies might be unexpected changes in a metric or a particular market.
  b) Collect data that's related to these anomalies
  c) Use statistical techniques to discover relationships and trends that explain these anomalies. 

**3) Predictive Analytics**
+ Predictive analytics help answer questions about **what will happen in the future.** Predictive analytics techniques use historical data to identify trends and determine if they're likely to recur.
+ Predictive analytical tools provide valuable insight into what might happen in the future. Techniques include a variety of **statistical and machine learning techniques** such as **neural networks, decision trees, and regression.** 

**4) Precriptive Analytics**
+ Prescriptive analytics help answer questions about **which actions should be taken to achieve a goal or taget.**
+ By using insights from precriptive analytics, organizations can **make data-driven decisions.** This technique allows businesses to make informed decisions in the face of uncertainty.
+ Prescriptive analytics techniques rely on machine learning as one of the strategies **to find patterns in large semantic models.**
+ By analyzing past decisions and events, organizations can **estimate the likelihood of different outcomes.** 

**5) Cognitive Analytics**
+ Cognitive analytics attempt to draw inferences from existing data and patterns, derive conclusions based on existing knowledge bases, and then add these findings back into the knowledge base for future inferences, a self-learning feedback loop.
+ Cognitive analytics help you learn what might happen if circumtances change and determine how you might handle these situations.
+ Inferences aren't structured queries on a rules database; rather, they're unstructured hypotheses that are gathered from several sources and expressed with varying degrees of confidence.
+ Effective cognitive analytics depend on machine learning algorithms, and will use several natural language processing concepts to make sense of previously untapped data sources, such as call center conversation logs and product reviews. 

Chào bạn, để giúp bạn nắm vững khái niệm **Cognitive Analytics (Phân tích nhận thức)**, mình sẽ giải thích chi tiết hơn dựa trên 4 ý chính bạn đã nêu, nhưng dưới góc độ dễ hiểu và ứng dụng thực tế hơn nhé.

---

## Phân tích Nhận thức (Cognitive Analytics) là gì?

Nói một cách đơn giản nhất, nếu Phân tích dữ liệu truyền thống giống như một chiếc máy tính bỏ túi (chỉ làm theo lệnh), thì **Cognitive Analytics giống như một bộ não nhân tạo**. Nó không chỉ xử lý số liệu mà còn cố gắng "tư duy" và "học hỏi" giống con người.

### 1. Vòng lặp phản hồi tự học (Self-learning Feedback Loop)

Thay vì chỉ đưa ra kết quả rồi dừng lại, hệ thống này hoạt động theo chu trình:

* **Quan sát:** Đọc dữ liệu hiện có.
* **Suy luận:** Đưa ra kết luận dựa trên những gì đã biết.
* **Cập nhật:** Lưu kết luận đó vào "kho kiến thức" (Knowledge Base).
* **Tái sử dụng:** Lần tới khi gặp vấn đề tương tự, nó sẽ thông minh hơn nhờ kiến thức vừa học được.

### 2. Mô phỏng các kịch bản "Nếu - Thì" (What-if Scenarios)

Điểm mạnh của Cognitive Analytics là khả năng dự báo trong môi trường biến động.

* Nó giúp doanh nghiệp chuẩn bị cho những thay đổi không lường trước (ví dụ: "Nếu giá nguyên liệu tăng 20% và có một dịch bệnh mới xảy ra, chuỗi cung ứng sẽ ra sao?").
* Nó không chỉ dự báo mà còn đề xuất cách xử lý tối ưu cho từng tình huống cụ thể.

### 3. Suy luận dựa trên "Giả thuyết" thay vì "Câu lệnh"

Đây là điểm khác biệt cốt lõi:

* **Truy vấn truyền thống:** Bạn hỏi "Doanh thu tháng 10 là bao nhiêu?" -> Máy trả về 1 con số chính xác (Cấu trúc cứng nhắc).
* **Cognitive Analytics:** Nó đưa ra các **giả thuyết không cấu trúc**. Ví dụ: "Dựa trên đánh giá khách hàng và tin tức kinh tế, có 85% khả năng khách hàng đang dần chuyển sang thích sản phẩm xanh hơn".
* Kết quả không phải là "Đúng/Sai" tuyệt đối, mà là đi kèm với **mức độ tự tin (Confidence level)** (ví dụ: tin tưởng 90%).

### 4. Kết hợp AI, Học máy (ML) và Xử lý ngôn ngữ tự nhiên (NLP)

Để "hiểu" được thế giới như con người, nó cần những công cụ đặc biệt:

* **NLP:** Để đọc hiểu những dữ liệu "khó nhằn" như nội dung cuộc gọi ghi âm của tổng đài, các bài đăng trên mạng xã hội, hay các dòng review của khách hàng.
* **Machine Learning:** Để tìm ra các quy luật ẩn mà mắt người hoặc các thuật toán thông thường không thấy được.

---

## Ví dụ thực tế dễ hình dung

Hãy tưởng tượng một hệ thống chăm sóc khách hàng của một hãng hàng không:

1. **Dữ liệu đầu vào:** Nó đọc hàng ngàn email than phiền, nghe các cuộc gọi ghi âm và xem dữ liệu về các chuyến bay bị hoãn.
2. **Xử lý (NLP):** Nó nhận ra tông giọng khách hàng đang rất giận dữ khi nhắc đến việc "chờ đợi tại cửa khởi hành".
3. **Suy luận:** Nó đưa ra giả thuyết: "Việc thiếu thông tin tại cửa khởi hành gây ức chế cao hơn cả việc chậm chuyến".
4. **Hành động:** Nó đề xuất hãng hàng không gửi tin nhắn cập nhật trạng thái tự động mỗi 15 phút. Sau khi thực hiện, nó theo dõi thấy mức độ hài lòng tăng lên và tự ghi chú vào bộ não của nó: "Chiến thuật này hiệu quả, hãy dùng lại sau này".



Imagine a company called Taival Ski & Co., which sells outdoor clothing and equipment. They have a website with an online store and a few physical shops across the country. They've been collecting data on their sales, marketing campaigns, and customer interactions for a few years now, but they haven't really done much with it.

Here's where data analytics comes in. By using tools like Power BI, Taival Ski & Co. can start to make sense of all that data and use it to improve their business.

For example, they could use **descriptive analytics** to:

+ **Analyze past sales data:** See which products were most popular in different seasons, identify sales trends over time, and understand how different factors (like weather or marketing campaigns) influenced sales. This could help them predict demand for upcoming seasons and ensure they have the right stock in the right places.

+ **Understand customer behavior:** Analyze website traffic, customer demographics, and purchase history to identify their target audience and tailor their marketing efforts more effectively. They might discover, for example, that their online customers are primarily interested in hiking gear, while their in-store customers are more focused on camping equipment.

+ **Evaluate marketing campaign performance:** Track the effectiveness of different marketing campaigns (like social media ads or email newsletters) by analyzing click-through rates, conversion rates, and customer acquisition costs. This can help them optimize their marketing spend and get the most out of their campaigns.

+ Taival Ski & Co. needs to trust their data. They need to ensure they're collecting accurate and reliable data from sources like their website analytics, point-of-sale systems, and customer relationship management (CRM) software.

+ **With clean, reliable data and the right tools, Taival Ski & Co. can transform raw data into actionable insights.** They can **make informed decisions** about inventory management, marketing strategies, and product development, ultimately leading to increased sales, improved customer satisfaction, and a stronger competitive edge in the market.

+ … And that's where data analysts come in! A skilled data analyst can help Taival Ski & Co. navigate their growing sea of data, identify key trends, and translate those trends into meaningful recommendations. They can help Taival Ski & Co. **unlock the true potential of their data and drive business growth.**

+ Data analysts do a lot more than crunch numbers, though. Let’s take a moment to explore some of their tasks and duties.

**Roles in Data**

**Business Analyst**
+ While some similarties exist between a data analyst and business analyst, the key differentiator between the two roles is what they do with data. A business analyst is closer to the business and is a specialist in interpreting the data that comes from the visualization. Often, the roles of data analyst and business analyst could be the responsibility of a single person. 

**Data Analyst**
+ A data analyst enables businesses to maximize the value of their data assets through visualization and reporting tools such as Microsoft Power BI. Data analysts are responsible for profiling, cleaning, and transforming data.
+ Their responsibilities also include designing and building scalable and effective semantic models, and enabling and implementing the advanced analytics capabilities into reports for analysis.
+ A data analyst works with the pertinent stakeholders to identify appropriate and necessary data and reporting requirements, and then they are tasked with turning raw data into relevant and meaningful insights.
+ A data analyst is also responsible for the management of Power BI assets, including reports, dashboards, workspaces, and the underlying semantic models that are used in the reports.
+ They are tasked with implementing and configuring proper security procedures, in conjunction with stakeholder requirements, to ensure the safekeeping of all Power BI assets and their data.
+ Data analysts work with data engineers to determine and locate appropriate data sources that meet stakeholder requirements.
+ Additionally, data analysts work with the data engineer and database administrator to ensure that the analyst has proper access to the needed data sources.
+ The data analyst also works with the data engineer to identify new processes or improve existing processes for collecting data for analysis.

**Data Engineer**
+ Data engineers provision and set up data platform technologies that are on-premises and in the cloud. They manage and secure the flow of structured and unstructured data from multiple sources.
+ The data platforms that they use can include relational databases, nonrelational databases, data streams, and file stores. Data engineers also ensure that data services securely and seamlessly integrate across data platforms.
+ Primary responsibilities of data engineers include the use of on-premises and cloud data services and tools to ingest, egress, and transform data from multiple sources.
+ Data engineers collaborate with business stakeholders to identify and meet data requirements. They design and implement solutions.
+ While some alignment might exist in the tasks and responsibilities of a data engineer and a database administrator, a data engineer's scope of work goes well beyond looking after a database and the server where it's hosted and likely doesn't include the overall operational data management.
+ Both database administrators and business intelligence professionals can transition to a data engineer role; they need to learn the tools and technology that are used to process large amounts of data.

**Database Scientist**
+ Data scientists perform advanced analytics to extract value from data.
+ Their work can vary from descriptive analytics to predictive analytics.
+ Descriptive analytics evaluate data through a process known as exploratory data analysis (EDA). Predictive analytics are used in machine learning to apply modeling techniques that can detect anomalies or patterns. These analytics are important parts of forecast models.
+ Descriptive and predictive analytics are only partial aspects of data scientists' work. Some data scientists might work in the realm of deep learning, performing iterative experiments to solve a complex data problem by using customized algorithms.
+ Anecdotal evidence suggests that most of the work in a data science project is spent on data wrangling and feature engineering. Data scientists can speed up the experimentation process when data engineers use their skills to successfully wrangle data.
+ A data scientist looks at data to determine the questions that need answers and will often devise a hypothesis or an experiment and then turn to the data analyst to assist with the data visualization and reporting.

**Database Administrator**
+ A database administrator implements and manages the operational aspects of cloud-native and hybrid data platform solutions that are built on Microsoft Azure data services and Microsoft SQL Server.
+ A database administrator is responsible for the overall availability and consistent performance and optimizations of the database solutions. They work with stakeholders to identify and implement the policies, tools, and processes for data backup and recovery plans.
+ The role of a database administrator is different from the role of a data engineer.
+ A database administrator monitors and manages the overall health of a database and the hardware that it resides on, whereas a data engineer is involved in the process of data wrangling, in other words, ingesting, transforming, validating, and cleaning data to meet business needs and requirements.

![image.png](attachment:c4b75c73-d283-44d0-9cb6-9998e5ecb411.png)

**Understanding Business Processes**

+ Understanding business processes and terms is crucial for data analysts and data engineers. When you know the business processes, you can better understand the context of the data you are working with. This helps in identifying relevant data and ensuring data quality.
+ Identifying gaps and inconsistencies in data can highlight the presence of silos. Data silos occur when data and processes are isolated within different departments or systems, commonly due to lack of communication or data ownership. => Bad
+ Breaking down data silos and promoting collaboration across the organization ensures that data is accessible, accurate, and complete for decision-making. => Good
+ Knowing the business terms (ROI, CAC, Churn Rate, Revenue vs. Profit, Conversion Rate, B2B/B2C, Bottom Line, Pain Points, Stakeholders, Fiscal Year, Active Customer, Single Source of Truth ) allows you to communicate effectively with stakeholders, ensuring that your analyses and reports are aligned with business needs and objectives.
+ Moreover, familiarity with business processes and terms enables you to design more efficient data models and pipelines. It helps in anticipating potential issues and addressing them proactively. This knowledge also aids in creating meaningful visualizations and dashboards that provide actionable insights to the stakeholders.
+ Ultimately, it ensures that your work as a data analyst or data engineer adds real value to the organization.

**Tasks of a Data Analyst**
![image.png](attachment:f32f093f-4886-4e4b-9441-8b90c8ef62c8.png)


**1. Preparing (The Foundation)**
+ Goal: Turn raw data into trusted, readable information.
+ Actions: Profiling, cleaning, and transforming data.
+ Key Focus: Ensure data integrity, fix inaccuracies, and handle privacy/security (anonymizing PII).
+ Impact: Poor preparation leads to invalid reports and loss of business trust.

**2. Modeling (The Structure)**
+ Goal: Define how tables relate to each other.
+ Actions: Creating relationships, defining metrics, and adding custom calculations.
+ Key Focus: Building an efficient semantic model to improve report performance and accuracy.
+ Note: This is an iterative process linked closely with data preparation.

**3. Visualizing (The Storytelling)**
+ Goal: Solve business problems by "bringing data to life."
+ Actions: Designing reports that tell a compelling narrative using appropriate charts.
+ Key Focus: Accessibility (fonts/colors) and conciseness (avoiding data overload).
+ Tooling: Use Power BI’s built-in AI (Quick Insights, AI visuals) to find answers without code.

**4. Analyzing (The Insight)**
+ Goal: Understand and interpret the story told by the data.
+ Actions: Identifying patterns, trends, and predicting future outcomes.
+ Key Focus: Using advanced analytics to create actionable insights that drive better decisions.
+ Evolution: Modern tools (AI integrations) make complex analysis accessible to everyone, not just data scientists.

**5. Managing (The Governance)**
+ Goal: Oversee the lifecycle and security of assets (reports, dashboards, workspaces).
+ Actions: Sharing/distributing content (via Apps) and ensuring data security.
+ Key Focus: Reducing Data Silos by reusing shared semantic models and endorsing "Certified" data to ensure a single source of truth.

**Exercise:**

**1. Taival Ski & Co. notices a significant drop in sales of their winter jackets during the typically busy month of December. How could they use diagnostic analytics to understand this unexpected trend?**
   
Taival Ski & Co. could investigate by analyzing data related to the sales drop, such as website traffic, marketing performance, customer reviews, and competitor activity. This analysis could reveal if factors like a competitor's promotion or unusually warm weather contributed to the decline.

**2. How can Taival Ski & Co. leverage predictive analytics to anticipate customer demand for hiking boots in the upcoming spring season?**

By analyzing past hiking boot sales data, considering seasonality and marketing campaigns, Taival Ski & Co. can identify trends and use predictive models to forecast demand. They should also incorporate external factors like economic conditions and competitor activity.

**3. Taival Ski & Co. wants to increase customer engagement with their online store. How can they use descriptive analytics to gain insights into customer behavior on their website?**

Taival Ski & Co. can track website traffic (page views, bounce rate), analyze customer demographics, and examine product popularity to understand how customers interact with their online store and what products they prefer.

**4. Explain how Taival Ski & Co. could use prescriptive analytics to optimize their pricing strategy for a new line of camping gear.**

Taival Ski & Co. can analyze past sales data, conduct market research, and use machine learning models to simulate different pricing scenarios. This allows them to determine the optimal price point to maximize revenue.

**5. Describe how cognitive analytics could help Taival Ski & Co. analyze customer feedback from online reviews and social media comments to improve their products and services.**

Cognitive analytics can help Taival Ski & Co. analyze customer feedback from various sources, using natural language processing to identify key themes and sentiment. This allows them to uncover insights and make data-driven decisions to improve products and services.

**6. Which data role enables advanced analytics capabilities specifically through reports and visualizations?**

Data analyst. A data analyst uses appropriate visuals to help business decision makers gain deep and meaningful insights from data.

**7. Which data analyst task has a critical performance impact on reporting and data analysis?**

Modeling. An optimized and tuned semantic model performs better and provides a better data analysis experience.

**1. Tailwind Traders stores employee data in both a SQL Server database (for sales transactions) and Excel files (for HR information). What are the potential challenges of working with data stored in these different formats?**

Challenges include data inconsistencies, difficulties integrating data from different sources, and maintaining data accuracy across multiple locations.

**2. Why might it be beneficial for Tailwind Traders to consolidate their employee data into a single database, rather than using both SQL Server and Excel?**

Consolidating data improves consistency and accuracy, simplifies data management, and enables more comprehensive analysis.

**3. Tailwind Traders' warehousing application uses Cosmos DB, a NoSQL database, to store shipping data in JSON format. How does the structure of JSON data differ from data stored in a relational database like SQL Server?**

JSON data is schema-less, stored as documents, and often nested, unlike the structured tables of a relational database.

**4. When importing JSON data into Power BI, why is it often necessary to 'extract and normalize' the data?**

Extracting and normalizing flattens the nested JSON structure, making it compatible with Power BI's tabular model and easier to analyze.

**5. Tailwind Traders uses SharePoint to store sales goals for their sales team. What are the advantages of connecting Power BI directly to their SharePoint data, rather than manually importing the data?**

Directly connecting ensures real-time updates, reduces manual effort, and improves data accuracy.

**6. What are the potential benefits of using Power Query to clean and transform data before loading it into Power BI?**

Power Query improves data quality, optimizes data for analysis, and can enhance report performance.

**1. What are the advantages of cleaning data in Power BI?**

Clean data leads to more accurate results, organized tables, easier data navigation, simplified columns, and human-readable values.

**2. How does Power Query Editor in Power BI Desktop help in shaping imported data?**

Power Query Editor allows for various transformations like renaming columns, changing data types, removing rows, and setting headers to make the data suitable for analysis.

**3. Why is it important to identify and promote column headers correctly when shaping data in Power Query Editor?**

Correctly identifying and promoting headers ensures that the data is organized properly and that columns have meaningful names for analysis and reporting.

**4. What are some implications of having incorrect data types in a Power BI dataset?**

Incorrect data types can prevent certain calculations, hierarchy creation, and proper relationships between tables, leading to errors and inaccurate analysis.

**5. What is the risk of having null values in a numeric column?**

That function AVERAGE of data will be incorrect. AVERAGE takes the total and divides by the number of non-null values. If NULL is synonymous with zero in the data, the average will be different from the accurate average.

**6. If you have two queries that have different data but the same column headers, and you want to combine both tables into one query with all the combined rows, which operation should you perform?**

Append. Append will take two tables and combine it into one query. The combined query will have more rows while keeping the same number of columns.

**7. Why is not abbreviating column names recommended when naming conventions in Power BI?**

Abbreviations lead to confusion because they're often overused or not universally agreed on.

**Data modeling**

+ Data Modeling is about structuring and organizing your data to create a clear and accurate representation of the information you want to analyze.
+ Think of it as building the foundation for a house - a strong and well-designed foundation ensures stability and supports everything built on top of it.
+ When developing the model, you must complete the following tasks:
    * Connect to data
    * Transform and prepare data
    * Define business logic by adding Data Analysis Expressions (DAX) calculations
    * Publish the model to Power BI
    * Understanding the structure of semantic models can help you design the right model to support your reports and dashboards. A semantic model can be developed in many ways, yet one or several of those ways are more optimal. Optimal models are important for delivering good query performance and for minimizing data refresh times and the use of service resources, including memory and CPU. The fewer resources that are used, the more models that can be hosted and at lower cost.
+ The primary purpose of a star schema in data modeling: To optimize query performance and facilitate data analysis.
+ The type of table in a star schema typically contains a large number of rows representing individuals events or observations:
![image.png](attachment:9c27dc53-0ccb-4fdc-bcc8-84c54c4730b1.png)

Notice that the model is comprised of seven tables, one of which is named Sales and is the fact table. The remaining tables are dimension tables, and they have the following names:
1. Customer
2. Date
3. Product
4. Reseller
5. Sales Order
6. Sales Territory

+ The role of a dimension table in a star schema: To provide descriptive context for the data in fact tables

**Analytic Queries**
+ An analytic query is a query that produces a result from a semantic model.
+ An analytic query has three phases that are implemented in the following order:
  1) Filter
  2) Group
  3) Summarize
![image.png](attachment:22eb523d-64e7-45ab-8040-06baab04a24f.png)

+ The purpose of the "Summarize" phase in an analytic query: To aggregate data and produce a single value result.
+ The primary function of a measure in Power BI: To calculate and aggregate values from the data model.
+ In a Power BI Desktop model design, the type of object you create to connect multiple tables: Relationship.
+ Fact tables store accumulations of business events
+ In what order does an analytic query implement its phase: Filter, Group, Summarize.

![image.png](attachment:1cca191d-077e-489c-85ca-304d31310c9a.png)

**Semantic Models**

1. What is the primary purpose of a semantic model in Power BI?
+ To define relationships between data elements and enable analysis.

2. The benefit of simplifying table structures in a Power BI semantic model: It improves model navigation and readability for users.

3. The purpose of a date table in a Power BI semantic model: To provide a standardized way to analyze data across different time periods.

4. The process of "flattening" a hierarchy in Power BI: Creating separate columns to represent each level of the hierarchy.

5. The purpose of marking a table as a date table in Power BI: To validate the data and ensure it meets the requirements of a date table.

**Optimizing Model Performance**
1. Smaller models consume fewer resources, leading to faster data refresh, calculations, and rendering of visuals.
   
2. Proper relationships ensure data integrity and optimize query performance by guiding Power BI on how to efficiently retrieve and filter data.

3. Variables can improve performance, readability, and debugging of DAX formulas, leading to more efficient and maintainable code.

**Dashboad**
The primary goal of a dashboard is to interpret the story as quickly as possible. User interactions are limited by insights that are highly curated toward the audience. Report visuals are focused, self-explanatory, and clearly labeled. A dashboard directly communicates the meaning behind the data to minimize misinterpretation or confusion.

**Analytical Report**
1. An analytical report is the most common type of report that can serve various report consumer use cases while providing a structured space for analysis.

2. The primary goal of an analytical report is to help report consumers discover answers to a broad array of questions by interacting with the report and its visuals. Analytical reports often have many slicers to filter report data, and they often contain complex visuals that expose in-depth detail of the data.

3. Report pages are often expressly designed for interactivity with a focus on UX features. Multiple pathways are often provided for the report consumer to follow, which allows them to explore a topic of interest, share their findings, or return to where they started. Report consumers can remove layers and add context and detail by incorporating interactive features. Common interactive features include drill down, drill through, and tooltips.

4. A good example of an analytical report is one that extends beyond the "How are we doing?" type of question to answer the "Why did that happen?" or "What might happen next?" type of questions.

5. An example of an analytical report at the Contoso Skateboard Store would be a sales analysis report that allows drilling into sales revenue from year, down to quarter, month, and day.

**Operational Report**
Operational reports are designed to give the report consumer the ability to monitor current or real-time data, make decisions, and act on those decisions. Operational reports can include buttons that allow the report consumer to navigate within the report and also beyond the report to perform actions in external systems. Frequently, operational reports serve as a hub for action that is used by report consumers as part of their daily activity and workload.

This type of report should minimize the number of analytical features to ensure that focus remains on the operation that it's designed to serve. A streamlined user experience is the primary aim for this report type because excessive clicking or illogical flow can lead to high dissatisfaction.

A good example of an operational report is one that allows monitoring of a manufacturing production line. When an unexpected event arises, such as equipment malfunction, a button could allow workers to start a maintenance request.

An example of an operational report at the Contoso Skateboard Store would be an inventory report that informs the report consumer of current stock levels, and highlighting low stock levels or back orders. It also includes a Submit Order button that allows users to create a purchase order.

![image.png](attachment:5f69fec9-29c2-467a-a252-d8c9897fdefb.png)

**Educational Reports**
It assumes that the report consumer is unfamiliar with the data or context. So the reports must provide clear narrative detail and guidance to help with understanding. This type of report is often used in journalism and by governments to disseminate information to large audiences that have varying levels of understanding of the subject.

A good example of an educational report is one that describes the rollout of COVID-19 vaccination progress and that can be filtered by the home geographic region of the report consumer.

![image.png](attachment:88cb9899-53f1-4e4a-918e-704a17ce7c15.png)

**Designing Data Reports**
**1) Placement:**
Good placement of report objects contributes to an ordered report design. Generally, you should place the most important information in the upper-left corner of the page and arrange the report elements from left to right and top to bottom.

Important: This placement applies to audiences who predominantly read left to right (LTR). When your audience reads right to left (RTL), as is the case with some written languages such as Arabic and Hebrew, place the most important information in the upper-right corner and arrange the report elements from right to left.

Arrange report objects so that the vertical and horizontal edges align because it looks ordered and is pleasing to the eye. Position-related report objects in logical groups. An ordered report layout creates a connection between visual elements and avoids clutter that can result from a seemingly random placement of report objects.

Additionally, aligning report objects in a visually pleasing layout can convey more energy and interest than simply centering or randomly placing report objects. Consider applying the rule of thirds, which is a visual arts rule that can be applied to report object placements in an analytical report. The rule proposes that a page layout should be divided into an invisible grid of nine equal parts. The grid is formed by two equally spaced horizontal lines and two equally spaced vertical lines. Then, report objects can be placed within the cells of the grid.

At the Contoso Skateboard Store, a proposed report design for analyzing sales presents three equally sized vertical regions. The first region shows sales broken down by product, the second region shows sales broken down by customer store, and the third region shows items that have been sold.
![image.png](attachment:2e9ee05a-be96-4a50-a591-4ec81a08eb17.png)'

**2) Balance:**
Another important consideration when you are laying out report objects is balance. Balance is concerned with stability and structure in design. In the context of a report layout, balance refers to the weight that is distributed across the report page by the placement of objects of the same or different sizes.

Balance can be symmetrical or asymmetrical. Symmetrical balance is achieved by distributing the weight evenly on both halves of the page. Asymmetrical balance is achieved through contrast.

Consider using the golden ratio as a guide to produce asymmetrical balance. The ratio is based on the Fibonacci Sequence, where two quantities are in the golden ratio if their ratio is the same as the ratio of their sum to the larger of the two quantities. For centuries, the golden ratio has influenced art and architecture to produce works that are harmonious and balanced. If applied to report design, the golden ratio will align a page to have one large visual to draw initial attention, which is then supported by smaller visuals that provide context.

In the following animated image, notice how the report layout initially draws your eye to the larger charts. After you've comprehended the larger charts, observe that your eye is likely drawn to the bar chart and then to the values in the cards.

**3) Proximity:**
In a report layout, proximity is concerned with the nearness of report objects. When a report page consists of multiple groups of related objects, you should use space to visually separate them.

In the following report design, notice the top-left section labeled key metrics. Related visuals are placed near one another. They are also purposefully and consistently aligned forming a clear section.
![image.png](attachment:7651fc10-291a-4623-af06-577e6d54aa48.png)

**4) Contrast:**
Contrast can be used to combine two opposing objects. The use of contrasting colors, fonts, font properties, or lines can emphasize important objects of the report design. Use this principle to direct report consumers to where they should look or which data visual they should interact with first.

![image.png](attachment:8dacaf59-d411-497f-9c3a-63f67623efec.png)

**5) Repetition:**
Repetition in a report design creates association and consistency. Good use of repetition can help strengthen a report layout by tying related report objects together.

In the following report design, notice the top-left section labeled Key metrics. Many key metrics are presented in single-value cards. This repetitious design allows report consumers to quickly understand and interpret the metrics.

![image.png](attachment:db4cb612-8d6d-4f55-b66e-b5164b821c89.png)



The primary goal of data visualization is to communicate information clearly and effectively to report consumers. That's why selecting the most effective visual type to meet requirements is critical. Selecting the wrong visual type could make it difficult for report consumers to understand the data, or worse, it could result in the misrepresentation of the data.

Visual selection can be challenging because so many visuals are available to choose from. To help you select an appropriate visual, the following sections provide tips and guidance to help you meet specific visualization requirements.

**1. Categorical Visuals**
Often, bar or column charts are good choices when you need to show data across multiple categories. Selecting which type depends on the number of categories and the kind of information that you want to visualize. For example, if many category values are available, you should avoid selecting a visual where color is used to split the data, such as a stacked bar chart with a category legend. Instead, use the category dimension on the axis of a bar chart.

Additionally, you should avoid a line chart with a categorical X-axis because the line implies a relationship between elements that might not exist. In the following example, notice that the line chart visual implies a relationship between the product categories on the X-axis.

![image.png](attachment:04770ece-4c53-45c1-b7cb-e8a12ca058d6.png)

![image.png](attachment:cbc4192f-2317-44cf-8d8c-87914632d0de.png)

**2. Time Serie Visuals**
Always use a line or column chart to show values over time. The X-axis should present time, sorted from earliest to latest periods (left to right).

In the following example, a line chart shows historical sales. The line chart shows the natural flow of a timeline from left to right, eliminating the time needed to interpret the X-axis.

![image.png](attachment:d92cbab9-3493-42e2-977c-ba962c2d9c11.png)

You can bring the line chart to the next level by adding an analytics option. In this case, it applies a forecast to extend historical sales with projected sales.
![image.png](attachment:185ad2ba-2841-415d-801b-b9838c8e67c2.png)

**3. Proportional Visuals**
Proportional visuals show data as part of a whole. They effectively communicate how a value is distributed across a dimension. Column and bar chart visuals work well for visualizing proportions across multiple dimensions.

In the following example, a 100% Stacked Bar chart visual shows proportional sales across four stores. It allows you to compare each store across the six product categories. Notice that the actual sales value isn't shown. Instead, the proportion of sales is shown, allowing report consumers to determine which one is higher. (If necessary, you can reveal the actual values in a tooltip.)

![image.png](attachment:61fe0b43-f1bb-4a40-b4d6-e0140195d5e8.png)

In the next example, notice that the same information can be expressed vertically as a 100% Stacked Column chart. It yields an equivalent result.

![image.png](attachment:9d383382-74ae-40f1-b36d-5744057802f5.png)

**4. Numeric Visuals**
Often presented by card visuals, numeric values show high-level callouts that demand immediate attention. They can be powerful in dashboard and analytical reports because they communicate important data quickly.

![image.png](attachment:19bc1ef5-4623-4b8b-b9bb-2e3584a669ac.png)

**5. Grid Visuals**
Often overlooked, tables and matrices can effectively convey a lot of detailed information. Tables have a fixed number of columns, and each column can express grouped or summarized data. Matrices can have groups on columns and rows. Adding conditional formatting options, such as background colors, font colors, or icons, can enhance values with visual indicators. This extra context supports simple report consumption and can bring balance to a report page.

Additionally, matrices provide one of the best experiences for hierarchical navigation. They allow users to drill down, on the columns or rows, to discover detailed data points of interest.

![image.png](attachment:a697a327-cfe1-49f5-9483-3251daf4a586.png)

In the above example, a table visual shows sales and units sold by product. Showing these metrics together in a single visual can be a challenge because the scale of values for sales and units is so different.

In the next example, a matrix visual displays inventory by product and by store. It uses conditional formatting to show indicators, which provide visual cues to understanding the data

![image.png](attachment:b1a4331e-3740-48b3-b1e6-d3b1ccbc28e1.png)

**Performing Analytics**

1) What visual should be used to display outliers? The scatter chart is best to display outliers.
2) What is an outlier in data analysis? A data point that significantly differs from other data points
3) What is the primary purpose of analytics in Power BI? To uncover trends, make predictions, and gain insights. 

**Introduction to Data Ingestion**

Imagine you're a chef preparing a gourmet meal. You wouldn't just toss all the ingredients into a pot without washing, chopping, and measuring them first, would you? Similarly, in machine learning, data ingestion is the crucial process of gathering, preparing, and "serving" your raw data to your model in a way it can digest.

Think of it as the supply chain for your AI. Just as a restaurant needs fresh ingredients delivered on time to create delicious dishes, your machine learning model needs a reliable flow of high-quality data to learn effectively and generate accurate insights.

Here's why data ingestion is so important:

**Garbage in, garbage out:** If you feed your model inaccurate, incomplete, or inconsistent data, you'll get unreliable results. Data ingestion ensures your data is clean, transformed into a usable format, and ready for analysis.

**Efficiency and performance:** A well-designed ingestion process streamlines data collection, cleaning, and transformation, saving you time and resources. It also ensures your model receives data in an optimal format for efficient processing.

**Real-time insights:** For many applications, like fraud detection or real-time recommendations, you need up-to-the-minute data. Data ingestion enables you to establish pipelines that deliver fresh data continuously, empowering your model to react to changes quickly.

**Scalability:** As your data grows, your ingestion process needs to scale accordingly. A robust strategy ensures you can handle increasing data volumes and variety without compromising performance or accuracy.

-------------
In essence, data ingestion is the foundation of any successful machine learning project. By mastering this process, you ensure your models are fueled with the high-quality data they need to thrive and generate valuable insights.

After specifying the project goal, say, predicting customer churn (the loss of customers for any reason, that is) for a telecommunications company, the next step is to figure out how to gather the necessary data to fuel our machine learning model.

We’ll need to identify data sources and decide how to serve the models with data before designing the actual ingestion solution.

-------------
**Identifying Data Sources and Formats**

Data is the most important input for your machine learning models. You’ll need access to data when training machine learning models, and the trained model needs data as input to generate predictions.

Imagine you're a data scientist and have been asked to train a machine learning model.

You aim to go through the following six steps to plan, train, deploy, and monitor the model:

**1) Define the problem:** Decide on what the model should predict and when it's successful.
**2) Get the data:** Find data sources and get access.
**3) Prepare the data:** Explore the data. Clean and transform the data based on the model's requirements.
**4) Train the model:** Choose an algorithm and hyperparameter values based on trial and error.
**5) Integrate the model:** Deploy the model to an endpoint to generate predictions.
Monitor the model: Track the model's performance.

To get and prepare the data you'll use to train the machine learning model, you'll need to extract data from a source and make it available to the Azure service you want to use to train models or make predictions.

In general, it’s a best practice to extract data from its source before analyzing it. Whether you’re using the data for data engineering, data analysis, or data science, you’ll want to extract the data from its source, transform it, and load it into a serving layer. Such a process is also referred to as Extract, Transform, and Load (ETL) or Extract, Load, and Transform (ELT). The serving layer makes your data available for the service you’ll use for further data processing like training machine learning models.

Before being able to design the ETL or ELT process, you’ll need to identify your data source and data format.

**Identifying the Data Source**
When you start with a new machine learning project, first identify where the data you want to use is stored.

The necessary data for your machine learning model may already be stored in a database or be generated by an application. For example, the data may be stored in a Customer Relationship Management (CRM) system, in a transactional database like an SQL database, or be generated by an Internet of Things (IoT) device.

In other words, your organization may already have business processes in place, which generate and store the data. If you don’t have access to the data you need, there are alternative methods. You can collect new data by implementing a new process, acquire new data by using publicly available datasets, or buy curated datasets.

**Identifying the Format**
Based on the source of your data, your data may be stored in a specific format. You need to understand the current format of the data and determine the format required for your machine learning workloads.

Commonly, we refer to three different formats:

**Tabular or structured data:** All data has the same fields or properties, which are defined in a schema. Tabular data is often represented in one or more tables where columns represent features and rows represent data points. For example, an Excel or CSV file can be interpreted as tabular data:

![image.png](attachment:71840412-22a6-4900-9aa2-5d26b089a870.png)

**Semi-structured data:** Not all data has the same fields or properties. Instead, each data point is represented by a collection of key-value pairs. The keys represent the features, and the values represent the properties for the individual data point. For example, real-time applications like Internet of Things (IoT) devices generate a JSON object: 

{ "deviceId": 29482, "location": "Office1", "time":"2021-07-14T12:47:39Z", "temperature": 23 }

**Unstructured data:** Files that don't adhere to any rules when it comes to structure. For example, documents, images, audio, and video files are considered unstructured data. Storing them as unstructured files ensures you don’t have to define any schema or structure, but also means you can't query the data in the database. You'll need to specify how to read such a file when consuming the data.

**Identifying the Desired Format**
When extracting the data from a source, you may want to transform the data to change the data format and make it more suitable for model training.

For example, you may want to train a forecasting model to perform predictive maintenance on a machine. You want to use features such as the machine's temperature to predict a problem with the machine. If you get an alert that a problem is arising, before the machine breaks down, you can save costs by fixing the problem early on.

Imagine the machine has a sensor that measures the temperature every minute. Each minute, every measurement or entry can be stored as a JSON object or file.

To train the forecasting model, you may prefer one table in which all temperature measurements of each minute are combined. You may want to create aggregates of the data and have a table of the average temperature per hour. To create the table, you'll want to transform the semi-structured data ingested from the IoT device to tabular data.

To create a dataset you can use to train the forecasting model, you may:
    1) Extract data measurements as JSON objects from the IoT devices.
    2) Convert the JSON objects to a table.
    3) Transform the data to get the temperature per machine per minute.
    
![image.png](attachment:3452a3a0-15b4-4c61-ad65-c08dbeee1fa5.png)

Once you’ve identified the data source, the original data format, and the desired data format, you can think about how you want to serve the data. Then, you can design a data ingestion pipeline to automatically extract and transform the data you need.

Every minute, a JSON object is extracted from an Internet of Things (IoT) device. What is the type of data that is extracted? - Semi-structured

**Identifying Machine Learning Tasks**

**1. Define the problem:** Decide on what the model should predict and when it's successful. 

**2. Get the data:** Find data sources and get access 

**3. Prepare the data:** Explore the data. Clean and transform the data based on the model's requirements. 

**4. Train the model:** Choose an algorithm and hyperparameter values based on trial and error. 

**5. Integrate the model:** Deploy the model to an endpoint to generate predictions 

**6. Monitor the model:** Track the model's performance 

Starting the first step, you want to define the problem the model will solve by understanding: 
    + What should the model's output be?
    + What type of machine learning task will you use?
    + What criteria makes a model successful? 

Depending on the data you have and the expected output of the model, you can identify the machine learning task. The task will determine which types of algorithms you can use to train the model. 

**Some common machine learning tasks are:**
**1) Classification:** Predict a categorical value 
**2) Regression:** Predict a numerical value 
**3) Time-series forecasting:** Predict future numerical values based on time-series data 
**4) Computer vision:** Classify images or detect objects in images 
**5) Natural language processing (NLP):** Extract insights from text

**To train a model**, you have a set of algorithms that you can use, depending on the task you want to perform. 

**To evaluate the model**, you can calculate performance metrics such as accuracy or precision. 

The metrics available will also depend on the task your model needs to perform and will help you to decide whether a model is successful in its task. 

When you know what the problem is, you're trying to solve and how you'll assess the success of your model, you can choose the service to train and manage your model. 

**Choosing a Service to Train the Model**
There are many services available to train machine learning models. Which service you use depends on factors like: 
+ What type of model you need to train
+ Whether you need full control over model training
+ How much time you want to invest in model training
+ Which services are already within your organization
+ Which programming language you're comfortable with

**Choosing the Right Option**

+ When you want to train your own model, the most valuable resource you'll consume is compute. Especially during model training, it's important to choose the most suitable compute. Additionally, you should monitor compute utilization to know when to scale up or down to save on time and costs.

**CPU or GPU**
One important decision to make when configuring compute is whether you want to use a central processing unit (CPU) or a graphics processing unit (GPU). For smaller tabular datasets, CPU will be sufficient and cheaper to use. Whenever working with unstructured data like images or text, GPUs will be more powerful and effective.

For larger amounts of tabular data, it may also be beneficial to use GPUs. When processing your data and training your model takes a long time, even with the largest CPUs compute available, you may want to consider using GPUs compute instead. There are libraries such as RAPIDs (developed by NVIDIA) which allow you to efficiently perform data preparation and model training with larger tabular datasets. As GPUs come at a higher cost than CPUs, it may require some experimentation to explore whether using GPU will be beneficial for your situation.

**General Purpose or Optimized Memory**
When you create compute resources for machine learning workloads, there are 2 common types of virtual machine sizes you can choose from: 

**+ General purpose:** Have a balanced CPU-to-memory ratio. Ideal for testing and development with smaller datasets. 
**+ Memory optimized:** Have a high memory-to-CPU ratio. Great for in-memory analytics, which is ideal when you have larger datasets or when you're working in notebooks. 

**Spark**
Spark compute or clusters use the same sizing as virtual machines in Azure but distribute the workloads. 

A Spark cluster consists of a driver node and worker nodes. Your code will initially communicate with the driver node. The work is then distributed across the worker nodes. When you use a service that distributes the work, parts of the workload can be executed in parallel, reducing the processing time. Finally, the work is summarized and the driver node communicates the result back to you. 

To make optimal use of a Spark cluster, your code needs to be written in Spark-friendly language like **Scala, SQL, RSpark, or PySpark** in order to distribute the workload. If you write in Python, you'll only use the driver node and leave the worker nodes unused. 

When you create a Spark cluster, you'll have to choose whether you want to use CPU, or GPU compute. You'll also have to choose the virtual machine size for the driver and worker nodes. 

**Monitoring the Computer Utilization**
Configuring your compute resources for training a machine learning model is an iterative process. When you know how much data you have and how you want to train your model, you'll have an idea of which compute options may best suit training your model. 

Every time you train a model, you should monitor **how long it takes to train the model** and **how much compute is used to execute your code**. By monitoring the compute utilization, you'll know **whether to scale your compute up or down**. 

If training your model takes too long, even with the largest compute size, you may want to use GPUs instead of CPUs. 

Alternatively. you can choose to distribute model training by using Spark compute which may require you to rewrite your training scripts. 

1. What are the key steps involved in defining the problem a machine learning model should solve?

Defining the problem involves understanding the desired output of the model, identifying the specific machine learning task (classification, regression, etc.), and establishing clear criteria for evaluating the model's success.

2. Explain the difference between using CPUs and GPUs for model training in Azure Machine Learning.

CPUs are generally suitable for smaller datasets and less computationally intensive tasks, while GPUs excel at handling large datasets and complex models, especially those involving image or text data. The choice depends on the specific needs and budget of the project.

3. What are the advantages of using Spark compute for machine learning tasks in Azure?

Spark compute allows for distributed processing, which can significantly reduce training time for large datasets by dividing the workload across multiple nodes. This improves efficiency and scalability for computationally intensive machine learning tasks.

4. A data scientist wants to train a machine learning model to predict the sales of supermarket items to adjust the supply to the projected demand. What type of machine learning task will the model perform and why?

Time-series forecasting is the best option in this case since it is designed for working with demand projections.

5. The data scientist received data to train a model to predict the sales of supermarket items. The data scientist wants to quickly iterate over several featurization and algorithm options by only providing the data and editing some configurations. Which tool would best be used in this situation and why?

Azure Machine Learning. This way, you'll only have to provide the data and Automated Machine Learning will iterate over different featurization approaches and algorithms.



**1. What is the primary purpose of deploying a machine learning model to an endpoint?**

To make the model's predictions accessible to applications and users. 

**2. What is a common use case for real-time predictions from a deployed machine learning model?**

Providing instant recommendations on an e-commerece website. 

**3. What is an advantage of using container technologies like Azure Container Instances (ACI) or Azure Kubernetes Service (AKS) for deploying machine learning models?**

They provide a light weight and scalable infrastructure for real-time predictions. 

**4. What are some benefits of using central repository like Git for managing machine learning code?**

It enables collaboration, version control, and easier retraining of models. 

**5. When should we retrain the model?**

When the model's metrics are below the benchmark. 

**Introduction to Compute Targets**

In Azure Machine Learning, you can use various types of managed cloud computes. By using any of the compute options in the Azure Machine Learning workspace, you can save time on managing compute. 

**Choosing the Target**
In Azure Machine Learning, compute targets are physical or virtual computers on which jobs are run. 

There are multiple types of compute for **experimentation, training and deployment:**

**1) Compute instance:** Behaves similarly to a virtual machine and is primarily used to run notebooks. It's ideal for experimentation.

**2) Compute clusters:** Multi-node clusters of virtual machines that automatically scale up or down to meet demand. A cost-effective way to run scripts that need to process large volumes of data. Clusters also allow you to use parallel processing to distribute the workload and reduce the time it takes to run a script.

**3) Kubernetes clusters:** Cluster based on Kubernetes technology, giving you more control over how the compute is configured and managed. You can attach your self-managed Azure Kubernetes (AKS) cluster for cloud compute, or an Arc Kubernetes cluster for on-premises workloads.

**4) Attached compute:** Allows you to attach existing compute like Azure virtual machines or Azure Databricks clusters to your workspace.

**5) Serverless compute:** A fully managed, on-demand compute you can use for training jobs.

**1) The primary benefit of using AutoML for finding the best classification model:** 

AutoML automates the process of experimenting with different algorithms and hyperparameters, saving time and effort in finding the best performing model for a given dataset and task.

**2) What is the purpose of featurization in an AutoML experiment?**

Featurization prepares the data for model training by applying transformations such as scaling, normalization, missing value imputation, and categorical encoding. This ensures the data is in a suitable format for the machine learning algorithms.

**3) What is the purpose of the primary_metric setting in an AutoML experiment?**

To define the target performance metric for evaluating models. 

**4) What is the purpose of the set_limits() function in an AutoML experiment?**

To control the duration and resource consumption of experiment.

**5) What can you do if AutoML detects a class imbalance in your training data?**

Review the data and consider techniques to address the class imbalance. 

**6) What are data guardrails in AutoML, and how do they help improve model training?**

Data guardrails are automated checks and corrections applied by AutoML to address potential issues in the training data, such as class imbalance, missing values, or high cardinality features. They help improve data quality and model performance.

**7) You want to train a diabetes classification model with the scikit-learn library. You want to focus on experimenting with the model, and minimize the effort needed to log the model's results. Which logging method should you use?**

You should use autologging with scikit-learn. Enabling autologging minimizes the effort needed to log the model's results.

**1) A data scientist has trained a model in a notebook. The model should be retrained every week on new data. What should the data scientist do to make the code production-ready?**

Convert the code to multiple functions in a script that read the data and train the model. 

**2) a data scientist wants to run a script as a command job to train a PyTorch model, setting the batch size and learning rate hyperparameters to specified values each time the job runs. What should be done by the data scientist?**

Add arguments for batch size and learning rate to the script, and set them in the command property of the command job. 

**3) What is the main advantage of using scripts instead of notebooks for production machine learning workloads?**

Scripts are easier to automate, test, and maintain. 

**1) What are the 2 main types of model signatures in MLflow?**

+ Column-based and tensor-based

**2) What is the purpose of a model signature in MLflow?**

+ To define the schema of the model's inputs and outputs.

**3) What is a "flavor" in MLflow?** 

+ The machine learning library or framework used to create the model.

**4) What is the purpose of the MLmodel file in MLflow?**

+ To provide metadata and instructions for loading and using the model.

**5) What is the advantage of using MLflow for model registration and deployment?**

+ It standardizes model packaging and simplifies deployment across different environments.

**6) What is MLflow, and how does it related to Azure Machine Learning?**

+ MLflow is an open-source platform for managing the machine learning lifecycle, integrated with Azure Machine Learning.

**7) What is the primary purpose of registering a machine learning model in Azure Machine Learning?**

+ To store and manage the model for deployment and reuse.

------------------------------------
In Azure Machine Learning, MLflow can be used to log metrics, artifacts, and model versions, ensuring reproducibility and facilitating comparison between different experiments. By studying its features, data professionals can effectively manage their machine learning workflows and gain valuable insights into model performance and behavior.

When it comes to developing artificial intelligence models, there’s one crucial aspect that shouldn’t be overlooked: responsibility.

**Exploring Responsible AI**

**1. What is the purpose of the Responsible AI dashboard in Azure Machine Learning?**

The Responsible AI dashboard provides insights and tools to help data scientists assess and mitigate potential issues related to fairness, explainability, and error analysis in their machine learning models.

**2. What are some of the Responsible AI principles that data scientists should consider when developing and deploying machine learning models?**

Key principles include fairness, reliability and safety, privacy and security, transparency, and accountability.

**3. How can you create a Responsible AI dashboard in Azure Machine Learning?**

You can create a pipeline using built-in components, including the RAI Insights dashboard constructor, one or more RAI tool components (like explanation or error analysis), and the Gather RAI Insights dashboard component.

**4. What are some of the insights that can be generated by the Responsible AI dashboard?**

The dashboard can provide insights on error analysis, model explanations, counterfactuals, and causal analysis.

**5. A data scientist wants to investigate for which subgroups the model has relatively more false predictions, which Responsible AI component should be added to the pipeline to create the Responsible AI dashboard? What does it provide?**

Error analysis. It provides an overview of the number of false predictions for specific subgroups in your dataset.

**6. What should be the first component in a pipeline to create a Responsible AI Dashboard?**

The RAI Insights dashboard constructor.

**7. A data scientist has trained a model, and wants to quantify the influence of each feature on a specific prediction. What kind of feature importance should the data scientist examine?**

Individual feature importance. It shows how each feature influences an individual prediction.



**Data Analytics with Fabric**

1) The primary advantage of using Apace Spark for data processing: It enables distributed computing, allowing for efficient processing of large datasets.
2) Spark pool: A cluster of compute nodes used to execute Spark applications.
3) The benefit of enabling high concurrency mode for Spark in Fabric: It allows multiple users or processes to share Spark sessions, improving efficiency.
4) Interacting with and execute Spark code in Fabric: Using notebooks for interactive analysis of by defining Spark jobs for automated execution.
5) Creating a notebook to use Apache Spark to explore data interactively in Fabric.
6) Method is used to split the data across folders when saving a dataframe: partitionBy
7) The simplest way to use Spark to analyze data in a CSV file: load the file into a dataframe.
8) Data Wrangler is used to accelerate data exploration and cleansing in Fabric.
9) Forecasting is used to predict the expenses for the coming month.
10) An experiment in the context of MLflow tracking: A collection of runs that represent different variations of a machine learning task
11) The purpose of Data Wrangler in Fabric: To simplify data exploration and cleaning with a user-friendly interface.
12) The advantage of using a lakehouse in Fabric for data science projects: It providesa centralized and scalable platform for storing and accessing data.
13) The purpose of using MLflow in a data science project is to track experiments, manage models, and ensure reproducibility.
14) Some common types of machine learning models used in data science: Classification, regression, clustering, and forecasting.
15) The primary goal of data science: To extract knowledge and insights from data to solve problems and make predictions.
16) A data warehouse provides a centralized and structured repository for analytical data, enabling efficient querying, reporting, and analysis to support business decision-making.
17) Fabric's data warehouse is built on a lakehouse architecture, allowing it to integrate with diverse data formats and support collaboration between data engineers, analysts, and data scientists within a unified environment.
18) Fact tables store numerical data related to events or transactions, while dimension tables contain descriptive attributes that provide context to the facts.
19) A dimension table. It stores attributes used to group numeric measures.
20) A semantic model is a business-oriented data model that provides a consistent and reusable representation of data across the organization. It provides a way to organize and structure data in a way that is meaningful to business users, enabling them to easily access and analyze data.
21) To grant access to individual warehouses for downstream consumption. By granting access to a single data warehouse using item permissions, you can enable downstream consumption of data.
22) The purpose ofa surrogate key in a data warehouse: To provide a unique, system-generated identifier for each row in a dimension table.
23) The difference between a full load and an incremental load in a data warehouse: A full load loads all data, while an incremental load only loads changes since the last update.
24) The purpose of a staging area in a data warehouse: To provide a temporary storage and processing area for data before loading it into the final tables
25) The primary purpose of loading data into a data warehouse: To prepare data for analysis, reporting, and business intelligence.

**Querying a Data Warehouse**
1) SQL (Structured Query Language)is the primary language used for querying a data warehouse
2) Indexing is a data structure technique to efficiently retrieve records from the database files based on some attributes on which the indexing has been done.
3) The purpose of a fact table in a data warehouse. To store the results of calculations. A fact table typically contains the results of calculations, along with foreign keys to dimension tables.

**Securing a data warehouse**
1) The purpose of a security predicate function in Row-Level Security (RLS): It determines whether a row is accessible to a user based on certain conditions.
2) The primary advantage of Dynamic Data Masking (DDM): It limits data exposure by obscuring sensitive information in real time. ,
3) The four fundamental permissions that govern data manipulation language (DML) operations in a relational database: SELECT, INSERT, UPDATE, DELETE
4) The advantage of using column-level security compared to views for restricting access to sensitive columns: Column-level security is generally more efficient and transparent.
5) A security predicate in the context of row-level security: A function that defines the conditions for filtering rows based on user roles.
6) The purpose of dynamic data masking (DDM): To hide sensitive data from unauthorized users without altering the original data.
7) The primary goal of securing a data warehouse: To protect sensitive data and prevent unauthorized access or modification.
8) Data warehouses are like the central libraries of the data world, storing vast amounts of information for analysis and reporting. They play a crucial role in helping businesses make sense of their data and extract valuable insights to drive informed decision-making.
9) Data warehouses are also essential for working with semantic models, as they provide the underlying data and structure for these models to operate on. A well-designed data warehouse ensures that the semantic model can accurately represent and analyze the data.
10) By prioritizing security and implementing robust safeguards, data professionals can protect their organization's valuable data assets and ensure that data remains a source of trust and informed decision-making.

**Exploring Data with Notebooks**
1) Notebooks are interactive documents that combine code, text, and visualizations, providing a flexible environment for data exploration, analysis, and collaboration.
2) Notebooks allow for iterative experimentation, easy sharing of results, and the integration of code with explanatory text and visualizations, promoting reproducibility and collaboration.
You can use the spark.read command to load data from various sources, such as files in your lakehouse or data warehouse, specifying the file format and path.
3) Common visualizations include line charts, bar charts, scatter plots, histograms, and heatmaps, which can be created using libraries like Matplotlib or Seaborn.
4) Microsoft Fabric notebooks currently support four Apache Spark languages, which are PySpark (Python), Spark (Scala), Spark SQL, and SparkR.
5) The range of a uniform distribution is the interval [a, b], where 'a' is the minimum value and 'b' is the maximum value.
6) When data is Missing at Random (MAR), the missingness of data is related to some other variables' values but not the missing data itself. For example, if women are more likely to disclose their number of daily steps than men, then the daily steps data is MAR.
7) The correlation coefficient ranges from -1 to 1. A coefficient of 1 indicates a perfect positive linear correlation, -1 indicates a perfect negative linear correlation, and 0 indicates no linear correlation.

**Preprocessing Data**

+ **Data Exploration:** Notebooks allow data scientists to visually explore data, identify patterns, and understand its characteristics. This exploration helps inform the preprocessing steps needed to prepare the data for machine learning algorithms.

+ **Data Cleaning:** Notebooks facilitate data cleaning tasks, such as handling missing values, removing duplicates, and correcting inconsistencies. These cleaning operations are essential for ensuring data quality before training machine learning models.

+ **Data Transformation:** Notebooks enable data scientists to apply various transformations to the data, such as scaling, normalization, and feature engineering. These transformations can significantly impact the performance and accuracy of machine learning models.

+ **Code-Based Manipulation:** Notebooks allow for code-based data manipulation using languages like Python or R, providing flexibility and control over the preprocessing steps.

+ **Visualization:** Notebooks enable the visualization of preprocessed data, allowing data scientists to assess the effectiveness of their transformations and identify further areas for improvement.

---------------------------------------
Imagine you're a teacher analyzing student test scores. You have a spreadsheet with student names, grades, and attendance records. Using Data Wrangler, you can quickly:

+ **Explore the data:** Visualize the distribution of grades, identify top and bottom performers, and see if there are any correlations between grades and attendance.

+ **Handle missing data:** If some students have missing grades or attendance records, you can use Data Wrangler to fill in those gaps with estimated values or exclude those students from the analysis.

+ **Transform data:** You can create new columns, such as calculating the average grade for each student or categorizing students based on their performance levels.



**Working with models in MLflow**

+ Feature in Fabric allows you to write and run code to train models: Notebooks
+ You have a prepared dataset that you loaded into a notebook. You want to train a classification model, you should SPLIT THE DATA before training the model.
+ The purpose of one-hot encoding in machine learning: To convert categorical data into a format that can be provided to machine learning algorithms to improve prediction accuracy.
+ The type of data Data Wrangler currently support: Both Pandas and Spark DataFrames. 

**Batch Predictions and Machine Learning**

+ Data Preprocessing: The quality of your batch predictions depends heavily on the quality of your data. The preprocessing techniques we learned earlier, such as cleaning, transforming, and feature engineering, are crucial for ensuring your model receives the right input for accurate predictions.  

+ Model Training: The accuracy and performance of your batch predictions are directly tied to the effectiveness of your trained model. The model training techniques we explored, including algorithm selection, hyperparameter tuning, and model evaluation, are essential for building a model capable of generating reliable predictions.  

+ MLflow Tracking: MLflow helps us track and manage different versions of our models, ensuring that we use the best-performing model for batch predictions and enabling us to reproduce and compare results.

+ By incorporating batch predictions into your data science toolkit, you can unlock the power of machine learning to solve real-world problems and drive innovation, whether you're a seasoned data scientist or simply a curious individual navigating the modern data landscape.

+ 

**Apache Spark and Databricks**

+ Apache Spark and Databricks are closely intertwined in the world of big data and analytics. Think of Apache Spark as a powerful engine, and Databricks as the sleek, high-performance car built around it.

+ Apache Spark is an open-source, distributed computing system that allows you to process massive datasets with incredible speed and efficiency. It's like having a team of processors working together in perfect harmony to tackle complex tasks.

+ Databricks takes Spark's power to the next level by providing a unified platform for data engineering, data science, and machine learning. It's a collaborative environment where data professionals can write code, build and deploy models, and visualize data, all within a single, integrated workspace.

+ Together, they form a powerful combination for tackling big data challenges and building innovative data-driven solutions. In this section, we’ll explore this tandem a little further.

+ Spark is a flexible platform that supports many different programming languages and APIs. By setting up a Databricks workspace and deploying Spark clusters, users can easily ingest data from various sources like Azure Data Lake or Cosmos DB into Spark DataFrames. Within the interactive Databricks notebooks, users can perform complex data transformations using Spark’s DataFrame API, which includes operations like filtering, grouping, and aggregation. Most data processing and analytics tasks can be accomplished using the Dataframe API, on which we'll focus in this section.

+ Getting to Know Spark
To gain a better understanding of how to process and analyze data with Apache Spark in Azure Databricks, it's important to understand the underlying architecture.

+ From a high level, the Azure Databricks service launches and manages Apache Spark clusters within your Azure subscription. Apache Spark clusters are groups of computers that are treated as a single computer and handle the execution of commands issued from notebooks. Clusters enable processing of data to be parallelized across many computers to improve scale and performance. They consist of a Spark driver and worker nodes. The driver node sends work to the worker nodes and instructs them to pull data from a specified data source.

+ In Databricks, the notebook interface is typically the driver program. This driver program contains the main loop for the program and creates distributed datasets on the cluster, then applies operations to those datasets. Driver programs access Apache Spark through a SparkSession object regardless of deployment location.

+ Microsoft Azure manages the cluster, and auto-scales it as needed based on your usage and the setting used when configuring the cluster. Auto-termination can also be enabled, which allows Azure to terminate the cluster after a specified number of minutes of inactivity.

+ Spark Jobs in Detail
Work submitted to the cluster is split into as many independent jobs as needed. This is how work is distributed across the Cluster's nodes. Jobs are further subdivided into tasks. The input to a job is partitioned into one or more partitions. These partitions are the unit of work for each slot. In between tasks, partitions may need to be reorganized and shared over the network.

+ The secret to Spark's high performance is parallelism. Scaling vertically (by adding resources to a single computer) is limited to a finite amount of RAM, Threads and CPU speeds; but clusters scale horizontally, adding new nodes to the cluster as needed.

+ Spark parallelizes jobs at two levels:

+ The first level of parallelization is the executor – a Java virtual machine (JVM) running on a worker node, typically, one instance per node.
The second level of parallelization is the slot – the number of which is determined by the number of cores and CPUs of each node.
Each executor has multiple slots to which parallelized tasks can be assigned.
Diagram of Spark cluster with tasks.

+ The JVM is naturally multi-threaded, but a single JVM, such as the one coordinating the work on the driver, has a finite upper limit. By splitting the work into tasks, the driver can assign units of work to *slots in the executors on worker nodes for parallel execution. Additionally, the driver determines how to partition the data so that it can be distributed for parallel processing. So, the driver assigns a partition of data to each task so that each task knows which piece of data it is to process. Once started, each task will fetch the partition of data assigned to it.

+ Jobs and Stages
Depending on the work being performed, multiple parallelized jobs may be required. Each job is broken down into stages. A useful analogy is to imagine that the job is to build a house:

+ The first stage would be to lay the foundation.
+ The second stage would be to erect the walls.
+ The third stage would be to add the roof.
+ Attempting to do any of these steps out of order just doesn't make sense, and may in fact be impossible. Similarly, Spark breaks each job into stages to ensure everything is done in the right order.

+ Modularity
Spark includes libraries for tasks ranging from SQL to streaming and machine learning, making it a tool for data processing tasks. Some of the Spark libraries include:

+ Spark SQL: For working with structured data.
+ SparkML: For machine learning.
+ GraphX: For graph processing.
+ Spark Streaming: For real-time data processing.
+ Diagram of Spark libraries.

+ Spark can run on a variety of distributed systems, including Hadoop YARN, Apache Mesos, Kubernetes, or Spark's own cluster manager. It also reads from and writes to diverse data sources like HDFS, Cassandra, HBase, and Amazon S3.

**Managing Data with Delta Lake**

+ When using Delta Lake, ACID properties are crucial. In essence, we're talking about ensuring the reliability and consistency of your data, even with concurrent transactions. It's akin to having a strict set of rules that prevent conflicts and ensure that all transactions are completed safely and accurately.

As mentioned above, ACID stands for:

+ **Atomicity:** All changes within a transaction are treated as a single unit. Either all changes succeed, or none do. This prevents partial updates and ensures data consistency.
  
+ **Consistency:** Transactions maintain the integrity of the data, ensuring that it adheres to defined rules and constraints. This prevents invalid data from being introduced into the system.

+ **Isolation:** Concurrent transactions are isolated from each other, preventing them from interfering with each other's results. This ensures that each transaction sees a consistent view of the data.

+ **Durability:** Once a transaction is committed, the changes are permanent and survive even system failures. This guarantees data persistence and reliability.

**Data Pipelines**

Imagine scrolling through your social media feed and seeing posts from friends, family, and your favorite creators. Data pipelines are constantly working to gather and process those posts, ensuring they appear in your feed in real-time.

A data pipeline is like a set of instructions for your data. It tells the data where to go, what changes to undergo, and where to end up. This automation ensures that data flows smoothly and efficiently through your systems, enabling you to perform tasks like:

+ **Data ingestion:** Extract data from various sources.

+ **Data transformation:** Clean, aggregate, and enrich data.

+ **Data loading:** Deliver data to its final destination, such as a data warehouse or data lake.

Data pipelines are essential for managing and processing data in today's data-driven world. They automate data workflows, improve efficiency, and ensure data quality and consistency.

In addition to social media feeds, data pipelines are incremental in providing services like online shopping recommendations, fraud detection in banking, traffic updates and navigation, as well as personalized music and video streaming.

Without data pipelines, online services we now take for granted, like social media and streaming applications, would be much slower and unreliable. 

**Delta Live Tables (DLT)** is a tool in Databricks that simplifies the creation and management of data pipelines for Delta Lake tables. It allows you to define your data transformations using SQL or Python, and DLT automatically generates and executes the underlying pipelines, ensuring data quality and consistency.


**Deploying Workloads with Databricks**

Workloads represent the different types of tasks we can perform in Databricks, such as data engineering, data science, and machine learning.

**Data Engineering:** Building and managing data pipelines, performing ETL processes, and preparing data for analysis.

**Data Science:** Exploring and analyzing data, building machine learning models, and conducting experiments.

**Machine Learning:** Training, deploying, and monitoring machine learning models for various applications.

Deploying workloads allows us to execute our code and processes on the Databricks platform, leveraging its powerful capabilities for distributed computing and data management. This enables us to handle large datasets, utilize various tools and languages, and automate workflows, to name a few.

**Administering Fabric Environment**

Administering and governing your data solutions is like maintaining law and order in a city of data. It's about establishing rules, enforcing policies, and ensuring that everything runs smoothly and securely. 

Without proper administration and governance, the city would descend into chaos, with data breaches, performance issues, and a lack of trust in the information.

Whether you're working with Databricks, Azure Synapse Analytics, or any other data platform, these principles are crucial for:

+ Protecting sensitive data and ensuring compliance with regulations.
+ Optimizing performance and resource utilization.
+ Promoting collaboration and data sharing.
+ Maintaining data quality and integrity.