###| Concept                 | Focus                       | Example                      |
| ----------------------- | --------------------------- | ---------------------------- |
| **Data Mining**         | Discovering patterns        | Find trends in sales data    |
| **Data Wrangling**      | Cleaning & structuring      | Handle missing values        |
| **Data Preparation**    | Getting ready for analysis  | Normalize and encode data    |
| **Data Transformation** | Changing data format/values | Convert text to lowercase    |
| **Data Harmonization**  | Making datasets consistent  | Standardize country names    |
| **Data Refinement**     | Enhancing data quality      | Add calculated features      |
| **Data Shaping**        | Restructuring layout        | Pivot or melt a table        |
| **Data Manipulation**   | Modifying/organizing data   | Sort, group, or filter       |
| **Data Manicuring**     | Final polish                | Fix formatting and names     |
| **Data Validation**     | Checking correctness        | Verify data types and ranges |


In [1]:
import pandas as pd
from sklearn.datasets import load_iris

### DATA INGESTION


In [None]:

iris_flowers = load_iris()

df = pd.DataFrame(data=iris_flowers.data, columns=iris_flowers.feature_names)
target = iris_flowers.target

### PRELIMINARY DATA ANALYSIS

In [None]:

df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [4]:
print(target)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 4 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   sepal length (cm)  150 non-null    float64
 1   sepal width (cm)   150 non-null    float64
 2   petal length (cm)  150 non-null    float64
 3   petal width (cm)   150 non-null    float64
dtypes: float64(4)
memory usage: 4.8 KB


In [7]:
df.columns

Index(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
       'petal width (cm)'],
      dtype='object')

In [8]:
df.describe()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


### DATA CLEANING & AND PREPARATION

HANDLING MISSING VALUES

In [9]:
df.isna().sum()

sepal length (cm)    0
sepal width (cm)     0
petal length (cm)    0
petal width (cm)     0
dtype: int64

In [None]:
# Introducing missing values for demonstration
df.loc[10, "sepal length (cm)"] = None
df.loc[50:54, "sepal width (cm)"] = None
df.loc[100:102, "petal length (cm)"] = None

In [11]:
df.isna().sum()

sepal length (cm)    1
sepal width (cm)     5
petal length (cm)    3
petal width (cm)     0
dtype: int64

In [12]:
from sklearn.impute import SimpleImputer
impute_mean = SimpleImputer(strategy='mean')
impute_median = SimpleImputer(strategy='median')

In [14]:
df[["sepal length (cm)"]] = impute_mean.fit_transform(df[["sepal length (cm)"]])
df[["sepal width (cm)"]] = impute_mean.fit_transform(df[["sepal width (cm)"]])
df[["petal length (cm)"]] = impute_median.fit_transform(df[["petal length (cm)"]])

In [15]:
df.isna().sum()

sepal length (cm)    0
sepal width (cm)     0
petal length (cm)    0
petal width (cm)     0
dtype: int64

In [17]:
from scipy import stats
z_scores = stats.zscore(df)
print(z_scores)

[[-9.05163302e-01  1.02053680e+00 -1.34014407e+00 -1.31544430e+00]
 [-1.14773404e+00 -1.44643012e-01 -1.34014407e+00 -1.31544430e+00]
 [-1.39030478e+00  3.21428915e-01 -1.39764453e+00 -1.31544430e+00]
 [-1.51159016e+00  8.83929516e-02 -1.28264361e+00 -1.31544430e+00]
 [-1.02644867e+00  1.25357277e+00 -1.34014407e+00 -1.31544430e+00]
 [-5.41307191e-01  1.95268066e+00 -1.16764268e+00 -1.05217993e+00]
 [-1.51159016e+00  7.87500841e-01 -1.34014407e+00 -1.18381211e+00]
 [-1.02644867e+00  7.87500841e-01 -1.28264361e+00 -1.31544430e+00]
 [-1.75416090e+00 -3.77678975e-01 -1.34014407e+00 -1.31544430e+00]
 [-1.14773404e+00  8.83929516e-02 -1.28264361e+00 -1.44707648e+00]
 [ 0.00000000e+00  1.48660873e+00 -1.28264361e+00 -1.31544430e+00]
 [-1.26901941e+00  7.87500841e-01 -1.22514315e+00 -1.31544430e+00]
 [-1.26901941e+00 -1.44643012e-01 -1.34014407e+00 -1.44707648e+00]
 [-1.87544627e+00 -1.44643012e-01 -1.51264545e+00 -1.44707648e+00]
 [-5.61657085e-02  2.18571662e+00 -1.45514499e+00 -1.31544430e