In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# [Kaggle Male & Female Height vs Weight](https://www.kaggle.com/datasets/saranpannasuriyaporn/male-female-height-and-weight)

### 🐼 pandas as pd — Most Common Methods 
#### [Documentation](https://pandas.pydata.org/docs/)

- `pd.read_csv("file.csv")` — load a CSV
- `df.head(n)` — show first `n` rows
- `df.tail(n)` — show last `n` rows
- `df.shape` — (rows, columns)
- `df.columns` — list of column names
- `df.info()` — column types + nulls
- `df.describe()` — basic stats (numeric)
- `df.describe(include='all')` — stats incl. non-numeric
- `df.isnull().sum()` — nulls per column
- `df["col"].value_counts()` — count of unique values
- `df["col"].unique()` — array of unique values
- `df["col"].nunique()` — number of unique values
- `df["col"].dtype` — data type of column
- `df.dtypes` — all column types
- `df.sort_values("col")` — sort by column
- `df.dropna()` — drop missing values
- `df.fillna(value)` — fill missing values

---

### 🧮 numpy as np — Most Common Methods
#### [Documentation](https://numpy.org/doc/stable/)

- `np.array([..])` — create array
- `np.mean(arr)` — average
- `np.median(arr)` — median
- `np.std(arr)` — standard deviation
- `np.min(arr)` / `np.max(arr)` — min/max
- `np.sum(arr)` — sum of elements
- `np.unique(arr)` — unique values
- `np.random.seed(n)` — reproducible randomness
- `np.random.rand(n)` — random floats [0, 1)
- `np.linspace(start, stop, num)` — evenly spaced numbers
- `arr.reshape(shape)` — reshape array
- `arr.T` — transpose

---

### 📈 matplotlib.pyplot as plt — Basic Plotting
#### [Documentatation](https://matplotlib.org)

- `plt.plot(x, y)` — line plot
- `plt.scatter(x, y)` — scatter plot
- `plt.bar(x, height)` — bar chart
- `plt.hist(data)` — histogram
- `plt.xlabel("label")` / `plt.ylabel("label")`
- `plt.title("title")`
- `plt.legend()` — add legend
- `plt.grid(True)` — show grid
- `plt.show()` — render plot

---

### 🎨 seaborn as sns — Stats-Friendly Visualization
#### [Documentation](https://seaborn.pydata.org)

- `sns.histplot(data, x="col")` — histogram
- `sns.boxplot(data, x="col")` — box plot
- `sns.countplot(data, x="col")` — bar for categories
- `sns.scatterplot(data, x="col1", y="col2")`
- `sns.lineplot(data, x="col1", y="col2")`
- `sns.heatmap(df.corr(), annot=True)` — correlation heatmap
- `sns.pairplot(df)` — scatterplot matrix
- `sns.set(style="whitegrid")` — clean theme

### 🤖 scikit-learn — Basic Structure and Common Methods

#### 💡 General Pattern

- `.fit(X, y)` — learn from training data
- `.transform(X)` — apply transformation (e.g., scaling)
- `.fit_transform(X)` — do both at once
- `.predict(X)` — make predictions from input data

---

### 🔄 Preprocessing

- `StandardScaler()` — standardize features (mean=0, std=1)
- `MinMaxScaler()` — scale features to [0, 1]
- `OneHotEncoder()` — convert categorical vars to binary matrix
- `OrdinalEncoder()` — encode categorical vars as integers
- `SimpleImputer()` — fill missing values
- `LabelEncoder()` — encode target labels with value between 0 and n-1

---

### 📊 Data Splitting

- `train_test_split(X, y, test_size=0.2)` — split into train/test sets

---

### 📈 Models (Classifiers / Regressors)

- `LogisticRegression()` — binary classification
- `LinearRegression()` — regression
- `RandomForestClassifier()` — ensemble classification
- `RandomForestRegressor()` — ensemble regression
- `KNeighborsClassifier()` — k-NN classification
- `DecisionTreeClassifier()` — tree-based model

---

### 🧪 Evaluation Metrics

- `accuracy_score(y_true, y_pred)` — for classification
- `confusion_matrix(y_true, y_pred)` — for classification
- `classification_report(y_true, y_pred)` — precision, recall, F1
- `mean_squared_error(y_true, y_pred)` — for regression
- `r2_score(y_true, y_pred)` — R² metric for regression

---

### 📦 Pipelines

- `Pipeline([("step1", transformer), ("step2", model)])` — chain steps
- `.fit(X, y)` — fit all steps
- `.predict(X)` — predict using final estimator
- `.transform(X)` — apply all steps up to final estimator

---

### 🔍 Model Selection / Validation

- `cross_val_score(model, X, y, cv=5)` — cross-validation scores
- `GridSearchCV(estimator, param_grid, cv=5)` — hyperparameter tuning

# Now try it yourself!
## Recommended [Heart Failure Prediction](https://www.kaggle.com/datasets/andrewmvd/heart-failure-clinical-data)
