<a href="https://colab.research.google.com/github/rhodes-byu/cs180-winter25/blob/main/notebooks/13-encoding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [53]:
import sklearn.datasets as datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, f1_score
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import fetch_openml
import seaborn as sns
import pandas as pd
import numpy as np

## Transforming Continuous (Numeric) Featuers

#### Standardization
Standardization is the process of scaling features to have a mean of 0 and a standard deviation of 1. The formula for standardization is:

$$ z = \frac{x - \mu}{\sigma} $$

where:
- $z$ is the standardized value  
- $x$ is the original value  
- $\mu$ is the mean of the feature  
- $\sigma$ is the standard deviation of the feature  

#### Normalization
Normalization is the process of scaling features to a range of [0, 1]. The formula for normalization is:

$$ x' = \frac{x - x_{\min}}{x_{\max} - x_{\min}} $$

where:
- $x'$ is the normalized value  
- $x$ is the original value  
- $x_{\min}$ is the minimum value of the feature  
- $x_{\max}$ is the maximum value of the feature  


## Normalization vs. Standardization

### **Use Normalization (Scaling to [0, 1] or [-1, 1]) When:**
- **Bounded Data**: Features have a fixed range (e.g., pixel values [0, 255]).
- **Deep Learning**: Neural networks perform better with small, scaled inputs.
- **Distance-Based Models**: k-NN, K-Means, and clustering methods rely on consistent feature scales.
- **Non-Gaussian Data**: Works even when data isn't normally distributed.
- **Interpretability**: Easier to understand in real-world terms.

### **Use Standardization (Zero Mean, Unit Variance) When:**
- **Gaussian-Like Data**: Ideal for normally distributed features.
- **Linear Models & PCA**: Regression, SVM, and PCA assume standardized inputs.
- **Outlier Robustness**: Less sensitive to extreme values than normalization.
- **Different Units**: Useful when features have varying scales (e.g., income vs. age).
- **Optimization Stability**: Gradient-based models (SGD, Adam) converge better.

### **Key Takeaways:**
- **Normalization**: Best for bounded data, deep learning, and distance-based models.
- **Standardization**: Best for Gaussian-like data, linear models, and handling different units.

### Sklearn Scaling / Normalizing

#### Scaling

In [48]:
X = np.random.normal(loc = 10, scale = 3, size = 1000)

In [3]:
np.mean(X), np.std(X)

(np.float64(10.003283873167465), np.float64(3.016315304706204))

In [4]:
scaler = StandardScaler()

# Note: Sklearn requires at least one column; the reshape ensures a column vector
X_scaled = scaler.fit_transform(X.reshape(-1, 1))

In [5]:
np.mean(X_scaled), np.std(X_scaled)

(np.float64(1.2967404927621828e-16), np.float64(0.9999999999999999))

#### Normalizing

In [6]:
normalizer = MinMaxScaler()

X_normalized = normalizer.fit_transform(X.reshape(-1, 1))

In [7]:
np.min(X_normalized), np.max(X_normalized)

(np.float64(0.0), np.float64(0.9999999999999999))

### Pandas Scaling

In [8]:
df = sns.load_dataset('iris')

In [9]:
# Sklearn StandardScaler converts to array
scaler = StandardScaler()
scaler.fit_transform(df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']])

array([[-9.00681170e-01,  1.01900435e+00, -1.34022653e+00,
        -1.31544430e+00],
       [-1.14301691e+00, -1.31979479e-01, -1.34022653e+00,
        -1.31544430e+00],
       [-1.38535265e+00,  3.28414053e-01, -1.39706395e+00,
        -1.31544430e+00],
       [-1.50652052e+00,  9.82172869e-02, -1.28338910e+00,
        -1.31544430e+00],
       [-1.02184904e+00,  1.24920112e+00, -1.34022653e+00,
        -1.31544430e+00],
       [-5.37177559e-01,  1.93979142e+00, -1.16971425e+00,
        -1.05217993e+00],
       [-1.50652052e+00,  7.88807586e-01, -1.34022653e+00,
        -1.18381211e+00],
       [-1.02184904e+00,  7.88807586e-01, -1.28338910e+00,
        -1.31544430e+00],
       [-1.74885626e+00, -3.62176246e-01, -1.34022653e+00,
        -1.31544430e+00],
       [-1.14301691e+00,  9.82172869e-02, -1.28338910e+00,
        -1.44707648e+00],
       [-5.37177559e-01,  1.47939788e+00, -1.28338910e+00,
        -1.31544430e+00],
       [-1.26418478e+00,  7.88807586e-01, -1.22655167e+00,
      

In [10]:
# Pandas apply to keep as dataframe; filter by float columns
df_standardized = df.apply(lambda x: (x - x.mean()) / x.std() if x.dtype == 'float64' else x)
df_standardized.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,-0.897674,1.015602,-1.335752,-1.311052,setosa
1,-1.1392,-0.131539,-1.335752,-1.311052,setosa
2,-1.380727,0.327318,-1.392399,-1.311052,setosa
3,-1.50149,0.097889,-1.279104,-1.311052,setosa
4,-1.018437,1.24503,-1.335752,-1.311052,setosa


In [11]:
# Pandas normalization
df_normalized = df.apply(lambda x: (x - x.min()) / (x.max() - x.min()) if x.dtype == 'float64' else x)
df_normalized.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,0.222222,0.625,0.067797,0.041667,setosa
1,0.166667,0.416667,0.067797,0.041667,setosa
2,0.111111,0.5,0.050847,0.041667,setosa
3,0.083333,0.458333,0.084746,0.041667,setosa
4,0.194444,0.666667,0.067797,0.041667,setosa


## Processing Categorical Features

### Label Encoding

Typically used to encode the labels or targets when labels are categories.  

`LabelEncoder` from `sklearn.preprocessing` maps from categories (strings) to integer values.


### One-Hot Encoding

One-hot encoding splits up a single categorical feature (e.g., `['cat', 'dog', 'fish']`) into several columns which represent binary values, 1 mapped to the category of the observation, and 0 for the other categories.

For example, the animal column with values `['cat', 'dog', 'fish', 'cat']` Would map to

| cat | dog | fish |
|-----|-----|------|
|  1  |  0  |  0   |
|  0  |  1  |  0   |
|  0  |  0  |  1   |
|  1  |  0  |  0   |




### Sklearn Encoding

#### Label Encoding

In [12]:
y_str = ['zebra', 'dog', 'cat', 'fish', 'dog', 'cat', 'fish']

label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y_str)

In [13]:
print(y_encoded)

[3 1 0 2 1 0 2]


#### One Hot Encoding

In [14]:
one_hot_encoder = OneHotEncoder(sparse_output = False)
y_one_hot = one_hot_encoder.fit_transform(y_encoded.reshape(-1, 1))

In [17]:
print(y_one_hot)

[[0. 0. 0. 1.]
 [0. 1. 0. 0.]
 [1. 0. 0. 0.]
 [0. 0. 1. 0.]
 [0. 1. 0. 0.]
 [1. 0. 0. 0.]
 [0. 0. 1. 0.]]


### Pandas Encoding

#### Label Encoding

In [18]:
# Load in the titanic dataset
data = fetch_openml(data_id=40945, parser = 'auto')
titanic = data.frame
titanic.drop(['body', 'boat', 'name', 'ticket', 'home.dest', 'cabin'], axis = 1, inplace = True)
titanic.dropna(inplace = True)

In [19]:
titanic.head()

Unnamed: 0,pclass,survived,sex,age,sibsp,parch,fare,embarked
0,1,1,female,29.0,0,0,211.3375,S
1,1,1,male,0.9167,1,2,151.55,S
2,1,0,female,2.0,1,2,151.55,S
3,1,0,male,30.0,1,2,151.55,S
4,1,0,female,25.0,1,2,151.55,S


In [None]:
titanic.info()

In [None]:
titanic_encoded = titanic.apply(lambda x: pd.Categorical(x).codes if x.dtype == 'category' else x)

In [None]:
titanic_encoded.head()

#### One-Hot Encoding

In [None]:
titanic_one_hot = pd.get_dummies(titanic)

In [None]:
titanic_one_hot.head()

# **In-Class Activity: Predicting Obesity Levels from Eating Habits and Physical Condition**

In this activity, you will work with a dataset designed to predict obesity levels based on various eating habits and physical conditions. Your goal is to preprocess the data, experiment with different encoding strategies, and compare classification models.

---

## **Review the Dataset**
Before beginning, take some time to familiarize yourself with the dataset and its features. Feature descriptions can be found [here](https://archive.ics.uci.edu/dataset/544/estimation+of+obesity+levels+based+on+eating+habits+and+physical+condition).

Consider the following as you review the dataset:
- What types of features are present? *(Numerical, ordinal, categorical?)*  
- How should these features be encoded for use in machine learning models?

---

## **Data Preprocessing**
- **Encoding:** Decide how to encode categorical and ordinal variables appropriately.
- **Splitting:** Divide the dataset into **80% training** and **20% testing** using:


## **Model Training & Cross-Validation**
- Apply **cross-validation** on the training set to fine-tune hyperparameters and evaluate model performance.
- Compare the results of **$k$-Nearest Neighbors (k-NN) and Logistic Regression** using cross-validation scores.

### **Evaluation:**
1. Compare the models based on accuracy.
2. Consider hyperparameter tuning for both models:
   - For **k-NN**, experiment with different values of k, metrics, and weighting.
   - For **Logistic Regression**, consider trying different penalties. (View the documentation [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).)

## **Wait Before Testing!**
🚨 **Do NOT evaluate your model on the test set until instructed to do so!** 🚨  

- The test set should remain **unseen** throughout training and validation.
- We will use it **only once** to assess the final model’s performance.
- Keep track of your cross-validation results to decide which model to use for final testing.

### **Why is this important?**
Evaluating too early on the test set can lead to **data leakage** and **overfitting**, giving misleading performance estimates. The test set should serve as a final, unbiased evaluation of the model.




In [34]:
# Here is the data:
df = pd.read_csv('https://raw.githubusercontent.com/rhodes-byu/cs180-winter25/refs/heads/main/data/obesity.csv')
df.head()

Unnamed: 0,NObeyesdad,Gender,Age,Height,family_history_with_overweight,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS
0,Normal_Weight,Female,21.0,1.62,yes,no,2.0,3.0,Sometimes,no,2.0,no,0.0,1.0,no,Public_Transportation
1,Normal_Weight,Female,21.0,1.52,yes,no,3.0,3.0,Sometimes,yes,3.0,yes,3.0,0.0,Sometimes,Public_Transportation
2,Normal_Weight,Male,23.0,1.8,yes,no,2.0,3.0,Sometimes,no,2.0,no,2.0,1.0,Frequently,Public_Transportation
3,Overweight_Level_I,Male,27.0,1.8,no,no,3.0,3.0,Sometimes,no,2.0,no,2.0,0.0,Frequently,Walking
4,Overweight_Level_II,Male,22.0,1.78,no,no,2.0,1.0,Sometimes,no,2.0,no,0.0,0.0,Sometimes,Public_Transportation


Class Practice

In [58]:
x= df.drop(columns= "NObeyesdad")
y= df["NObeyesdad"]
df.head()

X_encoded = pd.get_dummies(x)

label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

X_train, X_test, y_train, y_test = train_test_split(X_encoded, y_encoded, test_size=0.2, random_state=42)
print("Training data shape:", X_train.shape)
print("Testing data shape:", X_test.shape)

#KN Neighbors Model
# Initialize
knn = KNeighborsClassifier(n_neighbors = 5)

# Train the model
knn.fit(X_train, y_train)

# Make predictions
y_pred_knn = knn.predict(X_test)

# Evaluate the model
print("KNN Accuracy:", accuracy_score(y_test, y_pred_knn))


#Logistic Regression Model
# Initialize
log_reg = LogisticRegression(max_iter = 20000)

# Train the model
log_reg.fit(X_train, y_train)

# Make predictions
y_pred_log_reg = log_reg.predict(X_test)

# Evaluate the model
print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_log_reg))


Training data shape: (1688, 30)
Testing data shape: (423, 30)
KNN Accuracy: 0.7706855791962175
Logistic Regression Accuracy: 0.6288416075650118
