## **Question 1: Discretization**
Given the following continuous dataset of students' scores out of 100: [45, 67, 82, 90, 54, 71, 88, 62], discretize the scores into three categories: "Low" (0-60), "Medium" (61-80), and "High" (81-100).

In [1]:
import pandas as pd

# Given continuous dataset of scores
scores = [45, 67, 82, 90, 54, 71, 88, 62]

# Define the bins and labels
bins = [0, 60, 80, 100] #(these define the ranges: 0-60, 61-80, 81-100)
labels = ["Low", "Medium", "High"]

# Discretize the scores
# Hint: pd.cut(): This function is used to segment and sort data values into discrete bins or intervals.
# https://pandas.pydata.org/docs/reference/api/pandas.cut.html

discretized_scores = pd.cut(scores,bins= bins, labels=labels) #use pd.cut()

print(discretized_scores)


['Low', 'Medium', 'High', 'High', 'Low', 'Medium', 'High', 'Medium']
Categories (3, object): ['Low' < 'Medium' < 'High']


## **Question 2: Numeric Coding of Nominal Attributes**
Convert the following list of car brands into numeric codes: ["Toyota", "Ford", "Honda", "Toyota", "BMW", "Ford", "Honda"].

In [None]:
# Import the LabelEncoder class from the sklearn.preprocessing module in the scikit-learn library.
from sklearn.preprocessing import LabelEncoder
# https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html

# Encoding Categorical Data: LabelEncoder is used to convert categorical (nominal) data into numeric labels.
# It assigns a unique integer to each category, transforming strings or other types of categories into numerical values.

# Given list of car brands
car_brands = ["Toyota", "Ford", "Honda", "Toyota", "BMW", "Ford", "Honda"]

# Initialize LabelEncoder
label_encoder = LabelEncoder() #This line initializes an instance of the LabelEncoder class.
# Purpose: The LabelEncoder is ready to be used for converting categorical data into numeric labels.
# It prepares the encoder but doesn’t yet process any data.


# Convert car brands to numeric codes
numeric_codes = label_encoder.fit_transform(car_brands) #use .fit_transform()
# fit_transform() is a combined method that both "fits" the encoder to the data and "transforms" the data in one step.
# Fitting: The fit part of fit_transform scans through the data (car_brands) and identifies all unique categories (e.g., "Toyota", "Ford", "Honda", "BMW").
# Transforming: The transform part assigns a unique integer label to each category:
# For example, it might encode "Toyota" as 3, "Ford" as 1, "Honda" as 2, and "BMW" as 0.

print(numeric_codes)


[3 1 2 3 0 1 2]


Here are some functions and methods similar to LabelEncoder that can be used for numeric coding of categorical data:

**1. OneHotEncoder (from sklearn.preprocessing)**

**Purpose:** Converts categorical data into a one-hot numeric array, where each category is represented by a binary vector. Each column represents a category with 1 indicating presence and 0 absence.

**2. OrdinalEncoder (from sklearn.preprocessing)**

**Purpose:** Similar to LabelEncoder, but works on multiple columns and encodes each category as an ordinal number based on the order provided or encountered.


**3. pd.factorize (from pandas)**

**Purpose:** Encodes categorical data into numeric codes, similar to LabelEncoder, but is available directly in pandas and handles missing values by assigning a unique code for NaNs.

**4. get_dummies (from pandas)**

**Purpose:** Similar to OneHotEncoder, get_dummies creates one-hot encoded columns from categorical data, making it easy to incorporate categorical data into models.

**5. ColumnTransformer (from sklearn.compose)**

**Purpose:** Allows for more complex preprocessing pipelines where different encoders can be applied to different columns within the same dataset.

##**Question 3: Data Preprocessing and Cleansing**
You have a dataset with missing values in the "Age" column. The values are [25, 30, NaN, 22, 28, NaN, 35]. Describe how you would handle these missing values and justify your approach.

In [None]:
import numpy as np
import pandas as pd

# Given dataset with missing values
# pd.Series(): This function from the pandas library creates a one-dimensional array-like object called a Series, which can hold data of any type (integers, floats, strings, etc.).
ages = pd.Series([25, 30, np.nan, 22, 28, np.nan, 35]) #np.nan (which stands for "Not a Number" and represents missing values in the dataset

# Option 1: Fill missing values with the mean
ages_mean_filled = ages.fillna(ages.mean())

# Option 2: Fill missing values with the median
ages_median_filled = ages.fillna(ages.median())



# .tolist(): This method converts the pandas Series into a Python list.
print("Mean-filled:", ages_mean_filled.tolist())
print("Median-filled:", ages_median_filled.tolist())


Mean-filled: [25.0, 30.0, 28.0, 22.0, 28.0, 28.0, 35.0]
Median-filled: [25.0, 30.0, 28.0, 22.0, 28.0, 28.0, 35.0]


## **Question 4: Feature Selection**
Using the California housing dataset, you want to identify the top 3 most relevant features for predicting the median house value. You decide to use the SelectKBest feature selection method with the f_regression scoring function. Write the code to perform this feature selection and identify the selected feature names.

In [None]:
from sklearn.datasets import fetch_california_housing
from sklearn.feature_selection import SelectKBest, f_regression
# https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html
# https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_regression.html

# Load the California housing dataset
housing = fetch_california_housing()
X = housing.data  # Features (input data)
y = housing.target  # Target variable (what we want to predict)

# Apply feature selection using SelectKBest
selector = SelectKBest(score_func=f_regression, k=3)
# Initializes the SelectKBest feature selector:
# - `score_func=f_regression`: Uses the F-test (ANOVA) regression score function to evaluate the importance of each feature.
# - `k=3`: Specifies that we want to select the top 3 features based on the scoring function.

X_selected = selector.fit_transform(X, y) #use .fit_transform()
# Fits the selector to the data (X, y) and transforms X to contain only the selected features.
# - The fit process evaluates each feature using the score function.
# - The transform process reduces X to the k best features based on the scores.

# Get selected feature indices
selected_features = selector.get_support(indices=True) #use .get_support(indices=True)
# `get_support(indices=True)` retrieves the indices of the selected features.
# - It returns an array of indices corresponding to the top k features.

# Map indices to feature names
selected_feature_names = [housing.feature_names[i] for i in selected_features]
# Uses the indices of selected features to get their names from `housing.feature_names`.
# - `housing.feature_names` contains the names of all features in the dataset.
# - This list comprehension builds a list of names for the selected features.

# Print the names of the selected features
print("Selected features:", selected_feature_names)
# Outputs the names of the top 3 features identified as most relevant for predicting the target variable.
# (Note: The actual selected features may vary depending on the scoring function results.)

Selected features: ['MedInc', 'AveRooms', 'Latitude']


There are several other methods similar to SelectKBest for feature selection in machine learning. Here are some common alternatives:

**1. Recursive Feature Elimination (RFE):**

**Description:** RFE works by recursively removing the least important features and building the model repeatedly until the specified number of features is reached.

**Use Case:** It’s useful when you want a model-based approach that considers feature interactions.


**2. Feature Importance from Tree-Based Models:**
**Description:** Tree-based models like Random Forests and Gradient Boosting provide feature importance scores that can be used to select the most important features.
**Use Case:** Suitable when using tree-based models for prediction, as they naturally rank feature importance.

**3. L1-Based Feature Selection (Lasso Regularization):**
**Description:** Uses L1 regularization to shrink some coefficients to zero, effectively selecting features that contribute most to the model.
**Use Case:** Good for high-dimensional datasets where many features are irrelevant.


**4. Mutual Information Feature Selection:**
**Description:** Selects features based on their mutual information score with the target variable, which measures the dependency between variables.
**Use Case:** Useful when you want to capture non-linear relationships between features and the target.

**5. Sequential Feature Selection:**

**Description:** Adds (forward selection) or removes (backward selection) features sequentially based on model performance.

**Use Case:** Useful when model performance is the key criterion for feature selection.


## **Question 5: Data Transformation**
Consider a dataset where the "Income" attribute is heavily skewed. What transformation technique would you apply to normalize this data, and why?

In [None]:
import numpy as np
import pandas as pd

# Given dataset with skewed income attribute
income = pd.Series([30000, 45000, 500000, 70000, 120000, 25000, 300000])

# Apply logarithmic transformation
log_income = np.log(income) #use np.log()
# https://numpy.org/doc/stable/reference/generated/numpy.log.html

print(log_income)


0    10.308953
1    10.714418
2    13.122363
3    11.156251
4    11.695247
5    10.126631
6    12.611538
dtype: float64


Besides the logarithmic transformation, there are several other transformations you can apply to skewed data to make it more normally distributed or manageable for analysis. Here are some common transformations:


**1. Square Root Transformation:**


This transformation can help reduce the skewness of moderately skewed data.


It is less aggressive than a log transformation and works well with positive values.


**2. Cube Root Transformation:**


The cube root transformation is another option for reducing skewness.


It can handle both positive and negative values, unlike the logarithmic transformation.


**3. Box-Cox Transformation:**


The Box-Cox transformation is a more flexible transformation that can be tuned by a parameter (lambda) to reduce skewness.


It requires all values to be positive and is often used to transform data closer to normality.


**4. Reciprocal Transformation:**


The reciprocal (or inverse) transformation is used when data is very positively skewed.


This transformation should be used carefully as it can overly compress larger values.


**5. Exponential Transformation:**


An exponential transformation, like raising the values to a power less than one (e.g., raising to 0.5 or 0.3), can also be used to transform data.


For heavily skewed data, this approach can moderate large values.


**Differences Between Transformation Methods and Choosing the Best One**


Data transformations adjust the distribution of skewed data. Here’s an overview of common transformations and guidance on selecting the best one for your data:

**1. Logarithmic Transformation**

**Effect:** Reduces right skewness, compressing large values more.

**Use When:** Data is positively skewed with all positive values.

**Best For:** Data with a wide range of values, clustered at the lower end.


**2. Square Root Transformation**


**Effect:** Moderately reduces skewness by compressing larger values.

**Use When:** Data has moderate positive skewness.

**Best For:** Non-negative data needing a less aggressive transformation.


**3. Cube Root Transformation**


**Effect:** Handles both positive and negative values; gentler adjustment.

**Use When:** Data includes negatives or requires mild transformation.

**Best For:** Data with both positive and negative values or needing a softer adjustment.


**4. Box-Cox Transformation**

**Effect:** Adjusts data toward normality with a tunable parameter (lambda).

**Use When:** Data is strictly positive, aiming for normal distribution.

**Best For:** Cases requiring normality for parametric tests.

**5. Reciprocal Transformation**

**Effect:** Strongly compresses large values; can overly shrink data.


**Use When:** Data is highly skewed and all positive.


**Best For:** Extreme skewness requiring strong transformation.


**6. Exponential Transformation (Power Transformation)**


**Effect:** Uses fractional powers for milder adjustments than logs.


**Use When:** Data has moderate skewness and is non-negative.


**Best For:** Flexible, mild transformations without drastic changes.


**How to Choose the Best Transformation:**


**Visual Inspection:**

Use histograms, boxplots, or QQ plots to assess skewness.


**Statistical Tests:**

Use tests like Shapiro-Wilk or Anderson-Darling for normality.


**Try Multiple Transformations:**

Apply and visually inspect multiple transformations; use histograms or QQ plots to compare.

**Evaluate with Your Model:**

Test transformed data in your model and compare metrics like R-squared, MSE, or cross-validation scores.

**Check Distribution Characteristics:**

Evaluate mean, variance, skewness, and kurtosis of transformed data.
Context and Practicality:

Consider how interpretable and practical the transformed data is for your specific needs.
