## Course Assignment Instructions
You should have Python (version 3.8 or later) and Jupyter Notebook installed to complete this assignment. You will write code in the empty cell/cells below the problem. While most of this will be a programming assignment, some questions will ask you to "write a few sentences" in markdown cells. 

Submission Instructions:

Create a labs directory in your personal class repository (e.g., located in your home directory)
Clone the class repository
Copy this Jupyter notebook file (.ipynb) into your repo/labs directory
Make your edits, commit changes, and push to your repository
All submissions must be pushed before the due date to avoid late penalties. 

Labs are graded out of a 100 pts. Each day late is -5. For a max penalty of -50 after 10 days. From there you may submit the lab anytime before the semester ends for a max score of 50.  

Lab 2 is due on 2/18/25

## Basic Modeling
In the 342 class an example was given that considered a variable `x_3` which measured "criminality". In this example there are L = 4 levels "none", "infraction", "misdemeanor" and "felony". Create a variable `x_3` here with 100 random elements (equally probable). Create it as a nominal (i.e. unordered) factor. Hint: use random.choice from NumPy and Categorical from Pandas.

In [15]:
import numpy as np
import pandas as pd

# Define the categories
L = ["none", "infraction", "misdemeanor", "felony"]

# Generate 100 random elements with equal probability
x_3 = np.random.choice(L, replace=True, size=100)

# Convert to a categorical (nominal) variable in pandas
x_3_cat = pd.Categorical(x_3, categories=L, ordered=False)
print(x_3_cat)

['none', 'infraction', 'misdemeanor', 'infraction', 'infraction', ..., 'infraction', 'none', 'infraction', 'felony', 'felony']
Length: 100
Categories (4, object): ['none', 'infraction', 'misdemeanor', 'felony']


Use x_3 to create x_3_bin, a binary feature where 0 is no crime and 1 is any crime.

In [16]:
# creates a boolean array (True for crime, False for no crime)
x_3_bin = (x_3 != "none").astype(int)
print(x_3_bin)

[0 1 1 1 1 1 0 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 1 1 0 1
 1 1 0 0 0 1 0 1 1 1 0 0 1 0 0 1 1 1 1 1 1 0 0 1 1 0 1 1 1 0 0 1 1 1 1 0 1
 1 0 1 1 1 1 1 1 0 1 0 0 1 0 1 1 1 1 1 1 1 1 0 1 1 1]


Use `x_3` to create `x_3_ord`, an ordered factor variable. Ensure the proper ordinal ordering.

In [17]:
x_3_ord = x_3 = pd.Categorical(x_3, categories=L, ordered=True)
print(x_3_ord)

['none', 'infraction', 'misdemeanor', 'infraction', 'infraction', ..., 'infraction', 'none', 'infraction', 'felony', 'felony']
Length: 100
Categories (4, object): ['none' < 'infraction' < 'misdemeanor' < 'felony']


Convert this variable into three binary variables without any information loss and put them into a data matrix. Hint: use column_stack from Numpy.

In [18]:
x_3_matrix = np.column_stack([
    (x_3 == "infraction").astype(int),
    (x_3 == "misdemeanor").astype(int),
    (x_3 == "felony").astype(int)])

x_3_matrix = pd.DataFrame(x_3_matrix, columns=["infraction", "misdemeanor", "felony"])
print(x_3_matrix)

    infraction  misdemeanor  felony
0            0            0       0
1            1            0       0
2            0            1       0
3            1            0       0
4            1            0       0
..         ...          ...     ...
95           1            0       0
96           0            0       0
97           1            0       0
98           0            0       1
99           0            0       1

[100 rows x 3 columns]


What should the sum of each row be (in English)? Write your answer in the markdown cell below

Answer: The sum of a single row will either be 1 if a crime was committed or 0 if none were. 


Verify that in the code cell below

In [19]:
row_sum = x_3_matrix.sum(axis=1)
print(row_sum.value_counts())

1    73
0    27
Name: count, dtype: int64


 How should the column sum look (in English)? Write your answer in the markdown cell below

Answer: The total amount of each crime that occurred. The remaining people did not commit a crime

Verify that in the code cell below

In [20]:
col_sum = np.sum(x_3_matrix, axis=0)
print(col_sum)

infraction     25
misdemeanor    21
felony         27
dtype: int64


Generate a matrix with 100 rows where the first column is realization from a normal with mean 17 and variance 38, the second column is uniform between -10 and 10, the third column is poisson with mean 6, the fourth column is exponential with lambda of 9, the fifth column is binomial with n = 20 and p = 0.12 and the sixth column is a binary variable with exactly 24% 1's dispersed randomly. Name the rows the entries of the `fake_first_names` vector. You will need to use Numpy

In [21]:
# Number of rows
num_rows=100

# Assign row names (index) from fake_first_names
fake_first_names = [
    "Sophia", "Emma", "Olivia", "Ava", "Mia", "Isabella", "Riley", 
    "Aria", "Zoe", "Charlotte", "Lily", "Layla", "Amelia", "Emily", 
    "Madelyn", "Aubrey", "Adalyn", "Madison", "Chloe", "Harper", 
    "Abigail", "Aaliyah", "Avery", "Evelyn", "Kaylee", "Ella", "Ellie", 
    "Scarlett", "Arianna", "Hailey", "Nora", "Addison", "Brooklyn", 
    "Hannah", "Mila", "Leah", "Elizabeth", "Sarah", "Eliana", "Mackenzie", 
    "Peyton", "Maria", "Grace", "Adeline", "Elena", "Anna", "Victoria", 
    "Camilla", "Lillian", "Natalie", "Jackson", "Aiden", "Lucas", 
    "Liam", "Noah", "Ethan", "Mason", "Caden", "Oliver", "Elijah", 
    "Grayson", "Jacob", "Michael", "Benjamin", "Carter", "James", 
    "Jayden", "Logan", "Alexander", "Caleb", "Ryan", "Luke", "Daniel", 
    "Jack", "William", "Owen", "Gabriel", "Matthew", "Connor", "Jayce", 
    "Isaac", "Sebastian", "Henry", "Muhammad", "Cameron", "Wyatt", 
    "Dylan", "Nathan", "Nicholas", "Julian", "Eli", "Levi", "Isaiah", 
    "Landon", "David", "Christian", "Andrew", "Brayden", "John", 
    "Lincoln"
]

# Create a DataFrame with the specified distributions
X = pd.DataFrame({
    "Normal": np.random.normal(loc=17, scale=np.sqrt(38), size=num_rows),           # Normal(17, variance 38)
    "Uniform": np.random.uniform(low=-10, high=10, size=num_rows),                  # Uniform(-10, 10)
    "Poisson": np.random.poisson(6, size=num_rows),                                 # Poisson(6)
    "Exponential": np.random.exponential((1/9), size=num_rows),                     # Exponential(λ=9)
    "Binomial": np.random.binomial(n=20, p=0.12, size=num_rows),                    # Binomial(n=20, p=0.12)
    "Binary": np.random.permutation([1]*int(num_rows*0.24) + [0]*int(num_rows*0.76))# 24% 1s, shuffled
})

X.index = fake_first_names[:num_rows]
X


Unnamed: 0,Normal,Uniform,Poisson,Exponential,Binomial,Binary
Sophia,13.615262,-3.635272,6,0.035979,2,0
Emma,17.118373,-0.316063,9,0.013859,3,0
Olivia,12.132996,5.497896,3,0.106619,0,0
Ava,24.350343,3.212553,1,0.040261,1,0
Mia,21.750860,-1.158207,8,0.015045,4,0
...,...,...,...,...,...,...
Christian,19.708186,-4.080289,6,0.093488,3,1
Andrew,15.664092,-8.588573,6,0.066969,1,0
Brayden,10.674084,-6.151815,3,0.025805,1,0
John,13.146662,8.196078,7,0.105762,3,0


Create a data frame of the same data as above except make the binary variable a factor "DOMESTIC" vs "FOREIGN" for 0 and 1 respectively. In Rstudio you used the `View` function to ensure this worked as desired. In python use .head() on the DataFrame. I recommend creating a copy of the DataFrame and then using the .replace in conjunction with .astype("category") to make the binary variable a factor. 

In [22]:
# Convert matrix DataFrame to categorical for the binary variable
# Make a copy to keep X unchanged
x_copy = X.copy()

# Convert binary column (6th column) to categorical labels
x_copy["Binary"] = x_copy["Binary"].replace({0:"Domestic", 1:"Foreign"}).astype("category")

# Display first few rows
x_copy.head()

Unnamed: 0,Normal,Uniform,Poisson,Exponential,Binomial,Binary
Sophia,13.615262,-3.635272,6,0.035979,2,Domestic
Emma,17.118373,-0.316063,9,0.013859,3,Domestic
Olivia,12.132996,5.497896,3,0.106619,0,Domestic
Ava,24.350343,3.212553,1,0.040261,1,Domestic
Mia,21.75086,-1.158207,8,0.015045,4,Domestic


Print out a table of the binary variable. Then print out the proportions of "DOMESTIC" vs "FOREIGN". Pandas DataFrames has a .value_count() feature. 

In [23]:
print(x_copy["Binary"].value_counts(normalize=True))

Binary
Domestic    0.76
Foreign     0.24
Name: proportion, dtype: float64


Print out a summary of the whole dataframe.

In [24]:
print(x_copy.describe())
print(x_copy["Binary"].value_counts())

           Normal     Uniform    Poisson  Exponential    Binomial
count  100.000000  100.000000  100.00000   100.000000  100.000000
mean    18.285119    0.000856    6.28000     0.093466    2.530000
std      5.662072    6.130316    2.34878     0.095559    1.566409
min      6.306362   -9.891443    1.00000     0.000812    0.000000
25%     13.596485   -5.488196    5.00000     0.024459    1.000000
50%     18.441749   -0.365310    6.00000     0.070078    2.500000
75%     22.218288    5.838185    8.00000     0.112332    3.250000
max     34.163007    9.909550   11.00000     0.473739    6.000000
Binary
Domestic    76
Foreign     24
Name: count, dtype: int64


## Dataframe creation
Imagine you are running an experiment with many manipulations. You have 14 levels in the variable "treatment" with levels a, b, c, etc. For each of those manipulations you have 3 submanipulations in a variable named "variation" with levels A, B, C. Then you have "gender" with levels M / F. Then you have "generation" with levels Boomer, GenX, Millenial. Then you will have 6 runs per each of these groups. In each set of 6 you will need to select a name without duplication from the appropriate set of names (from the last question). Create a data frame with columns treatment, variation, gender, generation, name and y that will store all the unique unit information in this experiment. Leave y empty because it will be measured as the experiment is executed. In Rstudio you used `rep` function using the `times` argument. For python use np.tile, and np.repeat.

In [25]:
# Define categories
treatments = list("abcdefghijklmn")  # 14 levels
variations = list("ABC")             # 3 levels
genders = ["M", "F"]                 # 2 levels
generations = ["Boomer", "GenX", "Millenial"]  # 3 levels

# Define name sets
name_sets = {
    "M": {
        "Boomer": ["Theodore", "Bernard", "Gene", "Herbert", "Ray", "Tom", "Lee", "Alfred", "Leroy", "Eddie"],
        "GenX": ["Marc", "Jamie", "Greg", "Darryl", "Tim", "Dean", "Jon", "Chris", "Troy", "Jeff"],
        "Millenial": ["Zachary", "Dylan", "Christian", "Wesley", "Seth", "Austin", "Gabriel", "Evan", "Casey", "Luis"]
    },
    "F": {
        "Boomer": ["Gloria", "Joan", "Dorothy", "Shirley", "Betty", "Dianne", "Kay", "Marjorie", "Lorraine", "Mildred"],
        "GenX": ["Tracy", "Dawn", "Tina", "Tammy", "Melinda", "Tamara", "Tracey", "Colleen", "Sherri", "Heidi"],
        "Millenial": ["Samantha", "Alexis", "Brittany", "Lauren", "Taylor", "Bethany", "Latoya", "Candice", "Brittney", "Cheyenne"]
    }
}


# Create experiment DataFrame
df = pd.DataFrame({
    "treatment": np.repeat(treatments, len(variations) * len(genders) * len(generations) * 6),
    "variation": np.tile(np.repeat(variations, len(genders) * len(generations) * 6), len(treatments)),
    "gender": np.tile(np.repeat(genders, len(generations) * 6), len(treatments) * len(variations)),
    "generation": np.tile(np.repeat(generations, 6), len(treatments) * len(variations) * len(genders)),
}) 

# Add a unique identifier to preserve the original order
df = df.reset_index().rename(columns={'index': 'orig_index'})

# Define a function that assigns 6 unique names per group and returns a DataFrame with the original index.
def assign_names_with_index(group):
    gender_val = group["gender"].iloc[0]       # Extract the group's gender
    generation_val = group["generation"].iloc[0]  # Extract the group's generation
    # Sample 6 unique names from the appropriate set (without replacement)
    names = np.random.choice(name_sets[gender_val][generation_val], 6, replace=False)
    # Return a DataFrame with the original indices and the assigned names
    return pd.DataFrame({
        "orig_index": group["orig_index"],
        "name": names
    })

# Group by the categorical variables and apply the function.
names_df = df.groupby(["treatment", "variation", "gender", "generation"], group_keys=False).apply(assign_names_with_index).reset_index(drop=True)

# Merge the assigned names back into the original DataFrame using the unique identifier.
df = df.merge(names_df, on="orig_index", how="left")

# Restore the original order and remove the temporary identifier.
df = df.sort_values("orig_index").reset_index(drop=True).drop(columns=["orig_index"])

# Add empty column y
df["y"] = np.nan

# Display DataFrame
df

  names_df = df.groupby(["treatment", "variation", "gender", "generation"], group_keys=False).apply(assign_names_with_index).reset_index(drop=True)


Unnamed: 0,treatment,variation,gender,generation,name,y
0,a,A,M,Boomer,Lee,
1,a,A,M,Boomer,Leroy,
2,a,A,M,Boomer,Tom,
3,a,A,M,Boomer,Gene,
4,a,A,M,Boomer,Herbert,
...,...,...,...,...,...,...
1507,n,C,F,Millenial,Bethany,
1508,n,C,F,Millenial,Taylor,
1509,n,C,F,Millenial,Samantha,
1510,n,C,F,Millenial,Brittney,


Now that you've done it with the np.tile and np.repeat, Try doing this by importing product from the itertools module. This will be analogous to using `expand.grid` function from Rstudio. 

| **R Function** | **Python Equivalent** |
|--------------|-----------------|
| `rep(x, times=n)` | `np.repeat(x, n)` |
| `rep(x, each=n)` | `np.tile(np.repeat(x, n), times)` |
| `rep(x, length.out=n)` | `np.resize(x, n)` |
| `expand.grid()` | `itertools.product()` |

| **R Function** | **Python Equivalent** | **Use Case** |
|--------------|-----------------|-----------|
| `rep(x, times=n)` | `np.repeat(x, n)` | Repeat each element **`n` times** in order |
| `rep(x, each=n)` | `np.tile(x, n)` | Repeat the full sequence **`n` times** |
| `rep(x, length.out=n)` | `np.resize(x, n)` | Repeat `x` but **truncate** or **expand** to length `n` |

**`expand.grid()` → `itertools.product()`** for generating **all combinations**  
**`rep(..., each=n)` → `np.repeat()`** for **repeating values in order**  
**`rep(..., times=n)` → `np.tile()`** for **cycling through values**  
**`Combination of `np.repeat()` and `np.tile()`** replaces **nested `rep()`** in R

In [26]:
from itertools import product

# Define categories
treatments = list("abcdefghijklmn")  # 14 treatment levels
variations = list("ABC")             # 3 variation levels
genders = ["M", "F"]                 # 2 gender levels
generations = ["Boomer", "GenX", "Millenial"]  # 3 generation levels
runs_per_group = 6                   # Each group has 6 runs

# Generate all unique combinations (equivalent to expand.grid in R)
df = pd.DataFrame(
    product(treatments, variations, genders, generations, range(1, runs_per_group + 1)),
    columns=["treatment", "variation", "gender", "generation", "run"]
)

# Define name sets
name_sets = {
    "M": {
        "Boomer": ["Theodore", "Bernard", "Gene", "Herbert", "Ray", "Tom", "Lee", "Alfred", "Leroy", "Eddie"],
        "GenX": ["Marc", "Jamie", "Greg", "Darryl", "Tim", "Dean", "Jon", "Chris", "Troy", "Jeff"],
        "Millenial": ["Zachary", "Dylan", "Christian", "Wesley", "Seth", "Austin", "Gabriel", "Evan", "Casey", "Luis"]
    },
    "F": {
        "Boomer": ["Gloria", "Joan", "Dorothy", "Shirley", "Betty", "Dianne", "Kay", "Marjorie", "Lorraine", "Mildred"],
        "GenX": ["Tracy", "Dawn", "Tina", "Tammy", "Melinda", "Tamara", "Tracey", "Colleen", "Sherri", "Heidi"],
        "Millenial": ["Samantha", "Alexis", "Brittany", "Lauren", "Taylor", "Bethany", "Latoya", "Candice", "Brittney", "Cheyenne"]
    }
}

# Function to assign unique names per group (each group has 6 rows)
def assign_names(group):
    gender = group["gender"].iloc[0]
    generation = group["generation"].iloc[0]
    # Sample 6 unique names (without replacement) from the appropriate name set
    return np.random.choice(name_sets[gender][generation], size=len(group), replace=False)

# Group by all four factors and apply the function.
# Using sort=False preserves the order generated by product.
df["name"] = (df.groupby(["treatment", "variation", "gender", "generation"], sort=False, group_keys=False).apply(assign_names).explode().reset_index(drop=True))

# Add an empty column for y (to be measured later)
df["y"] = np.nan

# Display first few rows and verify the total number of rows
print(df.head())
print(f"Total rows: {len(df)} (Expected: {14 * 3 * 2 * 3 * 6})")
df

  treatment variation gender generation  run      name   y
0         a         A      M     Boomer    1  Theodore NaN
1         a         A      M     Boomer    2      Gene NaN
2         a         A      M     Boomer    3   Herbert NaN
3         a         A      M     Boomer    4     Eddie NaN
4         a         A      M     Boomer    5       Lee NaN
Total rows: 1512 (Expected: 1512)


  df["name"] = (df.groupby(["treatment", "variation", "gender", "generation"], sort=False, group_keys=False).apply(assign_names).explode().reset_index(drop=True))


Unnamed: 0,treatment,variation,gender,generation,run,name,y
0,a,A,M,Boomer,1,Theodore,
1,a,A,M,Boomer,2,Gene,
2,a,A,M,Boomer,3,Herbert,
3,a,A,M,Boomer,4,Eddie,
4,a,A,M,Boomer,5,Lee,
...,...,...,...,...,...,...,...
1507,n,C,F,Millenial,2,Taylor,
1508,n,C,F,Millenial,3,Latoya,
1509,n,C,F,Millenial,4,Cheyenne,
1510,n,C,F,Millenial,5,Alexis,


## Basic Binary Classification Modeling

Load the famous `iris` data frame into the namespace. In Rstudio you used the `skim` function from the package `skimr` to provide a summary of the columns. In python we will use df.describe() and the ProfileReport from the ydata-profiling package. The `iris` data set is not available in base python, but we can get this data from the sklearn package. Write a few descriptive sentences about the distributions using the code below in English.

### **Comparing the `iris` Dataset in R vs Python**
| Feature  | **R (`datasets::iris`)**  | **Python (`sklearn.datasets.load_iris()`)**  |
|----------|-------------------------|--------------------------------|
| **Total Rows**  | 150 | 150 |
| **Columns (Features)** | 5 (`Sepal.Length`, `Sepal.Width`, `Petal.Length`, `Petal.Width`, `Species`) | 5 (`sepal length (cm)`, `sepal width (cm)`, `petal length (cm)`, `petal width (cm)`, `species`) |
| **Species Encoding**  | `"setosa"`, `"versicolor"`, `"virginica"` (Categorical Factor) | `0` (setosa), `1` (versicolor), `2` (virginica) (Numerical Encoding) |
| **Data Type for Species** | Factor (Categorical) | Integer (0,1,2) |
| **Data Loading Method** | `data(iris)` (built-in dataset) | `datasets.load_iris()` (from `sklearn`) |

### **Key Differences**
- **Species Encoding:**  
  - **R uses categorical factor labels (`setosa`, `versicolor`, `virginica`).**  
  - **Python (`sklearn`) encodes species numerically as `0`, `1`, and `2`.**
- **Column Names:**  
  - **R:** `Sepal.Length`, `Sepal.Width`, etc.  
  - **Python:** `sepal length (cm)`, `sepal width (cm)`, etc.  

In [27]:
from sklearn import datasets
import ydata_profiling  

# Load the famous Iris dataset
iris = datasets.load_iris()
df_iris = pd.DataFrame(iris.data, columns = iris.feature_names)

df_iris["species"] = iris.target
df_iris

profile = ydata_profiling.ProfileReport(df_iris, title = "iris_summary", explorative= True)

#Generate the profiling report (Uncomment to generate HTML file)
#profile.to_file("iris_report.html")

TO-DO: describe this data

Answer: This data contains 3 categorical species of iris namely; 0 for setosa, 1 for virginica, and 2 for versicolor. Each datum contains 4 measurements to the nearest 0.1 cm for sepal length, sepal width, petal length, and petal width. This dataset was made famous by R.F. Fisher when he published a paper outlining a method to predict the species based on the 4 measurements.

The outcome / label / response is `Species`. This is what we will be trying to predict. However, we only care about binary classification between "setosa" and "versicolor" for the purposes of this exercise. Thus the first order of business is to drop one class. Let's drop the data for the level "virginica" from the data frame.

In [28]:
# Filter out "virginica" from the dataset
df_iris_binary = df_iris[df_iris["species"] != 2].copy()

#print(df_iris_binary.tail())
print(df_iris_binary["species"].unique())

[0 1]


Now create a vector `y` that is length the number of remaining rows in the data frame whose entries are 0 if "setosa" and 1 if "versicolor".

In [29]:
# Create binary target vector `y` (0 for setosa, 1 for versicolor)
y = (df_iris_binary["species"] == 1).astype(int)
y

0     0
1     0
2     0
3     0
4     0
     ..
95    1
96    1
97    1
98    1
99    1
Name: species, Length: 100, dtype: int32

Write a function `mode` returning the sample mode of a vector of numeric values. Use np.random.choice from NumPy and import Counter from the collections module.

In [30]:
from collections import Counter

# Define mode function
def mode(v):
    return Counter(v).most_common(1)[0][0]

# Test with a random sample (equivalent to `sample(letters, 1000, replace=TRUE)`)
sample_data = np.random.choice(list("abcdefghijklmnopqrstuvwxyz"), 1000, replace=True)
print("Mode of sample letters:", mode(sample_data))

# Test with binary target vector `y`
print("Mode of y:", mode(y))

Mode of sample letters: c
Mode of y: 0


Fit a threshold model to `y` using the feature `Sepal.Length`. Write your own code to do this. What is the estimated value of the threshold parameter? Save the threshold value as `threshold`. Hint: use np.zeros and np.sum from Numpy. You will need to use a for loop using the range() function.  

In [31]:
# Extract relevant data
sepal_length = df_iris_binary["sepal length (cm)"].values  # Feature
y_values = y.values  # Target labels (0 or 1)
n = len(sepal_length)  # Number of samples

# Initialize matrix to store threshold values and corresponding error counts
num_errors_by_parameter = np.zeros((n, 2))

# Loop over all possible threshold values
for i in range(n):
    threshold = sepal_length[i]  # Set current threshold
    num_errors = np.sum((sepal_length > threshold) != y_values)  # Count classification errors
    num_errors_by_parameter[i] = [threshold, num_errors]  # Store values

# Sort by number of errors
num_errors_by_parameter = num_errors_by_parameter[num_errors_by_parameter[:, 1].argsort()]

# Get the threshold with the least number of errors
best_threshold = num_errors_by_parameter[0, 0]

# Print results
print(f"Optimal threshold for classification: {best_threshold}")

Optimal threshold for classification: 5.4


What is the total number of errors this model makes? This requires a couple of minor modifications to the previous code.

In [32]:
# Extract relevant data
sepal_length = df_iris_binary["sepal length (cm)"].values  # Feature
y_values = y.values  # Target labels (0 or 1)
n = len(sepal_length)  # Number of samples

# Initialize matrices for threshold values and classification errors
num_errors_by_parameter = np.zeros((n, 2))
total_errors = 0  # Initialize total error count

# Loop over all possible threshold values
for i in range(n):
    threshold = sepal_length[i]  # Set current threshold
    num_errors = np.sum((sepal_length > threshold) != y_values)  # Count classification errors
    
    # Store threshold and corresponding errors
    num_errors_by_parameter[i] = [threshold, num_errors]
    
    # Accumulate total errors across all thresholds
    total_errors += num_errors

# Sort by number of errors to find the best threshold
num_errors_by_parameter = num_errors_by_parameter[num_errors_by_parameter[:, 1].argsort()]
best_threshold = num_errors_by_parameter[0, 0]  # Best threshold with the least errors

# Print results
print(f"Optimal threshold for classification: {best_threshold}")
print(f"Total number of errors across all thresholds: {total_errors}")

Optimal threshold for classification: 5.4
Total number of errors across all thresholds: 2796


Does the threshold model's performance make sense given the following summaries:

In [33]:
# Print the best threshold found earlier
print(f"Optimal threshold for classification: {best_threshold}")

# Summary statistics for setosa and versicolor Sepal.Length
setosa_summary = df_iris_binary[df_iris_binary["species"] == 0]["sepal length (cm)"].describe()
versicolor_summary = df_iris_binary[df_iris_binary["species"] == 1]["sepal length (cm)"].describe()

# Print summaries
print("\nSummary statistics for Setosa Sepal Length:")
print(setosa_summary)

print("\nSummary statistics for Versicolor Sepal Length:")
print(versicolor_summary)

Optimal threshold for classification: 5.4

Summary statistics for Setosa Sepal Length:
count    50.00000
mean      5.00600
std       0.35249
min       4.30000
25%       4.80000
50%       5.00000
75%       5.20000
max       5.80000
Name: sepal length (cm), dtype: float64

Summary statistics for Versicolor Sepal Length:
count    50.000000
mean      5.936000
std       0.516171
min       4.900000
25%       5.600000
50%       5.900000
75%       6.300000
max       7.000000
Name: sepal length (cm), dtype: float64


TO-DO: Write your answer here in English

Answer: Yes this makes sense. I would predict the best thresehold to fall somewhere in between the means, further from the value with higher variance

Create the function `g` explicitly that can predict `y` from `x` being a new `Sepal.Length`. Hint: use np.where from Numpy ... this can also be down using a lambda function. 

In [34]:
# Define function `g` for threshold-based prediction
def g(x):
    return np.where(x>=best_threshold, 1, 0)

print(g(5.6))
print(g(1.8))

1
0


In [35]:
g2 = lambda x:np.where(x>=best_threshold, 1, 0)

print(g2(5.6))
print(g2(1.8))

1
0
