# Student Performance Prediction (Classification)

**üìä Dataset:** `student_performance.csv`  
**üìö Source:** [Kaggle ‚Äì student_performance Dataset](https://www.kaggle.com/)  




## üéØ Goal
The goal of this project is to predict student academic performance using **supervised machine learning (classification) with Apache Spark**.  
By analyzing students‚Äô demographic, behavioral, and academic features, the model aims to classify students based on their expected performance level, helping identify students who may need early academic support and improve educational decision-making.




## üìà Description
The dataset includes features such as:  
- `StudentID`, `Gender`, `AttendanceRate`, `StudyHoursPerWeek`  
- `PreviousGrade`, `ExtracurricularActivities`, `ParentalSupport`, `OnlineClassesTaken`  
- `FinalGrade` (Target Variable)  

This project predicts students‚Äô academic performance levels to support **data-driven educational decisions**.




In [None]:
# import libraries
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when
from pyspark.sql.types import IntegerType, DoubleType
from pyspark.ml.feature import StringIndexer, OneHotEncoder
import pyspark.sql.functions as F

In [None]:
# Create Spark Session
spark = SparkSession.builder \
    .appName("Student Performance - Data Cleaning") \
    .config("spark.driver.bindAddress", "127.0.0.1") \
    .config("spark.driver.host", "127.0.0.1") \
    .getOrCreate()

print("‚úÖ Spark Session Created Successfully")

‚úÖ Spark Session Created Successfully


# Phase 1: Data Overview & Understanding

In [None]:
df = spark.read.csv(
    "/content/student_performance_updated_1000.csv",
    header=True,
    inferSchema=True
)

print("‚úÖ Dataset Loaded Successfully")

‚úÖ Dataset Loaded Successfully


In [None]:
# Dataset Shape
rows = df.count()
cols = len(df.columns)

print(f"üìä Dataset Shape: {rows} rows, {cols} columns")

üìä Dataset Shape: 1000 rows, 12 columns


In [None]:
# Preview Dataset
print("üîπ First 5 rows:")
df.show(5)

üîπ First 5 rows:
+---------+-------+------+--------------+-----------------+-------------+-------------------------+---------------+----------+-----------+--------------+--------------------+
|StudentID|   Name|Gender|AttendanceRate|StudyHoursPerWeek|PreviousGrade|ExtracurricularActivities|ParentalSupport|FinalGrade|Study Hours|Attendance (%)|Online Classes Taken|
+---------+-------+------+--------------+-----------------+-------------+-------------------------+---------------+----------+-----------+--------------+--------------------+
|      1.0|   John|  Male|          85.0|             15.0|         78.0|                      1.0|           High|      80.0|        4.8|          59.0|               false|
|      2.0|  Sarah|Female|          90.0|             20.0|         85.0|                      2.0|         Medium|      87.0|        2.2|          70.0|                true|
|      3.0|   Alex|  Male|          78.0|             10.0|         65.0|                      0.0|       

In [None]:
# Preview Dataset
print("üîπ Random sample:")
df.sample(fraction=0.01).show(5)

üîπ Random sample:
+---------+-----------------+------+--------------+-----------------+-------------+-------------------------+---------------+----------+-----------+--------------+--------------------+
|StudentID|             Name|Gender|AttendanceRate|StudyHoursPerWeek|PreviousGrade|ExtracurricularActivities|ParentalSupport|FinalGrade|Study Hours|Attendance (%)|Online Classes Taken|
+---------+-----------------+------+--------------+-----------------+-------------+-------------------------+---------------+----------+-----------+--------------+--------------------+
|      9.0|            James|  Male|          82.0|             12.0|         70.0|                      2.0|            Low|      72.0|        3.6|          50.0|               false|
|   5089.0|   Trevor Freeman|  Male|          85.0|             NULL|         77.0|                      2.0|         Medium|      72.0|       NULL|          99.0|               false|
|   2730.0|Brandon Dickerson|  Male|          88.0|    

In [None]:
# Dataset Schema & Info
print("üîπ Dataset Schema:")
df.printSchema()

üîπ Dataset Schema:
root
 |-- StudentID: double (nullable = true)
 |-- Name: string (nullable = true)
 |-- Gender: string (nullable = true)
 |-- AttendanceRate: double (nullable = true)
 |-- StudyHoursPerWeek: double (nullable = true)
 |-- PreviousGrade: double (nullable = true)
 |-- ExtracurricularActivities: double (nullable = true)
 |-- ParentalSupport: string (nullable = true)
 |-- FinalGrade: double (nullable = true)
 |-- Study Hours: double (nullable = true)
 |-- Attendance (%): double (nullable = true)
 |-- Online Classes Taken: boolean (nullable = true)



#### Identify Columns Types

In [None]:
# Identify Numerical Columns
numerical_cols = [
    field.name for field in df.schema.fields
    if isinstance(field.dataType, (IntegerType, DoubleType))
]

print("üìå Numerical Columns:", numerical_cols)

üìå Numerical Columns: ['StudentID', 'AttendanceRate', 'StudyHoursPerWeek', 'PreviousGrade', 'ExtracurricularActivities', 'FinalGrade', 'Study Hours', 'Attendance (%)']


In [None]:
# Identify Categorical Columns
categorical_cols = [
    field.name for field in df.schema.fields
    if field.name not in numerical_cols
]
print("üìå Categorical Columns:", categorical_cols)

üìå Categorical Columns: ['Name', 'Gender', 'ParentalSupport', 'Online Classes Taken']


In [None]:
# Unique Values per Column
print("üîπ Unique values per column:")
for col_name in df.columns:
    print(f"{col_name}: {df.select(col_name).distinct().count()}")

üîπ Unique values per column:
StudentID: 917
Name: 963
Gender: 3
AttendanceRate: 10
StudyHoursPerWeek: 11
PreviousGrade: 11
ExtracurricularActivities: 5
ParentalSupport: 4
FinalGrade: 11
Study Hours: 53
Attendance (%): 53
Online Classes Taken: 3


# Data Cleaning

In [None]:
# Check Missing Values
print("üîπ Missing values per column:")
df.select([
    F.count(F.when(col(c).isNull(), c)).alias(c)
    for c in df.columns
]).show()

üîπ Missing values per column:
+---------+----+------+--------------+-----------------+-------------+-------------------------+---------------+----------+-----------+--------------+--------------------+
|StudentID|Name|Gender|AttendanceRate|StudyHoursPerWeek|PreviousGrade|ExtracurricularActivities|ParentalSupport|FinalGrade|Study Hours|Attendance (%)|Online Classes Taken|
+---------+----+------+--------------+-----------------+-------------+-------------------------+---------------+----------+-----------+--------------+--------------------+
|       40|  34|    48|            40|               50|           33|                       43|             22|        40|         24|            41|                  25|
+---------+----+------+--------------+-----------------+-------------+-------------------------+---------------+----------+-----------+--------------+--------------------+



Handle Missing Values
Strategy:
- Numerical ‚Üí Median (avoid skewing)
- Categorical ‚Üí Mode (most frequent value)

In [None]:
# Handle Numerical Missing Values (Median)
for col_name in numerical_cols:
    median_value = df.approxQuantile(col_name, [0.5], 0.01)[0]
    df = df.fillna({col_name: median_value})


In [None]:
# Handle Categorical Missing Values (Mode)

# Select categorical columns
categorical_cols = [col for col, dtype in df.dtypes if dtype == 'string']

for col_name in categorical_cols:
    # Get the most frequent value (mode) for the column
    mode_row = df.groupBy(col_name).count().orderBy(F.desc("count")).first()

    # If mode exists and is not None, fill missing values with it
    if mode_row is not None and mode_row[0] is not None:
        mode_value = str(mode_row[0])  # Ensure it's a string
        df = df.na.fill({col_name: mode_value})
    else:
        print(f"‚ö†Ô∏è Column {col_name} is empty or all null, skipping fillna.")

print("‚úÖ Missing values for categorical columns handled successfully")

‚ö†Ô∏è Column Name is empty or all null, skipping fillna.
‚úÖ Missing values for categorical columns handled successfully


In [None]:
# Validate Missing Values Removal
print("üîπ Missing values after cleaning:")
df.select([
    F.count(F.when(col(c).isNull(), c)).alias(c)
    for c in df.columns
]).show()

üîπ Missing values after cleaning:
+---------+----+------+--------------+-----------------+-------------+-------------------------+---------------+----------+-----------+--------------+--------------------+
|StudentID|Name|Gender|AttendanceRate|StudyHoursPerWeek|PreviousGrade|ExtracurricularActivities|ParentalSupport|FinalGrade|Study Hours|Attendance (%)|Online Classes Taken|
+---------+----+------+--------------+-----------------+-------------+-------------------------+---------------+----------+-----------+--------------+--------------------+
|        0|  34|     0|             0|                0|            0|                        0|              0|         0|          0|             0|                  25|
+---------+----+------+--------------+-----------------+-------------+-------------------------+---------------+----------+-----------+--------------+--------------------+



In [None]:
# Check Duplicates
duplicates_count = df.count() - df.dropDuplicates().count()
print(f"üîπ Number of duplicate rows: {duplicates_count}")

# Remove Duplicates
df = df.dropDuplicates()
print("‚úÖ Duplicate rows removed")

print("üìä New Dataset Shape:", df.count(), "rows")

üîπ Number of duplicate rows: 0
‚úÖ Duplicate rows removed
üìä New Dataset Shape: 1000 rows


In [None]:
# Detect Outliers using IQR (Numerical Columns)
for col_name in numerical_cols:
    Q1 = df.approxQuantile(col_name, [0.25], 0.01)[0]
    Q3 = df.approxQuantile(col_name, [0.75], 0.01)[0]
    IQR = Q3 - Q1

    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR

    df = df.withColumn(
        col_name,
        when(col(col_name) < lower, lower)
        .when(col(col_name) > upper, upper)
        .otherwise(col(col_name))
    )

print("‚úÖ Outliers handled using IQR capping")

‚úÖ Outliers handled using IQR capping


In [None]:
# Create target variable for classification
df = df.withColumn(
    "PerformanceLevel",
    when(col("FinalGrade") >= 85, "High")
    .when(col("FinalGrade") >= 70, "Medium")
    .otherwise("Low")
)

print("‚úÖ Target variable (PerformanceLevel) created")

# NOTE:
# PerformanceLevel will be used as the target variable
# in the ML modeling phase (model.py)


‚úÖ Target variable (PerformanceLevel) created


In [None]:
#  Final Cleaned Dataset Validation
print("üìä Final Dataset Shape:")
print("Rows:", df.count())
print("Columns:", len(df.columns))

print("üîπ Final Schema:")
df.printSchema()

df.describe().show()

üìä Final Dataset Shape:
Rows: 1000
Columns: 13
üîπ Final Schema:
root
 |-- StudentID: double (nullable = false)
 |-- Name: string (nullable = true)
 |-- Gender: string (nullable = false)
 |-- AttendanceRate: double (nullable = false)
 |-- StudyHoursPerWeek: double (nullable = false)
 |-- PreviousGrade: double (nullable = false)
 |-- ExtracurricularActivities: double (nullable = false)
 |-- ParentalSupport: string (nullable = false)
 |-- FinalGrade: double (nullable = false)
 |-- Study Hours: double (nullable = false)
 |-- Attendance (%): double (nullable = false)
 |-- Online Classes Taken: boolean (nullable = true)
 |-- PerformanceLevel: string (nullable = false)

+-------+-----------------+--------------+------+----------------+-----------------+-----------------+-------------------------+---------------+----------------+------------------+------------------+----------------+
|summary|        StudentID|          Name|Gender|  AttendanceRate|StudyHoursPerWeek|    PreviousGrade|Extra

### Save cleaned dataset


In [None]:
df.toPandas().to_csv(
    "/content/student_performance_cleaned.csv",
    index=False
)

print("‚úÖ Cleaned dataset saved successfully")

‚úÖ Cleaned dataset saved successfully


## Statistical Analysis & Data Summaries



In [None]:
# Select numerical columns
numeric_cols = [c for c, t in df.dtypes if t in ('int', 'double')]

# Summary statistics
df.select(numeric_cols).describe().show()


+-------+-----------------+----------------+-----------------+-----------------+-------------------------+----------------+------------------+------------------+
|summary|        StudentID|  AttendanceRate|StudyHoursPerWeek|    PreviousGrade|ExtracurricularActivities|      FinalGrade|       Study Hours|    Attendance (%)|
+-------+-----------------+----------------+-----------------+-----------------+-------------------------+----------------+------------------+------------------+
|  count|             1000|            1000|             1000|             1000|                     1000|            1000|              1000|              1000|
|   mean|         5412.859|           85.61|           17.599|           77.612|                    1.498|          80.029| 2.433699999999999|            76.437|
| stddev|2600.123645923994|7.20039899906566|6.114702657414227|9.840238121421965|       1.0291040059464618|9.30164929889747|1.5028199419018162|15.086007378316458|
|    min|              1.0| 

### Numerical Feature Summary

In [None]:
# Descriptive statistics for numerical features
df.select(numeric_cols).describe().show()


+-------+-----------------+----------------+-----------------+-----------------+-------------------------+----------------+------------------+------------------+
|summary|        StudentID|  AttendanceRate|StudyHoursPerWeek|    PreviousGrade|ExtracurricularActivities|      FinalGrade|       Study Hours|    Attendance (%)|
+-------+-----------------+----------------+-----------------+-----------------+-------------------------+----------------+------------------+------------------+
|  count|             1000|            1000|             1000|             1000|                     1000|            1000|              1000|              1000|
|   mean|         5412.859|           85.61|           17.599|           77.612|                    1.498|          80.029| 2.433699999999999|            76.437|
| stddev|2600.123645923994|7.20039899906566|6.114702657414227|9.840238121421965|       1.0291040059464618|9.30164929889747|1.5028199419018162|15.086007378316458|
|    min|              1.0| 

### Categorical Feature Distributions

In [None]:
# Frequency counts for categorical features
for col_name in categorical_cols:
    print(f"\nDistribution of {col_name}:")
    df.groupBy(col_name).count().orderBy("count", ascending=False).show()



Distribution of Name:
+------------------+-----+
|              Name|count|
+------------------+-----+
|              NULL|   34|
|       Andrea Frey|    2|
|     Anthony Smith|    2|
|      Erica Miller|    2|
| Kimberly Harrison|    2|
|   Lonnie Williams|    1|
|Jessica Richardson|    1|
|    Alyssa Schmidt|    1|
|  Samantha Mendoza|    1|
|  Elizabeth Chavez|    1|
|       Sarah Young|    1|
| Christina Johnson|    1|
|       Gina Palmer|    1|
|     Michael Smith|    1|
|       Joseph Holt|    1|
|    Mrs. Jill Long|    1|
|      Heather Hill|    1|
|     Candace Kelly|    1|
|     Alan Williams|    1|
|         Amy Evans|    1|
+------------------+-----+
only showing top 20 rows

Distribution of Gender:
+------+-----+
|Gender|count|
+------+-----+
|  Male|  549|
|Female|  451|
+------+-----+


Distribution of ParentalSupport:
+---------------+-----+
|ParentalSupport|count|
+---------------+-----+
|           High|  367|
|         Medium|  328|
|            Low|  305|
+---------

### Correlation Analysis

In [None]:
# Convert numerical features to Pandas for correlation analysis
corr_matrix = df.select(numeric_cols).toPandas().corr()

corr_matrix.round(2)


Unnamed: 0,StudentID,AttendanceRate,StudyHoursPerWeek,PreviousGrade,ExtracurricularActivities,FinalGrade,Study Hours,Attendance (%)
StudentID,1.0,0.05,-0.01,-0.03,-0.04,0.08,0.04,-0.01
AttendanceRate,0.05,1.0,0.01,0.03,-0.02,-0.01,-0.02,0.02
StudyHoursPerWeek,-0.01,0.01,1.0,-0.01,0.03,0.03,-0.02,0.06
PreviousGrade,-0.03,0.03,-0.01,1.0,0.06,0.0,-0.04,0.05
ExtracurricularActivities,-0.04,-0.02,0.03,0.06,1.0,-0.03,-0.04,-0.01
FinalGrade,0.08,-0.01,0.03,0.0,-0.03,1.0,0.04,0.04
Study Hours,0.04,-0.02,-0.02,-0.04,-0.04,0.04,1.0,-0.11
Attendance (%),-0.01,0.02,0.06,0.05,-0.01,0.04,-0.11,1.0


### Multicollinearity Assessment

In [None]:
high_corr_pairs = []

for i in range(len(numeric_cols)):
    for j in range(i + 1, len(numeric_cols)):
        corr_value = corr_matrix.iloc[i, j]
        if abs(corr_value) > 0.8:
            high_corr_pairs.append(
                (numeric_cols[i], numeric_cols[j], round(corr_value, 2))
            )

high_corr_pairs


[]

### Summary Tables


In [None]:
# Numeric summary table
numeric_summary = df.select(numeric_cols).describe()
numeric_summary.show()


+-------+-----------------+----------------+-----------------+-----------------+-------------------------+----------------+------------------+------------------+
|summary|        StudentID|  AttendanceRate|StudyHoursPerWeek|    PreviousGrade|ExtracurricularActivities|      FinalGrade|       Study Hours|    Attendance (%)|
+-------+-----------------+----------------+-----------------+-----------------+-------------------------+----------------+------------------+------------------+
|  count|             1000|            1000|             1000|             1000|                     1000|            1000|              1000|              1000|
|   mean|         5412.859|           85.61|           17.599|           77.612|                    1.498|          80.029| 2.433699999999999|            76.437|
| stddev|2600.123645923994|7.20039899906566|6.114702657414227|9.840238121421965|       1.0291040059464618|9.30164929889747|1.5028199419018162|15.086007378316458|
|    min|              1.0| 

In [None]:
from pyspark.sql.functions import countDistinct

# Number of unique categories per categorical feature
categorical_summary = df.select([
    countDistinct(c).alias(c) for c in categorical_cols
])

categorical_summary.show()


+----+------+---------------+
|Name|Gender|ParentalSupport|
+----+------+---------------+
| 962|     2|              3|
+----+------+---------------+



Preprocessing and ML Modeling

In [None]:
from pyspark.sql import SparkSession
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler, StandardScaler
from pyspark.ml.classification import LogisticRegression, RandomForestClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator, BinaryClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.sql.types import IntegerType, DoubleType
from pyspark.sql.functions import col, mode, count, when, F

spark = SparkSession.builder.getOrCreate()

data = spark.read.csv(
'/content/student_performance_cleaned.csv',
header=True,
inferSchema=True
)

# Cast 'Online Classes Taken' to IntegerType (0 or 1)
data = data.withColumn("Online Classes Taken", col("Online Classes Taken").cast(IntegerType()))

# Identify numerical columns for robust null handling
# Exclude 'StudentID' if it's not a feature
numeric_feature_cols = [
    'AttendanceRate', 'StudyHoursPerWeek', 'PreviousGrade',
    'ExtracurricularActivities', 'FinalGrade', 'Study Hours', 'Attendance (%)'
]

# Handle missing values in 'Online Classes Taken'
mode_online_classes_row = data.groupBy("Online Classes Taken").count().orderBy(col("count").desc()).first()
fill_value_online_classes = 0 # Default if mode is null or not found
if mode_online_classes_row and mode_online_classes_row[0] is not None:
    fill_value_online_classes = mode_online_classes_row[0]
data = data.fillna({'Online Classes Taken': fill_value_online_classes})

# Handle missing values for other numerical feature columns (using median or 0 as fallback)
for col_name in numeric_feature_cols:
    median_value_row = data.approxQuantile(col_name, [0.5], 0.01)
    fill_value_numeric = median_value_row[0] if median_value_row else 0.0 # Use 0.0 if median not found
    data = data.fillna({col_name: fill_value_numeric})

# Handle missing values for categorical columns if any (using mode)
categorical_feature_cols_for_data = [
    col_name for col_name, dtype in data.dtypes
    if dtype == 'string' and col_name not in ['Name', 'PerformanceLevel']
]
for col_name in categorical_feature_cols_for_data:
    mode_row = data.groupBy(col_name).count().orderBy(F.desc("count")).first()
    fill_value_cat = mode_row[0] if mode_row and mode_row[0] is not None else "Unknown"
    data = data.fillna({col_name: fill_value_cat})

# Add a comprehensive check for nulls after all fillna operations
print("Checking for nulls in all relevant columns after fillna:")
data.select([count(when(col(c).isNull(), c)).alias(c) for c in data.columns if c not in ['StudentID', 'Name']]).show()

data.printSchema()

Checking for nulls in 'Online Classes Taken' after fillna:
+--------------------------+
|Online Classes Taken_nulls|
+--------------------------+
|                         0|
+--------------------------+

root
 |-- StudentID: double (nullable = true)
 |-- Name: string (nullable = true)
 |-- Gender: string (nullable = true)
 |-- AttendanceRate: double (nullable = true)
 |-- StudyHoursPerWeek: double (nullable = true)
 |-- PreviousGrade: double (nullable = true)
 |-- ExtracurricularActivities: double (nullable = true)
 |-- ParentalSupport: string (nullable = true)
 |-- FinalGrade: double (nullable = true)
 |-- Study Hours: double (nullable = true)
 |-- Attendance (%): double (nullable = true)
 |-- Online Classes Taken: integer (nullable = false)
 |-- PerformanceLevel: string (nullable = true)



In [None]:
train_df, test_df = data.randomSplit([0.8, 0.2], seed=42)

print("Train size:", train_df.count())
print("Test size:", test_df.count())

Train size: 838
Test size: 162


In [None]:
label_indexer = StringIndexer(inputCol='PerformanceLevel',outputCol='label').fit(train_df)

train_df = label_indexer.transform(train_df)
test_df = label_indexer.transform(test_df)

In [None]:
categorical_cols = ['Gender','ParentalSupport']

indexers = [StringIndexer(inputCol=c,outputCol=f"{c}_idx",handleInvalid='keep') for c in categorical_cols]

encoders = [OneHotEncoder(inputCol=f"{c}_idx",outputCol=f"{c}_ohe") for c in categorical_cols]

In [None]:
numeric_cols = ['AttendanceRate','StudyHoursPerWeek','PreviousGrade','ExtracurricularActivities','Study Hours','Attendance (%)', 'Online Classes Taken']

numeric_assembler = VectorAssembler(inputCols=numeric_cols,outputCol='numeric_features')

scaler = StandardScaler(inputCol='numeric_features',outputCol='scaled_numeric_features')

In [None]:
feature_cols = ['scaled_numeric_features'] + [f"{c}_ohe" for c in categorical_cols]

final_assembler = VectorAssembler(
inputCols=feature_cols,
outputCol='features'
)

In [None]:
log_reg = LogisticRegression(
featuresCol='features',
labelCol='label'
)

rand_forest = RandomForestClassifier(
featuresCol='features',
labelCol='label',
seed=42
)

In [None]:
lr_param_grid = (ParamGridBuilder()
.addGrid(log_reg.regParam, [0.01, 0.1])
.addGrid(log_reg.elasticNetParam, [0.0, 0.5])
.build()
)

rf_param_grid = (ParamGridBuilder()
.addGrid(rand_forest.numTrees, [50, 100])
.addGrid(rand_forest.maxDepth, [5, 10])
.build()
)

In [None]:
common_stages = (
indexers +
encoders +
[numeric_assembler,
scaler,
final_assembler]
)

lr_pipeline = Pipeline(stages=common_stages + [log_reg])
rf_pipeline = Pipeline(stages=common_stages + [rand_forest])

In [None]:
f1_evaluator = MulticlassClassificationEvaluator(
labelCol='label',
predictionCol='prediction',
metricName='f1'
)

lr_cv = CrossValidator(
estimator=lr_pipeline,
estimatorParamMaps=lr_param_grid,
evaluator=f1_evaluator,
numFolds=5
)

rf_cv = CrossValidator(
estimator=rf_pipeline,
estimatorParamMaps=rf_param_grid,
evaluator=f1_evaluator,
numFolds=5
)

lr_model = lr_cv.fit(train_df)
rf_model = rf_cv.fit(train_df)

In [None]:
accuracy_eval = MulticlassClassificationEvaluator(
labelCol='label',
predictionCol='prediction',
metricName='accuracy'
)

f1_evaluator = MulticlassClassificationEvaluator(
labelCol='label',
predictionCol='prediction',
metricName='f1'
)

lr_preds = lr_model.transform(test_df)
rf_preds = rf_model.transform(test_df)

print("Logistic Regression ‚Üí",
"Accuracy:", accuracy_eval.evaluate(lr_preds),
"F1:", f1_evaluator.evaluate(lr_preds))

print("Random Forest ‚Üí",
"Accuracy:", accuracy_eval.evaluate(rf_preds),
"F1:", f1_evaluator.evaluate(rf_preds))

Logistic Regression ‚Üí Accuracy: 0.41358024691358025 F1: 0.3036709115140488
Random Forest ‚Üí Accuracy: 0.43209876543209874 F1: 0.3767924288111021
