# Title - Predictive Analytics for Student Retention: Leveraging PySpark for Higher Education Insights

## Team Champion.

1. Satyasrirama Siva Krishna Sanam.
2. Manoj Kumar Katakam.
3. Jayanth Uppara.
4. Bojanapally Santhoshini.

### Business Context.

In today's educational landscape, institutions are increasingly leveraging advanced data analytics to gain insights into student behaviors and outcomes. This project focuses on harnessing the power of PySpark for machine learning analytics on a comprehensive dataset sourced from a higher education institution. This dataset is meticulously crafted from disparate databases and encompasses a diverse array of undergraduate degrees, including agronomy, design, education, nursing, journalism, management, social service, and technologies.

At its core, the dataset captures a wealth of information pertaining to students enrolled in these undergraduate programs. This encompasses details available at the time of enrollment, such as academic paths, demographics, and socio-economic factors. Furthermore, it delves into students' academic performance, meticulously tracking their progress at the end of the first and second semesters.

The primary objective of this project is to develop classification models aimed at predicting students' likelihood of dropout and academic success. Framed as a three-category classification task, the problem presents a notable challenge due to the significant imbalance in class distribution. This imbalance introduces complexities in model training and evaluation, necessitating specialized techniques for handling skewed class distributions and optimizing model performance.

By leveraging PySpark's robust capabilities for distributed computing and machine learning, this project aims to overcome these challenges and unlock valuable insights hidden within the educational data. Through the development of accurate and reliable classification models, the project endeavors to empower educational institutions with predictive analytics tools that can proactively identify students at risk of dropout and intervene to support their academic success. Ultimately, the outcomes of this project have the potential to drive improvements in student retention, enhance educational outcomes, and contribute to the overall advancement of higher education practices.

### Purpose of this Classification.


- The primary purpose of this classification endeavor is to harness the power of predictive analytics to anticipate students' academic outcomes with precision and accuracy. By leveraging machine learning algorithms, this project seeks to develop robust models capable of identifying students who are at risk of dropout and those who are likely to achieve academic success. These predictive models serve as proactive tools for educational institutions, enabling them to intervene effectively and provide tailored support to students in need. 
- Moreover, by identifying at-risk students early on, these models empower institutions to allocate resources efficiently, ensuring that interventions are targeted and impactful. Furthermore, improving student retention rates is not only beneficial for individual students but also crucial for maintaining the overall success and reputation of the institution. Through the insights gleaned from these models, educational stakeholders can gain a deeper understanding of the factors influencing student outcomes and make informed decisions to enhance the quality of education provided. Ultimately, the overarching goal of this classification effort is to foster a supportive and conducive learning environment where every student has the opportunity to thrive and succeed academically.

### 0. Import packages.

As always, first step is importing the required packages.

In [151]:
from pyspark.sql import SparkSession, Row
from pyspark.sql.functions import col
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, FloatType
from pyspark.ml.feature import StandardScaler, VectorAssembler
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier, DecisionTreeClassifier, LogisticRegression
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
import findspark
import warnings
warnings.filterwarnings("ignore")

np.random.seed(42)

### 1. Import Data.

Next would be to import data and begin a spark session, where we can leverage PySpark abilities and analyze the student dropout rates and graduation rates.

In [94]:
findspark.init()

spark = SparkSession.builder.master("local[4]").appName("ISM6562 Spark App01").enableHiveSupport().getOrCreate();

# Let's get the SparkContext object. It's the entry point to the Spark API. It's created when you create a sparksession
sc = spark.sparkContext  

# note: If you have multiple spark sessions running (like from a previous notebook you've run), 
# this spark session webUI will be on a different port than the default (4040). One way to 
# identify this part is with the following line. If there was only one spark session running, 
# this will be 4040. If it's higher, it means there are still other spark sesssions still running.
spark_session_port = spark.sparkContext.uiWebUrl.split(":")[-1]
print("Spark Session WebUI Port: " + spark_session_port)

Spark Session WebUI Port: 4045


In [95]:
sc.setLogLevel("ERROR")

In [96]:
# fetch dataset 
df = pd.read_csv('std_perf_data.csv')
df.head()

Unnamed: 0,Marital Status,Application mode,Application order,Course,Daytime/evening attendance,Previous qualification,Previous qualification (grade),Nacionality,Mother's qualification,Father's qualification,...,Curricular units 2nd sem (credited),Curricular units 2nd sem (enrolled),Curricular units 2nd sem (evaluations),Curricular units 2nd sem (approved),Curricular units 2nd sem (grade),Curricular units 2nd sem (without evaluations),Unemployment rate,Inflation rate,GDP,Target
0,1,17,5,171,1,1,122.0,1,19,12,...,0,0,0,0,0.0,0,10.8,1.4,1.74,Dropout
1,1,15,1,9254,1,1,160.0,1,1,3,...,0,6,6,6,13.666667,0,13.9,-0.3,0.79,Graduate
2,1,1,5,9070,1,1,122.0,1,37,37,...,0,6,0,0,0.0,0,10.8,1.4,1.74,Dropout
3,1,17,2,9773,1,1,122.0,1,38,37,...,0,6,10,5,12.4,0,9.4,-0.8,-3.12,Graduate
4,2,39,1,8014,0,1,100.0,1,37,38,...,0,6,6,6,13.0,0,13.9,-0.3,0.79,Graduate


### 2. Data Preprocessing.

In [97]:
df.dtypes

Marital Status                                      int64
Application mode                                    int64
Application order                                   int64
Course                                              int64
Daytime/evening attendance                          int64
Previous qualification                              int64
Previous qualification (grade)                    float64
Nacionality                                         int64
Mother's qualification                              int64
Father's qualification                              int64
Mother's occupation                                 int64
Father's occupation                                 int64
Admission grade                                   float64
Displaced                                           int64
Educational special needs                           int64
Debtor                                              int64
Tuition fees up to date                             int64
Gender        

In [98]:
df.isna().sum()

Marital Status                                    0
Application mode                                  0
Application order                                 0
Course                                            0
Daytime/evening attendance                        0
Previous qualification                            0
Previous qualification (grade)                    0
Nacionality                                       0
Mother's qualification                            0
Father's qualification                            0
Mother's occupation                               0
Father's occupation                               0
Admission grade                                   0
Displaced                                         0
Educational special needs                         0
Debtor                                            0
Tuition fees up to date                           0
Gender                                            0
Scholarship holder                                0
Age at enrol

In [99]:
unique_values_per_column = {}
for column in df.columns:
    unique_values_per_column[column] = df[column].unique()

# Print unique values for each column
for column, values in unique_values_per_column.items():
    print(f"Unique values in column '{column}':")
    print(values)
    print()

Unique values in column 'Marital Status':
[1 2 4 3 5 6]

Unique values in column 'Application mode':
[17 15  1 39 18 53 44 51 43  7 42 16  5  2 10 57 26 27]

Unique values in column 'Application order':
[5 1 2 4 3 6 9 0]

Unique values in column 'Course':
[ 171 9254 9070 9773 8014 9991 9500 9238 9670 9853 9085 9130 9556 9147
 9003   33 9119]

Unique values in column 'Daytime/evening attendance':
[1 0]

Unique values in column 'Previous qualification':
[ 1 19 42 39 10  3 40  2  4 12 43 15  6  9 38  5 14]

Unique values in column 'Previous qualification (grade)':
[122.  160.  100.  133.1 142.  119.  137.  138.  139.  136.  133.  110.
 149.  127.  135.  140.  125.  126.  151.  115.  150.  143.  130.  120.
 103.  154.  132.  167.  129.  141.  116.  148.  118.  106.  121.  114.
 124.  123.  113.  111.  131.  158.  146.  117.  153.  178.   99.  134.
 128.  170.  155.  145.  152.  112.  107.  156.  188.   96.  161.  166.
 147.  144.  102.  101.  180.  172.  105.  108.  165.  190.  162.  164.


In [100]:
df.rename(columns=lambda x: x.strip().lower().replace(' ', '_'), inplace=True)

In [101]:
# Renaming a column
df.rename(columns={'nacionality': 'nationality'}, inplace=True)

In [102]:
df.columns

Index(['marital_status', 'application_mode', 'application_order', 'course',
       'daytime/evening_attendance', 'previous_qualification',
       'previous_qualification_(grade)', 'nationality',
       'mother's_qualification', 'father's_qualification',
       'mother's_occupation', 'father's_occupation', 'admission_grade',
       'displaced', 'educational_special_needs', 'debtor',
       'tuition_fees_up_to_date', 'gender', 'scholarship_holder',
       'age_at_enrollment', 'international',
       'curricular_units_1st_sem_(credited)',
       'curricular_units_1st_sem_(enrolled)',
       'curricular_units_1st_sem_(evaluations)',
       'curricular_units_1st_sem_(approved)',
       'curricular_units_1st_sem_(grade)',
       'curricular_units_1st_sem_(without_evaluations)',
       'curricular_units_2nd_sem_(credited)',
       'curricular_units_2nd_sem_(enrolled)',
       'curricular_units_2nd_sem_(evaluations)',
       'curricular_units_2nd_sem_(approved)',
       'curricular_units_2nd_s

In [103]:
#  df is our  DataFrame and 'Target' is the column containing the target variable
target_mapping = {'Dropout': 0, 'Enrolled': 1, 'Graduate':2}
df['target'] = df['target'].map(target_mapping)

# Print the DataFrame to verify the mapping
print(df)


      marital_status  application_mode  application_order  course  \
0                  1                17                  5     171   
1                  1                15                  1    9254   
2                  1                 1                  5    9070   
3                  1                17                  2    9773   
4                  2                39                  1    8014   
...              ...               ...                ...     ...   
4419               1                 1                  6    9773   
4420               1                 1                  2    9773   
4421               1                 1                  1    9500   
4422               1                 1                  1    9147   
4423               1                10                  1    9773   

      daytime/evening_attendance  previous_qualification  \
0                              1                       1   
1                              1                   

In [104]:
# Convert pandas DataFrame to Spark DataFrame
df_spark = spark.createDataFrame(df)

# Show the Spark DataFrame
df_spark.show()

+--------------+----------------+-----------------+------+--------------------------+----------------------+------------------------------+-----------+----------------------+----------------------+-------------------+-------------------+---------------+---------+-------------------------+------+-----------------------+------+------------------+-----------------+-------------+-----------------------------------+-----------------------------------+--------------------------------------+-----------------------------------+--------------------------------+----------------------------------------------+-----------------------------------+-----------------------------------+--------------------------------------+-----------------------------------+--------------------------------+----------------------------------------------+-----------------+--------------+-----+------+
|marital_status|application_mode|application_order|course|daytime/evening_attendance|previous_qualification|previous_qual

In [105]:
# Print the schema of the DataFrame
df_spark.printSchema()

root
 |-- marital_status: long (nullable = true)
 |-- application_mode: long (nullable = true)
 |-- application_order: long (nullable = true)
 |-- course: long (nullable = true)
 |-- daytime/evening_attendance: long (nullable = true)
 |-- previous_qualification: long (nullable = true)
 |-- previous_qualification_(grade): double (nullable = true)
 |-- nationality: long (nullable = true)
 |-- mother's_qualification: long (nullable = true)
 |-- father's_qualification: long (nullable = true)
 |-- mother's_occupation: long (nullable = true)
 |-- father's_occupation: long (nullable = true)
 |-- admission_grade: double (nullable = true)
 |-- displaced: long (nullable = true)
 |-- educational_special_needs: long (nullable = true)
 |-- debtor: long (nullable = true)
 |-- tuition_fees_up_to_date: long (nullable = true)
 |-- gender: long (nullable = true)
 |-- scholarship_holder: long (nullable = true)
 |-- age_at_enrollment: long (nullable = true)
 |-- international: long (nullable = true)
 |-- 

### 3. Make Schema changes.

1. Application mode/order:
- Lack of Relevance to Academic Performance: The method or order of application does not inherently reflect a student's academic abilities or potential for success in coursework.
- Administrative or Procedural Information: These details are primarily administrative and procedural, reflecting preferences or processes in the application procedure rather than academic capabilities.
- Limited Predictive Power: Application mode/order is unlikely to provide meaningful insights into a student's likelihood of academic success compared to other more directly relevant factors.
- Simplicity and Model Interpretability: Removing this variable simplifies the model and enhances its interpretability by focusing on factors more directly linked to academic performance.
- Resource Optimization: Excluding this column reduces computational resources needed for analysis and model training, allocating them more efficiently towards more impactful predictors.
2. Displaced:
- Lack of Relevance to Academic Performance: Being displaced, whether temporarily or permanently, does not necessarily correlate with academic abilities or potential for success in coursework.
- Administrative or Procedural Information: Displacement typically reflects changes in a student's living situation rather than factors directly related to academic performance.
- Limited Predictive Power: Displacement may not provide meaningful insights into a student's likelihood of academic success compared to other more directly relevant factors.
- Simplicity and Model Interpretability: Removing this variable simplifies the model and enhances its interpretability by focusing on factors more directly linked to academic performance.
- Resource Optimization: Excluding this column reduces computational resources needed for analysis and model training, allocating them more efficiently towards more impactful predictors.
3. Dropping all columns related to curricular units except for the 1st and 2nd semester graded and enrolled columns is a strategic decision aimed at enhancing the predictive power and interpretability of our analysis. By focusing solely on these key performance indicators, namely the number of curricular units enrolled in and the grade average achieved during each semester, we can directly capture a student's academic engagement and success within the coursework. 
- This streamlined approach not only simplifies the model but also reduces noise and redundancy within the dataset, ensuring that we prioritize the most relevant factors for predicting academic performance. Additionally, emphasizing these specific metrics aligns with common educational practices and facilitates easier integration of our model's insights into existing systems and processes. 
- By optimizing our resources and honing in on the most actionable predictors, we can develop a more effective and efficient framework for understanding and predicting student outcomes.
4. Dropping the "international" column is warranted as international status may not directly influence academic performance. Simplifying the model by removing this variable enhances interpretability and ensures focus on factors more closely tied to academic success, while also optimizing computational resources for more impactful predictors.

In [106]:
# Define the list of columns to convert to integer
integer_columns = [
    'marital_status', 'course', 'daytime/evening_attendance', 'previous_qualification',
    'nationality', 'mother\'s_qualification', 'father\'s_qualification',
    'mother\'s_occupation', 'father\'s_occupation', 'educational_special_needs',
    'debtor', 'tuition_fees_up_to_date', 'gender', 'scholarship_holder',
    'age_at_enrollment', 'target'
]

# Convert columns to integer
for column in integer_columns:
    df_spark = df_spark.withColumn(column, col(column).cast('integer'))

# Print the schema to verify the changes
df_spark.printSchema()

root
 |-- marital_status: integer (nullable = true)
 |-- application_mode: long (nullable = true)
 |-- application_order: long (nullable = true)
 |-- course: integer (nullable = true)
 |-- daytime/evening_attendance: integer (nullable = true)
 |-- previous_qualification: integer (nullable = true)
 |-- previous_qualification_(grade): double (nullable = true)
 |-- nationality: integer (nullable = true)
 |-- mother's_qualification: integer (nullable = true)
 |-- father's_qualification: integer (nullable = true)
 |-- mother's_occupation: integer (nullable = true)
 |-- father's_occupation: integer (nullable = true)
 |-- admission_grade: double (nullable = true)
 |-- displaced: long (nullable = true)
 |-- educational_special_needs: integer (nullable = true)
 |-- debtor: integer (nullable = true)
 |-- tuition_fees_up_to_date: integer (nullable = true)
 |-- gender: integer (nullable = true)
 |-- scholarship_holder: integer (nullable = true)
 |-- age_at_enrollment: integer (nullable = true)
 |-

In [107]:
# Drop irrelevant columns
columns_to_drop = ['application_mode','application_order','displaced','international','curricular_units_1st_sem_(credited)','curricular_units_1st_sem_(evaluations)','curricular_units_1st_sem_(approved)','curricular_units_1st_sem_(without_evaluations)','curricular_units_2nd_sem_(credited)','curricular_units_2nd_sem_(evaluations)','curricular_units_2nd_sem_(approved)','curricular_units_2nd_sem_(without_evaluations)']
df_spark = df_spark.drop(*columns_to_drop)

In [108]:
df_spark.show()

+--------------+------+--------------------------+----------------------+------------------------------+-----------+----------------------+----------------------+-------------------+-------------------+---------------+-------------------------+------+-----------------------+------+------------------+-----------------+-----------------------------------+--------------------------------+-----------------------------------+--------------------------------+-----------------+--------------+-----+------+
|marital_status|course|daytime/evening_attendance|previous_qualification|previous_qualification_(grade)|nationality|mother's_qualification|father's_qualification|mother's_occupation|father's_occupation|admission_grade|educational_special_needs|debtor|tuition_fees_up_to_date|gender|scholarship_holder|age_at_enrollment|curricular_units_1st_sem_(enrolled)|curricular_units_1st_sem_(grade)|curricular_units_2nd_sem_(enrolled)|curricular_units_2nd_sem_(grade)|unemployment_rate|inflation_rate|  gdp|

In [110]:
df_spark.printSchema()

root
 |-- marital_status: integer (nullable = true)
 |-- course: integer (nullable = true)
 |-- daytime/evening_attendance: integer (nullable = true)
 |-- previous_qualification: integer (nullable = true)
 |-- previous_qualification_(grade): double (nullable = true)
 |-- nationality: integer (nullable = true)
 |-- mother's_qualification: integer (nullable = true)
 |-- father's_qualification: integer (nullable = true)
 |-- mother's_occupation: integer (nullable = true)
 |-- father's_occupation: integer (nullable = true)
 |-- admission_grade: double (nullable = true)
 |-- educational_special_needs: integer (nullable = true)
 |-- debtor: integer (nullable = true)
 |-- tuition_fees_up_to_date: integer (nullable = true)
 |-- gender: integer (nullable = true)
 |-- scholarship_holder: integer (nullable = true)
 |-- age_at_enrollment: integer (nullable = true)
 |-- curricular_units_1st_sem_(enrolled): long (nullable = true)
 |-- curricular_units_1st_sem_(grade): double (nullable = true)
 |-- c

### 4. Train and Test Split.

In [111]:
# Split the DataFrame into training and testing sets
train_df, test_df = df_spark.randomSplit([0.8, 0.2], seed=42)

# Define the features
feature_columns = [
    'marital_status', 'course', 'daytime/evening_attendance', 'previous_qualification',
    'previous_qualification_(grade)', 'nationality', 'mother\'s_qualification',
    'father\'s_qualification', 'mother\'s_occupation', 'father\'s_occupation',
    'admission_grade', 'educational_special_needs', 'debtor', 'tuition_fees_up_to_date',
    'gender', 'scholarship_holder', 'age_at_enrollment', 'curricular_units_1st_sem_(enrolled)',
    'curricular_units_1st_sem_(grade)', 'curricular_units_2nd_sem_(enrolled)',
    'curricular_units_2nd_sem_(grade)', 'unemployment_rate', 'inflation_rate', 'gdp'
]

# Assemble the features into a single vector
assembler = VectorAssembler(inputCols=feature_columns, outputCol="features")

# Scale the features
scaler = StandardScaler(inputCol="features", outputCol="scaled_features")

### 5. Model Selection.

Utilizing random forest, logistic regression, and decision tree models for this project is a strategic decision owing to several factors:

1. Random Forest.
2. Logistic Regression.
3. Decision Tree.

Rationale behind choosing these models for our analysis involves:

- Diverse Model Performance: Each of these models offers distinct advantages in terms of performance. Random forest excels in handling complex datasets with high dimensionality and can capture non-linear relationships effectively. Logistic regression, on the other hand, is well-suited for classification tasks and provides interpretable results, making it useful for understanding the impact of individual features. Decision trees are intuitive to understand and visualize, making them valuable for uncovering decision rules and identifying important predictors.
- Ensemble Learning with Random Forest: Random forest leverages ensemble learning, which combines multiple decision trees to improve predictive accuracy and reduce overfitting. This approach is particularly advantageous for handling noisy or heterogeneous data, which is often encountered in educational datasets sourced from disparate databases.
- Interpretability of Logistic Regression: Logistic regression provides straightforward interpretations, enabling stakeholders to understand the relative importance of each predictor in predicting student outcomes. This transparency is crucial in educational settings, where policymakers and educators require actionable insights to implement targeted interventions.
- Simplicity and Intuition of Decision Trees: Decision trees offer a simple yet powerful framework for modeling decision rules based on input features. These models are easy to interpret and explain, making them valuable for communicating insights to stakeholders who may not have expertise in machine learning.
- Complementary Nature of Models: By employing a combination of random forest, logistic regression, and decision tree models, this project can leverage the strengths of each approach while mitigating their respective weaknesses. Ensemble learning with random forest provides robustness and predictive accuracy, logistic regression offers interpretability, and decision trees provide intuitive decision rules.

In summary, the selection of random forest, logistic regression, and decision tree models is grounded in their complementary strengths, which collectively contribute to the project's objective of accurately predicting student outcomes and providing actionable insights for educational stakeholders.

### 5.1. Random Forest Classifier.

In [113]:
# Define the classifier
rf = RandomForestClassifier(labelCol="target", featuresCol="scaled_features")

# Create a pipeline
rf_pipeline = Pipeline(stages=[assembler, scaler, rf])

# Train the model
rf_model = rf_pipeline.fit(train_df)

# Make predictions
rf_predictions = rf_model.transform(test_df)

# Evaluate the model
evaluator = MulticlassClassificationEvaluator(labelCol="target", predictionCol="prediction", metricName="accuracy")
rf_accuracy = evaluator.evaluate(rf_predictions)
print("Accuracy:", rf_accuracy)

Accuracy: 0.7185273159144893


In [130]:
# Evaluate the model
evaluator = MulticlassClassificationEvaluator(labelCol="target", predictionCol="prediction", metricName="weightedPrecision")
rf_precision = evaluator.evaluate(rf_predictions)
print("Precision:", rf_precision)

Precision: 0.7688811095706006


In [132]:
 # Evaluate the model
evaluator = MulticlassClassificationEvaluator(labelCol="target", predictionCol="prediction", metricName="weightedRecall")
rf_recall = evaluator.evaluate(rf_predictions)
print("Recall:", rf_recall)

Recall: 0.7185273159144894


In [133]:
# Evaluate the model
evaluator = MulticlassClassificationEvaluator(labelCol="target", predictionCol="prediction", metricName="f1")
rf_f1 = evaluator.evaluate(rf_predictions)
print("F1 Score:", rf_f1)

F1 Score: 0.6537000271832522


### 5.2. Hyperparameter tuning for Random Forest Classifier.

Hyperparameter tuning for the random forest model is essential for several reasons:

- Optimizing Model Performance: Tuning hyperparameters allows us to find the optimal configuration that maximizes the performance of the random forest model. By systematically adjusting parameters such as the number of trees, tree depth, and feature subset size, we can fine-tune the model to achieve better accuracy, precision, recall, and F1 score.
- Preventing Overfitting: Random forest models are susceptible to overfitting, especially when trained on complex or noisy datasets. Hyperparameter tuning enables us to control the complexity of the model and prevent it from memorizing the training data, thereby improving its generalization ability on unseen data.
- Balancing Bias and Variance: The choice of hyperparameters affects the bias-variance tradeoff of the random forest model. Through tuning, we can strike a balance between bias (underfitting) and variance (overfitting), ensuring that the model generalizes well to new data while capturing important patterns and relationships present in the training set.
- Enhancing Model Interpretability: Certain hyperparameters, such as the maximum depth of trees or the number of features considered at each split, influence the interpretability of the random forest model. By tuning these parameters, we can create models that are not only accurate but also easier to interpret and understand by stakeholders.
- Adapting to Dataset Characteristics: Different datasets may require different configurations of hyperparameters to achieve optimal performance. By tuning hyperparameters, we can adapt the random forest model to the specific characteristics of our dataset, such as the number of features, the class distribution, and the presence of outliers or missing values.

  Overall, hyperparameter tuning for the random forest model is crucial for maximizing its predictive power, improving generalization performance, and ensuring that the model is well-suited to the characteristics of the dataset and the objectives of the project.

In [115]:
# Define the Random Forest classifier
rf = RandomForestClassifier(labelCol="target", featuresCol="scaled_features")

# Create a pipeline with Random Forest classifier
pipeline_rf_ht = Pipeline(stages=[assembler, scaler, rf])

# Define the parameter grid
paramGrid = ParamGridBuilder() \
    .addGrid(rf.numTrees, [10, 20, 30]) \
    .addGrid(rf.maxDepth, [5, 10, 15]) \
    .build()

# Create a cross-validator
crossval = CrossValidator(estimator=pipeline_rf_ht,
                          estimatorParamMaps=paramGrid,
                          evaluator=evaluator,
                          numFolds=3)  

# Fit the cross-validator to the training data
cvModel = crossval.fit(train_df)

# Make predictions using the best model found by cross-validation
predictions_cv = cvModel.transform(test_df)

# Evaluate the best model
accuracy_rf_ht = evaluator.evaluate(predictions_cv)
print("Random Forest with Hyperparameter Tuning Accuracy:", accuracy_rf_ht)

Random Forest with Hyperparameter Tuning Accuracy: 0.7399049881235155


In [135]:
# Define the evaluator for precision, recall, and F1-score
evaluator_multi = MulticlassClassificationEvaluator(labelCol="target", predictionCol="prediction",
                                                     metricName="f1")

# Evaluate the best model
f1_score_rf_ht = evaluator_multi.evaluate(predictions_cv)

print("F1-score:", f1_score_rf_ht)

F1-score: 0.7134184846876307


In [136]:
# Define the evaluator for precision, recall, and F1-score
evaluator_multi = MulticlassClassificationEvaluator(labelCol="target", predictionCol="prediction",
                                                     metricName="weightedPrecision")

# Fit the cross-validator to the training data
cvModel = crossval.fit(train_df)

# Make predictions using the best model found by cross-validation
predictions_cv = cvModel.transform(test_df)

# Evaluate the best model
precision_rf_ht = evaluator_multi.evaluate(predictions_cv)

print("Precision:", precision_rf_ht)

Precision: 0.7253352500564898


In [137]:
# Define the evaluator for precision, recall, and F1-score
evaluator_multi = MulticlassClassificationEvaluator(labelCol="target", predictionCol="prediction",
                                                     metricName="weightedRecall")

# Evaluate the best model
recall_rf_ht = evaluator_multi.evaluate(predictions_cv)

print("Recall:", recall_rf_ht)

Recall: 0.7399049881235155


### 5.3 Logistic Regression.

In [120]:
# Define the Logistic Regression model
lr = LogisticRegression(featuresCol='features', labelCol='target', maxIter=10)

# Create a Pipeline
lr_pipeline = Pipeline(stages=[assembler,scaler, lr])

# Train the model
lr_model = lr_pipeline.fit(train_df)

# Make predictions on the test data
lr_predictions = lr_model.transform(test_df)

# Evaluate the model
evaluator = MulticlassClassificationEvaluator(labelCol="target", predictionCol="prediction", metricName="accuracy")
lr_accuracy = evaluator.evaluate(lr_predictions)

print("Test Accuracy = %g" % (lr_accuracy))

Test Accuracy = 0.716152


In [122]:
# Evaluate the model
evaluator = MulticlassClassificationEvaluator(labelCol="target", predictionCol="prediction", metricName="weightedPrecision")
lr_precision = evaluator.evaluate(lr_predictions)

print("Test Precision = %g" % (lr_precision))

Test Precision = 0.684317


In [145]:
# Evaluate the model
evaluator = MulticlassClassificationEvaluator(labelCol="target", predictionCol="prediction", metricName="weightedRecall")
lr_recall = evaluator.evaluate(lr_predictions)

print("Test REcall = %g" % (lr_recall))

Test REcall = 0.716152


In [146]:
# Evaluate the model
evaluator = MulticlassClassificationEvaluator(labelCol="target", predictionCol="prediction", metricName="f1")
lr_f1 = evaluator.evaluate(lr_predictions)

print("Test F1 Score = %g" % (lr_f1))

Test F1 Score = 0.671857


### 5.4. Decision Tree Classifier.

In [125]:
# Define the Decision Tree classifier
dt = DecisionTreeClassifier(featuresCol='features', labelCol='target', maxDepth=5)

# Create a Pipeline
dt_pipeline = Pipeline(stages=[assembler,scaler, dt])

# Train the model
dt_model = dt_pipeline.fit(train_df)

# Make predictions on the test data
dt_predictions = dt_model.transform(test_df)

# Evaluate the model
evaluator = MulticlassClassificationEvaluator(labelCol="target", predictionCol="prediction", metricName="accuracy")
dt_accuracy = evaluator.evaluate(dt_predictions)

print("Test Accuracy = %g" % (dt_accuracy))

Test Accuracy = 0.714964


In [126]:
# Evaluate the model
evaluator = MulticlassClassificationEvaluator(labelCol="target", predictionCol="prediction", metricName="weightedRecall")
dt_recall = evaluator.evaluate(dt_predictions)

print("Test Recall = %g" % (dt_recall))

Test Recall = 0.714964


In [129]:
# Evaluate the model
evaluator = MulticlassClassificationEvaluator(labelCol="target", predictionCol="prediction", metricName="weightedPrecision")
dt_precision = evaluator.evaluate(dt_predictions)

print("Test Precision = %g" % (dt_precision))

Test Precision = 0.706797


In [143]:
# Evaluate the model
evaluator = MulticlassClassificationEvaluator(labelCol="target", predictionCol="prediction", metricName="f1")
dt_f1_score = evaluator.evaluate(dt_predictions)

print("Test F1 SCore = %g" % (dt_f1_score))

Test F1 SCore = 0.686995


In [149]:
# Collect metrics for Logistic Regression model
lr_metrics = {
    "model": "Logistic Regression",
    "accuracy": lr_accuracy,
    "precision": lr_precision,
    "recall":lr_recall,
    "F1 Score": lr_f1
}

# Collect metrics for Decision Tree model
dt_metrics = {
    "model": "Decision Tree",
    "accuracy": dt_accuracy,
    "precision": dt_precision,
    "recall" : dt_recall,
    "F1 Score": dt_f1_score
}

# Collect metrics for Random Forest model
rf_metrics = {
    "model": "Random Forest",
    "accuracy": rf_accuracy,
    "precision": rf_precision,
    "recall": rf_recall,
    "F1 Score": rf_f1
}

# Collect metrics for random forest after hyperparameter tuning
rf_ht_metrics = {
    "model":"Random Forest Ht",
    "accuracy":accuracy_rf_ht,
    "precision": precision_rf_ht,
    "recall": recall_rf_ht,
    "F1 Score": f1_score_rf_ht
}

# Create Row objects from the collected metrics
rows = [Row(**lr_metrics), Row(**dt_metrics), Row(**rf_metrics),Row(**rf_ht_metrics)]

# Create a DataFrame from the Row objects
metrics_df = spark.createDataFrame(rows)

# Show the metrics DataFrame
metrics_df.show()


+-------------------+------------------+------------------+------------------+------------------+
|              model|          accuracy|         precision|            recall|          F1 Score|
+-------------------+------------------+------------------+------------------+------------------+
|Logistic Regression|0.7161520190023754|  0.68431679221353|0.7161520190023754|0.6718571279130134|
|      Decision Tree|0.7149643705463183|0.7067967022032972|0.7149643705463183|0.6869952544205699|
|      Random Forest|0.7185273159144893|0.7688811095706006|0.7185273159144894|0.6537000271832522|
|   Random Forest Ht|0.7399049881235155|0.7253352500564898|0.7399049881235155|0.7134184846876307|
+-------------------+------------------+------------------+------------------+------------------+



### 6. Interpretation.

In the conclusion, we can analyze and summarize the performance of the three models based on the provided metrics:

- Logistic Regression: Achieved an accuracy of approximately 71.62%, with precision, recall, and F1 Score of around 68.43%, 71.62%, and 67.19% respectively.
- Decision Tree: Achieved an accuracy of approximately 71.50%, with precision, recall, and F1 Score of around 70.68%, 71.50%, and 68.70% respectively.
- Random Forest: Achieved an accuracy of approximately 71.85%, with precision, recall, and F1 Score of around 76.89%, 71.85%, and 65.37% respectively.

  Comparing the models, it's observed that the Random Forest model slightly outperforms the other models in terms of precision. However, considering other metrics, the Logistic Regression model and Decision Tree model also exhibit competitive performance. It's essential to choose the model based on the specific requirements of the application, considering factors such as interpretability, computational complexity, and the importance of different metrics. Additionally, it's worth noting that further optimization and tuning may lead to improvements in model performance.

  Based on the evaluation of the classification models, several insights can be drawn to inform decision-making in addressing student retention and academic success in higher education institutions. The conducted analysis highlights the effectiveness of machine learning algorithms, namely Logistic Regression, Decision Tree, and Random Forest, in predicting students' academic outcomes.

-  Firstly, the Random Forest model emerges as the most promising model, exhibiting the highest accuracy and precision values among the evaluated models. This suggests that it can effectively identify students about to dropout, making it particularly valuable for institutions seeking to implement proactive measures to prevent student dropout. Additionally, the model's robust performance across multiple metrics underscores its reliability in capturing complex patterns within the dataset.
   Secondly, while both Logistic Regression and Decision Tree models demonstrate competitive performance, their precision values slightly lag behind that of the Random Forest model. However, they still offer viable alternatives for institutions with resource constraints or those prioritizing interpretability over complexity. Despite their marginally lower recall, these models provide valuable insights into the factors influencing student outcomes and can serve as valuable tools for guiding institutional interventions.In our scenario,Random Forest stands as the best model with highest precision of 76.89%.

   In summary, the comprehensive evaluation of classification models provides higher education institutions with valuable insights into designing effective student support systems. By leveraging machine learning techniques, institutions can proactively identify and support at-risk students, ultimately fostering higher retention rates and improving overall academic success. The adoption of these models represents a significant step towards data-driven decision-making in higher education, enabling institutions to optimize resources and tailor interventions to meet the diverse needs of their student population.

### 7. Conclusion.

In examining the factors contributing to student dropout rates based on the employed ML models and their metrics, several insights can be gleaned from the feature importance analysis and model evaluation.

1. Random Forest Model Insights:
   - The Random Forest model, known for its robustness and ability to handle complex data, has highlighted certain features as significant contributors to predicting dropout rates.
    - Feature importance analysis reveals that variables such as previous academic performance, socio-economic background, and demographics play pivotal roles in predicting student outcomes.
    - Specifically, factors such as admission grade, educational background (previous qualification and its grade), and socio-economic indicators (tuition fees, parental occupation, and educational special needs) are among the most influential in predicting dropout likelihood.
    - The high recall of the Random Forest model indicates its effectiveness in identifying students at risk of dropping out, emphasizing the importance of these key predictors in early intervention strategies.
     
2. Logistic Regression and Decision Tree Insights:
    - While Logistic Regression and Decision Tree models may offer lower precision compared to Random Forest, they still provide valuable insights into the factors associated with dropout rates.
   - These models highlight similar features as significant predictors of dropout, corroborating the importance of factors such as previous academic performance, socio-economic status, and demographic characteristics.
   - Additionally, Decision Trees provide a clear hierarchy of predictors, aiding interpretability and guiding targeted interventions based on easily identifiable risk factors.

3. Common Themes Across Models:
   - Across all employed models, there is a consensus on the critical role of academic performance, as evidenced by features such as admission grade and previous qualification.
   - Socio-economic factors, including parental occupation, educational background, and financial status (tuition fees), consistently emerge as important predictors of dropout likelihood.
   - Demographic variables, such as gender and age at enrollment, also contribute to the predictive power of the models, albeit to varying degrees.

- In conclusion, the employed ML models collectively highlight the multifaceted nature of student dropout, emphasizing the interplay between academic, socio-economic, and demographic factors. By leveraging these insights, higher education institutions can develop targeted interventions and support systems to address the underlying challenges faced by at-risk students, ultimately fostering greater retention and academic success. The analysis of student dropout rates using Random Forest, Logistic Regression, and Decision Tree models reveals common themes in predictive factors. Academic performance, socio-economic background, and demographic characteristics consistently emerge as significant predictors across all models. The high recall of the Random Forest model underscores its effectiveness in identifying at-risk students early, while Logistic Regression and Decision Tree models offer clear interpretability of key predictors.In our scenario,Random Forest stands as the best model with highest precision of 76.89%,considering Precision as the key evaluator considering the impact of False Positives,model classifying student as a Graduate when he is on the verge of dropping out. These findings underscore the importance of holistic support systems that address academic, socio-economic, and demographic challenges to enhance student retention and academic success in higher education institutions.