#**INTRODUCTION**
The objective of this study is to predict the placement outcomes of students based on their academic performance and other attributes. The dataset contains features like SSC Percentage (ssc_p), HSC Percentage (hsc_p), E-test Percentage (etest_p), Degree Percentage (degree_p), MBA Percentage (mba_p), Salary (salary), and categorical variables like Gender, Work Experience (workex), and Specialization. The target variable is the placement status: Placed or Not Placed. PySpark is utilised for data pre-processing and building a simple linear regression model for prediction.







#**DATA PREPARATION**

The dataset underwent the following preprocessing steps.

* Data Loading: The dataset is loaded into a PySpark DataFrame. Schema is automatically determined to determine the data type in each column.

* Numerical and Categorical Columns: Targeted preprocessing based on categories of features: numerical and categorical.

* Schema Conversion: Numerical columns are cast to float to be compatible with PySpark's ML library for use.

* Handling Missing Values:

1. Numerical Features: Missing values in numerical features have been imputed using PySpark's Imputer with strategy set to 'mean'
2. Categorical Features: All missing values in categorical feature have been replaced with string "Unknown".
* Target Variable Encoding: Status of the target variable is encoded with a binary column where placed is represented with 1 and not placed with 0.

* Feature Engineering : The selected feature ssc_p, hsc_p, etest_p, degree_p, mba_p are brought together into one feature vector of PySpark using VectorAssembler.



#**MODEL BULIDING**

Configuration of the Linear regression model has been done as follows.

* Train-Test Split: Split the clean dataset 70% and 30% for training and testing set respectively to ensure that the validation is robust.

* Model Training: Trained the model using the features column as the predictors and placement_status as the target variable.

* Coefficients and Intercept: The coefficients and intercept were obtained in order to understand how much a particular feature is contributing.

* Training Evaluation: The model's performance was assessed on training data using metrics like RMSE and R-squared.

#**MODEL EVALUATION**

The trained model was evaluated on the test dataset using regression metrics:

*Root Mean Squared Error (RMSE): Assesses the average magnitude of prediction errors.
*Mean Squared Error (MSE): Evaluates the squared difference between predicted and actual values.
*Mean Absolute Error (MAE): Captures the average magnitude of absolute errors.
*R-squared (R²): This measures the degree to which the model explains the variance in the target variable.










In [None]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("alitaimoor572/placement-dataset-campus-recruitment")

print("Path to dataset files:", path)

Path to dataset files: /root/.cache/kagglehub/datasets/alitaimoor572/placement-dataset-campus-recruitment/versions/1


In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when, count
from pyspark.ml.feature import VectorAssembler, Imputer
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator

# Initialize Spark session
spark = SparkSession.builder.appName("PlacementPrediction").getOrCreate()

# Load dataset
placement_data = spark.read.csv("Placement_Data_Full_Class.csv", header=True, inferSchema=True)
placement_data.show(5)
placement_data.printSchema()

# Separate numerical and categorical columns
numerical_columns = ['ssc_p', 'hsc_p', 'etest_p', 'degree_p', 'mba_p', 'salary']  # Adjust as needed
categorical_columns = ['gender', 'ssc_b', 'hsc_b', 'hsc_s', 'degree_t', 'workex', 'specialisation', 'status']

# Cast numerical columns to float
placement_data_float = placement_data.select(
    *(col(c).cast("float").alias(c) for c in numerical_columns),
    *categorical_columns
)

# Check for missing values
placement_data_float.select(
    [count(when(col(c).isNull(), c)).alias(c) for c in placement_data_float.columns]
).show()

# Impute missing values in numerical columns
imputer = Imputer(inputCols=numerical_columns, outputCols=numerical_columns)
placement_data_imputed = imputer.fit(placement_data_float).transform(placement_data_float)

# Handle missing values in categorical columns
for col_name in categorical_columns:
    placement_data_imputed = placement_data_imputed.fillna({col_name: "Unknown"})

# Encode the target variable ('status') as 1 for "Placed" and 0 for "Not Placed"
placement_data_imputed = placement_data_imputed.withColumn(
    "placement_status",
    when(col("status") == "Placed", 1).otherwise(0)
)

# Feature selection
selected_features = ['ssc_p', 'hsc_p', 'etest_p', 'degree_p', 'mba_p']  # Adjust as needed
feature_assembler = VectorAssembler(inputCols=selected_features, outputCol="features")
placement_data_features = feature_assembler.transform(placement_data_imputed)

# Select the features and target variable
placement_model_data = placement_data_features.select("features", "placement_status")

# Split data into training and test sets
training_data, testing_data = placement_model_data.randomSplit([0.7, 0.3], seed=42)

# Train a Linear Regression model
lr_model = LinearRegression(featuresCol='features', labelCol='placement_status')
placement_lr_model = lr_model.fit(training_data)

# Model coefficients and intercept
print("Coefficients: " + str(placement_lr_model.coefficients))
print("Intercept: " + str(placement_lr_model.intercept))

# Evaluate model on training data
training_summary = placement_lr_model.summary
print(f"RMSE (Training): {training_summary.rootMeanSquaredError}")
print(f"R-squared (Training): {training_summary.r2}")

# Make predictions on test data
test_predictions = placement_lr_model.transform(testing_data)
test_predictions.select("prediction", "placement_status", "features").show(5)

# Evaluate the model on test data
metrics = ["rmse", "mse", "mae", "r2"]
for metric in metrics:
    evaluator = RegressionEvaluator(labelCol="placement_status", predictionCol="prediction", metricName=metric)
    metric_score = evaluator.evaluate(test_predictions)
    print(f"{metric.upper()}: {metric_score}")

# Stop the Spark session
spark.stop()


+-----+------+-----+-------+-----+-------+--------+--------+---------+------+-------+--------------+-----+----------+------+
|sl_no|gender|ssc_p|  ssc_b|hsc_p|  hsc_b|   hsc_s|degree_p| degree_t|workex|etest_p|specialisation|mba_p|    status|salary|
+-----+------+-----+-------+-----+-------+--------+--------+---------+------+-------+--------------+-----+----------+------+
|    1|     M| 67.0| Others| 91.0| Others|Commerce|    58.0| Sci&Tech|    No|   55.0|        Mkt&HR| 58.8|    Placed|270000|
|    2|     M|79.33|Central|78.33| Others| Science|   77.48| Sci&Tech|   Yes|   86.5|       Mkt&Fin|66.28|    Placed|200000|
|    3|     M| 65.0|Central| 68.0|Central|    Arts|    64.0|Comm&Mgmt|    No|   75.0|       Mkt&Fin| 57.8|    Placed|250000|
|    4|     M| 56.0|Central| 52.0|Central| Science|    52.0| Sci&Tech|    No|   66.0|        Mkt&HR|59.43|Not Placed|  NULL|
|    5|     M| 85.8|Central| 73.6|Central|Commerce|    73.3|Comm&Mgmt|    No|   96.8|       Mkt&Fin| 55.5|    Placed|425000|
