This Loan Prediction Model uses PySpark and Machine Learning techniques to predict whether a loan applicant is likely to be approved or not. The dataset is preprocessed, transformed, and trained using multiple classification algorithms to evaluate performance and improve prediction accuracy. The model includes end-to-end steps such as data loading, cleaning, encoding categorical features, assembling feature vectors, training ML models, and evaluating performance using MLlib evaluators.
- Data Cleaning & Preprocessing Handles missing values, converts string attributes into numerical format, and prepares the dataset for training.
- Feature Engineering Uses StringIndexer, OneHotEncoder, and VectorAssembler to create model-ready feature vectors.
- Multiple Classification Models
- Implements:
- Logistic Regression
- Naive Bayes
- Decision Tree Classifier
- Implements:
- Model Evaluation
- Evaluates model accuracy and performance using:
- MulticlassMetrics
- MulticlassClassificationEvaluator
- Evaluates model accuracy and performance using:
- Visualizations Uses Matplotlib and Seaborn to generate plots for insights into feature distribution and performance metrics.
- End-to-End ML Pipeline Uses the PySpark Pipeline API to streamline preprocessing and training into a single workflow.
- Python
- Apache Spark
- PySpark
- SparkSession
-DataFrame operations (pyspark.sql)
- Feature transformation (StringIndexer, OneHotEncoder, VectorAssembler)
- Model building (LogisticRegression, NaiveBayes, DecisionTreeClassifier)
- ML Pipeline (Pipeline)
- Evaluation (MulticlassMetrics, MulticlassClassificationEvaluator)
- matplotlib: for visualization
- seaborn: for statistical plots