<a href="https://colab.research.google.com/github/Datangels/Machine_Learning_with_PySpark/blob/master/pyspark_random_forest.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Google Colab configuration & creation the SparkSession Object**

In [0]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://www-us.apache.org/dist/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
!tar xf spark-2.4.4-bin-hadoop2.7.tgz
!pip install -q findspark

In [0]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.4-bin-hadoop2.7"

In [0]:
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt

## **Read the Dataset**

In [0]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

In [0]:
dataset_not_clean = spark.read.csv('/content/drive/My Drive/pycharm_colab_training/dataset/marriage_stats.csv',inferSchema=True, header=True)

## **Exploratory Data Analysis**


In [0]:
# dataset_not_clean.printSchema()
dataset_not_clean.describe().show()
# print((dataset_not_clean.count(), len(dataset_not_clean.columns)))

## **Feature Engineering**

In [0]:
from pyspark.ml.feature import VectorAssembler

df_assembler = VectorAssembler(inputCols=['rate_marriage', 'age', 'yrs_married', 'children', 'religious'], outputCol="features")
dataset_clean = df_assembler.transform(dataset_not_clean)
model_df = dataset_clean.select(['features','affairs'])
model_df.show(2)

## **Splitting the Dataset**

In [0]:
train_df, test_df = model_df.randomSplit([0.75,0.25])
print("whole dataset: " + str(model_df.count()))
print("train_df dataset: " + str(train_df.count()))
print("test_df dataset: " + str(test_df.count()))

## **Build and Train Random Forest Model**


In [0]:
from pyspark.ml.classification import RandomForestClassifier
rf_classifier = RandomForestClassifier(labelCol='affairs',numTrees=50).fit(train_df)

## **Evaluation on Test Data**

In [0]:
rf_predictions = rf_classifier.transform(test_df)
rf_predictions.show(10)

## **Accuracy**

In [0]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.evaluation import BinaryClassificationEvaluator
rf_accuracy = MulticlassClassificationEvaluator(labelCol='affairs',metricName='accuracy').evaluate(rf_predictions)
print('The accuracy of RF on test data is {0:.0%}'.format(rf_accuracy))

## **Precision**

In [0]:
rf_precision = MulticlassClassificationEvaluator(labelCol='affairs',metricName='weightedPrecision').evaluate(rf_predictions)
print('The precision rate on test data is {0:.0%}'.format(rf_precision))

## **AUC (Area under the curve)**

In [0]:
rf_auc = BinaryClassificationEvaluator(labelCol='affairs').evaluate(rf_predictions)
print(rf_auc)
print(rf_classifier.featureImportances)

In [0]:
df.schema["features"].metadata["ml_attr"]["attrs"] # Rate_marriage is the most important feature from a prediction point of view