# Land Cover Mapping in North Algeria Using Sentinel-2 Data

## Members

-  Atef Amriche
-  Sharathchandra Bangalore Munibaire Gowda
-  Narasimha Rejala Shenoy
-  Ying Zhou

# Introduction

Land Cover and Land Use (LCLU) products are used for a wide range of research and development applications including environmental monitoring, natural resources management, urban planning, and socio-economic studies. Scientists and practitioners use various types of remote sensing data, adopt different approaches, and consequently obtain LCLU products with different mapping accuracies. Current research benefits from technological advances and freely available of remote sensing data, and focuses on the development of advanced classification approaches and machine learning algorithms. As such, this project will focus on fine-tuning and optimizing the Land Cover mapping process through the selection of the most accurate and generalizable method. The expected final classification scheme should be able to adapt and obtain high accuracies in different types of study areas (tropical, humid, arid, desert, etc.). This is most important when processing very large datasets (terabytes of satellite imagery) to generate land cover maps at the national, continental, or global scales, where environmental changes are more pronounced.


## Objectives

- Training multiple classifiers for comparison
- Testing multiple configurations for each classifier to select the best performing parameters
- Run the best classifier and parameters using 100 iterations to generate better performance generalization evaluation.
- Use the best classifier to make predictions for larger datasets (>150 M)

## Requirements

-  Numpy
-  Pandas
-  Matplotlib
-  Pyspark

## Code

### Importing packages

In [40]:
#importing numpy
import numpy as np
#Importing Pandas
import pandas as pd
#Plotting capability
import matplotlib.pylab as plt
# Functionality for computing features
from pyspark.ml import feature
# Object for creating sequences of transformations
from pyspark.ml import Pipeline
# Load functionality to manipulate dataframes
from pyspark.sql import functions as fn
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import MaxAbsScaler
from pyspark.ml.classification import RandomForestClassifier, LogisticRegression, LinearSVC, OneVsRest, GBTClassifier
from pyspark.ml.feature import StringIndexer
from pyspark.sql import SparkSession

### loading dataframe

In [41]:
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext.setSystemProperty('spark.executor.memory', '512g')
df=pd.read_csv("Data1.csv")
data = spark.createDataFrame(df)

### Data visualization

#### Box plots for 9 features

In [4]:
plt.style.use('classic')
fig, ax = plt.subplots(figsize=(5,  5))
df.boxplot(['S2_B1','S2_B4','S2_B2'])
fig.savefig('boxplot_features_1.png', dpi=400, bbox_inches='tight')

In [5]:
plt.style.use('classic')
fig, ax = plt.subplots(figsize=(5,  5))
df.boxplot(['ARVI','GEMI','NDVI'])
fig.savefig('boxplot_features_2.png', dpi=400, bbox_inches='tight')

In [6]:
plt.style.use('classic')
fig, ax = plt.subplots(figsize=(5,  5))
df.boxplot(['3_45_9','5_90_21','15_135_97'])
fig.savefig('boxplot_features_3.png', dpi=400, bbox_inches='tight')

### Testing Three Classifier 

#### Radom Forest Classifier

In [42]:
va = VectorAssembler(inputCols= ["S2_B1", "S2_B2", "S2_B4", "S2_B11", "S2_B12", "ARVI", "CI", "DVI", "GEMI", "GNDVI", "IPVI", "IRECI", "MCARI", "MSAVI", "MSAVI2", "NDVI", "NDWI2", "PSSRA", "RI", "RVI", "SAVI", "3_45_9", "3_90_12", "3_90_63", "5_90_21", "5_90_62", "7_90_15", "7_45_19", "9_180_12", "9_45_97", "9_90_11", "9_90_59", "11_135_1", "13_90_59", "15_135_97", "15_45_1", "15_45_19", "15_45_59", "15_45_81", "15_90_13", "15_90_31", "15_90_59", "15_90_64", "15_180_59"]
, outputCol="features")

In [8]:
#["S2_B1", "S2_B2", "S2_B4", "S2_B11", "S2_B12", "ARVI", "CI", "DVI", "GEMI", "GNDVI", "IPVI", "IRECI", "MCARI", "MSAVI", "MSAVI2", "NDVI", "NDWI2", "PSSRA", "RI", "RVI", "SAVI", "3_45_9", "3_90_12", "3_90_63", "5_90_21", "5_90_62", "7_90_15", "7_45_19", "9_180_12", "9_45_97", "9_90_11", "9_90_59", "11_135_1", "13_90_59", "15_135_97", "15_45_1", "15_45_19", "15_45_59", "15_45_81", "15_90_13", "15_90_31", "15_90_59", "15_90_64", "15_180_59"]

In [43]:
training_df, validation_df, testing_df = data.randomSplit([0.8, 0.1, 0.1], seed = 100 )

In [44]:
classification_evaluator = MulticlassClassificationEvaluator(labelCol='Label')
classification_evaluator_Acc = MulticlassClassificationEvaluator(labelCol='Label',predictionCol="prediction", metricName="accuracy")

In [45]:
rf_estimator1 = RandomForestClassifier(featuresCol='features',labelCol='Label',numTrees = 10, maxDepth = 5)
rf_estimator11 = RandomForestClassifier(featuresCol='features',labelCol='Label')

pipeline_rf1 = Pipeline(stages=[va, rf_estimator1])
pipeline_rf11 = Pipeline(stages=[va, rf_estimator11])

pipeline_rf_model1 = pipeline_rf1.fit(training_df)
pipeline_rf_model11 = pipeline_rf1.fit(training_df)

In [46]:
AUC_model1 = classification_evaluator.evaluate(pipeline_rf_model1.transform(validation_df))
AUC_model11 = classification_evaluator.evaluate(pipeline_rf_model11.transform(validation_df))

print(AUC_model1, AUC_model11)

(0.6781381093101251, 0.6781381093101251)


In [47]:
rf_model1 = pipeline_rf_model1.stages[-1]
feature_importance1 = pd.DataFrame(list(zip(training_df.columns[0:], rf_model1.featureImportances.toArray())),
            columns = ['feature', 'importance']).sort_values('importance', ascending=False)
feature_importance1

Unnamed: 0,feature,importance
10,IPVI,0.149597
37,15_45_59,0.052799
34,15_135_97,0.051865
39,15_90_13,0.048163
6,CI,0.044674
16,NDWI2,0.041185
21,3_45_9,0.039492
20,SAVI,0.038606
26,7_90_15,0.036409
43,15_180_59,0.036323


In [48]:
predictions1 = pipeline_rf_model1.transform(testing_df)

In [49]:
predictions1.select("prediction", "Label", "features").show(5)

+----------+-----+--------------------+
|prediction|Label|            features|
+----------+-----+--------------------+
|       1.0|    1|          (44,[],[])|
|       1.0|    3|          (44,[],[])|
|       1.0|    5|          (44,[],[])|
|       7.0|    7|[0.0194,0.023,0.0...|
|       2.0|    3|[0.0226,0.0367,0....|
+----------+-----+--------------------+
only showing top 5 rows



In [50]:
accuracy1 = classification_evaluator_Acc.evaluate(predictions1)
print("Accuracy of RF Model 1 = %g " % accuracy1)
AUC_rf1 = classification_evaluator.evaluate(pipeline_rf_model1.transform(testing_df))
print("Area Under Curve of RF Model 1 = %g " % AUC_rf1)

Accuracy of RF Model 1 = 0.715789 
Area Under Curve of RF Model 1 = 0.704933 


In [51]:
rf_estimator2 = RandomForestClassifier(featuresCol='features',labelCol='Label',numTrees=100, maxDepth=15)
pipeline_rf2 = Pipeline(stages=[va, rf_estimator2])
pipeline_rf_model2 = pipeline_rf2.fit(training_df)

In [52]:
classification_evaluator.evaluate(pipeline_rf_model2.transform(validation_df))

0.7863614134452444

In [53]:
rf_model2 = pipeline_rf_model2.stages[-1]
feature_importance2 = pd.DataFrame(list(zip(training_df.columns[0:], rf_model2.featureImportances.toArray())),
            columns = ['feature', 'importance']).sort_values('importance', ascending=False)
feature_importance2

Unnamed: 0,feature,importance
10,IPVI,0.050511
5,ARVI,0.037892
4,S2_B12,0.032641
15,NDVI,0.032148
2,S2_B4,0.031326
0,S2_B1,0.030988
16,NDWI2,0.030913
21,3_45_9,0.029462
6,CI,0.02879
39,15_90_13,0.02781


In [54]:
predictions2 = pipeline_rf_model2.transform(testing_df)
predictions2.select("prediction", "Label", "features").show(5)

+----------+-----+--------------------+
|prediction|Label|            features|
+----------+-----+--------------------+
|       5.0|    1|          (44,[],[])|
|       5.0|    3|          (44,[],[])|
|       5.0|    5|          (44,[],[])|
|       7.0|    7|[0.0194,0.023,0.0...|
|       2.0|    3|[0.0226,0.0367,0....|
+----------+-----+--------------------+
only showing top 5 rows



In [55]:
accuracy2 = classification_evaluator_Acc.evaluate(predictions2)
print("Accuracy of RF Model 2 = %g " % accuracy2)
AUC_rf2 = classification_evaluator.evaluate(pipeline_rf_model2.transform(testing_df))
print("Area Under Curve of RF Model 2 = %g " % AUC_rf2)

Accuracy of RF Model 2 = 0.729825 
Area Under Curve of RF Model 2 = 0.727725 


In [56]:
rf_estimator3 = RandomForestClassifier(featuresCol='features',labelCol='Label',numTrees=200,maxDepth=5)
pipeline_rf3 = Pipeline(stages=[va, rf_estimator3])
pipeline_rf_model3 = pipeline_rf3.fit(training_df)

In [57]:
classification_evaluator.evaluate(pipeline_rf_model3.transform(validation_df))

0.7077972578644908

In [58]:
rf_model3 = pipeline_rf_model3.stages[-1]
feature_importance3 = pd.DataFrame(list(zip(training_df.columns[0:], rf_model3.featureImportances.toArray())),
            columns = ['feature', 'importance']).sort_values('importance', ascending=False)
feature_importance3

Unnamed: 0,feature,importance
10,IPVI,0.061827
15,NDVI,0.054314
5,ARVI,0.051736
16,NDWI2,0.04201
19,RVI,0.040672
6,CI,0.037711
3,S2_B11,0.035039
21,3_45_9,0.033897
17,PSSRA,0.033142
20,SAVI,0.031787


In [59]:
predictions3 = pipeline_rf_model3.transform(testing_df)
predictions3.select("prediction", "Label", "features").show(5)

+----------+-----+--------------------+
|prediction|Label|            features|
+----------+-----+--------------------+
|       1.0|    1|          (44,[],[])|
|       1.0|    3|          (44,[],[])|
|       1.0|    5|          (44,[],[])|
|       7.0|    7|[0.0194,0.023,0.0...|
|       2.0|    3|[0.0226,0.0367,0....|
+----------+-----+--------------------+
only showing top 5 rows



In [60]:
accuracy3 = classification_evaluator_Acc.evaluate(predictions3)
print("Accuracy of RF Model 3 = %g " % accuracy3)
AUC_rf3 = classification_evaluator.evaluate(pipeline_rf_model3.transform(testing_df))
print("Area Under Curve of RF Model 3 = %g " % AUC_rf3)

Accuracy of RF Model 3 = 0.719298 
Area Under Curve of RF Model 3 = 0.707276 


In [61]:
best_rf_model = pipeline_rf_model2

In [62]:
classification_evaluator_Precision = MulticlassClassificationEvaluator(labelCol='Label',predictionCol="prediction", metricName="weightedPrecision")
classification_evaluator_Recall = MulticlassClassificationEvaluator(labelCol='Label',predictionCol="prediction", metricName="weightedRecall")
classification_evaluator_F1 = MulticlassClassificationEvaluator(labelCol='Label',predictionCol="prediction", metricName="f1")

In [63]:
accuracy_rf_best = classification_evaluator_Acc.evaluate(predictions2)
print("Accuracy of Best RF Model = %g " % accuracy_rf_best)
AUC_rf_best = classification_evaluator.evaluate(pipeline_rf_model2.transform(testing_df))
print("Area Under Curve of Best RF Model = %g " % AUC_rf_best)
precision_rf_best = classification_evaluator_Precision.evaluate(predictions2)
print("Precision of Best RF Model = %g " % (1.0 - precision_rf_best))
recall_rf_best = classification_evaluator_Recall.evaluate(predictions2)
print("Recall of Best RF Model = %g " % recall_rf_best)
f1_score_best = classification_evaluator_F1.evaluate(predictions2)
print("F1 score of Best RF Model = %g " % f1_score_best)

Accuracy of Best RF Model = 0.729825 
Area Under Curve of Best RF Model = 0.727725 
Precision of Best RF Model = 0.256858 
Recall of Best RF Model = 0.729825 
F1 score of Best RF Model = 0.727725 


Kappa Co-efficient = Po- Pe/1-Pe where Observed Agreement is equivalent to Accuracy and Pe is the expected agreement based on randomly agreeing to what land mass it is classified to. 

Therefore, k = 0.72-0.3/1-0.3 = 0.6.

#### Logisitic Regression

In [64]:
lr =LogisticRegression().\
    setLabelCol('Label').\
    setFeaturesCol('features').\
    setRegParam(0.1).\
    setMaxIter(100).\
    setElasticNetParam(0.01)

In [65]:
pipeline_lr1 = Pipeline(stages=[va,lr])
pipeline_lr1_model = pipeline_lr1.fit(training_df)

In [66]:
classification_evaluator.evaluate(pipeline_lr1_model.transform(validation_df))

0.6961103812342535

In [67]:
AUC_lr1 = classification_evaluator.evaluate(pipeline_lr1_model.transform(testing_df))
print(AUC_lr1)

0.675921155456


In [68]:
predictions1 = pipeline_lr1_model.transform(testing_df)
predictions1.select("prediction", "Label", "features").show(5)
accuracy1 = classification_evaluator_Acc.evaluate(predictions1)
print("Test Error = %g " % (1.0 - accuracy1))

+----------+-----+--------------------+
|prediction|Label|            features|
+----------+-----+--------------------+
|       1.0|    1|          (44,[],[])|
|       1.0|    3|          (44,[],[])|
|       1.0|    5|          (44,[],[])|
|       7.0|    7|[0.0194,0.023,0.0...|
|       7.0|    3|[0.0226,0.0367,0....|
+----------+-----+--------------------+
only showing top 5 rows

Test Error = 0.305263 


In [69]:
va2 = VectorAssembler(inputCols= ["ARVI", "CI", "DVI", "GEMI", "GNDVI", "IPVI", "IRECI", "MCARI", "MSAVI", "MSAVI2", "NDVI", "NDWI2", "PSSRA", "RI", "RVI", "SAVI"],
        outputCol="features") 

In [70]:
pipeline_lr2 = Pipeline(stages=[va2,lr])
pipeline_lr2_model = pipeline_lr2.fit(training_df)

In [71]:
classification_evaluator.evaluate(pipeline_lr2_model.transform(validation_df))

0.5397001133397824

In [72]:
AUC_lr2 = classification_evaluator.evaluate(pipeline_lr2_model.transform(testing_df))
print(AUC_lr2)

0.505956824556


In [73]:
predictions2 = pipeline_lr2_model.transform(testing_df)
predictions2.select("prediction", "Label", "features").show(5)
accuracy2 = classification_evaluator_Acc.evaluate(predictions2)
print("Test Error = %g " % (1.0 - accuracy2))

+----------+-----+--------------------+
|prediction|Label|            features|
+----------+-----+--------------------+
|       3.0|    1|          (16,[],[])|
|       3.0|    3|          (16,[],[])|
|       3.0|    5|          (16,[],[])|
|       7.0|    7|[0.76646,-0.06646...|
|       2.0|    3|[0.066,0.15145,0....|
+----------+-----+--------------------+
only showing top 5 rows

Test Error = 0.403509 


In [74]:
va3 = VectorAssembler(inputCols= ["3_45_9", "3_90_12", "3_90_63", "5_90_21", "5_90_62", "7_90_15", "7_45_19", "9_180_12", "9_45_97", "9_90_11", "9_90_59", "11_135_1", "13_90_59", "15_135_97", "15_45_1", "15_45_19", "15_45_59", "15_45_81", "15_90_13", "15_90_31", "15_90_59", "15_90_64", "15_180_59"]
, outputCol="features")
pipeline_lr3 = Pipeline(stages=[va3,lr])
pipeline_lr3_model = pipeline_lr3.fit(training_df)

In [75]:
classification_evaluator.evaluate(pipeline_lr3_model.transform(validation_df))

0.571601366358707

In [76]:
AUC_lr3 = classification_evaluator.evaluate(pipeline_lr3_model.transform(testing_df))
print(AUC_lr3)

0.498516274102


In [77]:
predictions3 = pipeline_lr3_model.transform(testing_df)
predictions3.select("prediction", "Label", "features").show(5)
accuracy3 = classification_evaluator_Acc.evaluate(predictions3)
print("Test Error = %g " % (1.0 - accuracy3))

+----------+-----+--------------------+
|prediction|Label|            features|
+----------+-----+--------------------+
|       7.0|    1|          (23,[],[])|
|       7.0|    3|          (23,[],[])|
|       7.0|    5|          (23,[],[])|
|       7.0|    7|[1.88889000000000...|
|       7.0|    3|[2.66667,1.44444,...|
+----------+-----+--------------------+
only showing top 5 rows

Test Error = 0.449123 


First LR model is good compared to the other models as we have better accuracy for the first model. Also accuracy is good whe we use all the features in the image and not just specific set of features

Setting various parameters for the first LR model where all the features are considered and running on the validation and testing datasets

In [78]:
from pyspark.ml.classification import LogisticRegression
alpha1 = 0
alpha2 = 0.2
alpha3 = 0.4
lambda1 = 0
lambda2 = 0.02
lambda3 = 0.1
en_lr1 = LogisticRegression().\
        setLabelCol('Label').\
        setFeaturesCol('features').\
        setRegParam(lambda1).\
        setMaxIter(100).\
        setElasticNetParam(alpha1)
        
en_lr2 = LogisticRegression().\
        setLabelCol('Label').\
        setFeaturesCol('features').\
        setRegParam(lambda2).\
        setMaxIter(100).\
        setElasticNetParam(alpha2)
        
en_lr3 = LogisticRegression().\
        setLabelCol('Label').\
        setFeaturesCol('features').\
        setRegParam(lambda3).\
        setMaxIter(100).\
        setElasticNetParam(alpha3)
lr_pipeline1_test = Pipeline(stages=[va, en_lr1])
lr_pipeline2_test = Pipeline(stages=[va, en_lr2])
lr_pipeline3_test = Pipeline(stages=[va, en_lr3])
lr_pipeline1 = lr_pipeline1_test.fit(training_df)
lr_pipeline2 = lr_pipeline2_test.fit(training_df)
lr_pipeline3 = lr_pipeline3_test.fit(training_df)

In [79]:
classification_evaluator.evaluate(lr_pipeline1.transform(validation_df))

0.7608876375771567

In [80]:
accuracy1 = classification_evaluator_Acc.evaluate(lr_pipeline1.transform(testing_df))
accuracy1

0.7473684210526316

In [81]:
AUC1 = classification_evaluator.evaluate(lr_pipeline1.transform(testing_df))
AUC1

0.7436382417340024

In [82]:
classification_evaluator.evaluate(lr_pipeline2.transform(validation_df))

0.7152866518940785

In [83]:
accuracy2 = classification_evaluator_Acc.evaluate(lr_pipeline2.transform(testing_df))
accuracy2

0.7017543859649122

In [84]:
AUC2 = classification_evaluator.evaluate(lr_pipeline2.transform(testing_df))
AUC2

0.6889362383375792

In [85]:
classification_evaluator.evaluate(lr_pipeline3.transform(validation_df))

0.5621986632852622

In [86]:
accuracy3 = classification_evaluator_Acc.evaluate(lr_pipeline3.transform(testing_df))
accuracy3

0.6105263157894737

In [87]:
AUC3 = classification_evaluator.evaluate(lr_pipeline3.transform(testing_df))
AUC3

0.5202640726719836

In [91]:
classification_evaluator_Precision = MulticlassClassificationEvaluator(labelCol='Label',predictionCol="prediction", metricName="weightedPrecision")
classification_evaluator_Recall = MulticlassClassificationEvaluator(labelCol='Label',predictionCol="prediction", metricName="weightedRecall")
classification_evaluator_F1 = MulticlassClassificationEvaluator(labelCol='Label',predictionCol="prediction", metricName="f1")

In [92]:
predictions_best = lr_pipeline1.transform(testing_df)
predictions_best.select("prediction", "Label", "features").show(5)

+----------+-----+--------------------+
|prediction|Label|            features|
+----------+-----+--------------------+
|       1.0|    1|          (44,[],[])|
|       1.0|    3|          (44,[],[])|
|       1.0|    5|          (44,[],[])|
|       7.0|    7|[0.0194,0.023,0.0...|
|       2.0|    3|[0.0226,0.0367,0....|
+----------+-----+--------------------+
only showing top 5 rows



In [93]:
accuracy_lr_best = classification_evaluator_Acc.evaluate(predictions_best)
print("Accuracy of Best LR Model = %g " % accuracy_lr_best)
AUC_lr_best = classification_evaluator.evaluate(lr_pipeline1.transform(testing_df))
print("Area Under Curve of Best LR Model = %g " % AUC_lr_best)
precision_lr_best = classification_evaluator_Precision.evaluate(predictions_best)
print("Precision of Best LR Model = %g " % (1.0 - precision_lr_best))
recall_lr_best = classification_evaluator_Recall.evaluate(predictions_best)
print("Recall of Best LR Model = %g " % recall_lr_best)
f1_score_best = classification_evaluator_F1.evaluate(predictions_best)
print("F1 score of Best LR Model = %g " % f1_score_best)

Accuracy of Best LR Model = 0.747368 
Area Under Curve of Best LR Model = 0.743638 
Precision of Best LR Model = 0.245976 
Recall of Best LR Model = 0.747368 
F1 score of Best LR Model = 0.743638 


Kappa Co-efficient = Po- Pe/1-Pe where Observed Agreement is equivalent to Accuracy and Pe is the expected agreement based on randomly agreeing to what land mass it is classified to. 

Therefore, k = 0.72-0.28/1-0.28 = 0.61.

#### SVM

In [102]:
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext.setSystemProperty('spark.executor.memory', '512g')
df=pd.read_csv("Data1.csv")
df=df.rename(index=str, columns={"Label": "Class-label"})
data = spark.createDataFrame(df)

feature_transformer = VectorAssembler(inputCols= ["S2_B1", "S2_B2", "S2_B4", "S2_B11", "S2_B12", "ARVI", "CI", "DVI", "GEMI", "GNDVI", "IPVI", "IRECI", "MCARI", "MSAVI", "MSAVI2", "NDVI", "NDWI2", "PSSRA", "RI", "RVI", "SAVI", "3_45_9", "3_90_12", "3_90_63", "5_90_21", "5_90_62", "7_90_15", "7_45_19", "9_180_12", "9_45_97", "9_90_11", "9_90_59", "11_135_1", "13_90_59", "15_135_97", "15_45_1", "15_45_19", "15_45_59", "15_45_81", "15_90_13", "15_90_31", "15_90_59", "15_90_64", "15_180_59"], outputCol="features")
labelIndexer = StringIndexer(inputCol="Class-label", outputCol="label").setHandleInvalid("keep")
feature_df=feature_transformer.transform(data)
data_indexed =labelIndexer.fit(feature_df).transform(feature_df)

In [103]:
training_df, validation_df, testing_df = data_indexed.randomSplit([0.8, 0.1, 0.1], seed = 100 )

In [104]:
svm_estimator = LinearSVC(featuresCol='features',labelCol='label',maxIter=5, regParam=0.01)
ovr=OneVsRest(classifier=svm_estimator,labelCol='label').fit(training_df)

In [105]:
classification_evaluator = MulticlassClassificationEvaluator(labelCol='label')
classification_evaluator_Acc = MulticlassClassificationEvaluator(labelCol='label',predictionCol="prediction", metricName="accuracy")
#classification_evaluator = MulticlassClassificationEvaluator(labelCol='label')
#accuracy=classification_evaluator.evaluate(predictions)
#print(accuracy)

In [106]:
classification_evaluator.evaluate(ovr.transform(validation_df))

0.4151717341099401

In [107]:
AUC1 = classification_evaluator.evaluate(ovr.transform(testing_df))
AUC1

0.39864125826634433

In [108]:
Accuracy1 = classification_evaluator_Acc.evaluate(ovr.transform(testing_df))
Accuracy1

0.5087719298245614

In [109]:
svm_estimator1 = LinearSVC(featuresCol='features',labelCol='label',maxIter=50, regParam=0.5)
ovr1=OneVsRest(classifier=svm_estimator,labelCol='label').fit(training_df)

In [110]:
classification_evaluator.evaluate(ovr1.transform(validation_df))

0.4151717341099401

In [111]:
AUC2 = classification_evaluator.evaluate(ovr1.transform(testing_df))
AUC2

0.39864125826634433

In [112]:
Accuracy2 = classification_evaluator_Acc.evaluate(ovr1.transform(testing_df))
Accuracy2

0.5087719298245614

In [113]:
svm_estimator2 = LinearSVC(featuresCol='features',labelCol='label',maxIter=100, regParam=1)
ovr2=OneVsRest(classifier=svm_estimator,labelCol='label').fit(training_df)

In [114]:
classification_evaluator.evaluate(ovr2.transform(validation_df))

0.4151717341099401

In [115]:
AUC3 = classification_evaluator.evaluate(ovr2.transform(testing_df))
AUC3

0.39864125826634433

In [116]:
Accuracy3 = classification_evaluator_Acc.evaluate(ovr2.transform(testing_df))
Accuracy3

0.5087719298245614

### The best classifier is random forest.

In [117]:
best_rf_model = pipeline_rf_model2

### Running 100 iteration to report best generalization performance

In [None]:
### Best RF model refitted with 90% of data and tested on 10% of data
#AUCs = []
#for i in range(0,100):
#    training_df, testing_df = data.randomSplit([0.9, 0.1], seed = i )
#    best_rf_modelX = pipeline_rf2.fit(training_df)
#    AUC_rf_bestX = classification_evaluator.evaluate(best_rf_modelX.transform(testing_df))
#    AUCs.append(AUC_rf_bestX)
    
##### ONLY RUN ONCE TO TEST AND VISUALIZE - this process takes some time.

In [None]:
#### Generalization performance results from previous step
#training_df, testing_df = data.randomSplit([0.9, 0.1], seed = i )
#best_rf_modelX = pipeline_rf2.fit(training_df)
#AUC_rf_bestX = classification_evaluator.evaluate(best_rf_modelX.transform(testing_df))
    
#generalization_AUC = sum(AUCs)/len(AUCs)

#print('maximum AUC = ', max(AUCs))
#print('minimum AUC = ', min(AUCs))
#print('generalization AUC (mean of AUCs) = ', generalization_AUC)

### Use the best model to make predictions for larger data sets

In [118]:
df=pd.read_csv('sub1.csv')
dfsplit=np.array_split(df,100)

n=len(dfsplit[0])
data=spark.createDataFrame(dfsplit[0])
prediction = best_rf_model.transform(data)
df=pd.DataFrame(prediction.select('prediction','Label').take(n),columns=['prediction','Label'])
df.to_csv('output.csv',mode='w')

for i in range(1,100):
    n=len(dfsplit[i])
    data=spark.createDataFrame(dfsplit[i])
    prediction = best_rf_model.transform(data)
    df=pd.DataFrame(prediction.select('prediction','Label').take(n),columns=['prediction','Label'])
    df.to_csv('output.csv',mode='a',header=None)
    
print("Produce predictions to the output.csv file")

Produce predictions to the output.csv file
