# **Forest Cover Type Prediction**

#### **Introduction**

This notebook documents a project focused on predicting forest cover types using the [Kaggle dataset: Forest Cover Type Prediction Dataset](https://www.kaggle.com/competitions/forest-cover-type-prediction/data)
. The project is part of the **Big Data** module of ENIT's 3rd year MIndS and is undertaken by **Group 4**: Chaima Balti, Roukaya Lakhzouri, and Salsabil Rouahi. We are working under the supervision of our professor, **Moez Ben Haj Hmida**.

The primary goal of this project is to explore and apply various machine learning techniques to accurately classify forest cover types based on specific features related to soil, climate, and topography. 

#### **Libraries** 

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [25]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, corr
from pyspark.ml.feature import VectorAssembler, StandardScaler
from pyspark.ml.classification import LinearSVC
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [17]:
# Step 1: Start a Spark Session
spark = SparkSession.builder.appName("SVM_with_PySpark").getOrCreate()

#### **Data exploration**

In [18]:
# Load the dataset
data_path = "train.csv"  # Replace with the actual path to your dataset
df = spark.read.csv(data_path, header=True, inferSchema=True)

In [19]:
# Step 3: Inspect the dataset
df.printSchema()
df.show(5)

root
 |-- Id: integer (nullable = true)
 |-- Elevation: integer (nullable = true)
 |-- Aspect: integer (nullable = true)
 |-- Slope: integer (nullable = true)
 |-- Horizontal_Distance_To_Hydrology: integer (nullable = true)
 |-- Vertical_Distance_To_Hydrology: integer (nullable = true)
 |-- Horizontal_Distance_To_Roadways: integer (nullable = true)
 |-- Hillshade_9am: integer (nullable = true)
 |-- Hillshade_Noon: integer (nullable = true)
 |-- Hillshade_3pm: integer (nullable = true)
 |-- Horizontal_Distance_To_Fire_Points: integer (nullable = true)
 |-- Wilderness_Area1: integer (nullable = true)
 |-- Wilderness_Area2: integer (nullable = true)
 |-- Wilderness_Area3: integer (nullable = true)
 |-- Wilderness_Area4: integer (nullable = true)
 |-- Soil_Type1: integer (nullable = true)
 |-- Soil_Type2: integer (nullable = true)
 |-- Soil_Type3: integer (nullable = true)
 |-- Soil_Type4: integer (nullable = true)
 |-- Soil_Type5: integer (nullable = true)
 |-- Soil_Type6: integer (nullab

### Model ( SVM FOR Multiclass classification )

In [49]:
# Filter correlations with the target
target_column = "Cover_Type"  # Replace with your actual target column
data = df

In [50]:
# Step 6: Prepare features and label
feature_columns = [col for col in data.columns]
assembler = VectorAssembler(inputCols=data.columns[:-1], outputCol="features")
data = assembler.transform(data)

In [51]:
# Step 8: Split data into training and test sets
train_data, test_data = data.randomSplit([0.8, 0.2], seed=42)

In [52]:
train_data.printSchema()
test_data.printSchema()


root
 |-- Id: integer (nullable = true)
 |-- Elevation: integer (nullable = true)
 |-- Aspect: integer (nullable = true)
 |-- Slope: integer (nullable = true)
 |-- Horizontal_Distance_To_Hydrology: integer (nullable = true)
 |-- Vertical_Distance_To_Hydrology: integer (nullable = true)
 |-- Horizontal_Distance_To_Roadways: integer (nullable = true)
 |-- Hillshade_9am: integer (nullable = true)
 |-- Hillshade_Noon: integer (nullable = true)
 |-- Hillshade_3pm: integer (nullable = true)
 |-- Horizontal_Distance_To_Fire_Points: integer (nullable = true)
 |-- Wilderness_Area1: integer (nullable = true)
 |-- Wilderness_Area2: integer (nullable = true)
 |-- Wilderness_Area3: integer (nullable = true)
 |-- Wilderness_Area4: integer (nullable = true)
 |-- Soil_Type1: integer (nullable = true)
 |-- Soil_Type2: integer (nullable = true)
 |-- Soil_Type3: integer (nullable = true)
 |-- Soil_Type4: integer (nullable = true)
 |-- Soil_Type5: integer (nullable = true)
 |-- Soil_Type6: integer (nullab

In [53]:
from pyspark.sql.functions import col

train_data = train_data.withColumn("Cover_Type", col("Cover_Type") - 1)
test_data = test_data.withColumn("Cover_Type", col("Cover_Type") - 1)


In [54]:
from pyspark.ml.feature import StringIndexer

indexer = StringIndexer(inputCol="Cover_Type", outputCol="indexed_label")
train_data = indexer.fit(train_data).transform(train_data)
test_data = indexer.fit(test_data).transform(test_data)


In [60]:
from pyspark.ml.classification import LogisticRegression

lr = LogisticRegression(featuresCol="features", labelCol="indexed_label", maxIter=100, regParam=0.01, elasticNetParam=0.0, family="multinomial")

# Fit the model
lr_model = lr.fit(train_data)


In [62]:
predictions = lr_model.transform(test_data)
predictions.select("features", "indexed_label", "prediction", "probability").show()

+--------------------+-------------+----------+--------------------+
|            features|indexed_label|prediction|         probability|
+--------------------+-------------+----------+--------------------+
|(55,[0,1,2,3,4,5,...|          6.0|       0.0|[0.92467288761026...|
|(55,[0,1,2,3,4,5,...|          2.0|       0.0|[0.61751573253408...|
|(55,[0,1,2,3,4,5,...|          2.0|       0.0|[0.56320251046590...|
|(55,[0,1,2,3,4,5,...|          2.0|       4.0|[0.29333378452627...|
|(55,[0,1,2,3,4,5,...|          2.0|       4.0|[0.26466540800532...|
|(55,[0,1,2,3,4,6,...|          2.0|       4.0|[0.25344922108903...|
|(55,[0,1,2,3,4,5,...|          2.0|       4.0|[0.37989877640613...|
|(55,[0,1,2,3,4,5,...|          6.0|       0.0|[0.57939111502075...|
|(55,[0,1,2,3,4,5,...|          2.0|       4.0|[0.28547501132889...|
|(55,[0,1,2,3,4,5,...|          2.0|       4.0|[0.27375308523954...|
|(55,[0,1,2,3,4,6,...|          2.0|       4.0|[0.28734724053058...|
|(55,[0,1,2,3,4,5,...|          2.

### Evaluation

In [64]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

evaluator = MulticlassClassificationEvaluator(labelCol="indexed_label", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print(f"Test Accuracy: {accuracy}")


Test Accuracy: 0.08900876601483479


In [67]:
f1_evaluator = MulticlassClassificationEvaluator(labelCol="indexed_label", predictionCol="prediction", metricName="f1")
f1_score = f1_evaluator.evaluate(predictions)
print(f"F1 Score: {f1_score}")

F1 Score: 0.09389319525185283
