# **Forest Cover Type Prediction**

#### **Introduction**

This notebook documents a project focused on predicting forest cover types using the [Kaggle dataset: Forest Cover Type Prediction Dataset](https://www.kaggle.com/competitions/forest-cover-type-prediction/data)
. The project is part of the **Big Data** module of ENIT's 3rd year MIndS and is undertaken by **Group 4**: Chaima Balti, Roukaya Lakhzouri, and Salsabil Rouahi. We are working under the supervision of our professor, **Moez Ben Haj Hmida**.

The primary goal of this project is to explore and apply various machine learning techniques to accurately classify forest cover types based on specific features related to soil, climate, and topography. 

#### **Libraries** 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [2]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, corr
from pyspark.ml.feature import VectorAssembler, StandardScaler
from pyspark.ml.classification import LinearSVC
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [3]:
# Step 1: Start a Spark Session
spark = SparkSession.builder.appName("LR_with_PySpark").getOrCreate()

#### **Data exploration**

In [4]:
# Load the dataset
data_path = "train.csv"  # Replace with the actual path to your dataset
df = spark.read.csv(data_path, header=True, inferSchema=True)

### Model Logistic regression (multinomial)

In [5]:
# Filter correlations with the target
target_column = "Cover_Type"  # Replace with your actual target column
data = df

In [6]:
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import LogisticRegression

# Assuming df is your DataFrame with feature columns and 'Cover_Type' as the target

# Step 1: Combine feature columns into a single vector
feature_columns = [col for col in df.columns if col != "Cover_Type"]
vector_assembler = VectorAssembler(inputCols=feature_columns, outputCol="features")
df = vector_assembler.transform(df)

# Step 2: Split the data into training and testing sets
train_df, test_df = df.randomSplit([0.8, 0.2], seed=42)

# Step 3: Initialize the Logistic Regression model
lr = LogisticRegression(featuresCol="features", labelCol="Cover_Type", family="multinomial")

# Step 4: Build the parameter grid
param_grid = (ParamGridBuilder()
              .addGrid(lr.regParam, [0.01])  # Regularization parameter
              .addGrid(lr.elasticNetParam, [0.5])  # ElasticNet mixing parameter
              .addGrid(lr.maxIter, [300])  # Maximum number of iterations
              .build())

# Step 5: Define the evaluator
evaluator = MulticlassClassificationEvaluator(labelCol="Cover_Type", predictionCol="prediction", metricName="accuracy")

# Step 6: Set up CrossValidator
cross_validator = CrossValidator(estimator=lr,
                                  estimatorParamMaps=param_grid,
                                  evaluator=evaluator,
                                  numFolds=5)  # 5-fold cross-validation

# Step 7: Fit the model with CrossValidator
cv_model = cross_validator.fit(train_df)

# Step 8: Get the best model
best_model = cv_model.bestModel

# Step 9: Evaluate the best model on the test set
predictions = best_model.transform(test_df)
test_accuracy = evaluator.evaluate(predictions)

print(f"Best Test Accuracy: {test_accuracy * 100:.2f}%")

# Step 10: Print the best hyperparameters
print("Best Hyperparameters:")
print(f" - Regularization Parameter (regParam): {best_model._java_obj.getRegParam()}")
print(f" - ElasticNet Parameter (elasticNetParam): {best_model._java_obj.getElasticNetParam()}")
print(f" - Max Iterations (maxIter): {best_model._java_obj.getMaxIter()}")


Best Test Accuracy: 66.08%
Best Hyperparameters:
 - Regularization Parameter (regParam): 0.01
 - ElasticNet Parameter (elasticNetParam): 0.5
 - Max Iterations (maxIter): 300


In [9]:
import joblib

In [10]:
joblib.dump(best_model, 'Spark_model.joblib')


TypeError: cannot pickle '_thread.RLock' object