# Task
Analyze the forest cover type dataset from "/content/train.csv", build a classification model to predict the forest cover type, and evaluate its performance.

## Data loading

### Subtask:
Load the training data from `/content/train.csv` into a pandas DataFrame.


**Reasoning**:
Import pandas and load the training data into a DataFrame, then display the first few rows.



In [1]:
import pandas as pd

df = pd.read_csv('/content/train.csv')
display(df.head())

Unnamed: 0,Id,Elevation,Aspect,Slope,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,...,Soil_Type32,Soil_Type33,Soil_Type34,Soil_Type35,Soil_Type36,Soil_Type37,Soil_Type38,Soil_Type39,Soil_Type40,Cover_Type
0,1,2596,51,3,258,0,510,221,232,148,...,0,0,0,0,0,0,0,0,0,5
1,2,2590,56,2,212,-6,390,220,235,151,...,0,0,0,0,0,0,0,0,0,5
2,3,2804,139,9,268,65,3180,234,238,135,...,0,0,0,0,0,0,0,0,0,2
3,4,2785,155,18,242,118,3090,238,238,122,...,0,0,0,0,0,0,0,0,0,2
4,5,2595,45,2,153,-1,391,220,234,150,...,0,0,0,0,0,0,0,0,0,5


## Exploring the data

### Subtask:
Perform basic exploratory data analysis (EDA) to understand the structure and characteristics of the dataset, including checking for missing values and examining the distribution of features and target variable.


**Reasoning**:
Display the shape and data types of the DataFrame to understand its structure.



In [2]:
display("Shape of the DataFrame:", df.shape)
display("Data types of the columns:", df.dtypes)

'Shape of the DataFrame:'

(15120, 56)

'Data types of the columns:'

Unnamed: 0,0
Id,int64
Elevation,int64
Aspect,int64
Slope,int64
Horizontal_Distance_To_Hydrology,int64
Vertical_Distance_To_Hydrology,int64
Horizontal_Distance_To_Roadways,int64
Hillshade_9am,int64
Hillshade_Noon,int64
Hillshade_3pm,int64


**Reasoning**:
Check for missing values and display descriptive statistics for the numerical columns.



In [3]:
display("Missing values per column:", df.isnull().sum())
display("Descriptive statistics:", df.describe())

'Missing values per column:'

Unnamed: 0,0
Id,0
Elevation,0
Aspect,0
Slope,0
Horizontal_Distance_To_Hydrology,0
Vertical_Distance_To_Hydrology,0
Horizontal_Distance_To_Roadways,0
Hillshade_9am,0
Hillshade_Noon,0
Hillshade_3pm,0


'Descriptive statistics:'

Unnamed: 0,Id,Elevation,Aspect,Slope,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,...,Soil_Type32,Soil_Type33,Soil_Type34,Soil_Type35,Soil_Type36,Soil_Type37,Soil_Type38,Soil_Type39,Soil_Type40,Cover_Type
count,15120.0,15120.0,15120.0,15120.0,15120.0,15120.0,15120.0,15120.0,15120.0,15120.0,...,15120.0,15120.0,15120.0,15120.0,15120.0,15120.0,15120.0,15120.0,15120.0,15120.0
mean,7560.5,2749.322553,156.676653,16.501587,227.195701,51.076521,1714.023214,212.704299,218.965608,135.091997,...,0.045635,0.040741,0.001455,0.006746,0.000661,0.002249,0.048148,0.043452,0.030357,4.0
std,4364.91237,417.678187,110.085801,8.453927,210.075296,61.239406,1325.066358,30.561287,22.801966,45.895189,...,0.208699,0.197696,0.038118,0.081859,0.02571,0.047368,0.214086,0.20388,0.171574,2.000066
min,1.0,1863.0,0.0,0.0,0.0,-146.0,0.0,0.0,99.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
25%,3780.75,2376.0,65.0,10.0,67.0,5.0,764.0,196.0,207.0,106.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0
50%,7560.5,2752.0,126.0,15.0,180.0,32.0,1316.0,220.0,223.0,138.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0
75%,11340.25,3104.0,261.0,22.0,330.0,79.0,2270.0,235.0,235.0,167.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.0
max,15120.0,3849.0,360.0,52.0,1343.0,554.0,6890.0,254.0,254.0,248.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,7.0


**Reasoning**:
Display the value counts for the target variable 'Cover_Type' to examine its distribution.



In [4]:
display("Value counts for Cover_Type:", df['Cover_Type'].value_counts())

'Value counts for Cover_Type:'

Unnamed: 0_level_0,count
Cover_Type,Unnamed: 1_level_1
5,2160
2,2160
1,2160
7,2160
3,2160
6,2160
4,2160


## Data preprocessing

### Subtask:
Prepare the data for modeling by handling categorical features (Wilderness Area and Soil Type) which are already one-hot encoded, and separating the features and target variable.


**Reasoning**:
Identify the features (X) and the target variable (y) by dropping the 'Id' and 'Cover_Type' columns for X and selecting the 'Cover_Type' column for y, then verify their shapes.



In [5]:
X = df.drop(['Id', 'Cover_Type'], axis=1)
y = df['Cover_Type']

print("Shape of features (X):", X.shape)
print("Shape of target variable (y):", y.shape)

Shape of features (X): (15120, 54)
Shape of target variable (y): (15120,)


## Model selection

### Subtask:
Choose a suitable classification model for this task. A simple yet effective model like Logistic Regression or a Decision Tree will be a good starting point for a beginner.


**Reasoning**:
Choose and import a suitable classification model from scikit-learn. A Logistic Regression model is a good starting point for multi-class classification.



In [6]:
from sklearn.linear_model import LogisticRegression

## Model training

### Subtask:
Train the selected model on the prepared training data.


**Reasoning**:
Instantiate and train the Logistic Regression model using the prepared features and target variable.



In [7]:
model = LogisticRegression()
model.fit(X, y)

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


## Model evaluation

### Subtask:
Evaluate the performance of the trained model using appropriate metrics such as accuracy, precision, recall, and F1-score.


**Reasoning**:
Import the necessary evaluation metrics and calculate the accuracy, precision, recall, and F1-score of the trained model on the training data.



In [8]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

y_pred = model.predict(X)

accuracy = accuracy_score(y, y_pred)
precision = precision_score(y, y_pred, average='weighted')
recall = recall_score(y, y_pred, average='weighted')
f1 = f1_score(y, y_pred, average='weighted')

print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-score: {f1:.4f}")

Accuracy: 0.4790
Precision: 0.4624
Recall: 0.4790
F1-score: 0.4674


## Prediction

### Subtask:
Use the trained model to predict forest cover types on new, unseen data (if available, or simulate with a split of the training data).


**Reasoning**:
Split the data into training and testing sets, train a Logistic Regression model on the training data, predict on the test data, and evaluate the model's performance using various metrics.



In [9]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model_test = LogisticRegression(max_iter=1000)
model_test.fit(X_train, y_train)

y_pred_test = model_test.predict(X_test)

accuracy_test = accuracy_score(y_test, y_pred_test)
precision_test = precision_score(y_test, y_pred_test, average='weighted')
recall_test = recall_score(y_test, y_pred_test, average='weighted')
f1_test = f1_score(y_test, y_pred_test, average='weighted')

print(f"Test Accuracy: {accuracy_test:.4f}")
print(f"Test Precision: {precision_test:.4f}")
print(f"Test Recall: {recall_test:.4f}")
print(f"Test F1-score: {f1_test:.4f}")

Test Accuracy: 0.6098
Test Precision: 0.6007
Test Recall: 0.6098
Test F1-score: 0.6015


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


## Summary:

### Data Analysis Key Findings

*   The dataset contains 15120 rows and 56 columns with no missing values.
*   All features and the target variable are of integer data type.
*   The target variable 'Cover\_Type' is perfectly balanced, with each of the 7 classes having 2160 instances.
*   A Logistic Regression model was chosen for classification.
*   The model achieved a test accuracy of approximately 60.98%, a precision of 60.07%, a recall of 60.98%, and an F1-score of 60.15% on the unseen test data.
*   A `ConvergenceWarning` was observed during model training, suggesting the model may not have fully converged.

### Insights or Next Steps

*   The model's performance on the test set is moderate. Further improvements could be explored by addressing the convergence issue (e.g., increasing `max_iter` or scaling the data) or trying more complex models.
*   Investigate the impact of feature scaling on the Logistic Regression model's convergence and performance.
