<h1 style="font-family: 'poppins'; font-weight: bold; color: Blue;">👨‍💻Author: Muhammad Faheem Iqbal</h1>

[![GitHub](https://img.shields.io/badge/GitHub-Profile-blue?style=for-the-badge&logo=github)](https://github.com/FaheemAI1024)
[![Kaggle](https://img.shields.io/badge/Kaggle-Profile-blue?style=for-the-badge&logo=kaggle)](https://www.kaggle.com/muhammadfaheemiqbal) 
[![LinkedIn](https://img.shields.io/badge/LinkedIn-Profile-blue?style=for-the-badge&logo=linkedin)](https://www.linkedin.com/in/muhammad-faheem-iqbal-ai-solutions-architect-b630932ab/)  

[![Facebook](https://img.shields.io/badge/Facebook-Profile-blue?style=for-the-badge&logo=facebook)](https://www.facebook.com/aammar.tufail) 
[![TikTok](https://img.shields.io/badge/TikTok-Profile-blue?style=for-the-badge&logo=tiktok)](https://www.tiktok.com/@data_scientist04?_t=8kW2bLg8CFl&_r=1)
[![HuggingFace](https://img.shields.io/badge/huggingface-Profile-yellow?style=for-the-badge&logo=huggingface)](https://huggingface.co/FaheemAi1024)

[![Twitter/X](https://img.shields.io/badge/Twitter-Profile-blue?style=for-the-badge&logo=twitter)](https://x.com/MFaheem113141?t=__88BWMyKGZcC08sw3SJtA&s=09) 
[![Instagram](https://img.shields.io/badge/Instagram-Profile-blue?style=for-the-badge&logo=instagram)](https://www.instagram.com/i_am_faheeeem?igsh=MXhlcG0zdTZ6Mnl5Yw==) 
[![Email](https://img.shields.io/badge/Email-Contact%20Me-red?style=for-the-badge&logo=email)](mailto:faheemiqbalbwn2002@gmail.com)


Absolutely! Let's start implementing the project step by step.

But first, to proceed with the actual implementation — especially **EDA, preprocessing, and applying Naïve Bayes, Decision Tree, and Neural Network models** — I need access to the **raw dataset file** (CSV/Excel) from the archive.

---

### ✅ Here's What You Can Do Now:

Please extract the `.7z` file locally (`apartments_for_rent_classified_10K.7z`) using a tool like:

- [7-Zip](https://www.7-zip.org/)
- WinRAR or any archive manager

Then **upload the extracted `.csv` or `.xlsx` file** here.

---

### 🧠 Once You Upload the Dataset:

We’ll go through these phases:

---

### 🔧 Step 1: **Dataset Exploration**
- Load and preview data
- Summarize features
- Check missing values, data types, class balance

### 🧼 Step 2: **Preprocessing**
- Handle nulls, encoding, and normalization
- Convert rent prices into categories (for classification)
- Correlation analysis

### 🧪 Step 3: **Model Implementation**
- Apply:
  - Naïve Bayes
  - Decision Tree
  - Neural Network (simple MLP)
- Train, tune, and test each

### 📊 Step 4: **Evaluation**
- Accuracy, Confusion Matrix, Precision, Recall
- Compare all 3 models

### 📄 Step 5: **Prepare Research Paper Sections**
- Abstract
- Introduction
- Literature review
- Detailed methodology with screenshots and plots
- Evaluation & result discussion
- Conclusion and references (IEEE format)

---

Ready to upload the extracted dataset file and begin with Step 1 (EDA)?

In [None]:
print("Hello world")

In [6]:
import pandas as pd
import matplotlib as plt
import numpy as np 

In [None]:
# import pandas as pd

# # Path to the CSV file
# csv_file_path = '/kaggle/input/appartment-classify-10k/apartments_for_rent_classified_10K.csv'

# try:
#     # Load the dataset using 'cp1252' encoding and skip problematic lines
#     df = pd.read_csv(csv_file_path, encoding='cp1252', on_bad_lines='skip')
    
#     print(df.shape)
    
# except Exception as e:
#     print("Error loading CSV file:", e)


In [None]:
import pandas as pd

# Load CSV with correct delimiter
df = pd.read_csv('/kaggle/input/appartment-classify-10k/apartments_for_rent_classified_10K.csv', sep=';', encoding='cp1252')

# Display column names to confirm proper parsing
print("Parsed Column Names:")
print(df.columns)



In [None]:
df.head(9)

In [None]:
df.tail()

In [None]:
df.info()

In [None]:
df.dtypes

In [None]:
df.columns

In [None]:
df.index

In [None]:
df.describe()

In [None]:
df.dtypes

In [None]:

class_counts = df['category'].value_counts()

class_distribution = df['category'].value_counts(normalize=True) * 100


print("Class Counts:\n", class_counts)
print("\nClass Distribution (%):\n", class_distribution)


In [None]:
# Display a sample of the 'price' column
print("\nSample 'price' column values:")
print(df['price'].head(10))

## step 02 

In [None]:
print(df.isnull().sum())

In [None]:
# Convert key columns to numeric if they are not already
df['price_numeric'] = pd.to_numeric(df['price'], errors='coerce')
df['bathrooms']    = pd.to_numeric(df['bathrooms'], errors='coerce')
df['bedrooms']     = pd.to_numeric(df['bedrooms'], errors='coerce')
df['square_feet']  = pd.to_numeric(df['square_feet'], errors='coerce')
df['latitude']     = pd.to_numeric(df['latitude'], errors='coerce')
df['longitude']    = pd.to_numeric(df['longitude'], errors='coerce')

# Identify numeric columns for normalization and correlation analysis
numeric_cols = ['price_numeric', 'bathrooms', 'bedrooms', 'square_feet', 'latitude', 'longitude']


In [None]:
print(numeric_cols)

In [None]:
# Fill missing numeric values with the median of each column
for col in numeric_cols:
    median_val = df[col].median()
    df[col].fillna(median_val, inplace=True)

# For categorical columns, fill missing values with a placeholder ("Unknown")
# (List selected categorical columns as per your dataset)
categorical_cols = ['category', 'title', 'body', 'amenities', 'currency', 'fee', 'has_photo', 
                    'pets_allowed', 'price_display', 'price_type', 'address', 'cityname', 
                    'state', 'source', 'time']
for col in categorical_cols:
    if col in df.columns:
        df[col].fillna("Unknown", inplace=True)

print("\nMissing values after handling:")
print(df.isnull().sum())


# 2. Encoding and Normalization

In [None]:
from sklearn.preprocessing import MinMaxScaler  # Import the MinMaxScaler

# Normalize numerical features using MinMaxScaler for better scale consistency
scaler = MinMaxScaler()
df_scaled = df.copy()  # Create a copy to preserve original values
df_scaled[numeric_cols] = scaler.fit_transform(df[numeric_cols])

In [None]:
# 3. Convert Rent Prices into Categories

# Define bins for the rent prices using quantiles of the numeric prices
bins = df['price_numeric'].quantile([0, 0.33, 0.66, 1]).values
labels = ['Low', 'Medium', 'High']
df['rent_category'] = pd.cut(df['price_numeric'], bins=bins, labels=labels, include_lowest=True)

# Display the distribution (class balance) of rent categories
print("\nRent Category Distribution:")
print(df['rent_category'].value_counts())



In [None]:
import matplotlib.pyplot as plt  # Correct import for pyplot
import seaborn as sns

# Optionally, if numeric_cols is not defined, select numeric columns automatically:
numeric_cols = df.select_dtypes(include=['number']).columns

# 4. Correlation Analysis

# Calculate the correlation matrix for the numerical columns
corr_matrix = df[numeric_cols].corr()
print("\nCorrelation Matrix:")
print(corr_matrix)

# Plot a heatmap of the correlation matrix
plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Heatmap of Numerical Features")
plt.show()


### 1. Scatter Plots

In [None]:
  import matplotlib.pyplot as plt
  import seaborn as sns

  plt.figure(figsize=(10, 6))
  sns.scatterplot(data=df, x='square_feet', y='price')
  plt.title('Scatter Plot of Square Feet vs. Price')
  plt.xlabel('Square Feet')
  plt.ylabel('Price')
  plt.show()


### 2. Bubble Charts

In [None]:
  plt.figure(figsize=(10, 6))
  sns.scatterplot(data=df, x='square_feet', y='price', size='bedrooms', sizes=(20, 200), alpha=0.5)
  plt.title('Bubble Chart of Square Feet vs. Price with Bedrooms as Size')
  plt.xlabel('Square Feet')
  plt.ylabel('Price')
  plt.legend(title='Bedrooms', bbox_to_anchor=(1, 1))
  plt.show()

### 3. Heatmaps

In [None]:
# import numpy as np
# import matplotlib.pyplot as plt
# import seaborn as sns

# # Select only numeric columns from the DataFrame
# numeric_df = df.select_dtypes(include=[np.number])

# # Calculate the correlation matrix for numeric columns
# correlation_matrix = numeric_df.corr()

# # Plot the heatmap of the correlation matrix
# plt.figure(figsize=(12, 8))
# sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
# plt.title('Heatmap of Variable Correlations')
# plt.show()


### 4. Bar Charts for Pairwise Correlations

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Select only numeric columns from the DataFrame
numeric_df = df.select_dtypes(include=[np.number])

# Calculate the absolute correlation matrix for numeric columns
correlation_matrix = numeric_df.corr().abs()

# Extract the upper triangle of the correlation matrix (excluding the diagonal)
upper_triangle = correlation_matrix.where(np.triu(np.ones(correlation_matrix.shape), k=1).astype(bool))

# Unstack the matrix, sort the correlation pairs, and drop null values
sorted_pairs = upper_triangle.unstack().sort_values(ascending=False)
strongest_pairs = sorted_pairs[sorted_pairs.notnull()]

# Plot the pairwise correlation strengths as a bar plot
plt.figure(figsize=(10, 8))
strongest_pairs.plot(kind='bar')
plt.title('Pairwise Correlation Strengths')
plt.ylabel('Correlation Coefficient')
plt.show()


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

sns.boxplot(x=df['price_numeric'])
plt.title('Price Distribution (Check Outliers)')
plt.show()

In [None]:
df = df[df['price_numeric'] < 10000]  # Example threshold

In [None]:
df_encoded = pd.get_dummies(df, columns=['currency', 'fee', 'has_photo', 'pets_allowed', 'price_type', 'source', 'state'], drop_first=True)

In [None]:
# from sklearn.model_selection import train_test_split

# X = df_encoded[numeric_cols + [col for col in df_encoded.columns if col not in numeric_cols + ['price', 'price_numeric']]]
# y = df_encoded['price_numeric']

# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


## Model Building 

##  Pre-Step: Prepare Data for Classification

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import classification_report, accuracy_score

# Fill missing values or drop rows with critical missing data
df_clean = df[['price', 'bathrooms', 'bedrooms', 'square_feet', 'cityname', 'state']].dropna()

# Bin price into categories
df_clean['price_category'] = pd.qcut(df_clean['price'], q=3, labels=['Low', 'Medium', 'High'])

# Encode categorical variables
le_city = LabelEncoder()
le_state = LabelEncoder()
df_clean['city_encoded'] = le_city.fit_transform(df_clean['cityname'])
df_clean['state_encoded'] = le_state.fit_transform(df_clean['state'])

# Feature set
X = df_clean[['bathrooms', 'bedrooms', 'square_feet', 'city_encoded', 'state_encoded']]
y = df_clean['price_category']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


## Naïve Bayes

In [None]:
from sklearn.naive_bayes import GaussianNB

nb_model = GaussianNB()
nb_model.fit(X_train, y_train)
y_pred_nb = nb_model.predict(X_test)

print("Naïve Bayes Accuracy:", accuracy_score(y_test, y_pred_nb))
print(classification_report(y_test, y_pred_nb))


In [None]:
from sklearn.tree import DecisionTreeClassifier

dt_model = DecisionTreeClassifier(max_depth=5, random_state=42)
dt_model.fit(X_train, y_train)
y_pred_dt = dt_model.predict(X_test)

print("Decision Tree Accuracy:", accuracy_score(y_test, y_pred_dt))
print(classification_report(y_test, y_pred_dt))


In [None]:
from sklearn.neural_network import MLPClassifier

# Normalize input data for NN
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

mlp_model = MLPClassifier(hidden_layer_sizes=(64, 32), max_iter=300, random_state=42)
mlp_model.fit(X_train_scaled, y_train)
y_pred_mlp = mlp_model.predict(X_test_scaled)

print("Neural Network Accuracy:", accuracy_score(y_test, y_pred_mlp))
print(classification_report(y_test, y_pred_mlp))

-----

---- 









# NEW FILE




-----

In [10]:
import pandas as pd

# Load the dataset (update the file path if needed)
df = pd.read_csv('/kaggle/input/appartment-classify-10k/apartments_for_rent_classified_10K.csv', sep=';', encoding='cp1252')
# 1. Basic structure of the dataset
print("Dataset Shape (Rows, Columns):", df.shape)
print("\nColumns:")
print(df.columns.tolist())

# 2. Data Types for each column
print("\nData Types:")
print(df.dtypes)

# 3. Display the first few rows to get a feel for the data
print("\nFirst 5 Rows:")
print(df.head())

# 4. Summary statistics for numerical columns and overall info
print("\nSummary Statistics for Numerical Columns:")
print(df.describe())

# For categorical columns, include object types in the description
print("\nSummary Statistics for All Columns (including categorical):")
print(df.describe(include='all'))

# 5. Check for missing values
print("\nMissing Values per Column:")
print(df.isnull().sum())

# 6. Unique value counts for categorical columns (if any)
print("\nUnique Value Counts for Categorical Columns:")
categorical_columns = df.select_dtypes(include=['object']).columns
for col in categorical_columns:
    print(f"\nColumn: {col}")
    print(df[col].value_counts().head(10))  # Displaying top 10 unique values


# Optionally, save a CSV with this analysis summary
analysis_summary = {
    "shape": df.shape,
    "columns": df.columns.tolist(),
    "dtypes": df.dtypes.to_dict(),
    "missing_values": df.isnull().sum().to_dict(),

}

# Save summary to a CSV file if needed:
summary_df = pd.DataFrame({
    "Column": list(analysis_summary["dtypes"].keys()),
    "DataType": list(analysis_summary["dtypes"].values()),
    "MissingValues": [analysis_summary["missing_values"].get(col, 0) for col in analysis_summary["dtypes"].keys()]
})
summary_df.to_csv("dataset_summary.csv", index=False)
print("\nA summary of the analysis has been saved to 'dataset_summary.csv'.")


Dataset Shape (Rows, Columns): (10000, 22)

Columns:
['id', 'category', 'title', 'body', 'amenities', 'bathrooms', 'bedrooms', 'currency', 'fee', 'has_photo', 'pets_allowed', 'price', 'price_display', 'price_type', 'square_feet', 'address', 'cityname', 'state', 'latitude', 'longitude', 'source', 'time']

Data Types:
id                 int64
category          object
title             object
body              object
amenities         object
bathrooms        float64
bedrooms         float64
currency          object
fee               object
has_photo         object
pets_allowed      object
price              int64
price_display     object
price_type        object
square_feet        int64
address           object
cityname          object
state             object
latitude         float64
longitude        float64
source            object
time               int64
dtype: object

First 5 Rows:
           id                category  \
0  5668626895  housing/rent/apartment   
1  5664597177  housin

  has_large_values = (abs_vals > 1e6).any()
  has_small_values = ((abs_vals < 10 ** (-self.digits)) & (abs_vals > 0)).any()
  has_small_values = ((abs_vals < 10 ** (-self.digits)) & (abs_vals > 0)).any()
  has_large_values = (abs_vals > 1e6).any()
  has_small_values = ((abs_vals < 10 ** (-self.digits)) & (abs_vals > 0)).any()
  has_small_values = ((abs_vals < 10 ** (-self.digits)) & (abs_vals > 0)).any()


In [11]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# 2. Drop truly useless / redundant columns
df = df.drop([
    'id',             # identifier
    'price_display',  # duplicate of price
    'currency', 'fee',# constant columns
    'address',        # too high-cardinality free text
    'body', 'title',  # drop text for tabular baseline
    'category'        # ~99.9% same value
], axis=1)

# 3. Feature engineering
# 3a. Price per square foot
df['price_per_sqft'] = df['price'] / df['square_feet']

# 3b. Time features
df['datetime'] = pd.to_datetime(df['time'], unit='s')
df['month']   = df['datetime'].dt.month
df['weekday'] = df['datetime'].dt.weekday
df['hour']    = df['datetime'].dt.hour

# 4. Missing‐value imputation
# 4a. Numerical
for col in ['bathrooms','bedrooms','latitude','longitude']:
    df[col] = df[col].fillna(df[col].median())

# 4b. Categorical
df['cityname']    = df['cityname'].fillna('Unknown')
df['state']       = df['state'].fillna('Unknown')
df['price_type']  = df['price_type'].fillna('Monthly')
df['has_photo']   = df['has_photo'].fillna('No')
df['source']      = df['source'].fillna('Other')

# 5. Encode “amenities” → binary flags for top 10
amen = (df['amenities']
        .fillna('')  # missing→empty
        .str.get_dummies(sep=','))
# keep only the top 10 most frequent amenities
top10 = amen.sum().sort_values(ascending=False).iloc[:10].index
df = pd.concat([df, amen[top10]], axis=1)
df = df.drop('amenities', axis=1)

# 6. Encode “pets_allowed” → two flags
pets = df['pets_allowed'].fillna('')
df['allows_cats'] = pets.str.contains('Cats').astype(int)
df['allows_dogs'] = pets.str.contains('Dogs').astype(int)
df = df.drop('pets_allowed', axis=1)

# 7. Frequency‐encode high‐cardinality categoricals
for col in ['cityname','state','source']:
    freq = df[col].value_counts(normalize=True)
    df[col + '_freq'] = df[col].map(freq)
df = df.drop(['cityname','state','source'], axis=1)

# 8. One‐hot encode low‐cardinality categoricals
df = pd.get_dummies(df,
                    columns=['price_type','has_photo'],
                    drop_first=True)

# 9. Outlier handling & transformations
# 9a. Clip extreme sqft and price_per_sqft to the 1–99 percentile
for col in ['square_feet','price_per_sqft']:
    lower, upper = df[col].quantile([0.01,0.99])
    df[col] = df[col].clip(lower, upper)

# 9b. Log‐transform skewed targets/features
df['log_price']     = np.log1p(df['price'])
df['log_sqft']      = np.log1p(df['square_feet'])
df['log_price_psf']= np.log1p(df['price_per_sqft'])

# 10. Final cleanup
df = df.drop(['time','datetime','price','square_feet','price_per_sqft'], axis=1)

# 11. Split into X / y
X = df.drop('log_price', axis=1)
y = df['log_price']  # model log‐rent to stabilize variance

# 12. Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

print("Prepared data shapes:", X_train.shape, X_test.shape, y_train.shape, y_test.shape)


Prepared data shapes: (8000, 28) (2000, 28) (8000,) (2000,)


# model

In [12]:
import numpy as np
import pandas as pd

from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier

from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# --- 1) Reconstruct raw price from your log target ---
# y_train, y_test hold log1p(price).  We invert that to get price.
y_train_price = np.expm1(y_train)
y_test_price  = np.expm1(y_test)

# --- 2) Bin into 4 classes (0 = lowest rents, 3 = highest) ---
y_train_cl = pd.qcut(y_train_price, q=4, labels=False)
y_test_cl  = pd.qcut(y_test_price,  q=4, labels=False)

print("Class distribution (train):\n", np.bincount(y_train_cl))
print("Class distribution (test):\n",  np.bincount(y_test_cl))

# --- 3) Naïve Bayes (no hyper‐params) ---
nb = GaussianNB()
nb.fit(X_train, y_train_cl)
pred_nb = nb.predict(X_test)

print("\n=== Naïve Bayes ===")
print("Accuracy:", accuracy_score(y_test_cl, pred_nb))
print(classification_report(y_test_cl, pred_nb))

# --- 4) Decision Tree with GridSearchCV ---
dt_param_grid = {
    'max_depth':       [5, 10, 20, None],
    'min_samples_split':[2, 5, 10],
}
dt_gs = GridSearchCV(
    DecisionTreeClassifier(random_state=42),
    dt_param_grid,
    cv=5,
    n_jobs=-1,
    verbose=1
)
dt_gs.fit(X_train, y_train_cl)
best_dt = dt_gs.best_estimator_
pred_dt = best_dt.predict(X_test)

print("\n=== Decision Tree ===")
print("Best params:", dt_gs.best_params_)
print("Accuracy:", accuracy_score(y_test_cl, pred_dt))
print(classification_report(y_test_cl, pred_dt))

# --- 5) Simple MLPClassifier with GridSearchCV ---
mlp_param_grid = {
    'hidden_layer_sizes': [(50,), (100,), (50,50)],
    'alpha':              [1e-4, 1e-3, 1e-2],
    'learning_rate':      ['constant', 'adaptive']
}
mlp_gs = GridSearchCV(
    MLPClassifier(max_iter=500, random_state=42),
    mlp_param_grid,
    cv=5,
    n_jobs=-1,
    verbose=1
)
mlp_gs.fit(X_train, y_train_cl)
best_mlp = mlp_gs.best_estimator_
pred_mlp = best_mlp.predict(X_test)

print("\n=== MLPClassifier ===")
print("Best params:", mlp_gs.best_params_)
print("Accuracy:", accuracy_score(y_test_cl, pred_mlp))
print(classification_report(y_test_cl, pred_mlp))


Class distribution (train):
 [2049 1962 1992 1997]
Class distribution (test):
 [501 499 501 499]

=== Naïve Bayes ===
Accuracy: 0.409
              precision    recall  f1-score   support

           0       0.73      0.13      0.22       501
           1       0.35      0.70      0.47       499
           2       1.00      0.00      0.00       501
           3       0.44      0.81      0.57       499

    accuracy                           0.41      2000
   macro avg       0.63      0.41      0.32      2000
weighted avg       0.63      0.41      0.32      2000

Fitting 5 folds for each of 12 candidates, totalling 60 fits

=== Decision Tree ===
Best params: {'max_depth': 20, 'min_samples_split': 2}
Accuracy: 0.918
              precision    recall  f1-score   support

           0       0.92      0.98      0.95       501
           1       0.87      0.91      0.89       499
           2       0.91      0.85      0.88       501
           3       0.97      0.93      0.95       499

    

In [17]:
from sklearn.preprocessing import MinMaxScaler

# Normalize features to [0,1]
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

nb = GaussianNB()
nb.fit(X_train_scaled, y_train_cl)
pred_nb = nb.predict(X_test_scaled)

print("\n=== Enhanced Naïve Bayes ===")
print("Accuracy:", accuracy_score(y_test_cl, pred_nb))
print(classification_report(y_test_cl, pred_nb))



=== Enhanced Naïve Bayes ===
Accuracy: 0.3855
              precision    recall  f1-score   support

           0       0.00      0.00      0.00       501
           1       0.34      0.74      0.46       499
           2       0.00      0.00      0.00       501
           3       0.44      0.81      0.57       499

    accuracy                           0.39      2000
   macro avg       0.20      0.39      0.26      2000
weighted avg       0.19      0.39      0.26      2000



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [18]:
from sklearn.ensemble import RandomForestClassifier

rf_params = {
    'n_estimators': [100, 200],
    'max_depth': [10, 20, None],
    'min_samples_leaf': [1, 2, 4]
}
rf_gs = GridSearchCV(
    RandomForestClassifier(random_state=42),
    rf_params,
    cv=5,
    n_jobs=-1,
    verbose=1
)
rf_gs.fit(X_train, y_train_cl)
pred_rf = rf_gs.predict(X_test)

print("\n=== Random Forest ===")
print("Best params:", rf_gs.best_params_)
print("Accuracy:", accuracy_score(y_test_cl, pred_rf))
print(classification_report(y_test_cl, pred_rf))


Fitting 5 folds for each of 18 candidates, totalling 90 fits

=== Random Forest ===
Best params: {'max_depth': None, 'min_samples_leaf': 2, 'n_estimators': 200}
Accuracy: 0.849
              precision    recall  f1-score   support

           0       0.90      0.94      0.92       501
           1       0.79      0.85      0.82       499
           2       0.79      0.76      0.78       501
           3       0.92      0.85      0.88       499

    accuracy                           0.85      2000
   macro avg       0.85      0.85      0.85      2000
weighted avg       0.85      0.85      0.85      2000



In [19]:
mlp_param_grid = {
    'hidden_layer_sizes': [(100,), (100, 50), (128, 64)],
    'alpha': [1e-4, 1e-3],
    'learning_rate': ['constant', 'adaptive'],
    'activation': ['relu', 'tanh'],
}
mlp_gs = GridSearchCV(
    MLPClassifier(max_iter=1000, early_stopping=True, random_state=42),
    mlp_param_grid,
    cv=5,
    n_jobs=-1,
    verbose=1
)
mlp_gs.fit(X_train, y_train_cl)
pred_mlp = mlp_gs.predict(X_test)

print("\n=== Enhanced MLPClassifier ===")
print("Best params:", mlp_gs.best_params_)
print("Accuracy:", accuracy_score(y_test_cl, pred_mlp))
print(classification_report(y_test_cl, pred_mlp))


Fitting 5 folds for each of 24 candidates, totalling 120 fits

=== Enhanced MLPClassifier ===
Best params: {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (100, 50), 'learning_rate': 'constant'}
Accuracy: 0.863
              precision    recall  f1-score   support

           0       0.91      0.92      0.92       501
           1       0.82      0.84      0.83       499
           2       0.82      0.79      0.81       501
           3       0.91      0.90      0.90       499

    accuracy                           0.86      2000
   macro avg       0.86      0.86      0.86      2000
weighted avg       0.86      0.86      0.86      2000

