<center>

### COSC2753 - Machine Learning

# **Testing Data Prediction**

<center>────────────────────────────</center>
&nbsp;

# I. Global Configuration

In [61]:
import sys
import importlib
import tabulate
import pandas as pd
import numpy as np
import sklearn
import statsmodels
import imblearn
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

# Reload modules
sys.path.append("../")  # Root directory
modules_to_reload = [
    "scripts.styler",
    "scripts.neko",
    "scripts.utils",
]

# Reload modules if they have been modified
missing_modules = []

for module_name in modules_to_reload:
    if module_name in sys.modules:
        importlib.reload(sys.modules[module_name])
    else:
        missing_modules.append(module_name)

# Recache missing modules
if missing_modules:
    print(f"Modules {missing_modules} not found. \nRecaching...")

# Import user-defined scripts
from scripts.styler import Styler
from scripts.neko import Neko
from scripts.utils import Utils

# Initialize styler
styler = Styler()  # Text Styler

# Check package versions
styler.draw_box("Validating Package Versions...")

try:
    with open("../requirements.txt", "r") as file:
        requirements = file.readlines()
except FileNotFoundError:
    print(f"File '../requirements.txt' not found. Please check your directory!")

packages_to_check = [np, pd, tabulate, sklearn, statsmodels, imblearn]

for package in packages_to_check:
    Utils.version_check(package, requirements=requirements)

styled_text = styler.style("\nDone validating packages\n", bold=True, italic=True)
print(styled_text)

# Initialize objects
styler.draw_box("Initializing Project...")
neko = Neko()  # Panda extension
bullet = ">>>"  # Bullet point

# Configuration
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)
pd.set_option("display.precision", 3)

styled_text = styler.style("Done initializing project...", bold=True, italic=True)
print(styled_text)

┌──────────────────────────────────┐
│  Validating Package Versions...  │
└──────────────────────────────────┘
>>> numpy is up to date: 1.26.4
>>> pandas is up to date: 2.2.1
>>> tabulate is up to date: 0.9.0
>>> sklearn is up to date: 1.4.1.post1
>>> statsmodels is up to date: 0.14.1
>>> imblearn is up to date: 0.12.2
[1m[3m
Done validating packages
[0m
┌───────────────────────────┐
│  Initializing Project...  │
└───────────────────────────┘

    /\_____/\
   /  x   o  \
  ( ==  ^  == )       Neko has arrived!
   )         (        An data visualizing extension for analyzing DataFrames.
  (           )       Art: https://www.asciiart.eu/animals/cats.
 ( (  )   (  ) )
(__(__)___(__)__)

[1m[3mDone initializing project...[0m


## II. Data Loading

In [62]:
try:
    # Load data
    df_test = pd.read_csv("../data/test/data_test.csv")
    df_result = pd.read_csv("../results/COSC2753_A1_Predictions_S3927776.csv")
    df_train = pd.read_csv("../data/processed/data_train_processed.csv")

    styler.draw_box("Data Loaded Successfully")

except FileNotFoundError:
    print("Error: File not found. Please check the file path.")
except Exception as e:
    print("An error occurred:", e)

┌────────────────────────────┐
│  Data Loaded Successfully  │
└────────────────────────────┘


In [63]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50736 entries, 0 to 50735
Data columns (total 25 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Id                    50736 non-null  int64  
 1   HighBP                50736 non-null  int64  
 2   HighChol              50736 non-null  int64  
 3   CholCheck             50736 non-null  int64  
 4   BMI                   50736 non-null  int64  
 5   Smoker                50736 non-null  int64  
 6   Stroke                50736 non-null  int64  
 7   HeartDiseaseorAttack  50736 non-null  int64  
 8   PhysActivity          50736 non-null  int64  
 9   Fruits                50736 non-null  int64  
 10  Veggies               50736 non-null  int64  
 11  HvyAlcoholConsump     50736 non-null  int64  
 12  AnyHealthcare         50736 non-null  int64  
 13  NoDocbcCost           50736 non-null  int64  
 14  GenHlth               50736 non-null  int64  
 15  MentHlth           

In [64]:
df_result.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50736 entries, 0 to 50735
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Id      50736 non-null  int64  
 1   Status  0 non-null      float64
dtypes: float64(1), int64(1)
memory usage: 792.9 KB


# III. Data Prediction

Building on the previous notebook, this section leverages a **Random Forest model** for test data prediction. As previously discussed, no additional preprocessing is necessary for the test data.

In [65]:
# Drop "Id" column from df_test
patient_id = df_test["Id"].values
df_test.drop(columns=["Id", "Status"], inplace=True)

# Split data
X_train = df_train.drop(columns=["Status"], axis=1)
y_train = df_train["Status"]

In [66]:
styler.draw_box("Performing Random Forest Classification...")

rf = RandomForestClassifier(
    n_estimators=225,
    min_samples_split=17,
    min_samples_leaf=4,
    max_features=None,
    max_depth=40,
    criterion="entropy",
    ccp_alpha=0.0,
)

rf.fit(X_train, y_train)

# Make predictions on df_test
pred_y = rf.predict(df_test)

# Create a DataFrame
df_pred = pd.DataFrame({"Id": patient_id, "Status": pred_y})

# Save predictions to a CSV file
df_pred.head()

styler.draw_box("Random Forest Classification Complete")

┌──────────────────────────────────────────────┐
│  Performing Random Forest Classification...  │
└──────────────────────────────────────────────┘
┌─────────────────────────────────────────┐
│  Random Forest Classification Complete  │
└─────────────────────────────────────────┘


In [68]:
df_pred.head()

Unnamed: 0,Id,Status
0,202944,0
1,202945,1
2,202946,1
3,202947,0
4,202948,1


In [67]:
try:
    df_pred.to_csv("../results/COSC2753_A1_Predictions_S3927776.csv", index=False)
    print("Predictions saved to 'COSC2753_A1_Predictions_S3927776.csv'")

except FileNotFoundError:
    print("Error: File not found. Please check the file path.")

Predictions saved to 'COSC2753_A1_Predictions_S3927776.csv'
