# **OPEN-ARC**
---

### Project 1: Liver Cirrhosis Stage Classification Model:
**Challenge:** Create an AI model, capable of classifying the stage of liver cirrhosis a given patient has, based on feature values from the patient's data. This project is part of a collaborative research project, OPEN-ARC, aiming to improve AI solutions for everyone.


### Terms and Use:
Learn more about the project's [LICENSE](https://github.com/Infinitode/OPEN-ARC/blob/main/LICENSE) and read our [CODE_OF_CONDUCT](https://github.com/Infinitode/OPEN-ARC/blob/main/CODE_OF_CONDUCT) before contributing to the project. You can contribute to this project from here: [https://github.com/Infinitode/OPEN-ARC/](https://github.com/Infinitode/OPEN-ARC/).

---

Please fill out this performance sheet to help others quickly see your model's performance **(optional)**:

### Performance Sheet:
| Contributor | Architecture Type | Platform | Base Model | Dataset | Accuracy | Link |
|-------------|-------------------|----------|------------|---------|----------|------|
| Infinitode  | RandomForestClassifier  | Kaggle   | ✗  | Liver Cirrhosis Stage Classification 🩺 | 95.2%    | [Notebook](https://github.com/Infinitode/OPEN-ARC/Project-1-LCSC/project-1-lcsc.ipynb) |
| Username  | Unknown  | Kaggle   | ✗/✔  | Liver Cirrhosis Stage Classification 🩺 | Score    | [Notebook](https://github.com) |

---

### Model: Decision Tree Classifier:
This model uses **Grid Search** to optimize the model for the best performance and accuracy score while training. Grid Search searches for the best parameters, ensuring that the model performs at its very best.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load the dataset
data = pd.read_csv('/kaggle/input/liver-cirrhosis-stage-classification/liver_cirrhosis.csv')

# Preprocess the data using a LabelEncoder to transform text data into numerical data
label_encoder = LabelEncoder()
data['Sex'] = label_encoder.fit_transform(data['Sex'])
data['Edema'] = label_encoder.fit_transform(data['Edema'])
data['Status'] = label_encoder.fit_transform(data['Status'])
data['Drug'] = label_encoder.fit_transform(data['Drug'])
data['Ascites'] = label_encoder.fit_transform(data['Ascites'])
data['Hepatomegaly'] = label_encoder.fit_transform(data['Hepatomegaly'])
data['Spiders'] = label_encoder.fit_transform(data['Spiders'])

X = data.drop('Stage', axis=1)
y = data['Stage']

# Scale the numerical features to improve the accuracy score
numerical_cols = X.select_dtypes(include=['float64', 'int64']).columns
scaler = MinMaxScaler()
X[numerical_cols] = scaler.fit_transform(X[numerical_cols])

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Tune the decision tree model using GridSearchCV
param_grid = {
    'max_depth': [3, 5, 7, None],
    'min_samples_leaf': [1, 2, 4],
    'criterion': ['gini', 'entropy']
}

dt_grid_search = GridSearchCV(DecisionTreeClassifier(), param_grid, cv=5, scoring='accuracy')
dt_grid_search.fit(X_train, y_train)

best_params = dt_grid_search.best_params_
best_dt_model = dt_grid_search.best_estimator_

# Evaluate the best decision tree model on the test set
y_pred = best_dt_model.predict(X_test)
test_acc = accuracy_score(y_test, y_pred)
print(f"Decision Tree Test Accuracy: {test_acc}")

Decision Tree Test Accuracy: 0.9226


If you run the code cell above, you will see that you get a testing accuracy of around 91% - 92%, which is not bad, but could be improved upon. Which is what we'll do next.

### Model: Random Forest Classifier:
This model is a **Random Forest Classifier**. We've had so far, a Decision **Tree**, now we have a **Forest**, which in many cases, improves the performance on datasets drastically. This implemenation also includes **Feature Selection** to improve the model's accuracy score.

In [3]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.metrics import accuracy_score

# Load the dataset
data = pd.read_csv('/kaggle/input/liver-cirrhosis-stage-classification/liver_cirrhosis.csv')

# Preprocess the data, using the LabelEncoder for text values
label_encoder = LabelEncoder()
data['Sex'] = label_encoder.fit_transform(data['Sex'])
data['Edema'] = label_encoder.fit_transform(data['Edema'])
data['Status'] = label_encoder.fit_transform(data['Status'])
data['Drug'] = label_encoder.fit_transform(data['Drug'])
data['Ascites'] = label_encoder.fit_transform(data['Ascites'])
data['Hepatomegaly'] = label_encoder.fit_transform(data['Hepatomegaly'])
data['Spiders'] = label_encoder.fit_transform(data['Spiders'])

X = data.drop('Stage', axis=1)
y = data['Stage']

# Identify numerical columns for scaling
numerical_cols = X.select_dtypes(include=['float64', 'int64']).columns

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and fit the scaler on the training data
scaler = MinMaxScaler()
X_train[numerical_cols] = scaler.fit_transform(X_train[numerical_cols])
X_test[numerical_cols] = scaler.transform(X_test[numerical_cols])

# Perform feature selection to exclude some features
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
selector = SelectFromModel(estimator=rf_model)
selector.fit(X_train, y_train)
selected_features = X_train.columns[selector.get_support()]

# Create a new dataset with only the selected features
X_train_selected = X_train[selected_features]
X_test_selected = X_test[selected_features]

# Train the Random Forest model on the selected features
rf_model.fit(X_train_selected, y_train)

# Evaluate the model on the test set
y_pred = rf_model.predict(X_test_selected)
test_acc = accuracy_score(y_test, y_pred)
print(f"Random Forest Test Accuracy (with feature selection): {test_acc}")

# Function to take user input and predict the stage of cirrhosis
def predict_cirrhosis_stage():
    input_data = {}
    for feature in selected_features:
        value = input(f"Enter the value for '{feature}': ")
        input_data[feature] = [float(value)]

    input_df = pd.DataFrame(input_data)

    # Scale the input data
    input_df[numerical_cols] = scaler.transform(input_df[numerical_cols])

    # Make prediction
    prediction = rf_model.predict(input_df[selected_features])
    print(f"Predicted stage of liver cirrhosis: {prediction[0]}")

Random Forest Test Accuracy (with feature selection): 0.956


As you can see, the Random Forest Classifier performs much better, giving us a final testing accuracy of 95%. You can also run the cell below, to test the model for yourself, on your own data, or on data from other datasets, or sources.

In [None]:
# Call the prediction function to test the model
predict_cirrhosis_stage()

### The End:
This is the end of this project notebook, make sure to experiment and contribute to help improve the model and implementation. You can browse more of the open-source free projects on our GitHub repository: [https://github.com/Infinitode/OPEN-ARC](https://github.com/Infinitode/OPEN-ARC). If you like this project, make sure to star the repo and contribute your implementation, or help others in the community.

~ Infinitode