## **Import Required Libraries**
--- 
Import the necessary libraries for data manipulation and modeling. **`pandas`** for data analysis, **`LableEncoder`** for encoding the categorical into numerical values and **`train_test_split`** for splitting the data into training and testing sets.

In [2]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

## **Load Data**
---
read the data from the **`Bankruptcy.csv`** file into a pandas DataFrame and display the first 5 rows of the DataFrame to understand the data structure.

In [3]:
data1=pd.read_csv('Bankruptcy.csv')
data1.head()

Unnamed: 0,industrial_risk,management_risk,financial_flexibility,credibility,competitiveness,operating_risk,class,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24,Unnamed: 25
0,0.5,1.0,0.0,0.0,0.0,0.5,bankruptcy,,,,...,,,,,,,,,,
1,0.0,1.0,0.0,0.0,0.0,1.0,bankruptcy,,,,...,,,,,,,,,,
2,1.0,0.0,0.0,0.0,0.0,1.0,bankruptcy,,,,...,,,,,,,,,,
3,0.5,0.0,0.0,0.5,0.0,1.0,bankruptcy,,,,...,,,,,,,,,,
4,1.0,1.0,0.0,0.0,0.0,1.0,bankruptcy,,,,...,,,,,,,,,,


## **Data Preprocessing**

---
Get the data upto class removing all unnamed data, then heck If there were any null values present

In [4]:
data=data1.loc[:, :'class']
data.isnull().sum()

industrial_risk          0
management_risk          0
financial_flexibility    0
credibility              0
competitiveness          0
operating_risk           0
class                    0
dtype: int64

---
Perform data cleaning and feature engineering, use **`LabelEncoder`** to encode the categorical **`class`** column into a numerical format. Finally, display the first 5 rows of the modified DataFrame with the new **`class_encoded`** column.

In [5]:
le = LabelEncoder()
data['class_encoded'] = le.fit_transform(data['class'])
data.head()

Unnamed: 0,industrial_risk,management_risk,financial_flexibility,credibility,competitiveness,operating_risk,class,class_encoded
0,0.5,1.0,0.0,0.0,0.0,0.5,bankruptcy,0
1,0.0,1.0,0.0,0.0,0.0,1.0,bankruptcy,0
2,1.0,0.0,0.0,0.0,0.0,1.0,bankruptcy,0
3,0.5,0.0,0.0,0.5,0.0,1.0,bankruptcy,0
4,1.0,1.0,0.0,0.0,0.0,1.0,bankruptcy,0


## **Feature Dependency**
---

Calculate the correlation between the features and the target variable `class_encoded`.First create a new DataFrame `df` that excludes the original `class` column. Then, calculate the correlation of each feature with `class_encoded` and sort them in descending order. Finally, print the correlation values.

By this we understand the importence of each feature on the deciding the output.


In [6]:
df=data.loc[:, data.columns !='class']
correlation = df.corr()['class_encoded'].sort_values(ascending=False)
print(correlation)

class_encoded            1.000000
competitiveness          0.899452
credibility              0.755909
financial_flexibility    0.751020
industrial_risk         -0.227823
operating_risk          -0.279786
management_risk         -0.370838
Name: class_encoded, dtype: float64


From the **Corelation** we clearly understand the higher imact is due to **`competitiveness`**,**`credibility`**, **`financial_flexibility`**

Eventhough remaining had low corelation that to imacting negetively we also considering them and taking all data features.

---
Split the data into features (X) and target (y). `X` contains all the columns from the beginning up to `operating_risk`, while `y` contains the `class_encoded` column. Then displays the first 5 rows of both `X` and `y`.


In [7]:
X,y=data.loc[:, :'operating_risk'], data['class_encoded']
X.head(),y.head()

(   industrial_risk  management_risk  financial_flexibility  credibility  \
 0              0.5              1.0                    0.0          0.0   
 1              0.0              1.0                    0.0          0.0   
 2              1.0              0.0                    0.0          0.0   
 3              0.5              0.0                    0.0          0.5   
 4              1.0              1.0                    0.0          0.0   
 
    competitiveness  operating_risk  
 0              0.0             0.5  
 1              0.0             1.0  
 2              0.0             1.0  
 3              0.0             1.0  
 4              0.0             1.0  ,
 0    0
 1    0
 2    0
 3    0
 4    0
 Name: class_encoded, dtype: int64)

## **Training**
---
split the data into training and testing sets. 80% of the data is used for training and 20% for testing. The `random_state` is set for reproducibility. Finally, print the shapes of the training and testing sets.

In [8]:
trainx, testx, trainy, testy = train_test_split(X, y, test_size=0.2, random_state=73)
trainx.shape, testx.shape, trainy.shape, testy.shape

((200, 6), (50, 6), (200,), (50,))

## **Import Models**
---

Imports the machine learning models that will be used for classification: `LogisticRegression`, `RandomForestClassifier`, and `GaussianNB`.

In [9]:

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB


---

initializee, traine, and make predictions with the three imported models. Each model is first instantiated, then trained on the training data (`trainx`, `trainy`), and finally used to make predictions on the testing data (`testx`).

In [10]:
model1=LogisticRegression()
model2=RandomForestClassifier()
model3=GaussianNB()
model1.fit(trainx,trainy)
model2.fit(trainx,trainy)
model3.fit(trainx,trainy)
pred1=model1.predict(testx)
pred2=model2.predict(testx)
pred3=model3.predict(testx)

## **Evaluation**
---

Evaluates the performance of the three models using the accuracy score.Import the `accuracy_score` function from `sklearn.metrics`, calculate the accuracy for each model by comparing the predicted values with the actual values, and then print the accuracy of each model.

In [11]:
from sklearn.metrics import accuracy_score
accuracy1=accuracy_score(testy,pred1)
print("Accuracy of Logistic Regression:", accuracy1)
accuracy2=accuracy_score(testy,pred2)
print("Accuracy of Random Forest Classifier:", accuracy2)
accuracy3=accuracy_score(testy,pred3)
print("Accuracy of Gaussian Naive Bayes:", accuracy3)

Accuracy of Logistic Regression: 1.0
Accuracy of Random Forest Classifier: 1.0
Accuracy of Gaussian Naive Bayes: 0.98


## **Save Models**
---

Save the Models

In [12]:
import joblib
joblib.dump(model1, 'models/logistic_regression_model.pkl')
joblib.dump(model2, 'models/random_forest_model.pkl')
joblib.dump(model3, 'models/gaussian_nb_model.pkl')
joblib.dump(le, 'models/label_encoder.pkl')


['models/label_encoder.pkl']