# Name: Margaret Nguyen 

# Data 300: Final – Design choice and Time space complexity

For this final, we will be fitting two classification models - Support Vector Machines (SVMs) and a Random Forest classifier on one dataset. We will tweak some parameters and compare the performance of models across parameters and how well they perform on the data. This is your chance to make some design choices on your models and how you work with data. Have fun with it! 

You can find the data [here](https://github.com/KennedyOdongo/DATA-300-Statistical-Machine-Learning-Fall-2023-/blob/main/Data/company_bankruptcy_data.csv) on the course website. A description of the dataset is available on the University of California Irvine (UCI) machine learning repository [here](https://archive.ics.uci.edu/dataset/572/taiwanese+bankruptcy+prediction).

## I. Exploratory Data Analysis (EDA)

## A. Report findings

In [1]:
# Import modules
import pandas as pd # v 1.4.4
import numpy as np # v 1.21.5
from sklearn.utils import resample # v 1.0.2

# Classifier
from sklearn.ensemble import RandomForestClassifier # v 1.0.2
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix # v 1.0.2
from sklearn.model_selection import train_test_split # v 1.0.2
from sklearn.svm import SVC
from sklearn.tree import export_text # v 1.0.2

# Ploting libraries 
import matplotlib.pyplot as plt # v 3.5.2
import seaborn as sns # v. 0.11.2

# Display any generated plots or visualizations directly in the notebook interface
%matplotlib inline 

In [2]:
# Load dataset
df_company_bankruptcy = pd.read_csv('./data/company_bankruptcy_data.csv')

In [3]:
# Display dataframe
df_company_bankruptcy.head()

Unnamed: 0,Bankrupt,ROA(C) before interest and depreciation before interest,ROA(A) before interest and % after tax,ROA(B) before interest and depreciation after tax,Operating Gross Margin,Realized Sales Gross Margin,Operating Profit Rate,Pre-tax net Interest Rate,After-tax net Interest Rate,Non-industry income and expenditure/revenue,...,Net Income to Total Assets,Total assets to GNP price,No-credit Interval,Gross Profit to Sales,Net Income to Stockholder's Equity,Liability to Equity,Degree of Financial Leverage (DFL),Interest Coverage Ratio (Interest expense to EBIT),Net Income Flag,Equity to Liability
0,1,0.370594,0.424389,0.40575,0.601457,0.601457,0.998969,0.796887,0.808809,0.302646,...,0.716845,0.009219,0.622879,0.601453,0.82789,0.290202,0.026601,0.56405,1,0.016469
1,1,0.464291,0.538214,0.51673,0.610235,0.610235,0.998946,0.79738,0.809301,0.303556,...,0.795297,0.008323,0.623652,0.610237,0.839969,0.283846,0.264577,0.570175,1,0.020794
2,1,0.426071,0.499019,0.472295,0.60145,0.601364,0.998857,0.796403,0.808388,0.302035,...,0.77467,0.040003,0.623841,0.601449,0.836774,0.290189,0.026555,0.563706,1,0.016474
3,1,0.399844,0.451265,0.457733,0.583541,0.583541,0.9987,0.796967,0.808966,0.30335,...,0.739555,0.003252,0.622929,0.583538,0.834697,0.281721,0.026697,0.564663,1,0.023982
4,1,0.465022,0.538432,0.522298,0.598783,0.598783,0.998973,0.797366,0.809304,0.303475,...,0.795016,0.003878,0.623521,0.598782,0.839973,0.278514,0.024752,0.575617,1,0.03549


In [4]:
# Summary Statistics
df_company_bankruptcy.describe()

Unnamed: 0,Bankrupt,ROA(C) before interest and depreciation before interest,ROA(A) before interest and % after tax,ROA(B) before interest and depreciation after tax,Operating Gross Margin,Realized Sales Gross Margin,Operating Profit Rate,Pre-tax net Interest Rate,After-tax net Interest Rate,Non-industry income and expenditure/revenue,...,Net Income to Total Assets,Total assets to GNP price,No-credit Interval,Gross Profit to Sales,Net Income to Stockholder's Equity,Liability to Equity,Degree of Financial Leverage (DFL),Interest Coverage Ratio (Interest expense to EBIT),Net Income Flag,Equity to Liability
count,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,...,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0
mean,0.032263,0.50518,0.558625,0.553589,0.607948,0.607929,0.998755,0.79719,0.809084,0.303623,...,0.80776,18629420.0,0.623915,0.607946,0.840402,0.280365,0.027541,0.565358,1.0,0.047578
std,0.17671,0.060686,0.06562,0.061595,0.016934,0.016916,0.01301,0.012869,0.013601,0.011163,...,0.040332,376450100.0,0.01229,0.016934,0.014523,0.014463,0.015668,0.013214,0.0,0.050014
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
25%,0.0,0.476527,0.535543,0.527277,0.600445,0.600434,0.998969,0.797386,0.809312,0.303466,...,0.79675,0.0009036205,0.623636,0.600443,0.840115,0.276944,0.026791,0.565158,1.0,0.024477
50%,0.0,0.502706,0.559802,0.552278,0.605997,0.605976,0.999022,0.797464,0.809375,0.303525,...,0.810619,0.002085213,0.623879,0.605998,0.841179,0.278778,0.026808,0.565252,1.0,0.033798
75%,0.0,0.535563,0.589157,0.584105,0.613914,0.613842,0.999095,0.797579,0.809469,0.303585,...,0.826455,0.005269777,0.624168,0.613913,0.842357,0.281449,0.026913,0.565725,1.0,0.052838
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,9820000000.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [5]:
# Find columns with maximum value greater than 1
columns_with_max_gt_1 = df_company_bankruptcy.columns[df_company_bankruptcy.max() > 1]

# Display the result
print("Columns with maximum value greater than 1:")
print(columns_with_max_gt_1)

Columns with maximum value greater than 1:
Index([' Operating Expense Rate', ' Research and development expense rate',
       ' Interest-bearing debt interest rate', ' Revenue Per Share (Yuan ¥)',
       ' Total Asset Growth Rate', ' Net Value Growth Rate', ' Current Ratio',
       ' Quick Ratio', ' Total debt/Total net worth',
       ' Accounts Receivable Turnover', ' Average Collection Days',
       ' Inventory Turnover Rate (times)', ' Fixed Assets Turnover Frequency',
       ' Revenue per person', ' Allocation rate per person',
       ' Quick Assets/Current Liability', ' Cash/Current Liability',
       ' Inventory/Current Liability',
       ' Long-term Liability to Current Assets',
       ' Current Asset Turnover Rate', ' Quick Asset Turnover Rate',
       ' Cash Turnover Rate', ' Fixed Assets to Assets',
       ' Total assets to GNP price'],
      dtype='object')


In [6]:
df_company_bankruptcy.shape, df_company_bankruptcy.columns, df_company_bankruptcy.dtypes

((6819, 96),
 Index(['Bankrupt', ' ROA(C) before interest and depreciation before interest',
        ' ROA(A) before interest and % after tax',
        ' ROA(B) before interest and depreciation after tax',
        ' Operating Gross Margin', ' Realized Sales Gross Margin',
        ' Operating Profit Rate', ' Pre-tax net Interest Rate',
        ' After-tax net Interest Rate',
        ' Non-industry income and expenditure/revenue',
        ' Continuous interest rate (after tax)', ' Operating Expense Rate',
        ' Research and development expense rate', ' Cash flow rate',
        ' Interest-bearing debt interest rate', ' Tax rate (A)',
        ' Net Value Per Share (B)', ' Net Value Per Share (A)',
        ' Net Value Per Share (C)', ' Persistent EPS in the Last Four Seasons',
        ' Cash Flow Per Share', ' Revenue Per Share (Yuan ¥)',
        ' Operating Profit Per Share (Yuan ¥)',
        ' Per Share Net profit before tax (Yuan ¥)',
        ' Realized Sales Gross Profit Growth Rate

In [7]:
# Check for missing values
missing_values = df_company_bankruptcy.isna().sum()
print("Missing Values:")
print(missing_values[missing_values > 0])

Missing Values:
Series([], dtype: int64)


In [8]:
# Check for the classes
df_company_bankruptcy['Bankrupt'].value_counts()

0    6599
1     220
Name: Bankrupt, dtype: int64

**The dataset does not have missing values. The classes in 'Bankrupt' are imbalanced, with 'Bankrupt' == 1 being the minority, so I decided to undersample the dataset. Some variables (ratios in accounting) have maximum values greater than 1. For example, Total debt/Total net worth: A ratio greater than 1 indicates that the company has more debt relative to its net worth.** 

## B. Addressing Class Imbalance (Undersampling)

In [9]:
# Split dataset into major and minor sets
def_major = df_company_bankruptcy[df_company_bankruptcy['Bankrupt']==0]
def_minor = df_company_bankruptcy[df_company_bankruptcy['Bankrupt']==1]

# Undersample the majority class which is Bankrupt == 0
def_major_undersampled = resample(def_major, n_samples=220, replace=True)
df_company_bankruptcy_undersampled = pd.concat([def_major_undersampled, def_minor])

# Check the values
df_company_bankruptcy_undersampled['Bankrupt'].value_counts()

0    220
1    220
Name: Bankrupt, dtype: int64

## C. Splitting Data into Training and Testing Sets:
### 1. Original dataset

In [10]:
# Separate dataset into Y (dependent) and X (independent) variables
y = df_company_bankruptcy['Bankrupt']
X = df_company_bankruptcy.drop('Bankrupt', axis=1)

In [11]:
# Use the train_test_split function to split the sale data into training and testing set (80% vs 20%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

### 2. Undersampled dataset

In [12]:
# Separate dataset into Y (dependent) and X (independent) variables
y_undersampled = df_company_bankruptcy_undersampled['Bankrupt']
X_undersampled = df_company_bankruptcy_undersampled.drop('Bankrupt', axis=1)

In [13]:
# Use the train_test_split function to split the loan data into training and testing set (80% vs 20%)
X_train_undersampled, X_test_undersampled, y_train_undersampled, y_test_undersampled = train_test_split(X_undersampled, y_undersampled, test_size=0.2, random_state=5)

## II. Part 1 - Using the original data

## A. Random Forest

1. Fit the model on the training set using the default [parameters](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) and report your findings. (5 points)

In [14]:
# Instantiate the model
clf_rf = RandomForestClassifier()

**Fit the model**

In [15]:
%%time
clf_rf.fit(X_train, y_train)

CPU times: user 3.73 s, sys: 5.5 ms, total: 3.74 s
Wall time: 3.77 s


In [16]:
# Make predictions on test set
y_pred = clf_rf.predict(X_test)

In [17]:
# Calculate metrics
# I decided not to use accuracy because the original dataset is imbalanced.
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f_score = f1_score(y_test, y_pred)

# Interpret the results 
print("\nDefault Random Forest Classifier Metrics:")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F-score: {f_score:.4f}")


Default Random Forest Classifier Metrics:
Precision: 0.8000
Recall: 0.1569
F-score: 0.2623


- **A precision of 0.8000** means that when the classifier predicts a company as bankrupt, it is correct about 80% of the time. 
- **A recall of 0.1569** suggests that the classifier is capturing only about 15.69% of the actual bankrupt companies. 
- The model has a high precision, indicating that when it predicts bankruptcy, it is correct a significant percentage of the time. However, the recall is relatively low, suggesting that the model misses a substantial number of actual bankrupt cases.
- However, precision and recall have an inverse relationship. So, it is better to use F-score. **An F-score of 0.2623** indicates the trade-off between precision and recall. Therefore, the model is not achieving a satisfactory balance between precision and recall.

In [18]:
# Display the decision path
clf_rf.decision_path(X_train)

(<5455x21000 sparse matrix of type '<class 'numpy.int64'>'
 	with 5362040 stored elements in Compressed Sparse Row format>,
 array([    0,   209,   430,   643,   840,  1049,  1260,  1461,  1698,
         1899,  2084,  2293,  2482,  2693,  2872,  3081,  3292,  3541,
         3750,  3979,  4196,  4425,  4662,  4869,  5082,  5293,  5484,
         5689,  5902,  6129,  6358,  6567,  6786,  6985,  7212,  7423,
         7620,  7827,  8046,  8263,  8458,  8679,  8888,  9101,  9304,
         9509,  9724,  9907, 10106, 10317, 10518, 10707, 10930, 11119,
        11326, 11547, 11752, 11945, 12168, 12381, 12624, 12829, 13048,
        13247, 13480, 13697, 13900, 14113, 14346, 14553, 14764, 14991,
        15202, 15407, 15608, 15813, 16030, 16241, 16444, 16667, 16872,
        17061, 17246, 17455, 17660, 17863, 18084, 18283, 18500, 18719,
        18944, 19151, 19372, 19579, 19766, 19983, 20198, 20361, 20576,
        20775, 21000]))

In [19]:
# Pairing feature names with their importances
feature_importance_pairs = list(zip(X_train.columns, clf_rf.feature_importances_))

# Displaying the feature importances alongside their names
for feature, importance in feature_importance_pairs:
    print(f"{feature}: {importance}")

 ROA(C) before interest and depreciation before interest: 0.009883697940786324
 ROA(A) before interest and % after tax: 0.011431022372326374
 ROA(B) before interest and depreciation after tax: 0.011959177760513293
 Operating Gross Margin: 0.007290010216563158
 Realized Sales Gross Margin: 0.005526842632454003
 Operating Profit Rate: 0.008456387328799128
 Pre-tax net Interest Rate: 0.010506404795996337
 After-tax net Interest Rate: 0.00943327002275572
 Non-industry income and expenditure/revenue: 0.014855573308061944
 Continuous interest rate (after tax): 0.010928016994400361
 Operating Expense Rate: 0.010516471251839252
 Research and development expense rate: 0.00787186019098273
 Cash flow rate: 0.0065346259328217604
 Interest-bearing debt interest rate: 0.017238900650970102
 Tax rate (A): 0.0036028582072538955
 Net Value Per Share (B): 0.016800176591944445
 Net Value Per Share (A): 0.015295684485381548
 Net Value Per Share (C): 0.01380489836491618
 Persistent EPS in the Last Four Seas

2. Repeat 1 but change the parameters to different non-default parameters. Evaluate this model on your choice of metrics. Which model do you prefer? (5 points)

In [20]:
# Modified Random Forest Classifier (Non-default Parameters)
clf_rf_modified = RandomForestClassifier(criterion='entropy')

**Fit the model**

In [21]:
%%time
clf_rf_modified.fit(X_train, y_train)

CPU times: user 2.73 s, sys: 3.91 ms, total: 2.74 s
Wall time: 2.76 s


In [22]:
# Make predictions on test set
y_pred_modified = clf_rf_modified.predict(X_test)

In [23]:
# Calculate metrics
precision_modified = precision_score(y_test, y_pred_modified)
recall_modified = recall_score(y_test, y_pred_modified)
f_score_modified = f1_score(y_test, y_pred_modified)

# Interpret the results 
print("\nModified Random Forest Classifier Metrics:")
print(f"Precision: {precision_modified:.4f}")
print(f"Recall: {recall_modified:.4f}")
print(f"F-score: {f_score_modified:.4f}")


Modified Random Forest Classifier Metrics:
Precision: 0.7143
Recall: 0.0980
F-score: 0.1724


- **A precision of 0.7143** means that when the classifier predicts a company as bankrupt, it is correct about 71.43% of the time. 
- **A recall of 0.0980** suggests that the classifier is capturing only about 9.8% of the actual bankrupt companies. 
- The model has decent precision, indicating that when it predicts bankruptcy, it is correct a significant percentage of the time. However, the recall is low, suggesting that the model misses a substantial number of actual bankrupt cases.
- However, precision and recall have an inverse relationship.
- **An F-score of 0.1724** indicates the trade-off between precision and recall. Therefore, the model is not achieving a satisfactory balance between precision and recall.

**I prefer the Default Random Forest Classifier because it has a higher metrics compared to the Modified Random Forest Classifier's  metrics.**

3. Which model takes longer to fit? 1. Or 2.? Report the CPU time. (2 points)

**The Default Random Forest Classifier (Model 1) takes longer to fit based on the provided CPU time measurements.**

1. **Default Random Forest Classifier (Model 1): CPU times: user 3.73 s, sys: 5.5 ms, total: 3.74 s**

2. **Modified Random Forest Classifier (Model 2): CPU times: user 2.73 s, sys: 3.91 ms, total: 2.74 s**

4. Based on your findings above which parameter combinations give you the best results for classification? Would you prefer “Gini” or “entropy” as your splitting metric? (5 points)

**To identify which parameter combinations yield the best results for classification, I believe it is essential to conduct a hyperparameter tuning process through cross-validation. This process involves fine-tuning hyperparameters, such as n_estimators, to optimize the model's performance on the given dataset. Based on my findings above, I prefer the "Gini" metric (Default Random Forest Classifier) as my splitting criterion because the Default Random Forest Classifier because it has a higher metrics compared to the Modified Random Forest Classifier's metrics.**

5. Choose one of the models from number 1. Or number 2, display, and discuss the decision rules. Do the rules make sense for classification? (5 points)

In [24]:
# Display the decision rules
for i, clf in enumerate(clf_rf.estimators_):
    # Display decision rules as text
    text_representation = export_text(clf, feature_names=X_train.columns.tolist())
    print(f"Decision Rules:\n{text_representation}")

Decision Rules:
|---  Net Value Per Share (B) <= 0.16
|   |---  Total debt/Total net worth <= 0.02
|   |   |---  Debt ratio % <= 0.15
|   |   |   |---  Net profit before tax/Paid-in capital <= 0.13
|   |   |   |   |--- class: 1.0
|   |   |   |---  Net profit before tax/Paid-in capital >  0.13
|   |   |   |   |---  Cash Flow Per Share <= 0.33
|   |   |   |   |   |---  Cash Flow to Sales <= 0.67
|   |   |   |   |   |   |---  Equity to Long-term Liability <= 0.11
|   |   |   |   |   |   |   |--- class: 0.0
|   |   |   |   |   |   |---  Equity to Long-term Liability >  0.11
|   |   |   |   |   |   |   |--- class: 1.0
|   |   |   |   |   |---  Cash Flow to Sales >  0.67
|   |   |   |   |   |   |---  Cash/Total Assets <= 0.00
|   |   |   |   |   |   |   |---  Net Worth Turnover Rate (times) <= 0.02
|   |   |   |   |   |   |   |   |--- class: 1.0
|   |   |   |   |   |   |   |---  Net Worth Turnover Rate (times) >  0.02
|   |   |   |   |   |   |   |   |--- class: 0.0
|   |   |   |   |   |   |-

Decision Rules:
|---  Liability to Equity <= 0.29
|   |---  Liability to Equity <= 0.27
|   |   |---  Quick Asset Turnover Rate <= 0.00
|   |   |   |--- class: 1.0
|   |   |---  Quick Asset Turnover Rate >  0.00
|   |   |   |--- class: 0.0
|   |---  Liability to Equity >  0.27
|   |   |---  Net Value Growth Rate <= 0.00
|   |   |   |---  Realized Sales Gross Profit Growth Rate <= 0.02
|   |   |   |   |---  Cash Flow to Equity <= 0.31
|   |   |   |   |   |---  Gross Profit to Sales <= 0.62
|   |   |   |   |   |   |--- class: 1.0
|   |   |   |   |   |---  Gross Profit to Sales >  0.62
|   |   |   |   |   |   |--- class: 0.0
|   |   |   |   |---  Cash Flow to Equity >  0.31
|   |   |   |   |   |--- class: 0.0
|   |   |   |---  Realized Sales Gross Profit Growth Rate >  0.02
|   |   |   |   |--- class: 1.0
|   |   |---  Net Value Growth Rate >  0.00
|   |   |   |---  Working Capital to Total Assets <= 0.70
|   |   |   |   |---  Cash/Total Assets <= 0.00
|   |   |   |   |   |--- class: 1.0


Decision Rules:
|---  Net profit before tax/Paid-in capital <= 0.14
|   |---  Net Income to Total Assets <= 0.62
|   |   |---  Interest-bearing debt interest rate <= 0.00
|   |   |   |---  Quick Assets/Current Liability <= 0.01
|   |   |   |   |--- class: 1.0
|   |   |   |---  Quick Assets/Current Liability >  0.01
|   |   |   |   |--- class: 0.0
|   |   |---  Interest-bearing debt interest rate >  0.00
|   |   |   |---  Inventory/Current Liability <= 0.01
|   |   |   |   |--- class: 0.0
|   |   |   |---  Inventory/Current Liability >  0.01
|   |   |   |   |--- class: 1.0
|   |---  Net Income to Total Assets >  0.62
|   |   |---  Degree of Financial Leverage (DFL) <= 0.03
|   |   |   |---  Cash Flow to Equity <= 0.31
|   |   |   |   |---  Quick Asset Turnover Rate <= 9115000320.00
|   |   |   |   |   |---  Operating profit per person <= 0.38
|   |   |   |   |   |   |--- class: 1.0
|   |   |   |   |   |---  Operating profit per person >  0.38
|   |   |   |   |   |   |---  Operating Expe


Decision Rules:
|---  Persistent EPS in the Last Four Seasons <= 0.20
|   |---  Net Income to Total Assets <= 0.63
|   |   |---  Net Worth Turnover Rate (times) <= 0.01
|   |   |   |--- class: 0.0
|   |   |---  Net Worth Turnover Rate (times) >  0.01
|   |   |   |---  Interest Coverage Ratio (Interest expense to EBIT) <= 0.57
|   |   |   |   |---  After-tax net Interest Rate <= 0.81
|   |   |   |   |   |--- class: 1.0
|   |   |   |   |---  After-tax net Interest Rate >  0.81
|   |   |   |   |   |--- class: 0.0
|   |   |   |---  Interest Coverage Ratio (Interest expense to EBIT) >  0.57
|   |   |   |   |---  Working Capital to Total Assets <= 0.91
|   |   |   |   |   |--- class: 1.0
|   |   |   |   |---  Working Capital to Total Assets >  0.91
|   |   |   |   |   |--- class: 0.0
|   |---  Net Income to Total Assets >  0.63
|   |   |---  Degree of Financial Leverage (DFL) <= 0.03
|   |   |   |---  Operating Expense Rate <= 8215000064.00
|   |   |   |   |---  Pre-tax net Interest Rate <=

- **The decision rules displays a hierarchical structure for classification, providing a clear path for determining the class of a given entity based on specific financial parameters (financial ratios). The rules use various financial ratios such as Total debt/Total net worth, Cash/Total Assets, Net Income to Total Assets, Operating Gross Margin, and others. The decision tree makes logical sense for classification, as it systematically navigates through different financial metrics to arrive at a final classification of either class 0.0 or class 1.0.**
- **Based on the Random Forest Classifier, if the net value per share (B) is less than or equal to 0.16. If this condition is met, the algorithm proceeds to examine the total debt-to-total net worth ratio, ensuring it is less than or equal to 0.02. Further, it checks the debt ratio percentage, verifying it is below 0.15. The algorithm continues by evaluating the net profit before tax divided by paid-in capital, ensuring it does not exceed 0.13. If all these conditions hold true, the final decision is made, and the instance is classified as belonging to Class 1. Therefore, the rules make sense for classification.**

6. Which features are the most important for classification? (2 points)

In [25]:
# Get and pair feature importances
feature_importances = dict(zip(X.columns, clf_rf.feature_importances_))

# Print the top 5 most important features
top_features = sorted(feature_importances.items(), key=lambda x: x[1], reverse=True)[:5]
print("Top 5 most important features for classification:")
for feature, importance in top_features:
    print(f"{feature}: {importance}")

Top 5 most important features for classification:
 Net Income to Stockholder's Equity: 0.033952393922019264
 Net Value Growth Rate: 0.030370700931561585
 Net profit before tax/Paid-in capital: 0.021876954405090018
 Borrowing dependency: 0.020676276943996243
 Per Share Net profit before tax (Yuan ¥): 0.020508784366081556


7. On average, what do you think is the tradeoff between model fitting time and model performance? (3 points)

- **Based on my findings above, I believe that the tradeoff between model fitting time and model performance is such that the better the model performs, the longer it takes to fit the model.** 
- **According to the two models mentioned above, the Default Random Forest Classifier outperforms the Modified Random Forest Classifier because it has higher metrics. Also, the Default Random Forest Classifier has a longer CPU time (3.74 s) compared to the Modified Random Forest Classifier (2.74 s). However, this varies case by case, so it will not hold true all the time.**

## B. Support Vector Machines

1. Fit the model on the training set using the default [parameters](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) and report your findings. (5 points)

In [14]:
# Instantiate the model
clf = SVC()

**Fit the model**

In [15]:
%%time
clf.fit(X_train, y_train)

CPU times: user 732 ms, sys: 16.2 ms, total: 749 ms
Wall time: 749 ms


In [16]:
# Make predictions on test set
y_pred_clf = clf.predict(X_test)

In [17]:
# Calculate metrics
accuracy_clf = accuracy_score(y_test, y_pred_clf)
precision_clf_zero_division = precision_score(y_test, y_pred_clf, zero_division=1)
precision_clf = precision_score(y_test, y_pred_clf)

# Print the results
print("\nDefault Support Vector Machines:")
print(f"Accuracy: {accuracy_clf:.4f}")
print(f"Precision: {precision_clf:.4f}")
print(f"Precision with zero_division: {precision_clf_zero_division:.4f}")


Default Support Vector Machines:
Accuracy: 0.9626
Precision: 0.0000
Precision with zero_division: 1.0000


  _warn_prf(average, modifier, msg_start, len(result))


In [18]:
# Display confusion matrix
tn, fp, fn, tp = confusion_matrix(y_test, y_pred_clf).ravel()
tn, fp, fn, tp

(1313, 0, 51, 0)

- **True Negatives (TN)** amount to 1313, indicating the correct prediction of instances belonging to Class 0 (negative). 
- **False Positives (FP)** are 0, implying no instances were mistakenly classified as Class 1 (positive) when they actually pertain to Class 0. 
- **False Negatives (FN)** account for 51, representing instances incorrectly predicted as Class 0 when they should have been assigned to Class 1. 
- **True Positives (TP)** are 0, signifying the model's inability to successfully identify any instances of Class 1. 
- While the significant number of **True Negatives** contributes to a high accuracy, the model encounters challenges in accurately classifying instances from Class 1, leading to undefined precision, recall, and F1 score for Class 1. This issue could stem from a strong imbalance in the distribution of the classes.

2. Change the kernel from “rbf” to “linear” and repeat the model fitting procedure in 1 above (5 point)

In [19]:
# Instantiate the model
clf_modified = SVC(kernel='linear')

**Fit the model**

In [20]:
#clf_modified.fit(X_train, y_train)

**The linear SVM did not converge after 180 minutes.**

3. Evaluate both models using any metrics of choice. (4 points)

In [21]:
# Calculate metrics
accuracy_clf = accuracy_score(y_test, y_pred_clf)
precision_clf_zero_division = precision_score(y_test, y_pred_clf, zero_division=1)
precision_clf = precision_score(y_test, y_pred_clf)

# Print the results
print("\nDefault Support Vector Machines:")
print(f"Accuracy: {accuracy_clf:.4f}")
print(f"Precision: {precision_clf:.4f}")
print(f"Precision with zero_division: {precision_clf_zero_division:.4f}")


Default Support Vector Machines:
Accuracy: 0.9626
Precision: 0.0000
Precision with zero_division: 1.0000


  _warn_prf(average, modifier, msg_start, len(result))


- **An accuracy of 0.9626 indicates that the model is correct about 96.26% of the time. However, because this dataset is significantly imbalanced, accuracy may not be a proper metric for estimation.**
- **While the significant number of True Negatives contributes to high accuracy, the model encounters challenges in accurately classifying instances from Class 1, leading to undefined precision, recall, and F1 score for Class 1. This issue could stem from a strong imbalance in the distribution of the classes.**
- **The precision value with zero division is 1.0000, implying that there were no false positives or instances predicted as positive by the model. This value serves as evidence explaining the issue.**

4. How long does it take to fit the SVM model in 1. above? How about 2. above? Report only the CPU times (1 point)

- **The Default Support Vector Machines (Model 1): CPU times: user 732 ms, sys: 16.2 ms, total: 749 ms.**
- **The linear SVM (Model 2) did not converge after 180 minutes.**

## III. Part 2 - Using the sampled data.

1. Fit the Random Forest and the SVM on the undersampled data and compare the fitting times for both models? Which model takes longer to fit? (5 points)

## A. Random Forest

In [22]:
# Instantiate the model
clf_rf_undersampled = RandomForestClassifier()

**Fit the model**

In [23]:
%%time
clf_rf_undersampled.fit(X_train_undersampled, y_train_undersampled) 

CPU times: user 252 ms, sys: 8.44 ms, total: 261 ms
Wall time: 263 ms


## B. SVM

In [24]:
# Instantiate the model
clf_undersampled = SVC()

**Fit the model**

In [25]:
%%time
clf_undersampled.fit(X_train_undersampled, y_train_undersampled) 

CPU times: user 29.2 ms, sys: 3.61 ms, total: 32.8 ms
Wall time: 31 ms


**The Default Random Forest Classifier takes longer to fit based on the provided CPU time measurements.**

1. **Default Random Forest Classifier: CPU times: user 252 ms, sys: 8.44 ms, total: 261 ms**

2. **Default Support Vector Machines: CPU times: user 29.2 ms, sys: 3.61 ms, total: 32.8 ms**

2. Compare the performance of both models on any metrics of choice. (5 points)

In [26]:
# Random Forest
y_pred_rf_undersampled = clf_rf_undersampled.predict(X_test_undersampled)

# SVM
y_pred_clf_undersampled = clf_undersampled.predict(X_test_undersampled)

# Evaluate models based on accuracy
accuracy_rf_undersampled = accuracy_score(y_test_undersampled, y_pred_rf_undersampled)
accuracy_svm_undersampled = accuracy_score(y_test_undersampled, y_pred_clf_undersampled)

In [27]:
# Print the results 
print("For using the undersampled dataset:")
print(f"Accuracy of Random Forest: {accuracy_rf_undersampled:.4f}")
print(f"Accuracy of SVM: {accuracy_svm_undersampled:.4f}")

# Compare models
if accuracy_rf_undersampled > accuracy_svm_undersampled:
    print("Random Forest is the better-performing model.")
elif accuracy_svm_undersampled > accuracy_rf_undersampled:
    print("SVM is the better-performing model.")
else:
    print("Both models have the same accuracy.")

For using the undersampled dataset:
Accuracy of Random Forest: 0.8636
Accuracy of SVM: 0.6818
Random Forest is the better-performing model.
