# Q3: Predicting T2 Risk Level and Handling Missing Values

In [1]:
import pandas as pd

# Create the training dataset
data = {
    'ID': [1, 2, 3, 4, 5, 6, 7, 8],
    'Age': [35, 28, 45, 31, 52, 29, 42, 33],
    'CreditScore': [720, 650, 750, 600, 780, 630, 710, 640],
    'Education': [16, 14, None, 12, 18, 14, 16, 12],
    'RiskLevel': ['Low', 'High', 'Low', 'High', 'Low', 'High', 'Low', 'High']
}

df = pd.DataFrame(data)
df['Education'] = df['Education'].astype('float')

# Display the training dataset
print("Training Data:\n", df)

Training Data:
    ID  Age  CreditScore  Education RiskLevel
0   1   35          720       16.0       Low
1   2   28          650       14.0      High
2   3   45          750        NaN       Low
3   4   31          600       12.0      High
4   5   52          780       18.0       Low
5   6   29          630       14.0      High
6   7   42          710       16.0       Low
7   8   33          640       12.0      High


### Define the Test Case T2

T2 has the following attributes:

- **Age:** 30
- **CreditScore:** 645
- **Education:** missing

Since Education is missing, we focus on Age and CreditScore.

In [2]:
# Define the test record T2 as a dictionary
T2 = {'Age': 30, 'CreditScore': 645}

print("Test Record T2:", T2)

Test Record T2: {'Age': 30, 'CreditScore': 645}


### Identify Similar Training Records

We select training records with similar Age and CreditScore. Here, we define similarity as:
- Absolute difference in Age ≤ 5 years
- Absolute difference in CreditScore ≤ 25 points

These thresholds can be adjusted based on the dataset.

In [3]:
# Define thresholds for similarity
age_threshold = 5
credit_threshold = 25

# Filter the training data for similar records
similar_records = df[(abs(df['Age'] - T2['Age']) <= age_threshold) & 
                     (abs(df['CreditScore'] - T2['CreditScore']) <= credit_threshold)]

print("Similar Training Records:\n", similar_records)

Similar Training Records:
    ID  Age  CreditScore  Education RiskLevel
1   2   28          650       14.0      High
5   6   29          630       14.0      High
7   8   33          640       12.0      High


### Calculate the Probability of T2 Being High Risk

Now, we compute the proportion of similar records that are classified as High Risk.

In [4]:
# Count the number of similar records and those with High risk
total_similar = len(similar_records)
high_risk_count = len(similar_records[similar_records['RiskLevel'] == 'High'])

if total_similar > 0:
    probability_high = high_risk_count / total_similar
else:
    probability_high = None

print(f"Probability of T2 being High Risk: {probability_high}")

Probability of T2 being High Risk: 1.0
