## Question 1

**`Inductive Reasoning`**:  Making generalized conclusions based on specific observations or cases is what constitutes Inductive Reasoning. It commences with specific observations and moves towards general theories. However, inductive arguments can still lead to false conclusions even if the premises are true.

**`Example`**: You have noticed your friend wearing a raincoat and carrying an umbrella when it’s cloudy on multiple occasions. From this; you infer that people usually wear raincoats and carry umbrellas when they expect it to rain.

**`Deductive Reasoning`**: on the other hand, involves deriving particular predictions or conclusions from general rules or principles. It begins with a general statement or hypothesis and examines possibilities to reach a specific logical conclusion.

**`Example`**: If all mammals breathe air (general rule) and a whale is a mammal (specific case), then you deduce that a whale breathes air.

In [23]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error
import xgboost as xgb

In [24]:
credit_data = pd.read_csv('./Credit_card.csv')
credit_labels = pd.read_csv('./Credit_card_label.csv')

(credit_data.head(), credit_labels.head())

(    Ind_ID GENDER Car_Owner Propert_Owner  CHILDREN  Annual_income  \
 0  5008827      M         Y             Y         0       180000.0   
 1  5009744      F         Y             N         0       315000.0   
 2  5009746      F         Y             N         0       315000.0   
 3  5009749      F         Y             N         0            NaN   
 4  5009752      F         Y             N         0       315000.0   
 
             Type_Income         EDUCATION Marital_status       Housing_type  \
 0             Pensioner  Higher education        Married  House / apartment   
 1  Commercial associate  Higher education        Married  House / apartment   
 2  Commercial associate  Higher education        Married  House / apartment   
 3  Commercial associate  Higher education        Married  House / apartment   
 4  Commercial associate  Higher education        Married  House / apartment   
 
    Birthday_count  Employed_days  Mobile_phone  Work_Phone  Phone  EMAIL_ID  \
 0        

In [25]:
missing_values_percentage = credit_data.isnull().mean() * 100

missing_values_percentage

Ind_ID              0.000000
GENDER              0.452196
Car_Owner           0.000000
Propert_Owner       0.000000
CHILDREN            0.000000
Annual_income       1.485788
Type_Income         0.000000
EDUCATION           0.000000
Marital_status      0.000000
Housing_type        0.000000
Birthday_count      1.421189
Employed_days       0.000000
Mobile_phone        0.000000
Work_Phone          0.000000
Phone               0.000000
EMAIL_ID            0.000000
Type_Occupation    31.524548
Family_Members      0.000000
dtype: float64

In [26]:
credit_data['GENDER'].fillna(credit_data['GENDER'].mode()[0], inplace=True)

credit_data['Annual_income'].fillna(credit_data['Annual_income'].median(), inplace=True)
credit_data['Birthday_count'].fillna(credit_data['Birthday_count'].median(), inplace=True)

credit_data['Type_Occupation'].fillna("Unknown", inplace=True)

missing_values_after_filling = credit_data.isnull().sum()

missing_values_after_filling

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  credit_data['GENDER'].fillna(credit_data['GENDER'].mode()[0], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  credit_data['Annual_income'].fillna(credit_data['Annual_income'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will ne

Ind_ID             0
GENDER             0
Car_Owner          0
Propert_Owner      0
CHILDREN           0
Annual_income      0
Type_Income        0
EDUCATION          0
Marital_status     0
Housing_type       0
Birthday_count     0
Employed_days      0
Mobile_phone       0
Work_Phone         0
Phone              0
EMAIL_ID           0
Type_Occupation    0
Family_Members     0
dtype: int64

In [27]:
categorical_cols = ['GENDER', 'Car_Owner', 'Propert_Owner', 'Type_Income', 'EDUCATION', 'Marital_status', 'Housing_type', 'Type_Occupation']

credit_data_encoded = pd.get_dummies(credit_data, columns=categorical_cols)

print(credit_data_encoded.head())

    Ind_ID  CHILDREN  Annual_income  Birthday_count  Employed_days  \
0  5008827         0       180000.0        -18772.0         365243   
1  5009744         0       315000.0        -13557.0           -586   
2  5009746         0       315000.0        -15661.5           -586   
3  5009749         0       166500.0        -13557.0           -586   
4  5009752         0       315000.0        -13557.0           -586   

   Mobile_phone  Work_Phone  Phone  EMAIL_ID  Family_Members  ...  \
0             1           0      0         0               2  ...   
1             1           1      1         0               2  ...   
2             1           1      1         0               2  ...   
3             1           1      1         0               2  ...   
4             1           1      1         0               2  ...   

   Type_Occupation_Low-skill Laborers  Type_Occupation_Managers  \
0                               False                     False   
1                             

In [28]:
credit_merged = pd.merge(credit_data_encoded, credit_labels, on='Ind_ID')

print(credit_merged.head())

    Ind_ID  CHILDREN  Annual_income  Birthday_count  Employed_days  \
0  5008827         0       180000.0        -18772.0         365243   
1  5009744         0       315000.0        -13557.0           -586   
2  5009746         0       315000.0        -15661.5           -586   
3  5009749         0       166500.0        -13557.0           -586   
4  5009752         0       315000.0        -13557.0           -586   

   Mobile_phone  Work_Phone  Phone  EMAIL_ID  Family_Members  ...  \
0             1           0      0         0               2  ...   
1             1           1      1         0               2  ...   
2             1           1      1         0               2  ...   
3             1           1      1         0               2  ...   
4             1           1      1         0               2  ...   

   Type_Occupation_Managers  Type_Occupation_Medicine staff  \
0                     False                           False   
1                     False           

In [29]:
X = credit_merged.drop('label', axis=1)
y = credit_merged['label']

X_train, X_validation, y_train, y_validation = train_test_split(X, y, test_size=0.2, random_state=42)

In [30]:
dt_regressor = DecisionTreeRegressor(random_state=42)
dt_params = {'max_depth': [None, 10, 20], 'min_samples_split': [2, 10, 20], 'min_samples_leaf': [1, 5, 10]}

dt_grid_search = GridSearchCV(dt_regressor, dt_params, cv=5, scoring='neg_mean_squared_error')
dt_grid_search.fit(X_train, y_train)

best_dt = dt_grid_search.best_estimator_
dt_predictions_train = best_dt.predict(X_train)
dt_predictions_val = best_dt.predict(X_validation)
dt_rmse_train = np.sqrt(mean_squared_error(y_train, dt_predictions_train))
dt_rmse_val = np.sqrt(mean_squared_error(y_validation, dt_predictions_val))

print(f"Decision Tree RMSE on Training Data: {dt_rmse_train}")
print(f"Decision Tree RMSE on Validation Data: {dt_rmse_val}")

Decision Tree RMSE on Training Data: 0.22991837010953878
Decision Tree RMSE on Validation Data: 0.3148498891964941


In [31]:
xgb_regressor = xgb.XGBRegressor(objective='reg:squarederror', random_state=42)
xgb_params = {'max_depth': [6, 10, 15], 'min_child_weight': [1, 5, 10], 'learning_rate': [0.01, 0.1, 0.2], 'subsample': [0.8, 1], 'colsample_bytree': [0.8, 1]}

xgb_grid_search = GridSearchCV(xgb_regressor, xgb_params, cv=5, scoring='neg_mean_squared_error')
xgb_grid_search.fit(X_train, y_train)

best_xgb = xgb_grid_search.best_estimator_
xgb_predictions_train = best_xgb.predict(X_train)
xgb_predictions_val = best_xgb.predict(X_validation)
xgb_rmse_train = np.sqrt(mean_squared_error(y_train, xgb_predictions_train))
xgb_rmse_val = np.sqrt(mean_squared_error(y_validation, xgb_predictions_val))

print(f"XGBoost RMSE on Training Data: {xgb_rmse_train}")
print(f"XGBoost RMSE on Validation Data: {xgb_rmse_val}")

XGBoost RMSE on Training Data: 0.05383573515265429
XGBoost RMSE on Validation Data: 0.23575752591449772


## Model Tuning Techniques

## 3. Decision Tree Model Tuning

For this purpose, I used `GridSearchCV` which is an exhaustive search method to test all possible combinations of specified parameter values for the estimator. Here are those parameters that I focused on and why:

- **`max_depth`**: This parameter controls the maximum depth of the tree. The deepness of the tree is limited in order to control overfitting. I set it to `None`, `10`, and `20` so that it could allow the model to decide which one is best using data and also try moderate and more constrained tree depths.

- **`min_samples_split`**: This parameter represents the minimum number of samples required to split an internal node. For example, by testing such values as `2`, `10`, and `20`, I would aim at finding a balance between underfitting and overfitting, ensuring that my model does not split too eagerly on very small subsets of data.

- **`min_samples_leaf`**: This parameter specifies the minimum number of samples required to be at a leaf node. In regression especially, it helps in smoothing the model. Values like `1`, `5`, and ‘10’ were selected so as to find out what effect more or less constraint has on leaf size.

The goal was finding a combination of these parameters that minimizes overfitting while still allowing the tree to capture enough detail about our data.

## 4. Random Forest Model Tuning

Similar strategy was used in tuning The Random Forest, again using GridSearchCV for hyperparameters exploration:

- **`n_estimators’** : This refers to how many trees should be grown within each random forest; more trees make it more robust but computationally expensive; thus trying ‘100’ or ‘200’ to see if there is any significant improvement with greater number of estimators.

- **‘max_depth’:** This parameter controls the depth of each tree in the random forest similar to decision tree model. I used again the same values to assess their impact in an ensemble context.

- **`min_samples_split`** and **`min_samples_leaf`**: The rationale for tuning these parameters is also the same as Decision Tree model. It should be such that it strikes a balance between the complexity of the model and its generalization ability.

The combination of these parameters was aimed at increasing the model’s ability to generalize, avoiding overfitting common in decision trees by leveraging the ensemble nature of random forests.

## 5. XGBoost Model Tuning

XGBoost is known for its performance and speed. Here’s how it was tuned:

- **‘max_depth’:** Determines how deep each tree can grow which I changed to see what depth do my trees need to have so that they grasp data complexity without overfitting;

- **‘min_child_weight’:** As this parameter defines minimum sum of instance weight (hessian) needed in a child, this is used to control over-fitting; high values prevent learning different relations which might be highly specific to particular samples.
2
- ‘learning_rate’ or ‘eta:’ Shrinks weights on each step thus making the process more robust. I tried some values with eta o find out where learning is neither too slow requiring many trees nor too fast leading to overfitting.

- **‘subsample’:** fraction of observations to be randomly sampled for each tree. This introduces randomness into individual estimates when fitting trees, which makes it more robust against overfitting.

