**<font size="5">AMM Random Forest</font>** 


In [21]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from imblearn.over_sampling import SMOTE

In [22]:
numerical_data = pd.read_csv('numerical.csv')
print("Numerical Data Head:")
print(numerical_data.head())

categorical_data = pd.read_csv('categorical.csv')
print("\nCategorical Data Head:")
print(categorical_data.head())

target_data = pd.read_csv('target.csv')
print("\nTarget Data Head:")
print(target_data.head())

Numerical Data Head:
   TCODE        AGE  INCOME  WEALTH1  HIT  MALEMILI  MALEVET  VIETVETS   
0      0  60.000000       5        9    0         0       39        34  \
1      1  46.000000       6        9   16         0       15        55   
2      1  61.611649       3        1    2         0       20        29   
3      0  70.000000       1        4    2         0       23        14   
4      0  78.000000       3        2   60         1       28         9   

   WWIIVETS  LOCALGOV  ...  CARDGIFT  MINRAMNT  MAXRAMNT  LASTGIFT  TIMELAG   
0        18        10  ...        14       5.0      12.0      10.0        4  \
1        11         6  ...         1      10.0      25.0      25.0       18   
2        33         6  ...        14       2.0      16.0       5.0       12   
3        31         3  ...         7       2.0      11.0      10.0        9   
4        53        26  ...         8       3.0      15.0      15.0       14   

     AVGGIFT  CONTROLN  HPHONE_D  RFA_2F  CLUSTER2  
0   7.

In [23]:
categorical_data_encoded = pd.get_dummies(categorical_data)
print("\nEncoded Categorical Data Head:")
print(categorical_data_encoded.head())


Encoded Categorical Data Head:
   CLUSTER  DATASRCE  DOMAIN_B  ODATEW_YR  ODATEW_MM  DOB_YR  DOB_MM   
0       36         3         2         89          1      37      12  \
1       14         3         1         94          1      52       2   
2       43         3         2         90          1       0       2   
3       44         3         2         87          1      28       1   
4       16         3         2         86          1      20       1   

   MINRDATE_YR  MINRDATE_MM  MAXRDATE_YR  ...  RFA_2A_G  GEOCODE2_A   
0           92            8           94  ...     False       False  \
1           93           10           95  ...      True        True   
2           91           11           92  ...     False       False   
3           87           11           94  ...     False       False   
4           93           10           96  ...     False        True   

   GEOCODE2_B  GEOCODE2_C  GEOCODE2_D  DOMAIN_A_C  DOMAIN_A_R  DOMAIN_A_S   
0       False        True      

In [24]:
combined_data = pd.concat([numerical_data, categorical_data_encoded], axis=1)
print("\nCombined Data Head:")
print(combined_data.head())


Combined Data Head:
   TCODE        AGE  INCOME  WEALTH1  HIT  MALEMILI  MALEVET  VIETVETS   
0      0  60.000000       5        9    0         0       39        34  \
1      1  46.000000       6        9   16         0       15        55   
2      1  61.611649       3        1    2         0       20        29   
3      0  70.000000       1        4    2         0       23        14   
4      0  78.000000       3        2   60         1       28         9   

   WWIIVETS  LOCALGOV  ...  RFA_2A_G  GEOCODE2_A  GEOCODE2_B  GEOCODE2_C   
0        18        10  ...     False       False       False        True  \
1        11         6  ...      True        True       False       False   
2        33         6  ...     False       False       False        True   
3        31         3  ...     False       False       False        True   
4        53        26  ...     False        True       False       False   

   GEOCODE2_D  DOMAIN_A_C  DOMAIN_A_R  DOMAIN_A_S  DOMAIN_A_T  DOMAIN_A_U  
0

In [25]:
print("\nColumns of numerical_data:")
print(numerical_data.columns)

# Display column names of categorical_data
print("\nColumns of categorical_data:")
print(categorical_data.columns)

# Display column names of categorical_data_encoded
print("\nColumns of categorical_data_encoded:")
print(categorical_data_encoded.columns)

# Display column names of combined_data
print("\nColumns of combined_data:")
print(combined_data.columns)


Columns of numerical_data:
Index(['TCODE', 'AGE', 'INCOME', 'WEALTH1', 'HIT', 'MALEMILI', 'MALEVET',
       'VIETVETS', 'WWIIVETS', 'LOCALGOV',
       ...
       'CARDGIFT', 'MINRAMNT', 'MAXRAMNT', 'LASTGIFT', 'TIMELAG', 'AVGGIFT',
       'CONTROLN', 'HPHONE_D', 'RFA_2F', 'CLUSTER2'],
      dtype='object', length=315)

Columns of categorical_data:
Index(['STATE', 'CLUSTER', 'HOMEOWNR', 'GENDER', 'DATASRCE', 'RFA_2R',
       'RFA_2A', 'GEOCODE2', 'DOMAIN_A', 'DOMAIN_B', 'ODATEW_YR', 'ODATEW_MM',
       'DOB_YR', 'DOB_MM', 'MINRDATE_YR', 'MINRDATE_MM', 'MAXRDATE_YR',
       'MAXRDATE_MM', 'LASTDATE_YR', 'LASTDATE_MM', 'FIRSTDATE_YR',
       'FIRSTDATE_MM'],
      dtype='object')

Columns of categorical_data_encoded:
Index(['CLUSTER', 'DATASRCE', 'DOMAIN_B', 'ODATEW_YR', 'ODATEW_MM', 'DOB_YR',
       'DOB_MM', 'MINRDATE_YR', 'MINRDATE_MM', 'MAXRDATE_YR', 'MAXRDATE_MM',
       'LASTDATE_YR', 'LASTDATE_MM', 'FIRSTDATE_YR', 'FIRSTDATE_MM',
       'STATE_CA', 'STATE_FL', 'STATE_GA', 'STATE_I

In [26]:
print("\nColumns of target_data:")
print(target_data.columns)

# Identify the target variable
target_variable = 'targetB'
print("\nThe target variable is:", target_variable)


Columns of target_data:
Index(['TARGET_B', 'TARGET_D'], dtype='object')

The target variable is: targetB


**TARGET_B:** This column represents the model's prediction for whether something will happen (like a purchase, yes or no).

**TARGET_D:** This column represents how much of that something is expected to happen (like the amount of money spent).

The target variable we're focusing on is 'targetB', which predicts a simple yes/no outcome. The model tries to guess whether an event will occur or not. It's like predicting whether someone will make a purchase (yes) or not (no).

In [27]:
X = combined_data
y = target_data['TARGET_B']

# Print the shapes of X and y
print("\nShape of X:", X.shape)
print("Shape of y:", y.shape)


Shape of X: (95412, 361)
Shape of y: (95412,)


**Shape of X:** This tells us how many rows (examples) and columns (features) are in our dataset. In this case, we have 95,412 rows and 361 columns.

**Shape of y:** This shows the number of rows in our target variable. It matches the number of rows in our dataset, so we have 95,412 values in the target variable.

In simpler terms, it's like knowing we have a big table with lots of rows and columns, and we're also keeping track of something (our target) for each row in that table.

In [28]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Apply SMOTE to upsample the minority class
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

# Print the shapes of X_train_resampled and y_train_resampled
print("\nShape of X_train_resampled:", X_train_resampled.shape)
print("Shape of y_train_resampled:", y_train_resampled.shape)


Shape of X_train_resampled: (144928, 361)
Shape of y_train_resampled: (144928,)


**Shape of X_train_resampled:** This tells us the size of our training dataset after applying a technique called SMOTE. We now have 144,928 rows and 361 columns.

**Shape of y_train_resampled:** This shows the number of rows in our target variable, which matches the size of our training data. We have 144,928 values in the target variable.

In simpler terms, we made our training data bigger to help our model learn better by adding more examples. It's like having more practice questions for a test to improve your performance.

In [29]:
rf_classifier = RandomForestClassifier(random_state=42)
rf_classifier.fit(X_train_resampled, y_train_resampled)
print("\nRandom Forest Classifier Parameters:")
print(rf_classifier.get_params())


Random Forest Classifier Parameters:
{'bootstrap': True, 'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': 'sqrt', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 100, 'n_jobs': None, 'oob_score': False, 'random_state': 42, 'verbose': 0, 'warm_start': False}


The "Random Forest Classifier Parameters" are like the settings or rules the computer follows to make predictions. These settings include:

It uses 100 "trees" (imagine 100 separate decision-makers).
It considers different parts of the data when making decisions (like looking at different features).
It doesn't pay extra attention to any particular class.
It splits decisions based on how they best separate the data.
It doesn't set a maximum depth for each decision-maker (tree).
These settings help the computer make predictions. For example, if you're predicting whether someone will make a purchase based on various factors like age, income, and location, these settings guide how the computer analyzes those factors and makes a final guess.

The "Random State" value is set to 42, which ensures that the randomness in the model is the same every time you run it. This is important for reproducibility, so you can get the same results when you run the model again later.

So, in simple terms, these settings and the random state value are like the rules of a game the computer plays to guess whether someone will make a purchase, and they help make sure the game is fair and reproducible.

In [30]:
y_pred = rf_classifier.predict(X_test)
print("\nPredictions (y_pred):")
print(y_pred)


Predictions (y_pred):
[0 0 0 ... 0 0 0]


These "Predictions (y_pred)" are like the computer's guesses or decisions.
For example, it's guessing that in different situations, people won't make a purchase, represented as 0. So, when you see these numbers (0, 0, 0, ...), it means the computer is making predictions that most likely, people won't make a purchase in those cases.

In [31]:
accuracy = accuracy_score(y_test, y_pred)
print(f"\nAccuracy: {accuracy:.2f}")


Accuracy: 0.95


"Accuracy: 0.95" means that our computer model is correct about 95% of the time. In other words, it's doing a pretty good job at making predictions.

In [32]:
from sklearn.metrics import classification_report
classification_report_result = classification_report(y_test, y_pred)
print("\nClassification Report:")
print(classification_report_result)


Classification Report:
              precision    recall  f1-score   support

           0       0.95      1.00      0.97     18105
           1       0.12      0.00      0.00       978

    accuracy                           0.95     19083
   macro avg       0.53      0.50      0.49     19083
weighted avg       0.91      0.95      0.92     19083



**Precision (0)**: When the model predicts "0" (no purchase), it's correct about 95% of the time. This means that when it says people won't make a purchase, it's usually right.

**Recall (0):** Out of all the actual "0" cases (no purchase), the model identifies almost all of them, which is good.

**F1-Score (0):** This is a balance between precision and recall. It tells us that the model is pretty good at identifying "0" cases.

**Precision (1):** When the model predicts "1" (purchase), it's correct only about 12% of the time. So, when it says people will make a purchase, it's often wrong.

**Recall (1):** Out of all the actual "1" cases (purchase), the model identifies very few of them.

**F1-Score (1):** This is very low, indicating that the model struggles to identify "1" cases.

**Accuracy:** Overall, the model is correct about 95% of the time, which sounds good, but it's mainly due to being really good at predicting "0" (no purchase).

**Conclusion**

In conclusion, the model is excellent at predicting when people won't make a purchase, but it struggles to identify cases where people will make a purchase. So, while the overall accuracy is high, it's mainly because of the abundance of "0" cases, and the model needs improvement in recognizing "1" cases.