Data preprocessing and feature engineering are vital in data mining as they enhance the quality and performance of data mining models. Data preprocessing ensures that raw data is clean, consistent, and ready for analysis, while feature engineering improves the model’s ability to capture patterns and make accurate predictions.



To carry out this activity, follow the instructions below:

1. Review the company **business case** . Review the **problem statement** and download the required datasets.
    
    [here](https://sites.google.com/mmdc.mcl.edu.ph/finmarkcorp/data-mining)
    
2. Load the datasets to Jupyter Notebook and **perform the necessary data processing using Python.**
3. Once done, **engineer new features to help the model arrive at better solutions to the outline problem statement.**
4. Save your output and prepare for a discussion with your team.

In [9]:
# Importing necessary libraries

import pandas as pd
from google.colab import drive
drive.mount('/content/drive')


# loading datasets
customer_feedback_df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/finmark_data_mining/Customer_Feedback_Data.csv")
product_offering_df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/finmark_data_mining/Product_Offering_Data.csv")
transaction_df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/finmark_data_mining/Transaction_Data.csv")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Preview each dataset

In [14]:
print(customer_feedback_df.head())

   Customer_ID  Satisfaction_Score  Feedback_Comments  Likelihood_to_Recommend
0            1                10.0     Very satisfied                        9
1            2                 3.0     Very satisfied                        3
2            3                10.0     Very satisfied                        1
3            4                 7.0  Needs improvement                        4
4            5                 8.0     Unsatisfactory                        7


In [15]:
print(product_offering_df.head())

   Product_ID                   Product_Name     Product_Type Risk_Level  \
0           1           Platinum Credit Card      Credit Card     Medium   
1           2           Gold Savings Account  Savings Account        Low   
2           3  High-Yield Investment Account       Investment       High   
3           4                  Mortgage Loan             Loan     Medium   
4           5                      Auto Loan             Loan     Medium   

   Target_Age_Group Target_Income_Group  
0               NaN              Medium  
1               NaN                 Low  
2               NaN                High  
3               NaN                High  
4               NaN              Medium  


In [16]:
print(transaction_df.head())

   Transaction_ID  Customer_ID     Transaction_Date  Transaction_Amount  \
0               1          393  2023-01-01 00:00:00              3472.0   
1               2          826  2023-01-01 01:00:00                 NaN   
2               3          916  2023-01-01 02:00:00                10.0   
3               4          109  2023-01-01 03:00:00                72.0   
4               5          889  2023-01-01 04:00:00              1793.0   

  Transaction_Type  
0         Purchase  
1     Bill Payment  
2         Purchase  
3       Investment  
4       Investment  


Detailed Overview of Datasets

In [6]:
# Get dataset information
customer_feedback_df.info()
product_offering_df.info()
transaction_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5050 entries, 0 to 5049
Data columns (total 4 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Customer_ID              5050 non-null   int64  
 1   Satisfaction_Score       4949 non-null   float64
 2   Feedback_Comments        5050 non-null   object 
 3   Likelihood_to_Recommend  5050 non-null   int64  
dtypes: float64(1), int64(2), object(1)
memory usage: 157.9+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Product_ID           15 non-null     int64  
 1   Product_Name         15 non-null     object 
 2   Product_Type         15 non-null     object 
 3   Risk_Level           15 non-null     object 
 4   Target_Age_Group     0 non-null      float64
 5   Target_Income_Group  15 non-null     object 
dtyp

Missing Values



In [8]:
#Identify Missing Values in each dataset
print(customer_feedback_df.isnull().sum())
print(product_offering_df.isnull().sum())
print(transaction_df.isnull().sum())



Customer_ID                  0
Satisfaction_Score         101
Feedback_Comments            0
Likelihood_to_Recommend      0
dtype: int64
Product_ID              0
Product_Name            0
Product_Type            0
Risk_Level              0
Target_Age_Group       15
Target_Income_Group     0
dtype: int64
Transaction_ID          0
Customer_ID             0
Transaction_Date        0
Transaction_Amount    100
Transaction_Type        0
dtype: int64


Duplicate Data


In [18]:
# Checking for Duplicate Data
print(customer_feedback_df.duplicated().sum())
print(product_offering_df.duplicated().sum())
print(transaction_df.duplicated().sum())

81
5
50
