**Let's consider a scenario where we want to develop a machine learning model to predict customer satisfaction based on various features of a product or service. In this case, we might integrate a statistical test like the t-test into our model building process.**

Here's how we could do it:

- Data Collection: We gather data on customer satisfaction scores and various features of the product or service, such as price, features, quality, etc.

- Exploratory Data Analysis (EDA): We conduct exploratory data analysis to understand the relationships between different features and customer satisfaction. During this phase, we might use visualizations and summary statistics to identify potentially important features.

- Feature Selection: We use techniques like correlation analysis or domain knowledge to select a subset of features that are likely to be most predictive of customer satisfaction.

- Initial Model Building: We build a machine learning model (e.g., regression, decision trees, etc.) using all selected features without considering statistical significance.

- Integration of t-test: Before finalizing the model, we integrate a t-test to assess the statistical significance of each feature in predicting customer satisfaction. The t-test helps us determine if there is a significant difference in the means of a feature between satisfied and dissatisfied customers.

- Feature Adjustment: We adjust the model by either removing features that are not statistically significant or by giving more weight to the statistically significant features.

- Model Evaluation: We evaluate the performance of the adjusted model using cross-validation or a holdout dataset to ensure that it generalizes well to unseen data.

- Deployment and Monitoring: Once we are satisfied with the model's performance, we deploy it into production and continuously monitor its performance to ensure it remains effective over time.

This code can be found on Github

In [1]:
import pandas as pd
import numpy as np
from scipy.stats import ttest_ind
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from scipy.stats import t, ttest_ind


# Step 1: Data Collection
# Assuming you have a dataset with features and target variable (customer satisfaction)
data = pd.read_csv('Employee Satisfaction Index.csv')

FileNotFoundError: [Errno 2] No such file or directory: 'Employee Satisfaction Index.csv'

In [None]:
# View the dataframe data for Column Names
data.head()

Unnamed: 0,emp_id,age,Dept,location,education,recruitment_type,job_level,rating,onsite,awards,certifications,salary,satisfied
0,HR8270,28,HR,Suburb,PG,Referral,5,2,0,1,0,86750,1
1,TECH1860,50,Technology,Suburb,PG,Walk-in,3,5,1,2,1,42419,0
2,TECH6390,43,Technology,Suburb,UG,Referral,4,1,0,2,0,65715,0
3,SAL6191,44,Sales,City,PG,On-Campus,2,3,1,0,0,29805,1
4,HR6734,33,HR,City,UG,Recruitment Agency,2,1,0,5,0,29805,1


In [5]:
len(data)
data.dtypes

emp_id              object
age                  int64
Dept                object
location            object
education           object
recruitment_type    object
job_level            int64
rating               int64
onsite               int64
awards               int64
certifications       int64
salary               int64
satisfied            int64
dtype: object

In [6]:
print(data["Dept"].value_counts())
print(data["location"].value_counts())
print(data["education"].value_counts())
print(data["recruitment_type"].value_counts())

Dept
Purchasing    109
HR            106
Technology     98
Marketing      95
Sales          92
Name: count, dtype: int64
location
City      259
Suburb    241
Name: count, dtype: int64
education
PG    254
UG    246
Name: count, dtype: int64
recruitment_type
On-Campus             133
Referral              131
Walk-in               128
Recruitment Agency    108
Name: count, dtype: int64


In [7]:
#Converting catgetorial to numberial values for those columns
cleanup_nums = {"Dept":     {"Purchasing": 1, "HR": 2, "Technology":3, "Marketing":4, "Sales":5},
                "location": {"City": 1, "Suburb": 2},
                "education": {"PG":1, "UG":2},
                "recruitment_type": {"On-Campus":1, "Referral":2, "Walk-in":3, "Recruitment Agency":4}}

In [8]:
data = data.replace(cleanup_nums)
data.head(10)

Unnamed: 0,emp_id,age,Dept,location,education,recruitment_type,job_level,rating,onsite,awards,certifications,salary,satisfied
0,HR8270,28,2,2,1,2,5,2,0,1,0,86750,1
1,TECH1860,50,3,2,1,3,3,5,1,2,1,42419,0
2,TECH6390,43,3,2,2,2,4,1,0,2,0,65715,0
3,SAL6191,44,5,1,1,1,2,3,1,0,0,29805,1
4,HR6734,33,2,1,2,4,2,1,0,5,0,29805,1
5,PUR7265,40,1,2,2,2,3,3,0,7,1,42419,1
6,PUR1466,26,1,2,2,2,5,5,0,2,0,86750,0
7,TECH5426,25,3,1,2,4,1,1,0,4,0,24076,0
8,HR6578,35,2,1,1,2,3,4,0,0,0,42419,1
9,TECH9322,45,3,1,1,2,3,3,0,9,0,42419,0


In [9]:
# Step 3: Feature Selection
# Let's say 'price', 'features', and 'quality' are selected features
selected_features = ['age', 'Dept', 'location','education', 'job_level','salary']

In [10]:
# Step 4: Initial Model Building
X = data[selected_features]
y = data['satisfied']

In [11]:
print(X)

     age  Dept  location  education  job_level  salary
0     28     2         2          1          5   86750
1     50     3         2          1          3   42419
2     43     3         2          2          4   65715
3     44     5         1          1          2   29805
4     33     2         1          2          2   29805
..   ...   ...       ...        ...        ...     ...
495   49     2         2          1          2   29805
496   24     3         2          2          2   29805
497   34     4         1          1          1   24076
498   26     3         1          2          2   29805
499   26     3         1          2          3   42419

[500 rows x 6 columns]


In [12]:
print(y)

0      1
1      0
2      0
3      1
4      1
      ..
495    1
496    0
497    1
498    0
499    0
Name: satisfied, Length: 500, dtype: int64


In [None]:
# Splitting data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Step 5: Integration of t-test
# Let's perform t-test for each feature
for feature in selected_features:
    satisfied = data[data['satisfied'] == 1][feature]
    dissatisfied = data[data['satisfied'] == 0][feature]
    
    # Calculate t-test statistic and p-value
    t_stat, p_value = ttest_ind(satisfied, dissatisfied)
    
    # Calculate critical t-value from t-distribution
    n1 = len(satisfied)
    n2 = len(dissatisfied)
    dof = n1 + n2 - 2  # Degrees of freedom for independent two-sample t-test
    critical_t = t.ppf(0.05, dof)  # Using 0.05 significance level
    
    print(f"T-test results for '{feature}': t-statistic={t_stat}, p-value={p_value}, critical t-value={critical_t}")

T-test results for 'age': t-statistic=0.10019981699415642, p-value=0.9202260155271248, critical t-value=-1.6479191388550005
T-test results for 'Dept': t-statistic=-0.4631684089025832, p-value=0.6434459984176686, critical t-value=-1.6479191388550005
T-test results for 'location': t-statistic=-0.673988560983048, p-value=0.5006312971488471, critical t-value=-1.6479191388550005
T-test results for 'education': t-statistic=-0.6074026884700869, p-value=0.5438605564674008, critical t-value=-1.6479191388550005
T-test results for 'job_level': t-statistic=0.2252461525786165, p-value=0.8218801870567769, critical t-value=-1.6479191388550005
T-test results for 'salary': t-statistic=0.5171465003342448, p-value=0.6052834998330064, critical t-value=-1.6479191388550005


**in this output:**

The t-statistic represents the calculated t-value for the t-test.
- The p-value represents the probability of observing the data if the null hypothesis (no difference between satisfied and dissatisfied customers in terms of the feature) is true.
- The interpretation of the results would depend on the significance level chosen (commonly 0.05).
- If the p-value is less than the chosen significance level, we reject the null hypothesis and conclude that there is a significant difference between satisfied and dissatisfied customers in terms of that feature.
- Otherwise, if the p-value is greater than the significance level, we fail to reject the null hypothesis.

In [None]:
# Step 6: Feature Adjustment (if necessary)
# Let's say we decide to keep all features for simplicity

# Step 7: Model Building and Evaluation
# Let's build a linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Predictions on the test set
y_pred = model.predict(X_test)

In [None]:
# Model evaluation
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")

# Step 8: Deployment and Monitoring
# Deployment steps would depend on your production environment

### Tasks to be conducted

**In this code:**

- Replace 'customer_data.csv' with the path to your own dataset.

- 'satisfied' is assumed to be the column name for the target variable.

- Adjust the feature selection, model building, and evaluation steps as needed based on your specific dataset and requirements.

**In a real-world scenario, you may want to handle missing values, encode categorical variables, and perform other preprocessing steps before building the model.**