# Deep Dive into Coefficients
|standing if a particular feature or co-efficient is going to affect our linear models in a significant way - i.e if it useful to include said feature or not. To do this, we can use t-tests and wald's tests. 

## Data Preprocessing

This time, for example, let's see if Gender has a significant affect on our logistic regression model in the social_network_ads dataset.

In [25]:
import pandas as pd
from ml_code.utils import load_data
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load the dataset
data = load_data("Social_Network_Ads.csv")

# Preprocess the data with the "Gender" feature
transformer_with_gender = ColumnTransformer(
    transformers=[
        ("scaler", StandardScaler(), ["Age", "EstimatedSalary"]),
        ("encoder", OneHotEncoder(drop="first"), ["Gender"]),
    ],
    remainder="drop",
)

features_with_gender = transformer_with_gender.fit_transform(data)
target = data["Purchased"]

# Split the data into training and testing sets
X_train_with_gender, X_test_with_gender, y_train, y_test = train_test_split(
    features_with_gender, target, test_size=0.25, random_state=0
)

X_train_with_gender.shape, X_test_with_gender.shape, y_train.shape, y_test.shape

((300, 3), (100, 3), (300,), (100,))

## Training the model

In [26]:
feature_names = ["Age", "EstimatedSalary", "Gender_Male"]
# Add a constant term to the features
X_train_with_gender_sm = sm.add_constant(X_train_with_gender)

# Create a DataFrame with the feature names
X_train_with_gender_df = pd.DataFrame(
    X_train_with_gender_sm, columns=["const"] + feature_names, index=y_train.index
)

# Train a logistic regression model with the "Gender" feature using statsmodels
model_with_gender_sm = sm.Logit(y_train, X_train_with_gender_df)
results_with_gender_sm = model_with_gender_sm.fit()

Optimization terminated successfully.
         Current function value: 0.371191
         Iterations 7


## Evaluating the co-efficient performance

In [27]:
print(results_with_gender_sm.summary())

                           Logit Regression Results                           
Dep. Variable:              Purchased   No. Observations:                  300
Model:                          Logit   Df Residuals:                      296
Method:                           MLE   Df Model:                            3
Date:                Sun, 19 May 2024   Pseudo R-squ.:                  0.4367
Time:                        13:42:22   Log-Likelihood:                -111.36
converged:                       True   LL-Null:                       -197.69
Covariance Type:            nonrobust   LLR p-value:                 3.393e-37
                      coef    std err          z      P>|z|      [0.025      0.975]
-----------------------------------------------------------------------------------
const              -1.2313      0.268     -4.594      0.000      -1.757      -0.706
Age                 2.3791      0.305      7.791      0.000       1.781       2.978
EstimatedSalary     1.2091      

Coefficients (coef):

- const: The intercept term is -1.2313, indicating the baseline log-odds of purchasing when all other predictors are zero.
- Age: The coefficient for Age is 2.3791, suggesting that a one-unit increase in Age is associated with an increase of 2.3791 in the log-odds of purchasing, holding other predictors constant.
- EstimatedSalary: The coefficient for EstimatedSalary is 1.2091, indicating that a one-unit increase in EstimatedSalary is associated with an increase of 1.2091 in the log-odds of purchasing, holding other predictors constant.
- Gender_Male: The coefficient for Gender_Male is 0.2324, suggesting that being male is associated with an increase of 0.2324 in the log-odds of purchasing compared to being female, holding other predictors constant.


Standard Errors (std err):

The standard errors quantify the uncertainty in the coefficient estimates. Smaller standard errors indicate more precise estimates.


Z-values (z):

The z-values are the Wald test statistics, calculated as the coefficient divided by its standard error.
- For Age, the z-value is 7.791, indicating a strong positive effect on the probability of purchasing.
- For EstimatedSalary, the z-value is 5.844, suggesting a significant positive effect on the probability of purchasing.
- For Gender_Male, the z-value is 0.687, indicating a weaker and not statistically significant effect on the probability of purchasing.


P-values (P>|z|):

The p-values indicate the statistical significance of each predictor.
- For Age, and EstimatedSalary, the p-values are 0.000, suggesting strong evidence against the null hypothesis that the coefficients are zero.
- For Gender_Male, the p-value is 0.492, indicating that the effect of gender is not statistically significant at the conventional 0.05 level.

In [28]:
X_test_with_gender_df = pd.DataFrame(X_test_with_gender, columns=feature_names)

# Add a constant term to the test features
X_test_with_gender_sm = sm.add_constant(X_test_with_gender_df)

# Get the predicted probabilities and predicted classes
y_pred_prob = results_with_gender_sm.predict(X_test_with_gender_sm)
y_pred_class = (y_pred_prob > 0.5).astype(int)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred_class)
print("\nModel Performance:")
print("Accuracy:", accuracy)

# Calculate precision
precision = precision_score(y_test, y_pred_class)
print("Precision:", precision)

# Calculate recall
recall = recall_score(y_test, y_pred_class)
print("Recall:", recall)

# Calculate F1-score
f1 = f1_score(y_test, y_pred_class)
print("F1-score:", f1)


Model Performance:
Accuracy: 0.91
Precision: 0.896551724137931
Recall: 0.8125
F1-score: 0.8524590163934426
