**<font size="5"> Lab Cross Validationr </font>**


In [9]:
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import TomekLinks

In [13]:
churnData = pd.read_csv('Customer-Churn.csv')
churnData['TotalCharges'] = pd.to_numeric(churnData['TotalCharges'], errors='coerce')
churnData.dropna(inplace=True)
features = ['tenure', 'SeniorCitizen', 'MonthlyCharges', 'TotalCharges']
X = churnData[features]
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
y = churnData['Churn']
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)

X_train shape: (5625, 4)
X_test shape: (1407, 4)


**Apply SMOTE for upsampling**

In [14]:
smote = SMOTE(random_state=42)
X_smote, y_smote = smote.fit_resample(X_train, y_train)

print("X_smote shape:", X_smote.shape)

X_smote shape: (8260, 4)


The result "X_smote shape: (8260, 4)" means that after applying a technique called SMOTE (which helps balance our data for better predictions), we now have a dataset with 8,260 rows and 4 columns. Each row represents a customer, and these 8,260 customers have 4 different characteristics or features that we're using to make predictions about whether they will leave the service or not.

In [15]:
logistic_model = LogisticRegression()
decision_tree_model = DecisionTreeClassifier()

logistic_scores = cross_val_score(logistic_model, X_smote, y_smote, cv=5, scoring='accuracy')
decision_tree_scores = cross_val_score(decision_tree_model, X_smote, y_smote, cv=5, scoring='accuracy')


logistic_mean_accuracy = logistic_scores.mean()
decision_tree_mean_accuracy = decision_tree_scores.mean()

print("Logistic Regression Mean Accuracy (SMOTE):", logistic_mean_accuracy)
print("Decision Tree Mean Accuracy (SMOTE):", decision_tree_mean_accuracy)


Logistic Regression Mean Accuracy (SMOTE): 0.7317191283292978
Decision Tree Mean Accuracy (SMOTE): 0.7616222760290556


These results represent the accuracy of two different models in making predictions about whether customers will leave a service or not. Here's a simple explanation:

**Logistic Regression Mean Accuracy (SMOTE):** 73.2% - This means that, on average, the logistic regression model correctly predicts whether a customer will leave or stay in the service about 73.2% of the time when we used a technique called SMOTE to balance the data.

**Decision Tree Mean Accuracy (SMOTE):** 76.2% - Similarly, the decision tree model correctly predicts about 76.2% of the time using the same balanced data.

In simple terms, both models are doing a decent job of predicting customer behavior, with the decision tree model being slightly more accurate in this case. These accuracies give us an idea of how well these models perform, and higher accuracy is generally better, although other factors may also be important when choosing the best model for a specific problem.

In [16]:
tomek_links = TomekLinks()
X_tomek, y_tomek = tomek_links.fit_resample(X_train, y_train)

In [None]:
print("X_tomek shape:", X_tomek.shape)

X_tomek shape: (5224, 4)


The result **"X_tomek shape: (5224, 4)"** means that after applying a technique called TomekLinks (which helps reduce data imbalance), we now have a dataset with 5,224 rows and 4 columns. Each row represents a customer, and these 5,224 customers have 4 different characteristics or features that we're using to make predictions about whether they will leave the service or not. This smaller dataset is a result of the downsampling process used to balance the data.

In [18]:
logistic_model_tomek = LogisticRegression()
decision_tree_model_tomek = DecisionTreeClassifier()

logistic_scores_tomek = cross_val_score(logistic_model_tomek, X_tomek, y_tomek, cv=5, scoring='accuracy')
decision_tree_scores_tomek = cross_val_score(decision_tree_model_tomek, X_tomek, y_tomek, cv=5, scoring='accuracy')


In [19]:
logistic_mean_accuracy_tomek = logistic_scores_tomek.mean()
decision_tree_mean_accuracy_tomek = decision_tree_scores_tomek.mean()

In [20]:
print("Logistic Regression Mean Accuracy (TomekLinks):", logistic_mean_accuracy_tomek)
print("Decision Tree Mean Accuracy (TomekLinks):", decision_tree_mean_accuracy_tomek)

Logistic Regression Mean Accuracy (TomekLinks): 0.7934554253973493
Decision Tree Mean Accuracy (TomekLinks): 0.7530655007424516


These results represent the accuracy of two different models in making predictions about whether customers will leave a service or not, after applying a technique called TomekLinks to reduce data imbalance. Here's a simple explanation:

**Logistic Regression Mean Accuracy (TomekLinks):** 79.3% - This means that, on average, the logistic regression model correctly predicts whether a customer will leave or stay in the service about 79.3% of the time when we used TomekLinks to balance the data.

**Decision Tree Mean Accuracy (TomekLinks):** 75.3% - Similarly, the decision tree model correctly predicts about 75.3% of the time using the same balanced data with TomekLinks.

In simple terms, both models are doing a good job of predicting customer behavior, with the logistic regression model being slightly more accurate in this case. These accuracies give us an idea of how well these models perform after balancing the data, and higher accuracy is generally better, although other factors may also be important when choosing the best model for a specific problem.

In [21]:
print("Class Distribution after TomekLinks:\n", y_tomek.value_counts())

Class Distribution after TomekLinks:
 Churn
No     3729
Yes    1495
Name: count, dtype: int64


**Class Distribution after TomekLinks:**

After applying a TomekLinks to the data, we now have a more balanced class distribution:

**No**: There are 3,729 customers who haven't left the service.
**Yes**: There are 1,495 customers who have left the service.

**Conclusion:**

By using TomekLinks, we made our dataset more balanced, which helped improve the accuracy of our models. The logistic regression model achieved an accuracy of **79.3%**, and the decision tree model achieved an accuracy of **75.3%** in predicting whether customers will leave the service. This shows that with balanced data, our models are better at making predictions, particularly the logistic regression model, which performed the best in this case. Balancing the data is an important step in building accurate predictive models.