Project instructions

1) Open and look through the data file. Path to the file:datasets/users_behavior.csv Download dataset
2) Split the source data into a training set, a validation set, and a test set.
3) Investigate the quality of different models by changing hyperparameters. Briefly describe the findings of the study.
4) Check the quality of the model using the test set.
5) Additional task: sanity check the model. This data is more complex than what you’re used to working with, so it's not an easy task. We'll take a closer look at it later.

Here’s what the reviewers will look at when reviewing your project:

1) How did you look into data after downloading?
2) Have you correctly split the data into train, validation, and test sets?
3) How have you chosen the sets' sizes?
4) Did you evaluate the quality of the models correctly?
5) What models and hyperparameters did you use?
6) What are your findings?
7) Did you test the models correctly?
8) What is your accuracy score?
9) Have you stuck to the project structure and kept the code neat?

In [125]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score

url = 'https://raw.githubusercontent.com/DHE42/sprint_7_project/refs/heads/main/users_behavior.csv'

df = pd.read_csv(url)
print(df.head())
print()

print(df.info())
print()

print(df.describe())
print()


   calls  minutes  messages   mb_used  is_ultra
0   40.0   311.90      83.0  19915.42         0
1   85.0   516.75      56.0  22696.96         0
2   77.0   467.66      86.0  21060.45         0
3  106.0   745.53      81.0   8437.39         1
4   66.0   418.74       1.0  14502.75         0

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB
None

             calls      minutes     messages       mb_used     is_ultra
count  3214.000000  3214.000000  3214.000000   3214.000000  3214.000000
mean     63.038892   438.208787    38.281269  17207.673836     0.306472
std      33.236368   234.569872    36.148326   7570.968246

In [126]:
# Declare features and target
features = df.drop(['is_ultra'], axis=1)
target = df['is_ultra']


# Split of a 20% test set
features_train, features_test, target_train, target_test = train_test_split(
    features, target, test_size=0.2, random_state=12345
)

# Split features_train and target_train into training (60%) and validation (20%)
features_train, features_valid, target_train, target_valid = train_test_split(
    features_train, target_train, test_size=0.25, random_state=12345
)

def evaluate_model(model, features_valid, target_valid, features_test, target_test, model_name="Model"):
   
    # Validation set predictions and accuracy
    valid_predictions = model.predict(features_valid)
    valid_accuracy = accuracy_score(target_valid, valid_predictions)
    print(f"{model_name} Validation Accuracy: {valid_accuracy}")
    
    # Test set predictions and accuracy
    test_predictions = model.predict(features_test)
    test_accuracy = accuracy_score(target_test, test_predictions)
    print(f"{model_name} Test Accuracy: {test_accuracy}")
    
    # Error count
    validation_errors = sum(target_valid != valid_predictions)
    test_errors = sum(target_test != test_predictions)
    print(f"{model_name} Validation Errors: {validation_errors}")
    print(f"{model_name} Test Errors: {test_errors}")

### Decision Tree

In [127]:
# Write function to explore decision tree hyperparameters and evaluate their performance

def explore_decision_tree_hyperparameters(features_train, target_train, features_valid, target_valid, features_test, target_test):
    # List of hyperparameter configurations to try
    configs = [
        # Depth variations
        {'max_depth': 3, 'min_samples_split': 2, 'criterion': 'gini'},
        {'max_depth': 5, 'min_samples_split': 2, 'criterion': 'gini'},
        {'max_depth': 7, 'min_samples_split': 2, 'criterion': 'gini'},
        {'max_depth': None, 'min_samples_split': 2, 'criterion': 'gini'},
        
        # Splitting variations
        {'max_depth': 5, 'min_samples_split': 5, 'criterion': 'gini'},
        {'max_depth': 5, 'min_samples_split': 10, 'criterion': 'gini'},
        
        # Criterion variations
        {'max_depth': 5, 'min_samples_split': 2, 'criterion': 'entropy'},
        
        # Leaf node variations
        {'max_depth': 5, 'min_samples_leaf': 2, 'criterion': 'gini'},
        {'max_depth': 5, 'min_samples_leaf': 4, 'criterion': 'gini'}
    ]
    
    results = []
    
    # Iterate through configurations
    for config in configs:
        # Train Decision Tree
        tree_model = DecisionTreeClassifier(
            random_state=12345,
            **config
        )
        tree_model.fit(features_train, target_train)
        
        # Evaluate
        valid_predictions = tree_model.predict(features_valid)
        test_predictions = tree_model.predict(features_test)
        
        valid_accuracy = accuracy_score(target_valid, valid_predictions)
        test_accuracy = accuracy_score(target_test, test_predictions)
        
        # Store results
        result = config.copy()
        result['valid_accuracy'] = valid_accuracy
        result['test_accuracy'] = test_accuracy
        results.append(result)
    
    # Sort by test accuracy
    results_sorted = sorted(results, key=lambda x: x['test_accuracy'], reverse=True)
    
    print("\nTop Performing Configuration:")
    print(results_sorted[0])
    print()
    
    return results_sorted


hyperparameter_results = explore_decision_tree_hyperparameters(
    features_train, target_train,
    features_valid, target_valid,
    features_test, target_test
)

# Take the best configuration
best_config = hyperparameter_results[0]

# Declare and fit final Decision Tree
tree_model = DecisionTreeClassifier(
    random_state=12345,
    **{k: v for k, v in best_config.items() if k not in ['valid_accuracy', 'test_accuracy']}
)
tree_model.fit(features_train, target_train)

# Evaluate the final model
evaluate_model(tree_model, features_valid, target_valid, features_test, target_test, model_name="Decision Tree")


Top Performing Configuration:
{'max_depth': 5, 'min_samples_split': 2, 'criterion': 'entropy', 'valid_accuracy': 0.7667185069984448, 'test_accuracy': 0.7962674961119751}

Decision Tree Validation Accuracy: 0.7667185069984448
Decision Tree Test Accuracy: 0.7962674961119751
Decision Tree Validation Errors: 150
Decision Tree Test Errors: 131


The function I wrote cycled through several iterations of Decision Tree hyperparameters, giving a final top performing configuration and accuracy scores for both the validation set and test set. The test set scored about 80% accuracy, which meets the goal of at least 75%.

### Random Forest

In [128]:
# Write function to explore random forest hyperparameters and evaluate their performance

def explore_random_forest_hyperparameters(features_train, target_train, features_valid, target_valid, features_test, target_test):
    # List of hyperparameter configurations to try
    configs = [
        {'n_estimators': 50, 'max_depth': 5, 'min_samples_split': 2, 'criterion': 'gini'},
        {'n_estimators': 100, 'max_depth': 5, 'min_samples_split': 2, 'criterion': 'gini'},
        {'n_estimators': 200, 'max_depth': 5, 'min_samples_split': 2, 'criterion': 'gini'},
        {'n_estimators': 100, 'max_depth': 3, 'min_samples_split': 2, 'criterion': 'gini'},
        {'n_estimators': 100, 'max_depth': 7, 'min_samples_split': 2, 'criterion': 'gini'},
        {'n_estimators': 100, 'max_depth': None, 'min_samples_split': 2, 'criterion': 'gini'},
        {'n_estimators': 100, 'max_depth': 5, 'min_samples_split': 5, 'criterion': 'gini'},
        {'n_estimators': 100, 'max_depth': 5, 'min_samples_split': 10, 'criterion': 'gini'},
        {'n_estimators': 100, 'max_depth': 5, 'min_samples_split': 2, 'criterion': 'entropy'},
        {'n_estimators': 100, 'max_depth': 5, 'min_samples_leaf': 2, 'criterion': 'gini'},
        {'n_estimators': 100, 'max_depth': 5, 'min_samples_leaf': 4, 'criterion': 'gini'},
        {'n_estimators': 100, 'max_depth': 5, 'min_samples_split': 2, 'max_features': 'sqrt'},
        {'n_estimators': 100, 'max_depth': 5, 'min_samples_split': 2, 'max_features': 'log2'},
        {'n_estimators': 100, 'max_depth': 5, 'min_samples_split': 2, 'bootstrap': False},
        {'n_estimators': 100, 'max_depth': 5, 'min_samples_split': 2, 'bootstrap': True}
    ]
    
    results = []
    
    # Try all configurations
    for config in configs:
        model = RandomForestClassifier(random_state=12345, **config)
        model.fit(features_train, target_train)
        
        valid_predictions = model.predict(features_valid)
        test_predictions = model.predict(features_test)
        
        valid_accuracy = accuracy_score(target_valid, valid_predictions)
        test_accuracy = accuracy_score(target_test, test_predictions)
        
        result = config.copy()
        result['valid_accuracy'] = valid_accuracy
        result['test_accuracy'] = test_accuracy
        results.append(result)
    
    # Pick best config (highest accuracy score for test set)
    best_config = max(results, key=lambda x: x['test_accuracy'])
    
    print("\nTop Performing Configuration:")
    print(best_config)
    
    # Build and fit final model with best hyperparameters
    forest_model = RandomForestClassifier(random_state=12345, **{k: v for k, v in best_config.items() if k not in ['valid_accuracy','test_accuracy']})
    forest_model.fit(features_train, target_train)
    
    return forest_model, results

forest_model, results = explore_random_forest_hyperparameters(
    features_train, target_train,
    features_valid, target_valid,
    features_test, target_test
)
print()

evaluate_model(forest_model, features_valid, target_valid, features_test, target_test, model_name="Random Forest")


Top Performing Configuration:
{'n_estimators': 50, 'max_depth': 5, 'min_samples_split': 2, 'criterion': 'gini', 'valid_accuracy': 0.7791601866251944, 'test_accuracy': 0.7931570762052877}

Random Forest Validation Accuracy: 0.7791601866251944
Random Forest Test Accuracy: 0.7931570762052877
Random Forest Validation Errors: 142
Random Forest Test Errors: 133


### Logistic Regression

In [None]:
# Write function to explore logistic regression hyperparameters and evaluate their performance
def explore_logistic_regression_hyperparameters(features_train, target_train, features_valid, target_valid, features_test, target_test):
    # List of hyperparameter configurations to try
    configs = [
        # Solvers with default penalty options
        {'solver': 'liblinear', 'penalty': 'l1', 'C': 1.0},
        {'solver': 'liblinear', 'penalty': 'l2', 'C': 1.0},
        {'solver': 'lbfgs', 'penalty': 'l2', 'C': 1.0, 'max_iter': 1000},
        {'solver': 'saga', 'penalty': 'l1', 'C': 1.0, 'max_iter': 1000},
        {'solver': 'saga', 'penalty': 'l2', 'C': 1.0, 'max_iter': 1000},
        
        # Regularization strength variations
        {'solver': 'liblinear', 'penalty': 'l2', 'C': 0.1},
        {'solver': 'liblinear', 'penalty': 'l2', 'C': 10},
        {'solver': 'lbfgs', 'penalty': 'l2', 'C': 0.1, 'max_iter': 1000},
        {'solver': 'lbfgs', 'penalty': 'l2', 'C': 10, 'max_iter': 1000},
    ]
    
    results = []
    
    # Iterate through configurations
    for config in configs:
        try:
            # Train Logistic Regression
            logistic_model = LogisticRegression(
                random_state=12345,
                **config
            )
            logistic_model.fit(features_train, target_train)
            
            # Evaluate
            valid_predictions = logistic_model.predict(features_valid)
            test_predictions = logistic_model.predict(features_test)
            
            valid_accuracy = accuracy_score(target_valid, valid_predictions)
            test_accuracy = accuracy_score(target_test, test_predictions)
            
            # Store results
            result = config.copy()
            result['valid_accuracy'] = valid_accuracy
            result['test_accuracy'] = test_accuracy
            results.append(result)
            
            # Print results
            print(f"Configuration: {config}")
            print(f"Validation Accuracy: {valid_accuracy}")
            print(f"Test Accuracy: {test_accuracy}\n")
        
        except Exception as e:
            print(f"Skipping config {config} due to error: {e}")
    
    # Sort by test accuracy
    results_sorted = sorted(results, key=lambda x: x['test_accuracy'], reverse=True)
    
    print("\nTop Performing Configuration:")
    if results_sorted:
        print(results_sorted[0])
    else:
        print("No valid configuration found.")
    
    return results_sorted


# --- Use the function ---
hyperparameter_results = explore_logistic_regression_hyperparameters(
    features_train, target_train,
    features_valid, target_valid,
    features_test, target_test
)

# Take the best configuration
if hyperparameter_results:
    best_config = hyperparameter_results[0]

    # Declare and fit final Logistic Regression model
    logistic_model = LogisticRegression(
        random_state=12345,
        **{k: v for k, v in best_config.items() if k not in ['valid_accuracy', 'test_accuracy']}
    )
    logistic_model.fit(features_train, target_train)

    # Evaluate the final model
    evaluate_model(logistic_model, features_valid, target_valid, features_test, target_test, model_name="Logistic Regression")

Configuration: {'solver': 'liblinear', 'penalty': 'l1', 'C': 1.0}
Validation Accuracy: 0.7278382581648523
Test Accuracy: 0.7589424572317263

Configuration: {'solver': 'liblinear', 'penalty': 'l2', 'C': 1.0}
Validation Accuracy: 0.7293934681181959
Test Accuracy: 0.7511664074650077

Configuration: {'solver': 'lbfgs', 'penalty': 'l2', 'C': 1.0, 'max_iter': 1000}
Validation Accuracy: 0.7262830482115086
Test Accuracy: 0.7589424572317263

Configuration: {'solver': 'saga', 'penalty': 'l1', 'C': 1.0, 'max_iter': 1000}
Validation Accuracy: 0.6936236391912908
Test Accuracy: 0.6982892690513219





Configuration: {'solver': 'saga', 'penalty': 'l2', 'C': 1.0, 'max_iter': 1000}
Validation Accuracy: 0.6936236391912908
Test Accuracy: 0.6982892690513219

Configuration: {'solver': 'liblinear', 'penalty': 'l2', 'C': 0.1}
Validation Accuracy: 0.6936236391912908
Test Accuracy: 0.6967340590979783

Configuration: {'solver': 'liblinear', 'penalty': 'l2', 'C': 10}
Validation Accuracy: 0.7278382581648523
Test Accuracy: 0.7558320373250389

Configuration: {'solver': 'lbfgs', 'penalty': 'l2', 'C': 0.1, 'max_iter': 1000}
Validation Accuracy: 0.7262830482115086
Test Accuracy: 0.7589424572317263

Configuration: {'solver': 'lbfgs', 'penalty': 'l2', 'C': 10, 'max_iter': 1000}
Validation Accuracy: 0.7262830482115086
Test Accuracy: 0.7589424572317263


Top Performing Configuration:
{'solver': 'liblinear', 'penalty': 'l1', 'C': 1.0, 'valid_accuracy': 0.7278382581648523, 'test_accuracy': 0.7589424572317263}


NameError: name 'config' is not defined