## Megaline Legacy Plan Conversion

I'll begin by importing the libraries and modules for handling data, building and evaluating regression and classification models, splitting datasets, and preprocessing features. Then, I'll load the DataFrame I'll be using from the Github repository I have set up for this project

In [119]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

url = 'https://raw.githubusercontent.com/DHE42/sprint_7_project/refs/heads/main/users_behavior.csv'

df = pd.read_csv(url)
print(df.head())
print()

print(df.info())
print()

print(df.describe())
print()


   calls  minutes  messages   mb_used  is_ultra
0   40.0   311.90      83.0  19915.42         0
1   85.0   516.75      56.0  22696.96         0
2   77.0   467.66      86.0  21060.45         0
3  106.0   745.53      81.0   8437.39         1
4   66.0   418.74       1.0  14502.75         0

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB
None

             calls      minutes     messages       mb_used     is_ultra
count  3214.000000  3214.000000  3214.000000   3214.000000  3214.000000
mean     63.038892   438.208787    38.281269  17207.673836     0.306472
std      33.236368   234.569872    36.148326   7570.968246

It is evident from preliminary explorations that there are five columns, and that is_ultra is the target. Since we do not have separate training, validation, and testing datasets, I will split the data into 60% and 40%, and then divide that 40% in half to receive a training data set of 60%, a validation data set of 20%, and a testing dataset of 20%. I will also declare all columns but is_ultra the features, and is_ultra the target. Finally, I will write a function that evaluates the models I will need to train in order to gauge their accuracy.

In [120]:
# Declare features and target
features = df.drop(['is_ultra'], axis=1)
target = df['is_ultra']


# Split of a 20% test set
features_train, features_test, target_train, target_test = train_test_split(
    features, target, test_size=0.2, random_state=12345
)

# Split features_train and target_train into training (60%) and validation (20%)
features_train, features_valid, target_train, target_valid = train_test_split(
    features_train, target_train, test_size=0.25, random_state=12345
)

def evaluate_model(model, features_valid, target_valid, features_test, target_test, model_name="Model"):
   
    # Validation set predictions and accuracy
    valid_predictions = model.predict(features_valid)
    valid_accuracy = accuracy_score(target_valid, valid_predictions)
    print(f"{model_name} Validation Accuracy: {valid_accuracy}")
    
    # Test set predictions and accuracy
    test_predictions = model.predict(features_test)
    test_accuracy = accuracy_score(target_test, test_predictions)
    print(f"{model_name} Test Accuracy: {test_accuracy}")
    
    # Error count
    validation_errors = sum(target_valid != valid_predictions)
    test_errors = sum(target_test != test_predictions)
    print(f"{model_name} Validation Errors: {validation_errors}")
    print(f"{model_name} Test Errors: {test_errors}")

### Decision Tree

I will begin my model exploration by writing a function to explore several different decision tree hyperparameters and give me the hyperparameters that will result in the highest accuracy scores. Decision trees tend to be very fast with the drawback of underfittedness if the depth of the tree is under four, with the converse problem if it is over four. Let's play around with some other hyperparameters: min_samples_split, criterion, and gini.

In [121]:
# Write function to explore decision tree hyperparameters and evaluate their performance

def explore_decision_tree_hyperparameters(features_train, target_train, features_valid, target_valid, features_test, target_test):
    # List of hyperparameter configurations to try
    configs = [
        # Depth variations
        {'max_depth': 3, 'min_samples_split': 2, 'criterion': 'gini'},
        {'max_depth': 5, 'min_samples_split': 2, 'criterion': 'gini'},
        {'max_depth': 7, 'min_samples_split': 2, 'criterion': 'gini'},
        {'max_depth': None, 'min_samples_split': 2, 'criterion': 'gini'},
        
        # Splitting variations
        {'max_depth': 5, 'min_samples_split': 5, 'criterion': 'gini'},
        {'max_depth': 5, 'min_samples_split': 10, 'criterion': 'gini'},
        
        # Criterion variations
        {'max_depth': 5, 'min_samples_split': 2, 'criterion': 'entropy'},
        
        # Leaf node variations
        {'max_depth': 5, 'min_samples_leaf': 2, 'criterion': 'gini'},
        {'max_depth': 5, 'min_samples_leaf': 4, 'criterion': 'gini'}
    ]
    
    results = []
    
    # Iterate through configurations
    for config in configs:
        # Train Decision Tree
        tree_model = DecisionTreeClassifier(
            random_state=12345,
            **config
        )
        tree_model.fit(features_train, target_train)
        
        # Evaluate
        valid_predictions = tree_model.predict(features_valid)
        test_predictions = tree_model.predict(features_test)
        
        valid_accuracy = accuracy_score(target_valid, valid_predictions)
        test_accuracy = accuracy_score(target_test, test_predictions)
        
        # Store results
        result = config.copy()
        result['valid_accuracy'] = valid_accuracy
        result['test_accuracy'] = test_accuracy
        results.append(result)
    
    # Sort by test accuracy
    results_sorted = sorted(results, key=lambda x: x['test_accuracy'], reverse=True)
    
    print("\nTop Performing Configuration:")
    print(results_sorted[0])
    print()
    
    return results_sorted


hyperparameter_results = explore_decision_tree_hyperparameters(
    features_train, target_train,
    features_valid, target_valid,
    features_test, target_test
)

# Take the best configuration
best_config = hyperparameter_results[0]

# Declare and fit final Decision Tree
tree_model = DecisionTreeClassifier(
    random_state=12345,
    **{k: v for k, v in best_config.items() if k not in ['valid_accuracy', 'test_accuracy']}
)
tree_model.fit(features_train, target_train)

# Evaluate the final model
evaluate_model(tree_model, features_valid, target_valid, features_test, target_test, model_name="Decision Tree")


Top Performing Configuration:
{'max_depth': 5, 'min_samples_split': 2, 'criterion': 'entropy', 'valid_accuracy': 0.7667185069984448, 'test_accuracy': 0.7962674961119751}

Decision Tree Validation Accuracy: 0.7667185069984448
Decision Tree Test Accuracy: 0.7962674961119751
Decision Tree Validation Errors: 150
Decision Tree Test Errors: 131


The function I wrote cycled through several iterations of Decision Tree hyperparameters, giving a final top performing configuration and accuracy scores for both the validation set and test set. The test set scored about 80% accuracy, which meets the goal of at least 75%.

### Random Forest

Let's do something similar with a random forest. These tend to have the highest accuracy since it uses a tree ensemble, but the slowest speed due to the large amount of decisions it cycles through. We'll try different configurations for n_estimators, max_depth, min_samples_split, criterion, and gini.

In [122]:
# Write function to explore random forest hyperparameters and evaluate their performance
def explore_random_forest_hyperparameters(features_train, target_train, features_valid, target_valid, features_test, target_test):
    # List of hyperparameter configurations to try
    configs = [
        {'n_estimators': 50, 'max_depth': 5, 'min_samples_split': 2, 'criterion': 'gini'},
        {'n_estimators': 100, 'max_depth': 5, 'min_samples_split': 2, 'criterion': 'gini'},
        {'n_estimators': 200, 'max_depth': 5, 'min_samples_split': 2, 'criterion': 'gini'},
        {'n_estimators': 100, 'max_depth': 3, 'min_samples_split': 2, 'criterion': 'gini'},
        {'n_estimators': 100, 'max_depth': 7, 'min_samples_split': 2, 'criterion': 'gini'},
        {'n_estimators': 100, 'max_depth': None, 'min_samples_split': 2, 'criterion': 'gini'},
        {'n_estimators': 100, 'max_depth': 5, 'min_samples_split': 5, 'criterion': 'gini'},
        {'n_estimators': 100, 'max_depth': 5, 'min_samples_split': 10, 'criterion': 'gini'},
        {'n_estimators': 100, 'max_depth': 5, 'min_samples_split': 2, 'criterion': 'entropy'},
        {'n_estimators': 100, 'max_depth': 5, 'min_samples_leaf': 2, 'criterion': 'gini'},
        {'n_estimators': 100, 'max_depth': 5, 'min_samples_leaf': 4, 'criterion': 'gini'},
        {'n_estimators': 100, 'max_depth': 5, 'min_samples_split': 2, 'max_features': 'sqrt'},
        {'n_estimators': 100, 'max_depth': 5, 'min_samples_split': 2, 'max_features': 'log2'},
        {'n_estimators': 100, 'max_depth': 5, 'min_samples_split': 2, 'bootstrap': False},
        {'n_estimators': 100, 'max_depth': 5, 'min_samples_split': 2, 'bootstrap': True}
    ]
    
    results = []
    
    # Try all configurations
    for config in configs:
        model = RandomForestClassifier(random_state=12345, **config)
        model.fit(features_train, target_train)
        
        valid_predictions = model.predict(features_valid)
        test_predictions = model.predict(features_test)
        
        valid_accuracy = accuracy_score(target_valid, valid_predictions)
        test_accuracy = accuracy_score(target_test, test_predictions)
        
        result = config.copy()
        result['valid_accuracy'] = valid_accuracy
        result['test_accuracy'] = test_accuracy
        results.append(result)
    
    # Pick best config (highest accuracy score for test set)
    best_config = max(results, key=lambda x: x['test_accuracy'])
    
    print("\nTop Performing Configuration:")
    print(best_config)
    print()
    
    # Build and fit final model with best hyperparameters
    forest_model = RandomForestClassifier(
        random_state=12345,
        **{k: v for k, v in best_config.items() if k not in ['valid_accuracy','test_accuracy']}
    )
    forest_model.fit(features_train, target_train)
    
    # Evaluate final model
    evaluate_model(forest_model, features_valid, target_valid, features_test, target_test, model_name="Random Forest")
    
    return forest_model, results


# Call the function
forest_model, results = explore_random_forest_hyperparameters(
    features_train, target_train,
    features_valid, target_valid,
    features_test, target_test
)


Top Performing Configuration:
{'n_estimators': 50, 'max_depth': 5, 'min_samples_split': 2, 'criterion': 'gini', 'valid_accuracy': 0.7791601866251944, 'test_accuracy': 0.7931570762052877}

Random Forest Validation Accuracy: 0.7791601866251944
Random Forest Test Accuracy: 0.7931570762052877
Random Forest Validation Errors: 142
Random Forest Test Errors: 133


Interesting! It looks like the random forest model had lower accuracy than the decision tree model.

### Logistic Regression

Logistic regression models typically have medium accuracy, but high speed like decision tree models. It has the lowest number of hyperparameters, funnily enough. We'll play around with the the solver hyperparameter mostly, with the max_iter hyperparameter set at 5000 to avoid errors when computing.

In [None]:
def explore_logistic_regression_hyperparameters(features_train, target_train,
                                                 features_valid, target_valid,
                                                 features_test, target_test):
    
    # Scaled features to prevent logistic regression failure
    scaler = StandardScaler()
    features_train_scaled = scaler.fit_transform(features_train)
    features_valid_scaled = scaler.transform(features_valid)
    features_test_scaled = scaler.transform(features_test)
    
    # List of hyperparameter configurations to try
    configs = [
        # Solvers with default penalty options
        {'solver': 'liblinear', 'penalty': 'l1', 'C': 1.0, 'max_iter': 5000},
        {'solver': 'liblinear', 'penalty': 'l2', 'C': 1.0, 'max_iter': 5000},
        {'solver': 'lbfgs', 'penalty': 'l2', 'C': 1.0, 'max_iter': 5000},
        {'solver': 'saga', 'penalty': 'l1', 'C': 1.0, 'max_iter': 5000},
        {'solver': 'saga', 'penalty': 'l2', 'C': 1.0, 'max_iter': 5000},
        # Regularization strength variations
        {'solver': 'liblinear', 'penalty': 'l2', 'C': 0.1, 'max_iter': 5000},
        {'solver': 'liblinear', 'penalty': 'l2', 'C': 10, 'max_iter': 5000},
        {'solver': 'lbfgs', 'penalty': 'l2', 'C': 0.1, 'max_iter': 5000},
        {'solver': 'lbfgs', 'penalty': 'l2', 'C': 10, 'max_iter': 5000},
    ]
    
    results = []
    
    # Iterate through configurations
    for config in configs:
        logistic_model = LogisticRegression(random_state=12345, **config)
        logistic_model.fit(features_train_scaled, target_train)
            
        # Evaluate
        valid_predictions = logistic_model.predict(features_valid_scaled)
        test_predictions = logistic_model.predict(features_test_scaled)
            
        valid_accuracy = accuracy_score(target_valid, valid_predictions)
        test_accuracy = accuracy_score(target_test, test_predictions)
            
        # Store results
        result = config.copy()
        result['valid_accuracy'] = valid_accuracy
        result['test_accuracy'] = test_accuracy
        results.append(result)

    
    # Sort results by test accuracy
    results_sorted = sorted(results, key=lambda x: x['test_accuracy'], reverse=True)
    
    # Pick best config
    best_config = results_sorted[0]
    print("Top Performing Configuration:")
    print(best_config)
    print()
    
    # Build and fit final model with best hyperparameters
    final_logistic_model = LogisticRegression(
        random_state=12345,
        **{k: v for k, v in best_config.items() if k not in ['valid_accuracy', 'test_accuracy']}
    )
    final_logistic_model.fit(features_train_scaled, target_train)
    
    # Evaluate final model
    evaluate_model(final_logistic_model, features_valid_scaled, target_valid,
                   features_test_scaled, target_test, model_name="Logistic Regression")
    
    return final_logistic_model, results

logistic_model, results = explore_logistic_regression_hyperparameters(
    features_train, target_train,
    features_valid, target_valid,
    features_test, target_test
)



Top Performing Configuration:
{'solver': 'liblinear', 'penalty': 'l1', 'C': 1.0, 'max_iter': 5000, 'valid_accuracy': 0.7278382581648523, 'test_accuracy': 0.7589424572317263}

Logistic Regression Validation Accuracy: 0.7278382581648523
Logistic Regression Test Accuracy: 0.7589424572317263
Logistic Regression Validation Errors: 175
Logistic Regression Test Errors: 155


In practice, it appears that this model has the lowest accuracy score. You learn something new every day!

### Conclusion

In practice, it appears that thigns bore out rather differently than in the classroom. It looks like the decistion tree had highest accuracy, followed by random forest, followed by logistic regression.