Project instructions

1) Open and look through the data file. Path to the file:datasets/users_behavior.csv Download dataset
2) Split the source data into a training set, a validation set, and a test set.
3) Investigate the quality of different models by changing hyperparameters. Briefly describe the findings of the study.
4) Check the quality of the model using the test set.
5) Additional task: sanity check the model. This data is more complex than what you’re used to working with, so it's not an easy task. We'll take a closer look at it later.

Here’s what the reviewers will look at when reviewing your project:

1) How did you look into data after downloading?
2) Have you correctly split the data into train, validation, and test sets?
3) How have you chosen the sets' sizes?
4) Did you evaluate the quality of the models correctly?
5) What models and hyperparameters did you use?
6) What are your findings?
7) Did you test the models correctly?
8) What is your accuracy score?
9) Have you stuck to the project structure and kept the code neat?

In [81]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score

url = 'https://raw.githubusercontent.com/DHE42/sprint_7_project/refs/heads/main/users_behavior.csv'

df = pd.read_csv(url)
print(df.head())
print()

print(df.info())
print()

print(df.describe())
print()


   calls  minutes  messages   mb_used  is_ultra
0   40.0   311.90      83.0  19915.42         0
1   85.0   516.75      56.0  22696.96         0
2   77.0   467.66      86.0  21060.45         0
3  106.0   745.53      81.0   8437.39         1
4   66.0   418.74       1.0  14502.75         0

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB
None

             calls      minutes     messages       mb_used     is_ultra
count  3214.000000  3214.000000  3214.000000   3214.000000  3214.000000
mean     63.038892   438.208787    38.281269  17207.673836     0.306472
std      33.236368   234.569872    36.148326   7570.968246

In [82]:
# Declare features and target
features = df.drop(['is_ultra'], axis=1)
target = df['is_ultra']


# Split of a 20% test set
features_train, features_test, target_train, target_test = train_test_split(
    features, target, test_size=0.2, random_state=12345
)

# Split features_train and target_train into training (60%) and validation (20%)
features_train, features_valid, target_train, target_valid = train_test_split(
    features_train, target_train, test_size=0.25, random_state=12345
)

def evaluate_model(model, features_valid, target_valid, features_test, target_test, model_name="Model"):
   
    # Validation set predictions and accuracy
    valid_predictions = model.predict(features_valid)
    valid_accuracy = accuracy_score(target_valid, valid_predictions)
    print(f"{model_name} Validation Accuracy: {valid_accuracy}")
    
    # Test set predictions and accuracy
    test_predictions = model.predict(features_test)
    test_accuracy = accuracy_score(target_test, test_predictions)
    print(f"{model_name} Test Accuracy: {test_accuracy}")
    
    # Error count
    validation_errors = sum(target_valid != valid_predictions)
    test_errors = sum(target_test != test_predictions)
    print(f"{model_name} Validation Errors: {validation_errors}")
    print(f"{model_name} Test Errors: {test_errors}")

### Decision Tree

In [None]:
tree_model = DecisionTreeClassifier(random_state=12345, max_depth=5)
tree_model.fit(features_train, target_train)

def explore_decision_tree_hyperparameters(features_train, target_train, features_valid, target_valid, features_test, target_test):
    # List of hyperparameter configurations to try
    configs = [
        # Depth variations
        {'max_depth': 3, 'min_samples_split': 2, 'criterion': 'gini'},
        {'max_depth': 5, 'min_samples_split': 2, 'criterion': 'gini'},
        {'max_depth': 7, 'min_samples_split': 2, 'criterion': 'gini'},
        {'max_depth': None, 'min_samples_split': 2, 'criterion': 'gini'},
        
        # Splitting variations
        {'max_depth': 5, 'min_samples_split': 5, 'criterion': 'gini'},
        {'max_depth': 5, 'min_samples_split': 10, 'criterion': 'gini'},
        
        # Criterion variations
        {'max_depth': 5, 'min_samples_split': 2, 'criterion': 'entropy'},
        
        # Leaf node variations
        {'max_depth': 5, 'min_samples_leaf': 2, 'criterion': 'gini'},
        {'max_depth': 5, 'min_samples_leaf': 4, 'criterion': 'gini'}
    ]
    
    # Store results
    results = []
    
    # Iterate through configurations
    for config in configs:
        print("\nConfiguration:", config)
        
        # Create and train the model
        tree_model = DecisionTreeClassifier(
            random_state=12345,
            **config  # Unpack the configuration
        )
        tree_model.fit(features_train, target_train)
        
        # Evaluate the model
        valid_predictions = tree_model.predict(features_valid)
        test_predictions = tree_model.predict(features_test)
        
        valid_accuracy = accuracy_score(target_valid, valid_predictions)
        test_accuracy = accuracy_score(target_test, test_predictions)
        
        # Store results
        result = config.copy()
        result['valid_accuracy'] = valid_accuracy
        result['test_accuracy'] = test_accuracy
        results.append(result)
        
        # Print results
        print(f"Validation Accuracy: {valid_accuracy:.4f}")
        print(f"Test Accuracy: {test_accuracy:.4f}")
    
    # Sort results by test accuracy
    results_sorted = sorted(results, key=lambda x: x['test_accuracy'], reverse=True)
    
    print("\nTop Performing Configurations:")
    for result in results_sorted[:3]:
        print(result)
    
    return results_sorted

# Use the function
hyperparameter_results = explore_decision_tree_hyperparameters(
    features_train, target_


evaluate_model(tree_model, features_valid, target_valid, features_test, target_test, model_name="Decision Tree")


Decision Tree Validation Accuracy: 0.7589424572317263
Decision Tree Test Accuracy: 0.7884914463452566
Decision Tree Validation Errors: 155
Decision Tree Test Errors: 136


### Random Forest

In [84]:
forest_model = RandomForestClassifier(random_state=12345, n_estimators=100, max_depth=10)
forest_model.fit(features_train, target_train)

evaluate_model(forest_model, features_valid, target_valid, features_test, target_test, model_name="Random Forest")

Random Forest Validation Accuracy: 0.7962674961119751
Random Forest Test Accuracy: 0.8009331259720062
Random Forest Validation Errors: 131
Random Forest Test Errors: 128


### Logistic Regression

In [85]:
regression_model = LogisticRegression(random_state=12345, max_iter=1000)
regression_model.fit(features_train, target_train)


evaluate_model(regression_model, features_valid, target_valid, features_test, target_test, model_name="Logistic Regression")

Logistic Regression Validation Accuracy: 0.7262830482115086
Logistic Regression Test Accuracy: 0.7589424572317263
Logistic Regression Validation Errors: 176
Logistic Regression Test Errors: 155
