Project instructions

1) Open and look through the data file. Path to the file:datasets/users_behavior.csv Download dataset
2) Split the source data into a training set, a validation set, and a test set.
3) Investigate the quality of different models by changing hyperparameters. Briefly describe the findings of the study.
4) Check the quality of the model using the test set.
5) Additional task: sanity check the model. This data is more complex than what you’re used to working with, so it's not an easy task. We'll take a closer look at it later.

Here’s what the reviewers will look at when reviewing your project:

1) How did you look into data after downloading?
2) Have you correctly split the data into train, validation, and test sets?
3) How have you chosen the sets' sizes?
4) Did you evaluate the quality of the models correctly?
5) What models and hyperparameters did you use?
6) What are your findings?
7) Did you test the models correctly?
8) What is your accuracy score?
9) Have you stuck to the project structure and kept the code neat?

In [19]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score

url = 'https://raw.githubusercontent.com/DHE42/sprint_7_project/refs/heads/main/users_behavior.csv?token=GHSAT0AAAAAADLACSSAT5W5KGLTMFPYB5N62GDOAAQ'
df = pd.read_csv(url)
print(df.head())
print()

print(df.info())
print()

print(df.describe())
print()


   calls  minutes  messages   mb_used  is_ultra
0   40.0   311.90      83.0  19915.42         0
1   85.0   516.75      56.0  22696.96         0
2   77.0   467.66      86.0  21060.45         0
3  106.0   745.53      81.0   8437.39         1
4   66.0   418.74       1.0  14502.75         0

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB
None

             calls      minutes     messages       mb_used     is_ultra
count  3214.000000  3214.000000  3214.000000   3214.000000  3214.000000
mean     63.038892   438.208787    38.281269  17207.673836     0.306472
std      33.236368   234.569872    36.148326   7570.968246

In [None]:
# Declare features and target
features = df.drop(['is_ultra'], axis=1)
target = df['is_ultra']


# Split of a 20% test set
features_train_val, features_test, target_train_val, target_test = train_test_split(
    features, target, test_size=0.2, random_state=12345
)

# Split train_val into training (60%) and validation (20%)
features_train, features_valid, target_train, target_valid = train_test_split(
    features_train_val, target_train_val, test_size=0.25, random_state=12345
)

### Decision Tree

In [None]:
dt_model = DecisionTreeClassifier(random_state=12345, max_depth=5)
dt_model.fit(features_train, target_train)
dt_preds = dt_model.predict(features_valid)
print("Decision Tree Accuracy:", accuracy_score(target_valid, dt_preds))


Decision Tree Accuracy: 0.7589424572317263


### Random Forest

In [None]:
rf_model = RandomForestClassifier(random_state=12345, n_estimators=100, max_depth=10)
rf_model.fit(features_train, target_train)
rf_preds = rf_model.predict(features_valid)
print("Random Forest Accuracy:", accuracy_score(target_valid, rf_preds))

Random Forest Accuracy: 0.7962674961119751


### Logistic Regression

In [None]:
log_model = LogisticRegression(random_state=12345, max_iter=1000)
log_model.fit(features_train, target_train)
log_preds = log_model.predict(features_valid)
print("Logistic Regression Accuracy:", accuracy_score(target_valid, log_preds))

Logistic Regression Accuracy: 0.7262830482115086
