# Sprint 7: Intro to Machine Learning

Megaline, a mobile carrier, aims to transition subscribers from legacy plans to newer plans—Smart or Ultra—by leveraging a predictive model. Subscriber behavior data from those who have already switched to these plans is available, providing the foundation for this classification task.

With data preprocessing already completed, the focus is on building a model capable of accurately recommending the most suitable plan. The target accuracy for this project is set at a minimum threshold of 0.75, which will be evaluated using a test dataset. The goal is to achieve the highest possible accuracy to optimize the recommendation process.

## The Data

In [13]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

In [14]:
df = pd.read_csv('/datasets/users_behavior.csv')

***Data Info***
- сalls — number of calls,
- minutes — total call duration in minutes,
- messages — number of text messages,
- mb_used — Internet traffic used in MB
- is_ultra — plan for the current month (Ultra - 1, Smart - 0).

In [15]:
display(df.info())
display(df.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


None

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


In [16]:
display(df.duplicated().sum())
display(df.duplicated(['calls','minutes','messages','mb_used']).sum())

0

0

***Data Take Aways:***

- no null values
- data types correct for each column
- no fully duplicated lines
- no duplicated data on misnamed on both plans
- 3214 indexes

## Developing the Model

- analyze subscribers' behavior and recommend one of Megaline's newer plans: Smart or Ultra.
- a model with the highest possible accuracy.
- accuracy is 0.75

1. Split the source data into a training set, a validation set, and a test set.
2. Investigate the quality of different models by changing hyperparameters. Briefly describe the findings of the study.
3. Check the quality of the model using the test set.

### Splitting Source Data

In [17]:
# Define features and target
features = df.drop('is_ultra', axis=1)
target = df['is_ultra']

# Split the data into training (80%) and test (20%) sets
features_train, features_test, target_train, target_test = train_test_split(
    features, target, test_size=0.2, random_state=12345
)

### Investigating the Different Models

***DecisionTreeClassifier:***

In [18]:
# Set up the grid of hyperparameters
param_grid_dt = {
    'max_depth': [3, 5, 7, 10],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Initialize the model
model_dt = DecisionTreeClassifier(random_state=54321)

# Set up GridSearchCV
grid_search_dt = GridSearchCV(estimator=model_dt, param_grid=param_grid_dt, cv=5, scoring='accuracy', n_jobs=-1)

# Fit the model on the training set
grid_search_dt.fit(features_train, target_train)

# Print the best parameters and accuracy
print(f"Best parameters (DecisionTree): {grid_search_dt.best_params_}")
print(f"Best accuracy (DecisionTree): {grid_search_dt.best_score_}")

Best parameters (DecisionTree): {'max_depth': 7, 'min_samples_leaf': 1, 'min_samples_split': 10}
Best accuracy (DecisionTree): 0.7989180612746024


***RandomForestClassifier:***

In [19]:
# Set up the grid of hyperparameters
param_grid_rf = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 15],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2']
}

# Initialize the model
model_rf = RandomForestClassifier(random_state=54321)

# Set up GridSearchCV
grid_search_rf = GridSearchCV(estimator=model_rf, param_grid=param_grid_rf, cv=5, scoring='accuracy', n_jobs=-1)

# Fit the model on the training set
grid_search_rf.fit(features_train, target_train)

# Print the best parameters and accuracy
print(f"Best parameters (RandomForest): {grid_search_rf.best_params_}")
print(f"Best accuracy (RandomForest): {grid_search_rf.best_score_}")

Best parameters (RandomForest): {'max_depth': 15, 'max_features': 'sqrt', 'min_samples_leaf': 2, 'min_samples_split': 5, 'n_estimators': 200}
Best accuracy (RandomForest): 0.8210826942692002


***LogisticRegression:***

In [20]:
# Set up the grid of hyperparameters
param_grid_lr = {
    'logisticregression__C': [0.01, 0.1, 1, 10],
    'logisticregression__penalty': ['l2'],
    'logisticregression__solver': ['liblinear']
}

# Set up a pipeline with scaling and logistic regression
pipeline_lr = Pipeline([
    ('scaler', StandardScaler()),  # Scaling step
    ('logisticregression', LogisticRegression(random_state=54321, max_iter=1000))
])

# Set up GridSearchCV
grid_search_lr = GridSearchCV(estimator=pipeline_lr, param_grid=param_grid_lr, cv=5, scoring='accuracy', n_jobs=-1)

# Fit the model on the training set
grid_search_lr.fit(features_train, target_train)

# Print the best parameters and accuracy
print(f"Best parameters (LogisticRegression): {grid_search_lr.best_params_}")
print(f"Best accuracy (LogisticRegression): {grid_search_lr.best_score_}")

Best parameters (LogisticRegression): {'logisticregression__C': 0.01, 'logisticregression__penalty': 'l2', 'logisticregression__solver': 'liblinear'}
Best accuracy (LogisticRegression): 0.7483480034755015


***Conclusion of the Classification methods:***

- The best method was the RandomForestClassifier which acheived 82.11% accuracy using the training set

### Using Test Data Set

In [21]:
# Evaluate the best RandomForestClassifier model on the test set
test_accuracy_rf = grid_search_rf.best_estimator_.score(features_test, target_test)
print(f"Test accuracy of the best RandomForestClassifier model: {test_accuracy_rf}")

Test accuracy of the best RandomForestClassifier model: 0.7853810264385692


***Test Data Conclusion***

- Acheived an accuracy of 78.54% using the best RandomForestClassifier model

## Conclusion:


-  In this project, the goal was to identify the best classification model for predicting whether a user is on an "ultra" plan based on various features. After exploring different models and tuning their hyperparameters using GridSearchCV, the RandomForestClassifier emerged as the top performer. The model achieved an accuracy of 82.11% on the validation set, indicating strong predictive capability.

-  To ensure the model's generalizability, it was then tested on an unseen test set, where it achieved an accuracy of 78.54%. This slight drop in accuracy suggests that while the model is effective, there may still be room for improvement or that the test set had slightly different characteristics than the training set.