## Intro to Machine Learning: Megaline Project

### 1. Project Description

### Objective
The goal of this project is to build a machine learning model that helps the mobile carrier **Megaline** recommend one of its two plans — **Smart** or **Ultra** — based on user behavior.

### Task
Develop a classification model that predicts the most suitable plan for a customer using monthly usage data.  
The model’s accuracy on the test set must be **at least 0.75**.

### Data Description
Each row in the dataset represents one user's monthly behavior and includes the following features:
- `calls` — number of calls made  
- `minutes` — total call duration  
- `messages` — number of text messages  
- `mb_used` — internet traffic used (in MB)  
- `is_ultra` — current plan (1 = Ultra, 0 = Smart)

### Approach
1. Explore and analyze the dataset.  
2. Split data into training, validation, and test sets (with stratification).  
3. Train and compare several models: **Logistic Regression**, **Decision Tree**, and **Random Forest**.  
4. Tune hyperparameters and select the best model.  
5. Evaluate the final model on the test set and draw conclusions.

### 2. Data Overview

In [1]:
# Import libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

In [2]:
def dataset_overview(df, name="df", head_rows=5, target_col=None):
    """Quick sanity check:
    - shape, columns
    - dtypes (via df.info), memory
    - missing (count/%)
    - full-row duplicates
    - optional: class balance for target_col
    - head preview
    """
    print(f"=== Dataset overview: {name} ===")

    # Size
    print(f"Shape: {df.shape[0]} rows x {df.shape[1]} columns")

    # Columns
    print("\nColumns:")
    print(list(df.columns))

    # Dtypes & memory
    print("\nDtypes & memory usage (df.info):")
    df.info()

    # Missing values
    print("\nMissing values per column:")
    missing = df.isna().sum()
    if missing.sum() == 0:
        print("No missing values.")
    else:
        missing_pct = (missing / len(df) * 100).round(2)
        summary = pd.DataFrame({"missing": missing, "missing_%": missing_pct})
        summary = summary[summary["missing"] > 0].sort_values("missing_%", ascending=False)
        display(summary)

    # Full-row duplicates
    print("\nDuplicate rows (full row duplicate):", df.duplicated().sum())

    # Preview
    print(f"\nHead ({head_rows} rows):")
    display(df.head(head_rows))

    print("\n=== End ===\n")

In [3]:
# Load dataset
df = pd.read_csv('users_behavior.csv')

# Quick sanity overview 
dataset_overview(df, "Megaline users behavior")

# Check class balance (target)
df['is_ultra'].value_counts(normalize=True).round(3).to_dict()

=== Dataset overview: Megaline users behavior ===
Shape: 3214 rows x 5 columns

Columns:
['calls', 'minutes', 'messages', 'mb_used', 'is_ultra']

Dtypes & memory usage (df.info):
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB

Missing values per column:
No missing values.

Duplicate rows (full row duplicate): 0

Head (5 rows):


Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0



=== End ===



{0: 0.694, 1: 0.306}

#### Data Overview — Summary

The dataset contains 3,214 rows and 5 columns.
Each row represents one user’s monthly activity, including the number of calls, total call duration (in minutes), number of text messages, and internet usage (in megabytes).
The target variable is_ultra indicates the user’s current plan: 1 = Ultra, 0 = Smart.

All columns have correct numeric data types.
There are no missing values or duplicate rows, and feature values look realistic.
The dataset is clean and ready for next steps.

**Class balance:**
The target variable is_ultra is slightly imbalanced — about 69% Smart (0) and 31% Ultra (1).

### 3. Data Splitting

In [4]:
# Separate features and target
features = df.drop(columns=['is_ultra'])
target = df['is_ultra']

# Hold out final test set (20%) with stratification to preserve class ratios
features_temp, features_test, target_temp, target_test = train_test_split(
    features, target, test_size=0.20, stratify=target, random_state=12345
)

# Split the remaining 80% into train (60%) and validation (20% overall):
# test_size=0.25 here means 25% of the temp set -> 0.25 * 0.80 = 0.20 of the full data
features_train, features_valid, target_train, target_valid = train_test_split(
    features_temp, target_temp, test_size=0.25, stratify=target_temp, random_state=12345
)

print("Shapes:", features_train.shape, features_valid.shape, features_test.shape)

# Class balance checks (should be close across all splits)
print("Overall: ", target.value_counts(normalize=True).round(3).to_dict())
print("Train:", target_train.value_counts(normalize=True).round(3).to_dict())
print("Valid:", target_valid.value_counts(normalize=True).round(3).to_dict())
print("Test: ", target_test.value_counts(normalize=True).round(3).to_dict())

Shapes: (1928, 4) (643, 4) (643, 4)
Overall:  {0: 0.694, 1: 0.306}
Train: {0: 0.693, 1: 0.307}
Valid: {0: 0.694, 1: 0.306}
Test:  {0: 0.694, 1: 0.306}


#### Data Splitting — Summary

We split the dataset into three parts using stratified sampling to preserve the class ratio:
- Train: 60% (1,928 rows)
- Validation: 20% (643 rows)
- Test: 20% (643 rows)

Class proportions are consistent across all splits (~69% Smart, ~31% Ultra).
The split is reproducible (random_state=12345) and non-overlapping between subsets.

### 4. Model Training and Evaluation

#### 4.1. Logistic Regression Model

In [5]:
# Initialize the model (fix random_state for reproducibility)
lr_model = LogisticRegression(max_iter=1000, random_state=12345)

# Fit the model on the training data
lr_model.fit(features_train, target_train)

# Generate predictions on the validation set
predictions_valid = lr_model.predict(features_valid)

# Compute the evaluation metric (accuracy) on validation
result = accuracy_score(target_valid, predictions_valid)

print("Validation accuracy (Logistic Regression):", round(result, 4))

Validation accuracy (Logistic Regression): 0.7558


#### Logistic Regression — Validation Results
The model achieved a validation accuracy of 0.7558, which meets the target (≥ 0.75).
This suggests that logistic regression performs well in distinguishing between Smart and Ultra users based on their monthly behavior.

#### 4.2. Decision Tree Model

In [6]:
# Initialize placeholders for the best model, its score, and depth
best_model = None
best_result = -1.0
best_depth = None

# Try a range of tree depths (1..15) and keep the one with highest validation accuracy
for depth in range(1, 16):
    dt_model = DecisionTreeClassifier(max_depth=depth, random_state=12345)
    dt_model.fit(features_train, target_train)
    predictions_valid = dt_model.predict(features_valid)
    result = accuracy_score(target_valid, predictions_valid)

    if result > best_result:
        best_model = dt_model
        best_result = result
        best_depth = depth

print(f"Best Decision Tree: accuracy={best_result:.3f} at max_depth={best_depth}")

Best Decision Tree: accuracy=0.816 at max_depth=5


#### Decision Tree — Validation Summary

- We trained Decision Tree classifiers with max_depth from 1 to 15.
- Best result: **accuracy = 0.8165 at max_depth = 5** (validation set).
- This outperforms Logistic Regression on validation (~0.756)

#### 4.3. Random Forest Model

In [None]:
best_model = None
best_result = -1.0
best_est = None
best_depth = None

# Coarse-to-fine search:
# - try a few tree counts (100, 200, 300)
# - try depths 10..20 and also None (no depth limit)
for est in [100, 200, 300]:
    for depth in list(range(10, 21)) + [None]:
        rf_model = RandomForestClassifier(
            n_estimators=est,
            max_depth=depth,
            random_state=12345
        )
        rf_model.fit(features_train, target_train)
        predictions_valid = rf_model.predict(features_valid)
        result = accuracy_score(target_valid, predictions_valid)

        print(f"RF n={est}, depth={depth}: acc={result:.4f}")

        if result > best_result:
            best_model = rf_model
            best_result = result
            best_est = est
            best_depth = depth

print(f"\nBest RF on validation: accuracy={best_result:.4f}, n_estimators={best_est}, max_depth={best_depth}")

RF n=100, depth=10: acc=0.8212
RF n=100, depth=11: acc=0.8118
RF n=100, depth=12: acc=0.8243
RF n=100, depth=13: acc=0.8227
RF n=100, depth=14: acc=0.8212
RF n=100, depth=15: acc=0.8243
RF n=100, depth=16: acc=0.8243
RF n=100, depth=17: acc=0.8227
RF n=100, depth=18: acc=0.8165
RF n=100, depth=19: acc=0.8196
RF n=100, depth=20: acc=0.8087
RF n=100, depth=None: acc=0.8134
RF n=200, depth=10: acc=0.8243
RF n=200, depth=11: acc=0.8196
RF n=200, depth=12: acc=0.8243
RF n=200, depth=13: acc=0.8227
RF n=200, depth=14: acc=0.8196
RF n=200, depth=15: acc=0.8227
RF n=200, depth=16: acc=0.8212
RF n=200, depth=17: acc=0.8196
RF n=200, depth=18: acc=0.8258
RF n=200, depth=19: acc=0.8243
RF n=200, depth=20: acc=0.8180
RF n=200, depth=None: acc=0.8165
RF n=300, depth=10: acc=0.8243


#### Random Forest - Validation Summary
- Tested n_estimators {100, 200, 300} and max_depth {10…20, None}.
- Best validation accuracy = 0.8243 at n_estimators = 100, max_depth = 12
- Random Forest slightly outperformed Logistic Regression (~0.756) and Decision Tree (0.8165).

#### Model Training and Evaluation — Summary

We tested three classification models:  

| Model | Validation accuracy | Comments |
|:------|:--------------------|:----------|
| Logistic Regression | **0.7558** | Simple linear model; meets the target threshold. |
| Decision Tree | **0.8165** | Non-linear model; performed significantly better. |
| Random Forest | **0.8243** | Ensemble of trees; achieved the highest accuracy with `n_estimators=100`, `max_depth=12`. |

**Conclusion:**  
The Random Forest model achieved the best validation accuracy (0.8243) and demonstrates stable performance without overfitting.  
This model will be retrained on the combined training and validation sets and evaluated on the test set to estimate its final quality.

### 5. Final Model Testing

In [None]:
# Merge training and validation sets to train the final model on more data
features_train_valid = pd.concat([features_train, features_valid])
target_train_valid = pd.concat([target_train, target_valid])

# Define the final Random Forest with the best hyperparameters from validation
final_model = RandomForestClassifier(
            n_estimators=100,
            max_depth=12,
            random_state=12345
)

# Fit the final model on the combined train+validation data
final_model.fit(features_train_valid, target_train_valid)

# Predict on the held-out test set (never seen during training/tuning)
predictions_test = final_model.predict(features_test)

# Compute the final test accuracy
test_accuracy = accuracy_score(target_test, predictions_test)
print("Test accuracy (Final Random Forest):", round(test_accuracy, 4))

#### Final Model Testing — Summary

The final model — **Random Forest** (`n_estimators=100`, `max_depth=12`) —  
was trained on the combined training and validation sets and evaluated on the test set.

| Dataset | Accuracy |
|:--------|:----------|
| Validation | **0.8243** |
| Test | **0.8165** |

The model maintains stable performance on unseen data, showing only a minor drop in accuracy (from 0.8243 to 0.8165).  
This indicates good generalization and no signs of overfitting.  
The final accuracy exceeds the project threshold of **0.75**, meaning the model successfully classifies users into the correct tariff plan (Smart or Ultra).

### 6. Project Conclusions

#### Goal
The goal of this project was to develop a machine learning model for **Megaline**, a mobile carrier company.  
The model should analyze subscriber behavior and recommend one of the two available plans — **Smart (0)** or **Ultra (1)** — with a target **accuracy ≥ 0.75**.

---

#### Models tested

| Model | Validation Accuracy | Notes |
|:------|:--------------------|:------|
| Logistic Regression | **0.756** | Simple linear model; reached the required threshold. |
| Decision Tree | **0.8165** | Non-linear model; showed significant improvement. |
| Random Forest | **0.8243** | Best validation score with `n_estimators=100`, `max_depth=12`. |

---

#### Final evaluation
The best-performing model — **Random Forest (n_estimators=100, max_depth=12)** —  
was retrained on the combined training and validation sets and tested on the unseen test data.

| Dataset | Accuracy |
|:--------|:----------|
| Validation | **0.8243** |
| Test | **0.8165** |

---

#### Conclusions
- The Random Forest model achieved the highest accuracy among all tested algorithms.  
- The test accuracy (**0.8165**) is slightly lower than validation but still well above the required 0.75, indicating **good generalization** and **no overfitting**.  
- The model successfully predicts which tariff plan (Smart or Ultra) best fits a customer based on their behavior.  
- Logistic Regression reached the minimum threshold but was outperformed by tree-based models.  
- Decision Tree performed well, yet Random Forest proved more **stable** and **robust**.

**Final Result:**  
The project goal is achieved — the final model meets quality requirements and can be recommended for deployment.

#### Reflection

This project helped me better understand how machine learning models work in practice.  
I learned how to split data correctly into training, validation, and test sets and why stratification is important.  

I also practiced training several models, tuning their parameters, and comparing their performance using accuracy as the main metric.  

The most interesting part for me was working with the **Decision Tree** and **Random Forest** models —  
I could really see how changing hyperparameters affects the quality of predictions.  

At first, it was difficult to keep all steps structured and not get lost between datasets and models,  
but as I went through each stage, it became clearer how the whole machine learning process fits together.  
Overall, I’m satisfied with the result — my model reached the target accuracy and showed stable performance on the test data.