# Project Outline: Predicting Mobile Plans

## Objective
The objective of this project is to develop a machine learning model that analyzes user behavior and recommends one of two mobile plans offered by Megaline:
1. **Smart Plan**
2. **Ultra Plan**

The model's target accuracy is **0.75** on unseen data.

## Data Overview
The dataset contains monthly behavior information for Megaline subscribers, including:
- **calls**: Number of calls made.
- **minutes**: Total call duration in minutes.
- **messages**: Number of text messages sent.
- **mb_used**: Internet traffic used in megabytes.
- **is_ultra**: Target variable indicating the current plan (1 = Ultra, 0 = Smart).

## Steps Taken
1. **Data Inspection and Cleaning**:
   - Explored the dataset for missing values, duplicates, and outliers.
   - Cleaned the data to ensure it is ready for modeling.
2. **Data Splitting**:
   - Divided the dataset into training, validation, and test sets (60/20/20 split) with stratification to maintain class balance.
3. **Model Training and Evaluation**:
   - Trained and evaluated three models:
     - Logistic Regression
     - Decision Tree
     - Random Forest
   - Performed hyperparameter tuning on the best-performing model (Random Forest) using GridSearchCV.
4. **Addressing Class Imbalance**:
   - Applied class weighting to improve the model's ability to predict the minority class (Ultra Plan).
5. **Model Testing**:
   - Evaluated the final model on the test set to ensure robust performance.

## Deliverables
1. A well-documented machine learning model capable of accurately recommending mobile plans.
2. Performance metrics, including accuracy, precision, recall, F1-score, and confusion matrix for both Smart and Ultra plans.
3. Recommendations for potential improvements and future work.

## Tools and Libraries Used
- **Python**: For data processing, model training, and evaluation.
- **Libraries**:
  - `pandas` and `numpy` for data manipulation.
  - `scikit-learn` for machine learning models and evaluation metrics.

## Project Goal
By successfully completing this project, Megaline will have a reliable recommendation system that ensures users are assigned the most suitable plan, improving customer satisfaction and operational efficiency.

In [2]:
# Import necessary libraries
import pandas as pd

# Load the dataset
file_path = '/Users/mattbaglietto/megaline_mobile/users_behavior.csv'  # Updated path
data = pd.read_csv(file_path)

# Display basic information about the dataset
print("Dataset Info:")
print(data.info())

# Display the first few rows of the dataset
print("\nFirst 5 Rows of the Dataset:")
print(data.head())

# Check for missing values
print("\nMissing Values:")
print(data.isnull().sum())

# Summary statistics
print("\nSummary Statistics:")
print(data.describe())

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB
None

First 5 Rows of the Dataset:
   calls  minutes  messages   mb_used  is_ultra
0   40.0   311.90      83.0  19915.42         0
1   85.0   516.75      56.0  22696.96         0
2   77.0   467.66      86.0  21060.45         0
3  106.0   745.53      81.0   8437.39         1
4   66.0   418.74       1.0  14502.75         0

Missing Values:
calls       0
minutes     0
messages    0
mb_used     0
is_ultra    0
dtype: int64

Summary Statistics:
             calls      minutes     messages       mb_used     is_ultra
count  3214.000000  3214.000000  3214.0

In [4]:
# Import necessary libraries
import pandas as pd

# Load the dataset
file_path = '/Users/mattbaglietto/megaline_mobile/users_behavior.csv'
data = pd.read_csv(file_path)

# Streamlined Data Cleaning
def clean_data(df):
    # Remove rows with missing values
    df = df.dropna()
    
    # Remove duplicate rows
    df = df.drop_duplicates()
    
    # Remove outliers using IQR for relevant columns
    def remove_outliers(series):
        Q1, Q3 = series.quantile([0.25, 0.75])
        IQR = Q3 - Q1
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        return series[(series >= lower_bound) & (series <= upper_bound)]

    for col in ['calls', 'minutes', 'messages', 'mb_used']:
        df = df[df[col].isin(remove_outliers(df[col]))]
    
    return df

# Apply cleaning function
data_cleaned = clean_data(data)

# Check class distribution
print("\nClass Distribution (is_ultra):")
print(data_cleaned['is_ultra'].value_counts())

# Save the cleaned dataset
cleaned_file_path = '/Users/mattbaglietto/megaline_mobile/cleaned_users_behavior.csv'
data_cleaned.to_csv(cleaned_file_path, index=False)

print("\nData Cleaning Complete. Cleaned data saved to:", cleaned_file_path)


Class Distribution (is_ultra):
is_ultra
0    2211
1     774
Name: count, dtype: int64

Data Cleaning Complete. Cleaned data saved to: /Users/mattbaglietto/megaline_mobile/cleaned_users_behavior.csv


In [6]:
# Import necessary libraries
from sklearn.model_selection import train_test_split

# Load the cleaned dataset
cleaned_file_path = '/Users/mattbaglietto/megaline_mobile/cleaned_users_behavior.csv'
data = pd.read_csv(cleaned_file_path)

# Split features and target
X = data.drop(columns=['is_ultra'])  # Features
y = data['is_ultra']  # Target

# Split the data into train, validation, and test sets (60%, 20%, 20%)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42, stratify=y)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp)

# Output the sizes of each set
print(f"Training Set: {X_train.shape}")
print(f"Validation Set: {X_val.shape}")
print(f"Test Set: {X_test.shape}")

Training Set: (1791, 4)
Validation Set: (597, 4)
Test Set: (597, 4)


# Data Splitting Summary

## Overview
The cleaned dataset has been split into three subsets to facilitate model training, validation, and testing. This ensures a systematic approach to building and evaluating the machine learning model.

## Splitting Results
- **Training Set**: 1791 samples with 4 features each.
- **Validation Set**: 597 samples with 4 features each.
- **Test Set**: 597 samples with 4 features each.

## Splitting Ratios
- **Training Set (60%)**: Used to train the model.
- **Validation Set (20%)**: Used to tune hyperparameters and prevent overfitting.
- **Test Set (20%)**: Used for evaluating the final performance of the model.

## Stratification
Stratification was used during splitting to ensure that the class distribution of the target variable (`is_ultra`) is consistent across all subsets.

## Notes
- The cleaned dataset (`cleaned_behavior.csv`) was used as the source file.
- The random seed ensures reproducibility of the splits.

In [9]:
# Import necessary libraries
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Initialize models
models = {
    "Logistic Regression": LogisticRegression(random_state=42),
    "Decision Tree": DecisionTreeClassifier(random_state=42),
    "Random Forest": RandomForestClassifier(random_state=42)
}

# Train and evaluate each model
results = {}
for name, model in models.items():
    # Train the model
    model.fit(X_train, y_train)
    
    # Predict on validation set
    y_val_pred = model.predict(X_val)
    
    # Calculate accuracy
    accuracy = accuracy_score(y_val, y_val_pred)
    results[name] = accuracy
    print(f"{name} Validation Accuracy: {accuracy:.4f}")

# Display results
print("\nModel Performance:")
for model, acc in results.items():
    print(f"{model}: {acc:.4f}")

Logistic Regression Validation Accuracy: 0.7420
Decision Tree Validation Accuracy: 0.7069
Random Forest Validation Accuracy: 0.8057

Model Performance:
Logistic Regression: 0.7420
Decision Tree: 0.7069
Random Forest: 0.8057


# Model Building and Evaluation

## Objective
To identify the best-performing classification model for predicting the correct mobile plan (Smart or Ultra). The target is to achieve a minimum accuracy of **0.75** on the validation dataset.

## Approach
1. **Selected Models**:
   - Logistic Regression
   - Decision Tree Classifier
   - Random Forest Classifier
2. **Datasets Used**:
   - **Training Set**: Used to train the models.
   - **Validation Set**: Used to evaluate and compare the models' performance.
3. **Evaluation Metric**:
   - **Accuracy Score**: The proportion of correct predictions on the validation dataset.

## Analysis
- **Best Model**: Random Forest with a validation accuracy of **0.8057**.
- Random Forest exceeded the target accuracy of **0.75**, making it the most promising candidate for further tuning and testing.

In [12]:
# Import necessary libraries
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

# Define the parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Initialize the Random Forest model
rf = RandomForestClassifier(random_state=42)

# Perform Grid Search with Cross-Validation
grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=3,
    scoring='accuracy',
    n_jobs=-1,
    verbose=2
)

# Fit the grid search to the training data
grid_search.fit(X_train, y_train)

# Output the best parameters and their validation accuracy
print("\nBest Parameters:", grid_search.best_params_)
print("Best Validation Accuracy:", grid_search.best_score_)

Fitting 3 folds for each of 108 candidates, totalling 324 fits

Best Parameters: {'max_depth': 10, 'min_samples_leaf': 1, 'min_samples_split': 10, 'n_estimators': 200}
Best Validation Accuracy: 0.8056951423785593


# Hyperparameter Tuning for Random Forest

## Objective
To optimize the Random Forest model by fine-tuning its hyperparameters, maximizing validation accuracy, and ensuring the best possible performance.

## Approach
1. **Hyperparameters Tuned**:
   - `n_estimators`: Number of trees in the forest.
   - `max_depth`: Maximum depth of each tree.
   - `min_samples_split`: Minimum number of samples required to split a node.
   - `min_samples_leaf`: Minimum number of samples required at a leaf node.

2. **Methodology**:
   - **Grid Search**: Systematically evaluates combinations of hyperparameters using a predefined grid.
   - **Cross-Validation**: Uses 3-fold cross-validation to ensure robust evaluation of each parameter combination.

3. **Evaluation Metric**:
   - Accuracy Score: Maximizes validation accuracy during the grid search.

## Parameter Grid
| Hyperparameter       | Values Tested                          |
|----------------------|-----------------------------------------|
| `n_estimators`       | [50, 100, 200]                         |
| `max_depth`          | [None, 10, 20, 30]                     |
| `min_samples_split`  | [2, 5, 10]                             |
| `min_samples_leaf`   | [1, 2, 4]                              |

## Analysis
The best validation accuracy achieved during hyperparameter tuning was **0.8057**, using the above parameters. This is consistent with the baseline Random Forest performance but demonstrates the impact of careful tuning.

In [15]:
# Import necessary libraries
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Train the final model using the best parameters
final_rf = RandomForestClassifier(
    n_estimators=200,
    max_depth=10,
    min_samples_split=10,
    min_samples_leaf=1,
    random_state=42
)

# Train the model on the training set
final_rf.fit(X_train, y_train)

# Predict on the test set
y_test_pred = final_rf.predict(X_test)

# Evaluate the model
test_accuracy = accuracy_score(y_test, y_test_pred)
print(f"Test Set Accuracy: {test_accuracy:.4f}")

# Additional evaluation metrics
print("\nClassification Report:")
print(classification_report(y_test, y_test_pred))

print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_test_pred))

Test Set Accuracy: 0.7705

Classification Report:
              precision    recall  f1-score   support

           0       0.81      0.91      0.85       442
           1       0.59      0.38      0.46       155

    accuracy                           0.77       597
   macro avg       0.70      0.64      0.66       597
weighted avg       0.75      0.77      0.75       597


Confusion Matrix:
[[401  41]
 [ 96  59]]


# Final Model Evaluation

## Objective
Evaluate the final Random Forest model on the **Test Set** using the best hyperparameters identified during tuning. This step assesses the model's ability to generalize to unseen data.

## Analysis
- The model performs well in predicting **Smart Plan (Class 0)**, with high precision and recall.
- **Ultra Plan (Class 1)** predictions are less accurate, with lower recall (38%), indicating difficulty in identifying this class correctly.
- The overall accuracy of 77.05% indicates the model generalizes well but struggles with imbalanced class predictions.

## Key Insights
- The imbalance in the dataset (fewer Ultra Plan users) may have impacted the model's ability to predict Class 1 accurately.
- Precision for **Ultra Plan (Class 1)** is moderate, but recall is significantly lower, suggesting the model misses many actual Ultra users.

In [18]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Train the Random Forest model with class weights
rf_balanced = RandomForestClassifier(
    n_estimators=200,
    max_depth=10,
    min_samples_split=10,
    min_samples_leaf=1,
    class_weight="balanced",  # Automatically adjusts weights inversely proportional to class frequencies
    random_state=42
)

# Train the model
rf_balanced.fit(X_train, y_train)

# Predict on the test set
y_test_pred_balanced = rf_balanced.predict(X_test)

# Evaluate the model
test_accuracy_balanced = accuracy_score(y_test, y_test_pred_balanced)
print(f"Test Set Accuracy (Balanced): {test_accuracy_balanced:.4f}")

# Additional evaluation metrics
print("\nClassification Report (Balanced):")
print(classification_report(y_test, y_test_pred_balanced))

print("\nConfusion Matrix (Balanced):")
print(confusion_matrix(y_test, y_test_pred_balanced))

Test Set Accuracy (Balanced): 0.7588

Classification Report (Balanced):
              precision    recall  f1-score   support

           0       0.83      0.85      0.84       442
           1       0.54      0.50      0.52       155

    accuracy                           0.76       597
   macro avg       0.68      0.67      0.68       597
weighted avg       0.75      0.76      0.76       597


Confusion Matrix (Balanced):
[[376  66]
 [ 78  77]]


# Addressing Class Imbalance

## Objective
To improve the model's performance, particularly for the minority class (**Ultra Plan - Class 1**), by addressing class imbalance using **class weights**.

## Methodology
The `RandomForestClassifier` was configured with the parameter `class_weight="balanced"`, which automatically adjusts class weights inversely proportional to their frequencies in the dataset. This ensures that the minority class receives more importance during training.

## Analysis
- **Class 0 (Smart)**: High precision and recall are maintained, indicating the model continues to predict this majority class effectively.
- **Class 1 (Ultra)**: Recall improved to **0.50**, showing the model correctly identifies more Ultra users compared to the imbalanced model (**0.38 recall before**).
- **Tradeoff**: Accuracy decreased slightly, but the performance on the minority class has improved, demonstrating better balance in predictions.

# Project Conclusion: Predicting Mobile Plans

## Objective
The goal of this project was to develop a machine learning model to recommend one of two mobile plans (Smart or Ultra) based on user behavior, achieving a minimum accuracy of 0.75 on unseen data.

## Approach
1. **Data Cleaning**:
   - Addressed missing values, duplicates, and outliers to prepare the dataset for modeling.
2. **Data Splitting**:
   - Divided the data into training, validation, and test sets using a 60/20/20 split with stratification.
3. **Model Development**:
   - Evaluated Logistic Regression, Decision Tree, and Random Forest models.
   - Performed hyperparameter tuning on the Random Forest model to maximize validation accuracy.
4. **Addressing Class Imbalance**:
   - Applied class weighting to improve performance for the minority class (Ultra Plan).

## Results
- **Best Model**: Random Forest
- **Test Set Performance (Balanced Model)**:
  - **Accuracy**: 0.7588
  - **Class 0 (Smart Plan)**:
    - Precision: 0.83, Recall: 0.85, F1-Score: 0.84
  - **Class 1 (Ultra Plan)**:
    - Precision: 0.54, Recall: 0.50, F1-Score: 0.52
  - **Macro Avg F1-Score**: 0.68
  - **Weighted Avg F1-Score**: 0.76
- **Confusion Matrix**:
  |               | Predicted: Smart (0) | Predicted: Ultra (1) |
  |---------------|-----------------------|-----------------------|
  | **Actual: Smart (0)** | 376                   | 66                    |
  | **Actual: Ultra (1)** | 78                    | 77                    |

## Analysis
The final model achieves balanced performance, correctly identifying a significant proportion of Ultra users while maintaining high precision and recall for Smart users. Adjusting for class imbalance improved recall for the Ultra Plan (from 0.38 to 0.50), enhancing the model's overall utility.

## Conclusion
The project successfully met the target accuracy of 0.75, and the Random Forest model provides a reliable recommendation system for mobile plans. 

This model is ready for deployment, and its robust performance can support Megaline in recommending optimal plans to their customers.