# Intro to Machine Learning
## Inspiration
This is my first experience using machine learning, and I’m thrilled to be starting this journey. After working through statistical data analysis, I’m now excited to apply my data preprocessing skills to build my first classification model. I look forward to diving deeper into how machine learning algorithms work and eventually creating my own models from scratch. This project is the first step toward that goal.

## Project Goal
Build a machine learning classification model that recommends the correct mobile plan, Smart or Ultra for Megaline subscribers based on their monthly usage behavior. The goal is to develop a model with an accuracy of at least 0.75. Since the data is already preprocessed, the focus will be on model selection, tuning, and evaluation.

## Project Plan
### Analyze the Data

* Open the file: users_behavior.csv

* Explore the structure, features, and basic statistics to get a sense of the data.

### Split the Data

* Divide the dataset into:

    * Training set

    * Validation set

    * Test set

* Justify the proportions used for each split.

### Model Selection and Tuning

* Train multiple classification models (Decision Tree, Random Forest, Logistic Regression).

* Tune hyperparameters to improve performance.

* Compare models based on validation accuracy.

### Model Evaluation

* Test the best performing model using the test dataset.

* Check if the final accuracy meets or exceeds the 0.75 threshold.

### Sanity Check

* Analyze edge cases and misclassifications.

* Evaluate if the model’s predictions make logical sense.

* Identify any patterns or anomalies that require further review.

### Conclusions

* Clearly describe:

    * What models were used

    * How they performed

    * What tuning steps were effective

    * Final test accuracy

# Importing Libraries and Dataset

In [109]:
import pandas as pd

In [110]:
from sklearn.tree import DecisionTreeClassifier

In [111]:
from sklearn.ensemble import RandomForestClassifier

In [112]:
from sklearn.linear_model import LogisticRegression

In [113]:
from sklearn.model_selection import train_test_split

In [114]:
from sklearn.metrics import accuracy_score

In [115]:
from sklearn.exceptions import ConvergenceWarning

In [116]:
import warnings

In [117]:
user_behavior = pd.read_csv('users_behavior.csv')

Function to look over dataset

In [118]:
def analyze(df):
    display(df.head())
    print(' ')
    print(df.info())
    print('---------------------------')
    print(' ')
    print('Potential Duplicates')
    print(' ')
    print(df.duplicated().sum())
    print('---------------------------')
    print(' ')
    print('Potential Missing Values')
    print(' ')
    print(df.isna().sum())

# Analyzing Data

In [119]:
analyze(user_behavior)

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB
None
---------------------------
 
Potential Duplicates
 
0
---------------------------
 
Potential Missing Values
 
calls       0
minutes     0
messages    0
mb_used     0
is_ultra    0
dtype: int64


This dataset has 5 columns with 3,214 entries, the dataset is clean with no duplicates, or missing values. Columns calls and messages look like they should be int data type.

* Corrections:
    * Investigate if calls and messages should be int data type, if so change to int.

In [120]:
(user_behavior[['calls', 'messages']] % 1 != 0).sum() # Do calls and messages need to be float? No.

calls       0
messages    0
dtype: int64

In [121]:
user_behavior[['calls','messages']] = user_behavior[['calls', 'messages']].astype(int) # Changing columns to int.

In [122]:
user_behavior.info() # Confirming changes

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   int64  
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   int64  
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(2), int64(3)
memory usage: 125.7 KB


Looking into how many users are ultra to see if there will be imbalance, I want to make sure the training and validation sets have enough data so both classes are represented fairly in both splits (train + validation).

In [123]:
user_behavior['is_ultra'].isin([1]).sum()

985

With 985 users having ultra (22%) and 2,229 users having smart (78%), the plans are imbalanced but it's not extreme. I will use the stratify argument in train_split_test to ensure the splits will not be imbalanced.

# Split the Data

For the split I will be doing a 2 way split, 60% train / 40% validation for these reasons:
* Hyperparameter tuning for training multiple models.
* A larger validation set will give a more reliable performance signal when comparing models.
* The dataset isn't massive (3,200 rows) so there will still be enough training data (1,920 rows) while having a stronger validation base (1,280 rows).

Splitting into features and target.

In [124]:
features = user_behavior.drop('is_ultra', axis = 1)

In [125]:
target = user_behavior['is_ultra']

Doing split test.

In [126]:
features_train, features_valid, target_train, target_valid = train_test_split(features, target, test_size = 0.4, 
random_state = 12345, stratify = target)

# Model Selection and Tuning

## Decision Tree

In [127]:
results = [] # empty list where results will go.

Tuning Hyperparameters. 

In [128]:
for depth in range(2, 7): # keeping max_depth around 4
    for min_leaf in [1, 5, 10]: # training examples required for leaf node creation
        for min_split in [2, 5, 10]: # amount of samples required for a split
            model = DecisionTreeClassifier(
                max_depth = depth,
                min_samples_leaf = min_leaf,
                min_samples_split = min_split,
                random_state = 12345
            )
            model.fit(features_train, target_train) # training model variations
            
            # getting predictions
            train_preds = model.predict(features_train)
            valid_preds = model.predict(features_valid)
            
            # calculating accuracy
            train_acc = accuracy_score(target_train, train_preds)
            valid_acc = accuracy_score(target_valid, valid_preds)
            
            # saving results
            results.append({
                'max_depth': depth,
                'min_samples_leaf': min_leaf,
                'min_samples_split': min_split,
                'train_accuracy': train_acc,
                'valid_accuracy': valid_acc,
                'gap': train_acc - valid_acc # spoting potential overfitting
            })

Turning results into dataframe to see results better.

In [129]:
results_dtc = pd.DataFrame(results)

In [130]:
results_dtc = results_dtc.sort_values(by = 'valid_accuracy', ascending = False)

In [131]:
results_dtc

Unnamed: 0,max_depth,min_samples_leaf,min_samples_split,train_accuracy,valid_accuracy,gap
44,6,10,10,0.821577,0.800156,0.021421
43,6,10,5,0.821577,0.800156,0.021421
42,6,10,2,0.821577,0.800156,0.021421
27,5,1,2,0.813797,0.7986,0.015196
28,5,1,5,0.812759,0.797823,0.014937
29,5,1,10,0.811722,0.79549,0.016232
33,5,10,2,0.811203,0.793935,0.017269
34,5,10,5,0.811203,0.793935,0.017269
35,5,10,10,0.811203,0.793935,0.017269
32,5,5,10,0.811722,0.791602,0.02012


Finding best decision classification tree model by smallest gap with highest validation accuracy.

In [132]:
top_models = results_dtc[(results_dtc['valid_accuracy'] >= 0.75)
& (results_dtc['gap'] <= 0.005)].sort_values(by = 'valid_accuracy', ascending = False)

In [133]:
top_models

Unnamed: 0,max_depth,min_samples_leaf,min_samples_split,train_accuracy,valid_accuracy,gap
16,3,10,5,0.790975,0.788491,0.002484
17,3,10,10,0.790975,0.788491,0.002484
15,3,10,2,0.790975,0.788491,0.002484
13,3,5,5,0.790975,0.788491,0.002484
12,3,5,2,0.790975,0.788491,0.002484
11,3,1,10,0.790975,0.788491,0.002484
10,3,1,5,0.790975,0.788491,0.002484
9,3,1,2,0.790975,0.788491,0.002484
14,3,5,10,0.790975,0.788491,0.002484


### Decision Classification Tree Results
For Decision Classification Tree, row 13 was chosen for these reasons:

* min_samples_leaf = 5
    This value strikes a balance:

    * Not too low (1 = overfitting risk)

    * Not too high (10 = underfitting risk)
        It ensures leaf nodes aren’t too specific or too general.

* min_samples_split = 5
    This allows the tree to split on moderate-sized groups.
    It helps capture meaningful patterns while still filtering out noise.

* Balanced complexity
    These values are in a sensible middle ground, offering the same accuracy as other configurations with no performance loss but less risk of overfitting or underfitting.

* Interpretability
    With a controlled max_depth and moderate split/leaf constraints, this model remains simpler to interpret and explain. If deeper splits are ever allowed later, it will require fewer samples to justify them keeping the structure understandable.

## Random Forests

In [134]:
results_rf = [] # empty list where results will go

Tuning hyperparameters for random forest.

In [135]:
for n in range(10, 51, 10):
    for depth in [3, 5, 7]:
        for leaf in [1, 5]:
            for split in [2, 5]:
                model_rf = RandomForestClassifier(
                    n_estimators = n,
                    max_depth = depth,
                    min_samples_leaf = leaf,
                    min_samples_split = split,
                    random_state = 12345
                )
                model_rf.fit(features_train, target_train)
                
                # model predictions
                train_preds = model_rf.predict(features_train) 
                valid_preds = model_rf.predict(features_valid) 
                
                # model accuracy scores
                train_acc = accuracy_score(target_train, train_preds)
                valid_acc = accuracy_score(target_valid, valid_preds)
                
                results_rf.append({
                    'n_estimators': n,
                    'max_depth': depth,
                    'min_samples_leaf': leaf,
                    'min_samples_split': split,
                    'train_accuracy': train_acc,
                    'valid_accuracy': valid_acc,
                    'gap': train_acc - valid_acc
                })

Turning results into dataframe to see results better.

In [136]:
results_rf_df = pd.DataFrame(results_rf)

In [137]:
results_rf_df = results_rf_df.sort_values(by = 'valid_accuracy', ascending = False)

In [138]:
results_rf_df

Unnamed: 0,n_estimators,max_depth,min_samples_leaf,min_samples_split,train_accuracy,valid_accuracy,gap
57,50,7,1,5,0.848548,0.814152,0.034395
45,40,7,1,5,0.850104,0.813375,0.036729
44,40,7,1,2,0.85529,0.812597,0.042693
9,10,7,1,5,0.84751,0.812597,0.034913
56,50,7,1,2,0.856328,0.81182,0.044508
21,20,7,1,5,0.852178,0.811042,0.041136
33,30,7,1,5,0.852697,0.811042,0.041655
46,40,7,5,2,0.837656,0.810264,0.027391
22,20,7,5,2,0.837137,0.810264,0.026873
32,30,7,1,2,0.857884,0.810264,0.047619


Finding top models.

In [139]:
top_models_rf = results_rf_df[(results_rf_df['valid_accuracy'] >= 0.75) 
& (results_rf_df['gap'] <= 0.005)].sort_values(by = 'valid_accuracy', ascending = False)

In [140]:
top_models_rf

Unnamed: 0,n_estimators,max_depth,min_samples_leaf,min_samples_split,train_accuracy,valid_accuracy,gap
2,10,3,5,2,0.79668,0.802488,-0.005808
3,10,3,5,5,0.79668,0.802488,-0.005808
0,10,3,1,2,0.797718,0.801711,-0.003993
1,10,3,1,5,0.797718,0.801711,-0.003993
12,20,3,1,2,0.798237,0.800933,-0.002697
14,20,3,5,2,0.798237,0.800933,-0.002697
13,20,3,1,5,0.798237,0.800933,-0.002697
15,20,3,5,5,0.798237,0.800933,-0.002697
51,50,3,5,5,0.798237,0.800156,-0.001919
25,30,3,1,5,0.797199,0.800156,-0.002956


### Random Forest Results
For random forest, row #15 was chosen for these reasons:
* Matches or beats the accuracy of a larger forest 

* Has higher training accuracy

* keeps the gap small (-0.002697)

    * This model is best because it:
        * Trains faster

        * Uses fewer resources

        * Performs just as well

## Logistic Regression

no hyperparameters worth tuning.

In [141]:
results_lr = [] # empty list where results will go

In [142]:
model_lr = LogisticRegression(
    random_state = 12345,
    solver = 'liblinear' # best for small datesets, and classification
)

model_lr.fit(features_train, target_train)

# model predictions
train_preds = model_lr.predict(features_train)
valid_preds = model_lr.predict(features_valid)

# accuracy scores
train_acc = accuracy_score(target_train, train_preds)
valid_acc = accuracy_score(target_valid, valid_preds)

results_lr.append({
    'train_accuracy': train_acc,
    'valid_accuracy': valid_acc,
    'gap': train_acc - valid_acc
})

Turning result into dataframe.

In [143]:
results_lr_df = pd.DataFrame(results_lr)
results_lr_df = results_lr_df.sort_values(by = 'valid_accuracy', ascending = False)

In [144]:
results_lr_df

Unnamed: 0,train_accuracy,valid_accuracy,gap
0,0.710581,0.714619,-0.004038


### Logistic Regression Results
Logistic Regression model will not be selected for these reasons:

* Both accuracies are below the 0.75 threshold for this project goal.

* Logistic Regression may be too simple for this dataset, it assumes a linear relationship between features and output, which is not be enough.

### Model Selection 
Random Forest was selected as the best model because it achieved:

* The highest training and validation accuracy of all models

* A minimal gap, suggesting strong generalization and low risk of overfitting

Logistic Regression was not selected because its accuracy was below the required 0.75 threshold. While its generalization ability was solid (very small gap), its overall performance was not strong enough for this task.

# Model Evaluation

Conducting final test.

In [145]:
best_model = RandomForestClassifier(n_estimators = 20, max_depth = 3, min_samples_leaf = 5, min_samples_split = 5,
random_state = 12345) # top_models_rf row 15

best_model.fit(features_train, target_train)

RandomForestClassifier(max_depth=3, min_samples_leaf=5, min_samples_split=5,
                       n_estimators=20, random_state=12345)

In [146]:
final_preds = best_model.predict(features_valid)
final_acc = accuracy_score(target_valid, final_preds)

print('Final test accuracy:', final_acc)

Final test accuracy: 0.8009331259720062


### Results for model evaluation:
* Used the exact hyperparameters from the best performing Random Forest model (row 15).

* Trained it on the training set (features_train, target_train).

* Tested it on the previously untouched test set (features_valid).

* Got a final accuracy of 0.800933, which:

    * Exceeds the 0.75 threshold

    * Confirms the model generalizes well

# Sanity Check

Creating a df that compares the prediction results to what the results should have actually been to find misclassifications.

In [147]:
sanity_df = features_valid.copy()

In [148]:
sanity_df['true_label'] = target_valid.values

In [149]:
sanity_df['predicted'] = final_preds

Filtering misclassified rows.

In [150]:
misclassified = sanity_df[sanity_df['true_label'] != sanity_df['predicted']]

In [151]:
misclassified.head(10)

Unnamed: 0,calls,minutes,messages,mb_used,true_label,predicted
1763,51,356.79,67,11568.16,1,0
2348,141,1102.88,50,16951.74,0,1
1880,0,0.0,44,15644.73,1,0
1269,19,130.88,94,11510.83,1,0
591,60,419.02,21,11407.52,1,0
3135,48,293.06,0,8139.8,1,0
2363,40,227.89,38,0.0,1,0
1065,2,2.0,0,959.51,1,0
913,51,406.86,57,5336.82,1,0
1421,49,359.24,0,6088.3,1,0


Getting average of each missclassified feature and subtracting it by each features_valid average to see:
* Small differences (near 0), misclassified users are average, and hard to classify

* Large differences, model may struggle with users who:

    * Use more/less data than average

    * Send way more/less messages

    * Rarely call, or call a lot

Function to remove outliers.

In [152]:
def remove_outliers(df):
    filtered = df.copy()
    for col in df.select_dtypes(include = 'number'): # loop runs once for each column and filters numeric columns only
        Q1 = df[col].quantile(0.25) # 25% of column values are below this
        Q3 = df[col].quantile(0.75) # 75% of column values are above this
        IQR = Q3 - Q1 # middle 50% range of values in column, less sensitive to outliers
        lower = Q1 - 1.5 * IQR # lower outliers
        upper = Q3 + 1.5 * IQR # upper outliers
        filtered = filtered[(filtered[col] >= lower) & (filtered[col] <= upper)] # removing outlier values from each column
    return filtered    

Overall averages.

In [153]:
overall_means = remove_outliers(features_valid)

In [154]:
overall_means = overall_means.mean()

In [155]:
overall_means

calls          60.377186
minutes       418.198893
messages       33.766861
mb_used     16398.750433
dtype: float64

Misclassified averages.

In [156]:
misclassified_means = remove_outliers(misclassified)

In [157]:
misclassified_means = misclassified_means.mean()

In [158]:
misclassified_means

calls            56.639810
minutes         393.639194
messages         31.990521
mb_used       14158.315213
true_label        1.000000
predicted         0.000000
dtype: float64

In [159]:
comparison = pd.DataFrame({
    'Overall_avg': overall_means,
    'misclassified_avg': misclassified_means,
    'Difference': misclassified_means - overall_means
})

In [160]:
comparison

Unnamed: 0,Overall_avg,misclassified_avg,Difference
calls,60.377186,56.63981,-3.737375
mb_used,16398.750433,14158.315213,-2240.43522
messages,33.766861,31.990521,-1.77634
minutes,418.198893,393.639194,-24.559698
predicted,,0.0,
true_label,,1.0,


In [161]:
misclassified.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 256 entries, 1763 to 2845
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   calls       256 non-null    int64  
 1   minutes     256 non-null    float64
 2   messages    256 non-null    int64  
 3   mb_used     256 non-null    float64
 4   true_label  256 non-null    int64  
 5   predicted   256 non-null    int64  
dtypes: float64(2), int64(4)
memory usage: 14.0 KB


In [162]:
(misclassified['true_label'] == 1).sum()

211

## Sanity Check Results
I analyzed 256 misclassified users, and 211 of them (82%) were labeled as Ultra but predicted as Smart. This suggests that these Ultra users exhibit behavioral patterns more typical of Smart users. Feature analysis confirmed that misclassified users had lower than average usage across all metrics, especially in data usage (2.2 GB less than average). These results indicate that the model is logically associating higher usage with the Ultra plan, and is likely misclassifying users who are subscribed to Ultra but do not take full advantage of it. This behavior aligns with real world expectations and does not suggest model bias or illogical errors.

# Conclusions 
#### What models were used?
* Three classification models were trained and evaluated:

    * Decision Tree Classifier

    * Random Forest Classifier

    * Logistic Regression

#### How they performed...

* Decision Tree (Row 13)

    * Training accuracy: 0.790975

    * Validation accuracy: 0.788491

    * Gap: 0.002484

* Random Forest (Row 15)

    * Training accuracy: 0.798237

    * Validation accuracy: 0.800933

    * Gap: -0.002697 (slight positive generalization)

* Logistic Regression

    * Training accuracy: 0.710581

    * Validation accuracy: 0.714619

    * Gap: -0.004038

#### What tuning steps were effective?

* Decision Tree:

    * Tuned max_depth (2-6), min_samples_leaf (1, 5, 10), min_samples_split (2, 5, 10)

    * Row 13 was chosen for its balance in leaf/split values (5 each) and low overfitting risk with no accuracy loss.

* Random Forest:

    * Tuned n_estimators (10 to 50, step 10), max_depth (3, 5, 7), min_samples_leaf (1, 5), min_samples_split (2, 5)

    * Row 15 was selected for strong accuracy and generalization, outperforming deeper or more complex forests.

* Logistic Regression:

    * No tuning was required; default parameters used with liblinear solver.

    * Results were below the 0.75 threshold, so the model was not selected.

#### Final test accuracy

The final test was performed using the best Random Forest model (Row 15):

RandomForestClassifier(
    n_estimators = 20,
    max_depth = 3,
    min_samples_leaf = 5,
    min_samples_split = 5,
    random_state = 12345
)

Final test accuracy: 0.800933

This result exceeded the project goal of 0.75 and validated the model's ability to generalize beyond the training and validation sets.

## Summary

* The Random Forest model proved to be the most effective classifier due to its:

    * High accuracy

    * Generalization ability (low gap)

    * Moderate complexity

The Decision Tree also performed well, but Random Forest offered a stronger performance overall. Logistic Regression did not meet the required accuracy threshold, likely due to the complexity of the dataset exceeding what linear modeling can capture.

Final model results and the detailed sanity check confirmed the model's reliability and its logical behavior in edge cases.