<a href="https://colab.research.google.com/github/EstevahnAguilera/Data-Science-Projects/blob/main/Intro_to_Machine_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project Description
Mobile carrier Megaline has found out that many of their subscribers use legacy plans. They want to develop a model that would analyze subscribers' behavior and recommend one of Megaline's newer plans: Smart or Ultra.

You have access to behavior data about subscribers who have already switched to the new plans (from the project for the Statistical Data Analysis course). For this classification task, you need to develop a model that will pick the right plan. Since you’ve already performed the data preprocessing step, you can move straight to creating the model.  

Develop a model with the highest possible accuracy. In this project, the threshold for accuracy is 0.75. Check the accuracy using the test dataset.  

## Data Description
Every observation in the dataset contains monthly behavior information about one user. The information given is as follows:
- calls — number of calls,
- minutes — total call duration in minutes,
- messages — number of text messages,
- mb_used — Internet traffic used in MB,
- is_ultra — plan for the current month (Ultra - 1, Smart - 0).

## Getting Started
I will begin by setting up the environment.
### Importing Libraries
I will import the necessary libraries.

In [3]:
# Importing neccesary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

In [1]:
# Importing the .csv file from local computer
from google.colab import files
uploaded = files.upload()

Saving users_behavior.csv to users_behavior.csv


In [5]:
# Creating my dataframe
df = pd.read_csv('users_behavior.csv')

# Analyzing my data
print(df.head())
print(df.info())

   calls  minutes  messages   mb_used  is_ultra
0   40.0   311.90      83.0  19915.42         0
1   85.0   516.75      56.0  22696.96         0
2   77.0   467.66      86.0  21060.45         0
3  106.0   745.53      81.0   8437.39         1
4   66.0   418.74       1.0  14502.75         0
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB
None


My understanding up until this point:
- After reading the projects description, I have identified this problem as a categorical one as opposed to a regression one. It's categorical because Magaline is trying to develope a model that would analyze subscribers behavior and then recommend either their Smart or Ultra plan. There isn't any numerical values that we are trying to find, thus it's categorical.
- Now that I understand that, I need to set up my features and target that will be necessary to train my model.

After analyzing the data:
- Upon looking at the DataFrame's information, I notice that there isn't any missing values which is good.
- I also notice that all of the columns in this DataFrame are necessary for my model, thus I don't need to drop any aditional columns when specifying my models features.

### Specifying *features* and *target*
The features will be all of the columns except for **'is_ultra'**.
The target will be just the **'is_ultra'** column.

### Splitting the Data
Split, splits the data into two datasets based on a certain percentage. Therefore if we want to split it into three, we would need to do two separate splits.

I will be splitting the data into three sets using a common split ratio:
- Training set 60%
- Validation set 20%
- Test set 20%

I will be splitting the data as follows:
1. The training set will get 60% and a temp set will get 40%.
2. On the second split, I will be splitting the temp set in half that way both the validation set and test set get an equal amount (20%).

In [7]:
# features & target
features = df.drop(['is_ultra'], axis=1)
target = df['is_ultra']

In [8]:
# First split: Training set gets 60% and temp set gets 40%
features_train, features_temp, target_train, target_temp = train_test_split(features, target, test_size=0.4, random_state=12345)

# Second split: Divides the 40% into two
features_valid, features_test, target_valid, target_test = train_test_split(features_temp, target_temp, test_size=0.5, random_state=12345)

### Testing how accurate a Decision Tree model would be

In [10]:
# Testing out the DecisionTree
best_score_DTC = 0
best_depth_DTC = 0
for depth in range(1, 6):
    model_DTC = DecisionTreeClassifier(random_state=12345, max_depth=depth)
    model_DTC.fit(features_train, target_train)
    score_DTC = model_DTC.score(features_valid, target_valid)
    if score_DTC > best_score_DTC:
        best_score_DTC = score_DTC
        best_depth_DTC = depth

print("Accuracy of the best model on the validation set:")
print(f"Accuracy: {best_score_DTC} and depth: {best_depth_DTC}")

Accuracy of the best model on the validation set:
Accuracy: 0.7853810264385692 and depth: 3


### Testing how accurate a Random Forest Classifier model would be

In [11]:
# Testing out the RandomForestClassifier
best_score_RFC = 0
best_est_RFC = 0
best_depth_RFC = 0
for est in range(1,50,10):
  for depth in range(1,11):
    model_RFC = RandomForestClassifier(random_state=12345, n_estimators=est, max_depth=depth)
    model_RFC.fit(features_train, target_train)
    score_RFC = model_RFC.score(features_valid, target_valid)
    if score_RFC > best_score_RFC:
      best_score_RFC = score_RFC
      best_est = est
      best_depth_RFC = depth

print("Accuracy of the best model on the validation set:")
print(f"Accuracy: {best_score_RFC} , n_estimators: {best_est} and depth: {best_depth_RFC}")

Accuracy of the best model on the validation set:
Accuracy: 0.807153965785381 , n_estimators: 41 and depth: 8


### My Findings:
- After investigating RandomForest and DecisionTree models, I found that RandomForestClassifier achieved higher accuracy (0.8087) compared to the DecisionTree (0.7853). I tested extensive hyperparameter ranges for RandomForestClassifier (n_estimators: 1-100, max_depth: 1-15) and determined that n_estimators = 41 eith max_depth = 8 provided optimal performance (I updated the code with these parameters 1-50 by 10, and depth 1-11 so that it doesn't run for a long time). While both models exceeded the required 0.75 accuracy threshold, RandomForestClassifier's superior performance makes it the better choice for recommending mobile plans to customers.

### Comparing the validation set to the test set

In [18]:
# Comparing the validation set to the test set
# Setting our final model equal to our already RFC
final_model = model_RFC

# We know the validations accuracy score from the previous cell
print("Accuracy of the final model on the validation set:", best_score_RFC)

# Test set predictions
test_predictions = final_model.predict(features_test)
# Checking the accuracy of the model
accuracy_test = accuracy_score(target_test, test_predictions)
print("Accuracy of the final model on the test set:", accuracy_test)

Accuracy of the final model on the validation set: 0.807153965785381
Accuracy of the final model on the test set: 0.7978227060653188


After comparing the two sets, the validation set has a higher accuracy by about 0.01. With this information, we can conclude that the model generalizes well and it's not significantly overfitting to the validation set. This shows that the model is stable and consistent across different subsets of data.