# Project 8 : Megaline Machine Learning Algoritms

The purpose of this project is to demonstrate the skills to perform an analysis developing a model which can analize the behaviour of the clients for the company Megaline (telecommunications company) and recomend a data plan (Smart or Ultra) for each user.

I'll work with different models to find the best solution with the following structure:

- Importing libraries
- Load the information.
- Verify the integrity of the data.
- Clean the data.
- Analyze the data.
- Create a model(s).
- Train the model(s).
- Find the best result.

## Importing libraries
Importing necessary libraries

In [14]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression

Importing datasets

In [2]:
df= pd.read_csv('D:/Tripleten/datasets/users_behavior.csv')

In [3]:
df.info()
df['calls'] = df['calls'].astype('Int64')
df['messages'] = df['messages'].astype('Int64')
df.sample(10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
1066,90,653.62,0,15697.77,0
68,138,1009.11,64,27807.13,1
1016,24,178.32,14,4674.47,0
2534,65,463.79,23,24868.65,0
1136,89,602.81,56,19293.91,0
3129,51,325.48,52,17477.23,0
2504,111,758.66,47,16045.16,0
117,80,588.57,47,9863.23,0
1495,63,408.68,63,24970.26,0
368,31,185.63,101,14344.72,0


In [4]:
df.describe()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
count,3214.0,3214.0,3214.0,3214.0,3214.0
mean,63.038892,438.208787,38.281269,17207.673836,0.306472
std,33.236368,234.569872,36.148326,7570.968246,0.4611
min,0.0,0.0,0.0,0.0,0.0
25%,40.0,274.575,9.0,12491.9025,0.0
50%,62.0,430.6,30.0,16943.235,0.0
75%,82.0,571.9275,57.0,21424.7,1.0
max,244.0,1632.06,224.0,49745.73,1.0


df.info shows:
- Columns names are OK
- Data types for calls and message where change to data type int64
- Data has not null values

The function df.sample(5) suggests consistency in the data.

df.describes shows :
calls 



## Preparing the training dataset 

The data will be segmented in validation data, training data, and testing data as a solution for not possesing another independent dataset. 

In [5]:
features = df.drop('is_ultra', axis=1)
target =  df['is_ultra']

# Splitting the data into training (60%) and temporary data (40%)
X_train, X_temp, y_train, y_temp = train_test_split(features,target, train_size=0.6, random_state=54321)

# Further splitting the temporary data into validation (20%) and test (20%)
X_val, X_test, y_val, y_test = train_test_split( X_temp, y_temp, train_size=0.5, random_state=54321)


## Implementing machinge learning for classification algoritm

In this project we will find the best MSE (mean squared error) in three different clasiffication algorithms.

- Decision Tree
- Random Forest
- Logistic Regression

## Decision Tree Classifier

First we will iterate the Decission Tree Classifier to obtain the best score and depth to use it with our test dataset. I will be necessary to use the train dataset and the validation dataset

In [6]:
best_score = 0
best_depth = 0 

for depth in range(1,200): 
    model = DecisionTreeClassifier(max_depth=depth, random_state=54321)
    model.fit(X_train,y_train)
    predictions = model.predict(X_val)
    val_score = accuracy_score(y_val, predictions)
    # print(val_score)
    if val_score > best_score:
        best_score = val_score
        best_depth = depth

print(f'Best score {best_score} and best depth {best_depth}')

Best score 0.7822706065318819 and best depth 10


Now it's time to compare the model through our test dataset

In [7]:
model = DecisionTreeClassifier(max_depth=10, random_state=54321)
model.fit( X_train, y_train)
predictions = model.predict(X_test)
score = accuracy_score(y_test, predictions)
print(f'The score obtained in this model is {score}')

The score obtained in this model is 0.8055987558320373


## Random Forest Classifier

In [11]:

best_depth= 0
best_score = 0
best_est = 0

for est in range(1, 100, 10):
    for depth in range (1, 20):
        rfc = RandomForestClassifier(n_estimators = est, max_depth= depth, random_state=54321)
        rfc.fit(X_train, y_train)
        predictions = rfc.predict(X_val)
        val_score = accuracy_score(y_val, predictions)
        # print(val_score)

        if val_score > best_score:
            best_depth= depth
            best_score = val_score
            best_est = est

print(f'The best score is {val_score}, with a n_estimators of {best_est} and best_deep of {best_depth}')


The best score is 0.7822706065318819, with a n_estimators of 11 and best_deep of 8


In [13]:
rfc = RandomForestClassifier(n_estimators = best_est, max_depth=best_depth, random_state=54321)
rfc.fit(X_train,y_train)
predictions = rfc.predict(X_test)
score = accuracy_score(y_test, predictions)
print(f'The best score for this model is {score}, with a n_estimators of {best_est} and best_deep of {best_depth}')

The best score for this model is 0.8367029548989113, with a n_estimators of 11 and best_deep of 8


## Logistic Regression

In [16]:
lr = LogisticRegression(random_state=54321, solver='liblinear')
lr.fit(X_train,y_train)

# Evaluating Validation Dataset
validation_predictions = lr.predict(X_val)
validation_score = accuracy_score(y_val, validation_predictions)
print(f'El score para el data de validación es: {validation_score}')

# Evaluating Test Dataset
test_predictions = lr.predict(X_test)
test_score = accuracy_score(y_test, test_predictions)
print(f'El score para el data de validación es: {test_score}')


El score para el data de validación es: 0.6780715396578538
El score para el data de validación es: 0.7402799377916018


Conclusion