# **Introduction to Regression Project: Mobile carrier Megaline**

# 1. Defining the Question

## **a) Specifying the Data Analysis Question**

The task at hand is to develop a model that will pick the right plan with the best accuracy (threshold of .75)

## **b) Defining the Metric for Success**

We will have accomplished our objective if we develop a model with the highest possible accuracy

## **c) Understanding the Context**

Mobile carrier Megaline has found out that many of their subscribers use legacy plans. They want to develop a model that would analyze subscribers' behavior and recommend one of Megaline's newer plans: Smart or Ultra. You have access to behavior data about subscribers who have already switched to the new plans (from the project for the Statistical Data Analysis course).

For this classification task, you need to develop a model that will pick the right plan. Since you’ve already performed the data preprocessing step, you can move straight to creating the model. Develop a model with the highest possible accuracy. In this project, the threshold for accuracy is 0.75. Check the accuracy using the test dataset.

## **d) Recording the Experimental Design**


● Data Importation

● Data Exploration

● Data Preparation

● Model Evaluation

● Hyparameter Tuning

● Findings and Recommendations

# **e) Data Relevance**

The given data sets were relevant in answering the research question.

# Load needed Libraries

In [3]:
# Loading  libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# 2. Data Importation ,Exploration , Cleaning & Analysis


In [4]:
# Loading the dataset
df=pd.read_csv('https://bit.ly/UsersBehaviourTelco')

In [5]:
# Previewing the first 5 records
df.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


In [6]:
# Checking the last 5 rows of data

df.tail()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
3209,122.0,910.98,20.0,35124.9,1
3210,25.0,190.36,0.0,3275.61,0
3211,97.0,634.44,70.0,13974.06,0
3212,64.0,462.32,90.0,31239.78,0
3213,80.0,566.09,6.0,29480.52,1


In [7]:
# Getting our dataset shape

df.shape

#dataframe has 3214 rows and 5 variables

(3214, 5)

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [9]:
# Checking for duplicates
df.duplicated().sum()

#NO duplicate values found

0

In [10]:
#checking data types
df.dtypes

calls       float64
minutes     float64
messages    float64
mb_used     float64
is_ultra      int64
dtype: object

In [11]:
# Checking for missing values
df.isnull().sum()

#there are no missing values

calls       0
minutes     0
messages    0
mb_used     0
is_ultra    0
dtype: int64

# Data Preparation

In [12]:
# Checking missing entries of all the variables.

df.isnull().sum()

calls       0
minutes     0
messages    0
mb_used     0
is_ultra    0
dtype: int64

In [13]:
# Standardizing your dataset i.e. changing column headings to lower case for all  columns & checking 1st 5 rcords

df.columns = df.columns.str.lower()
df.head()



Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


In [23]:
#preparing data
x = df.drop(['is_ultra'], axis = 1)
y = df['is_ultra']

#Split the source data into a training set, a validation set, and a test set.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 42, stratify =y)
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, test_size = 0.25, random_state = 42)

#confirm size of datasets
print(df.shape)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
print(X_valid.shape)
print(y_valid.shape)

(3214, 5)
(1928, 4)
(643, 4)
(1928,)
(643,)
(643, 4)
(643,)


In [24]:
#compare actual values with predicted values
df_logistic= pd.DataFrame({'Actual': y_test, 'Predicted': logistic_y_prediction })
df_logistic.head()

NameError: ignored

In [None]:
#compare actual values with predicted values
df_decision = pd.DataFrame({'Actual': y_test, 'Predicted': decision_y_prediction })
df_decision.head()

In [None]:
#compare actual values with predicted values
df_random = pd.DataFrame({'Actual': y_test, 'Predicted': random_y_prediction })
df_random.head()

In [19]:
# Checking if any of the rows are all null
# ---
sum(df.isnull().all(axis = 1))


0

In [18]:
#creating a copy of our dataframe
#
# ---
#
data_clean = df.copy()
data_clean.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


In [20]:
data_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [21]:
data_clean.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
calls,3214.0,63.038892,33.236368,0.0,40.0,62.0,82.0,244.0
minutes,3214.0,438.208787,234.569872,0.0,274.575,430.6,571.9275,1632.06
messages,3214.0,38.281269,36.148326,0.0,9.0,30.0,57.0,224.0
mb_used,3214.0,17207.673836,7570.968246,0.0,12491.9025,16943.235,21424.7,49745.73
is_ultra,3214.0,0.306472,0.4611,0.0,0.0,0.0,1.0,1.0


In [22]:
data_clean.columns

Index(['calls', 'minutes', 'messages', 'mb_used', 'is_ultra'], dtype='object')

In [25]:
## Splitting the dataframe

df_train, df_valid = train_test_split(data_clean, test_size=0.25, random_state=12345)
# spliting the data in 80:10:10 for train:valid:test dataset
train_size=0.8

X = data_clean.drop(columns = ['is_ultra']).copy()
y = df['is_ultra']

# In the first step we will split the data in training and remaining dataset
X_train, X_rem, y_train, y_rem = train_test_split(X,y, train_size=0.8)

# Now since we want the valid and test size to be equal (10% each of overall data).
# we have to define valid_size=0.5 (that is 50% of remaining data)
test_size = 0.5
X_valid, X_test, y_valid, y_test = train_test_split(X_rem,y_rem, test_size=0.5)

print(X_train.shape), print(y_train.shape)
print(X_valid.shape), print(y_valid.shape)
print(X_test.shape), print(y_test.shape)

(2571, 4)
(2571,)
(321, 4)
(321,)
(322, 4)
(322,)


(None, None)

In [None]:
model = DecisionTreeClassifier(random_state=12345,max_depth=3,class_weight=None)

model.fit(X_train, y_train)
train_predictions = model.predict(X_train)
model.fit(X_test, y_test)
test_predictions = model.predict(X_test)
model.fit(X_valid, y_valid)
valid_predictions = model.predict(X_valid)

print('accuracy_score')
print('Training set:', accuracy_score(y_train, train_predictions))
print('Test set:', accuracy_score(y_test, test_predictions))
print('Valid set:', accuracy_score(y_valid, valid_predictions))

## **Random Forest**

In [None]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(random_state=12345, n_estimators=3)
model.fit(X_train, y_train)
train_predictions = model.predict(X_train)
model.fit(X_test, y_test)
test_predictions = model.predict(X_test)
model.fit(X_valid, y_valid)
valid_predictions = model.predict(X_valid)

print('Accuracy')
print('Training set:', accuracy_score(y_train, train_predictions))
print('Test set:', accuracy_score(y_test, test_predictions))
print('Valid set:', accuracy_score(y_valid, valid_predictions))

## **Logistic Regression**

In [None]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(random_state=12345, solver='liblinear')
model.fit(X_train, y_train)
train_predictions = model.predict(X_train)
model.fit(X_test, y_test)
test_predictions = model.predict(X_test)
model.fit(X_valid, y_valid)
z= model.score(X_valid, y_valid)

print('Accuracy')
print('Training set:', accuracy_score(y_train, train_predictions))
print('Test set:', accuracy_score(y_test, test_predictions))
print('Valid set:', z)
