# Telecom: Identifying Ineffective Operators 

****The virtual telephony service CallMeMaybe is developing a new function that will give supervisors information on the least effective operators. An operator is considered ineffective if they have a large number of missed incoming calls (internal and external) and a long waiting time for incoming calls. Moreover, if an operator is supposed to make outgoing calls, a small number of them is also a sign of ineffectiveness.

- Carry out exploratory data analysis
- Identify ineffective operators
- Test statistical hypotheses

The datasets contain data on the use of the virtual telephony service CallMeMaybe. Its clients are organizations that need to distribute large numbers of incoming calls among various operators or make outgoing calls through their operators. Operators can also make internal calls to communicate with one another. These calls go through CallMeMaybe's network.


<b>Presentation link: </b>
    
https://drive.google.com/drive/folders/1SlLDMcYWjeNd8MXo4k6u0XJ8O2XmzIwN?usp=sharing

# Table of Contents



### **[First look at the data](#0)**


### **[Decomposition](#1)**


### **[Data Preprocessing](#2)**


### **[Exploratory Data Analysis](#3)**


### **[Set Threshold Values for Efficiency Classification](#4)**


### **[Determine effective/ineffective](#5)**


### **[Efficiency Distribution Summary](#6)**


### **[Test Hypothesis](#7)**


### **[Clients Plan & Operators Efficiency Analysis](#8)**


### **[ML model to identify efficent/inefficent operators](#9)**


### **[Overall Conclusions](#10)**

<a class="anchor" id="0"></a>
# First look at the data

In [213]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math as math
import datetime as dt
from functools import reduce
from scipy import stats as st
import plotly.express as px
import scipy.stats as stats
import plotly.graph_objects as go

from plotly.graph_objects import Layout
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.cluster import KMeans
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

import sys
import warnings
if not sys.warnoptions:
       warnings.simplefilter("ignore")
        

In [214]:

try:
    clients = pd.read_csv('/datasets/telecom_clients_us.csv')
    dataset = pd.read_csv('/datasets/telecom_dataset_us.csv')
    
except: 
    clients = pd.read_csv('telecom_clients_us.csv')
    dataset = pd.read_csv('telecom_dataset_us.csv')    

In [215]:
display(dataset.sample(10))
display(dataset.info())
display(dataset.describe(include = 'all'))

Unnamed: 0,user_id,date,direction,internal,operator_id,is_missed_call,calls_count,call_duration,total_call_duration
47264,168187,2019-10-20 00:00:00+03:00,in,False,937962.0,False,1,20,51
9936,166658,2019-11-27 00:00:00+03:00,in,False,,True,3,0,0
48992,168187,2019-11-19 00:00:00+03:00,out,False,937870.0,True,1,0,12
47217,168187,2019-10-18 00:00:00+03:00,out,False,937854.0,False,10,660,835
35327,167532,2019-09-27 00:00:00+03:00,out,False,917846.0,False,1,26,33
50224,168252,2019-11-01 00:00:00+03:00,in,False,940630.0,True,1,0,30
36468,167626,2019-09-27 00:00:00+03:00,out,False,919378.0,True,74,0,1823
47930,168187,2019-10-30 00:00:00+03:00,in,False,937760.0,False,7,814,884
36926,167626,2019-10-04 00:00:00+03:00,out,False,919456.0,True,156,0,3127
36585,167626,2019-09-30 00:00:00+03:00,out,False,919490.0,True,31,0,753


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53902 entries, 0 to 53901
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   user_id              53902 non-null  int64  
 1   date                 53902 non-null  object 
 2   direction            53902 non-null  object 
 3   internal             53785 non-null  object 
 4   operator_id          45730 non-null  float64
 5   is_missed_call       53902 non-null  bool   
 6   calls_count          53902 non-null  int64  
 7   call_duration        53902 non-null  int64  
 8   total_call_duration  53902 non-null  int64  
dtypes: bool(1), float64(1), int64(4), object(3)
memory usage: 3.3+ MB


None

Unnamed: 0,user_id,date,direction,internal,operator_id,is_missed_call,calls_count,call_duration,total_call_duration
count,53902.0,53902,53902,53785,45730.0,53902,53902.0,53902.0,53902.0
unique,,119,2,2,,2,,,
top,,2019-11-25 00:00:00+03:00,out,False,,False,,,
freq,,1220,31917,47621,,30334,,,
mean,167295.344477,,,,916535.993002,,16.451245,866.684427,1157.133297
std,598.883775,,,,21254.123136,,62.91717,3731.791202,4403.468763
min,166377.0,,,,879896.0,,1.0,0.0,0.0
25%,166782.0,,,,900788.0,,1.0,0.0,47.0
50%,167162.0,,,,913938.0,,4.0,38.0,210.0
75%,167819.0,,,,937708.0,,12.0,572.0,902.0


In [216]:
dataset.sort_values('calls_count', ascending = False)

Unnamed: 0,user_id,date,direction,internal,operator_id,is_missed_call,calls_count,call_duration,total_call_duration
40733,167827,2019-11-11 00:00:00+03:00,out,False,929428.0,True,4817,0,5529
40671,167827,2019-10-31 00:00:00+03:00,out,False,929428.0,True,2614,0,45312
37070,167626,2019-10-07 00:00:00+03:00,in,False,,True,2168,0,2361
37102,167626,2019-10-08 00:00:00+03:00,in,False,,True,1917,0,2044
37553,167626,2019-10-15 00:00:00+03:00,in,False,,True,1914,0,2063
...,...,...,...,...,...,...,...,...,...
22791,167071,2019-10-21 00:00:00+03:00,in,False,913942.0,True,1,0,16
42000,167927,2019-11-09 00:00:00+03:00,in,False,929626.0,False,1,27,36
42001,167927,2019-11-11 00:00:00+03:00,in,False,,True,1,0,26
42002,167927,2019-11-12 00:00:00+03:00,out,False,929626.0,False,1,23,26


In [217]:
dataset.operator_id.nunique()

1092

In [218]:
display(clients.sample(5))
display(clients.info())

Unnamed: 0,user_id,tariff_plan,date_start
409,166391,C,2019-08-01
493,166785,B,2019-08-19
176,168092,C,2019-10-14
322,167150,C,2019-09-04
190,168030,C,2019-10-09


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 732 entries, 0 to 731
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   user_id      732 non-null    int64 
 1   tariff_plan  732 non-null    object
 2   date_start   732 non-null    object
dtypes: int64(1), object(2)
memory usage: 17.3+ KB


None

<b> preliminary observations </b>
- ~54K calls events in the dataset.
- ~1092 operators, not including missing values
- significant missing values only in operator_id column, approx. for 8k calls
- major outliars at calls_count, call_duration. total_call_duration, clearly seen in std values. 



====================================================================================================

<a class="anchor" id="1"></a>
# Decomposition

## Research Objective

The virtual telephony service CallMeMaybe is developing a new function that will give supervisors information on the least effective operators.

<b> - Objective: Identify ineffective operators </b>



====================================================================================================

## Ineffective Operators Definition

Ineffective Operators are characterized by the following categories:
- Large number of missed incoming calls (internal and external)
- Long waiting time for incoming calls 
- Small number of outgoing calls (for operators making outgoing calls)

====================================================================================================

## Ineffective Operators  - Quantitative Definition

<b> Quantitaive parameters for ineffective classification:  </b>
- Each operator we will be assigned with binary value for each of the three categories by compareison to the total data. Binary values of 1 will be treated as "panlty" points, which will be used to determine the effectiveness of the operator.


<b>Determine ineffective: </b>
- <b>Sum of binary values </b> for each operator. If the result equals or greater then 2, it will be considered as ineffective.


<b> Pros and cons:</b>
- The binary approach makes for a simple, streightforward, easy to understand method. It allows us to switch the threshold values with ease for different purposes. 
- Quick identifcation of weakenss categories can be addressed to the operators. 
- Due to the fact that the binary classifcation does not weight the variance from the mean, we might choose Q1/Q3, as they set a more strict limit for detecting the worst values as ineffective. 


====================================================================================================

## Hypothesis Testing

- H0: There is no difference in call duration between effective operators (m0) and ineffective operators (m1): m0 = m1.
- h1: There is a difference in call duration between effective operators (m0) and ineffective operators (m1): m0 != m1.

====================================================================================================

## Optional - Build a ML model for users clustering 
  - The model will use the original dataset, with additional columns of the taarif plan and day of the week. 
  - First task in mind: Scatter plot where  X axis is day of the week, y axis is waiting time, and "hue" is set as users tarrif plan.
  
  - assumption: it is possible that tarrif plans and day of the week play a role for clients exprience different waiting times.
  - use clustering to see if the clients are clustered by the tarrif plan. 

  
  

====================================================================================================

## Code decomposition (road map)




### <b> Data Preprocessing </b>
   - clean/replace missing values
   - clean/replace outliar values
   - columns conversion if necessery

### <b> Exploratory Data Analysis </b>
  
  - Calculate the average waiting time for each row ( (total call duration - call duration) / calls_count )
  - add total_rows count for further precentage calculations of the tested categories.
  
  - Build a dataframe for each operator and its characteristics using groupby:
  
  
  <b>df1:</b>     
        
        operator_id | total_rows | avg_waiting_time | missing_calls_cnt | outgoing_calls_cnt 
  
based on df1, build a new dataframe and calculate the precentage of each category:
  
   <b>df2:</b>
         
        operator_id | %_missed_calls | %_outgoing_calls | avg_waiting_time
            
     
### <b> Set threshold values </b>   

  from df2:
     
  - <b>missed calls:</b> analyze missed calls distribution, use mean/median/Q3 as a threshold for inefficeny. 

  - <b>outgoing calls:</b> analyze outgoing calls distribution, use mean/median/Q3 as a threshold for ineffciency. 
  
  - <b>waiting time:</b> analyze waiting time distribution, use mean/median/Q1 as a threshold for ineffciency. 
  
  For each operator we will assign binary classiication of 1 if its category value are great then the threshold, i.e. the operator is considered ineffecient for the subjected category. 
 
 
    
### <b> Apply a row function to determine effective/ineffective classification: </b>
   - The function will assign a binary value for each row based on the operator charcteristcs. 
   - give a final classification for operator by the sum of its total effective/ineffective binary values. If the sum equlas or larger then 2, the operator is clasified as ineffective. if the sum is 0, the operator is classified as effecive, if the sum is 1, he is considered as neither effecive/ineffctive  
 

### <b> Test Hypothesis </b>

### <b> Optional - Build a ML model that predicts a call as effective / ineffective. </b> 

====================================================================================================

## Useful reading


https://towardsdatascience.com/the-statistical-analysis-t-test-explained-for-beginners-and-experts-fd0e358bbb62
This article will help me execute the ttest for the hypothesis testing. I believe it will also make it clear when it is right to use the t test, and clarify if maybe I should use a different test.


https://towardsdatascience.com/apply-function-to-pandas-dataframe-rows-76df74165ee4 This article will help me choose the most efficient way to execute an efficient row function for assigning the effective score. 


https://towardsdatascience.com/how-to-visualize-distributions-2cf2243c7b8e This article can help for choosing the right graphs for at the EDA. Also, if intrested in comparison different tarrif plans, it might be a good idea to supplement with a graph that takes into account differnt group sizes 


https://www.explorium.ai/blog/clustering-when-you-should-use-it-and-avoid-it/ This article will help me decide if clustering is the right option for a ML model. 


https://towardsdatascience.com/introduction-to-logistic-regression-66248243c148
This article will addreseed the ML part, in order to decide if it is worth to run try and predict if an operator is effective or ineffective, using all the original featurs + added the tarrif plan for each row in the original dataset 


## Red flags

- Not 100% positive about  the ML test, If not, I should find a different additional analysis (maybe building KPIs methods.

- we have 1092 operators, but only 732 clients. This means that we connect between operators and clients for a more simpler approach towards the data analysis.

================================================================================

<a class="anchor" id="2"></a>
# Data Preprocessing 

- clean/replace missing values
- clean/replace outliar values
- columns conversion if necessery

## Dataset dataframe

### Outliars in calls_count, calls_duration and total_call_duration

In [219]:
# check call_duration distribution

# plot entire dataset
fig = px.histogram(dataset, x="call_duration",  title = 'call duration distribution')
fig.show()

# plot with limit
fig = px.histogram(dataset, x="call_duration", nbins= 2000, title = 'call duration distribution')
fig.update_xaxes(range=[0, 1000])
fig.show()

- The first graph shows there are major outliars.
- Call duration shows that must calls records are short. However, this does not mean that 300-1000 call durations are outliars.

In [220]:
# major outliars at calls_count, call_duration. total_call_duration, clearly seen in std values. 
display(dataset.describe(include = 'all'))


# replace the outliers using the precentile
print(np.percentile(dataset['calls_count'], [1,5, 95, 99])) 
print(np.percentile(dataset['call_duration'], [1,5, 95, 99])) 
print(np.percentile(dataset['total_call_duration'], [1,5, 95, 99])) 
# consider all dataset from 95th precentile and up as outliars. 

# function to clip
dataset.calls_count = dataset.calls_count.transform(lambda x : np.clip(x,x.quantile(0.10),x.quantile(0.90)))

dataset.total_call_duration = dataset.total_call_duration.transform(lambda x : np.clip(x,x.quantile(0.10),x.quantile(0.90)))

# check values after replacing outliers
display(dataset.describe(include = 'all'))

Unnamed: 0,user_id,date,direction,internal,operator_id,is_missed_call,calls_count,call_duration,total_call_duration
count,53902.0,53902,53902,53785,45730.0,53902,53902.0,53902.0,53902.0
unique,,119,2,2,,2,,,
top,,2019-11-25 00:00:00+03:00,out,False,,False,,,
freq,,1220,31917,47621,,30334,,,
mean,167295.344477,,,,916535.993002,,16.451245,866.684427,1157.133297
std,598.883775,,,,21254.123136,,62.91717,3731.791202,4403.468763
min,166377.0,,,,879896.0,,1.0,0.0,0.0
25%,166782.0,,,,900788.0,,1.0,0.0,47.0
50%,167162.0,,,,913938.0,,4.0,38.0,210.0
75%,167819.0,,,,937708.0,,12.0,572.0,902.0


[  1.   1.  62. 166.]
[    0.       0.    3739.95 10333.9 ]
[0.000000e+00 2.000000e+00 4.540000e+03 1.295565e+04]


Unnamed: 0,user_id,date,direction,internal,operator_id,is_missed_call,calls_count,call_duration,total_call_duration
count,53902.0,53902,53902,53785,45730.0,53902,53902.0,53902.0,53902.0
unique,,119,2,2,,2,,,
top,,2019-11-25 00:00:00+03:00,out,False,,False,,,
freq,,1220,31917,47621,,30334,,,
mean,167295.344477,,,,916535.993002,,9.245687,866.684427,647.077956
std,598.883775,,,,21254.123136,,11.131112,3731.791202,861.012455
min,166377.0,,,,879896.0,,1.0,0.0,10.0
25%,166782.0,,,,900788.0,,1.0,0.0,47.0
50%,167162.0,,,,913938.0,,4.0,38.0,210.0
75%,167819.0,,,,937708.0,,12.0,572.0,902.0


In [221]:
# we can see that std are still high, but not as close to the extreame values we see before.

<b> preprocess call_duration column </b> 

Outlairs of this column cannot be treated in similar way, as there are many 0 duraion calls for unanswered calls which effect the realibility of the quantile method

In [222]:
# display call_duration 90th percentlie
display(np.percentile(dataset['call_duration'], 90))

2105.0

In [223]:
# check long duration calls
display(dataset[dataset['call_duration'] > 1500].sample(10))
display(dataset[dataset['total_call_duration'] > 1500].sample(10))

Unnamed: 0,user_id,date,direction,internal,operator_id,is_missed_call,calls_count,call_duration,total_call_duration
37277,167626,2019-10-10 00:00:00+03:00,out,False,919192.0,False,35,8048,2626
34109,167497,2019-11-08 00:00:00+03:00,out,True,924930.0,False,6,3108,2626
37105,167626,2019-10-08 00:00:00+03:00,out,False,919318.0,False,35,4145,2626
29937,167264,2019-11-25 00:00:00+03:00,in,False,919552.0,False,26,2895,2626
8209,166658,2019-09-03 00:00:00+03:00,out,False,890416.0,False,20,5645,2626
40719,167827,2019-11-09 00:00:00+03:00,out,False,929428.0,False,35,10383,2626
15470,166899,2019-10-03 00:00:00+03:00,out,False,894656.0,False,25,1924,2064
7946,166658,2019-08-21 00:00:00+03:00,out,False,890412.0,False,11,4254,2626
52629,168361,2019-11-14 00:00:00+03:00,out,False,945320.0,False,35,12375,2626
36542,167626,2019-09-29 00:00:00+03:00,in,False,919206.0,False,35,1838,2268


Unnamed: 0,user_id,date,direction,internal,operator_id,is_missed_call,calls_count,call_duration,total_call_duration
11062,166680,2019-11-27 00:00:00+03:00,out,False,972410.0,False,14,1494,1681
34954,167521,2019-10-27 00:00:00+03:00,in,False,919792.0,False,25,1177,2070
37129,167626,2019-10-08 00:00:00+03:00,out,False,919476.0,True,35,0,1697
36841,167626,2019-10-03 00:00:00+03:00,out,False,921592.0,False,35,6003,2626
8373,166658,2019-09-11 00:00:00+03:00,out,False,890422.0,False,18,2059,2284
49047,168187,2019-11-20 00:00:00+03:00,out,False,937708.0,False,20,1543,1969
50939,168252,2019-11-28 00:00:00+03:00,out,False,950972.0,False,32,2443,2626
1725,166405,2019-11-27 00:00:00+03:00,in,False,882686.0,False,16,1782,1931
7034,166582,2019-10-05 00:00:00+03:00,out,False,925922.0,False,35,13157,2626
38394,167650,2019-11-14 00:00:00+03:00,in,False,921318.0,False,35,2901,2626


In [224]:
dataset[dataset['call_duration'] == 0].sample(10)

Unnamed: 0,user_id,date,direction,internal,operator_id,is_missed_call,calls_count,call_duration,total_call_duration
53341,168412,2019-11-20 00:00:00+03:00,out,True,953464.0,True,6,0,10
34514,167497,2019-11-20 00:00:00+03:00,out,True,924946.0,True,2,0,31
37040,167626,2019-10-07 00:00:00+03:00,out,False,919476.0,True,35,0,2212
15288,166896,2019-10-19 00:00:00+03:00,in,False,,True,3,0,94
21653,167035,2019-11-08 00:00:00+03:00,out,False,923526.0,True,2,0,113
20929,167011,2019-09-16 00:00:00+03:00,out,False,899968.0,True,5,0,170
44200,168024,2019-11-12 00:00:00+03:00,in,False,,True,1,0,10
27653,167176,2019-09-25 00:00:00+03:00,in,False,,True,3,0,42
36356,167626,2019-09-26 00:00:00+03:00,out,False,919390.0,True,34,0,564
26824,167158,2019-09-26 00:00:00+03:00,in,False,907502.0,True,1,0,26



- long call durations are mostly charcterized by many call counts, therfore many of them do not represent outliars. 
- 0 values calls duration are for missed calls, so these will stay as is. 

<b> - Conclusion: We will clean call duration based on avg call duration. </b>

### Conversion 

In [225]:
# most long call duration are accompanied by many calls_count, some outliers may still exist. 
# For this, we will use avg_calls_duration (only for observations with call duration above 0 seconds)


In [226]:
dataset['avg_call_duration'] = dataset['call_duration'] / dataset['calls_count']

In [227]:
# check hist
fig = px.histogram(dataset, x="avg_call_duration", title = 'avg_call_duration distribution')
fig.show()

# check distribution without 0 sec avg_call_duration
fig = px.histogram( dataset[dataset['avg_call_duration'] != 0], x="avg_call_duration", title = 'avg_call_duration distribution')
fig.show()


In [228]:
dataset[dataset['avg_call_duration'] == 0].sample(10)

Unnamed: 0,user_id,date,direction,internal,operator_id,is_missed_call,calls_count,call_duration,total_call_duration,avg_call_duration
42495,167955,2019-11-05 00:00:00+03:00,in,False,,True,4,0,30,0.0
20,166377,2019-08-08 00:00:00+03:00,out,False,880022.0,True,4,0,28,0.0
37352,167626,2019-10-12 00:00:00+03:00,out,False,934426.0,True,35,0,1622,0.0
4806,166511,2019-10-15 00:00:00+03:00,out,True,891414.0,True,1,0,10,0.0
12003,166717,2019-08-28 00:00:00+03:00,in,False,,True,2,0,10,0.0
23713,167109,2019-09-16 00:00:00+03:00,out,True,909910.0,True,2,0,10,0.0
589,166391,2019-11-13 00:00:00+03:00,in,False,,True,1,0,10,0.0
40833,167828,2019-10-02 00:00:00+03:00,in,False,,True,1,0,10,0.0
40718,167827,2019-11-08 00:00:00+03:00,out,False,929428.0,True,35,0,2626,0.0
20857,167011,2019-09-09 00:00:00+03:00,out,True,899788.0,True,1,0,10,0.0


In [229]:
# check random sample
dataset[(dataset['avg_call_duration'] > 1000)].sample(10)

Unnamed: 0,user_id,date,direction,internal,operator_id,is_missed_call,calls_count,call_duration,total_call_duration,avg_call_duration
9567,166658,2019-11-08 00:00:00+03:00,out,False,890412.0,False,5,5365,2626,1073.0
6816,166582,2019-09-10 00:00:00+03:00,out,False,885876.0,False,35,80782,2626,2308.057143
8438,166658,2019-09-13 00:00:00+03:00,out,False,891154.0,False,10,10925,2626,1092.5
42729,167976,2019-11-27 00:00:00+03:00,in,False,934428.0,False,1,1207,1232,1207.0
9789,166658,2019-11-20 00:00:00+03:00,out,False,891154.0,False,2,3865,2626,1932.5
44568,168062,2019-11-05 00:00:00+03:00,out,False,951492.0,False,2,2290,2315,1145.0
6832,166582,2019-09-12 00:00:00+03:00,out,False,885890.0,False,35,72094,2626,2059.828571
7055,166582,2019-10-07 00:00:00+03:00,out,False,885876.0,False,35,57176,2626,1633.6
7136,166582,2019-10-14 00:00:00+03:00,out,False,885890.0,False,35,47545,2626,1358.428571
7017,166582,2019-10-04 00:00:00+03:00,out,False,925922.0,False,35,60890,2626,1739.714286


- The major spike of 0 avg_call_duration is caused by missed calls, therfore we will not drop this part of the dataset.
- We see outliars causes by major call durations, therfore, we will clean the data based on avg_call_duration 99th percentlie

In [230]:
# check percentile 
np.percentile(dataset['avg_call_duration'], 99)

676.494999999999

In [231]:
# we will drop from dataset data - all the values with avg_call_duration above 99 percentile
dataset = dataset[dataset['avg_call_duration'] < np.percentile(dataset['avg_call_duration'], 99)]

<b> - date column conversion </b>

In [232]:
dataset['date'] = pd.to_datetime(dataset.date, format = "%Y-%m-%d").dt.date
dataset['date'] = pd.to_datetime(dataset.date, format = "%Y-%m-%d")

In [233]:
dataset.head()

Unnamed: 0,user_id,date,direction,internal,operator_id,is_missed_call,calls_count,call_duration,total_call_duration,avg_call_duration
0,166377,2019-08-04,in,False,,True,2,0,10,0.0
1,166377,2019-08-05,out,True,880022.0,True,3,0,10,0.0
2,166377,2019-08-05,out,True,880020.0,True,1,0,10,0.0
3,166377,2019-08-05,out,True,880020.0,False,1,10,18,10.0
4,166377,2019-08-05,out,False,880022.0,True,3,0,25,0.0


### Missing Values

In [234]:
dataset.isnull().sum()

user_id                   0
date                      0
direction                 0
internal                115
operator_id            8154
is_missed_call            0
calls_count               0
call_duration             0
total_call_duration       0
avg_call_duration         0
dtype: int64

- 115 rows with missing values for interntal: minor percent of the data.

- operator_id missing values: 

In [235]:
# test value counts for operators 
fig = px.histogram(dataset.operator_id.value_counts(), x="operator_id", title = 'operator_id count distribution')
fig.show()

- we see that we have 1092 different opreators, where most of them appear only once according to the histogram.
- the operator_id is of float type, even though these are id numbers, and having them as str might be better.
- if we fill in all missing values as 'unkown', we will actually group a huge amount of data, more then 8000 rows, into a single id, while in the far exreame case of the data we don't have operators with more then ~350 rows.
- This means that this kind of fill does not help us, as it creates a very big anumaly.

Conclusion: 
- it is decided to drop these rows. 8K rows out of 54K is about 7-8%, which is a lot, but of creucial importance to the data analysis.


In [236]:
dataset = dataset.dropna(subset = ['operator_id'])

### Duplicates 

In [237]:
dataset.duplicated().sum()

4128

In [238]:
# we have 4128 duplicated rows

In [239]:
dataset = dataset.drop_duplicates(keep = 'first')

In [240]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 41080 entries, 1 to 53898
Data columns (total 10 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   user_id              41080 non-null  int64         
 1   date                 41080 non-null  datetime64[ns]
 2   direction            41080 non-null  object        
 3   internal             41027 non-null  object        
 4   operator_id          41080 non-null  float64       
 5   is_missed_call       41080 non-null  bool          
 6   calls_count          41080 non-null  int64         
 7   call_duration        41080 non-null  int64         
 8   total_call_duration  41080 non-null  int64         
 9   avg_call_duration    41080 non-null  float64       
dtypes: bool(1), datetime64[ns](1), float64(2), int64(4), object(2)
memory usage: 3.2+ MB


## Clients dataframe

In [241]:
clients.duplicated().sum()

0

In [242]:
clients.isnull().sum()

user_id        0
tariff_plan    0
date_start     0
dtype: int64

In [243]:
clients.date_start = pd.to_datetime(clients.date_start, format = "%Y-%m-%d")

In [244]:
clients.head()

Unnamed: 0,user_id,tariff_plan,date_start
0,166713,A,2019-08-15
1,166901,A,2019-08-23
2,168527,A,2019-10-29
3,167097,A,2019-09-01
4,168193,A,2019-10-16


In [245]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 41080 entries, 1 to 53898
Data columns (total 10 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   user_id              41080 non-null  int64         
 1   date                 41080 non-null  datetime64[ns]
 2   direction            41080 non-null  object        
 3   internal             41027 non-null  object        
 4   operator_id          41080 non-null  float64       
 5   is_missed_call       41080 non-null  bool          
 6   calls_count          41080 non-null  int64         
 7   call_duration        41080 non-null  int64         
 8   total_call_duration  41080 non-null  int64         
 9   avg_call_duration    41080 non-null  float64       
dtypes: bool(1), datetime64[ns](1), float64(2), int64(4), object(2)
memory usage: 3.2+ MB


### Data Preprocessing Summary

- Date types converted in both dataframes.
- Extensive DP was required for dataset df, meaning we should improve the way data is collected for this subject.
- Dataset observations dropped from 53901 to 41080, about ~ 10% of the data was removed.

<a class="anchor" id="3"></a>
# Exploratory Data Analysis 

- Calculate the average waiting time for each row ( (total call duration - call duration) / calls_count )
- add total_rows count for further precentage calculations of the tested categories.

In [246]:
# create a copy for possible future analysis 
dataset_raw = dataset.copy()

In [247]:
# calculate avg waiting time
dataset['avg_wait_time'] = (dataset['total_call_duration'] - dataset['call_duration']) / dataset['calls_count']

# total records counter
dataset['total_records'] = 1

# count true missing calls 
dataset['missing_calls_cnt'] =  np.where(dataset['is_missed_call'] == True, 1, 0)

# count outgoing calls 
dataset['out_calls_cnt'] =  np.where(dataset['direction'] == 'out', 1, 0)

In [248]:
dataset.sample(5)

Unnamed: 0,user_id,date,direction,internal,operator_id,is_missed_call,calls_count,call_duration,total_call_duration,avg_call_duration,avg_wait_time,total_records,missing_calls_cnt,out_calls_cnt
24936,167125,2019-09-11,out,False,902748.0,True,35,0,1213,0.0,34.657143,1,1,1
44577,168062,2019-11-05,out,False,947614.0,True,35,0,998,0.0,28.514286,1,1,1
4547,166511,2019-09-18,in,False,891416.0,False,1,42,87,42.0,45.0,1,0,0
4928,166511,2019-10-29,in,False,891410.0,False,11,2051,2180,186.454545,11.727273,1,0,0
28594,167185,2019-09-27,in,False,912296.0,False,3,346,370,115.333333,8.0,1,0,0


* Build a dataframe for each operator and its characteristics using groupby:

        operator_id | total_records | avg_waiting_time | missing_calls_cnt | outgoing_calls_cnt 


In [249]:
dataset_operators = dataset.groupby('operator_id').agg({'total_records':'count', 'avg_wait_time':'mean', 'is_missed_call':'sum', 'out_calls_cnt':'sum'}).reset_index()

In [250]:
dataset_operators.sample(5)

Unnamed: 0,operator_id,total_records,avg_wait_time,is_missed_call,out_calls_cnt
1008,958454.0,14,8.255952,3,13
581,929426.0,62,-62.707735,31,62
1025,960648.0,25,19.004736,9,17
503,921814.0,22,11.499729,9,17
109,894224.0,17,10.858824,9,8


based on df1, build a new dataframe and calculate the precentage of each category:
  
   <b>df2:</b>
         
        operator_id | %_missed_calls | %_outgoing_calls | avg_waiting_time

In [251]:
# calcualte missed_calls percent 
dataset_operators['missed_call_percent'] = dataset_operators['is_missed_call'] / dataset_operators['total_records']

In [252]:
# calculate out_calls percent
dataset_operators['out_calls_percent'] = dataset_operators['out_calls_cnt'] / dataset_operators['total_records']

In [253]:
dataset_operators.sample(5)

Unnamed: 0,operator_id,total_records,avg_wait_time,is_missed_call,out_calls_cnt,missed_call_percent,out_calls_percent
979,954810.0,11,8.737879,0,0,0.0,0.0
405,917446.0,9,23.296296,3,9,0.333333,1.0
740,938896.0,138,15.292876,54,91,0.391304,0.65942
939,951650.0,4,22.8,2,3,0.5,0.75
797,940952.0,32,12.21822,12,19,0.375,0.59375


In [254]:
# check value counts for out_calls percent
dataset_operators.out_calls_percent.value_counts()

1.000000    339
0.000000    211
0.500000     23
0.750000     20
0.333333     18
           ... 
0.683333      1
0.580000      1
0.730769      1
0.670213      1
0.791667      1
Name: out_calls_percent, Length: 335, dtype: int64

In [255]:
# wee see some operators have 100%, 0% or in between out calls percent

<a class="anchor" id="4"></a>
# Set Threshold Values for Efficiency Classification 


  from df2:
     
  - <b>missed calls:</b> analyze missed calls distribution, use mean/median/Q3 as a threshold for inefficeny. 

  - <b>outgoing calls:</b> analyze outgoing calls distribution, use mean/median/Q3 as a threshold for ineffciency. 
  
  - <b>waiting time:</b> analyze waiting time distribution, use mean/median/Q1 as a threshold for ineffciency. 
  
  For each operator we will assign binary classiication of 1 if its category value are great then the threshold, i.e. the operator is considered ineffecient for the subjected category. 

<b> missed calls </b>

In [256]:
# missed calls Q3 threshold value
missed_calls_q3 = np.percentile(dataset_operators.missed_call_percent, 75)

# assign binary value based on threshold
dataset_operators['missed_calls_binary'] = np.where(dataset_operators['missed_call_percent'] > missed_calls_q3, 1, 0)


In [257]:
dataset_operators.sample(5)

Unnamed: 0,operator_id,total_records,avg_wait_time,is_missed_call,out_calls_cnt,missed_call_percent,out_calls_percent,missed_calls_binary
981,955068.0,1,5.0,1,1,1.0,1.0,1
948,952462.0,7,21.0,1,1,0.142857,0.142857,0
357,912010.0,52,19.749191,2,0,0.038462,0.0,0
321,908984.0,2,26.133333,1,2,0.5,1.0,1
824,944246.0,6,32.222222,0,1,0.0,0.166667,0


<b> out calls </b>

In [258]:
# calculate q1 of outgoing calls percentage, only for operators who make outgoing calls:

# operators who make outcalling: exclude 0 
operators_outcalling = dataset_operators[(dataset_operators['out_calls_percent'] > 0)]

# calculate q1 threshold 
outcalling_q1 = np.percentile(operators_outcalling.out_calls_percent, 25)


# assign binary value only for operators who make outgoing calls, otherwise, leave assign the value 0 
dataset_operators['out_calls_binary'] = np.where(dataset_operators['operator_id'].isin(operators_outcalling['operator_id']), np.where(dataset_operators['out_calls_percent'] < outcalling_q1, 1, 0), 0)


In [259]:
dataset_operators.sample(5)

Unnamed: 0,operator_id,total_records,avg_wait_time,is_missed_call,out_calls_cnt,missed_call_percent,out_calls_percent,missed_calls_binary,out_calls_binary
307,908078.0,46,15.698188,5,17,0.108696,0.369565,0,1
80,891646.0,7,20.357143,0,0,0.0,0.0,0,0
410,917852.0,147,15.355103,50,100,0.340136,0.680272,0,0
1075,970244.0,3,18.833333,1,2,0.333333,0.666667,0,0
1007,958452.0,51,10.780883,17,39,0.333333,0.764706,0,0


<b> waiting time </b>

In [260]:
# waiting time q3 threshold  
avg_wait_q3 = np.percentile(dataset_operators.avg_wait_time, 75)

# assign binary value only based on threshold value
dataset_operators['avg_wait_binary'] = np.where(dataset_operators['avg_wait_time'] > avg_wait_q3, 1, 0)


In [261]:
# extract a summarized table for binary values only
dataset_operators_filter = dataset_operators[['operator_id','missed_calls_binary', 'out_calls_binary', 'avg_wait_binary' ]]

In [262]:
# calculate the sum of binary values
dataset_operators_filter['binary_sum'] = dataset_operators_filter['missed_calls_binary'] + dataset_operators_filter['out_calls_binary'] + dataset_operators_filter['avg_wait_binary']

<a class="anchor" id="5"></a>
# Determine effective/ineffective

- Give a final classification for operator by the sum of its total effective/ineffective binary values. If the sum equlas or larger then 2, the operator is clasified as ineffective. if the sum is 0, the operator is classified as effecive, if the sum is 1, he is considered as neither effecive/ineffctive

In [263]:
def efficiency(row):
    binary_sum = row['binary_sum']
    
    if binary_sum == 0:
        answer = 'efficient'
    elif binary_sum > 1:
        answer = 'inefficient'
    else:
        answer = 'normal_efficiency'
        
    return answer
    

dataset_operators_filter['efficiency'] = dataset_operators_filter.apply(efficiency, axis = 1)
#dataset_operators_filter['efficieny'] = np.where(dataset_operators['binary_sum'] > 1, 'inefficient', 0)

In [264]:
# test the result
dataset_operators_filter.sample(10)

Unnamed: 0,operator_id,missed_calls_binary,out_calls_binary,avg_wait_binary,binary_sum,efficiency
16,883898.0,0,0,0,0,efficient
187,900790.0,0,0,0,0,efficient
1032,961064.0,0,0,0,0,efficient
118,895170.0,0,0,0,0,efficient
37,887416.0,0,0,0,0,efficient
326,909502.0,1,0,0,1,normal_efficiency
707,937854.0,0,0,0,0,efficient
577,929228.0,0,0,0,0,efficient
210,902532.0,0,0,0,0,efficient
334,910038.0,0,0,0,0,efficient


In [265]:
# display value counts by efficiency 
display(dataset_operators_filter.efficiency.value_counts())

# precentage of inefficent users
display(len(dataset_operators_filter[dataset_operators_filter['efficiency'] == 'inefficient']) / len(dataset_operators_filter))

# perecentage of efficient users
display(len(dataset_operators_filter[dataset_operators_filter['efficiency'] == 'efficient']) / len(dataset_operators_filter))

normal_efficiency    544
efficient            438
inefficient          109
Name: efficiency, dtype: int64

0.0999083409715857

0.40146654445462876

<a class="anchor" id="6"></a>
# Efficiency Distribution Summary

- The method allows to estimate inefficent operators by setting threshold values from the updated data.
- The method works with binary values, therefore an operator must be ineeffcient in at least two categories in order to be defined as ineffcient.
- Thrshold values can also be manually changed by the company needs and standards. 
- The current data results with 10% of inefficient operators and 40% of efficent users.
- In my opinion, these numbers show very good sense and allows to identify fairly well the portion of ineffcieny in the data.


<a class="anchor" id="7"></a>
# Test Hypothesis 

- H0: There is no difference in call duration between effective operators (m0) and ineffective operators (m1): m0 = m1.
- h1: There is a difference in call duration between effective operators (m0) and ineffective operators (m1): m0 != m1.

### prepere the data

In [266]:
# recall the data
dataset.sample(5)

Unnamed: 0,user_id,date,direction,internal,operator_id,is_missed_call,calls_count,call_duration,total_call_duration,avg_call_duration,avg_wait_time,total_records,missing_calls_cnt,out_calls_cnt
50572,168252,2019-11-17,in,False,940658.0,False,4,229,263,57.25,8.5,1,0,0
6103,166541,2019-09-10,out,False,908834.0,True,1,0,58,0.0,58.0,1,1,1
45865,168073,2019-10-16,out,True,937422.0,True,3,0,10,0.0,3.333333,1,1,1
32463,167466,2019-10-31,out,False,921818.0,True,10,0,236,0.0,23.6,1,1,1
13699,166800,2019-10-25,in,False,892530.0,False,1,13,54,13.0,41.0,1,0,0


In [267]:
# make a table for operators call duration, avg_call_duration and efficeny ranking
operators_call_duration = pd.merge(dataset[['operator_id', 'call_duration', 'avg_call_duration']], dataset_operators_filter[['operator_id','efficiency']], on = 'operator_id', how = 'left')

In [268]:
operators_call_duration.sample(5)

Unnamed: 0,operator_id,call_duration,avg_call_duration,efficiency
34823,958460.0,56,56.0,normal_efficiency
4354,892028.0,0,0.0,inefficient
11834,906394.0,25,25.0,efficient
23256,912722.0,54,18.0,efficient
2525,887282.0,58,58.0,efficient


In [269]:
# drop 0 call duration, as those represent missing calls that can distort the data
operators_call_duration = operators_call_duration[operators_call_duration['call_duration'] != 0]
operators_call_duration = operators_call_duration[operators_call_duration['avg_call_duration'] != 0]

In [270]:
display(operators_call_duration.avg_call_duration.mean())
display(operators_call_duration.avg_call_duration.median())

112.79565400613481

83.25

In [271]:
# check avg_call_duration distribution
fig = px.histogram(operators_call_duration, x="avg_call_duration", title = 'avg_call_duration recordsdistribution')
fig.show()

# stacked barplot
fig = px.histogram(operators_call_duration, x="avg_call_duration", 
                   color="efficiency",  
                   color_discrete_map={'efficient': '#636EFA','inefficient': '#EF553B', 'normal_efficiency': '#00CC96'},
                   title = 'Stacked avg_call_duration records by operators distribution')
fig.show()

- We see a long tail distribution for avg_call duration.
- dividing by efficency of operators, we see that the portions of operators-efficency group stays the same (for example, the share of effective users does not show clear increase/decrease towards the longer avg call duration)

In [272]:

# groupby operator_id, calculate avg_call_duration mean
operators_avg_call_dur = dataset.groupby('operator_id').agg({'avg_call_duration':'mean'}).reset_index()

# append efficieny rank
operators_dur_eff = pd.merge(operators_avg_call_dur[['operator_id','avg_call_duration']], dataset_operators_filter[['operator_id','efficiency']], on = 'operator_id', how = 'left')



In [273]:
operators_dur_eff.sample(10)

Unnamed: 0,operator_id,avg_call_duration,efficiency
948,952462.0,73.857143,inefficient
1031,960950.0,68.375,efficient
880,947480.0,54.75,normal_efficiency
552,926214.0,92.333333,efficient
165,899892.0,63.666667,normal_efficiency
734,938080.0,113.278571,normal_efficiency
603,930692.0,9.666667,efficient
205,901894.0,90.092593,normal_efficiency
383,914426.0,94.466667,inefficient
40,888406.0,52.564103,inefficient


- Keep in mind: additional factors dectate the efficiency ranking 

In [274]:
# EDA the data 
display(operators_dur_eff[operators_dur_eff['efficiency'] == 'efficient']['avg_call_duration'].mean())
display(operators_dur_eff[operators_dur_eff['efficiency'] == 'normal_efficiency']['avg_call_duration'].mean())
display(operators_dur_eff[operators_dur_eff['efficiency'] == 'inefficient']['avg_call_duration'].mean())

75.84977858583618

67.6855188773297

53.31021870625949

In [275]:
# this means that we will rejact the null hypothesis probably 

In [276]:
operators_dur_eff.efficiency.value_counts()

normal_efficiency    544
efficient            438
inefficient          109
Name: efficiency, dtype: int64

In [277]:
# avg_call_duration by operator, take this table and turn into pie chart
operators_dur_eff

Unnamed: 0,operator_id,avg_call_duration,efficiency
0,879896.0,64.854002,efficient
1,879898.0,50.253050,efficient
2,880020.0,47.225000,efficient
3,880022.0,77.980676,efficient
4,880026.0,47.157755,efficient
...,...,...,...
1086,972410.0,48.630495,inefficient
1087,972412.0,64.931884,normal_efficiency
1088,972460.0,16.690476,efficient
1089,973120.0,2.500000,normal_efficiency


In [278]:
operators_dur_eff_grouped = operators_dur_eff.groupby('efficiency').agg({'operator_id':'count'}).reset_index()

In [279]:
operators_dur_eff_grouped['percentage'] = round(100 * operators_dur_eff_grouped['operator_id']/ sum(operators_dur_eff_grouped['operator_id']),2)

operators_dur_eff_grouped

Unnamed: 0,efficiency,operator_id,percentage
0,efficient,438,40.15
1,inefficient,109,9.99
2,normal_efficiency,544,49.86


In [280]:
# plot pie chart

colors = ['636EFA', 'EF553B', '00CC96']

fig = go.Figure(data = [go.Pie(labels = operators_dur_eff_grouped['efficiency'], values = operators_dur_eff_grouped['operator_id'])])


fig.update_traces(hoverinfo='label+percent+value', textfont_size=15, textinfo='label+percent', 
                  
                  pull=[0,0.1,0],
                  marker=dict(colors=colors, line=dict(color='#0b0c10', width = 2)))

# add title
fig.update_layout(
    title={
        'text': "Opeartors Efficency Distribution",
        'y':0.9,
        'x':0.2,
        'xanchor': 'center',
        'yanchor': 'top'})

fig.show()

In [281]:
# QA - check avg_call_duration histogram following the groupby operator_id 
fig = px.histogram(operators_dur_eff, x="avg_call_duration", title = 'avg_call_duration distribution')
fig.show()

<b> Run the test </b>

In [282]:
# check for profit column
print (st.shapiro(operators_dur_eff[operators_dur_eff['efficiency'] == 'efficient'].avg_call_duration))
print (st.shapiro(operators_dur_eff[operators_dur_eff['efficiency'] == 'inefficient'].avg_call_duration))

# double-check. run the shapiro on all columns
def shapiro(df, parameter):
        alpha = 0.05
        print('For {} parameter'.format(parameter))     
        result =  st.shapiro(df[parameter])
        if result[1] > alpha:
            print ('We accept the null, data is normally distributed')
        else:
            print ('We rejact the null, data is not normally distributed')


(0.6782189011573792, 5.245750241577275e-28)
(0.9585162997245789, 0.001837790827266872)


In [283]:
shapiro(operators_dur_eff[operators_dur_eff['efficiency'] == 'efficient'], 'avg_call_duration')
shapiro(operators_dur_eff[operators_dur_eff['efficiency'] == 'inefficient'], 'avg_call_duration')

For avg_call_duration parameter
We rejact the null, data is not normally distributed
For avg_call_duration parameter
We rejact the null, data is not normally distributed


In [284]:
# attampt to cut to only 200 length calls for normally distributed data
operators_dur_eff_normal = operators_dur_eff[operators_dur_eff['avg_call_duration'] <= 250]

In [285]:
shapiro( operators_dur_eff_normal[operators_dur_eff_normal['efficiency'] == 'efficient'], 'avg_call_duration')
shapiro( operators_dur_eff_normal[operators_dur_eff_normal['efficiency'] == 'inefficient'], 'avg_call_duration')

For avg_call_duration parameter
We rejact the null, data is not normally distributed
For avg_call_duration parameter
We rejact the null, data is not normally distributed


In [286]:
# that's the best we could do with this data, we will proceed with the t-test

In [287]:
# display difference in duration for efficient and inefficent operators 

display(operators_dur_eff_normal[operators_dur_eff_normal['efficiency'] == 'efficient']['avg_call_duration'].mean())
display(operators_dur_eff_normal[operators_dur_eff_normal['efficiency'] == 'inefficient']['avg_call_duration'].mean())

display(operators_dur_eff_normal[operators_dur_eff_normal['efficiency'] == 'efficient']['avg_call_duration'].median())
display(operators_dur_eff_normal[operators_dur_eff_normal['efficiency'] == 'inefficient']['avg_call_duration'].median())



66.42437626240897

53.31021870625949

61.68222561751974

52.30261904761905

- we see ~10% difference between the two groups
- therefore we can assuem that we will rejact the null hypothesis

### Run the test 

In [288]:
# set varaibles for each of the sampels
operators_effective = operators_dur_eff_normal[operators_dur_eff_normal['efficiency'] == 'efficient']
operators_ineffective = operators_dur_eff_normal[operators_dur_eff_normal['efficiency'] == 'inefficient']

# set alpha value
alpha = .05

# After review - check for Equality of Variances using the Levane's Test
op_var = st.levene(operators_effective.avg_call_duration, operators_ineffective.avg_call_duration, center='mean')
display(op_var)
 

# run ttest

# using equal_var = False, as the levene result show p-value bigger then the alpha value.
# When  eual_var = False, the method in use is the Welch t-test.
results = st.ttest_ind(
        operators_effective.avg_call_duration, 
        operators_ineffective.avg_call_duration, equal_var = False)

print('p-value: ', results.pvalue)

if (results.pvalue < alpha):
        print("We reject the null hypothesis")
else:
        print("We can't reject the null hypothesis") 
        


LeveneResult(statistic=0.29000456287375487, pvalue=0.5904428523889926)

p-value:  0.0012576700839946144
We reject the null hypothesis


#### <b> Conclusion: </b>
    
We can't rejact the alternative hypothesis, that avg_call_duration is different between effective and ineffective operators

<a class="anchor" id="8"></a>
# Clients Plan & Operators Efficiency Analysis

### Approach


- It is decided to go with additional analysis in order to look for relation between clients tarrif_plan and efficiency records.


In [289]:
#
clients_operators = dataset[['user_id','operator_id']]

# add generelized efficiency rank of each
clients_operators = pd.merge(clients_operators[['user_id','operator_id']], dataset_operators_filter[['operator_id','efficiency']], on ='operator_id', how = 'left' )

# add tarif plan and date_start

clients_operators = pd.merge(clients_operators[:], clients[:], on = 'user_id', how = 'left')

In [290]:
clients_operators.sample(5)

Unnamed: 0,user_id,operator_id,efficiency,tariff_plan,date_start
32288,167977,944218.0,inefficient,B,2019-10-08
40701,168412,952468.0,normal_efficiency,C,2019-10-24
8096,166678,900892.0,efficient,B,2019-08-14
11151,166884,895776.0,efficient,B,2019-08-22
12591,166916,906404.0,efficient,A,2019-08-23


In [291]:
clients_operators_stacked = clients_operators[['efficiency','tariff_plan']]

clients_operators_stacked['cnt'] = 1
clients_operators_stacked

clients_operators_stacked_grouped = clients_operators_stacked.groupby(['efficiency','tariff_plan']).agg({'cnt':'count'}).reset_index().sort_values('cnt', ascending = False)
clients_operators_stacked_grouped

Unnamed: 0,efficiency,tariff_plan,cnt
2,efficient,C,8603
1,efficient,B,8033
6,normal_efficiency,A,6278
8,normal_efficiency,C,5501
7,normal_efficiency,B,4615
0,efficient,A,4491
4,inefficient,B,1638
3,inefficient,A,1279
5,inefficient,C,642


In [292]:
# plot stacked barplot
fig = px.bar(clients_operators_stacked_grouped, x="tariff_plan", y='cnt', color = 'efficiency',
            hover_data=['efficiency'], barmode = 'stack', color_discrete_map={'efficient': '#636EFA','inefficient': '#EF553B', 'normal_efficiency': '#00CC96'} ,title = 'Stacked Efficient Operators Distribution by Tarrif Plan')
0
# edit layout
fig.update_layout(
    autosize=False,
    width=500,
    height=600)

# edit y axis range for round number

    
fig.show()

In [293]:
# calculate precentage of inefficent records for clients of C_tariff
clients_operators_C = clients_operators[clients_operators['tariff_plan'] == 'C']
display(len(clients_operators_C[clients_operators_C['efficiency'] == 'inefficient']) / len(clients_operators_C))

# calculate precentage of inefficent records for clients of A and B tariffs combined
clients_operators_ab = clients_operators.query('tariff_plan == "A" or tariff_plan == "B"')
display(len(clients_operators_ab[clients_operators_ab['efficiency'] == 'inefficient']) / len(clients_operators_ab))

0.043537230435372304

0.11076934761145288

<b> Conclusion </b>

The results implies the possibility of more efficiecnt operators, or more efficienct events, related to clients from tarrif plan C.
It is worth checking if this tarrif plan is more expensive, and therfore more resources are tunneled towards its service.
However, this is an estimation only, which needs to be statistically tested.

<a class="anchor" id="9"></a>
# ML model to identify efficent/inefficent operators

- The original plan was to  build a ML model in order to cluster records and to see if we can predict if they are classified as efficient or not.


- The problem for this approach: The efficient/inefficient/normal efficiency is classified by operator_id, and not by record_id of the dataset. Therfore, if we will assign an efficiency classification to the dataset, it will be by the operator_id, instead of each particular observation stats. 



- Approch: Instead of clustering, we will try to predict weather an operaor is efficent/inefficent:
- <b> Method: binary classification: logistic regression. </b>

- Data Preprocessing:
        - change classifications to binary: efficent or normal efficency = efficent, inefficient = inefficient 
        - assign to the dataset: the user_id package.
        - for feature we will have: avg_call_duration, avg_wait_time, is_missed_call, and tarrif_plan

* Preprocess the data:

In [294]:
# Filter the original dataset for essential columns only 
dataset_filter = dataset[['user_id','operator_id','is_missed_call','internal','avg_call_duration','avg_wait_time']]

In [295]:
# merge the filtered data with effciency operators
dataset_filter = pd.merge(dataset_filter[:], dataset_operators_filter[['operator_id','efficiency']], on = 'operator_id', how = 'left')

In [296]:
# merge the filtered data with tarrif plans
dataset_filter = pd.merge(dataset_filter[:], clients[['user_id','tariff_plan']], on = 'user_id', how = 'left')

In [297]:
# filter out the user_id and operator_id, so they won't be part of the considered features for the ML predictions 
dataset_filter = dataset_filter[['is_missed_call', 'internal', 'avg_call_duration', 'avg_wait_time', 'tariff_plan', 'efficiency']]

In [298]:
# narrow down the target feature to binary: efficent/inefficent/ by categorized normal efficency as efficent
dataset_filter['efficiency'].replace({"normal_efficiency":"efficient"}, inplace = True)

In [299]:
# turn traffic_plan into numeric so the model can weigh it 
dataset_filter['tariff_plan'].replace({"A": "1", "B": "2", "C": "3"}, inplace=True)
dataset_filter['tariff_plan'] = dataset_filter['tariff_plan'].astype('int')


In [300]:
# turn the target featuer of effieceny into numeric
dataset_filter['efficiency'].replace({"efficient": "0", "inefficient": "1"}, inplace=True)
dataset_filter['efficiency'] = dataset_filter['efficiency'].astype('int')

In [301]:
# drop any missing values 
dataset_filter = dataset_filter.dropna()

In [302]:
# filtered dataset:
dataset_filter.sample(10)

Unnamed: 0,is_missed_call,internal,avg_call_duration,avg_wait_time,tariff_plan,efficiency
40242,True,False,0.0,20.133333,1,0
10163,False,False,101.0,27.0,2,1
20668,False,True,33.0,2.666667,2,0
28115,True,False,0.0,10.0,1,0
25975,False,False,35.285714,21.0,1,0
252,False,False,66.5,3.5,2,0
37079,False,False,262.5,20.5,1,0
33791,True,False,0.0,36.771429,1,0
31452,False,False,333.4,12.6,3,0
20799,False,True,35.0,3.333333,2,0


* Run the logistice regreassion model 

In [303]:
# divide the data into features (the X matrix) and a target variable (y)
X = dataset_filter.drop(['efficiency'], axis = 1)
y = dataset_filter['efficiency']

# divide the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0, stratify=y)

In [304]:
# define the model's algorithm 
model = LogisticRegression()

# train your model
model.fit(X_train, y_train)

# use the trained model to make forecasts
predictions = model.predict(X_test)
probabilities = model.predict_proba(X_test)[:,1]

In [305]:
# function that prints the metrics results
def print_all_metrics(y_true, y_pred, y_proba, title = 'Classification metrics'):
    print(title)
    print('\tAccuracy: {:.2f}'.format(accuracy_score(y_true, y_pred)))

In [306]:
# apply the function on logistic regression model
print_all_metrics(y_test, predictions, probabilities , title='Metrics for logistic regression:')

Metrics for logistic regression:
	Accuracy: 0.91


<b> ML conclusions: </b>
- The model predicted effiency/inefficeny of records with a 91% accuracy.
- It seems likely that an ML model can be used in order to predict efficent/inefficent events! 
- pitfalls in the current model: the model requires to take catagorical features (such as taarif_plan, effiecny) and turn them into int in order to run succesffuly.


<a class="anchor" id="10"></a>
# Overall Conclusions

<b> Workflow for identifying inefficient operators: </b>
- The project presesnt a unique approach to classify operators efficiency.
- Efficeincy classifcation is done by setting threshold values based on the raw data, and classify operators as inefficent if they don't pass these thresholds.
- Binary rank for each of the requested categories allows to quickly determine each operator strong and weak features.

<b> Analysis result: </b>
- The analysis results with 10% inefficent operators, and 40% efficent operators, the rest 50% are classified in between as normal efficeny operators.
- Hypothesis testing: we can't rule out the hypothesis that avarage call duration is different among operators by their efficiency ranking. 
- Tarrif plans analysis: We suggest there may be a relation betweeen customers with taarif_plan_C and efficient operators records.
    
<b> ML Logistic Regression model to identify efficeny/inefficeny records: </b>
- Using logistic regression, the model was able to predict efficent/inefficent record of the data with 91% accuracy.
- However, the model requires using numerical values in all features, and for that, categorical features such as tarrif_plan and efficent/ineffecient had to be transformed into numbers.
- This means that using logistic regression might not be the best approach for ML prediction.
- In addition, the project dealt with classifing operators efficeny based on the observations, and not classifying each observation sepertaly. This means that if we would like to classify each observation, we should have a different appraoch to the data analysis.