### 🧭 __Telecommunications: Identifying ineffective operators__

The virtual phone service CallMeMaybe is developing a new feature that will provide supervisors with insight into the least effective operators. An operator is considered ineffective if they have a high number of missed incoming calls (internal and external) and a long wait time for incoming calls. Furthermore, if an operator is supposed to make outgoing calls, a low number of them will also be a sign of ineffectiveness.

- Conduct exploratory data analysis
- Identify ineffective operators
- Test statistical hypotheses

#### 🧾 __Data Dictionary__

The datasets contain information about the use of the CallMeMaybe virtual phone service. Its customers are organizations that need to distribute large numbers of incoming calls among multiple carriers or make outgoing calls through their carriers. Carriers can also make internal calls to communicate with each other. These calls are made through the CallMeMaybe network.

The compressed dataset `telecom_dataset_us.csv` contains the following columns:

- `user_id`: Customer account ID
- `date`: Date statistics were retrieved
- `direction`: Call direction (`out` for outgoing, `in` for incoming)
- `internal`: Whether the call was internal (between a customer's operators)
- `operator_id`: Operator ID
- `is_missed_call`: Whether the call was missed
- `calls_count`: Number of calls
- `call_duration`: Call duration (excluding hold time)
- `total_call_duration`: Call duration (including hold time)

The `telecom_clients_us.csv` dataset has the following columns:

- `user_id`: User ID
- `tariff_plan`: Customer's current rate
- `date_start`: customer registration date

### 💻 __1. Libraries__

In [102]:
from IPython.display import display, HTML
import numpy as np
import pandas as pd
from pandas import BooleanDtype
import plotly.express as px
import plotly.graph_objects as go
import dash
from dash import dcc, html
from dash.dependencies import Input, Output
import re
from scipy import stats as st
import scipy.stats as stats
from scipy.stats import f_oneway
from scipy.stats import ttest_ind
from statsmodels.stats.proportion import proportions_ztest
from tqdm import tqdm
import unicodedata
import warnings

# Ignore all FutureWarning
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=UserWarning)

### 🛠️ __2. Functions__

In [2]:
# Function to normalize string formatting in object-type columns
def normalize_string_format(df, include=None, exclude=None):
    """
    Standardizes text formatting for object-type (string) columns in a DataFrame.

    Operations performed:
    - Converts text to lowercase
    - Strips leading/trailing whitespace
    - Replaces punctuation with spaces
    - Collapses spaces into underscores
    - Removes redundant underscores
    - Adds unicode normalization to remove accents and special characters.

    Parameters:
    df (DataFrame): The input DataFrame.
    include (list, optional): Specific columns to apply formatting to. If None, applies to all except those in 'exclude'.
    exclude (list, optional): Columns to skip.

    Returns:
    DataFrame: Updated DataFrame with normalized string formats.
    """

    if exclude is None:
        exclude = []

    if include is None:
        available_columns = [col for col in df.columns if col not in exclude]
    else:
        available_columns = [col for col in include if col not in exclude]

    def clean_text(text):
        if isinstance(text, str):
            text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8')
            text = (text
                    # .lower()
                    .strip())
            # text = re.sub(r'[^\w\s]', ' ', text)
            # text = re.sub(r'\s+', '_', text)
            # text = re.sub(r'__+', '_', text)
            # text = re.sub(r'_(?=\s|$)', '', text)
            # text = re.sub(r'__+', '_', text)
        return text

    for column in available_columns:
        if df[column].dtype in ['object', 'string']:
            df[column] = df[column].apply(clean_text)

    return df

# Function to identify non-standard missing values in object-type columns
def check_existing_missing_values(df):
    """
    Checks object-type columns in a DataFrame for non-standard missing values.

    Parameters:
    df (DataFrame): The dataset to inspect.

    Output:
    Displays the number of non-standard missing entries per column and the matched values.
    """

    # Common non-standard representations of missing values
    missing_values = ['', ' ', 'N/A', 'none', 'None','null', 'NULL', 'NaN', 'nan', 'NAN', 'nat', 'NaT']

    display(HTML(f"<h4>Scanning for Non-Standard Missing Values</h4>"))

    for column in df.columns:

        matches = df[df[column].isin(missing_values)][column].unique()

        if df[column].isin(missing_values).any() and matches.size > 0:
            count = df[column].isin(missing_values).sum()
            display(
                HTML(f"> Missing values in column <i>'{column}'</i>: <b>{count}</b>"))
            display(
                HTML(f"&emsp;Matched non-standard values: {list(matches)}"))
        else:
            display(
                HTML(f"> Missing values in column <i>'{column}'</i>: None"))

    print()

    return None

# Function to standardize non-standard missing values to pd.NA
def replace_missing_values(df, include=None, exclude=None):
    """
    Replaces common non-standard missing value entries in object-type columns with pd.NA.

    Parameters:
    df (DataFrame): The input dataset.
    include (list, optional): List of columns to include. If None, all columns except those in 'exclude' are considered.
    exclude (list, optional): List of columns to exclude from replacement.

    Returns:
    DataFrame: Updated DataFrame with non-standard missing values replaced by pd.NA.
    """

    missing_values = ['', ' ', 'N/A', 'none', 'None', 'null', 'NULL', 'NaN', 'nan', 'NAN', 'nat', 'NaT']

    if exclude is None:
        exclude = []

    if include is None:
        available_columns = [col for col in df.columns if col not in exclude]
    else:
        available_columns = [col for col in include if col not in exclude]

    for column in available_columns:
        if df[column].dtype in ['object', 'string'] and df[column].isin(missing_values).any():
            df.loc[:, column] = df[column].replace(missing_values, pd.NA)

    return df

# function for displaying the percentage of mising values in a Dataset
def missing_values_rate(df, include=None, exclude=None):
    
    """
    Displays the percentage of missing values for specified columns in a DataFrame.

    Parameters:
    ----------
    df : pandas.DataFrame
        The DataFrame to analyze.

    include : list, optional
        List of column names to include in the analysis. If None, all columns not in `exclude` are considered.

    exclude : list, optional
        List of column names to exclude from the analysis. Default is an empty list.

    Returns:
    -------
    None
        Displays HTML output in a Jupyter Notebook environment.
    """
    
    if exclude is None:
        exclude = []

    if include is None:
        available_columns = [col for col in df.columns if col not in exclude]
    else:
        available_columns = [col for col in include if col not in exclude]

    for column in available_columns:
        total_values = len(df[column])
        if total_values == 0:
            percentage = 0
        else:
            missing_values = df[column].isna().sum()
            percentage = (missing_values / total_values) * 100

        display(HTML(f"> Percentage of missing values for column <i>'{column}'</i>: <b>{percentage:.2f}</b> %<br>"))
        display(HTML(f">    Total values: {df[column].shape[0]}<br>   > Missing values: {df[column].isna().sum()}<br><br>"))

# Function to convert string-based date/time columns to timezone-aware datetime or time objects
def normalize_datetime(df, include=None, exclude=None, frmt=None, time_zone='UTC'):
    """
    Converts string-based columns in a DataFrame to datetime or time objects,
    with optional format and timezone adjustments.

    Parameters:
    - df (DataFrame): The input DataFrame.
    - include (list, optional): Specific columns to include. If None, all non-excluded columns are processed.
    - exclude (list, optional): Columns to exclude from conversion.
    - frmt (str, optional): Optional datetime format (e.g., '%Y-%m-%d', '%H:%M:%S').
    - time_zone (str): Timezone to localize or convert to (default: 'UTC').

    Returns:
    DataFrame: DataFrame with parsed datetime or time columns.
    """

    if exclude is None:
        exclude = []

    if include is None:
        target_columns = [col for col in df.columns if col not in exclude]
    else:
        target_columns = [col for col in include if col not in exclude]
        
    df = df.copy()

    for column in target_columns:
        if pd.api.types.is_object_dtype(df[column]) or pd.api.types.is_string_dtype(df[column]):
            df.loc[:, column] = pd.to_datetime(df[column], format=frmt, errors='coerce')

        if pd.api.types.is_datetime64_any_dtype(df[column]):
            if frmt in ["%H:%M:%S", "%H:%M"]:
                df.loc[:, column] = df[column].dt.time
            else:
                if df[column].dt.tz is None:
                    df.loc[:, column] = df[column].dt.tz_localize(time_zone)
                else:
                    df.loc[:, column] = df[column].dt.tz_convert(time_zone)

    return df

# Function to evaluate the central tendency of a numerical feature
def evaluate_central_trend(df, column):
    """
    Evaluates the central tendency of a given column using the coefficient of variation (CV)
    and skewness to determine the most reliable measure (mean or median).
    
    Parameters:
    df (DataFrame): The input DataFrame.
    column (str): Name of the numerical column to evaluate.
    
    Output:
    Displays CV, skewness, and recommends the most reliable central measure.
    """
    
    data = df[column].dropna()
    
    if data.empty:
        display(HTML(f"<b>Column '{column}' is empty or contains only NaNs.</b>"))
        return
    
    mean = data.mean()
    std = data.std()
    skew = data.skew()
    
    if mean == 0:
        display(HTML(f"> Mean of column '{column}' is <b>zero</b>.\n Coefficient of Variation is <b>undefined</b>."))
        return
    
    cv = (std / mean) * 100
    
    # CV-based interpretation
    if cv <= 10:
        cv_msg = "Very low variability: highly reliable mean."
    elif cv <= 20:
        cv_msg = "Moderate variability: reasonably reliable mean."
    elif cv <= 30:
        cv_msg = "Considerable variability: mean may be biased."
    else:
        cv_msg = "High variability: mean may be misleading."
    
    # Skewness-based adjustment
    abs_skew = abs(skew)
    if abs_skew <= 0.3:
        skew_msg = "Low skewness: distribution is nearly symmetric."
        skew_level = "low"
    elif abs_skew <= 0.6:
        skew_msg = "Moderate skewness: some asymmetry present."
        skew_level = "moderate"
    elif abs_skew <= 1.0:
        skew_msg = "High skewness: strong asymmetry detected."
        skew_level = "high"
    else:
        skew_msg = "Very high skewness: distribution is heavily distorted."
        skew_level = "very_high"
    
    # Central trend evaluation
    if cv > 30 or skew_level in ["high", "very_high"]:
        central = "median"
        reason = "due to high variability or strong skewness"
    elif cv > 20 or skew_level == "moderate":
        central = "median (with caution)"
        reason = "due to moderate variability or skewness"
    else:
        central = "mean"
        reason = "distribution is stable and symmetric"

    display(HTML(f"> Coefficient of variation for column <i>'{column}'</i>: <b>{cv:.2f} %</b>"))
    display(HTML(f"> Skewness of column <i>'{column}'</i>: <b>{skew:.2f}</b>"))
    display(HTML(f"> {cv_msg}"))
    display(HTML(f"> {skew_msg}"))
    display(HTML(f"> Recommended central measure: <b>{central}</b> ({reason})"))
    
    # Validation of values for transformation
    min_val = data.min()
    has_negatives = (min_val < 0)
    has_zeros = (data == 0).any()
    all_positive = (min_val > 0)

    # Robust Transformation Suggestion
    if skew_level in ["high", "very_high"]:
        if skew > 0:
            transform_suggestion = "To reduce right skew:"
            if all_positive:
                transform_suggestion += " [log(x), sqrt(x), reciprocal(x), Box-Cox]."
            elif not has_negatives:
                transform_suggestion += " [sqrt(x), reciprocal(x), Yeo-Johnson (handles zeros)]."
            else:
                transform_suggestion += " [Yeo-Johnson, quantile or rank-based transforms (handles negatives)]."
        else:
            transform_suggestion = "To reduce left skew:"
            if not has_negatives:
                transform_suggestion += " [square(x), exp(x), reflect+log(x), Yeo-Johnson]."
            else:
                transform_suggestion += " [Yeo-Johnson or rank-based transforms (handles negatives)]."

        if abs_skew > 1.5 or data.max() > 10 * data.median():
            transform_suggestion += " For extreme skew or heavy-tailed distributions, consider quantile or normal score transforms instead of classical ones."

        display(HTML(f"> Suggested transformation: <i>{transform_suggestion}</i>"))
    
    print()
    
    return 

# Function to detect outlier boundaries with optional clamping of lower bound to zero
def outlier_limit_bounds(df, column, bound='both', clamp_zero=False):
    """
    Detects outlier thresholds based on the IQR method and returns rows beyond those limits.

    Parameters:
    - df (DataFrame): The input DataFrame.
    - column (str): The name of the numerical column to analyze.
    - bound (str): One of 'both', 'lower', or 'upper' to indicate which bounds to evaluate.
    - clamp_zero (bool): If True, clamps the lower bound to zero (useful for non-negative metrics).

    Returns:
    DataFrame(s): Rows identified as outliers, depending on the bound selected.
    """

    q1, q3 = df[column].quantile([0.25, 0.75])
    iqr = q3 - q1

    lower_bound = max(q1 - 1.5 * iqr, 0) if clamp_zero else q1 - 1.5 * iqr
    upper_bound = q3 + 1.5 * iqr

    display(HTML(f"> Outlier thresholds for <i>'{column}'</i>: \n"
                 f"> Lower = <b>{lower_bound:.3f}</b>, > Upper = <b>{upper_bound:.3f}</b>"))

    if bound not in ['both', 'lower', 'upper']:
        display(HTML(f"> Invalid 'bound' parameter. Use <b>'both'</b>, <b>'upper'</b>, or <b>'lower'</b>."))
        return

    outliers = pd.DataFrame()
    
    if bound in ['both', 'lower']:
        lower_outliers = df[df[column] < lower_bound]
        if lower_outliers.empty:
            display(HTML(f"> <b>No</b> lower outliers found in column <i>'{column}'</i>."))
        outliers = pd.concat([outliers, lower_outliers])

    if bound in ['both', 'upper']:
        upper_outliers = df[df[column] > upper_bound]
        if upper_outliers.empty:
            display(HTML(f"> <b>No</b> upper outliers found in column <i>'{column}'</i>."))
        outliers = pd.concat([outliers, upper_outliers])

    display(HTML(f"- - -"))
    display(HTML(f"> Outliers:"))

    return outliers if not outliers.empty else None

# Function to visualize missing values within a DataFrame using a heatmap
def missing_values_heatmap_px(df):
    """
    Displays an interactive heatmap of missing (NaN) values in the given DataFrame using Plotly Express.
    
    Parameters:
    - df (DataFrame): The input DataFrame to analyze.
    
    Output:
    An interactive heatmap visualization showing the presence of missing values per column and row.
    """
    # Convert missing values to integers (1 = missing, 0 = present)
    missing_matrix = df.isna().astype(int)

    # Create the interactive heatmap (no sorting, just direct axes)
    fig = px.imshow(missing_matrix, color_continuous_scale='viridis', aspect='auto', labels=dict(x="Columns", y="Rows", color="Missing"),
                    title="Heatmap of Missing Values (Interactive)")

    # Adjust layout and colorbar
    fig.update_layout(width=1000, height=600, coloraxis_colorbar=dict(tickvals=[0, 1], ticktext=["Present", "Missing"]))

    # Improve axis readability
    fig.update_xaxes(title="Columns")
    fig.update_yaxes(title="Rows")

    fig.show()
    
# Plot Histogram for descrete values with Plotly Express
# plot_hist_frequency_px(data, bins=10, color='lightgrey', title='Histogram with Step 2 Ticks', xlabel='Value', ylabel='Frequency',  xticks_step=2)
def plot_hist_frequency_px(series, bins=10, color='grey', title='', xlabel='', ylabel='Frequency', xticks_range=None, xticks_step=None,
                           rotation=0):
    """
    Generates a histogram for discrete values using Plotly Express, 
    including mean and median lines and optional custom x-axis ticks.

    Parameters:
    - series (pd.Series or list/array): Data to plot
    - bins (int): Number of bins
    - color (str): Fill color of the bars
    - title (str): Plot title
    - xlabel (str): X-axis label
    - ylabel (str): Y-axis label
    - xticks_range (list/tuple, optional): Range of ticks on X-axis [min, max]
    - xticks_step (int, optional): Step between ticks
    - rotation (int): Rotation angle for X-axis tick labels
    """
    
    # Convert to series if needed
    if not isinstance(series, pd.Series):
        series = pd.Series(series)
    
    # Compute statistics
    mean_val = series.mean()
    median_val = series.median()
    
    # Prepare histogram
    fig = px.histogram(series, nbins=bins, color_discrete_sequence=[color]    )
    
    # Add mean and median lines
    fig.add_vline(x=mean_val, line_dash='dash', line_color='red', annotation_text=f'Mean: {mean_val:.2f}', annotation_position="top right",
                  annotation_y=1.0, annotation_font=dict(color='red'))
    fig.add_vline(x=median_val, line_dash='dash', line_color='blue', annotation_text=f'Median: {median_val:.2f}', annotation_position="top right", 
                  annotation_y=0.9, annotation_font=dict(color='blue'))
    
    # Update layout
    fig.update_layout(title=title, xaxis_title=xlabel, yaxis_title=ylabel, xaxis=dict(tickangle=rotation), bargap=0.1)
    
    # Handle x-axis range and ticks
    if xticks_range is not None:
        x_min, x_max = xticks_range
    else:
        x_min, x_max = series.min(), series.max()
    
    if xticks_step is not None:
        tickvals = list(range(int(x_min), int(x_max)+1, xticks_step))
        fig.update_xaxes(range=[x_min, x_max], tickvals=tickvals)
    else:
        fig.update_xaxes(range=[x_min, x_max])
    
    fig.show()

# Plot horizontal boxplot
# plot_horizontal_boxplotpx(df, 'column_name')
def plot_horizontal_boxplot_plotlypx(data, column, title=None):
    """
    Horizontal boxplot with aligned markers:
    - Points, outliers and mean marker are aligned with whiskers (y=1).
    - Red diamond shows mean.
    - Annotates mean, median, and IQR-based outlier thresholds.

    Parameters:
    - data: pd.DataFrame - DataFrame containing the column to plot.
    - column: str - Column name to visualize.
    - title: str - Title to visualize
    """
    values = data[column].dropna().values
    Q1, Q3 = np.percentile(values, [25, 75])
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    mean_val = np.mean(values)
    median_val = np.median(values)

    outliers = values[(values < lower_bound) | (values > upper_bound)]
    non_outliers = values[(values >= lower_bound) & (values <= upper_bound)]

    fig = go.Figure()

    # 1. Main boxplot (data only without outliers)
    fig.add_trace(go.Box(x=non_outliers, y=['Data'] * len(non_outliers), name='Boxplot', orientation='h', boxpoints=False, fillcolor='lightgrey',
                         line=dict(color='black'), showlegend=False))

    # 2. Normal points
    fig.add_trace(go.Scatter(x=non_outliers, y=['Data'] * len(non_outliers), mode='markers', name='Data Points', marker=dict(color='black', symbol='circle', size=5),
                             hoverinfo='x'))

    # 3. Outliers
    fig.add_trace(go.Scatter(x=outliers, y=['Data'] * len(outliers), mode='markers', name='IQR Outliers', marker=dict(color='red', symbol='x', size=8),
                             hoverinfo='x'))

    # 4. Media (red diamond)
    fig.add_trace(go.Scatter(x=[mean_val], y=['Data'], mode='markers+text', name='Mean', marker=dict(color='red', symbol='diamond', size=10),
                             text=[f"Mean = {mean_val:.2f}"], textposition='top center', textfont=dict(color='red')))

    # 5. Median (blue diamond)
    fig.add_trace(go.Scatter(x=[median_val], y=['Data'], mode='markers+text', name='Median', marker=dict(color='blue', symbol='diamond', size=10),
                             text=[f"Median = {median_val:.2f}"], textposition='bottom center', textfont=dict(color='blue')))

    # 6. Vertical lines for IQR limits
    fig.add_shape(type="line", x0=lower_bound, y0=0.9, x1=lower_bound, y1=1.1, line=dict(color="red", dash="dot")) 
    fig.add_shape(type="line", x0=upper_bound, y0=0.9, x1=upper_bound, y1=1.1, line=dict(color="red", dash="dot"))

    # Layout
    fig.update_layout(title=title or f'Boxplot with Stats: {column}', xaxis_title=column, yaxis=dict(showticklabels=False), height=450,
                      template='simple_white', legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="right", x=1))
    
    fig.show()

# Function to plot Histogram for Cualitative/categorical data
# plot_bar_frequencies(serie=data, color="orange", title="Music genre distribution", xlabel="Genre", ylabel="Frequency", figsize=(700, 500),
#                      rotation_x=45, rotation_y=0) 
def plot_cualitative_histogram_plotlypx(series, color="grey", title="", xlabel="", ylabel="", figsize=(1200, 600), xtick=None, ytick=None, 
                                        rotation_x=0, rotation_y=0, top_n=None):
    """
    Generates a frequency bar chart with Plotly Express.

    Parameters:
    -----------
    - series: pd.Series or list
    - Categorical data.
    - color: str
    - Bar color.
    - title: str
    - Chart title.
    - xlabel: str
    - X-axis label.
    - ylabel: str
    - Y-axis label.
    - figsize: tuple
    - Figure size in pixels (width, height).
    - xtick: list or None
    - Custom X labels.
    - ytick: list or None
    - Custom Y labels.
    - x_rotation: int
    - X-axis label rotation.
    - y_rotation: int
    - Y-axis label rotation.
    - top_n: Number of top categories to display (by default all).
    """

    # Convert to pandas.Series if list
    if not isinstance(series, pd.Series):
        series = pd.Series(series)

    # Get frequencies
    freq = series.value_counts().reset_index()
    freq.columns = ["Category", "Frequency"]
    
    # Sort descending
    freq = freq.sort_values(by="Frequency", ascending=False)
    
    # Filter top_n if specified
    if top_n is not None:
        freq = freq.head(top_n)

    # Graph
    fig = px.bar(freq, x="Category", y="Frequency", text="Frequency", color_discrete_sequence=[color], width=figsize[0], height=figsize[1])

    # Custom
    fig.update_traces(textposition="outside")
    fig.update_layout(title=title, xaxis_title=xlabel, yaxis_title=ylabel)
    
    # Rotation
    fig.update_xaxes(tickangle=rotation_x)
    fig.update_yaxes(tickangle=rotation_y)
    
    # Customized Ticks
    if xtick is not None:
        fig.update_xaxes(tickvals=list(range(len(xtick))), ticktext=xtick)
    if ytick is not None:
        fig.update_yaxes(tickvals=ytick)

    fig.show()

# Plots Quantile to Quantile graph
# plot_qq_normality_tests_plotlypx(df, 'column_name')
def plot_qq_normality_tests_plotlypx(data, column, dist='norm', dist_params=None,
                                    title=None, color='grey', outlier_color='crimson',
                                    outlier_marker='x', width=1200, height=600):
    """
    Interactive QQ plot using Plotly, including normality tests and outlier detection.
    
    Generate a QQ plot comparing the quantiles of a sample against a theoretical distribution.
    Outliers are detected using the IQR method and highlighted with custom color and marker.
    Also performs normality tests and displays the results.

    Parameters:
    - data: pd.DataFrame - The dataset containing the column to analyze.
    - column: str - The column name to analyze.
    - dist: str or scipy.stats distribution - The distribution to compare against (default: 'norm').
    - dist_params: tuple - Parameters required for the theoretical distribution (shape, loc, scale).
    - title: str - Plot title.
    - color: str -  Color of the main data points.
    - outlier_color: str - Color of the outlier points.
    - outlier_marker: str - Marker style for outliers.
    """
    values = data[column].dropna().values
    n = len(values)

    # Get theoretical distribution
    dist_obj = getattr(stats, dist) if isinstance(dist, str) else dist

    # QQ data
    if dist_params:
        (osm, osr), (slope, intercept, r) = stats.probplot(values, dist=dist_obj, sparams=dist_params)
    else:
        (osm, osr), (slope, intercept, r) = stats.probplot(values, dist=dist_obj)

    # Outlier detection via IQR on sample quantiles
    Q1, Q3 = np.percentile(osr, [25, 75])
    IQR = Q3 - Q1
    lower, upper = Q1 - 1.5 * IQR, Q3 + 1.5 * IQR
    is_outlier = (osr < lower) | (osr > upper)

    # Plotly figure
    fig = go.Figure()

    # Non-outliers
    fig.add_trace(go.Scatter(
        x=osm[~is_outlier], y=osr[~is_outlier],
        mode='markers',
        name='Data',
        marker=dict(color=color, size=6),
        hovertemplate='Theoretical: %{x:.2f}<br>Sample: %{y:.2f}<extra></extra>'
    ))

    # Outliers
    fig.add_trace(go.Scatter(
        x=osm[is_outlier], y=osr[is_outlier],
        mode='markers',
        name='IQR Outliers',
        marker=dict(color=outlier_color, size=8, symbol=outlier_marker),
        hovertemplate='Outlier<br>Theoretical: %{x:.2f}<br>Sample: %{y:.2f}<extra></extra>'
    ))

    # Reference line
    line_x = np.array([osm.min(), osm.max()])
    line_y = slope * line_x + intercept
    fig.add_trace(go.Scatter(
        x=line_x, y=line_y,
        mode='lines',
        name=f'{dist_obj.name.title()} Fit',
        line=dict(color='red', width=2)
    ))

    fig.update_layout(
        title=title or f'QQ Plot ({column}) vs. {dist_obj.name.title()}',
        xaxis_title='Theoretical Quantiles',
        yaxis_title='Sample Quantiles',
        width=width,
        height=height,
        template='simple_white'
    )

    fig.show()

    # Normality tests
    shapiro_stat, shapiro_p = stats.shapiro(values)
    dagostino_stat, dagostino_p = stats.normaltest(values)
    ad_result = stats.anderson(values, dist='norm')
    ad_stat = ad_result.statistic
    ad_crit = ad_result.critical_values[2]

    # Summary table
    html = f"""
    <h4>Normality Tests for <code>{column}</code> (n={n})</h4>
    <table border="1" style="border-collapse:collapse; text-align:center;">
    <tr><th>Test</th><th>Statistic</th><th>p-value / Critical</th><th>Conclusion</th><th>Recommended for</th><th>Sensitive to</th></tr>
    <tr><td>Shapiro-Wilk</td><td>{shapiro_stat:.4f}</td><td>{shapiro_p:.4f}</td><td>{'Reject H₀ (Not Normal)' if shapiro_p < 0.05 else 'Possibly Normal'}</td><td>n ≤ 5000</td><td>General deviations</td></tr>
    <tr><td>D’Agostino-Pearson</td><td>{dagostino_stat:.4f}</td><td>{dagostino_p:.4f}</td><td>{'Reject H₀ (Not Normal)' if dagostino_p < 0.05 else 'Possibly Normal'}</td><td>n > 500</td><td>Skewness & Kurtosis</td></tr>
    <tr><td>Anderson-Darling</td><td>{ad_stat:.4f}</td><td>Crit: {ad_crit:.4f}</td><td>{'Reject H₀ (Not Normal)' if ad_stat > ad_crit else 'Possibly Normal'}</td><td>All sizes</td><td>Tail behavior</td></tr>
    </table>
    """
    display(HTML(html))

# Plot bar graph with plotly express
# plot_bar_series(df, x='operator', y='missed_calls', title="Llamadas Perdidas", color='red', x_label="Operador", y_label="Cantidad")
def plot_bar_plotlypx(data, x=None, y=None, title="", color='grey', x_label="", y_label="", rotation=0):
    """
    Generates a bar chart using Plotly Express from a DataFrame or Series.

    Parameters:
    ------------
    data : pd.DataFrame or pd.Series or list/np.array
    Data to plot
    x : str or None
    Column name or index to use as the X-axis. If None, use numeric indices.
    y : str or None
    Column name or series to use as the Y-axis. If None, use the values ​​from the series/data.
    title : str, optional
    Chart title
    color : str, optional
    Bar color (default: 'grey')
    x_label : str, optional
    Label for the X axis
    y_label : str, optional
    Label for the Y axis
    rotation : int, optional
    Rotation of the X axis labels (default: 0)
    """

    # Convert to pd.Series if it's a list or array
    if isinstance(data, (list, tuple, pd.Series, pd.Index)):
        data = pd.Series(data, name=y if y else "Y")

    # If it's a Series and y isn't passed, use values ​​from the series
    if isinstance(data, pd.Series):
        df = pd.DataFrame({'X': data.index if x is None else x, 'Y': data.values})
        x = 'X'
        y = 'Y'
    else:
        df = data.copy()
        if x is None:
            df['X'] = df.index
            x = 'X'
    if y is None:
        raise ValueError("You must specify the parameter 'y' if data is DataFrame.") 

    # Create figure 
    fig = px.bar(df, x=x, y=y, text=y, color_discrete_sequence=[color], title=title) 

    # Adjust layout 
    fig.update_traces(texttemplate='%{text:.2f}', textposition='outside') 
    fig.update_layout(xaxis_title=x_label, yaxis_title=y_label, xaxis_tickangle=rotation, plot_bgcolor='white', bargap=0.2, title_x=0.5) 

    # Improve axes visually 
    fig.update_xaxes(showgrid=False, linecolor='black') 
    fig.update_yaxes(showgrid=True, gridcolor='lightgrey', linecolor='black')

    fig.show()
    
# Plot stacked bars from a grouped dataframe
# colors_map = {'A': 'steelblue', 'B': 'salmon'}
# plot_stacked_bar_plotlypx(df=df_sales, x='Month', y='Sales', hue='Product', title='Ventas por Producto', xlabel='Mes', ylabel='Ventas', rotation=0, colors=colors_map)
def plot_stacked_bar_plotlypx(df, x, y, hue, title='', xlabel='', ylabel='', rotation=0, colors=None):
    """
    Generates a stacked bar chart with Plotly Express.

    Parameters:
    df : pd.DataFrame -> DataFrame ready to be plotted
    x : str -> X-axis column
    y : str -> Y-axis column
    hue : str -> column defining the color/stacking
    title : str -> chart title
    xlabel : str -> X-axis label
    ylabel : str -> Y-axis label
    rotation: int -> rotation of the X-axis labels
    colors : list or dict -> list of colors or mapping dictionary for hue values
    """
    # Convert hue to string and remove spaces
    df[hue] = df[hue].astype(str).str.strip()
    
    fig = px.bar(df, x=x, y=y, color=hue, 
                 text=y,           # opcional: mostrar valores encima de las barras
                 color_discrete_map=colors, title=title)
    
    fig.update_layout(xaxis_title=xlabel, yaxis_title=ylabel, xaxis_tickangle=rotation, barmode='stack', template='plotly_white'
    )
    
    # Mostrar figura
    fig.show()


### 🔁 __3. Data Loading__

In [3]:
df_telecom_data = pd.read_csv('../data/raw/telecom_dataset_new.csv', sep=',', header='infer', keep_default_na=False)
df_telecom_data

Unnamed: 0,user_id,date,direction,internal,operator_id,is_missed_call,calls_count,call_duration,total_call_duration
0,166377,2019-08-04 00:00:00+03:00,in,False,,True,2,0,4
1,166377,2019-08-05 00:00:00+03:00,out,True,880022,True,3,0,5
2,166377,2019-08-05 00:00:00+03:00,out,True,880020,True,1,0,1
3,166377,2019-08-05 00:00:00+03:00,out,True,880020,False,1,10,18
4,166377,2019-08-05 00:00:00+03:00,out,False,880022,True,3,0,25
...,...,...,...,...,...,...,...,...,...
53897,168606,2019-11-10 00:00:00+03:00,out,True,957922,True,1,0,38
53898,168606,2019-11-11 00:00:00+03:00,out,True,957922,False,2,479,501
53899,168606,2019-11-15 00:00:00+03:00,out,True,957922,False,4,3130,3190
53900,168606,2019-11-15 00:00:00+03:00,out,True,957922,False,4,3130,3190


In [4]:
df_telecom_clients = pd.read_csv('../data/raw/telecom_clients.csv', sep=',', header='infer', keep_default_na=False)
df_telecom_clients

Unnamed: 0,user_id,tariff_plan,date_start
0,166713,A,2019-08-15
1,166901,A,2019-08-23
2,168527,A,2019-10-29
3,167097,A,2019-09-01
4,168193,A,2019-10-16
...,...,...,...
727,166554,B,2019-08-08
728,166911,B,2019-08-23
729,167012,B,2019-08-28
730,166867,B,2019-08-22


##### `LSPL`

**_Note_:**

`"keep_default_na=False"` is used so that missing values ​​are later converted to "pd.NA". This is convenient because "pd.NA" provides:

- Consistency between data types
- Preservation of type integrity
- Cleaner logical operations
- Better control over missing data.

Since high performance and heavy computation are not required, it is appropriate to use "pd.NA".

### 🧹 __4. Data Cleanup__

In [5]:
df_telecom_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53902 entries, 0 to 53901
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   user_id              53902 non-null  int64 
 1   date                 53902 non-null  object
 2   direction            53902 non-null  object
 3   internal             53902 non-null  object
 4   operator_id          53902 non-null  object
 5   is_missed_call       53902 non-null  bool  
 6   calls_count          53902 non-null  int64 
 7   call_duration        53902 non-null  int64 
 8   total_call_duration  53902 non-null  int64 
dtypes: bool(1), int64(4), object(4)
memory usage: 3.3+ MB


In [6]:
df_telecom_clients.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 732 entries, 0 to 731
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   user_id      732 non-null    int64 
 1   tariff_plan  732 non-null    object
 2   date_start   732 non-null    object
dtypes: int64(1), object(2)
memory usage: 17.3+ KB


##### **4.1** Normalize String data

In [7]:
df_telecom_data = normalize_string_format(df_telecom_data, include=['date', 'direction', 'internal', 'operator_id'])

##### **4.2** Explicit Duplicate Removal

In [8]:
# Checking for explicit duplicate values ​​in a DataFrame
display(HTML(f"> Number of rows with <i>explicit duplicates</i> in [df_telecom_data]: <b>{df_telecom_data.duplicated().sum()}</b>"))

In [9]:
df_telecom_data = df_telecom_data.drop_duplicates()
df_telecom_data

Unnamed: 0,user_id,date,direction,internal,operator_id,is_missed_call,calls_count,call_duration,total_call_duration
0,166377,2019-08-04 00:00:00+03:00,in,False,,True,2,0,4
1,166377,2019-08-05 00:00:00+03:00,out,True,880022,True,3,0,5
2,166377,2019-08-05 00:00:00+03:00,out,True,880020,True,1,0,1
3,166377,2019-08-05 00:00:00+03:00,out,True,880020,False,1,10,18
4,166377,2019-08-05 00:00:00+03:00,out,False,880022,True,3,0,25
...,...,...,...,...,...,...,...,...,...
53896,168606,2019-11-10 00:00:00+03:00,out,True,957922,False,1,0,25
53897,168606,2019-11-10 00:00:00+03:00,out,True,957922,True,1,0,38
53898,168606,2019-11-11 00:00:00+03:00,out,True,957922,False,2,479,501
53899,168606,2019-11-15 00:00:00+03:00,out,True,957922,False,4,3130,3190


In [10]:
# Checking for explicit duplicate values ​​in a DataFrame
display(HTML(f"> Number of rows with <i>explicit duplicates</i> in [df_telecom_clients]: <b>{df_telecom_clients.duplicated().sum()}</b>"))

##### **4.3** Missing values processing

In [11]:
# Check missing values for df_telecom_data
check_existing_missing_values(df_telecom_data)




In [12]:
# Set missing values to pd.NA for df_telecom_data
df_telecom_data = replace_missing_values(df_telecom_data, include=['internal', 'operator_id'])
df_telecom_data

Unnamed: 0,user_id,date,direction,internal,operator_id,is_missed_call,calls_count,call_duration,total_call_duration
0,166377,2019-08-04 00:00:00+03:00,in,False,,True,2,0,4
1,166377,2019-08-05 00:00:00+03:00,out,True,880022,True,3,0,5
2,166377,2019-08-05 00:00:00+03:00,out,True,880020,True,1,0,1
3,166377,2019-08-05 00:00:00+03:00,out,True,880020,False,1,10,18
4,166377,2019-08-05 00:00:00+03:00,out,False,880022,True,3,0,25
...,...,...,...,...,...,...,...,...,...
53896,168606,2019-11-10 00:00:00+03:00,out,True,957922,False,1,0,25
53897,168606,2019-11-10 00:00:00+03:00,out,True,957922,True,1,0,38
53898,168606,2019-11-11 00:00:00+03:00,out,True,957922,False,2,479,501
53899,168606,2019-11-15 00:00:00+03:00,out,True,957922,False,4,3130,3190


In [13]:
# Show missing values rate for df_telecom_data
missing_values_rate(df_telecom_data, include=['internal', 'operator_id']) 

In [14]:
# Show missing values heatmap for df_telecom_data
missing_values_heatmap_px(df_telecom_data)

##### `LSPL`

**_Note_:**   
- Missing values for 'operator_id'   

    Since the goal is to compare real-world performance between operators, assigning null values ​​could:   
    - Artificially inflate a real-world operator's metrics.
    - Hide the real-world performance of existing operators.

    Delete rows where operator_id is null.   
    - This maintains data integrity for performance analysis.
    - Losing 15.22% of rows isn't ideal, but it's better than introducing bias.

- Missing values for 'internal'   

    Keep null values ​​as is in the "internal" column, since:   
    - their percentage is minimal,   
    - their analytical influence is marginal,   
    - and any imputation would be more harmful than beneficial.


In [15]:
df_telecom_data = df_telecom_data.dropna(subset=['operator_id']).reset_index(drop=True)
df_telecom_data

Unnamed: 0,user_id,date,direction,internal,operator_id,is_missed_call,calls_count,call_duration,total_call_duration
0,166377,2019-08-05 00:00:00+03:00,out,True,880022,True,3,0,5
1,166377,2019-08-05 00:00:00+03:00,out,True,880020,True,1,0,1
2,166377,2019-08-05 00:00:00+03:00,out,True,880020,False,1,10,18
3,166377,2019-08-05 00:00:00+03:00,out,False,880022,True,3,0,25
4,166377,2019-08-05 00:00:00+03:00,out,False,880020,False,2,3,29
...,...,...,...,...,...,...,...,...,...
41541,168606,2019-11-09 00:00:00+03:00,out,False,957922,False,4,551,593
41542,168606,2019-11-10 00:00:00+03:00,out,True,957922,False,1,0,25
41543,168606,2019-11-10 00:00:00+03:00,out,True,957922,True,1,0,38
41544,168606,2019-11-11 00:00:00+03:00,out,True,957922,False,2,479,501


In [16]:
# Check missing values for df_telecom_data
check_existing_missing_values(df_telecom_clients)




### 📦 __5. Casting Data types__

In [17]:
df_telecom_data

Unnamed: 0,user_id,date,direction,internal,operator_id,is_missed_call,calls_count,call_duration,total_call_duration
0,166377,2019-08-05 00:00:00+03:00,out,True,880022,True,3,0,5
1,166377,2019-08-05 00:00:00+03:00,out,True,880020,True,1,0,1
2,166377,2019-08-05 00:00:00+03:00,out,True,880020,False,1,10,18
3,166377,2019-08-05 00:00:00+03:00,out,False,880022,True,3,0,25
4,166377,2019-08-05 00:00:00+03:00,out,False,880020,False,2,3,29
...,...,...,...,...,...,...,...,...,...
41541,168606,2019-11-09 00:00:00+03:00,out,False,957922,False,4,551,593
41542,168606,2019-11-10 00:00:00+03:00,out,True,957922,False,1,0,25
41543,168606,2019-11-10 00:00:00+03:00,out,True,957922,True,1,0,38
41544,168606,2019-11-11 00:00:00+03:00,out,True,957922,False,2,479,501


In [18]:
# Cast into datetime
df_telecom_data = normalize_datetime(df_telecom_data, include=['date'], time_zone='Europe/Moscow')

In [19]:
# Cast into category
df_telecom_data['direction'] = df_telecom_data['direction'].astype('category')


In [20]:
# Cast into boolean
df_telecom_data['internal'] = (df_telecom_data['internal'].str.strip().map({'True': True, 'False': False}).astype(BooleanDtype()))


In [21]:
df_telecom_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41546 entries, 0 to 41545
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype   
---  ------               --------------  -----   
 0   user_id              41546 non-null  int64   
 1   date                 41546 non-null  object  
 2   direction            41546 non-null  category
 3   internal             41491 non-null  boolean 
 4   operator_id          41546 non-null  object  
 5   is_missed_call       41546 non-null  bool    
 6   calls_count          41546 non-null  int64   
 7   call_duration        41546 non-null  int64   
 8   total_call_duration  41546 non-null  int64   
dtypes: bool(1), boolean(1), category(1), int64(4), object(2)
memory usage: 2.1+ MB


In [22]:
df_telecom_clients

Unnamed: 0,user_id,tariff_plan,date_start
0,166713,A,2019-08-15
1,166901,A,2019-08-23
2,168527,A,2019-10-29
3,167097,A,2019-09-01
4,168193,A,2019-10-16
...,...,...,...
727,166554,B,2019-08-08
728,166911,B,2019-08-23
729,167012,B,2019-08-28
730,166867,B,2019-08-22


In [23]:
# Cast into category
df_telecom_clients['tariff_plan'] = df_telecom_clients['tariff_plan'].astype('category')

In [24]:
df_telecom_clients['date_start'] = pd.to_datetime(df_telecom_clients['date_start']).dt.date

In [25]:
df_telecom_clients.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 732 entries, 0 to 731
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   user_id      732 non-null    int64   
 1   tariff_plan  732 non-null    category
 2   date_start   732 non-null    object  
dtypes: category(1), int64(1), object(1)
memory usage: 12.4+ KB


### 📚 __6. EDA Descriptive Statistics__

##### **6.1** Descriptive statistics for quantitative data

In [26]:
df_telecom_data.drop(columns='user_id').describe()

Unnamed: 0,calls_count,call_duration,total_call_duration
count,41546.0,41546.0,41546.0
mean,16.900424,1009.769172,1321.592813
std,59.749373,4064.106117,4785.978633
min,1.0,0.0,0.0
25%,1.0,0.0,67.0
50%,4.0,106.0,288.0
75%,13.0,770.0,1104.0
max,4817.0,144395.0,166155.0


In [27]:
columns = ['calls_count', 'call_duration', 'total_call_duration']

for column in columns:
    evaluate_central_trend(df_telecom_data, column)










In [28]:
plot_hist_frequency_px(df_telecom_data['calls_count'], bins=1000, color='grey', title='Calls Count Distribution', 
                       xlabel='Calls amount', ylabel='Frequency', xticks_range=[0, 4820], xticks_step=100, rotation=45)

In [29]:
plot_horizontal_boxplot_plotlypx(df_telecom_data, 'calls_count', title='Calls Count Distribution and Outliers')

In [30]:
df_telecom_data['calls_count'].value_counts()

calls_count
1      10490
2       5158
3       3323
4       2561
5       1936
       ...  
237        1
288        1
360        1
685        1
279        1
Name: count, Length: 457, dtype: int64

In [31]:
plot_hist_frequency_px(df_telecom_data['call_duration'], bins=10000, color='grey', title='Call Duration Distribution', 
                       xlabel='Call duration (sec)', ylabel='Frequency', xticks_range=[0, 144500], xticks_step=5000, rotation=45)

In [32]:
plot_horizontal_boxplot_plotlypx(df_telecom_data, 'call_duration', title='Calls Duration Distribution and Otliers')

In [33]:
df_telecom_data['call_duration'].value_counts()

call_duration
0        13831
1          184
7          123
15         114
8          113
         ...  
12989        1
7986         1
20285        1
5157         1
5803         1
Name: count, Length: 5339, dtype: int64

In [34]:
plot_hist_frequency_px(df_telecom_data['total_call_duration'], bins=10000, color='grey', title='Total Call Duration Distribution', 
                       xlabel='Total Call duration (sec)', ylabel='Frequency', xticks_range=[0, 167000], xticks_step=5000, rotation=45)

In [35]:
plot_horizontal_boxplot_plotlypx(df_telecom_data, 'total_call_duration', title='Total Calls Duration Distribution and Otliers')

In [36]:
df_telecom_data['total_call_duration'].value_counts()

total_call_duration
0        863
1        247
2        226
60       209
18       196
        ... 
13497      1
6705       1
10547      1
10867      1
4435       1
Name: count, Length: 5986, dtype: int64

##### **6.2** Descriptive statistics for qualitative data

In [37]:
df_telecom_data.describe(include=['object', 'boolean', 'category'])

Unnamed: 0,date,direction,internal,operator_id,is_missed_call
count,41546,41546,41491,41546,41546
unique,118,2,2,1092,2
top,2019-11-25 00:00:00+03:00,out,False,901884,False
freq,988,28813,36161,323,27436


In [38]:
plot_cualitative_histogram_plotlypx(df_telecom_data['date'], color="grey", title="Call Activity Date Distribution", xlabel="Date", ylabel="Frequency", figsize=(1200, 600), xtick=None, ytick=None, 
                                    rotation_x=45, rotation_y=0, top_n=None)

In [39]:
plot_cualitative_histogram_plotlypx(df_telecom_data['direction'], color="grey", title="Call Activity Direction Distribution", xlabel="Direction", ylabel="Frequency", figsize=(1200, 600), xtick=None, ytick=None, 
                                    rotation_x=0, rotation_y=0, top_n=None)

In [40]:
plot_cualitative_histogram_plotlypx(df_telecom_data['internal'], color="grey", title="Call Activity Internal Distribution", xlabel="Internal", ylabel="Frequency", figsize=(1200, 600), xtick=None, ytick=None, 
                                    rotation_x=0, rotation_y=0, top_n=None)

In [41]:
plot_cualitative_histogram_plotlypx(df_telecom_data['operator_id'], color="grey", title="Call Activity Operator Distribution", xlabel="Operator ID", ylabel="Frequency", figsize=(1200, 600), xtick=None, ytick=None, 
                                    rotation_x=90, rotation_y=0, top_n=None)

In [42]:
plot_cualitative_histogram_plotlypx(df_telecom_data['is_missed_call'], color="grey", title="Call Activity Missed Calls Distribution", xlabel="Missed Calls", ylabel="Frequency", figsize=(1200, 600), xtick=None, ytick=None, 
                                    rotation_x=0, rotation_y=0, top_n=None)

In [43]:
df_telecom_clients.drop(columns='user_id').describe(include=['object', 'category'])

Unnamed: 0,tariff_plan,date_start
count,732,732
unique,3,73
top,C,2019-09-24
freq,395,24


In [44]:
plot_cualitative_histogram_plotlypx(df_telecom_clients['tariff_plan'], color="grey", title="Client Tariff Plan Distribution", xlabel="Tariff Plan", ylabel="Frequency", figsize=(1200, 600), xtick=None, ytick=None, 
                                    rotation_x=0, rotation_y=0, top_n=None)

In [45]:
plot_cualitative_histogram_plotlypx(df_telecom_clients['date_start'], color="grey", title="Client Start Date Plan Distribution", xlabel="Start Date", ylabel="Frequency", figsize=(1200, 600), xtick=None, ytick=None, 
                                    rotation_x=90, rotation_y=0, top_n=None)

### 🛠️ __7. Feature engineering__

In [46]:
# Get the Call wait time
df_telecom_data['call_wait_time'] = df_telecom_data['total_call_duration'] - df_telecom_data['call_duration'] 
df_telecom_data

Unnamed: 0,user_id,date,direction,internal,operator_id,is_missed_call,calls_count,call_duration,total_call_duration,call_wait_time
0,166377,2019-08-05 00:00:00+03:00,out,True,880022,True,3,0,5,5
1,166377,2019-08-05 00:00:00+03:00,out,True,880020,True,1,0,1,1
2,166377,2019-08-05 00:00:00+03:00,out,True,880020,False,1,10,18,8
3,166377,2019-08-05 00:00:00+03:00,out,False,880022,True,3,0,25,25
4,166377,2019-08-05 00:00:00+03:00,out,False,880020,False,2,3,29,26
...,...,...,...,...,...,...,...,...,...,...
41541,168606,2019-11-09 00:00:00+03:00,out,False,957922,False,4,551,593,42
41542,168606,2019-11-10 00:00:00+03:00,out,True,957922,False,1,0,25,25
41543,168606,2019-11-10 00:00:00+03:00,out,True,957922,True,1,0,38,38
41544,168606,2019-11-11 00:00:00+03:00,out,True,957922,False,2,479,501,22


In [47]:
df_telecom_data['date'] = pd.to_datetime(df_telecom_data['date']).dt.tz_localize(None)
df_telecom_data['call_month'] = df_telecom_data['date'].dt.month
df_telecom_data['call_day'] = df_telecom_data['date'].dt.day
df_telecom_data['date'] = df_telecom_data['date'].dt.normalize().dt.date
df_telecom_data

Unnamed: 0,user_id,date,direction,internal,operator_id,is_missed_call,calls_count,call_duration,total_call_duration,call_wait_time,call_month,call_day
0,166377,2019-08-05,out,True,880022,True,3,0,5,5,8,5
1,166377,2019-08-05,out,True,880020,True,1,0,1,1,8,5
2,166377,2019-08-05,out,True,880020,False,1,10,18,8,8,5
3,166377,2019-08-05,out,False,880022,True,3,0,25,25,8,5
4,166377,2019-08-05,out,False,880020,False,2,3,29,26,8,5
...,...,...,...,...,...,...,...,...,...,...,...,...
41541,168606,2019-11-09,out,False,957922,False,4,551,593,42,11,9
41542,168606,2019-11-10,out,True,957922,False,1,0,25,25,11,10
41543,168606,2019-11-10,out,True,957922,True,1,0,38,38,11,10
41544,168606,2019-11-11,out,True,957922,False,2,479,501,22,11,11


In [48]:
df_telecom_clients['date_start'] = pd.to_datetime(df_telecom_clients['date_start'])
df_telecom_clients['date_start_month'] = df_telecom_clients['date_start'].dt.month
df_telecom_clients['date_start_day'] = df_telecom_clients['date_start'].dt.day
df_telecom_clients['date_start'] = df_telecom_clients['date_start'].dt.normalize().dt.date
df_telecom_clients

Unnamed: 0,user_id,tariff_plan,date_start,date_start_month,date_start_day
0,166713,A,2019-08-15,8,15
1,166901,A,2019-08-23,8,23
2,168527,A,2019-10-29,10,29
3,167097,A,2019-09-01,9,1
4,168193,A,2019-10-16,10,16
...,...,...,...,...,...
727,166554,B,2019-08-08,8,8
728,166911,B,2019-08-23,8,23
729,167012,B,2019-08-28,8,28
730,166867,B,2019-08-22,8,22


In [49]:
df_telecom = df_telecom_data.merge(df_telecom_clients, on='user_id', how='left')
df_telecom

Unnamed: 0,user_id,date,direction,internal,operator_id,is_missed_call,calls_count,call_duration,total_call_duration,call_wait_time,call_month,call_day,tariff_plan,date_start,date_start_month,date_start_day
0,166377,2019-08-05,out,True,880022,True,3,0,5,5,8,5,B,2019-08-01,8,1
1,166377,2019-08-05,out,True,880020,True,1,0,1,1,8,5,B,2019-08-01,8,1
2,166377,2019-08-05,out,True,880020,False,1,10,18,8,8,5,B,2019-08-01,8,1
3,166377,2019-08-05,out,False,880022,True,3,0,25,25,8,5,B,2019-08-01,8,1
4,166377,2019-08-05,out,False,880020,False,2,3,29,26,8,5,B,2019-08-01,8,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41541,168606,2019-11-09,out,False,957922,False,4,551,593,42,11,9,C,2019-10-31,10,31
41542,168606,2019-11-10,out,True,957922,False,1,0,25,25,11,10,C,2019-10-31,10,31
41543,168606,2019-11-10,out,True,957922,True,1,0,38,38,11,10,C,2019-10-31,10,31
41544,168606,2019-11-11,out,True,957922,False,2,479,501,22,11,11,C,2019-10-31,10,31


### 🛠️ __8. Outliers__

In [50]:
# show Outliers for 'calls_count'
outlier_limit_bounds(df_telecom, 'calls_count', bound='both', clamp_zero=True)

Unnamed: 0,user_id,date,direction,internal,operator_id,is_missed_call,calls_count,call_duration,total_call_duration,call_wait_time,call_month,call_day,tariff_plan,date_start,date_start_month,date_start_day
50,166377,2019-08-20,out,False,880028,False,32,2975,3243,268,8,20,B,2019-08-01,8,1
68,166377,2019-08-23,out,False,880026,False,43,3435,3654,219,8,23,B,2019-08-01,8,1
144,166377,2019-09-10,out,False,880026,False,32,2451,2620,169,9,10,B,2019-08-01,8,1
179,166377,2019-09-17,out,False,880028,False,35,2507,2787,280,9,17,B,2019-08-01,8,1
282,166377,2019-10-11,out,False,880028,False,32,2835,3100,265,10,11,B,2019-08-01,8,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41398,168466,2019-11-14,out,False,952114,False,33,2079,2355,276,11,14,C,2019-10-28,10,28
41495,168601,2019-11-11,out,False,952914,True,42,0,689,689,11,11,C,2019-10-31,10,31
41497,168601,2019-11-11,out,False,952914,False,33,1806,2245,439,11,11,C,2019-10-31,10,31
41505,168601,2019-11-13,out,False,952914,False,50,3296,3893,597,11,13,C,2019-10-31,10,31


In [51]:
# show Outliers for 'call_duration'
outlier_limit_bounds(df_telecom, 'call_duration', bound='both', clamp_zero=True)

Unnamed: 0,user_id,date,direction,internal,operator_id,is_missed_call,calls_count,call_duration,total_call_duration,call_wait_time,call_month,call_day,tariff_plan,date_start,date_start_month,date_start_day
19,166377,2019-08-12,out,False,880028,False,20,2074,2191,117,8,12,B,2019-08-01,8,1
32,166377,2019-08-14,out,False,880028,False,18,2686,2782,96,8,14,B,2019-08-01,8,1
37,166377,2019-08-15,out,False,880028,False,19,2653,2779,126,8,15,B,2019-08-01,8,1
50,166377,2019-08-20,out,False,880028,False,32,2975,3243,268,8,20,B,2019-08-01,8,1
51,166377,2019-08-20,out,False,880026,False,19,2568,2758,190,8,20,B,2019-08-01,8,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41484,168601,2019-11-05,out,False,952914,False,20,2074,2361,287,11,5,C,2019-10-31,10,31
41486,168601,2019-11-06,out,False,952914,False,21,2999,3249,250,11,6,C,2019-10-31,10,31
41505,168601,2019-11-13,out,False,952914,False,50,3296,3893,597,11,13,C,2019-10-31,10,31
41510,168601,2019-11-14,out,False,952914,False,46,2614,3221,607,11,14,C,2019-10-31,10,31


In [52]:
# show Outliers for 'total_call_duration'
outlier_limit_bounds(df_telecom, 'total_call_duration', bound='both', clamp_zero=True)

Unnamed: 0,user_id,date,direction,internal,operator_id,is_missed_call,calls_count,call_duration,total_call_duration,call_wait_time,call_month,call_day,tariff_plan,date_start,date_start_month,date_start_day
32,166377,2019-08-14,out,False,880028,False,18,2686,2782,96,8,14,B,2019-08-01,8,1
37,166377,2019-08-15,out,False,880028,False,19,2653,2779,126,8,15,B,2019-08-01,8,1
50,166377,2019-08-20,out,False,880028,False,32,2975,3243,268,8,20,B,2019-08-01,8,1
51,166377,2019-08-20,out,False,880026,False,19,2568,2758,190,8,20,B,2019-08-01,8,1
57,166377,2019-08-21,out,False,880028,False,19,3496,3613,117,8,21,B,2019-08-01,8,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41395,168466,2019-11-13,out,False,952114,False,69,3756,4502,746,11,13,C,2019-10-28,10,28
41486,168601,2019-11-06,out,False,952914,False,21,2999,3249,250,11,6,C,2019-10-31,10,31
41505,168601,2019-11-13,out,False,952914,False,50,3296,3893,597,11,13,C,2019-10-31,10,31
41510,168601,2019-11-14,out,False,952914,False,46,2614,3221,607,11,14,C,2019-10-31,10,31


In [53]:
# show Outliers for 'call_wait_time'
outlier_limit_bounds(df_telecom, 'call_wait_time', bound='both', clamp_zero=True)

Unnamed: 0,user_id,date,direction,internal,operator_id,is_missed_call,calls_count,call_duration,total_call_duration,call_wait_time,call_month,call_day,tariff_plan,date_start,date_start_month,date_start_day
52,166377,2019-08-20,out,False,880028,True,13,0,572,572,8,20,B,2019-08-01,8,1
56,166377,2019-08-21,out,False,880028,True,22,0,717,717,8,21,B,2019-08-01,8,1
61,166377,2019-08-22,out,False,880028,True,18,0,710,710,8,22,B,2019-08-01,8,1
69,166377,2019-08-23,out,False,880028,True,24,0,597,597,8,23,B,2019-08-01,8,1
79,166377,2019-08-27,out,False,880028,True,27,0,983,983,8,27,B,2019-08-01,8,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41495,168601,2019-11-11,out,False,952914,True,42,0,689,689,11,11,C,2019-10-31,10,31
41505,168601,2019-11-13,out,False,952914,False,50,3296,3893,597,11,13,C,2019-10-31,10,31
41509,168601,2019-11-14,out,False,952914,True,31,0,545,545,11,14,C,2019-10-31,10,31
41510,168601,2019-11-14,out,False,952914,False,46,2614,3221,607,11,14,C,2019-10-31,10,31


`LSPL`

__Note:__   

For CallMeMaybe dataset, it will be defined two layers of cleanup before measuring efficiency:

- Business rule → eliminate invalid calls (0–19 seconds).
- Reasonableness rule → eliminate physically impossible values ​​(>2–3 hours).


In [54]:
# Get rid off invalid data and fit outliers
mask = (df_telecom['call_duration'] > 20) & (df_telecom['call_duration'] < 7200)
df_telecom = df_telecom.loc[mask, :]
df_telecom

Unnamed: 0,user_id,date,direction,internal,operator_id,is_missed_call,calls_count,call_duration,total_call_duration,call_wait_time,call_month,call_day,tariff_plan,date_start,date_start_month,date_start_day
9,166377,2019-08-06,out,False,880020,False,5,800,819,19,8,6,B,2019-08-01,8,1
11,166377,2019-08-07,out,False,880026,False,1,21,28,7,8,7,B,2019-08-01,8,1
12,166377,2019-08-07,out,False,880020,False,2,232,240,8,8,7,B,2019-08-01,8,1
14,166377,2019-08-08,out,False,880022,False,2,558,568,10,8,8,B,2019-08-01,8,1
16,166377,2019-08-09,out,False,880028,False,17,1603,1725,122,8,9,B,2019-08-01,8,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41539,168606,2019-11-08,out,False,957922,False,2,255,328,73,11,8,C,2019-10-31,10,31
41540,168606,2019-11-08,in,False,957922,False,2,686,705,19,11,8,C,2019-10-31,10,31
41541,168606,2019-11-09,out,False,957922,False,4,551,593,42,11,9,C,2019-10-31,10,31
41544,168606,2019-11-11,out,True,957922,False,2,479,501,22,11,11,C,2019-10-31,10,31


### 🛠️ __9. EDA - Processed Dataset__

In [55]:
plot_hist_frequency_px(df_telecom['calls_count'], bins=270, color='grey', title='Calls Count Distribution', 
                       xlabel='Calls count', ylabel='Frequency', xticks_range=[0, 280], xticks_step=20, rotation=45)

In [56]:
plot_qq_normality_tests_plotlypx(df_telecom, 'calls_count')

Test,Statistic,p-value / Critical,Conclusion,Recommended for,Sensitive to
Shapiro-Wilk,0.6048,0.0000,Reject H₀ (Not Normal),n ≤ 5000,General deviations
D’Agostino-Pearson,19343.4042,0.0000,Reject H₀ (Not Normal),n > 500,Skewness & Kurtosis
Anderson-Darling,2932.0176,Crit: 0.7870,Reject H₀ (Not Normal),All sizes,Tail behavior


In [57]:
plot_hist_frequency_px(df_telecom['call_duration'], bins=7200, color='grey', title='Call Duration Distribution', 
                       xlabel='Call duration (sec)', ylabel='Frequency', xticks_range=[0, 7200], xticks_step=100, rotation=90)

In [58]:
plot_qq_normality_tests_plotlypx(df_telecom, 'call_duration')

Test,Statistic,p-value / Critical,Conclusion,Recommended for,Sensitive to
Shapiro-Wilk,0.7305,0.0000,Reject H₀ (Not Normal),n ≤ 5000,General deviations
D’Agostino-Pearson,9992.3845,0.0000,Reject H₀ (Not Normal),n > 500,Skewness & Kurtosis
Anderson-Darling,2195.7753,Crit: 0.7870,Reject H₀ (Not Normal),All sizes,Tail behavior


In [59]:
plot_hist_frequency_px(df_telecom['total_call_duration'], bins=10200, color='grey', title='Total Call Duration Distribution', 
                       xlabel='Total call duration (sec)', ylabel='Frequency', xticks_range=[0, 10200], xticks_step=200, rotation=45)

In [60]:
plot_qq_normality_tests_plotlypx(df_telecom, 'total_call_duration')

Test,Statistic,p-value / Critical,Conclusion,Recommended for,Sensitive to
Shapiro-Wilk,0.7293,0.0000,Reject H₀ (Not Normal),n ≤ 5000,General deviations
D’Agostino-Pearson,10235.6478,0.0000,Reject H₀ (Not Normal),n > 500,Skewness & Kurtosis
Anderson-Darling,2205.5822,Crit: 0.7870,Reject H₀ (Not Normal),All sizes,Tail behavior


In [61]:
plot_hist_frequency_px(df_telecom['call_wait_time'], bins=4200, color='grey', title='Call Wait Time Distribution', 
                       xlabel='Call wait time (sec)', ylabel='Frequency', xticks_range=[0, 4200], xticks_step=50, rotation=90)

In [62]:
plot_qq_normality_tests_plotlypx(df_telecom, 'call_wait_time')

Test,Statistic,p-value / Critical,Conclusion,Recommended for,Sensitive to
Shapiro-Wilk,0.5504,0.0000,Reject H₀ (Not Normal),n ≤ 5000,General deviations
D’Agostino-Pearson,21694.9759,0.0000,Reject H₀ (Not Normal),n > 500,Skewness & Kurtosis
Anderson-Darling,3369.9053,Crit: 0.7870,Reject H₀ (Not Normal),All sizes,Tail behavior


### 📊 __10. EDA - Data Visualization__

##### **10.1** Operators Efficiency

In [63]:
# Operator efficiency
df_operator_efficiency = (df_telecom.groupby(['operator_id', 'direction']).agg(missed_calls=('is_missed_call', 'sum'), 
                                                                           calls=('calls_count', 'sum'), 
                                                                           wait_time=('call_wait_time', 'sum'))
                                                                          .reset_index())
df_operator_efficiency

Unnamed: 0,operator_id,direction,missed_calls,calls,wait_time
0,879896,in,0,55,466
1,879896,out,0,484,4686
2,879898,in,0,98,1622
3,879898,out,0,4566,50504
4,880020,in,0,6,43
...,...,...,...,...,...
2031,972410,out,0,40,456
2032,972412,in,0,1,25
2033,972412,out,0,35,443
2034,972460,in,0,0,0


In [64]:
df_pivot_op_ef = df_operator_efficiency.pivot_table(index='operator_id', columns='direction')
df_pivot_op_ef.columns = ['_'.join(col).strip() for col in df_pivot_op_ef.columns.values]
df_pivot_op_ef = df_pivot_op_ef.reset_index()
df_pivot_op_ef

Unnamed: 0,operator_id,calls_in,calls_out,missed_calls_in,missed_calls_out,wait_time_in,wait_time_out
0,879896,55.0,484.0,0.0,0.0,466.0,4686.0
1,879898,98.0,4566.0,0.0,0.0,1622.0,50504.0
2,880020,6.0,13.0,0.0,0.0,43.0,56.0
3,880022,6.0,84.0,0.0,0.0,81.0,556.0
4,880026,23.0,1560.0,0.0,0.0,137.0,9784.0
...,...,...,...,...,...,...,...
1013,971354,6.0,0.0,0.0,0.0,84.0,0.0
1014,972408,0.0,4.0,0.0,0.0,0.0,15.0
1015,972410,0.0,40.0,0.0,0.0,0.0,456.0
1016,972412,1.0,35.0,0.0,0.0,25.0,443.0


`LSPL`

__Note:__ Efficiency calculation

Formula proposal (main idea)

We want an efficiency score 𝐸 between 0 and 100, where 100 = very efficient.   
Based on three normalized components:
- Missed Rate (𝑀) = proportion of incoming calls missed per operator. (0..1) — the higher the rate, the worse.
- Avg Wait Time (𝑊) = average wait time for incoming calls (seconds). We normalize it to (0..1) using a reasonable threshold. — the higher the rate, the worse.
- Outgoing Activity (𝑂) = relative outgoing activity (normalized outgoing call counts). Here, more is better.

Construction (a simple, interpretive form): $S = w_M \cdot M_{norm} + w_W \cdot W_{norm} + w_O \cdot (1 - O_{norm})$

where $w_M + w_W + w_O = 1$

Finally, the efficiency: $E = (1 - S) \times 100$

Thus: if the operator has high M_norm or W_norm or low O_norm, S rises and E falls.

Weights: Defaults (0.45, 0.35, 0.20) give greater weight to missed calls, then to waiting calls, and less weight to outgoing calls. Adjust according to business.

wait_cap: Value to normalize avg_wait. If you set it to 300 s (5 min), an operator with a 300-s wait time will have W_norm=1. Change according to your context.

In [65]:
# Missed rate (M) calculation
df_pivot_op_ef['missed_rate'] = np.where(df_pivot_op_ef['calls_in'] > 0, df_pivot_op_ef['missed_calls_in'] / df_pivot_op_ef['calls_in'], 0.0)

In [66]:
# Average Wait Time (W) calculation
df_pivot_op_ef['avg_wait'] = np.where(df_pivot_op_ef['calls_in'] > 0, df_pivot_op_ef['wait_time_in'] / df_pivot_op_ef['calls_in'], 0.0)

In [67]:
wait_cap = df_pivot_op_ef['avg_wait'].quantile(0.90)  # p90
wait_cap

np.float64(28.38256243894347)

In [68]:
# Normalizations

outgoing_cap_quantile = 0.95

# M_norm en 0..1 (missed_rate scale 0..1)
df_pivot_op_ef['M_norm'] = df_pivot_op_ef['missed_rate'].clip(0,1)

# W_norm: avg_wait / wait_cap (cap)
df_pivot_op_ef['W_norm'] = (df_pivot_op_ef['avg_wait'].clip(0, wait_cap) / float(wait_cap))

# O_norm: robust normalization por quantile to avoid extreme outliers
if df_pivot_op_ef['calls_out'].max() == 0:
    df_pivot_op_ef['O_norm'] = 0.0
else:
    cap = max(df_pivot_op_ef['calls_out'].quantile(outgoing_cap_quantile), 1.0)
    df_pivot_op_ef['O_norm'] = (df_pivot_op_ef['calls_out'].clip(0, cap) / cap)

In [69]:
# Weights
w_M=0.66
w_W=0.22
w_O=0.12

# S: higher = worse
df_pivot_op_ef['S'] = (w_M * df_pivot_op_ef['M_norm']) + (w_W * df_pivot_op_ef['W_norm']) + (w_O * (1 - df_pivot_op_ef['O_norm']))

In [70]:
df_pivot_op_ef['efficiency_score'] = round(((1 - df_pivot_op_ef['S']) * 100).clip(0, 100), 3)
df_pivot_op_ef

Unnamed: 0,operator_id,calls_in,calls_out,missed_calls_in,missed_calls_out,wait_time_in,wait_time_out,missed_rate,avg_wait,M_norm,W_norm,O_norm,S,efficiency_score
0,879896,55.0,484.0,0.0,0.0,466.0,4686.0,0.0,8.472727,0.0,0.298519,0.539486,0.120936,87.906
1,879898,98.0,4566.0,0.0,0.0,1622.0,50504.0,0.0,16.551020,0.0,0.583140,1.000000,0.128291,87.171
2,880020,6.0,13.0,0.0,0.0,43.0,56.0,0.0,7.166667,0.0,0.252502,0.014490,0.173812,82.619
3,880022,6.0,84.0,0.0,0.0,81.0,556.0,0.0,13.500000,0.0,0.475644,0.093630,0.213406,78.659
4,880026,23.0,1560.0,0.0,0.0,137.0,9784.0,0.0,5.956522,0.0,0.209866,1.000000,0.046170,95.383
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1013,971354,6.0,0.0,0.0,0.0,84.0,0.0,0.0,14.000000,0.0,0.493261,0.000000,0.228517,77.148
1014,972408,0.0,4.0,0.0,0.0,0.0,15.0,0.0,0.000000,0.0,0.000000,0.004459,0.119465,88.054
1015,972410,0.0,40.0,0.0,0.0,0.0,456.0,0.0,0.000000,0.0,0.000000,0.044586,0.114650,88.535
1016,972412,1.0,35.0,0.0,0.0,25.0,443.0,0.0,25.000000,0.0,0.880823,0.039012,0.309099,69.090


In [71]:
plot_bar_plotlypx(df_pivot_op_ef.sort_values(by='efficiency_score', ascending=True).head(50), x='operator_id', y='efficiency_score', 
                  title="Top 50 Worst Operators", color='grey', x_label="Operators", y_label="Efficiency", rotation=90)

##### **10.2** Operators Efficiency (operator vs missed calls - direction)

In [72]:
df_operator_direction_mcalls = df_telecom.groupby(['operator_id', 'direction'])['is_missed_call'].sum().reset_index()
df_operator_direction_mcalls = df_operator_direction_mcalls.rename(columns={'is_missed_call': 'missed_calls'})
df_operator_direction_mcalls

Unnamed: 0,operator_id,direction,missed_calls
0,879896,in,0
1,879896,out,0
2,879898,in,0
3,879898,out,0
4,880020,in,0
...,...,...,...
2031,972410,out,0
2032,972412,in,0
2033,972412,out,0
2034,972460,in,0


In [73]:
colors_map = {'in': 'grey', 'out': 'darkgrey'}
plot_stacked_bar_plotlypx(df_operator_direction_mcalls.sort_values(by='missed_calls', ascending=False).head(50), x='operator_id', y='missed_calls', hue='direction', 
                          title='Top 50 Missed Calls by Operator and direction', xlabel='Operator', ylabel='Missed Calls', rotation=90, colors=colors_map)


##### **10.3** Operators Efficiency (operator vs missed calls - internal)

In [74]:
df_operator_internal_mcalls = df_telecom.groupby(['operator_id', 'internal'])['is_missed_call'].sum().reset_index()
df_operator_internal_mcalls = df_operator_internal_mcalls.rename(columns={'is_missed_call': 'missed_calls'})
df_operator_internal_mcalls

Unnamed: 0,operator_id,internal,missed_calls
0,879896,False,0
1,879896,True,0
2,879898,False,0
3,879898,True,0
4,880020,False,0
...,...,...,...
1286,971354,False,0
1287,972408,False,0
1288,972410,False,0
1289,972412,False,0


In [75]:
colors_map = {'False': 'darkgrey', 'True': 'grey'}
plot_stacked_bar_plotlypx(df_operator_internal_mcalls.sort_values(by='missed_calls', ascending=False).head(50), x='operator_id', y='missed_calls', hue='internal', 
                          title='Top 50 Missed Calls by Operator and internal', xlabel='Operator', ylabel='Missed Calls', rotation=90, colors=colors_map)

##### **10.4** Operators Efficiency (operator vs wait time - internal)

In [76]:
df_operator_internal_wtime = df_telecom.loc[(df_telecom['is_missed_call'] == False), :]
df_operator_internal_wtime = df_operator_internal_wtime.groupby(['operator_id', 'internal'])['call_wait_time'].sum().reset_index()
df_operator_internal_wtime

Unnamed: 0,operator_id,internal,call_wait_time
0,879896,False,4987
1,879896,True,165
2,879898,False,52103
3,879898,True,23
4,880020,False,99
...,...,...,...
1284,971354,False,84
1285,972408,False,15
1286,972410,False,456
1287,972412,False,468


In [77]:
colors_map = {'False': 'darkgrey', 'True': 'grey'}
plot_stacked_bar_plotlypx(df_operator_internal_wtime.sort_values(by='call_wait_time', ascending=False).head(50), x='operator_id', y='call_wait_time', hue='internal', 
                          title='Top 50 Wait Time by Operator and internal', xlabel='Operator', ylabel='Wait time (sec)', rotation=90, colors=colors_map)

##### **10.5** Operators Efficiency (operator vs calls count - direction)

In [78]:
df_operator_ccount_direction= df_telecom.loc[(df_telecom['is_missed_call'] == False), :]
df_operator_ccount_direction = df_operator_ccount_direction.groupby(['operator_id', 'direction'])['calls_count'].sum().reset_index()
df_operator_ccount_direction

Unnamed: 0,operator_id,direction,calls_count
0,879896,in,55
1,879896,out,484
2,879898,in,98
3,879898,out,4566
4,880020,in,6
...,...,...,...
2027,972410,out,40
2028,972412,in,1
2029,972412,out,35
2030,972460,in,0


In [79]:
colors_map = {'in': 'grey', 'out': 'darkgrey'}
plot_stacked_bar_plotlypx(df_operator_ccount_direction.sort_values(by='calls_count', ascending=False).head(50), x='operator_id', y='calls_count', hue='direction', 
                          title='Top 50 Calls Count by Operator and direction', xlabel='Operator', ylabel='Calls', rotation=90, colors=colors_map)

### 🧪 __11. Inferential Statistics__

##### **11.1** Hypothesis: The average wait time (wait_time) is equal between incoming and outgoing calls.

t-test will be used to compare means of two independent groups and usually the population deviation is not known — Welch will be tested if the variances differ.

In [80]:
# Get and evaluate series normality
incoming_calls_wt = df_telecom.loc[(df_telecom['direction'] == 'in'), ['call_wait_time']]
plot_qq_normality_tests_plotlypx(incoming_calls_wt, 'call_wait_time', dist='norm', dist_params=None, title='Call Wait Time - Direction: in', color='grey', outlier_color='crimson', outlier_marker='x', width=1200, height=600)


Test,Statistic,p-value / Critical,Conclusion,Recommended for,Sensitive to
Shapiro-Wilk,0.5212,0.0000,Reject H₀ (Not Normal),n ≤ 5000,General deviations
D’Agostino-Pearson,11408.5898,0.0000,Reject H₀ (Not Normal),n > 500,Skewness & Kurtosis
Anderson-Darling,1510.3051,Crit: 0.7870,Reject H₀ (Not Normal),All sizes,Tail behavior


In [81]:
# Get and evaluate series normality
outgoing_calls_wt = df_telecom.loc[(df_telecom['direction'] == 'out'), ['call_wait_time']]
plot_qq_normality_tests_plotlypx(outgoing_calls_wt, 'call_wait_time', dist='norm', dist_params=None, title='Call Wait Time - Direction: out', color='grey', outlier_color='crimson', outlier_marker='x', width=1200, height=600)


Test,Statistic,p-value / Critical,Conclusion,Recommended for,Sensitive to
Shapiro-Wilk,0.608,0.0000,Reject H₀ (Not Normal),n ≤ 5000,General deviations
D’Agostino-Pearson,10308.2114,0.0000,Reject H₀ (Not Normal),n > 500,Skewness & Kurtosis
Anderson-Darling,1608.8561,Crit: 0.7870,Reject H₀ (Not Normal),All sizes,Tail behavior


In [82]:
# 1. Hypotheses H₀, H₁
# H₀: The average wait time (wait_time) is the same between incoming and outgoing calls.
# H₁: The average wait time differs between incoming and outgoing calls.

# 2. Specify Significance or Confidence
# alpha = 5%
# confidence = 95%

alpha = 0.05

# Levene's test, to ensure that the variances of different samples are equal.
# Preventing Tests Like ANOVA and t-Tests from Being Incorrect
# And that directly defines the equal_var parameter in scipy.stats.ttest_ind()

levene_stat, levene_p = stats.levene(incoming_calls_wt, outgoing_calls_wt)
levene_stat, levene_p = levene_stat.item(), levene_p.item()

display(HTML(f"> <b>Levene's Test</b> - Statistic: {levene_stat:.4f}, P-value: {levene_p:.4f}"))

# Determining Equality of Variances
if levene_p < 0.05:
    equal_var = False
    display(HTML("> <i>Null Hypothesis H₀ is rejected: the variances are different → use equal_var=False</i>"))
else:
    equal_var = True
    display(HTML("> <i>Null Hypothesis H₀ is not rejected: the variances are equal → use equal_var=True</i>"))

In [83]:
# 3. Calculate critical and test values, define acceptance and rejection zones

t_stat, p_val = ttest_ind(incoming_calls_wt, outgoing_calls_wt, equal_var=False)
t_stat, p_val = t_stat.item(), p_val.item()

display(HTML(f"> T-statistic: <b>{t_stat:.15f}</b>"))
display(HTML(f"> P-value: <b>{p_val:.15f}</b>"))

# 4. Decision and Conclusion

if p_val < alpha:
    display(HTML("> The <i>'null hypothesis' is rejected</i>, <b>not rejecting 'alternative hypothesis'</b>, because there is sufficient statistical evidence to affirm that <b>The average wait time between incoming and outgoing calls differ significantly.</b>"))
else:
    display(HTML("> The <i>'null hypothesis' is not rejected</i>, <b>not rejecting 'null hypothesis'</b>, indicating insufficient evidence to conclude that <b>The average wait time between incoming and outgoing calls differ significantly</b>."))

In [84]:
# Histogram for incoming calls wait time
plot_hist_frequency_px(incoming_calls_wt['call_wait_time'], bins=4000, color='grey', title='Direction: in Calls Wait Time Histogram', xlabel='Wait Time (sec)', ylabel='Frequency', 
                       xticks_range=[0, 4000], xticks_step=40, rotation=90)

In [85]:
# Histogram for outgoing wait time
plot_hist_frequency_px(outgoing_calls_wt['call_wait_time'], bins=4200, color='grey', title='Direction: out  Calls Wait Time Histogram', xlabel='Waittime (sec)', ylabel='Frequency', 
                       xticks_range=(0, 4200), xticks_step=40, rotation=90)

##### **11.2** Hypothesis: The proportion of missed calls is equal between tariff A and tariff C.

z-test of proportions: two proportions are compared and normally the sizes are large (n p and n (1−p) ≥ 5), which allows normal approximation.

In [86]:
# Get data for proportions
count = np.array([
    ((df_telecom['tariff_plan'] == 'A') & (df_telecom['is_missed_call'] == True)).sum(),
    ((df_telecom['tariff_plan'] == 'C') & (df_telecom['is_missed_call'] == True)).sum()
])
count

array([101,  18])

In [87]:
nobs = np.array([
    (df_telecom['tariff_plan'] == 'A').sum(),
    (df_telecom['tariff_plan'] == 'C').sum()
])
nobs

array([6761, 9165])

In [None]:
# 1. Hypotheses H₀, H₁
# H₀: The proportion of dropped calls is the same between tariff A and tariff C.
# H₁: The proportion of dropped calls differ between tariff A and tariff C.

# 2. Specify Significance or Confidence
# alpha = 5%
# confidence = 95%

alpha = 0.05

# 3. Calculate critical and test values, define acceptance and rejection zones
stats, p_value = proportions_ztest(count, nobs, alternative="two-sided")
display(HTML(f"> Z-statistic: {stats}"))
display(HTML(f"> p-value: {p_value}"))



In [90]:
prop_A = count[0]/nobs[0]
prop_C   = count[1]/nobs[1]
display(HTML(f"> Missed Calls Proportion for Tariff <i>A</i>: <b>{prop_A}</b>, Missed Calls Proportion for Tariff <i>C</i>: <b>{prop_C}</b>"))

In [None]:
# 4. Decision and Conclusion
if p_value <= 0.05:
    display(HTML(f"> Null Hypothesis (<i>H₀</i>) is <b>rejected</b>, meaning there is enough statistical evidence that proportion of dropped calls is between tariff A and tariff C <b>differ</b>."))
else:
    display(HTML(f"> Null Hypothesis (<i>H₀</i>) is <b>not rejected</b>, meaning there is <b>not enough statistical</b> evidence that proportion of dropped calls is between tariff A and tariff C <b>differ</b>."))

In [93]:
plans = ['A', 'C']
pie_data = {}

for plan in plans:
    df_plan = df_telecom[df_telecom['tariff_plan'] == plan]
    counts = df_plan['is_missed_call'].value_counts()  # True = perdida, False = contestada
    pie_data[plan] = pd.DataFrame({'Call Status': ['Missed', 'Answered'], 'Count': [counts.get(True, 0), counts.get(False, 0)]})

In [95]:
for plan in plans:
    fig = px.pie(
        pie_data[plan],
        values='Count',
        names='Call Status',
        title=f'Proportion of Missed vs Answered Calls - Plan {plan}',
        color='Call Status',
        color_discrete_map={'Missed':'lightgrey','Answered':'grey'}
    )
    fig.show()

##### **11.3** Hypothesis: The average number of missed calls is the same on all days of the week.

ANOVA test: compares means between 3 or more independent groups.

In [103]:
df_answered = df_telecom.loc[(df_telecom['is_missed_call'] == False), :]
groups = [df_answered[df_answered['call_day'] == day]['call_duration'] for day in df_answered['call_day'].unique()]
groups

[9         800
 126      1243
 128      1304
 130        48
 132      1045
          ... 
 41314     421
 41315    1058
 41455      52
 41485     645
 41486    2999
 Name: call_duration, Length: 654, dtype: int64,
 11         21
 12        232
 261      2024
 266       494
 382       139
          ... 
 41417      68
 41440     290
 41442     256
 41488    1686
 41490     648
 Name: call_duration, Length: 784, dtype: int64,
 14        558
 269      2470
 271      2003
 392        66
 394      1925
          ... 
 41443      81
 41492     374
 41493     670
 41539     255
 41540     686
 Name: call_duration, Length: 790, dtype: int64,
 16       1603
 135       234
 136      1479
 138        54
 139        32
          ... 
 40502    2191
 40504    1990
 41323     155
 41444     556
 41541     551
 Name: call_duration, Length: 606, dtype: int64,
 19       2074
 21        407
 155        69
 157      1030
 158        45
          ... 
 41468      78
 41498    1682
 41499     161
 41502   

In [105]:
# 1. Hypotheses H₀, H₁
# H₀: The average number of call duration is the same on all days of the week
# H₁: At least one day has a different average

f_stat, p_val = f_oneway(*groups)
display(HTML(f"> F-statistic: <b>{f_stat:.4f}</b>, p-value: <b>{p_val:.4e}</b>"))

In [106]:
if p_value <= 0.05:
    display(HTML(f"> Null Hypothesis (<i>H₀</i>) is <b>rejected</b>, meaning there is enough statistical evidence that the average number of call duration on at least one days of the week <b>differ</b>."))
else:
    display(HTML(f"> Null Hypothesis (<i>H₀</i>) is <b>not rejected</b>, meaning there is <b>not enough statistical</b> evidence that the average number of call duration is the same on all days of the week <b>differ</b>."))

In [101]:
mean_duration_per_day = df_telecom.groupby('call_day')['call_duration'].mean().sort_values(ascending=False).reset_index()
plot_bar_plotlypx(mean_duration_per_day, x='call_day', y='call_duration', title="Mean Call duration by day", color='grey', x_label="Day", y_label="Call duration (sec)", rotation=0)

### 📊 __12. Dashboard__

In [None]:
# Initialize the app
external_stylesheets = ['https://stackpath.bootstrapcdn.com/bootstrap/4.5.2/css/bootstrap.min.css']
app = dash.Dash(__name__, external_stylesheets=external_stylesheets, compress=False)

# Dropdown options
unique_directions = pd.Series(df_telecom['direction'].unique()).astype(str).tolist()
options_direction = [{"label": str(val), "value": str(val)} for val in unique_directions]

unique_internal = pd.Series(df_telecom['internal'].unique()).astype(str).tolist()
options_internal = [{"label": str(val), "value": str(val)} for val in unique_internal]

# Layout
app.layout = html.Div([
    html.H1("CallMeMaybe Dashboard"),
    
    # Row: header on the left, dropdown on the right
    html.Div([
        
        html.Div(
            html.H5("Call Duration", style={'margin': 0}), 
            style={'flex': '1', 'border': '2px solid black', 'padding': '3px', 'margin': '12px'}
        ),
        
        html.Div(
            dcc.Dropdown(
                id='filter_direction',                         
                options=options_direction,
                value=options_direction[0]['value'] if options_direction else None,  # default value
                clearable=False,
                style={'width': '1024px'}
            )            
        ),
    ], style={'display': 'flex', 'alignItems': 'center', 'justifyContent': 'space-center', 'padding': '3px'}),

    html.Div([
        dcc.Graph(
            id='hist_call_duration', style={'flex': '2', 'margin-right': '10px'}  # 2 parts from total space
            ),
        dcc.Graph(
            id='pie_internal_external', style={'flex': '1'}  # 1 part from total space
            )
    ], style={'display': 'flex', 'justify-content': 'space-around', 'border': '2px solid black', 'padding': '3px', 'margin': '12px'}),
    
    html.Div([
        
        html.Div(
            html.H5("Number of calls per day", 
            style={'margin': 0}), style={'flex': '1', 'border': '2px solid black', 'padding': '3px', 'margin': '12px'}
        ),
        
        html.Div(
            dcc.Checklist(
                id="filter_internal",
                options=options_internal,
                value=[opt['value'] for opt in options_internal],  # lists all
                inline=True,
                style={'width': '1024px'}
            ),
            style={'border': '1px solid black', 'padding': '3px', 'margin': '12px'}
        ),  
    ], style={'display': 'flex', 'alignItems': 'center', 'justifyContent': 'space-center', 'marginBottom': '12px'}),
     
    html.Div([
        dcc.Graph(id='hist_calls_by_day', style={'margin-bottom': '12px'}), 
        dcc.Graph(id='pie_internal_external_day')
    ], style={
               'display': 'flex',
               'flexDirection': 'column',  # place the elements in a column
               'justify-content': 'flex-start',  # vertical alignment
               'align-items': 'stretch',  # complete width for each graph
               'border': '2px solid black', 'padding': '3px', 'margin': '12px'
             })
], style={
        'border': '2px solid black',   # thickness, type and color of the border
        'padding': '3px',              # internal space so that it does not touch the edge
        'margin': '12px'        # separation from other containers
    })

# Callbacks
@app.callback( 
    Output('hist_call_duration', 'figure'), 
    Output('pie_internal_external', 'figure'), 
    Input('filter_direction', 'value')
)

def update_hist_pie_duration(selected_directions): 
    # selected_directions es string (e.g. "in" o "out") o None
    if selected_directions is None:
        filtered = df_telecom.copy()
    else:
        filtered = df_telecom[df_telecom['direction'].astype(str) == str(selected_directions)]

    if filtered.empty:
        return px.histogram(title="No data"), px.pie(title="No data")

    # Call duration histogram 
    fig_hist = px.histogram( 
        filtered, 
        x='call_duration', 
        nbins=720, 
        color_discrete_sequence=['grey'], 
        title=f'Call Duration — direction = {selected_directions}'
    ) 
    
    pie_df = filtered.copy()
    pie_df['internal_label'] = pie_df['internal'].map(lambda x: str(x) if pd.notna(x) else 'Missing')

    # Pie chart internal/external calls 
    fig_pie = px.pie(
        pie_df, 
        names='internal_label', 
        title='Internal vs External Participation',
        color='internal_label',
        color_discrete_map={'True': 'lightgrey', 'False': 'grey', 'Missing': 'black'}
    )

    return fig_hist, fig_pie

@app.callback( 
    Output('hist_calls_by_day', 'figure'), 
    Output('pie_internal_external_day', 'figure'), 
    Input('filter_internal', 'value')
)

def update_hist_pie_calls(selected_internal): 
    if not selected_internal: # if there is no selection
        filtered = df_telecom.copy()
    else:
        # Convert to string to match the checklist values
        filtered = df_telecom[df_telecom['internal'].astype(str).isin(selected_internal)]

    if filtered.empty:
        return px.bar(title="No data"), px.pie(title="No data")

    # Histogram calls per day 
    calls_per_day = filtered.groupby('date', as_index=False)['calls_count'].sum()
    fig_hist_day = px.bar(calls_per_day.sort_values('date'), 
                          x='date', 
                          y='calls_count',
                          color_discrete_sequence=['grey'],
                          title=f'Number of calls per day — internal={selected_internal}')
    # Pie chart calls
    pie_df = filtered.copy()
    pie_df['internal_label'] = pie_df['internal'].map(lambda x: str(x) if pd.notna(x) else 'Missing')
    fig_pie_day = px.pie(pie_df, 
                         names='internal_label', 
                         title='Internal vs External Participation (subset)',
                         color='internal_label',
                         color_discrete_map={'True': 'lightgrey', 'False': 'grey', 'Missing': 'black'}
    )

    return fig_hist_day, fig_pie_day

if __name__ == '__main__':
    # app.run_server(host='0.0.0.0', port=3000)
    app.run(port=3000, debug=True, jupyter_mode='inline')      # embedded in cell