 **To**: Megaline Management

 **From**: Junior Data Scientist

 **Date**: August 8, 2025
 
 **Subject**: Preliminary Analysis of Surf and Ultimate Prepaid Plans

# Introduction
This report presents a preliminary analysis of Megaline's 'Surf' and 'Ultimate' prepaid plans. The primary goal is to determine which of these two plans generates more revenue. By analyzing the behavior of a sample of 500 customers from 2018, we can gain insights that will help the commercial department make informed decisions about the allocation of the advertising budget.

# 1. Data Loading and Initial Exploration
First, let's load all the necessary libraries and the datasets to get a first look at the data we're working with.

In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats as st
import os

try:
    calls = pd.read_csv('data/megaline_calls.csv')
    internet = pd.read_csv('data/megaline_internet.csv')
    messages = pd.read_csv('data/megaline_messages.csv')
    plans = pd.read_csv('data/megaline_plans.csv')
    users = pd.read_csv('data/megaline_users.csv')
except FileNotFoundError as e:
    print(f"Error: {e}. Make sure all CSV files are in the same directory.")

print("Initial Data Info:")
users.info()
calls.info()
messages.info()
internet.info()
plans.info()

Initial Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   user_id     500 non-null    int64 
 1   first_name  500 non-null    object
 2   last_name   500 non-null    object
 3   age         500 non-null    int64 
 4   city        500 non-null    object
 5   reg_date    500 non-null    object
 6   plan        500 non-null    object
 7   churn_date  34 non-null     object
dtypes: int64(2), object(6)
memory usage: 31.4+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 137735 entries, 0 to 137734
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   id         137735 non-null  object 
 1   user_id    137735 non-null  int64  
 2   call_date  137735 non-null  object 
 3   duration   137735 non-null  float64
dtypes: float64(1), int64(1), object(2)
memory usage: 4.2+ MB
<class '

### **Initial Observations**
- Data Types: Several date columns (reg_date, churn_date, call_date, message_date, session_date) are currently stored as object data types. These will need to be converted to a proper datetime format for time-based analysis.

- Missing Values: The churn_date column in the users table has a significant number of missing values. The project description states that if the value is missing, the plan was still in use at the time of data extraction. This is expected and doesn't represent an error.

- Call Duration: The duration in the calls table is a float. The plan details specify that call durations are rounded up to the nearest minute for billing. I also note the presence of zero-duration calls, which could represent missed or dropped calls. These still consume resources to connect, so they should be investigated.

- Data Volume: Internet usage (mb_used) is given in megabytes. For billing, the total monthly data usage is rounded up to the next gigabyte. This will need to be calculated.