📜 Деталі завдання:

+ Завантажити набір даних (тут) у робоче середовище (Jupyter Notebook)

Виконати первинний огляд структури даних: Перевірити розмірність (shape), типи стовпців (dtypes), приклади записів(head etc.)

Перевірити відсутні значення: 1) Використати Pandas та/або Seaborn для визначення пропущених значень у стовпцях. 2) Побудувати візуалізацію (напр., heatmap або barplot), щоб показати обсяги відсутніх даних.

Проаналізувати розподіл даних: 1) Побудувати гістограми для числових змінних, визначити нормальність розподілу. 2) Використати boxplot для пошуку викидів.

Виявити кореляції: 1) Обчислити коефіцієнти кореляції між числовими змінними за допомогою Pandas. 2) Побудувати теплову карту кореляцій із використанням Seaborn.

Документування результатів аналізу: Зібрати у Результат Аналізу (markdown в кінці вашого Notebook) всі ключові висновки: основні розподіли, пропуски, викиди, кореляції.

Залити результат роботи на GitHub: Створити собі гілку на GitHub (якщо ще не створили) від master/main головного репозиторію проєкту тут, залити свій ноутбук і зробити Pull Request.

📎📎 Примітка: Можете взяти за основу Модуль 5 та наше домашнє завдання.

# General Data Overview

For the purpose of this project we will use a customer churn prediction dataset *internet_service_churn.csv*. Customer churn is the percentage of customers who stopped using a company's product or service during a specified time period. This is a vital metric because retaining existing customers is more cost-effective than acquiring new ones. Churn can occur due to various reasons, including unsatisfactory service, competing products, changing customer needs, and a lack of engagement. 

The main goal of this project is to help companies with investigating and analysing the existing data (through EDA) and creating effective predictive model(s) for further implementation targeted strategies to retain customers, enhance consumer satisfaction, and maintain sustainable growth.

The provided dataset belongs to an unknown internet service provider and contains information from over 70,000 unique customers, including their internet usage, subscription age, number of service failures and additional services used.

The dataset contains information about subscribers and their likelihood of churning based on various factors.

Meaning of the Columns names:

- `id`: Unique identifier for each customer.

- `is_tv_subscriber`: Indicates if the customer has a TV subscription (1 = Yes, 0 = No).

- `is_movie_package_subscriber`: Indicates if the customer has a movie package subscription (1 = Yes, 0 = No).

- `subscription_age`: The age of the subscription in years.

- `bill_avg`: Average monthly bill amount.

- `remaining_contract`: Remaining contract duration in years.

- `service_failure_count`: Number of service failures reported by the customer.

- `download_avg`: Average download speed.

- `upload_avg`: Average upload speed.

- `download_over_limit`: Indicates if the customer has exceeded their download limit (1 = Yes, 0 = No).

- `churn`: Indicates if the customer has churned (1 = Yes, 0 = No).

### Imports

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### Datasets Load + DataFrames build

In [4]:
data = pd.read_csv("internet_service_churn.csv")
df = data.copy()

### Initial structure overview

- **Step 1.** Cheking first 10 rows of the dataset:

In [6]:
df.head(10)

Unnamed: 0,id,is_tv_subscriber,is_movie_package_subscriber,subscription_age,bill_avg,reamining_contract,service_failure_count,download_avg,upload_avg,download_over_limit,churn
0,15,1,0,11.95,25,0.14,0,8.4,2.3,0,0
1,18,0,0,8.22,0,,0,0.0,0.0,0,1
2,23,1,0,8.91,16,0.0,0,13.7,0.9,0,1
3,27,0,0,6.87,21,,1,0.0,0.0,0,1
4,34,0,0,6.39,0,,0,0.0,0.0,0,1
5,56,1,1,11.94,32,1.38,0,69.4,4.0,0,0
6,71,0,0,8.96,18,0.0,0,21.3,2.0,0,1
7,84,0,0,5.48,14,,1,0.0,0.0,0,1
8,94,0,0,8.54,0,,0,0.0,0.0,0,1
9,112,0,0,8.33,0,,0,0.0,0.0,0,1


- **Step 2.** Cheking dataset information:

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 72274 entries, 0 to 72273
Data columns (total 11 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   id                           72274 non-null  int64  
 1   is_tv_subscriber             72274 non-null  int64  
 2   is_movie_package_subscriber  72274 non-null  int64  
 3   subscription_age             72274 non-null  float64
 4   bill_avg                     72274 non-null  int64  
 5   reamining_contract           50702 non-null  float64
 6   service_failure_count        72274 non-null  int64  
 7   download_avg                 71893 non-null  float64
 8   upload_avg                   71893 non-null  float64
 9   download_over_limit          72274 non-null  int64  
 10  churn                        72274 non-null  int64  
dtypes: float64(4), int64(7)
memory usage: 6.1 MB


In [12]:
df.describe()

Unnamed: 0,id,is_tv_subscriber,is_movie_package_subscriber,subscription_age,bill_avg,reamining_contract,service_failure_count,download_avg,upload_avg,download_over_limit,churn
count,72274.0,72274.0,72274.0,72274.0,72274.0,50702.0,72274.0,71893.0,71893.0,72274.0,72274.0
mean,846318.2,0.815259,0.334629,2.450051,18.942483,0.716039,0.274234,43.689911,4.192076,0.207613,0.554141
std,489102.2,0.38809,0.471864,2.03499,13.215386,0.697102,0.816621,63.405963,9.818896,0.997123,0.497064
min,15.0,0.0,0.0,-0.02,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,422216.5,1.0,0.0,0.93,13.0,0.0,0.0,6.7,0.5,0.0,0.0
50%,847784.0,1.0,0.0,1.98,19.0,0.57,0.0,27.8,2.1,0.0,1.0
75%,1269562.0,1.0,1.0,3.3,22.0,1.31,0.0,60.5,4.8,0.0,1.0
max,1689744.0,1.0,1.0,12.8,406.0,2.92,19.0,4415.2,453.3,7.0,1.0


- **Step 3.** Cheking the shape of the dataset and the types of values:

In [10]:
df.shape

(72274, 11)

In [15]:
df.dtypes

id                               int64
is_tv_subscriber                 int64
is_movie_package_subscriber      int64
subscription_age               float64
bill_avg                         int64
reamining_contract             float64
service_failure_count            int64
download_avg                   float64
upload_avg                     float64
download_over_limit              int64
churn                            int64
dtype: object

### Data Visualization Methods

# Exploratory Data Analysis (EDA)

# Results

You can use a logistic regression model to solve the problem but remember to keep the business context in mind. For instance, retention surveys have shown that while price and product are important, most customers churn because of service failures and dissatisfaction with the customer care team.