dataset: https://www.kaggle.com/datasets/tejashvi14/tour-travels-customer-churn-prediction/data

## Dataset

In [1]:
import pandas as pd
import numpy as np
import os

In [2]:
from sklearn.model_selection import train_test_split

In [3]:
real_dir = os.path.join("../../dataset")
real_path = os.path.join(real_dir,"customer_travel.csv")


In [4]:
dataset = pd.read_csv(real_path, sep=";")
dataset.head()

Unnamed: 0,Age,FrequentFlyer,AnnualIncomeClass,ServicesOpted,AccountSyncedToSocialMedia,BookedHotelOrNot,Target
0,34,No,Middle Income,6,No,Yes,0
1,34,Yes,Low Income,5,Yes,No,1
2,37,No,Middle Income,3,Yes,No,0
3,30,No,Middle Income,2,No,No,0
4,30,No,Low Income,1,No,No,0


In [5]:
dataset.describe()

Unnamed: 0,Age,ServicesOpted,Target
count,954.0,954.0,954.0
mean,32.109015,2.437107,0.234801
std,3.337388,1.606233,0.424097
min,27.0,1.0,0.0
25%,30.0,1.0,0.0
50%,31.0,2.0,0.0
75%,35.0,4.0,0.0
max,38.0,6.0,1.0


In [6]:
dataset.shape

(954, 7)

In [7]:
dataset.isna().sum()

Age                           0
FrequentFlyer                 0
AnnualIncomeClass             0
ServicesOpted                 0
AccountSyncedToSocialMedia    0
BookedHotelOrNot              0
Target                        0
dtype: int64

In [8]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 954 entries, 0 to 953
Data columns (total 7 columns):
 #   Column                      Non-Null Count  Dtype 
---  ------                      --------------  ----- 
 0   Age                         954 non-null    int64 
 1   FrequentFlyer               954 non-null    object
 2   AnnualIncomeClass           954 non-null    object
 3   ServicesOpted               954 non-null    int64 
 4   AccountSyncedToSocialMedia  954 non-null    object
 5   BookedHotelOrNot            954 non-null    object
 6   Target                      954 non-null    int64 
dtypes: int64(3), object(4)
memory usage: 52.3+ KB


In [9]:
for column in dataset.columns:
    print(f"* {column}: {dataset[column].unique()} \n")

* Age: [34 37 30 27 36 28 35 31 38 33 29] 

* FrequentFlyer: ['No' 'Yes' 'No Record'] 

* AnnualIncomeClass: ['Middle Income' 'Low Income' 'High Income'] 

* ServicesOpted: [6 5 3 2 1 4] 

* AccountSyncedToSocialMedia: ['No' 'Yes'] 

* BookedHotelOrNot: ['Yes' 'No'] 

* Target: [0 1] 



In [10]:
dataset.duplicated().sum()

507

In [11]:
dataset[dataset.duplicated()]

Unnamed: 0,Age,FrequentFlyer,AnnualIncomeClass,ServicesOpted,AccountSyncedToSocialMedia,BookedHotelOrNot,Target
26,37,No,Middle Income,3,Yes,No,0
48,30,No,Middle Income,4,No,Yes,0
54,31,No,Middle Income,1,No,Yes,0
59,36,No,Middle Income,2,No,No,0
60,34,No,Middle Income,6,No,Yes,0
...,...,...,...,...,...,...,...
948,31,No,Middle Income,1,No,Yes,0
949,31,Yes,Low Income,1,No,No,0
951,37,No,Middle Income,4,No,No,0
952,30,No,Low Income,1,Yes,Yes,0


In [12]:
dataset.query("Age==34 & FrequentFlyer == 'No' & AnnualIncomeClass == 'Middle Income' & ServicesOpted == 6")

Unnamed: 0,Age,FrequentFlyer,AnnualIncomeClass,ServicesOpted,AccountSyncedToSocialMedia,BookedHotelOrNot,Target
0,34,No,Middle Income,6,No,Yes,0
60,34,No,Middle Income,6,No,Yes,0
450,34,No,Middle Income,6,No,Yes,0
495,34,No,Middle Income,6,No,Yes,0
510,34,No,Middle Income,6,No,Yes,0
780,34,No,Middle Income,6,No,Yes,0


In [13]:
sum(dataset["FrequentFlyer"] == "No Record")

60

## Data Cleaning

After short descriptive analysis, it turned out that: 
- there no missed values, if we consider "No record" as another option for a column FrequentFlyer. 
- there are 507 duplicated values. But we are not going to throw them away, as there is a possibility of having different customers with the same entries.

## Train Test Split

In [14]:
X = dataset.drop('Target', axis = 1)
Y = dataset['Target']

In [15]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=32)

In [16]:
train_dataset = pd.concat([X_train, y_train], axis=1)
test_dataset = pd.concat([X_test, y_test], axis=1)

In [17]:
train_dataset.to_csv("original_train_dataset/Customer_travel_original_train.csv", index=False)
test_dataset.to_csv("original_test_dataset/Customer_travel_original_test.csv", index=False)

In [18]:
len(train_dataset)

639

In [19]:
len(test_dataset)

315