**TOUR AND** **TRAVELS**  **CHURN** **PREDICTION**

The dataset assists a travel company in
predicting customer churn. It includes indicators such as age, frequent flyer status,
annual income class, services opted frequency, social media account
synchronization, and hotel bookings.The goal is to build predictive models to save company resources. we can perform
exploratory data analyses to reveal insights for effective churn prediction. The
binary target variable distinguishes customers who churn (1) from those who don't
(0), guiding the modeling process.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')


In [2]:
data=pd.read_csv('/content/Customertravel.csv')
data.head(5)

Unnamed: 0,Age,FrequentFlyer,AnnualIncomeClass,ServicesOpted,AccountSyncedToSocialMedia,BookedHotelOrNot,Target
0,34,No,Middle Income,6,No,Yes,0
1,34,Yes,Low Income,5,Yes,No,1
2,37,No,Middle Income,3,Yes,No,0
3,30,No,Middle Income,2,No,No,0
4,30,No,Low Income,1,No,No,0


**EDA**

In [3]:
data.isnull().sum()

Age                           0
FrequentFlyer                 0
AnnualIncomeClass             0
ServicesOpted                 0
AccountSyncedToSocialMedia    0
BookedHotelOrNot              0
Target                        0
dtype: int64

In [4]:
data.duplicated().sum()

507

In [5]:
data.shape

(954, 7)

In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 954 entries, 0 to 953
Data columns (total 7 columns):
 #   Column                      Non-Null Count  Dtype 
---  ------                      --------------  ----- 
 0   Age                         954 non-null    int64 
 1   FrequentFlyer               954 non-null    object
 2   AnnualIncomeClass           954 non-null    object
 3   ServicesOpted               954 non-null    int64 
 4   AccountSyncedToSocialMedia  954 non-null    object
 5   BookedHotelOrNot            954 non-null    object
 6   Target                      954 non-null    int64 
dtypes: int64(3), object(4)
memory usage: 52.3+ KB


In [7]:
data.describe()

Unnamed: 0,Age,ServicesOpted,Target
count,954.0,954.0,954.0
mean,32.109015,2.437107,0.234801
std,3.337388,1.606233,0.424097
min,27.0,1.0,0.0
25%,30.0,1.0,0.0
50%,31.0,2.0,0.0
75%,35.0,4.0,0.0
max,38.0,6.0,1.0


In [8]:
data.columns

Index(['Age', 'FrequentFlyer', 'AnnualIncomeClass', 'ServicesOpted',
       'AccountSyncedToSocialMedia', 'BookedHotelOrNot', 'Target'],
      dtype='object')

In [9]:
import matplotlib.pyplot as plt
import plotly.express as px

In [10]:
fig = px.histogram(data,x='AnnualIncomeClass',color='Target')
fig.show()

Middle Income people travel more than High Income, there will be may reason behind it like they don't have enough time to travel they are busy in their work to earn more money or anything else also happens

Note : Company need to focus on how they can attract High Income family or people to their travelling plan and also attract Low Income people by provind jaw dropping deals or offers

In [11]:
fig = px.histogram(data,x='AnnualIncomeClass',color='FrequentFlyer')
fig.show()

As per the data Middle Income class are not a FrequentFlyer

In [12]:
data['ServicesOpted'].value_counts()

ServicesOpted
1    404
2    176
3    124
4    117
5     69
6     64
Name: count, dtype: int64

In [13]:
fig = px.histogram(data,x='AnnualIncomeClass',color='ServicesOpted')
fig.show()

In [14]:
fig = px.histogram(data,x='AnnualIncomeClass',color='BookedHotelOrNot')
fig.show()

Mainly Low Income Class people not booking hotel while travelling

And very low amount of High Income Clss booked hotels

**PREPROCESSING**

ENCODING CATEGORICAL COLUMN

In [16]:
from sklearn.preprocessing import LabelEncoder

In [17]:
le = LabelEncoder()

In [18]:
cat_col = data.select_dtypes(include=['object']).columns
for i in cat_col:
    data[i] = le.fit_transform(data[i])

In [19]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 954 entries, 0 to 953
Data columns (total 7 columns):
 #   Column                      Non-Null Count  Dtype
---  ------                      --------------  -----
 0   Age                         954 non-null    int64
 1   FrequentFlyer               954 non-null    int64
 2   AnnualIncomeClass           954 non-null    int64
 3   ServicesOpted               954 non-null    int64
 4   AccountSyncedToSocialMedia  954 non-null    int64
 5   BookedHotelOrNot            954 non-null    int64
 6   Target                      954 non-null    int64
dtypes: int64(7)
memory usage: 52.3 KB


STANDARDISING THE VALUES

In [20]:
#Before standardizing first spilt the data

In [21]:
from sklearn.model_selection import train_test_split

In [22]:
X = data.drop(columns=['Target'])
y = data['Target']

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=1)

In [23]:
X_train

Unnamed: 0,Age,FrequentFlyer,AnnualIncomeClass,ServicesOpted,AccountSyncedToSocialMedia,BookedHotelOrNot
623,30,0,2,2,0,0
874,30,0,1,4,1,0
722,36,2,1,3,1,0
223,33,0,1,4,0,0
651,36,0,2,5,0,0
...,...,...,...,...,...,...
767,30,0,2,2,1,0
72,30,0,2,1,1,1
908,37,0,2,3,0,0
235,33,0,1,5,1,1


In [24]:
from sklearn.preprocessing import StandardScaler

In [25]:
scaler = StandardScaler()

In [26]:
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)

In [27]:
X_train_scaled

array([[-0.58830739, -0.73331065,  1.03128874, -0.26303681, -0.78301202,
        -0.79833518],
       [-0.58830739, -0.73331065, -0.3443639 ,  0.99525213,  1.2771196 ,
        -0.79833518],
       [ 1.2036229 ,  1.4651806 , -0.3443639 ,  0.36610766,  1.2771196 ,
        -0.79833518],
       ...,
       [ 1.50227795, -0.73331065,  1.03128874,  0.36610766, -0.78301202,
        -0.79833518],
       [ 0.30765776, -0.73331065, -0.3443639 ,  1.6243966 ,  1.2771196 ,
         1.25260671],
       [ 1.50227795,  1.4651806 , -0.3443639 , -0.89218128,  1.2771196 ,
        -0.79833518]])

Training the model

In [28]:
from sklearn.linear_model import LogisticRegression

In [29]:
lr = LogisticRegression()

In [30]:
lr.fit(X_train_scaled,y_train)

In [31]:
y_pred = lr.predict(X_test_scaled)

In [32]:
from sklearn.metrics import accuracy_score

In [33]:
accuracy_score(y_test,y_pred)

0.8167539267015707