**Problem Statement**

A tour & travels company wants to predict whether a customer will churn or not. Based on a few customer characteristics like their age, frequent flyer status, annual income class, services opted, account snick to social media, booked hotel or not, Target.

The analysis and forecasting are based on the customer churn's impact on yearly income, hotel reservations, and whether or not they were made in order to assist the business in developing predictive models, saving money, and performing fascinating EDAs.

**Import Libraries**

In [None]:
import pandas as pd
import numpy as np
import plotly.express as px

**Loading the Dataset**

In [None]:
df = pd.read_csv('/content/Customertravel.csv')

Unnamed: 0,Age,FrequentFlyer,AnnualIncomeClass,ServicesOpted,AccountSyncedToSocialMedia,BookedHotelOrNot,Target
0,34,No,Middle Income,6,No,Yes,0
1,34,Yes,Low Income,5,Yes,No,1
2,37,No,Middle Income,3,Yes,No,0
3,30,No,Middle Income,2,No,No,0
4,30,No,Low Income,1,No,No,0


**Data Cleaning**

In [None]:
df.head()

Unnamed: 0,Age,FrequentFlyer,AnnualIncomeClass,ServicesOpted,AccountSyncedToSocialMedia,BookedHotelOrNot,Target
0,34,No,Middle Income,6,No,Yes,0
1,34,Yes,Low Income,5,Yes,No,1
2,37,No,Middle Income,3,Yes,No,0
3,30,No,Middle Income,2,No,No,0
4,30,No,Low Income,1,No,No,0


In [None]:
df.tail()

Unnamed: 0,Age,FrequentFlyer,AnnualIncomeClass,ServicesOpted,AccountSyncedToSocialMedia,BookedHotelOrNot,Target
949,31,Yes,Low Income,1,No,No,0
950,30,No,Middle Income,5,No,Yes,0
951,37,No,Middle Income,4,No,No,0
952,30,No,Low Income,1,Yes,Yes,0
953,31,Yes,High Income,1,No,No,0


In [None]:
print('The null values in the dataset\n')

df.isna().sum()

The null values in the dataset



Unnamed: 0,0
Age,0
FrequentFlyer,0
AnnualIncomeClass,0
ServicesOpted,0
AccountSyncedToSocialMedia,0
BookedHotelOrNot,0
Target,0


In [None]:
print('Shape of the dataset\n')

df.shape

Shape of the dataset



(954, 7)

In [None]:
print('Columns of the dataset\n')

df.columns

Columns of the dataset



Index(['Age', 'FrequentFlyer', 'AnnualIncomeClass', 'ServicesOpted',
       'AccountSyncedToSocialMedia', 'BookedHotelOrNot', 'Target'],
      dtype='object')

In [None]:
print('Information of the dataset\n')

df.info()

Information of the dataset

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 954 entries, 0 to 953
Data columns (total 7 columns):
 #   Column                      Non-Null Count  Dtype 
---  ------                      --------------  ----- 
 0   Age                         954 non-null    int64 
 1   FrequentFlyer               954 non-null    object
 2   AnnualIncomeClass           954 non-null    object
 3   ServicesOpted               954 non-null    int64 
 4   AccountSyncedToSocialMedia  954 non-null    object
 5   BookedHotelOrNot            954 non-null    object
 6   Target                      954 non-null    int64 
dtypes: int64(3), object(4)
memory usage: 52.3+ KB


In [None]:
print('The duplicate values in the dataset\n')

df.duplicated().sum()

The duplicate values in the dataset



507

**I Observed that there are five hundred and seven duplicated values in the dataset**

In [None]:
print('The Statistical distribution\n')

df.describe()

The Statistical distribution



Unnamed: 0,Age,ServicesOpted,Target
count,954.0,954.0,954.0
mean,32.109015,2.437107,0.234801
std,3.337388,1.606233,0.424097
min,27.0,1.0,0.0
25%,30.0,1.0,0.0
50%,31.0,2.0,0.0
75%,35.0,4.0,0.0
max,38.0,6.0,1.0


In [None]:
df.describe(include = 'object')

Unnamed: 0,FrequentFlyer,AnnualIncomeClass,AccountSyncedToSocialMedia,BookedHotelOrNot
count,954,954,954,954
unique,3,3,2,2
top,No,Middle Income,No,No
freq,608,409,594,576


In [None]:
df.select_dtypes('number').corr()

Unnamed: 0,Age,ServicesOpted,Target
Age,1.0,-0.012422,-0.131534
ServicesOpted,-0.012422,1.0,0.038646
Target,-0.131534,0.038646,1.0


In [None]:
df.describe(include = 'object')

Unnamed: 0,FrequentFlyer,AnnualIncomeClass,AccountSyncedToSocialMedia,BookedHotelOrNot
count,954,954,954,954
unique,3,3,2,2
top,No,Middle Income,No,No
freq,608,409,594,576


In [None]:
for col in df.describe(include = 'object') .columns:
    print(col)
    print(df[col].unique())
    print('-'*50)

FrequentFlyer
['No' 'Yes' 'No Record']
--------------------------------------------------
AnnualIncomeClass
['Middle Income' 'Low Income' 'High Income']
--------------------------------------------------
AccountSyncedToSocialMedia
['No' 'Yes']
--------------------------------------------------
BookedHotelOrNot
['Yes' 'No']
--------------------------------------------------


In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
for col in df.describe(include = 'object') .columns:
    df[col] = le.fit_transform(df[col])

In [None]:
df.head(2)

Unnamed: 0,Age,FrequentFlyer,AnnualIncomeClass,ServicesOpted,AccountSyncedToSocialMedia,BookedHotelOrNot,Target
0,34,0,2,6,0,1,0
1,34,2,1,5,1,0,1


In [None]:
px.bar(df.groupby('Age')['FrequentFlyer'].mean())

**Hotel Booking Visualization**

In [None]:
churn_per = df['BookedHotelOrNot'].value_counts(normalize=1)
print(churn_per)

px.bar(["Booked","NotBooked"], df['BookedHotelOrNot'].value_counts(normalize=1), title="How Hotel Bookings Affect Customer Churn")

BookedHotelOrNot
0    0.603774
1    0.396226
Name: proportion, dtype: float64


**Customers who have made hotel reservations account for 39% of the overall population, whereas those who have not made hotel reservations account for 60% of the population, which is a significant number.**

In [None]:
income = df.groupby(['AnnualIncomeClass'])

income = income.size()

income


Unnamed: 0_level_0,0
AnnualIncomeClass,Unnamed: 1_level_1
0,159
1,386
2,409


In [None]:
px.pie(values = income , names = ("High Income", "Low Income", "Middle Income" ), title = 'How Income impacts Customers Churn')

**I found that customers with the highest yearly income have the lowest proportion of customer churn, whereas those with the lowest (40%) and middle (42%), have the highest percentage of churn.**

In [None]:
AnnualIncomeClass = df.groupby('AnnualIncomeClass')
Age = df.groupby('Age')

px.bar(df, y = 'AnnualIncomeClass', x = 'Age')

**Age 30 has the highest AnnualIncomeClass**

In [None]:
df.head(1)

Unnamed: 0,Age,FrequentFlyer,AnnualIncomeClass,ServicesOpted,AccountSyncedToSocialMedia,BookedHotelOrNot,Target
0,34,0,2,6,0,1,0


In [None]:
FrequentFlyer = df.groupby('FrequentFlyer')
FrequentFlyer = FrequentFlyer.size()
FrequentFlyer



Unnamed: 0_level_0,0
FrequentFlyer,Unnamed: 1_level_1
0,608
1,60
2,286


In [None]:
px.pie(values = FrequentFlyer , names = ("High FrequentFlyer", "Low FrequentFlyer", "Middle FrequentFlyer" ), title = 'How FrequentFlyer impacts Customers Churn')

**I found that customers with the highest frequentflyer have the highest proportion of customer churn, whereas those with the lowest (6.29%) and middle (30%), have the lowest percentage of churn**

In [None]:
acc = df.groupby('AccountSyncedToSocialMedia')
acc = acc.size()
acc

Unnamed: 0_level_0,0
AccountSyncedToSocialMedia,Unnamed: 1_level_1
0,594
1,360
