# Data in Motion Tourism Company

## Business Problem:

The main business problem is to increase the conversion rate of product pitches into actual sales. To do this, the company needs to:

- Understand the characteristics and preferences of their customers.
- Understand the factors that influence a customer's decision to purchase a product.
- Tailor their products and marketing strategies to better meet the needs and preferences of their customers.
- Improve the effectiveness of their product pitches and follow-ups.
- Enhance customer satisfaction with the product pitches.
- By addressing these issues, the company can increase the likelihood that a customer will purchase a product after a pitch, - thereby increasing sales and revenue.

# **Data Description:**

- CustomerID: Unique customer ID
- ProdTaken: Whether the customer has purchased a package or not (0: No, 1: Yes)
- Age: Age of customer
- TypeofContact: How customer was contacted (Company Invited or Self Inquiry)
- CityTier: City tier depends on the development of a city, population, facilities, and living standards. The categories are ordered i.e. Tier 1 > Tier 2 > Tier 3
- DurationOfPitch: Duration of the pitch by a salesperson to the customer
- Occupation: Occupation of customer
- Gender: Gender of customer
- NumberOfPersonVisiting: Total number of persons planning to take the trip with the customer
- NumberOfFollowups: Total number of follow-ups has been done by the salesperson after the sales pitch
- ProductPitched: Product pitched by the salesperson
- PreferredPropertyStar: Preferred hotel property rating by customer
- MaritalStatus: Marital status of customer
- NumberOfTrips: Average number of trips in a year by customer
- Passport: The customer has a passport or not (0: No, 1: Yes)
- PitchSatisfactionScore: Sales pitch satisfaction score
- OwnCar: Whether the customers own a car or not (0: No, 1: Yes)
- NumberOfChildrenVisiting: Total number of children with age less than 5 planning to take the trip with the customer
- Designation: Designation of the customer in the current organization
- MonthlyIncome: Gross monthly income of the customer

# Import Libraries

In [4]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

%matplotlib inline
import seaborn as sns

import statsmodels.api as sm
import scipy.stats as stats

# Import Data

In [5]:
# Load Excel File
data = "Tourism.csv"
df = pd.read_csv(data)
tour_df = df.copy()

print(f"There are {tour_df.shape[0]} rows and {tour_df.shape[1]} columns")

There are 4888 rows and 20 columns


In [8]:
tour_df.head()

Unnamed: 0,CustomerID,ProdTaken,Age,TypeofContact,CityTier,DurationOfPitch,Occupation,Gender,NumberOfPersonVisiting,NumberOfFollowups,ProductPitched,PreferredPropertyStar,MaritalStatus,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,NumberOfChildrenVisiting,Designation,MonthlyIncome
0,200000,1,41.0,Self Enquiry,3,6.0,Salaried,Female,3,3.0,Deluxe,3.0,Single,1.0,1,2,1,0.0,Manager,20993.0
1,200001,0,49.0,Company Invited,1,14.0,Salaried,Male,3,4.0,Deluxe,4.0,Divorced,2.0,0,3,1,2.0,Manager,20130.0
2,200002,1,37.0,Self Enquiry,1,8.0,Free Lancer,Male,3,4.0,Basic,3.0,Single,7.0,1,3,0,0.0,Executive,17090.0
3,200003,0,33.0,Company Invited,1,9.0,Salaried,Female,2,3.0,Basic,3.0,Divorced,2.0,1,5,1,1.0,Executive,17909.0
4,200004,0,,Self Enquiry,1,8.0,Small Business,Male,2,3.0,Basic,4.0,Divorced,1.0,0,5,1,0.0,Executive,18468.0


In [10]:
tour_df.tail()

Unnamed: 0,CustomerID,ProdTaken,Age,TypeofContact,CityTier,DurationOfPitch,Occupation,Gender,NumberOfPersonVisiting,NumberOfFollowups,ProductPitched,PreferredPropertyStar,MaritalStatus,NumberOfTrips,Passport,PitchSatisfactionScore,OwnCar,NumberOfChildrenVisiting,Designation,MonthlyIncome
4883,204883,1,49.0,Self Enquiry,3,9.0,Small Business,Male,3,5.0,Deluxe,4.0,Unmarried,2.0,1,1,1,1.0,Manager,26576.0
4884,204884,1,28.0,Company Invited,1,31.0,Salaried,Male,4,5.0,Basic,3.0,Single,3.0,1,3,1,2.0,Executive,21212.0
4885,204885,1,52.0,Self Enquiry,3,17.0,Salaried,Female,4,4.0,Standard,4.0,Married,7.0,0,1,1,3.0,Senior Manager,31820.0
4886,204886,1,19.0,Self Enquiry,3,16.0,Small Business,Male,3,4.0,Basic,3.0,Single,3.0,0,5,0,2.0,Executive,20289.0
4887,204887,1,36.0,Self Enquiry,1,14.0,Salaried,Male,4,4.0,Basic,4.0,Unmarried,3.0,1,3,1,2.0,Executive,24041.0


## Prelimary Checks


### 1. Is there any missing data?

In [None]:
# Examine 

In [6]:
df.isna().sum() 

CustomerID                    0
ProdTaken                     0
Age                         226
TypeofContact                25
CityTier                      0
DurationOfPitch             251
Occupation                    0
Gender                        0
NumberOfPersonVisiting        0
NumberOfFollowups            45
ProductPitched                0
PreferredPropertyStar        26
MaritalStatus                 0
NumberOfTrips               140
Passport                      0
PitchSatisfactionScore        0
OwnCar                        0
NumberOfChildrenVisiting     66
Designation                   0
MonthlyIncome               233
dtype: int64

In [11]:
tour_df.isna().sum().sum()

1012

### 2. Provide a basic statistical summary of the dataset. What observations can you point out?

In [13]:
tour_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
CustomerID,4888.0,202443.5,1411.188388,200000.0,201221.75,202443.5,203665.25,204887.0
ProdTaken,4888.0,0.188216,0.390925,0.0,0.0,0.0,0.0,1.0
Age,4662.0,37.622265,9.316387,18.0,31.0,36.0,44.0,61.0
CityTier,4888.0,1.654255,0.916583,1.0,1.0,1.0,3.0,3.0
DurationOfPitch,4637.0,15.490835,8.519643,5.0,9.0,13.0,20.0,127.0
NumberOfPersonVisiting,4888.0,2.905074,0.724891,1.0,2.0,3.0,3.0,5.0
NumberOfFollowups,4843.0,3.708445,1.002509,1.0,3.0,4.0,4.0,6.0
PreferredPropertyStar,4862.0,3.581037,0.798009,3.0,3.0,3.0,4.0,5.0
NumberOfTrips,4748.0,3.236521,1.849019,1.0,2.0,3.0,4.0,22.0
Passport,4888.0,0.290917,0.454232,0.0,0.0,0.0,1.0,1.0


### 3. Fill in missing data based on information from the statistical summary

In [None]:
df['Age'] = df.groupby(['Destination'])['Age'].transform()

### 4. Check the unique values in each column to asses cardinality and ensure theres no errors. What observations can you point out?

### 5. Fix the Gender column

In [14]:
df['Gender'] = df['Gender'].str.replace('Fe Male', 'Female')

In [17]:
df['Gender'].value_counts()

Male      2916
Female    1972
Name: Gender, dtype: int64

### 6. Check the data types.

## Feature Engineering


### 1. Create 5 bins for the age column. Display the distribution.

In [18]:
df['Age'].quantile([0., 0.5, 0.75])

0.25    31.0
0.50    36.0
0.75    44.0
Name: Age, dtype: float64

### 2. Create 9 bins for the income column. Display the distribution.

## Create Custom Plot Functions

### Combo plot with boxplot and histogram

In [12]:
# Function to Plots a Combined Graph for Univariate Analysis of the Continous Variables
def dist_box(data):

    Name = data.name.upper()
    fig, (ax_box, ax_dis) = plt.subplots(
        nrows=2,
        sharex=True,
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=(8, 5),
    )
    mean = data.mean()
    median = data.median()
    mode = data.mode().tolist()[0]
    sns.set_theme(style="white")
    fig.suptitle("SPREAD OF DATA FOR " + Name, fontsize=18, fontweight="bold")
    sns.boxplot(x=data, showmeans=True, orient="h", color="blue", ax=ax_box)
    ax_box.set(xlabel="")
    # just trying to make visualisation better. This will set background to white
    sns.despine(top=True, right=True, left=True)  # to remove side line from graph
    sns.distplot(data, kde=False, color="black", ax=ax_dis)
    ax_dis.axvline(mean, color="r", linestyle="--", linewidth=2)
    ax_dis.axvline(median, color="g", linestyle="-", linewidth=2)
    ax_dis.axvline(mode, color="orange", linestyle="-", linewidth=2)
    plt.legend({"Mean": mean, "Median": median, "Mode": mode})

### Barplot with percentage labeled

In [13]:
def bar_plot_with_percentage(data, column, title):
    '''
    This function takes a dataframe and a column name and plots a bar chart with percentage.
    '''
    total = len(data[column])
    plt.figure(figsize=(10,5))
    ax = sns.countplot(x=column, data=data, palette='viridis')
    plt.title(title)
    for p in ax.patches:
        height = p.get_height()
        ax.text(p.get_x()+p.get_width()/2.,
                height + 3,
                '{:1.2f}'.format(height/total),
                ha='center')
    plt.show()

### Stacked Bar Chart

In [14]:
def stacked_bar_chart(data, column1, column2, title):
    '''
    This function takes a dataframe and two column names and plots a stacked bar chart.
    '''
    cross_tab = pd.crosstab(data[column1], data[column2], normalize='index')
    cross_tab.plot(kind='bar', stacked=True, figsize=(10,5))
    plt.title(title)
    plt.show()

## Univariate Analysis

### 1. What is the distribution of age among the customers?

### 2. What is the distribution of the duration of the pitch?

### 3. What is the distribution of monthly income among the customers?

### 4. What is the distribution of the number of trips among the customers?

### 5. Check the ratio distribution of outliers.

### 6. Perform outlier treatment.

### 7. Explore the amount of people who visit. What observations can you point out?

### 8. Check the distribution of occupation.

### 9. What is the distribution of ‘CityTier’? What observations can you point out?

### 10. What is the distribution of gender? What observations can you point out?

### 11. What is the distribution of follow ups? What observations can you point out?

### 12. What is the distribution of the products pitched? What observations can you point out?

### 13. What is the distribution of the property ratings? What observations can you point out?

### 14. What is the distribution of type of contact? What observations can you point out?


### 15. What is the distribution of the marital status? What observations can you point out?

### 16. What is the distribution of the passport status? What observations can you point out?

### 17. What is the distribution of the pitch satisfaction score? What observations can you point out?

### 18. What is the distribution of customer designations? What observations can you point out?

### 19. How many customers bought the product vs did not buy the product? What observations can you point out?

### 20. What is the distribution of the number of children? What observations can you point out?

## Bivariate & Univariate Analysis

### 1. Explore the conversion rate based on the amount of people visiting.

### 2. Explore the conversion rate based on their based on the number of follow ups.

### 3. Explore the conversion rate based on marital status.

### 4. Explore the conversion rate based on passport status.

### 5. Explore the conversion rate based on the product pitched.

### 6. Explore the relationship between the duration of the pitch and if a product was bought.

### 7. Explore the ralationship between a customer's monthly income and if a product was bought.


### 8. Explore the relationship between a customer's monthly income, if a product was bought, and their designation.

### 9. Explore the relationship between a customer's age and if a product was bought.


### 10. Create a heatmap for all the numerical columns. What observations can you point out?

## Conclusions

## Recommendations