<div style="text-align:center; border-radius:15px; padding:15px; color:white; margin:0; font-family: 'Orbitron', sans-serif; background: #2E0249; background: #11001C; box-shadow: 0px 4px 8px rgba(0, 0, 0, 0.3); overflow:hidden; margin-bottom: 1em;">
    <div style="font-size:150%; color:#FEE100"><b>Hajj Pilgrimage Data Analysis and Predictor Notebook</b></div>
    <div>This notebook was created with the help of <a href="https://devra.ai/ref/kaggle" style="color:#6666FF">Devra AI</a></div>
</div>

## Table of Contents

- [Introduction](#Introduction)
- [Data Loading and Inspection](#Data-Loading-and-Inspection)
- [Data Cleaning and Preprocessing](#Data-Cleaning-and-Preprocessing)
- [Exploratory Data Analysis](#Exploratory-Data-Analysis)
- [Prediction Model](#Prediction-Model)
- [Conclusions and Future Work](#Conclusions-and-Future-Work)


## Introduction

In this notebook we dive into the synthetic Hajj pilgrimage dataset. The Hajj is one of the largest annual gatherings in the world, and this dataset provides insight into various characteristics of the pilgrims such as their country, gender, age group, accommodation type, transport type, estimated spending in Saudi Riyal, and group sizes. The aim here is to explore the data, perform necessary cleaning and preprocessing, visualize distributions and relationships, and build a regression predictor for estimating spending. If you find this analysis useful, upvote it.

Let's get started with our journey into the data.

In [None]:
# Import necessary libraries
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import matplotlib
matplotlib.use('Agg')  # Ensure the proper backend is used
import matplotlib.pyplot as plt
plt.switch_backend('Agg')  # For cases when plt module is imported

import seaborn as sns

# For predictive modeling
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score

# Ensure inline plotting when appropriate
%matplotlib inline

# Set visualization style
sns.set(style="whitegrid")

## Data Loading and Inspection

In this section we load the synthetic Hajj dataset and inspect the first few rows as well as the data types. Note that the dataset is encoded in ASCII and uses comma as the delimiter.

In [None]:
# Load the dataset
df = pd.read_csv('synthetic_hajj_dataset.csv', delimiter=',', encoding='ascii')

# Display the first few rows of the dataframe
print('First five rows of the dataset:')
print(df.head())

# Display the dataframe info to inspect column types
print('\nDataframe Info:')
print(df.info())

## Data Cleaning and Preprocessing

We now perform data cleaning and preprocessing. The dataset appears to have categorical columns with string values and few numeric columns. Although there is no explicit date column in this dataset, if there were, we would infer the date type.

We also check for missing values and duplicate rows.

In [None]:
# Check for missing values
print('Missing values in each column:')
print(df.isnull().sum())

# Check for duplicate rows
print('\nNumber of duplicate rows:', df.duplicated().sum())

# Drop duplicates if necessary
df = df.drop_duplicates()

# Convert columns with categorical information to category dtype if appropriate
categorical_columns = ['Country', 'Gender', 'Age_Group', 'Accommodation_Type', 'Transport_Type', 'Stay_Duration']
for col in categorical_columns:
    df[col] = df[col].astype('category')

# Check data types after conversion
print('\nData types after conversion:')
print(df.dtypes)

## Exploratory Data Analysis

Here we explore the distribution and relationships of the various columns in the dataset using a variety of visualization techniques. We use histograms and box plots for numeric columns, count plots (pie charts) for categorical columns, pair plots for numeric relationships, and more. 

Note that when constructing correlation heatmaps it is advisable to filter the dataframe to only numeric columns. In this case, we only have three numeric columns (Pilgrim_ID, Estimated_Spending_SAR, Group_Size), which is less than four; therefore, we skip the correlation heatmap as it might not be very insightful.

In [None]:
# Select numeric columns
numeric_df = df.select_dtypes(include=[np.number])

# Plot histograms for numeric columns
for col in numeric_df.columns:
    plt.figure(figsize=(8, 4))
    sns.histplot(numeric_df[col], kde=True)
    plt.title(f'Histogram of {col}')
    plt.xlabel(col)
    plt.ylabel('Frequency')
    plt.tight_layout()
    plt.show()

# Generate a pairplot for numeric features (even though we have only three columns)
sns.pairplot(numeric_df)
plt.show()

In [None]:
# Plot count plots for categorical data
for col in categorical_columns:
    plt.figure(figsize=(10, 4))
    sns.countplot(data=df, x=col, order=df[col].value_counts().index)
    plt.title(f'Count Plot for {col}')
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()

# Box plot for numeric columns vs. categorical group: Estimated_Spending_SAR by Gender
plt.figure(figsize=(8, 6))
sns.boxplot(data=df, x='Gender', y='Estimated_Spending_SAR')
plt.title('Estimated Spending by Gender')
plt.tight_layout()
plt.show()

# Violin plot for numeric columns vs. categorical group: Group_Size by Accommodation_Type
plt.figure(figsize=(10, 6))
sns.violinplot(data=df, x='Accommodation_Type', y='Group_Size')
plt.title('Group Size by Accommodation Type')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

## Prediction Model

It might be useful to predict the Estimated Spending of a pilgrim based on other features. Here we build a regression predictor to estimate the 'Estimated_Spending_SAR'. We use a RandomForestRegressor after encoding the categorical features. 

Note: When encoding categorical variables, the approach used here is label encoding for simplicity; however, in a production scenario, one might consider more robust methods such as one-hot encoding.

In [None]:
# Create a copy of the dataframe to work on for prediction
df_model = df.copy()

# We need to encode categorical variables. For simplicity, we will use pandas' factorize
for col in categorical_columns:
    df_model[col] = pd.factorize(df_model[col])[0]

# Define the target and features
target = 'Estimated_Spending_SAR'
features = [col for col in df_model.columns if col != target and col != 'Pilgrim_ID']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df_model[features], df_model[target], test_size=0.2, random_state=42)

# Initialize and train the RandomForestRegressor
rf_regressor = RandomForestRegressor(n_estimators=100, random_state=42)
rf_regressor.fit(X_train, y_train)

# Predict on the test set
y_pred = rf_regressor.predict(X_test)

# Evaluate the model using R^2 score
r2 = r2_score(y_test, y_pred)
print(f"R^2 Score of the Regression Model: {r2:.3f}")

# Feature importance visualization using permutation importance style
importances = rf_regressor.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(10, 6))
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), [features[i] for i in indices])
plt.xlabel('Feature Importance')
plt.title('Permutation Importance of Features')
plt.tight_layout()
plt.show()

## Conclusions and Future Work

This notebook provided an exploratory analysis of the synthetic Hajj dataset, followed by building a regression predictor that estimates the spending of pilgrims. The approach covered data inspection, preprocessing, a spectrum of visualizations, and a machine learning model. 

Merits of this approach include:

- A thorough examination of both categorical and numeric features using varied plot types.
- Addressing potential common errors (like date parsing and missing values) even if they were not present in this dataset.
- A simple yet effective regression model with evaluation via the R^2 score.

For future analysis, one might: 

- Explore advanced feature engineering (e.g., interaction terms between categorical features).
- Compare different regression models or even try a classification approach based on spending brackets.
- Incorporate external data sources to enhance the richness of the analysis.

If you found this notebook useful, please give it an upvote.