# Hotel Booking Cancellation Prediction
Dataset Source: The "Hotel Booking Demand" dataset is available on Kaggle.

## Project Steps:

### Data Acquisition:
- Download the dataset from Kaggle.

### Data Exploration:
- Load the dataset using pandas.
- Inspect the first few rows to understand its structure.
- Check for missing values and data types.

### Data Cleaning:
- Handle missing values appropriately (e.g., imputation or removal).
- Convert data types if necessary.

### Feature Engineering:
- Create new features such as total stay duration.
- Encode categorical variables using techniques like one-hot encoding.

### Exploratory Data Analysis (EDA):
- Visualize distributions of key features.
- Analyze correlations between features and the target variable (is_canceled).

### Model Building:
- Split the data into training and testing sets.
- Train classification models (e.g., Logistic Regression, Random Forest).
- Evaluate model performance using metrics like accuracy and AUC-ROC.

### Model Interpretation:
- Identify important features influencing cancellations.
- Visualize feature importances.

### Conclusion:
- Summarize findings and potential actions for hotel management.


In [None]:
# Data Acquisition
import pandas as pd

# Load the dataset
df = pd.read_csv('path_to_dataset/hotel_bookings.csv')

# Data Exploration
df.head()


In [None]:
# Check for missing values and data types
df.info()
df.isnull().sum()


In [None]:
# Data Cleaning
# Handle missing values
df.fillna(method='ffill', inplace=True)

# Convert data types if necessary
# Example: df['column_name'] = df['column_name'].astype('int')


In [None]:
# Feature Engineering
# Create new features
df['total_stay_duration'] = df['stays_in_weekend_nights'] + df['stays_in_week_nights']

# Encode categorical variables
df = pd.get_dummies(df, columns=['hotel', 'meal', 'country', 'market_segment', 'distribution_channel', 'reserved_room_type', 'assigned_room_type', 'deposit_type', 'customer_type'])


In [None]:
# Exploratory Data Analysis (EDA)
import matplotlib.pyplot as plt
import seaborn as sns

# Visualize distributions of key features
sns.histplot(df['total_stay_duration'])
plt.show()

# Analyze correlations between features and the target variable (is_canceled)
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True)
plt.show()


In [None]:
# Model Building
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, roc_auc_score

# Split the data into training and testing sets
X = df.drop('is_canceled', axis=1)
y = df['is_canceled']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train classification models
log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train, y_train)
rf_clf = RandomForestClassifier(n_estimators=100)
rf_clf.fit(X_train, y_train)

# Evaluate model performance
log_reg_pred = log_reg.predict(X_test)
rf_clf_pred = rf_clf.predict(X_test)
print('Logistic Regression Accuracy:', accuracy_score(y_test, log_reg_pred))
print('Random Forest Accuracy:', accuracy_score(y_test, rf_clf_pred))
print('Logistic Regression AUC-ROC:', roc_auc_score(y_test, log_reg_pred))
print('Random Forest AUC-ROC:', roc_auc_score(y_test, rf_clf_pred))


In [None]:
# Model Interpretation
importances = rf_clf.feature_importances_
features = X.columns
feature_importances = pd.Series(importances, index=features).sort_values(ascending=False)
sns.barplot(x=feature_importances, y=feature_importances.index)
plt.show()


## Conclusion
- Summarize findings and potential actions for hotel management.
