# Airline Passenger Satisfaction Analysis

## Table of Contents
1. [Project Overview](#project-overview)
2. [Data Preprocessing](#data-preprocessing)
3. [Exploratory Data Analysis](#exploratory-data-analysis)
4. [Feature Engineering](#feature-engineering)
5. [Model Training](#model-training)
6. [Model Evaluation](#model-evaluation)

## Project Overview
This project analyzes airline passenger satisfaction using machine learning techniques. The goal is to predict passenger satisfaction based on various flight and service-related features.

## Data Preprocessing

### Data Loading and Initial Exploration
```python
import pandas as pd
import numpy as np

# Load the dataset
df = pd.read_csv('train.csv')

# Initial data exploration
print(df.head())
print(df.dtypes)
print(df.shape)
print(df.info())
```

### Handling Missing Values
```python
# Check for missing values
missing_values = df.isnull().sum()
print("Missing Values:\n", missing_values)
```

### Feature Encoding
```python
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

# Categorical columns encoding
cat_cols = ['Type of Travel', 'Customer Type', 'Class', 'satisfaction']
oh_col = ['Gender']

encoding_dict = {}
le = LabelEncoder()
oh = OneHotEncoder(sparse_output=False)

# Label Encoding for categorical columns
for col in cat_cols:
    df[col] = le.fit_transform(df[col])
    encoding_dict[col] = dict(zip(le.classes_, range(len(le.classes_))))

# One-Hot Encoding for Gender
encoded_data = oh.fit_transform(df[oh_col])
encoded_df = pd.DataFrame(encoded_data, columns=oh.get_feature_names_out(oh_col))
final_df = pd.concat([df.drop(columns=oh_col), encoded_df], axis=1)
```

## Exploratory Data Analysis

### Visualization Techniques
```python
import seaborn as sns
import matplotlib.pyplot as plt

# Boxplots of key features
plt.figure(figsize=(15, 8))
sns.boxenplot(y=df['Arrival Delay in Minutes'], x=df['satisfaction'], hue=df['satisfaction'])
plt.title('Arrival Delay by Satisfaction Level')

# Scatter plots
plt.figure(figsize=(15, 10))
sns.scatterplot(data=features, x='Arrival Delay in Minutes', y='Departure Delay in Minutes', hue='satisfaction')
```

### Distribution Analysis
```python
# Histogram of features
plt.figure(figsize=(15, 10))
for i, column in enumerate(features.columns[:-1], 1):
    plt.subplot(2, 2, i)
    sns.histplot(features[column], kde=True, bins=30)
    plt.title(f'Distribution of {column}')
```

## Feature Engineering

### Interaction Features
```python
from itertools import combinations

def create_interactions(df, feature_list):
    for feat1, feat2 in combinations(feature_list, 2):
        df[f'{feat1}_x_{feat2}'] = df[feat1] * df[feat2]
    return df

# Create interaction features
features = create_interactions(df, ['Class', 'Type of Travel', 'Seat comfort'])
```

### Outlier Handling
```python
def remove_outliers(df, method="zscore", threshold=3):
    df_clean = df.copy()
    
    if method == "zscore":
        z_scores = np.abs((df_clean - df_clean.mean()) / df_clean.std())
        df_clean = df_clean[(z_scores < threshold).all(axis=1)]
    
    return df_clean
```

### Skewness Transformation
```python
def transform_skewed(df, method="log"):
    df_transformed = df.copy()
    
    for col in skewed_cols:
        if method == "log":
            df_transformed[col] = np.log1p(df_transformed[col])
    
    return df_transformed
```

## Model Training

### Model Selection
```python
import xgboost as xgb
import lightgbm as lgb
from catboost import CatBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define models
models = {
    'XGBoost': xgb.XGBClassifier(**xgb_params, random_state=42),
    'Random Forest': RandomForestClassifier(**rf_params, random_state=42),
    'LightGBM': lgb.LGBMClassifier(**lgbm_params, random_state=42, verbose=-1),
    'CatBoost': CatBoostClassifier(**cb_params, random_seed=42, verbose=False)
}
```

## Model Evaluation

### Performance Metrics
```python
from sklearn.metrics import accuracy_score, roc_auc_score, classification_report

# Evaluate models
for name, model in models.items():
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    
    print(f"{name} Results:")
    print(f"Accuracy: {accuracy_score(y_test, predictions):.4f}")
    print(f"ROC AUC: {roc_auc_score(y_test, predictions):.4f}")
    print("Classification Report:")
    print(classification_report(y_test, predictions))
```

### Cross-Validation
```python
from sklearn.model_selection import StratifiedKFold

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Perform cross-validation
accuracy_list, roc_auc_list = [], []
for train_idx, test_idx in cv.split(X, y):
    X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
    y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]
    
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    
    accuracy_list.append(accuracy_score(y_test, predictions))
    roc_auc_list.append(roc_auc_score(y_test, predictions))

# Print cross-validation results
print("Cross-Validation Results:")
print(f"Accuracy: {np.mean(accuracy_list):.4f} ± {np.std(accuracy_list):.4f}")
print(f"ROC AUC: {np.mean(roc_auc_list):.4f} ± {np.std(roc_auc_list):.4f}")
```

## Conclusion
The analysis provides insights into factors affecting passenger satisfaction and demonstrates the effectiveness of various machine learning models in predicting satisfaction levels.

## Key Findings
- Feature importance analysis
- Model performance comparison
- Insights into key drivers of passenger satisfaction

## Recommendations
1. Focus on reducing delays
2. Improve in-flight services
3. Enhance customer experience based on key features

## Technologies Used
- Python
- Pandas
- NumPy
- Scikit-learn
- XGBoost
- LightGBM
- CatBoost
- Seaborn
- Matplotlib