# Rehal Travel Recommendation System

---
## **1. Team Members**
- **Refal Abutheeb** (Student ID: 443200792)
- **Raghad alqahtani** (Student ID: 443200549)
- **Alanoud Almuzayrie** (Student ID: 443201087)
- **Reema Alsurayea** (Student ID: 443203093)
- **Amal Alharbi** (Student ID: 442202306)

---
## **2. Introduction**
Traveling can be an overwhelming process, especially when trying to balance budget, timing, and preferences. "Rehal" is a smart travel recommendation system designed to simplify the decision-making process by providing personalized travel suggestions based on data.

---
## **3. Motivation**
- Offer users a one-stop system for data-driven travel recommendations.
- Help travelers optimize their budget and time.
- Provide useful insights using real-world travel data.

The dataset was chosen because it simulates real corporate travel systems with detailed flight, hotel, and user data, enabling meaningful analysis and recommendations.

---
## **4. Problem Statement**
Travelers often face difficulties in:
1. Identifying cost-effective travel options.
2. Planning trips that align with their preferences and budget.
3. Finding consolidated travel information.

---
## **5. Goal (Purpose of Collecting the Dataset)**
The goal of the dataset is to develop a smart travel recommendation system that can:
- Provide users with personalized travel suggestions.
- Offer insights into optimal travel timings, destinations, and accommodations.
- Identify travel trends and patterns, improving trip planning and budgeting.

---
## **6. Scope**
The scope of the Rehal Travel Recommendation System includes:
- Analyzing flight, hotel, and user data to provide personalized travel recommendations.
- Focusing on budget optimization, seasonal travel trends, and user preferences.
- Processing structured travel data, excluding Arabic text or images.
- Ensuring data quality through exploration and analysis.

The system does **not** include:
- Real-time travel booking or reservation features.
- Integration with third-party travel services.

---
## **7. Dataset Overview**
The dataset, sourced from **Kaggle**, simulates real-world corporate travel systems and contains information on flights, hotels, and users.
- **Dataset Source:** [Kaggle - Argo Datathon 2019](https://www.kaggle.com/datasets/leomauro/argodatathon2019/data)

---
## **8. Dataset Details**
**1. flights.csv**
- **Number of rows:** 271,888
- **Number of columns:** 10
**Features:**
- `travelCode`: Travel identifier (integer)
- `userCode`: User identifier (integer)
- `from`: Departure location (string)
- `to`: Destination location (string)
- `flightType`: Type of flight (string, e.g., first class)
- `price`: Price of the flight (float)
- `time`: Duration of the flight (float, hours)
- `distance`: Distance between locations (float)
- `agency`: Travel agency name (string)
- `date`: Travel date (string)

**Sample Data:**
| travelCode | userCode | from   | to     | flightType | price  | time | distance | agency   | date       |
|------------|----------|--------|--------|------------|--------|------|----------|----------|------------|
| 1001       | 2002     | Riyadh | Dubai  | Business   | 1234.5 | 2.5  | 1200.0   | AgencyX  | 2023-05-15 |

**2. hotels.csv**
- **Number of rows:** 40,552
- **Number of columns:** 8
**Features:**
- `travelCode`: Travel identifier (integer)
- `userCode`: User identifier (integer)
- `name`: Hotel name (string)
- `place`: Hotel location (string)
- `days`: Number of days booked (integer)
- `price`: Price per day (float)
- `total`: Total booking price (float)
- `date`: Booking date (string)

**Sample Data:**
| travelCode | userCode | name           | place  | days | price | total | date       |
|------------|----------|-----------------|--------|------|-------|-------|------------|
| 1001       | 2002     | Grand Hotel     | Dubai  | 5    | 200.0 | 1000.0| 2023-05-10 |

**3. users.csv**
- **Number of rows:** 1,340
- **Number of columns:** 5
**Features:**
- `code`: Unique user identifier (integer)
- `company`: Company the user belongs to (string)
- `name`: User's full name (string)
- `gender`: User's gender (string)
- `age`: User's age (integer)

**Sample Data:**
| code | company     | name          | gender | age |
|------|-------------|---------------|--------|-----|
| 2002 | TechCorp    | John Doe      | Male   | 35  |

---
## **9. Preprocessing Techniques**

**Missing Values Check:**
- **Method:** We used the `isnull()` method in Pandas.
- **Code:**
```python
print(flights_data.isnull().sum())
print(hotels_data.isnull().sum())
print(users_data.isnull().sum())
```
- **Result:** No missing values found in any dataset.

**Duplicate Rows Check:**
- **Method:** We used the `duplicated()` method in Pandas.
- **Code:**
```python
print(flights_data.duplicated().sum())
print(hotels_data.duplicated().sum())
print(users_data.duplicated().sum())
```
- **Result:** No duplicate rows found in any dataset.

**Date Column Conversion:**
- **Method:** We used the `pd.to_datetime()` function from Pandas to convert date columns to datetime format.
- **Code:**
```python
flights_data['date'] = pd.to_datetime(flights_data['date'], format='%m/%d/%Y')
hotels_data['date'] = pd.to_datetime(hotels_data['date'], format='%m/%d/%Y')
```
- **Justification:** Proper date formatting is essential for time-series analysis and data filtering.

**Normalization of Continuous Variables:**
- **Method:** We used the `MinMaxScaler()` from the `sklearn.preprocessing` module to normalize the `price`, `time`, and `distance` columns.
- **Code:**
```python
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
flights_data[['price', 'time', 'distance']] = scaler.fit_transform(flights_data[['price', 'time', 'distance']])
```
- **Justification:** Normalization ensures that features with large scales do not dominate those with smaller scales during analysis.

**Cleaning Text Data:**
- **Method:** We used the `str.strip()` and `str.title()` methods in Pandas to clean and standardize text fields.
- **Code:**
```python
flights_data['from'] = flights_data['from'].str.strip().str.title()
flights_data['to'] = flights_data['to'].str.strip().str.title()
hotels_data['place'] = hotels_data['place'].str.strip().str.title()
hotels_data['name'] = hotels_data['name'].str.strip().str.title()
```
- **Justification:** Consistent text formatting improves data integrity and analysis results by reducing variations caused by inconsistent capitalization or whitespace.

---
## **10. Descriptive Statistics and Analysis**

**1. Users Dataset (users.csv)**

**Age Statistics**
- Mean: 42.74
- Median: 42.0
- Variance: 165.63
- Standard Deviation: 12.87

**Interpretation:**
- The mean age of users is approximately 42.74 years.
- The median age is 42, suggesting a fairly symmetrical distribution.
- The variance and standard deviation indicate the level of age dispersion among users.

**2. Flights Dataset (flights.csv)**

**Flight Price Statistics**
- Mean: $957.38
- Median: $904.0
- Variance: 131,269.91
- Standard Deviation: 362.31

**Flight Time Statistics**
- Mean: 1.42 hours
- Median: 1.46 hours
- Variance: 0.29
- Standard Deviation: 0.54

**Flight Distance Statistics**
- Mean: 546.96 km
- Median: 562.14 km
- Variance: 43,618.86
- Standard Deviation: 208.85

**Interpretation:**
- The average flight price is $957.38, with a median of $904.
- Flight duration varies, with an average of 1.42 hours and a median of 1.46 hours.
- Flight distance statistics indicate common travel ranges, with a mean of 546.96 km.

**3. Hotels Dataset (hotels.csv)**

**Hotel Price per Day Statistics**
- Mean: $214.44
- Median: $242.88
- Variance: 5,889.38
- Standard Deviation: 76.74

**Hotel Stay Duration Statistics**
- Mean: 2.50 days
- Median: 2.0 days
- Variance: 1.25
- Standard Deviation: 1.12

**Interpretation:**
- The average hotel price per day is $214.44, with a median of $242.88.
- The standard deviation suggests moderate variation in hotel pricing.
- Hotel stay durations vary, with a mean of 2.5 days.