<h2>Pretty Table</h2>

In [0]:
# pretty table
# https://ptable.readthedocs.io/en/latest/tutorial.html
print("")
print("\t\t\t MACHINE LEARNING MODELS")
from prettytable import PrettyTable
T = PrettyTable()
T.field_names = ["Classifiers","Best-Hyper-Parameters", "NDCG_Score"]
T.add_row(["Logistic Regression","C: 0.001, penalty: l2", "0.8295"])
T.add_row(["------","-----------","------"])
T.add_row(["Linear Kernel SVM","C: 0.01, gamma: 0.125", "0.82763"])
T.add_row(["------","-----------","------"])
T.add_row(["DecisionTree ","max_depth: 5, min_samples_split: 50", "0.8288"])
T.add_row(["------","------------","------"])
T.add_row(["Random Forest ","max_depth: 20, n_estimators: 1200", "0.83175"])
T.add_row(["------","------------","------"])
T.add_row(["XG-Boost[ES] ","max_depth: 6, n_estimators: 1624", "0.8385"])
print(T)


			 MACHINE LEARNING MODELS
+---------------------+-------------------------------------+------------+
|     Classifiers     |        Best-Hyper-Parameters        | NDCG_Score |
+---------------------+-------------------------------------+------------+
| Logistic Regression |        C: 0.001, penalty: l2        |   0.8295   |
|        ------       |             -----------             |   ------   |
|  Linear Kernel SVM  |        C: 0.01, gamma: 0.125        |  0.82763   |
|        ------       |             -----------             |   ------   |
|    DecisionTree     | max_depth: 5, min_samples_split: 50 |   0.8288   |
|        ------       |             ------------            |   ------   |
|    Random Forest    |  max_depth: 20, n_estimators: 1200  |  0.83175   |
|        ------       |             ------------            |   ------   |
|    XG-Boost[ES]     |   max_depth: 6, n_estimators: 1624  |   0.8385   |
+---------------------+-------------------------------------+----------

<h2>Steps Followed to solve this problem</h2>

 **1. Problem Definition:** This include clearly understanding the problem being solved
  *  Airbnb is an **online marketplace** and hospitality service, enabling people to lease or rent short-term lodging including vacation rentals, apartment rentals, homestays, hostels beds, or hotel rooms. 
  * New users on Airbnb can book a place to stay in 34,000+ cities across 190+ countries. By accurately predicting where a new user will book their first travel experience, Airbnb can share more personalized content with their community, decrease the **average time** to first booking, and better **forecast demand**. We need to **predict** the **first travel destination** of a new user based on his personalized content.

 **2. Gathering Data:** The quality and quantity of data that you gather will determine how good your predictive model can be.
  * The dataset contains a list of users along with their demographics, web session records, and some summary statistics to predict which country a new user's first booking destination will be. All the users in this dataset are from the USA.
  * There are 12 possible outcomes of the destination country: 'US', 'FR', 'CA', 'GB', 'ES', 'IT', 'PT', 'NL','DE', 'AU', 'NDF' (no destination found), and 'other'.
  * dataset from **Kaggle:** https://www.kaggle.com/c/airbnb-recruiting-new-user-bookings/data
      * Train_users.csv
      * Sessions.csv
      * Countries.csv
      * Age_gender_bkts.csv

**3. Data Analysis and Preprocessing :** This includes gaining insights from data to solve a problem and cleaning unwanted data
   * Train Data collected from Kaggle contain **213451 entries** with 16 Features from **Train.csv** and **1.05 million entries** with 6 Features from **Session.csv**.
   * This step Contain **Univariate, Bivariate, Multivariate analysis** of features in Train and Session data.
    * **Preprocessing Include:**
      * Checking **Nan/Null** Values.
      * Checking for **Duplicates** in Data.
      * Removing **Unwanted** Data from Features

 **4. Data preparation:** Discovering the format of data that the machine learning model can understand and construction of  features:
   * Grouping Data by **user_id** and obtaining **unique value count** of categorical features, **Mean** and **Std** of its occurrence.
   * This Step includes **Feature extractions** like an **hour, days, week, month, year** from **date_account_created** and **first_active feature** for each Datapoint. 
   * Extracting **Season feature** and **difference in seconds** from account_created and first_active Features.
   * Converting all **categorical variables** to **Binary encoding** features.
   * **Feature binning** for **seconds_elasped** and **Age** Feature to get a vector of some length.
   * Performing **Standardization** on data for modeling.

 **5. Feature selection:** The process of selecting a subset of relevant features that contribute most for prediction for model Construction.
   * Feature selection is performed to keeping **70%** of the **actual feature** by training **Xgboost** for **Feature Importance**. 

 **6. Choosing a Model and Methods:** There are many models that researchers have created over the years but in this case study we will experiment with a bunch of  algorithms 
 * Used **Machine Learning Classifiers** below :
      *  Logistic Regression 
      *  Linear Kernel SVM      
      *  DecisionTree    
      *  Random Forest    
      *  XG-Boost

 **6. Training:** The is considered the bulk of machine learningâ€Š which refers to Building a Machine Learning model
 * Done a **70-30 Split** for Training, Testing, and for building a model.

 **7. Evaluation:** This is where the model is evaluated and performance is measured.
  * Used **NDCG (Normalized discounted cumulative gain)** as main Mertic to measure the performance of Model

 **8. Improve Results(Parameter Tuning):**  This is where Hyperparameter tuning is done to gain the best model possible
  * Used **GridSearchCV** and **RandomizedSearchCV** for Hyper Parameter Tunning on all models.
  * Try a bunch of **Hyper Parameter values** to increase  NDCG_Score more as possible.
  * To improve performance and avoid overfitting, **Early stopping** is used which improved performance.

 **9. Present results:** The results of the model performance for this are Represented in the **pretty table** above.

**10.Productionalization and Deployment:** This is where the Model is put into operation or made available to use for the world for real data to predicts output.

**Deployed link:** https://dashboard.heroku.com/apps

**CONCLUSION:**
* In the Real-world, Domain-knowledge, EDA, and Feature-engineering matter most.
* More well-structured Data can offer more performance from Machine learning models.
* Feature extraction and Parameter Tunning helps to improve performance.
* The Imbalanced Data_set can sometimes be a curse for performance improvement.<br>
Finally, our best machine learning model is XGBoost classifier and which outperforms every other model.