# Activity: Build a decision tree<a href="#Activity:-Build-a-decision-tree" class="anchor-link">¶</a>

## Introduction<a href="#Introduction" class="anchor-link">¶</a>

A decision tree model can makes predictions for a target based on
multiple features. Because decision trees are used across a wide array
of industries, becoming proficient in the process of building one will
help you expand your skill set in a widely-applicable way.

For this activity, you work as a consultant for an airline. The airline
is interested in predicting whether a future customer would be satisfied
with their services given previous customer feedback about their flight
experience. The airline would like you to construct and evaluate a model
that can accomplish this goal. Specifically, they are interested in
knowing which features are most important to customer satisfaction.

The data for this activity includes survey responses from 129,880
customers. It includes data points such as class, flight distance, and
in-flight entertainment, among others. In a previous activity, you
utilized a binomial logistic regression model to help the airline better
understand this data. In this activity, your goal will be to utilize a
decision tree model to predict whether or not a customer will be
satisfied with their flight experience.

Because this activity uses a dataset from the industry, you will need to
conduct basic EDA, data cleaning, and other manipulations to prepare the
data for modeling.

In this activity, you’ll practice the following skills:

-   Importing packages and loading data
-   Exploring the data and completing the cleaning process
-   Building a decision tree model
-   Tuning hyperparameters using `GridSearchCV`
-   Evaluating a decision tree model using a confusion matrix and
    various other plots

## Step 1: Imports<a href="#Step-1:-Imports" class="anchor-link">¶</a>

Import relevant Python packages. Use
`DecisionTreeClassifier`,`plot_tree`, and various imports from
`sklearn.metrics` to build, visualize, and evaluate the model.

### Import packages<a href="#Import-packages" class="anchor-link">¶</a>

In \[1\]:

    # Standard operational package imports
    import pandas as pd
    import numpy as np

    # Important imports for modeling and evaluation
    from sklearn.model_selection import train_test_split
    from sklearn.model_selection import GridSearchCV
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.tree import plot_tree
    import sklearn.metrics as metrics
    from sklearn.preprocessing import StandardScaler

    # Visualization package imports
    import matplotlib.pyplot as plt
    import seaborn as sns

### Load the dataset<a href="#Load-the-dataset" class="anchor-link">¶</a>

`Pandas` is used to load the **Invistico_Airline.csv** dataset. The
resulting pandas DataFrame is saved in a variable named `df_original`.
As shown in this cell, the dataset has been automatically loaded in for
you. You do not need to download the .csv file, or provide more code, in
order to access the dataset and proceed with this lab. Please continue
with this activity by completing the following instructions.

In \[2\]:

    # RUN THIS CELL TO IMPORT YOUR DATA.
    df_original = pd.read_csv("Invistico_Airline.csv")

### Output the first 10 rows of data<a href="#Output-the-first-10-rows-of-data" class="anchor-link">¶</a>

In \[3\]:

    df_original.head(10)

Out\[3\]:

|     | satisfaction | Customer Type  | Age | Type of Travel  | Class    | Flight Distance | Seat comfort | Departure/Arrival time convenient | Food and drink | Gate location | ... | Online support | Ease of Online booking | On-board service | Leg room service | Baggage handling | Checkin service | Cleanliness | Online boarding | Departure Delay in Minutes | Arrival Delay in Minutes |
|-----|--------------|----------------|-----|-----------------|----------|-----------------|--------------|-----------------------------------|----------------|---------------|-----|----------------|------------------------|------------------|------------------|------------------|-----------------|-------------|-----------------|----------------------------|--------------------------|
| 0   | satisfied    | Loyal Customer | 65  | Personal Travel | Eco      | 265             | 0            | 0                                 | 0              | 2             | ... | 2              | 3                      | 3                | 0                | 3                | 5               | 3           | 2               | 0                          | 0.0                      |
| 1   | satisfied    | Loyal Customer | 47  | Personal Travel | Business | 2464            | 0            | 0                                 | 0              | 3             | ... | 2              | 3                      | 4                | 4                | 4                | 2               | 3           | 2               | 310                        | 305.0                    |
| 2   | satisfied    | Loyal Customer | 15  | Personal Travel | Eco      | 2138            | 0            | 0                                 | 0              | 3             | ... | 2              | 2                      | 3                | 3                | 4                | 4               | 4           | 2               | 0                          | 0.0                      |
| 3   | satisfied    | Loyal Customer | 60  | Personal Travel | Eco      | 623             | 0            | 0                                 | 0              | 3             | ... | 3              | 1                      | 1                | 0                | 1                | 4               | 1           | 3               | 0                          | 0.0                      |
| 4   | satisfied    | Loyal Customer | 70  | Personal Travel | Eco      | 354             | 0            | 0                                 | 0              | 3             | ... | 4              | 2                      | 2                | 0                | 2                | 4               | 2           | 5               | 0                          | 0.0                      |
| 5   | satisfied    | Loyal Customer | 30  | Personal Travel | Eco      | 1894            | 0            | 0                                 | 0              | 3             | ... | 2              | 2                      | 5                | 4                | 5                | 5               | 4           | 2               | 0                          | 0.0                      |
| 6   | satisfied    | Loyal Customer | 66  | Personal Travel | Eco      | 227             | 0            | 0                                 | 0              | 3             | ... | 5              | 5                      | 5                | 0                | 5                | 5               | 5           | 3               | 17                         | 15.0                     |
| 7   | satisfied    | Loyal Customer | 10  | Personal Travel | Eco      | 1812            | 0            | 0                                 | 0              | 3             | ... | 2              | 2                      | 3                | 3                | 4                | 5               | 4           | 2               | 0                          | 0.0                      |
| 8   | satisfied    | Loyal Customer | 56  | Personal Travel | Business | 73              | 0            | 0                                 | 0              | 3             | ... | 5              | 4                      | 4                | 0                | 1                | 5               | 4           | 4               | 0                          | 0.0                      |
| 9   | satisfied    | Loyal Customer | 22  | Personal Travel | Eco      | 1556            | 0            | 0                                 | 0              | 3             | ... | 2              | 2                      | 2                | 4                | 5                | 3               | 4           | 2               | 30                         | 26.0                     |

10 rows × 22 columns

## Step 2: Data exploration, data cleaning, and model preparation<a
href="#Step-2:-Data-exploration,-data-cleaning,-and-model-preparation"
class="anchor-link">¶</a>

### Prepare the data<a href="#Prepare-the-data" class="anchor-link">¶</a>

After loading the dataset, prepare the data to be suitable for decision
tree classifiers. This includes:

-   Exploring the data
-   Checking for missing values
-   Encoding the data
-   Renaming a column
-   Creating the training and testing data

### Explore the data<a href="#Explore-the-data" class="anchor-link">¶</a>

Check the data type of each column. Note that decision trees expect
numeric data.

In \[4\]:

    df_original.dtypes

Out\[4\]:

    satisfaction                          object
    Customer Type                         object
    Age                                    int64
    Type of Travel                        object
    Class                                 object
    Flight Distance                        int64
    Seat comfort                           int64
    Departure/Arrival time convenient      int64
    Food and drink                         int64
    Gate location                          int64
    Inflight wifi service                  int64
    Inflight entertainment                 int64
    Online support                         int64
    Ease of Online booking                 int64
    On-board service                       int64
    Leg room service                       int64
    Baggage handling                       int64
    Checkin service                        int64
    Cleanliness                            int64
    Online boarding                        int64
    Departure Delay in Minutes             int64
    Arrival Delay in Minutes             float64
    dtype: object

### Output unique values<a href="#Output-unique-values" class="anchor-link">¶</a>

The `Class` column is ordinal (meaning there is an inherent order that
is significant). For example, airlines typically charge more for
'Business' than 'Eco Plus' and 'Eco'. Output the unique values in the
`Class` column.

In \[5\]:

    df_original['Class'].unique()

Out\[5\]:

    array(['Eco', 'Business', 'Eco Plus'], dtype=object)

### Check the counts of the predicted labels<a href="#Check-the-counts-of-the-predicted-labels"
class="anchor-link">¶</a>

In order to predict customer satisfaction, verify if the dataset is
imbalanced. To do this, check the counts of each of the predicted
labels.

In \[6\]:

    df_original['satisfaction'].value_counts(dropna = False)

Out\[6\]:

    satisfied       71087
    dissatisfied    58793
    Name: satisfaction, dtype: int64

**Question:** How many satisfied and dissatisfied customers were there?

-   There are 71087 satisfied customers and 58793 dissatisfied
    customers.

**Question:** What percentage of customers were satisfied?

54.7% (71087 / 129880) of customers were satisfied. This value can be
compared to a decision tree'd model accuracy.

### Check for missing values<a href="#Check-for-missing-values" class="anchor-link">¶</a>

The sklearn decision tree implementation does not support missing
values. Check for missing values in the rows of the data.

In \[7\]:

    df_original.isnull().sum()

Out\[7\]:

    satisfaction                           0
    Customer Type                          0
    Age                                    0
    Type of Travel                         0
    Class                                  0
    Flight Distance                        0
    Seat comfort                           0
    Departure/Arrival time convenient      0
    Food and drink                         0
    Gate location                          0
    Inflight wifi service                  0
    Inflight entertainment                 0
    Online support                         0
    Ease of Online booking                 0
    On-board service                       0
    Leg room service                       0
    Baggage handling                       0
    Checkin service                        0
    Cleanliness                            0
    Online boarding                        0
    Departure Delay in Minutes             0
    Arrival Delay in Minutes             393
    dtype: int64

**Question:** Why is it important to check how many rows and columns
there are in the dataset?

This is important to check because if there are only a small number of
missing values in the dataset, they can more safely be removed.

### Check the number of rows and columns in the dataset<a href="#Check-the-number-of-rows-and-columns-in-the-dataset"
class="anchor-link">¶</a>

In \[8\]:

    df_original.shape

Out\[8\]:

    (129880, 22)

### Drop the rows with missing values<a href="#Drop-the-rows-with-missing-values" class="anchor-link">¶</a>

Drop the rows with missing values and save the resulting pandas
DataFrame in a variable named `df_subset`.

In \[9\]:

    df_subset = df_original.dropna(axis=0).reset_index(drop = True)

### Check for missing values<a href="#Check-for-missing-values" class="anchor-link">¶</a>

Check that `df_subset` does not contain any missing values.

In \[10\]:

    df_subset.isna().sum()

Out\[10\]:

    satisfaction                         0
    Customer Type                        0
    Age                                  0
    Type of Travel                       0
    Class                                0
    Flight Distance                      0
    Seat comfort                         0
    Departure/Arrival time convenient    0
    Food and drink                       0
    Gate location                        0
    Inflight wifi service                0
    Inflight entertainment               0
    Online support                       0
    Ease of Online booking               0
    On-board service                     0
    Leg room service                     0
    Baggage handling                     0
    Checkin service                      0
    Cleanliness                          0
    Online boarding                      0
    Departure Delay in Minutes           0
    Arrival Delay in Minutes             0
    dtype: int64

### Check the number of rows and columns in the dataset again<a href="#Check-the-number-of-rows-and-columns-in-the-dataset-again"
class="anchor-link">¶</a>

Check how many rows and columns are remaining in the dataset. You should
now have 393 fewer rows of data.

In \[11\]:

    df_subset.shape

Out\[11\]:

    (129487, 22)

### Encode the data<a href="#Encode-the-data" class="anchor-link">¶</a>

Four columns (`satisfaction`, `Customer Type`, `Type of Travel`,
`Class`) are the pandas dtype object. Decision trees need numeric
columns. Start by converting the ordinal `Class` column into numeric.

In \[12\]:

    df_subset['Class'] = df_subset['Class'].map({'Business': 3, 'Eco plus': 2, 'Eco': 1})

### Represent the data in the target variable numerically<a href="#Represent-the-data-in-the-target-variable-numerically"
class="anchor-link">¶</a>

To represent the data in the target variable numerically, assign
`"satisfied"` to the label `1` and `"dissatisfied"` to the label `0` in
the `satisfaction` column.

In \[13\]:

    df_subset['satisfaction'] = df_subset['satisfaction'].map({"satisfied": 1, "dissatisfied": 0})

### Convert categorical columns into numeric<a href="#Convert-categorical-columns-into-numeric"
class="anchor-link">¶</a>

There are other columns in the dataset that are still categorical. Be
sure to convert categorical columns in the dataset into numeric.

In \[14\]:

    df_subset = pd.get_dummies(df_subset, drop_first = True)

### Check column data types<a href="#Check-column-data-types" class="anchor-link">¶</a>

Now that you have converted categorical columns into numeric, check your
column data types.

In \[15\]:

    df_subset.dtypes

Out\[15\]:

    satisfaction                           int64
    Age                                    int64
    Class                                float64
    Flight Distance                        int64
    Seat comfort                           int64
    Departure/Arrival time convenient      int64
    Food and drink                         int64
    Gate location                          int64
    Inflight wifi service                  int64
    Inflight entertainment                 int64
    Online support                         int64
    Ease of Online booking                 int64
    On-board service                       int64
    Leg room service                       int64
    Baggage handling                       int64
    Checkin service                        int64
    Cleanliness                            int64
    Online boarding                        int64
    Departure Delay in Minutes             int64
    Arrival Delay in Minutes             float64
    Customer Type_disloyal Customer        uint8
    Type of Travel_Personal Travel         uint8
    dtype: object

### Create the training and testing data<a href="#Create-the-training-and-testing-data"
class="anchor-link">¶</a>

Put 75% of the data into a training set and the remaining 25% into a
testing set.

In \[16\]:

    y = df_subset["satisfaction"]

    X = df_subset.copy()
    X = X.drop("satisfaction", axis = 1)

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

## Step 3: Model building<a href="#Step-3:-Model-building" class="anchor-link">¶</a>

### Fit a decision tree classifier model to the data<a href="#Fit-a-decision-tree-classifier-model-to-the-data"
class="anchor-link">¶</a>

Make a decision tree instance called `decision_tree` and pass in `0` to
the `random_state` parameter. This is only so that if other data
professionals run this code, they get the same results. Fit the model on
the training set, use the `predict()` function on the testing set, and
assign those predictions to the variable `dt_pred`.

In \[17\]:

    print(X_test.isnull().sum())  # Check for NaNs
    print((X_test == np.inf).sum())  # Check for infinite values
    print((X_test == -np.inf).sum())  # Check for negative infinite values

    X_test.fillna(X_test.median(), inplace=True)  # Replace NaNs with median
    X_test = X_test.mask(np.isinf(X_test), X_test.median(axis=0), axis=0) # Replace infinite values

    print(X_test.dtypes)

    X_test = pd.get_dummies(X_test)
    dt_pred = decision_tree.predict(X_test)
    print(dt_pred[:10])  # Sample predictions

    Age                                     0
    Class                                2375
    Flight Distance                         0
    Seat comfort                            0
    Departure/Arrival time convenient       0
    Food and drink                          0
    Gate location                           0
    Inflight wifi service                   0
    Inflight entertainment                  0
    Online support                          0
    Ease of Online booking                  0
    On-board service                        0
    Leg room service                        0
    Baggage handling                        0
    Checkin service                         0
    Cleanliness                             0
    Online boarding                         0
    Departure Delay in Minutes              0
    Arrival Delay in Minutes                0
    Customer Type_disloyal Customer         0
    Type of Travel_Personal Travel          0
    dtype: int64
    Age                                  0
    Class                                0
    Flight Distance                      0
    Seat comfort                         0
    Departure/Arrival time convenient    0
    Food and drink                       0
    Gate location                        0
    Inflight wifi service                0
    Inflight entertainment               0
    Online support                       0
    Ease of Online booking               0
    On-board service                     0
    Leg room service                     0
    Baggage handling                     0
    Checkin service                      0
    Cleanliness                          0
    Online boarding                      0
    Departure Delay in Minutes           0
    Arrival Delay in Minutes             0
    Customer Type_disloyal Customer      0
    Type of Travel_Personal Travel       0
    dtype: int64
    Age                                  0
    Class                                0
    Flight Distance                      0
    Seat comfort                         0
    Departure/Arrival time convenient    0
    Food and drink                       0
    Gate location                        0
    Inflight wifi service                0
    Inflight entertainment               0
    Online support                       0
    Ease of Online booking               0
    On-board service                     0
    Leg room service                     0
    Baggage handling                     0
    Checkin service                      0
    Cleanliness                          0
    Online boarding                      0
    Departure Delay in Minutes           0
    Arrival Delay in Minutes             0
    Customer Type_disloyal Customer      0
    Type of Travel_Personal Travel       0
    dtype: int64
    Age                                    int64
    Class                                float64
    Flight Distance                        int64
    Seat comfort                           int64
    Departure/Arrival time convenient      int64
    Food and drink                         int64
    Gate location                          int64
    Inflight wifi service                  int64
    Inflight entertainment                 int64
    Online support                         int64
    Ease of Online booking                 int64
    On-board service                       int64
    Leg room service                       int64
    Baggage handling                       int64
    Checkin service                        int64
    Cleanliness                            int64
    Online boarding                        int64
    Departure Delay in Minutes             int64
    Arrival Delay in Minutes             float64
    Customer Type_disloyal Customer        uint8
    Type of Travel_Personal Travel         uint8
    dtype: object

    ---------------------------------------------------------------------------
    NameError                                 Traceback (most recent call last)
    <ipython-input-17-53d595209b90> in <module>
          9 
         10 X_test = pd.get_dummies(X_test)
    ---> 11 dt_pred = decision_tree.predict(X_test)
         12 print(dt_pred[:10])  # Sample predictions

    NameError: name 'decision_tree' is not defined

In \[ \]:

    print(X_train.isnull().sum())

    X_train["Class"].fillna(X_train["Class"].mode()[0], inplace=True)

    decision_tree = DecisionTreeClassifier(random_state=0)

    decision_tree.fit(X_train, y_train)

    dt_pred = decision_tree.predict(X_test)

**Question:** What are some advantages of using decision trees versus
other models you have learned about?

Decision trees require no assumptions regarding the distribution of
underlying data and don't require scaling of features. This lab uses
decision trees because there is no need for additional data processing,
unlike some other models.

## Step 4: Results and evaluation<a href="#Step-4:-Results-and-evaluation" class="anchor-link">¶</a>

Print out the decision tree model's accuracy, precision, recall, and F1
score.

In \[ \]:

    print("Decision Tree")
    print("Accuracy:", "%.6f" % metrics.accuracy_score(y_test, dt_pred))
    print("Precision:", "%.6g" % metrics.precision_score(y_test, dt_pred))
    print("Recall:", "%.6f" % metrics.recall_score(y_test, dt_pred))
    print("F1 Score:", "%.6f" % metrics.f1_score(y_test, dt_pred))

**Question:** Are there any additional steps you could take to improve
the performance or function of your decision tree?

Decision Trees can be particularly susceptible to overfitting. Combining
hyperparameter tuning and grid search can help ensure this doesn't
happe. For instance, setting an appropriate value for max depth could
potentially help reduce a decision tree's overfitting problem by
limiting how deep a tree can grow.

### Produce a confusion matrix<a href="#Produce-a-confusion-matrix" class="anchor-link">¶</a>

Data professionals often like to know the types of errors made by an
algorithm. To obtain this information, produce a confusion matrix.

In \[ \]:

    cm = metrics.confusion_matrix(y_test, dt_pred, labels = decision_tree.classes_)
    disp = metrics.ConfusionMatrixDisplay(confusion_matrix = cm,display_labels = decision_tree.classes_)
    disp.plot();

**Question:** What patterns can you identify between true positives and
true negatives, as well as false positives and false negatives?

-   In the confusion matrix, there are high proportion of true positives
    and true negatives.
-   The matrix also had a relatively low number of false positives and
    false negatives.

### Plot the decision tree<a href="#Plot-the-decision-tree" class="anchor-link">¶</a>

Examine the decision tree. Use `plot_tree` function to produce a visual
representation of the tree to pinpoint where the splits in the data are
occurring.

In \[ \]:

    plt.figure(figsize=(20,12))
    plot_tree(decision_tree, max_depth=2, fontsize=14, feature_names=X.columns);

In \[ \]:

    importances = decision_tree.feature_importances_

    forest_importances = pd.Series(importances, index=X.columns).sort_values(ascending=False)

    fig, ax = plt.subplots()
    forest_importances.plot.bar(ax=ax);

### Hyperparameter tuning<a href="#Hyperparameter-tuning" class="anchor-link">¶</a>

Knowing how and when to adjust or tune a model can help a data
professional significantly increase performance. In this section, you
will find the best values for the hyperparameters `max_depth` and
`min_samples_leaf` using grid search and cross validation. Below are
some values for the hyperparameters `max_depth` and `min_samples_leaf`.

In \[ \]:

    tree_para = {'max_depth':[5,10,15,20,25,30,35,40,45,50],
                 'min_samples_leaf': [2,3,4,5,6,7,8,9, 10, 15, 20, 50]}

    scoring = {'accuracy': 'accuracy', 'precision': 'precision', 
               'recall': 'recall', 'f1': 'f1'}

### Check combinations of values<a href="#Check-combinations-of-values" class="anchor-link">¶</a>

Check every combination of values to examine which pair has the best
evaluation metrics. Make a decision tree instance called
`tuned_decision_tree` with `random_state=0`, make a `GridSearchCV`
instance called `clf`, make sure to refit the estimator using `"f1"`,
and fit the model on the training set.

**Note:** This cell may take up to 15 minutes to run.

In \[ \]:

    tuned_decision_tree = DecisionTreeClassifier(random_state=0)

    clf = GridSearchCV(
        estimator=tuned_decision_tree, 
        param_grid=tree_para, 
        scoring=scoring, 
        cv=5,  # 5-fold cross-validation
        refit="f1",  # Optimize for F1-score
        n_jobs=-1,  # Use all available CPU cores for faster processing
        verbose=1  # Show progress updates
    )

    # Fit the model
    clf.fit(X_train, y_train)

    # Display best parameters and best score
    print("Best Parameters:", clf.best_params_)
    print("Best f1 Score:", clf.best_score_)

**Question:** How can you determine the best combination of values for
the hyperparameters?

Use the best estimator tool to help uncover the best pair combination.

### Compute the best combination of values for the hyperparameters<a
href="#Compute-the-best-combination-of-values-for-the-hyperparameters"
class="anchor-link">¶</a>

In \[ \]:

    clf.best_estimator_

**Question:** What is the best combination of values for the
hyperparameters?

After running the DecisionTreeClassifier, the maximum depth is 18 and
the minimum number of samples is two, meaning this is the best
combination of values.

**Question: What was the best average validation score?**

In \[ \]:

    print("Best Avg. Validation Score:", "%.4f" % clf.best_score_)

The best validation score is 0.94

### Determine the "best" decision tree model's accuracy, precision, recall, and F1 score<a
href="#Determine-the-%22best%22-decision-tree-model&#39;s-accuracy,-precision,-recall,-and-F1-score"
class="anchor-link">¶</a>

Print out the decision tree model's accuracy, precision, recall, and F1
score. This task can be done in a number of ways.

In \[ \]:

    results = pd.DataFrame(columns=[])

    def make_results(model_name, model_object):

        # Get all the results from the CV and put them in a df
        cv_results = pd.DataFrame(model_object.cv_results_)

        # Isolate the row of the df with the max(mean f1 score
        best_estimator_results = cv_results.iloc[cv_results['mean_test_f1'].idxmax(), :]

        # Extract accuracy, precision, recall, and f1 score from that row
        f1 = best_estimator_results.mean_test_f1
        recall = best_estimator_results.mean_test_recall
        Precision = best_estimator_results.mean_test_precision
        accuracy = best_estimator_results.mean_test_accuracy

        # Create table of results
        table = pd.DataFrame({'Model': [model_name],
                             'F1': [f1],
                             'Recall': [recall],
                             'Precision': [Precision],
                             'Accuracy': [accuracy]
                             })
        return table

    result_table = make_results("Tuned Decision Tree", clf)

    result_table

**Question:** Was the additional performance improvement from
hyperparameter tuning worth the computational cost? Why or why not?

The F1 score for the decision tree that was not hyperparameter tuned is
0.940940 and the F1 score for the hyperparameter-tuned decision tree is
0.945422. While ensuring that overfitting doesn't occur is necessary for
some models, it didn't make a meaningful difference in improving this
model.

### Plot the "best" decision tree<a href="#Plot-the-%22best%22-decision-tree" class="anchor-link">¶</a>

Use the `plot_tree` function to produce a representation of the tree to
pinpoint where the splits in the data are occurring. This will allow you
to review the "best" decision tree.

In \[ \]:

    plt.figure(figsize=(20,12))
    plot_tree(clf.best_estimator_, max_depth=2, fontsize=14, feature_names=X.columns);

Which features did the model use first to sort the samples?

The plot makes it seem like **Inflight entertainment**, **Seat
comfort**, and **Ease of Online booking** are amoung the most important
features. The code below outputs a most important features graph from
the model.

## Conclusion<a href="#Conclusion" class="anchor-link">¶</a>

**What are some key takeaways that you learned from this lab?**

-   Machine Learning workflows may be used to clean and encode data for
    machine learning.
-   While hyperparameter tuning can lead to an increase in performance,
    it doesn't always.
-   The Visualization of the decision tree and the feature importance
    graph both suggest that **Inflight entertainment**, **Seat
    comfort**, and **Ease of online booking** are the most important
    features in the model.

**What findings would you share with others?**

-   Decision tress accurately predicted satisfaction over 94% of the
    time.
-   The confusion matrix is useful as it shows a similar number of true
    positives and true negatives.
-   The visualization of the decision tree and the feature importance
    graph both suggest that **Inflight entertainment**, **Seat
    comfort**, and **Ease of online booking** are the most important
    features in the model.

**What would you recommend to stakeholders?**

-   Customer satisfaction is highly tied to **Inflight entertainment**,
    **Seat comfort**, and **Ease of online booking** . Improving these
    experiences should lead to better customer satisfaction.
-   The success of the model suggests that the airline should invest
    more effort into model building and model understanding since this
    model seemed to be very good at predicting customer satisfaction.

**Congratulations!** You've completed this lab. However, you may not
notice a green check mark next to this item on Coursera's platform.
Please continue your progress regardless of the check mark. Just click
on the "save" icon at the top of this notebook to ensure your work has
been logged