# Introduction to Artificial Intelligence and Machine Learning (AI/ML)

## Module Objectives
- Define AI/ML and the difference between them
- Compare ML vs. expert based systems
- Explore ways in which AI/ML gets used
- Do the "Hello World" of neural networks as an introduction to AI/ML

## AI and ML: Definitions and Differences (1 point)

Read from both of these links on the difference (or lack thereof) between AI and ML:

- <a href=https://web.archive.org/web/20231129013703/https://www.ibm.com/topics/artificial-intelligence>What is Artificial Intelligence? (IBM)<a>
- <a href=https://web.archive.org/web/20231127132627/https://www.ibm.com/topics/machine-learning>What is Machine Learning? (IBM)<a>

**In a few sentences, describe the difference (or lack thereof) between machine learning and artificial intelligence in the cell below.** Citations are not required.

## Expert Based Systems (1 point)

Read from <a href=https://web.archive.org/web/20140505045226/http://stpk.cs.rtu.lv/sites/all/files/stpk/materiali/MI/Artificial%20Intelligence%20A%20Modern%20Approach.pdf>this paper from Russell & Norvig (1995)</a> on what an **expert system** is. Then describe the difference between an expert system and machine learning. The relevant section is titled *Knowledge-based systems: The key to power? (1969-1979)* and starts at page 22 (you should only need to read this one section).

**In a few sentences, describe the difference between machine learning and expert systems in the cell below**. Citations are not required.

## Use Cases For AI/ML (1 point)

Go back to the link <a href=https://web.archive.org/web/20231127132627/https://www.ibm.com/topics/machine-learning>What is Machine Learning? (IBM)<a> and scroll down to the "Real-world machine learning use cases" section.

**In 1-3 sentences, explain which one of the given use cases interest you the most and why**. Citations are not required.

## A Machine Learning Algorithm: $k$-nearest neighbors (knn)

Enough talk- let's create a machine learning algorithm!

Below we implement the $k$-nearest neighbors algorithm using Scikit-Learn, a machine learning package. 

There might be a lot of terms here that you are unfamiliar with, like conditional distribution, preprocessing, EDA, etc- that's ok! For this introductory homework, we don't expect you to know all these terms. **The entire algorithm is mostly implemented for you; all you need to do is edit a few lines of code to finish it. There will be clear instructions at each point where you need to edit the code to get it to work.**

If you are interested in learning more about $k$-nearest neighbors, check out chapter 2.2.3 of ISL: https://www.statlearning.com/ or visit the <a href=https://the-examples-book.com/starter-guides/data-science/data-analysis/k-nearest-neighbors>Starter Guide page</a>.

### Package Imports

In [1]:
import pandas as pd
import openpyxl
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
import math
from sklearn.model_selection import train_test_split

warnings.filterwarnings('ignore') #ignore warnings that occur

### A Brief Technical Overview

While knn can be used for either regression or classification, we are going to demonstrate its uses for classification. 

One of the primary challenges of classification is that we don't know the conditional distribution of $Y$ given $X$ for our dataset. To deal with this, we attempt to *estimate* the conditional distribution of $Y$ given $X$, then classify any given individual observation using that estimated probability of what it *should* be. **knn can be used to estimate what the conditional distribution should be.**

### Deep Dive Into The Algorithm

Given some positive integer $k$ and our data observation $x_0$, knn identifies the $k$ nearest points to $x_0$, represented by $n_0$.

Then, we estimate the conditional probability of class $j$ as the fraction of points in $n_0$ whose response values equal $j$:

$$ Pr(Y=j|X=x_0)=\frac{1}{k}\sum_{i\in n_0}I(y_i=j) $$

Finally, knn classifies the test observation we started with ($x_0$) to the class ($j$) with the largest probability from the above equation.

### A Visual Example

In the example below, we simulate 3 different $k$ values: 3, 5, and 11. We have 11 different data points plus our test point, and 2 labels (magenta and yellow). 

When $k$=3, the black dot (our $x_0$) would be classified as yellow, because the 3 nearest points are 2 yellow and 1 magenta; so the chance of it being yellow is 2/3, and magenta is 1/3. When $k$=5, our black dot would be classified as magenta, because magenta is higher than yellow with 3/5, vs 2/5. Finally, when we include all the points with $k$=11, our $x_0$ would be classified as yellow, because there is a 7/11 chance of yellow and 4/11 chance of magenta.

We could theoretically take any integer $k$ values from 2 to 11. In practice, when $k$ is the same as the max amount of points, you are really just taking the average, and if you were going to do that in the first place you wouldn't need k-nearest neighbors to do it. *Often, detecting the right value of $k$ can be a challenge*.

<img src="images/knn.gif" alt="knn gif" />

### Our Data

We are working with an airline customer satisfaction dataset. The data comes from Kaggle: https://www.kaggle.com/datasets/johndddddd/customer-satisfaction

*Variables*:

Satisfaction:Airline satisfaction level(Satisfaction, neutral or dissatisfaction)"

Age:The actual age of the passengers

Gender:Gender of the passengers (Female, Male)

"Type of Travel:Purpose of the flight of the passengers (Personal Travel, Business Travel)"

"Class:Travel class in the plane of the passengers (Business, Eco, Eco Plus)"

Customer Type:The customer type (Loyal customer, disloyal customer)

Flight distance:The flight distance of this journey

"Inflight wifi service:Satisfaction level of the inflight wifi service (0:Not Applicable;1-5)"

Ease of Online booking:Satisfaction level of online booking

Inflight service:Satisfaction level of inflight service

Online boarding:Satisfaction level of online boarding

Inflight entertainment:Satisfaction level of inflight entertainment

Food and drink:Satisfaction level of Food and drink

Seat comfort:Satisfaction level of Seat comfort

On-board service:Satisfaction level of On-board service

Leg room service:Satisfaction level of Leg room service

Departure/Arrival time convenient:Satisfaction level of Departure/Arrival time convenient

Baggage handling:Satisfaction level of baggage handling

Gate location:Satisfaction level of Gate location

Cleanliness:Satisfaction level of Cleanliness

Check-in service:Satisfaction level of Check-in service

Departure Delay in Minutes:Minutes delayed when departure

Arrival Delay in Minutes:Minutes delayed when Arrival

Flight cancelled:Whether the Flight cancelled or not (Yes, No)

Flight time in minutes:Minutes of Flight takes

### Our Goal

Classify observations by their level of satisfaction using knn. We want to know what variables really matter to customer satisfaction. For instance, is flight delay time the biggest indicator of overall customer satisfaction?

### Code

### EDA

Below we take a quick look at the data. We drop rows with NA once we load it in.

In [2]:
df = pd.read_excel("data.xlsx")
df = df.dropna()

In [3]:
df.head()

Unnamed: 0,id,satisfaction_v2,Gender,Customer Type,Age,Type of Travel,Class,Flight Distance,Seat comfort,Departure/Arrival time convenient,...,Online support,Ease of Online booking,On-board service,Leg room service,Baggage handling,Checkin service,Cleanliness,Online boarding,Departure Delay in Minutes,Arrival Delay in Minutes
0,11112,satisfied,Female,Loyal Customer,65,Personal Travel,Eco,265,0,0,...,2,3,3,0,3,5,3,2,0,0.0
1,110278,satisfied,Male,Loyal Customer,47,Personal Travel,Business,2464,0,0,...,2,3,4,4,4,2,3,2,310,305.0
2,103199,satisfied,Female,Loyal Customer,15,Personal Travel,Eco,2138,0,0,...,2,2,3,3,4,4,4,2,0,0.0
3,47462,satisfied,Female,Loyal Customer,60,Personal Travel,Eco,623,0,0,...,3,1,1,0,1,4,1,3,0,0.0
4,120011,satisfied,Female,Loyal Customer,70,Personal Travel,Eco,354,0,0,...,4,2,2,0,2,4,2,5,0,0.0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 129487 entries, 0 to 129879
Data columns (total 24 columns):
 #   Column                             Non-Null Count   Dtype  
---  ------                             --------------   -----  
 0   id                                 129487 non-null  int64  
 1   satisfaction_v2                    129487 non-null  object 
 2   Gender                             129487 non-null  object 
 3   Customer Type                      129487 non-null  object 
 4   Age                                129487 non-null  int64  
 5   Type of Travel                     129487 non-null  object 
 6   Class                              129487 non-null  object 
 7   Flight Distance                    129487 non-null  int64  
 8   Seat comfort                       129487 non-null  int64  
 9   Departure/Arrival time convenient  129487 non-null  int64  
 10  Food and drink                     129487 non-null  int64  
 11  Gate location                      129487 no

In [5]:
df.describe()

Unnamed: 0,id,Age,Flight Distance,Seat comfort,Departure/Arrival time convenient,Food and drink,Gate location,Inflight wifi service,Inflight entertainment,Online support,Ease of Online booking,On-board service,Leg room service,Baggage handling,Checkin service,Cleanliness,Online boarding,Departure Delay in Minutes,Arrival Delay in Minutes
count,129487.0,129487.0,129487.0,129487.0,129487.0,129487.0,129487.0,129487.0,129487.0,129487.0,129487.0,129487.0,129487.0,129487.0,129487.0,129487.0,129487.0,129487.0,129487.0
mean,64958.335169,39.428761,1981.008974,2.838586,2.990277,2.852024,2.990377,3.24916,3.383745,3.519967,3.472171,3.465143,3.486118,3.69546,3.340729,3.705886,3.352545,14.643385,15.091129
std,37489.781165,15.117597,1026.884131,1.392873,1.527183,1.443587,1.305917,1.318765,1.345959,1.306326,1.305573,1.270755,1.292079,1.156487,1.260561,1.151683,1.298624,37.932867,38.46565
min,1.0,7.0,50.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
25%,32494.5,27.0,1359.0,2.0,2.0,2.0,2.0,2.0,2.0,3.0,2.0,3.0,2.0,3.0,3.0,3.0,2.0,0.0,0.0
50%,64972.0,40.0,1924.0,3.0,3.0,3.0,3.0,3.0,4.0,4.0,4.0,4.0,4.0,4.0,3.0,4.0,4.0,0.0,0.0
75%,97415.5,51.0,2543.0,4.0,4.0,4.0,4.0,4.0,4.0,5.0,5.0,4.0,5.0,5.0,4.0,5.0,4.0,12.0,13.0
max,129880.0,85.0,6951.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,1592.0,1584.0


`id` probably won't be relevant for us, and will get dropped. 

In [6]:
df['satisfaction_v2'].unique()

array(['satisfied', 'neutral or dissatisfied'], dtype=object)

In [7]:
df['satisfaction_v2'].value_counts()

satisfaction_v2
satisfied                  70882
neutral or dissatisfied    58605
Name: count, dtype: int64

`satisfaction_v2` is the satisfaction by the customer, a True or False essentially. In this case it's assumed that neutral is bad and bad is bad. It would've been nicer to see more discrete categories of satisfaction, but this is all the data we have to work with.

### Preprocessing

Here we drop the `id` column, categorically convert the other non-int columns, make a new data frame, and then normalize. Finally, we make train/valid/test splits.

In [8]:
columns_to_convert = ['satisfaction_v2','Gender','Customer Type','Type of Travel','Class']

for column in columns_to_convert:
    df[column] = df[column].astype('category')
    df[column+"_coded"] = df[column].cat.codes

old_df = df

df = df.drop(columns=['id'])
df = df.drop(columns=columns_to_convert)

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 129487 entries, 0 to 129879
Data columns (total 23 columns):
 #   Column                             Non-Null Count   Dtype  
---  ------                             --------------   -----  
 0   Age                                129487 non-null  int64  
 1   Flight Distance                    129487 non-null  int64  
 2   Seat comfort                       129487 non-null  int64  
 3   Departure/Arrival time convenient  129487 non-null  int64  
 4   Food and drink                     129487 non-null  int64  
 5   Gate location                      129487 non-null  int64  
 6   Inflight wifi service              129487 non-null  int64  
 7   Inflight entertainment             129487 non-null  int64  
 8   Online support                     129487 non-null  int64  
 9   Ease of Online booking             129487 non-null  int64  
 10  On-board service                   129487 non-null  int64  
 11  Leg room service                   129487 no

In [10]:
columns_to_norm = ['Age','Flight Distance','Departure Delay in Minutes','Arrival Delay in Minutes']

for column in columns_to_norm:
    df[column] = df[column]/np.max(df[column])

#### Train Test Splits (1 point)

Below we create the cross validation train test splits from our data. You can learn about train/test splits here: https://the-examples-book.com/starter-guides/data-science/data-modeling/resampling-methods/cross-validation/train-valid-test

**Set test_size to be some value between 0.05 and 0.30 to create a test split that is 5-30% of our total dataset**. `test_size` gets used in the scikit-learn function `train_test_split` to automatically shuffle our data and create train test splits.

In [None]:
test_size = ??

In [None]:
if test_size < 0.05 or test_size > 0.30:
    raise AssertionError("The test_size is too big or too small.")

In [None]:
labels = df['satisfaction_v2_coded'] #create the labels 
data = df.drop(columns=['satisfaction_v2_coded']) #recreate the data
train_x, test_x, train_y, test_y = train_test_split(data, labels, test_size=test_size, random_state=42)

#### Setting the Max $k$ Value (1 point)

Below we create a for loop to try out multiple different $k$ values. Here we set the maximum value of $k$. You will want to set your `max_k` value to not be more than 20; it might take a while if you go higher than that, and besides, you will see that this data (like most datasets) doesn't benefit from a $k$ value higher than 10. **Set `max_k` to be equal to an int between 1 and 21 of your choice.**

In [None]:
max_k = ??

In [None]:
if max_k > 20 or max_k < 2:
    raise AssertionError("The max k value is larger than 20 or smaller than 2.")

### Model Building

We create a for loop to test out many different values of $k$. It creates the $k$-nearest neighbors classifier using the $k$ value, makes predictions, and gets the accuracy metric back on those predictions into the array. Finally, it saves the result of the metrics to results_df so we can plot them.

In [13]:
k_values = []
train_acc = []
test_acc = []

#for each possible k we can test from 2 to the max possible k value (including max_k)
for k in range(2,max_k+1):
#Train Model and Predict  
    print("Now testing value of k:",k)
    neigh = KNeighborsClassifier(n_neighbors = k).fit(train_x,train_y)
    yhat = neigh.predict(test_x)
    k_values.append(k)
    train_acc.append(metrics.accuracy_score(train_y, neigh.predict(train_x)))
    test_acc.append(metrics.accuracy_score(test_y, yhat))

#convert results to df
results_data = {'k':k_values, 'Training Accuracy':train_acc, 'Test Accuracy':test_acc}
results_df = pd.DataFrame(data=results_data)

SyntaxError: invalid syntax (3113960034.py, line 1)

In [None]:
print("The k value with the highest accuracy betwen 2 and", max_k,"is",np.argmax(test_acc)+2)

In [None]:
# setting the dimensions
fig, ax = plt.subplots(figsize=(30, 18))
 
# drawing the plot
sns.lineplot(results_df, x='k',y='Test Accuracy', ax=ax).set_title("Test Accuracy For Each k Value")
plt.show()

### Discussion

For most data, it is common that the ideal $k$ value is often somewhere between 5 and 10. There will often be an approximate arc like you see above, and the ideal $k$ will typically be an inflection point, after which the accuracy (or whatever other metric used) fares progressively worse.

### Summary

k-nearest neighbors is a very simple algorithm to understand and implement. It can be used for both classification and regression, and its ease of use can make it a great starting point for many datasets.