<a href="https://colab.research.google.com/github/Hashiraee/tutorials/blob/main/tutorial_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Some basics** (I): variables
In Python, variables are **containers** that **hold values**, such as numbers, strings, or other types of data. Think of them as boxes in which you can store information that you may need to use later in your program.


In [6]:
# We can store the greeting in a variable:
greeting = "Hello, World!"
print(greeting)

Hello, World!


In [3]:
# Or, in our seconds in a day example:
hours = 24
minutes = 60
seconds = 60

seconds_in_a_day = hours * minutes * seconds
seconds_in_a_day

86400

# **Some basics** (II): if-statements
In Python, **if statements** are used to control the flow of your program based on **certain conditions**. They allow you to execute different blocks of code depending on whether a particular condition is **true** or **false**.

In [4]:
# Lets consider an age example:
age = 17
if age < 18:
    print("You are too young, no access")
else:
    print("Welcome!")

You are too young, no access


# **Some basics** (III): for-loops
In Python, there are two styles of for loops: the **for loop** and the for each loop, also known as the for in loop. Both of these loops are used to iterate over a sequence of values and perform some action for each value.

In [7]:
# Fortunately, loops can do that for us:
for number in range(6):
    print(number)

# If you want to calculate the sum:
sum = 0
for number in range(6):
    sum = sum + number
print(f"The sum is: {number}")

0
1
2
3
4
5
The sum is: 5


**Foreach loop** used to to iterate over an object that is iterable, like a list.

In [8]:
# These are lists in Python
animals = ["bear", "cat", "dog", "frog", "bird"]

# Printing the animal from animals
for animal in animals:
    print(animal)

bear
cat
dog
frog
bird


# **Some basics** (IV): functions
In Python, a function is a **block of code** that performs a **specific task** or set of tasks. Functions are used to break down complex problems into smaller, more manageable parts and to make your code more modular and reusable.

In [9]:
# Greeting a person, person here is called an argument to a function
def greet(person):
    print("Hello " + person + "!")

greet("Anon")
greet("Someone")

Hello Anon!
Hello Someone!


In [10]:
# Multiplying two numbers:
def multiply(a, b):
    return a * b

multiply(4, 20)

80

# **Some basics** (V): dictionaries
In Python, a dictionary is a **data structure** that allows you to store and retrieve **key-value pairs**. 

It is similar to a **real-world dictionary**, where you look up **a word (the key)** to find its **definition (the value)**.

In [11]:
# Look at the structure of defining a dictionary, we first have the "key": "value"
person = {
    #"key": "value",
    "name": "Anon",
    "age": 42,
    "city": "New York",
}

print(person["name"])
print(person["age"])

Anon
42


### We can also add values to the "person" dictionary

In [12]:
person["email"] = "anon@example.com"
print(person)

{'name': 'Anon', 'age': 42, 'city': 'New York', 'email': 'anon@example.com'}


# **Some basics** (VI): classes, methods and attributes
Briefly, **classes define** the **structure and behavior** of objects in object-oriented programming. 

**Attributes** are the **characteristics or properties** of an object.

**Methods** define the **actions** that **objects can perform**. 

Together, classes, attributes, and methods provide a powerful way to organize and encapsulate complex code.

In [13]:
class Car:
    def __init__(self, color):
        self.color = color
        self.engine_started = False
        
    def start_engine(self):
        if not self.engine_started:
            self.engine_started = True
            print("Engine started.")
        else:
            print("Engine is already running.")
            
    def change_color(self, new_color):
        print(f"Changing color from {self.color} to {new_color}.")
        self.color = new_color

**Here**, we create an `instance` of the Car `class`, or equivalently: my_car is an `object` of the Car `class`.

In [14]:
# my_car object, which is an instance of the Car class with its own unique color attribute.
my_car = Car("red")

When you create an `instance` or `object` of a class, you are creating a **copy of the class blueprint** with its **own set of attributes and methods**. 

The `instance` or `object` has access to the `attributes` and `methods` defined in the class, but it can also have its **own unique attributes and methods** that are **specific to that instance**.

In [15]:
# Here, we access an attribute
my_car.color

'red'

In [16]:
# This is a method
my_car.start_engine()

Engine started.


In [17]:
# Here, we use a METHOD 'change_color' to modify the ATTRIBUTE 'color'
my_car.change_color("blue")
my_car.color

Changing color from red to blue.


'blue'

# Example from `sklearn`

https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing

In [18]:
from sklearn import preprocessing
import numpy as np

## **Suppose** we have the following data, 3 rows and 3 columns.

In [19]:
X_train = np.array([[ 1., -1.,  2.],
                     [ 2.,  0.,  0.],
                     [ 0.,  1., -1.]])

### **Here, we use `sklearn`'s scaler, to rescale the data:**

$rescaled\_data = \frac{X-E(X)}{scale} = \frac{X - \mu}{scale} = \frac{X - \mu}{\sigma}$

**Where:**

$scale = \sigma = \sqrt{Var(X)}$ and $\mu = E(X)$

The StandardScaler class is a transformer that **standardizes** features by removing the **mean** and scaling to **unit variance**:

In [20]:
scaler = preprocessing.StandardScaler().fit(X_train)

**Here**, we access the **mean_** `attribute` of the StandardScaler `object` which is an `instance` of the StandardScaler `class`.

In [21]:
scaler.mean_

array([1.        , 0.        , 0.33333333])

**Here**, we access the **scale_** `attribute` of the StandardScaler `object` which is an `instance` of the StandardScaler `class`.

In [22]:
scaler.scale_

array([0.81649658, 0.81649658, 1.24721913])

**Next**, we rescale the original data by using the .transform() `method` of the StandardScaler `object` (which is named `scaler`).

In [23]:
X_scaled = scaler.transform(X_train)
X_scaled

array([[ 0.        , -1.22474487,  1.33630621],
       [ 1.22474487,  0.        , -0.26726124],
       [-1.22474487,  1.22474487, -1.06904497]])

**Here**, we calculate the mean over each column using the `.mean(axis = 0)`.

If you want to calculate the mean over the rows, you use `.mean(axis=1)`.

In [24]:
X_scaled.mean(axis=0)

array([0., 0., 0.])

**Here**, we calculate the **standard  (std)** over each column using the `.std(axis = 0)`.

If you want to calculate the standard deviation over the rows, you use `.std(axis=1)`.

In [25]:
X_scaled.std(axis=0)

array([1., 1., 1.])

<br>

# **Installing modules**:

**To extend the functionality** (using more than the built-ins): we need to install and use packages.
**Installing packages** is simple:

`!pip install {packageName}`

In [26]:
!pip install numpy
!pip install pandas

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


# Google colab already has a bunch of convenient packages installed!

### Next, we do need to import them to use them.

You can **import a package** by using:
import {package} as {short_name}

In [31]:
import os
import math
import warnings
import functools
import tqdm

# The packages we installed
import numpy as np
import pandas as pd

# Sklearn packages for ML
import sklearn.metrics
import sklearn.model_selection
import sklearn.neighbors
import sklearn.preprocessing

# Importing packages related to plotting
import matplotlib.pyplot as plt
import matplotlib.colors
import seaborn as sns

# Some settings for the plotting packages.
sns.set(style="ticks")
sns.set_style("whitegrid")

plt.rcParams.update(
    {
        "figure.figsize": (10, 8),
        "axes.titlesize": 20,
        "axes.labelsize": 15,
        "legend.fontsize": 15,
        "axes.grid": True,
        "axes.axisbelow": True,
        "pcolor.shading": "auto",
    }
)

my_colors = [sns.color_palette()[0], sns.color_palette()[1], sns.color_palette()[2]]

# **Part 1: Exploratory Data Analysis**

## Background: .parquet files

* Columnar storage nature of the (parquet) format allows us to easily select a subset of columns for all rows.
* Binary format and allows encoded data types.
* Metadata (schema at three levels: file, chunk and page header).

## **Loading Data**

The **pandas** library has a DataFrame class.

In [32]:
# Reading the data, by specifying the path to the data
data_path = "drive/MyDrive/Colab Notebooks/data.parquet"

# Here, the data object is an instance of the DataFrame class.
data = pd.read_parquet(data_path)

After **loading the data**, let us see what it looks like

In [33]:
# We can see the columns and the data type (Dtype) with .info()
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 450 entries, 0 to 149
Data columns (total 7 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   shopper_id                450 non-null    object 
 1   average_revenue           450 non-null    float64
 2   average_basket_size       450 non-null    float64
 3   fraction_canned_food      450 non-null    float64
 4   fraction_national_brands  450 non-null    float64
 5   segment_name              450 non-null    object 
 6   is_cherry_picker          450 non-null    int64  
dtypes: float64(4), int64(1), object(2)
memory usage: 28.1+ KB


### **Data content:** what are we working with?
The dataset consists of 450 observations (450 shoppers):

* `shopper_id`(object): The loyalty program ID.
* `average_revenue`(float64): Average monthly spending in USD (i.e., revenue).
* `average_basket_size`(float64): Average basket size per shopping trip (i.e., number of SKUs purchased on a trip).
* `fraction_canned_food`(float64): Fraction of purchases that are canned food products.
* `fraction_national_brands`(float64): Fraction of purchases that are products manufactured by national brands.
* `segment_name`(object): The segment each customer was assigned.
* `is_cherry_picker`(int64): The loyalty program ID.

### **What do the observations (rows) of the data look like?**

We often use the `.head()` or `.tail()` methods to look how a few of the observations look like

The .head(n) shows the first n observations of the dataset, default n = 5.

In [34]:
data.head()

Unnamed: 0,shopper_id,average_revenue,average_basket_size,fraction_canned_food,fraction_national_brands,segment_name,is_cherry_picker
0,shopper_000,141.817975,3.541991,0.165028,0.077541,The Obnoxious Teens,0
1,shopper_001,144.396016,3.860022,0.163544,0.06922,The Obnoxious Teens,1
2,shopper_002,130.580378,3.761713,0.149754,0.067993,The Obnoxious Teens,0
3,shopper_003,167.258689,3.555836,0.14592,0.06784,The Obnoxious Teens,0
4,shopper_004,144.925715,4.021508,0.170981,0.078963,The Obnoxious Teens,0


In [35]:
data.tail()

Unnamed: 0,shopper_id,average_revenue,average_basket_size,fraction_canned_food,fraction_national_brands,segment_name,is_cherry_picker
145,shopper_445,587.467485,44.225649,0.230167,0.064235,The Prepper,1
146,shopper_446,498.108805,32.448675,0.210713,0.055606,The Prepper,0
147,shopper_447,502.550763,38.304781,0.219645,0.067221,The Prepper,0
148,shopper_448,562.634494,51.018432,0.198177,0.07358,The Prepper,0
149,shopper_449,564.658163,31.917702,0.186912,0.063536,The Prepper,0


### Next, we look at some **summary statistics:**

In [36]:
data.describe()

Unnamed: 0,average_revenue,average_basket_size,fraction_canned_food,fraction_national_brands,is_cherry_picker
count,450.0,450.0,450.0,450.0,450.0
mean,378.464356,23.884672,0.194487,0.067881,0.46
std,182.141639,15.409354,0.02828,0.009784,0.498952
min,90.757119,1.769286,0.138783,0.043475,0.0
25%,160.378473,6.059692,0.171622,0.061535,0.0
50%,425.914019,26.463476,0.193793,0.067127,0.0
75%,520.202382,36.533924,0.216222,0.073704,1.0
max,759.837484,56.032508,0.266659,0.099542,1.0


### Lets now look at the segments, meet the sort of shoppers:

### The obnoxious teens:

<img src="https://github.com/sbstn-gbl/learning-from-big-data/blob/main/source/_static/img/type-teens.gif?raw=true" width="600">


### The prepper:
<img src="https://github.com/sbstn-gbl/learning-from-big-data/blob/main/source/_static/img/type-prepper.gif?raw=true" width="600">


### The shopper-for-hire:
<img src="https://github.com/sbstn-gbl/learning-from-big-data/blob/main/source/_static/img/type-shopper-hire.gif?raw=true" width="600">



In [37]:
# Here, we group the data by "segment_name" and count the number of unique "shopper_id" values for each group.
data.groupby("segment_name")["shopper_id"].count()

segment_name
The Obnoxious Teens     150
The Prepper             150
The Shopper-for-Hire    150
Name: shopper_id, dtype: int64

**Suppose**, that we want to now the average spending per segment, how would we determine that?

In [38]:
data.groupby("segment_name")["average_revenue"].mean()

segment_name
The Obnoxious Teens     146.388560
The Prepper             560.418395
The Shopper-for-Hire    428.586112
Name: average_revenue, dtype: float64

**Next**, determing the maximum average spending per segment?

In [39]:
data.groupby("segment_name")["average_revenue"].max()

segment_name
The Obnoxious Teens     209.605322
The Prepper             759.837484
The Shopper-for-Hire    563.249425
Name: average_revenue, dtype: float64

## **Data preparation**

Earlier, we looked at **dictionaries**, lets create one here to map a certain segment to a number

In [40]:
map_segment_name_to_id = {
    name: ix for ix, name in enumerate(data["segment_name"].unique())
}

map_segment_name_to_id

{'The Obnoxious Teens': 0, 'The Shopper-for-Hire': 1, 'The Prepper': 2}

### This may look a bit complex, so lets deconstruct it step by step:

In [41]:
# Here, we extract the unique segment (names) from the data.
unique_segments = data["segment_name"].unique()
unique_segments

array(['The Obnoxious Teens', 'The Shopper-for-Hire', 'The Prepper'],
      dtype=object)

**Next**, we create an _empty_ dictionary which we will populate later.

In [42]:
# Next, we initialize an empty dictionary, we will then use "key": "value"
map_segment_name_to_id = {}

# It is currently empty
map_segment_name_to_id

{}

**Here** we encounter our first built-in **function** in python: `len()`.

In [43]:
# How many unique segments are there?
number_of_unique_segments = len(unique_segments)
number_of_unique_segments

3

In [44]:
# Next, we populate the empty dictionary
for unique_segment in range(number_of_unique_segments):
    name = unique_segments[unique_segment]
    map_segment_name_to_id[name] = unique_segment

map_segment_name_to_id

{'The Obnoxious Teens': 0, 'The Shopper-for-Hire': 1, 'The Prepper': 2}

### For data-analysis purposes, we would like to have "number":"segment_name" instead:

In [45]:
# Creating an empty dictionary again
map_segment_id_to_name = {}

map_segment_id_to_name = {
    map_segment_name_to_id[name]: name for name in map_segment_name_to_id
}
map_segment_id_to_name

{0: 'The Obnoxious Teens', 1: 'The Shopper-for-Hire', 2: 'The Prepper'}

In [46]:
# Or, if you prefer, the step-by-step way:
map_segment_id_to_name = {}

In [47]:
map_segment_name_to_id.items()

dict_items([('The Obnoxious Teens', 0), ('The Shopper-for-Hire', 1), ('The Prepper', 2)])

In [48]:
# So, mapping them after looking at the contents
for name, id in map_segment_name_to_id.items():
    map_segment_id_to_name[id] = name
map_segment_id_to_name

# Again, the same result.

{0: 'The Obnoxious Teens', 1: 'The Shopper-for-Hire', 2: 'The Prepper'}

## **Now let us apply our mapping of segments to numbers to our original data (adding a new column):**

**Now**, we create a new feature/column such that we have segment_id's.

In [49]:
# Here, we create a new column using our previously constructed dictionary
data["segment_id"] = data["segment_name"].map(map_segment_name_to_id)

In [50]:
# Lets look at a few rows again:
data.head()

Unnamed: 0,shopper_id,average_revenue,average_basket_size,fraction_canned_food,fraction_national_brands,segment_name,is_cherry_picker,segment_id
0,shopper_000,141.817975,3.541991,0.165028,0.077541,The Obnoxious Teens,0,0
1,shopper_001,144.396016,3.860022,0.163544,0.06922,The Obnoxious Teens,1,0
2,shopper_002,130.580378,3.761713,0.149754,0.067993,The Obnoxious Teens,0,0
3,shopper_003,167.258689,3.555836,0.14592,0.06784,The Obnoxious Teens,0,0
4,shopper_004,144.925715,4.021508,0.170981,0.078963,The Obnoxious Teens,0,0


# **Defining the feature variables**

We only consider the variables we will use for further analysis:
* `fraction_canned_food`: Fraction of bought items that are canned food.
* `fraction_national_brand`: Fraction of bought items that are manufactured by national brands such as Coca-Cola.
* `average_revenue`: Average monthly revenue.
* `average_basket_size:` Average basket size per shopping trip. 


In [51]:
# Lets put all our variables into one variable:
feature_variables = [
    "fraction_canned_food",
    "fraction_national_brands",
    "average_revenue",
    "average_basket_size",
]

# Let's start with some **EDA (Exploratory Data Analysis)**

## **Median:**

The median is a value separating the higher half from the lower half of a data sample, a population or a probability distribution. For a data set, it may be thought of as "the middle" value.

In [52]:
values = [1, 2, 3, 4, 4, 4, 4, 4, 4, 10, 20, 100, 1000]
print(f"The median is: {np.median(values)}")

The median is: 4.0


## **Mean:** 

The arithmetic mean (or simply mean) of a sample, $x_1, x_2, ..., x_n$, usually denoted by $\bar{x}$, is the sum of the sampled values divided by the number of items in the sample:

$\bar{x} = \frac{1}{n}\sum_{i = 1}^{n}\frac{x_1 + x_2 + ... + x_n}{n}$

In [53]:
values = [1, 2, 3, 4, 4, 4, 4, 4, 4, 10, 20, 100, 1000]
print(f"The median is: {np.mean(values)}")

The median is: 89.23076923076923


**Now**, for our shopper data example:

In [54]:
data.groupby("segment_name")[feature_variables].median()

Unnamed: 0_level_0,fraction_canned_food,fraction_national_brands,average_revenue,average_basket_size
segment_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
The Obnoxious Teens,0.165538,0.074686,144.311948,4.125141
The Prepper,0.218525,0.065292,554.789159,40.089668
The Shopper-for-Hire,0.197005,0.062725,426.851005,26.463476


In [55]:
data.groupby("segment_name")[feature_variables].mean()

Unnamed: 0_level_0,fraction_canned_food,fraction_national_brands,average_revenue,average_basket_size
segment_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
The Obnoxious Teens,0.166316,0.076007,146.38856,4.809775
The Prepper,0.219397,0.065914,560.418395,40.477877
The Shopper-for-Hire,0.197746,0.061723,428.586112,26.366364


## **Variance:**
The variance of a random variable $X$ is the expected value of the squared deviation from the mean of X: $E(X) = \mu$

$ \sigma^{2} = Var(X) = E[(X - E(X)]^{2} = E[(X - \mu]^{2} = ∫_{-\infty}^{\infty}(x - \mu)^{2} \cdot f(x)dx$

If the generator of random variable $X$ is discrete with probability mass function $x_1 ↦ p_1, x_2 ↦ p_2, ..., x_n ↦ p_n$, then:

$\sigma^{2} = Var(X) = \sum_{i = 1}^{n}p_i \cdot (x_i - \mu)^{2}$

In [56]:
# The variance:
data.groupby("segment_name")[feature_variables].var()

Unnamed: 0_level_0,fraction_canned_food,fraction_national_brands,average_revenue,average_basket_size
segment_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
The Obnoxious Teens,0.000156,7.1e-05,437.857719,4.154882
The Prepper,0.000468,5.8e-05,5857.393119,41.576064
The Shopper-for-Hire,0.000352,5.1e-05,3597.623169,20.124759


In [57]:
# Equivalently, the standard deviation squared:
data.groupby("segment_name")[feature_variables].std() ** 2

Unnamed: 0_level_0,fraction_canned_food,fraction_national_brands,average_revenue,average_basket_size
segment_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
The Obnoxious Teens,0.000156,7.1e-05,437.857719,4.154882
The Prepper,0.000468,5.8e-05,5857.393119,41.576064
The Shopper-for-Hire,0.000352,5.1e-05,3597.623169,20.124759


## **Range**:
Difference between min and max values

$\Delta(x) = max(x) - min(x)$ 

In [58]:
# The range of the variables:
data.groupby("segment_name")[feature_variables].max() - data.groupby("segment_name")[feature_variables].min()

Unnamed: 0_level_0,fraction_canned_food,fraction_national_brands,average_revenue,average_basket_size
segment_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
The Obnoxious Teens,0.058268,0.049066,118.848203,10.495797
The Prepper,0.103862,0.040205,372.414308,29.22988
The Shopper-for-Hire,0.088172,0.033241,300.826247,20.496614


# **Part 2: Binary Classification with k-NN**
## **Help Dr. S to automate shopper segmentation**
Dr. S is a marketing professor at a US business school and the proud owner of a small neighborhood grocery stores. As a marketing professor Dr. S understands the value of data, so one of the first things after opening the grocery store was launching a loyalty card app. Customers can show the app at the checkout to collect loyalty points. After collecting enough loyalty points, customers can exchange their loyalty points for product vouchers that can be used to get grocery products for free. The customers love this concept, and Dr. wants to make the loyalty program even better by offering targeted coupons. He wants to use the loyalty program data to analyze the behavior (and preferences) of his customers.

With the help of the consulting firm Accidenture Dr. S created customer segments so he can recommend suitable products to customers. Dr. S. told us that working with Accidenture is pretty expensive: once every quarter he hires two consultants for 3 days (USD 3,500 per consultant per day plus travel expenses) to segment the customers.

After working with Accidenture for a few months, Dr. S has collected a lot of labeled data. Instead of paying consultants a lot of money, Dr. S wants to use Machine Learning to automate the customer segmentation so he can save some money to take his family on a (very!) nice vacation. Let's look at the data, perhaps we can produce a segmentation model helps Dr. S to save some money!

## **Types of Machine Learning (ML):**
*  **Supervised** learning uses **labeled** data
*  **Unsupervised** learning uses **unlabeled** data

### **Objective** of supervised ML:
Automate time-consuming or expensive manual tasks.

**Requires _labeled_ data:**
*  Historical data with labels.
*  Experiments to get labeled data.
*  Crowd-Sourcing labeled data.

**Practical Applications:**
* Classification: should we target a (certain) consumer?
* Regression: how much revenue can we expect from a consumer.

**Usage:**
*  Firms largely use (are _biased_ towards) classification models.
*  The reason beghind this bias towards `classification` models is that most analytical problems invlove making a decision that requires a simple **Yes/No** answer.

**Examples:**
*  Will a customer churn ('leave') or not.
*  Will a customer respond to an ad-campaign or not
*  Will a firm default or not.

## **How k-NN (k-NearestNeighbors) works**

### **Simple Idea:** Predict the _label_ of a "point/observation" to `i` by:
*  **Finding** the `k` closest ('nearest') points to `i`.
*  **Looking-up** the labels of the _nearest neighbors_.
*  Taking a **majority vote**.

![example image](https://lh3.googleusercontent.com/90egr1yszJf-UKxxPX1HF7icN1ioj0tLii0mCacBy-FQ-9W91TLD7Na6P0tth16RwJdWWbGrfXe3m8wq97EdEKJ5seJ0P7e5YVP-vzMIc62z63eooQ8t_mmhGtiLjxro8j_RSu8O)

*  `k` is the predefined `hyperparameter`:
    *  A `hyperparameter` is a value/parameter set before training a machine learning model that controls the behavior of the algorithm and affects the model's performance.
    *  The algorithm depends/relies on your choice of distance metric between the points in your sample.

### **Creating**`X` (Features) **and** `y` (Target Variable).