# S06 T01: Sampling methods

## Exercises 1 Grab a sports-themed dataset you like. Performs a sampling of the data generating a simple random sample and a systematic sample.

In [123]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import random

pd.set_option('display.float_format', lambda x: '%.0f' % x)

In [124]:
df = pd.read_csv(r'highest_earning_players.csv', sep=',')

### Exploring data

In [125]:
df

Unnamed: 0,PlayerId,NameFirst,NameLast,CurrentHandle,CountryCode,TotalUSDPrize,Game,Genre
0,3883,Peter,Rasmussen,dupreeh,dk,1822989,Counter-Strike: Global Offensive,First-Person Shooter
1,3679,Andreas,Højsleth,Xyp9x,dk,1799289,Counter-Strike: Global Offensive,First-Person Shooter
2,3885,Nicolai,Reedtz,dev1ce,dk,1787490,Counter-Strike: Global Offensive,First-Person Shooter
3,3672,Lukas,Rossander,gla1ve,dk,1652351,Counter-Strike: Global Offensive,First-Person Shooter
4,17800,Emil,Reif,Magisk,dk,1416449,Counter-Strike: Global Offensive,First-Person Shooter
...,...,...,...,...,...,...,...,...
995,7400,Janne,Mikkonen,Savjz,fi,50734,Hearthstone,Collectible Card Game
996,3255,Drew,Biessener,Tidesoftime,us,50450,Hearthstone,Collectible Card Game
997,49164,Simone,Liguori,Leta,it,49300,Hearthstone,Collectible Card Game
998,43043,Mike,Eichner,Ike,us,48550,Hearthstone,Collectible Card Game


The database is about gamming earnings.

In [126]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   PlayerId       1000 non-null   int64  
 1   NameFirst      1000 non-null   object 
 2   NameLast       1000 non-null   object 
 3   CurrentHandle  1000 non-null   object 
 4   CountryCode    1000 non-null   object 
 5   TotalUSDPrize  1000 non-null   float64
 6   Game           1000 non-null   object 
 7   Genre          1000 non-null   object 
dtypes: float64(1), int64(1), object(6)
memory usage: 62.6+ KB


There are not Nan

In [127]:
df[['TotalUSDPrize']].describe()

Unnamed: 0,TotalUSDPrize
count,1000
mean,397793
std,690849
min,24172
25%,83790
50%,168328
75%,393735
max,6952597


In [128]:
df.duplicated().sum()

0

There are not duplicates

In [129]:

df.groupby('Genre')['TotalUSDPrize'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Battle Royale,200,279257,377236,47078,83670,159406,283264,3141395
Collectible Card Game,100,133356,91524,47974,69156,97099,162850,491419
First-Person Shooter,200,344449,340189,53294,89933,258548,440271,1822989
Multiplayer Online Battle Arena,400,585842,993810,24172,72046,184601,609656,6952597
Strategy,100,253798,199770,81796,109662,162767,313492,893337


The database is quite structurated.

In [130]:
df.groupby('Game')['TotalUSDPrize'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Game,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Arena of Valor,100,83340,84282,24172,38212,58160,108895,465686
Counter-Strike: Global Offensive,100,565419,357793,232167,304962,443061,740391,1822989
Dota 2,100,1791788,1388015,591177,819764,1343057,2025515,6952597
Fortnite,100,434094,472427,134800,174660,236438,469716,3141395
Hearthstone,100,133356,91524,47974,69156,97099,162850,491419
Heroes of the Storm,100,117804,84070,50546,65855,89372,135673,464561
League of Legends,100,350435,196222,159402,217207,281803,461466,1257616
Overwatch,100,123478,77250,53294,69685,89436,159796,331109
PUBG,100,124420,120127,47078,62519,83191,131239,703954
Starcraft II,100,253798,199770,81796,109662,162767,313492,893337


There is the same number of values per game. It helps to have better samples.

In [131]:
df.groupby('CountryCode')['TotalUSDPrize'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
CountryCode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
ar,3,391386,545293.0,70750,76580,82409,551705,1021000
at,3,914074,878361.0,326177,409224,492272,1208023,1923774
au,5,1599935,2573504.0,53760,57228,79050,1809225,6000412
ba,1,762803,,762803,762803,762803,762803,762803
be,5,246035,71796.0,172385,179596,249663,287280,341250
bg,6,956899,1778198.0,93149,145094,293449,363301,4579118
br,10,577755,419891.0,61460,273063,384880,1050430,1063858
by,2,389815,284903.0,188358,289086,389815,490543,591271
ca,37,364423,525256.0,51637,77433,151117,341250,2257053
ch,3,292188,118963.0,176598,231153,285707,349983,414258


There are more players from China and Korea.

###  Simple random sample

In [132]:
sample_df = df.sample(100)
sample_df

Unnamed: 0,PlayerId,NameFirst,NameLast,CurrentHandle,CountryCode,TotalUSDPrize,Game,Genre
812,37261,Sorawichaya,Mahavanakul,Isilindilz,th,131426,Arena of Valor,Multiplayer Online Battle Arena
534,11418,Mikołaj,Ogonowski,Elazer,pl,264095,Starcraft II,Strategy
396,62624,Moussa,Faour,Chapix,se,136820,Fortnite,Battle Royale
957,45830,Wataru,Ishibashi,posesi,jp,84196,Hearthstone,Collectible Card Game
175,30452,Peng,Du,Monet,cn,813170,Dota 2,Multiplayer Online Battle Arena
...,...,...,...,...,...,...,...,...
972,15703,Facu,Pruzzo,Nalguidan,ar,70750,Hearthstone,Collectible Card Game
852,60456,Sorachat,Janechaijitravanit,Getsrch,th,57379,Arena of Valor,Multiplayer Online Battle Arena
931,28238,Lin,Zheng,OmegaZero,cn,139426,Hearthstone,Collectible Card Game
854,36696,Kawee,Wachiraphas,MeMarkz,th,55686,Arena of Valor,Multiplayer Online Battle Arena


In [133]:
sample_df[['TotalUSDPrize']].describe()

Unnamed: 0,TotalUSDPrize
count,100
mean,472153
std,914720
min,26696
25%,77441
50%,153191
75%,342375
max,6470000


The median of prices in dollars is similar: 168,328 dollars in the whole database and 164,980 in the simple random sample.

###  Systematic random sample

In [134]:
# Choosing Random Number
random_number = random.randint(1, 10)

In [135]:
# Define systematic sampling function

def systematic_sampling(df, step):
 
    indexes = np.arange(random_number, len(df), step=step)
    systematic_sample = df.iloc[indexes]
    return systematic_sample
 
 
systematic_sample = systematic_sampling(df, 10 )
 

display(systematic_sample)

Unnamed: 0,PlayerId,NameFirst,NameLast,CurrentHandle,CountryCode,TotalUSDPrize,Game,Genre
4,17800,Emil,Reif,Magisk,dk,1416449,Counter-Strike: Global Offensive,First-Person Shooter
14,3875,Jesper,Wecksell,JW,se,897761,Counter-Strike: Global Offensive,First-Person Shooter
24,5783,Nikola,Kovač,NiKo,ba,762803,Counter-Strike: Global Offensive,First-Person Shooter
34,3951,Chris,de Jong,chrisJ,nl,596749,Counter-Strike: Global Offensive,First-Person Shooter
44,3293,Cédric,Guipouy,RpK,fr,498888,Counter-Strike: Global Offensive,First-Person Shooter
...,...,...,...,...,...,...,...,...
954,37063,Tsu Lin,Tsao,SamuelTsao,tw,89250,Hearthstone,Collectible Card Game
964,39850,Tyler,Hoang Nguyen,Tylerootd,nl,78375,Hearthstone,Collectible Card Game
974,17263,Chen,Yuxiang,Breath,cn,69488,Hearthstone,Collectible Card Game
984,12793,Jeffrey,Liu,Tarei,us,62454,Hearthstone,Collectible Card Game


In [136]:
systematic_sample[['TotalUSDPrize']].describe()

Unnamed: 0,TotalUSDPrize
count,100
mean,400329
std,693380
min,26667
25%,85339
50%,170430
75%,365042
max,5470903


The median of dollar prices in the systematic sample is higher, 178,252 dollars, than in the whole database (168,328 dollars) and the simple random sample (164,980 dollars).

## Exercise 2 Estratified sample and sample using SMOTE (Synthetic Minority Oversampling Technique).



### Estratified sample

In [137]:
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest, f_classif


In [138]:
from sklearn.model_selection import train_test_split

I try two ways to estratifie using train_test_split, which split arrays or matrices into random train and test subsets.

In [139]:
# Use directly the function and let Sklearn to do the rest
strat1=train_test_split(df, df[['CountryCode','TotalUSDPrize','Game','Genre']])
strat1


[     PlayerId   NameFirst  NameLast CurrentHandle CountryCode  TotalUSDPrize  \
 201      6466    Ho Seong       Lee          Duke          kr         954621   
 13       8635        Nick  Cannella         nitr0          us         920152   
 166      3522       Jacky       Mao   EternaLEnVy          ca         987073   
 729     45412  Young Hoon      Shim         Simsn          kr         125722   
 150      2598     Fa Ming     Liang           DDC          mo        1337308   
 ..        ...         ...       ...           ...         ...            ...   
 115      2578     Clement    Ivanov        Puppey          ee        2864004   
 901     11132     Wei Lin      Chen      tom60229          tw         442878   
 142      3161      Junhao       Xie         Super          cn        1598492   
 794     48874     Michael      Wake         mykLe          gb          49288   
 44       3293      Cédric   Guipouy           RpK          fr         498888   
 
                          

In [140]:
# Introduce more parameters
X = df
y=df[['CurrentHandle','CountryCode','TotalUSDPrize','Game','Genre']]

In [141]:
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, train_size=0.8,test_size=0.20, random_state=100)
print ("X_train: ", X_train)
print ("y_train: ", y_train)
print("X_test: ", X_test)
print ("y_test: ", y_test)

X_train:       PlayerId    NameFirst     NameLast CurrentHandle CountryCode  \
675     22244       Markus        Hanke        Blumbi          de   
358     51909       Patrik  Zaharchenko        Pate1k          no   
159      3164   Zhengzheng          Yao           Yao          cn   
533      1030      Ji Sung         Choi        Bomber          kr   
678     21719      Michael        Udall  MichaelUdall          us   
..        ...          ...          ...           ...         ...   
855     18456      Pakinai    Srivijarn          kSsA          th   
871     53974            -            -       Sirenia          tw   
835     70308            -            -            轻雨          cn   
792     48935       Magnus     Hartmann         udyRR          de   
520      1175  Juan Carlos        Lopez       SpeCial          mx   

     TotalUSDPrize                 Game                            Genre  
675          65526  Heroes of the Storm  Multiplayer Online Battle Arena  
358        

In [142]:
X_test[['TotalUSDPrize']].describe()

Unnamed: 0,TotalUSDPrize
count,200
mean,491010
std,825077
min,25941
25%,90364
50%,182583
75%,519378
max,6470000


The median of dollar prices in the stratified sample is 182.583 dollars, higher than in the systematic sample (178,252 dollars), the whole database (168,328 dollars) or the simple random sample (164,980).

### SMOTE (Synthetic Minority Oversampling Technique).

Synthetic Minority Oversampling Technique is an Oversampling technique that allows us to generate synthetic samples for our minority categories.So, we get a difference between a sample and one of its k nearest neighbours and multiply by some random value in the range of (0, 1). A problem with imbalanced classification is that there are too few examples of the minority class for a model to effectively learn the decision boundary.

In [143]:
from imblearn.over_sampling import SMOTE
from sklearn import preprocessing

In [144]:
df2= df.loc[:, ['CountryCode','TotalUSDPrize','Game','Genre']]
df2

Unnamed: 0,CountryCode,TotalUSDPrize,Game,Genre
0,dk,1822989,Counter-Strike: Global Offensive,First-Person Shooter
1,dk,1799289,Counter-Strike: Global Offensive,First-Person Shooter
2,dk,1787490,Counter-Strike: Global Offensive,First-Person Shooter
3,dk,1652351,Counter-Strike: Global Offensive,First-Person Shooter
4,dk,1416449,Counter-Strike: Global Offensive,First-Person Shooter
...,...,...,...,...
995,fi,50734,Hearthstone,Collectible Card Game
996,us,50450,Hearthstone,Collectible Card Game
997,it,49300,Hearthstone,Collectible Card Game
998,us,48550,Hearthstone,Collectible Card Game


In [145]:
# Creating instance of labelencoder
labelencoder = preprocessing.LabelEncoder()
# Assigning numerical values and storing in another column

df2['CountryCode_num'] = labelencoder.fit_transform(df2['CountryCode'])
df2['Game_num']= labelencoder.fit_transform(df2['Game'])

In [146]:
df2

Unnamed: 0,CountryCode,TotalUSDPrize,Game,Genre,CountryCode_num,Game_num
0,dk,1822989,Counter-Strike: Global Offensive,First-Person Shooter,14,1
1,dk,1799289,Counter-Strike: Global Offensive,First-Person Shooter,14,1
2,dk,1787490,Counter-Strike: Global Offensive,First-Person Shooter,14,1
3,dk,1652351,Counter-Strike: Global Offensive,First-Person Shooter,14,1
4,dk,1416449,Counter-Strike: Global Offensive,First-Person Shooter,14,1
...,...,...,...,...,...,...
995,fi,50734,Hearthstone,Collectible Card Game,17,4
996,us,50450,Hearthstone,Collectible Card Game,53,4
997,it,49300,Hearthstone,Collectible Card Game,26,4
998,us,48550,Hearthstone,Collectible Card Game,53,4


In [147]:
X=df2[['TotalUSDPrize','CountryCode_num', 'Game_num']]
X

Unnamed: 0,TotalUSDPrize,CountryCode_num,Game_num
0,1822989,14,1
1,1799289,14,1
2,1787490,14,1
3,1652351,14,1
4,1416449,14,1
...,...,...,...
995,50734,17,4
996,50450,53,4
997,49300,26,4
998,48550,53,4


In [148]:
y=df2['Genre']

In [149]:
seed=100
k=1
sm= SMOTE(sampling_strategy='auto', k_neighbors=k, random_state=seed)
X_res, y_res = sm.fit_resample(X, y)


In [150]:
X_res

Unnamed: 0,TotalUSDPrize,CountryCode_num,Game_num
0,1822989,14,1
1,1799289,14,1
2,1787490,14,1
3,1652351,14,1
4,1416449,14,1
...,...,...,...
1995,518525,43,9
1996,132990,16,9
1997,884038,29,9
1998,93877,29,9


In [151]:
y_res

0       First-Person Shooter
1       First-Person Shooter
2       First-Person Shooter
3       First-Person Shooter
4       First-Person Shooter
                ...         
1995                Strategy
1996                Strategy
1997                Strategy
1998                Strategy
1999                Strategy
Name: Genre, Length: 2000, dtype: object

In [152]:
df3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   TotalUSDPrize    2000 non-null   float64
 1   CountryCode_num  2000 non-null   int64  
 2   Game_num         2000 non-null   int64  
 3   Genre            2000 non-null   object 
dtypes: float64(1), int64(2), object(1)
memory usage: 62.6+ KB


In [153]:
df3.Game_num. value_counts()

4    408
9    400
7    205
1    196
3    185
8    183
5    114
6    108
2    101
0    100
Name: Game_num, dtype: int64

In [154]:
df3.Genre. value_counts()

Strategy                           400
Multiplayer Online Battle Arena    400
First-Person Shooter               400
Collectible Card Game              400
Battle Royale                      400
Name: Genre, dtype: int64

In [155]:
df.Genre. value_counts()

Multiplayer Online Battle Arena    400
Battle Royale                      200
First-Person Shooter               200
Collectible Card Game              100
Strategy                           100
Name: Genre, dtype: int64

In [156]:
X_res[['TotalUSDPrize']].describe()

Unnamed: 0,TotalUSDPrize
count,2000
mean,321146
std,535898
min,24172
25%,85541
50%,152769
75%,327898
max,6952597


Though our database is balanced we have applied SOME sampling as an exercise. With SMOTE sample the median is lower, 152.769 dollars, but we have used "Genre" as y. Anyway, in this dataset there was not a problem of oversampling or undersampling.

## Exercise 3 Reservoir sampling.

Weighted random sampling with a reservoir; Pavlos S.Efraimidis; Paul G. Spirakis (2006). Applies when we have a stream of items of large and unknown length that we can only iterate over once. Creates an algorithm that randomly chooses an item from this stream such that each item is equally likely to be selected. I use the same dataset to show an exemple.

In [157]:
from random import uniform
from random import randint

import numpy as np
import matplotlib.pyplot as plt



In [158]:
# Doing Reservoir Sampling from the stream
k=100
reservoir = []
for i, element in enumerate(df["TotalUSDPrize"]):
    if i+1<= k:
        reservoir.append(element)
    else:
        probability = k/(i+1)
        if random.random() < probability:
            # Select item in stream and remove one of the k items already selected
             reservoir[random.choice(range(0,k))] = element
print(reservoir)

[243027.72, 1799288.57, 83373.91, 28083.35, 1416448.64, 1755205.74, 1063858.27, 89346.12, 232418.28, 88443.76, 170911.34, 127688.41, 65922.73, 166127.4, 2002159.51, 133715.0, 188358.03, 867823.34, 327424.24, 180858.29, 805444.66, 250580.01, 47602.61, 87834.52, 66306.23, 1349767.32, 64304.93, 689568.7, 82409.24, 81023.44, 119560.82, 69918.09, 53293.7, 162017.51, 81795.94, 136120.04, 762437.0, 2551657.34, 2275625.58, 1257615.87, 289675.08, 326176.94, 106225.0, 31489.38, 2864004.17, 191647.23, 112482.56, 28816.0, 483747.81, 227008.15, 437479.98, 152619.67, 297730.97, 414305.65, 68984.11, 89669.63, 132306.1, 104671.25, 70298.68, 78178.8, 126204.51, 58160.23, 361678.02, 174361.72, 135189.54, 114751.8, 315616.67, 111710.36, 1213576.66, 102114.43, 88040.91, 341250.0, 156959.11, 133558.02, 89725.4, 164363.81, 265429.15, 41144.36, 1598492.49, 810683.0, 148558.05, 325910.24, 594850.0, 227550.0, 545447.56, 165348.34, 1873138.8, 152734.63, 271901.75, 133114.01, 171767.15, 1186002.61, 75756.09, 510

In [159]:
df4=df[df.TotalUSDPrize.isin(reservoir)]
df4

Unnamed: 0,PlayerId,NameFirst,NameLast,CurrentHandle,CountryCode,TotalUSDPrize,Game,Genre
1,3679,Andreas,Højsleth,Xyp9x,dk,1799289,Counter-Strike: Global Offensive,First-Person Shooter
4,17800,Emil,Reif,Magisk,dk,1416449,Counter-Strike: Global Offensive,First-Person Shooter
6,12183,Epitácio,de Melo,TACO,br,1063858,Counter-Strike: Global Offensive,First-Person Shooter
17,5000,Freddy,Johansson,KRiMZ,se,867823,Counter-Strike: Global Offensive,First-Person Shooter
20,3290,Nathan,Schmitt,NBK,fr,805445,Counter-Strike: Global Offensive,First-Person Shooter
...,...,...,...,...,...,...,...,...
934,33382,Cheon Su,Kim,Che0nsu,kr,133715,Hearthstone,Collectible Card Game
953,43794,Juwei,Wu,XiaoT,cn,89346,Hearthstone,Collectible Card Game
958,47728,Zakarya,Hail,xBlyzes,fr,83374,Hearthstone,Collectible Card Game
959,30167,Francisco,Leimontas,PNC,ar,82409,Hearthstone,Collectible Card Game


In [160]:
df4[['TotalUSDPrize']].describe()

Unnamed: 0,TotalUSDPrize
count,119
mean,369019
std,556024
min,28083
25%,74197
50%,148558
75%,326044
max,2864004


With this sample method the median is 125.459 dollars.