# Jan 9 2023:

## Vorüberlegungen:

https://www.analyticsvidhya.com/blog/2020/11/handling-imbalanced-data-machine-learning-computer-vision-and-nlp/

SMOTE(SyntheticMinorityOversamplingTechnique) — upsampling:-
It works based on the KNearestNeighbours algorithm, synthetically generating data points that fall in the proximity of the already existing outnumbered group. The input records should not contain any null values when applying this approach.

	-> https://datascience.stackexchange.com/questions/108342/how-does-smote-work-for-dataset-with-only-categorical-variables
	SMOTE only works on numerical data
	for purely categorical data, use SMOTEN:
		https://imbalanced-learn.org/dev/references/generated/imblearn.over_sampling.SMOTEN.html#imblearn.over_sampling.SMOTEN
	for mixed (categorical and numerical) data, use SMOTENC:
		https://imbalanced-learn.org/dev/references/generated/imblearn.over_sampling.SMOTENC.html#imblearn.over_sampling.SMOTENC

DataDuplication — upsampling:- In this approach, the existing data points corresponding to the outvoted labels are randomly selected and duplicated.

## Umsetzung:

SMOTENC upsampling implementiert: kann synthetische records erzeugen
Problem:
	Bisher werden ID Felder (globalID, localID, sourceID) ignoriert
	Wie mit ID Feldern umgehen?
	Bzw, wann sollen zwei synthetische records ein match sein und wann nicht?
	Alternative wäre sonst erstmal data duplication
	Korrektur: Data duplication keine gute Idee, weil zusätzliche Duplikate das Matching versauen. Bei ML Problemen einsetzbar, bei RL eher nicht.


# Jan 11 2023

## Gedanken nach Treffen mit Florens:
TODO:
* Häufigkeitsverteilung in resampled dataset sollte gleich sein
* Datensatz downsamplen -> bleibt overlap gleich?
* evt. erstmal downsamplen und ggf. später aus großem (Millionen) Datensatz schöpfen, um nur über subsets/downsamplen gewünschte
DS Eigenschaften zu bekommen.

## Steps to check if upsampling preserved ds properties:
* Sample up
* let n=1000, for example
* Draw n random samples from original
* Draw n random samples from up-sampled
* if the property in question is skewness, all random samples must have the same size (because size affects skewness, as noted in programmiertagebuch.md on Jan 11)
* for each sample, calculate the property in question and store it in a list, so that you end up with two lists (original and up-sampled), each n records.
* do significance test on the two lists, to check if the two distributions are significantly different

## Trying to run SMOTENC on the whole dataset to resample it to larger size
I just found out the SMOTE complexity lies between O(n²) and O(n³), making it infeasible for large data sets.
(See https://datascience.stackexchange.com/questions/48709/svm-smote-fit-resample-function-runs-forever-with-no-result)
I ran it on the whole dataset (200k rows) and it hasn't finished after 1 hour.
For 1000 rows it takes a few minutes.

Solution ideas:
* use downsampling instead, as noted above (under TODO)
* try to run on cluster

## Get the data

In [7]:
import pandas as pd

In [8]:
filepath = "data/2021_NCVR_Panse_001/dataset_ncvr_dirty.csv"
col_names = "sourceID,globalID,localID,FIRSTNAME,MIDDLENAME,LASTNAME,YEAROFBIRTH,PLACEOFBIRTH,COUNTRY,CITY,PLZ,STREET,GENDER,ETHNIC,RACE".split(",")
df = pd.read_csv(filepath, names=col_names, dtype={"PLZ": str, "YEAROFBIRTH": int},
                              keep_default_na=False)
df.shape[0]

200000

# Jan 11, 2023

## Measure Skewness of First Names, Male vs Female

In [9]:
df[df.GENDER == "F"].FIRSTNAME.value_counts().skew()

16.190111903664548

In [10]:
df[df.GENDER == "F"].sample(50_000).FIRSTNAME.value_counts().skew()

12.48531405193897

In [11]:
df[df.GENDER == "M"].FIRSTNAME.value_counts().skew()

21.14717233878872

In [12]:
df[df.GENDER == "M"].sample(50_000).FIRSTNAME.value_counts().skew()

17.565501388622522

Learnings:
* positive skew = tail to the right
* Skewness amplified by subset size (negative skew becomes smaller, pos. skew becomes lager)
* male first names skew larger than female first names skew -> female first names are more divers
	-> explains better precision and lower recall on female subset

# Jan 12 2023:

## Influence of SMOTENC upsampling on skewness

In [13]:
from time import perf_counter
from resampling import smotenc

# sample up
original_size = 5000
original = df.drop(df.columns[[0, 1, 2]], axis=1).sample(original_size, random_state=42)  # drop ID columns
categorical_features = [0, 1, 2, 4, 5, 6, 7, 8, 9, 10, 11]  # all but 3 (year of birth)
start = perf_counter()
res = smotenc(original, original.shape[0] * 2, categorical_features, random_state=42)
stop = perf_counter()
print(f"Elapsed time: {stop-start}")

Elapsed time: 12.694753911999214


In [14]:
original

Unnamed: 0,FIRSTNAME,MIDDLENAME,LASTNAME,YEAROFBIRTH,PLACEOFBIRTH,COUNTRY,CITY,PLZ,STREET,GENDER,ETHNIC,RACE
119737,MATTHEW,ERSKINE,SMITH,1984,NC,IREDELL,MOORESVILLE,28115,CEDAR,M,UN,W
72272,HELEN,B,ROSEMAN,1976,NC,MECKLENBURG,CORNELIUS,28031,SHEARWATER POINT,F,NL,W
158154,MAURICE,DONYELL,WIGGINS,1981,DC,ONSLOW,JACKSONVILLE,28546,WINDSOR,M,NL,B
65426,CAMILLE,LAJOY,PRYSOCK,1977,,CATAWBA,HICKORY,28602,9TH,F,NL,B
30074,DENISE,MARIE,BEAMER,1975,IN,PITT,GRIFTON,28530,LEHMAN,F,UN,W
...,...,...,...,...,...,...,...,...,...,...,...,...
170108,CONNIE,D,WILSON,1947,WV,CRAVEN,NEW BERN,28562,BADEN,F,NL,W
29900,DONALD,R,GORE,1952,NC,WAKE,WENDELL,27591,RIDGE HAVEN,M,NL,W
20386,VERTIE,,FOSTER,1900,NC,YADKIN,JONESVILLE,28642,BETHEL,F,NL,B
83444,TAUJAUNA,TUGMAN,HOWARD,1963,NC,UNION,MATTHEWS,28104,CALITERRA,F,UN,B


In [15]:
res

Unnamed: 0,FIRSTNAME,MIDDLENAME,LASTNAME,YEAROFBIRTH,PLACEOFBIRTH,COUNTRY,CITY,PLZ,STREET,GENDER,ETHNIC,RACE
0,MATTHEW,ERSKINE,SMITH,1984.000000,NC,IREDELL,MOORESVILLE,28115,CEDAR,M,UN,W
1,HELEN,B,ROSEMAN,1976.000000,NC,MECKLENBURG,CORNELIUS,28031,SHEARWATER POINT,F,NL,W
2,MAURICE,DONYELL,WIGGINS,1981.000000,DC,ONSLOW,JACKSONVILLE,28546,WINDSOR,M,NL,B
3,CAMILLE,LAJOY,PRYSOCK,1977.000000,,CATAWBA,HICKORY,28602,9TH,F,NL,B
4,DENISE,MARIE,BEAMER,1975.000000,IN,PITT,GRIFTON,28530,LEHMAN,F,UN,W
...,...,...,...,...,...,...,...,...,...,...,...,...
19995,CARLA,JOHANNA,BARNETT,1959.211358,,DAVIDSON,LEXINGTON,27295,HUNTERS RIDGE,F,NL,W
19996,STEVEN,ANTHONY,IMES,1967.193497,NC,IREDELL,STATESVILLE,28677,JANE SOWERS,M,NL,W
19997,NICHOLAS,ERNEST,SMITH,1997.000000,,UNION,MONROE,28704,THOROUGHBRED,U,UN,U
19998,JEFFREY,,ROBINSON,1980.906702,,JACKSON,SYLVA,28779,JAMES,M,UN,W


In [16]:
original[original.GENDER == "F"].sample(1000).FIRSTNAME.value_counts().skew()

3.799310979689645

In [17]:
res[res.GENDER == "F"].sample(1000).FIRSTNAME.value_counts().skew()

3.9036208186501877

In [18]:
n = 1000
size = 1000
skews_orignal = []
skews_res = []
for _ in range(n):
    skews_orignal.append(original[original.GENDER == "M"].sample(size).FIRSTNAME.value_counts().skew())
    skews_res.append(res[res.GENDER == "M"].sample(size).FIRSTNAME.value_counts().skew())

In [19]:
from scipy.stats import ttest_ind, describe

In [20]:
describe(skews_orignal)

DescribeResult(nobs=1000, minmax=(4.635980729539374, 6.163776110632538), mean=5.387363329317303, variance=0.05991665893034635, skewness=0.11716625705634509, kurtosis=-0.18989819817337183)

In [21]:
describe(skews_res)

DescribeResult(nobs=1000, minmax=(5.031590466621371, 6.68770491186547), mean=5.829257860581405, variance=0.07486386013727452, skewness=0.09797169866288152, kurtosis=-0.05962616892646677)

In [22]:
ttest_ind(skews_orignal, skews_res)

Ttest_indResult(statistic=-38.063179639350565, pvalue=7.161469072005977e-239)

### => upsampling significantly increases skewnesses of male first names

In [27]:
n = 1000
size = 1000
skews_orignal = []
skews_res = []
for _ in range(n):
    skews_orignal.append(original[original.GENDER == "F"].sample(size).FIRSTNAME.value_counts().skew())
    skews_res.append(res[res.GENDER == "F"].sample(size).FIRSTNAME.value_counts().skew())

In [28]:
describe(skews_orignal)

DescribeResult(nobs=1000, minmax=(2.7011546165296294, 6.782843211177342), mean=3.8841015070814473, variance=0.2602796622676175, skewness=0.8372279482619696, kurtosis=1.5809580596857993)

In [29]:
describe(skews_res)

DescribeResult(nobs=1000, minmax=(2.5472132932324496, 7.2468158387529416), mean=4.022322389623218, variance=0.44654805153016575, skewness=0.8142021840431486, kurtosis=1.0382674956004418)

In [30]:
ttest_ind(skews_orignal, skews_res)

Ttest_indResult(statistic=-5.198964799426186, pvalue=2.2086305999541598e-07)

### => upsampling significantly increases skewnesses of both male and female first names

### Another Problem with SMOTE, SMOTE Runtime is in O(n³). Anything over 6000 Records does not terminate.