<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Loading-required-libraries" data-toc-modified-id="Loading-required-libraries-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Loading required libraries</a></span><ul class="toc-item"><li><span><a href="#Loading-prepared-dataset" data-toc-modified-id="Loading-prepared-dataset-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Loading prepared dataset</a></span></li></ul></li><li><span><a href="#Making-high-quality-wine-features-from-medium-quality-wine" data-toc-modified-id="Making-high-quality-wine-features-from-medium-quality-wine-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Making high quality wine features from medium quality wine</a></span><ul class="toc-item"><li><span><a href="#Making-feature-statistic-analysis" data-toc-modified-id="Making-feature-statistic-analysis-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Making feature statistic analysis</a></span></li><li><span><a href="#Preparing-function" data-toc-modified-id="Preparing-function-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Preparing function</a></span></li><li><span><a href="#Generating-new-samples" data-toc-modified-id="Generating-new-samples-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Generating new samples</a></span></li><li><span><a href="#Spliting-to-Train-Test-Validation-and-adding-new-samples-ONLY-to-Train-dataset" data-toc-modified-id="Spliting-to-Train-Test-Validation-and-adding-new-samples-ONLY-to-Train-dataset-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>Spliting to Train Test Validation and adding new samples ONLY to Train dataset</a></span></li><li><span><a href="#Making-50-duplicates-of-class-2-in-Test-and-Validation-datasets" data-toc-modified-id="Making-50-duplicates-of-class-2-in-Test-and-Validation-datasets-2.5"><span class="toc-item-num">2.5&nbsp;&nbsp;</span>Making 50 duplicates of class 2 in Test and Validation datasets</a></span></li><li><span><a href="#Saving" data-toc-modified-id="Saving-2.6"><span class="toc-item-num">2.6&nbsp;&nbsp;</span>Saving</a></span></li></ul></li><li><span><a href="#Making-samples-by-MEAN-OF-TWO-RANDOM-CHOSEN-SAMPLES" data-toc-modified-id="Making-samples-by-MEAN-OF-TWO-RANDOM-CHOSEN-SAMPLES-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Making samples by MEAN OF TWO RANDOM CHOSEN SAMPLES</a></span><ul class="toc-item"><li><span><a href="#Preparing-Function" data-toc-modified-id="Preparing-Function-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Preparing Function</a></span></li><li><span><a href="#Generating-new-samples" data-toc-modified-id="Generating-new-samples-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Generating new samples</a></span></li><li><span><a href="#Spliting-to-Train-Test-Validation-and-adding-new-samples-ONLY-to-Train-dataset" data-toc-modified-id="Spliting-to-Train-Test-Validation-and-adding-new-samples-ONLY-to-Train-dataset-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Spliting to Train Test Validation and adding new samples ONLY to Train dataset</a></span></li><li><span><a href="#Making-50-duplicates-of-class-2-in-Test-and-Validation-datasets" data-toc-modified-id="Making-50-duplicates-of-class-2-in-Test-and-Validation-datasets-3.4"><span class="toc-item-num">3.4&nbsp;&nbsp;</span>Making 50 duplicates of class 2 in Test and Validation datasets</a></span></li><li><span><a href="#Saving" data-toc-modified-id="Saving-3.5"><span class="toc-item-num">3.5&nbsp;&nbsp;</span>Saving</a></span></li></ul></li></ul></div>

# Feature_generation

Here the 2 approaches of feature generation are collected.

Will be used the following approaches: 
- generate new samples by making mean of two samples from class 2
- generate new samples from class 1 + unique feature values (special for class 2) 

## Loading required libraries 

In [2]:
# matrices
import numpy as np
import pandas as pd

#sklearn
from sklearn.model_selection import train_test_split

#scipy
from scipy import stats

### Loading prepared dataset

In [3]:
X = pd.read_csv('../../data/intermid/X_wo_missed_values.csv',index_col=0)
y = pd.read_csv('../../data/intermid/y_wo_missed_values.csv',index_col=0)

In [4]:
df = pd.concat([X,y],axis=1)

## Making high quality wine features from medium quality wine

### Making feature statistic analysis

In [4]:
features_for_high_quality = [
    'chlorides',
    'density',
    'pH',
    'alcohol',
    'fixed acidity',
    'volatile acidity',
    'sulphates'
]

In [5]:
# lets check the statistical relevance :
features_for_high_quality_statistic = {'feature':[],
                                      'criteria_student_1samp':[],
                                      'p_value':[]}

for i in features_for_high_quality:
    high_array = np.array(df.loc[df.quality==2,:][i]) # 2 = quality 'high'
    medium_array = np.array(df.loc[df.quality==1,:][i]) # 1 = quality 'medium'
    
    criteria_studdent_1samp = stats.ttest_1samp(a=high_array,popmean=np.mean(medium_array) )
    
    features_for_high_quality_statistic['feature'].append(i)
    features_for_high_quality_statistic['criteria_student_1samp'].append(round(criteria_studdent_1samp[0],3))
    features_for_high_quality_statistic['p_value'].append(round(criteria_studdent_1samp[1],5))

pd.DataFrame(features_for_high_quality_statistic)

Unnamed: 0,feature,criteria_student_1samp,p_value
0,chlorides,-8.546,0.00103
1,density,-2.316,0.08152
2,pH,2.437,0.07146
3,alcohol,3.701,0.02081
4,fixed acidity,0.476,0.65901
5,volatile acidity,-1.425,0.22731
6,sulphates,-1.601,0.18473


The features: __chlorides__ and __alcohol__ is statistical significant.

### Preparing function

Make samples from medium wine with changing chlorides and alcohol features to correspond high_quality wines

In [6]:
def make_new_sample_from_medium(n,dataset):
    '''
        MAKE NEW SAMPLE FROM MEDIUM QUALITY WINES ADDING chlorides AND alcohol
        N - new samples amount
    '''

    high_chlorides = np.array(dataset.loc[dataset.quality==2,'chlorides'])
    high_alcohol = np.array(dataset.loc[dataset.quality==2,'alcohol'])

    new_item_chlorides = stats.norm.rvs(loc=np.mean(high_chlorides), scale=np.std(high_chlorides),size=n)
    new_item_alcohol = stats.norm.rvs(loc=np.mean(high_alcohol), scale=np.std(high_alcohol),size=n)

    medium_quality_wines = dataset[dataset.quality==1]

    new_item = np.random.choice(medium_quality_wines.index,size=n)

    new_item = medium_quality_wines.loc[new_item,:]

    new_item.chlorides = new_item_chlorides
    new_item.alcohol = new_item_alcohol
    new_item.quality = [2]*n
    
    dataset = dataset.append(new_item)
    
    return new_item

### Generating new samples

In [7]:
new_high_quality_samples = make_new_sample_from_medium(3000,df)

In [8]:
index_high_quality = df[df.quality==2].index

### Spliting to Train Test Validation and adding new samples ONLY to Train dataset

In [9]:
X_train, X_test, y_train, y_test = train_test_split(df.drop(index=index_high_quality).iloc[:,:-1],
                                                    df.drop(index=index_high_quality).iloc[:,-1],test_size=.4,random_state=255)

#adding high quality wines
X_test = X_test.append(df.iloc[index_high_quality,:-1])
y_test = y_test.append(df.iloc[index_high_quality,-1])

X_test, X_val, y_test, y_val = train_test_split(X_test,y_test,test_size=.4,random_state=255)

np.bincount(y_train), np.bincount(y_test),np.bincount(y_val)

(array([ 137, 3758], dtype=int64),
 array([  64, 1494,    3], dtype=int64),
 array([ 45, 994,   2], dtype=int64))

In [10]:
#adding generated examples to train dataset 
X_train = X_train.append(new_high_quality_samples.iloc[:,:-1])
y_train = y_train.append(new_high_quality_samples.iloc[:,-1])


In [11]:
np.bincount(y_train), np.bincount(y_test),np.bincount(y_val)

(array([ 137, 3758, 3000], dtype=int64),
 array([  64, 1494,    3], dtype=int64),
 array([ 45, 994,   2], dtype=int64))

### Making 50 duplicates of class 2 in Test and Validation datasets

In [12]:
#making 50 duplicates of high_quality wine to X_test and y_test:
X_y_test = pd.concat([X_test,y_test],axis=1)

a = X_y_test[X_y_test.quality==2]

a = a.sample(n=47,random_state=255,replace=True)

X_y_test = pd.concat([a,X_y_test],axis=0)


X_test = X_y_test.iloc[:,:-1]
y_test = X_y_test.iloc[:,-1]

np.bincount(y_test)

array([  64, 1494,   50], dtype=int64)

In [13]:
#making 100 duplicates of high_quality wine to X_val and y_val:
X_y_val = pd.concat([X_val,y_val],axis=1)

a = X_y_val[X_y_val.quality==2]

a = a.sample(n=48,random_state=255,replace=True)

X_y_val = pd.concat([a,X_y_val],axis=0)


X_val = X_y_val.iloc[:,:-1]
y_val = X_y_val.iloc[:,-1]

np.bincount(y_val)

array([ 45, 994,  50], dtype=int64)

In [14]:
print('class:')
print('0\t1\t2')
print('-'*35)
print(np.bincount(y_train), ' <- train')
print(np.bincount(y_test), ' <- test')
print(np.bincount(y_val), ' <- validation')

class:
0	1	2
-----------------------------------
[ 137 3758 3000]  <- train
[  64 1494   50]  <- test
[ 45 994  50]  <- validation


In [15]:
np.bincount(y_train), np.bincount(y_test),np.bincount(y_val)

(array([ 137, 3758, 3000], dtype=int64),
 array([  64, 1494,   50], dtype=int64),
 array([ 45, 994,  50], dtype=int64))

### Saving

In [16]:
#saving to file:
X_train.to_csv('X_train.csv')
y_train.to_csv('y_train.csv')

X_test.to_csv('X_test.csv')
y_test.to_csv('y_test.csv')

X_val.to_csv('X_val.csv')
y_val.to_csv('y_val.csv')

## Making samples by MEAN OF TWO RANDOM CHOSEN SAMPLES

In [29]:
index_high_quality = df[df.quality==2].index

### Preparing Function

In [9]:
def make_new_sample(n,dataset):
    '''
        MAKE NEW SAMPLE FROM MEAN OF TWO RANDOM CHOSEN SAMPLES
        N - new samples amount
    '''
    for i in range(n):
        index_1, index_2 = np.random.randint(low=0,high=len(dataset)-1,size=2)
        b = np.array(dataset)
        new_item = (b[index_1]+b[index_2])/2
        new_item = pd.DataFrame(new_item.reshape(1,-1), columns=dataset.columns)
        dataset = dataset.append(new_item)
    dataset.reset_index(inplace=True)
    dataset.drop(columns='index',inplace=True)
    return dataset


### Generating new samples

In [30]:
high_quality_df = df[df.quality==2]

In [31]:
high_quality_df_pH = list(high_quality_df.pH)

In [32]:
a = make_new_sample(3000,high_quality_df)

In [34]:
a['check'] = a.pH.apply(lambda x: 1 if x in high_quality_df_pH else 0)
np.bincount(a.check)

array([2994,   11], dtype=int64)

In [42]:
new_high_quality_samples = a.loc[a.check == 0,:].iloc[:,:-1]

### Spliting to Train Test Validation and adding new samples ONLY to Train dataset

In [47]:
X_train, X_test, y_train, y_test = train_test_split(df.drop(index=index_high_quality).iloc[:,:-1],
                                                    df.drop(index=index_high_quality).iloc[:,-1],test_size=.4,random_state=255)

#adding high quality wines
X_test = X_test.append(df.iloc[index_high_quality,:-1])
y_test = y_test.append(df.iloc[index_high_quality,-1])

X_test, X_val, y_test, y_val = train_test_split(X_test,y_test,test_size=.4,random_state=255)

np.bincount(y_train), np.bincount(y_test),np.bincount(y_val)

(array([ 137, 3758], dtype=int64),
 array([  64, 1494,    3], dtype=int64),
 array([ 45, 994,   2], dtype=int64))

In [48]:
#adding generated examples to train dataset 
X_train = X_train.append(new_high_quality_samples.iloc[:,:-1])
y_train = y_train.append(new_high_quality_samples.iloc[:,-1])

np.bincount(y_train), np.bincount(y_test),np.bincount(y_val)

(array([ 137, 3758, 2994], dtype=int64),
 array([  64, 1494,    3], dtype=int64),
 array([ 45, 994,   2], dtype=int64))

### Making 50 duplicates of class 2 in Test and Validation datasets

In [49]:
#making 50 duplicates of high_quality wine to X_test and y_test:
X_y_test = pd.concat([X_test,y_test],axis=1)

a = X_y_test[X_y_test.quality==2]

a = a.sample(n=47,random_state=255,replace=True)

X_y_test = pd.concat([a,X_y_test],axis=0)


X_test = X_y_test.iloc[:,:-1]
y_test = X_y_test.iloc[:,-1]

np.bincount(y_test)

array([  64, 1494,   50], dtype=int64)

In [50]:
#making 50 duplicates of high_quality wine to X_val and y_val:
X_y_val = pd.concat([X_val,y_val],axis=1)

a = X_y_val[X_y_val.quality==2]

a = a.sample(n=48,random_state=255,replace=True)

X_y_val = pd.concat([a,X_y_val],axis=0)


X_val = X_y_val.iloc[:,:-1]
y_val = X_y_val.iloc[:,-1]

np.bincount(y_val)

array([ 45, 994,  50], dtype=int64)

In [51]:
print('class:')
print('0\t1\t2')
print('-'*35)
print(np.bincount(y_train), ' <- train')
print(np.bincount(y_test), ' <- test')
print(np.bincount(y_val), ' <- validation')

class:
0	1	2
-----------------------------------
[ 137 3758 2994]  <- train
[  64 1494   50]  <- test
[ 45 994  50]  <- validation


In [52]:
np.bincount(y_train), np.bincount(y_test),np.bincount(y_val)

(array([ 137, 3758, 2994], dtype=int64),
 array([  64, 1494,   50], dtype=int64),
 array([ 45, 994,  50], dtype=int64))

### Saving

In [53]:
#saving to file:
X_train.to_csv('X_train.csv')
y_train.to_csv('y_train.csv')

X_test.to_csv('X_test.csv')
y_test.to_csv('y_test.csv')

X_val.to_csv('X_val.csv')
y_val.to_csv('y_val.csv')