# Problem Set 4 - Dealing with noisy data

_Data Preparation Course at UCU, 2019_

### NB

__1) Which programming languages to use?__

We recommend to use Python for this task, but if you find working library alternatives for the algorithms we
use in this assignment in R, you are free to work with that as well.

__2) What libraries/packages to use?__

You are free to choose any appropriate libraries (good choice would be __pandas__, __numpy__,
__scicit-learn__).

__3) How to summarize my homework?__

The best way is to create an Jupyter/R notebook with code and explanations for each strategy. In case you
are not familiar with these tools, you can create a Python/R scripts and write explanations as comments.
However, we strongly recommend you to use Jupyter/R notebooks, as those are #1 tools in applied data
analysis nowadays.

__4) Useful links__

1. [Deaing with Noisy Data in Data Science.](https://medium.com/analytics-vidhya/dealing-with-noisy-data-in-data-science-e177a4e32621)
2. [Decision trees in Scikit-learn.](https://scikit-learn.org/stable/modules/tree.html)

## Tasks

In this homework you will investigate the impact of different types of noise on the accuracy of classification
model based on __<font color="black">[(Census Income dataset)](https://archive.ics.uci.edu/ml/datasets/Census+Income)</font>__. Noise is an unavoidable problem which affects all stages of Data Mining process, so it is extremely important to learn how to deal with the noise in the most appropriate way. 

### __1) Logistic regression.__

Similar to previous assignment, you’ll have to train multiple logistic regression models. We encourage you to use provided jupyter notebook with working template of logistic regression for Census dataset. Please remember that LR is not the main topic of this assignment, so do not bother yourself tuning your models. The purpose of this assignment is to investigate the negative impact of noise in your dataset on the accuracy of classification and learn basic methods of dealing with problems of this type.

Regarding missing values in the dataset - you need to impute them using __global most common substitution strategy__ from the previous assignment.

__Treat dataset you obtain after missing values imputation as an original one. All further
modification in this homework perform on this dataset, not on the one you have before missing
values imputation.__

__1.1.__ Train original logistic regression model provided in jupyter notebook. Save values of train and test
classification accuracy scores for future comparison.

In [19]:
## YOU CAN PLACE YOUR SOLUTION IN THE CELLS LIKE THIS ##
#######################################################
import pandas as pd
from sklearn.utils import shuffle
import random
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
import numpy as np
from sklearn.preprocessing import LabelEncoder


data = pd.read_csv('adult.data',names=['age', 'workclass', 'fnlwgt', 'education', 'education_num',
                                      'marital-status', 'occupation', 'relationship', 'race', 'sex', 
                                      'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'y'])





def get_missing_info(data):
    data_missing = data.isna()
    data_num_missing = data_missing.sum()
    print(data_num_missing / len(data) * 100) 
    


    
data = data.replace(" ?", np.nan).drop(['education'], axis=1)


get_missing_info(data)

data = data.dropna()
lst_of_columns =['workclass','marital-status','occupation','relationship','race','native-country', "sex", "y"]
le = LabelEncoder()
encoded_series = data[lst_of_columns].apply(le.fit_transform)
for col in lst_of_columns:
    data[col] = encoded_series[col]
data = data.dropna()
get_missing_info(data)
#######################################################
data

age               0.000000
workclass         5.638647
fnlwgt            0.000000
education_num     0.000000
marital-status    0.000000
occupation        5.660146
relationship      0.000000
race              0.000000
sex               0.000000
capital-gain      0.000000
capital-loss      0.000000
hours-per-week    0.000000
native-country    1.790486
y                 0.000000
dtype: float64
age               0.0
workclass         0.0
fnlwgt            0.0
education_num     0.0
marital-status    0.0
occupation        0.0
relationship      0.0
race              0.0
sex               0.0
capital-gain      0.0
capital-loss      0.0
hours-per-week    0.0
native-country    0.0
y                 0.0
dtype: float64


Unnamed: 0,age,workclass,fnlwgt,education_num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,y
0,39,5,77516,13,4,0,1,4,1,2174,0,40,38,0
1,50,4,83311,13,2,3,0,4,1,0,0,13,38,0
2,38,2,215646,9,0,5,1,4,1,0,0,40,38,0
3,53,2,234721,7,2,5,0,2,1,0,0,40,38,0
4,28,2,338409,13,2,9,5,2,0,0,0,40,4,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,2,257302,12,2,12,5,4,0,0,0,38,38,0
32557,40,2,154374,9,2,6,0,4,1,0,0,40,38,1
32558,58,2,151910,9,6,0,4,4,0,0,0,40,38,0
32559,22,2,201490,9,4,0,3,4,1,0,0,20,38,0


### __2) [1pt] Misclassification noise.__

__2.1.__ Introduce misclassification in your dataset. Randomly flip $n\%$ of the target variable (‘y’) values. Try $n = (1, 5, 10, 20)$. Perform this process __only in train dataset__. Leave test dataset unchanged.

In [20]:
def randomly_flip_𝑛(target, data, procent):
    lst = list(data[target].values)
    number_of_flips = int(len(lst) / 100 * procent)
    for i in range(number_of_flips):
        if not data[target].iloc[i]:
            data[target].iloc[i] = 1.0
        else:
            data[target].iloc[i] = 0.0
    return shuffle(data)

def randomly_flip_test(y_test, procent):
    number_of_flips = int(len(y_test) / 100 * procent)
    for i in range(number_of_flips):
        if not y_test.iloc[i]:
            y_test.iloc[i] = 1.0
        else:
            y_test.iloc[i] = 0.0
    return shuffle(y_test)        


X_train, X_test, y_train, y_test = train_test_split(data.drop('y', 1), data['y'], test_size = .2, random_state=10)

y_test1 = randomly_flip_test(y_test, 1)
y_test5 = randomly_flip_test(y_test, 5)
y_test10 = randomly_flip_test(y_test, 10)
y_test20 = randomly_flip_test(y_test, 20)

__2.2.__ For each $n$ train separate model. Record train and test accuracy for each of these models.

In [21]:
def model_fit(model, X_train, X_test, y_train, y_test):
    model.fit(X_train, y_train)
    y_predict_test = model.predict(X_test) 
    
    y_predict_train = model.predict(X_train) 
    
    print("#####################################################################################")
    print("Accuracy score on test : ",accuracy_score(y_test, y_predict_test))
    print("f1_score on test : ",f1_score(y_test, y_predict_test))
    print("Accuracy score on train : ",accuracy_score(y_train, y_predict_train))
    print("f1_score on train : ",f1_score(y_train, y_predict_train))
    print("#####################################################################################")
    return model, accuracy_score(y_test, y_predict_test) 
model = LogisticRegression(random_state=0, solver='lbfgs')
model, test_score = model_fit(model, X_train, X_test, y_train, y_test)
model, _ = model_fit(model, X_train, X_test, y_train, y_test1)
model, _ = model_fit(model, X_train, X_test, y_train, y_test5)
model, _ = model_fit(model, X_train, X_test, y_train, y_test10)
model, _ = model_fit(model, X_train, X_test, y_train, y_test20)

#####################################################################################
Accuracy score on test :  0.7044588098789989
f1_score on test :  0.3269158172895432
Accuracy score on train :  0.7816320610054291
f1_score on train :  0.3883923389437029
#####################################################################################
#####################################################################################
Accuracy score on test :  0.6819161279628708
f1_score on test :  0.1512605042016807
Accuracy score on train :  0.7816320610054291
f1_score on train :  0.3883923389437029
#####################################################################################
#####################################################################################
Accuracy score on test :  0.6781037626388198
f1_score on test :  0.165807560137457
Accuracy score on train :  0.7816320610054291
f1_score on train :  0.3883923389437029
#############################################################

__2.3.__ What is the highest safe fraction (approximately) of misclassified examples? (by ‘safe’, we mean fraction
of misclassified examples with which difference of accuracies between original and misclassified model does
not exceed 0.01)

##### __3) [1pt] Attribute noise.__

__3.0.__ For $n = (1,5,10,20)$ create datasets with different levels of attribute noise.

In [22]:
data1 = randomly_flip_𝑛("y", data, 1)
data5 = randomly_flip_𝑛("y", data, 5)
data10 = randomly_flip_𝑛("y", data, 10)
data20 = randomly_flip_𝑛("y", data, 10)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)


__3.1.__ Introduce attribute noise to the __age__ column. Randomly negate $n\%$ of the values of this attribute.

In [23]:
import random
data1 = randomly_flip_𝑛("age", data1, random.choice([i for i in range(1,101)]))
data5 = randomly_flip_𝑛("age", data5, random.choice([i for i in range(1,101)]))
data10 = randomly_flip_𝑛("age", data10, random.choice([i for i in range(1,101)]))
data20 = randomly_flip_𝑛("age", data20, random.choice([i for i in range(1,101)]))

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


__3.2.__ Introduce attribute noise to the __education_num__ column. Randomly replace $n\%$ of the values of this attribute with random large numbers in range $[20,100]$.

In [24]:
data1 = randomly_flip_𝑛("education_num", data1, random.choice([i for i in range(20,101)]))
data5 = randomly_flip_𝑛("education_num", data5, random.choice([i for i in range(20,101)]))
data10 = randomly_flip_𝑛("education_num", data10, random.choice([i for i in range(20,101)]))
data20 = randomly_flip_𝑛("education_num", data20, random.choice([i for i in range(20,101)]))

__3.3.__ Introduce attirute noise to the __race__ column. Randomly replace $n\%$ of the values of this attribute with
any other random race from the set of existing races.

In [25]:
def race_noise(data, procent, target='race'):
    values = set(data[target].values)
    for i in range(int(len(data[target]) / 100 * procent)):  
        data[target].iloc[i] = random.choice(list(values.difference(set([data[target].iloc[i]]))))
    return shuffle(data) 

data1 = race_noise(data1, 1)
data5 = race_noise(data5, 5)
data10 = race_noise(data10, 10)
data20 = race_noise(data20, 20)

__3.4.__ For each $n$ train separate model. Record train and test accuracy for each of these models.

In [26]:
X_train1, X_test1, y_train1, y_test1 = train_test_split(data1.drop('y', 1), data1['y'], test_size = .2, random_state=10)
X_train5, X_test5, y_train5, y_test5 = train_test_split(data5.drop('y', 1), data5['y'], test_size = .2, random_state=10)
X_train10, X_test10, y_train10, y_test10 = train_test_split(data10.drop('y', 1), data10['y'], test_size = .2, random_state=10)
X_train20, X_test20, y_train20, y_test20 = train_test_split(data20.drop('y', 1), data20['y'], test_size = .2, random_state=10)

__3.5.__ Quantify the degradation of the model after introducing each new level of noise to its attributes.

In [27]:
model, test_score1 = model_fit(model, X_train1, X_test1, y_train1, y_test1)
model, test_score5 = model_fit(model, X_train5, X_test5, y_train5, y_test5)
model, test_score10 = model_fit(model, X_train10, X_test10, y_train10, y_test10)
model, test_score20 = model_fit(model, X_train20, X_test20, y_train20, y_test20)



#####################################################################################
Accuracy score on test :  0.7808718713741091
f1_score on test :  0.4045045045045045
Accuracy score on train :  0.779642753533093
f1_score on train :  0.39914114589219124
#####################################################################################
#####################################################################################
Accuracy score on test :  0.7633018398806564
f1_score on test :  0.35791366906474825
Accuracy score on train :  0.7650959426416345
f1_score on train :  0.36556973360197
#####################################################################################




#####################################################################################
Accuracy score on test :  0.7616442897397646
f1_score on test :  0.3580357142857143
Accuracy score on train :  0.7531186539019438
f1_score on train :  0.3545346191353343
#####################################################################################
#####################################################################################
Accuracy score on test :  0.7831924415713575
f1_score on test :  0.4129263913824058
Accuracy score on train :  0.7753325873430312
f1_score on train :  0.36618730270080674
#####################################################################################


### __4) [1pt] Impact comparison.__

__4.1.__ Build a table to compare accuracy of the model on the original dataset with models based on datasets
with different types and levels of noise introduced.

In [28]:
print("Class noise comprison:")
print("Calss noise n=1% train degradation:",test_score - test_score1)
print("Calss noise n=5% train degradation:",test_score - test_score5)
print("Calss noise n=10% train degradation:",test_score - test_score10)
print("Calss noise n=20% train degradation:",test_score - test_score20)

Class noise comprison:
Calss noise n=1% train degradation: -0.07641306149511018
Calss noise n=5% train degradation: -0.05884303000165747
Calss noise n=10% train degradation: -0.05718547986076572
Calss noise n=20% train degradation: -0.07873363169235859


__4.2.__ What has greater impact on the accuracy of the model: class or attribute noise? How would you explain
it? (4-5 sentences).

__4.3.__ What kind of noise would you address first? Why? (2-3 sentences)

### __5) [2pt] Misclassification noise elimination.__

__5.1.__ Use training dataset with 10% of misclassified instances which you obtained in __Task 2__.

In [41]:
from sklearn import tree
#data10


def get_parts(data, parts_number):
    part_div = len(data) // parts_number
    return [data.iloc[part_div*i:part_div*(i+1)] for i in range(parts_number)]

def fit_for_cros_clasification(lst_models, lst_data):
    if len(lst_models) == len(lst_data):
        for i,model in enumerate(lst_models):
            model.fit(pd.concat([lst_data[index].drop('y',axis=1) for index in range(len(lst_data)) if index != i], axis=0),
                      pd.concat([lst_data[index]['y']  for index in range(len(lst_data)) if index != i],axis= 0))
        return lst_models
    else:
        raise ValueError("nubers of models and datasets shuld be == ")
        

data_five1, data_five2, data_five3, data_five4, data_five5  = get_parts(data10, 5)
clf1 = tree.DecisionTreeClassifier()
clf2 = tree.DecisionTreeClassifier()
clf3 = tree.DecisionTreeClassifier()
clf4 = tree.DecisionTreeClassifier()
clf5 = tree.DecisionTreeClassifier()
clf1, clf2, clf3, clf4, clf5 =  fit_for_cros_clasification([clf1, clf2, clf3, clf4, clf5], 
                                                           [data_five1, data_five2, data_five3, data_five4, data_five5])



def outputGenerator(data,yTrue,  clf1, clf2, clf3, clf4):
    x1 = clf1.predict(data)
    x2 = clf2.predict(data)
    x3 = clf3.predict(data)
    x4 = clf4.predict(data)
    trueYList = yTrue.tolist()
    outputList = []
    for i in range(len(x1)):
        numOfAccurence0 = 0
        numOfAccurence1 = 0
        if x1[i] == x2[i] and x1[i] == 1:
                numOfAccurence1 += 1
        else:
            numOfAccurence0 += 1
        if x1[i] == x3[i] and x1[i] == 1:
            numOfAccurence1 += 1
        else:
            numOfAccurence0 += 1
        if x1[i] == x4[i] and x1[i] == 1:
            numOfAccurence1 += 1
        else:
            numOfAccurence0 += 1
        if x2[i] == x3[i] and x2[i] == 1:
            numOfAccurence1 += 1
        else:
            numOfAccurence0 += 1
        if x2[i] == x4[i] and x2[i] == 1:
            numOfAccurence1 += 1
        else:
            numOfAccurence0 += 1
        if x3[i] == x4[i] and x3[i] == 1:
            numOfAccurence1 += 1
        else:
            numOfAccurence0 += 1
        if numOfAccurence1 > 4:
            outputList.append(1)
        else:
            outputList.append(0) 
    return outputList

x1 = outputGenerator(data_five1.drop('y',axis=1), data_five1['y'],clf5, clf2, clf3, clf4)
x2 = outputGenerator(data_five2.drop('y',axis=1), data_five2['y'],clf5, clf2, clf3, clf4)
x3 = outputGenerator(data_five3.drop('y',axis=1), data_five3['y'],clf5, clf2, clf3, clf4)
x4 = outputGenerator(data_five4.drop('y',axis=1), data_five4['y'],clf5, clf2, clf3, clf4)
x5 = outputGenerator(data_five5.drop('y',axis=1), data_five5['y'],clf5, clf2, clf3, clf4)

__5.2.__ Apply Cross-Validated Committees Filter algorithm to identify and fix mislabled instances in this dataset.

• You can read the full description of this algorithm in “Data Preprocessing In Data Mining” by S. Garcia,
J. Luengo and F. Herrera [page 117, Section 5.3.2].

• Use scikit-learn utilities to create and train Decision Tree classifiers for this algorithm. You can read
more about them in [__(Decision Trees)__](https://scikit-learn.org/stable/modules/tree.html).

• Use $\Gamma = 5$.

In [47]:
def mislabels(x,lstDfy):
    mislabels = 0
    for i in range(len(x)):
        if(x[i] != lstDfy[i]):
            mislabels += 1
    return mislabels
print(data_five1)
x1Misslables = mislabels(x1, data_five1['y'].tolist())
x2Misslables = mislabels(x2, data_five2['y'].tolist())
x3Misslables = mislabels(x3, data_five3['y'].tolist())
x4Misslables = mislabels(x4, data_five4['y'].tolist())
x5Misslables = mislabels(x5, data_five5['y'].tolist())


def mislabels(x,lstDfy):
    mislabels = 0
    for i in range(len(x)):
        if(x[i] != lstDfy[i]):
            mislabels += 1
    return mislabels

totalMislables = x1Misslables + x2Misslables + x3Misslables + x4Misslables + x5Misslables
print(totalMislables)

percentOfMislabels = (totalMislables * 100)/len(data10)
print("Mislabled dat %",percentOfMislabels)


def newDf(x, df):
    indexes = df.index.tolist()
    for i in range(len(x)):
        if(df.at[indexes[i],'y'] != x[i]):
            df.at[indexes[i],'y'] = x[i]
    return df

newDf1 = newDf(x1,data_five1)
newDf2 = newDf(x2,data_five2)
newDf3 = newDf(x3,data_five3)
newDf4 = newDf(x4,data_five4)
newDf5 = newDf(x5,data_five5)

new_data = pd.concat([newDf1,newDf2,newDf3,newDf4,newDf5],axis=0)


new_data

        age  workclass  fnlwgt  education_num  marital-status  occupation  \
19723   0.0          2  180262            0.0               4           3   
11369   0.0          2  314177            0.0               4           5   
32074   0.0          2  318749            0.0               2          12   
31948   0.0          2  130513            0.0               4          11   
2250   39.0          2   34028            0.0               0           3   
...     ...        ...     ...            ...             ...         ...   
11886   0.0          0  161463            0.0               2           7   
7734    0.0          2   70261            0.0               4           7   
29616  33.0          5  173806            0.0               4           9   
14413   0.0          2  238342            0.0               2          13   
2085    0.0          2   49115            0.0               2           3   

       relationship  race  sex  capital-gain  capital-loss  hours-per-week 

Unnamed: 0,age,workclass,fnlwgt,education_num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,y
19723,0.0,2,180262,0.0,4,3,1,4,0,0,0,40,38,0.0
11369,0.0,2,314177,0.0,4,5,2,2,1,0,0,40,38,0.0
32074,0.0,2,318749,0.0,2,12,5,4,0,0,0,35,10,0.0
31948,0.0,2,130513,0.0,4,11,3,4,0,0,0,40,28,0.0
2250,39.0,2,34028,0.0,0,3,4,4,0,0,0,48,38,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10803,0.0,2,176711,0.0,4,0,3,4,1,0,0,40,38,0.0
3657,0.0,2,274222,0.0,2,7,0,4,1,7688,0,38,38,1.0
14515,0.0,2,210095,0.0,3,5,1,4,0,0,0,40,25,0.0
29744,0.0,5,162945,0.0,2,3,0,2,1,20051,0,55,38,1.0


__5.3.__ What percent of mislabled records you fixed using this method? Is it possible to do better?

__5.4.__ Compare the accuracy of the classifier after elimination of mislabeled instances with its accuracy before
this procedure was performed.