# Introduction

- Databases are often corrupted by missing values
- Most data mining algorithms cannot be immediately applied to incomplete data
- The simplest method to deal with missing data is data reduction which deletes the instances with missing values. However it will lead to great information loss.

- example: titanic dataset

In [None]:
from seaborn import load_dataset
df = load_dataset('titanic')
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


- NaN value can often be seen in the deck column, as shown in the above data frame.

# Missing Value Searching Example Code

In [None]:
nan_deck = df['deck'].value_counts(dropna = False)
print(nan_deck)
print('\n')
print(df.head().isnull())
print('\n')
print(df.head().notnull())
print('\n')
print(df.isnull().sum(axis=0))

NaN    688
C       59
B       47
D       33
E       32
A       15
F       13
G        4
Name: deck, dtype: int64


   survived  pclass    sex    age  sibsp  parch   fare  embarked  class  \
0     False   False  False  False  False  False  False     False  False   
1     False   False  False  False  False  False  False     False  False   
2     False   False  False  False  False  False  False     False  False   
3     False   False  False  False  False  False  False     False  False   
4     False   False  False  False  False  False  False     False  False   

     who  adult_male   deck  embark_town  alive  alone  
0  False       False   True        False  False  False  
1  False       False  False        False  False  False  
2  False       False   True        False  False  False  
3  False       False  False        False  False  False  
4  False       False   True        False  False  False  


   survived  pclass   sex   age  sibsp  parch  fare  embarked  class   who  \
0      True 

# Missing Value Deletion

pandas.DataFrame.[dropna(axis, how, thresh)](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html)

* Remove missing values.
 * axis: {0 or ‘index’, 1 or ‘columns’}, default 0  
Determine if rows or columns which contain missing values are removed.  
   * 0, or ‘index’ : Drop rows which contain missing values.  
   * 1, or ‘columns’ : Drop columns which contain missing value.

 * how: {‘any’, ‘all’}, default ‘any’  
Determine if row or column is removed from DataFrame, when we have at least one NA or all NA.

   * ‘any’ : If any NA values are present, drop that row or column.

   * ‘all’ : If all values are NA, drop that row or column.

  * thresh: int, optional  
Require that many non-NA values.

In [None]:
from seaborn import load_dataset
df = load_dataset('titanic')

missing_df = df.isnull()

# missing value가 아닌 값이 thresh 미만인 column인 경우 제거하고, thresh 이상인 값이면 남겨둠
# deck column은 정상값 203개와 결측값 688개가 있음 
# thresh가 203인 경우 -> deck column 남겨짐
# thresh가 204인 경우 -> deck column 제거됨        
df_thresh = df.dropna(axis = 1, thresh = 500) 
print(df_thresh.columns)

#age column에서 NaN value가 있으면 그 전체 row를 삭제
df_age = df.dropna(subset=['age'], how = 'any', axis = 0) 
print(len(df_age))

Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',
       'embarked', 'class', 'who', 'adult_male', 'embark_town', 'alive',
       'alone'],
      dtype='object')
714


In [None]:
df.isnull().sum()

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

In [None]:
df_thresh = df.dropna(axis = 1, thresh = 203) 

df_thresh.isnull().sum()

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

# Missing Value Inplacing

## Mean Substitution

pandas.DataFrame.[fillna(value, inplace)](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html)

* Fill NA/NaN values using the specified method.
 * value: scalar, dict, Series, or DataFrame  
Value to use to fill holes (e.g. 0),
 * inplace: bool, default False  
If True, fill in-place.

In [None]:
df = load_dataset('titanic')

print(df['age'].head(10))
print('\n')

mean_age = df['age'].mean(axis = 0)
df['age'].fillna(mean_age, inplace = True)

print(df['age'].head(10))

0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
5     NaN
6    54.0
7     2.0
8    27.0
9    14.0
Name: age, dtype: float64


0    22.000000
1    38.000000
2    26.000000
3    35.000000
4    35.000000
5    29.699118
6    54.000000
7     2.000000
8    27.000000
9    14.000000
Name: age, dtype: float64


## Hot Deck Imputation

In [None]:
from seaborn import load_dataset
df = load_dataset('titanic')

print(df['embark_town'][825:830])
print('\n')

most_freq = df['embark_town'].value_counts(dropna=True).idxmax()
print(most_freq)
print('\n')

df['embark_town'].fillna(most_freq, inplace = True)

print(df['embark_town'][825:830])

825     Queenstown
826    Southampton
827      Cherbourg
828     Queenstown
829            NaN
Name: embark_town, dtype: object


Southampton


825     Queenstown
826    Southampton
827      Cherbourg
828     Queenstown
829    Southampton
Name: embark_town, dtype: object


## Regression Imputation

In [None]:
from seaborn import load_dataset
import pandas as pd


# titanic dataset을 불러옴
df = load_dataset('titanic')

# age column을 출력하여 결측값을 확인
print(df['age'].head(10))
print('\n')

0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
5     NaN
6    54.0
7     2.0
8    27.0
9    14.0
Name: age, dtype: float64




### Import Linear Regression Model & Numpy

In [None]:
# Regression Imputation을 위한 LinearRegression, numpy library import  
from sklearn.linear_model import LinearRegression
import numpy as np

# 모델 선언
lr = LinearRegression()

# 데이터셋 확인
df.head(10)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True
5,0,3,male,,0,0,8.4583,Q,Third,man,True,,Queenstown,no,True
6,0,1,male,54.0,0,0,51.8625,S,First,man,True,E,Southampton,no,True
7,0,3,male,2.0,3,1,21.075,S,Third,child,False,,Southampton,no,False
8,1,3,female,27.0,0,2,11.1333,S,Third,woman,False,,Southampton,yes,False
9,1,2,female,14.0,1,0,30.0708,C,Second,child,False,,Cherbourg,yes,False


### 데이터 학습

In [None]:
testdf = df[df['age'].isnull()==True].copy()
traindf = df[df['age'].isnull()==False].copy()
y = traindf['age']

traindf.drop("age",axis=1,inplace=True)
lr.fit(traindf['fare'].to_numpy().reshape(-1, 1), y)

testdf.drop("age",axis=1,inplace=True)

data = testdf['fare'].to_numpy().reshape(-1,1)

pred = lr.predict(data)

### Imputation 결과 출력

In [None]:
pd.set_option('mode.chained_assignment',  None) 

testdf['age']= pred

for i in testdf['age'].index:
  df['age'][i] = testdf['age'][i]

df.head(10)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True
5,0,3,male,29.007249,0,0,8.4583,Q,Third,man,True,,Queenstown,no,True
6,0,1,male,54.0,0,0,51.8625,S,First,man,True,E,Southampton,no,True
7,0,3,male,2.0,3,1,21.075,S,Third,child,False,,Southampton,no,False
8,1,3,female,27.0,0,2,11.1333,S,Third,woman,False,,Southampton,yes,False
9,1,2,female,14.0,1,0,30.0708,C,Second,child,False,,Cherbourg,yes,False


## Multiple Imputation, Imputation Using Multivariate Imputation by Chained Equation (MICE)
![img](https://miro.medium.com/max/1400/1*cmZFWypJUrFL2QL3KyzXEQ.png)
This type of imputation works by filling the missing data multiple times. Multiple Imputations (MIs) are much better than a single imputation as it measures the uncertainty of the missing values in a better way. The chained equations approach is also very flexible and can handle different variables of different data types (ie., continuous or binary) as well as complexities such as bounds or survey skip patterns.

### sklearn.impute.[IterativeImputer(estimator=None, *, missing_values=nan, initial_strategy='mean')](https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html)

Multivariate imputer that estimates each feature from all the others.

A strategy for imputing missing values by modeling each feature with missing values as a function of other features in a round-robin fashion.

* estimator: estimator object, default=BayesianRidge()  
The estimator to use at each step of the round-robin imputation.
* missing_values: int or np.nan, default=np.nan  
The placeholder for the missing values.
* initial_strategy{'mean', 'median', 'most_frequent', 'constant'}, default= 'mean'    
Which strategy to use to initialize the missing values. 

In [1]:
import numpy as np
import pandas as pd

# sklearn's IterativeImputer is experimental class
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

imputer = IterativeImputer(random_state=0)

# 연속된 데이터 생성
data_train = pd.DataFrame(np.arange(25).reshape((5, 5)).astype(np.float64))
data_test = data_train

# 결측값 적용
data_test[0][2], data_test[2][3] = np.nan, np.nan

# 각 데이터 출력
print("\ntrain:")
print(data_train)
print("\ntest:")
print(data_test)

# MICE 적용
imputer.fit(data_train)
result_df = pd.DataFrame(imputer.transform(data_test))

# 결과 출력
print("\nresult:")
print(result_df)


train:
      0     1     2     3     4
0   0.0   1.0   2.0   3.0   4.0
1   5.0   6.0   7.0   8.0   9.0
2   NaN  11.0  12.0  13.0  14.0
3  15.0  16.0   NaN  18.0  19.0
4  20.0  21.0  22.0  23.0  24.0

test:
      0     1     2     3     4
0   0.0   1.0   2.0   3.0   4.0
1   5.0   6.0   7.0   8.0   9.0
2   NaN  11.0  12.0  13.0  14.0
3  15.0  16.0   NaN  18.0  19.0
4  20.0  21.0  22.0  23.0  24.0

result:
      0     1     2     3     4
0   0.0   1.0   2.0   3.0   4.0
1   5.0   6.0   7.0   8.0   9.0
2  10.0  11.0  12.0  13.0  14.0
3  15.0  16.0  17.0  18.0  19.0
4  20.0  21.0  22.0  23.0  24.0


### MICE in titanic dataset

In [None]:
#타이타닉 데이터셋 추가, 과정이 잘 보이는 예시

from seaborn import load_dataset
df = load_dataset('titanic')

df

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,,1,2,23.4500,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


### Apply MICE

In [None]:
new_df =  df[['age', 'fare','pclass']]

new_df.tail(10)

Unnamed: 0,age,fare,pclass
881,33.0,7.8958,3
882,22.0,10.5167,3
883,28.0,10.5,2
884,25.0,7.05,3
885,39.0,29.125,3
886,27.0,13.0,2
887,19.0,30.0,1
888,,23.45,3
889,26.0,30.0,1
890,32.0,7.75,3


In [None]:
pd.set_option('mode.chained_assignment',  None)

train_df = new_df.copy()

train_df['age'][881] = np.nan
train_df['pclass'][883] = np.nan
train_df['fare'][886] = np.nan

train_df.tail(10)

Unnamed: 0,age,fare,pclass
881,,7.8958,3.0
882,22.0,10.5167,3.0
883,28.0,10.5,
884,25.0,7.05,3.0
885,39.0,29.125,3.0
886,27.0,,2.0
887,19.0,30.0,1.0
888,,23.45,3.0
889,26.0,30.0,1.0
890,32.0,7.75,3.0


In [None]:
test_df = new_df.copy()

test_df['pclass'][881] = np.nan
test_df['pclass'][887] = np.nan
test_df['fare'][884] = np.nan

test_df.tail(10)

Unnamed: 0,age,fare,pclass
881,33.0,7.8958,
882,22.0,10.5167,3.0
883,28.0,10.5,2.0
884,25.0,,3.0
885,39.0,29.125,3.0
886,27.0,13.0,2.0
887,19.0,30.0,
888,,23.45,3.0
889,26.0,30.0,1.0
890,32.0,7.75,3.0


In [None]:
imputer = IterativeImputer(random_state=42, verbose=1)

imputer.fit(train_df)


result_df = pd.DataFrame(imputer.transform(test_df), columns = ['age','fare','class'])
print("\nresult:")
result_df.tail(10)

[IterativeImputer] Completing matrix with shape (891, 3)
[IterativeImputer] Change: 11.722450129532419, scaled tolerance: 0.5123292 
[IterativeImputer] Change: 0.48596102624105697, scaled tolerance: 0.5123292 
[IterativeImputer] Early stopping criterion reached.
[IterativeImputer] Completing matrix with shape (891, 3)

result:


Unnamed: 0,age,fare,class
881,33.0,7.8958,2.435431
882,22.0,10.5167,3.0
883,28.0,10.5,2.0
884,25.0,9.631626,3.0
885,39.0,29.125,3.0
886,27.0,13.0,2.0
887,19.0,30.0,2.5536
888,24.218355,23.45,3.0
889,26.0,30.0,1.0
890,32.0,7.75,3.0


# Data Transformation

## Concepts
- Machine learning models make a lot of assumptions about the data
- In reality, these assumptions are often violated
- We build pipelines that transform the data before feeding it to the learners
  - Scaling (or other numeric transformations)
  - Encoding (convert categorical features into numerical ones)
  - Automatic feature selection
  - Feature engineering (e.g. binning, polynomial features,...)
  - Handling missing data
  - Handling imbalanced data
  - Dimensionality reduction (e.g. PCA)
  - Learned embeddings (e.g. for text)
- Seek the best combinations of transformations and learning methods
  - Often done empirically, using cross-validation
  - Make sure that there is no data leakage during this process!

## Scaling

### Standard Scaling
- Generally most useful, assumes data is more or less normally distributed
- Per feature, subtract the mean value μ, scale by standard deviation σ
- New feature has μ = 0 and σ = 1, values can still be arbitrarily large
- $Z = \frac{X - \mu}{\sigma}$

In [None]:
from sklearn.preprocessing import StandardScaler

data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
scaler = StandardScaler()
print(scaler.fit(data))
print("scale of data = ", scaler.scale_)
print("mean of data = ", scaler.mean_)
print("data variance = ", scaler.var_)
print("<data scaling> \n", scaler.transform(data))
print('[2, 2] standard scaling =', scaler.transform([[2, 2]]))

### min-max Scaling
- If the minimum value is subtracted from the data and divided by the difference between the maximum value and the minimum value, it can have a value in the range of 0 to 1.
  - scikit-learn method: MinMaxScaler
  - If yoiu don't want between 0 ~ 1, you can specify a
  - $Z = \frac{X-X_{min}}{X_{max}-X_{min}}$

In [None]:
from sklearn.preprocessing import MinMaxScaler

data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
scaler = MinMaxScaler()
print(scaler.fit(data))
print("max of data = ", scaler.data_max_)
print("min of data = ", scaler.data_min_)
print("data range = ", scaler.data_range_)
print("<data scaling> \n", scaler.transform(data))
print('[2, 2] minmax scaling =', scaler.transform([[2, 2]]))

### Robust Scaling
- Subtracts the median, scales between quantiles $q_{25}$ and $q_{75}$
- New feature has median $0$, $q_{25} = −1$ and $q_{75} = 1$
- Similar to standard scaler, but ignores outliers

In [None]:
from sklearn.preprocessing import RobustScaler

data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
scaler = RobustScaler()
print(scaler.fit(data))
print("center of data = ", scaler.center_)
print("scale of data = ", scaler.scale_)
print("<data scaling> \n", scaler.transform(data))
print('[2, 2] robust scaling =', scaler.transform([[2, 2]]))

### Normalization
- Makes sure that feature values of each point (each row) sum up to 1 (L1 norm)
  - Useful for count data (e.g. word counts in documents)
- Can also be used with L2 norm (sum of squares is 1)
  - Useful when computing distances in high dimensions
  - Normalized Euclidean distance is equivalent to cosine similarity

In [None]:
from sklearn.preprocessing import Normalizer

data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
scaler = Normalizer()
print(scaler.fit(data))
print("<data normalizing> \n", scaler.transform(data))
print('[2, 2] normalize =', scaler.transform([[2, 2]]))

### Maximum Absolute Scaler
- For sparse data (many features, but few are non-zero)
  - Maintain sparseness (efficient storage)
- Scales all values so that maximum absolute value is 1
- Similar to Min-Max scaling without changing 0 values

In [None]:
from sklearn.preprocessing import MaxAbsScaler

data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
scaler = MaxAbsScaler()
print(scaler.fit(data))
print("scale of data = ", scaler.scale_)
print("maximum absolute value of data = ", scaler.max_abs_)
print("<data scaling> \n", scaler.transform(data))
print('[2, 2] robust scaling =', scaler.transform([[2, 2]]))

### Power Transformations
- Some features follow certain distributions
  - E.g. number of twitter followers is log-normal distributed
- Box-Cox transformations transform these to normal distributions (λ is fitted)
  - Only works for positive values, use Yeo-Johnson otherwise

In [None]:
from sklearn.preprocessing import PowerTransformer

data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
scaler = PowerTransformer()
print(scaler.fit(data))
print("<data scaling> \n", scaler.transform(data))
print('[2, 2] robust scaling =', scaler.transform([[2, 2]]))

## Categorical Feature Encoding
- Many algorithms can only handle numeric features, so we need to encode the categorical ones

### Ordinal Encoding
- Simply assigns an integer value to each category in the order they are encountered
- Only really useful if there exist a natural order in categories
  - Model will consider one category to be 'higher' or 'closer' to another

In [None]:
from seaborn import load_dataset
df = load_dataset('titanic')
df.head(10)

In [None]:
class_cat = df[['class']]

In [None]:
from sklearn.preprocessing import OrdinalEncoder
ordinal_encoder = OrdinalEncoder()
class_cat_encoded = ordinal_encoder.fit_transform(class_cat)
class_cat_encoded[:10]

### One‐hot encoding (dummy encoding)
- Simply adds a new 0/1 feature for every category, having 1 (hot) if the sample has that category
- Can explode if a feature has lots of values, causing issues with high dimensionality
- What if test set contains a new category not seen in training data?
  - Either ignore it (just use all 0's in row), or handle manually (e.g. resample)

In [None]:
from sklearn.preprocessing import OneHotEncoder
onehot_encoder = OneHotEncoder()
class_cat_encoded = onehot_encoder.fit_transform(class_cat)
class_cat_encoded.toarray()

In [None]:
onehot_encoder.categories_

### Target encoding
- Value close to 1 if category correlates with class 1, close to 0 if correlates with class 0
- Preferred when you have lots of category values. It only creates one new feature per class

In [None]:
!pip install category_encoders

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting category_encoders
  Downloading category_encoders-2.5.0-py2.py3-none-any.whl (69 kB)
[K     |████████████████████████████████| 69 kB 3.3 MB/s 
Installing collected packages: category-encoders
Successfully installed category-encoders-2.5.0


In [None]:
import pandas as pd
from category_encoders.target_encoder import TargetEncoder

### category_encoders.target_encoder.[TargetEncoder(cols=None)](https://contrib.scikit-learn.org/category_encoders/targetencoder.html)

Target encoding for categorical features.

Supported targets: binomial and continuous. For polynomial target support, see PolynomialWrapper.

For the case of categorical target: features are replaced with a blend of posterior probability of the target given particular categorical value and the prior probability of the target over all the training data.

For the case of continuous target: features are replaced with a blend of the expected value of the target given particular categorical value and the expected value of the target over all the training data.

* cols: list  
a list of columns to encode, if None, all string columns will be encoded.

In [None]:
data2 = pd.DataFrame({'boro' : ['Manhattan', 'Queens', 'Manhattan', 'Brooklyn', 'Brooklyn', 'Bronx'],
                     'salary' : [103, 89, 142, 54, 63, 219],
                     'vegan': [0,0,0,1,1,0]})

data2

Unnamed: 0,boro,salary,vegan
0,Manhattan,103,0
1,Queens,89,0
2,Manhattan,142,0
3,Brooklyn,54,1
4,Brooklyn,63,1
5,Bronx,219,0


In [None]:
encoder = TargetEncoder(cols=['boro'])
encoder.fit(data2['boro'], data2['vegan']);
data2['boro_encoded'] = encoder.transform(data2['boro'])

data2

Unnamed: 0,boro,salary,vegan,boro_encoded
0,Manhattan,103,0,0.089647
1,Queens,89,0,0.333333
2,Manhattan,142,0,0.089647
3,Brooklyn,54,1,0.820706
4,Brooklyn,63,1,0.820706
5,Bronx,219,0,0.333333


# Exercise

실습에서 사용하는 adult.csv 주소  

```python
url = 'https://raw.githubusercontent.com/EugeneYoo/practice_file/main/adult.csv'

df = pd.read_csv(url)

```

## 1번 문제

adult.csv파일을 읽은 후에 null value값이 존재하는 column들을 확인하시오.

## 1번 문제 답안



## 2번 문제

adult dataframe에 존재하는 `'workclass'` column의 missing value를 most frequent value inplacing(Hot Deck Imputation)기법을 활용하여 처리하시오.

## 2번 문제 답안

## 3번 문제

adult dataframe에 존재하는 `'nativeCountry'` column의 missing value를 Missing Value Deletion(row)기법을 활용하여 처리하시오.

## 3번 문제 답안

## 4번 문제

adult dataframe의 각 column별 데이터 타입을 확인하고 income column의 데이터 타입을 object에서 bool(50K 이하이면 False, 50K 초과이면 True)로 변경하시오.

## 4번 문제 답안

## 5번 문제

adult dataframe의 workclass column을 ordinal encoder를 활용하여 encoding하시오.<br>
  **참조코드**
  ```python
  pandas.Series.to_numpy().reshape(-1, 1)
  
  df['column_name'].to_numpy().reshape(-1, 1)
  ```

## 5번 문제 답안

## 6번 문제

adult_df의 workclass column을 one-hot encoder를 활용하여 encoding하시오.<br>
  **참조코드**
   ```python
  pandas.Series.to_numpy().reshape(-1, 1)
  
  df['column_name'].to_numpy().reshape(-1, 1)
  ```

## 6번 문제 답안