#**LAB 1: Cleaning data using dropna() and fillna()**

The `hmeq_small` dataset contains information on 5960 home equity loans, including 7 features on the characteristics of the loan.

- Load the data set `hmeq_small.csv` as a data frame.
- Create a new data frame with all the rows with missing data deleted.
- Create a second data frame with all missing data filled in with the median value of the column.
- Find the means of the columns for both new data frames.

Ex: Using only the first hundred rows, found in `hmeq_sample.csv`, the output is:
```
Median for hmeqDelete are  LOAN        3250.000000
MORTDUE    64793.500000
VALUE      75525.000000
YOJ            9.000000
CLAGE        119.999883
CLNO          13.500000
DEBTINC       31.824143
dtype: float64
Median for hmeqReplace are  LOAN        3000.000000
MORTDUE    47000.000000
VALUE      61000.000000
YOJ            6.000000
CLAGE        122.950000
CLNO          14.000000
DEBTINC       31.588503
dtype: float64
```

In [2]:
import pandas as pd

# Read in hmeq_small.csv
hmeq = pd.read_csv("hmeq_small.csv")# Your code here

# Create a new data frame with the rows with missing values dropped
hmeqDelete = hmeq.dropna()

# Create a new data frame with the missing values filled in by the median of the column
hmeqReplace = hmeq.apply(lambda col: col.fillna(col.median()), axis=0)

# Print the median of the columns for each new data frame
print("Median for hmeqDelete are ", hmeqDelete.median())

print("Median for hmeqReplace are ", hmeqReplace.median())

Median for hmeqDelete are  LOAN       17000.000000
MORTDUE    66893.000000
VALUE      94364.500000
YOJ            7.000000
CLAGE        175.563507
CLNO          21.000000
DEBTINC       35.202650
dtype: float64
Median for hmeqReplace are  LOAN       16300.000000
MORTDUE    65019.000000
VALUE      89235.500000
YOJ            7.000000
CLAGE        173.466667
CLNO          20.000000
DEBTINC       34.818262
dtype: float64


#**LAB 2: Structuring data using scale() and MinMaxScaler()**

The `hmeq_small` dataset contains information on 5960 home equity loans, including 7 features on the characteristics of the loan.

- Load the `hmeq_small.csv` data set as a data frame.
- Standardize the data set as a new data frame.
- Normalize the data set as a new data frame.
- Print the means and variance of both the standardized and normalized data.

Ex: Using the first 100 rows, found in `hmeq_sample.csv`, the output is:
```
The means of hmeqStand are  LOAN      -4.984675e-17
MORTDUE    1.914178e-17
VALUE     -1.790682e-18
YOJ       -7.235161e-17
CLAGE     -4.194176e-17
CLNO      -6.033821e-17
DEBTINC    6.125368e-17
dtype: float64
The variance of hmeqStand are  LOAN       1.010309
MORTDUE    1.011628
VALUE      1.010870
YOJ        1.011364
CLAGE      1.011236
CLNO       1.010989
DEBTINC    1.035714
dtype: float64
The means of hmeqNorm are  LOAN       0.671006
MORTDUE    0.358735
VALUE      0.299044
YOJ        0.292135
CLAGE      0.448986
CLNO       0.346377
DEBTINC    0.624927
dtype: float64
The variance of hmeqNorm are  LOAN       0.072647
MORTDUE    0.061099
VALUE      0.035189
YOJ        0.056618
CLAGE      0.051232
CLNO       0.035601
DEBTINC    0.049705
dtype: float64


```

In [1]:
import pandas as pd
from sklearn import preprocessing
from sklearn.preprocessing import MinMaxScaler
hmeq = pd.read_csv("hmeq_small.csv")

# Standardize the data
scaler = preprocessing.StandardScaler()
numeric_data = hmeq.select_dtypes(include = ['float64','int64']).columns
standardized = scaler.fit_transform(hmeq[numeric_data])

# Output the standardized data as a data frame
hmeqStand = pd.DataFrame(standardized,columns=numeric_data)

# Normalize the data
scaler2 = MinMaxScaler()
# Select only the numeric columns
numeric_columns = hmeq.select_dtypes(include=['float64', 'int64']).columns
# Normalize the numeric data
normalized= scaler2.fit_transform(hmeq[numeric_columns])

# Output the normalized data as a data frame
hmeqNorm = pd.DataFrame(normalized,columns=numeric_columns)

# Print the means and variance of hmeqStand and hmeqNorm
print("The means of hmeqStand are ", hmeqStand.mean())
print("The variance of hmeqStand are ", hmeqStand.var())
print("The means of hmeqNorm are ",hmeqNorm.mean())
print("The variance of hmeqNorm are ", hmeqNorm.var())

The means of hmeqStand are  LOAN      -1.525998e-16
MORTDUE   -2.089064e-16
VALUE      0.000000e+00
YOJ        7.307694e-17
CLAGE     -2.413733e-16
CLNO      -8.915838e-17
DEBTINC   -2.361915e-16
dtype: float64
The variance of hmeqStand are  LOAN       1.000168
MORTDUE    1.000184
VALUE      1.000171
YOJ        1.000184
CLAGE      1.000177
CLNO       1.000174
DEBTINC    1.000213
dtype: float64
The means of hmeqNorm are  LOAN       0.197162
MORTDUE    0.180378
VALUE      0.110597
YOJ        0.217616
CLAGE      0.153879
CLNO       0.299945
DEBTINC    0.163991
dtype: float64
The variance of hmeqNorm are  LOAN       0.015929
MORTDUE    0.012510
VALUE      0.004580
YOJ        0.034126
CLAGE      0.005395
CLNO       0.020392
DEBTINC    0.001799
dtype: float64
