# Data Cleaning
When collecting data from online or real-time sources, the dataset is always a bit dirty. There may be missing values, nulls, or just incorrect inputs. 

Our columns are relatively clean, so we won't show a very expansive set of cleaning tools, but feel free to check out more of our workshops to experiment with other types of data.

In [None]:
import os
import pandas as pd
import psycopg2

import warnings

warnings.filterwarnings("ignore")

# Get the current working directory
cwd = os.getcwd()

# Print the current working directory
print("Current working directory: {0}".format(cwd))

In [None]:
# check if the a directory exists, if not create it
outdir = "./scratch"

if not os.path.exists(outdir):
    os.mkdir(outdir)

In [None]:
# creates a connection to a database
conn = psycopg2.connect(
    database="predict-db", user="predict-db", password="failureislame", host="localhost"
)

GET_ALL_ROWS = "Select * from waterpump order by timestamp"

try:
    with conn:
        # Pull our dataset into a pandas dataframe
        df = pd.read_sql_query(GET_ALL_ROWS, conn)
        df.set_index("timestamp", inplace=True)
except (Exception, psycopg2.DatabaseError) as err:
    print(err)
finally:
    conn.close()

### Lets make a copy of the dataset, so that if we make a mistake or just want a clean version of the dataset, we don't need to run that cell above again.

In [None]:
df_original = df.copy()

### As we said before, we have some nulls in the data. Let's see if any columns are unusable.

In [None]:
nulls_series = df.isnull().sum()
print(nulls_series.sort_values())

### Something looks wrong with sensor_15 data...

In [None]:
df["sensor_15"].unique()

In [None]:
# drop it like it's hot
df.drop("sensor_15", axis=1, errors="ignore", inplace=True)

In [None]:
# select the number of columns with too many null values
number_removed = 3
empty_cols = nulls_series.sort_values().tail(number_removed)
display(empty_cols)

# get the names of the columns in a list
bad_col_list = list(empty_cols.keys())

# drop the bad columns
df.drop(bad_col_list, axis=1, errors="ignore", inplace=True)
print(df.columns)

### When we ultimately train a model, we'll need to get all of or columns into numbers
### If a non-numerical feature has a discrete distribution, we can implement a practice called one-hot-encoding that will assign our values 0 (False) or 1 (True)

In [None]:
# we have an in-between stage, 'recovering', so we'll label it 0.5

# a dictionary can be used to one-to-one map values in a series
status_map = {"NORMAL": 0, "BROKEN": 1, "RECOVERING": 0.5}

df["machine_status"] = df["machine_status"].map(status_map)

### The index of our dataframe, the time, contains strings. Let's give them a smarter type that understands time.

In [None]:
df.index = pd.to_datetime(df.index)

### Now that all of our columns are numerical, we can run some math operations ourselves for testing purposes.

In [None]:
df.describe().iloc[:, :15]

### Let's check all the means of our sensors. And while we're at it, let's fill in any null values with those means, so we don't change the average.

In [None]:
col_averages = df.mean()
print(col_averages)
df.fillna(value=col_averages, inplace=True)

### We should be good to go into further analysis, let's save a csv file so our next notebook can access our updated data.

In [None]:
df.to_csv(outdir + "/clean-df.csv")