# CI Portfolio Project 5 - Filter Maintenance Predictor 2022
## **Data Cleaning Notebook**

## Objectives

*   Confirm / Evaluate missing data
*   Clean data in preparation for analysis

### Inputs

1. Test Dataset : `outputs/datasets/collection/PredictiveMaintenanceTest.csv`

2. Train Dataset : `outputs/datasets/collection/PredictiveMaintenanceTrain.csv`

### Outputs

* Generate cleaned Train and Test sets, both saved under `outputs/datasets/cleaned`

### Conclusions

  * Data Cleaning Pipeline
  * Drop Variables as Required
  <!-- `['customerID', 'TotalCharges' ]` -->

---

# Change working directory

In [None]:
import os
current_dir = os.getcwd()
current_dir

In [None]:
os.chdir(os.path.dirname(current_dir))
print("Current directory set to new location")

In [None]:
current_dir = os.getcwd()
current_dir

---

# Load Collection Data

In [None]:
import pandas as pd
df_train = pd.read_csv(f'outputs/datasets/collection/PredictiveMaintenanceTrain.csv')
df_test = pd.read_csv(f'outputs/datasets/collection/PredictiveMaintenanceTest.csv')

In [None]:
df_train.info()

In [None]:
df_test.info()

---

# Data Exploration

### Check for Missing Data

To confirm we don't have variables with missing data, and if we do; discover their distribution and shape.
* Note: we are aware that the **df_train** dataset does not have values for `RUL`, so both sets are checked separately

If we tried to combine the sets to check, it would indicate `RUL` has missing values like so: 

In [None]:
df_total = pd.concat([df_train, df_test])
vars_with_missing_data = df_total.columns[df_total.isna().sum() > 0].to_list()
vars_with_missing_data

#### To check both datasets for missing data at the same time

Define a handy function to identify which dataframe

In [None]:
def name_dataframe(data):
    """ To identify which dataframe is being accessed """
    name =[n for n in globals() if globals()[n] is data][0]
    print('Dataframe name: %s' % name)

In [None]:
from pandas_profiling import ProfileReport

for df in (df_train, df_test):
    vars_with_missing_data = df.columns[df.isna().sum() > 0].to_list()
    if vars_with_missing_data:
        profile = ProfileReport(df=df[vars_with_missing_data], minimal=True)
        profile.to_notebook_iframe()
    else:
        name_dataframe(df)
        print('There are no variables with missing data')

---

## Evenly distribute dataset by `Dust` type

Both the train and test sets supplied have data distributed unevenly between 50 test bins. To account for this we wish to assess the measures of central tendency for each Dust class, with tha aim of reducing the data size to a more evenly proportioned one between classes.

Consider % `censored` calculation to all observations in both datasets

#### **Train** Dataset

**Considerations**

* The proportion of data that **has reached filter failure**. These may be worth keeping and will make part of our heuristic decision process.
* The **mean** is the most frequently used measure of central tendency because it uses all values in the data set to give you an average.
* For data from skewed distributions (like `differential_pressure`), the **median** is better than the mean because it isn’t influenced by extremely large values.

Note the top five `Data_No` bins where `differential_pressure` observations that have made it to the **600 Pa** (the point of filter failure).

In [None]:
last_row_train = df_train[df_train.Data_No != df_train.Data_No.shift(-1)]
# last_row_descending = last_row_train.sort_values(by='Dust', ascending=True)
last_row_descending = last_row_train.sort_values(by='Differential_pressure', ascending=False)
last_row_descending.head(n=10)

We see that the `Dust` variable in this dataset shows a disproportionate mix between classes

In [None]:
%matplotlib inline

category_totals = df_train.groupby('Dust')['Differential_pressure'].count().sort_values()
category_totals.plot(kind="barh", title='Proportion of Dust Classes in df_train\n', xlabel='\nObservations', ylabel='Dust Class')
category_totals

In [None]:
# %matplotlib inline
# top_5.plot(x='Dust', y='Differential_pressure', kind='bar', rot=5, fontsize=4)

In [None]:
# %matplotlib inline

# category_totals = last_row_descending.groupby('Dust')['Differential_pressure'].count().sort_values()
# category_totals.plot(kind="barh")
# category_totals

In [None]:
%matplotlib inline

category_totals = df_train.groupby('Dust')['Differential_pressure'].count().sort_values()
category_totals.plot(kind="barh")
category_totals

In [None]:
for df in (df_train, df_test):
    df.to_numpy()
    name_dataframe(df)
    print(df.shape)

Review the last values of each data bin

In [None]:
df_train[df_train.Data_No != df_train.Data_No.shift(-1)].head()

In [None]:
df_train.describe().round(decimals=2)

Extract each class and compare distributions

#### **Test** dataset

In [None]:
df_test[df_test.Data_No != df_test.Data_No.shift(-1)].head()

---

# Correlation and Power Predictive Score Analysis

---

## Save Datasets

Save the files to /cleaned folder

In [None]:
import os
try:
  os.makedirs(name='outputs/datasets/cleaned')
except Exception as e:
  print(e)

df_train.to_csv(f'outputs/datasets/cleaned/dfCleanTrain.csv',index=False)
df_test.to_csv(f'outputs/datasets/cleaned/dfCleanTest.csv',index=False)

---

# Conclusions and Next steps

#### Conclusions: 
* 

#### Next Steps:
* Correlation Study
* Feature Engineering

---