# Case Study 1: Predective Maintenance

## Objectives

After this tutorial you will be able to:

*   Link all the steps of a typical data science project
*   Predict maintenance time for a device based on historical data

<h2>Table of Contents</h2>

<ol>
    <li>
        <a href="#problem">Problem Definition</a>
    </li>
    <br>
    <li>
        <a href="#import">Data Collection</a>
    </li>
    <br>
    <li>
        <a href="#clean">Data Cleaning and Preparation</a>
    </li>
    <br>
    <li>
        <a href="#eda">Exploratory Data Analysis</a>
    </li>
    <br>
    <li>
        <a href="#model">Model Development and Evaluation</a>
    </li>
    <br>
    <li>
        <a href="#deploy">Deployment and Communication</a>
    </li>
    <br>
</ol>


<hr id="problem">

<h2>1. Problem Definition</h2>

<h3>Background:</h3>

Operation data was provided by a refinery. The data reflect information about heat exchangers in a cold train serving a crude distillation unit in the refinery.  
The network consists of 7 countercurrent heat exchangers where raw crude is in the tube side of each exchanger.  

<div style="text-align: center;">
    <img src="hex_network.png" height="500px">
</div>

<h3>Data</h3>

The data provided consists of monthly files with daily data for each heat exchanger. We will only work on exchanger `E-003` in this lab.  
Each file contains the following details:
- Date
- Flow rate (raw crude), bph
- Inlet temp (raw crude), F
- Outlet temp (raw crude), F
- Flow rate (VTB), bph
- Inlet temp (VTB), F
- Outlet temp (VTB), F

Process parameters:
- U (service), Btu/h.ft^2.F: 27.7
- Exchanger Area, ft^2: 3,561
- Cp (raw crude), Btu/lb.F: 2.6
- Cp (VTB), Btu/lb.F: 4
- Density (raw crude), lb/ft^3: 56.85
- Density (VTB), lb/ft^3: 65


<h3>Goal:</h3>

Making recommendations on heat exchanger cleaning schedule given that cleaning is necessary when `U (actual)` drops below `70% of U (service)`.


<hr id="import">

<h2>2. Data Collection</h2>

Import the `Pandas` library

In [21]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Read the data from `data.csv` into a `Pandas DataFrame`

In [None]:
# load all csv files in hex_data folder and concatenate them into one dataframe
folder = 'E003_data'
files = os.listdir(folder)
df = pd.DataFrame()
for file in files:
    df = pd.concat([df, pd.read_csv(folder + '/' + file)], axis=0, ignore_index=True)

df.head()


<hr id="clean">

<h2>3. Data Cleaning and Preparation</h2>

<h5 id="clean-missing">Handle missing values</h5>

Identify missing values

In [None]:
# get more info about the data
df.info()

In [None]:
# find the number of missing values in each column
# TODO

Drop rows with "NaN" from certain columns

In [None]:
# drop the rows with missing values
# TODO

<h5 id="clean-duplicates">Remove duplicates</h5>

In [None]:
# find the number of duplicate rows
# TODO

In [None]:
# drop the duplicate rows
# TODO

<h5 id="clean-standardize">Standardize data</h5>

In [28]:
# convert column to datetime
# TODO

# convert the rest of the columns to float
# TODO


<h5>Validate cleaned data</h5>

In [None]:
# check data types
# TODO

In [None]:
# check for duplicates
# TODO

<h5>Create extra required features</h5>

-  `year-month`
-  `LMTD`
-  `U (actual)`

In [None]:
# system parameters
V = 5.615 # ft^3/min
rho = 56.85 # lb/ft^3
Cp = 2.6 # Btu/lb.F
A = 3561 # ft^2

# create a new column for "year-month" (yyyy-mm)
# TODO

# create a new column for "LMTD"
# LMTD = ((Tsi - Toi) - (Tso - Tio)) / ln((Tsi - Toi) / (Tso - Tio))
# TODO

# create a new column for "U (actual)" = Q / (A * LMTD) = V * rho * Cp * dT / (A * LMTD)
# TODO

df.head()

<hr id="eda">

<h2>4. Exploratory Data Analysis</h2>

<h5>Descriptive Analysis</h5>

In [None]:
df_desc = df.describe()
df_desc

<h5>Visualize Parameters and identify outliers</h5>

In [33]:
# helper function to identify iqr bounds
def iqr_bounds(df, column):
    q1 = df[column].quantile(0.25)
    q3 = df[column].quantile(0.75)
    iqr = q3 - q1
    lower_bound = q1 - 1.5 * iqr
    upper_bound = q3 + 1.5 * iqr
    return lower_bound, upper_bound

### Shell side

Box plot

In [None]:
# box plot of shell temperature in and out
# TODO


Remove outliers

In [35]:
# remove outliers for "Shell Temp In, F" column
# TODO

# remove outliers for "Shell Temp Out, F" column
# TODO

Remove impractical entries

In [None]:
# find entries with "Shell Temp In, F" < "Shell Temp Out, F"
# TODO


In [None]:
# remove the found entries
# TODO

Tube Side


In [None]:
# box plot of tube temperature in and out
# TODO


Flow Rates

In [None]:
# box plot of shell flow rate and tube flow rate
# TODO

U (actual)

In [None]:
# box plot of U (actual)


<h5>Visualize trend of U over time</h5>

In [None]:
# plot U (actual) over time (Date)
# TODO


In [42]:
# remove outliers for "U (actual)" column
# TODO

To decrease the noise, let's examine the average monthly values.

In [None]:
# group by "year-month" and calculate the average of "U (actual)"
df_monthly = df.groupby('year-month', as_index=False)['U (actual)'].mean()
df_monthly


In [None]:
# plot monthly average U (actual) over time (year-month)
# TODO


<hr id="model">

<h2>5. Model Development and Evaluation</h2>

Let's first try linear regression to fit `U (actual) vs Date`

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# create a linear regression model (call it "lr")
# TODO

# split the data into training and testing sets
# TODO

# fit the model to the training data
# TODO

# make predictions using the testing set
# TODO

# print the coefficients
# TODO


Evaluate the linear model using the following metrics:
- Mean Squared Error (MSE)
- Coefficient of Determination (R2)

In [None]:
# evaluate the model
# TODO

Calculate the month when `U (actual)` will fall under `70% of U (service)`

In [None]:
# predict the month when U (dirty) is 70% of the clean U
u_design = 27.7
u_dirty = 0.7 * u_design

# calculate the month (call it "month")
# TODO


In [None]:
# add a new column for "U (predicted)"
# TODO

# add new rows for future months
for i in range(len(df_monthly), month.astype(int) + 4):
    # increment month by 1
    if i % 12 == 0:
        m = 1
        y = int(df_monthly.loc[i - 1, 'year-month'][:4]) + 1
    else:
        m = int(df_monthly.loc[i - 1, 'year-month'][5:]) + 1
        y = int(df_monthly.loc[i - 1, 'year-month'][:4])
    year_month = str(y) + '-' + str(m).zfill(2)

    df_monthly.loc[i] = {
        'year-month': year_month, 
        'U (actual)': np.nan, 
        'U (predicted)': lr.predict([[i]])[0]
    }

df_monthly.tail()

Generate a scatter plot with a trend line showing when the heat exchanger will require maintenance

In [None]:
# plot the actual and predicted U values
df_monthly.plot(kind='scatter', x='year-month', y='U (actual)', figsize=(12, 6), title='U (actual) over time')
plt.plot(df_monthly['year-month'], df_monthly['U (predicted)'], color='orange')
plt.plot(df_monthly['year-month'], [u_dirty] * len(df_monthly), color='red', linestyle='dashed')
plt.scatter(month, u_dirty, color='red', marker='o', s=400, facecolors='none')
plt.xlabel('Month')
plt.xticks(rotation=90, fontsize=8)
plt.yticks(fontsize=8)
plt.ylabel('U')
plt.legend(['U (actual)', 'U (predicted)', 'U (dirty)'])
plt.grid()
plt.show()


<hr id="deploy">

<h2>6. Deployment and Communication</h2>

The findings of the project can be deployed in a dashboard.  

They can also be presented in a report format for written reporting or presentation purposes.  
The report should contain the following sections:
1. **Title Page**  
   *the title, name, and date*
2. **Outline (table of contents)**  
   *the different sections of the report (with page numbers for printed report)*
3. **Executive Summary**  
   *a summary/overview of the problem, methodology, findings, and conclusions*
4. **Introduction**  
   *problem statement and background*
5. **Methodology**  
   *description of the different data science project steps (data collection, cleaning, exploration, different models tested, etc.)*
6. **Results**  
   *the findings with visualization charts, etc.*
7. **Discussion**  
   *analysis of the findings*
8. **Conclusion**  
   *drawn conclusions based on the findings*
9. **Appendix**  
   *any supporting data, charts, etc. that were not used in the report but could be useful to review (if any)*


<hr style="margin-top: 4rem;">
<h2>Author</h2>

<a href="https://github.com/SamerHany">Samer Hany</a>

<h2>References</h2>
<a href="https://www.w3schools.com/python/default.asp">w3schools.com</a>