### TASK

Develop and deploy machine
learning models in any one of the following areas only and analyse the results.

##### Public Transport

Project questions could be: (this is a small, suggested, sample of questions, other questions may be more
appropriate to your project)

– How to measure similarity or dissimilarity between different clusters?
– Which clustering solution do you prefer, and why?
– How to analyse and investigate an inflation rate for a specific product?

You will present their findings and defend the results in the report (MS Doc). Your report should capture the
following aspects that are relevant to your project investigations.

i) A precise introduction, motivation, description of problem domain, project objectives and the
rationale for the chosen dataset in the above-mentioned areas.


ii) Which clustering algorithms would you consider for segmentation, and why? Explain the differences
between silhouette score and Davies-Bouldin index in the context of clustering. Compare the results
obtained from any two clustering algorithms from the chosen dataset.


iii) What insights can you derive from the initial exploration of the time series data based on the
provided topics? Describe any trends, seasonality, or anomalies observed. How did you determine
the appropriate parameters (p, d, q) for the ARIMA model. Evaluate the performance of the ARIMA
model in forecasting future values, highlighting any strengths and limitations based on your chosen
dataset.


iv) Interpret and justify the results based on the problem specification or project objectives by using
suitable visualizations. Comments and description of Python code, conclusions of the project should
be specified in the report as well as jupyter notebook. Citations and references should be in the
Harvard Style.


#### Note: You can choose two different datasets for task (ii) or task (iii) separately or one dataset for both tasks

### Part I: Data Loading and Cleaning

In [1]:
# Step 1: Importing Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Suppressing potential future warnings for pandas operations
pd.options.mode.chained_assignment = None

# Step 2: Loading the Dataset
# Load the dataset
df = pd.read_csv('MetroPT3(AirCompressor).csv')

# Display the first few rows of the dataframe
df.head()

Unnamed: 0.1,Unnamed: 0,timestamp,TP2,TP3,H1,DV_pressure,Reservoirs,Oil_temperature,Motor_current,COMP,DV_eletric,Towers,MPG,LPS,Pressure_switch,Oil_level,Caudal_impulses
0,0,01/02/2020 00:00,-0.012,9.358,9.34,-0.024,9.358,53.6,0.04,1,0,1,1,0,1,1,1
1,10,01/02/2020 00:00,-0.014,9.348,9.332,-0.022,9.348,53.675,0.04,1,0,1,1,0,1,1,1
2,20,01/02/2020 00:00,-0.012,9.338,9.322,-0.022,9.338,53.6,0.0425,1,0,1,1,0,1,1,1
3,30,01/02/2020 00:00,-0.012,9.328,9.312,-0.022,9.328,53.425,0.04,1,0,1,1,0,1,1,1
4,40,01/02/2020 00:00,-0.012,9.318,9.302,-0.022,9.318,53.475,0.04,1,0,1,1,0,1,1,1


In [2]:
# Checking the total number of rows in the dataset
total_rows = len(df)
print(f"Total number of rows in the dataset: {total_rows}")


Total number of rows in the dataset: 1048575


In [3]:
# Check for duplicates
duplicates = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicates}")

Number of duplicate rows: 0


In [4]:
# Step 3: Data Cleaning
# Checking for missing values in the dataset
print(df.isnull().sum())


Unnamed: 0         0
timestamp          0
TP2                0
TP3                0
H1                 0
DV_pressure        0
Reservoirs         0
Oil_temperature    0
Motor_current      0
COMP               0
DV_eletric         0
Towers             0
MPG                0
LPS                0
Pressure_switch    0
Oil_level          0
Caudal_impulses    0
dtype: int64


In [5]:
# Fill missing values with the median:
df.fillna(df.median(numeric_only=True), inplace=True)

# Convert 'timestamp' column to datetime type (if not already)
df['timestamp'] = pd.to_datetime(df['timestamp'])

print(df.dtypes)  # To check the data types of each column

Unnamed: 0                  int64
timestamp          datetime64[ns]
TP2                       float64
TP3                       float64
H1                        float64
DV_pressure               float64
Reservoirs                float64
Oil_temperature           float64
Motor_current             float64
COMP                        int64
DV_eletric                  int64
Towers                      int64
MPG                         int64
LPS                         int64
Pressure_switch             int64
Oil_level                   int64
Caudal_impulses             int64
dtype: object
