# Business monitoring tickets Analysis
This Notebook briefly analyzes the dataset found under `./dataset`

## Data Gathering
Load the local csv file inside a Pandas `DataFrame`.

In [15]:
import pandas as pd

df_tickets = pd.read_csv(f"dataset/DATASET.csv")
print(
    f"The tickets data set has {len(df_tickets)} players with {df_tickets.shape[1]} variables."
)
df_tickets.head()

The tickets data set has 88753 players with 5 variables.


Unnamed: 0,TICKET_ID,VALUE_STATUS,VALUE_PREVIOUS_VALUE,Updater_id,CREATED_AT
0,7074438,open,solved,393185000000.0,2024-04-19T13:52:06Z
1,7073481,solved,hold,385161000000.0,2024-04-19T13:52:08Z
2,7074447,open,solved,13623800000000.0,2024-04-19T13:52:09Z
3,7074582,solved,open,390858000000.0,2024-04-19T13:52:10Z
4,7074630,new,,12895600000000.0,2024-04-19T13:52:20Z


In [16]:
df_tickets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 88753 entries, 0 to 88752
Data columns (total 5 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   TICKET_ID             88753 non-null  int64  
 1   VALUE_STATUS          88753 non-null  object 
 2   VALUE_PREVIOUS_VALUE  72343 non-null  object 
 3   Updater_id            88753 non-null  float64
 4   CREATED_AT            88753 non-null  object 
dtypes: float64(1), int64(1), object(3)
memory usage: 3.4+ MB


## Data Assessment and Cleaning
In this part, we'll perform parallel assessment and cleaning steps on the data sets.

### `NaN`
Let's first check if we have any `Null` or `NaN` values in the data set.

In [17]:
print(f"Number of Null in data set = {df_tickets.isnull().sum().sum()}")

Number of Null in data set = 16410


We can notice that the data set contains some `NaN` in the `VALUE_PREVIOUS_VALUE` variable.

In [18]:
# df_tickets.isnull()
df_tickets[df_tickets.isnull().any(axis=1)]

Unnamed: 0,TICKET_ID,VALUE_STATUS,VALUE_PREVIOUS_VALUE,Updater_id,CREATED_AT
4,7074630,new,,1.289560e+13,2024-04-19T13:52:20Z
10,7074633,new,,1.362490e+13,2024-04-19T13:53:07Z
13,7074636,new,,1.298090e+13,2024-04-19T13:53:24Z
15,7074639,new,,1.362490e+13,2024-04-19T13:53:45Z
28,7074642,new,,8.234020e+12,2024-04-19T13:55:19Z
...,...,...,...,...,...
88743,7113750,new,,1.378470e+13,2024-04-30T20:56:42Z
88744,7113759,new,,1.378460e+13,2024-04-30T20:59:04Z
88746,7113762,new,,1.006990e+13,2024-04-30T20:59:19Z
88747,7113765,new,,1.378470e+13,2024-04-30T20:59:36Z


It seems the `NaN` are right there since the `VALUE_PREVIOUS_VALUE` is not defined for tickets with `VALUE_STATUS=='new'`

### Data types
The `Updater_id` should be casted to `int` instead of `float`.

The `CREATED_AT` should be casted to datetime.

In [19]:
df_tickets["Updater_id"] = df_tickets["Updater_id"].astype(int)
df_tickets["CREATED_AT"] = pd.to_datetime(df_tickets["CREATED_AT"])
df_tickets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 88753 entries, 0 to 88752
Data columns (total 5 columns):
 #   Column                Non-Null Count  Dtype              
---  ------                --------------  -----              
 0   TICKET_ID             88753 non-null  int64              
 1   VALUE_STATUS          88753 non-null  object             
 2   VALUE_PREVIOUS_VALUE  72343 non-null  object             
 3   Updater_id            88753 non-null  int64              
 4   CREATED_AT            88753 non-null  datetime64[ns, UTC]
dtypes: datetime64[ns, UTC](1), int64(2), object(2)
memory usage: 3.4+ MB


In [20]:
df_tickets.head()

Unnamed: 0,TICKET_ID,VALUE_STATUS,VALUE_PREVIOUS_VALUE,Updater_id,CREATED_AT
0,7074438,open,solved,393185000000,2024-04-19 13:52:06+00:00
1,7073481,solved,hold,385161000000,2024-04-19 13:52:08+00:00
2,7074447,open,solved,13623800000000,2024-04-19 13:52:09+00:00
3,7074582,solved,open,390858000000,2024-04-19 13:52:10+00:00
4,7074630,new,,12895600000000,2024-04-19 13:52:20+00:00


## Export to SQL DB

In [21]:
from sqlalchemy import create_engine
from sqlalchemy_utils import database_exists, create_database

engine = create_engine(
    "postgresql://localhost:5432/business-monitoring-tickets"
)
if not database_exists(engine.url):
    create_database(engine.url)

df_tickets.to_sql(name="tickets", index=False, con=engine, if_exists="replace")

753