# PERKENALAN

Nama  : Ma'ruf Habibie Siregar  

Program ini dibuat untuk melakukan automatisasi load data, tranform, dan push dari PostgreSQL ke ElasticSearch.  
Adapun dataset yang dipakai adalah dataset statistic pemain NBA sampai season 2021/2022

# GREAT EXPECTATIONS

## 1. Data Loading

In [38]:
# Import Library
import pandas as pd
import great_expectations as ge
from great_expectations.data_context import FileDataContext
from great_expectations.core.batch import RuntimeBatchRequest
from great_expectations.checkpoint import SimpleCheckpoint

In [39]:
#Membaca file csv yang sudah di clean
df = pd.read_csv("NBAPlayer_Preference_data_clean.csv")

# Konversi dataframe pandas ke GE dataset
ge_df = ge.from_pandas(df)

In [40]:
#Melihat isi tabel
pd.set_option('display.max_columns', None)
df.head()

Unnamed: 0,rk,player,pos,age,tm,g,gs,mp,fg,fga,fg%,3p,3pa,3p%,2p,2pa,2p%,efg%,ft,fta,ft%,orb,drb,trb,ast,stl,blk,tov,pf,pts,year,year_start,year_end,main_pose,ID
0,1,Mahmoud Abdul-Rauf,PG,28,SAC,31,0,17.1,3.3,8.8,0.377,0.2,1.0,0.161,3.2,7.8,0.405,0.386,0.5,0.5,1.0,0.2,1.0,1.2,1.9,0.5,0.0,0.6,1.0,7.3,1997-1998,1997,1998,PG,Mahmoud Abdul-Rauf_SAC_1997-1998
1,2,Tariq Abdul-Wahad,SG,23,SAC,59,16,16.3,2.4,6.1,0.403,0.1,0.3,0.211,2.4,5.7,0.414,0.409,1.4,2.1,0.672,0.7,1.2,2.0,0.9,0.6,0.2,1.1,1.4,6.4,1997-1998,1997,1998,SG,Tariq Abdul-Wahad_SAC_1997-1998
2,3,Shareef Abdur-Rahim,SF,21,VAN,82,82,36.0,8.0,16.4,0.485,0.3,0.6,0.412,7.7,15.8,0.488,0.493,6.1,7.8,0.784,2.8,4.3,7.1,2.6,1.1,0.9,3.1,2.5,22.3,1997-1998,1997,1998,SF,Shareef Abdur-Rahim_VAN_1997-1998
3,4,Cory Alexander,PG,24,TOT,60,22,21.6,2.9,6.7,0.428,1.1,2.9,0.375,1.8,3.7,0.469,0.51,1.3,1.7,0.784,0.3,2.2,2.4,3.5,1.2,0.2,1.9,1.6,8.1,1997-1998,1997,1998,PG,Cory Alexander_TOT_1997-1998
4,4,Cory Alexander,PG,24,SAS,37,3,13.5,1.6,3.9,0.414,0.5,1.7,0.313,1.1,2.2,0.494,0.483,0.7,1.0,0.676,0.2,1.1,1.3,1.9,0.7,0.1,1.3,1.4,4.5,1997-1998,1997,1998,PG,Cory Alexander_SAS_1997-1998


## 2. Instantiate Data Context

In [41]:
# Create a data context
context = FileDataContext.create(project_root_dir='./')

## 3. Connect to DataSource

In [42]:
# Give a name to a Datasource. This name must be unique between Datasources.
datasource_name = 'nba_player_csv'
datasource = context.sources.add_pandas(datasource_name)

# Give a name to a data asset
asset_name = 'nbaplayer'
path_to_data = 'NBAPlayer_Preference_data_clean.csv'
asset = datasource.add_csv_asset(asset_name, filepath_or_buffer=path_to_data)

# Build batch request
batch_request = asset.build_batch_request()

## 4. Create an Expectation Suite

In [43]:
# Creat an expectation suite
expectation_suite_name = 'expectation_nba_player_pref'
context.add_or_update_expectation_suite(expectation_suite_name)

# Create a validator using above expectation suite
validator = context.get_validator(
    batch_request = batch_request,
    expectation_suite_name = expectation_suite_name
)
# Check the validator
validator.head()

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,rk,player,pos,age,tm,g,gs,mp,fg,fga,fg%,3p,3pa,3p%,2p,2pa,2p%,efg%,ft,fta,ft%,orb,drb,trb,ast,stl,blk,tov,pf,pts,year,year_start,year_end,main_pose,ID
0,1,Mahmoud Abdul-Rauf,PG,28,SAC,31,0,17.1,3.3,8.8,0.377,0.2,1.0,0.161,3.2,7.8,0.405,0.386,0.5,0.5,1.0,0.2,1.0,1.2,1.9,0.5,0.0,0.6,1.0,7.3,1997-1998,1997,1998,PG,Mahmoud Abdul-Rauf_SAC_1997-1998
1,2,Tariq Abdul-Wahad,SG,23,SAC,59,16,16.3,2.4,6.1,0.403,0.1,0.3,0.211,2.4,5.7,0.414,0.409,1.4,2.1,0.672,0.7,1.2,2.0,0.9,0.6,0.2,1.1,1.4,6.4,1997-1998,1997,1998,SG,Tariq Abdul-Wahad_SAC_1997-1998
2,3,Shareef Abdur-Rahim,SF,21,VAN,82,82,36.0,8.0,16.4,0.485,0.3,0.6,0.412,7.7,15.8,0.488,0.493,6.1,7.8,0.784,2.8,4.3,7.1,2.6,1.1,0.9,3.1,2.5,22.3,1997-1998,1997,1998,SF,Shareef Abdur-Rahim_VAN_1997-1998
3,4,Cory Alexander,PG,24,TOT,60,22,21.6,2.9,6.7,0.428,1.1,2.9,0.375,1.8,3.7,0.469,0.51,1.3,1.7,0.784,0.3,2.2,2.4,3.5,1.2,0.2,1.9,1.6,8.1,1997-1998,1997,1998,PG,Cory Alexander_TOT_1997-1998
4,4,Cory Alexander,PG,24,SAS,37,3,13.5,1.6,3.9,0.414,0.5,1.7,0.313,1.1,2.2,0.494,0.483,0.7,1.0,0.676,0.2,1.1,1.3,1.9,0.7,0.1,1.3,1.4,4.5,1997-1998,1997,1998,PG,Cory Alexander_SAS_1997-1998


### Great Expectation 1
#### Pemain, team dan tahun bermain pasti unik (to be Unique)

Dalam dataset NBA Player list tidak ada unique ID, maka dari itu saya akan membuat unique id sendiri. Unique ID perpaduan antara nama pemain, tim yang dibela dan tahun bermain

In [44]:
#GX 1, kolom ID harus unik
validator.expect_column_values_to_be_unique("ID")

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 14573,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

    Hasil di atas menunjukkan id yang dibuat berhasil memenuhi ekspektasi **(SUCCESS = TRUE)**

### Great Expectation 2
#### Umur pemain harus di rentang 18 sampai 45 tahun  (to be between min_value and max_value)

Umur pemain NBA tidak ada batasan seberapa tua akan tetapi tetap saya batasi dari umur 18 sampai 45 tahun [source](https://www.nba.com/news/oldest-players-to-play-in-an-nba-game)

In [45]:
#To be between umur 18 sampai 45 tahun
validator.expect_column_values_to_be_between("age", min_value=18, max_value=45)

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 14573,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

    Hasil di atas menunjukkan bahwa data sesuai expektasi **(SUCCESS = TRUE)**

### Great Expectation 3
#### Posisi pemain harus sesuai dengan posisi yang ada di dalam olahraga basket (to be in set)

Dalam olahraga basket ada 5 posisi dalam permainannya, yaitu :  [source](https://en.wikipedia.org/wiki/Basketball_positions)
1. PG (Point Guard)
2. PF (Power Forward)
3. C (Center)
4. SF (Small Forward)
5. SG (Shooting Guard)

In [46]:
main_pose = ["PG", "SG", "SF", "PF", "C"]
# Daftar posisi pemain yang valid
validator.expect_column_values_to_be_in_set("main_pose",main_pose)

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 14573,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

    Hasil di atas menunjukkan bahwa data sesuai expektasi **(SUCCESS = TRUE)**

### Great Expectation 4
#### Kolom 'tm' (team) tidak boleh ada yang missing ( to not be null)

Karena objektif awal adalah untuk mencari roster baru maka untuk kolom seperti nama team dan nama pemain tidak boleh ada yang missing

In [47]:
#Melihat kolom tm (team) tidak boleh ada yang missing
validator.expect_column_values_to_not_be_null("tm")

Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 14573,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": []
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

    Hasil di atas menunjukkan bahwa data sesuai expektasi **(SUCCESS = TRUE)**

 ### Great Expectation 5
 #### Expectation untuk validasi bahwa nilai kolom 'g' (games played) adalah integer dengan tipe data benar (to be of type)

Pada gameplay tipe datanya harus integer, karena ketika bermain tidak ada setengah bermain, walaupun itu pemain cadangan. Yang penting dia bermain maka akan ditulis game playednya = 1 di pertandingan itu

In [48]:
# Validasi tipe data kolom games played harus integer
validator.expect_column_values_to_be_of_type("g", "int64")

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "observed_value": "int64"
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

    Hasil di atas menunjukkan bahwa data sesuai expektasi **(SUCCESS = TRUE)**

### Great Expectation 6
#### Expectation untuk validasi distribusi nilai statistik (misal minutes played 'mp') (median_to_be_between)

Memastikan data performa pemain di bagian mp (minute_play = menit bermain) berada pada kisaran nilai yang wajar agar analisis tren dan keputusan trading pemain tidak terpengaruh oleh data yang aneh atau tidak biasa. Untuk pemain starter biasanya berada dalam rentang yang 10 sampai 40 menit [source](https://www.nba.com/stats/players/traditional?dir=D&sort=MIN)

In [49]:
# Validasi distribusi kolom 'mp' menggunakan quantile range
# median menit bermain harus antara 10 dan 40

validator.expect_column_median_to_be_between(
    column="mp",
    min_value=10,
    max_value=40
)

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "observed_value": 18.9
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

    Hasil di atas menunjukkan bahwa data sesuai expektasi **(SUCCESS = TRUE)**

### Great Expectation 7
#### Memastikan urutan waktu pada tahun/season benar untuk melihat tren pemain permusim yang menentukan jadi target trading atau tidak (to_be_increasing)

Saya harus memastikan bahwa season/tahunnya tidak ada yang terlongkap/ter-skip.

In [50]:
#Validasi urutan waktu pada year/season
validator.expect_column_values_to_be_increasing('year_start')

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 14573,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

    Hasil di atas menunjukkan bahwa data sesuai expektasi **(SUCCESS = TRUE)**

## Checkpoint

In [51]:
# Jalankan checkpoint GX
checkpoint = SimpleCheckpoint(
    name="checkpoint_nba_player_pref",
    data_context=context,
    validations=[
        {
            "batch_request": batch_request,
            "expectation_suite_name": "expectation_nba_player_pref"
        }
    ],
)
# Jalankan checkpoint
checkpoint_result = checkpoint.run()
print(f"Sukses? {checkpoint_result['success']}")

Calculating Metrics: 0it [00:00, ?it/s]

Sukses? True
