# **Great Expectations** 

---

1. Expectation 1 :  expect column values to be unique
2. Expectation 2 :  expect column values to be between
3. Expectation 3 :  expect column values to be in set
4. Expectation 4 :  expect column values to be in type list
5. Expectation 5 :  expect column distinct values to be in set
6. Expectation 6 :  expect column median to be between
7. Expectation 7 :  expect column unique value count to be between

---

# **A. Import Libraries**

In [2]:
# Install great expectations
!pip install -q great-expectations

In [3]:
import pandas as pd
from great_expectations.data_context import FileDataContext

# **B. Data Loading**

In [4]:
# Load dataset CSV
df = pd.read_csv('genre_therapy_impact_clean.csv')

# create kolom ID
df['id'] = range(1, len(df) + 1)

# save dataframe
df.to_csv('genre_therapy_impact_clean.csv', index=False)

In [5]:
df.head(3)

Unnamed: 0,timestamp,age,primary_streaming_service,hours_per_day,while_working,instrumentalist,composer,fav_genre,exploratory,foreign_languages,...,frequency_rap,frequency_rock,frequency_video_game_music,anxiety,depression,insomnia,ocd,music_effects,permissions,id
0,2022-08-27 21:28:18,18,Spotify,4.0,No,No,No,Video game music,No,Yes,...,Rarely,Rarely,Very frequently,7.0,7.0,10.0,2.0,No effect,I understand.,1
1,2022-08-27 21:40:40,61,YouTube Music,2.5,Yes,No,Yes,Jazz,Yes,Yes,...,Never,Never,Never,9.0,7.0,3.0,3.0,Improve,I understand.,2
2,2022-08-27 21:54:47,18,Spotify,4.0,Yes,No,No,R&B,Yes,No,...,Very frequently,Never,Rarely,7.0,2.0,5.0,9.0,Improve,I understand.,3


This `id` column was added to identify each respondent, that each respondent in this dataset is different

# **C. Instantiate Data Context**

In [6]:
# Create a data context
context = FileDataContext.create(project_root_dir='./')

The `project_root_dir='./'` param means the Data Context will be created in the current working directory. You can save the Data Context in a specific path by specifying its location in `project_root_dir`.

# **D. Connect to A Datasource**

In [7]:
# Give a name to a Datasource. This name must be unique between Datasources.
datasource_name = 'csv-data-music-and-mental-health'
datasource = context.sources.add_pandas(datasource_name)

# Give a name to a data asset
asset_name = 'music-and-mental-health'
path_to_data = 'C:/Users/Muhammad Hafidz Adit.DESKTOP-6IPGJGG/Documents/Hacktiv8/P2/Milestone3/p2-ftds012-hck-m3-Muhammad-Hafidz-Adityaswara/P2M3_muhammad_hafidz_data_clean.csv'
asset = datasource.add_csv_asset(asset_name, filepath_or_buffer=path_to_data)

# Build batch request
batch_request = asset.build_batch_request()

# **E. Create an Expectation Suite**

In [8]:
# Create an expectation suite
expectation_suite_name = 'expectation-music-and-mental-health'
context.add_or_update_expectation_suite(expectation_suite_name)

# Create a validator using above expectation suite
validator = context.get_validator(
    batch_request = batch_request,
    expectation_suite_name = expectation_suite_name
)

# Check the validator
validator.head()

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,timestamp,age,primary_streaming_service,hours_per_day,while_working,instrumentalist,composer,fav_genre,exploratory,foreign_languages,...,frequency_rap,frequency_rock,frequency_video_game_music,anxiety,depression,insomnia,ocd,music_effects,permissions,id
0,2022-08-27 21:28:18,18,Spotify,4.0,No,No,No,Video game music,No,Yes,...,Rarely,Rarely,Very frequently,7.0,7.0,10.0,2.0,No effect,I understand.,1
1,2022-08-27 21:40:40,61,YouTube Music,2.5,Yes,No,Yes,Jazz,Yes,Yes,...,Never,Never,Never,9.0,7.0,3.0,3.0,Improve,I understand.,2
2,2022-08-27 21:54:47,18,Spotify,4.0,Yes,No,No,R&B,Yes,No,...,Very frequently,Never,Rarely,7.0,2.0,5.0,9.0,Improve,I understand.,3
3,2022-08-27 21:56:50,18,Spotify,5.0,Yes,Yes,Yes,Jazz,Yes,Yes,...,Very frequently,Very frequently,Never,8.0,8.0,7.0,7.0,Improve,I understand.,4
4,2022-08-27 22:00:29,18,YouTube Music,3.0,Yes,Yes,No,Video game music,Yes,Yes,...,Never,Never,Sometimes,4.0,8.0,6.0,0.0,Improve,I understand.,5


# **Expectation 1 :  expect column values to be unique**

In [9]:
validator.expect_column_values_to_be_unique('id')

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 616,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

- I use the ID column because ID represents each respondent.  
- The result of this expectation is `"success": true`. This means that in the id column the value is unique

# **Expectation 2 :  expect column values to be between**

In [10]:
validator.expect_column_values_to_be_between(column='hours_per_day', min_value=0, max_value=24)

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 616,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

- I use the hours_per_day column with parameters min = 0 and max = 24, assuming the number of hours per day is 0-24 hours.  
- The result of this expectation is `"success": true`. This means that in the hours_per_day column the minimum value is 0 and the maximum value is 24

# **Expectation 3 :  expect column values to be in set**

In [11]:
validator.expect_column_values_to_be_in_set(column='music_effects', value_set=[ 'No effect', 'Improve', 'Worsen'])

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 616,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

- I use the music_effects column with the assumption that the values ​​in the music_effects column are No effect, Improve and Worsen.
- The result of this expectation is `"success": true`. This means that the music_effects column has the values ​​No effect, Improve and Worsen

# **Expectation 4 :  expect column values to be in type list**

In [12]:
validator.expect_column_values_to_be_in_type_list('anxiety', ['integer', 'float'])

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "observed_value": "float64"
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

- I use the anxiety column assuming the data type in the anxiety column is integer and float.
- The result of this expectation is `"success": true`. This means that the anxiety column has integer and float data types

# **Expectation 5 :  expect column distinct values to be in set**

In [13]:
validator.expect_column_distinct_values_to_be_in_set("music_effects", ['No effect', 'Improve', 'Worsen'])

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "observed_value": [
      "Improve",
      "No effect",
      "Worsen"
    ],
    "details": {
      "value_counts": [
        {
          "value": "Improve",
          "count": 465
        },
        {
          "value": "No effect",
          "count": 136
        },
        {
          "value": "Worsen",
          "count": 15
        }
      ]
    }
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

- I use the music_effects column to see the number of each value from the music_effects column.
- The result of this expectation is `"success": true`. This means that the music_effects column has the unique value No effect with a value of 136, Improve with a value of 465 and Worsen with a value of 15

# **Expectation 6 :  expect column median to be between**

In [14]:
validator.expect_column_median_to_be_between('age', 10, 89)

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "observed_value": 21.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

- I use an age column with a minimum of 10 and a maximum of 89 with the assumption that the respondent's age has a middle or highest value in the range 10-89.
- The result of this expectation is `"success": true`. This means that the age column has a middle (median) between 10-89

# **Expectation 7 :  expect column unique value count to be between**

In [15]:
validator.expect_column_unique_value_count_to_be_between('primary_streaming_service', 0, 7)

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "observed_value": 6
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

- I use the primary_streaming_service column with the assumption that this column has unique values ​​in the range 0-7.
- The result of this expectation is `"success": true`. This means that the primary_streaming_service column has a unique value between 0-7