# 1 Data Exploration
    a. Explore the dataset by displaying the first few rows, summary statistics, and data types of each column.
    b. Identify missing values, outliers, and unique values in categorical columns.

## 1.1 Smoking dataset
### 1.1.a

In [9]:
import pandas as pd
df = pd.read_csv('data/assigment_1/smoking/smoking_driking_dataset_Ver01.csv')

print("First 5 rows of the dataset:")
print(df.head())

print("\nSummary statistics for numerical columns:")
print(df.describe())

print("\nData types of each column:")
print(df.dtypes)

First 5 rows of the dataset:
    sex  age  height  weight  waistline  sight_left  sight_right  hear_left  \
0  Male   35     170      75       90.0         1.0          1.0        1.0   
1  Male   30     180      80       89.0         0.9          1.2        1.0   
2  Male   40     165      75       91.0         1.2          1.5        1.0   
3  Male   50     175      80       91.0         1.5          1.2        1.0   
4  Male   50     165      60       80.0         1.0          1.2        1.0   

   hear_right    SBP  ...  LDL_chole  triglyceride  hemoglobin  urine_protein  \
0         1.0  120.0  ...      126.0          92.0        17.1            1.0   
1         1.0  130.0  ...      148.0         121.0        15.8            1.0   
2         1.0  120.0  ...       74.0         104.0        15.8            1.0   
3         1.0  145.0  ...      104.0         106.0        17.6            1.0   
4         1.0  138.0  ...      117.0         104.0        13.8            1.0   

   serum_

### 1.1.b

In [16]:
print("\nMissing values per column:")
print(df.isnull().sum())
# print(df.info()) // shows count of not null values

print("\nDetect outliers(not using IQR or sth else):")
range_rules = {
    "age": (0, 120),
    "height": (100, 250),
    "weight": (30, 250),
    "sight_left": (0, 2),
    "sight_right": (0, 2),
    "SBP": (70, 250),
    "DBP": (40, 150),
    "BLDS": (40, 400),
    "tot_chole": (70, 400),
    "HDL_chole": (10, 150),
    "LDL_chole": (30, 300),
    "triglyceride": (20, 1000),
    "hemoglobin": (5, 20),
    "serum_creatinine": (0.2, 20),
    "SGOT_AST": (0, 500),
    "SGOT_ALT": (0, 500),
    "gamma_GTP": (0, 1000)
}

valid_values = {
    "sex": {"Male", "Female"},
    "hear_left": {1, 2},
    "hear_right": {1, 2},
    "urine_protein": {1, 2, 3, 4, 5, 6},
    "SMK_stat_type_cd": {1, 2, 3},
    "DRK_YN": {"Y", "N"}
}

print("Logical outliers per column:")

# valid value ranges
for col, (low, high) in range_rules.items():
    if col in df.columns:
        mask = (df[col] < low) | (df[col] > high)
        print(f"{col}: {mask.sum()} outliers")

# discrete vallid values
for col, valid in valid_values.items():
    if col in df.columns:
        mask = ~df[col].isin(valid)
        print(f"{col}: {mask.sum()} invalid values")

print("\nUnique values in categorical columns:")
categorical_cols = df.select_dtypes(include=['object', 'category']).columns
for col in categorical_cols:
    print(f"{col}: {df[col].nunique()} unique values")
    print(df[col].unique())



Missing values per column:
sex                 0
age                 0
height              0
weight              0
waistline           0
sight_left          0
sight_right         0
hear_left           0
hear_right          0
SBP                 0
DBP                 0
BLDS                0
tot_chole           0
HDL_chole           0
LDL_chole           0
triglyceride        0
hemoglobin          0
urine_protein       0
serum_creatinine    0
SGOT_AST            0
SGOT_ALT            0
gamma_GTP           0
SMK_stat_type_cd    0
DRK_YN              0
dtype: int64

Detect outliers(not using IQR or sth else):
Logical outliers per column:
age: 0 outliers
height: 0 outliers
weight: 9 outliers
sight_left: 3130 outliers
sight_right: 3132 outliers
SBP: 5 outliers
DBP: 26 outliers
BLDS: 237 outliers
tot_chole: 269 outliers
HDL_chole: 166 outliers
LDL_chole: 3080 outliers
triglyceride: 1426 outliers
hemoglobin: 89 outliers
serum_creatinine: 458 outliers
SGOT_AST: 136 outliers
SGOT_ALT: 156 outli

# 2 Data Cleaning
    a. Handling Missing Values
    b. Choose appropriate methods to handle missing values (e.g., mean/median imputation for numerical data, mode imputation for categorical data, or deletion of rows/columns).
    c. Justify your choices for handling missing data.

# 3 Handling Outliers
    a. Detect outliers using methods such as the IQR method or Z-score.
    b. Decide whether to remove, cap, or transform the outliers. Justify your decisions.

# 4 Data Transformation
    a. Encoding Categorical Data
        i. Apply label encoding or one-hot encoding to transform categorical data into numerical form.
        ii. Justify your choice of encoding method.
    b. Feature Scaling
        i. Apply feature scaling techniques such as normalization (Min-Max scaling) or standardization (Z-score normalization) to the dataset.
        ii. Explain why feature scaling is necessary and how it impacts the model.


# 5 Data Splitting
    a. Split the preprocessed dataset into training and testing sets. Typically, an 80-20 or 70-30 split is used.
    b. Explain the importance of splitting the data and how it prevents overfitting.

# 6 Bonus
Apply dimensionality reduction techniques such as Principal
Component Analysis (PCA) and discuss how it affects the dataset.