# STINTSY MCO
The Major Course Output for STINTSY (Advanced Intelligent Systems) will include 11 sections. The following sections are:
- **Section 1** : Introduction to the problem/task and dataset
- **Section 2** : Description of the dataset
- **Section 3** : List of requirements
- **Section 4** : Data preprocessing and cleaning
- **Section 5** : Exploratory data analysis
- **Section 6** : Initial model training
- **Section 7** : Error analysis
- **Section 8** : Improving model performance
- **Section 9** : Model Performance Summary
- **Section 10** : Insights and conclusions
- **Section 11** : References

## Section 1 : Introduction 

Each group should select one real-world dataset from the list of datasets provided for the project. Each dataset is accompanied with a description file, which also contains detailed description of each feature.

The target task (i.e., classification or regression) should be properly stated as well.

## Section 2 : Description of Dataset
In this section of the notebook, you must fulfill the following:
- State a brief description of the dataset.
- Provide a description of the collection process executed to build the dataset. Discuss the implications of the data collection method on the generated conclusions and insights. Note that you may need to look at relevant sources related to the dataset to acquire necessary information for this part of the project.
- Describe the structure of the dataset file.
    - What does each row and column represent?
    - How many instances are there in the dataset?
    - How many features are there in the dataset?
    - If the dataset is composed of different files that you will combine in the succeeding steps, describe the structure and the contents of each file.
- Discuss the features in each dataset file. What does each feature represent? All features, even those which are not used for the study, should be described to the reader. The purpose of each feature in the dataset should be clear to the reader of the notebook without having to go through an external link.

## Section 3 : List of Requirements
List all the Python libraries and modules that you used.

In [6]:
import numpy as np
import pandas as pd
import csv

## Section 4 : Data Preprocessing and Cleaning

Perform necessary steps before using the data. In this section of the notebook, please take note of the following:

- If needed, perform preprocessing techniques to transform the data to the appropriate representation. This may include binning, log transformations, conversion to one-hot encoding, normalization, standardization, interpolation, truncation, and feature engineering, among others. There should be a correct and proper justification for the use of each preprocessing technique used in the project.
- Make sure that the data is clean, especially features that are used in the project. This may include checking for misrepresentations, checking the data type, dealing with missing data, dealing with duplicate data, and dealing with outliers, among others. There should be a correct and proper justification for the application (or non-application) of each data cleaning method used in the project. Clean only the variables utilized in the study.

### Section 4.1 : Pandas Library for Identfying Null Values
This section identifies Null values of the dataset

In [68]:
# Define the filename
filename = "dataset.csv"

# Define the column names based on the provided features
columns = [
    "W_REGN", "W_OID", "W_SHSN", "W_HCN", "URB", "RSTR", "PSU", "BWEIGHT", "RFACT", "FSIZE",
    "AGRI_SAL", "NONAGRI_SAL", "WAGES", "NETSHARE", "CASH_ABROAD", "CASH_DOMESTIC", "RENTALS_REC", "INTEREST",
    "PENSION", "DIVIDENDS", "OTHER_SOURCE", "NET_RECEIPT", "REGFT", "NET_CFG", "NET_LPR", "NET_FISH", "NET_FOR",
    "NET_RET", "NET_MFG", "NET_COM", "NET_TRANS", "NET_MIN", "NET_CONS", "NET_NEC", "EAINC", "TOINC", "LOSSES",
    "T_BREAD", "T_MEAT", "T_FISH", "T_MILK", "T_OIL", "T_FRUIT", "T_VEG", "T_SUGAR", "T_FOOD_NEC", "T_COFFEE",
    "T_MINERAL", "T_ALCOHOL", "T_TOBACCO", "T_OTHER_VEG", "T_FOOD_HOME", "T_FOOD_OUTSIDE", "T_FOOD", "T_CLOTH",
    "T_FURNISHING", "T_HEALTH", "T_HOUSING_WATER", "T_ACTRENT", "T_RENTVAL", "T_IMPUTED_RENT", "T_BIMPUTED_RENT",
    "T_TRANSPORT", "T_COMMUNICATION", "T_RECREATION", "T_EDUCATION", "T_MISCELLANEOUS", "T_OTHER_EXPENDITURE",
    "T_OTHER_DISBURSEMENT", "T_NFOOD", "T_TOTEX", "T_TOTDIS", "T_OTHREC", "T_TOREC", "FOOD_ACCOM_SRVC", "SEX",
    "AGE", "MS", "HGC", "JOB", "OCCUP", "KB", "CW", "HHTYPE", "MEMBERS", "AGELESS5", "AGE5_17", "EMPLOYED_PAY",
    "EMPLOYED_PROF", "SPOUSE_EMP", "BLDG_TYPE", "ROOF", "WALLS", "TENURE", "HSE_ALTERTN", "TOILET", "ELECTRIC",
    "WATER", "DISTANCE", "RADIO_QTY", "TV_QTY", "CD_QTY", "STEREO_QTY", "REF_QTY", "WASH_QTY", "AIRCON_QTY",
    "CAR_QTY", "LANDLINE_QTY", "CELLPHONE_QTY", "PC_QTY", "OVEN_QTY", "MOTOR_BANCA_QTY", "MOTORCYCLE_QTY",
    "POP_ADJ", "PCINC", "NATPC", "NATDC", "REGDC", "REGPC"
]

# Define integer and float columns
int_cols = [col for col in columns if col not in ["BWEIGHT", "RFACT", "FSIZE", "POP_ADJ", "PCINC"]]
float_cols = ["BWEIGHT", "RFACT", "FSIZE", "POP_ADJ", "PCINC"]

# Load the CSV file
df = pd.read_csv(filename, usecols=columns, dtype=str)  # Read everything as strings first

# Strip whitespace efficiently using applymap replacement
df = df.apply(lambda col: col.map(lambda x: x.strip() if isinstance(x, str) else x))

# Convert data types safely
for col in int_cols:
    df[col] = pd.to_numeric(df[col], errors='coerce').astype('Int64')  # Allows None for missing integers
for col in float_cols:
    df[col] = pd.to_numeric(df[col], errors='coerce').astype(float)  # Convert safely to float

In [69]:
pd.set_option('display.max_rows', None) 
pd.set_option('display.max_columns', None) 
pd.set_option('display.width', None)  

print(df.isnull().sum())

W_REGN                      0
W_OID                       0
W_SHSN                      0
W_HCN                       0
URB                         0
RSTR                        0
PSU                         0
BWEIGHT                     0
RFACT                       0
FSIZE                       0
AGRI_SAL                    0
NONAGRI_SAL                 0
WAGES                       0
NETSHARE                    0
CASH_ABROAD                 0
CASH_DOMESTIC               0
RENTALS_REC                 0
INTEREST                    0
PENSION                     0
DIVIDENDS                   0
OTHER_SOURCE                0
NET_RECEIPT                 0
REGFT                       0
NET_CFG                     0
NET_LPR                     0
NET_FISH                    0
NET_FOR                     0
NET_RET                     0
NET_MFG                     0
NET_COM                     0
NET_TRANS                   0
NET_MIN                     0
NET_CONS                    0
NET_NEC   

### Convert null values to 0
The features, with null values, to be converted to 0 are the following:
- OCCUP
- AGELESS5
- AGE5_17
- EMPLOYED_PAY
- EMPLOYED_PROF
- DISTANCE
- RADIO_QTY
- TV_QTY
- CD_QTY
- STEREO_QTY
- REF_QTY
- WASH_QTY
- AIRCON_QTY
- CAR_QTY
- LANDLINE_QTY
- CELLPHONE_QTY
- PC_QTY
- OVEN_QTY
- MOTOR_BANCA_QTY
- MOTORCYCLE_QTY

In [101]:
columns_to_fill = [
    "DISTANCE", "RADIO_QTY", "TV_QTY", "CD_QTY", "STEREO_QTY",
    "REF_QTY", "WASH_QTY", "AIRCON_QTY", "CAR_QTY", "LANDLINE_QTY",
    "CELLPHONE_QTY", "PC_QTY", "OVEN_QTY", "MOTOR_BANCA_QTY", "MOTORCYCLE_QTY", 
    "OCCUP", "EMPLOYED_PAY", "EMPLOYED_PROF", "AGELESS5", "AGE5_17"
]

# replace NaN with 0 in the selected columns
df[columns_to_fill] = df[columns_to_fill].fillna(0)

# verify changes
print(df.isnull().sum())  

W_REGN                     0
W_OID                      0
W_SHSN                     0
W_HCN                      0
URB                        0
RSTR                       0
PSU                        0
BWEIGHT                    0
RFACT                      0
FSIZE                      0
AGRI_SAL                   0
NONAGRI_SAL                0
WAGES                      0
NETSHARE                   0
CASH_ABROAD                0
CASH_DOMESTIC              0
RENTALS_REC                0
INTEREST                   0
PENSION                    0
DIVIDENDS                  0
OTHER_SOURCE               0
NET_RECEIPT                0
REGFT                      0
NET_CFG                    0
NET_LPR                    0
NET_FISH                   0
NET_FOR                    0
NET_RET                    0
NET_MFG                    0
NET_COM                    0
NET_TRANS                  0
NET_MIN                    0
NET_CONS                   0
NET_NEC                    0
EAINC         

### Dropping Features from the Data Frame
This section will drop two columns **KB** and **CW** as these features wouldn't be necessary for predictions.
- **KB** : Kind of business / industry of the head of the family during the past six months
- **CW** : Class of worker of the head of the family during the past six months

In [109]:
df = df.drop(columns=['KB', 'CW'])

In [111]:
print(df.isnull().sum())

W_REGN                  0
W_OID                   0
W_SHSN                  0
W_HCN                   0
URB                     0
RSTR                    0
PSU                     0
BWEIGHT                 0
RFACT                   0
FSIZE                   0
AGRI_SAL                0
NONAGRI_SAL             0
WAGES                   0
NETSHARE                0
CASH_ABROAD             0
CASH_DOMESTIC           0
RENTALS_REC             0
INTEREST                0
PENSION                 0
DIVIDENDS               0
OTHER_SOURCE            0
NET_RECEIPT             0
REGFT                   0
NET_CFG                 0
NET_LPR                 0
NET_FISH                0
NET_FOR                 0
NET_RET                 0
NET_MFG                 0
NET_COM                 0
NET_TRANS               0
NET_MIN                 0
NET_CONS                0
NET_NEC                 0
EAINC                   0
TOINC                   0
LOSSES                  0
T_BREAD                 0
T_MEAT      

### Section 4.2 : Numpy array
This section will fetch the convert the pandas dataframe to a numpy array

In [114]:
raw_data = df.to_numpy()

In [116]:
# print the first few rows of the array to verify (check dataset.csv)
print(raw_data[:1])

[[14 101001000 2 25 2 21100 415052 138.25 200.6576 3.0 0 0 0 0 176000
  16000 0 0 33000 0 0 4385 76666 0 0 0 0 0 0 0 0 0 0 0 0 325251 0 30263
  29374 5204 3533 2136 2129 6517 1149 2472 1890 6356 0 0 0 91023 23330
  114353 11191 3598 586 55128 0 19200 19200 0 17280 1470 49567 41200
  18636 260 0 198916 313269 313269 0 325251 0 2 75 3 280 2 0 2 3 0 1 0 0
  3 1 1 1 1 2 1 1 1 0 1 1 1 1 1 1 0 0 0 2 1 1 0 0 0.94617231 108417.0 9 8
  8 9]]


### Dicitonaries
This seciton of the notebook defined the dictionaries needed for classification algorithm/s.
- W_REGN = W_REGN_dict
- W_OID = W_OID_dict
- URB_VS1 = URB_VS1_dict
- NATPC_VS1 & NATDC_VS1 & REGDC_VS1 & REGPC_VS1 = RID_dict

In [119]:
W_REGN_dict = {
    13: "Region XIII - NCR",
    14: "Region XIV - CAR",
    1: "Region I - Ilocos Region",
    2: "Region II - Cagayan Valley",
    3: "Region III - Central Luzon",
    41: "Region IVa - Calabarzon",
    42: "Region IVb - Mimaropa",
    5: "Region V - Bicol Region",
    6: "Region VI - Western Visayas",
    7: "Region VII - Central Visayas",
    8: "Region VIII - Eastern Visayas",
    9: "Region IX - Western Mindanao",
    10: "Region X - Northern Mindanao",
    11: "Region XI - Southern Mindanao",
    12: "Region XII - Central Mindanao",
    15: "Region XV - ARMM",
    16: "Region XVI - CARAGA"
}

W_OID_dict = {
    39: "Manila",
    74: "NCR-2nd Dist.",
    75: "NCR-3rd Dist.",
    76: "NCR-4th Dist.",
    1: "Abra",
    27: "Benguet",
    32: "Ifugao",
    44: "Kalinga",
    81: "Mountain Province",
    28: "Apayao",
    29: "Ilocos Norte",
    33: "Ilocos Sur",
    55: "La Union",
    9: "Pangasinan",
    15: "Batanes",
    31: "Cagayan",
    50: "Isabela",
    57: "Nueva Vizcaya",
    8: "Quirino",
    49: "Bataan",
    54: "Bulacan",
    69: "Nueva Ecija",
    77: "Pampanga",
    10: "Tarlac",
    21: "Zambales",
    34: "Aurora",
    56: "Batangas",
    58: "Cavite",
    40: "Laguna",
    51: "Quezon",
    52: "Rizal",
    53: "Marinduque",
    59: "Occidental Mindoro",
    5: "Oriental Mindoro",
    16: "Palawan",
    17: "Romblon",
    20: "Albay",
    41: "Camarines Norte",
    62: "Camarines Sur",
    4: "Catanduanes",
    6: "Masbate",
    19: "Sorsogon",
    30: "Aklan",
    45: "Antique",
    12: "Capiz",
    79: "Iloilo",
    22: "Negros Occidental",
    46: "Guimaras",
    61: "Bohol",
    26: "Negros Oriental",
    37: "Siquijor",
    48: "Eastern Samar",
    60: "Leyte",
    64: "Northern Samar",
    78: "Samar (Western)",
    72: "Southern Leyte",
    73: "Biliran",
    83: "Zamboanga del Norte",
    97: "Zamboanga del Sur",
    13: "Zamboanga Sibugay",
    18: "Isabela City",
    35: "Bukidnon",
    42: "Camiguin",
    43: "Lanao del Norte",
    23: "Misamis Occidental",
    24: "Misamis Oriental",
    25: "Davao",
    82: "Davao de Sur",
    47: "Davao Oriental",
    63: "Compostela Valley",
    65: "Cotabato",
    80: "South Cotabato",
    98: "Sultan Kudarat",
    7: "Sarangani",
    36: "Cotabato City",
    38: "Basilan",
    66: "Lanao del Sur",
    70: "Maguindanao",
    2: "Sulu",
    67: "Tawi-tawi",
    68: "Agusan del Norte",
    3: "Agusan del Sur",
    67: "Surigao del Norte",
    68: "Surigao del Sur"
}

URB_VS1_dict = {
    1 : "Urban",
    2 : "Rural"
}

RID_dict = {
    1: "First Decile",
    2: "Second Decile",
    3: "Third Decile",
    4: "Fourth Decile",
    5: "Fifth Decile",
    6: "Sixth Decile",
    7: "Seventh Decile",
    8: "Eighth Decile",
    9: "Ninth Decile",
    10: "Tenth Decile"
}

## Section 5 : Exploratory Data Analysis

Perform exploratory data analysis comprehensively to gain a good understanding of your dataset. In this section of the notebook, you must present relevant numerical summaries and visualizations. Make sure that each code is accompanied by a brief explanation. The whole process should be supported with verbose textual descriptions of your procedures and findings.

## Section 6 : Initial Model Training
Use machine learning models to accomplish your chosen task (i.e., classification or regression) for the dataset. In this section of the notebook, please take note of the following:
- The project should train and evaluate <u> at least 3 different kinds</u> of machine learning models. The models should not be multiple variations of the same model, e.g., three neural network models with different number of neurons.
- Each model should be appropriate in accomplishing the chosen task for the dataset. There should be a clear and correct justification on the use of each machine learning model.
- Make sure that the values of the hyperparameters of each model are mentioned. At the minimum, the optimizer, the learning rate, and the learning rate schedule should be discussed per model.
- The report should show that the models are not overfitting nor underfitting.

### Section 6.1 : K-Nearest Neighbor

### Section 6.2 : Linear Regression

### Section 6.3 : Logistic Regression

## Section 7 : Error Analysis
Perform error analysis on the output of all models used in the project. In this section of the notebook, you should:
- Report and properly interpret the initial performance of all models using appropriate evaluation metrics.
- Identify difficult classes and/or instances. For classification tasks, these are classes and/or instances that are difficult to classify. Hint: You may use confusion matrix for this. For regression tasks, these are instances that produces high error.

### Section 7.1 : Error Analysis for K-Nearest Neighbor

### Section 7.2 : Error Analysis for Linear Regression

### Section 7.3 : Error Analysis for Logistic Regression

## Section 8 : Improving Model Performance
Perform grid search or random search to tune the hyperparameters of each model. You should also tune each model to reduce the error in difficult classes and/or instances. In this section of the notebook, please take note of the following:
- Make sure to elaborately explain the method of hyperparameter tuning.
- Explicitly mention the different hyperparameters and their range of values. Show the corresponding performance of each configuration.
- Report the performance of all models using appropriate evaluation metrics and visualizations.
- Properly interpret the result based on relevant evaluation metrics.

### Section 8.1 : Improving K-Nearest Neighbor

### Section 8.2 : Improving Linear Regression

### Section 8.3 : Improving Logistic Regression

## Section 9 : Model Performance Summary
Present a summary of all model configurations. In this section of the notebook, do the following:
- Discuss each algorithm and the best set of values for its hyperparameters. Identify the best model configuration and discuss its advantage over other configurations.
- Discuss how tuning each model helped in reducing its error in difficult classes and/or instances.

## Section 10 : Insights and Conclusion
Clearly state your insights and conclusions from training a model on the data. Why did some models produce better results? Summarize your conclusions to explain the performance of the models. Discuss recommendations to improve the performance of the model.

## Section 11 : References
Cite relevant references that you used in your project. All references must be cited, including:
- Scholarly Articles – Cite in APA format and put a description of how you used it for your work.
- Online references, blogs, articles that helped you come up with your project – Put the website, blog, or article title, link, and how you incorporated it into your work.
- Artificial Intelligence (AI) Tools – Put the model used (e.g., ChatGPT, Gemini), the complete transcript of your conversations with the model (including your prompts and its responses), and a description of how you used it for your work.
