#### B. Purpose : DATA QAULITY ASSESSMENT AND DATA PREPROCESSING 

In [1]:
%run 00_project_setup.ipynb
%run 01_data_import.ipynb 

  from .autonotebook import tqdm as notebook_tqdm


In this stage, we perform an initial assessment of the dataset’s quality to ensure its integrity and suitability for subsequent analysis. This involves systematically checking for missing values across all features, identifying outliers that may distort statistical analyses, detecting inconsistent or erroneous entries, and uncovering duplicate records that could artificially inflate the dataset. The goal is to gain a comprehensive understanding of the dataset’s condition, highlight potential issues, and evaluate its reliability. Importantly, to prevent any risk of data leakage or inadvertent bias, no transformations, imputations, or feature engineering will be applied at this stage. Instead, all extensive data cleaning and preprocessing will be deferred to the model development phase, where they can be executed in a controlled manner to preserve the validity and robustness of predictive modeling outcomes.

In [2]:
class Preprocessing:
    """
     This class is use to preprocessing of the data and has the following methods :
     - checkforduplicatesrows : This methods is use to check for duplicated observations.
     - checkformissingvalues : This methods is use to check for missing values.
     - dealingwithduplicatedrows : This method is use to remove duplicated observations. 
     - replace_zeros_with_nan : This method is use to replace 0 to NaN.
     - one_hot_encode : This method drops the target varaible encodes all norminal variables in x using one hot encoder.
     - label_encode : This method encodes all ordinal/binary variables in x using label encoder.
    """
    
    def __init__(self , extract_data):
        self.extract_data = extract_data
    
    def checkforduplicatesrows(self):
        return self.extract_data.duplicated().sum()
    
    def checkformissingvalues(self):
        return self.extract_data.isnull().sum()
    
    def dealingwithduplicatedrows(self):
        return self.extract_data.drop_duplicates(inplace=True)
    
    def replace_zeros_with_nan(self , columns):
        
        for col in columns:
            if col in self.extract_data.columns:
                self.extract_data[col] = self.extract_data[col].replace(0, np.nan)
            else:
                print(f"Warning: Column '{col}' not found in the DataFrame.")
        
        return self.extract_data

    def one_hot_encode(self, columns):

        x = self.extract_data.drop('HeartDisease', axis = 1)
        
        encoder = OneHotEncoder(sparse_output=False, drop='first')  # Drop first to avoid multicollinearity
        
        encoded = encoder.fit_transform(x[columns])
        encoded_df = pd.DataFrame(encoded, columns=encoder.get_feature_names_out(columns))
        x = x.drop(columns, axis=1).reset_index(drop=True)
        
        return pd.concat([x, encoded_df], axis=1)

    def label_encode(self,x_one_hot_encode, columns):

        encoder = LabelEncoder()
        for col in columns:
            x_one_hot_encode[col] = encoder.fit_transform(x_one_hot_encode[col])
            
        return x_one_hot_encode


###### Object of class preprocessing :

In [3]:
## creating a class copy so i don't change the orginal copy.
X_copy = x.copy()

In [4]:
preprocess_object = Preprocessing(X_copy)

###### B.1 Check for missing values :

In [5]:
preprocess_object.checkformissingvalues()

Marital Status                                    0
Application mode                                  0
Application order                                 0
Course                                            0
Daytime/evening attendance                        0
Previous qualification                            0
Previous qualification (grade)                    0
Nacionality                                       0
Mother's qualification                            0
Father's qualification                            0
Mother's occupation                               0
Father's occupation                               0
Admission grade                                   0
Displaced                                         0
Educational special needs                         0
Debtor                                            0
Tuition fees up to date                           0
Gender                                            0
Scholarship holder                                0
Age at enrol

Just as seen from the exploratoray data analysis, we can see that there are no missing values in our data set.

###### B.2 Check for duplicated rows :

In [6]:
preprocess_object.checkforduplicatesrows()

0

As observed, the dataset contains no duplicate records, indicating that the data integrity is intact and suitable for use in model training.

###### B.3 Investigating for incorret data in each columns

In [7]:
x.columns

Index(['Marital Status', 'Application mode', 'Application order', 'Course',
       'Daytime/evening attendance', 'Previous qualification',
       'Previous qualification (grade)', 'Nacionality',
       'Mother's qualification', 'Father's qualification',
       'Mother's occupation', 'Father's occupation', 'Admission grade',
       'Displaced', 'Educational special needs', 'Debtor',
       'Tuition fees up to date', 'Gender', 'Scholarship holder',
       'Age at enrollment', 'International',
       'Curricular units 1st sem (credited)',
       'Curricular units 1st sem (enrolled)',
       'Curricular units 1st sem (evaluations)',
       'Curricular units 1st sem (approved)',
       'Curricular units 1st sem (grade)',
       'Curricular units 1st sem (without evaluations)',
       'Curricular units 2nd sem (credited)',
       'Curricular units 2nd sem (enrolled)',
       'Curricular units 2nd sem (evaluations)',
       'Curricular units 2nd sem (approved)',
       'Curricular units 2nd s

In [8]:
check_inconsistent_values(X_copy)

{}

Using the check_inconsistent_values function defined in the 01_data_import.ipynb notebook,  i notice that there is consistency in every feature as they are in the right data type and the values of each features is clean hence i don't need to deal with incorrect data in my data set.         

###### B4. Identify outliers for each feature

In [9]:
outliers_dict = {}
for column in X_copy.columns:
    outliers_dict[column] = find_outliers_iqr(X_copy[column])

In [10]:
features_with_outliers = [feature for feature, has_outliers in outliers_dict.items() if has_outliers.any()]
print("Features with Outliers:", features_with_outliers)

Features with Outliers: ['Marital Status', 'Application order', 'Course', 'Daytime/evening attendance', 'Previous qualification', 'Previous qualification (grade)', 'Nacionality', "Mother's occupation", "Father's occupation", 'Admission grade', 'Educational special needs', 'Debtor', 'Tuition fees up to date', 'Scholarship holder', 'Age at enrollment', 'International', 'Curricular units 1st sem (credited)', 'Curricular units 1st sem (enrolled)', 'Curricular units 1st sem (evaluations)', 'Curricular units 1st sem (approved)', 'Curricular units 1st sem (grade)', 'Curricular units 1st sem (without evaluations)', 'Curricular units 2nd sem (credited)', 'Curricular units 2nd sem (enrolled)', 'Curricular units 2nd sem (evaluations)', 'Curricular units 2nd sem (approved)', 'Curricular units 2nd sem (grade)', 'Curricular units 2nd sem (without evaluations)']


Just as seen in the EDA, some of the features has outliers. 

### Concludsion on Preproprocesing

During the preprocessing phase, the following was done : 
- No miss values we observed. 
- No dupplicated observation was seen in the data set.  
- Outliers was observe in some features.
- All features were consitent and in the right format. 
- Not much was done in outliers and missing values because we don't want data leakage and contamination, this would be done in cross validation part in building our model