# Project file: Data Cleaner App

## Overview: 

Data Cleaner App is a Python application designed to efficiently clean datasets by handling duplicates, missing values and providing cleaned output. This tool is highly performant and it was tested on differents datasets, realy smooth execution and accuracy.

The application can handle datasets with thousands of rows and clean them without errors. It also keeps a backup of duplicate rows,
replaces missing numeric values with calculated column's mean, drops rows with missing non-numeric values. This makes it so usefull tool for preparate and processing data.

### Main Goals:

1. Load and clean datasets in various formats (CSV and Excel).

2. Identify and remove duplicate records, while keeping a backup of these duplicates.

3. Handle missing values:
    - For numeric columns: replace missing values with the column's mean.
    - For non-numeric columns: remove rows containing missing values.
    
4. Save the cleaned dataset and provide access to both the cleaned data and duplicate records.

### Tools I Used:

Mainly **Python**:

- **Pandas**: loading and saving data, servicing 'csv' and 'xlsx' files.

- **Openpyxl**: Default engine using in the background by pandas library to reading excel files, imported just in case to avoid errors.

- **Time**: delaying execution each moment to make feel working effect of app at every processing step.

- **Random**: generating random time of delays 

- **OS**: checking if file path is correct.

## Coding Process Explaining:

#### Importing Libraries:

```python
# importing libraries
import pandas as pd
import time
import openpyxl
import os
import random
```

#### Creating function which taking from user two arguments as path to file and name of cleaned file to generate:

![image.png](attachment:image.png)

#### Creating 'sec' value to contain generated length time of delay from '(1, 4)' range by use 'randint' function.

#### Sleep function counts down the int number from 'sec' value. 

```python
    sec = random.randint(1, 4)
    # print delay message
    print(f"Please wait for {sec} seconds! Checking file path")
    # counting down length of sec
    time.sleep(sec)
```

#### Writing condition which checking if file not exist case then program return information about incorrect path to file, if the path exists then checking in another condition whats type of file it is. It's part of code about verification of the file at all.

```python
    # checking if the path exists
    if not os.path.exists(data_path):
        print("Incorrect path! Try again with correct path")
        return
        
    else:
        # checking the file type
        if data_path.endswith('.csv'):
            print('Dataset is csv!')
            data = pd.read_csv(data_path, encoding_errors='ignore')
                
        elif data_path.endswith('.xlsx'):
            print('Dataset is excel file!')
            data = pd.read_excel(data_path)

        else:
            print("Unkown file type")
            return
```

#### Checking shape of data:

```python
    # print delay message
    sec = random.randint(1, 4)
    print(f"Please wait for {sec} seconds! Checking total columns and rows")
    time.sleep(sec)
            
    # display number of records
    print(f"Dataset contain total rows: {data.shape[0]} \n Total Columns: {data.shape[1]}")

```

#### Checking duplicates:

```python
    # start cleanining

    # print delay message
    sec = random.randint(1, 4)
    print(f"Please wait for {sec} seconds! Checking total duplicates")
    time.sleep(sec)


    # checking duplicates
    duplicates = data.duplicated()
    total_duplicates = data.duplicated().sum()

    print(f"Datasets has total duplicates records :{total_duplicates}")
```

#### Checking if there is any result of duplicates and saving them to seperate file:

```Python
    # print delay message
    sec = random.randint(1, 4)
    print(f"Please wait for {sec} seconds! Saving total duplicates rows")
    time.sleep(sec)


    # saving the duplicates 
    if total_duplicates > 0:
        duplicate_records = data[duplicates]
        duplicate_records.to_csv(f'{data_name}_duplicates.csv', index=None)
        
```

#### Deleting duplicates from user dataset: 

```python
    # deleting duplicates
    df = data.drop_duplicates()
```

#### Checking missing values:

```python
    # print delay message
    sec = random.randint(1, 10)
    print(f"Please wait for {sec} seconds! Checking for missing values")
    time.sleep(sec)

    # find missing values
    total_missing_value = df.isnull().sum().sum()
    missing_value_by_colums = df.isnull().sum()

    print(f"Dataset has Total missing value: {total_missing_value}")
    print(f"Dataset contain missing value by columns \n {missing_value_by_colums}")
```

#### Create loop to checking the data type with condition if in column is numeric type (float or int) then replace this value with mean (average measure) or if data type in that column is not numeric then drops this rows with missing value.

```python
    # print delay message
    sec = random.randint(1, 6)
    print(f"Please wait for {sec} seconds! Cleaning datasets")
    time.sleep(sec)

    columns = df.columns

    for col in columns:
        # filling mean for numeric columns all rows
        if df[col].dtype in (float, int):
            df[col] = df[col].fillna(df[col].mean())
            
        else:
            # dropping all rows with missing records for non number col
            df.dropna(subset=col, inplace=True)
            
```

#### Creating a conditional block to ensure the script runs properly when executed directly or is imported as a module without executing specific code.

```python
if __name__ == "__main__":
    
    print("Welcome to Data Cleaner App!")

    # asking about path and file name
    data_path = input("Please enter dataset path: ")
    data_name = input("Please enter dataset name: ")
    
    # calling the function
    data_cleaning_master(data_path, data_name)
```

### Result Example:

![image-2.png](attachment:image-2.png)

## Steps Process:



### 1. Input and File Verification

- The application begins by asking the user for the dataset path and dataset name.
- It verifies if the path is valid and checks whether the file is in a supported format (CSV or Excel).

### 2. Duplicate Detection and Removal

- The application checks for duplicate records in the dataset.
- Duplicate records are saved as a separate file named **'{dataset_name}_duplicates.csv'**.
- Duplicate rows are then removed from the main dataset.

### 3. Handling Missing Values

- The application checks for missing values in the dataset.
- For numeric columns (integer or float), missing values are replaced by the column’s mean.
- For non-numeric columns, rows containing missing values are dropped.

### 4. Exporting Clean Data

- Once cleaned, the dataset is saved as **'{dataset_name}_Clean_data.csv'**, and a message confirming successful cleaning is displayed.

### 5. Multiple Testing & Performance Tuning

- The application has been tested with more than 5 different datasets. It consistently cleaned datasets in a matter of seconds, without errors.
- The program was also tested using Jupyter Notebook, where it performed flawlessly, allowing easy integration with data analitycs workflows.

## App's specification:


- Fast & Efficient: Cleans large datasets (10k+ rows) in seconds.

- Duplicate Backup: Keeps a backup of all duplicate records before removing them.

- Missing Values Handling: Automatically fills missing numeric values and drops rows with missing 
non-numeric values.

- User-friendly Prompts: Guides the user step by step with appropriate messages and delay prompts.

- Multiple Formats Supported: Handles CSV and Excel files.

## Instruction for user:

1. Run the application using a Python enviroment.

2. Input the dataset path and name when prompted.

3. The application will autamtically clean the dataset and save the results.


```python
    python data_cleaner.py
```

#### Example of execution:

```python
    Welcome to Data Cleaner App!
    Please enter dataset path: /usr/desktop/world_happiness_report_2024.csv
    Please enter dataset name: world_happiness_report
```

#### Expected output:

- Duplicate records saved as: `world_happiness_report_duplicates.csv`

- Cleaned data saved as: `world_happiness_report_Clean_data.csv`