<a href="https://colab.research.google.com/github/Hassan-DS507/data-science-notebooks/blob/main/Task_1_for_clean.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#  Tom Clinic Data Cleaning Project

##  Objective
Clean and prepare the provided clinic dataset to make it ready for analysis by:
- Handling missing values
- Fixing inconsistent text entries
- Removing duplicates
- Ensuring correct data types

##  Step 1: Import Necessary Libraries

In [None]:
# Data Handling
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns


# Step 2: Load the Datasetet
We load the dataset and take an initial look at its structure.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
df = pd.read_csv('/content/drive/MyDrive/tasks_to_dataset/01JT7BFHK057AQS04QAAWHCWNX (1).csv')
df.head()

In [None]:
df.sample(5)

# Step 3: Explore the Data

In [None]:
df.shape

In [None]:
df.columns

In [None]:
df.nunique().sort_values(ascending = False)

###  Observation Summary:
The dataset contains a high number of unique invoices and dates, indicating many transactions over time. Most categorical columns like Product, Brand, Branch, and Payment_Method have limited unique values, reflecting a controlled and structured retail environment.


In [None]:
df.info()

##  Data Summary and Observations

###  General Info:
- The dataset contains **2600 rows** and **10 columns**.
- This is sales data, including invoice info, products, prices, customers, and payment methods.

-  `Invoice_ID`: OK – All values are present and unique.
-  `Date`: Type is text, needs to be converted to datetime.
- `Customer_Name`: 333 missing – can fill with "Unknown".
-  `Product`: OK – No missing values.
-  `Brand`: OK – No missing values.
-  `Quantity`: OK – Numeric and complete.
-  `Unit_Price`: OK – Numeric and complete.
-  `Branch`: OK – No missing values.
-  `Payment_Method`: 608 missing – fill with "Not Recorded".
-  `Total_Price`: OK – Numeric and complete.

###  Next Steps:
- Clean missing values.
- Convert `Date` to datetime.
- Validate that `Total_Price = Quantity × Unit_Price`.

In [None]:
missing = df.isna().sum().sort_values(ascending = False)
print(f'Total Number of missing values in the Dataset {missing.sum()}\n')
missing = missing[missing>0]
missing

##  Missing Values

-  **Total Missing Values**: 941
- `Payment_Method`: 608 missing
  - Observation: Many transactions have no recorded payment method.

  
-  `Customer_Name`: 333 missing
  - Observation: Some invoices are missing customer names.

  - Action: you can check pattern

 All other columns have **zero missing values** — data is mostly clean.


## Analyze the pattern of missing data in `Customer_Name` and `Payment_Method`

In [None]:

def Check_Pattern(missing_df, null_col):
    """
    Analyze the pattern of missing data in a specific column.

    Parameters:
    ----------
    missing_df : pandas.DataFrame
        Rows where the specified column is missing.
    null_col : str
        The column with missing values.

    Returns:
    -------
    pandas.DataFrame
        Columns that tend to take only 1 or 2 unique values
        when `null_col` is missing — this may indicate a non-random pattern.
    """
    # Remove the target column (we don't need to analyze it here)
    missing_df = missing_df.drop(null_col, axis=1)

    # Create an empty DataFrame to store unique values for relevant columns
    unique_df = pd.DataFrame()

    for col in missing_df.columns:
        nunique = missing_df[col].nunique()

        if nunique in [1, 2]:  # If only 1 or 2 unique values exist
            unique_vals = missing_df[col].unique()
            unique_df[col + '_unique_vals'] = pd.Series(unique_vals)

    # Interpretation
    if unique_df.empty:
        print(f'\t- Missing values in `{null_col}` appear to be randomly distributed.')
        print(f'\t- Likely missing mechanism: MCAR (Missing Completely At Random)')
    else:
        print(f'\t- Missing values in `{null_col}` are associated with specific values in other columns:')
        print(f'\t  Columns with 1 or 2 unique values when `{null_col}` is missing: {list(unique_df.columns)}')
        print(f'\t- Likely missing mechanism: MAR or MNAR (Not Missing Completely At Random)')

    return unique_df


def Missing_Pattern(df, col):
    """
    Show stats and pattern analysis for a column with missing values.

    Parameters:
    ----------
    df : pandas.DataFrame
    col : str
        Column name to analyze
    """
    print(f"\nFeature: {col}")
    print('-'*40)
    print(f"\t- Number of missing values: {df[col].isna().sum()}")
    print(f"\t- Percentage of missing values: {(df[col].isna().mean())*100:.2f}%")
    print(f"\t- Data type: {df[col].dtype}")
    print(f"\t- Number of unique values: {df[col].nunique(dropna=True)}")
    print(f"\t- Most common value: {df[col].mode(dropna=True).iloc[0] if df[col].notna().any() else 'N/A'}")

    print(f"\nAnalyzing missing value pattern...")
    print('-'*40)
    missing = df[df[col].isna()]
    unique_df = Check_Pattern(missing, col)
    return missing, unique_df


In [None]:
missing1, pattern1 = Missing_Pattern(df, 'Customer_Name')
missing2, pattern2 = Missing_Pattern(df, 'Payment_Method')

### Missing Values Analysis Summary

- **Customer_Name**
  - ~12.8% missing values
  - Missing Completely At Random (MCAR)
  -  Action: Fill missing values with `'Unknown'` to keep the data and track anonymous customers.

- **Payment_Method**
  - ~23.4% missing values
  - Missing Completely At Random (MCAR)
  -  Action: Fill missing values with the most common value (`mode`), e.g., `'Mobile Wallet'`.

###  Final Decision:
- No rows will be dropped.
- Missing values will be imputed to retain useful sales data for analysis.

In [None]:
df['Customer_Name'].fillna('Unknown', inplace=True)

In [None]:
df['Customer_Name'].isna().sum()

In [None]:
df['Payment_Method'].fillna(df['Payment_Method'].mode()[0], inplace=True)

In [None]:
df['Payment_Method'].isna().sum()

## Handle duplicates

In [None]:
df.duplicated().sum()

# Fix column data types if needed

In [None]:
df.dtypes

### Observation : Convert `Date` column


In [None]:
df['Date'].unique()

In [None]:

from datetime import datetime

def clean_date(date_str):
    if pd.isnull(date_str):
        return None

    # Replace / with - for consistency
    date_str = str(date_str).replace('/', '-')

    # Possible date formats (more can be added if needed)
    formats = ['%Y-%m-%d', '%d-%m-%Y', '%m-%d-%Y', '%d-%b-%Y', '%Y/%m/%d', '%d/%m/%Y']

    for fmt in formats:
        try:
            date_obj = datetime.strptime(date_str, fmt)
            return date_obj.strftime('%Y-%m-%d')  # Standardize format
        except:
            continue

    return None  # If all parsing fails

# Example: apply on a column called 'Date'
df['Date'] = df['Date'].apply(clean_date)


When I looked at the column for dates, I noticed that all the dates were not in the same format. This inconsistency is a significant issue.

We will aim to resolve this in the future.

In [None]:
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')


In [None]:
df['Date'].isna().sum()

In [None]:
missing = df.isna().sum().sort_values(ascending = False)
print(f'Total Number of missing values in the Dataset {missing.sum()}\n')
missing = missing[missing>0]
missing

In [None]:
df.to_csv("cleaned_data.csv", index=False)

In [None]:
df