**SKRUB** :  A Python library for cleaning, structuring, and visualizing tabular data

INTRODUCTION:

How to deal with messy data Missing values, inconsistent formats, and unstructured information slow down data analysis? here skrub comes in picture!
skrub is a powerful Python library designed to clean, structure, and prepare tabular data efficiently. In this guide, we’ll explore its key features

skrub makes cleaning, organizing, and visualizing messy tables easier and faster

 **Data Cleaning** – Removes inconsistencies, trims spaces, and fixes column names.
 
 **Handling Missing Data** – Easily fills missing values.
 
 **Dataset Merging**– Intelligently links datasets, even with slight variations.
 
 **Quick Insights**– Generates structured data for visualization & analysis.

**WHY SKRUB?**

**Assembling Tables with Precision**:Skrub excels at joining tables on keys of different types, including string, numerical, and datetime, with an impressive ability to handle imprecise correspondences.

**Fuzzy Joining for Seamless Integration**: selects the type of fuzzy matching based on column types, producing a similarity score for easy identification of less-than-perfect matches

**Advanced Analysis Made Simple**:Skrub takes table joining to the next level with features like Joiner, AggJoiner, and AggTarget.

**Efficient Column Selection in Pipelines**:Apart from joins, skrub also facilitates column selection within a pipeline, allowing data scientists to choose and discard columns dynamically.




**INSTALLATION PROCESS**

In [1]:
pip install skrub

Note: you may need to restart the kernel to use updated packages.


In [2]:
pip install --upgrade skrub


Note: you may need to restart the kernel to use updated packages.


**USING SKRUB**

**HANDLING MISSING DATAFILES:**

We always find issues when cleaning data is dealing with missing or NaN values. With Skrub, you can fill or drop these missing values with just a few lines of code.

for example consider this code:

In [3]:
#filling missing values
import pandas as pd


data = {
    'Name': ['Alice', 'Bob', None, 'Eve'],
    'Age': [25, None, 22, 29],
    'City': ['New York', 'Paris', None, None]
}

df = pd.DataFrame(data)
df


Unnamed: 0,Name,Age,City
0,Alice,25.0,New York
1,Bob,,Paris
2,,22.0,
3,Eve,29.0,


we can use skrub’s missing module to fill the missing values. For example, to fill numerical columns with the mean and categorical columns with the mode, we can use th following code:

In [4]:
import pandas as pd
from skrub import TableVectorizer

# Sample data with missing values
data = {
    'Name': ['Alice', 'Bob', None, 'Eve'],
    'Age': [25, None, 22, 29],
    'City': ['New York', 'Paris', None, None]
}

# Create a DataFrame
df = pd.DataFrame(data)

# Print the original data
print("Original Data:")
print(df)

# Handle missing values before using TableVectorizer
# Let's fill missing names with 'Unknown', ages with the mean of the column, and cities with 'Unknown'
df['Name'].fillna('Unknown', inplace=True)
df['Age'].fillna(df['Age'].mean(), inplace=True)  # Filling missing Age with the mean
df['City'].fillna('Unknown', inplace=True)

# Print the data after filling missing values
print("\nData After Filling Missing Values:")
print(df)

# Initialize TableVectorizer to clean and transform the data
vectorizer = TableVectorizer()

# Clean the data (it will handle categorical data by encoding it into numerical form)
cleaned_data = vectorizer.fit_transform(df)

# Convert cleaned data to a DataFrame
df_cleaned = pd.DataFrame(cleaned_data, columns=vectorizer.get_feature_names_out())

# Print the cleaned data
print("\nCleaned Data (Transformed into Numerical Representation):")
print(df_cleaned)


Original Data:
    Name   Age      City
0  Alice  25.0  New York
1    Bob   NaN     Paris
2   None  22.0      None
3    Eve  29.0      None

Data After Filling Missing Values:
      Name        Age      City
0    Alice  25.000000  New York
1      Bob  25.333333     Paris
2  Unknown  22.000000   Unknown
3      Eve  29.000000   Unknown

Cleaned Data (Transformed into Numerical Representation):
   Name_Alice  Name_Bob  Name_Eve  Name_Unknown        Age  City_New York  \
0         1.0       0.0       0.0           0.0  25.000000            1.0   
1         0.0       1.0       0.0           0.0  25.333334            0.0   
2         0.0       0.0       0.0           1.0  22.000000            0.0   
3         0.0       0.0       1.0           0.0  29.000000            0.0   

   City_Paris  City_Unknown  
0         0.0           0.0  
1         1.0           0.0  
2         0.0           1.0  
3         0.0           1.0  


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Name'].fillna('Unknown', inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(df['Age'].mean(), inplace=True)  # Filling missing Age with the mean
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate ob

 The age column is replaced with its mean value.
 
the missing Name and City values are replaced with the string 'Unknown'. 

TableVectorizer handles the categorical features and converts them into binary features (one-hot encoding).
Output
The cleaned data is now numerical, and we can see how missing values were handled.

**Standardizing Data**

  skrub can standardize numerical data, by making sure that values are consistent across the dataset. 
For example, if we have a column of Age values with a large range, we can scale it between 0 and 1 using normalization or standardization techniques.

for example analyze the code below

In [5]:
from skrub.preprocessing import Normalizer
import pandas as pd

# Sample data
df = pd.DataFrame({'Age': [25, 30, 35, 40, 45]})

# Normalize data to a range [0, 1]
normalizer = Normalizer()
df_normalized = normalizer.fit_transform(df)

print(df_normalized)



ModuleNotFoundError: No module named 'skrub.preprocessing'

In [6]:
import skrub
help(skrub)


Help on package skrub:

NAME
    skrub - skrub: Prepping tables for machine learning.

PACKAGE CONTENTS
    _agg_joiner
    _check_dependencies
    _check_input
    _clean_categories
    _clean_null_strings
    _column_associations
    _dataframe (package)
    _datetime_encoder
    _deduplicate
    _dispatch
    _drop_if_too_many_nulls
    _fast_hash
    _fuzzy_join
    _gap_encoder
    _interpolation_joiner
    _join_utils
    _joiner
    _matching
    _minhash_encoder
    _multi_agg_joiner
    _on_each_column
    _on_subframe
    _reporting (package)
    _select_cols
    _selectors (package)
    _similarity_encoder
    _sklearn_compat
    _string_distances
    _string_encoder
    _table_vectorizer
    _tabular_learner
    _text_encoder
    _to_categorical
    _to_datetime
    _to_float32
    _to_str
    _utils
    _wrap_transformer
    conftest
    datasets (package)
    tests (package)

CLASSES
    builtins.object
        skrub._reporting._table_report.TableReport
    sklearn.base.B