# VisTool Example Usage

This notebook demonstrates how to use the `VisTool` for data downloading, wrangling, combining and visualization.

## Prerequisites

Make sure the `VisTool` is installed in your environment before proceeding please see READ.ME for how to install VisTool

# Download Module 

1. `download_file`: Downloads a file from a given URL and saves it locally.

2. `download_csv`: Downloads a CSV file from a URL and loads it into a Pandas DataFrame.

3. `load_csv`: Loads a CSV file into a Pandas DataFrame from local path.

4. `load_excel`: Load excel file into a Pandas DataFrame from local path. 

5. `summarize_data`: Summarizes key aspects of the dataset and provides an overview of its structure.

## Combine Module 

1. `merge_datasets`: Merges two datasets on a specified column.

2. `concat_datasets`: Concatenates multiple datasets along rows or columns.

## Visualize Module 
1. `plot_histogram`: Plots a histogram of a column.

2. `plot_scatter`: Creates a scatter plot of two columns.

3. `plot_correlation_matrix`: Plots a heatmap of correlations between numeric columns.

4. `plot_line`: Plots a line chart for time-series data.

5. `plot_overlay`: Overlays multiple columns with different plot types.

## Wrangle Module 
 1. `clean_data`: Cleans the dataset by dropping NaN values or filling with mean.

2. `filter_data`: Filters rows based on a condition.

3. `rename_columns`: Renames columns in the dataset.

4. `label_encode`: Perform label encoding on a categorical column using Pandas and NumPy.

## Checking the package information and if it is installed correctly. 

In [1]:
# this should work!
import VisTool
print(VisTool.__version__)
print(VisTool.__name__)
print(VisTool.__author__)

0.1.0
VisTool
Guled Abdullahi and Kayleigh Haydock 


## Download Module Example Usage 

1. download_file(url: str, save_path: str) -> None

2. download_csv(url: str) -> pd.DataFrame

3. load_csv(file_path: str) -> pd.DataFrame

4. load_excel(file_path: str, sheet_name: str = None) -> pd.DataFrame

5. summarize_data(df)

Summarizes key aspects of the dataset and provides an overview of its structure, including:

	• Shape (rows, columns)

	• Numeric and non-numeric columns

	• Missing values

	• Duplicate rows

	• Categorical columns

	• Correlation matrix for numeric columns


In [None]:
# Import the download functions
from VisTool.download import download_file,download_csv, load_csv,load_excel, summarize_data
import os

# Download a file and save it locally
url = "https://people.sc.fsu.edu/~jburkardt/data/csv/airtravel.csv"
save_path = "data/airtravel.csv"

# Create the directory if it doesn't exist
os.makedirs(os.path.dirname(save_path), exist_ok=True)

# Download the file
download_file(url, save_path)

# Verify the file is saved
if os.path.exists(save_path):
    print(f"File downloaded successfully and saved to: {save_path}")
else:
    print("Failed to download the file.")
  

In [None]:
  

#  Download a CSV file and load into Pandas DataFrame
url = "https://people.sc.fsu.edu/~jburkardt/data/csv/airtravel.csv"
df = download_csv(url)

# Display the first few rows of the DataFrame
print("Downloaded CSV Data:")
print(df.head())


In [None]:
# load csv file 

df = load_csv('data/Monthly_AE_Attendances_Nov_2024.csv')
df.head(5)

In [None]:
# Load_excel file 
df_dict = load_excel('data/titanic3.xls')
print(type(df_dict))  # Output: <class 'dict'>
print(df_dict.keys())  # Output: dict_keys(['Sheet1', 'Sheet2', ...])

In [None]:
df = df_dict['titanic3']  # Access the DataFrame for the 'titanic3' sheet
print(df.head(5))  # Display the first 5 rows of the DataFrame

In [None]:
## Summary of the data 
df = load_csv('data/Monthly_AE_Attendances_Nov_2024.csv')
summarize_data(df)

## Combine Module Example Usage

The combine.py module allows for merging and concatenating datasets, making it easier to integrate and manage data. Below is a detailed explanation of the features along with examples.

1.`merge_datasets(left_df, right_df, on, how)`

This function merges two datasets on a specified column using various join methods (inner, outer, left, right).

Arguments:

• `left_df (pd.DataFrame)`: The first dataset.

• `right_df (pd.DataFrame)`: The second dataset.

• `on (str)`: The column to merge on.

• `how (str, optional)`: Type of join ("inner", "outer", "left", "right"). Default: "inner".

2.`concat_datasets(dataframes, axis)`

This function concatenates multiple datasets either row-wise or column-wise.

Arguments:

• `dataframes (list[pd.DataFrame])`: List of datasets to concatenate.

• `axis (int, optional)`: Axis to concatenate on (0 for rows, 1 for columns). Default: 0.    

In [None]:


#Example 1: Inner Join

from VisTool.combine import merge_datasets, concat_datasets
import pandas as pd

# Sample data
data1 = {'id': [1, 2, 3], 'name': ['Alice', 'Bob', 'Charlie']}
data2 = {'id': [2, 3, 4], 'age': [25, 30, 35]}

df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

# Merge datasets inner joint 
merged_data = merge_datasets(df1, df2, on="id", how="inner")
print(merged_data)



In [None]:

# Merge datasets inner joint 

merged_data = merge_datasets(df1, df2, on="id", how="outer")
print(merged_data)

In [None]:
#Concatenating Row-Wise

# Sample data
df1 = pd.DataFrame({'id': [1, 2], 'name': ['Alice', 'Bob']})
df2 = pd.DataFrame({'id': [3, 4], 'name': ['Charlie', 'David']})

# Concatenate row-wise
concatenated_data = concat_datasets([df1, df2], axis=0)
print(concatenated_data)


In [None]:

# Concatenate column-wise
concatenated_data = concat_datasets([df1, df2], axis=1)
print(concatenated_data)

## Wrangle Module Example Usage

The wrangle module simplifies the process of preparing and managing datasets by offering both manual and interactive options for cleaning, filtering, renaming, and encoding data. The interactive functions allow users to manipulate their data directly in Jupyter Notebooks without writing additional code.

`1. clean_data(data, remove_columns=None, fill_with=None, apply_to='columns')`

This function cleans the dataset by dropping or filling missing values.

Arguments:

• data (pd.DataFrame): The input dataset.

• remove_columns (list, optional): List of columns to drop rows with missing values.

• fill_with (str, optional): Method to fill NaN values ('mean' or 'average').

• apply_to (str, optional): Apply the operation to 'columns' or 'rows'. Default: 'columns'.


`2. filter_data(data, condition)`

Filters the dataset based on a specified condition.

Arguments:
• data (pd.DataFrame): The input dataset.

• condition (str): A valid pandas query string to filter the data.


`3. rename_columns(data, columns_mapping)`

Renames columns using a dictionary mapping.

Arguments:

• data (pd.DataFrame): The input dataset.

• columns_mapping (dict): A mapping of old column names to new names.

`4. label_encode(data, column)`

Applies label encoding to a categorical column.

Arguments:
• data (pd.DataFrame): The dataset containing the categorical column.

• column (str): The name of the column to encode.



`Interactive Variants`

1. clean_data_interactive(data): Interactive version of clean_data.

2. filter_data_interactive(data): Interactive version of filter_data.

3. rename_columns_interactive(data): Interactive column renaming.

4. label_encode_interactive(data): Interactive label encoding.



In [13]:
# Import wrangle module functions
from VisTool.wrangle import clean_data, filter_data, rename_columns

# Example 1: Clean Data
data = pd.DataFrame({"A": [1, None, 3], "B": [4, 5, None]})
cleaned_data = clean_data(data)
print("Cleaned Data:")
print(cleaned_data)

# Example 2: Filter Data
filtered_data = filter_data(data, "A > 1")
print("Filtered Data:")
print(filtered_data)

# Example 3: Rename Columns
renamed_data = rename_columns(data, {"A": "ColumnA", "B": "ColumnB"})
print("Renamed Data:")
print(renamed_data)

Rows with any NaN values were dropped.
Cleaned Data:
     A    B
0  1.0  4.0
Data filtered successfully.
Filtered Data:
     A   B
2  3.0 NaN
Columns renamed successfully.
Renamed Data:
   ColumnA  ColumnB
0      1.0      4.0
1      NaN      5.0
2      3.0      NaN


In [14]:
from VisTool.wrangle import label_encode
import pandas as pd

# Example usage
# Sample data
df = pd.DataFrame({
    "Name": ["Alice", "Bob", "Charlie", "Alice"],
    "Age": [25, 30, 35, 28],
    "City": ["New York", "London", "New York", "Paris"]
    })

print("Original Data:")
print(df)

# Apply label encoding to the 'City' column
df_encoded = label_encode(df.copy(), column="City")
print("\nData with Label Encoding:")
print(df_encoded)



Original Data:
      Name  Age      City
0    Alice   25  New York
1      Bob   30    London
2  Charlie   35  New York
3    Alice   28     Paris
Label encoding applied to column: City

Data with Label Encoding:
      Name  Age  City
0    Alice   25     1
1      Bob   30     0
2  Charlie   35     1
3    Alice   28     2


In [16]:
d = clean_data(df, remove_columns=['age'] )

KeyError: ['age']

### Interactive Variants

In [9]:
import pandas as pd
from VisTool.wrangle import (
    clean_data_interactive,
    filter_data_interactive,
    rename_columns_interactive,
    label_encode_interactive
)

data = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, None],
    'Gender': ['Female', 'Male', 'Male', 'Female'],
    'Score': [88, 92, 85, None]
})



### 1. Clean the dataset interactively

clean_data_interactive(data)

How to use:
• Select columns to remove using the Remove Columns dropdown.

• Choose “mean” or “average” to fill NaN values in numeric columns.

• Specify whether to apply changes to columns or rows using the Apply To dropdown.

• Click Apply Cleaning to see the updated dataset.


In [11]:

clean_data_interactive(data)

SelectMultiple(description='Remove Columns:', options=('Name', 'Age', 'Gender', 'Score'), value=())

Dropdown(description='Fill With:', options=('None', 'mean', 'average'), value='None')

Dropdown(description='Apply To:', options=('columns', 'rows'), value='columns')

Button(description='Apply Cleaning', style=ButtonStyle())


### 2. Filter the dataset interactively

filter_data_interactive(data)

How to use:

• Enter a condition to filter rows in the Condition text box (e.g., Age > 30 or Gender == 'Female').

• Click Apply Filter to view the filtered dataset.


In [7]:
filter_data_interactive(data)

Text(value='', description='Condition:', placeholder='Enter condition (e.g., Age > 30)')

Button(description='Apply Filter', style=ButtonStyle())


### 3. Rename columns interactively

rename_columns_interactive(data)

How to use:
• Enter mappings for column renaming in the format old_name:new_name (e.g., Name:Full_Name,Score:Test_Score).

• Click Apply Rename to see the dataset with renamed columns.


In [8]:
rename_columns_interactive(data)

Text(value='', description='Mappings:', placeholder='Enter mappings (e.g., old:new,age:years)')

Button(description='Apply Rename', style=ButtonStyle())


### 4. Apply label encoding interactively

label_encode_interactive(data)

How to use:

• Select a categorical column (e.g., Gender) from the Column dropdown.

• Click Apply Encoding to view the dataset with the selected column label-encoded.


In [4]:
label_encode_interactive(data)

Dropdown(description='Column:', options=('Name', 'Age', 'Gender', 'Score'), value='Name')

Button(description='Apply Encoding', style=ButtonStyle())

## Visualize Module Usage Example

In [None]:
# Import visualize module functions
from VisTool.visualize import plot_histogram, plot_scatter, plot_correlation_matrix

# Example 1: Plot Histogram
data = pd.DataFrame({"A": [1, 2, 2, 3, 3, 3]})
plot_histogram(data, "A")

# Example 2: Plot Scatter
scatter_data = pd.DataFrame({"X": [1, 2, 3], "Y": [3, 2, 1]})
plot_scatter(scatter_data, "X", "Y")

# Example 3: Plot Correlation Matrix
correlation_data = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6], "C": [7, 8, 9]})
plot_correlation_matrix(correlation_data)