# Interactive Tools for Preparing Data for Analysis

This project creates an interactive tools for cleaning and converting data for use in analysis tools.  

In an automated system this would be implemented in a data "ETL pipeline" that makes many assumptions and checks on data as it is transformed from the input format to the needed analysis format(s). 

This project uses Streamlit and iTables, which both use the "Arrow" format for serializing data. 
Pandas uses its own internal (but standardized) datatype system).  

This is explained below. 

The error you're encountering is related to the serialization of a DataFrame to an Arrow table. This issue arises because Arrow has strict requirements for data types, and certain data types in your DataFrame are not compatible with Arrow's serialization process.

### Explanation of the Error

1. **Arrow Serialization**:
   - Arrow requires specific data types for serialization. If a column contains mixed data types or unsupported data types, Arrow will fail to serialize the DataFrame.
   - The error message indicates that the `Closed` column contains `datetime.datetime` objects, which Arrow expects to be in a specific format.
   - The `Column_Datatype` column contains `int64` values, but Arrow did not recognize the Python value type when inferring an Arrow data type.

2. **Pandas**:
   - Pandas is flexible with data types and can handle mixed data types within a column. However, this flexibility can cause issues when interfacing with other libraries like Arrow that have stricter requirements.

3. **Streamlit**:
   - Streamlit uses Arrow for efficient data transfer between the server and the client. Therefore, DataFrames passed to Streamlit components must be Arrow-compatible.

4. **iTables**:
   - iTables is a library for interactive tables in Jupyter notebooks. It also relies on Arrow for efficient data handling and visualization.

5. **pyarrow**:
   - Pyarrow is the Python library for Apache Arrow. It provides tools for converting between Pandas DataFrames and Arrow tables, but it requires that the data types be compatible with Arrow.

### Best Practices for Using These Packages Together

1. **Ensure Data Type Compatibility**:
   - Before passing a DataFrame to Streamlit or iTables, ensure that all columns have compatible data types.
   - Convert `datetime` columns to Pandas `datetime64` format.
   - Ensure that all columns have consistent data types (e.g., no mixed types within a column).

2. **Use Explicit Data Type Conversions**:
   - Explicitly convert columns to the appropriate data types using Pandas' `astype` method.
   - For example, convert `datetime` columns using `pd.to_datetime`.

3. **Handle Mixed Data Types**:
   - If a column contains mixed data types, convert it to a single data type that is compatible with Arrow.
   - For example, convert a column with mixed integers and strings to strings using `astype(str)`.

4. **Check Data Types Before Serialization**:
   - Use Pandas' `dtypes` attribute to check the data types of all columns before passing the DataFrame to Streamlit or iTables.
   - Ensure that all data types are compatible with Arrow.


### Updated Code with Data Type Compatibility Checks



In [None]:
import streamlit as st
import pandas as pd
import plotly.express as px
import numpy as np
import pyarrow as pa

def clean_column_names(df):
    return df.rename(columns=lambda x: x.replace(' ', '_').replace('/', '_').replace('(', '').replace(')', ''))

def analyze_trades(df, trade_col, opened_col, max_margin_col, profit_loss_col):
    df['Cumulative_Profit_Loss'] = df[profit_loss_col].cumsum()
    return df

def create_plot(df, profit_loss_col):
    fig = px.line(df, x='Opened', y='Cumulative_Profit_Loss', title='Cumulative Profit/Loss Over Time')
    fig.update_xaxes(title='Date')
    fig.update_yaxes(title='Cumulative Profit/Loss')
    return fig

def ensure_arrow_compatibility(df):
    for col in df.columns:
        if pd.api.types.is_datetime64_any_dtype(df[col]):
            df[col] = pd.to_datetime(df[col])
        elif pd.api.types.is_integer_dtype(df[col]):
            df[col] = df[col].astype('int64')
        elif pd.api.types.is_float_dtype(df[col]):
            df[col] = df[col].astype('float64')
        elif pd.api.types.is_object_dtype(df[col]):
            df[col] = df[col].astype(str)
    return df

def main():
    st.title("Trade Transaction Performance Analyzer")

    # File upload
    file = st.file_uploader("Upload your XLSX or CSV file", type=["xlsx", "csv"])
    if not file:
        st.stop()

    # Read the file
    if file.name.endswith('.xlsx'):
        df = pd.read_excel(file)
    else:
        df = pd.read_csv(file)

    # Clean column names
    df = clean_column_names(df)

    # Ensure Arrow compatibility
    df = ensure_arrow_compatibility(df)

    # Display column names
    st.subheader("Available columns:")
    st.write(df.columns.tolist())

    # Get user input for column selection
    st.subheader("Select columns for analysis:")
    trade_col = st.selectbox("Trade column", df.columns)
    opened_col = st.selectbox("Opened column", df.columns)
    max_margin_col = st.selectbox("Maximum Margin column", df.columns)
    profit_loss_col = st.selectbox("Profit/Loss column", df.columns)

    # Analyze trades
    df = analyze_trades(df, trade_col, opened_col, max_margin_col, profit_loss_col)

    # Sort by date
    df = df.sort_values('Closed', ascending=False)

    # Display data
    st.subheader("Trade data:")
    st.dataframe(df)

    # Create and display plot
    st.subheader("Cumulative Profit/Loss Chart:")
    fig = create_plot(df, profit_loss_col)
    st.plotly_chart(fig, use_container_width=True)

    # Allow user to select date range for chart
    st.subheader("Select date range for chart:")
    start_date = st.date_input("Start date", value=df['Opened'].min())
    end_date = st.date_input("End date", value=df['Opened'].max())

    # Filter data and create new plot
    filtered_df = df[(df['Opened'] >= start_date) & (df['Opened'] <= end_date)]
    fig = create_plot(filtered_df, profit_loss_col)
    st.subheader("Filtered Cumulative Profit/Loss Chart:")
    st.plotly_chart(fig, use_container_width=True)

if __name__ == "__main__":
    main()



### Explanation

1. **Ensure Arrow Compatibility**:
   - The `ensure_arrow_compatibility` function converts columns to data types that are compatible with Arrow.
   - `datetime` columns are converted to `datetime64`.
   - Integer and float columns are converted to `int64` and `float64`, respectively.
   - Object columns are converted to strings.

2. **Use Explicit Data Type Conversions**:
   - The `ensure_arrow_compatibility` function explicitly converts columns to the appropriate data types using Pandas' `astype` method.

3. **Check Data Types Before Serialization**:
   - The `ensure_arrow_compatibility` function checks the data types of all columns before passing the DataFrame to Streamlit.

### Running the Code

Run the updated code in your Streamlit application to ensure that the data types are correctly handled and the application runs without errors. The enhanced error-checking and handling will help identify where the problem lies and provide informative error messages to the user. If you encounter any specific errors, please provide the error message for further assistance.