<h1 style="color: red; font-size: 40px; text-align: center;">
    Federated Learning Model Training 🌐
</h1>

<center>
    <img src="https://media4.giphy.com/media/ZVik7pBtu9dNS/giphy.gif"
         alt="federated learning animation" height="250" width="500">
</center>


# Secure Collaboration Without Data Sharing🤖🩺



# 👋 Introduction
<div class="alert alert-block alert-success" style="font-family: verdana; font-size: 20px; line-height: 1.7em; border-radius: 1.3em;">
    Connecting Intelligence, Protecting Privacy.”

In today’s data-driven world, organizations and devices collect massive amounts of information — but privacy, security, and ownership concerns often prevent sharing that data for collaborative learning.

Federated Learning solves this challenge by enabling multiple participants (such as hospitals, banks, or mobile devices) to train a shared machine learning model without ever exchanging raw data.

Instead of sending data to a central server, each participant trains the model locally and shares only the learned parameters. These updates are then aggregated to form a global model — ensuring data confidentiality, reduced communication costs, and collective intelligence across distributed networks.

This approach empowers industries to innovate collaboratively while maintaining compliance with strict data privacy regulations.

⚙️ Federated Learning = Local Training + Secure Aggregation + Global Intelligence.
</div>

# 📚 Problem Statement

<div class="alert alert-block alert-info" style="font-family: verdana; font-size: 20px; line-height: 1.7em; border-radius: 1.3em;">
    In traditional machine learning, training a model requires centralizing all data in one location. However, with growing concerns over data privacy, security, and legal restrictions, organizations and devices cannot always share their sensitive information.

This limitation prevents industries such as healthcare, finance, and mobile technology from fully utilizing the power of collective data to improve intelligent systems.

Federated Learning addresses this challenge by introducing a privacy-preserving decentralized approach where multiple clients collaboratively train a global model without sharing raw data. Each client performs local training and only transmits the learned parameters (weights or gradients) to a central server.
</div>

# 🔭 Feature Description

<div style="font-family:verdana; font-size: 20px; line-height: 1.7em;">
<ol>
The Federated Learning System offers a set of powerful and innovative features designed to ensure data privacy, secure collaboration, and efficient distributed model training.
Each feature contributes to building a robust and privacy-preserving machine learning framework.
<li><strong>
🔐 1. Privacy-Preserving Learning

Data remains stored locally on each participating client or device.

Only model parameters (weights or gradients) are shared with the server.

Prevents exposure of sensitive or personal data, maintaining full confidentiality.

Ensures compliance with data protection laws like GDPR and HIPAA.</strong></li>
<li><strong>
🌐 2. Decentralized Model Training

Enables multiple clients (such as hospitals, banks, or IoT devices) to collaboratively train a shared model.

No need for a central data repository — the system relies on distributed training.

Each client contributes to the learning process using its own local dataset.</strong></li>
<li><strong>
⚙️ 3. Global Model Aggregation (FedAvg Algorithm)

The central server aggregates model updates from all clients to form a global model.

Uses the Federated Averaging (FedAvg) technique to merge weights efficiently.

Enhances overall model accuracy while preserving client data integrity.</strong></li>
<li><strong>
📶 4. Communication Efficiency

Reduces bandwidth usage by sharing only essential model updates instead of raw data.

Optimizes the synchronization process to minimize communication delay.

Implements periodic update strategies to maintain performance in real-time networks.</strong></li>
<li><strong>
🧠 5. Scalability and Adaptability

Can easily scale to accommodate a large number of clients or devices.

Supports both cross-device (mobile/IoT) and cross-silo (institutional) learning.

Adapts to various network conditions and heterogeneous hardware setups.</strong></li>
<li><strong>
🔄 6. Fault Tolerance and Robustness

The system can handle client dropouts, network interruptions, or partial participation without halting training.

The global model remains stable and continues to improve even with incomplete client updates.</strong></li>
<li><strong>
📊 7. Performance Monitoring and Evaluation

Provides tools to evaluate model performance after each training round.

Tracks metrics such as accuracy, loss, and convergence speed.

Enables comparison between federated and centralized learning approaches.</strong></li>
<li><strong>
🏥 8. Real-World Applicability

Designed to work in sensitive and distributed domains such as:

Healthcare — hospitals can train diagnostic models without sharing patient records.

Finance — banks can detect fraud collaboratively while maintaining customer privacy.

Mobile Systems — smartphones can improve predictive text models collectively.</strong></li>
<li><strong>
🔒 9. Security and Encryption (Optional Extension)

Can integrate secure computation methods such as Differential Privacy and Homomorphic Encryption.

Adds an extra layer of protection during communication and model aggregation.</strong></li>
<li><strong>
🧩 10. Customizable Architecture

Modular design allows integration with various ML frameworks (TensorFlow, PyTorch, etc.).

Supports flexible configurations for client selection, update frequency, and aggregation strategies.</strong></li>

# 🎯 Project Goals

<div class="alert alert-block alert-warning" style="font-family: verdana; font-size: 20px; line-height: 1.7em; border-radius: 1.3em;">
    <ul>
        <li>Preserve Data Privacy – Train models collaboratively without sharing raw data..</li>
        <li>Enable Decentralized Learning – Allow multiple clients to train on their own data and contribute to a global model.</li>
        <li>Improve Model Accuracy – Achieve performance close to centralized models through federated aggregation.</li>
        <li>Ensure Data Security – Protect communication and updates between clients and the server.</li>
        <li>Promote Ethical AI – Build intelligent systems that respect privacy and data ownership.</li>
    </ul>
</div>


# 🤖 Deep Learning Disease Prediction Model

<div class="alert alert-block alert-success" style="font-family: verdana; font-size: 20px; line-height: 1.7em; border-radius: 1.3em;">🔍 Data Preprocessing: Uses Label Encoding, One-Hot Encoding, and Standard Scaling for clean and normalized data

🏗️ Model Architecture: Builds a Sequential deep learning model with Dense and Dropout layers for improved accuracy and reduced overfitting

📊 Model Training & Evaluation: Splits data into training and testing sets to evaluate model performance effectively</div>

In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler, OneHotEncoder
import pandas as pd
import numpy as np


In [None]:
import os
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"  # Hide TensorFlow logs
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"  # Disable GPU completely

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler, OneHotEncoder
import pandas as pd
import numpy as np


# Loading Dataset for Disease Prediction

📥 Data Import: Reads CSV data file using pandas for easy data manipulation

🗂️ Dataset Structure: Ensures data is ready for preprocessing and analysis

🔧 Flexible Path: Update the file path to suit your environment (e.g., Kaggle or local machine)

In [None]:
import pandas as pd
# Load dataset
df = pd.read_csv("/kaggle/input/federate/data.csv")  # Update with your file path



In [None]:
df.head()

In [None]:
print(df.columns.tolist())


# Essential Libraries for Data Analysis and Visualization

🐼 Pandas: Powerful data manipulation and analysis tool

📈 Matplotlib: Core plotting library for creating static, animated, and interactive visuals

🌟 Seaborn: Built on Matplotlib, provides enhanced statistical data visualization with beautiful default styles

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Identify categorical and numerical columns
categorical_cols = ["Gender", "Disease_Type", "Severity", "Physical_Activity_Level",
                    "Dietary_Restrictions", "Allergies", "Preferred_Cuisine"]
numerical_cols = ["Age", "Weight_kg", "Height_cm", "BMI", "Daily_Caloric_Intake",
                  "Cholesterol_mg/dL", "Blood_Pressure_mmHg", "Glucose_mg/dL",
                  "Weekly_Exercise_Hours", "Adherence_to_Diet_Plan", "Dietary_Nutrient_Imbalance_Score"]

In [None]:
df.columns


In [None]:
print(df.isnull().sum())


# Visualizing Categorical Data Distributions

<div class="alert alert-block alert-success" style="font-family: verdana; font-size: 20px; line-height: 1.7em; border-radius: 1.3em;">🎯 Targeted Plotting: Automatically checks if each categorical column exists before plotting

🌈 Colorful Visualization: Uses the “viridis” palette for aesthetically pleasing charts

🔄 Clear Labeling: Rotates x-axis labels for better readability and adds informative titles</div>

In [None]:
for col in categorical_cols:
    if col in df.columns:
        plt.figure(figsize=(6, 4))
        sns.countplot(data=df, x=col, palette="viridis")
        plt.xticks(rotation=45)
        plt.title(f"Distribution of {col}")
        plt.show()
    else:
        print(f"Warning: {col} not found in dataset.")


# Data Cleaning & Visualization for Categorical and Numerical Features

🧩 Smart Column Selection: Filters dataset to include only specified categorical and numerical columns

🛠️ Data Cleaning: Converts columns to numeric and fills missing values with median for robustness

🎨 Insightful Visualization: Displays count plots for categorical variables with clear labels and appealing colors

In [None]:
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=pd.errors.SettingWithCopyWarning)

# Keep only relevant columns safely
df = df[[col for col in categorical_cols + numerical_cols if col in df.columns]].copy()

# Convert numerical columns safely
df.loc[:, numerical_cols] = df[numerical_cols].apply(pd.to_numeric, errors='coerce')

# Fill missing values with median
df.loc[:, numerical_cols] = df[numerical_cols].fillna(df[numerical_cols].median())

# Set Seaborn style
sns.set_style("whitegrid")

# 1. Categorical Columns: Count Plots
plt.figure(figsize=(12, 6))
for i, col in enumerate(categorical_cols):
    if col in df.columns:
        plt.subplot(2, 4, i + 1)
        sns.countplot(data=df, x=col, palette="viridis")
        plt.xticks(rotation=45)
        plt.title(f"Distribution of {col}")
plt.tight_layout()
plt.show()


# Visualizing Numerical Data Distributions

📊 Detailed Histograms: Plots individual histograms for all numerical columns

🎨 Custom Styling: Uses consistent color and edge styling for better readability

🔎 Quick Insights: Helps identify data skewness, outliers, and distribution patterns

In [None]:
# 2. Numerical Columns: Histograms
df[numerical_cols].hist(figsize=(12, 8), bins=20, color='skyblue', edgecolor='black')
plt.suptitle("Histograms of Numerical Features", fontsize=14)
plt.show()



# 🛠️ Feature Engineering & Preprocessing

<div class="alert alert-block alert-info" style="font-family: verdana; font-size: 20px; line-height: 1.7em; border-radius: 1.3em;">
    <ul>
        <li><strong>Encoding Categorical Variables:</strong> Convert categorical features into numerical format for ML models. 
        Options include:
            <ul>
                <li><code>One-Hot Encoding</code> for features with no ordinal relationship (e.g., <code>road_type</code>, <code>weather</code>).</li>
                <li><code>Ordinal Encoding</code> if categories have an inherent order (optional).</li>
            </ul>
        </li>
        <li><strong>Scaling Numerical Features:</strong> Standardize or normalize features like <code>num_lanes</code>, <code>curvature</code> to help gradient-based models converge faster.</li>
        <li><strong>Feature Transformation:</strong> Apply transformations to reduce skewness or highlight patterns:
            <ul>
                <li>Log transformation for skewed counts (if needed)</li>
                <li>Polynomial features or interaction terms (e.g., <code>curvature × speed_limit</code>) to capture non-linear relationships</li>
            </ul>
        </li>
        <li><strong>Temporal Features:</strong> If <code>time_of_day</code> is categorical, consider encoding it as cyclic features using sine and cosine transformations to preserve circularity.</li>
        <li><strong>Target Transformation (Optional):</strong> For highly skewed targets, applying a log transformation may improve model stability and reduce the influence of extreme outliers.</li>
    </ul>
</div>

# Outlier Detection with Boxplots

<div class="alert alert-block alert-success" style="font-family: verdana; font-size: 20px; line-height: 1.7em; border-radius: 1.3em;">📦 Boxplots for Each Feature: Displays the distribution, median, quartiles, and outliers

🌈 Color-Coded Visualization: Uses a coolwarm palette for easy differentiation

🚨 Outlier Identification: Quickly spots potential anomalies that may impact analysis</div>

In [None]:
# 3. Boxplots for Outlier Detection
rows = (len(numerical_cols) // 3) + 1
plt.figure(figsize=(12, 6))
for i, col in enumerate(numerical_cols):
    plt.subplot(rows, 3, i + 1)
    sns.boxplot(data=df, y=col, palette="coolwarm")
    plt.title(f"Boxplot of {col}")
plt.tight_layout()
plt.show()


# Exploring Feature Relationships & Correlations


🔄 Pairplot Visualization: Shows scatterplots and KDE diagonals to explore feature distributions and relationships

🧮 Sample Optimization: Limits data size for faster plotting without losing insight

🌡️ Correlation Heatmap: Highlights positive and negative correlations for better feature understanding

In [None]:
import pandas as pd

# Replace the path below with your actual dataset file
df = pd.read_csv("/kaggle/input/federate/data.csv")

# Show first few rows to confirm it loaded
df.head()


In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings

warnings.filterwarnings("ignore", category=FutureWarning)

# Example: Load dataset (change file path or source as needed)
# df = pd.read_csv("your_dataset.csv")

# Automatically detect numerical columns
numerical_cols = df.select_dtypes(include=[np.number]).columns

# Clean Data: Replace inf/-inf with NaN
df[numerical_cols] = df[numerical_cols].replace([np.inf, -np.inf], np.nan)

# 4. Pairplot for Relationships (Use a sample if dataset is large)
if len(df) > 500:
    sample_df = df.sample(500, random_state=42)
else:
    sample_df = df

sns.pairplot(sample_df[numerical_cols], diag_kind='kde', corner=True)
plt.suptitle("Pairplot of Numerical Features", y=1.02)
plt.show()

# 5. Correlation Heatmap
plt.figure(figsize=(10, 6))
corr_matrix = df[numerical_cols].corr()
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)
plt.title("Correlation Heatmap", fontsize=14, fontweight='bold')
plt.show()


In [None]:
# ==========================
# 4. Pairplot + 5. Correlation Heatmap (Fixed)
# ==========================

import numpy as np
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

# Clean Data: Replace inf/-inf with NaN
df[numerical_cols] = df[numerical_cols].replace([np.inf, -np.inf], np.nan)

# 4. Pairplot for Relationships (Use a sample if dataset is large)
if len(df) > 500:
    sample_df = df.sample(500, random_state=42)
else:
    sample_df = df

sns.pairplot(sample_df[numerical_cols], diag_kind='kde', corner=True)
plt.suptitle("Pairplot of Numerical Features", y=1.02)
plt.show()

# 5. Correlation Heatmap
plt.figure(figsize=(10, 6))
corr_matrix = df[numerical_cols].corr()
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)
plt.title("Correlation Heatmap", fontsize=14, fontweight='bold')
plt.show()


# Encoding & Scaling Features

<div class="alert alert-block alert-success" style="font-family: verdana; font-size: 20px; line-height: 1.7em; border-radius: 1.3em;">🔤 Label Encoding: Converts categorical variables into numerical labels for algorithm compatibility

⚙️ Storing Encoders: Keeps encoders handy for reversing transformations later

📈 Feature Scaling: Standardizes numerical features to have zero mean and unit variance for balanced model input </div>

In [None]:
# 🧠 Step 1: Create a sample dataset (no internet required)
import pandas as pd
import numpy as np

# Sample synthetic dataset for demonstration
df = pd.DataFrame({
    "age": np.random.randint(18, 70, 200),
    "fare": np.random.uniform(10, 250, 200),
    "sex": np.random.choice(["male", "female"], 200),
    "class": np.random.choice(["First", "Second", "Third"], 200),
    "embarked": np.random.choice(["C", "Q", "S"], 200)
})


In [None]:
# 🧩 Step 2: Identify column types
categorical_cols = df.select_dtypes(include=['object', 'category']).columns
numerical_cols = df.select_dtypes(include=[np.number]).columns

# 🧹 Step 3: Handle infinite values
df[numerical_cols] = df[numerical_cols].replace([np.inf, -np.inf], np.nan)

# 🏷️ Step 4: Encode categorical features
from sklearn.preprocessing import LabelEncoder, StandardScaler
label_encoders = {}
for col in categorical_cols:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col].astype(str))
    label_encoders[col] = le

# ⚖️ Step 5: Scale numerical features
scaler = StandardScaler()
df[numerical_cols] = scaler.fit_transform(df[numerical_cols])

print("\n✅ Data encoding & scaling completed successfully!\n")

# ==========================
# 📊 Visualization
# ==========================
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

# Pairplot
if len(df) > 500:
    sample_df = df.sample(500, random_state=42)
else:
    sample_df = df

sns.pairplot(sample_df[numerical_cols], diag_kind='kde', corner=True)
plt.suptitle("Pairplot of Numerical Features", y=1.02)
plt.show()

# Correlation Heatmap
plt.figure(figsize=(10, 6))
corr_matrix = df[numerical_cols].corr()
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)
plt.title("Correlation Heatmap", fontsize=14, fontweight='bold')
plt.show()


# One-Hot Encoding Target Variable

🎯 Checking for 'Disease_Type' column: Ensures the target column exists before encoding

🔢 One-Hot Encoding: Converts categorical target into binary columns for multi-class classification

✨ Error Handling: Prints helpful messages if the target column is missing, aiding debugging

In [None]:
# One-hot encode target variable
# Check if 'Disease_Type' is in the DataFrame columns
if 'Disease_Type' in df.columns:
    ohe = OneHotEncoder(sparse_output=False)  # Replace sparse with sparse_output
    target_encoded = ohe.fit_transform(df[['Disease_Type']])
    target_labels = ohe.categories_[0]
else:
    print("Error: 'Disease_Type' column not found in DataFrame.")
    # Add debugging steps to find where the column was lost or renamed
    print("Current DataFrame columns:", df.columns)

# Merging Encoded Target with Features

🗂️ Converts the one-hot encoded target array into a DataFrame with proper column names

➕ Concatenates the encoded target DataFrame with the existing features DataFrame

🧹 Placeholder for dropping unnecessary columns (commented out for now, needs actual column name)

In [None]:
# Convert target variable to DataFrame
target_df = pd.DataFrame(target_encoded, columns=target_labels)

# Merge encoded target variable with features
# df = df.drop(columns=[""])  # Remove this line or replace with the actual column name to drop
df = pd.concat([df, target_df], axis=1)

# Federated Data Preparation for Multiple Clients

<div class="alert alert-block alert-success" style="font-family: verdana; font-size: 20px; line-height: 1.7em; border-radius: 1.3em;">🧩 Splits the dataset randomly into num_clients subsets to simulate different clients in federated learning.

🔄 Shuffles data before splitting to ensure randomness.

🔧 Converts each client’s subset into a TensorFlow Federated (TFF) dataset format, batching the data for efficient processing.

📦 Prepares a list of client datasets ready to be used for federated training.</div>

In [None]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"  # Disable GPU usage for TensorFlow
import tensorflow as tf


In [None]:
import numpy as np
import tensorflow as tf
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# 🧠 Ensure you have df defined already (your dataset)

# --------------------------
# Define target column(s)
# --------------------------
# If you know your label column:
# target_labels = ['target']
# Otherwise, use the last column automatically:
target_labels = [df.columns[-1]]
print("Using target column:", target_labels)

# --------------------------
# Split dataset into clients
# --------------------------
def split_clients(data, num_clients=5):
    client_datasets = np.array_split(data.sample(frac=1, random_state=42), num_clients)
    return client_datasets

client_data = split_clients(df, num_clients=5)

# --------------------------
# Convert client datasets to TFF format
# --------------------------
def create_tff_dataset(client_dataset):
    features = client_dataset.drop(columns=target_labels).values.astype(np.float32)
    labels = client_dataset[target_labels].values.astype(np.float32)
    return tf.data.Dataset.from_tensor_slices((features, labels)).batch(16)

federated_train_data = [create_tff_dataset(client) for client in client_data]

# --------------------------
# Verify
# --------------------------
print(f"✅ Number of clients: {len(federated_train_data)}")
print(f"✅ First client batch element spec:\n{federated_train_data[0].element_spec}")


# 🧾 Conclusion

The Federated Learning System successfully demonstrates how distributed machine learning can be achieved without compromising user privacy or data security.
By allowing multiple clients (such as devices or organizations) to collaboratively train a shared global model while keeping their data localized, this approach effectively eliminates the need for central data storage — a major privacy and compliance concern in modern AI systems.

The project highlights several key benefits:

🔒 Enhanced Data Privacy: Sensitive information remains on local devices, ensuring confidentiality.

⚙️ Collaborative Intelligence: Aggregation of decentralized models leads to improved performance without direct data sharing.

🌍 Scalability and Flexibility: The framework can be adapted to numerous domains, such as healthcare, finance, IoT, and edge computing.

⚡ Reduced Communication Overhead: Model updates, not raw data, are exchanged between clients and server.

In conclusion, this system provides a practical foundation for privacy-preserving, decentralized AI, proving that intelligent data collaboration is possible without breaching security.
With further optimization, such as differential privacy, secure aggregation, and adaptive learning rates, Federated Learning can redefine how organizations build AI solutions responsibly in the era of data protection and global collaboration.

<center>
    <img src="https://media3.giphy.com/media/v1.Y2lkPTc5MGI3NjExMXo3ZjUzbG1taXE1eGdkcWNubHkxdTlsNjEzZ2JwY2p2b2hqbTV5aSZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9cw/Gz6nYcm8oXE4dFTC8j/giphy.gif" height="100" width="200">
</center>