<a href="https://colab.research.google.com/github/Pendota-sukumar/Sukumar-Pendota/blob/main/ML_pipeline_for_tranformation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

 Import Required Libraries

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import os

**Define File Paths
Source File: Path of your input CSV file

Target Folder: Path where the transformed CSV file should be saved**

In [6]:
source_file = "/content/drive/MyDrive/Colab Notebooks/Source Data/Big data.csv"  # Update with your actual file path
target_file = "/content/drive/MyDrive/Colab Notebooks/Target Data"  # This file does not exist initially
target_file = os.path.join(target_file, "final_transformed_data.csv")  # Define output file name


**Ensure Target Folder Exists**

In [7]:
if not os.path.exists(target_file):
    os.makedirs(target_file)
    print(f"Target folder created: {target_file}")
else:
    print(f"Target folder exists: {target_file}")


Target folder created: /content/drive/MyDrive/Colab Notebooks/Target Data/final_transformed_data.csv


## Load Data from Google Drive

In [8]:
df = pd.read_csv(source_file)
print("Data Loaded Successfully!")
print(df.head())  # Display first few rows


Data Loaded Successfully!
    Order ID Order Date   Ship Date Qtr  Aging            c    Ship Mode  \
0  AU-2015-1  11/9/2015  11/17/2015  Q4      8  first class  First Class   
1  AU-2015-2  6/30/2015    7/2/2015  Q2      2  first class  First Class   
2  AU-2015-3  12/5/2015  12/13/2015  Q4      8  first class  First Class   
3  AU-2015-4   5/9/2015   5/16/2015  Q2      7  first class  First Class   
4  AU-2015-5   7/9/2015   7/18/2015  Q3      9  first class  First Class   

     Product Category  Quantity            Product  ...  Shipping Cost  \
0  Auto & Accessories         1  Car Media Players  ...          $4.6    
1  Auto & Accessories         1       Car Speakers  ...         $11.2    
2  Auto & Accessories         5    Car Body Covers  ...          $3.1    
3  Auto & Accessories         4    Car & Bike Care  ...          $2.6    
4  Auto & Accessories         5               Tyre  ...         $16.0    

   Order Priority Customer ID    Customer Name      Segment        City 

# Transform Data as per Business Logic

In [10]:
# Fill missing values only in numerical columns
numeric_cols = df.select_dtypes(include=['number']).columns  # Select only numeric columns
df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].mean())

print("Missing values in numeric columns filled with mean!")


Missing values in numeric columns filled with mean!


In [11]:
# Encode categorical variables (if any)
df = pd.get_dummies(df, drop_first=True)

# Normalize numerical columns
scaler = StandardScaler()
df[df.select_dtypes(include=['float64', 'int64']).columns] = scaler.fit_transform(df.select_dtypes(include=['float64', 'int64']))

print("Data Transformation Completed!")
print(df.head())

Data Transformation Completed!
      Aging  Quantity  Discount  Order ID_AU-2015-10  Order ID_AU-2015-100  \
0  0.927390 -1.416429  1.419082                False                 False   
1 -1.099722 -1.416429  0.002044                False                 False   
2  0.927390  1.408443 -1.414993                False                 False   
3  0.589538  0.702225  1.419082                False                 False   
4  1.265242  1.408443  0.710563                False                 False   

   Order ID_AU-2015-1000  Order ID_AU-2015-1001  Order ID_AU-2015-1002  \
0                  False                  False                  False   
1                  False                  False                  False   
2                  False                  False                  False   
3                  False                  False                  False   
4                  False                  False                  False   

   Order ID_AU-2015-1003  Order ID_AU-2015-1004  ...  M

**Save Transformed Data as a CSV File in the Target Folder**

In [15]:

# Define target folder and file path
target_folder = "/content/drive/MyDrive/Colab Notebooks/Target Data/"
target_file = os.path.join(target_folder, "final_transformed_data.csv")  # Automatically create this file


In [16]:

# Ensure target folder exists
if not os.path.exists(target_folder):
    os.makedirs(target_folder)
    print(f"📂 Target folder created: {target_folder}")
else:
    print(f"📂 Target folder already exists: {target_folder}")


📂 Target folder already exists: /content/drive/MyDrive/Colab Notebooks/Target Data/


In [18]:
from google.colab import drive
drive.flush_and_unmount()  # Forces sync
drive.mount('/content/drive')


Mounted at /content/drive


In [19]:
!ls "/content/drive/MyDrive/Colab Notebooks/Target Data/"


 final_transformed_data.csv				    _SUCCESS
 part-00000-389ff18a-8326-4199-b466-644fe273c39c-c000.csv  'Transformed financial_data.csv'
 part-00001-389ff18a-8326-4199-b466-644fe273c39c-c000.csv   Updated_Transformed_financial_data.csv


**Since Data Saved In multiple CSV File making as one CSV File**

In [28]:
from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder.appName("SaveCSV").getOrCreate()

# Load data
df = spark.read.csv(source_file, header=True, inferSchema=True)

# Define target folder
target_folder = "/content/drive/MyDrive/Colab Notebooks/Target Data/"

# Save as a single CSV file (overwrite mode)
df.coalesce(1).write.mode("overwrite").option("header", "true").csv(target_folder)

print(f"✅ Transformed data saved successfully at: {target_folder}")


✅ Transformed data saved successfully at: /content/drive/MyDrive/Colab Notebooks/Target Data/


In [29]:
import os
import shutil

# Define the target folder
target_folder = "/content/drive/MyDrive/Colab Notebooks/Target Data/"

# Find the newly created part file
for file in os.listdir(target_folder):
    if file.startswith("part-") and file.endswith(".csv"):  # Find part files
        part_file = os.path.join(target_folder, file)
        final_file = os.path.join(target_folder, "final_transformed_data.csv")

        # Rename the part file
        shutil.move(part_file, final_file)
        print(f"✅ File renamed to: {final_file}")
        break  # Stop after renaming the first part file

# Remove _SUCCESS file (optional)
success_file = os.path.join(target_folder, "_SUCCESS")
if os.path.exists(success_file):
    os.remove(success_file)
    print("🗑️ Removed _SUCCESS file.")


✅ File renamed to: /content/drive/MyDrive/Colab Notebooks/Target Data/final_transformed_data.csv
🗑️ Removed _SUCCESS file.


**Basic ML Pipeline For My Output Data**

In [4]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

In [5]:
# Load the transformed data
file_path = "/content/drive/MyDrive/Colab Notebooks/Target Data/final_transformed_data.csv"
df = pd.read_csv(file_path)

**ML Operations**

In [6]:
# Select relevant numerical features
selected_features = ["Aging", "Quantity", "Discount"]  # Add more relevant columns

# Extract input data (X)
X = df[selected_features]


In [7]:
print(df.columns)


Index(['Order ID', 'Order Date', 'Ship Date', 'Qtr', 'Aging', 'c', 'Ship Mode',
       'Product Category', 'Quantity', 'Product', 'Sales', 'Discount',
       'Profit', 'Shipping Cost', 'Order Priority', 'Customer ID',
       'Customer Name', 'Segment', 'City', 'State', 'Country', 'Region',
       'Months'],
      dtype='object')


 Supervised Learning (Regression & Classification)

In [8]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Load your dataset
df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Target Data/final_transformed_data.csv")  # Change this to your actual file path

# Drop high-cardinality columns (Order IDs)
df = df.drop(columns=[col for col in df.columns if "Order ID" in col])

# Select numerical features
numerical_features = ["Aging", "Quantity", "Discount"]  # Add more if needed
df = df[numerical_features]

# Handle missing values
df.fillna(df.mean(), inplace=True)

# Scale the data
scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)


Supervised Learning (Regression & Classification)
**2.1 Regression: Predicting Sales**

In [9]:
print([col for col in df.columns if "Sales" in col])


[]


**Load Transformed Data**

In [10]:
import pandas as pd

# Load your transformed CSV file
file_path = "/content/drive/MyDrive/Colab Notebooks/Target Data/final_transformed_data.csv"
df = pd.read_csv(file_path)

# Display first few rows
df.head()


Unnamed: 0,Order ID,Order Date,Ship Date,Qtr,Aging,c,Ship Mode,Product Category,Quantity,Product,...,Shipping Cost,Order Priority,Customer ID,Customer Name,Segment,City,State,Country,Region,Months
0,AU-2015-1,11/9/2015,11/17/2015,Q4,8,first class,First Class,Auto & Accessories,1,Car Media Players,...,$4.6,Medium,LS-001,Lane Daniels,Consumer,Brisbane,Queensland,Australia,Oceania,Nov
1,AU-2015-2,6/30/2015,7/2/2015,Q2,2,first class,First Class,Auto & Accessories,1,Car Speakers,...,$11.2,Medium,IZ-002,Alvarado Kriz,Home Office,Berlin,Berlin,Germany,Central,Jun
2,AU-2015-3,12/5/2015,12/13/2015,Q4,8,first class,First Class,Auto & Accessories,5,Car Body Covers,...,$3.1,Critical,EN-003,Moon Weien,Consumer,Porirua,Wellington,New Zealand,Oceania,Dec
3,AU-2015-4,5/9/2015,5/16/2015,Q2,7,first class,First Class,Auto & Accessories,4,Car & Bike Care,...,$2.6,High,AN-004,Sanchez Bergman,Corporate,Kabul,Kabul,Afghanistan,Central Asia,May
4,AU-2015-5,7/9/2015,7/18/2015,Q3,9,first class,First Class,Auto & Accessories,5,Tyre,...,$16.0,Critical,ON-005,Rowe Jackson,Corporate,Townsville,Queensland,Australia,Oceania,Jul


**Define Features (X) and Target Variables (Y)**

In [13]:
# Define target variables (columns we want to predict)
target_columns = ["Quantity", "Discount", "Aging"]

# Check if target columns exist in the dataset
missing_cols = [col for col in target_columns if col not in df.columns]
if missing_cols:
    print(f"❌ Missing target columns: {missing_cols}")
else:
    print(f"✅ Target columns found: {target_columns}")

# Define features (X) and target (Y)
X = df.drop(columns=target_columns)  # All columns except target
Y = df[target_columns]  # Target variables


✅ Target columns found: ['Quantity', 'Discount', 'Aging']


**Train-Test Split**

In [14]:
from sklearn.model_selection import train_test_split

# Split data into 80% training and 20% testing
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

print(f"✅ Training data: {X_train.shape}, {Y_train.shape}")
print(f"✅ Testing data: {X_test.shape}, {Y_test.shape}")


✅ Training data: (41032, 20), (41032, 3)
✅ Testing data: (10258, 20), (10258, 3)


**Train a Multi-Output Regression Model**

In [16]:
# Identify non-numeric columns
categorical_cols = X.select_dtypes(include=['object']).columns.tolist()

if categorical_cols:
    print(f"🔹 Categorical columns found: {categorical_cols}")
else:
    print("✅ No categorical columns found, ready for ML!")


🔹 Categorical columns found: ['Order ID', 'Order Date', 'Ship Date', 'Qtr', 'c', 'Ship Mode', 'Product Category', 'Product', 'Sales', 'Profit', 'Shipping Cost', 'Order Priority', 'Customer ID', 'Customer Name', 'Segment', 'City', 'State', 'Country', 'Region', 'Months']


In [17]:
from sklearn.preprocessing import OneHotEncoder

# Apply One-Hot Encoding
X = pd.get_dummies(X, columns=categorical_cols)

print(f"✅ Transformed X shape: {X.shape}")


✅ Transformed X shape: (51290, 109666)


In [19]:
# Identify categorical columns
categorical_cols = X.select_dtypes(include=['object']).columns.tolist()

print(f"🔹 Categorical columns found: {categorical_cols}")


🔹 Categorical columns found: []


In [20]:
# Find columns with non-numeric values
non_numeric_cols = [col for col in X.columns if not pd.api.types.is_numeric_dtype(X[col])]

print(f"🔹 Non-numeric columns: {non_numeric_cols}")


🔹 Non-numeric columns: []


In [21]:
print(X.dtypes)


Order ID_AU-2015-1       bool
Order ID_AU-2015-10      bool
Order ID_AU-2015-100     bool
Order ID_AU-2015-1000    bool
Order ID_AU-2015-1001    bool
                         ... 
Months_Mar               bool
Months_May               bool
Months_Nov               bool
Months_Oct               bool
Months_Sep               bool
Length: 109666, dtype: object


In [11]:
# Drop non-relevant columns and define target
X = df.drop(columns=['Order ID', 'Customer ID', 'Customer Name'], errors='ignore')

# Define target column (replace with actual target)
y = df.get('target_column')  # Replace 'target_column' with your actual target


In [12]:
# Convert 'Order Date' and 'Ship Date' to datetime format
X['Order Date'] = pd.to_datetime(X['Order Date'], errors='coerce')
X['Ship Date'] = pd.to_datetime(X['Ship Date'], errors='coerce')

# Create new numeric features
X['Order_Year'] = X['Order Date'].dt.year
X['Order_Month'] = X['Order Date'].dt.month
X['Shipping_Days'] = (X['Ship Date'] - X['Order Date']).dt.days

# Drop original date columns
X.drop(columns=['Order Date', 'Ship Date'], inplace=True)


In [13]:
# Identify categorical columns
categorical_cols = X.select_dtypes(include=['object']).columns
print(f"🔹 Categorical Columns: {list(categorical_cols)}")

# One-hot encoding for categorical features
X = pd.get_dummies(X, columns=categorical_cols, drop_first=True)


🔹 Categorical Columns: ['Qtr', 'c', 'Ship Mode', 'Product Category', 'Product', 'Sales', 'Profit', 'Shipping Cost', 'Order Priority', 'Segment', 'City', 'State', 'Country', 'Region', 'Months']


In [15]:
print(df.columns.tolist())


['Order ID', 'Order Date', 'Ship Date', 'Qtr', 'Aging', 'c', 'Ship Mode', 'Product Category', 'Quantity', 'Product', 'Sales', 'Discount', 'Profit', 'Shipping Cost', 'Order Priority', 'Customer ID', 'Customer Name', 'Segment', 'City', 'State', 'Country', 'Region', 'Months']


In [16]:
df.rename(columns=lambda x: x.strip(), inplace=True)  # Removes spaces


In [17]:
df['Shipping Cost'] = df['Shipping Cost'].astype(str).str.replace('[\$,]', '', regex=True).astype(float)


In [18]:
if 'Shipping Cost' in df.columns:
    print("✅ 'Shipping Cost' column found!")
else:
    print("❌ 'Shipping Cost' column NOT found! Check for typos.")


✅ 'Shipping Cost' column found!


In [19]:
print(df.columns.tolist())  # Shows column names exactly as stored


['Order ID', 'Order Date', 'Ship Date', 'Qtr', 'Aging', 'c', 'Ship Mode', 'Product Category', 'Quantity', 'Product', 'Sales', 'Discount', 'Profit', 'Shipping Cost', 'Order Priority', 'Customer ID', 'Customer Name', 'Segment', 'City', 'State', 'Country', 'Region', 'Months']


In [20]:
print("Shipping Cost" in df.columns)  # Should return True


True


In [21]:
pip install fastapi uvicorn joblib pandas scikit-learn


Collecting fastapi
  Downloading fastapi-0.115.12-py3-none-any.whl.metadata (27 kB)
Collecting uvicorn
  Downloading uvicorn-0.34.0-py3-none-any.whl.metadata (6.5 kB)
Collecting starlette<0.47.0,>=0.40.0 (from fastapi)
  Downloading starlette-0.46.1-py3-none-any.whl.metadata (6.2 kB)
Downloading fastapi-0.115.12-py3-none-any.whl (95 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m95.2/95.2 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading uvicorn-0.34.0-py3-none-any.whl (62 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.3/62.3 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading starlette-0.46.1-py3-none-any.whl (71 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.0/72.0 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: uvicorn, starlette, fastapi
Successfully installed fastapi-0.115.12 starlette-0.46.1 uvicorn-0.34.0


In [23]:
import os

print(os.path.exists("model.pkl"))  # Should return True if the model file exists


False


** Ensure Model Is Trained & Saved**

In [25]:
import joblib
from sklearn.ensemble import RandomForestClassifier  # Example model

# Train a sample model (Replace with your actual model training)
model = RandomForestClassifier()
# model.fit(X_train, y_train)  # Uncomment this if training is required

# Save the trained model
joblib.dump(model, "model.pkl")
print("✅ Model saved successfully!")


✅ Model saved successfully!


**Load the Model in FastAPI**

In [26]:
from fastapi import FastAPI
import joblib

# Initialize FastAPI app
app = FastAPI()

# Load the trained model
try:
    model = joblib.load("model.pkl")
    print("✅ Model loaded successfully!")
except FileNotFoundError:
    print("❌ Error: 'model.pkl' file not found. Train and save the model first.")
except Exception as e:
    print(f"❌ Error loading model: {e}")

# Define a prediction route
@app.post("/predict/")
async def predict(features: dict):
    try:
        # Convert input features to a list (modify as per your model's requirement)
        input_data = [features[key] for key in features]
        prediction = model.predict([input_data])
        return {"prediction": prediction.tolist()}
    except Exception as e:
        return {"error": str(e)}


✅ Model loaded successfully!


**Debugging (If Issue Persists)**

In [28]:
import os
print(os.path.exists("model.pkl"))  # Should return True


True


**Checking  Model Type**

In [29]:
model = joblib.load("model.pkl")
print(type(model))  # Should print sklearn model type


<class 'sklearn.ensemble._forest.RandomForestClassifier'>


# Final FastAPI Code for Prediction

In [30]:
from fastapi import FastAPI
import joblib
from pydantic import BaseModel
import numpy as np

# Initialize FastAPI app
app = FastAPI()

# Load the trained model
try:
    model = joblib.load("model.pkl")
    print("✅ Model loaded successfully!")
except FileNotFoundError:
    print("❌ Error: 'model.pkl' file not found. Train and save the model first.")
except Exception as e:
    print(f"❌ Error loading model: {e}")

# Define the input data schema
class InputData(BaseModel):
    features: list  # Expecting input as a list of numerical values

# Define a prediction route
@app.post("/predict/")
async def predict(data: InputData):
    try:
        # Convert input to NumPy array
        input_data = np.array(data.features).reshape(1, -1)
        prediction = model.predict(input_data)
        return {"prediction": prediction.tolist()}
    except Exception as e:
        return {"error": str(e)}



✅ Model loaded successfully!
