<a href="https://colab.research.google.com/github/Merrinplenil/PythonPrograms/blob/main/data_preprocessing_colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Preprocessing in Google Colab
This notebook demonstrates:
1. Handling missing data
2. Data normalization
3. Feature selection

In [1]:

from google.colab import files
uploaded = files.upload()


Saving 100 Sales Records.csv to 100 Sales Records.csv


In [2]:

!pip install scikit-learn pandas




In [3]:

import pandas as pd

# Load dataset (update filename if needed)
df = pd.read_csv("100 Sales Records.csv")
print("Original Dataset:")
print(df.head())


Original Dataset:
                              Region                Country        Item Type  \
0              Australia and Oceania                 Tuvalu        Baby Food   
1  Central America and the Caribbean                Grenada           Cereal   
2                             Europe                 Russia  Office Supplies   
3                 Sub-Saharan Africa  Sao Tome and Principe           Fruits   
4                 Sub-Saharan Africa                 Rwanda  Office Supplies   

  Sales Channel Order Priority Order Date   Order ID  Ship Date  Units Sold  \
0       Offline              H  5/28/2010  669165933  6/27/2010        9925   
1        Online              C  8/22/2012  963881480  9/15/2012        2804   
2       Offline              L   5/2/2014  341417157   5/8/2014        1779   
3        Online              C  6/20/2014  514321792   7/5/2014        8102   
4       Offline              L   2/1/2013  115456712   2/6/2013        5062   

   Unit Price  Unit Cost  

In [4]:

# Handling Missing Data
for col in df.columns:
    if df[col].dtype in ['float64', 'int64']:
        df[col].fillna(df[col].mean(), inplace=True)
    else:
        df[col].fillna(df[col].mode()[0], inplace=True)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(df[col].mode()[0], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(df[col].mean(), inplace=True)


In [6]:

# Data Normalization (Min-Max Scaling)
from sklearn.preprocessing import MinMaxScaler

numeric_cols = df.select_dtypes(include=['float64', 'int64']).columns
scaler = MinMaxScaler()
df[numeric_cols] = scaler.fit_transform(df[numeric_cols])


In [7]:

# Feature Selection
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_selection import SelectKBest, f_classif

target = 'Region'  # Change target if needed
le = LabelEncoder()
y = le.fit_transform(df[target])

X = df[numeric_cols].drop(target, axis=1, errors='ignore')

selector = SelectKBest(score_func=f_classif, k=5)
X_new = selector.fit_transform(X, y)

selected_features = X.columns[selector.get_support()]
print("Selected Features:", selected_features.tolist())


Selected Features: ['Order ID', 'Units Sold', 'Total Revenue', 'Total Cost', 'Total Profit']


In [8]:

# Final Processed Dataset
print("Processed Dataset:")
print(df.head())


Processed Dataset:
                              Region                Country        Item Type  \
0              Australia and Oceania                 Tuvalu        Baby Food   
1  Central America and the Caribbean                Grenada           Cereal   
2                             Europe                 Russia  Office Supplies   
3                 Sub-Saharan Africa  Sao Tome and Principe           Fruits   
4                 Sub-Saharan Africa                 Rwanda  Office Supplies   

  Sales Channel Order Priority Order Date  Order ID  Ship Date  Units Sold  \
0       Offline              H  5/28/2010  0.630600  6/27/2010    1.000000   
1        Online              C  8/22/2012  0.965726  9/15/2012    0.273441   
2       Offline              L   5/2/2014  0.257911   5/8/2014    0.168860   
3        Online              C  6/20/2014  0.454524   7/5/2014    0.813999   
4       Offline              L   2/1/2013  0.000967   2/6/2013    0.503826   

   Unit Price  Unit Cost  Total

In [9]:
df.to_csv('processed.csv', index=False)
from google.colab import files
files.download('processed.csv')


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>