Data Preprocessing

Aim of the Experiment.
The main aim of this experiment is to preprocess the given dataset. The database is created and is available in the file sample.csv.
                                                      Sample Dataset

The objectives of this experiment are
1.       Explore Label Encoder
2.       Explore Scikit Preprocessing routines like Scaling
3.       Explore Scikit Preprocessing routines like Binarizer


Reference to the Textbook and Explanation
All the fundamentals are given in Chapter 2 and Appendix 2.
The variable in the dataset Female and Male can be changed to 0 or 1 using Label Encoder. It is done as given below:
df_gender_encode=LabelEncoder()
df.gender=df_gender_encode.fit_transform(df.gender)
Scaling can be done as follows:
df.Marks = preprocessing.scale(df.Marks)
scaled_df= preprocessing.scale(df.Marks)
Scaling removes the mean



In [2]:
# Data Preprocessing
from sklearn.preprocessing import LabelEncoder, StandardScaler, Binarizer
import pandas as pd

# Reload the sample dataset to start fresh for preprocessing steps
try:
  df = pd.read_csv('dataset/sample.csv')
  print("\nSample dataset loaded for preprocessing:")
  print(df)
except FileNotFoundError:
  print("\nsample.csv not found. Please make sure the file is created.")
  # Create a dummy dataframe if the file is not found for demonstration
  data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
          'gender': ['Female', 'Male', 'Female', 'Male'],
          'Marks': [85, 70, 92, 65]}
  df = pd.DataFrame(data)
  print("\nCreated dummy DataFrame for preprocessing:")
  print(df)


# 1. Explore Label Encoder
print("\n--- Exploring Label Encoder ---")

# Check if 'gender' column exists before encoding
if 'gender' in df.columns:
    print("Original 'gender' column:")
    print(df['gender'])

    # Initialize the LabelEncoder
    df_gender_encode = LabelEncoder()

    # Fit and transform the 'gender' column
    df['gender_encoded'] = df_gender_encode.fit_transform(df['gender'])

    print("\nDataFrame after Label Encoding 'gender':")
    print(df[['gender', 'gender_encoded']])
else:
    print("'gender' column not found in the DataFrame. Cannot apply Label Encoding.")


# 2. Explore Scikit Preprocessing routines like Scaling
print("\n--- Exploring Scaling (StandardScaler) ---")

# Check if 'Marks' column exists and is numeric before scaling
if 'Marks' in df.columns and pd.api.types.is_numeric_dtype(df['Marks']):
    print("Original 'Marks' column:")
    print(df['Marks'])

    # Using preprocessing.scale (which uses StandardScaler internally)
    # Note: preprocessing.scale is deprecated. StandardScaler is preferred.
    # scaled_marks_deprecated = preprocessing.scale(df['Marks'])
    # print("\nScaled 'Marks' using preprocessing.scale:")
    # print(scaled_marks_deprecated)

    # Using StandardScaler (recommended)
    scaler = StandardScaler()

    # Reshape the 'Marks' column to be 2D as required by StandardScaler
    marks_reshaped = df['Marks'].values.reshape(-1, 1)

    # Fit and transform the 'Marks' column
    df['Marks_scaled'] = scaler.fit_transform(marks_reshaped)

    print("\nDataFrame after Scaling 'Marks' using StandardScaler:")
    print(df[['Marks', 'Marks_scaled']])

    print("\nDescriptive stats of scaled 'Marks' (should have mean close to 0, std close to 1):")
    print(df['Marks_scaled'].describe())
else:
    print("Cannot scale 'Marks' column. Column not found or not numeric.")


# 3. Explore Scikit Preprocessing routines like Binarizer
print("\n--- Exploring Binarizer ---")

# Check if 'Marks' column exists and is numeric before binarizing
if 'Marks' in df.columns and pd.api.types.is_numeric_dtype(df['Marks']):
    print("Original 'Marks' column:")
    print(df['Marks'])

    # Initialize the Binarizer. Let's set a threshold, say 75.
    # Marks >= 75 will become 1, Marks < 75 will become 0.
    binarizer = Binarizer(threshold=75)

    # Reshape the 'Marks' column to be 2D as required by Binarizer
    marks_reshaped = df['Marks'].values.reshape(-1, 1)

    # Transform the 'Marks' column
    df['Marks_binarized'] = binarizer.transform(marks_reshaped)

    print("\nDataFrame after Binarizing 'Marks' (threshold=75):")
    print(df[['Marks', 'Marks_binarized']])
else:
    print("Cannot binarize 'Marks' column. Column not found or not numeric.")

print("\nFinal DataFrame after preprocessing steps:")
print(df)



Sample dataset loaded for preprocessing:
   id        first       last  gender  Marks  selected
0   1        Leone    Debrick  Female     50      True
1   2       Romola  Phinnessy  Female     60     False
2   3         Geri      Prium    Male     65     False
3   4        Sandy   Doveston  Female     95     False
4   5      Jacenta     Jansik  Female     31      True
5   6  Diane-marie   Medhurst  Female     45      True
6   7       Austen       Pool    Male     45      True
7   8        Vanya    Teffrey    Male     70     False
8   9     Giordano      Elloy    Male     36     False
9  10       Rozele    Fawcett  Female     50     False

--- Exploring Label Encoder ---
Original 'gender' column:
0    Female
1    Female
2      Male
3    Female
4    Female
5    Female
6      Male
7      Male
8      Male
9    Female
Name: gender, dtype: object

DataFrame after Label Encoding 'gender':
   gender  gender_encoded
0  Female               0
1  Female               0
2    Male               1
