<a href="https://colab.research.google.com/github/ARPANPATRA111/googlecolab/blob/main/%232DATA_PREPARATION.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

 # 2: DATA PREPARATION

**1. How to Add an Index Field Using R/Python**

In [None]:
# Adding an index field to a DataFrame

import pandas as pd

# Sample data without an index
data = {'Name': ['John', 'Jane', 'Bob'],
        'Age': [28, 34, 29]}

df = pd.DataFrame(data)

# Add a new index field
df['Index'] = df.index

print("DataFrame with Index column:\n", df)


DataFrame with Index column:
    Name  Age  Index
0  John   28      0
1  Jane   34      1
2   Bob   29      2


**2. Howto Change Misleading Field Values Using R/Python**

In [None]:
# Changing misleading field values in a DataFrame

import pandas as pd

# Sample data
data = {'Name': ['John', 'Jane', 'Bob'],
        'Age': [28, 34, 29],
        'Gender': ['M', 'F', 'M']}

df = pd.DataFrame(data)

# Change 'M' and 'F' to 'Male' and 'Female'
df['Gender'] = df['Gender'].replace({'M': 'Male', 'F': 'Female'})

print("Updated DataFrame with full gender labels:\n", df)


Updated DataFrame with full gender labels:
    Name  Age  Gender
0  John   28    Male
1  Jane   34  Female
2   Bob   29    Male


**3. Howto Re Express Categorical Field Values Using R/Python**

In [None]:
# Re-express categorical field values to numerical values

import pandas as pd

# Sample data
data = {'Name': ['John', 'Jane', 'Bob'],
        'Gender': ['Male', 'Female', 'Male']}

df = pd.DataFrame(data)

# Convert categorical 'Gender' to numeric (0 for Male, 1 for Female)
df['Gender_Num'] = df['Gender'].map({'Male': 0, 'Female': 1})

print("DataFrame with numerical gender values:\n", df)


DataFrame with numerical gender values:
    Name  Gender  Gender_Num
0  John    Male           0
1  Jane  Female           1
2   Bob    Male           0


**4. How to Standardise Numeric Fields Using R/Python**

In [1]:
# Standardizing numeric fields (e.g., z-score normalization)

import pandas as pd
from sklearn.preprocessing import StandardScaler

# Sample data
data = {'Name': ['John', 'Jane', 'Bob'],
        'Age': [28, 34, 29]}

df = pd.DataFrame(data)

# Standardize the 'Age' column
scaler = StandardScaler()
df['Age_Standardized'] = scaler.fit_transform(df[['Age']])

print("DataFrame with standardized Age:\n", df)

DataFrame with standardized Age:
    Name  Age  Age_Standardized
0  John   28         -0.889001
1  Jane   34          1.397001
2   Bob   29         -0.508001


**5. How to Identify Outliers Using R/Python**

In [2]:
# Identifying outliers using the IQR method

import pandas as pd

# Sample data
data = {'Name': ['John', 'Jane', 'Bob', 'Alice', 'Tom'],
        'Age': [28, 34, 29, 50, 16]}

df = pd.DataFrame(data)

# Calculate IQR (Interquartile Range)
Q1 = df['Age'].quantile(0.25)
Q3 = df['Age'].quantile(0.75)
IQR = Q3 - Q1

# Identify outliers
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = df[(df['Age'] < lower_bound) | (df['Age'] > upper_bound)]

print("Outliers in the data:\n", outliers)


Outliers in the data:
     Name  Age
3  Alice   50
4    Tom   16
