**Part 1**

**Code to create a DataFrame from a dictionary of sample employee data, then save it as a CSV file titled 'sample_data.csv' using Pandas.**

In [None]:
import pandas as pd

# Sample data in a dictionary format
data = {
    "Name": ["Alice", "", "Charlie", "David", "Eve", "Alice", "Frank", "Grace", "Heidi", "Ivan", "Judy"],
    "Age": [28, None, 35, 29, 28, 28, 40, 29, 30, 35, 31],
    "Salary": [50000, 60000, None, None, 60000, 50000, 90000, 45000, 55000, 62000, 59000],
    "Country": ["USA", "USA", "UK", "", "USA", "USA", "UK", "UK", "USA", "USA", "UK"],
    "Joining_Date": [
        "2021-01-15", "2021-03-22", "2020-12-30", "2020-10-15",
        "2021-02-20", "2021-01-15", "2019-06-25", "2019-08-10",
        "2021-05-17", "2020-11-30", "2019-03-15"
    ]
}

# Create a DataFrame
df = pd.DataFrame(data)

# Save to CSV
df.to_csv('sample_data.csv', index=False)


**Code to load the previously saved 'sample_data.csv' file into a Pandas DataFrame and display the original data**

In [None]:
import pandas as pd

# Load the data
df = pd.read_csv('sample_data.csv')

# Display the original data
print("Original Data:")
print(df)

Original Data:
       Name   Age   Salary Country Joining_Date
0     Alice  28.0  50000.0     USA   2021-01-15
1       NaN   NaN  60000.0     USA   2021-03-22
2   Charlie  35.0      NaN      UK   2020-12-30
3     David  29.0      NaN     NaN   2020-10-15
4       Eve  28.0  60000.0     USA   2021-02-20
5     Alice  28.0  50000.0     USA   2021-01-15
6     Frank  40.0  90000.0      UK   2019-06-25
7     Grace  29.0  45000.0      UK   2019-08-10
8     Heidi  30.0  55000.0     USA   2021-05-17
9      Ivan  35.0  62000.0     USA   2020-11-30
10     Judy  31.0  59000.0      UK   2019-03-15


**Code to handle missing values in the dataset by filling missing 'Age' values with the median, missing 'Salary' values with the mean, and replacing empty 'Country' values with the placeholder 'Unknown'**

In [None]:
# 1. Handle Missing Values
# Fill missing values in 'Age' with the median
df['Age'].fillna(df['Age'].median(), inplace=True)

# Fill missing values in 'Salary' with the mean
df['Salary'].fillna(df['Salary'].mean(), inplace=True)

# Fill missing values in 'Country' with a placeholder
df['Country'].replace('', 'Unknown', inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(df['Age'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Salary'].fillna(df['Salary'].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are se

**Code to remove duplicate entries from the DataFrame, ensuring that each record is unique.**

In [None]:
# 2. Remove Duplicates
df.drop_duplicates(inplace=True)

**Code to normalize the 'Salary' column using min-max normalization, scaling the values to a range between 0 and 1 for improved comparability**

In [None]:
# 3. Normalize Data
# Normalize the 'Salary' column (min-max normalization)
df['Salary'] = (df['Salary'] - df['Salary'].min()) / (df['Salary'].max() - df['Salary'].min())

**Code to format the 'Joining_Date' column by converting it to a datetime object for better date handling and analysis.**

In [None]:
# 4. Format Dates
df['Joining_Date'] = pd.to_datetime(df['Joining_Date'])

**Code to display the cleaned DataFrame, showcasing the modifications made, including handling missing values, removing duplicates, normalizing the 'Salary' column, and formatting the 'Joining_Date' to datetime.**

In [None]:
# Display the cleaned data
print("\nCleaned Data:")
print(df)


Cleaned Data:
       Name   Age    Salary Country Joining_Date
0     Alice  28.0  0.111111     USA   2021-01-15
1       NaN  29.5  0.333333     USA   2021-03-22
2   Charlie  35.0  0.311111      UK   2020-12-30
3     David  29.0  0.311111     NaN   2020-10-15
4       Eve  28.0  0.333333     USA   2021-02-20
6     Frank  40.0  1.000000      UK   2019-06-25
7     Grace  29.0  0.000000      UK   2019-08-10
8     Heidi  30.0  0.222222     USA   2021-05-17
9      Ivan  35.0  0.377778     USA   2020-11-30
10     Judy  31.0  0.311111      UK   2019-03-15


**Part 2**

**Code to create a sample DataFrame containing sales data for different products, including columns for Product, Sales, Date, Quantity, Region, and Discount. The DataFrame is displayed to show the initial structure of the data.**

In [None]:
import pandas as pd
import numpy as np

# Sample DataFrame
data = {
    'Product': ['Widget A', 'Widget B', 'Widget A', 'Widget C', 'Widget B', 'Widget C', 'Widget A', 'Widget B', 'Widget C'],
    'Sales': [100.50, np.nan, 150.75, 200.00, 250.50, np.nan, 300.00, 400.00, 500.00],
    'Date': ['2024-01-01', '2024-01-01', '2024-01-02', '2024-01-02', '2024-01-03', '2024-01-03', '2024-01-04', '2024-01-04', '2024-01-05'],
    'Quantity': [10, 20, 5, 8, np.nan, 12, 10, 6, 5],
    'Region': ['North', 'East', 'South', 'West', 'North', 'East', 'South', 'West', 'North'],
    'Discount': [0.10, 0.15, -0.05, 0.20, 0.25, 0.30, 0.05, 0.20, 0.10]
}

df = pd.DataFrame(data)
print(df)


    Product   Sales        Date  Quantity Region  Discount
0  Widget A  100.50  2024-01-01      10.0  North      0.10
1  Widget B     NaN  2024-01-01      20.0   East      0.15
2  Widget A  150.75  2024-01-02       5.0  South     -0.05
3  Widget C  200.00  2024-01-02       8.0   West      0.20
4  Widget B  250.50  2024-01-03       NaN  North      0.25
5  Widget C     NaN  2024-01-03      12.0   East      0.30
6  Widget A  300.00  2024-01-04      10.0  South      0.05
7  Widget B  400.00  2024-01-04       6.0   West      0.20
8  Widget C  500.00  2024-01-05       5.0  North      0.10


**Code to clean the DataFrame by filling missing values, correcting negative discounts, and converting the 'Date' column to datetime format, followed by displaying the cleaned DataFrame.**

In [None]:
# Display the original DataFrame
print("Original DataFrame:")
print(df)

# Step 1: Handling Missing Values
# Fill missing Sales with the mean of the Sales column
df['Sales'].fillna(df['Sales'].mean(), inplace=True)

# Fill missing Quantity with the median of the Quantity column
df['Quantity'].fillna(df['Quantity'].median(), inplace=True)

# Step 2: Fixing Incorrect Data
# Replace negative discount values with 0 (assuming discounts cannot be negative)
df['Discount'] = df['Discount'].clip(lower=0)

# Step 3: Convert Date column to datetime format
df['Date'] = pd.to_datetime(df['Date'])

print("\nCleaned DataFrame:")
print(df)


Original DataFrame:
    Product   Sales        Date  Quantity Region  Discount
0  Widget A  100.50  2024-01-01      10.0  North      0.10
1  Widget B     NaN  2024-01-01      20.0   East      0.15
2  Widget A  150.75  2024-01-02       5.0  South     -0.05
3  Widget C  200.00  2024-01-02       8.0   West      0.20
4  Widget B  250.50  2024-01-03       NaN  North      0.25
5  Widget C     NaN  2024-01-03      12.0   East      0.30
6  Widget A  300.00  2024-01-04      10.0  South      0.05
7  Widget B  400.00  2024-01-04       6.0   West      0.20
8  Widget C  500.00  2024-01-05       5.0  North      0.10

Cleaned DataFrame:
    Product       Sales       Date  Quantity Region  Discount
0  Widget A  100.500000 2024-01-01      10.0  North      0.10
1  Widget B  271.678571 2024-01-01      20.0   East      0.15
2  Widget A  150.750000 2024-01-02       5.0  South      0.00
3  Widget C  200.000000 2024-01-02       8.0   West      0.20
4  Widget B  250.500000 2024-01-03       9.0  North      0.2

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Sales'].fillna(df['Sales'].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Quantity'].fillna(df['Quantity'].median(), inplace=True)


**Code to create a new column for Total Sales by multiplying 'Sales' and 'Quantity', followed by one-hot encoding the 'Region' column, and displaying the transformed DataFrame.**

In [None]:
# Step 1: Create a new column for Total Sales (Sales * Quantity)
df['Total_Sales'] = df['Sales'] * df['Quantity']

# Step 2: One-hot encoding for the 'Region' column
df = pd.get_dummies(df, columns=['Region'], drop_first=True)

print("\nTransformed DataFrame:")
print(df)



Transformed DataFrame:
    Product       Sales       Date  Quantity  Discount  Total_Sales  \
0  Widget A  100.500000 2024-01-01      10.0      0.10  1005.000000   
1  Widget B  271.678571 2024-01-01      20.0      0.15  5433.571429   
2  Widget A  150.750000 2024-01-02       5.0      0.00   753.750000   
3  Widget C  200.000000 2024-01-02       8.0      0.20  1600.000000   
4  Widget B  250.500000 2024-01-03       9.0      0.25  2254.500000   
5  Widget C  271.678571 2024-01-03      12.0      0.30  3260.142857   
6  Widget A  300.000000 2024-01-04      10.0      0.05  3000.000000   
7  Widget B  400.000000 2024-01-04       6.0      0.20  2400.000000   
8  Widget C  500.000000 2024-01-05       5.0      0.10  2500.000000   

   Region_North  Region_South  Region_West  
0          True         False        False  
1         False         False        False  
2         False          True        False  
3         False         False         True  
4          True         False        Fal

**Code to normalize the 'Sales' and 'Total_Sales' columns using Min-Max scaling, followed by displaying the normalized DataFrame.**

In [None]:
# Step 1: Normalize Sales and Total_Sales
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
df[['Sales', 'Total_Sales']] = scaler.fit_transform(df[['Sales', 'Total_Sales']])

print("\nNormalized DataFrame:")
print(df)



Normalized DataFrame:
    Product     Sales       Date  Quantity  Discount  Total_Sales  \
0  Widget A  0.000000 2024-01-01      10.0      0.10     0.053688   
1  Widget B  0.428482 2024-01-01      20.0      0.15     1.000000   
2  Widget A  0.125782 2024-01-02       5.0      0.00     0.000000   
3  Widget C  0.249061 2024-01-02       8.0      0.20     0.180830   
4  Widget B  0.375469 2024-01-03       9.0      0.25     0.320685   
5  Widget C  0.428482 2024-01-03      12.0      0.30     0.535574   
6  Widget A  0.499374 2024-01-04      10.0      0.05     0.479986   
7  Widget B  0.749687 2024-01-04       6.0      0.20     0.351776   
8  Widget C  1.000000 2024-01-05       5.0      0.10     0.373145   

   Region_North  Region_South  Region_West  
0          True         False        False  
1         False         False        False  
2         False          True        False  
3         False         False         True  
4          True         False        False  
5         False 