<a href="https://colab.research.google.com/github/2303a51885/B2_PFDS_1885/blob/main/PFDS_Lab04.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Sales Data Cleaning**
Objective:
Clean sales data by handling missing values and calculating totals for
large transactions.

**Creating/Importing the data**

In [14]:
import pandas as pd
# Sample Sales Data
sales_data= pd.DataFrame({
    "Invoice_No": [101, 102, 103, 104, 105, 106, 107, 108],
    "Customer_Name": ["Alice", "Bob", "Charlie", "David", "Eva", "Frank", "Grace", "Helen"],
    "Product": ["Laptop", "Mouse", None, "Keyboard", "Monitor", "Laptop", "Headset", "Mouse"],
    "Quantity": [2, 5, 3, None, 1, 2, 4, 6],
    "Price": [50000, 800, 1500, 1200, None, 52000, 2000, None],
    "Total": [100000, None, None, None, None, None, 8000, 4800]
})

print("Sales Data:")
print(sales_data)

Sales Data:
   Invoice_No Customer_Name   Product  Quantity    Price     Total
0         101         Alice    Laptop       2.0  50000.0  100000.0
1         102           Bob     Mouse       5.0    800.0       NaN
2         103       Charlie      None       3.0   1500.0       NaN
3         104         David  Keyboard       NaN   1200.0       NaN
4         105           Eva   Monitor       1.0      NaN       NaN
5         106         Frank    Laptop       2.0  52000.0       NaN
6         107         Grace   Headset       4.0   2000.0    8000.0
7         108         Helen     Mouse       6.0      NaN    4800.0


**Cleaning the data**

In [13]:


# 1. Remove rows with missing Product or Quantity
sales_data = sales_data.dropna(subset=["Product", "Quantity"])

# 2. Fill missing Price with mean
sales_data["Price"] = sales_data["Price"].fillna(sales_data["Price"].mean())

# 3. Calculate missing Total = Quantity × Price
sales_data["Total"] = sales_data.apply(
    lambda row: row["Quantity"] * row["Price"] if pd.isna(row["Total"]) else row["Total"],
    axis=1
)

# 4. Display all records with Total > 1000
large_transactions = sales_data[sales_data["Total"] > 1000]

print("Cleaned Sales Data:")
print(sales_data.head())

print("\nTransactions with Total > 1000:")
print(large_transactions)


Cleaned Sales Data:
   Invoice_No Customer_Name  Product  Quantity    Price     Total
0         101         Alice   Laptop       2.0  50000.0  100000.0
1         102           Bob    Mouse       5.0    800.0    4000.0
4         105           Eva  Monitor       1.0  26200.0   26200.0
5         106         Frank   Laptop       2.0  52000.0  104000.0
6         107         Grace  Headset       4.0   2000.0    8000.0

Transactions with Total > 1000:
   Invoice_No Customer_Name  Product  Quantity    Price     Total
0         101         Alice   Laptop       2.0  50000.0  100000.0
1         102           Bob    Mouse       5.0    800.0    4000.0
4         105           Eva  Monitor       1.0  26200.0   26200.0
5         106         Frank   Laptop       2.0  52000.0  104000.0
6         107         Grace  Headset       4.0   2000.0    8000.0
7         108         Helen    Mouse       6.0  26200.0    4800.0


# **Job Applicants Filtering**
Objective:
Clean applicant data and identify experienced Python-skilled candidates

**Creating/Importing the data**

In [11]:
# Sample Applicants Data
applicants_data = pd.DataFrame({
    "Applicant_ID": [1, 2, 3, 4, 5, 6, 7, 8],
    "Name": ["John", "Priya", "Ahmed", "Sophia", "Li Wei", "Carlos", "Nina", "Raj"],
    "Degree": ["B.Tech", "MCA", "B.Sc", "M.Tech", "B.Tech", "MBA", "M.Sc", "B.Tech"],
    "Experience_Years": [3, None, 1, 5, None, 7, 0, 2],
    "Python_Score": [85, 92, None, 78, 95, 88, 60, None],
    "Status": ["Applied", "Interview", "Applied", "Rejected", "Applied", "Hired", "Applied", "Interview"]
})

print("\nApplicants Data:")
print(applicants_data)



Applicants Data:
   Applicant_ID    Name  Degree  Experience_Years  Python_Score     Status
0             1    John  B.Tech               3.0          85.0    Applied
1             2   Priya     MCA               NaN          92.0  Interview
2             3   Ahmed    B.Sc               1.0           NaN    Applied
3             4  Sophia  M.Tech               5.0          78.0   Rejected
4             5  Li Wei  B.Tech               NaN          95.0    Applied
5             6  Carlos     MBA               7.0          88.0      Hired
6             7    Nina    M.Sc               0.0          60.0    Applied
7             8     Raj  B.Tech               2.0           NaN  Interview


**Cleaning the data**

In [12]:
# 1. Drop rows missing Python_Score
applicants_data = applicants_data.dropna(subset=["Python_Score"])

# 2. Replace missing Experience_Years with 0
applicants_data["Experience_Years"] = applicants_data["Experience_Years"].fillna(0)

# 3. Filter applicants with Python_Score > 80 and Experience_Years > 2
filtered_applicants = applicants_data[
    (applicants_data["Python_Score"] > 80) & (applicants_data["Experience_Years"] > 2)
]

print("Cleaned Applicant Data:")
print(applicants_data.head())

print("\nQualified Applicants (Python_Score > 80 & Experience_Years > 2):")
print(filtered_applicants)


Cleaned Applicant Data:
   Applicant_ID    Name  Degree  Experience_Years  Python_Score     Status
0             1    John  B.Tech               3.0          85.0    Applied
1             2   Priya     MCA               0.0          92.0  Interview
3             4  Sophia  M.Tech               5.0          78.0   Rejected
4             5  Li Wei  B.Tech               0.0          95.0    Applied
5             6  Carlos     MBA               7.0          88.0      Hired

Qualified Applicants (Python_Score > 80 & Experience_Years > 2):
   Applicant_ID    Name  Degree  Experience_Years  Python_Score   Status
0             1    John  B.Tech               3.0          85.0  Applied
5             6  Carlos     MBA               7.0          88.0    Hired
