# Data Processing #
 
To ensure the readiness of dataset for analysis, the dataset was done by processing and cleaning. 
Key Observations are asf:

- Missing values were present in columns such as Gender, Project Phase 2, and Final Exam.
- Standardized Gender values by replacing abbreviations (M, F) with full terms (Male, Female).
- Missing values were handled by dropping rows with null entries, resulting in a reduction from 40 rows to 33 rows.
- Verified the dataset had no duplicate rows.

In [None]:
import pandas as pd

# Step 1: Load the CSV file
file_grades = "dataset_grades.csv"  
try:
    data = pd.read_csv("https://raw.githubusercontent.com/akmand/datasets/refs/heads/main/sample_grades.csv")
    print("Data loaded successfully!")
    print(data.head())  # Display the first few rows of the dataset
except FileNotFoundError:
    print("File not found. Please check the file path.")
except Exception as e:
    print(f"An error occurred: {e}")


Data loaded successfully!
   Student ID  Gender  Project Phase 1  Project Phase 2  Mid-Semester Test  \
0         101    Male            18.25             15.5                 94   
1         102  Female            17.75             30.0                 79   
2         103    Male             0.00              0.0                 78   
3         104    Male            20.00             25.0                 69   
4         105    Male            18.75             30.0                 96   

   Final Exam Grade  
0        61.0    PA  
1        62.0    PA  
2        15.0    NN  
3        65.0    PA  
4        51.0    PA  


In [None]:
data.head()

Unnamed: 0,Student ID,Gender,Project Phase 1,Project Phase 2,Mid-Semester Test,Final Exam,Grade
0,101,Male,18.25,15.5,94,61.0,PA
1,102,Female,17.75,30.0,79,62.0,PA
2,103,Male,0.0,0.0,78,15.0,NN
3,104,Male,20.0,25.0,69,65.0,PA
4,105,Male,18.75,30.0,96,51.0,PA


In [None]:
print("There are {:,} rows ".format(data.shape[0]) + "and {} columns in our data".format(data.shape[1]))

There are 40 rows and 7 columns in our data


In [None]:
print("Data types of each column:")
print(data.dtypes)

Data types of each column:
Student ID             int64
Gender                object
Project Phase 1      float64
Project Phase 2      float64
Mid-Semester Test      int64
Final Exam           float64
Grade                 object
dtype: object


In [None]:
# Replace "M" with "Male" in the 'Gender' column
data ['Gender'] = data['Gender'].replace('M', 'Male')

In [None]:
# Replace "F" with "Female" in the 'Gender' column
data ['Gender'] = data['Gender'].replace('F', 'Female')

In [None]:
data

Unnamed: 0,Student ID,Gender,Project Phase 1,Project Phase 2,Mid-Semester Test,Final Exam,Grade
0,101,Male,18.25,15.5,94,61.0,PA
1,102,Female,17.75,30.0,79,62.0,PA
2,103,Male,0.0,0.0,78,15.0,NN
3,104,Male,20.0,25.0,69,65.0,PA
4,105,Male,18.75,30.0,96,51.0,PA
5,106,Male,17.0,23.5,80,59.0,PA
6,107,,19.75,19.5,82,76.0,PA
7,108,Male,20.0,28.0,95,44.0,PA
8,109,Male,18.0,23.0,50,33.0,NN
9,110,Female,20.0,30.0,92,63.0,PA


 Managing Null Values
 

In [None]:
# number of null records in original data set
(data.isnull().sum()).tolist

<bound method IndexOpsMixin.tolist of Student ID           0
Gender               3
Project Phase 1      0
Project Phase 2      3
Mid-Semester Test    0
Final Exam           4
Grade                0
dtype: int64>

In [None]:
#Drop rows with null values
data_cleaned = data.dropna()

# Verify if all null values are removed
print("\nCheck for remaining null values:")
print(data_cleaned.isnull().sum())

# Save the cleaned dataset to a new file
cleaned_file_path = "cleaned_sample_grades.csv"
data_cleaned.to_csv(cleaned_file_path, index=False)
print(f"\nCleaned data saved to {cleaned_file_path}")


Check for remaining null values:
Student ID           0
Gender               0
Project Phase 1      0
Project Phase 2      0
Mid-Semester Test    0
Final Exam           0
Grade                0
dtype: int64

Cleaned data saved to cleaned_sample_grades.csv


In [None]:
data.duplicated().sum()

np.int64(0)

In [None]:
print("There are {:,} rows ".format(data_cleaned.shape[0]) + "and {} columns in our data after cleaning".format(data_cleaned.shape[1]))

There are 33 rows and 7 columns in our data after cleaning


In [None]:
data_cleaned

Unnamed: 0,Student ID,Gender,Project Phase 1,Project Phase 2,Mid-Semester Test,Final Exam,Grade
0,101,Male,18.25,15.5,94,61.0,PA
1,102,Female,17.75,30.0,79,62.0,PA
2,103,Male,0.0,0.0,78,15.0,NN
3,104,Male,20.0,25.0,69,65.0,PA
4,105,Male,18.75,30.0,96,51.0,PA
5,106,Male,17.0,23.5,80,59.0,PA
7,108,Male,20.0,28.0,95,44.0,PA
8,109,Male,18.0,23.0,50,33.0,NN
9,110,Female,20.0,30.0,92,63.0,PA
10,111,Female,19.5,13.0,95,52.0,PA


In [None]:
print("There are {:,} rows ".format(data_cleaned.shape[0]) + "and {} columns in our data after cleaning".format(data_cleaned.shape[1]))

There are 33 rows and 7 columns in our data after cleaning
