In [1]:
!pip install pandas


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [4]:
#Loading the dataset and checking
import pandas as pd

df = pd.read_csv("studentsgrade.csv")
print(df.head())
print(df.info())

  Student_ID First_Name Last_Name                    Email  Gender  Age  \
0      S1000       Omar  Williams  student0@university.com  Female   22   
1      S1001      Maria     Brown  student1@university.com    Male   18   
2      S1002      Ahmed     Jones  student2@university.com    Male   24   
3      S1003       Omar  Williams  student3@university.com  Female   24   
4      S1004       John     Smith  student4@university.com  Female   23   

    Department  Attendance (%)  Midterm_Score  Final_Score  ...  \
0  Engineering           52.29          55.03        57.82  ...   
1  Engineering           97.27          97.23        45.80  ...   
2     Business           57.19          67.05        93.68  ...   
3  Mathematics           95.15          47.79        80.63  ...   
4           CS           54.18          46.59        78.89  ...   

   Projects_Score  Total_Score  Grade  Study_Hours_per_Week  \
0           85.90        56.09      F                   6.2   
1           55.65   

# Some interesting relations
- A good Relationship_with_Manager might correlate with higher Work_Environment_Satisfaction.
- You can explore if employees in Manager or Executive roles stay longer (high Years_at_Company) than employees in Assistant roles.
- You can analyze if employees with lower Job Satisfaction are more likely to leave (i.e., Attrition = "Yes").

# Features that I would like to predict:
- Predict the Job Satisfaction level (possibly categorized as Low, Medium, High) based on other employee features.
- Predict whether an employee works Overtime based on other features like Job Role, Years at Company, or Absenteeism.
- Predict the Job Satisfaction level (possibly categorized as Low, Medium, High) based on other employee features.

In [9]:
#Checking for null values and duplicat values to clean the dataset
print(f"Missing values :\n {df.isnull().sum()}")
print(f"Duplicated values : {df.duplicated().sum()}")

Missing values :
 Student_ID                       0
First_Name                       0
Last_Name                        0
Email                            0
Gender                           0
Age                              0
Department                       0
Attendance (%)                 516
Midterm_Score                    0
Final_Score                      0
Assignments_Avg                517
Quizzes_Avg                      0
Participation_Score              0
Projects_Score                   0
Total_Score                      0
Grade                            0
Study_Hours_per_Week             0
Extracurricular_Activities       0
Internet_Access_at_Home          0
Parent_Education_Level        1794
Family_Income_Level              0
Stress_Level (1-10)              0
Sleep_Hours_per_Night            0
dtype: int64
Duplicated values : 0


## Missing and Duplicate values
After checking for missing values and duplicat values to clean the dataset, we found columns **Attendance**, **Parents_Education_Level** and **Assignments_Avg** are having missing values and there are no duplicated values.

We can either fill or drop the columns with missing values.
We will drop the **Parents_Education_Level** column and fill the **Attendance** and **Assignments_Avg** columns.

In [10]:
#DataCleaning
# Fill missing values in numerical columns with the mean.

df['Attendance (%)'] = df['Attendance (%)'].fillna(df['Attendance (%)'].mean())
df['Assignments_Avg'] = df['Assignments_Avg'].fillna(df['Assignments_Avg'].mean())

print(f"Missing values :\n {df.isnull().sum()}")

Missing values :
 Student_ID                       0
First_Name                       0
Last_Name                        0
Email                            0
Gender                           0
Age                              0
Department                       0
Attendance (%)                   0
Midterm_Score                    0
Final_Score                      0
Assignments_Avg                  0
Quizzes_Avg                      0
Participation_Score              0
Projects_Score                   0
Total_Score                      0
Grade                            0
Study_Hours_per_Week             0
Extracurricular_Activities       0
Internet_Access_at_Home          0
Parent_Education_Level        1794
Family_Income_Level              0
Stress_Level (1-10)              0
Sleep_Hours_per_Night            0
dtype: int64


### The columns **Attendance** and **Assignments_Avg** are filled.

In [11]:
#Dropping Parents_Education_Level column
df.dropna(subset=['Parent_Education_Level'], inplace=True)

print(f"Missing values :\n {df.isnull().sum()}")

Missing values :
 Student_ID                    0
First_Name                    0
Last_Name                     0
Email                         0
Gender                        0
Age                           0
Department                    0
Attendance (%)                0
Midterm_Score                 0
Final_Score                   0
Assignments_Avg               0
Quizzes_Avg                   0
Participation_Score           0
Projects_Score                0
Total_Score                   0
Grade                         0
Study_Hours_per_Week          0
Extracurricular_Activities    0
Internet_Access_at_Home       0
Parent_Education_Level        0
Family_Income_Level           0
Stress_Level (1-10)           0
Sleep_Hours_per_Night         0
dtype: int64


### **Parents_Education_Levels** has been dropped. And now the dataset is clean.
### We will work on creating training and testing sets.

In [12]:
!pip install scikit-learn


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [19]:
#Training and Testing sets using scikit learning.
from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(df, test_size=0.2, random_state=123)
print(f"Train Set :\n{train_set.head()}")
print(f"Test Set :\n{test_set.head()}")

Train Set :
     Student_ID First_Name Last_Name                       Email  Gender  Age  \
514       S1514       Sara  Williams   student514@university.com  Female   20   
1921      S2921      Ahmed  Williams  student1921@university.com  Female   19   
3999      S4999       Emma   Johnson  student3999@university.com  Female   20   
2121      S3121      Maria     Davis  student2121@university.com  Female   18   
4742      S5742      Maria  Williams  student4742@university.com  Female   24   

       Department  Attendance (%)  Midterm_Score  Final_Score  ...  \
514      Business           91.65          53.79        62.82  ...   
1921     Business           68.68          74.46        47.76  ...   
3999  Mathematics           90.86          88.60        44.19  ...   
2121  Engineering           92.72          49.60        63.92  ...   
4742  Engineering           85.80          43.95        53.98  ...   

      Projects_Score  Total_Score  Grade  Study_Hours_per_Week  \
514           