# IT326: Data mining project
# Phase#1:

## Goal of Collecting the Dataset: 

The dataset was collected to evaluate and predict students' adaptability levels in online education. It enables data scientists and researchers to explore factors influencing student performance and adaptability in virtual learning environments.

## Source of the Dataset: 
Kaggle Platform Dataset link: https://www.kaggle.com/datasets/mdmahmudulhasansuzan/students-adaptability-level-in-online-education

**Read dataset** 

In [35]:
import pandas as pd

df = pd.read_csv('Dataset/students_adaptability_level_online_education.csv')

## General information about the dataset:

- Number of Attributes: 14

- Number of Objects: 1205 (students)

- Class Name: Adaptivity Level

In [36]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1205 entries, 0 to 1204
Data columns (total 14 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Gender               1205 non-null   object
 1   Age                  1205 non-null   object
 2   Education Level      1205 non-null   object
 3   Institution Type     1205 non-null   object
 4   IT Student           1205 non-null   object
 5   Location             1205 non-null   object
 6   Load-shedding        1205 non-null   object
 7   Financial Condition  1205 non-null   object
 8   Internet Type        1205 non-null   object
 9   Network Type         1205 non-null   object
 10  Class Duration       1205 non-null   object
 11  Self Lms             1205 non-null   object
 12  Device               1205 non-null   object
 13  Adaptivity Level     1205 non-null   object
dtypes: object(14)
memory usage: 131.9+ KB


## Type of Attributes:

| Attributes Name | Data type | Description | Possible Values |
|----------|----------|----------|----------|
| Gender | Binary | Student's gender type | Girl, Boy |
| Age | Ordinal | Student's age range | 1-5, 6-10, 11-15, 16-20, 21-25, 26-30 |
| Education Level | Nominal | Student's education institution level | School, College, University |
| Institution Type | Binary | Student's education institution type | Government, Non Government |
| IT Studen | Binary | Whether the student is studying IT or not | Yes, No |
| Location | Binary | Whether the student is studying in their hometown | Yes, No |
| Load-shedding | Binary | Level of load shedding | High, Low |
| Financial Condition | Ordinal | Student's family's financial condition | Rich, Mid, Poor |
| Internet Type | Binary | Student's most used internet type | Wifi, Mobile Data |
| Network Type | Ordinal | Network connectivity type | 2G, 3G, 4G |
| Class Duration | Ordinal | Student's daily class duration in hours | 0, 1-3, 3-6 |
| Self Lms | Binary | Whether the student's institution has its own LMS | Yes, No |
| Device | Nominal | Student's most used device in class | Computer, Tab, Mobile |
| Adaptivity Level | Ordinal | Student's adaptibility level to online education | High, Moderate, Low |

# Phase#2:

### Loading the dataset and Import libraries:

In [37]:
import pandas as pd
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

df = pd.read_csv('Dataset/students_adaptability_level_online_education.csv')
df

Unnamed: 0,Gender,Age,Education Level,Institution Type,IT Student,Location,Load-shedding,Financial Condition,Internet Type,Network Type,Class Duration,Self Lms,Device,Adaptivity Level
0,Boy,21-25,University,Non Government,No,Yes,Low,Mid,Wifi,4G,3-6,No,Tab,Moderate
1,Girl,21-25,University,Non Government,No,Yes,High,Mid,Mobile Data,4G,1-3,Yes,Mobile,Moderate
2,Girl,16-20,College,Government,No,Yes,Low,Mid,Wifi,4G,1-3,No,Mobile,Moderate
3,Girl,11-15,School,Non Government,No,Yes,Low,Mid,Mobile Data,4G,1-3,No,Mobile,Moderate
4,Girl,16-20,School,Non Government,No,Yes,Low,Poor,Mobile Data,3G,0,No,Mobile,Low
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1200,Girl,16-20,College,Non Government,No,Yes,Low,Mid,Wifi,4G,1-3,No,Mobile,Low
1201,Girl,16-20,College,Non Government,No,No,High,Mid,Wifi,4G,3-6,No,Mobile,Moderate
1202,Boy,11-15,School,Non Government,No,Yes,Low,Mid,Mobile Data,3G,1-3,No,Mobile,Moderate
1203,Girl,16-20,College,Non Government,No,No,Low,Mid,Wifi,4G,1-3,No,Mobile,Low


1- Sample of dataset:

In [38]:
sample_data = df.sample(n=20);
sample_data

Unnamed: 0,Gender,Age,Education Level,Institution Type,IT Student,Location,Load-shedding,Financial Condition,Internet Type,Network Type,Class Duration,Self Lms,Device,Adaptivity Level
182,Girl,1-5,School,Non Government,No,Yes,Low,Mid,Wifi,4G,1-3,No,Mobile,Moderate
285,Boy,21-25,College,Government,Yes,No,Low,Poor,Mobile Data,4G,0,No,Mobile,Low
38,Girl,26-30,University,Government,No,No,Low,Poor,Mobile Data,4G,1-3,No,Mobile,Low
163,Boy,6-10,School,Government,No,No,Low,Mid,Mobile Data,3G,0,No,Mobile,Low
422,Girl,11-15,School,Non Government,No,Yes,Low,Mid,Wifi,4G,1-3,No,Computer,Moderate
369,Girl,16-20,School,Non Government,No,Yes,Low,Mid,Wifi,4G,1-3,No,Mobile,Low
392,Girl,16-20,College,Non Government,No,No,Low,Mid,Wifi,4G,1-3,No,Mobile,Low
524,Boy,21-25,University,Non Government,No,Yes,Low,Mid,Mobile Data,4G,1-3,No,Mobile,Low
74,Boy,11-15,School,Non Government,Yes,Yes,Low,Mid,Mobile Data,4G,1-3,No,Mobile,Moderate
487,Boy,21-25,University,Non Government,Yes,Yes,High,Mid,Mobile Data,3G,1-3,Yes,Computer,Moderate


2- Show missing value:

In [39]:
missing_values = df.isna().sum()
print("\nTotal number of missing values in the dataset:", missing_values.sum())

# Creates a table that counts the number of missing values for each variable in the dataset
print("\nMissing Values:")
missing_table = pd.DataFrame({'Variable': missing_values.index, 'Missing Values': missing_values.values})
display(missing_table)


Total number of missing values in the dataset: 0

Missing Values:


Unnamed: 0,Variable,Missing Values
0,Gender,0
1,Age,0
2,Education Level,0
3,Institution Type,0
4,IT Student,0
5,Location,0
6,Load-shedding,0
7,Financial Condition,0
8,Internet Type,0
9,Network Type,0


3- Statistical summary:

- Convert age values ​​to numerical instead of ordinal 

In [40]:
def interval_to_midpoint(interval):
    # Split the interval by the dash (e.g., "1-5" becomes ["1", "5"])
    lower, upper = map(int, interval.split('-'))
    # Calculate the midpoint
    return (lower + upper) // 2

# Apply the conversion function to the 'age' column
df['Age'] = df['Age'].apply(interval_to_midpoint)

# Now, the 'age' column should contain numerical midpoints
print("Age values after after converting it:\n")
print(df['Age'])

Age values after after converting it:

0       23
1       23
2       18
3       13
4       18
        ..
1200    18
1201    18
1202    13
1203    18
1204    13
Name: Age, Length: 1205, dtype: int64


In [41]:
df.describe()

Unnamed: 0,Age
count,1205.0
mean,17.219917
std,6.285479
min,3.0
25%,13.0
50%,18.0
75%,23.0
max,28.0


In [42]:
# Calculate the mean of the 'Age' column
mean_value = df["Age"].mean()
print("Mean of Age:", mean_value)

# Calculate the variance of the 'Age' column
age_variance = df['Age'].var()
print("Variance of Age:", age_variance)

# Calculate the median of the 'Age' column
median_value = df["Age"].median()
print("Median of Age:", median_value)

# Calculate the mode of the 'Age' column
mode_result = df["Age"].mode()
print("Mode of Age:", mode_result)

Mean of Age: 17.219917012448132
Variance of Age: 39.50724417915386
Median of Age: 18.0
Mode of Age: 0    23
Name: Age, dtype: int64


### Graphs:

In [43]:
print('hi')

hi


### Data Cleaning:
1- Removing duplicates:

In [44]:
# Check for duplicate rows
num_duplicates = df.duplicated().sum()
print("Number of duplicate rows:", num_duplicates , "\n")

df = df.drop_duplicates()
print("DataFrame after dropping all duplicate rows:\n")
print(df)

Number of duplicate rows: 949 

DataFrame after dropping all duplicate rows:

     Gender  Age Education Level Institution Type IT Student Location  \
0       Boy   23      University   Non Government         No      Yes   
1      Girl   23      University   Non Government         No      Yes   
2      Girl   18         College       Government         No      Yes   
3      Girl   13          School   Non Government         No      Yes   
4      Girl   18          School   Non Government         No      Yes   
...     ...  ...             ...              ...        ...      ...   
1124    Boy   23      University   Non Government        Yes       No   
1132    Boy   18         College       Government         No      Yes   
1149   Girl   18         College   Non Government         No       No   
1160    Boy   23      University   Non Government        Yes       No   
1197    Boy   23      University   Non Government        Yes      Yes   

     Load-shedding Financial Condition Intern

2- Handling Missing Values:
- there is no missing values

In [45]:
missing_values = df.isna().sum()
print("\nTotal number of missing values in the dataset:", missing_values.sum())


Total number of missing values in the dataset: 0


3- Detect Outlier using mean method and handling it:

In [46]:
age_column = df['Age']

# Calculate the mean age
mean_age = age_column.mean()

# Calculate the absolute differences of each age from the mean
differences_from_mean = abs(age_column - mean_age)

# Find the index of the row with the largest difference from the mean
max_difference_index = differences_from_mean.idxmax()

# Remove the row with the largest difference from the mean
df = df.drop(max_difference_index)

print("\nDataFrame after removing the row with the largest difference from the mean:")
df


DataFrame after removing the row with the largest difference from the mean:


Unnamed: 0,Gender,Age,Education Level,Institution Type,IT Student,Location,Load-shedding,Financial Condition,Internet Type,Network Type,Class Duration,Self Lms,Device,Adaptivity Level
0,Boy,23,University,Non Government,No,Yes,Low,Mid,Wifi,4G,3-6,No,Tab,Moderate
1,Girl,23,University,Non Government,No,Yes,High,Mid,Mobile Data,4G,1-3,Yes,Mobile,Moderate
2,Girl,18,College,Government,No,Yes,Low,Mid,Wifi,4G,1-3,No,Mobile,Moderate
3,Girl,13,School,Non Government,No,Yes,Low,Mid,Mobile Data,4G,1-3,No,Mobile,Moderate
4,Girl,18,School,Non Government,No,Yes,Low,Poor,Mobile Data,3G,0,No,Mobile,Low
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1124,Boy,23,University,Non Government,Yes,No,High,Mid,Mobile Data,3G,3-6,No,Computer,Low
1132,Boy,18,College,Government,No,Yes,Low,Mid,Mobile Data,3G,1-3,No,Mobile,Moderate
1149,Girl,18,College,Non Government,No,No,Low,Mid,Mobile Data,3G,1-3,Yes,Mobile,Low
1160,Boy,23,University,Non Government,Yes,No,High,Mid,Mobile Data,3G,1-3,Yes,Mobile,Moderate


### Data Transmission:
1- Encoding categorical data:

2- Normalization:

3- discretization:

4- aggregation

### Feature selection:

1- Correlation Coefficient:

2- Chi square 

3- Feature selection