#**Data Understanding**

**Attibute Information**

| Attribute | Data Type | Description |
| --- | --- | --- |
| GENDER | object | Gender type of student |
| AGE | object | Age range of the student |
| EDUCATION LEVEL | object | Education institution level |
| INSTITUTION TYPE | object | Education institution type |
| IT STUDENT | object | Studying as IT student or not |
| LOCATION | object | Is student location in town |
| LOAD SHEDDING | object | Level of load shedding |
| FINANCIAL CONDITION | object | Financial condition of family |
| INTERNET TYPE | object | Internet type used mostly in device |
| NETWORK TYPE | object | Network connectivity type |
| CLASS DURATION | object | Daily class duration |
| SELF LMS | object | Institution's own LMS availability |
| DEVICE | object | Device used mostly in class |
| ADAPTIVITY LEVEL | object | Adaptability level of the student |

**Dataset Context**

Berdasarkan pembacaan dari dataset, dapat dipahami bahwa dataset ini disediakan untuk mengetahui kemampuan adaptasi siswa dan mahasiswa dalam pembelajaran online selama masa pandemi COVID-19. Adapun dataset ini merupakan data penelitian ilmiah yang dapat dilihat detailnya melalui link berikut: https://www.researchgate.net/publication/355891881_Students'_Adaptability_Level_Prediction_in_Online_Education_using_Machine_Learning_Approaches.

**Problem Statement**

Bagaimana kemampuan adaptasi siswa dan mahasiswa terhadap pembelajaran online selama masa pandemi COVID-19 di Bangladesh?

**Goals**
- Menemukan insight dari pola-pola atau faktor-faktor yang saling berkaitan dari siswa dan mahasiswa yang memiliki kemampuan adaptasi dari yang rendah hingga tinggi dalam pembelajaran online.
- Membuat mesin prediksi agar dapat menentukan apakah siswa atau mahasiswa memiliki kemampuan adaptasi rendah/sedang/tinggi dalam pembelajaran online selama pandemi COVID-19.

**Analytical Approach**

Pendekatan yang digunakan yaitu:
  1. Exploratory Data Analytics.
  2. Predictive Analytics (Multi-class Classification).

**Evaluation Metrics**

The Confusion Matrix:

![untitled image](https://2.bp.blogspot.com/-EvSXDotTOwc/XMfeOGZ-CVI/AAAAAAAAEiE/oePFfvhfOQM11dgRn9FkPxlegCXbgOF4QCLcBGAs/s1600/confusionMatrxiUpdated.jpg)

- Metrik evaluasi yang akan digunakan yaitu **F1-Score**, **Recall**, **ROC-AUC (Area Under the Receiver Operating Characteristic Curve)**, dan **AUPRC (Area Under the Precision-Recall Curve)** untuk mengevaluasi model machine learning (ML) yang dikembangkan.
- Adapun alasan mengapa F1-Score, Recall, ROC-AUC, dan AUPRC dipilih sebagai metrik evaluasi dapat dilihat melalui laman berikut:[[medium.com]](https://medium.com/cuenex/advanced-evaluation-metrics-for-imbalanced-classification-models-ee6f248c90ca), [[machinelearningmastery.com]](https://machinelearningmastery.com/tour-of-evaluation-metrics-for-imbalanced-classification/)
- Untuk memilih model ML terbaik, disini kita menggunakan PR-AUC sebagai metrik evaluasi yang dijadikan patokan utama dengan melihat nilai persentase tertinggi dari kemampuan model ML dalam memprediksi label.


##**1. Import Library**

In [1]:
# Common library used
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import calendar
import warnings
warnings.filterwarnings('ignore')

In [None]:
from google.colab import drive
drive.mount('/content/drive')

##**2. Load Dataset**

In [15]:
# load dataset from google drive storage
df = pd.read_csv('/content/drive/MyDrive/4.Data Science Course/Personal Project/Student Adaptivity in Online Learning/students_adaptability_level_online_education.csv')
df

Unnamed: 0,Gender,Age,Education Level,Institution Type,IT Student,Location,Load-shedding,Financial Condition,Internet Type,Network Type,Class Duration,Self Lms,Device,Adaptivity Level
0,Boy,21-25,University,Non Government,No,Yes,Low,Mid,Wifi,4G,3-6,No,Tab,Moderate
1,Girl,21-25,University,Non Government,No,Yes,High,Mid,Mobile Data,4G,1-3,Yes,Mobile,Moderate
2,Girl,16-20,College,Government,No,Yes,Low,Mid,Wifi,4G,1-3,No,Mobile,Moderate
3,Girl,11-15,School,Non Government,No,Yes,Low,Mid,Mobile Data,4G,1-3,No,Mobile,Moderate
4,Girl,16-20,School,Non Government,No,Yes,Low,Poor,Mobile Data,3G,0,No,Mobile,Low
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1200,Girl,16-20,College,Non Government,No,Yes,Low,Mid,Wifi,4G,1-3,No,Mobile,Low
1201,Girl,16-20,College,Non Government,No,No,High,Mid,Wifi,4G,3-6,No,Mobile,Moderate
1202,Boy,11-15,School,Non Government,No,Yes,Low,Mid,Mobile Data,3G,1-3,No,Mobile,Moderate
1203,Girl,16-20,College,Non Government,No,No,Low,Mid,Wifi,4G,1-3,No,Mobile,Low


In [16]:
# check 5 upper rows data in dataset
df.head()

Unnamed: 0,Gender,Age,Education Level,Institution Type,IT Student,Location,Load-shedding,Financial Condition,Internet Type,Network Type,Class Duration,Self Lms,Device,Adaptivity Level
0,Boy,21-25,University,Non Government,No,Yes,Low,Mid,Wifi,4G,3-6,No,Tab,Moderate
1,Girl,21-25,University,Non Government,No,Yes,High,Mid,Mobile Data,4G,1-3,Yes,Mobile,Moderate
2,Girl,16-20,College,Government,No,Yes,Low,Mid,Wifi,4G,1-3,No,Mobile,Moderate
3,Girl,11-15,School,Non Government,No,Yes,Low,Mid,Mobile Data,4G,1-3,No,Mobile,Moderate
4,Girl,16-20,School,Non Government,No,Yes,Low,Poor,Mobile Data,3G,0,No,Mobile,Low


In [4]:
# check 5 lower rows data in dataset
df.tail()

Unnamed: 0,Gender,Age,Education Level,Institution Type,IT Student,Location,Load-shedding,Financial Condition,Internet Type,Network Type,Class Duration,Self Lms,Device,Adaptivity Level
1200,Girl,16-20,College,Non Government,No,Yes,Low,Mid,Wifi,4G,1-3,No,Mobile,Low
1201,Girl,16-20,College,Non Government,No,No,High,Mid,Wifi,4G,3-6,No,Mobile,Moderate
1202,Boy,11-15,School,Non Government,No,Yes,Low,Mid,Mobile Data,3G,1-3,No,Mobile,Moderate
1203,Girl,16-20,College,Non Government,No,No,Low,Mid,Wifi,4G,1-3,No,Mobile,Low
1204,Girl,11-15,School,Non Government,No,Yes,Low,Poor,Mobile Data,3G,1-3,No,Mobile,Moderate


##**3. Data Cleansing**

###**- Check Information Dataset**

In [5]:
# check information of dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1205 entries, 0 to 1204
Data columns (total 14 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Gender               1205 non-null   object
 1   Age                  1205 non-null   object
 2   Education Level      1205 non-null   object
 3   Institution Type     1205 non-null   object
 4   IT Student           1205 non-null   object
 5   Location             1205 non-null   object
 6   Load-shedding        1205 non-null   object
 7   Financial Condition  1205 non-null   object
 8   Internet Type        1205 non-null   object
 9   Network Type         1205 non-null   object
 10  Class Duration       1205 non-null   object
 11  Self Lms             1205 non-null   object
 12  Device               1205 non-null   object
 13  Adaptivity Level     1205 non-null   object
dtypes: object(14)
memory usage: 131.9+ KB


In [6]:
# View All Value Counts From Features/Columns
cat_cols = df.select_dtypes(include=object).columns.tolist()
(pd.DataFrame(
    df[cat_cols]
    .melt(var_name='column', value_name='value')
    .value_counts())
.rename(columns={0: 'counts'})
.sort_values(by=['column', 'counts']))

Unnamed: 0_level_0,Unnamed: 1_level_0,counts
column,value,Unnamed: 2_level_1
Adaptivity Level,High,100
Adaptivity Level,Low,480
Adaptivity Level,Moderate,625
Age,6-10,51
Age,26-30,68
Age,1-5,81
Age,16-20,278
Age,11-15,353
Age,21-25,374
Class Duration,0,154


###**- Check Missing Values**

In [7]:
# check if any missing value in the dataset
df.isna().sum()

Gender                 0
Age                    0
Education Level        0
Institution Type       0
IT Student             0
Location               0
Load-shedding          0
Financial Condition    0
Internet Type          0
Network Type           0
Class Duration         0
Self Lms               0
Device                 0
Adaptivity Level       0
dtype: int64

> - Terlihat tidak ada missing value dari dataset

###**- Check Duplicated Values**

In [19]:
# check if any data duplicated
df.duplicated().sum()

964

In [20]:
df[df.duplicated()]

Unnamed: 0,gender,age,education_level,institution_type,it_student,location,load_shedding,financial_condition,internet_type,class_duration,self_lms,device,adaptivity_level
23,Female,11-15,School,Non Government,No,Yes,Low,Middle Class,Mobile Data,1-3,No,Mobile,Moderate
25,Male,11-15,School,Non Government,No,Yes,Low,Middle Class,Mobile Data,1-3,No,Mobile,Moderate
28,Female,1-5,School,Non Government,No,Yes,Low,Middle Class,Mobile Data,1-3,No,Mobile,Moderate
29,Female,16-20,College,Non Government,No,Yes,High,Middle Class,Wifi,3-6,No,Mobile,Moderate
34,Male,11-15,School,Non Government,No,Yes,Low,Lower Class,Mobile Data,1-3,No,Mobile,Low
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1200,Female,16-20,College,Non Government,No,Yes,Low,Middle Class,Wifi,1-3,No,Mobile,Low
1201,Female,16-20,College,Non Government,No,No,High,Middle Class,Wifi,3-6,No,Mobile,Moderate
1202,Male,11-15,School,Non Government,No,Yes,Low,Middle Class,Mobile Data,1-3,No,Mobile,Moderate
1203,Female,16-20,College,Non Government,No,No,Low,Middle Class,Wifi,1-3,No,Mobile,Low


> - Terdapat 949 rows data terduplikat dari dataset


In [None]:
# drop duplicated data
df = df.drop_duplicates()

In [None]:
# check again if any data duplicated
df.duplicated().sum()

0

> - Terlihat data tidak ada yang terduplikasi setelah dilakukan drop terhadap data yang terduplikat

##**4. Data Formatting**

In [18]:
# provides an underline separator for each data with a space
df = df.rename(columns=lambda x: x.replace(' ', '_').replace('-','_'))

# change the column name to lowercase
df = df.rename(columns=lambda x: x.lower())

# Replace name columns & value/name in the columns
df['gender'] = df['gender'].replace(['Boy','Girl'],['Male','Female'])
df['financial_condition']= df['financial_condition'].replace(['Mid','Poor','Rich'],
                             ['Middle Class','Lower Class','Upper Class'])

# drop redundant columns
df = df.drop(columns='network_type', axis=1)

In [11]:
df

Unnamed: 0,gender,age,education_level,institution_type,it_student,location,load_shedding,financial_condition,internet_type,class_duration,self_lms,device,adaptivity_level
0,Male,21-25,University,Non Government,No,Yes,Low,Middle Class,Wifi,3-6,No,Tab,Moderate
1,Female,21-25,University,Non Government,No,Yes,High,Middle Class,Mobile Data,1-3,Yes,Mobile,Moderate
2,Female,16-20,College,Government,No,Yes,Low,Middle Class,Wifi,1-3,No,Mobile,Moderate
3,Female,11-15,School,Non Government,No,Yes,Low,Middle Class,Mobile Data,1-3,No,Mobile,Moderate
4,Female,16-20,School,Non Government,No,Yes,Low,Lower Class,Mobile Data,0,No,Mobile,Low
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1200,Female,16-20,College,Non Government,No,Yes,Low,Middle Class,Wifi,1-3,No,Mobile,Low
1201,Female,16-20,College,Non Government,No,No,High,Middle Class,Wifi,3-6,No,Mobile,Moderate
1202,Male,11-15,School,Non Government,No,Yes,Low,Middle Class,Mobile Data,1-3,No,Mobile,Moderate
1203,Female,16-20,College,Non Government,No,No,Low,Middle Class,Wifi,1-3,No,Mobile,Low


In [None]:
# check information of dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 256 entries, 0 to 1197
Data columns (total 13 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   gender               256 non-null    object
 1   age                  256 non-null    object
 2   education_level      256 non-null    object
 3   institution_type     256 non-null    object
 4   it_student           256 non-null    object
 5   location             256 non-null    object
 6   load_shedding        256 non-null    object
 7   financial_condition  256 non-null    object
 8   internet_type        256 non-null    object
 9   class_duration       256 non-null    object
 10  self_lms             256 non-null    object
 11  device               256 non-null    object
 12  adaptivity_level     256 non-null    object
dtypes: object(13)
memory usage: 28.0+ KB


##**5. Save The Processed Dataset For EDA**

In [None]:
# save the processed dataset to google drive again
df.to_csv('/content/drive/MyDrive/4.Data Science Course/Personal Project/Student Adaptivity in Online Learning/clean_dataset.csv', index=False)