### OCI Data Science - Useful Tips
<details>
<summary><font size="2">Check for Public Internet Access</font></summary>

```python
import requests
response = requests.get("https://oracle.com")
assert response.status_code==200, "Internet connection failed"
```
</details>
<details>
<summary><font size="2">Helpful Documentation </font></summary>
<ul><li><a href="https://docs.cloud.oracle.com/en-us/iaas/data-science/using/data-science.htm">Data Science Service Documentation</a></li>
<li><a href="https://docs.cloud.oracle.com/iaas/tools/ads-sdk/latest/index.html">ADS documentation</a></li>
</ul>
</details>
<details>
<summary><font size="2">Typical Cell Imports and Settings for ADS</font></summary>

```python
%load_ext autoreload
%autoreload 2
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

import logging
logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.ERROR)

import ads
from ads.dataset.factory import DatasetFactory
from ads.automl.provider import OracleAutoMLProvider
from ads.automl.driver import AutoML
from ads.evaluations.evaluator import ADSEvaluator
from ads.common.data import ADSData
from ads.explanations.explainer import ADSExplainer
from ads.explanations.mlx_global_explainer import MLXGlobalExplainer
from ads.explanations.mlx_local_explainer import MLXLocalExplainer
from ads.catalog.model import ModelCatalog
from ads.common.model_artifact import ModelArtifact
```
</details>
<details>
<summary><font size="2">Useful Environment Variables</font></summary>

```python
import os
print(os.environ["NB_SESSION_COMPARTMENT_OCID"])
print(os.environ["PROJECT_OCID"])
print(os.environ["USER_OCID"])
print(os.environ["TENANCY_OCID"])
print(os.environ["NB_REGION"])
```
</details>

In [201]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder , MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
import joblib

In [189]:
# Let's load the data
df = pd.read_csv("../../data/student_performance/xAPI-Edu-Data.csv")

In [190]:
# Let's preview the data from top
df.head()

Unnamed: 0,gender,NationalITy,PlaceofBirth,StageID,GradeID,SectionID,Topic,Semester,Relation,raisedhands,VisITedResources,AnnouncementsView,Discussion,ParentAnsweringSurvey,ParentschoolSatisfaction,StudentAbsenceDays,Class
0,M,KW,KuwaIT,lowerlevel,G-04,A,IT,F,Father,15,16,2,20,Yes,Good,Under-7,M
1,M,KW,KuwaIT,lowerlevel,G-04,A,IT,F,Father,20,20,3,25,Yes,Good,Under-7,M
2,M,KW,KuwaIT,lowerlevel,G-04,A,IT,F,Father,10,7,0,30,No,Bad,Above-7,L
3,M,KW,KuwaIT,lowerlevel,G-04,A,IT,F,Father,30,25,5,35,No,Bad,Above-7,L
4,M,KW,KuwaIT,lowerlevel,G-04,A,IT,F,Father,40,50,12,50,No,Bad,Above-7,M


In [191]:
# Let's check the shape of the data set
df.shape

(480, 17)

In [192]:
# Let's check information of the data set
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 480 entries, 0 to 479
Data columns (total 17 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   gender                    480 non-null    object
 1   NationalITy               480 non-null    object
 2   PlaceofBirth              480 non-null    object
 3   StageID                   480 non-null    object
 4   GradeID                   480 non-null    object
 5   SectionID                 480 non-null    object
 6   Topic                     480 non-null    object
 7   Semester                  480 non-null    object
 8   Relation                  480 non-null    object
 9   raisedhands               480 non-null    int64 
 10  VisITedResources          480 non-null    int64 
 11  AnnouncementsView         480 non-null    int64 
 12  Discussion                480 non-null    int64 
 13  ParentAnsweringSurvey     480 non-null    object
 14  ParentschoolSatisfaction  

**Note:**

- Only three features are integers and other 12 are objects(categorical)

In [193]:
# Let's describe integer features
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
raisedhands,480.0,46.775,30.779223,0.0,15.75,50.0,75.0,100.0
VisITedResources,480.0,54.797917,33.080007,0.0,20.0,65.0,84.0,99.0
AnnouncementsView,480.0,37.91875,26.611244,0.0,14.0,33.0,58.0,98.0
Discussion,480.0,43.283333,27.637735,1.0,20.0,39.0,70.0,99.0


**Note:**

- The average of students raised their hands is 46.775
- The average of students visited resources is 54.8
- The average of students viewed announcements is 37, seems majority of students aren't has tendace of viewing anouncements
- The average number of students joined discussion is 43 , the greater number of students aren't doing discussion

In [194]:
# Let's check for null values

df.isnull().sum().any()

False

**Note:**

- data sets doen't has empty values

In [195]:
# Let's see the distribution of performances

df.Class.value_counts()

M    211
H    142
L    127
Name: Class, dtype: int64

**Note:**

- Most of student's performance is medium
- The distribution of three classes not much bad but is not equally distibuted

In [196]:
# Let's check dataset columns
df.columns

Index(['gender', 'NationalITy', 'PlaceofBirth', 'StageID', 'GradeID',
       'SectionID', 'Topic', 'Semester', 'Relation', 'raisedhands',
       'VisITedResources', 'AnnouncementsView', 'Discussion',
       'ParentAnsweringSurvey', 'ParentschoolSatisfaction',
       'StudentAbsenceDays', 'Class'],
      dtype='object')

In [197]:
df.gender.unique()

array(['M', 'F'], dtype=object)

In [165]:
df.NationalITy.unique()

array(['KW', 'lebanon', 'Egypt', 'SaudiArabia', 'USA', 'Jordan',
       'venzuela', 'Iran', 'Tunis', 'Morocco', 'Syria', 'Palestine',
       'Iraq', 'Lybia'], dtype=object)

In [166]:
df.PlaceofBirth.unique()

array(['KuwaIT', 'lebanon', 'Egypt', 'SaudiArabia', 'USA', 'Jordan',
       'venzuela', 'Iran', 'Tunis', 'Morocco', 'Syria', 'Iraq',
       'Palestine', 'Lybia'], dtype=object)

In [167]:
df.StageID.unique()

array(['lowerlevel', 'MiddleSchool', 'HighSchool'], dtype=object)

In [168]:
df.GradeID.unique()

array(['G-04', 'G-07', 'G-08', 'G-06', 'G-05', 'G-09', 'G-12', 'G-11',
       'G-10', 'G-02'], dtype=object)

In [169]:
df.SectionID.unique()

array(['A', 'B', 'C'], dtype=object)

In [170]:
df.Topic.unique()

array(['IT', 'Math', 'Arabic', 'Science', 'English', 'Quran', 'Spanish',
       'French', 'History', 'Biology', 'Chemistry', 'Geology'],
      dtype=object)

In [171]:
df.Semester.unique()

array(['F', 'S'], dtype=object)

In [172]:
df.Relation.unique()

array(['Father', 'Mum'], dtype=object)

In [173]:
df.ParentAnsweringSurvey.unique()

array(['Yes', 'No'], dtype=object)

In [174]:
df.ParentschoolSatisfaction.unique()

array(['Good', 'Bad'], dtype=object)

In [198]:
df.StudentAbsenceDays.unique()

array(['Under-7', 'Above-7'], dtype=object)

In [199]:
# Let's preprocess categorical features

#intialize label encoder
le = LabelEncoder()


df['gender'] = le.fit_transform(df['gender'])
df['NationalITy'] = le.fit_transform(df['NationalITy'])
df['PlaceofBirth'] = le.fit_transform(df['PlaceofBirth'])
df['StageID'] = le.fit_transform(df['StageID'])
df['GradeID'] = le.fit_transform(df['GradeID'])

df['SectionID'] = le.fit_transform(df['SectionID'])
df['Topic'] = le.fit_transform(df['Topic'])
df['Semester'] = le.fit_transform(df['Semester'])
df['Relation'] = le.fit_transform(df['Relation'])
df['ParentAnsweringSurvey'] = le.fit_transform(df['ParentAnsweringSurvey'])
df['ParentschoolSatisfaction'] = le.fit_transform(df['ParentschoolSatisfaction'])
df['StudentAbsenceDays'] = le.fit_transform(df['StudentAbsenceDays'])
df['Class'] = le.fit_transform(df['Class'])

In [200]:
df.Class.value_counts()

2    211
0    142
1    127
Name: Class, dtype: int64

In [177]:
# Let's view data after conversion
df.head()

Unnamed: 0,gender,NationalITy,PlaceofBirth,StageID,GradeID,SectionID,Topic,Semester,Relation,raisedhands,VisITedResources,AnnouncementsView,Discussion,ParentAnsweringSurvey,ParentschoolSatisfaction,StudentAbsenceDays,Class
0,1,4,4,2,1,0,7,0,0,15,16,2,20,1,1,1,2
1,1,4,4,2,1,0,7,0,0,20,20,3,25,1,1,1,2
2,1,4,4,2,1,0,7,0,0,10,7,0,30,0,0,0,1
3,1,4,4,2,1,0,7,0,0,30,25,5,35,0,0,0,1
4,1,4,4,2,1,0,7,0,0,40,50,12,50,0,0,0,2


In [178]:
# Let's split the features and target

feat_cols = df.drop(['Class'], axis=1)

target = df['Class']

cols = feat_cols.columns

In [179]:
# Let's define seed for reproducebility
seed = 360

In [180]:
# Let's scale our data set

scaler = MinMaxScaler(feature_range=(0, 1))
scaled_df = scaler.fit_transform(df[cols])


In [181]:
# show scaled data set 
scaled_df[:1]

array([[1.        , 0.30769231, 0.30769231, 1.        , 0.11111111,
        0.        , 0.63636364, 0.        , 0.        , 0.15      ,
        0.16161616, 0.02040816, 0.19387755, 1.        , 1.        ,
        1.        ]])

In [182]:
scaled_df.shape

(480, 16)

In [183]:
# let's split data , for model training

X_train, X_test, y_train, y_test = train_test_split(scaled_df, target, stratify = target, test_size = 0.2, random_state = seed)

In [184]:
from sklearn.metrics import accuracy_score , confusion_matrix

# initialize the model

model = RandomForestClassifier(n_estimators=100, random_state=42)

# training model 

model.fit(X_train, y_train)

y_pred = model.predict(X_test)

# Evaluate

print("Accuracy of the model: ", accuracy_score(y_test, y_pred))

Accuracy of the model:  0.8541666666666666


In [185]:
# Evaluate with confusion matrix

print(confusion_matrix(y_test, y_pred))

[[23  0  6]
 [ 0 23  2]
 [ 3  3 36]]


In [186]:
# Let's store our model

#Sava the model to disk
filename = '../../models/student_performance/model_sp.sav'
joblib.dump(model, filename)

['../../models/student_performance/model_sp.sav']