# **STUDENT AI** - LOAD DATA

## Objectives

Import the dataset provided by the customer

## Inputs

Takes a comma separated values dataset as the input (.csv) from the customer.
In this case, a fictional dataset is sourced from Kaggle. The dataset contains 2 versions. 
1.  less features and more rows and no data issues.
2.  more features and less rows but also including some missing data.

This project will use the second set to demonstrate techniques that might be needed for real world datasets.

## Outputs

Saves the dataset from Kaggle as .csv in the inputs/dataset folder

## Additional Comments

A private kaggle account will be neccassary and the API key stored locally in a kaggle.json file.



---

# Change working directory

### Get current directory of this jupyter notebook as well as install the neccessary packages

In [1]:
import os
import pandas as pd
current_dir = os.getcwd()
current_dir

'/workspace/student-AI/jupyter_notebooks'

### Set directory to Project root (parent directory)

In [2]:
os.chdir(os.path.dirname(current_dir))
current_dir = os.getcwd()
print(f"New Directory: {current_dir}")

New Directory: /workspace/student-AI


# Import Dataset from Kaggle
This section would be replaced with the actual API to the educational facilities Database, or at least a manual import of the .csv file for each grade.

**Install kaggle package to access dataset directly through kaggle API:**

In [3]:
! pip install kaggle==1.5.12



Set Evironment Variable for Kaggle API based on kaggle.json file

In [3]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

#### Download student performance dataset from Kaggle - can be viewed on the site [here](https://www.kaggle.com/datasets/desalegngeb/students-exam-scores)

**CAUTION - SCRIPT WILL DELETE PREVIOUS COPY OF DATASET AND REPLACE WITH ONE DOWNLOADED DIRECT FROM KAGGLE**

In [5]:
! rm inputs/dataset/Expanded_data_with_more_features.csv
! rm inputs/dataset/Original_data_with_more_rows.csv

SourcePath = "desalegngeb/students-exam-scores"
DestinationPath = "inputs/dataset"   

from kaggle.api.kaggle_api_extended import KaggleApi

# Initialize Kaggle API
api = KaggleApi()
api.authenticate()

# Download dataset
api.dataset_download_files(dataset="desalegngeb/students-exam-scores", path="inputs/dataset", unzip=True)

---

# Inspect Dataset for correct load

Read dataset using Pandas library and display head of dataframe:

In [4]:
df = pd.read_csv(f"inputs/dataset/Expanded_data_with_more_features.csv", index_col=False)
df

Unnamed: 0.1,Unnamed: 0,Gender,EthnicGroup,ParentEduc,LunchType,TestPrep,ParentMaritalStatus,PracticeSport,IsFirstChild,NrSiblings,TransportMeans,WklyStudyHours,MathScore,ReadingScore,WritingScore
0,0,female,,bachelor's degree,standard,none,married,regularly,yes,3.0,school_bus,< 5,71,71,74
1,1,female,group C,some college,standard,,married,sometimes,yes,0.0,,5 - 10,69,90,88
2,2,female,group B,master's degree,standard,none,single,sometimes,yes,4.0,school_bus,< 5,87,93,91
3,3,male,group A,associate's degree,free/reduced,none,married,never,no,1.0,,5 - 10,45,56,42
4,4,male,group C,some college,standard,none,married,sometimes,yes,0.0,school_bus,5 - 10,76,78,75
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
30636,816,female,group D,high school,standard,none,single,sometimes,no,2.0,school_bus,5 - 10,59,61,65
30637,890,male,group E,high school,standard,none,single,regularly,no,1.0,private,5 - 10,58,53,51
30638,911,female,,high school,free/reduced,completed,married,sometimes,no,1.0,private,5 - 10,61,70,67
30639,934,female,group D,associate's degree,standard,completed,married,regularly,no,3.0,school_bus,5 - 10,82,90,93


In [5]:
df['WklyStudyHours'].unique()

array(['< 5', '5 - 10', '> 10', nan], dtype=object)

### Show initial dataset rows and datatypes:

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30641 entries, 0 to 30640
Data columns (total 15 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Unnamed: 0           30641 non-null  int64  
 1   Gender               30641 non-null  object 
 2   EthnicGroup          28801 non-null  object 
 3   ParentEduc           28796 non-null  object 
 4   LunchType            30641 non-null  object 
 5   TestPrep             28811 non-null  object 
 6   ParentMaritalStatus  29451 non-null  object 
 7   PracticeSport        30010 non-null  object 
 8   IsFirstChild         29737 non-null  object 
 9   NrSiblings           29069 non-null  float64
 10  TransportMeans       27507 non-null  object 
 11  WklyStudyHours       29686 non-null  object 
 12  MathScore            30641 non-null  int64  
 13  ReadingScore         30641 non-null  int64  
 14  WritingScore         30641 non-null  int64  
dtypes: float64(1), int64(4), object(10)


---