# **Data Collection Notebook**

## Objectives

- Fetch the Student Exam Scores dataset from Kaggle
- Save the dataset as a CSV file for use in future notebooks
- Conduct an initial inspection of the dataset to find out more about it

## Inputs
- Kaggle JSON authentication token
- Kaggle dataset file

## Outputs
- Dataset as a CSV file in the outputs/datasets/collection directory

## Additional comments
- The dataset is hosted on Kaggle, and the dataset is anonymised, so there are no privacy concerns to deal with
- The dataset is located [here](https://www.kaggle.com/datasets/whenamancodes/students-performance-in-exams)

## Import Packages

In [None]:
import pandas as pd
import os

# Change working directory

Since this notebook exists in the `jupyter_notebooks` directory, we need to change the current working directory from the jupyter_notebooks directory to the workspace, so that any directories created in further codes cells are added in the correct place. 

We access the current directory with the OS packages' `getcwd()` method

In [1]:
current_directory = os.getcwd()
current_directory

'/workspace/Exam-Scores-Analysis/jupyter_notebooks'

We now want to set the working directory as the parent of the current working directory, jupyter_notebooks

- The `os.path.dirname()` method gets the parent directory
- The `os.chir()` method defines the new current directory
- We do this to access all of the project's files and directories, rather than those in the jupyter_notebooks directory

In [2]:
os.chdir(os.path.dirname(current_directory))
print("You set a new current directory")

You set a new current directory


To make certain of things, we now use a code cell to confirm that we have set the current working directory properly

In [3]:
current_directory = os.getcwd()
current_directory

'/workspace/Exam-Scores-Analysis'

Excellent - we are now working in the main directory.

## Fetch data from Kaggle

We are now in a position to fetch the dataset from Kaggle

First, we need to install the Kaggle package to enable this

In [4]:
! pip install kaggle==1.5.12



We upload the `kaggle.json` authentication token to the workspace. Then we run the code cell below to ensure that the code cell is recognised. The `kaggle.json` token is listed in the `.gitignore` file, so it is not pushed to the project repository. This is because it is linked to my personal Kaggle account and as such pushing it to a public repository would consitute a data breach.

In [5]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

The dataset is located [here](https://www.kaggle.com/datasets/whenamancodes/students-performance-in-exams).

We can now import it. The terminal commands that run first remove any extant instances of the files to be imported.

In [6]:
! rm inputs/datasets/raw/exams.csv
! rm inputs/datasets/raw/students-performance-in-exams.zip


KaggleDatasetPath = "whenamancodes/students-performance-in-exams"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

rm: cannot remove 'inputs/datasets/raw/students-performance-in-exams.csv': No such file or directory
students-performance-in-exams.zip: Skipping, found more recently modified local copy (use --force to force download)


The dataset has now been imported. It exists as a zipped file in the `inputs/datasets/raw` directory. In the code cell below, we unzip it. Unlike in the Churnometer walkthrough project, the raw dataset and the kaggle token will not be deleted, as the kaggle token will not be pushed to the public repository. Both have been retained in case the dataset needs to be re-imported.

In [7]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder}

Archive:  inputs/datasets/raw/students-performance-in-exams.zip
  inflating: inputs/datasets/raw/exams.csv  


We can now begin to inspect the dataset

## Load and inspect Kaggle dataset

We can assign the dataset to a Pandas dataframe using the `read_csv()` method, and display the first 5 rows using the `head()` method:

In [8]:
df = pd.read_csv(f"inputs/datasets/raw/exams.csv")
df.head()

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,male,group A,high school,standard,completed,67,67,63
1,female,group D,some high school,free/reduced,none,40,59,55
2,male,group E,some college,free/reduced,none,59,60,50
3,male,group B,high school,standard,none,77,78,68
4,male,group E,associate's degree,standard,completed,78,73,68


We can now see how much of a dataset we have to work with. We do this with the `info()` method

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   gender                       1000 non-null   object
 1   race/ethnicity               1000 non-null   object
 2   parental level of education  1000 non-null   object
 3   lunch                        1000 non-null   object
 4   test preparation course      1000 non-null   object
 5   math score                   1000 non-null   int64 
 6   reading score                1000 non-null   int64 
 7   writing score                1000 non-null   int64 
dtypes: int64(3), object(5)
memory usage: 62.6+ KB


The dataset has 1000 records in 8 columns. This should be an appropriate number of records to train a machine learning model. 

The dataset also has no missing data, so data imputation will not be necessary. The dataset also contains no columns that could contain unique information, so handling this such as by dropping columns is also unnecessary. 

The gender, race/ethnicity, parental level of education, lunch and test preparation course feature variables contain categorical data, so the object data-type is practical. There is no need to convert data-types.

However, to improve the dataset's ease of use, it would be useful to convert the column headers (gender, race/ethnicity, etc) to snake_case, and to simplify them as follows:
- `race/ethnicity` converted to `ethnicity`
- `parental level of education` converted to `parental_education`
- `lunch` converted to `lunch_program`
- `test preparation course` to `test_preparation_course`
- `math score` to `math_score`
- `reading score` to `reading_score`
- `writing score` to `writing_score`

We can do this with the `columns()` method:

In [10]:
df.columns = ['gender', 'ethnicity', 'parental_education', 'lunch_program', 'test_preparation_course', 'math_score', 'reading_score', 'writing_score']
df.head()

Unnamed: 0,gender,ethnicity,parental_education,lunch_program,test_preparation_course,math_score,reading_score,writing_score
0,male,group A,high school,standard,completed,67,67,63
1,female,group D,some high school,free/reduced,none,40,59,55
2,male,group E,some college,free/reduced,none,59,60,50
3,male,group B,high school,standard,none,77,78,68
4,male,group E,associate's degree,standard,completed,78,73,68


It would also be useful to add a column called `average_score` that averages the 3 score columns to provide an overall indication of academic performance. We do this with the `mean()` method:

In [11]:
df['average_score'] = df[['math_score', 'reading_score', 'writing_score']].mean(axis=1)
df.head()

Unnamed: 0,gender,ethnicity,parental_education,lunch_program,test_preparation_course,math_score,reading_score,writing_score,average_score
0,male,group A,high school,standard,completed,67,67,63,65.666667
1,female,group D,some high school,free/reduced,none,40,59,55,51.333333
2,male,group E,some college,free/reduced,none,59,60,50,56.333333
3,male,group B,high school,standard,none,77,78,68,74.333333
4,male,group E,associate's degree,standard,completed,78,73,68,73.0


We can now check the status of the dataset:

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 9 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   gender                   1000 non-null   object 
 1   ethnicity                1000 non-null   object 
 2   parental_education       1000 non-null   object 
 3   lunch_program            1000 non-null   object 
 4   test_preparation_course  1000 non-null   object 
 5   math_score               1000 non-null   int64  
 6   reading_score            1000 non-null   int64  
 7   writing_score            1000 non-null   int64  
 8   average_score            1000 non-null   float64
dtypes: float64(1), int64(3), object(5)
memory usage: 70.4+ KB


If we run the info method, we see that the average_score column is a float. Given that the test scores are integers, we can make the reasonable assumption that the exams do not allow for fractional scores. Therefore, we should convert the average_score column to an integer. The `astype()` method automatically rounds the value, and then saves the changes:

In [13]:
df = df.astype({'average_score':'int'})
df.head()

Unnamed: 0,gender,ethnicity,parental_education,lunch_program,test_preparation_course,math_score,reading_score,writing_score,average_score
0,male,group A,high school,standard,completed,67,67,63,65
1,female,group D,some high school,free/reduced,none,40,59,55,51
2,male,group E,some college,free/reduced,none,59,60,50,56
3,male,group B,high school,standard,none,77,78,68,74
4,male,group E,associate's degree,standard,completed,78,73,68,73


If we check the data-type again, we see that average_score has been converted to a integer.

In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 9 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   gender                   1000 non-null   object
 1   ethnicity                1000 non-null   object
 2   parental_education       1000 non-null   object
 3   lunch_program            1000 non-null   object
 4   test_preparation_course  1000 non-null   object
 5   math_score               1000 non-null   int64 
 6   reading_score            1000 non-null   int64 
 7   writing_score            1000 non-null   int64 
 8   average_score            1000 non-null   int64 
dtypes: int64(4), object(5)
memory usage: 70.4+ KB


Now that we have modified the dataframe appropriately, we can save it as a CSV file, and push the file to the repository. We will create an `outputs` directory, a `datasets` directory within the `outputs` directory, and a `collection` directory inside the `datasets` directory. We then save the dataset inside the `collection` directory:

In [15]:
! rm outputs/datasets/collection/student-exam-results.csv

try:
  os.makedirs(name='outputs/datasets/collection') # create outputs/datasets/collection folder
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/collection/student-exam-results.csv", index=False)

[Errno 17] File exists: 'outputs/datasets/collection'


Now that we have the dataset downloaded and saved, we can begin the project in earnest. In the next notebook, we'll begin our data analysis.