# **Data Collection**

## Objectives

* Collect the dataset from kaggle and save as raw data
* Inspect the dataset and save it under output/datasets/collection

## Inputs

* Kaggle JSON file - the authentication token.

## Outputs

* Generate Dataset: outputs/datasets/collection/StudentPreformance.csv

## Additional Comments



---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/Student-Performance/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/Student-Performance'

---

# Fetch data from Kaggle

Install Kaggle package to fetch data

In [4]:
%pip install kaggle==1.5.12

Note: you may need to restart the kernel to use updated packages.


**Change kaggle configuration directory to current working directory and permission of kaggle authentication json.**

In [5]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

chmod: cannot access 'kaggle.json': No such file or directory


In the block belowe is the path for the Kaggle students preformens difined and a directory for it is ceatedasz, and then use a Kaggle command to download the dataset into this directory.

In [6]:
KaggleDatasetPath = "lainguyn123/student-performance-factors"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Traceback (most recent call last):
  File "/workspace/.pip-modules/bin/kaggle", line 5, in <module>
    from kaggle.cli import main
  File "/workspace/.pip-modules/lib/python3.8/site-packages/kaggle/__init__.py", line 23, in <module>
    api.authenticate()
  File "/workspace/.pip-modules/lib/python3.8/site-packages/kaggle/api/kaggle_api_extended.py", line 164, in authenticate
    raise IOError('Could not find {}. Make sure it\'s located in'
OSError: Could not find kaggle.json. Make sure it's located in /workspace/Student-Performance. Or use the environment method.


Next step is to unzip the downloaded file and therafter delet the zip-fil and the kaggle.json file 

In [7]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

unzip:  cannot find or open inputs/datasets/raw/*.zip, inputs/datasets/raw/*.zip.zip or inputs/datasets/raw/*.zip.ZIP.

No zipfiles found.


---

# Load and Inspect Kaggle data

The dataset is opened using the libary Pandas and then displays the first 5 rows in the dataset 

In [8]:
import pandas as pd
df = pd.read_csv(f"inputs/datasets/raw/StudentPerformanceFactors.csv")
df.head()

Unnamed: 0,Hours_Studied,Attendance,Parental_Involvement,Access_to_Resources,Extracurricular_Activities,Sleep_Hours,Previous_Scores,Motivation_Level,Internet_Access,Tutoring_Sessions,Family_Income,Teacher_Quality,School_Type,Peer_Influence,Physical_Activity,Learning_Disabilities,Parental_Education_Level,Distance_from_Home,Gender,Exam_Score
0,23,84,Low,High,No,7,73,Low,Yes,0,Low,Medium,Public,Positive,3,No,High School,Near,Male,67
1,19,64,Low,Medium,No,8,59,Low,Yes,2,Medium,Medium,Public,Negative,4,No,College,Moderate,Female,61
2,24,98,Medium,Medium,Yes,7,91,Medium,Yes,2,Medium,Medium,Public,Neutral,4,No,Postgraduate,Near,Male,74
3,29,89,Low,Medium,Yes,8,98,Medium,Yes,1,Medium,Medium,Public,Negative,4,No,High School,Moderate,Male,71
4,19,92,Medium,Medium,Yes,6,65,Medium,Yes,3,Medium,High,Public,Neutral,4,No,College,Near,Female,70
5,19,88,Medium,Medium,Yes,8,89,Medium,Yes,3,Medium,Medium,Public,Positive,3,No,Postgraduate,Near,Male,71
6,29,84,Medium,Low,Yes,7,68,Low,Yes,1,Low,Medium,Private,Neutral,2,No,High School,Moderate,Male,67
7,25,78,Low,High,Yes,6,50,Medium,Yes,1,High,High,Public,Negative,2,No,High School,Far,Male,66
8,17,94,Medium,High,No,6,80,High,Yes,0,Medium,Low,Private,Neutral,1,No,College,Near,Male,69
9,23,98,Medium,Medium,Yes,8,71,Medium,Yes,0,High,High,Public,Positive,5,No,High School,Moderate,Male,72


---

Belowe is a summary of the DataFrame

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6607 entries, 0 to 6606
Data columns (total 20 columns):
 #   Column                      Non-Null Count  Dtype 
---  ------                      --------------  ----- 
 0   Hours_Studied               6607 non-null   int64 
 1   Attendance                  6607 non-null   int64 
 2   Parental_Involvement        6607 non-null   object
 3   Access_to_Resources         6607 non-null   object
 4   Extracurricular_Activities  6607 non-null   object
 5   Sleep_Hours                 6607 non-null   int64 
 6   Previous_Scores             6607 non-null   int64 
 7   Motivation_Level            6607 non-null   object
 8   Internet_Access             6607 non-null   object
 9   Tutoring_Sessions           6607 non-null   int64 
 10  Family_Income               6607 non-null   object
 11  Teacher_Quality             6529 non-null   object
 12  School_Type                 6607 non-null   object
 13  Peer_Influence              6607 non-null   obje

Checks if the Dataframe have any *null* value

In [10]:
df.isnull().sum().sort_values(ascending=False)

Parental_Education_Level      90
Teacher_Quality               78
Distance_from_Home            67
Hours_Studied                  0
Attendance                     0
Gender                         0
Learning_Disabilities          0
Physical_Activity              0
Peer_Influence                 0
School_Type                    0
Family_Income                  0
Tutoring_Sessions              0
Internet_Access                0
Motivation_Level               0
Previous_Scores                0
Sleep_Hours                    0
Extracurricular_Activities     0
Access_to_Resources            0
Parental_Involvement           0
Exam_Score                     0
dtype: int64

In [14]:
new_list = df.duplicated(subset=None)

number_of_duplicted = 0
rows = []

for i in range(len(new_list)):
    if new_list[i]:
        number_of_duplicted += 1
        rows.append(i)

if number_of_duplicted == 0:
    print('There are no duplicates in the dataset')
else:
    print(f'{number_of_duplicted} rows are duplicated and the duplecations are on row {rows}')

There are no duplicates in the dataset


## Conclusion
With a quick inspection for the dataset the following this was found 

* The dataset used has 20 variables where there is 1 target and 19 features. 
* None of the rows are duplicated 
* The dataset has missing data in the columns Parental_Education_Level, Teacher_Quality, Distance_from_Home with the corresponding rows missing data 90, 78, 67. 
    * Hopefully, these data rows are correlated and the reduction of the data set is as small as possible. However, the maximum loss can be 235 rows, which is a reduction of about 3%, which is a minimal reduction of the data set. 






---

# Push files to Repo

In [12]:
import os
try:
  # create outputs/datasets/collection folder
  os.makedirs(name='outputs/datasets/collection') # create outputs/datasets/collection folder
except Exception as e:
  print(e)

# Save the dataset in newlt created folder 
df.to_csv(f"outputs/datasets/collection/StudentPerformance.csv",index=False)

[Errno 17] File exists: 'outputs/datasets/collection'
