Atalov S.

# Fundamentals of Machine Learning and Artificial Intelligence

---
 

## Student Performance Dataset from the UCI Machine Learning Repository

1. **school**: The school the student is attending, either "GP" (Gabriel Pereira) or "MS" (Mousinho da Silveira).
2. **sex**: The gender of the student, either "F" for female or "M" for male.
3. **age**: The age of the student in years.
4. **address**: The type of address, either "U" for urban or "R" for rural.
5. **famsize**: Family size, either "LE3" for less than or equal to 3 or "GT3" for greater than 3.
6. **Pstatus**: The cohabitation status of parents, either "T" for living together or "A" for apart.
7. **Medu**: Mother's education level (0: none, 1: primary education, 2: 5th to 9th grade, 3: secondary education, 4: higher education).
8. **Fedu**: Father's education level (0: none, 1: primary education, 2: 5th to 9th grade, 3: secondary education, 4: higher education).
9. **Mjob**: Mother's job (nominal: "teacher", "health" care related, "services" (e.g., administrative or police), "at_home" or "other").
10. **Fjob**: Father's job (nominal: "teacher", "health" care related, "services" (e.g., administrative or police), "at_home" or "other").
11. **reason**: Reason to choose this school (close to "home", school "reputation", "course" preference or "other").
12. **guardian**: Student's guardian (nominal: "mother", "father" or "other").
13. **traveltime**: Home to school travel time (1: <15 min, 2: 15 to 30 min, 3: 30 min to 1 hour, 4: >1 hour).
14. **studytime**: Weekly study time (1: <2 hours, 2: 2 to 5 hours, 3: 5 to 10 hours, 4: >10 hours).
15. **failures**: Number of past class failures (numeric: n if 1<=n<3, else 4).
16. **schoolsup**: Extra educational support (binary: yes or no).
17. **famsup**: Family educational support (binary: yes or no).
18. **paid**: Extra paid classes within the course subject (Math or Portuguese) (binary: yes or no).
19. **activities**: Extra-curricular activities (binary: yes or no).
20. **nursery**: Attended nursery school (binary: yes or no).
21. **higher**: Wants to take higher education (binary: yes or no).
22. **internet**: Internet access at home (binary: yes or no).
23. **romantic**: With a romantic relationship (binary: yes or no).
24. **famrel**: Quality of family relationships (numeric: from 1 - very bad to 5 - excellent).
25. **freetime**: Free time after school (numeric: from 1 - very low to 5 - very high).
26. **goout**: Going out with friends (numeric: from 1 - very low to 5 - very high).
27. **Dalc**: Workday alcohol consumption (numeric: from 1 - very low to 5 - very high).
28. **Walc**: Weekend alcohol consumption (numeric: from 1 - very low to 5 - very high).
29. **health**: Current health status (numeric: from 1 - very bad to 5 - very good).
30. **absences**: Number of school absences (numeric: from 0 to 93).
31. **G1**: First period grade (numeric: from 0 to 20).
32. **G2**: Second period grade (numeric: from 0 to 20).
33. **G3**: Final grade (numeric: from 0 to 20).

These columns provide a comprehensive overview of each student's background, family, academic performance, and extracurricular activities.

### Task: Predicting Final Grades (G3) and Analysis

#### Objective
Predict the final grade (G3) of students based on various demographic, social, and academic attributes. This task involves building and evaluating a machine learning model to accurately forecast the final grades.

#### Steps to Complete the Task

1. **Exploratory Data Analysis (EDA)**
   - Load the dataset and display the first five rows.
   - Check for missing values and handle them appropriately.
   - Summarize the dataset with descriptive statistics.
   - Visualize the distribution of the final grade (G3) and explore relationships between G3 and other features.

2. **Data Preprocessing**
   - Convert categorical variables into numerical values using techniques such as one-hot encoding.
   - Normalize or standardize numerical features if necessary.
   - Split the data into training and test sets.

3. **Feature Selection**
   - Identify the most important features that influence the final grade (G3) using correlation analysis and feature importance from models like Random Forest.

4. **Model Selection**
   - Choose appropriate regression models for predicting G3. Some common models include:
     - Linear Regression
     - Decision Tree Regressor
     - Random Forest Regressor
     - Gradient Boosting Regressor
   - Train each model using the training set.

5. **Model Evaluation**
   - Evaluate the performance of each model using metrics such as Mean Absolute Error (MAE), Mean Squared Error (MSE), and R-squared on the test set.
   - Compare the results to identify the best-performing model.

6. **Hyperparameter Tuning**
   - Fine-tune the hyperparameters of the best-performing model using techniques like Grid Search or Random Search.

7. **Final Model Training and Prediction**
   - Train the final model with the best hyperparameters on the entire training dataset.
   - Make predictions on the test set and visualize the predicted vs. actual grades.

8. **Analysis and Interpretation**
   - Analyze the results to understand the key factors affecting students' final grades.
   - Provide insights and recommendations based on the model's performance and feature importance.


#### Analysis and Interpretation
After evaluating the model's performance, analyze which features have the highest importance in predicting the final grade. Discuss the implications of these features on students' academic performance and provide recommendations for improving student outcomes based on the findings.

### Submission
Submit your Jupyter Notebook or Python script containing the complete analysis, code, and visualizations. Ensure that your code is well-documented with comments explaining each step.

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split

In [3]:
df = pd.read_csv("student-per.csv")

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 649 entries, 0 to 648
Data columns (total 33 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   school      649 non-null    object
 1   sex         649 non-null    object
 2   age         649 non-null    int64 
 3   address     649 non-null    object
 4   famsize     649 non-null    object
 5   Pstatus     649 non-null    object
 6   Medu        649 non-null    int64 
 7   Fedu        649 non-null    int64 
 8   Mjob        649 non-null    object
 9   Fjob        649 non-null    object
 10  reason      649 non-null    object
 11  guardian    649 non-null    object
 12  traveltime  649 non-null    int64 
 13  studytime   649 non-null    int64 
 14  failures    649 non-null    int64 
 15  schoolsup   649 non-null    object
 16  famsup      649 non-null    object
 17  paid        649 non-null    object
 18  activities  649 non-null    object
 19  nursery     649 non-null    object
 20  higher    

In [5]:
df.isnull().sum()

school        0
sex           0
age           0
address       0
famsize       0
Pstatus       0
Medu          0
Fedu          0
Mjob          0
Fjob          0
reason        0
guardian      0
traveltime    0
studytime     0
failures      0
schoolsup     0
famsup        0
paid          0
activities    0
nursery       0
higher        0
internet      0
romantic      0
famrel        0
freetime      0
goout         0
Dalc          0
Walc          0
health        0
absences      0
G1            0
G2            0
G3            0
dtype: int64

In [8]:
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor


In [11]:
gb = RandomForestRegressor(n_estimators=300)

In [12]:
X = df[["G1","G2","G3"]]
y=X.pop("G3")

In [13]:
X_train,X_test,y_train,y_test = train_test_split(X,y, test_size=0.2)

In [14]:
gb.fit(X_train,y_train)

In [15]:
gb.score(X_train,y_train)

0.8817869657142667

In [16]:
gb.score(X_test,y_test)

0.8073877569625647

In [None]:
from sklearn.utils import 