### Mid_Bootcamp_ Project: 

## Alcohol Effects On Study

#### Data_Source: https://www.kaggle.com/datasets/whenamancodes/alcohol-effects-on-study



### Column Description:

#### school:     student's school (binary: 'GP' - Gabriel Pereira or 'MS' - Mousinho da Silveira)

#### sex:     student's sex (binary: 'F' - female or 'M' - male)

#### age:     student's age (numeric: from 15 to 22)

#### address:     student's home address type (binary: 'U' - urban or 'R' - rural)

#### famsize:     family size (binary: 'LE3' - less or equal to 3 or 'GT3' - greater than 3)

#### Pstatus:     parent's cohabitation status (binary: 'T' - living together or 'A' - apart)

#### Medu:     mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 â€“ 5th to 9th grade, 3 â€“ secondary    education or 4 â€“ higher education)

#### Fedu:     father's education (numeric: 0 - none, 1 - primary education (4th grade), 2 â€“ 5th to 9th grade, 3 â€“ secondary education or 4 â€“ higher education)

#### Mjob:     mother's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other’)

#### Fjob:     father's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other’)

#### reason:     reason to choose this school (nominal: close to 'home', school 'reputation', 'course' preference or 'other’)

#### guardian:     student's guardian (nominal: 'mother', 'father' or 'other’)

#### traveltime:     home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)

#### studytime:     weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)

#### failures:     number of past class failures (numeric: n if 1<=n<3, else 4)

#### schoolsup:     extra educational support (binary: yes or no)

#### famsup:     family educational support (binary: yes or no)

#### paid:     extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)

#### activities:     extra-curricular activities (binary: yes or no)

#### nursery:     attended nursery school (binary: yes or no)

#### higher:     wants to take higher education (binary: yes or no)

#### internet:     Internet access at home (binary: yes or no)

#### romantic:     with a romantic relationship (binary: yes or no)

#### famrel:     quality of family relationships (numeric: from 1 - very bad to 5 - excellent)

#### freetime:     free time after school (numeric: from 1 - very low to 5 - very high)

#### goout:     going out with friends (numeric: from 1 - very low to 5 - very high)

#### Dalc:     workday alcohol consumption (numeric: from 1 - very low to 5 - very high)

#### Walc:     weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)

#### health:     current health status (numeric: from 1 - very bad to 5 - very good)

#### absences:     number of school absences (numeric: from 0 to 93)

_________________________________________________________________________________

### Grade:

#### G1:						first period grade (numeric: from 0 to 20)

#### G2:						second period grade (numeric: from 0 to 20)

#### G3:						final grade (numeric: from 0 to 20, output target)


In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import yaml
import scipy.stats as st

In [None]:
# Open yaml configs
with open('../params.yaml') as file:
    config = yaml.safe_load(file)
config

### Data Cleaning

In [None]:
# importing maths file
maths = pd.read_csv(config['data']['maths_raw'])
maths

In [None]:
# importing portuguese file
port = pd.read_csv(config['data']['portuguese_raw'])
port

In [None]:
#Checking duplicates maths --> no duplicates
maths.duplicated().value_counts()


In [None]:
#Checking duplicates portuguese --> no duplicates
port.duplicated().value_counts()

In [None]:
# create subject column
maths['subject'] = 'maths'
maths.head()

In [None]:
# create subject column
port['subject'] = 'portuguese'
port.head()

In [None]:
# combine datasets
data = pd.concat([maths, port], axis=0)
data


In [None]:
# checking for NaN values --> No NaN values
data.isna().sum()

In [None]:
#checking for duplicates
data.duplicated().value_counts()


In [None]:
data.shape

In [None]:
data.info()

In [None]:
# standardizing column names in lower case
data.columns = data.columns.str.strip().str.lower()
data.head()


### Data Exploration

In [None]:
# describe categoricals
data.describe(include=object).T

In [None]:
# describe numericals
data.describe()

In [None]:
# Getting the above average grades (g3)
upper_g3 = data[data['g3'] > data['g3'].mean()]
upper_g3

In [None]:
above_percentage = len(upper_g3) * (100/len(data))
print(round(above_percentage),'% of grades (g3) are above average.')

In [None]:
# Getting the below average grades (g3)
lower_g3 = data[data['g3'] < data['g3'].mean()]
lower_g3

In [None]:
lower_percentage = len(lower_g3) * (100/len(data))
print(round(lower_percentage), '% of grades (g3) are below average.')

In [None]:
sns.countplot(x='g3', data=data);
plt.grid(axis ='y')
plt.show()

In [None]:
# According to the plot above, adding another grade column, splitting g3 into 4 (high), 3 (high mid), 2 (low mid) and 1 (low) grades

def get_grade(val):
    if val <= 6:
        return 1
    elif val <= 10:
        return 2
    elif val <= 15:
        return 3
    else:
        return 4

data['grade'] = data['g3'].apply(get_grade)
data.head()

In [None]:
data.value_counts('grade')

### Checking the impact of each column on the grades

In [None]:
# school
sx = sns.barplot(x='school', y='grade', data=data)
sx.grid(axis='y')
plt.show()

In [None]:
# sex
sx = sns.barplot(x='sex', y='grade', data=data)
sx.grid(axis='y')
plt.show()

In [None]:
data['age'].value_counts()

In [None]:
# Drop outliers in age column
data.drop(data.loc[data['age']>=20].index, inplace=True)
data['age'].value_counts()

In [None]:
data = data.reset_index()
data

In [None]:
# age
sx = sns.barplot(x='age', y='grade', data=data)
sx.grid(axis='y')
plt.show()

In [None]:
# address - urban or rural
sx = sns.barplot(x='address', y='grade', data=data)
sx.grid(axis='y')
plt.show()

In [None]:
# family size - greater than 3 or less/equal 3
sns.barplot(x='famsize', y='grade', data=data)
plt.show()

In [None]:
# parents cohabitation status
sns.barplot(x='pstatus', y='grade', data=data)
plt.show()

### parents cohabitation status / family size

In [None]:
# parents cohabitation status / family size
#edu_order = [0,1,2,3,4]
fig, ax = plt.subplots(1,2,figsize=(10,5))
sx = sns.barplot(data = data, y='grade', x='pstatus', ax = ax[0])
sx.set(title = "Parental Status", xlabel = 'pstatus', ylabel = 'grade')
sx.grid(axis='y')
sx = sns.barplot(data = data, y='grade', x='famsize', ax = ax[1])
sx.set(title = "Family Size", xlabel = 'famsize', ylabel = 'grade')
sx.grid(axis='y')
plt.tight_layout()
plt.show()

In [None]:
# mother's education
sns.barplot(x='medu', y='grade', data=data)
plt.show()

In [None]:
# father's education
sns.barplot(x='fedu', y='grade', data=data)
plt.show()

### Parent's Education

In [None]:
edu_order = [0,1,2,3,4]
fig, ax = plt.subplots(1,2,figsize=(10,5))
sx = sns.barplot(data = data, y='grade', x='medu', ax = ax[0], order=edu_order)
sx.set(title = "Mother's Education", xlabel = 'medu', ylabel = 'grade')
sx.grid(axis='y')
sx = sns.barplot(data = data, y='grade', x='fedu', ax = ax[1], order=edu_order)
sx.set(title = "Father's Eduation", xlabel = 'fedu', ylabel = 'grade')
sx.grid(axis='y')
plt.tight_layout()
plt.show()

In [None]:
# mother's job
sns.barplot(x='mjob', y='grade', data=data)
plt.show()

In [None]:
# father's job
sns.barplot(x='fjob', y='grade', data=data)
plt.show()

### Parent's Occupation

In [None]:
job_order = ['teacher','health', 'services','other', 'at_home']
fig, ax = plt.subplots(1,2,figsize=(10,5))
sx = sns.barplot(data = data, y='grade', x='mjob', ax = ax[0], order=job_order)
sx.set(title = "Mother's Occupation", xlabel = 'mjob', ylabel = 'grade')
sx.grid(axis='y')
sx = sns.barplot(data = data, y='grade', x='fjob', ax = ax[1], order=job_order)
sx.set(title = "Father's Occupation", xlabel = 'fjob', ylabel = 'grade')
sx.grid(axis='y')
plt.tight_layout()
plt.show()



In [None]:
# reason for choice of school
sx = sns.barplot(x='reason', y='grade', data=data)
sx.grid(axis='y')
plt.show()

In [None]:
# guardian
sx = sns.barplot(x='guardian', y='grade', data=data)
sx.grid(axis='y')
plt.show()

In [None]:
# traveltime (1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
sx = sns.barplot(x='traveltime', y='grade', data=data)
sx.grid(axis='y')
plt.show()

In [None]:
# studytime (1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
sx = sns.barplot(x='studytime', y='grade', data=data)
sx.grid(axis='y')
plt.show()

In [None]:
# past class failures
sx = sns.barplot(x='failures', y='grade', data=data)
sx.grid(axis='y')
plt.show()

In [None]:
# extra educational support
sx = sns.barplot(x='schoolsup', y='grade', data=data)
sx.grid(axis='y')
plt.show()

In [None]:
# family educational support
sx = sns.barplot(x='famsup', y='grade', data=data)
sx.grid(axis='y')
plt.show()

In [None]:
sup_order = ['yes','no']
fig, ax = plt.subplots(1,2,figsize=(10,5))
sx = sns.barplot(data = data, y='grade', x='schoolsup', ax = ax[0], order=sup_order)
sx.set(title = "Extra Educational Support", xlabel = 'schoolsup', ylabel = 'grade')
sx.grid(axis='y')
sx = sns.barplot(data = data, y='grade', x='famsup', ax = ax[1], order=sup_order)
sx.set(title = "Family Educational Support", xlabel = 'famsup', ylabel = 'grade')
sx.grid(axis='y')
plt.tight_layout()
plt.show()

In [None]:
# extra paid classes
sx = sns.barplot(x='paid', y='grade', data=data)
sx.grid(axis='y')
plt.show()

In [None]:
# extra curricular activities
sx = sns.barplot(x='activities', y='grade', data=data)
sx.grid(axis='y')
plt.show()

In [None]:

extra_order = ['yes','no']
fig, ax = plt.subplots(1,2,figsize=(10,5))
sx = sns.barplot(data = data, y='grade', x='paid', ax = ax[0], order=extra_order)
sx.set(title = "Extra Paid Classes", xlabel = 'paid', ylabel = 'grade')
sx.grid(axis='y')
sx = sns.barplot(data = data, y='grade', x='activities', ax = ax[1], order=extra_order)
sx.set(title = "Extra Curricular Activities", xlabel = 'activities', ylabel = 'grade')
sx.grid(axis='y')
plt.tight_layout()
plt.show()

In [None]:
#mathsdata

maths = data["subject"].isin(["maths"])
mathsdata = data[maths]
mathsdata

In [None]:
extra_order = ['yes','no']
fig, ax = plt.subplots(1,2,figsize=(10,5))
sx = sns.barplot(data = mathsdata, y='grade', x='paid', ax = ax[0], order=extra_order)
sx.set(title = "Extra Paid Classes - Maths", xlabel = 'paid', ylabel = 'grade')
sx.grid(axis='y')
sx = sns.barplot(data = data, y='grade', x='activities', ax = ax[1], order=extra_order)
sx.set(title = "Extra Curricular Activities - Maths", xlabel = 'activities', ylabel = 'grade')
sx.grid(axis='y')
plt.tight_layout()
plt.savefig("../plots/maths_grades_per_paid_activities.png")
plt.show()

In [None]:
port = data["subject"].isin(["portuguese"])
portdata = data[port]
portdata

In [None]:
extra_order = ['yes','no']
fig, ax = plt.subplots(1,2,figsize=(10,5))
sx = sns.barplot(data = portdata, y='grade', x='paid', ax = ax[0], order=extra_order)
sx.set(title = "Extra Paid Classes - Portuguese", xlabel = 'paid', ylabel = 'grade')
sx.grid(axis='y')
sx = sns.barplot(data = data, y='grade', x='activities', ax = ax[1], order=extra_order)
sx.set(title = "Extra Curricular Activities - Portuguese", xlabel = 'activities', ylabel = 'grade')
sx.grid(axis='y')
plt.tight_layout()
plt.savefig("../plots/port_grades_per_paid_activities.png")
plt.show()

In [None]:
# attended nursery
sns.barplot(x='nursery', y='grade', data=data)
plt.show()

In [None]:
# aiming for higher education
sns.barplot(x='higher', y='grade', data=data)
plt.show()

In [None]:
#internet access
sns.barplot(x='internet', y='grade', data=data)
plt.show()

In [None]:
# romantically involved
sns.barplot(x='romantic', y='grade', data=data)
plt.show()

In [None]:
# quality of family relationship (1 - very bad to 5 - excellent)
sns.barplot(x='famrel', y='grade', data=data)
plt.show()

In [None]:
# freetime (1 - very low to 5 - very high)
sns.barplot(x='freetime', y='grade', data=data)
plt.show()

In [None]:
# going out (1 - very low to 5 - very high)
sns.barplot(x='goout', y='grade', data=data)
plt.show()

In [None]:
# workday alcohol consumption (1 - very low to 5 - very high)
sns.barplot(x='dalc', y='grade', data=data)
plt.show()

In [None]:
# weekend alcohol consumption (1 - very low to 5 - very high)
sns.barplot(x='walc', y='grade', data=data)
plt.show()

In [None]:
# health status (1 - very bad to 5 - very good)
sns.barplot(x='health', y='grade', data=data)
plt.show()

In [None]:
# school absences (0 to 75)
sns.barplot(x='absences', y='grade', data=data)
plt.tight_layout()
data.value_counts('absences')

In [None]:
data.pivot_table(index=['sex'], values=['dalc'], aggfunc=['mean'])

In [None]:
#!!! weekend: The between 15 and 18, older the students get, the more the gender gap widens
data.groupby(['sex', 'pstatus']).agg({'dalc':np.mean})

In [None]:
data.value_counts('age')

In [None]:
data.value_counts('guardian')

In [None]:
data.value_counts('reason')

In [None]:
data.value_counts('romantic')

In [None]:
data.value_counts('dalc')

In [None]:
data.value_counts('walc')

In [None]:
data.value_counts('health')

In [None]:
data.value_counts('health')

In [None]:
sns.countplot(x='health', data=data)
plt.show()

In [None]:
sns.pairplot(data)

In [None]:
sns.countplot(x='mjob', data=data);
plt.show()

In [None]:
sns.countplot(x='g3', data=data);
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(5,5))
ax.scatter(y=data['mjob'], x=data['dalc'])
ax.set_xlabel('dalc')
ax.set_ylabel('mjob')
ax.set_title("Checking relationships")
plt.show()

In [None]:
sns.barplot(x='guardian', y='g3', data=data);
plt.show()

In [None]:
# gut
sns.barplot(x='goout', y='absences', data=data);
plt.show()

In [None]:
# gut
sns.barplot(x='goout', y='dalc', data=data);
plt.show()

In [None]:
# gut
sns.barplot(x='grade', y='goout', data=data);
plt.show()

In [None]:
# gut
sns.barplot(x='romantic', y='grade', data=data);
plt.show()

In [None]:
data.groupby(['sex', 'age']).agg({'dalc':np.mean})

In [None]:
# boxplot??
#get_outliers(data)

In [None]:
data.corr()

In [None]:
sns.heatmap(data.corr())

### Hypothesis:

#### 1. Students between 16 and 18 consume alcohol once a week on average. (two-sided)
#### 2. Girls drink less alcohol, if their parents live together. (left-sided)
#### 3. Urban boys drink more alcohol than rural boys. (right-sided)

In [None]:
# creating a column 'talc' (Total Weekly Alcohol Consumption) 
data['talc'] = data['dalc']*5/7 + data['walc']*2/7
data

In [None]:
# 1. Alcohol consumption of Students between 16 and 18 is low (2).
# confidence level = 0,95
# H0: sample_mean != 2
# H1: sample_mean == 2

sample = data[(data['age']>=16) & (data['age']<=18)]['talc'] 
sample_mean = np.mean(sample)
alpha = 0.05
t_statistic    = (sample_mean - 2) / (np.std(sample,ddof=1) / np.sqrt(len(sample)))
lower_critical = st.t.ppf((alpha/2), df=len(sample)-1)
upper_critical = st.t.ppf(1-(alpha/2), df=len(sample)-1)

print(f"Lower critical: {lower_critical}")
print(f"Statistic:      {t_statistic}")
print(f"Upper critical: {upper_critical}")

#### Result: Since the statistic is NOT between the restriction areas, H0 is accepted. The hypothesis that Alcohol consumption of students between 16 and 18 is low (2), is correct.

In [None]:
# 2. Girls drink less alcohol, if their parents live together.
# confidence level = 0,95
# H0: sample_mean >= 2
# H1: sample_mean < 2

sample = data[(data['sex']=='F') & (data['pstatus']== 'T')]['talc'] 
stat, pval = st.ttest_1samp(sample, popmean=2, alternative="less")

print(f"Statistic: {stat}")
print(f"Pval:      {pval}")

#### Result: Since the p-value ist bigger than the left sided restriction area, H0 is rejected. The hypothesis that the alcohol consumption of girls is less than low (2), is correct.

In [None]:
# 3. Urban boys drink more alcohol than rural boys.
# confidence level = 0,95
# H0: sample_mean <= 3
# H1: sample_mean > 3

sample = data[(data['address']=='U') & (data['sex']== 'M')]['talc'] 
stat, pval = st.ttest_1samp(sample, popmean=3, alternative="greater")

print(f"Statistic: {stat}")
print(f"Pval:      {pval}")

#### Result: Since the p-value ist bigger than 0.95 (confidence level), H0 is accepted. The hypothesis that urban boys consume more alcohol  than boys from rural areas is incorrect.

In [None]:
sns.barplot(x='grade', y='talc', data=data)
plt.show()

In [None]:
grade_order = [6,5,4,3,2,1]
fig, ax = plt.subplots(1,2,figsize=(10,5))
sx = sns.countplot(x='grade',data = portdata, ax = ax[0], order=grade_order)
sx.set(title = "Grade - Portuguese", xlabel = 'grade', ylabel = 'count')
sx.grid(axis='y')
sx = sns.countplot(x='grade',data = mathsdata, ax = ax[1], order=grade_order)
sx.set(title = "Grade - Maths", xlabel = 'grade', ylabel = 'count')
sx.grid(axis='y')
plt.tight_layout()
plt.savefig("../plots/grade_port_maths.png")
plt.show()