In [None]:
import numpy as np #math
import pandas as pd #data processing, CSV I/O
import seaborn as sns #visualization
import matplotlib.pyplot as plt #plotting

In [None]:
#read in csv file using pandas, the path to your csv should be contained in the ''
df = pd.read_csv('student_drinking.csv')

In [None]:
df.head(10) #show the first 10 values in the dataset

In [None]:
#find out the number of entries in the dataset
len(df)

In [None]:
#get a list of all of the attributes in this dataset
df.columns

In [None]:
#access the 206th row of data
df.loc[246]

Before we start to do anything with our data, we should come up with some sort of reasy that we are dealing with  this data. What knowledge do we want to gain from this dataset. Coming up with a question will allow us to clean our dataset and get started with data analysis. I want to find out if alcohol consumption has any effect on grades. 

In [None]:
#start out with step 1, lets fix any duplicates or irrelevant data changes. 
Dup_Rows = df[df.duplicated()]
print(len(Dup_Rows))

In [None]:
#what are some other pieces of irrelevant data? We can use a correlation matrix to help determine that
plt.figure(figsize=(15,15))
sns.heatmap(df.corr(),annot = True, cbar = True)
plt.xticks(rotation=90)
plt.yticks(rotation = 0)

In [None]:
#We can see that the only real correlations come with Walc and Dalc, and with G1, G2 and G3. 
#Fixing irrelevant data can also mean joining together relevant data, lets get the average marks of students
#so that we are only dealing with a single datapoint per student
df['G1'] = (df['G1']+ df['G2']+df['G3'])/3

#lets also join the weekend and weekday alcohol into one
df['Walc'] = df['Walc']+df['Dalc']

#now lets drop the unneeded columns
dropCols = ['G2', 'G3', 'Dalc']
df.drop(columns = dropCols, inplace = True, axis = 1)

I could get rid of some more data I find irrelevant, but since there is only 30 attributes, I think its fine to 
keep the rest of it since I may want to use it for some different processing later. Moving onto part 2 and 3, lets now try and fix any structural errors in our data. Since this is a popular kaggle dataset, you can be pretty sure that the data is safe, but I can give a couple examples of the many different pandas cleaning techniques.

In [None]:
#if you want to drop any rows that are only NaN values, empty () drops any row with any NaN value
df.dropna(how='all')

#drop columns that are full of NaN values, use 'any' to drop any column with an NaN in it
df.dropna(axis=1, how='all')

#fill NaN with something
df.fillna("N/A", inplace = True)

df['Mjob'].str.upper() #make everything uppercase, .lower() does the same with lowercase
df['Mjob'].str.strip() #strip all of the whitespace

Part 4 is to filter unwatned outliers, based on what this dataset, there likely isn't much outlying as the values are pretty close together and there isn't much room for extremes, but plotting your data comparatively is definitely a good idea with more time. Now it is time to get some answers to our initial question. Does drinking have anything to do with academic success.

In [None]:
list = []
for i in range(11):
    list.append(len(df[df.Walc == i]))
ax = sns.barplot(x = [0,1,2,3,4,5,6,7,8,9,10], y = list)
plt.ylabel('Number of Students')
plt.xlabel('Weekly alcohol consumption')

In [None]:
labels = ['2','3','4','5','6','7','8','9','10']
colors = ['lime','blue','orange','cyan','grey','purple','brown','red','darksalmon']
explode = [0,0,0,0,0,0,0,0,0]
sizes = []
for i in range(2,11):
    sizes.append(sum(df[df.Walc == i].G1))
total_grade = sum(sizes)
average = total_grade/float(len(df))
plt.pie(sizes,explode=explode,colors=colors,labels=labels,autopct = '%1.1f%%')
plt.axis('equal')
plt.title('Total grade : '+str(total_grade))
plt.xlabel('Students grade distribution according to weekly alcohol consumption')

In [None]:
len(df)
len(df.G1)

In [None]:
#In order to understand whether alcohol affects students success, I compare grades with average.
ave = sum(df.G1)/float(len(df))
data['ave_line'] = ave
df['average'] = ['above average' if i > ave else 'under average' for i in df.G1]
sns.swarmplot(x='Walc', y = 'G1', hue = 'average',data= df,palette={'above average':'lime', 'under average': 'red'})

In [None]:
list = []
for i in range(2,11):
    list.append(sum(df[df.Walc == i].G1)/float(len(df[df.Walc == i])))
ax = sns.barplot(x = [2,3,4,5,6,7,8,9,10], y = list)
plt.ylabel('Average Grades of students')
plt.xlabel('Weekly alcohol consumption')

Alright, so I guess we haven't really come to a conclusion. It is obvious that most students don't drink profusely throughout the week, and the highest marks come from the ones that drink the least. But the worst marks also come from the students who drink the least. It appears that the amount of drinking that you do has little effect on the marks of these students. But it is important to remember that coming to this conclusion after plotting three graphs with i highly skewed dataset isn't at all a valid finding. This dataset is made up of 400 portugese students who go to an affluent highschool and who live in a place where drinking alcohol is legal at their ages. It would require a lot more work with a lot more data to come to any valid conclusions. I hope you have learned a bit more about working with data, and I would recommend checking out the many datasets on kaggle if you're interested in more!