In [None]:
import numpy as np 
import pandas as pd
import seaborn as sns 
import matplotlib.pyplot as plt
from pylab import subplot
from itertools import combinations
from scipy.stats import chi2_contingency

# Data loading and general data inspection

In [None]:
df = pd.read_csv('../input/students-performance-in-exams/StudentsPerformance.csv')
df.head(3)

In [None]:
num_cols = df.select_dtypes(exclude = 'O').columns
cat_cols = df.select_dtypes(include = 'O').columns
df.info()

<i>This dataset hasnt Nan-values, that way we can proceed to analysis and visualization.

# Analysis attributes of score

In [None]:
df.describe()

In [None]:
plt.figure(figsize = (10, 6))
sns.distplot(df['math score'], label = 'math')
sns.distplot(df['reading score'], label = 'reading')
sns.distplot(df['writing score'], label = 'writing')
plt.xlabel('Distibutions of different scores')
plt.legend()
plt.show()

<i>As we see, there aren't significulty differences in distributions of score-attributes.<br>
Let's check out correlations between this features.

In [None]:
df[num_cols].corr()

<i>Assumptions and consumptions:
* Reading and writing have a very high correlation.It is so because writing and reading tests were on the same language, and knowledge of language includes skills of reading, writing, speaking and listening. 
* 'Math' has a high correlation with other features. Probably the number of students who prepares only for math exam or only for language exam is low or even zero(Students usually prepare for all exams or do not prepare at all)

# Visualization and analysis of categorical features

In [None]:
plt.figure(figsize = (10, 6))
 
subplot(2, 2, 1)
sns.countplot(df['gender'])

subplot(2, 2, 2)
sns.countplot(df['lunch'])

subplot(2, 2, 3)
sns.countplot(df['race/ethnicity'])

subplot(2, 2, 4)
sns.countplot(df['test preparation course'])

plt.figure(figsize = (17, 6))
subplot(2, 2, 2)

sns.countplot(df['parental level of education'])
plt.tight_layout()    

<i>There is no big difference in number of objects between classes in any feature. All classes are quite balanced. 
<br>Now lets try to find out the relationship between this features. 

In [None]:
cat_combs = combinations(cat_cols, 2)
n = len(df.index)

for pair in cat_combs:
    table_for_chi2 = pd.crosstab(df[pair[0]], df[pair[1]])
    stat, p_val, a, b = chi2_contingency(table_for_chi2)
    print('Pair: {0}\np-value of h0(independence pair attributes): {1}\n correlation = {2}\n'.format(pair, round(p_val, 5), 
                                                                                                     round(np.sqrt(stat/n), 5)))

<i>We are not able to deny the hypothesis of independence of variables at significance level is equal 0.05<br> 
Let's have a look at pairs which have significance level less then 0.1 

In [None]:
plt.figure(figsize = (12, 5))
sns.countplot(df['race/ethnicity'], hue = df['parental level of education'])
plt.show()

plt.figure(figsize = (12, 5))
sns.countplot(df['race/ethnicity'], hue = df['gender'])
plt.show()

plt.figure(figsize = (12, 5))
sns.countplot(df['parental level of education'], hue = df['test preparation course'])
plt.show()

<i>We can see that proportions changing in different classes but its too early to talk about correlation. (For example, bachelos and masters have almost the same proportion in any class. Rase A, E and D have same gender composition, however Rase B and C are essential different from other races. Last pair hasnt strikly difference in all classes).<br>
The results are pretty obvious, e.g. sex doesn't affect the type of lunch, race doesn't affect the test preparation course. But there are some contradictions, sometimes relationship of independent features are much bigger then correlation between features like making child take courses by educated parents which know the benefits of courses.<br><br>
Now, the most interesting part starts

# Influence of all features on target attributes

In [None]:
# binary attributes
for cat_col in ['gender', 'lunch', 'test preparation course']:
    print('\n', cat_col.upper())
    val1 = df[cat_col].unique()[0]
    val2 = df[cat_col].unique()[1]
    for col in num_cols:
        df['n'] = (df[col] - df[col].mean())/df[col].std()
        diff = df[df[cat_col] == val1]['n'].mean() - df[df[cat_col] == val2]['n'].mean()
        print('Correlation with %s =' % col, abs(round(diff, 5)))
    
del df['n']

<i>Wow. We got that all binary features have average or high correlation with score attributes.

In [None]:
i = 0
plt.figure(figsize = (10, 10))
for bin_col in ['gender', 'lunch', 'test preparation course']:
    for col in num_cols:
        i += 1
        subplot(3, 3, i)
        sns.violinplot(y = df[col], x = df[bin_col])
    plt.tight_layout()

<i>My observations and guesses:
* About gender. Distribution of scores in male class has strikly a lot less kurtosis than in female class. There is in all score attributes. Possible reason for this is female cooperation at preparing for exams(preparing in group etc) or female ability for science(feamle scores more concetrated than male scores). Male class has more smoother peak in distribution.(Because maybe the most of men think that passing exams is a personal matter). By the location of the peak, we can say that men are better at mathematics, and women are better at language exams (it is consistent with the statement that men are better versed in technical sciences, and women in humanities)
* About lunch. When people have standard lunch, they get better scores than people with free lunch. The brain works better at full nutrition.(But if people eat lunch normally, it means that they have more money. So they will have more opportunities for preparing better for exams. For example: expensive literature or expensive tutors). As we can see math scores have the largest difference between two class of 'lunch'
* About test preparation course. There is improvement of scores due completed test preparation course(The best improvement of scores - writing exam)

In [None]:
# other categorical features
plt.figure(figsize = (12, 6))
sns.violinplot(x= df['parental level of education'], y = df['math score'])
plt.grid()
plt.show()

plt.figure(figsize = (12, 6))
sns.violinplot(x= df['parental level of education'], y = df['reading score'])
plt.grid()
plt.show()

plt.figure(figsize = (12, 6))
sns.violinplot(x= df['parental level of education'], y = df['writing score'])
plt.grid()
plt.show()

<i>As we can observe, there is dependence between parents's education and scores, but it's very hard to see.<br>
As one would expect, mean of score shifts upward with development educational degree of parents.

In [None]:
plt.figure(figsize = (12, 6))
sns.violinplot(x= df['race/ethnicity'], y = df['math score'])
plt.grid()
plt.show()

plt.figure(figsize = (12, 6))
sns.violinplot(x= df['race/ethnicity'], y = df['reading score'])
plt.grid()
plt.show()

plt.figure(figsize = (12, 6))
sns.violinplot(x= df['race/ethnicity'], y = df['writing score'])
plt.grid()
plt.show()

<i>About race(patterns are difficult to trace, but some assumptions can be made):
* Race E passes all exams better than others, this difference is noticeable in mathematics, in other exams it is almost nonexistent
* Races D and C are roughly equal (D is slightly better in math and writing)
* Race B is slightly inferior to C and D, the distribution of its ratings has a heavier tail towards zero than С and D
* Group A passes the subjects worst of all, its average is lower than the others + the heaviest tails in the direction of the minimum mark.

<i>This is probably due to the fact that most people from different races have different attitudes to learning. 
Let's take a look at the correlations.

In [None]:
from sklearn.preprocessing import LabelEncoder

df['parental level of education'] = LabelEncoder().fit_transform(df['parental level of education'])
df['race/ethnicity'] = LabelEncoder().fit_transform(df['race/ethnicity'])

print('Correlation with parental level of education')
print(df[list(num_cols)].corrwith(df['parental level of education']))

print('\nCorrelation with race')
print(df[list(num_cols)].corrwith(df['race/ethnicity']))

# Results of analysis sample

<i>The highest correlation between target features(reading - writing ~ 0.95, math - reading and writing ~ 0.8)

<i>It will be hard to find any dependences beetwen categorical attributes.
Probably next pairs have any dependences:
* 'race/ethnicity', 'parental level of education' 
* 'parental level of education', 'test preparation course'
* 'gender', 'race/ethnicity'

<i>We can find correlation between binury and numerical features:
* High correlation: gender - writing (~ 0.6); lunch - math (~ 0.72); test_prep_course - writing(~ 0.65)
* Average correlation: gender - reading (~ 0.48); lunch - wrinting & reading(~ 0.5); test_prep_course - reading(~ 0.51)
* Low correlation: gender - math (~ 0.33); test_prep_course - math(~ 0.37)

<i>Features "race/ethnicity" and "parental level of education" have very comlex relationships with target attributes. Correlation beetwen this featues is very low. But few pattern in distributions on classes can be found.

P.S.: Sorry for my English :)