# Goals  
The goal is to find what affects students scores in tests (post test), then a ML model will be used to try and predict students post test scores.  

# Data set   
The data is from IBM SPSS document folder or [Kaggle](https://www.kaggle.com/kwadwoofosu/predict-test-scores-of-students)

# Index  
[Does family income affect scores](#Does-family-income-affect-scores?)  
[Does a good pre-test means a good post-test?](#Does-a-good-pre-test-means-a-good-post-test?)  
[what is the diffrence in teaching methods?](#what-is-the-diffrence-in-teaching-methods?)  
[Are private schools worth it?](#Are-private-schools-worth-it?)  
[Does gender plays a role?](#Does-gender-plays-a-role?)  
[Does school area matter?](#Does-school-area-matter?)  
[AI Model](#AI-Model)  
[Model evaluation](#model-evaluation)  
[Conclusion](#Conclusion)

In [184]:
import pandas as pd
import numpy as np
import plotly.graph_objects as go

In [183]:
df= pd.read_csv('test_scores.csv')

In [84]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2133 entries, 0 to 2132
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   school           2133 non-null   object 
 1   school_setting   2133 non-null   object 
 2   school_type      2133 non-null   object 
 3   classroom        2133 non-null   object 
 4   teaching_method  2133 non-null   object 
 5   n_student        2133 non-null   float64
 6   student_id       2133 non-null   object 
 7   gender           2133 non-null   object 
 8   lunch            2133 non-null   object 
 9   pretest          2133 non-null   float64
 10  posttest         2133 non-null   float64
dtypes: float64(3), object(8)
memory usage: 183.4+ KB


# Does family income affect scores?

According to the [National School Lunch Program](https://www.ers.usda.gov/topics/food-nutrition-assistance/child-nutrition-programs/national-school-lunch-program/) Free lunches are available to children in households with incomes at or below 130 percent of poverty, and Reduced-price lunches are available to children in households with incomes between 130 and 185 percent of poverty.  
based on this we can conclude that children who qualify for free/reduced meals are from households with low income.

In [190]:
fig = go.Figure( [ go.Bar(x=list(df.groupby('lunch').mean()['pretest'].index), y=df.groupby('lunch').mean()['pretest'], text=np.round(df.groupby('lunch').mean()['pretest'].values, 2), textposition='outside') ] )
fig.update_layout(
    title_text="Pre-test averages for students who Qualifies/Does't Qualifies for free/reduced meals",
    xaxis_title='Free/reduced lunch',
    yaxis_title='Score'
)
fig.show()

In [189]:
fig = go.Figure( [ go.Bar(x=list(df.groupby('lunch').mean()['posttest'].index), y=df.groupby('lunch').mean()['posttest'], text=np.round(df.groupby('lunch').mean()['posttest'].values, 2), textposition='outside')  ] )
fig.update_layout(
    title_text="Post-test averages for students who Qualifies/Does't Qualifies for free/reduced meals",
    xaxis_title='Free/reduced lunch',
    yaxis_title='Score'
)
fig.show()

### kids with reduced/free lunches performed worse than kids without reduced/ free lunch in both pre-test and post-test, though it is not clear if family's income is the only factor that affects thier grades.

# Does a good pre-test means a good post-test?

### Assume that 80/100 is the minumem good grade

In [194]:
fig = go.Figure( [ go.Bar(x=list(df[df['pretest']>=80][['pretest', 'posttest']].columns), y=df[df['pretest']>=80][['pretest', 'posttest']].mean(), text=np.round(df[df['pretest']>=80][['pretest', 'posttest']].mean(),2), textposition='outside')  ] )
fig.update_layout(
    title_text="Test averages for students with good scores >=80",
    xaxis_title='Test',
    yaxis_title='Score'
)
fig.show()

In [193]:
fig = go.Figure( [ go.Bar(x=list(df[df['pretest']<80][['pretest', 'posttest']].columns), y=df[df['pretest']<80][['pretest', 'posttest']].mean(), text=np.round(df[df['pretest']<80][['pretest', 'posttest']].mean(),2), textposition='outside')  ] )
fig.update_layout(
    title_text="Test averages for students with bad scores <80",
    xaxis_title='Test',
    yaxis_title='Score'
)
fig.show()

### From tha graphs we can see that students who achevied good score in thier pretest are expected to rise thier score by 10 points on average in posttest, meanwhile other students are expected to increase thier posttest score by 13 point on average but on average they will be below 70 for the posttest.

# what is the diffrence in teaching methods?

In [196]:
names= list(df.groupby('teaching_method').mean()['posttest'].index)
values= df.groupby('teaching_method').mean()['posttest'].values

fig = go.Figure( [ go.Bar(x=names, y=values, text=np.round(values, 2), textposition='outside') ] )
fig.update_layout(
    title_text="Post test averages for teaching methods",
    xaxis_title='Teaching method',
    yaxis_title='Score'
)
fig.show()

### On average students under the Experimental method achive on average higher grades in the post test

# Are private schools worth it?

In [10]:
df.groupby(['school_type', 'teaching_method']).mean()['posttest']

school_type  teaching_method
Non-public   Experimental       78.652830
             Standard           73.468531
Public       Experimental       69.947475
             Standard           61.315547
Name: posttest, dtype: float64

In [11]:
#prepare data
x= list(df.groupby(['school_type', 'teaching_method']).mean()['posttest'].index.get_level_values(0).unique())
y= df.groupby(['school_type', 'teaching_method']).mean()['posttest']
legend= list(df.groupby(['school_type', 'teaching_method']).mean()['posttest'].index.get_level_values(1).unique())


#plot data
fig = go.Figure(data=[
    go.Bar(name=legend[0], x=x, y=[y.loc[x[0], legend[0]], y.loc[x[1], legend[0]]], text=np.round([y.loc[x[0], legend[0]], y.loc[x[1], legend[0]]], 2), textposition='outside'),
    go.Bar(name=legend[1], x=x, y=[y.loc[x[0], legend[1]], y.loc[x[1], legend[1]]], text=np.round([y.loc[x[0], legend[1]], y.loc[x[1], legend[1]]], 2), textposition='outside')
])

# Change the bar mode

fig.update_layout(barmode='group',
                 title_text="Avergaes for each (schoool_type, teachin_method) pair",
                 xaxis_title='School type',
                 yaxis_title='Score',
                 annotations=[
        dict(
            x=1.15,
            y=1.05,
            xref='paper',
            yref='paper',
            text='Teachin method',
            showarrow=False
        )
    ]
                 )
fig.show()

### Private schools have a higher post test average for both teaching methods.

# Does gender plays a role?

In [12]:
df.groupby('gender').mean()['posttest']

gender
Female    67.004735
Male      67.197772
Name: posttest, dtype: float64

In [197]:
names= list(df.groupby('gender').mean()['posttest'].index)
values= df.groupby('gender').mean()['posttest'].values

fig = go.Figure( [ go.Bar(x=names, y=values, text=np.round(values, 2), textposition='outside') ] )
fig.update_layout(
    title_text="Post test averages for male/female",
    xaxis_title='Gender',
    yaxis_title='Score'
)
fig.show()

### Both genders have the same post test average score

# Does school area matter?

In [198]:
names= list(df.groupby('school_setting').mean()['posttest'].index)
values= df.groupby('school_setting').mean()['posttest'].values

fig = go.Figure( [ go.Bar(x=names, y=values, text=np.round(values, 2), textposition='outside') ] )
fig.update_layout(
    title_text="Post-test averages for school areas",
    xaxis_title='School area',
    yaxis_title='Score'
)
fig.show()

### Schools in suburban areas have a high post test average

# AI Model

### Extract and remove features

In [122]:
#remove unimportant features (school, classroom, n_students, student_id, gender)
df.drop(['school', 'classroom', 'n_student', 'student_id', 'gender'], axis=1, inplace=True)

In [123]:
#make dummies for categorical features
t=pd.get_dummies(df[list(df.columns[:4])], drop_first=True)
df= pd.concat([df, t], axis=1)
df.drop(list(df.columns[:4]), axis=1, inplace=True)

In [133]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import r2_score
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression


In [134]:
#Train test split
x= df.drop('posttest', axis=1)
y=df['posttest']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=101)

In [135]:
#create pipeline
pipeline = Pipeline([
    ('scaler', MinMaxScaler()), 
    ('model', LinearRegression())
])

pipeline.fit(x_train, y_train)# fit pipeline

Pipeline(steps=[('scaler', MinMaxScaler()), ('model', LinearRegression())])

# Model evaluation

In [199]:
pred= pipeline.predict(x_test)
r2_score(y_test, pred)

0.9514275463397855

# Conclusion

It looks like that the enviroment plays a big role on the scores students achive in thier post test as can be seen (e.g: family income, school area, teaching method, etc...). Private Schools hava a higher post test score average than public but not everyone an afford it, which make public schools in suburban areas the best choise for the majority of families.