### Analysis of Student Data

Examining student data to draw conclusion by testing hypothesis.  

It would seem best to target exam score and answer the question, what variables affect test score the most? With an accurate model, we can predict our most likely test score based how those variables express.

---  

### **Data Wrangling**

In [None]:
from pathlib import Path
import pandas as pd
import numpy as np
import matplotlib as plt
import seaborn as sns
from scipy import stats as sts

df_data = Path().cwd().parent.parent/"Data"/"student_habits_performance.csv"
student_df = pd.read_csv(df_data)

In [None]:
print(student_df.describe(include='all'))

student_df.head(5)

Based on this small exploration alone, we can see we have multiple categorical variables alongside numerous numerical values, both float and int.  

We also see some great cursory stats about our numerical data, giving us an idea of:  
- Data scale and volume
  - We have 1,000 values per column
  - There is not a massive difference across all our values (min=0, max=100).
- How that data is distributed across each feature.  

Looking at `student_df["parental_education_level"]`, we can also see there's missing data. To cnofirm that's the only column, let's run a for loop.

In [None]:
missing_data = student_df.isnull()

for column in missing_data:
    print(missing_data[column].value_counts())

Now, to get a closer look at the unique value counts of that feature.  

In [None]:
parental_education_level_counts = student_df[
    "parental_education_level"
].value_counts(dropna=False).to_frame()
parental_education_level_counts.columns = ['value_counts']
parental_education_level_counts.index.name = "parental_education_level"

print(parental_education_level_counts)

There are 91 missing values out of 1,000--that's 9.1% of our data set.  

While perfectly fine to stop at replacing the missing values with  
the frequency, it could be worth while to create a **binary  
indicator column** for missing parental education to explore  
relationship with student performance. 

In [None]:
missing_ed_df = student_df[["parental_education_level"]]
missing_ed_df = missing_ed_df.rename(columns={"parental_education_level": "missing_parent_ed"})

for ed_index in list(range(len(missing_ed_df))):
    if pd.isnull(student_df.loc[ed_index, "parental_education_level"]):
        missing_ed_df.loc[ed_index, "missing_parent_ed"] = 1
    else:
        missing_ed_df.loc[ed_index, "missing_parent_ed"] = 0

print(missing_ed_df.value_counts())



This data is now preserved in a separate data frame I can  
concatenate with the original, or a copy of the original,  
when it's time to model.  

Next, we'll replace those missing values.