In [1]:
import acquire as a
import pandas as pd 
import matplotlib.pyplot as plt
import numpy as np

In [2]:
df = a.acquire_edu_data()

## First thing I want to do is standardize all the column names. 

In [3]:
df.columns

Index(['Unnamed: 0.1', 'Unnamed: 0', 'Gender', 'EthnicGroup', 'ParentEduc',
       'LunchType', 'TestPrep', 'ParentMaritalStatus', 'PracticeSport',
       'IsFirstChild', 'NrSiblings', 'TransportMeans', 'WklyStudyHours',
       'MathScore', 'ReadingScore', 'WritingScore'],
      dtype='object')

    Key takaways:
    - there is a combination of upper and lowercase letters
    - Some columns have sapces 
    - unnamed columns are mirror of index so we can remove those columns. 
    

In [4]:
# lets addess the first two with one line of code
df.columns = df.columns.str.lower().str.replace(' ', '_')

In [5]:
# now for the unnamed columns
df.drop(columns=[c for c in df.columns if 'unnamed' in c],inplace=True)


## Here we are looking to address the null values . This is a special case with nulls becasue every data point is a student. To use df.dropna() will drop a student who didnt have a voice and we don't want to take that away from a child.

In [6]:
# first I want to see how many missing values there are in the dataset and set it to output it by % of columns
(df.isna().sum() / len(df) )* 100 

gender                  0.000000
ethnicgroup             6.005026
parenteduc              6.021344
lunchtype               0.000000
testprep                5.972390
parentmaritalstatus     3.883685
practicesport           2.059332
isfirstchild            2.950295
nrsiblings              5.130381
transportmeans         10.228126
wklystudyhours          3.116739
mathscore               0.000000
readingscore            0.000000
writingscore            0.000000
dtype: float64

    Addressing the first null, I'm gonna drop the column ethnicgroup. I did this to prevent any potential biases or unfair labeling based on ethnicity, and to ensure that the analysis is focused solely on the other factors that may be affecting educational performance. It is important to note that removing a variable like ethnicity from the analysis does not mean that it is not an important factor, but rather that in this specific analysis, we are choosing to focus on other variables.

In [7]:
df.drop(columns='ethnicgroup', inplace=True)

In [8]:
# first I want to see how many missing values there are in the dataset and set it to output it by % of columns
(df.isna().sum() / len(df) )* 100 

gender                  0.000000
parenteduc              6.021344
lunchtype               0.000000
testprep                5.972390
parentmaritalstatus     3.883685
practicesport           2.059332
isfirstchild            2.950295
nrsiblings              5.130381
transportmeans         10.228126
wklystudyhours          3.116739
mathscore               0.000000
readingscore            0.000000
writingscore            0.000000
dtype: float64

    we are gonna fill the nulls with a place holder until information can be updated

In [9]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(missing_values = 'NaN', strategy = 'most_frequent')

In [10]:
has_null =[]

for cols in df.columns:
    if df[cols].isna().sum() > 0:
        has_null.append(cols)
    else:
        pass

In [11]:
has_null

['parenteduc',
 'testprep',
 'parentmaritalstatus',
 'practicesport',
 'isfirstchild',
 'nrsiblings',
 'transportmeans',
 'wklystudyhours']

In [12]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='most_frequent')

# Fill the missing values in each column
for col in df.columns:
    if df[col].isna().sum() > 0:
        df[col] = imputer.fit_transform(df[col].values.reshape(-1, 1))


In [13]:
(df.isna().sum() / len(df) )* 100 

gender                 0.0
parenteduc             0.0
lunchtype              0.0
testprep               0.0
parentmaritalstatus    0.0
practicesport          0.0
isfirstchild           0.0
nrsiblings             0.0
transportmeans         0.0
wklystudyhours         0.0
mathscore              0.0
readingscore           0.0
writingscore           0.0
dtype: float64

In [16]:
df.testprep.value_counts()

none         20686
completed     9955
Name: testprep, dtype: int64