# Imports 

In [1]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer

# Import our wrangle files
import wrangle as w 

# Acquire our data

In [2]:
df = w.acquire_edu_data()

# Lets look at our data 

In [3]:
# Next I want to check the data base information. Column names, non-null count, and Dtypes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30641 entries, 0 to 30640
Data columns (total 16 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Unnamed: 0.1         30641 non-null  int64  
 1   Unnamed: 0           30641 non-null  int64  
 2   Gender               30641 non-null  object 
 3   EthnicGroup          28801 non-null  object 
 4   ParentEduc           28796 non-null  object 
 5   LunchType            30641 non-null  object 
 6   TestPrep             28811 non-null  object 
 7   ParentMaritalStatus  29451 non-null  object 
 8   PracticeSport        30010 non-null  object 
 9   IsFirstChild         29737 non-null  object 
 10  NrSiblings           29069 non-null  float64
 11  TransportMeans       27507 non-null  object 
 12  WklyStudyHours       29686 non-null  object 
 13  MathScore            30641 non-null  int64  
 14  ReadingScore         30641 non-null  int64  
 15  WritingScore         30641 non-null 

In [4]:
# Next i want to get the descriptive statistics of the data.
df.describe()

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,NrSiblings,MathScore,ReadingScore,WritingScore
count,30641.0,30641.0,29069.0,30641.0,30641.0,30641.0
mean,15320.0,499.556607,2.145894,66.558402,69.377533,68.418622
std,8845.439136,288.747894,1.458242,15.361616,14.758952,15.443525
min,0.0,0.0,0.0,0.0,10.0,4.0
25%,7660.0,249.0,1.0,56.0,59.0,58.0
50%,15320.0,500.0,2.0,67.0,70.0,69.0
75%,22980.0,750.0,3.0,78.0,80.0,79.0
max,30640.0,999.0,7.0,100.0,100.0,100.0


In [5]:
# Column names are not stadardize.
df.columns = df.columns.str.replace(r'(?<!^)([A-Z])', r'_\1', regex= True)
df.columns = df.columns.str.lower()

In [6]:
# Now I want to look at my data and see what is going on.
df.head()

Unnamed: 0,unnamed: 0.1,unnamed: 0,gender,ethnic_group,parent_educ,lunch_type,test_prep,parent_marital_status,practice_sport,is_first_child,nr_siblings,transport_means,wkly_study_hours,math_score,reading_score,writing_score
0,0,0,female,,bachelor's degree,standard,none,married,regularly,yes,3.0,school_bus,< 5,71,71,74
1,1,1,female,group C,some college,standard,,married,sometimes,yes,0.0,,5 - 10,69,90,88
2,2,2,female,group B,master's degree,standard,none,single,sometimes,yes,4.0,school_bus,< 5,87,93,91
3,3,3,male,group A,associate's degree,free/reduced,none,married,never,no,1.0,,5 - 10,45,56,42
4,4,4,male,group C,some college,standard,none,married,sometimes,yes,0.0,school_bus,5 - 10,76,78,75


### **Key Takeaways:** 
* There are two unnamed columns this is probaly due to errors on the data import and export. Unnamed columns are most of the time a repeate of the index
* There are nulls we will have to decide how to address those. 
* There are a lot of categorical information that will have to be converted to a number before modeling
* Creating a final score based off the current grades may be a better metric for gauging a students overall assessment vs each sunject on their own.

# First lets address the nulls.
    
    Given that each data point represents a student, it is important to retain their individual contributions within the dataset. 
    Therefore, utilizing the df.dropna() function to remove rows with missing values should be avoided. By doing so, we would risk disregarding the valuable input and perspective of each student.
    
    It is crucial to ensure that every student's voice is preserved within the dataset, and no individual is deprived of representation. 
    Instead of removing entire rows, alternative methods can be employed to handle missing data while maintaining the integrity of the dataset. 

In [7]:
# Get the percentage of missing values
(df.isna().sum() / len(df)) * 100

unnamed: 0.1              0.000000
unnamed: 0                0.000000
gender                    0.000000
ethnic_group              6.005026
parent_educ               6.021344
lunch_type                0.000000
test_prep                 5.972390
parent_marital_status     3.883685
practice_sport            2.059332
is_first_child            2.950295
nr_siblings               5.130381
transport_means          10.228126
wkly_study_hours          3.116739
math_score                0.000000
reading_score             0.000000
writing_score             0.000000
dtype: float64

    Considering the size of our dataset, which is substantial, and the fact that the proportion of missing values is relatively small (less than 11% of the data), we can confidently proceed with imputing the null values.
    By doing so, we can ensure that the dataset remains comprehensive and provides a representative view of the student population.

    To handle the missing values, we will utilize an imputation technique to fill in the gaps while preserving the students' voices and maintaining the integrity of the dataset. 
    This approach allows us to retain as much information as possible, maximizing the usability and reliability of the data for analysis.

In [8]:
imputer = SimpleImputer(strategy= 'most_frequent')

In [9]:
for col in df.columns:
    if df[col].isna().sum() > 0:
        df[col] = imputer.fit_transform(df[col].values.reshape(-1, 1))[:, 0]
        print(f'Values in {col} have been imputed')
    else:
        print(f'No missing values in {col}')


No missing values in unnamed: 0.1
No missing values in unnamed: 0
No missing values in gender
Values in ethnic_group have been imputed
Values in parent_educ have been imputed
No missing values in lunch_type
Values in test_prep have been imputed
Values in parent_marital_status have been imputed
Values in practice_sport have been imputed
Values in is_first_child have been imputed
Values in nr_siblings have been imputed
Values in transport_means have been imputed
Values in wkly_study_hours have been imputed
No missing values in math_score
No missing values in reading_score
No missing values in writing_score


In [10]:
# trust but verify our code
(df.isna().sum() / len(df)) * 100

unnamed: 0.1             0.0
unnamed: 0               0.0
gender                   0.0
ethnic_group             0.0
parent_educ              0.0
lunch_type               0.0
test_prep                0.0
parent_marital_status    0.0
practice_sport           0.0
is_first_child           0.0
nr_siblings              0.0
transport_means          0.0
wkly_study_hours         0.0
math_score               0.0
reading_score            0.0
writing_score            0.0
dtype: float64

# Removing the unnamed columns:

In [11]:
df.drop(columns=[c for c in df.columns if 'unnamed' in c],inplace=True)

In [12]:
# trust but verify our code 
df.head(2)

Unnamed: 0,gender,ethnic_group,parent_educ,lunch_type,test_prep,parent_marital_status,practice_sport,is_first_child,nr_siblings,transport_means,wkly_study_hours,math_score,reading_score,writing_score
0,female,group C,bachelor's degree,standard,none,married,regularly,yes,3.0,school_bus,< 5,71,71,74
1,female,group C,some college,standard,none,married,sometimes,yes,0.0,school_bus,5 - 10,69,90,88


# Adding `final_score`

In [13]:
df['final_score'] = round((df.math_score + df.reading_score + df.writing_score) / 3 )

In [14]:
df.head()

Unnamed: 0,gender,ethnic_group,parent_educ,lunch_type,test_prep,parent_marital_status,practice_sport,is_first_child,nr_siblings,transport_means,wkly_study_hours,math_score,reading_score,writing_score,final_score
0,female,group C,bachelor's degree,standard,none,married,regularly,yes,3.0,school_bus,< 5,71,71,74,72.0
1,female,group C,some college,standard,none,married,sometimes,yes,0.0,school_bus,5 - 10,69,90,88,82.0
2,female,group B,master's degree,standard,none,single,sometimes,yes,4.0,school_bus,< 5,87,93,91,90.0
3,male,group A,associate's degree,free/reduced,none,married,never,no,1.0,school_bus,5 - 10,45,56,42,48.0
4,male,group C,some college,standard,none,married,sometimes,yes,0.0,school_bus,5 - 10,76,78,75,76.0


# Removing `ethnic_group`

    In our analysis focusing on the relationship between parents' social status and students' grades, we made the deliberate decision to exclude the variable related to ethnic group. 
    The primary motivation behind this decision was to ensure that the analysis remains focused on the specific research question at hand and avoids any potential misinterpretation 
    or unwarranted biases based on ethnic backgrounds.

    By narrowing our analysis to the impact of parents' social status on students' grades, we aim to explore the potential socioeconomic factors that may influence educational outcomes.
    This approach allows us to delve deeper into understanding the relationship between socioeconomic status and academic performance, providing valuable insights into educational equity
    and potential areas for intervention or support.

    By consciously excluding the variable related to ethnic group, we aim to ensure that our analysis is based solely on the socioeconomic aspect, minimizing the risk of any unintended biases
    or discriminatory implications. Our objective is to conduct a thorough and objective examination of the relationship between parents' social status and students' grades, contributing to a 
    fair and unbiased understanding of this important topic.






In [15]:
df.drop(columns='ethnic_group', inplace=True)

In [16]:
df.head(1)

Unnamed: 0,gender,parent_educ,lunch_type,test_prep,parent_marital_status,practice_sport,is_first_child,nr_siblings,transport_means,wkly_study_hours,math_score,reading_score,writing_score,final_score
0,female,bachelor's degree,standard,none,married,regularly,yes,3.0,school_bus,< 5,71,71,74,72.0


## **Key Takeaways** 
#### **We did the following things:**
* Imputed the null values since there were less than 11 % of null values. 
* Standarized our columns names to make it easier to call upon 
* Removed the Unnamed column from the database.
* Created a Final Score column
* removed `ethnic_group` from dataframe.

# Conclusion:
    Upon acquiring the data, we identified certain areas that required preparation. Our initial step involved imputing the null values to ensure the dataset's integrity while preserving the students' voices. 
    Given the substantial size of our dataframe, we felt confident in imputing the missing values.

    Additionally, we addressed naming conventions by adopting the standard convention of lowercase letters with underscores. 
    This allowed for consistency and ease of interpretation. As for the "Unnamed" columns, they were found to be redundant repetitions of the index, providing no value to the dataset. 
    Consequently, we made the decision to drop these columns.

    By implementing these preparatory steps, we were able to refine the dataset and ensure that it is ready for further analysis and exploration.

In [17]:
# For further information, please refer to the README.md file.
# for the ease of flow into the explore part of our pipeline.
# we are exporting our prepare dataframe into a csv file.
df.to_csv('prepared_edu_dataframe.csv', index=False)