**Import Packages & Data**

In [4]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split


In [5]:
#Import CSV File of Cleaned Data: Test Scores
df = pd.read_csv('joined_df_cleaned.csv')


**Create Dummy Features**

In [9]:

# Identify categorical columns (replace with actual columns from your DataFrame)
categorical_cols = ['Jurisdiction']  # Add other categorical columns as needed

# Create dummy variables
dummy_df = pd.get_dummies(df[categorical_cols])

# Concatenate dummy variables with original DataFrame
df = pd.concat([df, dummy_df], axis=1)

# Drop the original categorical columns if needed
df.drop(categorical_cols, axis=1, inplace=True)

# Display the updated DataFrame
print(df.head())

   Average Math Score  ELL_x  White_x  Black_x  Hispanic_x  Low SES_x  Male_x  \
0               236.0  216.0    246.0    217.0       224.0      223.0   239.0   
1               230.0  216.0    240.0    213.0       222.0      218.0   233.0   
2               226.0  204.0    242.0      NaN       228.0      213.0   224.0   
3               232.0  197.0    247.0    215.0       221.0      218.0   235.0   
4               228.0  205.0    236.0    207.0       219.0      221.0   230.0   

   Female_x  ESE_x  Average Reading Score  ELL_y  White_y  Black_y  \
0     233.0  212.0                    217  190.0      227    199.0   
1     228.0  211.0                    213  196.0      224    197.0   
2     228.0  201.0                    204  187.0      220      NaN   
3     229.0  206.0                    215  164.0      229    205.0   
4     227.0  199.0                    212  174.0      221    188.0   

   Hispanic_y  Low SES_y  Male_y  Female_y  ESE_y  
0       205.0        203     214       2

**Standardize the magnitude of numeric features using a scaler**

In [28]:
# Identify numeric columns (assuming all except categorical columns)
numeric_cols = df.select_dtypes(include=['float64', 'int64']).columns

# Initialize StandardScaler
scaler = StandardScaler()

# Fit and transform the numeric data
df[numeric_cols] = scaler.fit_transform(df[numeric_cols])

# Display the updated DataFrame with standardized values
print(df.head())

   Average Math Score     ELL_x   White_x       Black_x  Hispanic_x  \
0            0.279402  0.448143  0.509269  4.790124e-01    0.153879   
1           -0.924781  0.448143 -0.628222 -2.650845e-01   -0.189665   
2           -1.727570 -1.083692 -0.249059 -1.298127e-17    0.840968   
3           -0.523387 -1.977262  0.698851  1.069639e-01   -0.361437   
4           -1.326176 -0.956039 -1.386550 -1.381230e+00   -0.704982   

   Low SES_x    Male_x  Female_x     ESE_x  Average Reading Score     ELL_y  \
0   0.237934  0.369522  0.166577  0.411312               0.249935  0.357228   
1  -0.826508 -0.739043 -0.895352  0.248700              -0.546733  1.047216   
2  -1.890950 -2.401890 -0.895352 -1.377418              -2.339237  0.012234   
3  -0.826508 -0.369522 -0.682966 -0.564359              -0.148399 -2.632720   
4  -0.187843 -1.293325 -1.107737 -1.702641              -0.745900 -1.482740   

    White_y       Black_y  Hispanic_y  Low SES_y    Male_y  Female_y     ESE_y  
0  0.371738  4.94

**Split into testing and training datasets**

In [31]:
# Define features (X) and target (y)
features_math = ['ELL_x', 'White_x', 'Black_x', 'Hispanic_x', 'Low SES_x', 'Male_x', 'Female_x', 'ESE_x']
X = df[features_math]  # Features
y = df['Average Math Score']  # Target variable

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Review the dimensions of the datasets
print(f"X_train shape: {X_train.shape}, y_train shape: {y_train.shape}")
print(f"X_test shape: {X_test.shape}, y_test shape: {y_test.shape}")

X_train shape: (35, 8), y_train shape: (35,)
X_test shape: (16, 8), y_test shape: (16,)


In [33]:
# Define features (X) and target (y) for Average Reading Score
features_reading = ['ELL_y', 'White_y', 'Black_y', 'Hispanic_y', 'Low SES_y', 'Male_y', 'Female_y', 'ESE_y']
X_reading = df[features_reading]  # Features for Average Reading Score
y_reading = df['Average Reading Score']  # Target variable for Average Reading Score

# Split the data into training and testing sets for Average Reading Score
X_train_reading, X_test_reading, y_train_reading, y_test_reading = train_test_split(X_reading, y_reading, test_size=0.3, random_state=42)

# Review the dimensions of the datasets for Average Reading Score
print(f"X_train_reading shape: {X_train_reading.shape}, y_train_reading shape: {y_train_reading.shape}")
print(f"X_test_reading shape: {X_test_reading.shape}, y_test_reading shape: {y_test_reading.shape}")


X_train_reading shape: (35, 8), y_train_reading shape: (35,)
X_test_reading shape: (16, 8), y_test_reading shape: (16,)


In [37]:
from sklearn.impute import SimpleImputer

# Check for missing values
print("Missing values in X_train:")
print(X_train.isnull().sum())

# Initialize SimpleImputer
imputer = SimpleImputer(strategy='mean')

# Fit imputer on training data
imputer.fit(X_train)

# Transform both training and testing data
X_train_imputed = imputer.transform(X_train)
X_test_imputed = imputer.transform(X_test)

# Convert back to DataFrame (if needed)
X_train_imputed = pd.DataFrame(X_train_imputed, columns=X_train.columns)
X_test_imputed = pd.DataFrame(X_test_imputed, columns=X_test.columns)

# Check if there are still missing values
print("Missing values after imputation:")
print(X_train_imputed.isnull().sum())


Missing values in X_train:
ELL_x         0
White_x       0
Black_x       0
Hispanic_x    0
Low SES_x     0
Male_x        0
Female_x      0
ESE_x         0
dtype: int64
Missing values after imputation:
ELL_x         0
White_x       0
Black_x       0
Hispanic_x    0
Low SES_x     0
Male_x        0
Female_x      0
ESE_x         0
dtype: int64


## Linear Regression Model for Math Scores

In [39]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Initialize Linear Regression model
model_lr_math = LinearRegression()

# Fit the model on training data
model_lr_math.fit(X_train_imputed, y_train)


In [42]:
# Predictions on test set
y_pred_lr_math = model_lr_math.predict(X_test_imputed)

# Model Evaluation
mse_lr_math = mean_squared_error(y_test, y_pred_lr_math)
r2_lr_math = r2_score(y_test, y_pred_lr_math)

# Display evaluation metrics
print(f"Mean Squared Error (MSE): {mse_lr_math}")
print(f"R-squared (R2): {r2_lr_math}")


Mean Squared Error (MSE): 0.005470748596428069
R-squared (R2): 0.994871688862258


Interpretation:
High R-squared: The high R-squared score suggests that the model fits the data very well. In your case, nearly 99.49% of the variability in the Average Math Scores can be explained by the features included in your model (ELL_x, White_x, Black_x, Hispanic_x, Low SES_x, Male_x, Female_x, ESE_x).

Low MSE: The low MSE indicates that the model's predictions are very close to the actual values of the Average Math Scores, on average. This is a good sign of predictive accuracy.

In [44]:
# Coefficients and Intercept
coefficients = model_lr_math.coef_
intercept = model_lr_math.intercept_

# Print coefficients
print("Coefficients:")
for feature, coef in zip(features_math, coefficients):
    print(f"{feature}: {coef}")

print(f"\nIntercept: {intercept}")


Coefficients:
ELL_x: -0.01727579132480002
White_x: 0.04403115537174673
Black_x: 0.0029777161956944325
Hispanic_x: 0.010484229253849058
Low SES_x: 0.007594566753561007
Male_x: 0.5649835131260431
Female_x: 0.45790959268088444
ESE_x: -0.047040407659109024

Intercept: -0.0002660643854997813


Interpretation:
Intercept: The intercept represents the predicted value of the response variable (Average Math Score) when all predictors (features) are zero. In this case, it is very close to zero, indicating that when all predictors are at their mean values, the predicted Average Math Score is very close to zero.

Coefficients: Each coefficient represents the change in the Average Math Score for a one-unit change in the corresponding predictor, holding all other predictors constant.

Male_x and Female_x coefficients (0.565 and 0.458, respectively) suggest that being male has a slightly higher positive influence on the predicted Average Math Score compared to being female.
White_x (0.0440) and Hispanic_x (0.0105) coefficients indicate positive influences, suggesting that higher proportions of White and Hispanic students correlate with higher Average Math Scores.
ELL_x (-0.0173), ESE_x (-0.0470), Low SES_x (0.00759), and Black_x (0.00298) have smaller coefficients, suggesting their impact on Average Math Scores is less pronounced or potentially negative.

## Random Forest Model for Average Math Score

In [70]:
# Import necessary libraries
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Instantiate the Random Forest model
rf_model_math = RandomForestRegressor(random_state=42)

# Fit the model on the training data
rf_model_math.fit(X_train, y_train)


In [72]:

# Predictions on the test data
y_pred_rf_math = rf_model_math.predict(X_test)

# Model Evaluation
mse_rf_math = mean_squared_error(y_test, y_pred_rf_math)
r2_rf_math = r2_score(y_test, y_pred_rf_math)

# Display evaluation metrics
print(f"Mean Squared Error (MSE): {mse_rf_math}")
print(f"R-squared (R2): {r2_rf_math}")

# Feature importance
feature_importance_rf = rf_model_math.feature_importances_
feature_importance_df_rf = pd.DataFrame({'Feature': features_math, 'Importance': feature_importance_rf})
print("\nFeature Importance:")
print(feature_importance_df_rf)


Mean Squared Error (MSE): 0.12959383614148134
R-squared (R2): 0.8785179941002947

Feature Importance:
      Feature  Importance
0       ELL_x    0.004168
1     White_x    0.020458
2     Black_x    0.005471
3  Hispanic_x    0.006912
4   Low SES_x    0.003666
5      Male_x    0.316252
6    Female_x    0.616206
7       ESE_x    0.026868


Interpretation:
Female_x and Male_x are the most influential features in predicting Average Math Scores according to the Random Forest model.
ESE_x (presumably a variable related to English as a Second Language) also shows some importance in predictions.
Race/Ethnicity variables (White_x, Hispanic_x, Black_x, ELL_x) and Low SES_x have relatively lower importance in this model.

## Linear Regression Model for Math Scores

In [58]:
# Initialize the Linear Regression model
model_lr_reading = LinearRegression()

# Fit the model on the training data
model_lr_reading.fit(X_train_reading, y_train_reading)


In [60]:
# Predictions on the test set
y_pred_lr_reading = model_lr_reading.predict(X_test_reading)

# Model Evaluation
mse_lr_reading = mean_squared_error(y_test_reading, y_pred_lr_reading)
r2_lr_reading = r2_score(y_test_reading, y_pred_lr_reading)

# Display evaluation metrics
print(f"Mean Squared Error (MSE): {mse_lr_reading}")
print(f"R-squared (R2): {r2_lr_reading}")


Mean Squared Error (MSE): 0.005623186585414267
R-squared (R2): 0.9940895721535418


These metrics indicate that the model explains approximately 99.41% of the variance in the test data, and the MSE is quite low, suggesting that the predictions are close to the actual values.

In [63]:
# Coefficients and Intercept
coefficients = model_lr_reading.coef_
intercept = model_lr_reading.intercept_

# Display coefficients and intercept
print("Coefficients:")
for feature, coef in zip(features_reading, coefficients):
    print(f"{feature}: {coef}")

print(f"\nIntercept: {intercept}")


Coefficients:
ELL_y: 0.004041012163610265
White_y: -0.009481831238352029
Black_y: -0.005019154345846569
Hispanic_y: 0.02133711282375736
Low SES_y: 0.00014467703282548228
Male_y: 0.5578185703784504
Female_y: 0.4950823619796862
ESE_y: -0.005328220608045623

Intercept: -0.00021592243631324975


These coefficients indicate the contribution of each feature to predicting the average reading scores. For instance, Female students tend to have a slightly higher impact on reading scores compared to Male students, as indicated by their coefficients. The intercept represents the expected average reading score when all predictor variables are zero.

## Random Forest Model for Reading Scores

In [68]:
# Initialize Random Forest Regressor
rf_regressor = RandomForestRegressor(random_state=42)

# Fit the model to the training data
rf_regressor.fit(X_train_reading, y_train_reading)


In [74]:
# Predict using the trained model
y_pred_rf_reading = rf_regressor.predict(X_test_reading)

# Calculate Mean Squared Error (MSE) and R-squared (R2)
mse_rf_reading = mean_squared_error(y_test_reading, y_pred_rf_reading)
r2_rf_reading = r2_score(y_test_reading, y_pred_rf_reading)

# Print the evaluation metrics
print(f"Mean Squared Error (MSE): {mse_rf_reading}")
print(f"R-squared (R2): {r2_rf_reading}")


Mean Squared Error (MSE): 0.09839952503050198
R-squared (R2): 0.8965740716612376


In [76]:
# Extract feature importance from the model
feature_importance_rf = pd.DataFrame({
    'Feature': features_reading,
    'Importance': rf_regressor.feature_importances_
})

# Sort features by importance
feature_importance_rf = feature_importance_rf.sort_values(by='Importance', ascending=False)

# Print feature importance
print("Feature Importance:")
print(feature_importance_rf)


Feature Importance:
      Feature  Importance
6    Female_y    0.475039
5      Male_y    0.395857
7       ESE_y    0.043538
4   Low SES_y    0.040492
1     White_y    0.016553
3  Hispanic_y    0.016067
2     Black_y    0.006620
0       ELL_y    0.005833


Interpretation:
* Female_y and Male_y are the most influential features, indicating that gender plays a significant role in predicting reading scores.
* ESE_y (English as a Second Language) and Low SES_y (Low Socioeconomic Status) also contribute to the model's predictions, albeit to a lesser extent compared to gender.
* Race/Ethnicity variables (White_y, Hispanic_y, Black_y, ELL_y) have relatively lower importance in this model compared to gender and socio-economic factors.

Conclusion
* The Random Forest model demonstrates strong predictive performance for Average Reading Scores, with an R-squared value of approximately 0.897, indicating that the model explains 89.66% of the variance in the target variable. The feature importance analysis underscores the significant role of gender and socio-economic factors in predicting reading scores, providing valuable insights for further analysis or interventions aimed at improving educational outcomes.



### Comparative Analysis

**Performance Metrics:**
* The Linear Regression model has a significantly lower MSE (0.0055) compared to the Random Forest model (0.1296). This indicates that the Linear Regression model provides better accuracy in predicting Average Math Scores.
* The R-squared (R2) value for the Linear Regression model (0.9949) is also higher than that of the Random Forest model (0.8785), suggesting that the Linear Regression model explains more variance in the data.

**Feature Importance:**
* In the Random Forest model, Male_x and Female_x are identified as the most important features, indicating gender plays a crucial role in predicting math scores.
* In contrast, the Linear Regression model coefficients provide direct insights into the direction and strength of relationships between each feature and the target variable.

**Interpretation:**
* The Linear Regression model emphasizes the linear relationships between features and the target variable, providing straightforward coefficients for interpretation.
* The Random Forest model captures non-linear relationships and interactions among features, which can be beneficial for complex datasets but may lack the interpretability of individual feature effects.