#  CCSS Alignment NLP/ML Model

## - Andria Grace

# Introduction

In this project, we aim to build a recommendation system for Common Core State Standards (CCSS) using text similarity measures. We leverage Natural Language Processing (NLP) techniques to preprocess and analyze descriptions of educational standards. By employing the Term Frequency-Inverse Document Frequency (TF-IDF) vectorization method and Nearest Neighbors algorithm, our goal is to find and recommend the most relevant CCSS standards based on user input.

# Data Loading

In [51]:
df = pd.read_csv('C:/Users/acer/Downloads/ccss.csv')
print("Step 1: Load Dataset")
df.head()


Step 1: Load Dataset


Unnamed: 0,id,content_type,category_id,category_name,grade_id,grade_name,item,description
0,CCSS.ELA-LITERACY.L.K.1,ELA-LITERACY,L,Language,K,Kindergarten,1,Demonstrate command of the conventions of stan...
1,CCSS.ELA-LITERACY.L.K.1.a,ELA-LITERACY,L,Language,K,Kindergarten,1a,Print many upper- and lowercase letters.
2,CCSS.ELA-LITERACY.L.K.1.b,ELA-LITERACY,L,Language,K,Kindergarten,1b,Use frequently occurring nouns and verbs.
3,CCSS.ELA-LITERACY.L.K.1.c,ELA-LITERACY,L,Language,K,Kindergarten,1c,Form regular plural nouns orally by adding /s/...
4,CCSS.ELA-LITERACY.L.K.1.d,ELA-LITERACY,L,Language,K,Kindergarten,1d,Understand and use question words (interrogati...


In [37]:
import pandas as pd

# Step 1: Load the Dataset
def load_dataset(file_path):
    df = pd.read_csv('C:/Users/acer/Downloads/ccss.csv')
    print("Step 1: Load Dataset")
    print(df.head())  # Display the first few rows of the dataframe
    return df[['id', 'description']]

# Path to your CSV file
file_path = '/mnt/data/ccss.csv'

# Load dataset
ccss_df = load_dataset(file_path)


Step 1: Load Dataset
                          id  content_type category_id category_name grade_id  \
0    CCSS.ELA-LITERACY.L.K.1  ELA-LITERACY           L      Language        K   
1  CCSS.ELA-LITERACY.L.K.1.a  ELA-LITERACY           L      Language        K   
2  CCSS.ELA-LITERACY.L.K.1.b  ELA-LITERACY           L      Language        K   
3  CCSS.ELA-LITERACY.L.K.1.c  ELA-LITERACY           L      Language        K   
4  CCSS.ELA-LITERACY.L.K.1.d  ELA-LITERACY           L      Language        K   

     grade_name item                                        description  
0  Kindergarten    1  Demonstrate command of the conventions of stan...  
1  Kindergarten   1a           Print many upper- and lowercase letters.  
2  Kindergarten   1b          Use frequently occurring nouns and verbs.  
3  Kindergarten   1c  Form regular plural nouns orally by adding /s/...  
4  Kindergarten   1d  Understand and use question words (interrogati...  


# Data Preprocessing 

In [38]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Step 2: Data Preprocessing
def preprocess_data(df):
    vectorizer = TfidfVectorizer(stop_words='english')
    tfidf_matrix = vectorizer.fit_transform(df['description'])
    print("Step 2: Data Preprocessing")
    print("TF-IDF Matrix Shape:", tfidf_matrix.shape)
    print("Feature Names:", vectorizer.get_feature_names_out()[:10])  # Display the first 10 feature names
    return vectorizer, tfidf_matrix

# Preprocess data
vectorizer, tfidf_matrix = preprocess_data(ccss_df)


Step 2: Data Preprocessing
TF-IDF Matrix Shape: (1554, 2886)
Feature Names: ['01' '01212t' '02' '05' '05a' '0s' '10' '100' '1000' '100s']


# Model Building

In [39]:
from sklearn.neighbors import NearestNeighbors

# Step 3: Model Building
def build_model(tfidf_matrix):
    model = NearestNeighbors(n_neighbors=5, algorithm='auto')
    model.fit(tfidf_matrix)
    print("Step 3: Model Building")
    print("Model built with n_neighbors=5")
    return model

# Build model
model = build_model(tfidf_matrix)


Step 3: Model Building
Model built with n_neighbors=5


# Prediction

In [50]:
# Step 4: Making Predictions
def find_closest_ccss(input_text, vectorizer, model, df, n=5):
    input_tfidf = vectorizer.transform([input_text])
    distances, indices = model.kneighbors(input_tfidf, n_neighbors=n)
    closest_ids = df.iloc[indices[0]]['id'].values
    print("Step 4: Making Predictions")
    print("Input Text:", input_text)
    print("Closest CCSS ids:", closest_ids)
    return closest_ids

# Example input text
input_text = "Understand and use question words"

# Find the 5 closest CCSS ids
closest_ccss_ids = find_closest_ccss(input_text, vectorizer, model, ccss_df, n=5)


Step 4: Making Predictions
Input Text: Understand and use question words
Closest CCSS ids: ['CCSS.ELA-LITERACY.L.K.1.d' 'CCSS.ELA-LITERACY.L.8.5.b'
 'CCSS.ELA-LITERACY.L.5.5.c' 'CCSS.ELA-LITERACY.L.6.5.b'
 'CCSS.ELA-LITERACY.L.7.5.b']


# Conclusion

In summary, demonstration of the end-to-end process of building a text similarity-based recommendation system. 

Dataset Loading: The dataset is loaded and preprocessed to extract relevant columns, such as the 'id' and 'description' of each CCSS standard.
Data Preprocessing: We convert the descriptions into TF-IDF vectors, which represent the importance of words in the context of the entire dataset.
Model Building: A Nearest Neighbors model is trained on the TF-IDF matrix, enabling it to identify similar standards based on text similarity.
Making Predictions: Given an input text, the model finds and returns the CCSS standards with descriptions most similar to the input.
The system allows users to input text and receive the closest matching CCSS standards, demonstrating an effective use of NLP and similarity metrics for educational standard recommendation. Future enhancements could involve expanding the dataset, refining the model, or integrating additional features to improve recommendation accuracy.