# 1. Smart Cultural Storyteller

## Selected Project Track
AI Applications – NLP-based Cultural Storytelling

## Problem Statement
Cultural folk stories contain valuable moral, social, and historical knowledge. However, long narrative texts are difficult to consume and analyze efficiently.

## Objective
To build an AI-based system that understands cultural folk stories using Natural Language Processing (NLP) and generates meaningful summaries that preserve key events and narrative flow.

## Real-World Relevance
Such systems can support cultural preservation, education platforms, and digital storytelling by making long folk narratives more accessible.


## 2. Data Understanding & Preparation

### Dataset Source
- Dataset Name: 1000Folk_Story_around_the_Globe.csv.zip
- Source: Kaggle (Public Dataset)
- Website URL: https://www.kaggle.com/datasets/chayanonc/1000-folk-stories-around-the-world
- File Format: CSV (compressed as ZIP)

### Dataset Description
The dataset contains folk stories collected from different regions and cultures. It includes the following columns:
- genre
- source
- region
- title
- full_text

For this project, the `full_text` column is used as it contains the complete story content required for text-based ML analysis.


## 3. Data Cleaning, Preprocessing & Feature Engineering

### Cleaning & Preprocessing
- Converted text to lowercase
- Removed special characters and extra spaces

### Feature Engineering
- TF-IDF vectorization is used to convert text into numerical representations
- This enables the system to identify important sentences within stories



In [44]:
import pandas as pd

df = pd.read_csv("/content/1000Folk_Story_around_the_Globe.csv.zip")
df = df[['title', 'full_text']].dropna()

df.head()







Unnamed: 0,title,full_text
0,Geraint and Enid,\nQueen Guinevere lay idly in bed dreaming bea...
1,Lancelot and Elaine,\nHer name was Elaine. But she was so fair tha...
2,Pelleas and Ettarde,\nFar away in a dreary land there lived a lad ...
3,Gareth and Lynette,\nGareth was a little prince. His home was an ...
4,Sir Galahad and the Sacred Cup,"\n\n ‘My strength is as the strength of ten, ..."


In [45]:
import re

def clean_text(text):
    text = text.lower()
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    text = re.sub(r'\s+', ' ', text)
    return text.strip()

df['clean_text'] = df['full_text'].apply(clean_text)



## 4. Model / System Design

### Technique Used
Natural Language Processing (NLP)

### System Type
Extractive Text Summarization System

### Design Explanation
- Stories are broken into sentences
- TF-IDF is used to measure sentence importance
- Important sentences are selected to form summaries

### Justification
Extractive summarization is a classical NLP approach that provides interpretable and reliable results without requiring large pretrained models.


## 4a. Core Implementation: Story Retrieval Based on User Input


In [51]:
from sklearn.metrics.pairwise import cosine_similarity

def find_relevant_story(user_input):
    # Vectorize user input
    user_vec = vectorizer.transform([clean_text(user_input)])

    # Vectorize all stories
    story_vectors = vectorizer.transform(df['clean_text'])

    # Compute similarity
    similarities = cosine_similarity(user_vec, story_vectors)

    # Get most similar story index
    best_index = similarities.argmax()

    return df.iloc[best_index]


In [52]:
import nltk
import numpy as np
from nltk.tokenize import sent_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

nltk.download('punkt')

vectorizer = TfidfVectorizer(stop_words='english')
vectorizer.fit(df['clean_text'])

def summarize_story(text, num_sentences=5):
    sentences = sent_tokenize(text)

    if len(sentences) <= num_sentences:
        return "\n".join(sentences)

    sentence_vectors = vectorizer.transform(sentences)
    similarity_matrix = cosine_similarity(sentence_vectors)
    sentence_scores = similarity_matrix.sum(axis=1)

    top_indices = np.argsort(sentence_scores)[-num_sentences:]
    top_indices.sort()

    summary_sentences = [sentences[i] for i in top_indices]

    # Make output readable (each sentence on new line)
    summary = "\n• " + "\n• ".join(summary_sentences)
    return summary




[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### 4b. MODEL INFERENCE / SAMPLE OUTPUT

In [48]:
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [53]:
# USER INPUT: type the meaning / theme here
user_input = "a story about honesty and truth"

print("USER INPUT:")
print(user_input)

# Find relevant story using NLP
story = find_relevant_story(user_input)

print("\nMATCHED STORY TITLE:")
print(story['title'])

# Summarize the matched story
summary = summarize_story(story['clean_text'], num_sentences=5)

print("\nOUTPUT: STORY SUMMARY")
print(summary)



USER INPUT:
a story about honesty and truth

MATCHED STORY TITLE:
Elder-tree Mother

OUTPUT: STORY SUMMARY
there was once a little boy who had taken cold by going out and getting his feet wet no one could think how he had managed to do so for the weather was quite dry his mother undressed him and put him to bed and then she brought in the teapot to make him a good cup of elder tea which is so warming at the same time the friendly old man who lived all alone at the top of the house came in at the door he had neither wife nor child but he was very fond of children and knew so many fairy tales and stories that it was a pleasure to hear him talk now if you drink your tea said the mother very likely you will have a story in the meantime yes if i could think of a new one to tell said the old man but how did the little fellow get his feet wet asked he ah said the mother that is what we cannot make out will you tell me a story asked the boy yes if you can tell me exactly how deep the gutter is

## Input–Output Explanation

The user provides a natural language description of a story theme or meaning.
The system uses TF-IDF based NLP similarity to identify the most relevant folk story.
An extractive summarization algorithm then generates a concise, readable summary.

This demonstrates language understanding, inference, and output generation using NLP.


## 5. Evaluation & Analysis

### Evaluation Method
The system is evaluated qualitatively by checking whether the generated summary preserves the key characters, events, and narrative structure of the original story.

### Sample Output
The system successfully produces concise summaries without truncating text arbitrarily.

### Limitations
- The model performs extractive summarization, not creative generation
- Semantic understanding is limited to sentence importance


#### 6. Ethical Considerations & Responsible AI

- The dataset is publicly available and used strictly for academic purposes
- No personal or sensitive data is involved
- Cultural content is preserved without alteration
- The system does not generate new or misleading narratives


### 7. Conclusion & Future Scope

### Conclusion
This project demonstrates a complete NLP pipeline for cultural storytelling using extractive summarization techniques.

### Future Scope
- Use transformer-based summarization models
- Add multilingual support
- Allow user-uploaded story input
