

---

### **Product Overview**  
The application will convert PDFs (and eventually other text-based formats like `.txt` and `.docx`) into audio files. It will provide real-time text-to-speech playback using the browser’s `SpeechSynthesis` API while allowing users to visually track the text as it is being spoken. Users can also download audio versions of their PDFs in smaller, manageable chunks of 10-20 pages. Google Sign-In will be used for user authentication, allowing users to securely upload files and manage their text/audio data.

---

### **System Architecture**

#### **Frontend**
1. **User Interface**:
   - Users upload PDF files.
   - Real-time speech playback of the PDF text, with synchronized text highlighting using the `SpeechSynthesis` API.
   - PDF pages are rendered using the Canvas API, allowing users to see and interact with the visual representation of the document.
   - Options to play, pause, or stop the audio, and navigate between pages.
   - Download options for audio chunks (10-20 pages).
   - Google Sign-In for authentication.

2. **Frontend Tech Stack**:
   - **HTML/CSS**: For structuring and styling the UI.
   - **JavaScript**: Core logic for handling file uploads, text-to-speech, page navigation, and interactions.
   - **Canvas API**: To visually render each PDF page and highlight the text as it's read.
   - **SpeechSynthesis API**: For real-time text-to-audio playback.

#### **Backend**
1. **PDF Processing**:
   - **PDF Text Extraction**: The backend, built in Node.js, will use libraries like `pdf-lib` or `pdf-parse` to extract text from each PDF page.
   - **Caching Parsed Text**: Once a page is processed, its text will be cached in a database (e.g., MongoDB) to avoid reprocessing for future requests.
   - **Pagination**: Text will be extracted and sent to the frontend in manageable chunks (e.g., 10-20 pages).

2. **Audio Processing**:
   - Users can download audio files for each chunk of text (10-20 pages) to avoid processing large files all at once.
   - Audio will be generated on-demand using a Node.js text-to-speech library (such as `say.js` or an external TTS API).

3. **Authentication**:
   - **Google Sign-In**: Users will authenticate via Google OAuth, enabling a secure and smooth login process.

4. **File Handling**:
   - Uploaded PDFs will be processed server-side, and their extracted text will be sent back to the frontend for playback.
   - Optionally, the server could be connected to cloud storage (e.g., Google Drive) for saving and accessing user files.

5. **Database**:
   - **MongoDB**: To store user data, cached parsed text, and possibly user preferences (e.g., page/chunk history).
   - Each PDF can be identified by a unique file hash or ID, and processed text will be stored along with it.

#### **Potential Improvements and Suggestions**
1. **Optimized Audio Download**: Instead of generating the audio on the fly every time a user downloads it, you could cache generated audio files as well, so users don’t have to wait if they want to re-download the same chunk.
   
2. **Bookmarking & Resuming Playback**: Allow users to bookmark their progress so they can resume playback from where they left off.

3. **Different Voices & Speeds**: Let users choose different voices (e.g., male/female, accents) or adjust the speed of the audio. You can do this through the `SpeechSynthesis` API or a TTS service on the backend.

4. **File Upload Options**: In addition to PDFs, once you've expanded to `.txt` and `.docx` formats, offer users the option to upload different types of files and automatically detect the file type.

5. **Drive Integration**: Allow users to import PDFs directly from their Google Drive. This would enhance the integration with Google services and improve user experience.

6. **Scalability & Cloud Functions**: Use cloud functions (e.g., AWS Lambda or Google Cloud Functions) to process files on-demand for improved scalability and to offload heavy tasks from your core server.

7. **Try implementing Something Like NoteBookLLM**: This could be for question answeing via voice and even text. 

---

### **General Flow**
1. **Authentication**: User logs in with Google Sign-In.
2. **PDF Upload**: User uploads a PDF, which is sent to the Node.js backend.
3. **Text Processing**: Backend processes the PDF text, caches each page’s text, and sends it to the frontend.
4. **Text-to-Speech Playback**: Frontend receives text and uses the SpeechSynthesis API to play the audio while highlighting text on the page via Canvas.
5. **Audio Download**: User can request downloadable audio chunks, which the backend generates and returns in 10-20 page chunks.
6. **Caching**: If the same page is requested again, cached text is used to avoid redundant processing.

---

This architecture is flexible and scalable, ensuring efficient processing and delivering a smooth user experience. It also leaves room for future features and improvements as we expand beyond PDFs. 

<blockquote>
After a conversation with ChatGPT, we settled on these for the final application name:
</blockquote>


- **VocalizeIt**: It’s active and direct. This name feels approachable and action-oriented.
- **NarrateNow**: The "Now" suggests instant results, which fits well with the real-time aspect of your app.
- **SpeechStream**: Sounds modern and tech-forward, with the "stream" hinting at continuous audio playback, which is great for longer documents.
- **Audiopages**: It’s clear and self-explanatory, which could help users immediately grasp what the app does.
- **SonicPages**: Has a more dynamic feel, emphasizing speed and sound with "sonic," which gives it an exciting, energetic vibe.

If you want a name that feels techy and modern, **SpeechStream** or **SonicPages** might be best. For something more approachable and straightforward, **VocalizeIt** or **Audiopages** are great choices. **NarrateNow** could work if you want to emphasize speed and ease of use.

<blockquote>
The question now is "Which vibe do you think fits our app best?"
</blockquote>


<strong> I made some research and figured out that we can't use say.js because it's platform specific and doesn't work well on linux machines. A great alternative was hashed out to be the google text to speech API. 

Here's a cost analysis: </strong>

<hr/>
The cost of using the Google Text-to-Speech API depends on the number of characters you process each month. Here's a breakdown of the pricing:

- **Free Tier**:
  - **WaveNet voices**: The first 1 million characters are free each month.
  - **Standard (non-WaveNet) voices**: The first 4 million characters are free each month.

- **After Free Tier**:
  - **WaveNet voices**: $16.00 USD per 1 million characters.
  - **Standard voices**: $4.00 USD per 1 million characters¹².

These prices make it quite accessible for various applications, from small projects to larger-scale implementations. If you have specific usage needs or want to explore more about the pricing, you can check the [Google Cloud Text-to-Speech pricing page](https://cloud.google.com/text-to-speech)¹.



Source: Conversation with Copilot, 10/16/2024
(1) Text-to-Speech AI: Lifelike Speech Synthesis | Google Cloud. https://cloud.google.com/text-to-speech.
(2) Google Cloud Text-to-Speech Pricing Plan & Cost Guide - GetApp. https://www.getapp.com/all-software/a/cloud-text-to-speech/pricing/.
(3) Google Cloud Text-to-Speech Pricing: Cost and Pricing plans - SaaSworthy. https://www.saasworthy.com/product/google-cloud-text-to-speech/pricing.
(4) Speech-to-Text API Pricing | Google Cloud. https://cloud.google.com/speech-to-text/pricing.
(5) Google Cloud. https://cloud.google.com/text-to-speech/pricing.